Introduction
This is the last of a six-part review of the Pentaho BI suite. In
each part of the review, we will take a look at the components that
make up the BI suite, according to how they would be used in the real
world.
Data Mining
In this sixth part, originally, I'd like to at least touch on the
only part of Pentaho BI Suite we have not talked about before: Data
Mining. However as I gather my materials, I realized that Data
Mining (along with its ilks: Machine Learning, Predictive Analysis,
etc.) is too big of a topic to fit in the space that we have here.
Even if I try, the usefulness would be limited at best since at the
moment, while the result is being used to solve real-world problems,
the usage of Data Mining tools is still exclusively within the realm
of data scientists.
In addition, as of late I use Python more for working with
datasets that requires a lot of munging, preparing, and cleaning. So
as an extension to that, I ended using Pandas, SciKit Learning, and
other Python-specific Data Mining libraries instead of Weka (which is
basically what the Pentaho Data Mining tool is).
So for those who are new to Data Mining with Pentaho, here is a
good place to start, an interview with Mark Hall who was one of the
author of Weka who now works for Pentaho:
https://www.floss4science.com/machine-learning-with-weka-mark-hall
The link above also has some links to where to find more
information.
For those who are experienced data scientists, you probably
already made up your mind on which tool suits your needs best and
just like I went with Python libraries, you may or may not prefer the
GUI approach like Weka.
New Release: Pentaho 5.0 CE
For the rest of this review, we will go over the new changes that
comes with the highly anticipated release of the 5.0 CE version.
Overall, there are a lot of improvements in various parts of the
suite such as PDI and PRD, but we will focus on the BI Server itself,
where the largest impact of the new release can be seen.
A New Repository System
In this new release, one of the biggest shock for existing users
is the switch from file-based repository system to the new JCR-based
one. JCR is a database-backed content repository system that was
implemented by the Apache Foundation and code-named “Jackrabbit.”
The Good:
- Better metadata management
- No longer need to refresh the repository manually after
publishing solutions
- A much better UI for dealing with the solutions
- API to access the solutions via the repository which opens up
a lot of opportunities for custom applications
The Bad:
- It's not as familiar or convenient as the old file-based
system
- Need to use a synchronizer plugin to version-control the
solutions'

It remains to be seen if this switch will pay off for both the
developers and the users in the long run. But it is stable and
working for the most part, so I can't complain.
The Marketplace
One of the best feature of the Pentaho BI Server is its
plugin-friendly architecture. In version 5.0 this architecture has
been given a new face called the Marketplace:

This new interface serves two important functions:
- It allows admins to install and update plugins (almost all
Pentaho CE tools are written as plugins) effortlessly
- It allows developers to publish their own plugins to the
world
There are already several new plugins that is available with this
new release, notably Pivot4J Analytics. An alternative to Saiku that
shows a lot of promises to become a very useful tool to work with
OLAP data. Another one that excites me is Sparkl with which you can
create other custom plugins.
The Administration Console
The new version also brings about a new Administration Console
where we manage Users and Roles:



No longer do we have to fire-off another server just to do this
basic administrator task. In addition, you can manage the Mail
server (no more wrangling configuration files).
The New Dashboard Editor
As we discussed in Part V of this
review, the CDE is a very powerful dashboard editor. In version 5.0,
the list of available Components are further lengthen by new ones.
And the overall editor seems to be more responsive in this new
release.

Usage experience: The improvements in the Dashboard editor
is helping me to create dashboards for my clients that goes beyond
the static ones. In fact, the one below (demo purposes only) has the
interactivity level that rivals a web application or an electronic
form:

NOTE: Nikon and Olympus are trademarks of Nikon Corporation and
Olympus Group respectively.
Parting Thoughts
Even though the final product of a
Data Warehouse of a BI system is a set of answers and forecasts, or
dashboards and reports, it is easy to forget that without the tools
that help us to consolidate, clean up, aggregate, and analyze the
data, we will never get to the results we are aiming for.
As you can probably tell, I serve my clients with various tools
that makes sense given their situation, but time and again, the
Pentaho BI Suite (CE version especially) has risen to fulfill the
needs. I have created Data Warehouses from scratch using Pentaho BI
CE, pulling in data from various sources using the PDI, created OLAP
cubes with the PSW, which ends up as the data source for the various
dashboards (financial dashboards, inventory dashboards, marketing
dashboards, etc.) and published reports created using the PRD.
Of course my familiarity with the tool helps, but I am also
familiar with a lot of other BI tools beside Pentaho. And sometimes
I do have to use other tools in preference to Pentaho because they
suit the needs better.
But as I always mention to my clients, unless you have a good
relationship with the vendor to avoid paying hundreds-of-thousands
per year just to be able to use tools like IBM Cognos, Oracle
BI, or SAP Business Objects, there is a good chance that the Pentaho
(either EE or CE version) can do the same for less, even zero license
cost in the case of CE.
Given the increased awareness on the value of data analysis in
today's companies, these BI tools will continue to become more and
more sophisticated and powerful. It is up to us business owners,
consultants, and data analysis everywhere to develop the skills to
harness the tool and crank out useful, accurate, and yes,
easy-on-the-eyes decision-support systems. And I suspect that we
will always see Pentaho as one of the viable options. A testament to
the quality of the team working on it. The CE team in particular, it
would be amiss not to acknowledge their efforts to improve and
maintain a tool this complex using the Open Source paradigm.
So here we are, at the end of the sixth part. Writing this
six-part review has been a blast. And I would like to give a shout
out to the IT Central Station who has graciously hosted this review
for all to benefit from. Thanks for reading.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
Have you looked into using Talend?? It's got a great user interface, very similar to kettle, and their paid for version has version control that works very well, and you get the ability to run "joblets" which are basically re-usable pieces of code. Even in the free version there is version control, although it's pretty clumsy, and not joblets in the free, and the free version is difficult to get working with Github.