Follow Slashdot stories on Twitter


Forgot your password?

Pentaho 3.2 Data Integration 103

diddy81 writes "A book about the open source ETL tool Kettle (Pentaho Data Integration) is finally available. Pentaho 3.2 Data Integration: Beginner's Guide by María Carina Roldán is for everybody who is new to Kettle. In a nutshell, this book will give you all the information that you need to get started with Kettle quickly and efficiently, even if you have never used it before.The books offers loads of illustrations and easy-to-follow examples. The code can be downloaded from the publisher website and Kettle is available for free from the SourceForge website. In sum, the book is the best way to get to know the power of the open source ETL tool Kettle, which is part of the Pentaho BI suite. Read on for the rest of diddy81's review.
Pentaho 3.2 Data Integration: Beginner's Guide
author Maria Carina Roldan
pages 492
publisher Packt Publishing
rating 9/10
reviewer diddy81
ISBN 1847199542
summary If you have never used PDI before, this will be a perfect book to start with.
The first chapter describes the purpose of PDI, its components, the UI, how to install it and you go through a very simple transformation. Moreover, the last part tells you step by step how to install MySQL on Windows and Ubuntu.

It's just what you want to know when you touch PDI for the first time. The instructions are easy to follow and understand and should help you to get started in no time. I honestly quite like the structure of the book: Whenever you are learning something new, it is followed by a section that just recaps everything. So it will help you to remember everything much easier.

Maria focuses on using PDI with files instead of the repository, but she offers a description on how to work with the repository in the appendix of the book.

Chapter 2: You will learn how to reading data from a text file and how to handle header and footer lines. Next up is a description of the "Select values ..." step which allows you to apply special formatting to the input fields, select the fields that you want to keep or remove. You will create a transformation that reads multiple text fields at once by using regular expressions in the text input step. This is followed by a troubleshooting section that describes all kind of problems that might happen in the setup and how to solve them. The last step of the sample transformation is the text file output step.

Then you improve this transformation by adding the "Get system info" step, which will allow you to pass parameters to this transformation on execution. This is followed by a detailed description of the data types (I wish I had all this formatting info when I started so easily at hand). And then it even gets more exciting: Maria talks you through the setup of a batch process (scheduling a Kettle transformation).

The last part of this chapter describes how to read XML files with the XML file input step. There is a short description of XPath which should help you to get going with this particular step easily.

Chapter 3 walks you through the basic data manipulation steps. You set up a transformation that makes use of the calculator step (loads of fancy calculation examples here). For more complicated formulas Maria also introduces the formula step. Next in line are the Sort By and Group By step to create some summaries. In the next transformation you import a text file and use the Split field to rows step. You then apply the filter step on the output to get a subset of the data. Maria demonstrates various example on how to use the filter step effectively. At the end of the chapter you learn how to lookup data by using the "Stream Lookup" step. Maria describes very well how this step works (even visualizing the concept). So it should be really easy for everybody to understand the concept.

Chapter 4 is all about controlling the flow of data: You learn how to split the data stream by distributing or copying the data to two or more steps (this is based on a good example: You start with a task list that contains records for various people. You then distribute the tasks to different output fields for each of these people). Maria explains properly how "distribute" and "copy" work. The concept is very easy to understand following her examples. In another example Maria demonstrates how you can use the filter step to send the data to different steps based on a condition. In some cases, the filter step will not be enough, hence Maria also introduces the "Switch/Case" step that you can use to create more complex conditions for your data flow. Finally Maria tells you all about merging streams and which approach/step best to use in which scenario.

In Chapter 5 it gets really interesting: Maria walks you through the JavaScript step. In the first example you use the JavaScript step for complex calculations. Maria provides an overview of the available functions (String, Numeric, Date, Logic and Special functions) that you can use to quickly create your scripts by dragging and dropping them onto the canvas. In the following example you use the JavaScript step to modify existing data and add new fields. You also learn how to test your code from within this step. Next up (and very interesting) Maria tells you how to create special start and end scripts (which are only executed one time as opposed to the normal script which is executed for every input row). We then learn how to use the transformation constants (SKIP_TRANSFORMATION, CONTINUE_TRANSFORMATION, etc) to control what happens to the rows (very impressive!). In the last example of the chapter you use the JavaScript step to transform a unstructured text file. This chapter offered quite some in-depth information and I have to say that there were actually some things that I didn't know.

In the real world you will not always get the dataset structure in the way that you need it for processing. Hence, chapter 6 tells you how you can normalize and denormalize data sets. I have to say that Maria took really huge effort in visualizing how these processes work. Hence, this really helps to understand the theory behind these processes. Maria also provides two good examples that you work through. In the last example of this chapter you create a date dimension (very useful, as everyone of us will have to create on at some point).

Validating data and handling errors is the focus of chapter 7. This is quite an important topic, as when you automate transformation, you will have to find a way on how to deal with errors (so that they don't crash the transformation). Writing errors to the log, aborting a transformation, fixing captured errors and validating data are some of the steps you go through.

Chapter 8 is focusing on importing data from databases. Readers with no SQL experience will find a section covering the basics of SQL. You will work with both the Hypersonic database and MySQL. Moreover Maria introduces you to the Pentaho sample database called "Steel Wheels", which you use for the first example. You learn how to set up a connection to the database and how to explore it. You will use the "Table Input" to read from the database as well as the "Table Output" step to export the data to a database. Maria also describes how to parameterize SQL queries, which you will definitely need to do at some point in real world scenarios. In next tutorials you use the Insert/Update step as well as the Delete step to work with tables on the database.

In chapter 9 you learn about more advance database topics: Maria gives an introduction on data modelling, so you will soon know what fact tables, dimensions and star schemas are. You use various steps to lookup data from the database (i.e. Database lookup step, Combination lookup/update, etc). You learn how to load slowly changing dimensions Type 1, 2 and 3. All these topics are excellently illustrated, so it's really easy to follow, even for a person which never heard about these topics before.

Chapter 10 is all about creating jobs. You start off by creating a simple job and later learn more about on how to use parameters and arguments in a job, running jobs from the terminal window and how to run job entries under conditions.

In chapter 11 you learn how to improve your processes by using variables, subtransformations (very interesting topic!), transferring data between transformations, nesting jobs and creating a loop process. These are all more complex topics which Maria managed to illustrate excellently.

Chapter 12 is the last practical chapter: You develop and load a datamart. I would consider this a very essential chapter if you want to learn something about data warehousing. The last chapter 13 gives you some ideas on how to take it even further (Plugins, Carte, PDI as process action, etc) with Kettle/PDI.

In the appendix you also find a section that tells you all about working with repositories, pan and kitchen, a quick reference guide to steps and job entries and the new features in Kettle 4.

This book certainly fills a gap: It is the first book on the market that focuses solely on PDI. From my point of view, Maria's book is excellent for anyone who wants to start working with Kettle and even those ones that are on an intermediate level. This book takes a very practical approach: The book is full of interesting tutorials/examples (you can download the data/code from the Pakt website), which is probably the best way to learn about something new. Maria also made a huge effort on illustrating the more complex topics, which helps the reader to understand the step/process easily.

All in all, I can only recommend this book. It is the easiest way to start with PDI/Kettle and you will be able to create complex transformations/jobs in no time!

You can purchase Pentaho 3.2 Data Integration: Beginner's Guide from Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.


This discussion has been archived. No new comments can be posted.

Pentaho 3.2 Data Integration

Comments Filter:
  • Enough acronyms? (Score:5, Insightful)

    by pyite ( 140350 ) on Wednesday June 16, 2010 @01:45PM (#32592464)

    My goodness, would it kill you to state what an acronym stands for the first time you use it?

  • by ducomputergeek ( 595742 ) on Wednesday June 16, 2010 @01:56PM (#32592602)

    Was it made things three times more complicated than it needed to be. We needed to integrate one of our products with another and the other product's developer recommended Talend and Pentaho for the job. After two days of looking through the documentation it was complete overkill for what we needed. So we said screw it and directly mapped to their database using JDBC and Plan Ole XML as our transport layer. That only took a day to build.

  • by blackest_k ( 761565 ) on Wednesday June 16, 2010 @02:04PM (#32592702) Homepage Journal

    I'd settle for whats it for? and why i'd want to spend time learning how to use it?

    Apparently its for beginners but beginners who already have a foundation to build on.

  • by Wowlapalooza ( 1339989 ) on Wednesday June 16, 2010 @02:21PM (#32592864)

    You must be new here.

    Do you need a de-acronymization of SQL?

    I'd wager most IT professionals know that one.

    Do you need a de-acronymization of XML?

    I'd wager most IT professionals know that one too.

    Since it was established fairly on that PDI is Pentaho Data Integration (damn, first time I saw it I coulda sworn it was "pendejo"), I'm not really sure what exactly your bitch is.

    Uh, because it's not obvious to non-data-warehousing weenies that semantically there's any intersection/equivalence between the set ("data", "integration") and some other (undefined) set consisting of terms beginning with the letters "e", "t" and "l"? Maybe?

    Sure, any or all of this stuff can be Google'd/Wikipedia'ed/etc., but does one want to go through that for an article summary? Especially when it would have been soooo easy to just expand the acronym...

  • by name_already_taken ( 540581 ) on Wednesday June 16, 2010 @02:26PM (#32592924)

    Sure, any or all of this stuff can be Google'd/Wikipedia'ed/etc., but does one want to go through that for an article summary? Especially when it would have been soooo easy to just expand the acronym...

    Especially when it's standard journalism (and general writing) practice to expand acronyms the first time they're used, particularly when they are obscure.

    To expect every reader to either know the definition of the acronym, or to search Google for it is the height of arrogance. It's also a good way to turn off readers.

  • by TopherC ( 412335 ) on Wednesday June 16, 2010 @02:49PM (#32593184)

    Normal, face-to-face conversation:

    "I might be interested in this book, but don't yet know what ETL, Kettle, Pentaho, or BI refer to. Could you help me out please?"

    "Sure! An ETL is a ..."


    would it kill you to state what an acronym stands for the first time you use it?

    You must be new here.

    Do you need a de-acronymization of SQL?

    Do you need a de-acronymization of XML?

    Since it was established fairly on that PDI is Pentaho Data Integration ...

    (Note that PDI is the only TLA defined in the summary but it isn't actually used there.)

    Did you buy your four digit id, or are you just cranky today? ...

    I have low tolerance for idiots and none for trollish anonymous coward bitch-boys like yourself.

  • by Anonymous Coward on Wednesday June 16, 2010 @03:04PM (#32593426)

    I'd settle for whats it for? and why i'd want to spend time learning how to use it?

    Apparently its for beginners but beginners who already have a foundation to build on.

    I like my acronyms explained early and often - beginners need to know and experts need to be reminded, lest they drift off course. Still, it was in the title...

    ETL is one of those things that you do a lot of in an enterprise where there are many different systems from many different vendors (not to mention development groups who hate each other). When stuff has to get transferred from one database to another, that's ETL. When stuff has to get transferred and the data isn't in quite the right format, that, too is ETL. Pulling stuff from spreadsheets, FTP sites, web services, mixing and matching, validating and converting. Tools like Pentaho DI are how you can dump the process onto less-technical staff, since it does a lot of work that would otherwise require custom programming. And even some that does, since not only does Kettle support JavaScript transformations and user-developed Java plugins, it's an open-source project.

    Incidentally, Maria Carina Roldan was a guest author at the JavaRanch Big Moose Saloon ( several weeks ago. Those of us who hang out there had the opportunity to converse with her.

  • by Pollardito ( 781263 ) on Wednesday June 16, 2010 @03:17PM (#32593598)
    The worst part is that even if you google Kettle and get to their website, the front page for their product [] is a essentially a changelog and roadmap. There are FAQ links but even the "Beginners FAQ" (which should be "WTF is Kettle?" style Q&A) is a product troubleshooting guide.

    I suspect that the same secrecy-obsessed person that built the product website also wrote this review
  • by SplashMyBandit ( 1543257 ) on Wednesday June 16, 2010 @03:40PM (#32593926)
    I hope the reviewer is suitably chastened by this experience. Understanding your likely reader is an very important skill in (technical) writing. Realizing that people come from all sorts of backgrounds should not be a surprise. Each of those people may be very intelligent, they just have a specialty that is not in the same field as the writer. Therefore it is the mark of a competent writer that they'll at least try to expand an acronym the first time they use it. An even better writer might even find a single sentence that explains the concept well. Poor writers (eg. many soft-science academics and marketers) often obfuscate simple concepts behind jargon and convoluted sentence construction. Their pronouncements can often be written in a much more straightforward way, although that would often reveal that the "Emperor has no clothes". The best writers write simply, use the least complicated word that fits the purpose, and consider possible conceptual pitfalls of readers so try to write unambiguously.
  • by Per Wigren ( 5315 ) on Wednesday June 16, 2010 @04:23PM (#32594474) Homepage

    I will now argue that my home brew "mess" is very simple and clean and it will take any person with decent shell, Ruby and SQL knowledge a VERY short time to get a FULL understanding of. Even my bosses know and appreciate that.

    So, we already have the import, data cleaning, normalization and lots of aggregated tables in place and it's working fine. We don't want to change that. What we need is only a web interface that is easy for the non-technical managers and marketers to use. I can provide special tables and views for Mondrian in any way that it wants. No problem, except that it's helluva messy to set up. Unless you restart from scratch using the whole Pentaho stack, maybe.

God made machine language; all the rest is the work of man.