Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Image

Pentaho 3.2 Data Integration 103

diddy81 writes "A book about the open source ETL tool Kettle (Pentaho Data Integration) is finally available. Pentaho 3.2 Data Integration: Beginner's Guide by María Carina Roldán is for everybody who is new to Kettle. In a nutshell, this book will give you all the information that you need to get started with Kettle quickly and efficiently, even if you have never used it before.The books offers loads of illustrations and easy-to-follow examples. The code can be downloaded from the publisher website and Kettle is available for free from the SourceForge website. In sum, the book is the best way to get to know the power of the open source ETL tool Kettle, which is part of the Pentaho BI suite. Read on for the rest of diddy81's review.
Pentaho 3.2 Data Integration: Beginner's Guide
author Maria Carina Roldan
pages 492
publisher Packt Publishing
rating 9/10
reviewer diddy81
ISBN 1847199542
summary If you have never used PDI before, this will be a perfect book to start with.
The first chapter describes the purpose of PDI, its components, the UI, how to install it and you go through a very simple transformation. Moreover, the last part tells you step by step how to install MySQL on Windows and Ubuntu.

It's just what you want to know when you touch PDI for the first time. The instructions are easy to follow and understand and should help you to get started in no time. I honestly quite like the structure of the book: Whenever you are learning something new, it is followed by a section that just recaps everything. So it will help you to remember everything much easier.

Maria focuses on using PDI with files instead of the repository, but she offers a description on how to work with the repository in the appendix of the book.

Chapter 2: You will learn how to reading data from a text file and how to handle header and footer lines. Next up is a description of the "Select values ..." step which allows you to apply special formatting to the input fields, select the fields that you want to keep or remove. You will create a transformation that reads multiple text fields at once by using regular expressions in the text input step. This is followed by a troubleshooting section that describes all kind of problems that might happen in the setup and how to solve them. The last step of the sample transformation is the text file output step.

Then you improve this transformation by adding the "Get system info" step, which will allow you to pass parameters to this transformation on execution. This is followed by a detailed description of the data types (I wish I had all this formatting info when I started so easily at hand). And then it even gets more exciting: Maria talks you through the setup of a batch process (scheduling a Kettle transformation).

The last part of this chapter describes how to read XML files with the XML file input step. There is a short description of XPath which should help you to get going with this particular step easily.

Chapter 3 walks you through the basic data manipulation steps. You set up a transformation that makes use of the calculator step (loads of fancy calculation examples here). For more complicated formulas Maria also introduces the formula step. Next in line are the Sort By and Group By step to create some summaries. In the next transformation you import a text file and use the Split field to rows step. You then apply the filter step on the output to get a subset of the data. Maria demonstrates various example on how to use the filter step effectively. At the end of the chapter you learn how to lookup data by using the "Stream Lookup" step. Maria describes very well how this step works (even visualizing the concept). So it should be really easy for everybody to understand the concept.

Chapter 4 is all about controlling the flow of data: You learn how to split the data stream by distributing or copying the data to two or more steps (this is based on a good example: You start with a task list that contains records for various people. You then distribute the tasks to different output fields for each of these people). Maria explains properly how "distribute" and "copy" work. The concept is very easy to understand following her examples. In another example Maria demonstrates how you can use the filter step to send the data to different steps based on a condition. In some cases, the filter step will not be enough, hence Maria also introduces the "Switch/Case" step that you can use to create more complex conditions for your data flow. Finally Maria tells you all about merging streams and which approach/step best to use in which scenario.

In Chapter 5 it gets really interesting: Maria walks you through the JavaScript step. In the first example you use the JavaScript step for complex calculations. Maria provides an overview of the available functions (String, Numeric, Date, Logic and Special functions) that you can use to quickly create your scripts by dragging and dropping them onto the canvas. In the following example you use the JavaScript step to modify existing data and add new fields. You also learn how to test your code from within this step. Next up (and very interesting) Maria tells you how to create special start and end scripts (which are only executed one time as opposed to the normal script which is executed for every input row). We then learn how to use the transformation constants (SKIP_TRANSFORMATION, CONTINUE_TRANSFORMATION, etc) to control what happens to the rows (very impressive!). In the last example of the chapter you use the JavaScript step to transform a unstructured text file. This chapter offered quite some in-depth information and I have to say that there were actually some things that I didn't know.

In the real world you will not always get the dataset structure in the way that you need it for processing. Hence, chapter 6 tells you how you can normalize and denormalize data sets. I have to say that Maria took really huge effort in visualizing how these processes work. Hence, this really helps to understand the theory behind these processes. Maria also provides two good examples that you work through. In the last example of this chapter you create a date dimension (very useful, as everyone of us will have to create on at some point).

Validating data and handling errors is the focus of chapter 7. This is quite an important topic, as when you automate transformation, you will have to find a way on how to deal with errors (so that they don't crash the transformation). Writing errors to the log, aborting a transformation, fixing captured errors and validating data are some of the steps you go through.

Chapter 8 is focusing on importing data from databases. Readers with no SQL experience will find a section covering the basics of SQL. You will work with both the Hypersonic database and MySQL. Moreover Maria introduces you to the Pentaho sample database called "Steel Wheels", which you use for the first example. You learn how to set up a connection to the database and how to explore it. You will use the "Table Input" to read from the database as well as the "Table Output" step to export the data to a database. Maria also describes how to parameterize SQL queries, which you will definitely need to do at some point in real world scenarios. In next tutorials you use the Insert/Update step as well as the Delete step to work with tables on the database.

In chapter 9 you learn about more advance database topics: Maria gives an introduction on data modelling, so you will soon know what fact tables, dimensions and star schemas are. You use various steps to lookup data from the database (i.e. Database lookup step, Combination lookup/update, etc). You learn how to load slowly changing dimensions Type 1, 2 and 3. All these topics are excellently illustrated, so it's really easy to follow, even for a person which never heard about these topics before.

Chapter 10 is all about creating jobs. You start off by creating a simple job and later learn more about on how to use parameters and arguments in a job, running jobs from the terminal window and how to run job entries under conditions.

In chapter 11 you learn how to improve your processes by using variables, subtransformations (very interesting topic!), transferring data between transformations, nesting jobs and creating a loop process. These are all more complex topics which Maria managed to illustrate excellently.

Chapter 12 is the last practical chapter: You develop and load a datamart. I would consider this a very essential chapter if you want to learn something about data warehousing. The last chapter 13 gives you some ideas on how to take it even further (Plugins, Carte, PDI as process action, etc) with Kettle/PDI.

In the appendix you also find a section that tells you all about working with repositories, pan and kitchen, a quick reference guide to steps and job entries and the new features in Kettle 4.

This book certainly fills a gap: It is the first book on the market that focuses solely on PDI. From my point of view, Maria's book is excellent for anyone who wants to start working with Kettle and even those ones that are on an intermediate level. This book takes a very practical approach: The book is full of interesting tutorials/examples (you can download the data/code from the Pakt website), which is probably the best way to learn about something new. Maria also made a huge effort on illustrating the more complex topics, which helps the reader to understand the step/process easily.

All in all, I can only recommend this book. It is the easiest way to start with PDI/Kettle and you will be able to create complex transformations/jobs in no time!

You can purchase Pentaho 3.2 Data Integration: Beginner's Guide from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

*

This discussion has been archived. No new comments can be posted.

Pentaho 3.2 Data Integration

Comments Filter:
  • Enough acronyms? (Score:5, Insightful)

    by pyite ( 140350 ) on Wednesday June 16, 2010 @12:45PM (#32592464)

    My goodness, would it kill you to state what an acronym stands for the first time you use it?

    • Re: (Score:3, Funny)

      by Dave Emami ( 237460 )
      "Seeing as how the VP is such a VIP, shouldn't we keep the PC on the QT? 'Cause if it leaks to the VC he could end up MIA, and then we'd all be put out in KP."

      But seriously, to answer one of the tags: ETL = Extract Transform and Load. Basically it's how transactional or other data gets into a data warehouse.
    • Re: (Score:3, Funny)

      by PatPending ( 953482 )
      ACRONYM - A Completely Ridiculous Obsolete Noun You'll Misspell
    • by ClosedSource ( 238333 ) on Wednesday June 16, 2010 @12:58PM (#32592622)

      add PDI and ETL to my Resume. I wonder what they mean?

    • Re: (Score:1, Funny)

      by Anonymous Coward

      You should join our new group: Citizens Rejecting Acronym Prolifieration. ;)

    • Re: (Score:3, Insightful)

      by Pollardito ( 781263 )
      The worst part is that even if you google Kettle and get to their website, the front page for their product [pentaho.org] is a essentially a changelog and roadmap. There are FAQ links but even the "Beginners FAQ" (which should be "WTF is Kettle?" style Q&A) is a product troubleshooting guide.

      I suspect that the same secrecy-obsessed person that built the product website also wrote this review
    • by SplashMyBandit ( 1543257 ) on Wednesday June 16, 2010 @02:40PM (#32593926)
      I hope the reviewer is suitably chastened by this experience. Understanding your likely reader is an very important skill in (technical) writing. Realizing that people come from all sorts of backgrounds should not be a surprise. Each of those people may be very intelligent, they just have a specialty that is not in the same field as the writer. Therefore it is the mark of a competent writer that they'll at least try to expand an acronym the first time they use it. An even better writer might even find a single sentence that explains the concept well. Poor writers (eg. many soft-science academics and marketers) often obfuscate simple concepts behind jargon and convoluted sentence construction. Their pronouncements can often be written in a much more straightforward way, although that would often reveal that the "Emperor has no clothes". The best writers write simply, use the least complicated word that fits the purpose, and consider possible conceptual pitfalls of readers so try to write unambiguously.
  • by ducomputergeek ( 595742 ) on Wednesday June 16, 2010 @12:56PM (#32592602)

    Was it made things three times more complicated than it needed to be. We needed to integrate one of our products with another and the other product's developer recommended Talend and Pentaho for the job. After two days of looking through the documentation it was complete overkill for what we needed. So we said screw it and directly mapped to their database using JDBC and Plan Ole XML as our transport layer. That only took a day to build.

    • by Per Wigren ( 5315 ) on Wednesday June 16, 2010 @01:51PM (#32593194) Homepage
      I totally agree.

      At work I have built a large data warehouse pretty much from scratch with PostgreSQL and SQL-files, controlled by a set of Ruby scripts. It's simple, powerful, extremely flexible and plenty fast. It imports data from various sources (PostgreSQL, MySQL, MS SQL Server, CSV files on a remote SSH server, XML, custom logfiles, etc) with some HEAVY data cleaning and normalization. On top of that we have lots of autogenerated PDF-reports and a custom built report tool for all kinds of data.

      Recently it was decided that we need a way for managers to generate "cubes" for quick generation of custom, one-off reports on all kinds of dimension of the data. After looking around a bit we settled with Mondrian, which is a part of the Pentaho suite.

      O. M. F. G. What a mess.

      It consists of a deep directory hierarchy with config files and duplicated jar files sprinkled all over. To do simple things like adding a database you have to edit a whole bunch of XML config files in various directories and I even had to copy a jar file from one directory to another. There is plenty of documentation but it's disorganized, overly verbose in the simple areas and overly terse (or nonexistent) in the moderately advanced areas.

      After editing a config file you have to go through its web interface and press one "clean cache" button and one "reload config" button. Then you have to restart the app server and log in again to see your changes. They don't provide any commandline tools to do this. When starting out and building your new cubes there will be a lot of trial-and-error experimentation as the XML schema is somewhat archaic and underdocumented. When asking them on IRC for a way to automate this it took me a lot of explaining WHY I wanted to do this so often before they even attempted to answer. The answer involved copying and modifying .properties files in WEB-INF directories and writing a script that run curl on various URLs...

      Seemingly they themselves already set up their datawarehouse and cubes long time ago and have totally forgotten about the experience for NEW users that have to do all this from scratch...

      Anyone know about a decent alternative to Mondrian?
      • Per, writing a ROLAP server is a non-trivial task. Mondrian is the only open source option for you at the moment. There is a MOLAP server called PALO however.

        If you don't mind me saying so but on the one hand you seem to complain that a visual programming tool like Kettle is too hard to use. And at the same time you choose to ignore the tools to configure Mondrian properly. I'm sure there is some kind of pattern here.

        Most people can start using Kettle in a matter of a few minutes to a few hours. I woul

        • Re: (Score:3, Insightful)

          by Per Wigren ( 5315 )

          I will now argue that my home brew "mess" is very simple and clean and it will take any person with decent shell, Ruby and SQL knowledge a VERY short time to get a FULL understanding of. Even my bosses know and appreciate that.

          So, we already have the import, data cleaning, normalization and lots of aggregated tables in place and it's working fine. We don't want to change that. What we need is only a web interface that is easy for the non-technical managers and marketers to use. I can provide special tables

          • The Pentaho stack includes two mondrian viewers, jpivot and analyzer. Jpivot is open and free, and you can set this up yourself. Analyzer is an EE feature which allows business users to create analysis views quickly and easily with a drag-drop interface.

            Mondrian produces XMLA, so any XMLA client will be able to access Mondrian's output.

        • Matt Casters... If I remember correctly one of the Pentaho devs

        • by dintech ( 998802 )

          If you don't mind me saying so but on the one hand you seem to complain that a visual programming tool like Kettle is too hard to use. And at the same time you choose to ignore the tools to configure Mondrian properly. I'm sure there is some kind of pattern here.

          Disclaimer: My experience was 2 years ago.

          Yes there is a pattern. The commonality is that the documentation for almost all of the pentaho suite is verbose but devoid of content. It's impossible to find documentation on the things you actually need d

          • With access to Pentaho's knowledgebase and support you would have had first class support and access to a wealth of documentation.

            Some of this documentation is available on Pentaho's open wiki: http://wiki.pentaho.org/ [pentaho.org]

          • 2 years is indeed a long time for a startup company. In that period we released a host of new versions for the 5 product pilars, improved usability dramatically and 2 Pentaho related books came out to help you on your path (with a third on the way).

            What was once only possible is now fairly straightforward too.

      • Re: (Score:3, Informative)

        by Karem Lore ( 649920 )

        you have to edit a whole bunch of XML config files in various directories

        If you use the open-source version, sure. If you use the Pentaho BI Suite then no.

        You have a central configuration console, schema workbench (available free) for schema design. You can clear the cache programatically by way of a URL or using the API (which can be fine grained down to tuple) or through the user or enterprise-console.

        Before spouting such drivel, you should look at what exactly you are using and where you have gone wrong in your assumptions. Then, if you are still confused, contact support

      • Recently it was decided that we need a way for managers to generate "cubes" for quick generation of custom, one-off reports on all kinds of dimension of the data.

        Just put everything (and I mean everything) into one cube, and let them slice and dice it as they choose.

        I've heard that suggestion more than once, so I think you got off lightly.

    • As with any complex tool, if you don't know why it's useful, or when it should be used, you're probably going to make a mess.

      The visual nature of kettle masks its complexity due to the "pictures == easy, code == 3ll3t" bias. To simplify a bit, Kettle gives you the ability to create a multi-db, multi-data format "query plan", much as a DB optimiser would do when given a multi-table SQL statement with joins, filters, etc. The problem is that in kettle, you have to understand how to optimise that "query" you

    • I've actually used KETTLE (5 years ago, but still..) Buggy as hell. Would have taken me 1/10th the time to write the thing on my own. Unfortunately my manager at the time insisted that I use it.

      • 5 years ago, poor man! 5 years ago things were pretty wild. I open sourced Kettle in December 2005 so back then we weren't even with Pentaho yet.

        Now we have over 40 developers and a dozen translators, a QA team, doc writers, continuous integration servers, a JIRA system, a wiki, product managers, a sales team, etc.

        Thousands upon thousands of bugs have been fixed in the mean time and thousands of features have been implemented. Since then we released 27 stable versions!

  • So is PDI something like a database agnostic version of MSSQL DTS packages?

    • Re:PDI? (Score:4, Informative)

      by xouumalperxe ( 815707 ) on Wednesday June 16, 2010 @01:01PM (#32592668)
      It's a bit more than "database agnostic" as it can input from a load of non-db sources and output into a load of non-db sinks. I work at a pentaho shop, and one of our biggest projects involves, on the ETL front, parsing several gigs of apache logs per day and stuffing the (filtered) results into a db. We do that using Kettle.
  • by CajunArson ( 465943 ) on Wednesday June 16, 2010 @12:57PM (#32592614) Journal

    Seriously, I can't imagine how dumb some people are... complaining about acronyms that can easily be looked up on Wikipedia!

    I mean, a quick search obviously reveals that ETL stands for Express Toll Lanes [wikipedia.org]. Any slashdotter should know that these lanes are used by the many cars generated by the numerous analogies dotting slashdot "discussions".
        And as for Pentaho... let's just break this word down into parts shall we? Penta is the root word for the number 5... duh! Of course, Ho is an accurate description of the only type of woman who will talk to the average slashdotter... assuming the slashdotter has a sufficient Benjamin supply.

        So let's put all of this together shall we? This book is obviously about how you can pick up 5 hoes on a highway quickly and efficiently. This is a life skill that I'm sure many slashdotters are keenly interested in acquiring. How the hell anyone could possibly complain that the reviewer didn't expressly spell out these stupidly obvious terms is frankly beyond me.

    • Considering that the author, María Carina Roldán, is Argentinian, it's obvious that "pentaho" is a misspelling for "pendejo". This book is about a latino asshole who drives an old truck very slowly in the express lane, ignoring all the honking cars behind him. The truck is slow because the radiator is boiling, its nickname is the "Kettle".

  • by aquabat ( 724032 ) on Wednesday June 16, 2010 @01:13PM (#32592790) Journal
    Awesome review! Truly enlightening. Before I saw this article, I had absolutely no idea what Pentaho was, or why I would want it. Now, I know exactly what I'm getting both my friends for Christmas this year. I can't wait to discuss all 492 pages of this treasure with them in the new year.
    • by obender ( 546976 )

      I know exactly what I'm getting both my friends for Christmas this year

      You could buy me one. I am more real than your imaginary friends: Ostap Bender [wikipedia.org]

      • by aquabat ( 724032 )

        I know exactly what I'm getting both my friends for Christmas this year

        You could buy me one. I am more real than your imaginary friends: Ostap Bender [wikipedia.org]

        Perhaps you'd also like the key to the apartment where the money is?

  • Pentaho/Kettle is for "Market Intelligence", and that's why it took me 5 minutes to re-read the article numerous times and an additional 10 minutes of Googleing just to find out what this SlashDot story was about. Obviously I'm not smart enough to know anything "Intelligent", such as say, stating what the product/book is actually for? Analysing 'the market', or analysing 'whom I am marketing'?

    Now I am just left with the thought, is this "Intelligence" effort trying to market me? If so they are doing a pre

  • It is not nice to call Maria a ho, much less one of the penta variety. That's not just calling her a ho, but calling her a ho for 5 distinct reasons.

    • i pentaho once... her name was Steely Dan II. "Chewed to bits by a famished candiru in the Upper Baboonsasshole. And don't say 'wheeeeeeee!' this time."
    • Actually scrop1us, posters seem to think they are making some original joke.

      However... Pentaho had indeed 5 founders (penta) and (I say this with all the respect in the world for my esteemed colleagues) they have every intention of selling themselves out.

      So the given definition of 5 hoes is very close to the true meaning of the word Pentaho or so I have been told one drunken evening at Pentaho's bar, the Orlando Ale House.

      Maria is indeed not part of this group of 5 esteemed gentlemen.

      Matt

  • I'm a bit mystified about chapter 8, which sounds a whole heck of a lot like "apt-get install mysql-server" for those whom can't apt-get.

    From what little info I have, this software seems to summarize to a super complicated way to push data in and out of databases. The kind of thing normal people would whip up write-once-read-never-again perl scripts full of obscene regexes and mysterious one liners, but if you'd rather do it differently, here's this giant complicated system written in Java and XML with the

  • "A book about the open source ${ACRONYM} tool ${DICTIONARYWORD} (${NONSESNEWORD} Data Integration) is finally available. ${NONSESNEWORD} ${VERSION} Data Integration: Beginner's Guide by ${AUTHORNAME} is for everybody who is new to ${DICTIONARYWORD}. In a nutshell, this book will give you all the information that you need to get started with ${DICTIONARYWORD} quickly and efficiently, even if you have never used it before.The books offers loads of illustrations and easy-to-follow examples. The code can be dow
    • by aquabat ( 724032 )
      Last Monday's xkcd comic (xkcd.com/753/) conveys a similar concept, in the mouseover text.
  • I am glad to see someone has got a book out about this package. If you need something like Pentaho, then writing simple translation scripts is probably not where you want to be. Kettle has a steep learning curve, but has proven to be reasonably reliable, and very flexible.

  • ETL stands for Extract Translate Load. Basically you want to extract data out of your very normalized application database. Translate it into something that makes a little more sense for historical reporting and trending. Then load it into your data warehouse.
  • "What I'd do if I had 2.5 million dollars"
    -Lawrence, Office Space
  • Seen lots of negative Pentaho experiences here, and I'd generally agree. It's one of those "Open Source" projects which forces you into buying their commercial version because they've made it way too complex.

    Luckly the Pentaho project is an umbrella which contains a number of seperate products, most of which were developed independantly. Which results in there being a big difference in the quality of each component.

    From my experiences, Kettle is a really nice tool for ETL. It is, IMHO, easier to use than

    • Thanks for the thumbs up. Just a not thought: everything that is possible with the commercial (Enterprise Edition) version of Pentaho software is possible with the community edition. Please don't confuse use with certain other "Open Source" BI suites.

      To stay on topic I would advice you to simply buy one of the Pentaho books before you get started!

      • I understand that the capabilities are the same, I was getting at the fact that to get anywhere you need to hire Pentaho experts in, and then the only real choice was directly from Pentaho USA. Compare to Microsoft, where you could either learn it through books/, or hire a local consultant.

        For me (as BI consultant) supporting the commercial operation behind the OSS project needs to be via licensing/support via customers, and via training. Same as it is with the Microsoft stack. The easier it is for peopl

        • It's unfortunate but experience tells us that unless you sweeten the deal with extras like documentation, configuration/monitoring/EE software, repositories and the like, very few companies would buy anything. That experience is contrary to what I once believed.

          So you can complain that you can't get your hands on nice documentation, the dashboard designer or the console, all part of the enterprise edition. However, when you really compare it to closed source software it's still a lot cheaper. This analys

  • Its nice to have real tangible documentation for this beast. it looked like it has a lot of promise and is powerful out of the box, without having to spend tons of $ on a commercial product but the documentation was dismal. 9 at least all that i have found.

    ETL is not cheap, and if you have a small project, pretty much unattainable.

  • by Hognoxious ( 631665 ) on Wednesday June 16, 2010 @04:26PM (#32595472) Homepage Journal

    Undefined TLA near every single fucking line. Bailing out, giving up and going home...

C'est magnifique, mais ce n'est pas l'Informatique. -- Bosquet [on seeing the IBM 4341]

Working...