Book Review: Hadoop Beginner's Guide

Book Review: Hadoop Beginner's Guide 57

Posted by samzenpus on Wednesday March 13, 2013 @02:35PM from the read-all-about-it dept.

First time accepted submitter sagecreek writes "Hadoop is an open-source, Java-based framework for large-scale data processing. Typically, it runs on big clusters of computers working together to crunch large chunks of data. You also can run Hadoop in "single-cluster mode" on a Linux machine, Windows PC or Mac, to learn the technology or do testing and debugging. The Hadoop framework, however, is not quickly mastered. Apache's Hadoop wiki cautions: "If you do not know about classpaths, how to compile and debug Java code, step back from Hadoop and learn a bit more about Java before proceeding." But if you are reasonably comfortable with Java, the well-written Hadoop Beginner's Guide by Garry Turkington can help you start mastering this rising star in the Big Data constellation." Read below for the rest of Si's review.

Hadoop Beginner's Guide
author	Garry Turkington
pages	374
publisher	Packt Publishing
rating	9/10
reviewer	Si Dunn
ISBN	9781849517300
summary	Explains and shows how to use Hadoop software in Big Data settings.

Dr. Turkington is vice president of data engineering and lead architect for London-based Improve Digital. He holds a doctorate in computer science from Queens University of Belfast in Northern Ireland. His Hadoop Beginner's Guide provides an effective overview of Hadoop and hands-on guidance in how to use it locally, in distributed hardware clusters, and out in the cloud.

Packt Publishing provided a review copy of the book. I have reviewed one other Packt book previously.

Much of the first chapter is devoted to "exploring the trends that led to Hadoop's creation and its enormous success." This includes brief discussions of Big Data, cloud computing, Amazon Web Services, and the differences between "scale-up" (using increasingly larger computers as data needs grow) and "scale-out" (spreading the data processing onto more and more machines as demand expands).

Dr. Turkington writes, "One of the most confusing aspects of Hadoop to a newcomer is its various components, projects, sub-projects, and their interrelationships."

His 374-page book emphasizes three major aspects of Hadoop: (1) its common projects; (2) the Hadoop Distributed File System (HDFS); and (3) MapReduce.

He explains, "Common projects comprise a set of libraries and tools that help the Hadoop product work in the real world."

The HDFS, meanwhile, "is a filesystem unlike most you may have encountered before." As a distributed filesystem, it can spread data storage across many nodes. "[I]t stores files in blocks typically at least 64 MB in size, much larger than the 4-32 KB seen in most filesystems." The book briefly describes several features, strengths, weaknesses, and other aspects of HDFS.

Finally, MapReduce is a well-known programming model for processing large data sets. Typically, MapReduce is used with clusters of computers that perform distributed computing. In the "Map" portion of the process, a single problem is split into many subtasks that are then assigned by a master computer to individual computers known as nodes (and there can be sub-nodes). During the "Reduce" part of the task, the master computer gathers up the processed data from the nodes, combines it and outputs a response to the problem that was posed to be solved. (MapReduce libraries are now available for many different computer languages, including Hadoop.)

"The developer focuses on expressing the transformation between source and result data sets, and the Hadoop framework manages all aspects of job execution, parallelization, and coordination," Dr. Turkington notes. He calls this "possibly the most important aspect of Hadoop. The platform takes responsibility for every aspect of executing the processing across the data. After the user defines the key criteria for the job, everything else becomes the responsibility of the system."

In this 11-chapter book, the first two chapters introduce Hadoop and explain how to install and run the software.

Three chapters are devoted to learning to work with MapReduce, from beginner to advanced levels. And the author stresses: "In the book, we will be learning how to write MapReduce programs to do some serious data crunching and how to run them on both locally managed and AWS-hosted Hadoop clusters." ["AWS" is "Amazon Web Services."]

Chapter 6, titled "When Things Break" zeroes in on Hadoop's "resilience to failure and an ability to survive failures when they do happen.much of the architecture and design of Hadoop is predicated on executing in an environment where failures are both frequent and expected." But node failures and numerous other problems still can arise, so the reader is given an overview of potential difficulties and how to handle them.

The next chapter, "Keeping Things Running," lays out what must be done to properly maintain a Hadoop cluster and keep it tuned and ready to crunch data.

Three of the remaining chapters show how Hadoop can be used elsewhere within an organization's systems and infrastructure, by personnel who are not trained to write MapReduce programs.

Chapter 8, for example, provides "A Relational View on Data with Hive." What Hive provides is "a data warehouse that uses MapReduce to analyze data stored on HDFS," Dr. Turkington notes. "In particular, it provides a query language called HiveQL that closely resembles the common Structured Query Language (SQL) standard."

Using Hive as an interface to Hadoop "not only accelerates the time required to produce results from data analysis, it significantly broadens who can use Hadoop and MapReduce. Instead of requiring software development skills, anyone with a familiarity with SQL can use Hive," the author states.

But, as Chapter 9 makes clear, Hive is not a relational database, and it doesn't fully implement SQL. So the text and code examples in Chapter 9 illustrate (1) how to set up MySQL to work with Hadoop and (2) how to use Sqoop to transfer bulk data between Hadoop and MySQL.

Chapter 10 shows how to set up and run Flume NG. This is a distributed service that collects, aggregates, and moves large amounts of log data from applications to Hadoop's HDFS.

The book's final chapter, "Where to Go Next," helps the newcomer see what else is available beyond the Hadoop core product. "There are," Dr. Turkington emphasizes, "a plethora of related projects and tools that build upon Hadoop and provide specific functionality or alternative approaches to existing ideas." He provides a quick tour of several of the projects and tools.

A key strength of this beginner's guide is in how its contents are structured and delivered. Four important headings appear repeatedly in most chapters. The "Time for action" heading singles out step-by-step instructions for performing a particular action. The "What just happened?" heading highlights explanations of "the working of tasks or instructions that you have just completed." The "Pop quiz" heading, meanwhile, is followed by short, multiple-choice questions that help you gauge your understanding. And the "Have a go hero" heading introduces paragraphs that "set practical challenges and give you ideas for experimenting with what you have learned."

Hadoop can be downloaded free from the Apache Software Foundation's Hadoop website.

Dr. Turkington's book does a good job of describing how to get Hadoop running on Ubuntu and other Linux distributions. But while he assures that "Hadoop does run well on other systems," he notes in his text: "Windows is supported only as a development platform, and Mac OS X is not formally supported at all." He refers users to Apache's Hadoop FAQ wiki for more information. Unfortunately, few details are offered there. So web searches become the best option for finding how-to instructions for Windows and Macs.

Running Hadoop on a Windows PC typically involves installing Cygwin and openSSH, so you can simulate using a Linux PC. But other choices can be found via sites such as Hadoop Wizard and Hadoop on Windows with Eclipse".

To install Hadoop on a Mac running OS X Mountain Lion, you will need to search for websites that offer how-to tips. Here is one example.

There are other ways get access to Hadoop on a single computer, using other operating systems or virtual machines. Again, web searches are necessary. The Cloudera Enterprise Free product is one virtual-machine option to consider.

Once you get past the hurdle of installing and running Hadoop, Garry Turkington's well-written, well-structured Hadoop Beginner's Guide can start you moving down the lengthy path to becoming an expert user.

You will have the opportunity, the book's tagline states, to "[l]earn how to crunch big data to extract meaning from the data avalanche."

Si Dunn is an author, screenwriter, and technology book reviewer.

You can purchase Hadoop Beginner's Guide from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

This discussion has been archived. No new comments can be posted.