Hadoop Basic Guide

Posted on

Hadoop got its name from an actual toy elephant, more specifically the son’s toy elephant of Hadoop co-founder Doug Cutting. To find out why or how Hadoop got its name is not what you came here to do. Hadoop is a platform for parallel computing that acts similarly to an operating system.

It goes without saying that it is beyond the capabilities of a single machine to process all the big data that is constantly being generated and collected around us. By emulating the parallel processing of supercomputers, Hadoop provides a framework for handling these massive data sets.

If supercomputers can be used to parallelize the processing of big data, then why can’t we?

  • Since supercomputers lack a standard operating system (or OS-like framework), they are out of reach for most small and medium-sized businesses.
  • Both the initial investment and ongoing servicing costs are high.
  • Any hardware that is supported must come from the same manufacturer, meaning that a business cannot save money by buying parts from different manufacturers and assembling them themselves.
  • In most scenarios, running a supercomputer requires the creation of specialized software designed for that machine.
  • Hard to expand horizontally

Hadoop is a lifesaver because it eliminates all of these problems by providing a free, open-source, operating-system-like platform for parallel processing that works with commodity hardware and does not require expensive, specialized software licenses.

Since its initial release in 2006, Hadoop has seen two additional stable releases.

Let’s take a closer look at Hadoop’s structure; I’ll begin with Hadoop 1 so that we can better grasp Hadoop 2’s structure. Some familiarity with commodity hardware, cluster and cluster nodes, distributed systems, and hot standby is also assumed.

Hadoop 1 Architecture

The primary hardware parts of the original Hadoop architecture are as follows:

Master Nodes:

  • Name Node is Hadoop’s centralized file system manager, which stores no data itself but keeps track of the number of blocks a data file was broken into, the block size, and on which data nodes each individual file block will be stored and processed.
  • Backup for the primary Name Node, but not in hot standby status
  • Hadoop’s central job scheduler, Job Tracker is in charge of determining when a job will run on each data node.

In a manufacturing setting, each of the aforementioned nodes stands in for a standalone machine operating in master mode and typically located in separate racks (to avoid the failure of one rack bringing down multiple master nodes).

Slave Nodes:

  • Data Nodes: Independent computers or networks that house and process work files in the form of data blocks. Task Tracker: A service that checks in on the Job Tracker, logs what’s happening on the Data Node, and updates the Job Tracker on its progress. With one Task Monitor per Data Node

In order to perform any sort of processing, the slave nodes require explicit instructions from the master nodes, without which they would be unable to function. Every three seconds, slave nodes must confirm their online status by sending a heartbeat signal to the name node.

A cluster is formed when all the above master and slave nodes are linked to one another via some form of networking.

While both the Name and Secondary Name Nodes require some storage space, Job Tracker has significantly more processing power. Data nodes, on the other hand, boast the most RAM and processing power of any machine in the cluster.

Deployment Modes

Hadoop Basic Guide

Hadoop provides three main deployment or configuration modes, which are as follows:

  1. In Standalone Mode, each Hadoop service (including Name Node, Secondary Name Node, Job Tracker, and Data Nodes) is deployed independently on a single machine using a single instance of the Java Virtual Machine (JVM). However, standalone operation is rarely used these days.
  2. All Hadoop services are executed locally on a single machine, but in separate Java Virtual Machines (JVMs) to simulate a distributed system. Pseudo Distributed Mode is typically employed throughout the course of development and testing, as well as in the classroom.
  3. In a production setting, you’ll want to use the fully distributed mode, in which all of the Hadoop services are hosted on their own individual servers.

What is a Job in Hadoop?

A Job is the Hadoop equivalent of a Python script or program that can be run to accomplish a specific task (s). Hadoop jobs are similar to Python scripts in that they consist of one or more programs saved as a JAR file that are submitted to the Hadoop cluster and run on the input (raw) data stored on the data nodes, with the post-processed output being stored in a predetermined location.

Hadoop’s Software Components

Having covered Hadoop’s hardware prerequisites, we can move on to its software building blocks. Hadoop’s foundational programs are as follows:

  1. HDFS (Hadoop Distributed File System) is a file system used for storing and retrieving data in Hadoop.
  2. Hadoop’s programming arm, MapReduce, processes data from the HDFS using a parallel processing Java-based framework.

And here’s what else makes up MapReduce:

  • A parallel processing phase called Map, which is defined by the user.
  • Each Map phase’s output is aggregated in a user-defined Reduce phase.

Hadoop, to be clear, is a parallel processing platform that makes available the MapReduce framework (i.e., a basic framework that can be modified to suit the needs of the user) for parallel processing. While MapReduce is a popular framework, Hadoop also supports the Spark framework.

Hadoop Distributed File System (HDFS)

HDFS, or the Hadoop Distributed File System, is the file-management component of the Hadoop ecosystem. It is responsible for storing and tracking large data sets (both structured and unstructured data) across the various data nodes.

An input file of 200 MB in size will help us visualize how HDFS operates. As was previously mentioned, this single file will be split up into multiple blocks and stored on the data nodes to allow for parallel processing.

HDFS’s default split size (a global setting that can be changed by the Hadoop administrator) is 64 MB. This means that our 200MB example input file will be divided into 4 blocks, three of which will be 64MB in size and the fourth will be 8MB. HDFS will divide the input file into blocks and store them on the appropriate data nodes.

Notably, HDFS performs the splitting of the input file on the client machine outside of the Hadoop cluster, and the name node decides the placement of each data block intoalgorithm-driven data nodes. In this setup, the client machine doesn’t even need to go through the name node to get the block placement strategy; it just writes the data blocks straight to the data nodes.

Like a book’s table of contents, the File System Image stores the location of each data block within the various name nodes, as well as other metadata such as block size, hostname, and so on (FS Image).

Data Node Fault Management

But what happens if a data node fails? Even if only one data node fails, the entire input file will be corrupted because a crucial piece of the puzzle is now missing. The original data file being pushed back into the Hadoop cluster is inefficient and time-consuming in a production setting, where we are typically dealing with data blocks of hundreds of gigabytes.

Each data node has a duplicate of its data stored on a neighboring data node in case of a disaster. The Replication Factor determines how many replicas of each data block should be stored for safety purposes. Normally, the Hadoop cluster will have three copies of each data block, with the default replication factor set to three. This means that every block of data on every data node is saved on two additional backup data nodes. At the time the source data file is pushed into HDFS, the replication factor can be set on a per-file basis.

In the event that any data node stops communicating with the name node via heartbeat, the backup data node will take over. To ensure that the replication factor of 3 is maintained across the cluster, the name node will initiate another backup of the data block once a backup data node is online.

Secondary Name Node

A Secondary Name Node is used to replicate the File System image on the Name Node in Hadoop 1.0. One of Hadoop 1.0’s biggest drawbacks, though, is that the Secondary Name Node does not function in a hot standby mode. If the Name Node goes down, the Hadoop cluster as a whole will be rendered inoperable (data will still be stored in the Data Nodes, but it will be inaccessible because the cluster has lost the FS image), and the contents of the Secondary Name Node will have to be manually copied to the Name Nome.

We’ll get back to this limitation in a bit, but know that it was fixed in Hadoop 2.0 by making the Secondary Name Node into a Hot Standby.

Hadoop 2.0

Hadoop 2.0, aka MapReduce 2 (MR2) or Yet Another Resource Negotiator, has several other names (YARN).

Let’s take a stab at figuring out what sets Hadoop 2.0 apart from Hadoop 1.0 in terms of its underlying architecture. If you’re familiar with Hadoop 1.0, you’ll recall that the Job Tracker acts as a centralized job scheduler, breaking down a single job into multiple smaller jobs before sending them off to separate data nodes, where the Task Tracker keeps tabs on the progress of each task and reports back to the Job Tracker. The Job Tracker is responsible for allocating system resources to each data node in a static mode in addition to scheduling jobs (that is, the system resources are not dynamic).

Although Hadoop’s underlying file system, HDFS, is still in use, the Job Tracker has been replaced in version 2.0 by a new system called YARN. YARN is not limited to just the MapReduce framework, but rather supports other parallel processing frameworks as well, such as Spark. When comparing YARN to Hadoop 1.0’s Job Tracker, the latter only supports up to 4,000 data nodes. YARN, on the other hand, can support up to 10,000.

YARN is made up of two parts: the Scheduler and the Applications Manager. In Hadoop 1.0, the Job Tracker was responsible for handling both of these responsibilities independently. By delegating these tasks to their own YARN nodes, we can make more efficient use of the system’s resources.

Additionally, in Hadoop 2.0, a single Node Manager (which operates in the slave mode) took the place of the task trackers on each data node. For resource management, Node Manager talks to YARN’s Applications Manager directly.

Earlier, we mentioned that Hadoop 2.0 has a Hot Standby Name Node that takes over in the event that the Name Node fails. If the primary Name Node and the hot standby Name Node both go down, the secondary Name Node will come in handy.

What is MapReduce?

MapReduce consists of the following 2 stages, each of which consists of 3 additional stages, as suggested by the name:

Map stage Hadoop

Hadoop’s parallelization kicks in during the Map stage, when all three of its sub-stages are executed or acted upon in each of the data blocks located on the individual data nodes.

Record Reader

The Record Reader is set up to read in a file line by line and generate two different results:

  • Key: a number
  • Value: the entire line

Mapper

Each key-value pair returned by the Record Reader can be individually processed by the Mapper according to whatever logic is necessary or the problem statement. It takes a user-defined function and returns additional Key-Value pairs.

Sorter Hadoop

Keys from the Mapper’s output are fed into a Sorter, which then sorts them lexicographically. If the keys are numbers, the Sorter will sort them in ascending numerical order. The Sorter is fixed in its configuration, and the only option is to use it to sort values.

Reduce stage

One Mapper output will be produced for each data node by the end of the Map phase. The Reduce operation will be carried out on all these outputs at a single, isolated data node.

When performing a Reduce operation, there are three distinct phases:

Merge

As a result of appending the intermediate results of each Map operation, a single file is created.

Shuffler

Another pre-programmed built-in module, shuffler, collects all the duplicate keys in its input and outputs a list with one value for each key.

Reducer

The output from the Shuffler is sent to the Reducer, the programmable module of the Reduce stage. In response to a problem statement, Reducer executes its instructions and returns a set of key-value pairs.

Practical Example Hadoop

In order to demonstrate the inner workings and process flow of MapReduce, I will use a very basic problem statement that does not involve ML. Take the following 2 statements from an input file as an example:

When working with Hadoop, processing massive data sets is a breeze.

There is more than just Hadoop out there for handling massive amounts of data.

Our job is to count how many times each word appears in the input file, and the result should be:

Processing    2

big           2

data          2

through       1

Hadoop        2

is            2

easy          1

not           1

the           1

only          1

platform      1

After going through the MapReduce steps:

  • Once Record Reader finishes processing the first line, it will produce:
  • Zero (the initial file/line offset) is the key.
  • Hadoop is a valuable tool because of how simple it makes processing massive data sets.
  • The Mapper can be instructed to carry out the following tasks:
  • First, ignore the key that was pressed.
  • The second step is to isolate each word in the line (tokenization)
  • Third, generate output as a set of key-value pairs, with each line’s words serving as keys and the frequency of their appearances in the input line serving as values.
  • To that end, the combined result of Mapper’s processing of the two lines will look something like this:

Processing     1

big            1

data           1

through        1

Hadoop         1

is             1

easy           1

Hadoop         1

is             1

not            1

the            1

only           1

big            1

data           1

processing     1

platform       1

  • This is roughly how the Sorter will present its results:

big            1

big            1

data           1

data           1

easy           1

Hadoop         1

Hadoop         1

is             1

is             1

not            1

only           1

platform       1

processing     1

Processing     1

the            1

through        1

  • The Sorter will produce something like this as its final result:

big            1, 1

data           1, 1

easy           1

Hadoop         1, 1

is             1, 1

not            1

only           1

platform       1

processing     1, 1

the            1

through        1

Hadoop Basic Guide

  • Among the many things that Reducer is capable of doing are:
  • The first thing you need to do is grab the key-value pair from Shuffler’s output.
  • Second, add up the values of each key in the list.
  • Third, take the summed up numbers from the list that Shuffler outputs and use them as keys to produce new sets of key-value pairs.
  • Fourth, for each key-value pair received from the Shuffler, perform the aforementioned procedures again.
  • As a result, the output of Reducer will be:

big            2

data           2

easy           1

Hadoop         2

is             2

not            1

only           1

platform       1

processing     2

the            1

through        1

Use Cases of MapReduce Hadoop

Some examples of MapReduce in industry include:

  • Using large data sets as a search engine
  • Word count, AdWords, page rank, Google Search indexing data, Google News article clustering, and other Google services all made use of it (recently Google has moved on from MapReduce)
  • A variety of text algorithms exist, including grep, text indexing, and reverse indexing.
  • Mining for Data
  • Data mining, ad optimization, and spam detection are just a few of the ways Facebook puts it to use.
  • Business intelligence by banks and other financial institutions
  • Non-real-time, in-batch processing

Conclusion

This was obviously a very broad, non-technical overview of Hadoop and MapReduce. I haven’t even begun to scratch the surface of all the parts of Hadoop; for example, Hive, Zookeeper, Pig, HBase, Spark, etc.

Feel free to get in touch if you have any questions or comments about the material presented above, any of the material presented in my other posts, or anything else at all related to data analytics, machine learning, and financial risk.