Introduction to Hadoop

Hadoop is composed of two parts:

  • A processing part, which does MapReduce (read paper here); and
  • A storage part, the HDFS (Hadoop Distributed File System).
    • Distributes data across nodes and is natively redundant; and
    • Is easily scalable by adding more machines.

MapReduce

This is what MapReduce looks like. For thorough information about how it works in practice (or how it was conceived to work) please read the whitepaper, linked above.

MapReduce diagram

  1. The master thread gets the input;
  2. It spawns a thread for each section that it can split into;
  3. Each one of those threads produces a mapping (in this case, counting the words in the section);
  4. Data is shuffled so each thread only holds one group of data that can then be reduced;
  5. Data is reduced in the thread (here, added together is the reduction); and
  6. All threads communicate with the master thread and give it their counts. The master threads reduces further to a single list, in this case.

HDFS

HDFS is a distributed file system. Files are split into chunks (of default 64Mb) and copied around nodes. By default, the number of times each file is duplicated is 3. This is the Replication Factor.

NameNode

Prior to version 2 of Hadoop, there was only 1 NameNode, which essentially said where the nodes are and how they can be found.

After version 2, multiple NameNodes can exist for resilience, in case one goes down.

Sizing a Hadoop cluster

By Data

  • Imagine your system has 80Tb of data. With the default Replication Factor of 3, you would need 240Tb of storage across all your nodes.
  • Recommended 2Tb disk per CPU core.
  • Recommended 4Gb RAM per CPU core.

So for servers with 12 cores, you would need 10 of them.

Disk: 2Tb per Core, 24 cores = 240Tb total data.

Plus 2 servers for a NameNode and a secondary NameNode, but these need not be as powerful.

By Performance

Sample out the workload and code, and then run on e.g. an EC2 instance. Hadoop scales linearly, so use the test results to calculate how much you need.

Hadoop Use Cases

  • Web log processing;
  • Text search;
  • Social network analysis;
  • Batch pre-processing;
  • Last.fm - Analytics;

For more, http://wiki.apache.org/hadoop/PoweredBy.