More MapReduce

This thing's pretty important.

MapReduce Process diagram

Input

This is key-value pairs.

Map

Operates on each input, parses it and outputs intermediate KV pairs.

Partitioner

Distributes intermediate KV pairs and nominates the target reducer for keys.

Normally a hash function, but can be overriden.

Shuffle/Sort

Shuffle redirects the Mapper output to the correct Reducer.

Sort sorts the input to each Reducer after the shuffle.

Reduce

Receives the sorted intermediate KV pairs, and aggregates their values by key.

Output

Writes the output to HDFS (one file per reducer, e.g. Part-r-0001).

Also, Combiners

A Combiner is a Map-side reducer. It is used to decrease the amount of data transferred between nodes in the network.

Instead of sending a bunch of <key,1>, use a Combiner to send <key,n>.

Full example (average tweets)

Average Tweets MapReduce flow

Remember the code still has to be written to e.g. use UserID in the mapper.