More MapReduce

This thing's pretty important.

MapReduce Process diagram

This is key-value pairs.

Operates on each input, parses it and outputs intermediate KV pairs.

Distributes intermediate KV pairs and nominates the target reducer for keys.

Normally a hash function, but can be overriden.

Shuffle redirects the Mapper output to the correct Reducer.

Sort sorts the input to each Reducer after the shuffle.

Receives the sorted intermediate KV pairs, and aggregates their values by key.

Writes the output to HDFS (one file per reducer, e.g. Part-r-0001).

A Combiner is a Map-side reducer. It is used to decrease the amount of data transferred between nodes in the network.

Instead of sending a bunch of <key,1>, use a Combiner to send <key,n>.

Full example (average tweets)

Average Tweets MapReduce flow

Remember the code still has to be written to e.g. use UserID in the mapper.