Storm

Hadoop is for processing data at rest. Data streams are fast-moving data, in real-time.

We need a way to process it in parallel and store it, and potentially ways of combining data streams.

This is where Storm comes in.

Elements of Storm

The official documentation can be found here.

  • Streams of Data;
  • Spouts;
  • Bolts;
  • Topologies;
  • Stream Groupings; and
  • Reliability.

Streams

A Tuple is a class in Storm, and they are collections of name-value pairs.

Spouts

The data source that outputs the tuples (Streams).

Bolts

Elements that receive one or more Streams, process them and emit one Stream.

Examples of Bolt usage:

  • Combine;
  • Split;
  • Aggregate; or
  • Save.

Topologies

These are like network topologies, the way in which Spouts and Bolts (and therefore Streams) are organized.

Stream Grouping

Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt's tasks.

There are eight built-in stream groupings in Storm, of which Shuffle would seem to be the most common. The documentation here has more in-depth explanations.

Reliability

Storm guarantees that every tuple will be fully processed or restarted. To do this you must tell Storm when a Tuple tree is being created, and whenever a Bolt has finished processing an individual Tuple.

Other properties

  • Fairly simple API (simpler than MapReduce);
  • Scalable (due to tweakable parallelism and rebalancing during runtime); and
  • Fault-tolerant (like Hadoop, will restart a stalled process).

Names

  • Nimbus: the master node.
  • Supervisors: the worker nodes (why they weren't called workers, I'm not sure).
  • Zookeeper, running on the Nimbus or multiple instances.

If the Nimbus dies after the topology has been defined, that's OK. Cluster continues running but no new tasks can be assigned.

So if a Supervisor dies after the Nimbus has died, the cluster will fail.