Storm
Hadoop is for processing data at rest. Data streams are fast-moving data, in real-time.
We need a way to process it in parallel and store it, and potentially ways of combining data streams.
This is where Storm comes in.
Elements of Storm
The official documentation can be found here.
- Streams of Data;
- Spouts;
- Bolts;
- Topologies;
- Stream Groupings; and
- Reliability.
Streams
A Tuple
is a class in Storm, and they are collections of name-value pairs.
Spouts
The data source that outputs the tuples (Streams
).
Bolts
Elements that receive one or more Streams
, process them and emit one Stream
.
Examples of Bolt
usage:
- Combine;
- Split;
- Aggregate; or
- Save.
Topologies
These are like network topologies, the way in which Spouts
and Bolts
(and therefore Streams
) are organized.
Stream Grouping
Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt's tasks.
There are eight built-in stream groupings in Storm, of which Shuffle
would seem to be the most common. The documentation here has more in-depth explanations.
Reliability
Storm guarantees that every tuple will be fully processed or restarted. To do this you must tell Storm when a Tuple
tree is being created, and whenever a Bolt
has finished processing an individual Tuple
.
Other properties
- Fairly simple API (simpler than MapReduce);
- Scalable (due to tweakable parallelism and rebalancing during runtime); and
- Fault-tolerant (like Hadoop, will restart a stalled process).
Names
Nimbus
: the master node.Supervisors
: the worker nodes (why they weren't called workers, I'm not sure).Zookeeper
, running on theNimbus
or multiple instances.
If the Nimbus
dies after the topology has been defined, that's OK. Cluster continues running but no new tasks can be assigned.
So if a Supervisor
dies after the Nimbus
has died, the cluster will fail.