Three Types of Big Data
Twitter data
ID | Text |
---|---|
1 | Team Work is the theme of this week's Photo Comp - enter now at bit.ly/... |
2 | Solved my I/O issues - conclusion in the latest update: bit.ly/... Thanks @nayanraval @tonyrogerson |
3 | RT @SNMUndergrad: Kirkcaldy campus currently offline for phone/internet.To contact staff call 388534 |
... | ... |
What is the problem? Velocity? Volume?
How fast could we process all the tweets coming from Twitter? Probably not fast enough to do it in real-time.
Twitter messages are very big (https://dev.twitter.com/docs/api/1.1/get/search/tweets), and potentially will get bigger over time as more attributes are added. Is tabular data the best way to store them?
Click-through data
For example, Google Analytics data, or in-app data. As the app or the page evolves, data is bound to change. More fields may be added, and thus tabular data with a static schema may not be the best way to store this data.
Fridges and Meters
Suppose we are monitoring fridges towards an Internet of Things. Different fridges may be submitting different data from sensors, like so:
Indesit | Beko | Zanussi |
---|---|---|
Top Temp (16 bit ) |
Temp (String ) |
Top Temp (8 bit ) |
Bottom Temp (16 bit ) |
Middle Temp (8 bit ) |
|
Bottom Temp (8 bit ) |
||
On Time (32 bit ) |
||
Coolant Temp (String ) |
As you can see, the different data we are measuring are of different sizes, and differ for each make (and possibly each fridge). Tabular data with a schema might not be the best way to approach storing this data. Instead, a NoSQL and schema-less approach might be worth considering.
Mass Spectrometers
A mass spectrometer might store hundreds of thousands of data points. If we had to store mass spectrometer data in a SQL database row, we would have a problem. Especially if we wanted to search through that data later on.
The only option we would have in order to store this many data points is to have a table for each, and a data point per row.
If we are able to search through the data points easily, and for each data point to be stored in one "row" or document, that simplifies the data storage.
Summary
Big data can be defined by the 3 V's. It can also be defined as data that is not atomic. More realistically, it is data that needs horizontal scaling to handle it.
P.S. Remember horizontal scaling is where more nodes adds more power to the system. Normally, SQL databases would require vertical scaling: making servers more expensive and powerful as a means to improve overall system power. With horizontal scaling we can simply add more commodity servers to a system, which is cheaper.