Big Data Strategies

How can we handle Big Data, its volume and its velocity?

Big Data has some basic properties:

  • It's raw (you can answer more questions from raw data);
  • It's not cleaned (don't want to lose any data, even if we think it is not useful); and
  • It (should be) immutable--after all, data corruption is the cause of most problems in a database.

Timestamps and Small Tables

Instead of this:

id name gender status town
1 Jose M Pending Enroll Dundee
2 Yago M Enrolled Dundee
3 Stuart M Not Enrolled Broughty Ferry
4 Helen F Pending Acceptance London

We would have this:

id name timestamp
1 Jose 1449180525
2 Yago 1448185325
3 Stuart 1431297337
4 Helen 1429171731
id gender timestamp
1 M 1449180525
2 M 1448185325
3 M 1431297337
4 F 1429171731
id status timestamp
1 Pending Enroll 1449180525
2 Enrolled 1448185325
3 Not Enrolled 1431297337
4 Pending Acceptance 1429171731
id town timestamp
1 Dundee 1449180525
2 Dundee 1448185325
3 Broughty Ferry 1431297337
4 London 1429171731

So now changing any of the values for any of the ID's is substantially easier.

If we want to add more data to one of the ID's, we can do that.
If we want to remove a field from one of the people, we can just remove the row for e.g. "town".

Add and Never Delete

Your data sets will get bigger. But never delete a fact! Even if it is outdated. You can construct a state of affairs at any point by keeping a history.

Instead of this:

Person Friend Count
Jose 25
Yago 8794

Do this:

Person Friend Action Timestamp
Jose +1 1449180525
Jose +1 1449180520
Jose +1 1449180515
Jose +1 1449180423
Jose +1 1449151335
Jose -1 1449131751
Jose +1 1449051881
Jose +1 1447481823
Jose -1 1443945446

That way, it is much easier to check how many friends Jose had at timestamp 1449051881. You could go further, and save which friend was added at what time.