Tools in the Hadoop Ecosystem

Pig

High-level query language built on top of MapReduce.

Has two components:

  • Pig Latin, the language; and
  • Compiler, translates Pig Latin to MR jobs.

Could be interesting to read the documentation.

Hive

High-level query language built on top of MapReduce, which allows to define a schema in HDFS files. Then, it can be queried using SQL-like language.

This is often used in conjunction with Pig scripts.

Hive is not:

  • A relational database;
  • A transactional database; or
  • A real-time query database.

A database is used in the back-end to store metadata for schemas, and all user data is stored in the actual HDFS files.

HCatalog

HCatalog is a repository for schema definitions, which can then be re-used by other parts of the ecosystem, like Pig, Hive, or MapReduce code.

It provides consistency.

Syntax is very similar to Hive, as it can be used to create schemas as well in SQL-like language.