Tools in the Hadoop Ecosystem
Pig
High-level query language built on top of MapReduce.
Has two components:
- Pig Latin, the language; and
- Compiler, translates Pig Latin to MR jobs.
Could be interesting to read the documentation.
Hive
High-level query language built on top of MapReduce, which allows to define a schema in HDFS files. Then, it can be queried using SQL-like language.
This is often used in conjunction with Pig scripts.
Hive is not:
- A relational database;
- A transactional database; or
- A real-time query database.
A database is used in the back-end to store metadata for schemas, and all user data is stored in the actual HDFS files.
HCatalog
HCatalog is a repository for schema definitions, which can then be re-used by other parts of the ecosystem, like Pig
, Hive
, or MapReduce
code.
It provides consistency.
Syntax is very similar to Hive, as it can be used to create schemas as well in SQL-like language.