Tuesday, December 22, 2015

Hadoop Hive

Hive is developed on top of Hadoop as its data warehouse framework for querying and analysis of data that is stored in HDFS. Hive is an open source-software that lets programmers analyze large data sets on Hadoop. Hive makes the job easy for performing operations like data encapsulation, ad-hoc queries, & analysis of huge datasets.
The hive’s design reflects its targeted use as a system for managing and querying structured data. While coming to structured data in general, Map Reduce doesn’t have optimization and usability features, but Hive framework provides those features. Hive’s SQL-inspired language separates the user from the complexity of Map Reduce programming. It reuses familiar concepts from the relational database world, such as tables, rows, columns and schema, to ease learning.
Hadoop programming works on flat files. The hive can use directory structures to “partition“ data to improve performance on certain queries. To support these enhanced features, a new and important component of Hive i.e. metastore is used for storing schema information. This Metastore typically resides in a relational database.
We can interact with hive using several methods; those are Web GUI and Java Database Connectivity (JDBC) interface. Most interactions tend to take place over a command line interface (CLI). Hive provides a CLI to write hive queries using Hive Query Language(HQL). Generally HQL syntax is similar to the SQL syntax that most data analysts are familiar with.
Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File). In a single user scenario, hive uses derby database for metadata storage and for multi user scenario, hive uses MYSQL to store Meta data or shared Meta data.
Major difference between HQL and SQL is that, Hive query executes on the Hadoop infrastructure rather than traditional database. Hadoop is a distributed storage, so when we submit hive query it will apply on huge data sets. The data sets are so large that high-end, expensive, traditional databases would fail to perform operations.
The hive query execution is going to be a series of automatically generated map reduce Jobs. Hive supports partition and buckets concepts for easy retrieval of data when client executes the query. Hive supports custom specific UDF (User Defined Functions) for data cleansing and filtering. According to the project requirements, programmers can define Hive UDFs.

No comments: