Monday, December 28, 2015

Hive Data Modeling

In Hive data modeling - Tables, Partitions and Buckets come in to picture.
Coming to Tables, it’s just like the way that we create a table in Traditional relational databases. The functionalities such as filtering, joins can be performed on the tables. Hive deals with two types of table structures - Internal and External, depends on the design of schema and how the data is getting loaded in to Hive.
Internal Table is tightly coupled with nature. At first, we have to create tables and load the data. We can call this one as data on the schema. By dropping this table, both data and schema will be removed. The stored location of this table will be at /user/hive/warehouse.
External Table is loosely coupled with nature. Data will be available in HDFS; the table is going to get created with HDFS data. We can say that its creating schema of data. At the time of dropping the table, it dropped only schema, data will be available in HDFS as before. External tables provide an option to create multiple schemas for the data stored in HDFS instead of deleting the data every time whenever schema updates.
Partitions
Partitions come into place when table is having one or more Partition keys which is the basis for determining how the data is stored. For Example: - “Client has Some E–commerce data which belong to India operations in which each state (29 states) operations mentioned in as a whole. If we take the state as partition key and perform partitions on that India data as a whole, we will be able to get a Number of partitions (29 partitions) which is equal to the number of states (29) present in India. Each state data can be viewed separately in the partition tables.”

Buckets
Buckets are used for efficient querying. The data, i.e. present in that partition can be divided further into buckets. The division is performed based on hash of a particular column that we had selected in the table.
Conclusion


Hive provides data warehousing platform which deals with large amounts of data (peta bytes in volume). Querying and fetching the required results from large data sets in seconds of time is important. In Cloudera enterprise edition, it comes up with Impala for Increasing query time performance.

No comments: