In this segment, you will learn about Hive data models. Hive stores and queries data using its data models. The purpose of using data models is to make querying convenient and fast.
There are four main components in Hive data models, which are similar to how an RDBMS stores data:
- Databases
- Tables
- Partitions
- Buckets
You will learn about each of these components in the upcoming lectures.
Note:
Partitions and Buckets will be covered in detail in the next session.
Let’s summarise what you have learnt in this segment.
Hive has two types of tables:
- Managed (or internal) table.
- External table
Note:
‘Managed table’ and the ‘Internal table’ are synonymous terms.
Let’s understand the purposes of and the differences between managed/internal tables and external tables.
You should use external tables when:
- You want to use the data outside of Hive as well. For example, when another existing program is running on the same cluster.
- You want the data to remain stored on the HDFS even after dropping tables because Hive does not delete the data stored outside(of the Hive detabase).
- You do not want Hive to control the storage of your data (location/directories of storage/etc.).
On the other hand, you use managed tables when:
- The data is temporary. So, when the Hive table is dropped, the data stored in the internal table is deleted along with the metadata.
- You want Hive to manage the life cycle of the data completely, i.e. both store and process it.
An important thing to note is that Hive metastore stores all the metadata about the data stored in the HDFS, such as the names of the tables, the columns, the data types, the dates of creation of the respective tables, etc., for both Internal and External tables.
In the next segment, you will practice writing Hive queries and create managed and external tables.
Additional readings
Hive vs RDBMS
Internal vs External Tables