So, you learnt about the features of Hive and understood that the main purpose of developing Hive is to query into the Hadoop File Storage System and read data from it for analytical purposes. Now, let’s move on to the architectural aspect of the Hive and understand how Hive functions and how it is built over the Hadoop ecosystem.
Let’s hear from Vishwa and learn about the architecture of the Hive in detail.
So, any Hive ecosystem is divided into the following three parts:
- Hive Client: This acts as the face of the Hive system through which users interact with Hive and query into the HDFS.
- Hive Core: This is the core part of the Hive architecture, which links the Hive client and the Hadoop cluster.
- Hadoop Cluster: This is the file storage system where users can store files in either structured or unstructured format and can query into the HDFS using the MapReduce.
Let’s first learn about Hadoop Cluster, which is the bottom-most block of the Hive architecture. It consists of the following two parts:
- HDFS: The Hadoop Distributed File System (HDFS) allows you to store large volumes of data across multiple nodes in a Hadoop cluster. It is used to perform distributed computing on structured or unstructured data.
- MapReduce: This is a programming framework that is used by Hadoop for processing data. After users store the data in the HDFS, MapReduce is used to query into the HDFS to perform the data analysis. Note that even if you are running the Hive job, the MapReduce job gets triggered to query into the HDFS, in the background.
Now, let’s understand the Hive Client and Hive Core.
Now, as you know, Hive is just a software tool that works over the HDFS; it is a tool through which you can query into its server the HDFS using SQL as a programming language. Underlying the HDFS, all the Hive jobs are converted into the MapReduce job.
Let’s understand the first block of the Hive architecture, i.e., Hive Client.
- Command Line Interface (CLI): This is an interface between the Hive and its users. You need to write the queries in the Hive Query Language in order to derive results from the data available in the HDFS. It acts as the command-line tool for the Hive server.
- Web Interface: This is a graphical user interface of Hive, it is an alternative to the Hive CLI, where you query into the HDFS through the Hive tool.
Now, let’s move on to the most important part of the Hive architecture, which is Hive Core.
Hive Core consists of the following three aspects:
- Hive QL Process Engine: When you write the query in the Hive query language through the CLI or web interface, it goes into the Hive QL Proces Engine, which checks the syntax of the query and analyses it. The HQL is similar to SQL in terms of how to query and also uses the schema information available in the Metastore. Instead of writing a MapReduce program in Java, you can write the query in an SQL-like language, i.e., HQL for a MapReduce job, and process it.
- Hive Metastore: Metastore can be considered as the heart of the Hive ecosystem. Let’s try to understand this with the help of an example. Suppose an e-commerce company has to manage its employees, customers, orders, etc. The data is available in the HDFS and you want to perform analytics on this data.
You already have an understanding of DDL and DML statements. In Hive, when you create the tables of entities such as employees, customers, orders etc., all the schema information, including names of entities, data types and relations among the tables, is stored in the Hive Metastore. In Hive, whenever you query for this schema information in the Metastore, the Hive QL Process Engine refers to the Metastore each time and retrieves the schema information. This basically helps in tasks such as interpreting flat files and performing joins where we need to have the details of the tables.
This information about the tables and the database is called Metadata. Metadata is nothing but information or data about some given data”.
It is important to note that the Hive is not a database, and Metastore acts as a storage space for Hive, where the schema information of all the tables that have been created is stored. Metastore does not store the actual data; it stores the metadata of the tables that have been created. It contains its own embedded RDBMS to store the relational tables and uses Apache’s ‘Derby’ RDBMS for it.
- Execution Engine: Execution engine is used to convert Hive queries into MapReduce jobs. Execution engine processes the query and generates results that match the MapReduce results.
In the next video, you will see how a particular query flows throughout the Hive architecture.
So, in the video, you learnt how a particular query is executed and understood how you can query into the HDFS using the Hive Query Language.
Let’s understand this process step by step:
Step 1: When you create a particular table with DDL statements using HQL, its schema gets stored in the Hive metastore.
Step 2: Once Hive core receives the HQL commands from CLI, it passes this command to the Hive QL process Engine. It requests the Metastore to verify the information that you have sent in the form of a query. The Hive QL Process Engine also checks the syntax of the query and then sends the query into the Execution Engine (EE).
Step 3: The EE sends the request again to the metastore to receive additional information about partitioning, bucketing or any other relevant information and ultimately triggers the MapReduce job to convert all the Hive queries into MapReduce jobs.
Step 4: MapReduce interacts with the HDFS and provides the result of the query. Note that the actual data of the table with ‘user_info’ does not carry into the metastore but it is carried by the HDFS itself. However, the schema of the table is carried by the metastore only.
In the next segment, you will learn about the differences between RDBMS and Hive.