IKH

Internal and External Tables I

So far, you have learnt how to set up the AWS CLI on EC2 instance to perform queries on a database, which is the S3 bucket in this case. You have learnt how to create databases in Hive.

Now, you will learn how to create tables in Hive. There are two types of tables in Hive, which are as follows:

  • Internal tables
  • External tables

Let’s listen to Vishwa and try to understand the basic difference between the two types of tables.

So, let us summarise all that you learnt from the video:

  • Creation of tables: When you create a table in Hive, it is an internal table by default. To create an external table, you need to explicitly use the keyword “EXTERNAL” while writing the query to create the table.
  • Internal table vs External table: The basic difference between internal and external tables is that when you create an internal table, Hive is itself responsible for the data, and it takes control of the entire life cycle of the metadata as well as the actual data itself. 

In contrast, when you create an external table, Hive is responsible for taking care of only the schema of the table. It does not take care of the actual data.

You will get a clear understanding of internal and external tables when you drop the tables.

When you apply the drop command on an internal table, the entire metadata available in the metastore, as well as the actual table data that is present in the HDFS, gets deleted.

However, when you apply the drop command on an external table, the entire metadata available in the metastore gets deleted but the actual table data available in the HDFS remains intact there.

You will get a clear understanding of internal and external tables when you watch the practical demonstration in the upcoming videos.

  • When to use INT/EXT tables: Suppose a particular table is very common and is useful for other applications as well. In such a case, it is preferable to use an external table, because when you drop the table, the data would remain intact in the HDFS, and it can be used by some other applications. However, when you drop an internal table, the schema, as well as the table data, gets deleted and it becomes unavailable for other applications as well.

Now, let us come to the next part where you will learn how to create tables in Hive in a predefined database.

Important Note: Our industry expert is using /home/hadoop/u.user, but you have to use /root/movie_ratings/u.user

So, let us understand how to create an internal table step-by-step:

Database selection: You have seen in the video that there are three databases, namely, demo, demo2, and demo3, which have already been created from the previous segment. An important point to note here is that when you do not create any database, the tables created are stored in the “default” database. Now, let us choose the “demo” database to store the created tables in HDFS, and for this, you need to write the query as “use demo;” 

Example

Python
show databases;

Output

PowerShell
OK
database_name
default 
demo
demo2
demo3
Time taken: 0.035 seconds, Fetched: 4 row(s)

Note:

default looking bold but when you will run it will come in normal format.

Now, run below query to use the demo database.

Example

Python
use demo;

Output

PowerShell
OK
Time taken: 0.020 seconds
  • Creation of an internal table: By default, any table that you create is an internal table. The DDL statements for creating the table are almost the same as that of MySQL.

Example

Python
create table if not exists user_info (
       id int,
       age int,
       gender string,
       profession string,
       reviews int
 )
row format delimited fields terminated by '|'
lines terminated by '\n' 
stored as textfile;

Now describe table as created above.

Output

PowerShell
hive> describe user_info ;
OK
col_name        data_type        comment
id                int
age               int
gender            string
profession        string
reviews           int
Time taken: 1.020 seconds, Fetched: 5 row(s)

Here, you need to understand a new point, which is you need to specify the format of data that is available in the HDFS, for instance, fields are delimited by ‘|’ and lines are terminated by ‘\n’, otherwise, you will not be able to read the data into the tables.

There is a command “describe user_info;”, which is used to describe a table’s attributes data types.

  • Load data into the tables: Once you have created the schema of the tables, you need to load the data into them. To load the data, you have to specify a path from where you will load the data into these tables. In this case, the data is available at the location ‘/root/movie_ratings/u.user’. So load data from ‘/root/movie_ratings/u.user’ location by executing below command.

Example

Python
load data local inpath '/root/movie_ratings/u.user' into table user_info;

Output

Now check data by executing below query

Example

Python
select * from user_info limit 5;

Output

You will get below output.

So, by now, you must be clear about the creation of internal tables. In the next segment, you will learn how to create external tables and also learn what happens to the table data when you apply drop commands on the tables.

Report an error