IKH

Dynamic Partitioning and Dropping the Partitions

In the previous segment, you learnt how to create static partitioning. In this segment, you will be introduced to dynamic partitioning.

So, let’s watch the next video for a demonstration of dynamic partitioning where data gets allocated automatically in the partitions.

Now, let’s summarise all that you saw in the video above:

  • By default, dynamic partitioning is not allowed in Hive. You need to set dynamic partitioning to true:

Example

Python
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode= nonstrict;

Output

Now, let’s summarise all that you saw in the video above:

  • By default, dynamic partitioning is not allowed in Hive. You need to set dynamic partitioning to true:

Example

Python
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;

Output

  • You can see that the syntax to apply partition in the table is as follows:

Example

Python
create table if not exists dyn_part_user_info (
    id int,
    age int,
    gender string,
    ratings int)
partitioned by (profession string) 
row format delimited fields  terminated by '|'
lines terminated by '\n';

Output

  • How data has been loaded into it using the “insert” clause:

Example

Python
insert into table dyn_part_user_info partition (profession)
select id, age, gender, ratings, profession from user_info;

Output

  • Once you have loaded the data into the dynamic partitioned table named “dyn_part_user_info”, the values under the profession column are automatically allocated into multiple partitions, which as you can see within  “user/hive/warehouse/dyn_part_user_info” location in Hadoop.

In the next video, you will learn where to use static and dynamic partitioning.

So, you can see that it takes less time if you place the partitions manually, as in the case of static partitioning, as Hive gets an explicit command to allocate the values under the profession column into partitions. On the other hand, in the case of dynamic partitioning, Hive allocates the profession automatically and, hence, it takes some time to process the data into partitions.

In the next video, you will learn how to drop partitions.

  • So, you have already created static partitions on ‘doctor’ and ‘engineer’ in the previous segment. Now, let us first see what are the different partitions of the “part_user_info” table in the HDFS.

Example

Python
show partitions part_user_info

Output

So, there are two partitions in the “part_user_info” table, namely, ‘doctor’ and ‘engineer’.

  • Let us write a command to drop the profession ‘doctor’. Here, you have to explicitly define the partition name that you want to drop, e.g., ‘doctor’.

Example

Python
alter table part_user_infor drop partition (profession='doctor');

Output

PowerShell
Dropped the partition profession=doctor
OK
Time Taken: 0.249 seconds

Now again check the partitions now available

Example

Python
show partitions part_user_info;

Output

Now, there remains only one partition, that is, ‘engineer’.

So, you have learned how to create the two types of partitions and how to drop them. Now, let us compare the query time when you are using a non-partitioned table with the query time when you are using a partitioned one.

Here, Vishwa has taken an example whose problem statement is as follows:

“What are the total ratings given by the male and female users for the ‘engineer’ profession”? You have two tables, of which one is non-partitioned, whereas the other one is partitioned:

  • user_info: This is a non-partitioned table. If you want to find out the total rating of both the genders separately for the ‘engineering’ profession, then you can run the following query:

Example

Python
select gender, sum(ratings) from user_info where profession= ‘engineer’ group by gender ;

Output

The time for the query works out to be 7.157 seconds.

  • dyn_part_user_info: This table is partitioned dynamically on the ‘profession’ column.  If you want to find out the total rating of both the genders separately for the ‘engineering’ profession, then you can run the following query:

Example

Python
select gender, sum(ratings) from dyn_part_user_info where profession= ‘engineer’ group by gender ;

Output

The time for the query works out to be 6.911 seconds.

Thus, you can see that the query running time for the partitioned table was found to be lesser than that of the non-partitioned table.

In the next segment, you will learn about bucketing.

Report an error