In the previous segment, you learnt how to create static partitioning. In this segment, you will be introduced to dynamic partitioning.
So, let’s watch the next video for a demonstration of dynamic partitioning where data gets allocated automatically in the partitions.
Now, let’s summarise all that you saw in the video above:
- By default, dynamic partitioning is not allowed in Hive. You need to set dynamic partitioning to true:
Example
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode= nonstrict;Output
Now, let’s summarise all that you saw in the video above:
- By default, dynamic partitioning is not allowed in Hive. You need to set dynamic partitioning to true:
Example
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;Output
- You can see that the syntax to apply partition in the table is as follows:
Example
create table if not exists dyn_part_user_info (
id int,
age int,
gender string,
ratings int)
partitioned by (profession string)
row format delimited fields terminated by '|'
lines terminated by '\n';Output
- How data has been loaded into it using the “insert” clause:
Example
insert into table dyn_part_user_info partition (profession)
select id, age, gender, ratings, profession from user_info;Output
- Once you have loaded the data into the dynamic partitioned table named “dyn_part_user_info”, the values under the profession column are automatically allocated into multiple partitions, which as you can see within “user/hive/warehouse/dyn_part_user_info” location in Hadoop.
In the next video, you will learn where to use static and dynamic partitioning.
So, you can see that it takes less time if you place the partitions manually, as in the case of static partitioning, as Hive gets an explicit command to allocate the values under the profession column into partitions. On the other hand, in the case of dynamic partitioning, Hive allocates the profession automatically and, hence, it takes some time to process the data into partitions.
In the next video, you will learn how to drop partitions.
- So, you have already created static partitions on ‘doctor’ and ‘engineer’ in the previous segment. Now, let us first see what are the different partitions of the “part_user_info” table in the HDFS.
Example
show partitions part_user_infoOutput
So, there are two partitions in the “part_user_info” table, namely, ‘doctor’ and ‘engineer’.
- Let us write a command to drop the profession ‘doctor’. Here, you have to explicitly define the partition name that you want to drop, e.g., ‘doctor’.
Example
alter table part_user_infor drop partition (profession='doctor');Output
Dropped the partition profession=doctor
OK
Time Taken: 0.249 secondsNow again check the partitions now available
Example
show partitions part_user_info;Output
Now, there remains only one partition, that is, ‘engineer’.
So, you have learned how to create the two types of partitions and how to drop them. Now, let us compare the query time when you are using a non-partitioned table with the query time when you are using a partitioned one.
Here, Vishwa has taken an example whose problem statement is as follows:
“What are the total ratings given by the male and female users for the ‘engineer’ profession”? You have two tables, of which one is non-partitioned, whereas the other one is partitioned:
- user_info: This is a non-partitioned table. If you want to find out the total rating of both the genders separately for the ‘engineering’ profession, then you can run the following query:
Example
select gender, sum(ratings) from user_info where profession= ‘engineer’ group by gender ;Output
The time for the query works out to be 7.157 seconds.
- dyn_part_user_info: This table is partitioned dynamically on the ‘profession’ column. If you want to find out the total rating of both the genders separately for the ‘engineering’ profession, then you can run the following query:
Example
select gender, sum(ratings) from dyn_part_user_info where profession= ‘engineer’ group by gender ;Output
The time for the query works out to be 6.911 seconds.
Thus, you can see that the query running time for the partitioned table was found to be lesser than that of the non-partitioned table.
In the next segment, you will learn about bucketing.