In the previous segment, you learnt about ‘Partitioning’, which is one of the query optimisation techniques. In this segment, you will learn about another query optimisation technique, ‘Bucketing’.
First, let us watch the next video and try to understand bucketing theoretically.
So, let’s perform the queries to understand bucketing.
Now, let’s understand the queries to create buckets:
- One point that you need to keep in mind is that bucketing is separate from partitioning. But in general practice, you first create partitions and then buckets.
Example
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.enforce.bucketing=true;Note
To enable bucketing, you need to enable partitioning also.
- In the video, you can see that the partitions have been created based on gender. Then for each gender, seven buckets have been created based on ‘age’ in the “buck_user_info” table.
Example
create table if not exists buck_user_info (
id int,
age int,
profession string,
ratings int )
partitioned by (gender string)
clustered by (age)
into 7 buckets
row format delimited fields terminated by '|'
lines terminated by "\n"
stored as textfile;Output
- Once you have created the tables, you need to load the data into them. Following is the query to copy the data from the already created “user_info” table to the “buck_user_info” table:
Example
insert into table buck_user_info partition(gender);
select id,age, profession, ratings, gender from user_info;Output
Now, let us try to understand the memory locations in Hadoop where these buckets are created:
Let us first see the partitions based on gender:
Example
Output
So, you can see that two partitions, namely, ‘M’ and ‘F’, have been created based on gender.
Now, let’s take a look at the memory location of the buckets for the female partitions:
Here, you can see that in the female partition folder, seven files have been created, which correspond to the bucketing on the basis of age.
One of the very basic differences between bucketing and partitioning is that when you create partitions, new folders are created for each partition, but in the case of bucketing, multiple files (buckets) are created inside a single folder.
In the next video, you will see a comparison of the query time when using a non-bucketed table with the query time using a bucketed table.
Let us understand the problem statement.
“What are the total ratings given by males and females aged less than 35 years who belong to the artist profession?”
You have two tables, of which one is a non-partitioned/bucketed table, whereas the other one is bucketed:
- user_info: This is a non-bucketed table. If you want to find out the total ratings of both the genders separately for age less than 35 years for the artist profession, then you can run the following query:
Example
select gender, sum(ratings) as total_ratings from user_info where profession= ‘artist’ and age < 35 group by gender ;Output
The time of the query works out to be 19.611 seconds.
- buck_user_info: This is a bucket table based on age and gender. If you want to find out the total rating of both the genders separately for the artist profession with age less than 35 years, then you can run the following query:
Example
select gender, sum(ratings) as total_ratings from buck_user_info where profession= ‘artist’ and age < 35 group by gender ;Output
The time of the query works out to be 8.028 seconds.
Hence, using bucketing, you can reduce the query running time and optimise your queries.
Now, an important point that you need to understand is when to use partitioning and when to use bucketing. When the cardinality of a particular column is high, you should use bucketing on that column. Partitioning is preferable when the cardinality of a column is low.
Report an error