Let’s hear from Sandeep about how to improve the query execution time by enabling partitioning and bucketing on the Amazon review dataset.
Brief points that were covered in the above video:
- For partitioning, we have to increase partitions and enable the dynamic partition
Example
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;Output
- Create an external table and partition over year and month
Example
create external table if not exists amz_review_yr_mnth_part (
reviewerid string,
asin string,
reviewername string,
helpful array<int>,
reviewtext string,
overall double,
summary string,
unixreviewtime bigint) partitioned by
(yr int, mnth int)
location 's3a://<your-bucket-name>/folder-name'Output
Note
We are using year and months for partitioning, because these columns have minimum cardinality.
- Insert data into amz_review_yr_mnth_part from amz_review_col.
Example
insert overwrite table amz_review_yr_mnth_part partition(yr, mnth)
select reviewerid,
asin,
reviewername,
helpful,
reviewtext,
overall,
summary,
unixreviewtime,
year(from_unixtime(unixreviewtime)) as yr,
month(from_unixtime(unixreviewtime)) as mnth
from amz_review_colOutput
- Execute both queries to get execution time difference with and without partition.
select overall, count(*) as review_count from amz_review_yr_mnth_part
where yr = 2004 and mnth = 1 group by overall order by review_count desc;select overall, count(*) as review_count from amz_review_col
where year(from_unixtime(unixreviewtime)) = 2004 and month(from_unixtime(unixreviewtime)) = 1
group by overall order by review_count desc;Output
Now, let’s see how we can use the HQL Rank functions in our analysis.
Let’s summarise the commands that you saw in the above video:
- Using RANK() function
Example
Select asin as Product , RANK() OVER(order by overall) from amz_review_yr_mnth_part Output
- If we want consecutive ranking – we have to use DENSE_RANK() function.
Example
Select asin as Product , DENSE_RANK() OVER(order by overall) from amz_review_yr_mnth_partOutput
Now, let’s see how we can apply Bucketing in our Amazon review dataset tables.
Now you get an essence of how execution time gets affected by partitioning and bucketing.
Let’s conclude some important points of the above video:
- For doing bucketing we have to execute the following command:
Example
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.enforce.bucketing=true;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;Output
Note
we will use reviewerid for bucketing as each reviewer has a unique id. (The cardinality is very minimum here.)
- Create an external table with bucketing
Example
create external table if not exists
amz_review_clustered_yr_mnth
(reviewerid string, asin string, reviewername string, helpful
array<int>, reviewtext string,
overall double, summary string, unixreviewtime bigint) partitioned by
(yr int, mnth int)
clustered by (reviewerid) into 4 buckets
location 's3a://hue-demo-bucket/tables'Output
- After successful creation of the external table, insert data into it.
Example
insert overwrite table
amz_review_clustered_yr_mnth partition(yr,
mnth)
select reviewerid, asin, reviewername, helpful, reviewtext,
overall, summary, unixreviewtime, yr, mnth
from amz_review_yr_mnth_partOutput
As a practice, you can compare the execution time of the different queries between the amz_review (doesn’t havepartitioning and bucketing) and amz_review_clustered_yr_mnth (Has partitioning and bucketing).
Report an error