IKH

Data Analysis Using Partition

Let’s hear from Sandeep about how to improve the query execution time by enabling partitioning and bucketing on the Amazon review dataset.

Brief points that were covered in the above video:

  • For partitioning, we have to increase partitions and enable the dynamic partition

Example

Python
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;

Output

  • Create an external table and partition over year and month

Example

Python
create external table if not exists amz_review_yr_mnth_part (
    reviewerid string, 
    asin string, 
    reviewername string, 
    helpful array<int>, 
    reviewtext string,
    overall double, 
    summary string, 
    unixreviewtime bigint) partitioned by
(yr int, mnth int)
location 's3a://<your-bucket-name>/folder-name'

Output

Note

We are using year and months for partitioning, because these columns have minimum cardinality.

  • Insert data into amz_review_yr_mnth_part from amz_review_col.

Example

Python
insert overwrite table amz_review_yr_mnth_part partition(yr, mnth)
select    reviewerid,
          asin,
          reviewername,
          helpful,
          reviewtext,
          overall,
          summary,
          unixreviewtime,
          year(from_unixtime(unixreviewtime)) as yr,
          month(from_unixtime(unixreviewtime)) as mnth
from      amz_review_col

Output

  • Execute both queries to get execution time difference with and without partition.
Python
select  overall, count(*) as review_count from amz_review_yr_mnth_part
where  yr = 2004 and mnth = 1  group by overall   order by review_count desc;
Python
select  overall,  count(*) as review_count from amz_review_col
where year(from_unixtime(unixreviewtime)) = 2004 and month(from_unixtime(unixreviewtime)) = 1
 group by overall order by review_count desc;

Output

Now, let’s see how we can use the HQL Rank functions in our analysis.

Let’s summarise the commands that you saw in the above video:

  • Using RANK() function

Example

Python
Select asin as Product , RANK() OVER(order by overall)  from amz_review_yr_mnth_part 

Output

  • If we want consecutive ranking – we have to use DENSE_RANK() function.

Example

Python
Select asin as Product , DENSE_RANK() OVER(order by overall)  from amz_review_yr_mnth_part

Output

Now, let’s see how we can apply Bucketing in our Amazon review dataset tables.

Now you get an essence of how execution time gets affected by partitioning and bucketing.

Let’s conclude some important points of the above video:

  • For doing bucketing we have to execute the following command:

Example

Python
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.enforce.bucketing=true;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;

Output

Note

we will use reviewerid for bucketing as each reviewer has a unique id. (The cardinality is very minimum here.)

  • Create an external table with bucketing

Example

Python
create external table if not exists
amz_review_clustered_yr_mnth
(reviewerid string, asin string, reviewername string, helpful
array<int>, reviewtext string,
overall double, summary string, unixreviewtime bigint) partitioned by
(yr int, mnth int)
clustered by (reviewerid) into 4 buckets
location 's3a://hue-demo-bucket/tables'

Output

  • After successful creation of the external table, insert data into it.

Example

Python
insert overwrite table
amz_review_clustered_yr_mnth partition(yr,
mnth)
select reviewerid, asin, reviewername, helpful, reviewtext,
overall, summary, unixreviewtime, yr, mnth
from amz_review_yr_mnth_part

Output

As a practice, you can compare the execution time of the different queries between the amz_review (doesn’t havepartitioning and bucketing) and amz_review_clustered_yr_mnth (Has partitioning and bucketing). 

Report an error