Now, in this segment let’s look at some of the practice questions on the Hive queries you learnt in this session.
You are provided with the ‘Airlines’ datasets.
In the previous session, you have created the airlines table. Now, you cancreate a table with partitioning defined for the same dataset, and compare the performance of some heavy queries on both of these tables.
Note
The dataset and the data dictionary can be downloaded from the ‘Practice Questions’ segment of Session-02.
Let’s try and partition the airline data with the ‘Year’ column.
Important Note:
Before partitioning any table, make sure you run these commands.
Example
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;Output
Create an external table with name “airlines_partitioned“.
Example
create external table if not exists airlines_partitioned (
`SNo` int, `Month` int, `DayofMonth` int, `DayOfWeek` int,
`DepTime` int, `CRSDepTime` int, `ArrTime` int, `CRSArrTime` int,
`UniqueCarrier` string, `FlightNum` int, `TailNum` string,
`ActualElapsedTime` int, `CRSElapsedTime` int, `AirTime` int,
`ArrDelay` int, `DepDelay` int, `Origin` string, `Dest` string,
`Distance` int, `TaxiIn` int, `TaxiOut` int, `Cancelled` int,
`CancellationCode` string, `Diverted` int, `CarrierDelay` int,
`WeatherDelay` int, `NASDelay` int, `SecurityDelay` int,
`LateAircraftDelay` int)
partitioned by (`Year` int)
location '/root/airlines/partition/';Output
Insert the data in the table.
Example
insert overwrite table airlines_partitioned partition(`Year`)
select `SNo`, `Month`, `DayofMonth`, `DayOfWeek`, `DepTime`, `CRSDepTime`,
`ArrTime`, `CRSArrTime`, `UniqueCarrier`, `FlightNum`, `TailNum`,
`ActualElapsedTime`, `CRSElapsedTime`, `AirTime`, `ArrDelay`,`DepDelay`,
`Origin`, `Dest`, `Distance`, `TaxiIn`, `TaxiOut`, `Cancelled`, `CancellationCode`,
`Diverted`, `CarrierDelay`, `WeatherDelay`, `NASDelay`, `SecurityDelay`,
`LateAircraftDelay`, `Year`
from airlines;Output
Based on this ‘airlines_partitioned‘ table, answer the following questions.
You can perform the same queries on the ‘airlines’ table(created in session-02) and see the performance difference.
For more practice, we highly recommend you to apply EDA technique to analyse the dataset and share your insights on the DF.