This segment contains some practice questions around Hive queries.
You are given an ‘Airlines’ data set. This data contains information about the flights that have arrived and departed from different airports.
Following is the file containing information about the attributes. You can download it and read the complete explanation of the attributes.
Get data by executing below command on EMR instance.
Example
# Switch to the root user
sudo -i
# Create a new directory named 'airlines'
mkdir airlines
# move inside the directory
cd airlines
# Download the dataset
wget ml-cloud-dataset.s3.amazonaws.com/Airlines_data.txtOutput
Following is the data dictionary for the dataset.
| Index | Name | Description |
| 0 | ID | Serial Number |
| 1 | Year | 2004-2008 |
| 2 | Month | 1-12 |
| 3 | DayofMonth | 1-31 |
| 4 | DayOfWeek | 1 (Monday) – 7 (Sunday) |
| 5 | DepTime | Actual departure time (local,hhmm) |
| 6 | CRSDep Time | Scheduled departure time (local,hhmm) |
| 7 | ArrTime | Actual arrival time (local,hhmm) |
| 8 | CRSArrTime | Scheduled arrival time (local,hhmm) |
| 9 | UniqueCarrier | Unique carrier code |
| 10 | FlightNum | Flight number |
| 11 | TailNum | Plane tail number |
| 12 | ActualElapsedTime | Time in minutes |
| 13 | CRSElapsedTime | Time in minutes |
| 14 | AirTime | Time in minutes |
| 15 | ArrDelay | Arrival delay. in minutes |
| 16 | DepDelay | Departure delay, in minutes |
| 17 | Origin | Origin IATA airport code (International Air Transport Association) |
| 18 | Dest | Destination IATA airport code |
| 19 | Distance | In miles |
| 20 | TaxiIn | Time taken from runway to the terminal after landing , excluding deceleration , in minutes |
| 21 | TaxiOut | Taxi taken from terminal to the runways before take-off, excluding acceleration, in minutes |
| 22 | Cancelled | Whether the flight was cancelled |
| 23 | CancellationCode | Reason for cancellation ( A = carrier , B = weather , C = NAS , D= security) |
| 24 | Diverted | 1 = yes , 0 = no |
| 25 | CarrierDelay | Time in minutes |
| 26 | WeatherDelay | Time in minutes |
| 27 | NASDelay | Time in minutes |
| 28 | SecurityDelay | Time in minutes |
| 29 | LateAircraftDelay | Time in minutes |
Note
Before creating any table, make sure you run this command.
Example
Add jar /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core-2.3.6-amzn-2.jar;Output
Create an external table
Example
CREATE EXTERNAL TABLE airlines(
`SNo` int, `Year` int, `Month` int, `DayofMonth` int, `DayOfWeek` int, `DepTime` int, `CRSDepTime` int,
`ArrTime` int, `CRSArrTime` int, `UniqueCarrier` string, `FlightNum` int, `TailNum` string, `ActualElapsedTime` int,
`CRSElapsedTime` int, `AirTime` int, `ArrDelay` int, `DepDelay` int, `Origin` string, `Dest` string, `Distance` int, `TaxiIn` int,
`TaxiOut` int, `Cancelled` int, `CancellationCode` string, `Diverted` int,`CarrierDelay` int, `WeatherDelay` int,
`NASDelay` int,`SecurityDelay` int, `LateAircraftDelay` int )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;Output
Insert data into table airlines
Example
load data local inpath '/root/airlines/Airlines_data.txt' overwrite into table airlines;Output
Based on this ‘Airlines’ data set, answer the following questions.
For more practice, we highly recommend you to apply EDA technique to analyse the dataset and share your insights on the DF.