In this segment, we will begin to tackle our real-world use case. Let’s understand the problem statement and devise a solution for the same.
In the following video, Ajay will explain the problem statement to you and discuss the data with which we will work.
As seen in the video, we will enable an effcient OLAP(Online analytical processing) on the trip and bookings data for a ride-hailing company.
We need to design and schedule a data pipeline to generate the following insights:
- Get the car types involved with the highest number of trips for each city.
- Get the throughput of the trips (number of trips/number of bookings) for each city.
The data provided is in the form of the following two tables:
- booking
- trip
The schema for these tables can be seen below.
bookings
Trips
Now that you have gone through the problem statement and familiarised yourself with the data, in the next video, Ajay will explain the solution approach.
Note:
In the video at 0:02, the SME refers to the previous segment. However, the video that he is referring to is the first video in this segment.
The solution approach to our problem statement includes the following steps:
- Bring data from MySQL to HDFS via Sqoop.
- Create necessary directories in HDFS.
- Create Hive tables on the imported data.
- Construct partitions in the Hive table.
- Filter invalid records using Spark.
- Run the analysis to generate aggregated result using Spark.
- View the result.
Note:
Please note that there is one additional step that has been added to the above solution approach for our solution which is not mentioned separately which is the step to change JDK version to 8 which will happen after the Sqoop job is over as this is dependent on the Airflow installation. The overall generic solution approach will remain the same.
The DAG for the same can be seen below.
In the next segment, we will start with the coding demonstration for this problem statement.