In this segment, we will begin to tackle our real-world use case. Let’s understand the problem statement and devise a solution for the same.
In the following video, Ajay will explain the problem statement to you and discuss the data with which we will work.
As seen in the video, we will enable an efficient OLAP(Online analytical processing) on the trips and bookings data for a ride-hailing company.
We need to design and schedule a data pipeline to generate the following insights:
- Get the car types involved with the highest number of trips for each city
- Get the throughput of the trips (number of trips/number of bookings) for each city
The data provided is in the form of the following two tables:
- booking
- trip
The schema for these tables can be seen below.
Booking
trips
Now that you have gone through the problem statement and familiarised yourself with the data, in the next video, Ajay will explain the solution approach
The solution approach to our problem statement includes the following steps:
- Bring data from MySQL to HDFS via Sqoop
- Create necessary directories in HDFS
- Create Hive tables on the imported data
- Construct partitions in the Hive table
- Filter invalid records using Spark
- Run the analysis to generate aggregated result using Spark
- View the result
The DAG for the same can be seen below.
In the next segment, we will start with the coding demonstration for this problem statement.
Report an error