Problem Statement

In this segment, we will begin to tackle our real-world use case. Let’s understand the problem statement and devise a solution for the same.

In the following video, Ajay will explain the problem statement to you and discuss the data with which we will work.

As seen in the video, we will enable an efficient OLAP(Online analytical processing) on the trips and bookings data for a ride-hailing company.

We need to design and schedule a data pipeline to generate the following insights:

Get the car types involved with the highest number of trips for each city
Get the throughput of the trips (number of trips/number of bookings) for each city

The data provided is in the form of the following two tables:

booking
trip

The schema for these tables can be seen below.

Booking
trips

Now that you have gone through the problem statement and familiarised yourself with the data, in the next video, Ajay will explain the solution approach

The solution approach to our problem statement includes the following steps:

Bring data from MySQL to HDFS via Sqoop
Create necessary directories in HDFS
Create Hive tables on the imported data
Construct partitions in the Hive table
Filter invalid records using Spark
Run the analysis to generate aggregated result using Spark
View the result

The DAG for the same can be seen below.

In the next segment, we will start with the coding demonstration for this problem statement.

Report an error