Now that you had a general overview of the DAG, let’s take a look at its actual code.
We recommend that you follow along with the demonstrations in your own EMR instance.
Since the code is quite long, the demo is divided into two parts.
In the upcoming video, Ajay will demonstrate the first one.
Note:
Please note that the Airflow UI in the video above is slightly different from the UI that you will see however, the overall steps and configuration remain the same.
Note:
The above code was used on a different installation of Airflow and so some code lines will be different from our sample solution, however, the overall solution remains the same. Please refer to the code that we have shared on the platform as the final code for this use case.
In this video, you saw the creation of the DAG object and understood the definition of the following tasks:
- extract_booking_table.
- extract_trip_table.
- create_raw_booking_location.
- create_filtered_booking_location.
- create_raw_trip_location.
- create_filtered_trip_location.
- create_result_trip_throughput_location .
- create_result_car_with_most_trips_location.
Note:
There is an additional task in the code for the ETL DAG used with EMR called switch_java_version which is a basic Bash operator which will run the java EMR command to switch the Java JDK 8 as you have also manually used in the previous session. You can refer to the ETL DAG python manually used in the previous session. You can refer to the ETL DAG python file below for this.
In the next video, Ajay will demonstrate the second part of our DAG construction.
Note:
The above code was used on a different installation of Airflow and so some code lines will be different from our sample solution, however, the overall solution remains the same. Please refer to the code that we have shared on the platform as the final code for this use case.
In this video, you understood the definition of the following tasks:
- create_hive_database.
- create_booking_raw_table .
- create_trip_raw_table .
- add_partitions_for_booking_table .
- add_partitions_for_trip_table.
- create_filter_booking_table.
- create_filter_trip_table .
- filter_booking_table.
- filter_trip_table.
- create_car_with_most_trips_table .
- generate_car_with_most_trips.
- create_trip_throughput_table .
- generate_trip_throughput.
The etl_dag.py file is attached below for your reference.
We will run this file in an upcoming segment.
In the next segment, you will take a look at the four Spark applications used in this DAG.