Running our DAG

In this segment, you will finally run the DAG that we created.

We recommended that you follow along with the demonstrations in your own EMR instance.

Note:

You will need to configure the connections for Hive, Sqoop and Spark tasks. You will go through the details for those configurations in the following video.

Note:

Please note that since this DAG contains a lot of operators and is very taxing on the EMR cluster while running the ETL DAG and also pause any other DAGs that you may have resumed or in running state previously. Please also refrain from navigating the Airflow UI a lot apart from refreshing the home page of the Webserver UI to check whether the DAG run is complete or not.

Before running the DAG, make sure that the JDK version is set to version 11 as mentioned in the demo doc below.

In the upcoming video, Ajay will collate everything we created so far and execute our DAG.

So, in the video, the following steps were demonstrated:

Configure the connections for Hive, Sqoop, Spark, etc.

Start the DAG using the toggle switch in the Airflow UI.

Validate the KPIs generated by running some Hive queries.

You can find the code and other resources used in the demonstration attached below:

The document attached below details the steps followed in the demonstration.

This marks the end of our Final demonstration. Next,

we will see some best practices with regards to Apache Airflow.

Report an error