Before we start using Airflow, we need to first set up our EMR cluster for Apache Airflow and then proceed to actually install/set up Airflow on our EMR cluster. This segment will be covering the various steps for the above. In the following video.
In the following video, you will learn about the EMR configuration needed for Apache Airflow.
Please note that other steps related to setting up an Amazon EMR instance, such as login steps and YARN parameters steps, will remain the same as in the Introduction to Cloud and AWS Setup module.
As mentioned in the video, most of the steps are similar to previous EMR setups. The only significant changes are in the software packages chosen during the initial setup of the instance.
The following document contains the steps discussed above to set up an AWS EMR instance for Apache Airflow, which you have to follow for this module.
Please note that the entire process is quite big and contains a lot of steps so make sure that you are following the steps correctly.
Now in the following videos, you will learn how to install and set up Airflow on the EMR cluster that have created. In the first video below, you will learn how to set up the MySQL database and execute the first script for installing Airflow dependencies.
Now in the following video, you will be continuing the installation of Airflow on your EMR cluster and completing it. You will also be logging into the Web UI for Airflow after the installation is complete.
With this, you have successfully installed Airflow on your EMR cluster and logged into the Airflow Webserver UI.
The installation steps that were followed in the two video above are present in the following document.
Note:
There are two versions of JDK installed on your EMR cluster – JDK 8 and JDK 11. JDK 8 comes installed on EMR cluster by default (required especially by Hive and Spark) and we will be needing JDK 11 to use the Sqoop operator on Airflow. By default, after the Airflow installation steps are complete, your EMR cluster will have JDK 11 enabled. You can freely switch to the other version of JDK by running one of the following commands as applicable as is also mentioned in the document above.
Example
# To switch to JDK 8, run the following command
sudo alternatives --config java <<< 1
# To switch to JDK 11, run the following command
sudo alternatives --config java <<< 3
Output
Note:
Please note that Airflow comes configured with the SequentialExecutor by default. However, for our installation, we will be using the CeleryExecutor.
The following table contains some of the important configurations that were used in the airflow.cfg file that we used for the installation discussed in the video.
NOTE:
If you’re facing issues with installing Airflow due to MySQL installation issues, please follow the steps mentioned in the document below which lists the instructions to install Airflow through Docker Image.
Now in the following video, you will go through a walkthrough of the Airflow webserver UI that you will be working with throughout this module.
In the next segment, you will learnt about operators and understand how tasks are created in Airflow.
Additional Resource
- Airflow Configuration Reference – This documentation page contains the list of all the available Airflow configurations that you can set in airflow.cfg file or using environment variables