In this segment, you will learn about the Spark operator.
We recommend that you follow along with the demonstrations in your own EC2 instance.
In the upcoming video, Ajay will introduce you to the Spark operator.
So, in the video, you learnt the theory related to the Spark operator in Airflow.
The Spark operators are used to schedule Spark jobs using Airflow.
The SparkSubmitOperator launches Spark using the spark-submit CLI on the Airflow machine.
To run Spark SQL, you can make use of SparkSqlOperator.
For this module, we will be focusing on the SparkSubmitOperator.
Some of the important parameters/arguments for the SparkSubmitOperator are listed below:
- application: The application to be submitted as a job – either a jar or a py file
- conn_id: Connection ID
- application_args: Arguments for the application being submitted
- spark_binary: Command to use for spark-submit. We will use spark2-submit.
Note:
The task_id and dag arguments have to be mentioned for all operators.
In the next video, we will start with the actual demonstration of the Spark operator.
You can find the code and other resources used in the demonstration attached below:
he document provided below details the steps followed in the demonstration
Additional Reading
You can visit the following link for the source code for the SparkSubmitOperator: SparkSubmitOperator.
Report an error