IKH

Job Deployment Modes in Spark

In the previous segment, you understood the need for optimising cluster utilisation for your Spark jobs.

In this segment, you will learn about the different modes in which you can deploy your job in a Spark cluster. Let’s watch the next video to learn about these modes in detail.

In the video above, you learnt about the different cluster deployment modes, which are as as follow:

Local mode: Local mode is a non-distributed single-JVM deployment mode in which all the components run on a single instance. Any job that you run on your own laptop or any other machine is run in the local mode. Generally, this mode is used only for testing, debugging and demonstration purposes, not for actual ETL pipelines in the production environmemt. The degree of parallelism in this mode depends on the number of CPU cores in your local machine, which can be set according to the requirements of your Spark jobs. In order to start a Spark Session, you need to specify in the master that your job is running in the local mode. Consider the following example:

spark = SparkSession.builder.appName(‘demo’).master(“local”).getOrCreate()

Standalone mode: Spark has the capability of managing clusters on its own. In the standalone mode, Spark uses its inbuilt Spark Resource Manager. It allows you to create a distributed Master-Slave architecture similar to that in Hadoop. By default, Spark is set to have a single node cluster configuration, You can use the following node cluster:

spark-submit –master spark://207.184.161.138:7077

Note: In most cases, you need to provide the IP address of the master node followed by its port number of the Spark Master.

Spark has the following two types of deployment modes in distributed environments:

  • Spark Cluster Mode: When the Driver program resides on one of the worker nodes inside the cluster itself, the mode of deployment is known as the Spark Cluster-Mode.
  • Spark Client Mode: When the Driver program resides on an external client machine, the mode of deployment is known as the Spark Client Mode.

Note: In the case of the cluster mode, Spark driver can run on any of the available nodes along with the Spark executor.

Since Spark is based on the Hadoop ecosystem, it can also leverage Yet Another Resource Negotiator (YARN) for cluster management. Similar to Spark, YARN has the following two modes of deployment:

  • YARN Client Mode: When the driver program resides in an external client node, the deployment mode is known as the YARN Client Mode. You can run a Spark job in the YARN Client Mode as follows –

spark-submit –master yarn –deploy-mode client [options] <app file> [app options]

The image given below is an example of this deployment mode.

  • YARN Cluster Mode: When the Driver program resides in one of the worker nodes of the cluster under a separate Application Master, the deployment mode is known as the YARN Cluster-Mode.

spark-submit –master yarn –deploy-mode cluster [options] <app file> [app options]

The image given below is an example of this deployment mode.

Additional Content

Report an error