IKH

Spark Architecture

In the previous segment, you were introduced to certain functionalities of the Spark ecosystem. You learnt that Spark is equipped with all the useful libraries and features for efficient data processing and analysis. Now, in this segment, you will learn about the Spark architecture and understand how a process comes together in the Spark framework. One point to note here is that you will be using your knowledge of the basic distributed architecture, which you learnt in the previous modules, to build topics in Spark.

So, in the upcoming video, you will learn about the Spark architecture from our SME.

  • Driver node (master), which runs the Driver program and
  • Worker nodes (slave), which run the Executor program.

Now from the image given above, it should be clear that the driver and the worker nodes are physical machines on which the driver program and the executor program work. The driver program is responsible for managing the Spark application that you submit, and the executor program uses the worker nodes as the distributed storage and the processing space to run those applications. Both storage and processing require multiple worker nodes, as you are dealing with the distributed data.

Now, you have driver and worker nodes to perform the tasks that a user submits. However, in a distributed cluster, multiple processes are running at the same time. Therefore, you will also require a resource manager, which can efficiently divide the resources. Let’s watch the upcoming video and try to get an understanding of this.

To summarise, the driver program and the executor program are managed by the cluster manager. A cluster manager is a pluggable component in Spark. Because of this, Spark can run on various cluster manager modes, which include the following:

  • In the Standalone mode, Spark uses its own cluster manager and does not require any external infrastructure.
  • However, at an enterprise level, for running large Spark jobs, Spark can be integrated with external cluster managers like Apache YARN or Apache Mesos.
  • This facility to run Spark with external cluster managers allows Spark applications to be deployed on the same infrastructure like Hadoop and serves as an advantage for companies that are looking to use Spark as an analytics platform.

Now, in the upcoming video, you will learn how a Spark job gets executed.

Note:

At [00:35], it is mentioned that the driver program is responsible for resource allocation. The cluster manager is responsible for this task.

So, in the video, you saw that once a user submits a code, a Spark Application is created, which starts the driver program. The driver program is responsible for initiating the Spark Context, which is the most crucial task in the entire process cycle.

To understand each process even more efficiently, let’s determine the role of each part in the entire process.

Role of Spark Context:

  • Spark Context does not execute the code, but it creates an optimised physical plan of the execution within the Spark architecture. It is the initial entry point of Spark into the distributed environment.

Role of Driver Program:

  • A driver program is similar to the ‘main()’ method of your application. It contains the instructions and the action steps that are to be taken on the data present in each worker node.
  • A driver program creates a general logical graph of operations. These operations mostly involve the creation of RDD from some source data, transformation functions to manipulate and filter the data, and, finally, certain action to save or print the data.
  • When the driver program runs, the logical graph is converted to an execution plan.
  • In Spark terminology, a process performed or an action taken on data is called a Spark Job. A job is further broken down into multiple stages, which help make the Spark environment reliable and fault-tolerant. Finally, each stage comprises tasks that are implemented by the executors. A task is the most basic unit of work that each executor performs parallelly on the respective partition of data. 
  • Once the entire execution plan is ready, the Spark driver coordinates with executors to run various tasks.

Role of Executors:

  • Executors are JVM machines on worker nodes on which Spark runs.
  • The main task of an executor is to run the tasks and send the results to the driver program.
  • Executors also store the cached data that is created while running a user program.

One executor can consume one or more cores on a single worker node.

If you assume:

  • That 1 executor runs on 1 core and 1 worker node contains 8 cores, then there will be 8 executors in a single worker node. Since every executor has only one core to run a process or a task, the operations in the program cannot be parallelized.
  • That 1 executor runs on 2 cores and 1 worker node contains 8 cores, then there will be 4 executors in a single worker node. Since every executor has two cores to run a process or a task, the operations in the program can be parallelized.

Pro Tip:

If you are running Spark in local mode and not in a distributed environment, then the driver and executor programs run on the same JVM.

Role of Cluster Manager:

  • It launches the executor programs and manages the resources that are allocated to each component.

Now, let’s revisit how the different components in Spark work together to execute the physical plan that is created by Spark Context. Spark Context sends the optimised physical plan to a cluster manager (could be Standalone, YARN or Mesos) in the Spark framework. The cluster manager first checks for the availability of resources in the cluster and then allocates worker nodes based on the requirement. These worker nodes are responsible for loading the data and executing the desired tasks over each partition of the distributed data. This parallelism is responsible for fast computation in Apache Spark.

Another added advantage of the Spark architecture is that it can be deployed over distributed storage systems such as Hadoop as it is. No additional support is required to make it functional over these storage systems.

Additional Reading:

  • Understanding the working of Spark Driver and Executor