Apache Spark in the Production Environment

In the previous segments, you learnt how to do a cost and performance trade-off analysis to determine the most optimal configuration for your Spark cluster according to your use case.

In this segment, you will learn how Spark is deployed in the production environment.

In the next video, our SME will explain how Spark jobs are run in the production environment and the different factors that you should consider while deploying Spark jobs.

As discussed in the video above, there are several factors that you should consider while running jobs in the production environment.

Typically, Version Control tools, such as Git and SVN, are used for keeping track of the various versions of a Spark job so that multiple developers can work together on the same Spark job. You can also use these tools to view the previous versions of the job.

CI-CD pipeline is also set up in the production environment. This reduces the toil and time that it takes for the job after it is written in the machine in order to deploy it into production. Standardised pipelines are constructed to ensure that this process is seamless.

Whenever a job is created and you push the code, it results in a package or a bundled job. Each successful package creates a new version of the Spark job.

Typically, Spark jobs are executed as a batch job, and sometimes, it has to be executed on demand. So you need proper hooks or a specialised scheduler framework to run the Spark job as a batch-processing system.

In this segment, you learnt about some of the factors that you should consider while running jobs in the production environment, and you also learnt how to deploy jobs in the production environment.

Additional Content

Job Scheduling in Spark – Link to the official Spark documentation on Job Scheduling in Spark
What is it like to use Apache Spark in production? – Link to a Quora post describing how Spark is used in production and the experience of using it.

Report an error