IKH

Spinning Up a Spark EMR Cluster

In the previous module, you have learnt about the various concepts of Apache Spark and also learnt how to program in PySpark. Until now, you have run Spark jobs on the CDH EC2 instance. However, the CDH EC2 instance is a single node cluster and is relatively quite expensive if you try to maintain a multi-node cluster for long periods of time. This is where Amazon EMR (Elastic MapReduce) clusters come into play.

In the upcoming video, you will learn about the EMR cluster and its various features.

Note

In the following video at 0:06, the SME is supposed to start with “Welcome everyone to this new segment…”

In the video above, you got an overview of the EMR cluster, its features and its advantages and disadvantages over traditional EC2 instances.

Amazon EMR is a managed cluster platform that provides an expandable low-configuration service. It also comes with ready-made big data tools depending on the cluster configuration that you choose.


It supports Dynamic Resizing, which means that it has the ability to scale the number of nodes automatically on the fly. You can also manually control such parameters when running the Spark jobs. EMR cluster comes with the following features:

Compared to EC2 instances, there are several major advantages of using an EMR cluster. Some of these advantages are as follows:

  • It provides automatic scaling and ensures minimal loss of HDFS data. Also, since mostly spot instances are used with EMR, it is generally cheaper than a similar configuration EC2 instance over long periods of time.
  • It allows for dynamic orchestration of a new cluster on-demand, along with easy termination of the cluster once the work is complete. The actual steps involved in creating an EMR cluster are quite straightforward, and most of the configuration takes place in the background. You can also use the feature of AWS Step Functions to automate the whole process of creating entire data processing and analysis workflows with minimal code. You can read more about this concept in the Additional Resources section in this segment.
  • It enables direct access to data on S3 and data on other connected AWS services such as Hive tables.
  • It ensures high availability of the Slave nodes by constantly monitoring each node and automatically replacing all unhealthy nodes with new nodes.

However, it is important to note that EMR clusters are not the perfect alternative to Amazon EC2 machines and have some flaws as well. Some of these flaws are as follows:

  • EMR does not have a management console similar to that of Cloudera Manager, thereby making it much harder to manage and monitor the various services.
  • Even though EMR ensures high availability of the Slave nodes, it does not ensure the same for the cluster’s master node, which makes it a single point of failure.
  • One of the biggest disadvantages of using EMR clusters is that you cannot shut down EMR clusters; you need to terminate it directly. Since data stored on the EMR clusters is lost upon termination, you need to create a backup for the data that you want to save on S3 buckets. Fortunately, EMR clusters automatically save the Jupyter Notebooks that you create on S3 buckets.
  • While the automatic replacement of unhealthy nodes ensures high availability of the Slave nodes, it creates a high chance of loss of the data present in the unhealthy nodes.

In the next video, our SME Vishwa will walk you through the steps involved in spinning up an EMR cluster.

Now that you are aware of the steps involved in creating an EMR cluster, you will learn how to launch a Jupyter Notebook on the EMR cluster so that you can start working your PySpark jobs. In the next video, our SME will show you how to do this.

The documents attached below contain the steps involved in creating an EMR cluster and the steps for launching a Jupyter Notebook on top of it.

You may need to upload data sets to your Amazon S3 buckets in order to use them with your EMR cluster. The document attached below contains the steps on how to upload the data sets using your EC2 instance.

The links to the various Data Sets and Python files required in this module are given below.

Please use the following commands to download them on your CDH EC2 instance and then store them on your S3 buckets using the method described in the above document so that you are able to access them in the Jupyter Notebooks used throughout this module.

Also, make sure to change the lines in the Jupyter Notebooks where the location of the S3 bucket is mentioned according to where you save the files in your S3 bucket.

Note

 Please change the kernel inside the Jupyter Notebooks to PySpark, if it is not already PySpark before you execute them so that the code executes correctly.

In this segment, you got an overview of Amazon EMR and its various features,  and its advantages and disadvantages over EC2 instances. You also learnt how to spin up an EMR cluster and launch a Jupyter Notebook on top of it.

Again, we request you to keep caution while working with Spark EMR clusters since this service is very costly and can easily overshoot the budget for your instance. Consider terminating your instance once you are done for the day. 

Additional Content

  • Official Amazon EMR Documentation: Link to the official Amazon EMR documentation
  • Using Step Functions to Orchestrate Amazon EMR Workloads: An article on the concept of AWS Step Functions and how it can be used to Orchestrate Amazon EMR.