IKH

Setting up an Amazon EMR cluster

Until now, you have learnt about Amazon EMR and its features and benefits. In this segment, you will learn how you can set up an EMR cluster to process big data.

In the following video, our faculty member Kautak will walk you through the Amazon EMR dashboard and then will show you the steps to create a generic EMR cluster. You will be using these same steps in future modules as well with some modifications in the services that you want to install during cluster setup.

Please make sure that you have made the changes in the security group for your EMR cluster as shown in the video; otherwise, you will not be able to log in to your EMR instance. 

In the video, you saw the various services and configurations that you can customise for your EMR cluster. Make sure that you follow the hardware and software configurations properly, as these same steps will be used in future modules as well. You also learnt about the EMR dashboard and its different components that will help you to keep track of the status of your EMR cluster and its configurations.

You have learnt the steps to create an EMR cluster. You will now need to do some configuration changes to your EMR cluster. These changes are essential for optimal performance of your EMR cluster, and without this, many services such as Apache Hive and Apache Sqoop might not work at all. The configurations that you will be making affect the YARN parameters of your Hadoop EMR cluster. You will learn more about this in the next module.

In this video, you learnt the steps to set up YARN parameters for your EMR cluster. You will need to do these changes whenever you start up a new EMR cluster. 

In the previous videos, you saw how to create an EMR cluster from scratch and then set up the YARN parameters on it. However, you will eventually need to terminate this EMR cluster. As a result, having to replicate all of the  steps given above everytime can become time-consuming. Instead of following all of the configuration steps for creating an EMR cluster and then having to set up the YARN parameters, you can instead clone a previously setup EMR cluster. Please note that this will not clone the data from your old EMR cluster and will only clone the hardware and software configurations as well as any additional configurations such as YARN parameters that you have made on your previous cluster.

In this video, you saw how cloning an EMR cluster can be a time-efficient method of setting up an EMR cluster. Please note that you might have to set up an EMR cluster with a different set of software packages installed in future modules, in which case you will have to follow all of the steps to set up an EMR cluster from scratch, including the YARN parameters. However, in cases where you need the exact same setup for a new EMR cluster as one of your older EMR clusters, then you can use this method to quickly bring up a replica EMR cluster.

In the previous segment, you learnt about EMR Notebooks and understood how they can be used alongside EMR clusters to run Spark jobs on top of Jupyter Notebooks. In the following video, you will understand how you can create a new EMR Notebook and then link your running EMR cluster to it.

In this video, you saw how you can create an EMR Notebook and then link it to your EMR cluster. You will be using this service later with Apache Sqoop to run Spark jobs on EMR clusters.

Please note that you can simply connect any of your EMR clusters created in the future easily with your EMR Notebook by following these same steps.

In the next segment, you will learn how you can log in to your EMR instance as well as how file transfer can be done between the EMR instance and your local machine.

Report an error