Course Introduction

The next course of this programme is ‘Data Engineering – II’. You will first learn how to optimise Spark for large-scale data processing. You will then learn about the tools and techniques of processing data in real-time. In the previous course, you learnt about batch data processing techniques.

However, there is another type of data that you need to process as it arrives in real time, called streaming data, which you will learn about in this course. You will then learn how to orchestrate end-to-end data pipelines and, finally, use Apache Spark concepts for predictive analysis and EDA. Let’s quickly take a look at the different modules in this course and what to expect in each of them.

Optimising Spark for Large-Scale Data Processing

The learning objective of this module is to optimise Spark for large-scale data processing.

First, you will set up a Spark EMR cluster and run a Spark job, which you will optimise at the end of the module. Next, you will understand the need for optimise at the end of the module. Next, you will understand the need for optimising Spark jobs in the industry and learn about the various approaches for achieving this. You will learn to optimise disk IO and network IO. Next, you will learn to optimise the Spark cluster configuration. You will then learn about the various job deployment modes in Spark. At the end of this module, you will optimise the Spark job that you will write in this module using the Spark optimisation concepts covered throughout this module.

Real-Time Data Streaming With Apache Kafka

In this module, you will learn about Apache Kafka and its working. First, you will be introduced to Apache Kafka. You will then learn about some of the key features of Kafka and its various use cases. Next, you will learn about its architecture. You will then learn about the internals of Kafka and understand its various terminologies such as producers, consumers, topic and partitions. You will then learn how to create topics and send messages to topics through a producer. You will also write a consumer that will be reading the message from this topic. Finally, you will learn about Kafka Connect and Kafka Streams.

Real-Time Data Processing Using Spark Streaming

In this module, you will learn about stream processing using Spark Streaming. You will learn about the Spark Streaming API and its features. You will learn how it can be used with tools like Kafka and Flink. Next, you will learn about structured streaming, its APIs and its advantages and then build a simple Spark Streaming application that will consume streaming data. After this, you will learn about triggers and various output modes, and then work with structured streams and learn how to read from files into a stream. Next, you will learn about transformations and aggregation functions. You will also learn to monitor Spark Streaming tasks in Spark UI and visualise and monitor the metrics of a job in Spark UI. Finally, you will learn about the concept of windows.

Assignment – Stock Data Analysis (Optional)

This assignment mainly revolves around Apache Kafka and Apache Spark, which are the two most widely used tools in the industry for real-time and data processing. As part of this project, you will be provided real-time global equity data. Based on the data, you need to perform some real-time analyses to generate insights that can be used to make informed decisions. The data will be hosted on a centralised Kafka server.

Automating Data Pipelines Using Apache Airflow

In this module, you will learn about Apache Airflow and how you can automate your data pipelines using it. You will learn some of its prominent features and concepts, such as DAGs and tasks, and its architecture. Next, you will get hands-on experience of Airflow. You will start by learning about the installation process and then you will learn about operators in Airflow. You will learn specific operators one by one and also understand how to use them to their full potential. Finally, you will work on a real-world problem statement and put the skills that you have learnt so far to use. In the process, you will deep-dive into some of the advanced concepts of Apache Airflow such as subDAGs, Xcoms and trigger rules. Towards the end, you will take a look at some best practices to follow when developing your DAG in production and also look at some advantages and limitations of Airflow.

Analytics Using PySpark

In this module, you will learn how to manipulate data on a large scale with the help of Spark ML libraries. You will also be covering the basic ML algorithms using the Spark ML library. First, you will learn how to perform basic exploratory data analysis using the Spark ML library and learn about concepts such as Imputer and VectorAssembler. You will then learn how to perform linear regression as well as classification using Spark MLlib. You will also be learning about cross-validation and bias variance tradeoff. You will then be going through logistic regression and k-means clustering with SparkML, and will also go through practice coding problems for both of these concepts.

Classification Assignment (Optional)

This assignment mainly revolves around Apache Spark and SparkML libraries. As part of this project, you will have to identify the gender of a Twitter user using the user’s profile information. The dataset provided has information about users such as their username, a random tweet by them, their account profile, image and location. Based on the data, you need to train an algorithm to determine whether a Twitter account belongs to a man, a woman or an organisation. You need to build two models based on two different classification algorithms and compare the results. The data will be provided on the course platform as a downloadable file.

Course Project: Retail Data Analysis

This is a real-time project which will cover topics like Apache Spark Streaming and Apache Kafka, which are some of the most widely used tools in the industry for real-time data processing. This project will test your knowledge of the real-time data-processing tools you learnt about throughout this course. In this project, you will go through a real-world use case from the retail sector. Your task will be to ingest data from a centralised Kafka server in real time and process it to calculate various key performance indicators (KPIs).

This will conclude the second course of the program.

Some very important pointers regarding this course

Module deadlines: All of the modules are1¹ -week long, except “Analytics Using PySpark”, and the course project will last for 2 weeks.
Showcasing your projects on GitHub: Whenever you create a project, it is important to put it on GitHub with proper documentation. This will help you explain things better during an interview and definitely give you an extra edge. This programme has a good number of small projects, and you are advised to add all of them to your GitHub profile.
- You must not add the course project on GitHub as this is a graded component. As per university guidelines for preventing plagiarism, strict action will be taken if you do so.
Live coding: In the modules, the faculty will be showing you how to write queries or methods in the videos. You must try these demos at your end in parallel. Only then will you be able to understand the concept properly. If you don’t do so, you may find course projects and other future modules challenging.
Live sessions: Because of time constraints, there are a few important topics that may not be covered in the pre-recorded lectures in the module. To make sure you do not miss those concepts, you will have live sessions.

As explained earlier, live session content will be over and above the programme content and will cover some interview-related concepts or some advanced content as well.

Graded components: All the mandatory modules will have graded questions at the end, to test your understanding of the module. You must answer these questions with utmost sincerity to perform better in the program.
Lecture notes and interview questions: To help you better summarise the learnings from the modules, you will be getting lecture notes. You will also be provided with some of the most commonly asked interview questions around the concepts covered in the modules.

Careful use of resources

You will be using EC2 and EMR instances extensively in this course. You will be allocated a fixed budget every month to access these AWS resources. AWS resources are paid, so you must use them very wisely. Whenever you are done trying out the demos, you must stop the resources.

Great! So now you know the do’s and don’ts of this course. You also have a brief idea of the learning objectives of this course, so let’s get started.

Important Note: EMR Notebook Service is experiencing issues in AWS Academy. To work with Jupyter Notebook and Spark in AWS EMR in AWS Academy, please use the steps mentioned in the documentation below.

Report an error