IKH

Module Introduction

Welcome to the module on ‘Real-Time Data Processing Using Spark Structured Streaming’.

In the previous modules on Spark, you learnt about Apache Spark and its ecosystem. You also learnt about the various concepts of Spark, such as Spark RDDs and Spark DataFrames, and learnt how to use them to create Spark programs. Spark is a much faster big-data-processing framework than MapReduce, because it focuses on in-memory computations. You have also learnt certain optimisation techniques that can be applied to improve the performance of Apache Spark.

The two Spark modules that you have completed so far were centred around Spark core APIs, which are used to process the data that we have already collected (batch processing). In this module, you will learn about another very interesting Apache Spark API, which is used for processing data in near real-time. This API is called Spark Streaming.

Session 01

In the first session, you will learn about data streaming. You will learn about its importance and go through a few industry examples. Next, you will learn about the difference between Micro-batching and Streaming, followed by an introduction to Spark Streaming and its APIs.

Session 02

In the second session, you will learn about the importance of the Structured Streaming API and why it is preferred in the industry. Following this, you will learn about the differences between Static and Streaming DataFrame with transformations, along with triggers and various output modes. Next in the second session, you will work with Structured streams and learn how to read files into a stream, operate triggers and output modes, and also apply this knowledge through coding examples. Finally, you will learn how joins work on streams and will implement a few coding examples.

Session 03

In the third session, we will cover event and processing times, and you will learn about the concept of windows and how to handle late-arriving data using watermarks.

Session 04

And finally, in the fourth session, you will be exposed to Kafka and learn how to integrate it with Spark Streaming.

Guidelines for this module

Since this module contains a lot of coding-based content, we advise that you keep practising the various commands that are used throughout the module and actively attempt the in-segment questions as well. Also, note that you need to use the EMR instance carefully while running the different Spark Streaming codes, as EMR is a costly service and you might overshoot your budget if you leave your EMR instance running for a long duration.

Please do read the lab documents for all the coding sessions and refer to them whenever you have any doubt about the commands used in this module. The presentation used in each session is provided in its ‘Session Summary’ segment. The lecture notes for this module will also be included at the end, in the last segment.

Guidelines for in-segment and graded questions

There will be a separate session for graded questions. All other sessions will have non-graded questions. The graded questions in this module will have 10 marks each for a correct answer and o for an incorrect answer. Each graded question will allow only 1 attempt, whereas non-graded questions may allow 1 or 2 attempts depending upon the question type and the number of options.

People you will hear from in this module

Subject Matter Expert

Kautuk Pandey

Senior Data Engineer

Kautuk is currently working as a senior data engineer. He has over 9 years of experience in the IT industry and has worked for several companies. He has deep knowledge of the various tools and technologies that are in use today.

Report an error