Module Introduction

Welcome to the module on ‘Real-Time Data Processing Using Spark Structured Streaming‘.

In the previous modules on Spark, you learnt about Apache Spark and its ecosystem. You also learnt about the various concepts of Spark, such as Spark RDDs and Spark DataFrames, and learnt how to use them to create Spark programs. Spark is a much faster big-data-processing framework than MapReduce, because it focuses on in-memory computations. You have also learnt certain optimisation techniques that can be applied to improve the performance of Apache Spark.

The two Spark modules that you completed so far were centred around Sparkcore APIs, which are used to process the data that we have already collected (batch processing). In this module, you will learn about another very interesting Apache Spark API, which is used for processing data in near real-time. This API is called Spark Streaming.

So in the upcoming video, our SME, Kautuk, will give you an overview of the topics that will
be covered in this module.

Session 01

In the first session, you will learn about data streaming. You will learn about its importance and go through a few industry examples. Next, you will learn about the difference between Micro-batching and Streaming, followed by an introduction to Spark Streaming and its APIs.

Session 02

In the second session, you will learn about the importance of the Structured Streaming API and why it is preferred in the industry. Following this, you will learn about the differences between Static and Streaming DataFrame with transformations, along with triggers and various output modes. Next in the second session, you will work with Structured streams and learn how to read files into a stream, operate triggers and output modes, and also apply this knowledge through coding examples. Finally, you will learn how joins work on streams and will implement a few coding examples.

For the purpose of this module, you will need to install netcat on your machines. Please refer to the commands below to do this.

yum update -y[root@localhost ~]# yum update -y Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile epel/x86_64/metalink | 6.4 kB 00:00:00 * base: mirrors.piconets.webwerks.in * epel: ftp.jaist.ac.jp * extras: mirrors.piconets.webwerks.in * updates: mirrors.piconets.webwerks.in base | 3.6 kB 00:00:00 epel | 4.7 kB 00:00:00 extras | 2.9 kB 00:00:00 updates | 2.9 kB 00:00:00 (1/3): extras/7/x86_64/primary_db | 190 kB 00:00:00 (2/3): epel/x86_64/updateinfo | 1.0 MB 00:00:01 (3/3): epel/x86_64/primary_db | 6.8 MB 00:00:17 No packages marked for update

yum install -y nc[root@localhost ~]# yum install -y nc Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile * base: mirrors.piconets.webwerks.in * epel: mirrors.piconets.webwerks.in * extras: mirrors.piconets.webwerks.in * updates: mirrors.piconets.webwerks.in Resolving Dependencies –> Running transaction check —> Package nmap-ncat.x86_64 2:6.40-19.el7 will be installed –> Processing Dependency: libpcap.so.1()(64bit) for package: 2:nmap-ncat-6.40-19.el7.x86_64 –> Running transaction check —> Package libpcap.x86_64 14:1.5.3-12.el7 will be installed –> Finished Dependency Resolution Dependencies Resolved ======================================================================================================================================================================== Package Arch Version Repository Size ======================================================================================================================================================================== Installing: nmap-ncat x86_64 2:6.40-19.el7 base 206 k Installing for dependencies: libpcap x86_64 14:1.5.3-12.el7 base 139 k Transaction Summary ======================================================================================================================================================================== Install 1 Package (+1 Dependent package) Total download size: 345 k Installed size: 740 k Downloading packages: (1/2): nmap-ncat-6.40-19.el7.x86_64.rpm | 206 kB 00:00:00 (2/2): libpcap-1.5.3-12.el7.x86_64.rpm | 139 kB 00:00:00 ———————————————————————————————————————————————————————— Total 500 kB/s | 345 kB 00:00:00 Running transaction check Running transaction test Transaction test succeeded Running transaction Installing : 14:libpcap-1.5.3-12.el7.x86_64 1/2 Installing : 2:nmap-ncat-6.40-19.el7.x86_64 2/2 Verifying : 2:nmap-ncat-6.40-19.el7.x86_64 1/2 Verifying : 14:libpcap-1.5.3-12.el7.x86_64 2/2 Installed: nmap-ncat.x86_64 2:6.40-19.el7 Dependency Installed: libpcap.x86_64 14:1.5.3-12.el7 Complete!

Session 03

In the third session, we will cover event and processing times, and you will learn about the concept of windows and how to handle late-arriving data using watermarks.

Session 04

And finally, in the fourth session, you will be exposed to Kafka and learn how to integrate it with Spark Streaming. In this session, we will build an application to process/analyse tweets in real-time using Kafka and Spark Streaming.

Important Note:

You will need to set up a Twitter developer account for this session. Since it takes time(6-7 days), we advise that you start working on it right away.

Here is the link to the document on how to set up a Twitter developer account

Guidelines for this module

Since this module contains a lot of coding-based content, we advise that you keep practising the various commands that are used throughout the module and actively attempt the in-segment questions as well. Also, note that you need to use the

EC2 instance carefully while running the different Spark Streaming codes, as EC2 is a costly service and you might overshoot your budget if you leave your EC2 instance running for a long duration.

Please do read the lab documents for all the coding sessions and refer to them whenever you have any doubt about the commands used in this module. The presentation used in each session is provided in its ‘Session Summary’ segment. The lecture notes for this module will also be included at the end, in the last segment.

Guidelines for in-segment and graded questions

There will be a separate session for graded questions. All other sessions will have non-graded questions. The graded questions in this module will have 10 marks each for a correct answer and 0 for an incorrect answer. Each graded question will allow only 1 attempt, whereas non-graded questions may allow 1 or 2 attempts depending upon the question type and the number of options.

People you will hear from in this module

Subject Matter Expert

Kautuk Pandey

People you will hear from in this module

Subject Matter Expert

Kautuk Pandey

Senior Data Engineer

Kautuk is currently working as a senior data engineer. He has over 9 years of experience in the IT industry and has worked for several companies. He has deep knowledge of the various tools and technologies that are in use today.

Report an error