Module Introduction

Welcome to the module on ‘Apache Spark’. Before we dive in, let’s recap some of the concepts that you have learnt so far in the data engineering course:

Apache Hadoop: It is a collection of software utilities that are used for scalable distributed computing.

Hadoop Distributed File System (HDFS): It is a filesystem that is designed to store GBs/TBs of data with high reliability.

Apache Hive: It is an abstraction that is used for reading, writing and managing large data sets residing in HDFS, using SQL-type queries.

Apache Sqoop: It is a framework that is used to import/export data from RDBMS to HDFS.

Coming to Apache Spark, it is a big data utility that has gained a lot of popularity in a very short period of time. Since its development in ₂₀₁₃, global giants, such as Amazon, Alibaba, Yahoo and Uber, have been using Spark extensively in their production environments. In the upcoming video, you will learn about the contents of this module from our SME Vishwa Mohan.

This module contains many coding demos and coding practice questions, and we recommend that you code along with the SMEs on your Jupyter notebooks. Also, it is beneficial to learn how to debug codes through online research. When you find errors, do refer to documentation and forums, and you may find answers to errors that occur frequently.

Guidelines for this module

Apart from the first session, the module is practical in nature and it includes a number of code demos on various Spark APIs. We recommend that you start this module early to complete it within a week’s time. To understand the concepts better, we also recommend that you go through the platform text before attempting the in-segment questions. The videos in this module are presentation-based. The presentation used in each session is provided in the corresponding ‘Session Summary’ segment. The lecture notes for this module will also be attached in the last session of the module.

—Note—

In this module, you will work on the Apache Spark to be installed on your AWS EMR(Single Node) instance. And in the ‘Optimizing Spark for Large Scale Data Processing‘ (Course-04)- You will be making use of Amazon EMR(Multinode Cluster) and will learn various Spark optimisation techniques used in the industry.

—————————————————————-

Guidelines for in-segment and graded questions

Graded questions will be included in a separate session. All other sessions will contain questions that are not graded. In this module, for each graded question, you will be awarded ₁₀ marks for a correct response and 0 for an incorrect response. For each graded question, you will be allowed 1 attempt, whereas you may be allowed 1 or at most 2 attempts for non-graded questions depending on the type of question and the number of options.

People you will hear from in this module

Adjunct Faculty

Vishwa Mohan

Senior Software Engineer, LinkedIn

Vishwa is currently working as a senior software engineer at LinkedIn, an online employment-oriented platform. He has over 9 years of experience in the IT industry and has worked in various companies, including Amazon, Walmart, Oracle and others. He has deep knowledge of various tools and technologies that are used today.

Adjunct Faculty

Kautuk Panday

Senior Data Engineer

Kautuk is currently working as a senior data engineer. He has over 9 years of experience in the IT industry and has worked for several companies. He has deep knowledge of the various tools and technologies that are in use today.

Adjunct Faculty

Sachin Arora

Big Data and ML Expert, Capgemini

Adjunct Faculty

Sajan Kedia

Data Science Lead – Myntra

Sajan has completed his undergraduate and postgraduate in Computer Science Engineering from IIT, BHU. He heads the pricing team at Myntra, where he actively works on technologies like Data Science, Big Data, Spark and Machine Learning. Presently, his work mainly involves the development of discounting strategies for all the products offered by Myntra.

Adjunct Faculty

Praveen Singh

Principal Data Engineer, Noodle.ai

Praveen is currently working as Principal Data Engineer at Noodle.ai. He is a result-oriented Big Data architect with strong coding and execution skills. Praveen has more than half a decade of proven experience in building service-oriented architectures, distributed applications and Big Data solutions for both large enterprises and startups.

Presenter

Abhinav Rawat

upGrad

Report an error