Session Summary

This session has been a comprehensive study of the basics of Spark. It has introduced you to various libraries and functionalities that are available in Spark, and which make it such a powerful as well as easy-to-use framework. In the upcoming video, Vishwa Mohan will summarise the topics that were discussed in the first session.

To summarise this session:

“Apache Spark™ is a unified analytics engine for large-scale data processing.” It is an open-source, distributed computing engine, and it provides a productive environment for data analysis owing to its lightning speed and support for various libraries.

Why in-memory data processing systems?

Real-time data processing – Since data can be accessed fast, in-memory processing can be used in cases where immediate results are required.

Accessing data randomly in memory – Since data is stored in the RAM, the memory can be accessed randomly without scanning the entire storage.

Iterative and interactive operations – Intermediate results are stored in memory and not in disk storage and so, we can use this output in other computations.

Spark Architecture includes Driver Node and Worker Node. Cluster Manager is used for allocating resources to each component of the architecture.

Spark RDDs are the core data structure in Spark.

Dataframe and Dataset APIs are useful for handling structured data.

In the next session, you will learn about the basic Spark data structure: RDDs. And in upcoming sessions, you will get an understanding of RDDs and other APIs through various code demos and case studies.

Report an error