IKH

Session Summary

This session has been a comprehensive study of the basics of Spark. It has introduced you to various libraries and functionalities that are available in Spark, and which make it such a powerful as well as easy-to-use framework. In the upcoming video, Vishwa Mohan will summarise the topics that were discussed in the first session.

To summarise this session:

  • Apache Spark™ is a unified analytics engine for large-scale data processing.” It is an open-source, distributed computing engine, and it provides a productive environment for data analysis owing to its lightning speed and support for various libraries. 
  • Why in-memory data processing systems?
  • Real-time data processing – Since data can be accessed fast, in-memory processing can be used in cases where immediate results are required.
  • Accessing data randomly in memory – Since data is stored in the RAM, the memory can be accessed randomly without scanning the entire storage.
  • Iterative and interactive operations – Intermediate results are stored in memory and not in disk storage and so, we can use this output in other computations.
  • Spark Architecture includes Driver Node and Worker Node. Cluster Manager is used for allocating resources to each component of the architecture.
  • Spark RDDs are the core data structure in Spark.
  • Dataframe and Dataset APIs are useful for handling structured data.

In the next session, you will learn about the basic Spark data structure: RDDs. And in upcoming sessions, you will get an understanding of RDDs and other APIs through various code demos and case studies.

Report an error