This session has been a comprehensive study of the basics of Spark. It has introduced you to various libraries and functionalities that are available in Spark, and which make it such a powerful as well as easy-to-use framework. In the upcoming video, Vishwa Mohan will summarise the topics that were discussed in the first session.
To summarise this session:
- “Apache Spark™ is a unified analytics engine for large-scale data processing.” It is an open-source, distributed computing engine, and it provides a productive environment for data analysis owing to its lightning speed and support for various libraries.
- Why in-memory data processing systems?
- Real-time data processing – Since data can be accessed fast, in-memory processing can be used in cases where immediate results are required.
- Accessing data randomly in memory – Since data is stored in the RAM, the memory can be accessed randomly without scanning the entire storage.
- Iterative and interactive operations – Intermediate results are stored in memory and not in disk storage and so, we can use this output in other computations.
- Spark Architecture includes Driver Node and Worker Node. Cluster Manager is used for allocating resources to each component of the architecture.
- Spark RDDs are the core data structure in Spark.
- Dataframe and Dataset APIs are useful for handling structured data.
In the next session, you will learn about the basic Spark data structure: RDDs. And in upcoming sessions, you will get an understanding of RDDs and other APIs through various code demos and case studies.