IKH

Session Summary

This session has been a comprehensive study on the Spark Structured APIs. This session has introduced various dataframe operations that Spark offers on analyzing dataframes which makes it so powerful as well as easy to use. Here, in this video, Vishwa Mohan has summarized the topics discussed in this session.

The jupyter notebook used for analyzing police case study using dataframes and Spark SQL is:

Let us summarise this session:

  • DataFrames: Collection of data organised in a table form with rows and columns. They allow processing over a large amount of structured data.
  • Datasets: This structure is an extension of DataFrames that includes the features of both dataframes and RDDs.
  • SQL tables and views (SparkSQL): With SparkSQL, you can run SQL-like queries against views or tables organised into databases.
  • The Catalyst optimiser creates an internal representation of a user’ program, called a queryn plan. Once you have the initial version of the query plan, the Catalyst will apply different transformations to convert it an optimised query plan.
  • Different file formats can be loaded into dataframes and dataframes can be saved into different file formats.

Report an error