So far, you have gathered sufficient knowledge about Spark. We will now proceed with learning about how Spark handles different data structures. As you have learnt in the last few segments, Spark can load data from a wide range of sources. These sources provide a variety of data, which can be classified into the following two categories:
- Unstructured data
- Structured data
In the upcoming video, you will get an understanding of these data sources and data structures.
[00:35]
SME says “unstructured data is present in any kind of csv files”. However, csv files are structured in nature.
So, in the video, you learnt that based on the structure of the data, Spark has the following two APIs:
Unstructured APIs
- Unstructured data is generally free-form text that lacks a schema (which defines the organisation of the data).
- Examples of such data include text files, log files, images, videos, etc.
- To deal with unstructured data, Spark uses an unstructured API in the form of a Resilient Distributed Dataset (RDD).
- RDD is the core component of Spark, and it helps in working with unstructured data.
Structured APIs
- Structured data includes a schema.
- The data could be structured in a columnar or a row format.
- Structured data formats include ORC files, Parquet files, tables or dataframes in SQL, Python, etc.
- To deal with this type of data, Spark provides multiple APIs, including SparkSQL, DataFrame and Dataset.