Earlier, you used RDDs to implement word count and UserPin problem statements, so you must be well aware of the capabilities of RDDs. You can say that RDDs form the core of Spark. But since we have RDDs, one question that arises.is – What is the need for different classes of APIs when we already have RDDs?
In the next video, let’s first understand the evolution of Spark APIs.
Even though RDDs are the core abstraction of Spark, they have certain limitations. In the next video, you will learn about the limitations of RDDs from our industry expert.
Note
Note: At time stamp 1.38, the SME refers to the code map(lambda v: v[0]/v[1]); the code in the video is not right. It should be map(lambda v: v[1][0]/v[1][1])
Limitations of RDDs
- Data stored with the RDD abstraction is unstructured. While dealing with unstructured data, Spark recognises that there are parameters (or attributes) associated with each datapoint object. Still, Spark cannot read the inside object to know more details of the parameters.
- RDD is a low-level abstraction. The code has very low-level details about the execution of a job. For instance, consider a code to find the average of a set of data points.
Take a look at this code:
Example
rrd1 = spark.sparkContext.parallelize([('a', 10), ('b', 15),('a',5),('c', 12),('b',6)])
avg_by_key=rdd1.mapValues(lambda x: (x, 1))\
.reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1]))\
.map(lambda v: v[1][0]/v[1][1])
avg_by_key.collect()
Output
[7, 12, 10]
For better understanding, we will analyse this code step by step:
Example
# Creates a paired RDD.
rdd1 = spark.sparkContext.parallelize([('a', 10), ('b', 15), ('a', 5), ('c', 12), ('b', 6)])
Output
# Maps 1 along with the values in the paired RDD
avg_by_key1=rdd1.mapValues(lambda x: (x, 1))
Example
# You can see the output at each step by using rdd.collect() action
avg_by_key1.collect()
Output
[('a', (10, 1)), ('b', (15, 1)), ('a', (5, 1)), ('c', (12, 1)), ('b', (6, 1))]
Example
# Given there are multiple keys with different values, reduceByKey() sums the values corresponding to those keys.
avg_by_key2=avg_by_key1.reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1]))
avg_by_key2.collect()
Output
[('a', (15, 2)), ('c', (12, 1)), ('b', (21, 2))]
Example
# map() calculates the average by dividing the sum of values of each key by the number of keys that have contributed to that sum.
avg_by_key=avg_by_key2.map(lambda v: v[1][0]/v[1][1])
Output
# Final output
avg_by_key.collect()
[7, 12, 10]
In the code, the process of finding the average gets divided into smaller MapReduce operations:
- Map the datapoint values with 1.
- Add all the data points and arrive at one sum value by key.
- Divide the sum with the total number of datapoints.
In the above code, each logical step in the process of calculating the average has become a map or reduce statement. Now consider the same code written in the structured API abstraction as shown below:
df1.groupBy("product").avg("price")
Output
This code can be easily understood even if you do not understand its underlying details, you can tell what this code is meant for. You can see that the code with higher-level APIs is more intuitive as compared to RDDs.
To summarise, the RDD programming paradigm of Spark RDDs expresses the ‘how’ of a solution better than its ‘what’. This makes the computation as well as the data types opaque to the optimiser.
Also, since the Spark RDD does not support schema, the Spark execution engine cannot identify used and unused columns, and hence, it has to maintain all the columns in the memory. This means that columns that were never used in the user program would also occupy cluster memory.
Now, in the next video, let’s try to understand why these new APIs are known as structured APIs.
Structured APIs
As the name suggests, data stored in these abstractions is structured, that is, in the form of rows and columns. Storing data in a table-like format ensures that Spark can read the details of each feature (column) of every data point.
The code written in a structured API is readable and intuitive, just like writing a command using the Pandas library.
Another major upgrade the structured APIs have is the Catalyst Optimiser. As the name suggests, it is an in-built optimiser that optimises the code to get maximum performance in minimum possible time. The actual workflow of this optimiser will be dealt with in the next session. For now, let’s say structured APIs can perform jobs a lot faster than RDDs.
Before we move forward, you should remember that SparkSQL is the module used for structured data processing. DataFrames, Datasets and SQL tables and views (often referred to as SparkSQL) are APIs built on Spark SQL to write your data flows. As you proceed further, you’ll see that whether you use DataFrames, Datasets or SQL API, the functionalities are called from the pyspark.sql module. With this clear in mind, let’s understand a bit more about the 3 APIs.
There are three structured APIs in Spark, which are as follows:
- DataFrames: Collection of data organised in a table form with rows and columns. They allow processing over a large amount of structured data. One of the major differences between DataFrames and RDDs is the fact that data in DataFrames are organised in rows and columns, and in the case of RDDs, it is not. However, DataFrame does not have compile-time type safety.
- Datasets: This structure is an extension of DataFrames that includes the features of both dataframes and RDDs. Datasets provide an object-oriented interface for processing data safely. Object-oriented interface refers to an interface where all the entities are treated as an object and to access them, one has to call the object. Note that Datasets are only available in JVM based languages – Scala and Java and not available in Python and R. Datasets have compile-time type safety.
- SQL tables and views (SparkSQL): With SparkSQL, you can run SQL-like queries against views or tables organised into databases.
Do not worry if you haven’t understood some terminologies like compile-time type safety; you’ll learn more about these terms in the upcoming segments.