DataFrames and Datasets

Structured APIs have a lot of advantages over RDDs. That is why structured APIs have taken an important place in the industry recently. Let us learn about the structured APIs one by one, starting with Dataframes.

DataFrames in Spark is very similar to dataframes in Pandas, except it also follows lazy evaluation like RDDs. The data is available in the form of columns and rows, and each column in a dataframe has a specific data type associated with it. This structure of the columns with specific data types is called the schema of a dataframe. Once the schema of a dataframe is established, new data can also be read into the same schema.

Benefits of DataFrames:

High-level API: You are familiar with the code written using the dataFrame abstraction as it is structured very similar to dataframes in pandas. The code is highly readable and easy to write. Therefore, it becomes accessible to a lot of people other than specialists like data engineers or data scientists.

Catalyst optimiser: DataFrames have the ability to optimise the code internally. These optimisations are done with the help of Catalyst Optimiser. It has the ability to rearrange the code to make it run faster without any difference in output.

Memory management: You have seen in the previous module that the transformations on RDDs involved are performed in the executors. These transformations take place in the heap memory of the executors which are essentially JVMs. On the other hand, DataFrames make use of the off-heap memory in addition to the heap memory on JVMs. Hence, managing both heap and off-heap memory requires custom memory management. As you go through the next couple of points, you’ll understand why utilising the off-heap memory is beneficial.

As the executor works on the job tasks, at some point due to the increase in the garbage objects, it has to run a ‘full garbage collection’ process. During this, the machine slows down, as it has to scan the complete heap memory. So the larger the JVM memory, the more time the machine will take to do this. Note that Garbage collection runs only on JVM memory which is heap memory.

Off heap memory storage: An elegant solution to the above problem is to store data off JVM heap, but still in the RAM and not on the disc. With this method, one can now allocate a large amount of memory off-heap without GC slowing down the machine. If the data is being stored off-heap, the JVM memory can be made smaller, and the GC process will not affect performance as much as GC runs only on the JVM memory, which is heap memory. Another advantage of Off-heap memory is that the storage space required is much smaller compared with that for JVM memory, as Java objects in JVM require more storage space.

Just like dataFrames, datasets are also high-level APIs capable of working with structured data in Spark. Dataset combines the advantages of RDDs like compile-time type safety with the advantages of dataframes like catalyst optimiser and the ease of high-level programming. Now, before we move forward, you should note that Datasets are only available in JVM compatible languages: they are available in Java and Scala, and not in Python. With this in mind, let’s take a look at some of the features of datasets explained in the upcoming video.

So, as you learnt in the video, the features of a Dataset are as follows:

High-Level API: If you compare a code written in RDDs to a code written in datasets, then you will notice that the code written in the dataset abstraction is elegant, readable and short. Since the instructions given are high level, the Spark optimisation engine can figure out ways to improve the performance of the code. This is one of the features of dataframes that is also offered by datasets.

Compile-time type safety: It is the ability of dataset abstraction to catch errors in the code at compile time itself.

Let’s try to understand this in detail. Suppose while typing code, you make a syntax error or call a function that does not exist in the API. Then, an error will be raised during compilation. These errors are called syntax errors and dataframes also have the ability to catch such errors at compile-time, this is called syntax error safety. But what about logical mistakes that are made while coding?

But what about logical mistakes that are made while coding?

For instance, let’s say you typed a wrong column name or made an integer operation on string type data. This category of errors is known as analysis errors. RDDs and datasets have compile-time analysis error safety as well. Consider a SQL-style code written in the dataFrame abstraction using Python:

Example

Python

df.filter("salary > 5000")

Output

You will be practising coding on structured APIs in the upcoming segments. For now, let us break this code down and understand its details. This code is written in the dataFrame abstraction using Python. ‘df’ represents the dataframe that already exists. On that dataFrame, the ‘filter’ operation is being called with an argument that describes the required criteria.

Note that the actual condition is in string format. The compiler won’t be able to check the correctness of the statement. So even if you make a mistake here, for example, let’s say there was no column called salary in the dataFrame, the compiler will not raise an error. The error will only be raised at runtime when the string inside the filter is being run. Now consider the same code written using dataset abstraction in Scala:

Example

Python

dataset.filter(_.salary < 5000);

The compiler, in this case, is able to read the actual condition as .salary is being called as a method over the dataset, not as a string. If there is no column named ‘salary’ in the dataset it will raise an error at compile time itself because the method will be invalid. By catching these errors at compile-time, datasets help save a lot of computational resources and developer time.

	SQL	Dataframes	Datasets
Syntax Errors	Runtime	Compile-time	Compile-time
Analysis Errors	Runtime	Runtime	Compile-time

Encoders: The primary function of an encoder is the translation between the heap memory, that is JVM memory and the off-heap memory. It achieves this by serialising and deserialising the data. Serialisation is the process of converting a JVM object into a byte stream. This conversion is necessary as off-heap memory does not understand objects, it understands byte streams. The data stored in off-heap memory is serialised in nature.

When a Spark job is run, the job query is first optimised by Spark. At runtime, the encoder takes this optimised query and converts it into byte instructions. In response to these instructions, only the required data is deserialised. This also happens in the case of DataFrames.

Reduction in memory usage: Datasets “understand” the data as it is structured, and can create efficient layouts to reduce the memory usage while caching.

Remember, though, that datasets are only available in Scala and Java. In Python, dataframes are only available in PySpark. In recent versions of Spark, dataframes have adopted all the functionalities of datasets, except for the features that come from the programming language itself, such as the benefits that come from a strongly-typed language. Scala is a strongly-typed language; this means that the rules of typing the code are stricter. All the variables and constants that are used in a program need to have a predefined data type. However, in Python, it is okay not to define the data types.

As far as dataframes and datasets are concerned, they are the same thing in PySpark. Going forward, we will use dataFrames in this module.

Report an error