Serialization and Deserialization in Spark

In the previous segment, you learnt how using optimised file formats can reduce Disk IO.

In this segment, you will learnt about another technique that can be used for reducing Disk IO: Serialization and Deserialization.

In the next video, our SME will explain what exactly Serialization and Deserialization are and why it is important. You will also learn about the different types of serialization techniques available in PySpark.

In the video above, you learnt about the concept of Serialization and Deserialization. They can be defined as follows:

Serialization is the process of converting the state of an object into bits and byte streams so that it can be either stored somewhere or transferred over the network.

Deserialization is the process of converting this byte stream back to the original object state.

Serialization plays an important role in how a Spark job performs. Any format that is slow to serialize objects into bits and byte streams or consume a large number of bytes adversely affects the computation to a large extent.

In Python, you have access to these types of serializers: Marshal serializer and Pickle serializer.

Serialization is implemented for maintaining the performance of a job in distributed systems. It is especially helpful in saving objects into a disk or sending them over networks, as discussed in the video above.

In the case of Spark RDDs, the RDDs may be serialized to:

Decrease memory usage when stored in a serialised from;
Reduce network bottleneck in processes such as shuffling (as the size of the data itself may be reduced); and
Tune performance of Spark operations, as this helps in reducing both the Disk and Network IO.

Note that objects can be serialized before sending them to the Spark worker nodes.

In PySpark, serializers are set during the creation of the Spark Context.

The differences between the two types of serializers available in PySpark are summarised below.

Marshal serializer	Pickle serializer
This serializer is much faster than Pickle serializer in terms of the rate of Serialization and Deserialization.	Pickle serializer is much slower than Marshal serializer in terms of the rate of serialization and deserialization.
Mashal serializer supports fewer data types.	Pickle serializer supports nearly every type of Python object.

So, you need to choose a serializer based on your use case. Typically, if you want your Spark job to be more flexible, then you should select Pickle serializer. If the number of data types that we need to support in our Spark job is limited and supported by Marshal serializer then we should prefer that due to it being much faster.

In the next video, our SME will demonstrate how you can implement Marshal and Pickle serializers into your Spark jobs during the creation of the Spark Context.

The link to the Jupyter Notebook used in this segment is given below.

Note:

Please note that you may get different results when you run these Jupyter Notebooks. This may be due to network bandwidth changes and other internal reasons.

In this segment, you learnt about the concepts of Serialization and Deserialization and understood its importance in terms of performance optimisation for Spark. You also learnt about Marshal serializer and Pickle serializer and also learnt how to implement them in your PySpark jobs.

Additional Content

Official documentation of PySpark serializers: Link to the official documentation on serializers in PySpark

Report an error