Spark Memory Management Parameters

In the previous segments, you learnt about the techniques used for reducing Disk IO. In this segment, you will learn about Spark memory management parameters and understand the concepts of persist and cache, which can be used to significantly reduce IO from the Disk.

In the next video, you will learn about the memory levels supported by Spark and understand the concepts of cache and persist as well as Checkpointing and Unpersist.

As you in the video above, there are various Memory levels that are supported by Spark.

Let’s take a look at the RDD Persistence logic to be followed while using these Memory Levels.

MEMORY ONLY: In this memory level, whenever an RDD has to be persisted then if the total size of the RDD is less than the main memory, i.e., RAM, then it will persist in the memory but if the size of the RDD is bigger, then the spill-over partitions of the RDDs have to be processed every time you need to use the RDD.
MEMORY AND DISK: This is similar to the MEMORY ONLY memory level, except for the fact that if the RDD is bigger than the main memory, the spill-over partitions of the RDD are persisted in the Disk.
MEMORY ONLY SERIALIZED: This is almost similar to the MEMORY ONLY memory level, except for the fact that the RDD is stored as a serialized Java object before it is stored into the main memory. In this case, the RDDs are stored in a much more storage efficient manner than the MEMORY ONLY memory level.
MEMORY AND DISK SERIALIZED: This is the same as the MEMORY AND DISK memory level. The only difference is that if the RDDs are big even after serialization, the spill-over partitions of the serialized RDD are stored in the disk.
DISK ONLY: In this Memory level, the RDDs are persisted directly in the Disk, not in the main memory.
The other two Memory levels are MEMORY_ONLY_2 and MEMORY_AND_DISK_2, which are almost the same as their normal counterparts. The only difference is that each partition of the RDD is replicated on two cluster nodes.

Typically, if you need to store the RDD in the main memory itself so that it does not have to be recalculated and processed each time, you can do so using persist and cache.

If you fail to do this, then whenever you call an Action, based on the lineage of the RDD, you will have to create the RDD from scratch, which would be time-consuming and cause Disk IO.

Persist and cache simply stores the intermediate computation so that it can be reused easily, thereby reducing the overall Disk IO and improving the job performance.

The difference between persist and cache is that cache stores RDDs in the default storage level, i.e., the MEMORY_ONLY level and stores data sets in the default storage level, i.e., the MEMORY_AND_DISK level. On the other hand, you can use persist to assign user-defined storage levels (all the Memory levels are supported here) to RDDs and data sets. Both of these techniques are lazy in nature.

There is another concept known as Checkpointing. Unlike cache and persist, Checkpointing breaks the Spark lineage whenever it has to go to the previous stage. It is also computed separately from other jobs.

Note that Checkpointing is quite similar to Checkpoints in the Windows operating system. This means you can set a checkpoint and later easily return to this checkpoint and reconfigure the machine if you need to in the case of an error caused by some software update or installation.

Checkpoint data is persistent in nature and is not removed even after SparkContext is destroyed. You will learn more about checkpointing in the subsequent modules on Spark Streaming.

You can use another technique known as Unpersist. So far, you have learnt how to persist and cache your RDDs and data sets into the main memory. However, if you need to persist some more data into the memory, there might not be enough space left in the main memory to persist the additional data.

If you want to manually remove these RDDs and data sets from the cache, then you can do so with the help of the Unpersist method. Even if you do not remove the data manually from the cache using Unpersist, cache follows the Least Recently Used (LRU) principle to automatically evict the data.

Now that you have understood the theory behind these methods, in the following videos, you will learn how to implement them in a Spark job. You will also learn the difference between using these techniques and not using them in terms of how much processing time is saved.

In the next video, you will learn how to implement cache in your Spark jobs.

In the next video, you will learn how to implement persist in Spark jobs.

Finally, in the following video, you will learn how to use the Unpersist method in your Spark jobs.

The links to the Jupyter Notebooks used in this segment are given below.

Note:

Please note that you may get different results when you run these Jupyter Notebooks. This may be due to Network bandwidth changes and other internal reasons.

In this segment, you learnt about the various memory storage levels and the persist and cache techniques. You also learnt about Checkpointing and Unpersist methods. Finally, you learnt how to actually implement the persist, cache and Unpersist methods in your Spark jobs.

Additional Content

Difference between Cache and Persist: Link to a Stack Overflow article on the differences between cache and persist.
RDD Persistence: Link to the Spark Documentation on RDD Persistence techniques.

Report an error