IKH

Understanding Disk IO in Spark

In the previous segment, you learnt about the two major approaches to optimise Spark jobs at the code level. In this session, you will learn about the techniques used for reducing Disk IO. But first, you need to understand what encompasses Disk IO.

In the next video, our SME will walk you through the delays that are part of Disk IO and also give you an overview of some of the techniques used for reducing Disk IO.

In the video above, you understood what Disk IO actually means and also learnt about some techniques for reducing it. You will further explore this in the subsequent segments.

In simple terms, Disk IO is the process of fetching data or writing data to secondary storage devices such as hard drives (in commodity hardware, these hard disks are of magnetic type).

The entire process is much slower than other processes owing to the nature of hard drives, as shown in the figure given below.

The total delay is higher than the delays caused by other processes because hard drives are much slower than primary memory such as RAM.

Reducing possible Disk IO can be quite beneficial in increasing the performance of Spark jobs.

Some of the techniques that you can use to reduce Disk IO are as follows:

  • Avoid shuffling as much as possible: Shuffling process leads to the creation of stages. When this happens, the data at the stage boundary is stored in the disk so that it can be fault-tolerant.
  • Using optimised file formats (Parquet and ORC): Using optimised file formats such as Parquet and ORC can not only help in reducing the size of the files but also in the process of reading data. This is because these file formats are columnar in nature.
  • Serialization and Deserialization: Using appropriate Serialization and Deserialization techniques for memory storage and cached data help in reducing Disk IO.

Note:

The Kryo serialization technique is generally used in Java-based Spark jobs, whereas in Python, we use either Marshal Serializer or Pickle Serializer depending on the use case.

In this segment, you understood what Disk IO actually means and also learnt about some important techniques to reduce Disk IO for Spark jobs. You will learn about these techniques in detail in the subsequent segments.

In this segment, you understood what Disk IO actually means and also learnt about some important techniques to reduce Disk IO for Spark jobs. You will learn about these techniques in detail in the subsequent segments.

Additional Content

  • An Article on Disk IO on Stage Boundaries: Link to the Stack Overflow article that talks about what happens at stage boundaries and how Disk IO occurs there.

Report an error