IKH

Using Various File Formats in Spark

In the previous segment, you learnt about some of the techniques used for reducing Disk IO in Spark. In this segment, you will learn how using optimised columnar file formats can be beneficial in Spark in terms of reducing Disk IO.

In the next video, our SME will walk you through some of the common file formats used in PySpark and also discuss two file formats: Parquet and ORC.

In the video above, you learnt about some of the common file formats that are generally used in PySpark. You also learnt about two of these file formats: Parquet and ORC.

There are many file formats that you can use while writing Spark jobs. Some of the common file formats are .txt, .csv, .json, .avro, .parquet and .orc.

The Text file format is available in pretty much every technology device that we use today and is the most common file format. Comma-separated values (CSV) file is another common file format used for storing data sets. Javascript Object Notation (JSON) has a dictionary-like structure in which data is stored in a key-value format. In an Avro file, data is stored in a particular schema.

Both Parquet and ORC file formats are columnar in nature.

Some of the important points about the Optimised Row Columnar (ORC) file format are as follows:

  • A typical ORC file is divided into two different segments: Index Data, Row Data and Stripe Footer.
  • It is generally used for both compressed and uncompressed storage.
  • It stores collections of rows in a file and within a collection, row data is stored in a columnar format.
  • An advantage of using this file format is that the response time in both Reads and Writes is quite fast.
  • This file format is generally preferred when the original data is flat and non-hierarchical.
  • Files stored in this file format may be reduced by as much as 70% in terms of file size.
  • It also supports the lightweight index, which helps in improving the read time of data.

Some of the important points about the Parquet file format are as follows:

  • Files stored in this file format may be reduced by as much as 70% in terms of file size.
  • The metadata of the file is attached at the end of the file.
  • It is widely supported by all Apache big data processing tools.
  • This file format is preferred when data is in a nested format.

In the next video, you will learn about the impact of choosing a particular file format on  the performance of a Spark job. You will also learn about the benefits of using a columnar file format.

In the video above, you learnt about the various ways in which choosing a particular file format affects the performance of your Spark job. They are as follows:

  • Faster read time: At the industry level, reading files as fast as possible is extremely important; therefore, the performance of file format is often judged based on its read time.
  • Faster write time: Just like read time, write time is another straightforward parameter based on which a particular file format is chosen in the industry. A faster write time has a major impact on the overall performance of a Spark job.
  • Splittable files: Certain file formats support the feature of splitting a file into multiple smaller chunks. This has a major impact on the performance of jobs in the industry because it directly increases the degree of parallelism of your Spark job.
  • Schema evolution support: In many industry use cases, you might need to accommodate some changes in the schema of the data. There are various file formats that support this form of schema evolution.
  • Advanced compression support: You might need advanced compression support for your Sqoop jobs. Columnar file formats, such as ORC and Parquet, support advanced compression techniques because the data of a column is stored together. Such techniques can reduce the file size by 70%.

Now that we have discussed the features of the ORC and Parquet file formats, let’s learn about some of the benefits of using a columnar file format. These benefits are as follows:

  • In a columnar file format, data is more homogeneous because it is stored in the form of columns. Hence, it becomes easier to achieve high degrees of compression.
  • IO is also reduced since we only need to scan a subset of a column.
  • The data is homogenous, and all the data of a column is stored together. So, an encoding that is supported by modern processors can be used.

So, you learnt about optimised file formats in Spark and the benefits of using them. In the next video, you will get a practical hands-on demonstration of implementing different file formats for storing and reading data. You will also compare the performance of a job using a columnar format, such as ORC and Parquet, with the performance of a job using traditional non-columnar file formats.

Note:  At 4:53, SME mistakenly says 60 times instead of 16 times.

The link to the Jupyter Notebook used in this segment is given below

Note: Please note that you may get different results when you run the Jupyter Notebooks. This may be due to Network bandwidth changes and other internal reasons.

In this segment, you learnt about the ORC and Parquet file formats and understood how choosing a file format affects the performance of a job in the industry environment. You also learnt about the benefits of using a columnar file format.

Additional Content

  • Official ORC file format homepage: Link to the official ORC file format homepage from where you can also gain access to its documentation
  • Official Parquet file format homepage: Link to the official Parquet file format homepage from where you can also gain access to its documentation

Report an error