IKH

Catalyst Optimizer

In the last segment, you were introduced to the Dataframes and Datasets. In this segment, we will take a closer look at the functionality of Spark’s Catalyst optimiser. The Catalyst optimiser is at the core of Spark’s SQL execution engine. Now, let us learn from our SME about how the Catalyst optimiser works.

The Catalyst optimiser creates an internal representation of a user’s program, called a query plan. Once you have the initial version of the query plan, the Catalyst will apply different transformations to convert it to an optimised query plan.

Catalyst optimiser supports both rule-based and cost-based optimisation:

  • In the rule-based optimisation phase, Catalyst uses a set of rules to determine how to execute a query. During this optimisation phase, the Catalyst generates multiple physical plans.
  • In the cost-based optimisation phase, the cost for each physical plan is calculated, and the physical plan with the lowest cost is selected for execution.

Following are the steps of a Spark SQL compilation cycle: 

  • The Spark SQL code submitted via any of the supported APIs is converted to an initial logical plan or an unresolved logical plan.
  • This unresolved logical plan is converted to an optimised query plan through a process that involves three steps, namely, the Analysis, Logical Optimisation and Physical Planning phases:
  • In the Analysis phase, the unresolved query plan, which is the basic logical plan, is transformed to a resolved logical plan or just a “logical plan”. This plan includes additional information such as data source and the data types of the columns. 
  • In the Logical Optimisation phase, the logical plan is transformed into an optimised logical plan through the application of optimisation strategies such as Predicate Pushdown, Constant Folding, Column Parsing, etc.
  • In the Physical Planning phase, based on the optimised logical plan, Spark generates multiple physical plans and selects the best one, which is associated with the lowest cost. In this phase, the Catalyst uses its cost mode to obtain the most efficient physical plan.

Once the best physical plan is ready, the optimised query plan gets converted into a DAG of RDDs, ready for execution.

Now in the next segment, you will learn about the concept of DataFrame APIs.

Report an error