IKH

Pipeline

In the previous segments, you learnt about the various steps involved in preparing data. You must have observed how each process was implemented sequentially as a part of the data preparation activity. Further, the output of each stage is a new dataframe, thereby making the process suboptimal. Let’s watch the next video and understand how a pipeline can internally optimise these methods and improve efficiency.

Instead of executing a transform method at each step of the data preparation process, pipeline clubs all the steps of data cleansing, data scaling, data normalising, etc. in the desired sequence. Creating a pipeline will save you a lot of time by eliminating these multiple steps and make your code more efficient. Also, once you have designed a pipeline with all the required steps, you can reuse the pipeline for various data sets without severely altering the nature of the code for each data set.

A pipeline can be built by declaring the Pipeline object available in Spark MLlib. Further, you need to build a PipelineModel by fitting the Pipeline object on the data. Unlike the steps explored in the previous segments, the PipelineModel will only take one data frame as the input and output of the final prepared DataFrame.

In the previous segments, we had used a total of three transformations: Imputer, Assembler and Scaler. Each of these transformers and estimators can be mentioned as a sequence while defining the properties of the Pipeline object. Then you use a fit() method to create a pipeline model, further using it to transform the data.

In short, a Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input dataFrame is transformed as it passes through each stage. As you learnt earlier, no intermediate dataframe is created at any stage. The lineage created in Spark facilitates the one-short creation of the final dataframe.

So far, you performed different operations such as scaling the data set, imputing missing values and converting features into a vector. You also learnt how to integrate various activities in the form of a pipeline. However, you have not seen any visualisations made on the data. This is because the Spark API does not provide a good interface, unlike Python, for visualisation. For this reason, when you are handling huge data sets, you take a chunk of your data to get a basic idea of your data set and then start building models using PySpark based on the inferences you make from the data.

Additional Reading

  1. Pipelines Spark Documentation: You can refer to this Spark documentation link in order to learn about the ML pipeline.
  2. The elegance of the Spark ML Pipeline: You can refer to this link in order to appreciate the advantages offered by the Spark ML pipeline.

Report an error