IKH

Imputer

In the previous segment, you found missing values and null values in our data. As the next step of the data cleaning activity, you can either remove the records containing incomplete or garbage values or replace the missing values with an approximate value. Let’s watch the next video to understand how to do this.

Generally, the median or mean of the complete column variable serves as a good approximate value for missing values. Removing the records often leads to the loss of some valuable information, so you may impute those values instead. In the video above, the following two methods were used to handle the missing values:

  1. The first method involves removing the records with missing values by using the na() method present in Spark. This method drops all the rows that may contain a missing value.
  2. The second method involves replacing the missing values with the mean of their respective features using the Imputer () transformer present in the Spark ML library. It is an extension of the Transformer class.

Additional Reading:

  • Imputations in machine learning – You can refer to this link in order to learn more about how you can deal with the missing values in a dataset.
  • Handling missing values- This link also explains how to deal with the missing values present in a dataset.

Report an error