You learnt how to fix columns and rows, and applied those learnings to the bank marketing dataset. Now, you will learn what missing values are and how they should be treated. Before working on the dataset, let’s listen to Anand as he explains the different methods to fix missing values in a dataset.Play Video
The most important takeaway from this lecture is: good methods add information, bad methods exaggerate information. In case you can add information from reliable external sources, you should use it to replace missing values. But often, it is better to let missing values be and continue with the analysis rather than extrapolate the available information.
Let’s summarise the takeaways from the above video:
- Set values as missing values: Identify values that indicate missing data, for example, treat blank strings “NA”, “XX”, “999”, etc., as missing.
- Adding is good, exaggerating is bad: You should try to get information from reliable external sources as much as possible, but if you can’t, then it is better to retain missing values rather than exaggerating the existing rows/columns.
- Delete rows and columns: Rows can be deleted if the number of missing values is insignificant, as this would not impact the overall analysis results. Columns can be removed if the missing values are significant in number.
- Fill partial missing values using business judgement: Such values include missing time zones, century, etc. These values can be identified easily.
In the next video, Rahim will explain the different types of missing values and how to delete or impute them.
Following is a list of the major takeaways from the video.
Types of missing values:
- MCAR: It stands for Missing completely at random. The reason behind the missing value is not dependent on any other features.
- MAR: It stands for Missing at random. The reason behind the missing value may be associated with some other features.
- MNAR: It stands for Missing not at random. There is a specific reason behind the missing value.
Now, let’s apply all these concepts to the bank marketing campaign data set to tackle the issue of missing values in the age and month columns.
There are various ways to deal with missing values. Either you can drop the entries that are missing if you find that the percentage of missing values in a column is very small, or you can impute the missing values with some other values. Let’s look into the various ways to impute the missing values.
Imputation on categorical/numeric columns:
- Categorical column:
- Impute the most popular category.
- Imputation can be done using logistic regression techniques.
- Numerical column:
- Impute the missing value with mean/median/mode.
- The other methods to impute the missing values involve the use of interpolation, linear regression. These methods are useful for continuous numerical variables.
In this video, you will go through the analysis of the ‘pdays’ variable to deal with its missing values.
The major takeaway from the above video is that missing values do not always have to be null. So, now you must have a clear understanding of how to treat missing values in a dataset.
- Sometimes, it is good to just drop the missing values because they are missing completely at random.
- Sometimes, it is good to impute them with another value, maybe mean/median/mode, because they are not missing at random and have to be incorporated for further analysis.
You have gone through with the bank telemarketing data set. There is a ‘response’ variable which is basically the target variable of the data set. You have learnt about the missing values and the process to treat them. Based on your understanding of codes and process on missing values, answer the following questions.
In the next segment, you will learn how to deal with outliers.