Handling Outliers

You have learnt what missing values are and how to treat them. Now, let’s move to the next concept of data cleaning, which is outliers. 

The definition of outliers is as follows:

Outliers are values that are much beyond or far from the next nearest data points.

In this video, Rahim will help you understand the concept of outliers.

You learnt that there are two types of outliers. These are:

  • Univariate outliers: Univariate outliers are those data points in a variable whose values lie beyond the range of expected values. You can get a better understanding of univariate outliers from the image below. Here, almost all the points lie between 0 and 5.0, and one point is extremely far away (at 20.0) from the normal norms of this data set.
  • Multivariate outliers: While plotting data, some values of one variable may not lie beyond the expected range, but when you plot the data with some other variable, these values may lie far from the expected value. These are called multivariate outliers. You can refer to the image below to get a better understanding of multivariate outliers.

Now, let’s proceed to the next video, where you will learn about the reasons behind the appearance of outliers in data and how to treat them.

From the video, you must have understood that outliers should be treated before investigating data and drawing insights from a dataset.

Now, the major approaches to the treatment of outliers can include:

  • Imputation
  • Deletion of outliers
  • Binning of values
  • Capping the outliers

In the process of handling missing values and outliers of different columns, you are already performing univariate analysis. You will learn more about it in further sessions. In this video, you will learn how to implement all your learning on the bank marketing dataset.

So, in the above video, you have seen that the age variable has outliers, but these can be treated as the normal values of age because any person can be over 70 or 80 years of age. Also, the 70-90 age group is sparsely populated and participate in opening the term deposit account, which is why these set of people fall out of the box plot but they are not outliers and can be considered as normal values.

Let’s listen to Rahim as he explains the variable ‘balance’.

An important aspect that has been covered in this video is quantiles. Sometimes, it is beneficial if you look into the quantiles instead of the box plot, mean or median. Quantile may give you a fair idea about the outliers. If there is a huge difference between the maximum value and the 95th or 99th quantiles, then there are outliers in the data set.

In the next segment, you will learn about the standardisation process in EDA.

Report an error