Standardising Values

You learnt different techniques to handle outliers and also implemented the same in the ‘bank marketing’ dataset. Now, you will learn the next important aspect, which is to standardise values in a dataset.

In this video, Anand will explain how to standardise quantitative values in a dataset.

Scaling ensures that the values in a dataset have a common scale; this makes it easy to perform data analysis. Let’s take a data set that contains the grades of students studying in different universities. Some of the universities assign grades on a scale of 4, whereas the others assign grades on a scale of 10. Hence, you cannot assume that a GPA of 3 on a scale of 4 is equal to a GPA of 3 on a scale of 10, even though they are the same quantitatively. Thus, for the purpose of analysis, these values need to be brought to a common scale, such as the percentage scale.

Now, let’s summarise what you learnt so far about standardising the variables in a dataset. Given below is a list of the points that we covered. You could use this as a checklist for future data cleaning exercises:

  • Standardise units: Ensure all observations under one variable are expressed in a common and consistent unit, e.g., convert lbs to kg, miles/hr to km/hr, etc.
  • Scale values if required: Make sure all the observations under one variable have a common scale
  • Standardise precision for a better presentation of data, e.g., change 4.5312341 kg to 4.53 kg.

Now that you have learnt how to standardise the numeric values in a data set, let’s proceed to learn how to standardise text values, which is an equally important aspect of data analysis.

Now, let’s summarise what you learnt about standardising text values in a dataset. Given below is a list of the points that were covered, you can use this as a checklist for future data cleaning exercises:

  • Remove extra characters such as common prefixes/suffixes, leading/trailing/multiple spaces, etc. These are irrelevant to the analysis
  • Standardise case: String variables may take various casing styles, e.g., FULLCAPS, lowercase, Title Case, Sentence case, etc.
  • Standardise format: It is important to standardise the format of other elements such as date, name, etc. For example, change 23/10/16 to 2016/10/23, “Modi, Narendra” to “Narendra Modi”, etc.

In this video, Rahim will apply the concepts covered in this segment to the ‘bank marketing’ data set and standardise some required values in it.

In the videos, you saw the application of standardisation with a real-life example of the ‘duration’ variable in the ‘bank marketing’ data set. The duration variable has data in minutes as well as in seconds, which has to be converted into minute only. You can understand the entire code to convert the ‘duration’ variable into minutes in the image below.

In the next segment, you will learn how to fix invalid values in a data set.


Report an error