So far, you have worked with numerical variables. But many times, you will have non-numeric variables in the data sets. These variables are also known as categorical variables. Obviously, these variables cannot be used directly in the model, as they are non-numeric.
Let’s see how you can deal with these variables in the following video.
When you have a categorical variable with, say, ‘n’ levels, the idea of dummy variable creation is to build ‘n-1’ variables, indicating the levels. For a variable, say, ‘Relationship’ with three levels, namely, ‘Single’, ‘In a relationship’, and ‘Married’, you would create a dummy table like the following
As you can clearly see, there is no need to define three different levels. If you drop a level, say, ‘Single’, you will still be able to explain the three levels.
Let’s drop the dummy variable ‘Single’ from the columns and see what the table looks like:
If both the dummy variables, i.e., ‘In a relationship’ and ‘Married’, are equal to zero, it means that the person is single. If ‘In a relationship’ is denoted by 1 and ‘Married’ by 0, it means that the person is in a relationship. Finally, if ‘In a relationship’ is denoted by 0 and ‘Married’ by 1, it means that the person is married.
Before you move on to the next segment, there’s one concept that needs to be addressed: the concept of scaling the variables. Rahim had addressed scaling when answering a few common doubts regarding linear regression in this optional segment. But now that you have dummy variables too in the picture, let’s revisit the different aspects of scaling.
Note that scaling just affects the coefficients and none of the other parameters, such as t-statistic, F-statistic, p-values and R-squared.
Two major methods are employed to scale the variables: standardisation and MinMax scaling. Standardisation brings all the data into a standard normal distribution with mean 0 and standard deviation 1. MinMax scaling, on the other hand, brings all the data in the range of 0-1. The formulae used in the background for each of these methods are as given below:
- Standardisation: x=x−mean(x)sd(x)
- MinMax Scaling: x=x−min(x)max(x)−min(x)
Coming up
In the next segment, you will learn how to assess and compare models. For example, you can penalise models which use a large number of variables.
Additional reading
- To know more about dummy variables (here)
- Why it’s necessary to create dummy variables (here)
- When to normalise data and when to standardise (here)
- Various scaling techniques (here)
Report an error