Multicollinearity

In the last segment, you learnt about the new consideration that are required to be made when moving to multiple linear regression. Ajay has already talked about overfitting. let’s now look at the next aspect,i.e., multicollinearity.

Multicollinearity refers to the phenomenon of having related predictor (independent) variables in the input data set. In simple terms, in a model that has been built using several independent variables, some of these variables might be interrelated, due to which the presence of that variable in the model is redundant. You drop some of these related independent variables as a way of dealing with multicollinearity.

Multicollinearity affinearity affects the following.

Interpretation

Does “change in Y when all other are held constant” apply?

Inference

Coefficients swing wildly, signs can invert.

Therefore, p-values are not reliable.

It is, thus, essential to detect and deal with the multicollinearity present in a model while interpreting it. Let’s see how you can detect it.

You saw two basic ways of dealing with multicollinearity.

Looking at pairwise correlations
- Looking at the correlation between different pairs of independent variables
Checking the variance inflation factor (VIF)
- Sometimes, pairwise correlations are not enough.
- Instead of just one variable, the independent variable may depend upon a combination of other variables.
- VIF calculates how well one independent variable is explained by all the other independent variables combined. The VIF is given by:

VIFi=11−Ri2

Here,‘i’ refers to the i-th variable, which is being represented as a linear combination of the rest of the independent variables. You will see VIF in action during the Python demonstration on multiple linear regression.

The common heuristic we follow for the VIF values is:

> 10: VIF value is definitely high, and the variable should be eliminated.

> 5: Can be okay, but it is worth inspecting.

< 5: Good VIF value. No need to eliminate this variable.

But once you have detected the multicollinearity present in the data set, how exactly do you deal with it? Rahim answers this question in the following video.

Some methods that can be used to deal with multicollinearity are as follows.

Dropping variables
- Drop the variable that is highly correlated with others.
- Pick the business interpretable variable.
Creating a new variable using the interactions of the older variables
- Add interaction features, i.e., features derived using some of the original features.
Variable transformations
- Principal component analysis (covered in a later module)

Coming up

In the next segment, you will learn how to handle the categorical variables present in a data set.

Report an error