When using linear regression to model the relationship between a response variable and a predictor, we make a few assumptions. These assumptions are the essential conditions that should be fulfilled before we can draw any inferences about the model estimates or before we can use the model to make any predictions. In the next video, Anjali explains the key assumptions of linear regression.
The linear regression framework comprises certain key assumptions. Let’s try and understand the importance of each of these assumptions one by one.
1. There is a linear relationship between X and Y.
X and Y should display a linear relationship of some form; otherwise, there is no use of fitting a linear model between them.
2. Error terms are distributed normally with mean equal to 0 (not X, Y)
- There is no problem if the error terms are not distributed normally if you wish to just fit a line and not make any further interpretations.
- However, if you wish to draw some inferences on the model that you have built, then you need to have a notion of the distribution of the error terms. One particular repercussion if the error terms are not distributed normally is that the p-values obtained during the hypothesis test to determine the significance of the coefficients become unreliable.
- The assumption of normality is made, as it has been observed that the error terms generally follow a normal distribution with mean equal to 0 in most cases.
3. Error terms are independent of each other
- The error terms should not be dependent upon one another (like in time-series data, where the next value is dependent upon the previous one).
4. Error terms have constant variance (homoscedasticity)
- Variance should not increase (or decrease) with a change in error values.
- Also, variance should not follow any pattern with a change in error terms.
You can also go through the content at this link to see what happens when the assumptions above are violated. But you will anyway get more clarity as we keep moving ahead.
We basically use the following plots, which help us assess the above assumptions qualitatively:
- Residual versus prediction plot: This plot helps us detect nonlinearity, unequal error variances and outliers.
- Histogram of error terms: This plot helps detect non-normality of the error values.
We worked upon the Marketing dataset which we used earlier for simple linear regression in Python. Let us check if the assumptions are holding true in the forthcoming video.
In the next segment, Anjali will implement Multiple Linear Regression in Python and also do residual analysis for the same. Residual analysis will further help in checking if the assumptions of linear regression are holding true.