Before moving on to the Python code, we need to address an important aspect of linear regression: the assumptions of linear regression.

While building a linear model, you assume that the target variable and the input variables are linearly dependent. But do you need any assumptions other than this?

Let’s hear what Rahim has to say.

You are making inferences on the ‘population’ using a ‘sample’. The assumption that variables are linearly dependent is not enough to generalise the results you obtain on a sample to the **population**, which is much larger in size than the sample. Thus, you need to have certain assumptions in place in order to make inferences.

Let’s understand the importance of each assumption one by one.

**There is a linear relationship between X and Y**

- X and Y should display some sort of a linear relationship; otherwise, there is no use of fitting a linear model between them.

**Error terms are ***normally distributed* with mean zero(not X, Y)

*normally distributed*with mean zero(not X, Y)

- There is no problem if the error terms are not normally distributed if you just wish to fit a line and not make any further interpretations.
- But if you are willing to make some inferences on the model that you have built (you will see this in the coming segments), you need to have a notion of the distribution of the error terms. One particular repercussion of the error terms not being normally distributed is that the p-values obtained during the hypothesis test to determine the significance of the coefficients become unreliable. (You’ll see this in a later segment)
- The assumption of normality is made, as it has been observed that the error terms generally follow a
**normal distribution with mean equal to zero**in most cases.

**Error terms are ***independent* of each other

*independent*of each other

The error terms should not be dependent on one another (like in a time-series data wherein the next value is dependent on the previous one).

**Error terms have ***constant variance* (homoscedasticity)

*constant variance*(homoscedasticity)

- The variance should not increase (or decrease) as the error values change.
- Also, the variance should not follow any pattern as the error terms change.

You will look at each of these assumptions in more detail later and validate these while building the model.