In the previous segment, you implemented gradient descent for simple linear regression. In this segment, we will implement gradient descent for multiple linear regression. We also know that in multiple linear regression, we have more than one predictors.

For multiple linear regression, the cost function looks like the following:

In the equation above, xi is the column vector of all the feature inputs of the ith training example, m is the number of training examples, hθ is the prediction from our regression model, and yi is the column vector of the dependent variable.

The dataset is a collection of data points of (xi,yi). Once you have a model hθ, the least squares error of hθ on a single data point is (hθ(xi)−yi)2. Now, if you simply sum up the errors and multiply by half (1/2), we get the total error of 12∑(hθ(xi)−yi)2, but if you divide it by the number of summands, you would get the average error per data point of 12m∑(hθ(xi)−yi)2. The reason for dividing the error by half (1/2) is to get a nice interpretation for minimising (half) the average error per data point.

Now, let’s write the Python code for the cost function explained above:

To learn more about np.matmul, read the article here.

In the cost function above, the hypothesis or the predicted value is given by the following linear model.

xi here is the matrix consisting of the 1st row as the coefficient on intercept, i.e, 1 and the rest values are the values of the different features.

We need to minimise the cost function J(θ). One way to do this is to use the batch gradient decent algorithm. In batch gradient decent, the values are updated in each iteration:

With each iteration, the parameter comes θ closer to the optimal value that will achieve the lowest cost J(θ).

Now, let’s look at how to implement this in Python.

The code above will iterate and update the cost function J(θ).

The code above will generate a data frame, which is used to plot the cost function against the number of iterations.

In the graph above, we can observe that after 200 iterations the cost function is getting flattened. Thus, we can get the global minimum before it completes 200 iterations.

The learning rate and the number of iteration have a direct effect on the shape of the graph you get. Note that the graph may vary from the above shown as we change the learning rate and the number of iterations.

In the next segment, you need to work on a practice question on gradient descent.