Scalability in Linear Regression

Till now in this module, you learnt about how to perform basic EDA using the Spark ML library. In the next series of videos, you will understand how to implement linear regression using the pyspark API.

Before looking at the implementation, let’s first understand how your linear regression algorithm is scalable. While understanding the basics of linear regression you saw how the whole algorithm boils down to a matrix representation of β. The calculation involves a set of matrix transformations and matrix multiplications. Go through the next video where you will learn about Embarrassingly Parallel Problems.

You have seen earlier that matrix multiplication is parallelizable in Spark. Let’s take the example of the linear regression coefficients beta,

^β=(XTX)−1XTY

You can see that there are it is composed of 3 operations:

Multiplication XTand X
Inverse of (XT *X)
Multiplication XT and Y
Multiplication of inv(XT *X)and XT∗Y

All the above operations are multiplications which can be easily parallelized in Spark and hence, most of the Machine Learning algorithms are parallelizable in Spark as they are usually a series of matrix operations.

Report an error