Regression using Spark MLlib

Having understood how to perform a basic EDA on the data using the MLlib library, let’s continue exploring the library by building different machine learning models. The next few videos are focused on giving an intuition of how simple is model building. What’s important before building any machine learning model is getting a sense of what is happening internally.

Let’s build a linear regression model on the Boston housing dataset using the spark ML library in the next video.

You can download the notebooks and dataset used for the demonstration here. Note that you may need to install some packages if not already installed. To install numpy, run the following command:

For loading the data, you can upload your dataset in the Amazon S3 bucket and load the data using the following command:

Earlier, remember Ankit mentioned that the machine learning models require input in the form of a vector. That is exactly what happened when you tried to apply linear regression for the first time. You get an error saying a linear regression object expects an input features column.

Using the VectorAssembler feature transformer you can assemble the features in the form of a vector. After assembling the features to form a new features column, you can build a linear regression model by specifying the featuresCol and labelCol.

You can find the r2 score associated with the model by simply using model.evaluate().

Report an error