In this module, you have understood how to perform a basic EDA on the data using the MLlib library. Now let’s continue exploring the library by building different machine learning models. The next few video are focused on giving an intuition of how simple model building is. What’s important before building any machine learning model is getting a sense of what is happening internally.
Before we start with the ML models, let’s have a look into how you can install the various important python libraries that you will need throughout this module on an EMR notebook in the following video. You will also learn how to plot graphs with the help of Matplotlib on an EMR notebook in the following video.
In the video above, you learnt how to install important libraries on an EMR notebook such as sklearn and matplotlib and also the method to plot graphs on an EMR notebook with the help of matplotlib.
The notebook used in the video above is as follows.
The Jupyter notebooks provided throughout this module have the above method of installation of libraries implemented and while some of the videos may not show these steps, you will still have to use these steps to install the various python libraries.
Now, let’s build a linear regression model on the Boston housing dataset using the spark ML library in the next video.
Also, download the following dataset for the practice exercise
For loading the data, you can upload your dataset in the Amazon S3 bucket and load the data using the following command:
df= spark.read.csv('s3a://…./iris.csv',header=True,inferSchema=True)Output
Earlier, remember Ankit mentioned that the machine learning models require input in the form of a vector. That is exactly what happened when you tried to apply linear regression for the first time. You get an error saying a linear regression object expects an input features column.
Example
assembler = VectorAssembler(inputCols=[c for c in sdf.columns if c != 'price'], outputCol='features')
dataset = assembler.transform(sdf)Output
Using the VectorAssembler feature transformer you can assemble the features in the form of a vector. After assembling the features to form a new features column, you can build a linear regression model by specifying the featuresCol and labelCol.
Example
Output
You can find the r2 score associated with the model by simply using model.evaluate().