IKH

Classification Using Spark MLlib

In the last segment, you saw how to perform regression using spark ML library. In this segment, you will learn how to perform classification using the spark ML library. Classification involves predicting discrete values(or labels) and regression involves predicting continuous values. For classification there are many algorithms available such as logistic regression and naive bayes classifiers. In our next video let’s build a simple logistic regression model.

For building a simple logistic regression model, the steps include first loading the necessary data sets into a new data frame. Then using a vector assembler to create a feature column out of all the features excluding the species column. The species column is the predictor variable in our data set. Then unlike in regression simply defining the input features and output features wouldn’t work in the case of classification. You need to convert the output column ‘species’ which is categorical into numerical features. You can do this using the string indexer feature transformer available with the Spark MLlib library.

StringIndexer Transformer

You can convert a column of string values in our data frame to numeric values using the StringIndexer transformer. It assigns index values based on their corresponding string frequencies.

For Example, if our input data has

Now StringIndexer to convert these into indices:

Here as an intermediary step, the frequency of each string value is calculated and the highest frequency is given an index value of 0. This way the string value is converted to a numerical value.

Then finally by using the transform(), you transform the input data frame with a new indexed column.

Once you have applied a string indexer transformer to the output column, you can now build a logistic model on top of it by specifying the input featurecol and output labelcol. An important point to note here is we are performing the train and test split of the data set to check the performance of the model on an unseen data set. This is something we haven’t explored till now while building linear regression models and you will do this as you move further in the session.

Finally using the evaluate() method on the test data set you can print summary statistics of the model.

Now based on your understanding of classification model building try to build a Naive Bayes classifier on the Iris data set before watching the next video.

The implementation of Naive Bayes classifiers is the same as compared to the logistic regression classifier, the major difference being the use of multiclassClassificationEvaluator. You will now focus on linear regression and discuss it in more detail in the next segment.

Additional Reading

In order to learn more about StringIndexer, You can refer to this link of Spark documentation.

Report an error