Application of One vs Rest Classifier

In this segment, you will learn from Ankit how to create new features using existing features. Then, the data will be split into training data and testing data for model building and validation, respectively. In the next video, Ankit will proceed with feature creation.

In this video, Ankit created two new features in the data set: fund_perc and incToloan_perc. These features contain the following information:

fund_perc: It is the percentage of the amount sanctioned of the total loan amount. A high value indicates that the bank is positive about lending the loan to the customer.
incToloan_perc: It is the percentage of the annual income of the loan amount. A high value indicates that the customer is more likely to pay back without defaulting.

Here is the code used to create these features:

PowerShell

loan_data[fund_perc] = loan_data[funded_amnt] / loan_data[loan_amnt]

loan_data[incToloan_perc] = loan_data[annual_inc] / loan_data[loan_amnt]

Then, as a good practice, Ankit checked the distribution of the modified data set and confirmed that the new columns are added to the data set. You can always dive deeper into the distribution using visualisations such as a boxplot.

Now, since the data set is completely ready for building the model, we split the data set into train and test data sets. You must know from your previous learnings how to do that. So, do that on your own first and then watch Ankit do the same in the next video.

Firstly, from the data set, independent and dependent variables are taken in two data sets X and Y, respectively. Only numerical data types are taken in the case of independent variables as can be seen from the code given below.

PowerShell

X = loan_data.select_dtypes(np.number).drop([‘id’, ‘funded_amnt’], axis=1)

Also, id and funded_amnt are dropped. Since id is a feature unique to each row, it is not possible for it to explain anything about the loan status. Further, since we are using fund_perc, funded_amnt is not required now. The dependent variable set can be created using the below code:

PowerShell

Y = loan_data[‘loan_status’]

After creating X and Y, a 7:3 train-test split is created as shown below:

PowerShell

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=10)

Here, X_train and y_train form the training data set, which contains 70% of the data, and X_test and y_test form the testing data set, which contains 30% of the data. After this, both train and test data are scaled using the ‘StandardScaler’as shown in the code below.

PowerShell

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Now that the data is in shape for model building, we create the model using the One vs Rest method. For both the classification techniques, we will use built-in functions from the sklearn library. We start by importing the required libraries. Following is the list of libraries imported along with a brief explanation of each

from sklearn.linear_model import LogisticRegression: Required to build and train the Logistic Regression (LR) model
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier: Used for importing the multi-classification methods in which we can use LR algorithms
from sklearn.metrics import accuracy_score, classification_report: These are libraries to calculate accuracy of data and other classification parameters
import warnings: Used to display any warnings generated while executing the code This is the code used for model building:

PowerShell

LR = LogisticRegression()
oneVsrest = OneVsRestClassifier(LR)
oneVsrest.fit(X_train_scaled, y_train)

In this code, the first statement creates a logistic regression classifier. In the second statement, the One vs Rest classification method is defined, and the LR classifier is applied to it for binary classification. In the third code statement, the training data is fitted in the model.

In the next video, you will learn how to make predictions using this model. Let’s hear about it from Ankit.

In this video, you learnt about the next step, that is, model prediction, in which we predict the category of test data samples. Now that the models are trained using the One vs Rest method, we will pass the test data samples to the OneVsRest function as shown below:

PowerShell

prediction_oneVsRest = OneVsRest.predict(X_test_scaled)

The predicted results for the test samples are stored in the prediction_oneVsRest variable. Now, let’s take a look at the various classification parameters discussed in the video:

Accuracy: This is the ratio of correctly predicted observation to the total observations. The formula is given as follows:

Accuracy = (True positives + True negatives) / (True positives + False positives + False negatives + True negatives)

Precision: This is the ratio of correctly predicted positive observations to the total predicted positive observations. The formula is given below:

Precision = True positives / (True positives + False positives)

Recall: It is calculated as the number of true positives divided by the total number of true positives and false negatives. The formula is as follows:

Recall = True positives / (True positives + False negatives)

F1 score: This is the weighted average of precision and recall. Therefore, this score takes both false positives and false negatives into account. The highest possible value of an F1-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, which indicates that either the precision or the recall is zero. The formula is as follows:

F1 score = 2* ((precision*recall)/(precision+recall))

You can take a look at the code below that was executed to display the classification results. We print the accuracy values using the accuracy_score function and the remaining parameters (precision, recall,F1-score) using the classification_report function.

PowerShell

print(f"Test Set Accuracy: {accuracy_score(y_test, prediction_oneVsRest) * 100} %\n\n")
print(f"Classification Report: \n\n{classification_report(y_test, prediction_oneVsRest)}")

The image given below shows the results under each parameter (accuracy, precision, recall and F1-score) for each class category separately. As mentioned in the video, support represents the number of classification points under each category. We will not get into the details of weighted average and macro average, as they are beyond the scope of this module.

You can see that the accuracy is about 82%, but as discussed in earlier sessions, we cannot depend on accuracy alone for model effectiveness. In the classification report table shown in the above image, each row represents a class category and its respective parameter values. We will not discuss the second table, as it is beyond the scope of this module.

Report an error