IKH

Model Evaluation – Classification Metrics

Imagine that a person claims that he has built a model with 99.83% accuracy. Would you want to productionise that model? He also claims that his model can classify whether a given transaction is fraudulent or not with state-of-the-art accuracy. Then, you will have a perfect classifier, right? Well, not exactly.

This result may sound interesting and even impressive, but we should dive deeper to understand this better. Consider a dataset which is highly imbalanced, with 99.83% of the observations being labelled as non-fraudulent transactions, and only 0.17% of observations being labelled as fraudulent.  So, without handling the imbalances present, the model overfits on the training data and is, therefore, classifying every transaction as non-fraudulent and hence, achieving the aforementioned accuracy.

So is it always a better idea to consider accuracy as a reliable evaluation metric for a classification model? Let us understand this situation better in the next video.

As Snehanshu explained, accuracy is not always the correct metric for solving classification problems. There are other metrics such as precision, recall, confusion matrix, F1 score, and the AUC-ROC score. Let us focus on these metrics and understand them in the next video.

The ROC curve is used to understand the strength of the model by evaluating the performance of the model at all the classification thresholds.

As Snehanshu explained, the default threshold of 0.5 is not always the ideal threshold to find the best classification label of the test point. Because the ROC curve is measured at all thresholds, the best threshold would be one at which the TPR is high and FPR is low, i.e., misclassifications are low.


After determining the optimal threshold, you can calculate the F1 score of this classifier to measure the precision and recall at the selected threshold. Let us understand this in detail in the next video.

Finding the best F1 score is not the last step. This score depends on both precision and recall. So, depending on the use case, you have to account for what you need: high precision or high recall.

You have seen how the model was a bad classifier when it labelled every transaction as non-fraudulent with high accuracy. But what if the scenario is the opposite?


If the model labels all data points as fraudulent, then your recall becomes 1.0. However, at the same time, your precision is compromised. Precision is the ability of a classification model to identify only the relevant data points. When you increase the recall, you will also decrease the precision.

You can maximise the recall or precision at the expense of another metric, which depends on whether you want high precision or high recall for your use case.


For banks with smaller average transaction value, we would want high precision because we only want to label relevant transactions as fraudulent. For every transaction that is flagged as fraudulent, you can add the human element to verify whether the transaction was done by calling the customer. However, when precision is low, such tasks are a burden because the human element has to be increased.

For banks having a larger transaction value, if the recall is low, i.e., it is unable to detect transactions they are labelled as non-fraudulent. So, consider the losses if the missed transaction was a high-value fraudulent one, for e.g., a transaction of $10,000?


So here, to save banks from high-value fraudulent transactions, we have to focus on a high recall in order to detect actual fraudulent transactions.

Report an error