Over the two previous segments, you developed a step-by-step understanding of the One vs One and One vs Rest techniques. You now know the theoretical schema of both the techniques. In this segment, through an example, you will graphically visualise how these techniques are used for training and prediction.
Now, you will hear from Ankit as he takes up a data set to be used for this demonstration purpose.
As explained in the video, the data set taken by Ankit is an e-commerce data set that you have already seen in the previous segment(link to be added). The following are the target variables used in the data set:
- Normal: This type of transaction does not violate any conditions.
- Abusive:This transaction violates the fair usage policies of the website. For example, ordering multiple items for a heavily discounted product when the website has set a quantity limit on that item.
- Fraud: This transaction violates the laws of a given country. For example, placing an order using a stolen credit card. The data set has the following feature variables as explained by Ankit.
- Transaction id: This is a unique identifier for each transaction.
- Transaction amount: This is the total amount of the item purchased in the transaction. Usually, large-value transactions have a higher probability of fraud.
- #linked accounts: These are the accounts that have been identified as linked to the account through which a transaction is being made. The linkage is established on the basis of parameters such as credit card, delivery address, cookie id and device id. To bypass the fair usage policy of a website, abusers create multiple linked accounts. So, for abusers, this value will be relatively higher than other transactions.
- % items returned in the last 3 months: This represents the proportion of items returned in the last three months as compared with items bought in the last three months. A high proportion of returned items indicates that users are trying to abuse the system by ordering items and simply returning them.
- Transaction id: This is a unique identifier for each transaction.
- Fraud flagged on linked accounts: If any of the linked accounts have been flagged for fraud in the past, then this variable would be 1 for those accounts. This would indicate a high probability of a transaction being fraudulent.
For your understanding, this data set contains rows for each previous transaction on that e-commerce platform. Each row can be identified uniquely by the transaction Id associated with it. We can predict the fraudulency of an account by looking at features such as transaction amount, the number of accounts linked to that transaction account, the percentage of items returned in the last three months, and fraud flagged on accounts that had shown some linkage to this transaction account. The fraudulence of that account can be predicted and some action can be taken accordingly.
Now, you must have got a fair context of the problem statement. In the next video, you will hear from Ankit as he applies the One vs One technique on this data set and explains the steps graphically.
Following what you have studied in the segment on the One vs One method, in Step 1, the data set has been used to create C(3,2) = 3 subsets. The subsets contain the following rows in each of them:
- Subset 1: It contains rows where the target values are abusive and normal.
- Subset 2: It contains rows where the target values are abusive and fraud.
- Subset 3: It contains rows where the target values are fraud and normal.
Next, Ankit showed graphs to explain how the subsets get classified using binary logistic classifiers. On each subset, Ankit generated a sigmoid curve, which can classify a test sample into one of the two target classes that it contains. The graphs explained by Ankit can be seen below.
Now that you have the model in place, you can take test data and classify it using these binary models. As Ankit explains in the video, the test data is passed through each of these classifiers and each classifier, thus, gives one prediction. Each prediction is a vote in favour of a particular target class. The target class that gets the maximum votes out of C(n,2) total votes is taken as the final prediction.
One test data point, which is represented in the form of a star mark, is passed through all three classifiers. The results were obtained as shown below:
- Abusive vs Normal -> Abusive
- Fraud vs Abusive -> Abusive
- Fraud vs Normal -> Normal
Clearly, Abusive has more vote counts. So, the test data point can be classified as Abusive.
In the next video, you will hear from Ankit as he applies the One vs Rest technique on the same data set and explains the steps graphically.
Similar to what we did in the One vs One technique, in the One vs Rest technique, in Step 1, the data set has been used to create ānā data sets of the same dimension as the original data set. The new data sets contain the following rows in each of them:
- Data Set 1: It contains all the rows with the target values transformed as Normal and Rest.
- Data Set 2: It contains all the rows with the target values transformed as Abusive and Rest.
- Data Set 3: It contains all the rows with the target values transformed as Fraud and Rest.
After this, Ankit showed graphs to explain how the data sets get classified using binary logistic classifiers. On each data set, Ankit generated a sigmoid curve, which can produce a probability score for the target variable representing that model (sigmoid curve). The graphs explained by Ankit in the video can be seen below.
In the first classifier image, a sigmoid curve could not be illustrated since the data points of Rest class are spread on both sides of the Abusive class. In the actual model, complex mathematical operations are done on the data set to fit a sigmoid curve, which, nevertheless, is beyond the scope of this module.
So, now that you have the model in place, you can take test data and classify it using these binary models. As Ankit explains, the test data is passed through each of these classifiers, and each classifier, thus, gives the probability of classification of the target class associated with it. For example, in the case of Normal vs Rest, the probability of the test sample being classified as Normal is to be seen. Similarly, for Fraud vs Rest and Abusive vs Rest, we see the probability of the test sample being classified as fraud and abuse, respectively.
The target class that gets the maximum probability is taken as the final prediction.
One test data point represented in the form of a star mark is passed through all three classifiers. The results came out as follows.
- P(Abusive) = 0.99765
- P(Fraud) = 0.00035
- P(Normal) = 0.00200
Clearly, Abusive has the highest probability. So, the test data point can be classified as Abusive.
Report an error