How can you guarantee that if the two conditions of diversity and acceptability are fulfilled to make an ensemble, it will be better than any individual model? Let’s hear about this in the next video from Prof. Raghavan.

In this video, we discussed a few reasons why ensembles work better than individual models.

Now, let’s understand how an ensemble helps in making decisions. Consider an ensemble with 100 models consisting of decision trees, logistic regression models, etc. Given a new data point, each model will predict an output y for this data point. If this is a binary classification, then you simply take the majority score. If more than 50% models say y=0, you go with 0, and vice versa.

Now, the question that arises is: Why should you expect the majority vote to perform better on unseen data than any of the 100 individual models? Well, there are a number of convincing arguments to answer this question.

Firstly, if each individual model is **acceptable**, i.e., if it is wrong with a probability of less than 50%, you can show that the probability of the ensemble being wrong (i.e., the majority vote going wrong) will be much less than that of any individual model. In this way your chance of getting the prediction correct will be higher as compared to the individual models in the ensemble as they pool the opinion of each weak learner. This is done by exploiting and leveraging the predictive power of these models to predict the final outcome.

Also, the ensembles cannot be misled by the **assumptions made by individual models**. For example, ensembles (particularly random forests) successfully reduce the problem of overfitting. If a decision tree in an ensemble overfits, you let it. Chances are extremely low that more than 50% of the models are overfitted. Ensembles ensure that you do not put all your eggs in one basket.

In a binary classification task, an ensemble makes decisions by considering the majority vote. This means that if there are n models in the ensemble and more than half of them give you the right answers, you will make the right decision. On the other hand, if more than half of the models give you the wrong answers, you will make a wrong decision. In the **coin toss analogy**, making a **correct prediction** corresponds to **heads**, whereas making an **incorrect prediction** corresponds to **tails**.

If you can prove that the probability of more than half of the models making a wrong prediction is less than that of any of the individual models, you will know that the ensemble is a better choice than any of the individual models.

Like the professor mentioned, let’s assume heads is analogous to making a correct prediction and since it is a biased coin – P(Head) >> P(Tail). Now, let’s assume you have an ensemble of, say, 5 models, and let’s take P(Head or Right Predicition) to be 0.6 and P(Tail or Wrong Prediction) to be 0.4 for each of the five models for the sake of demonstration. The probability of making a right prediction for the ensemble then turns out to be approximately 68% (basically calculate P(Head >=3) for majority). As you can see, this is higher (by ~8%) than any of the individual model’s performance each of which only has a probability of 60% of being right.

It is important to remember that each model in an ensemble is **acceptable**, i.e., the probability of each model being wrong is less than 0.5 (as a random binary classification model is correct 50% of the time).

Using the above example, you easily saw that the probability of more than half of the models in an ensemble making the wrong prediction is **significantly less than 0.5**, i.e., less than a random model. For this you used the (biased) coin toss analogy to do the same, where you can **map the predictions** made by an ensemble to the **two sides of a biased coin**. Getting a correct prediction is equivalent to getting heads, and getting a wrong prediction is equivalent to getting tails, i.e., you map heads to success (correct prediction) and tails to failure (an incorrect prediction).

**Experimenting with an Ensemble of Three Models**

To understand why ensembles work better than individual models, let’s take a simple example of three coins (models). Consider an ensemble of **three models: m1, m2 and m3**, for a **binary classification **task (say, 1 or 0). Suppose each of these models has a **probability of being correct 70% of the time**.

So, each model is acceptable. Given a data point whose class has to be predicted, the ensemble will predict the class using a **majority score**. In other words, if two or more models predict class = 1 as the output, the ensemble will predict 1, and vice versa.

The following table shows all the possible cases that can occur while classifying a test data point as 1 or 0. The column to the extreme right shows the probability of each case. For example, if you take the first row, all m1, m2, and m3 give the correct output. Hence, the probability for this case becomes (0.7 x 0.7 x 0.7), since each model has a probability of being correct 70% of the times and all three models are completely independent of each other. Similarly, the probability of the second row becomes (0.7 x 0.7 x 0.3) since you have “Correct”, “Correct”, and “Incorrect”. And so on for all the rows.

In this table, there are four cases each where the decision of the final model (ensemble) is either correct or incorrect. Let’s assume that the probability of the ensemble being correct is p, and the probability of the ensemble being incorrect is q.

For the data in the table, p and q can be calculated as follows:

**p = 0.343 + 0.147 + 0.147 + 0.147 = 0.784**.

**q = 0.027 + 0.063 + 0.063 + 0.063 = 0.216 = 1 – p**.

Notice how the ensemble has a higher probability of being correct and a lower probability of being incorrect than any of the individual models (0.78 > 0.70 and 0.216 < 0.30). In this way, you can also calculate the probabilities of the ensemble being correct and incorrect with 4, 5, 100, 1000, and even a million individual models. The difference in probabilities will increase with an increasing number of models, thus **improving the overall performance of the ensemble.**