Before you dive into what exactly a simple model is, what all its benefits are, we will take a short detour to reiterate some terminologies in the machine learning framework. You will now understand the process of using training data, learning from it and then building a model to describe a system which performs a task at hand, like classification or regression. The key objectives here are to understand:
- The meaning of model, learning algorithm, system and hypothesis class.
- The (often misunderstood) difference between a learning algorithm and a model.
- The meaning of ‘class of models’.
You just revised the basic machine learning framework. You will now learn a basic property of a learning algorithm – that it can only produce models of a certain kind within its boundaries. This means that an algorithm designed to produce linear class of models, like linear/logistic regression, will never be able to produce a decision tree or a neural network. The class of model becomes critical because a wrong class will yield a sub-optimal model. Let us understand this point in more detail.
Let’s understand two more important terminologies: hypothesis and hypothesis class.
A hypothesis is the same as a model and hypothesis class is the class of models that you are going to consider for a given problem. Every algorithm has its own limitations. It works within the boundary of a certain class of models. Assume there is an algorithm which builds a random forest. Random forest is an example of a learning algorithm. Now every model that the random forest algorithm will produce is going to be a forest or a collection of decision trees only. Those are the only kind of algorithms that a random forest will ever produce.
Now suppose that the learning algorithm is a linear regression, logistic regression or SVM. In this case the learning algorithms will only produce linear models as they themselves are linear models and will not consider any other models.
Let’s look at the following illustration and understand this better. Suppose that you have a set of data points as shown below.
Is it possible to fit a linear regression model on this type of data? You can ofcourse do that but will you get the desired results using linear regression on this data and come up with a good performing model?
When you fit a linear regression on this type of data, you would get something shown in the image above which is not definitely the best model as it is unable to capture the underlying structure of the data. A linear regression model will produce some line which it thinks best fits the dataset among all the possible straight lines. It will not consider anything other than a straight line for this dataset.
So a learning algorithm when given data produces a model which could be linear regression, decision trees or any other model. This learning algorithm puts a boundary across the class of models that it is ever going to consider and among those models it will try to find the best model that fits the data given to it for training. That model will come out as an output from the learning algorithm. Once you have this model, you can go ahead and use it for making predictions.
Having understood the terminology and difference between algorithms, models and classes of models, we are now ready to understand the notion of model simplicity, complexity and common issues like overfitting etc. Let’s look into each of them in the next segment.