Before getting into the implementation of logistic regression using PySpark, let’s have a brief recap of the concepts involved in building a logistic regression model.
In the following video, Ajay will explain the basics of the logistic regression algorithm.
As explained by Ajay in this video, even though the name implies ‘Regression’, logistic regression is not used to solve problems pertaining to continuous variables. Instead, it is used to solve classification problems.
Classification can be of the following two types:
- Binary: Target variables have only two possible values.
- Multi-Class: Target variables have more than two possible values.
As discussed for diabetes classification, the following hard decision boundary can be used as shown below:
However, this hard decision boundary would give erroneous results for the borderline cases. So, to mitigate this shortcoming, the smooth sigmoid curve is used. It gives a probability of having diabetes at any given blood sugar level. This is how the curve looks like:
y(Probability of diabetes)=11+e−(βo+β1X)
Where,
y: the probability of diabetes
x: blood sugar level
Now, you can get a different sigmoid curve by changing the values of B1 and Bo. So, in order to find the optimal combination of values of B1 and Bo, which maximises the likelihood, you have to maximise the likelihood expression. The likelihood expression for the diabetes curve can be written as follows:
Likelihood=(1−P1)(1−P2)(1−P3)(1−P4)(P5)(1−P6)(P7)(P8)(P9)(P10)
The likelihood expression can be derived from the following graph:
Maximum Likelihood Estimation (MLE) is used to maximise the likelihood expression to find the optimal values of B1 and Bo.
Now that the quick recap is complete, let’s move on to discussing the CTR data set.
In order to better understand the concepts of logistic regression, it is advised that you should go through the session 1 of logistic regression module in the Machine Learning-1 course of this program.
Report an error