In the previous segment, we had mentioned that training a CRF model means to compute the optimal set of weights w which best represents the observed sequences y for the given word sequences x. In other words, we want to find the set of weights w which maximises the conditional probability P(y|x, w) for all the observed sequences (x, y) .
Recall that in logistic regression, we maximise the likelihood function:
$$//L(w\vert x,y)=P(y\vert x,w)//$$
where w represents the model coefficients. In other words, we compute the weights w such that the likelihood of observing the data points (x, y) is maximised for some set of weights w. We use exactly the same idea here – we compute the weights w such that likelihood P(y|x, w) is maximised.
Note
To understand the difference between the terms ‘likelihood’ and ‘probability’, you can refer to the following brief explanation.
If there are N such sequences (i.e. N sentences along with their IOB labels), assuming that the sequences are independent, we want to maximize the product of likelihoods of all sequences, i.e.
$$//{\textstyle\prod_1^N}(P(y\vert x,w))\;//$$
We have already established that the probability of observing a single sequence y assigned to x is:
$$//P(y\vert x,\;w)=exp({\textstyle\sum_1^n}(w.\;f(y_i,\;x_i,\;y_{i=1},i)))/Z(x)=exp(w.\;f(x,\;y))/Z(x)//$$
Now, if you have N such (x, y) sequences, the training task is to find weights w which maximise the probability of observing the N sequences (x, y). Let’s have Ashish explain this in detail.
To summarise, the objective is to find weight vector w such that P(y|x,w) is maximised. If there are N sequences, assuming that the N sequences are independent of each other, the likelihood function to be maximised is:
$$//L(w\vert x,\;y)=P(Y\vert X,\;w)={\textstyle\prod_1^N}(P(y\vert x))\;\;\;//$$
Since the likelihood function is exponential in nature, we take a log on both sides to simplify the computation (this makes sense since log(x) is a monotonically increasing function, and thus, maximising x is equivalent to maximising log(x)):
$$//L(w\vert x,\;y)=\log(P(Y\vert X,\;w))={\textstyle\sum_1^N}(\log(P(y\vert x,\;w))//$$
From the previous segment, Pr(y|x, w) can be written as:
$$//\;\;\;P(y\vert x;\;w)=exp({\textstyle\sum_1^N}(w.\;f(y_I,\;x_I,\;y_{I-1},\;i)))/Z(x)=exp(w.\;f(x,\;y)/Z(x)//$$
So, the final equation becomes:
$$//L(w\vert x,y)={\textstyle\sum_1^N}(\log(exp(w.f(x,y)/Z(x)))\;\;\\={\textstyle\sum_1^N}(\log(exp(w.f(x,y))-\log(Z(x)))={\textstyle\sum_1^N}(w.f(x,y)-\log(Z(x)))//$$
To prevent overfitting, we use the regularization term:
$$//L(w)={\textstyle\sum_1^N}\lbrack(w.f)-\log(Z)\rbrack\;//$$
regularisation_term.
Refer to the advanced regression module for a refresher on regularisation. Briefly, L1 and L2 regularisation, also called Lasso and Ridge respectively, use the L- 1 norm (sum(|w|) and L-2 norm (sum(w2)) respectively as the penalty terms.
Thus, we now have the objective function to be maximised.
Now, you have already studied gradient as an optimisation technique. In the upcoming lecture, you’ll study how to maximise the log-likelihood equation using gradient descent:
The final equation after taking the gradient of the log-likelihood function is
$$//L(w)={\textstyle\sum_1^N}(f(x,y))-E_{pv(y\backslash x,w)}(f(x,y))-2w/C\\//$$
Where
$$//E_{pv(y\backslash x,w)}(f(x,y’))={\textstyle\sum_1^N}f(x,y’)\ast exp(w.f(x,y’))/z_w(x)\\//$$
So, we start with a random initial value of the weights w,and in each iteration, we adjust w to move in the direction of increasing cost. This direction is basically given by the gradient of the log-likelihood function.
Once you have maximised the likelihood function to find the optimal set of weights, you can use them to infer the labels for a given word sequence. In the next segment, you will learn how to do that.