In this segment, you’ll study the CRF model architecture in detail. Let’ start with CRFs’ feature functions.
CRF Feature functions
We had mentioned in the previous segment that CRFs use ‘feature functions’ rather than the input wort sequence x itself. The idea is similar to how we had extracted features for building the naive Bayes and decision tree classifiers in a previous section. Some example ‘word-features'(each word has these features) are:
- Wort and POS tag-based features: word_is_city, word_is_digit, pos, etc.
- Label-based features: previous_label.
Let’s spend some time to understand CRF features in detail.
Consider a sequence of n tokens x=(x1, x2, x3, … xn) and the corresponding IOB label sequence y=(y1, y2, y3, … yn) . The task is to predict the label sequence y using the word sequence x .
Similar to conventional classifiers, we define some feature functions in CRFs which we’ll denote as f. A feature function takes the following four inputs:
- The input sequence of words:x.
- The position i of a wort in the sentence(whose features are to be extracted).
- The label yi of the current word (the target label).
- The tabel yi-1 of the previous word.
There are usually multiple feature functions, let’s denote the jth feature function as fj
(x, i, yi, yi,-1).
We restrict the model to extract label-base features using only the previous label (and not using labels of the previous n word where n > 1). this assumption is similar to the first-order Markov assumption we had used in POS tagging (i.e. transition probabilities depend only on the previous POS tag). Also, the feature function return a real-valued number, which is often o or 1(i.e. binary output).
Let’s take an example of a feature function f1 which returns 1 if the word xi is a city and the corresponding label yi is ‘I-location’, else 0. This can be represented as:
f1 (x, i, yi, yi-1) =[[xi is in city list name] & [yi is I-location]]
The feature function returns 1only if both the conditions are satisfied,i.e. when the word is a city name and is tagged as’I-location (e.g. Chicago/I-location).
Similarly, a second feature function can be defined using the previous label, for e.g.:
f2(x, i, yi, yi−1) = [[yi−1 is B-location] & [yi is I-location]]
you can define multiple such feature function fi, though all are not equally important. Every feature function fi has a weight wi associated with it, which represents the ‘importance’ of that feature function. This is almost exactly the same as logistic regression where coefficients of features represent their importance.
Training a CRF means to compute the optimal weight vector w which best represents the observed sequences y for the given word sequences x. In other words, we want to find the set of weights w which maximises P(y|x, w). This is similar to training a logistic regression model to find the optimal coefficients by maximising P(y|x, w) where w represents the vector of logistic regression model coefficients. You’ll study CRF model training shortly, for now, let’s focus on the model architecture.
Recall that in logistic regression, the conditional probability P(y\x, w) is modelled using a sigmoid function.
In CRFs, we model the conditional probabilities P(y\ x, w) in a similar fashion. If there are k feature function (and thus k weights). for each word i in the sequence x, we define a scoring function as follows:
$$//score_i\;=\;exp(w_i.\;f_1\;+\;w_2.\;f_2\;…+\;w_k.\;f_k)\;=\;exp(w.\;f(y_{i,\;}x_{i,\;}y_{i-1,}i))//$$
the intuition for the score of the ith word is that it high value if the correct tag is assigned to the word, else a low value. For e.g. say you want to tag the word ‘Francisco’ in the sentence ‘The flight to San Francisco.’ Consider that you have only two feature functions f1 and f2 defined above. Now, if you assign the (correct) label ‘I-location’ to ‘Francisco’, both f1 and f2 return a 1, and they return a o for any other tag such as ‘O’. Also, exp(x) being a monotonically increasing function, it returns a high value for high values of x. Thus, the score of a word is high when the correct tag is assigned to a word.
The score above is defined for each ith word in the sequence x. If the sequences x and y are of length n, then the score of the sequence y is given by the product of scores of individual words in x. The sequence score will be highest when all the words have been assigned the correct labels:
$$//Sequence\_score(y\vert x)=\prod\nolimits_{i=1}^n(exp(w.\;f(y_i,\;x_i,\;y_{i-1},\;i)))//$$
The product of exponents can be expressed as the exponential of the sums as follows:
$$//sequence\_score(y\vert x)=\prod\nolimits_{i=1}^n(exp(w.\;f(y_i,\;x_i,\;y_{i-i,}i)))=exp({\textstyle\sum_1^n}(w.f(y_i,\;x_i,\;y_{i-1},\;i\;)))//$$
Scores to Probabilities
Using some math, we can now translate the scores of label sequences, which can be any real numbers, to probabilities. The probability P(y|x, w), i.e. the probability of observing the sequence y given x for a certain w, is given by the score of the sequence y divided by the total score of all possible label sequences. If you have n words and t IOB labels, there are tn possible IOB sequences y. The inference task is to find the optimal sequence y.
The denominator in the following expression, Z(x), is a normalising constant which represents the sum of scores of all possible label sequences of x:
$$//P(y\vert x,\;w)=exp({\textstyle\sum_1^n}\;(w.\;f(y_i\;,x_i,\;y_{i-1},\;i)))/Z(x)=exp(w.\;f(x,\;y))/Z(x)//$$
= score of sequence y assigned to x / sum of scores of all possible sequences.
The term exp(w. f(x, y)) is a concise form of the expression
$$//exp({\textstyle\sum_1^n}(w.\;f(y_i,\;x_i,\;y_{i-1},i)))\;//$$
which represents the score of the sequence y assigned to x. Let’s have Ashish explain this further.
To summarise, the probability of observing the label sequence y given the input sequence x is given by:
$$//P(y\vert x,\;w)=exp({\textstyle\sum_1^n}\;(w.\;f(y_i,\;x_i,\;y_{i-1},\;i)))/Z(x)=exp(w.\;f(x,\;y))/Z(x)//$$
= score of sequence y assigned to x / sum of scores of all possible sequences
For now, assume that you have somehow computed the optimal weights w by training the model – you’ll study that in the next segment. For a sentence of n words and t IOB labels, there are tn possible label sequences, and your task is to find the optimal sequence yopt . You will study how to do this in a later segment.
In the next segment, you will understand the mathematical form of the normalising constant Z(x) and some examples of feature functions.