We have discussed that NER is a sequence prediction taskand that there are broadly two types of models for NER – rule-based techniques and probabilistic models.
Let’s start with the simpler rule-based models for entity recognition. Rule-based taggers use the commonly observed rules in the text to identify the tag of each word. They are similar to the rule-based POS taggers which use rules such as these – VBG mostly ends with ‘-ing’, VBD is likely to end with ‘ed’ etc.
Chunking
Rule-based models for NER tasks are based on the technique called chunking. Chunking is a commonly used shallow parsing technique used to chunk words that constitute some meaningful phrase in the sentence. Chunks are non-overlapping subsets of words in a sentence that form a meaningful ‘entity’. For example, a noun phrase chunk (NP chunk) is commonly used in NER tasks to identify groups of words that correspond to some ‘entity’. For example, in the following sentence, there are two noun phrase chunks:
Sentence
He bought a new car from the Maruti Suzuki showroom.
Noun phrase chunks
a new car, the Maruti Suzuki showroom
Note that a key difference between a noun phrase (NP) used in constituency parsing and a noun phrase chunk is that a chunk does not include any other noun phrase chunk within it, i.e. NP chunks are non-overlapping. This is also why chunking is a shallow parsing technique which falls somewhere between POS tagging and constituency parsing.
In general, the idea of chunking in the context of entity recognition is simple – since we know that most entities are nouns and noun phrases, we can write rules to extract these noun phrases and hopefully extract a large number of named entities (e.g. Maruti Suzuki, a new car, as shown above).
In the upcoming lecture, Ashish will explain the concept of chunking in detail and how regular expressions can be used to identify chunks in the sentence.
[Correction- At 0:47: Please note: PRP is a pronoun. Please ignore wherever it has been referred to as a preposition in the video]
Note
At 3:14, It should be zero or more adjective instead of one or more adjective.
Let’s take some more examples of chunking done using regular expressions:
Sentence
Ram booked the flight.
Noun phrase chunks
‘Ram’, ‘the flight’
One possible grammar to chunk the sentence is as follows:
Grammar
$$//NP\_chunk:,\left\{