Until now, you have learnt about the basics of phrases, CFGs and how CFGs can be used for parsing sentences. Let’s now study algorithms for parsing. There are two broad approaches for parsing:
- Top-down: Start from the starting symbol S and produce each word in the sentence.
- Bottom-up: Start from the individual words and reduce them to the sentence S.
You have studied that CFGs define what are called grammar or production rules of the form A > BC. Both parsing techniques use these rules to create parse trees.
Let’s first study top-down parsing.
To summarise, top-down parsing starts with the start symbol S at the top and uses the production rules to parse each word one by one. You continue to parse until all the words have been allocated to some production rule.
The figure below shows all the paths the parser tried for parsing “The man saw dogs”.
In the process, you often encounter dead ends, i.e. points where no production rule can be applied to produce a right-hand side from the left-hand side. In such cases, the algorithm needs to backtrack and try some other rule.
Let’s understand this through an example of a simple sentence and production rules.
Sentence: The man slept
Grammar:
S -> NP VP
NP -> N| Det N
VP -> V | V NP
Det -> ‘the’
N -> ‘man’
V -> ‘slept’
The sentence couldn’t be parsed using the left side of the tree since it reached a dead end. On the other hand, the grammar used on the right side is able to parse the sentence completely.
The NLTK library in Python will show the parse tree as:
(S
(NP (DT The) (N man))
(VP (V slept)))
You learnt how to parse a sentence in the previous lecture. But top-down parsers have a specific limitation. In the next lecture, the professor will explain where the top-down parser fails.
Note that the recursive descent parser is a type of top-down parser, though the terms are sometimes used equivalently. You can read more about the recursive descent parser in the additional reading material below.
Let’s understand the problem of left-recursion using an example:
Sentence: The man saw a car.
Grammar:
S -> NP VP
NP -> DT N
VP -> VP NP| V
DT -> the
N -> man
The top-down parse structure of the sentence will look as follows:
The rule VP -> VP NP runs into an infinite loop and no parse tree will be obtained. This is the problem of left-recursion. This problem can be resolved using the bottom-up approach, which you’ll learn in the next segment.
For now, let’s look at Python implementation of the top-down Parse tree. The following Jupyter notebook is attached for your reference.
Additional Reading
- Although the naive implementation of the recursive descent parser suffers from the problem of left-recursion, alternate algorithms have been suggested which can handle this problem. The algorithms use predictive techniques to ‘look ahead’ and avoid going down the infinite loop. Read more about it here.