To deal with the complexity and ambiguity of natural language, we first need to identify and define commonly observed grammatical patterns.
The first step in understanding grammar is to divide a sentence into groups of words called constituents based on their grammatical role in the sentence.
To start with, let’s take an example sentence: “The fox ate the squirrel.”
Each underlined group of words represents a grammatical unit or a constituent – ‘The fox’ represents a noun phrase, ‘ate’ represents a verb phrase, ‘the squirrel’ is another noun phrase.
In the upcoming few lectures, you will study how constituency parsers can ‘parse’ the grammatical structure of sentences. Let’s first understand the concept of constituents.
Let’s understand the concept of constituencies in a little more detail. Consider the following two sentences:
- ‘Ram read an article on data science’
- ‘Shruti ate dinner’
The underlined groups of words form a constituent (or a phrase). The rationale for clubbing these words in a single unit is provided by the notion of substitutability, i.e., a constituent can be replaced with another equivalent constituent while keeping the sentence syntactically valid.
For example, replacing the constituency ‘an article on data science’ (a noun phrase) with ‘dinner’ (another noun phrase) doesn’t affect the syntax of the sentence, though the resultant sentence “Ram read dinner” is semantically meaningless.
Most common constituencies in English are Noun Phrases (NP), Verb Phrases (VP), and Prepositional Phrases (PP). The following table summarises these phrases:
| Type of Phrases | Definition | Examples |
| Noun Phrase | Has a primary noun and other words that modify it | A crazy white cat, the morning flight, a large elephant |
| Verb Phrase | Starts with a verb and other words that syntactically depend on it | saw an elephant, made a cake, killed the squirrel |
| Prepositional Phrase | Starts with a preposition and other words (usually a Noun Phrase) that syntactically depend on it | on the table, into the solar system, down the road, by the river |
There are various other types of phrases, such as an adverbial phrase, a nominal (N), etc., though in most cases you will need to work with only the above three phrases along with the nominal (introduced in a later lecture).
Context-Free Grammars
The most commonly used technique to organize sentences into constituencies is Context-Free Grammars or CFGs. CFGs define a set of grammar rules (or productions) which specify how words can be grouped to form constituents such as noun phrases, verb phrases, etc.
In the following lecture, the professor will explain the elements of a context-free grammar.
To summarise, a context-free grammar is a series of production rules. Let’s understand production rules using some examples. The following production rule says that a noun phrase can be formed using either a determiner (DT) followed by a noun (N) or a noun phrase (NP) followed by a prepositional phrase (PP). :
NP -> DT N | NP PP
Some example phrases that follow this production rule are:
- The/DT man/N.
- The/DT man/N over/P the/DT bridge/N.
Both of the above are noun phrases NP. The man is a noun phrase that follows the first rule:
NP -> DT N.
The second phrase (The man over the bridge) follows the second rule:
NP -> NP PP
It has a noun phrase (The man) and a prepositional phrase (over the bridge).
In this way, using grammar rules, you can parse sentences into different constituents. In general, any production rule can be written as A -> B C, where A is a non-terminal symbol (NP, VP, N etc.) and B and C are either non-terminals or terminal symbols (i.e. words in vocabulary such as flight, man etc.).
Some other examples of commonly observed production rules in English grammar are provided in the table below. Note that a nominal (Nom) refers to an entity such as morning, flight etc. which commonly follows the rule Nominal > Nominal Noun. There is a subtle difference and a significant overlap between a nominal (Nom) and a noun (NN), you may read more about it here, though you need not worry much about these nuances in this course.
The symbol S represents an entire sentence.
Production Rules
| Production Rule | Example |
| S > NP VP | he + swam |
| NP > Pronoun | NP PP | DT Nom | she | a man + across the river | a + river |
| VP > VP PP | VBD | VP NP | swam + across the river | enjoyed | ate + the squirrel |
Further, the professor mentioned two broad approaches for parsing sentences using CFGs:
- Top-down: Start from the starting symbol S and produce each word in the sentence.
- Bottom-up: Start from the individual words and reduce them to the sentence S.
You’ll learn both approaches in detail in the next segments.