IKH

Regular expressions: Quantifiers-1

This section onwards, you’ll learn about regular expressions. Regular expressions, also called regex, are very powerful programming tools that are used for a variety of purposes such as feature extraction from text, string replacement and other string manipulations. For someone to become a master at text analytics, being proficient with regular expressions is a must-have skill.

A regular expression is a set of characters

 or a pattern, which is used to find substrings in a given string. 

Let’s say you want to extract all the hashtags from a tweet. A hashtag has a fixed pattern to it, i.e. a pound (‘#’) character followed by a string. Some example hashtags are – #mumbai, #bangalore, #upgrad. You could easily achieve this task by providing this pattern and the tweet that you want to extract the pattern from (in this case, the pattern is – any string starting with #). Another example is to extract all the phone numbers from a large piece of textual data.

In short, if there’s a pattern in any string, you can easily extract, substitute and do all kinds of other string manipulation operations using regular expressions.

Learning regular expressions basically means learning how to identify and define these patterns.

Regulars expressions are a language in itself since they have their own compilers. Almost all popular programming languages support working with regexes and so does Python.

Let’s take a look at how to work with regular expressions in Python. Download the Jupyter notebook provided below to follow along:

General Note on Practice Codind Questions

In the practice questions that you’ll attempt in this module, the phrases ‘match string’ and ‘extract string’ will be used interchangeably. In both cases, you need to use the ‘re.search()’ function which detects whether the given regular expression pattern is present in the given input string. The ‘re.search()’ method returns a RegexObject if the pattern is found in the string, else it returns a None object.

After writing your code, you can use the ‘Verify’ button to evaluate your code against sample test cases. After verifying the code, you can ‘Submit’ the code, which will be then validated against the (hidden) test cases.

The comments in the coding questions will guide you with these nuances. Also, you can look at the sample solution after submitting your code (i.e. after the maximum number of allowed submissions) at the bottom of the coding console window.

So that’s how you import regular expressions library in python and use it. You saw how to use the re.search() function – it returns a match object if the pattern is found in the string. Also, you saw two of its methods – match.start() and match.end() which return the index of the starting and ending position of the match found.

Apart from re.search(), there are other functions in the re library that are useful for other tasks. You’ll look at the other functions later in this session.

Now, the first thing that you’ll learn about regular expressions is the use of quantifiers. Quantifiers allow you to mention and have control over how many times you want the character(s) in your pattern to occur.

Let’s take an example. Suppose you have some data which have the word ‘awesome’ in it. The list might look like – [‘awesome’, ‘awesomeeee’, ‘awesomee’]. You decide to extract only those elements which have more than one ‘e’ at the end of the word ‘awesome’. This is where quantifiers come into picture. They let you handle these tasks.

You’ll learn four types of quantifiers:

  • The ‘?’ operator
  • The ‘*’ operator
  • The ‘+’ operator
  • The ‘{m, n}’ operator

The first quantifier is ‘?’. Let’s understand what the ‘?’ quantifier does.

You heard Krishna say that you’ll learn about five quantifiers instead of four. That’s because the fourth quantifier has some more variations. You’ll learn about it later in the session. 

The ‘?’  can be used where you want the preceding character of your pattern to be an optional character in the string. For example, if you want to write a regex that matches both ‘car’ and ‘cars’, the corresponding regex will be ’cars?’. ‘S’ followed by ‘?’ means that ‘s’ can be absent or present, i.e. it can be present zero or one time.

The next quantifier that you’re going to study is the ‘+’ quantifier.

A’+’ quantifier matches the preceding character any number of times. Practice some question below to strengthen your understanding of it.

You learnt two quantifiers-the’?’ and the ‘+’. in the next section, you’ll learn two more quantifiers.

Report an error