IKH

Regular Expression: Characters Sets

Until now, you were either using the actual letters (such as ab ,23 ,78 ,etc.) or the wildcard character in your regular expression patterns. there was no way of telling

that the preceding character is a digit, or an alphabet, or a special character, or a combination of these.

For example, say you want to match phone numbers in a large document. You know that the numbers may contain hyphens, plus symbol etc. (e.g. +91-9930839123) , but it will not have any alphabet. You need to somehow specify that you are looking only for numerics and some other symbols, but avoid alphabets.

To handle such situations, you can use what are called character sets in regular expression jargon.

The following lecture explains the various types of characters sets available in regular expressions, and how can you use them in different situations.

Note : At 1:47,”the ASCII value of a,b,c…. are 97,98,99,……” instead of “the ASCII value of a,b,c…. are 70,71,72,……”.

Character sets provide lot more flexibility than just typing a wildcard or the literal characters. Character sets can be specified with or without a quantifier. When no quantifier succeeds the character set, it matches only one character and the match is successful only if the character in the string is one of the characters present inside the character set. For example, the pattern ‘[a-z]ed’ will match strings such as ‘ted’, ‘bed’, ‘red’ and so on because the first character of each string – ‘t’, ‘b’ and ‘r’ – is present inside the range of the character set.

On the other hand, when we use a character set with a quantifier, such as in this case – ‘[a-z]+ed’, it will match any word that ends with ‘ed’ such as ‘watched’, ‘baked’, ‘jammed’, ‘educated’ and so on. In this way, a character set is similar to a wildcard because it can also be used with or without a quantifier. It’s just that a character set gives you more power and flexibility!

Note that a quantifier loses its special meaning when it’s present inside the character set. Inside square brackets, it is treated as any other character. 

You can also mention a whitespace character inside a character set to specify one or more whitespaces inside the string. The pattern [A-z ] can be used to match the full name of a person. It includes a space, so it can match the full name which includes the first name, a space, and the last name of the person.

But what if you want to match every other character other than the one mentioned inside the character set. You can use the caret operator to do this. Here, Krishna explains the use of caret operator inside a character set.

The ‘^’ has two use cases. You already know that it can be used outside a character set to specify the start of a string. Here, it is known as an anchor.

It’s another use is inside a character set. When used inside a character set, it acts as a complement operator, i.e. it specifies that it will match any character other than the ones mentioned inside the character set.

The pattern [0-9] matches any single digit number. On the other hand, the pattern ‘[^0-9]’ matches any single digit character that is not a digit.

Meta Sequences

When you work with regular expressions, you’ll find yourself using characters often. You’ll commonly use sets to match only digits, only alphabets, only alphanumeric characters, only whitespaces, etc.

Therefore, there is a shorthand way to write commonly used character sets in regular expressions. These are called meta-sequences. In the following video, Krishna explains the use of meta-sequences.

Those were the commonly used meta-sequences. You can use meta-sequences in two ways:

  • You can either use them without the square brackets. For example, the pattern ‘\w+’ will match any alphanumeric character.
  • Or you can them it inside the square brackets. For example, the pattern ‘[\w]+’ is same as ‘\w+’. But when you use meta-sequences inside a square bracket, they’re commonly used along with other meta-sequences. For example, the ‘[\w\s]+’ matches both alphanumeric characters and whitespaces. The square brackets are used to group these two meta-sequences into one.

Report an error