Lecture_Notes/Regex.md at 965918fa72fbe1ba6693de3a7fef387135f19f3d

mirror of https://github.com/dholerobin/Lecture_Notes.git synced 2025-03-16 06:10:00 +00:00

Pragy Agarwal 98a49c6a01 restructure files

2020-03-19 12:33:50 +05:30

30 KiB

Raw Blame History

Regular Expression (RegEx)

While filling online forms, haven't you come across errors like "Please enter valid email address" or "Please enter valid phone number".

Annoying as they may be, there's a lot of black magic that the computer does before it determines that, the details you've entered are incorrect.

Can you think out, what is that black magic? If you are familiar with algorithms, then you will say we can write an algorithm for the same.

Yes, we can write an algorithm to verify different things. But we have a standard tool, which is particularly designed for the similar kind of purposes.

It is Regular Expression. We call it RegEx for short. RegEx makes our work a lot easier. Let's see some basic examples where RegEx becomes handy.

Suppose, you are in search of an averge price of a particular product on amazon. The following regular expression will find you any price(ex. $12, $75.50) on the webpage: \$([0-9]+)\.([0-9]+).

Quite interesting!

Let's look at another example. You have a long list of documents with different kinds of extensions. You are particularly looking for data files having .dat extension.

^.*\.dat$ is a regular expression which represents a set of string ending with .dat. Regular expression is a standardized way to encode such patterns.

Well. What does the name Regular Expression(RegEx) represent? Regular Expression represents the sequence of characters that defines a regular search pattern.

RegEx is a standardized tool to do the following works:

Find and verify patterns in a string.
Extract particular data present in the text.
Replace, split and rearrange particular parts of a string.

We are going to look at all the three things above.

Let's begin the journey with RegEx!

Note:

In all the images below, the first section is a RegEx and below is a text, in which the matches are shown-the shaded regions show the match. All the images are taken using regexr.com. You can use it to do experiments on regex.
In all the images, Small dot between words in the text shows a space.
Alpha-numeric character belongs to anyone of the 0-9,A-Z,a-z ranges.
String is a sequence of characters and substring is a contiguous part of a string.

Simple Alpha-numeric character matching

Simple matching of a specific word can be done as the following:

As you can see it matches "Reg" in the text. Similarly, what will be the match for "Ex" in the same text above?

Do you notice anything? It is a case sensitive.

Note: Most of the programming languages have libraries for RegEx. They have almost similar kind of syntax. Here, we will see how to implement it in Javascript.

Below is a basic code in Javascript for regex. The patterns are written in /_____/g. Where g is a modifier, which is used to find all matches rather than stopping at the first match.

Note: The function exec returns null, if there is no match and match data otherwise.

// Main text (string) in which we are finding
// Patterns
var str = "RegEx stands for Regular Expression!";

// Pattern string
var pattern = /Reg/g;

// This will print all the data of matches
// across the whole string
while(result = pattern.exec(str))
{
	console.log(result); // printing
}

// This will be the output
/*
[
  'Reg',
  index: 0,
  input: 'RegEx stands for Regular Expression!',
  groups: undefined
]
[
  'Reg',
  index: 17,
  input: 'RegEx stands for Regular Expression!',
  groups: undefined
]
*/

Note: Groups in the above output is a RegEx concept. We will look at it, keep reading.

Now, you can change the expression and text in the code above, to observe other patterns.

Character classes:

What if you want to match both "soon" and "moon" or basically words ending with "oon"?

What did you observe? You can see that, adding [sm] matches both soon and moon. Here [sm] is called character class, which is basically a list of characters we want to match.

More formally, [abc] is basically 'either a or b or c'.

Predict the output of the following:

RegEx: [ABC][12] Text: A1 grade is the best, but I scored A2.

Answer:
RegEx: [0123456789][12345]:[abcdef][67890]:[0123456789][67890]:[1234589][abcdef] Text: Let's match 14:f6:89:3c mac address type of pattern. Other patterns are 51:a6:90:c5, 44:t6:u9:3d, 72:c8:39:8e.

Answer:

Now, if we put ^, then it will show a match for characters other than the ones in the bracket.

Predict the output for the following:

RegEx: [^13579]A[^abc]z3[590*-] Text: 1Abz33 will match or 2Atz30 and 8Adz3*.

Answer:

Writing every character (like [0123456789] or [abcd]) is somewhat slow and also erroneous, what is the short-cut?

Ranges

Ranges makes our work easier. Consecutive characters can simply be replaced by putting a dash between the smallest and largest character.

For example, abcdef --> a-f, 456 --> 4-6, abc3456 --> a-c3-6, c367980 --> c36-90.

Predict the output of the following regex:

RegEx: [a-d][^l-o][12][^5-7][l-p] Text: co13i, ae14p, eo30p, ce33l, dd14l.

Answer:

Note: If you write the range in reverse order (ex. 9-0), then it is an error.
RegEx: [a-zB-D934][A-Zab0-9] Text: t9, da, A9, zZ, 99, 3D, aCvcC9. Answer:

Predefined Character Classes

\w & \W: \w is just a short form of a character class [A-Za-Z0-9_]. \w is called word character class.

\W is equivalent to [^\w]. \W matches everything other than word characters.
\d & \D: \d matches any digit character. It is equivalent to character class [0-9]. \D is equivalent to [^\d]. \D matches everything other than digits.
\s & \S: \s matches whitespace characters. Tab(\t), newline(\n) & space( ) are whitespace characters. These characters are called non-printable characters.

Similarly, \S is equivalent to [^\s]. \S matches everything other than whitespace characters.
dot(.): Dot matches any character except \n(line-break or new-line character) and \r(carriage-return character). Dot(.) is known as a wildcard.

Note: \r is known as a windows style new-line character.

Predict the output of the following regex:

RegEx: [01][01][0-1]\W\s\d Text: Binary to decimal data: 001- 1, 010- 2, 011- 3, a01- 4, 100- 4. Answer:

Problems

Write a regex to match 28th February of any year. Date is in dd-mm-yyyy format.

Answer: 28-02-\d\d\d\d
Write a regex to match dates that are not in March. Consider that, the dates are valid and no proper format is given, i.e. it can be in dd.mm.yyyy, dd\mm\yyyy, dd/mm/yyyy format.

Answer: \d\d\W[10][^3]\W\d\d\d\d

Note that, the above regex will also match dd-mm.yyyy or dd/mm\yyyy kind of wrong format, this problem can be solved by using backreferencing.

Alternation (OR operator)

Character class can be used to match a single character out of several possible characters. Alternation is more generic than character class. It can also be used to match an expression out of several possible expressions.

In the above example, cat|dog|lion basically means 'either cat or dog or lion'. Here, we have used specific expression(cat, dog & lion), but we can use any regular expression. For example,

Problem

Find a regex to match boot or bot. Answer: There more than one possible answers: boot|bot, b(o|oo)t. Last expression is using a group.

Problem with OR operator:

Suppose, you want to match two words Set and SetValue. What will be the regular expression?

From whatever we have learned so far, you will say, Set|SetValue will be the answer. But it is not correct.

If you try SetValue|Set, then it is working.

Can you observe anything from it?

OR operator tries to match a substring starting from the first word(or expression)-in the regex. If it is a match, then it will not try to match the next word(or expression) at the same place in text.

Find out an regex which matches each and every word in the following set: {bat, cat, hat, mat, nat, oat, pat, Pat, ot}. The regex should be as small as possible.

Hint: Use character-class, ranges and or-operator together.

Answer: [b-chm-pP]at|ot

Quantifiers (Repetition)

To match 3 digit patterns, we can use [0-9][0-9][0-9]. What if we have n digit patterns? We have to write [0-9] n times, but that is a waste of time. Here is when quantifiers come for help.

Limiting repetitions({min, max}): To match n digit patterns, we can simply write [0-9]{n}. Instead of n, by providing minimum and maximum values as [0-9]{min, max}, we can match a pattern repeating min to max times.

Let's see an example to match all numbers between 1 to 999. Note: If you don't write the upper bound({min,}), then it basically means, there is no limit for maximum repetitions.
+ quantifier: It is equivalent to {1,}-at least one occurrence.
*quantifier: It is equivalent to {0,}-zero or more occurrences.
? quantifier: It is equivalent to {0,1}, either zero or one occurrence. ? is very useful for optional occurrences in patterns.

Let's see an example to match negative and positive numbers.

Problems

Find out a regex to match positive integers or floating point numbers with exactly two characters after the decimal point.

Answer: \d+(\.\d\d)?
Predict the output of the following regex: RegEx: [abc]{2,} Text: aaa abc abbccc avbcc

Answer:

Nature of Quantifiers: HTML tag is represented as <tag_name>some text</tag_name>. For example, <title>Regular expression</title>

So, can you figure out an expression that will match both <tag_name> & </tag_name>?

Most of the people will say, it is <.*>. But it gives different result. So, rather than matching up till first >, it matches the whole tag. So, quantifiers are greedy by default. It is called Greediness!

Now, if we use ?, then following happens.

Lazy matching:

As we have seen, the default nature of quantifier is greedy, so it will match as many characters as possible.

To make it lazy, we use ? quantifier, which turns the regex engine to match as less characters as possible which satisfies the regex.

Note: Now, you may be thinking, what if we want to match characters like *, ?, +, {, } in the text. We will look at it shortly. Keep reading!

Predict the output of the following regex:

Predict the output of the following regex: RegEx: (var|let)\s[a-zA-Z0-9_]\w* =\s"?\w+"?; Text: var carname = "volvo"; console.log(carname); let age = 8; var date = "23-03-2020";

Answer:

Boundary Matchers

Now, we will learn how to match patterns at specific positions, like before, after or between some characters. For this purpose we use special characters like ^,$,\b & \B,\A,\z & \Z, which are known as anchors.

Notes:

Line is a string which ends at a line-break or a new-line character \n.
There is a slight change in javascript code, we were using up till now. Instead of /____/g, we will now use /____/gm. Modifier 'm' is used to perform multiline search. Notice it in next images!
Word character can be represented by, [A-Za-z0-9_].
Anchor ^: It is used to match patterns at the very start of a line. For example,

It will show a match, only if the pattern is occuring at the start of the line.
Anchor $: Similarly, $ is used to match patterns at the very end of a line.

It will show a match, only if the pattern is occuring at the end of a line.

Example, both ^ and $,
Anchors \b & \B: \b is called word boundary character.

Below is a list of positions, which qualifies as a boundary for \b: If Regex-pattern is ending(or starting) with,
- A word character, then boundary is itself(word character). Let's call it a word boundary.
- A non-word character, then boundary is the next word-character. Let's call it a non-word boundary.
So, in short \b is only looking for word-character at boundaries, so it is called word boundary character.

Let's first observe some examples to understand it's working:

What did you observe? Our regex-pattern is starting and ending with a word character. So, the match occurs only if there is a substring starting and ending at word characters, which are required in our regex [a-z] and \d respectively.

Now, let's look at one more example. Here \+ will show a match for +.

What did you observe? First observation: Our pattern is starting with a non-word character and ending with a word character. So, the match occurs only if there is a substring having a non-word boundary at starting and word boundary at the ending.

Second observation: Non-word character after a word-boundary does not affect the result.

\b need not be used in pair. You can use a single \b.

\B is just a complement of \b. \B matches at all the positions that is not a word boundary. Observe two examples below:

Note: \A and \z & \Z are another anchors, which are used to match at the very start of input text and at very end of input text respectively. But it is not supported in Javascript.

Predict the output of the following regex:

RegEx: ^[\w$#%@!&^*]{6,18}$ Text: This is matching passwords of length between 6 to 18: Abfah45$ gadfaJ%33 Abjapda454&1 spc bjaphgu12$ Note that no whitespace characters are allowed. Answer:
RegEx: \b\w+:\B Text: 1232: , +1232:, abc:, abc:a, abc89, (+abc::) Answer:

Groups & Capturing

Grouping is the most useful feature of regex. Grouping can be done by placing regular expression inside round brackets.

It unifies the regular expressions inside it as a single unit. Let's look at its usages one by one:

It makes the regular expression more readable and sometimes it is an inevitable thing. Suppose, we want to match both the sentences in the above text, then grouping is the inevitable thing.
To apply quantifiers to one or more expressions. Similarly, you can use other quantifiers.

To extract and replace substrings using groups. So, we call groups Capturing groups, becuase we are capturing data(substrings) using groups.

In this part, we will see how to extract and replace data using groups in Javascript.

Data Extraction:

Observe the code below.

var str = "2020-01-20";

// Pattern string
var pattern = /(\d{4})-(\d{2})-(\d{2})/g;

//                ^       ^       ^
//group-no:	  1       2       3

var result = pattern.exec(str);

// printing
console.log(result);
/* Output will be:
[
  '2020-01-20', //-------pattern
  '2020', //-----First group
  '01', //-------Second group
  '20', //-------Third group
  index: 0,
  input: '2020-01-20',
  groups: undefined
]
*/
// Data extraction
console.log(result[1]); // First group
console.log(result[2]); // Second group
console.log(result[3]);	// Third group

In the output array, the first data is a match string followed by the matched groups in the order.

Data Replacement:

Replace is another function, which can be used to replace and rearrange the data using regex. Observe the code below.

var str = "2020-01-20";

// Pattern string
var pattern = /(\d{4})-(\d{2})-(\d{2})/g;

//                ^       ^       ^
//group-no:	  1       2       3

// Data replacement using $group_no
var ans=str.replace(pattern, '$3-$2-$1');

console.log(ans);
// Output will be: 20-01-2020

As you can see, we have used $group_no to indicate the capturing group.

Predict the output of the following regex:

RegEx: ([abc]){2,}(one|two) Text: aone cqtwo abone actwo abcbtwoone abbcccone

Answer:
RegEx: ([\dab]+(r|c)){2} Text: 1r2c ar4ccc 12abr12abc acac, accaca, acaaca aaar1234234c, aaa1234234c 194brar, 134bcbb-c

Answer:

Characters with special meaning

We have seen that, we are using *, +, ., $, etc for different purposes. Now, if we want to match them themselves, we have to escape them using escape character(backslash-\) .

Below is the table for these kind of characters and their escaped version, along with their usages.

Character	Usage	Escaped version
\	escape character	\\
.	predefined character class	\.
\|	OR operator	\\
*	as quantifier	\*
+	as quantifier	\+
?	as quantifier	\?
^	boundary matcher	\^
$	boundary matcher	\$
{	in quantifier notation	\{
}	in quantifier notation	\}
[	in character class notation	\[
]	in character class notation	\]
(	in group notation	\(
)	in group notation	\)
-	range operator	NA

Sometimes, it is also preferred to use escaped forward slash(/).

Backreferencing

Backreferencing is used to match same text again. Backreferences match the same text as previously matched by a capturing group. Let's look at an example:

The first captured group is (\w+), now we can use this group again by using a backreference (\1) at the closing tag, which matches the same text as in captured group \w+.

You can backreference any captured group by using \group_no.

Let's have two more examples:

Problems:

Match any palindrome string of length 6, having only lowercase letters. Answer: ([a-z])([a-z])([a-z])\3\2\1
RegEx: (\w+)oo\1le Text: google, doodle jump, ggooggle, ssoosle

Answer:

Note: For group numbers more than 9, there is a syntax difference.

Named Groups

Regular expressions with lots of groups and backreferencing can be difficult to maintain, as adding or removing a capturing group in the middle of the regex turns to change the numbers of all the groups that follow the added or removed group.

In regex, we have facility of named groups, which solves the above issue. Let's look at it.

We can name a group by putting ?<name> just after opening the paranthesis representing a group. For example, (?<year>\d{4}) is a named group.

Below is a code, we have already looked in capturing groups part. You can see, the code is more readable now.

	var str = "2020-01-20";

	// Pattern string
	var pattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/g;

	//                ^       ^       ^
	//group-no:	  1       2       3

	// Data replacement using $<group_name>
	var ans=str.replace(pattern, '$<day>-$<month>-$<year>');

	console.log(ans);
	// Output will be: 20-01-2020

Backreference syntax for numbered groups works for named capture groups as well. \k<name> matches the string that was previously matched by the named capture group name, which is a standard way to backreference named group.

Practical Applications of RegEx

Syntax highlighting systems
Data scraping and wrangling
In find and replace facility of text editors

Now that, you have learned RegEx. Let's look at some classical examples of RegEx.

Classical examples

Number Ranges: Can you find a regex matching all integers from 0 to 255?

First, Let's look at how can we match all integers from 0 to 59:

As you can see, we have used ? quantifier to make the first digit(0-5) optional. Now, can you solve it for 0-255?

Hint : Use OR operator.

We can divide the range 0-255 into three ranges: 0-199, 200-249 & 250-255. Now, creating an expression, for each of them independently, is easy.

Range RegEx

0-199 [01][0-9][0-9]

200-249 2[0-4][0-9]

250-255 25[0-5]

Now, by using OR operator, we can match the whole 0-255 range.

As you can see, the above regex is not going to match 0, but 000. So, how can you modify the regex which matches 0 as well, rather than matching 001 only?

We have just used ? quantifier.
Validate an IP address:

IP address consists of digits from 0-255 and 3 points(.). Valid IP address format is (0-255).(0-255).(0-255).(0-255).

For example, 10.10.11.4, 255.255.255.255, 234.9.64.43, 1.2.3.4 are Valid IP addresses.

Can you find a regex to match an IP-address?

We have already seen, how to match number ranges and to match a point, we use escaped-dot(\.). But in IP address, we don't allow leading zeroes in numbers like 001.

So, We have to divide the range in four sub-ranges: 0-99, 100-199, 200-249, 250-255. And finally we use OR-operator.

So, Regex to match IP Address is as below:

Note: The whole expression is contiguous, for the shake of easy understanding it is shown the way it is.