mirror of
https://github.com/dholerobin/Lecture_Notes.git
synced 2025-03-15 21:59:56 +00:00
Update and rename Regex_pending.md to Regex.md
This commit is contained in:
parent
9eca14ac85
commit
4d514d922b
640
Akash Articles/Regex.md
Normal file
640
Akash Articles/Regex.md
Normal file
@ -0,0 +1,640 @@
|
||||
## Regular Expression (RegEx)
|
||||
|
||||
While filling online forms, haven't you come across errors like "Please enter valid email address" or "Please enter valid phone number".
|
||||
|
||||
Annoying as they may be, there's a lot of black magic that the computer does before it determines that, the details you've entered are incorrect.
|
||||
|
||||
Can you think out, what is that black magic? If you are familiar with algorithms, then you will say we can write an algorithm for the same.
|
||||
|
||||
Yes, we can write an algorithm to verify different things. But we have a standard tool, which is particularly designed for the similar kind of purposes.
|
||||
|
||||
It is **Regular Expression**. We call it **RegEx** for short. RegEx makes our work a lot easier. Let's see some basic examples where RegEx becomes handy.
|
||||
|
||||
Suppose, you are in search of an averge price of a particular product on amazon. The following regular expression will find you any price(ex. `$12`, `$75.50`) on the webpage: `\$([0-9]+)\.([0-9]+)`.
|
||||
|
||||
Quite interesting!
|
||||
|
||||
Let's look at another example. You have a long list of documents with different kinds of extensions. You are particularly looking for data files having **.dat** extension.
|
||||
|
||||
`^.*\.dat$` is a regular expression which represents a set of string ending with **.dat**. Regular expression is a standardized way to encode such patterns.
|
||||
|
||||
Well. What does the name **Regular Expression(RegEx)** represent? Regular Expression represents the sequence of characters that defines a regular search pattern.
|
||||
|
||||
RegEx is a standardized tool to do the following works:
|
||||
1. Find and verify patterns in a string.
|
||||
2. Extract particular data present in the text.
|
||||
3. Replace, split and rearrange particular parts of a string.
|
||||
|
||||
We are going to look at all the three things above.
|
||||
|
||||
Let's begin the journey with RegEx!
|
||||
|
||||
**Note:**
|
||||
1. In all the images below, the first section is a RegEx and below is a text, in which the matches are shown-the shaded regions show the match. All the images are taken using regexr.com. You can use it to do experiments on regex.
|
||||
2. In all the images, Small dot between words in the text shows a space.
|
||||
3. **Alpha-numeric character** belongs to anyone of the $0-9,A-Z,a-z$ ranges.
|
||||
4. String is a sequence of characters and substring is a contiguous part of a string.
|
||||
|
||||
## Simple Alpha-numeric character matching
|
||||
|
||||
Simple matching of a specific word can be done as the following:
|
||||
|
||||

|
||||
|
||||
As you can see it matches "Reg" in the text. Similarly, what will be the match for "Ex" in the same text above?
|
||||
|
||||

|
||||
|
||||
Do you notice anything? It is a **case sensitive**.
|
||||
|
||||
**Note:** Most of the programming languages have libraries for RegEx. They have almost similar kind of syntax. Here, we will see how to implement it in **Javascript**.
|
||||
|
||||
Below is a basic code in Javascript for regex. The patterns are written in `/_____/g`. Where `g` is a modifier, which is used to find all matches rather than stopping at the first match.
|
||||
|
||||
**Note:** The function **exec** returns null, if there is no match and match data otherwise.
|
||||
```js
|
||||
// Main text (string) in which we are finding
|
||||
// Patterns
|
||||
var str = "RegEx stands for Regular Expression!";
|
||||
|
||||
// Pattern string
|
||||
var pattern = /Reg/g;
|
||||
|
||||
// This will print all the data of matches
|
||||
// across the whole string
|
||||
while(result = pattern.exec(str))
|
||||
{
|
||||
console.log(result); // printing
|
||||
}
|
||||
|
||||
// This will be the output
|
||||
/*
|
||||
[
|
||||
'Reg',
|
||||
index: 0,
|
||||
input: 'RegEx stands for Regular Expression!',
|
||||
groups: undefined
|
||||
]
|
||||
[
|
||||
'Reg',
|
||||
index: 17,
|
||||
input: 'RegEx stands for Regular Expression!',
|
||||
groups: undefined
|
||||
]
|
||||
*/
|
||||
```
|
||||
|
||||
**Note:** **Groups** in the above output is a RegEx concept. We will look at it, keep reading.
|
||||
|
||||
Now, you can change the expression and text in the code above, to observe other patterns.
|
||||
|
||||
## Character classes:
|
||||
|
||||

|
||||
|
||||
What if you want to match both "soon" and "moon" or basically words ending with "oon"?
|
||||
|
||||

|
||||
|
||||
What did you observe? You can see that, adding `[sm]` matches both $soon$ and $moon$. Here `[sm]` is called character class, which is basically a list of characters we want to match.
|
||||
|
||||
More formally, `[abc]` is basically 'either a or b or c'.
|
||||
|
||||
Predict the output of the following:
|
||||
|
||||
1. **RegEx:** ``[ABC][12]``
|
||||
**Text:** A1 grade is the best, but I scored A2.
|
||||
|
||||
Answer:
|
||||

|
||||
|
||||
|
||||
2. **RegEx:** ```[0123456789][12345]:[abcdef][67890]:[0123456789][67890]:[1234589][abcdef]```
|
||||
**Text:** Let's match 14:f6:89:3c mac address type of pattern. Other patterns are 51:a6:90:c5, 44:t6:u9:3d, 72:c8:39:8e.
|
||||
|
||||
Answer:
|
||||
|
||||

|
||||
|
||||
Now, if we put `^`, then it will show a match for characters other than the ones in the bracket.
|
||||
|
||||

|
||||
|
||||
Predict the output for the following:
|
||||
|
||||
**RegEx:** ```[^13579]A[^abc]z3[590*-]```
|
||||
**Text:** 1Abz33 will match or 2Atz30 and 8Adz3*.
|
||||
|
||||
Answer:
|
||||
|
||||

|
||||
|
||||
Writing every character (like `[0123456789]` or `[abcd]`) is somewhat slow and also erroneous, what is the short-cut?
|
||||
|
||||
## Ranges
|
||||
Ranges makes our work easier. Consecutive characters can simply be replaced by putting a dash between the smallest and largest character.
|
||||
|
||||
For example, `abcdef` --> `a-f`, `456` --> `4-6`, `abc3456` --> `a-c3-6`, `c367980` --> `c36-90`.
|
||||
|
||||

|
||||
|
||||
Predict the output of the following regex:
|
||||
|
||||
1. **RegEx:** ```[a-d][^l-o][12][^5-7][l-p]```
|
||||
**Text:** co13i, ae14p, eo30p, ce33l, dd14l.
|
||||
|
||||
Answer:
|
||||

|
||||
|
||||
|
||||
**Note:** If you write the range in reverse order (ex. 9-0), then it is an error.
|
||||
|
||||
2. **RegEx:** ``[a-zB-D934][A-Zab0-9]``
|
||||
**Text:** t9, da, A9, zZ, 99, 3D, aCvcC9.
|
||||
Answer:
|
||||

|
||||
|
||||
|
||||
|
||||
## Predefined Character Classes
|
||||
|
||||
1. **`\w` & `\W`**: `\w` is just a short form of a character class `[A-Za-Z0-9_]`. `\w` is called word character class.
|
||||
|
||||

|
||||
`\W` is equivalent to ``[^\w]``. `\W` matches everything other than word characters.
|
||||

|
||||
|
||||
|
||||
2. **`\d` & `\D`**: `\d` matches any digit character. It is equivalent to character class `[0-9]`.
|
||||

|
||||
`\D` is equivalent to ``[^\d]``. `\D` matches everything other than digits.
|
||||

|
||||
|
||||
3. **`\s` & `\S`**: `\s` matches whitespace characters. Tab(`\t`), newline(`\n`) & space(` `) are whitespace characters. These characters are called non-printable characters.
|
||||

|
||||
|
||||
Similarly, `\S` is equivalent to ``[^\s]``. `\S` matches everything other than whitespace characters.
|
||||

|
||||
|
||||
4. **dot(`.`)**: Dot matches any character except `\n`(line-break or new-line character) and `\r`(carriage-return character). Dot(`.`) is known as a **wildcard**.
|
||||

|
||||
|
||||
**Note:** `\r` is known as a windows style new-line character.
|
||||
|
||||
Predict the output of the following regex:
|
||||
|
||||
1. **RegEx:** ``[01][01][0-1]\W\s\d``
|
||||
**Text:** Binary to decimal data: 001- 1, 010- 2, 011- 3, a01- 4, 100- 4.
|
||||
Answer:
|
||||

|
||||
|
||||
### Problems
|
||||
|
||||
1. Write a regex to match 28th February of any year. Date is in dd-mm-yyyy format.
|
||||
|
||||
Answer: `28-02-\d\d\d\d`
|
||||
|
||||
2. Write a regex to match dates that are not in March. Consider that, the dates are valid and no proper format is given, i.e. it can be in dd.mm.yyyy, dd\mm\yyyy, dd/mm/yyyy format.
|
||||
|
||||
Answer: `\d\d\W[10][^3]\W\d\d\d\d`
|
||||
|
||||
Note that, the above regex will also match dd-mm.yyyy or dd/mm\yyyy kind of wrong format, this problem can be solved by using backreferencing.
|
||||
|
||||
|
||||
## Alternation (OR operator)
|
||||
|
||||
**Character class** can be used to match a single character out of several possible characters. Alternation is more generic than character class. It can also be used to match an expression out of several possible expressions.
|
||||
|
||||

|
||||
|
||||
In the above example, ``cat|dog|lion`` basically means 'either cat or dog or lion'. Here, we have used specific expression(cat, dog & lion), but we can use any regular expression. For example,
|
||||
|
||||

|
||||
|
||||
### Problem
|
||||
- Find a regex to match boot or bot.
|
||||
Answer: There more than one possible answers: `boot|bot`, `b(o|oo)t`. Last expression is using a group.
|
||||
|
||||
|
||||
### Problem with OR operator:
|
||||
Suppose, you want to match two words **Set** and **SetValue**. What will be the regular expression?
|
||||
|
||||
From whatever we have learned so far, you will say, ``Set|SetValue`` will be the answer. But it is not correct.
|
||||
|
||||

|
||||
|
||||
If you try `SetValue|Set`, then it is working.
|
||||
|
||||

|
||||
|
||||
Can you observe anything from it?
|
||||
|
||||
**OR operator** tries to match a substring starting from the first word(or expression)-in the regex. If it is a match, then it will not try to match the next word(or expression) at the same place in text.
|
||||
|
||||
Find out an regex which matches each and every word in the following set: `{bat, cat, hat, mat, nat, oat, pat, Pat, ot}`. The regex should be as small as possible.
|
||||
|
||||
**Hint:** Use character-class, ranges and or-operator together.
|
||||
|
||||
Answer: `[b-chm-pP]at|ot`
|
||||
|
||||
## Quantifiers (Repetition)
|
||||
|
||||
To match 3 digit patterns, we can use ``[0-9][0-9][0-9]``. What if we have n digit patterns? We have to write `[0-9]` n times, but that is a waste of time. Here is when quantifiers come for help.
|
||||
|
||||
1. **Limiting repetitions(``{min, max}``):** To match n digit patterns, we can simply write ``[0-9]{n}``. Instead of n, by providing minimum and maximum values as ``[0-9]{min, max}``, we can match a pattern repeating min to max times.
|
||||
|
||||
Let's see an example to match all numbers between 1 to 999.
|
||||

|
||||
**Note:** If you don't write the upper bound(``{min,}``), then it basically means, there is no limit for maximum repetitions.
|
||||
|
||||
2. **``+`` quantifier:** It is equivalent to ``{1,}``-at least one occurrence.
|
||||

|
||||
|
||||
3. **``*``quantifier:** It is equivalent to ``{0,}``-zero or more occurrences. 
|
||||
|
||||
4. **``?`` quantifier:** It is equivalent to ``{0,1}``, either zero or one occurrence. ``?`` is very useful for optional occurrences in patterns.
|
||||
|
||||
Let's see an example to match negative and positive numbers.
|
||||

|
||||
|
||||
### Problems
|
||||
|
||||
1. Find out a regex to match positive integers or floating point numbers with exactly two characters after the decimal point.
|
||||
|
||||
Answer: `\d+(\.\d\d)?`
|
||||
|
||||
2. Predict the output of the following regex:
|
||||
RegEx: `[abc]{2,}`
|
||||
Text:
|
||||
<code>aaa
|
||||
abc
|
||||
abbccc
|
||||
avbcc
|
||||
</code>
|
||||
|
||||
Answer:
|
||||

|
||||
|
||||
**Nature of Quantifiers:**
|
||||
HTML tag is represented as <tag_name>some text</tag_name>. For example, `<title>Regular expression</title>`
|
||||
|
||||
So, can you figure out an expression that will match both <tag_name> & </tag_name>?
|
||||
|
||||
Most of the people will say, it is `<.*>`. But it gives different result.
|
||||

|
||||
So, rather than matching up till first `>`, it matches the whole tag. So, quantifiers are greedy by default. It is called **Greediness!**
|
||||
|
||||
Now, if we use `?`, then following happens.
|
||||

|
||||
|
||||
### Lazy matching:
|
||||
As we have seen, the default nature of quantifier is greedy, so it will match as many characters as possible.
|
||||
|
||||

|
||||
|
||||
To make it lazy, we use `?` quantifier, which turns the regex engine to match as less characters as possible which satisfies the regex.
|
||||

|
||||
|
||||
**Note:** Now, you may be thinking, what if we want to match characters like `*, ?, +, {, }` in the text. We will look at it shortly. Keep reading!
|
||||
|
||||
Predict the output of the following regex:
|
||||
|
||||
1. Predict the output of the following regex:
|
||||
RegEx: `(var|let)\s[a-zA-Z0-9_]\w* =\s"?\w+"?;`
|
||||
Text:
|
||||
<code>var carname = "volvo";
|
||||
console.log(carname);
|
||||
let age = 8;
|
||||
var date = "23-03-2020";</code>
|
||||
|
||||
Answer:
|
||||

|
||||
|
||||
## Boundary Matchers
|
||||
|
||||
Now, we will learn how to match patterns at specific positions, like before, after or between some characters. For this purpose we use special characters like `^`,`$`,`\b & \B`,`\A`,`\z & \Z`, which are known as anchors.
|
||||
|
||||
**Notes:**
|
||||
- Line is a string which ends at a line-break or a new-line character `\n`.
|
||||
- There is a slight change in javascript code, we were using up till now. Instead of `/____/g`, we will now use `/____/gm`. Modifier 'm' is used to perform multiline search. Notice it in next images!
|
||||
- Word character can be represented by, `[A-Za-z0-9_]`.
|
||||
|
||||
- **Anchor `^`**: It is used to match patterns at the very start of a line.
|
||||
For example,
|
||||

|
||||
|
||||
It will show a match, only if the pattern is occuring at the start of the line.
|
||||
|
||||
- **Anchor `$`**: Similarly, ``$`` is used to match patterns at the very end of a line.
|
||||

|
||||
|
||||
It will show a match, only if the pattern is occuring at the end of a line.
|
||||
|
||||
Example, both `^` and `$`,
|
||||

|
||||
|
||||
|
||||
- **Anchors `\b` & `\B`**: `\b` is called **word boundary character**.
|
||||
|
||||
Below is a list of positions, which qualifies as a **boundary** for `\b`:
|
||||
If Regex-pattern is ending(or starting) with,
|
||||
- A word character, then boundary is itself(word character). Let's call it a word boundary.
|
||||
- A non-word character, then boundary is the next word-character. Let's call it a non-word boundary.
|
||||
|
||||
So, in short `\b` is only looking for word-character at boundaries, so it is called **word boundary character**.
|
||||
|
||||
Let's first observe some examples to understand it's working:
|
||||

|
||||
|
||||
What did you observe? Our regex-pattern is starting and ending with a word character. So, the match occurs only if there is a substring starting and ending at word characters, which are required in our regex `[a-z]` and `\d` respectively.
|
||||
|
||||
Now, let's look at one more example.
|
||||

|
||||
Here `\+` will show a match for `+`.
|
||||
|
||||
What did you observe?
|
||||
**First observation:** Our pattern is starting with a non-word character and ending with a word character. So, the match occurs only if there is a substring having a non-word boundary at starting and word boundary at the ending.
|
||||
|
||||
**Second observation:** Non-word character after a word-boundary does not affect the result.
|
||||
|
||||
`\b` need not be used in pair. You can use a single `\b`.
|
||||
|
||||

|
||||
|
||||
`\B` is just a complement of `\b`. `\B` matches at all the positions that is not a word boundary. Observe two examples below:
|
||||

|
||||
|
||||

|
||||
|
||||
**Note:** `\A` and `\z & \Z` are another anchors, which are used to match at the very start of input text and at very end of input text respectively. But it is not supported in Javascript.
|
||||
|
||||
Predict the output of the following regex:
|
||||
|
||||
1. **RegEx:** ```^[\w$#%@!&^*]{6,18}$```
|
||||
**Text:**
|
||||
<code>This is matching passwords of length between 6 to 18:
|
||||
Abfah45$
|
||||
gadfaJ%33
|
||||
Abjapda454&1 spc
|
||||
bjaphgu12$
|
||||
Note that no whitespace characters are allowed.</code>
|
||||
Answer:
|
||||

|
||||
|
||||
2. RegEx: `\b\w+:\B`
|
||||
Text: <code>1232: , +1232:, abc:, abc:a, abc89, (+abc::)</code>
|
||||
Answer: 
|
||||
|
||||
## Groups & Capturing
|
||||
|
||||
Grouping is the most useful feature of regex. Grouping can be done by placing regular expression inside round brackets.
|
||||
|
||||
It unifies the regular expressions inside it as a single unit. Let's look at its usages one by one:
|
||||
|
||||
1. It makes the regular expression more readable and sometimes it is an inevitable thing.
|
||||

|
||||
Suppose, we want to match both the sentences in the above text, then grouping is the inevitable thing.
|
||||

|
||||
|
||||
2. To apply quantifiers to one or more expressions.
|
||||

|
||||
Similarly, you can use other quantifiers.
|
||||
|
||||
3. To extract and replace substrings using groups. So, we call groups **Capturing groups**, becuase we are capturing data(substrings) using groups.
|
||||
|
||||
In this part, we will see how to extract and replace data using groups in Javascript.
|
||||
|
||||
**Data Extraction:**
|
||||
|
||||
Observe the code below.
|
||||
```js
|
||||
var str = "2020-01-20";
|
||||
|
||||
// Pattern string
|
||||
var pattern = /(\d{4})-(\d{2})-(\d{2})/g;
|
||||
|
||||
// ^ ^ ^
|
||||
//group-no: 1 2 3
|
||||
|
||||
var result = pattern.exec(str);
|
||||
|
||||
// printing
|
||||
console.log(result);
|
||||
/* Output will be:
|
||||
[
|
||||
'2020-01-20', //-------pattern
|
||||
'2020', //-----First group
|
||||
'01', //-------Second group
|
||||
'20', //-------Third group
|
||||
index: 0,
|
||||
input: '2020-01-20',
|
||||
groups: undefined
|
||||
]
|
||||
*/
|
||||
// Data extraction
|
||||
console.log(result[1]); // First group
|
||||
console.log(result[2]); // Second group
|
||||
console.log(result[3]); // Third group
|
||||
```
|
||||
In the output array, the first data is a match string followed by the matched groups in the order.
|
||||
|
||||
**Data Replacement:**
|
||||
|
||||
`Replace` is another function, which can be used to replace and rearrange the data using regex. Observe the code below.
|
||||
|
||||
```js
|
||||
var str = "2020-01-20";
|
||||
|
||||
// Pattern string
|
||||
var pattern = /(\d{4})-(\d{2})-(\d{2})/g;
|
||||
|
||||
// ^ ^ ^
|
||||
//group-no: 1 2 3
|
||||
|
||||
// Data replacement using $group_no
|
||||
var ans=str.replace(pattern, '$3-$2-$1');
|
||||
|
||||
console.log(ans);
|
||||
// Output will be: 20-01-2020
|
||||
```
|
||||
As you can see, we have used `$group_no` to indicate the capturing group.
|
||||
|
||||
Predict the output of the following regex:
|
||||
|
||||
1. RegEx: `([abc]){2,}(one|two)`
|
||||
Text:
|
||||
<code>aone
|
||||
cqtwo
|
||||
abone
|
||||
actwo
|
||||
abcbtwoone
|
||||
abbcccone
|
||||
</code>
|
||||
|
||||
Answer: 
|
||||
|
||||
2. RegEx: `([\dab]+(r|c)){2}`
|
||||
Text:
|
||||
<code>1r2c
|
||||
ar4ccc
|
||||
12abr12abc
|
||||
acac, accaca, acaaca
|
||||
aaar1234234c, aaa1234234c
|
||||
194brar, 134bcbb-c </code>
|
||||
|
||||
Answer: 
|
||||
|
||||
|
||||
|
||||
## Characters with special meaning
|
||||
|
||||
We have seen that, we are using `*`, `+`, `.`, `$`, etc for different purposes. Now, if we want to match them themselves, we have to escape them using escape character(backslash-\\) .
|
||||
|
||||
Below is the table for these kind of characters and their escaped version, along with their usages.
|
||||
|
||||
| Character | Usage | Escaped version |
|
||||
|:---------:|:---------------------------:|:---------------:|
|
||||
| \ | escape character | \\\ |
|
||||
| . | predefined character class | \\. |
|
||||
| \| | OR operator | \\\ |
|
||||
| * | as quantifier | \\* |
|
||||
| + | as quantifier | \\+ |
|
||||
| ? | as quantifier | \\? |
|
||||
| ^ | boundary matcher | \\^ |
|
||||
| $ | boundary matcher | \\$ |
|
||||
| { | in quantifier notation | \\{ |
|
||||
| } | in quantifier notation | \\} |
|
||||
| [ | in character class notation | \\[ |
|
||||
| ] | in character class notation | \\] |
|
||||
| ( | in group notation | \\( |
|
||||
| ) | in group notation | \\) |
|
||||
| -|range operator | NA
|
||||
|
||||
Sometimes, it is also preferred to use escaped forward slash(`/`).
|
||||
|
||||
## Backreferencing
|
||||
|
||||
Backreferencing is used to match same text again. Backreferences match the same text as previously matched by a capturing group. Let's look at an example:
|
||||
|
||||

|
||||
|
||||
The first captured group is (`\w+`), now we can use this group again by using a backreference (`\1`) at the closing tag, which matches the same text as in captured group `\w+`.
|
||||
|
||||
You can backreference any captured group by using `\group_no`.
|
||||
|
||||
Let's have two more examples:
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
**Problems:**
|
||||
|
||||
1. Match any palindrome string of length 6, having only lowercase letters.
|
||||
Answer: `([a-z])([a-z])([a-z])\3\2\1`
|
||||
|
||||
2. RegEx: `(\w+)oo\1le`
|
||||
Text: `google, doodle jump, ggooggle, ssoosle`
|
||||
|
||||
Answer:
|
||||

|
||||
|
||||
**Note:** For group numbers more than 9, there is a syntax difference.
|
||||
|
||||
## Named Groups
|
||||
Regular expressions with lots of groups and backreferencing can be difficult to maintain, as adding or removing a capturing group in the middle of the regex turns to change the numbers of all the groups that follow the added or removed group.
|
||||
|
||||
In regex, we have facility of named groups, which solves the above issue. Let's look at it.
|
||||
|
||||
We can name a group by putting `?<name>` just after opening the paranthesis representing a group. For example, `(?<year>\d{4})` is a named group.
|
||||
|
||||
Below is a code, we have already looked in **capturing groups** part. You can see, the code is more readable now.
|
||||
|
||||
```js
|
||||
var str = "2020-01-20";
|
||||
|
||||
// Pattern string
|
||||
var pattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/g;
|
||||
|
||||
// ^ ^ ^
|
||||
//group-no: 1 2 3
|
||||
|
||||
// Data replacement using $<group_name>
|
||||
var ans=str.replace(pattern, '$<day>-$<month>-$<year>');
|
||||
|
||||
console.log(ans);
|
||||
// Output will be: 20-01-2020
|
||||
```
|
||||
|
||||
Backreference syntax for numbered groups works for named capture groups as well. `\k<name>` matches the string that was previously matched by the named capture group `name`, which is a standard way to backreference named group.
|
||||
|
||||

|
||||
|
||||
## Practical Applications of RegEx
|
||||
1. Syntax highlighting systems
|
||||
2. Data scraping and wrangling
|
||||
3. In find and replace facility of text editors
|
||||
|
||||
Now that, you have learned RegEx. Let's look at some classical examples of RegEx.
|
||||
|
||||
## Classical examples
|
||||
|
||||
1. **Number Ranges:**
|
||||
Can you find a regex matching all integers from 0 to 255?
|
||||
|
||||
First, Let's look at how can we match all integers from 0 to 59:
|
||||

|
||||
|
||||
As you can see, we have used `?` quantifier to make the first digit(0-5) optional. Now, can you solve it for 0-255?
|
||||
|
||||
Hint : Use OR operator.
|
||||
|
||||
We can divide the range 0-255 into three ranges: 0-199, 200-249 & 250-255. Now, creating an expression, for each of them independently, is easy.
|
||||
|
||||
| Range| RegEx |
|
||||
| :--: | :--: |
|
||||
| 0-199 | `[01][0-9][0-9]` |
|
||||
| 200-249| `2[0-4][0-9]`|
|
||||
| 250-255| `25[0-5]`|
|
||||
|
||||
|
||||
Now, by using OR operator, we can match the whole 0-255 range.
|
||||
|
||||

|
||||
|
||||
As you can see, the above regex is not going to match 0, but 000. So, how can you modify the regex which matches 0 as well, rather than matching 001 only?
|
||||
|
||||

|
||||
|
||||
We have just used `?` quantifier.
|
||||
|
||||
2. **Validate an IP address:**
|
||||
|
||||
IP address consists of digits from 0-255 and 3 points(`.`). Valid IP address format is (0-255).(0-255).(0-255).(0-255).
|
||||
|
||||
For example, 10.10.11.4, 255.255.255.255, 234.9.64.43, 1.2.3.4 are Valid IP addresses.
|
||||
|
||||
Can you find a regex to match an IP-address?
|
||||
|
||||
We have already seen, how to match number ranges and to match a point, we use escaped-dot(`\.`). But in IP address, we don't allow leading zeroes in numbers like 001.
|
||||
|
||||
So, We have to divide the range in four sub-ranges: 0-99, 100-199, 200-249, 250-255. And finally we use OR-operator.
|
||||
|
||||

|
||||
|
||||
So, Regex to match IP Address is as below:
|
||||

|
||||
|
||||
**Note:** The whole expression is contiguous, for the shake of easy understanding it is shown the way it is.
|
||||
|
||||
### Bonus Problem:
|
||||
|
||||
Predict the output of the following regex:
|
||||
|
||||
**RegEx:** ``\b(0|(1(01*0)*1))*\b``
|
||||
**Text:** This RegEx denotes the set of binary numbers divisible by 3:
|
||||
0,11,1010, 1100, 1111, 1001
|
||||
|
||||
Answer:
|
||||
|
||||

|
@ -1,461 +0,0 @@
|
||||
|
||||
## Regular Expression (RegEx)
|
||||
|
||||
While filling online forms, haven't you come across errors like "Please enter valid email address" or "Please enter valid phone number".
|
||||
|
||||
Annoying as they may be, there's a lot of black magic that the computer does before it determines that, the details you've entered are incorrect.
|
||||
|
||||
Can you think out, what is that black magic? If you are familar with algorithms, then you will say, we can write an algorithm for the same.
|
||||
|
||||
Yes, we can write an algorithm to verify different things. But we have a standard tool which is particularly designed for the similar kind of purposes.
|
||||
|
||||
It is **Regular Expression**. We call it **RegEx** for short. RegEx makes our work a lot easier. Let's see some basic examples where RegEx becomes handy.
|
||||
|
||||
Suppose, you are in search of an averge price of a particular product on amazon. The following regular expression will find you any price(\$12, \$75.50) on the webpage: `\$([0-9]+)\.([0-9]+)`.
|
||||
|
||||
Quite interesting!
|
||||
|
||||
Let's look at another example. You have a long list of documents with different kinds of extensions. You are particularly looking for data files having **.dat** extension.
|
||||
|
||||
`^.*\.dat$` is a regular expression which represents a set of string ending with **.dat**. Regular expression is a standardized way to encode such patterns.
|
||||
|
||||
Well. What does the name **Regular Expression(RegEx)** represent? Regular Expression represents the sequence of characters that defines a regular search pattern.
|
||||
|
||||
RegEx is a standardized tool to do the following works:
|
||||
1. Find and verify patterns in a string.
|
||||
2. Extract particular data present in the form of substrings.
|
||||
3. Replace, split and rearrange particular parts of a string.
|
||||
|
||||
We are going to look at all the three things above.
|
||||
|
||||
Let's begin the journey with RegEx!
|
||||
|
||||
**Note:**
|
||||
1. In all the images below, the first section is a RegEx code and below is a text, in which the matches are shown-the shaded regions show the match. All the images are taken using regexr.com.
|
||||
2. In all the images, Small dot between words in the text shows a space.
|
||||
3. **Alpha-numeric character** is the one which belongs to any of $0-9,A-Z,a-z$ ranges.
|
||||
4. String is a sequence of characters.
|
||||
|
||||
## Simple Alpha-numeric character matching
|
||||
|
||||
Simple matching of a specific word can be done as the following:
|
||||
|
||||

|
||||
|
||||
As you can see it matches "Reg" in the text. Similarly, what will be the match for "Ex" in the same text above?
|
||||
|
||||

|
||||
|
||||
Do you notice anything? It is a **case sensitive**.
|
||||
|
||||
**Note:** Most of the programming languages have libraries for RegEx and almost similar kind of syntax usages. Here we will see how to implement in **Javascript**.
|
||||
|
||||
Here is a basic code in Javascript. Here pattern is written in /_____/g, where g is a modifier, which is used find all matches rather than stopping at the first match.
|
||||
|
||||
**Note:** The function **exec** returns null, if there is no match and match data otherwise.
|
||||
```js
|
||||
// Main text (string) in which we are finding
|
||||
// Patterns
|
||||
var str = "RegEx stands for Regular Expression!";
|
||||
|
||||
// Pattern string
|
||||
var pattern = /Reg/g;
|
||||
|
||||
// This will print all the data of matches
|
||||
// across the whole string
|
||||
while(result = pattern.exec(str))
|
||||
{
|
||||
console.log(result); // printing
|
||||
}
|
||||
|
||||
// This will be the output
|
||||
/*
|
||||
[
|
||||
'Reg',
|
||||
index: 0,
|
||||
input: 'RegEx stands for Regular Expression!',
|
||||
groups: undefined
|
||||
]
|
||||
[
|
||||
'Reg',
|
||||
index: 17,
|
||||
input: 'RegEx stands for Regular Expression!',
|
||||
groups: undefined
|
||||
]
|
||||
*/
|
||||
```
|
||||
|
||||
**Note:** **Groups** in the above output is a RegEx concept we will look at it soon, keep reading.
|
||||
|
||||
Now, you can change the pattern and string in the code above to observe other patterns as we will learn below.
|
||||
|
||||
## Character classes:
|
||||
|
||||

|
||||
|
||||
What if you want to match both "soon" and "moon" or basically words ending with "oon"?
|
||||
|
||||

|
||||
|
||||
What did you observe? You can see that adding `[sm]` matches both $soon$ and $moon$. Here `[sm]` is called character class, which is basically a list of characters we want to match.
|
||||
|
||||
More formally, `[abc]` is basically either `a` or `b` or `c`.
|
||||
|
||||
Predict the output of the following:
|
||||
|
||||
1. **RegEx code:** ```[ABC][12]```
|
||||
**Text:** A1 grade is the best, but I scored A2.
|
||||
|
||||
Answer:
|
||||
|
||||

|
||||
|
||||
|
||||
2. **RegEx code:** ```[0123456789][12345]:[abcdef][67890]:[0123456789][67890]:[1234589][abcdef]```
|
||||
**Text:** Let's match 14:f6:89:3c mac address type of pattern. Other patterns are 51:a6:90:c5, 44:t6:u9:3d, 72:c8:39:8e.
|
||||
|
||||
Answer:
|
||||
|
||||

|
||||
|
||||
Now, if we put `^`, then it will show a match for characters other than the ones in the bracket.
|
||||
|
||||

|
||||
|
||||
Predict the output for the following:
|
||||
|
||||
**RegEx code:** ```[^13579]A[^abc]z3[590*-]```
|
||||
**Text:** 1Abz33 will match or 2Atz30 and 8Adz3*.
|
||||
|
||||
Answer:
|
||||
|
||||

|
||||
|
||||
|
||||
Writing every characters(like `[0123456789]` or `[abcd]`) is some what slow and also errorneous, what is the short-cut?
|
||||
|
||||
## Ranges
|
||||
Ranges makes our work easier. Consecutive characters can simply be replaced by putting a dash between the first and last character.
|
||||
|
||||

|
||||
|
||||
Predict the output of the following regex:
|
||||
|
||||
**RegEx code:** ```[a-d][^l-o][12][^5-7][l-p]```
|
||||
**Text:** co13i, ae14p, eo30p, ce33l, dd14l.
|
||||
|
||||
Answer:
|
||||
|
||||

|
||||
|
||||
|
||||
**Note:** If you write the range in reverse order (ex. 9-0), then it is an error.
|
||||
|
||||
Predict the output of the following regex:
|
||||
|
||||
**RegEx code:** ``[a-zB-D934][A-Zab0-9]``
|
||||
**Text:** t9, da, A9, zZ, 99, 3D, aCvcC9.
|
||||
Answer:
|
||||

|
||||
|
||||
## Predefined Character Classes
|
||||
|
||||
1. **`\w` & `\W`**: `\w` is just a short form of a character class `[A-Za-Z0-9_]`.
|
||||
|
||||

|
||||
`\W` is equivalent to ``[^\w]``.
|
||||

|
||||
|
||||
|
||||
2. **`\d` & `\D`**: `\d` matches any digit character. It is equivalent to character class `[0-9]`.
|
||||
|
||||

|
||||
`\D` is equivalent to ``[^\d]``.
|
||||

|
||||
3. **`\s` & `\S`**: `\s` matches white space characters. Tab(`\t`), newline(`\n`) & space(` `) are whitespace characters.
|
||||

|
||||
|
||||
Similarly, `\S` is equivalent to ``[^\s]``.
|
||||

|
||||
|
||||
4. **dot(`.`)**: Dot matches any character except `\n`(line break or new line character) and `\r`(carriage-return character). It is known as **wildcard matching**.
|
||||
|
||||

|
||||
|
||||
**Note:** `\r` is known as windows style new line character.
|
||||
|
||||
Predict the output of the following regex:
|
||||
|
||||
1. **RegEx code:** ``[01][01][0-1]\W\s\d``
|
||||
**Text:** Binary to decimal data: 001- 1, 010- 2, 011- 3, a01- 4, 100- 4.
|
||||
Answer:
|
||||

|
||||
2. **RegEx code:**
|
||||
**Text:**
|
||||
|
||||
|
||||
## Alternation (OR operator)
|
||||
|
||||
As we have seen **character class**, it can be used to match a single character out of several possible characters, Alternation is more generic. It can also be used to match an expression out of several possible expressions.
|
||||
|
||||

|
||||
|
||||
In the above example, ``cat|dog|lion`` basically means either cat or dog or lion. Here we have used specific patterns(cat, dog & lion) but we can use any regular expression. For example,
|
||||
|
||||

|
||||
|
||||
### Problem with OR operator:
|
||||
Suppose that you want to match two words either **Set** or **SetValue**. What will be the regular expression?
|
||||
|
||||
From what ever we have learned till now, you will say ``Set|SetValue`` will be the answer. But it is not correct.
|
||||
|
||||

|
||||
|
||||
If you try `SetValue|Set`, then it is working.
|
||||
|
||||

|
||||
|
||||
Can you observe anything from it?
|
||||
|
||||
**OR operator** tries to match starting from the first word(in the expression), if it is a match, then it will not try to match next word(in the expression) at the same place in text.
|
||||
|
||||
Predict the output of the following regex:
|
||||
1. **RegEx code:**
|
||||
**Text:**
|
||||
|
||||
2. **RegEx code:**
|
||||
**Text:**
|
||||
|
||||
|
||||
## Quantifiers (Repetition)
|
||||
|
||||
We have seen that to match 3 digit patterns we can use ``[0-9][0-9][0-9]``. What if we have n digit patterns? We have to write `[0-9]` n times, but that is really waste of time. Here is when quantifiers comes for help.
|
||||
|
||||
1. **Limiting repetitions(``{min, max}``):** To match n digit pattern we can simply write ``[0-9]{n}``. Instead of ``{n}`` by providing minimum and maximum values as ``[0-9]{min, max}``, we can match a pattern repeating min to max times.
|
||||
|
||||
Let's see an example to match all numbers between 1 to 999.
|
||||

|
||||
**Note:** If you don't write the upper bound(``{min,}``), then it basically means, there is no limit for maximum repetitions.
|
||||
|
||||
|
||||
2. **``+`` quantifier:** It basically means ``{1,}``-at least one occurrence.
|
||||

|
||||
|
||||
3. **``*``quantifier:** It is equivalent to ``{0,}``-zero or more occurrences.
|
||||
Let's
|
||||

|
||||
|
||||
4. **``?`` quantifier:** It is equivalent to ``{0,1}``, either zero or one occurrence. ``?`` is very useful for optional occurrences in patterns.
|
||||
|
||||
Let's see an example to match negative and positive numbers.
|
||||

|
||||
|
||||
**Nature of Quantifiers:**
|
||||
HTML tag is represented as <tag_name>some text</tag_name>. For example, <title>Regular expression</title>
|
||||
|
||||
So can you figure out an expression that will match both <tag_name> & </tag_name>?
|
||||
|
||||
Most of the people will say, it is `<.*>`. But it gives different result.
|
||||

|
||||
So, rather than matching up till first `>`, it matches the whole tag. So, quantifiers are greedy by default. It is called **Greediness!**
|
||||
|
||||
To make it lazy, we use `?` quantifier. That stops the regex engine going further(makes it optional).
|
||||

|
||||
|
||||
|
||||
Predict the output of the following regex:
|
||||
1. **RegEx code:**
|
||||
**Text:**
|
||||
|
||||
2. **RegEx code:**
|
||||
**Text:**
|
||||
|
||||
|
||||
**Note:** Now you may be thinking, what if we want to match characters like ***, ?, +, {, },** etc in the text. We will look at it shortly. Keep reading!
|
||||
|
||||
## Boundary Matchers
|
||||
|
||||
Now, we will learn how to match pattern at specific positions, like before, after or between some characters. For this purpose we use special characters like `^`,`$`,`\b & \B`,`\A`,`\z & \Z`, which are known as anchors.
|
||||
|
||||
|
||||
**Note:**
|
||||
- Line is a string which ends at line break character like, `\n` or `\r`.
|
||||
- There is a slight change in javascript code we were using uptill now. Instead of `/____/g`, we will now use `/____/gm`. Modifier 'm' is used to perform multiline search. Notice it in next images!
|
||||
- Word character can be represented by, `[A-Za-z0-9_]`.
|
||||
|
||||
1. **Anchor `^`**: It is used to match patterns at the very start of a line.
|
||||
For example,
|
||||

|
||||
|
||||
It will show a match only if the pattern is occuring at the start of the line.
|
||||
|
||||
2. **Anchor `$`**: Similarly, ``$`` is used to match patterns at the very end of a line.
|
||||
|
||||

|
||||
|
||||
It will show a match only if the pattern is occuring at the end of a line.
|
||||
Example using both `^` and `$`:
|
||||

|
||||
|
||||
|
||||
3. **Anchors `\b` & `\B`**: `\b` is called **word boundary character**.
|
||||
|
||||
Let's first observe some examples:
|
||||

|
||||
|
||||
|
||||
What did you observe? Our pattern is starting and ending with word characters and so the match occurs only if the substring is starting(`[a-z]`) and ending(`\d`) at word characters which are required in our pattern-`[a-z]` and `\d` respectively. Now let's look at one more example.
|
||||

|
||||
|
||||
What did you observe?
|
||||
**First observation:** Our pattern is starting with a non-word character and ending with a word character. So the match occurs only if there is a word character before the starting of a match string and there is a required `\d` character at the end.
|
||||
|
||||
**Second observation:** One new thing to observe is that, If our pattern is starting(or ending) with a word character, then the match can still occur if there is a non-word character before(or after) the match string.
|
||||
|
||||
|
||||
`\b` need not be used in pair. You can use a single `\b`.
|
||||
|
||||
`\B` is just a complement of `\b`.
|
||||
|
||||
|
||||
|
||||
**Note:** `\A` and `\z & \Z` are another anchors which are used to match at very start of input text and at very end of input text respectively. But it is not supported in Javascript.
|
||||
|
||||
Predict the output of the following regex:
|
||||
|
||||
1. **RegEx code:**
|
||||
**Text:**
|
||||
|
||||
2. **RegEx code:**
|
||||
**Text:**
|
||||
|
||||
|
||||
## Groups & Capturing
|
||||
|
||||
Grouping is the most useful feature of regex. Grouping can be done by placing regular expression inside round brackets.
|
||||
|
||||
It unifies the regular expressions inside it as a single unit. Let's look at its usages one by one:
|
||||
|
||||
1. It makes the regular expression more readable and sometimes it is an inevitable thing.
|
||||

|
||||
Suppose we want to match both the sentences, then grouping is the inevitable thing.
|
||||

|
||||
|
||||
2. To apply quantifiers to one or more expressions.
|
||||

|
||||
Similarly, you can use other quantifiers.
|
||||
|
||||
3. To extract and replace substrings using groups. So we call groups **Capturing groups**, becuase we are capturing data(substrings) using groups.
|
||||
|
||||
In this part we will see how to extract and replace data using groups in Javascript.
|
||||
|
||||
In the output array, the first data is a match string followed by matched groups in the order.
|
||||
|
||||
```js
|
||||
var str = "2020-01-20";
|
||||
|
||||
// Pattern string
|
||||
var pattern = /(\d{4})-(\d{2})-(\d{2})/g;
|
||||
|
||||
// ^ ^ ^
|
||||
//group-no: 1 2 3
|
||||
|
||||
var result = pattern.exec(str);
|
||||
|
||||
console.log(result);
|
||||
/* Output will be:
|
||||
[
|
||||
'2020-01-20', //-------pattern
|
||||
'2020', //-----First group
|
||||
'01', //-------Second group
|
||||
'20', //-------Third group
|
||||
index: 0,
|
||||
input: '2020-01-20',
|
||||
groups: undefined
|
||||
]
|
||||
*/
|
||||
// Data extraction
|
||||
console.log(result[1]); // First group
|
||||
console.log(result[2]); // Second group
|
||||
console.log(result[3]); // Third group
|
||||
```
|
||||
`Replace` is another function which is used to replace and rearrange the data using groups.
|
||||
|
||||
```js
|
||||
var str = "2020-01-20";
|
||||
|
||||
// Pattern string
|
||||
var pattern = /(\d{4})-(\d{2})-(\d{2})/g;
|
||||
|
||||
// ^ ^ ^
|
||||
//group-no: 1 2 3
|
||||
|
||||
// Data replacement
|
||||
var ans=str.replace(pattern, '$3-$2-$1');
|
||||
|
||||
console.log(ans);
|
||||
// Output will be: 20-01-2020
|
||||
```
|
||||
|
||||
Predict the output of the following regex:
|
||||
1. **RegEx code:**
|
||||
**Text:**
|
||||
|
||||
2. **RegEx code:**
|
||||
**Text:**
|
||||
|
||||
|
||||
## Characters with special meaning
|
||||
|
||||
We have seen that we are using `*`, `+`, `.`, `$`, etc for different purposes. Now, you may be thinking, what if we want to match them themselves. For that purpose, we have to escape them using escape character(backslash-\\) .
|
||||
|
||||
Below is the table for these kind of characters and their escaped version, along with their usages.
|
||||
|
||||
| Character | Usage | Escaped version |
|
||||
|:---------:|:---------------------------:|:---------------:|
|
||||
| \ | escape character | \\\ |
|
||||
| . | predefined character class | \\. |
|
||||
| \| | OR operator | \\\ |
|
||||
| * | as quantifier | \\* |
|
||||
| + | as quantifier | \\+ |
|
||||
| ? | as quantifier | \\? |
|
||||
| ^ | boundary matcher | \\^ |
|
||||
| $ | boundary matcher | \\$ |
|
||||
| { | in quantifier notation | \\{ |
|
||||
| } | in quantifier notation | \\} |
|
||||
| [ | in character class notation | \\[ |
|
||||
| ] | in character class notation | \\] |
|
||||
| ( | in group notation | \\( |
|
||||
| ) | in group notation | \\) |
|
||||
|
||||
Sometimes, it is also preferred to use escaped forward slash(`/`).
|
||||
|
||||
|
||||
## Backreferencing
|
||||
|
||||
Backreferencing is used to match same text again. Backreferences match the same text as previously matched by a capturing group. Let's look at an example:
|
||||
|
||||

|
||||
|
||||
The first captured group is (`\w+`), now we can use this group again by using a backreference (`\1`), at the closing tag, which matches the same text as in captured group `\w+`.
|
||||
|
||||
You can use backreferencing for any captured group as \group_no.
|
||||
|
||||
Let's have one more example:
|
||||
|
||||

|
||||
|
||||
Predict the output of the following regex:
|
||||
1. **RegEx code:**
|
||||
**Text:**
|
||||
|
||||
2. **RegEx code:**
|
||||
**Text:**
|
||||
|
||||
|
||||
|
||||
|
||||
## Practical Applications of RegEx
|
||||
1. Syntax highlighting systems
|
||||
2. Data scraping and wrangling.
|
||||
3. In find and replace facility of text editors
|
Loading…
x
Reference in New Issue
Block a user