Update and rename Regex_pending.md to Regex.md

This commit is contained in:
Aakash Panchal 2020-02-24 03:12:44 +05:30 committed by GitHub
parent 9eca14ac85
commit 4d514d922b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 640 additions and 461 deletions

640
Akash Articles/Regex.md Normal file
View File

@ -0,0 +1,640 @@
## Regular Expression (RegEx)
While filling online forms, haven't you come across errors like "Please enter valid email address" or "Please enter valid phone number".
Annoying as they may be, there's a lot of black magic that the computer does before it determines that, the details you've entered are incorrect.
Can you think out, what is that black magic? If you are familiar with algorithms, then you will say we can write an algorithm for the same.
Yes, we can write an algorithm to verify different things. But we have a standard tool, which is particularly designed for the similar kind of purposes.
It is **Regular Expression**. We call it **RegEx** for short. RegEx makes our work a lot easier. Let's see some basic examples where RegEx becomes handy.
Suppose, you are in search of an averge price of a particular product on amazon. The following regular expression will find you any price(ex. `$12`, `$75.50`) on the webpage: `\$([0-9]+)\.([0-9]+)`.
Quite interesting!
Let's look at another example. You have a long list of documents with different kinds of extensions. You are particularly looking for data files having **.dat** extension.
`^.*\.dat$` is a regular expression which represents a set of string ending with **.dat**. Regular expression is a standardized way to encode such patterns.
Well. What does the name **Regular Expression(RegEx)** represent? Regular Expression represents the sequence of characters that defines a regular search pattern.
RegEx is a standardized tool to do the following works:
1. Find and verify patterns in a string.
2. Extract particular data present in the text.
3. Replace, split and rearrange particular parts of a string.
We are going to look at all the three things above.
Let's begin the journey with RegEx!
**Note:**
1. In all the images below, the first section is a RegEx and below is a text, in which the matches are shown-the shaded regions show the match. All the images are taken using regexr.com. You can use it to do experiments on regex.
2. In all the images, Small dot between words in the text shows a space.
3. **Alpha-numeric character** belongs to anyone of the $0-9,A-Z,a-z$ ranges.
4. String is a sequence of characters and substring is a contiguous part of a string.
## Simple Alpha-numeric character matching
Simple matching of a specific word can be done as the following:
![enter image description here](https://lh3.googleusercontent.com/YGfz9u58rRKD0ABrSKDv7ZJOEMaIMGdFWgJWGGNzCFNakCtfAZVk1UEm7mBS4lIX1LFXoV420cmY=s1600)
As you can see it matches "Reg" in the text. Similarly, what will be the match for "Ex" in the same text above?
![enter image description here](https://lh3.googleusercontent.com/LkJXO79wn08dvgX5Q2JXHtyN7MW38AeNdV7fjG6lk7MNsiamx9iOekEGQg-WS9OLQMWxBuspjSkh=s1600)
Do you notice anything? It is a **case sensitive**.
**Note:** Most of the programming languages have libraries for RegEx. They have almost similar kind of syntax. Here, we will see how to implement it in **Javascript**.
Below is a basic code in Javascript for regex. The patterns are written in `/_____/g`. Where `g` is a modifier, which is used to find all matches rather than stopping at the first match.
**Note:** The function **exec** returns null, if there is no match and match data otherwise.
```js
// Main text (string) in which we are finding
// Patterns
var str = "RegEx stands for Regular Expression!";
// Pattern string
var pattern = /Reg/g;
// This will print all the data of matches
// across the whole string
while(result = pattern.exec(str))
{
console.log(result); // printing
}
// This will be the output
/*
[
'Reg',
index: 0,
input: 'RegEx stands for Regular Expression!',
groups: undefined
]
[
'Reg',
index: 17,
input: 'RegEx stands for Regular Expression!',
groups: undefined
]
*/
```
**Note:** **Groups** in the above output is a RegEx concept. We will look at it, keep reading.
Now, you can change the expression and text in the code above, to observe other patterns.
## Character classes:
![IMG](https://lh3.googleusercontent.com/dlLzL3teyEoax1JsdF7JeGP6DZOJll-UgnZqFkJtBeAVJhM1xMnXBHVeJYf_cUFLmj-f1qPO8asf=s1600)
What if you want to match both "soon" and "moon" or basically words ending with "oon"?
![enter image description here](https://lh3.googleusercontent.com/bsRHqYuPZIQ7Yra4-zyF1BX2pIYDukCEtTfCK3rjaCTRmTAuo_fuHTVK5sJjbTdbXjGTVq1z5eYc=s1600)
What did you observe? You can see that, adding `[sm]` matches both $soon$ and $moon$. Here `[sm]` is called character class, which is basically a list of characters we want to match.
More formally, `[abc]` is basically 'either a or b or c'.
Predict the output of the following:
1. **RegEx:** ``[ABC][12]``
**Text:** A1 grade is the best, but I scored A2.
Answer:
![enter image description here](https://lh3.googleusercontent.com/2JgAwwprVtZ8AfQMLoB9PEi7NoyJXxrEqv_46tLvdtCBKo4HBcPlbi3atyKsWJTwJLBTneyA_C0j=s1600)
2. **RegEx:** ```[0123456789][12345]:[abcdef][67890]:[0123456789][67890]:[1234589][abcdef]```
**Text:** Let's match 14:f6:89:3c mac address type of pattern. Other patterns are 51:a6:90:c5, 44:t6:u9:3d, 72:c8:39:8e.
Answer:
![enter image description here](https://lh3.googleusercontent.com/d2ynsBn5p8gIzvQeKewe8VrPiEu0EyOoNiEBkj_Co8fq_12FKhWK81V1Rcc2YCs3or9d4sCbuGtA=s1600)
Now, if we put `^`, then it will show a match for characters other than the ones in the bracket.
![soon moon noon woon](https://lh3.googleusercontent.com/rj-zgBEZ7Fdv6rckQgHC90L_j7y1X7jj8veTZQoOKGQ2RSiEHPxPeSZUZoJE9yLW-o2dvXj6OI1j=s1600)
Predict the output for the following:
**RegEx:** ```[^13579]A[^abc]z3[590*-]```
**Text:** 1Abz33 will match or 2Atz30 and 8Adz3*.
Answer:
![enter image description here](https://lh3.googleusercontent.com/BXaE8cxW7PcMJcfoUTlY-xBm9qNuhB5isy-PDLS5hIqQGIdRWiUf4viVxHF5yn5DJ0wHtoqHYKmP=s1600)
Writing every character (like `[0123456789]` or `[abcd]`) is somewhat slow and also erroneous, what is the short-cut?
## Ranges
Ranges makes our work easier. Consecutive characters can simply be replaced by putting a dash between the smallest and largest character.
For example, `abcdef` --> `a-f`, `456` --> `4-6`, `abc3456` --> `a-c3-6`, `c367980` --> `c36-90`.
![regex](https://lh3.googleusercontent.com/PWRFyDwe-89sdNSbmGc528PZXWhoX_-GNq0gQ8X9fOA-NX1Q4hzQNq1-Ty1LYjjsL8L4nVbSgvaq=s1600)
Predict the output of the following regex:
1. **RegEx:** ```[a-d][^l-o][12][^5-7][l-p]```
**Text:** co13i, ae14p, eo30p, ce33l, dd14l.
Answer:
![enter image description here](https://lh3.googleusercontent.com/pgDHTvxQZ35ybyF7ozdeSGEBchiK8huiMQ3PfQWIgPSrzWoca8BpEoQ1yht8qyA4VVOdP6dNa-sl=s1600)
**Note:** If you write the range in reverse order (ex. 9-0), then it is an error.
2. **RegEx:** ``[a-zB-D934][A-Zab0-9]``
**Text:** t9, da, A9, zZ, 99, 3D, aCvcC9.
Answer:
![enter image description here](https://lh3.googleusercontent.com/ftl29tcq2QeaOCMpRs6kwEwtNYaxvTzkkLB-2SGi2WjkSWvUfAMmTTrE7NXtyORo8gaptghcnJQJ=s1600)
## Predefined Character Classes
1. **`\w` & `\W`**: `\w` is just a short form of a character class `[A-Za-Z0-9_]`. `\w` is called word character class.
![enter image description here](https://lh3.googleusercontent.com/UzEtYLNxnrtpDgOIW1N9SeyJ5Nyeh51hHIb516CwPOJutVSkWQZpcDfo09lSXmGzDMxDgtoikJAU=s1600)
`\W` is equivalent to ``[^\w]``. `\W` matches everything other than word characters.
![enter image description here](https://lh3.googleusercontent.com/cKEXAPheBxESGkBoe8zOONJP3REvaTUhYs4FPkPizMU4t-v2_enG-9Jk8tgF-HX6Wxrn0jQATBes=s1600)
2. **`\d` & `\D`**: `\d` matches any digit character. It is equivalent to character class `[0-9]`.
![enter image description here](https://lh3.googleusercontent.com/Q1WTXPIBFR0fCJ7QT5jdU_XummS39Jqzi96l1g_ijg-LA4hoSLf05pscFT32lW-39yEPC5uDP-V_=s1600)
`\D` is equivalent to ``[^\d]``. `\D` matches everything other than digits.
![enter image description here](https://lh3.googleusercontent.com/JWIIzBQOIqi7lPIkrveW6h_gL1C5sWd_0cNGCswkBxRGoNKDB9ZKN4Zwd21BdEmfuluuzu-THYpc=s1600)
3. **`\s` & `\S`**: `\s` matches whitespace characters. Tab(`\t`), newline(`\n`) & space(` `) are whitespace characters. These characters are called non-printable characters.
![enter image description here](https://lh3.googleusercontent.com/LbokzFHfw58rfmDUlcVoktdYHZtbWi76ddM-6-qyTiNVnk4s0Ea9KfC1KHRJkjTvDYRnbKXprkPr=s1600)
Similarly, `\S` is equivalent to ``[^\s]``. `\S` matches everything other than whitespace characters.
![enter image description here](https://lh3.googleusercontent.com/Vp2QdnqK-WOhuZaCZW82IBVNCPmVC--O2te2XzXKqCKwZJe4FKoJVHlzevhBgNfUSzF-34FcZFof=s1600)
4. **dot(`.`)**: Dot matches any character except `\n`(line-break or new-line character) and `\r`(carriage-return character). Dot(`.`) is known as a **wildcard**.
![enter image description here](https://lh3.googleusercontent.com/jwBp2XH1lL9ZRu_wASyTYsD03p81_3DIRjfHWtH5cA3jpSDuGfmE3P5A0RIhSfbrusmoV8w1D9k1=s1600)
**Note:** `\r` is known as a windows style new-line character.
Predict the output of the following regex:
1. **RegEx:** ``[01][01][0-1]\W\s\d``
**Text:** Binary to decimal data: 001- 1, 010- 2, 011- 3, a01- 4, 100- 4.
Answer:
![enter image description here](https://lh3.googleusercontent.com/YzmmRMcSqjhJPHthh_MwLnGldVl4nYR86Bb83viXeT2SM0koPmFjFKOathYXxxLyLKSz96Gkigcl=s1600)
### Problems
1. Write a regex to match 28th February of any year. Date is in dd-mm-yyyy format.
Answer: `28-02-\d\d\d\d`
2. Write a regex to match dates that are not in March. Consider that, the dates are valid and no proper format is given, i.e. it can be in dd.mm.yyyy, dd\mm\yyyy, dd/mm/yyyy format.
Answer: `\d\d\W[10][^3]\W\d\d\d\d`
Note that, the above regex will also match dd-mm.yyyy or dd/mm\yyyy kind of wrong format, this problem can be solved by using backreferencing.
## Alternation (OR operator)
**Character class** can be used to match a single character out of several possible characters. Alternation is more generic than character class. It can also be used to match an expression out of several possible expressions.
![enter image description here](https://lh3.googleusercontent.com/JxvB4lPgFBxk0FA-d3XapsLPad_JnhegB7NJ7BpRqmZkNuLwivUTAl08ek14EwpIM42LoIeB2O_c=s1600)
In the above example, ``cat|dog|lion`` basically means 'either cat or dog or lion'. Here, we have used specific expression(cat, dog & lion), but we can use any regular expression. For example,
![enter image description here](https://lh3.googleusercontent.com/syg6gq9LBaQfA43tyVVlIqHd3oD0jbhBnxrLkho21EsD4DoUy2gGfJZoLgWaz0PA5BxWHghdybRD=s1600)
### Problem
- Find a regex to match boot or bot.
Answer: There more than one possible answers: `boot|bot`, `b(o|oo)t`. Last expression is using a group.
### Problem with OR operator:
Suppose, you want to match two words **Set** and **SetValue**. What will be the regular expression?
From whatever we have learned so far, you will say, ``Set|SetValue`` will be the answer. But it is not correct.
![enter image description here](https://lh3.googleusercontent.com/fcX6JOLw9u7YK0Sbeec_0tol5IbcxFHR4zlb3TSgCoMb3tlkZlGWCNd-9KNc832YMcuG7TcceJyz=s1600)
If you try `SetValue|Set`, then it is working.
![enter image description here](https://lh3.googleusercontent.com/H8y8thO6EVan-NfxRKoGSEZG-ObgbX4EdfvHMJznr37m1Q-DekLpHMce3OqdAlje9jNQFy2ZK4u2=s1600)
Can you observe anything from it?
**OR operator** tries to match a substring starting from the first word(or expression)-in the regex. If it is a match, then it will not try to match the next word(or expression) at the same place in text.
Find out an regex which matches each and every word in the following set: `{bat, cat, hat, mat, nat, oat, pat, Pat, ot}`. The regex should be as small as possible.
**Hint:** Use character-class, ranges and or-operator together.
Answer: `[b-chm-pP]at|ot`
## Quantifiers (Repetition)
To match 3 digit patterns, we can use ``[0-9][0-9][0-9]``. What if we have n digit patterns? We have to write `[0-9]` n times, but that is a waste of time. Here is when quantifiers come for help.
1. **Limiting repetitions(``{min, max}``):** To match n digit patterns, we can simply write ``[0-9]{n}``. Instead of n, by providing minimum and maximum values as ``[0-9]{min, max}``, we can match a pattern repeating min to max times.
Let's see an example to match all numbers between 1 to 999.
![enter image description here](https://lh3.googleusercontent.com/i-Xd_gn0AYks2HX3HL-8kbVQWHaUzuuO5VO2ZoV5sqxIfFRyniKMEWNvM758zIfFb1ArY3q08dp5=s1600)
**Note:** If you don't write the upper bound(``{min,}``), then it basically means, there is no limit for maximum repetitions.
2. **``+`` quantifier:** It is equivalent to ``{1,}``-at least one occurrence.
![enter image description here](https://lh3.googleusercontent.com/_f5hQYEghXft3ZttKB7r177rDXXT4m04TQlkjnsg-2E5fkUAOZNUxeLYvWIt6T7B2XLTeVUkoXu1=s1600)
3. **``*``quantifier:** It is equivalent to ``{0,}``-zero or more occurrences. ![enter image description here](https://lh3.googleusercontent.com/vGqELFywEUZ5jilWFotcN_l4IC0MCpg45TMAB2k3x80nAm6gn_2R9NB3h8KMiXNB6aG3HlZ2C0Hs=s1600)
4. **``?`` quantifier:** It is equivalent to ``{0,1}``, either zero or one occurrence. ``?`` is very useful for optional occurrences in patterns.
Let's see an example to match negative and positive numbers.
![enter image description here](https://lh3.googleusercontent.com/YBbsvb14Aoje2CB32deP6kszaZ0OcUWThaK71y5RZ7q6eqQ8H4EkL8XzZOB9IoSKB_Tav37lE__W=s1600)
### Problems
1. Find out a regex to match positive integers or floating point numbers with exactly two characters after the decimal point.
Answer: `\d+(\.\d\d)?`
2. Predict the output of the following regex:
RegEx: `[abc]{2,}`
Text:
<code>aaa
abc
abbccc
avbcc
</code>
Answer:
![enter image description here](https://lh3.googleusercontent.com/Uo7nokSqOEM8zLuOuA2a56q74vvBUmvmNQiUjdPSfn2H9yzTVRAW2DZlfCEsOaZJxhWg7tIdBLcv=s1600)
**Nature of Quantifiers:**
HTML tag is represented as <tag_name>some text</tag_name>. For example, `<title>Regular expression</title>`
So, can you figure out an expression that will match both <tag_name> & </tag_name>?
Most of the people will say, it is `<.*>`. But it gives different result.
![enter image description here](https://lh3.googleusercontent.com/dYECtAiwn0dWJwY0K8gzb6U_vrzoihid1bgJxEvHA3G64Wm49dM5BVl5V41AHb3D1MxQ_t1MNXhh=s1600)
So, rather than matching up till first `>`, it matches the whole tag. So, quantifiers are greedy by default. It is called **Greediness!**
Now, if we use `?`, then following happens.
![enter image description here](https://lh3.googleusercontent.com/7Em674dudMFsnG-T5PiM8wpXBH8FOp3AW3o992orHzA49lO9muYkmAi4-NDKP4FV-Ay926fKdtEt=s1600)
### Lazy matching:
As we have seen, the default nature of quantifier is greedy, so it will match as many characters as possible.
![enter image description here](https://lh3.googleusercontent.com/ye_RJ7Gc34O3BL6zMFEFodtbHsrK9hnwfQ3sxYWkrDHHdUtR6JvOcPT77kKMaNtn3IJonpmWNz1E=s1600)
To make it lazy, we use `?` quantifier, which turns the regex engine to match as less characters as possible which satisfies the regex.
![enter image description here](https://lh3.googleusercontent.com/tHhbib6sbPqxvj9Or7SB2Gk6ODGjzwuAS0NOSi0FUb60SRaoc8RuxyGcGJqqWbInHmP6Z0eiQRua=s1600)
**Note:** Now, you may be thinking, what if we want to match characters like `*, ?, +, {, }` in the text. We will look at it shortly. Keep reading!
Predict the output of the following regex:
1. Predict the output of the following regex:
RegEx: `(var|let)\s[a-zA-Z0-9_]\w* =\s"?\w+"?;`
Text:
<code>var carname = "volvo";
console.log(carname);
let age = 8;
var date = "23-03-2020";</code>
Answer:
![enter image description here](https://lh3.googleusercontent.com/svrWV_MmE9x6WNVelmFRArJ-cvsY6VgUIUhT74KsvG0wdyifVmgOoMitGZHJ-ZjM1zgCJH6ID-S3=s1600)
## Boundary Matchers
Now, we will learn how to match patterns at specific positions, like before, after or between some characters. For this purpose we use special characters like `^`,`$`,`\b & \B`,`\A`,`\z & \Z`, which are known as anchors.
**Notes:**
- Line is a string which ends at a line-break or a new-line character `\n`.
- There is a slight change in javascript code, we were using up till now. Instead of `/____/g`, we will now use `/____/gm`. Modifier 'm' is used to perform multiline search. Notice it in next images!
- Word character can be represented by, `[A-Za-z0-9_]`.
- **Anchor `^`**: It is used to match patterns at the very start of a line.
For example,
![enter image description here](https://lh3.googleusercontent.com/AsEJmx-zVHGJAIOK7wzOpJLXC_EFtszcSvDE3ByluMDZkAh01Z6Z48n0LqZXLnjq0e7CiKqyXoWG=s1600)
It will show a match, only if the pattern is occuring at the start of the line.
- **Anchor `$`**: Similarly, ``$`` is used to match patterns at the very end of a line.
![enter image description here](https://lh3.googleusercontent.com/iwj0YbwW5I5ocLglFFypayBFmNpHClc3Bew-DXer_XYhvhJ2QMd0z4ZFEpgc_sZqMFA91HgbkxYz=s1600)
It will show a match, only if the pattern is occuring at the end of a line.
Example, both `^` and `$`,
![enter image description here](https://lh3.googleusercontent.com/udplrGsLdQkeEkwCZl06ZmqM7M-kSP18Lyk7w0qFqrB46EAZJhJalhQ3WbPPnzcNt4jeHJxlgAXi=s1600)
- **Anchors `\b` & `\B`**: `\b` is called **word boundary character**.
Below is a list of positions, which qualifies as a **boundary** for `\b`:
If Regex-pattern is ending(or starting) with,
- A word character, then boundary is itself(word character). Let's call it a word boundary.
- A non-word character, then boundary is the next word-character. Let's call it a non-word boundary.
So, in short `\b` is only looking for word-character at boundaries, so it is called **word boundary character**.
Let's first observe some examples to understand it's working:
![enter image description here](https://lh3.googleusercontent.com/RJqHdBX--517Xuq4MWDVoGsBoOFGIbWhqs8YOwWFU-LKjwSQLfVuQPHTgWjxU3rR54D6bg2TNxJu=s1600)
What did you observe? Our regex-pattern is starting and ending with a word character. So, the match occurs only if there is a substring starting and ending at word characters, which are required in our regex `[a-z]` and `\d` respectively.
Now, let's look at one more example.
![enter image description here](https://lh3.googleusercontent.com/NYHZmEuDDxajsib_cG248-x-L1YgxAR-Tn3KqYgnOJSM1uk1EQgbjTHTL_E9C-tNVjQ1qIxZyU_e=s1600)
Here `\+` will show a match for `+`.
What did you observe?
**First observation:** Our pattern is starting with a non-word character and ending with a word character. So, the match occurs only if there is a substring having a non-word boundary at starting and word boundary at the ending.
**Second observation:** Non-word character after a word-boundary does not affect the result.
`\b` need not be used in pair. You can use a single `\b`.
![enter image description here](https://lh3.googleusercontent.com/ZQWwiNznl_P_foHiLgoxmwN7b0xRN54qMHUk9B4wSuiDDKSWdkOY9rjSwjTGvbEWKqptvLG2i_Rk=s1600)
`\B` is just a complement of `\b`. `\B` matches at all the positions that is not a word boundary. Observe two examples below:
![enter image description here](https://lh3.googleusercontent.com/gDCZqZbdNHcP9XetuE8hqx_7BLE2rh27Z1Bnz-SPUKJmj-_qEdFrXfP4xCzpPCxA7RYN1ZP6L0Zk=s1600)
![enter image description here](https://lh3.googleusercontent.com/ucV3anQWFfYhO7qaXcU7-25bqR3W7KO87aSCVcd_GDMFtfEvWmqK-JvC0jJ62KQevzfhPdn4DMTC=s1600)
**Note:** `\A` and `\z & \Z` are another anchors, which are used to match at the very start of input text and at very end of input text respectively. But it is not supported in Javascript.
Predict the output of the following regex:
1. **RegEx:** ```^[\w$#%@!&^*]{6,18}$```
**Text:**
<code>This is matching passwords of length between 6 to 18:
Abfah45$
gadfaJ%33
Abjapda454&1 spc
bjaphgu12$
Note that no whitespace characters are allowed.</code>
Answer:
![enter image description here](https://lh3.googleusercontent.com/bsQ8pb1a2tBRUHfE8ul2wiJ_7rRQjilptThKAy2WY2P0ndudVI8UEsiyn8eXLGbRvY6hj7Fdg6GM=s1600)
2. RegEx: `\b\w+:\B`
Text: <code>1232: , +1232:, abc:, abc:a, abc89, (+abc::)</code>
Answer: ![enter image description here](https://lh3.googleusercontent.com/TgLgYDWKd9wbBmzzq1_m-CRh4t5ndAoLnB-6nH0kWTpJSkZk9ePhH_5j2uFByAzxLUSlGeeLf5HR=s1600)
## Groups & Capturing
Grouping is the most useful feature of regex. Grouping can be done by placing regular expression inside round brackets.
It unifies the regular expressions inside it as a single unit. Let's look at its usages one by one:
1. It makes the regular expression more readable and sometimes it is an inevitable thing.
![enter image description here](https://lh3.googleusercontent.com/JVFtW5n8xGothhQa-MCzSV-oIEFM7zpjOJWDAOto_JrjQalqcEt29LtCT4m62FZHuVLgntRJVJYN=s1600)
Suppose, we want to match both the sentences in the above text, then grouping is the inevitable thing.
![enter image description here](https://lh3.googleusercontent.com/_LLeROC6R9eAel9CF-FpwRvm4T2styQe60Qv4Zokhvky_6pbGrLJSWYd5TLaz5NwB6zaOwd3fKmX=s1600)
2. To apply quantifiers to one or more expressions.
![enter image description here](https://lh3.googleusercontent.com/cSn7JesNbcMaaXb_tFi1ymMlKtZxe7G09jROJtWuu7kPvUmAGOU_CDiVp9k0NQ8FuCistLgW4vUg=s1600)
Similarly, you can use other quantifiers.
3. To extract and replace substrings using groups. So, we call groups **Capturing groups**, becuase we are capturing data(substrings) using groups.
In this part, we will see how to extract and replace data using groups in Javascript.
**Data Extraction:**
Observe the code below.
```js
var str = "2020-01-20";
// Pattern string
var pattern = /(\d{4})-(\d{2})-(\d{2})/g;
// ^ ^ ^
//group-no: 1 2 3
var result = pattern.exec(str);
// printing
console.log(result);
/* Output will be:
[
'2020-01-20', //-------pattern
'2020', //-----First group
'01', //-------Second group
'20', //-------Third group
index: 0,
input: '2020-01-20',
groups: undefined
]
*/
// Data extraction
console.log(result[1]); // First group
console.log(result[2]); // Second group
console.log(result[3]); // Third group
```
In the output array, the first data is a match string followed by the matched groups in the order.
**Data Replacement:**
`Replace` is another function, which can be used to replace and rearrange the data using regex. Observe the code below.
```js
var str = "2020-01-20";
// Pattern string
var pattern = /(\d{4})-(\d{2})-(\d{2})/g;
// ^ ^ ^
//group-no: 1 2 3
// Data replacement using $group_no
var ans=str.replace(pattern, '$3-$2-$1');
console.log(ans);
// Output will be: 20-01-2020
```
As you can see, we have used `$group_no` to indicate the capturing group.
Predict the output of the following regex:
1. RegEx: `([abc]){2,}(one|two)`
Text:
<code>aone
cqtwo
abone
actwo
abcbtwoone
abbcccone
</code>
Answer: ![enter image description here](https://lh3.googleusercontent.com/oPAS6ExWeAQoTEc7GRnrd4iuCmeSZt1-6Jv58t1_clZgHkpT9U1qPJFEXE780Bdz9oTeypFEgpRy=s1600)
2. RegEx: `([\dab]+(r|c)){2}`
Text:
<code>1r2c
ar4ccc
12abr12abc
acac, accaca, acaaca
aaar1234234c, aaa1234234c
194brar, 134bcbb-c </code>
Answer: ![enter image description here](https://lh3.googleusercontent.com/7t717TZSxuWDD92l58z8v0zInRNqiGvdP1q_AGPrN419PfxrMsHj27SaliwSC6EK2KZHWWvsr___=s1600)
## Characters with special meaning
We have seen that, we are using `*`, `+`, `.`, `$`, etc for different purposes. Now, if we want to match them themselves, we have to escape them using escape character(backslash-\\) .
Below is the table for these kind of characters and their escaped version, along with their usages.
| Character | Usage | Escaped version |
|:---------:|:---------------------------:|:---------------:|
| \ | escape character | \\\ |
| . | predefined character class | \\. |
| \| | OR operator | \\\ |
| * | as quantifier | \\* |
| + | as quantifier | \\+ |
| ? | as quantifier | \\? |
| ^ | boundary matcher | \\^ |
| $ | boundary matcher | \\$ |
| { | in quantifier notation | \\{ |
| } | in quantifier notation | \\} |
| [ | in character class notation | \\[ |
| ] | in character class notation | \\] |
| ( | in group notation | \\( |
| ) | in group notation | \\) |
| -|range operator | NA
Sometimes, it is also preferred to use escaped forward slash(`/`).
## Backreferencing
Backreferencing is used to match same text again. Backreferences match the same text as previously matched by a capturing group. Let's look at an example:
![enter image description here](https://lh3.googleusercontent.com/VrwREOtqL_b2IPbzM2qJQVAiP9Q8XWoAny41UodrLlEzWBxUbOJZ3WTvR7T0b-9zHn7iOqN8op3l=s1600)
The first captured group is (`\w+`), now we can use this group again by using a backreference (`\1`) at the closing tag, which matches the same text as in captured group `\w+`.
You can backreference any captured group by using `\group_no`.
Let's have two more examples:
![enter image description here](https://lh3.googleusercontent.com/Wx30vdBz2zif4zqMt1P6rJIh9b3NBOWz0XMzGZR50gU5n8p4sxhtCRWYl1j5hWYfJpI6jC5VEDEX=s1600)
![enter image description here](https://lh3.googleusercontent.com/Ji2buSF895THR4SIILfEou2SJMmuFExrpJsG8xWoxPSl6O-hDv7wPOHP1b_145NYh-M0yQYe0BxE=s1600)
**Problems:**
1. Match any palindrome string of length 6, having only lowercase letters.
Answer: `([a-z])([a-z])([a-z])\3\2\1`
2. RegEx: `(\w+)oo\1le`
Text: `google, doodle jump, ggooggle, ssoosle`
Answer:
![enter image description here](https://lh3.googleusercontent.com/y_4mk8QPlH0dWWqzXVhm5_V9wZlieX36x_sPTfX_Tr86l5SFD0so0ejYXD2dy2BiadXWHGVzxfkW=s1600)
**Note:** For group numbers more than 9, there is a syntax difference.
## Named Groups
Regular expressions with lots of groups and backreferencing can be difficult to maintain, as adding or removing a capturing group in the middle of the regex turns to change the numbers of all the groups that follow the added or removed group.
In regex, we have facility of named groups, which solves the above issue. Let's look at it.
We can name a group by putting `?<name>` just after opening the paranthesis representing a group. For example, `(?<year>\d{4})` is a named group.
Below is a code, we have already looked in **capturing groups** part. You can see, the code is more readable now.
```js
var str = "2020-01-20";
// Pattern string
var pattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/g;
// ^ ^ ^
//group-no: 1 2 3
// Data replacement using $<group_name>
var ans=str.replace(pattern, '$<day>-$<month>-$<year>');
console.log(ans);
// Output will be: 20-01-2020
```
Backreference syntax for numbered groups works for named capture groups as well. `\k<name>` matches the string that was previously matched by the named capture group `name`, which is a standard way to backreference named group.
![enter image description here](https://lh3.googleusercontent.com/GbalU1cVIs8G5kP_pTXBj8eC1MF1M3vNVFaeAvnFcVyOUPg7mhBDoVu8nye9oC4inIg_4Vnzst1y=s1600)
## Practical Applications of RegEx
1. Syntax highlighting systems
2. Data scraping and wrangling
3. In find and replace facility of text editors
Now that, you have learned RegEx. Let's look at some classical examples of RegEx.
## Classical examples
1. **Number Ranges:**
Can you find a regex matching all integers from 0 to 255?
First, Let's look at how can we match all integers from 0 to 59:
![enter image description here](https://lh3.googleusercontent.com/my4yBp9jXoS3wHCjkL_OVD3EjmsUtj61hIWKCssvS5PH0et26vbhhSIoM5Jphx-FVwK6yvc0qT7u=s1600)
As you can see, we have used `?` quantifier to make the first digit(0-5) optional. Now, can you solve it for 0-255?
Hint : Use OR operator.
We can divide the range 0-255 into three ranges: 0-199, 200-249 & 250-255. Now, creating an expression, for each of them independently, is easy.
| Range| RegEx |
| :--: | :--: |
| 0-199 | `[01][0-9][0-9]` |
| 200-249| `2[0-4][0-9]`|
| 250-255| `25[0-5]`|
Now, by using OR operator, we can match the whole 0-255 range.
![enter image description here](https://lh3.googleusercontent.com/ao-aGXzSqOsICjuE8DlAmTtNCvMWVHKeStYMlccX3ounTHaTlI4qLNhAcm7jM_3P0FPAjXi0k2zB=s1600)
As you can see, the above regex is not going to match 0, but 000. So, how can you modify the regex which matches 0 as well, rather than matching 001 only?
![enter image description here](https://lh3.googleusercontent.com/qxfoJm05roxaS3FwOoTy0Nr88Ku2KvZPHRUl7_HB5ybcpUhLspgQAbitJjvV-J5Gi4OG2upbTl9K=s1600)
We have just used `?` quantifier.
2. **Validate an IP address:**
IP address consists of digits from 0-255 and 3 points(`.`). Valid IP address format is (0-255).(0-255).(0-255).(0-255).
For example, 10.10.11.4, 255.255.255.255, 234.9.64.43, 1.2.3.4 are Valid IP addresses.
Can you find a regex to match an IP-address?
We have already seen, how to match number ranges and to match a point, we use escaped-dot(`\.`). But in IP address, we don't allow leading zeroes in numbers like 001.
So, We have to divide the range in four sub-ranges: 0-99, 100-199, 200-249, 250-255. And finally we use OR-operator.
![enter image description here](https://lh3.googleusercontent.com/UXWulQq4P0STobfIjrW4Um5jaqwSqTPx3KSSYYvEg_HZJ22a5wNQYio3qQUtzgidIOxLO_M48M9p=s1600)
So, Regex to match IP Address is as below:
![enter image description here](https://lh3.googleusercontent.com/n2CaC-8Q8NH-H5RkDrCM4AQQkV2PamIAlA3dwljdRsW33WWoj18qJEIN5iyzjLzfifHj-dh-IW-u=s1600)
**Note:** The whole expression is contiguous, for the shake of easy understanding it is shown the way it is.
### Bonus Problem:
Predict the output of the following regex:
**RegEx:** ``\b(0|(1(01*0)*1))*\b``
**Text:** This RegEx denotes the set of binary numbers divisible by 3:
0,11,1010, 1100, 1111, 1001
Answer:
![enter image description here](https://lh3.googleusercontent.com/d4c8LeDk2JahKEQXXkMfPwDoDEmO6ijmqd4u3aHDUJ9_At1DZY95HSGZMPbJPn4Mptlfp3p1XZ4x=s1600)

View File

@ -1,461 +0,0 @@
## Regular Expression (RegEx)
While filling online forms, haven't you come across errors like "Please enter valid email address" or "Please enter valid phone number".
Annoying as they may be, there's a lot of black magic that the computer does before it determines that, the details you've entered are incorrect.
Can you think out, what is that black magic? If you are familar with algorithms, then you will say, we can write an algorithm for the same.
Yes, we can write an algorithm to verify different things. But we have a standard tool which is particularly designed for the similar kind of purposes.
It is **Regular Expression**. We call it **RegEx** for short. RegEx makes our work a lot easier. Let's see some basic examples where RegEx becomes handy.
Suppose, you are in search of an averge price of a particular product on amazon. The following regular expression will find you any price(\$12, \$75.50) on the webpage: `\$([0-9]+)\.([0-9]+)`.
Quite interesting!
Let's look at another example. You have a long list of documents with different kinds of extensions. You are particularly looking for data files having **.dat** extension.
`^.*\.dat$` is a regular expression which represents a set of string ending with **.dat**. Regular expression is a standardized way to encode such patterns.
Well. What does the name **Regular Expression(RegEx)** represent? Regular Expression represents the sequence of characters that defines a regular search pattern.
RegEx is a standardized tool to do the following works:
1. Find and verify patterns in a string.
2. Extract particular data present in the form of substrings.
3. Replace, split and rearrange particular parts of a string.
We are going to look at all the three things above.
Let's begin the journey with RegEx!
**Note:**
1. In all the images below, the first section is a RegEx code and below is a text, in which the matches are shown-the shaded regions show the match. All the images are taken using regexr.com.
2. In all the images, Small dot between words in the text shows a space.
3. **Alpha-numeric character** is the one which belongs to any of $0-9,A-Z,a-z$ ranges.
4. String is a sequence of characters.
## Simple Alpha-numeric character matching
Simple matching of a specific word can be done as the following:
![enter image description here](https://lh3.googleusercontent.com/YGfz9u58rRKD0ABrSKDv7ZJOEMaIMGdFWgJWGGNzCFNakCtfAZVk1UEm7mBS4lIX1LFXoV420cmY=s1600)
As you can see it matches "Reg" in the text. Similarly, what will be the match for "Ex" in the same text above?
![enter image description here](https://lh3.googleusercontent.com/LkJXO79wn08dvgX5Q2JXHtyN7MW38AeNdV7fjG6lk7MNsiamx9iOekEGQg-WS9OLQMWxBuspjSkh=s1600)
Do you notice anything? It is a **case sensitive**.
**Note:** Most of the programming languages have libraries for RegEx and almost similar kind of syntax usages. Here we will see how to implement in **Javascript**.
Here is a basic code in Javascript. Here pattern is written in /_____/g, where g is a modifier, which is used find all matches rather than stopping at the first match.
**Note:** The function **exec** returns null, if there is no match and match data otherwise.
```js
// Main text (string) in which we are finding
// Patterns
var str = "RegEx stands for Regular Expression!";
// Pattern string
var pattern = /Reg/g;
// This will print all the data of matches
// across the whole string
while(result = pattern.exec(str))
{
console.log(result); // printing
}
// This will be the output
/*
[
'Reg',
index: 0,
input: 'RegEx stands for Regular Expression!',
groups: undefined
]
[
'Reg',
index: 17,
input: 'RegEx stands for Regular Expression!',
groups: undefined
]
*/
```
**Note:** **Groups** in the above output is a RegEx concept we will look at it soon, keep reading.
Now, you can change the pattern and string in the code above to observe other patterns as we will learn below.
## Character classes:
![IMG](https://lh3.googleusercontent.com/dlLzL3teyEoax1JsdF7JeGP6DZOJll-UgnZqFkJtBeAVJhM1xMnXBHVeJYf_cUFLmj-f1qPO8asf=s1600)
What if you want to match both "soon" and "moon" or basically words ending with "oon"?
![enter image description here](https://lh3.googleusercontent.com/bsRHqYuPZIQ7Yra4-zyF1BX2pIYDukCEtTfCK3rjaCTRmTAuo_fuHTVK5sJjbTdbXjGTVq1z5eYc=s1600)
What did you observe? You can see that adding `[sm]` matches both $soon$ and $moon$. Here `[sm]` is called character class, which is basically a list of characters we want to match.
More formally, `[abc]` is basically either `a` or `b` or `c`.
Predict the output of the following:
1. **RegEx code:** ```[ABC][12]```
**Text:** A1 grade is the best, but I scored A2.
Answer:
![enter image description here](https://lh3.googleusercontent.com/2JgAwwprVtZ8AfQMLoB9PEi7NoyJXxrEqv_46tLvdtCBKo4HBcPlbi3atyKsWJTwJLBTneyA_C0j=s1600)
2. **RegEx code:** ```[0123456789][12345]:[abcdef][67890]:[0123456789][67890]:[1234589][abcdef]```
**Text:** Let's match 14:f6:89:3c mac address type of pattern. Other patterns are 51:a6:90:c5, 44:t6:u9:3d, 72:c8:39:8e.
Answer:
![enter image description here](https://lh3.googleusercontent.com/d2ynsBn5p8gIzvQeKewe8VrPiEu0EyOoNiEBkj_Co8fq_12FKhWK81V1Rcc2YCs3or9d4sCbuGtA=s1600)
Now, if we put `^`, then it will show a match for characters other than the ones in the bracket.
![soon moon noon woon](https://lh3.googleusercontent.com/rj-zgBEZ7Fdv6rckQgHC90L_j7y1X7jj8veTZQoOKGQ2RSiEHPxPeSZUZoJE9yLW-o2dvXj6OI1j=s1600)
Predict the output for the following:
**RegEx code:** ```[^13579]A[^abc]z3[590*-]```
**Text:** 1Abz33 will match or 2Atz30 and 8Adz3*.
Answer:
![enter image description here](https://lh3.googleusercontent.com/BXaE8cxW7PcMJcfoUTlY-xBm9qNuhB5isy-PDLS5hIqQGIdRWiUf4viVxHF5yn5DJ0wHtoqHYKmP=s1600)
Writing every characters(like `[0123456789]` or `[abcd]`) is some what slow and also errorneous, what is the short-cut?
## Ranges
Ranges makes our work easier. Consecutive characters can simply be replaced by putting a dash between the first and last character.
![regex](https://lh3.googleusercontent.com/PWRFyDwe-89sdNSbmGc528PZXWhoX_-GNq0gQ8X9fOA-NX1Q4hzQNq1-Ty1LYjjsL8L4nVbSgvaq=s1600)
Predict the output of the following regex:
**RegEx code:** ```[a-d][^l-o][12][^5-7][l-p]```
**Text:** co13i, ae14p, eo30p, ce33l, dd14l.
Answer:
![enter image description here](https://lh3.googleusercontent.com/pgDHTvxQZ35ybyF7ozdeSGEBchiK8huiMQ3PfQWIgPSrzWoca8BpEoQ1yht8qyA4VVOdP6dNa-sl=s1600)
**Note:** If you write the range in reverse order (ex. 9-0), then it is an error.
Predict the output of the following regex:
**RegEx code:** ``[a-zB-D934][A-Zab0-9]``
**Text:** t9, da, A9, zZ, 99, 3D, aCvcC9.
Answer:
![enter image description here](https://lh3.googleusercontent.com/ftl29tcq2QeaOCMpRs6kwEwtNYaxvTzkkLB-2SGi2WjkSWvUfAMmTTrE7NXtyORo8gaptghcnJQJ=s1600)
## Predefined Character Classes
1. **`\w` & `\W`**: `\w` is just a short form of a character class `[A-Za-Z0-9_]`.
![enter image description here](https://lh3.googleusercontent.com/UzEtYLNxnrtpDgOIW1N9SeyJ5Nyeh51hHIb516CwPOJutVSkWQZpcDfo09lSXmGzDMxDgtoikJAU=s1600)
`\W` is equivalent to ``[^\w]``.
![enter image description here](https://lh3.googleusercontent.com/cKEXAPheBxESGkBoe8zOONJP3REvaTUhYs4FPkPizMU4t-v2_enG-9Jk8tgF-HX6Wxrn0jQATBes=s1600)
2. **`\d` & `\D`**: `\d` matches any digit character. It is equivalent to character class `[0-9]`.
![enter image description here](https://lh3.googleusercontent.com/Q1WTXPIBFR0fCJ7QT5jdU_XummS39Jqzi96l1g_ijg-LA4hoSLf05pscFT32lW-39yEPC5uDP-V_=s1600)
`\D` is equivalent to ``[^\d]``.
![enter image description here](https://lh3.googleusercontent.com/JWIIzBQOIqi7lPIkrveW6h_gL1C5sWd_0cNGCswkBxRGoNKDB9ZKN4Zwd21BdEmfuluuzu-THYpc=s1600)
3. **`\s` & `\S`**: `\s` matches white space characters. Tab(`\t`), newline(`\n`) & space(` `) are whitespace characters.
![enter image description here](https://lh3.googleusercontent.com/LbokzFHfw58rfmDUlcVoktdYHZtbWi76ddM-6-qyTiNVnk4s0Ea9KfC1KHRJkjTvDYRnbKXprkPr=s1600)
Similarly, `\S` is equivalent to ``[^\s]``.
![enter image description here](https://lh3.googleusercontent.com/Vp2QdnqK-WOhuZaCZW82IBVNCPmVC--O2te2XzXKqCKwZJe4FKoJVHlzevhBgNfUSzF-34FcZFof=s1600)
4. **dot(`.`)**: Dot matches any character except `\n`(line break or new line character) and `\r`(carriage-return character). It is known as **wildcard matching**.
![enter image description here](https://lh3.googleusercontent.com/jwBp2XH1lL9ZRu_wASyTYsD03p81_3DIRjfHWtH5cA3jpSDuGfmE3P5A0RIhSfbrusmoV8w1D9k1=s1600)
**Note:** `\r` is known as windows style new line character.
Predict the output of the following regex:
1. **RegEx code:** ``[01][01][0-1]\W\s\d``
**Text:** Binary to decimal data: 001- 1, 010- 2, 011- 3, a01- 4, 100- 4.
Answer:
![enter image description here](https://lh3.googleusercontent.com/YzmmRMcSqjhJPHthh_MwLnGldVl4nYR86Bb83viXeT2SM0koPmFjFKOathYXxxLyLKSz96Gkigcl=s1600)
2. **RegEx code:**
**Text:**
## Alternation (OR operator)
As we have seen **character class**, it can be used to match a single character out of several possible characters, Alternation is more generic. It can also be used to match an expression out of several possible expressions.
![enter image description here](https://lh3.googleusercontent.com/JxvB4lPgFBxk0FA-d3XapsLPad_JnhegB7NJ7BpRqmZkNuLwivUTAl08ek14EwpIM42LoIeB2O_c=s1600)
In the above example, ``cat|dog|lion`` basically means either cat or dog or lion. Here we have used specific patterns(cat, dog & lion) but we can use any regular expression. For example,
![enter image description here](https://lh3.googleusercontent.com/syg6gq9LBaQfA43tyVVlIqHd3oD0jbhBnxrLkho21EsD4DoUy2gGfJZoLgWaz0PA5BxWHghdybRD=s1600)
### Problem with OR operator:
Suppose that you want to match two words either **Set** or **SetValue**. What will be the regular expression?
From what ever we have learned till now, you will say ``Set|SetValue`` will be the answer. But it is not correct.
![enter image description here](https://lh3.googleusercontent.com/fcX6JOLw9u7YK0Sbeec_0tol5IbcxFHR4zlb3TSgCoMb3tlkZlGWCNd-9KNc832YMcuG7TcceJyz=s1600)
If you try `SetValue|Set`, then it is working.
![enter image description here](https://lh3.googleusercontent.com/H8y8thO6EVan-NfxRKoGSEZG-ObgbX4EdfvHMJznr37m1Q-DekLpHMce3OqdAlje9jNQFy2ZK4u2=s1600)
Can you observe anything from it?
**OR operator** tries to match starting from the first word(in the expression), if it is a match, then it will not try to match next word(in the expression) at the same place in text.
Predict the output of the following regex:
1. **RegEx code:**
**Text:**
2. **RegEx code:**
**Text:**
## Quantifiers (Repetition)
We have seen that to match 3 digit patterns we can use ``[0-9][0-9][0-9]``. What if we have n digit patterns? We have to write `[0-9]` n times, but that is really waste of time. Here is when quantifiers comes for help.
1. **Limiting repetitions(``{min, max}``):** To match n digit pattern we can simply write ``[0-9]{n}``. Instead of ``{n}`` by providing minimum and maximum values as ``[0-9]{min, max}``, we can match a pattern repeating min to max times.
Let's see an example to match all numbers between 1 to 999.
![enter image description here](https://lh3.googleusercontent.com/i-Xd_gn0AYks2HX3HL-8kbVQWHaUzuuO5VO2ZoV5sqxIfFRyniKMEWNvM758zIfFb1ArY3q08dp5=s1600)
**Note:** If you don't write the upper bound(``{min,}``), then it basically means, there is no limit for maximum repetitions.
2. **``+`` quantifier:** It basically means ``{1,}``-at least one occurrence.
![enter image description here](https://lh3.googleusercontent.com/_f5hQYEghXft3ZttKB7r177rDXXT4m04TQlkjnsg-2E5fkUAOZNUxeLYvWIt6T7B2XLTeVUkoXu1=s1600)
3. **``*``quantifier:** It is equivalent to ``{0,}``-zero or more occurrences.
Let's
![enter image description here](https://lh3.googleusercontent.com/vGqELFywEUZ5jilWFotcN_l4IC0MCpg45TMAB2k3x80nAm6gn_2R9NB3h8KMiXNB6aG3HlZ2C0Hs=s1600)
4. **``?`` quantifier:** It is equivalent to ``{0,1}``, either zero or one occurrence. ``?`` is very useful for optional occurrences in patterns.
Let's see an example to match negative and positive numbers.
![enter image description here](https://lh3.googleusercontent.com/YBbsvb14Aoje2CB32deP6kszaZ0OcUWThaK71y5RZ7q6eqQ8H4EkL8XzZOB9IoSKB_Tav37lE__W=s1600)
**Nature of Quantifiers:**
HTML tag is represented as <tag_name>some text</tag_name>. For example, <title>Regular expression</title>
So can you figure out an expression that will match both <tag_name> & </tag_name>?
Most of the people will say, it is `<.*>`. But it gives different result.
![enter image description here](https://lh3.googleusercontent.com/dYECtAiwn0dWJwY0K8gzb6U_vrzoihid1bgJxEvHA3G64Wm49dM5BVl5V41AHb3D1MxQ_t1MNXhh=s1600)
So, rather than matching up till first `>`, it matches the whole tag. So, quantifiers are greedy by default. It is called **Greediness!**
To make it lazy, we use `?` quantifier. That stops the regex engine going further(makes it optional).
![enter image description here](https://lh3.googleusercontent.com/7Em674dudMFsnG-T5PiM8wpXBH8FOp3AW3o992orHzA49lO9muYkmAi4-NDKP4FV-Ay926fKdtEt=s1600)
Predict the output of the following regex:
1. **RegEx code:**
**Text:**
2. **RegEx code:**
**Text:**
**Note:** Now you may be thinking, what if we want to match characters like ***, ?, +, {, },** etc in the text. We will look at it shortly. Keep reading!
## Boundary Matchers
Now, we will learn how to match pattern at specific positions, like before, after or between some characters. For this purpose we use special characters like `^`,`$`,`\b & \B`,`\A`,`\z & \Z`, which are known as anchors.
**Note:**
- Line is a string which ends at line break character like, `\n` or `\r`.
- There is a slight change in javascript code we were using uptill now. Instead of `/____/g`, we will now use `/____/gm`. Modifier 'm' is used to perform multiline search. Notice it in next images!
- Word character can be represented by, `[A-Za-z0-9_]`.
1. **Anchor `^`**: It is used to match patterns at the very start of a line.
For example,
![enter image description here](https://lh3.googleusercontent.com/AsEJmx-zVHGJAIOK7wzOpJLXC_EFtszcSvDE3ByluMDZkAh01Z6Z48n0LqZXLnjq0e7CiKqyXoWG=s1600)
It will show a match only if the pattern is occuring at the start of the line.
2. **Anchor `$`**: Similarly, ``$`` is used to match patterns at the very end of a line.
![enter image description here](https://lh3.googleusercontent.com/iwj0YbwW5I5ocLglFFypayBFmNpHClc3Bew-DXer_XYhvhJ2QMd0z4ZFEpgc_sZqMFA91HgbkxYz=s1600)
It will show a match only if the pattern is occuring at the end of a line.
Example using both `^` and `$`:
![enter image description here](https://lh3.googleusercontent.com/udplrGsLdQkeEkwCZl06ZmqM7M-kSP18Lyk7w0qFqrB46EAZJhJalhQ3WbPPnzcNt4jeHJxlgAXi=s1600)
3. **Anchors `\b` & `\B`**: `\b` is called **word boundary character**.
Let's first observe some examples:
![enter image description here](https://lh3.googleusercontent.com/RJqHdBX--517Xuq4MWDVoGsBoOFGIbWhqs8YOwWFU-LKjwSQLfVuQPHTgWjxU3rR54D6bg2TNxJu=s1600)
What did you observe? Our pattern is starting and ending with word characters and so the match occurs only if the substring is starting(`[a-z]`) and ending(`\d`) at word characters which are required in our pattern-`[a-z]` and `\d` respectively. Now let's look at one more example.
![enter image description here](https://lh3.googleusercontent.com/NYHZmEuDDxajsib_cG248-x-L1YgxAR-Tn3KqYgnOJSM1uk1EQgbjTHTL_E9C-tNVjQ1qIxZyU_e=s1600)
What did you observe?
**First observation:** Our pattern is starting with a non-word character and ending with a word character. So the match occurs only if there is a word character before the starting of a match string and there is a required `\d` character at the end.
**Second observation:** One new thing to observe is that, If our pattern is starting(or ending) with a word character, then the match can still occur if there is a non-word character before(or after) the match string.
`\b` need not be used in pair. You can use a single `\b`.
`\B` is just a complement of `\b`.
**Note:** `\A` and `\z & \Z` are another anchors which are used to match at very start of input text and at very end of input text respectively. But it is not supported in Javascript.
Predict the output of the following regex:
1. **RegEx code:**
**Text:**
2. **RegEx code:**
**Text:**
## Groups & Capturing
Grouping is the most useful feature of regex. Grouping can be done by placing regular expression inside round brackets.
It unifies the regular expressions inside it as a single unit. Let's look at its usages one by one:
1. It makes the regular expression more readable and sometimes it is an inevitable thing.
![enter image description here](https://lh3.googleusercontent.com/JVFtW5n8xGothhQa-MCzSV-oIEFM7zpjOJWDAOto_JrjQalqcEt29LtCT4m62FZHuVLgntRJVJYN=s1600)
Suppose we want to match both the sentences, then grouping is the inevitable thing.
![enter image description here](https://lh3.googleusercontent.com/CotGZFIfY0VW1yVHb2W2p7vqRaUmZXNLgS4IZoAOiPxU8O5813uR0LKGzmXUpkzfp6vVH1X1HE5T=s1600)
2. To apply quantifiers to one or more expressions.
![enter image description here](https://lh3.googleusercontent.com/cSn7JesNbcMaaXb_tFi1ymMlKtZxe7G09jROJtWuu7kPvUmAGOU_CDiVp9k0NQ8FuCistLgW4vUg=s1600)
Similarly, you can use other quantifiers.
3. To extract and replace substrings using groups. So we call groups **Capturing groups**, becuase we are capturing data(substrings) using groups.
In this part we will see how to extract and replace data using groups in Javascript.
In the output array, the first data is a match string followed by matched groups in the order.
```js
var str = "2020-01-20";
// Pattern string
var pattern = /(\d{4})-(\d{2})-(\d{2})/g;
// ^ ^ ^
//group-no: 1 2 3
var result = pattern.exec(str);
console.log(result);
/* Output will be:
[
'2020-01-20', //-------pattern
'2020', //-----First group
'01', //-------Second group
'20', //-------Third group
index: 0,
input: '2020-01-20',
groups: undefined
]
*/
// Data extraction
console.log(result[1]); // First group
console.log(result[2]); // Second group
console.log(result[3]); // Third group
```
`Replace` is another function which is used to replace and rearrange the data using groups.
```js
var str = "2020-01-20";
// Pattern string
var pattern = /(\d{4})-(\d{2})-(\d{2})/g;
// ^ ^ ^
//group-no: 1 2 3
// Data replacement
var ans=str.replace(pattern, '$3-$2-$1');
console.log(ans);
// Output will be: 20-01-2020
```
Predict the output of the following regex:
1. **RegEx code:**
**Text:**
2. **RegEx code:**
**Text:**
## Characters with special meaning
We have seen that we are using `*`, `+`, `.`, `$`, etc for different purposes. Now, you may be thinking, what if we want to match them themselves. For that purpose, we have to escape them using escape character(backslash-\\) .
Below is the table for these kind of characters and their escaped version, along with their usages.
| Character | Usage | Escaped version |
|:---------:|:---------------------------:|:---------------:|
| \ | escape character | \\\ |
| . | predefined character class | \\. |
| \| | OR operator | \\\ |
| * | as quantifier | \\* |
| + | as quantifier | \\+ |
| ? | as quantifier | \\? |
| ^ | boundary matcher | \\^ |
| $ | boundary matcher | \\$ |
| { | in quantifier notation | \\{ |
| } | in quantifier notation | \\} |
| [ | in character class notation | \\[ |
| ] | in character class notation | \\] |
| ( | in group notation | \\( |
| ) | in group notation | \\) |
Sometimes, it is also preferred to use escaped forward slash(`/`).
## Backreferencing
Backreferencing is used to match same text again. Backreferences match the same text as previously matched by a capturing group. Let's look at an example:
![enter image description here](https://lh3.googleusercontent.com/VrwREOtqL_b2IPbzM2qJQVAiP9Q8XWoAny41UodrLlEzWBxUbOJZ3WTvR7T0b-9zHn7iOqN8op3l=s1600)
The first captured group is (`\w+`), now we can use this group again by using a backreference (`\1`), at the closing tag, which matches the same text as in captured group `\w+`.
You can use backreferencing for any captured group as \group_no.
Let's have one more example:
![enter image description here](https://lh3.googleusercontent.com/Wx30vdBz2zif4zqMt1P6rJIh9b3NBOWz0XMzGZR50gU5n8p4sxhtCRWYl1j5hWYfJpI6jC5VEDEX=s1600)
Predict the output of the following regex:
1. **RegEx code:**
**Text:**
2. **RegEx code:**
**Text:**
## Practical Applications of RegEx
1. Syntax highlighting systems
2. Data scraping and wrangling.
3. In find and replace facility of text editors