What if you want to match both "soon" and "moon" or basically words ending with "oon"?
What did you observe? You can see that, adding [sm]
matches both $soon$ and $moon$. Here [sm]
is called character class, which is basically a list of characters we want to match.
More formally, [abc]
is basically 'either a or b or c'.
Predict the output of the following:
RegEx: [ABC][12]
Text: A1 grade is the best, but I scored A2.
Answer:
RegEx: [0123456789][12345]:[abcdef][67890]:[1234589][abcdef]
Text: Let's match 14:f6:3c mac address type of pattern.
Other patterns are 51:a6:c5, 44:t6:3d, 72:c8:8e.
Answer:
Now, if we put ^
, then it will show a match for characters other than the ones in the bracket.
Predict the output for the following:
RegEx: [^13579]A[^abc]z3[590*-]
Text: 1Abz33 will match or 2Atz30 and 8Adz3*.
Answer:
Writing every character (like [0123456789]
or [abcd]
) is somewhat slow and also erroneous, what is the short-cut?
Ranges make our work easier. Consecutive characters can be included in a character class using the dash operator, for example, numbers from 0 to 9 can be simply written as 0-9. Similarly, abcdef
can be replaced by a-f
.
Examples: 456
--> 4-6
, abc3456
--> a-c3-6
, c367980
--> c36-90
.
Predict the output of the following regex:
RegEx: [a-d][^l-o][12][^5-7][l-p]
Text: co13i, ae14p, eo30p, ce33l, dd14l.
Answer:
Note: If you write the range in reverse order (ex. 9-0), then it is an error.
[a-zB-D934][A-Zab0-9]
\w
& \W
: \w
is just a short form of a character class [A-Za-Z0-9_]
. \w
is called word character class.
\W
is equivalent to [^\w]
. \W
matches everything other than word characters.
\d
& \D
: \d
matches any digit character. It is equivalent to character class [0-9]
.
\D
is equivalent to [^\d]
. \D
matches everything other than digits.
\s
& \S
: \s
matches whitespace characters. Tab(\t
), newline(\n
) & space(
) are whitespace characters. These characters are called non-printable characters.Similarly, \S
is equivalent to [^\s]
. \S
matches everything other than whitespace characters.
dot(.
): Dot matches any character except \n
(line-break or new-line character) and \r
(carriage-return character). Dot(.
) is known as a wildcard.
Note: \r
is known as a windows style new-line character.
Predict the output of the following regex:
RegEx: [01][01][0-1]\W\s\d
Text: Binary to decimal data: 001- 1, 010- 2, 011- 3, a01- 4, 100- 4.
Answer:
Write a regex to match 28th February of any year. Date is in dd-mm-yyyy format.
Answer: 28-02-\d\d\d\d
Write a regex to match dates that are not in March. Consider that, the dates are valid and no proper format is given, i.e. it can be in dd.mm.yyyy, dd\mm\yyyy, dd/mm/yyyy format.
Answer: \d\d\W[10][^3]\W\d\d\d\d
Note that, the above regex will also match dd-mm.yyyy or dd/mm\yyyy kind of wrong format, this problem can be solved by using backreferencing, which is a regex concept.