Character classes:

What if you want to match both "soon" and "moon" or basically words ending with "oon"?

What did you observe? You can see that, adding [sm] matches both $soon$ and $moon$. Here [sm] is called character class, which is basically a list of characters we want to match.

More formally, [abc] is basically 'either a or b or c'.

Predict the output of the following:

  1. RegEx: [ABC][12]
    Text: A1 grade is the best, but I scored A2.

    Answer:

  2. RegEx: [0123456789][12345]:[abcdef][67890]:[1234589][abcdef]
    Text: Let's match 14:f6:3c mac address type of pattern. Other patterns are 51:a6:c5, 44:t6:3d, 72:c8:8e.

    Answer:

Negation

Now, if we put ^, then it will show a match for characters other than the ones in the bracket.

Predict the output for the following:

RegEx: [^13579]A[^abc]z3[590*-]
Text: 1Abz33 will match or 2Atz30 and 8Adz3*.

Answer:

Writing every character (like [0123456789] or [abcd]) is somewhat slow and also erroneous, what is the short-cut?

Ranges

Ranges make our work easier. Consecutive characters can be included in a character class using the dash operator, for example, numbers from 0 to 9 can be simply written as 0-9. Similarly, abcdef can be replaced by a-f.

Examples: 456 --> 4-6, abc3456 --> a-c3-6, c367980 --> c36-90.

Predict the output of the following regex:

  1. RegEx: [a-d][^l-o][12][^5-7][l-p]
    Text: co13i, ae14p, eo30p, ce33l, dd14l.

    Answer:

Note: If you write the range in reverse order (ex. 9-0), then it is an error.

  1. RegEx: [a-zB-D934][A-Zab0-9]
    Text: t9, da, A9, zZ, 99, 3D, aCvcC9. Answer:

Predefined Character Classes

  1. \w & \W: \w is just a short form of a character class [A-Za-Z0-9_]. \w is called word character class.

    \W is equivalent to [^\w]. \W matches everything other than word characters.

  2. \d & \D: \d matches any digit character. It is equivalent to character class [0-9].

    \D is equivalent to [^\d]. \D matches everything other than digits.

    1. \s & \S: \s matches whitespace characters. Tab(\t), newline(\n) & space() are whitespace characters. These characters are called non-printable characters.

    Similarly, \S is equivalent to [^\s]. \S matches everything other than whitespace characters.

  3. dot(.): Dot matches any character except \n(line-break or new-line character) and \r(carriage-return character). Dot(.) is known as a wildcard.

Note: \r is known as a windows style new-line character.

Problems

  1. Predict the output of the following regex: RegEx: [01][01][0-1]\W\s\d
    Text: Binary to decimal data: 001- 1, 010- 2, 011- 3, a01- 4, 100- 4.

    Answer:

  2. Write a regex to match 28th February of any year. Date is in dd-mm-yyyy format.

    Answer: 28-02-\d\d\d\d

  3. Write a regex to match dates that are not in March. Consider that, the dates are valid and no proper format is given, i.e. it can be in dd.mm.yyyy, dd\mm\yyyy, dd/mm/yyyy format.

    Answer: \d\d\W[10][^3]\W\d\d\d\d

    Note that, the above regex will also match dd-mm.yyyy or dd/mm\yyyy kind of wrong format, this problem can be solved by using backreferencing, which is a regex concept.