Published on

Character classes and escape characters - Building a Regex Engine Part 5

Character classes are used to match characters inside a set of symbols in a simpler and more expresive way. For example if you want to match any number you can do \d or [0-9] instead of (0|1|2|3|4|5|6|7|8|9).

There are several types of character classes. Since every page I found uses a different nomenclature I'll contribute to the confusion by adding my own random names:

  • Shorthand character classes, like \d, which matches a digit. As the name implies this is a shorthand for the other types, for example \d is [0-9] and \w is equivalent to [A-Za-z0-9_].
    • All shorthand characters have a negated version. For example \D matches anything EXCEPT digits.
  • Complex character classes. They describe a set by combining ranges and individual characters:
    • A range [a-z] Matches any character between "a" and "z" (in lowercase).
    • You can concatenate several ranges to match characters that are in at least one range. For example[a-zA-Z] matches any character between "a" and "z" regardless of the case.
    • You can also match a single character. [w0-9a] matches a digit and the letters a and w.
    • You can negate the range adding a caret at the beginning. [^abc] matches any character EXCEPT a, b or c.
    • You can use shorthand character classes inside a range and they work as an alternative. For example [\d] is equivalent to \d, and [\d\s] matches any character that is either a digit or a whitespace.
  • Dot: Matches any character that ISN'T a line terminator
    • Except when it's inside a range ([.]). In that case it matches a literal dot character.
Optional - A quirky way to match ANY character, including line terminators

If you want to match all chacters, including new lines and other line terminators you could add all of them in an alternation, like (.|\n|<insert-others>). But it's easier to use two opposite shorthand classes, like \d and \D.

A range like [\d\D] means "Match any character that is a digit, or that isn't".

Honor Societies - xkcd 703

Implementing Character classes

First we need to change the grammar to accept the syntax. As always I'll ignore this part 😇. Then we'll represent them in the AST and finally build the to NFA by implementing a new matcher.

Representing ranges and shorthands in AST

A complex class is just a set of individual characters and ranges. We'll also implement a matches method to determine whether a character is included in the class and add a name, which will determine the label of the NFA transition.

class ComplexClass {
    constructor(individialChars, ranges, name, negated) {
        this.chars = individialChars;
        this.ranges = ranges;
        this.name = name;
        this.negated = negated;
    }

    matches(c) {
        const base = this.chars.includes(c) || this.ranges.some(x => x.matches(c));
        return this.negated ? !base : base;
    }
}

Each range is represented with a simple class:

class ComplexClassRange {
    constructor(start, end) {
        this.start = start;
        this.end = end;
    }

    matches(c) {
        return c >= this.start && this.end >= c;
    }
}

The magic is that all shorthands can be written as complex classes, so we can just translate them:

const SHORTHAND_CHARACTER_CLASSES = {
    "\d": new ComplexClass([], [new ComplexClassRange("0", "9")], "\d", false),
    "\D": new ComplexClass([], [new ComplexClassRange("0", "9")], "\D", true),
    "\w": new ComplexClass(["_"], [new ComplexClassRange("A", "Z"), new ComplexClassRange("a", "z"), new ComplexClassRange("0", "9")], "\w", false),
    "\W": new ComplexClass(["_"], [new ComplexClassRange("A", "Z"), new ComplexClassRange("a", "z"), new ComplexClassRange("0", "9")], "\W", true),
    "\s": new ComplexClass( [" ", "\f", "\n", "\r", "\t", "\v", "\u00a0", "\u1680", "\u2028","\u2029","\u202f", "\u205f", "\u3000", "\ufeff"],
       [new ComplexClassRange("\u2000", "\u200a")], "\s", false),
    "\S": new ComplexClass( [" ", "\f", "\n", "\r", "\t", "\v", "\u00a0", "\u1680", "\u2028","\u2029","\u202f", "\u205f", "\u3000", "\ufeff"],
       [new ComplexClassRange("\u2000", "\u200a")], "\S", true),
};

And since both ComplexClass and ComplexClassRange have the matches method we can have nested ComplexClasses, which allows having shorthands inside complex classes. For example [a-z_\d] would be represented as:

new ComplexClass(["_"], [
    new ComplexClassRange("a", "z"),
    SHORTHAND_CHARACTER_CLASSES["\\d"]
], "[a-z_\d]", false)

Implementation in the NFA

A simple way to implement character classes on the engine would be to translate the class to an alternation like \d->(0|1|2|3|4|5|6|7|8|9). But this would create a lot of unnecessary nodes, specially in big ranges or negated character classes. A smarter solution would be playing with (matches 🔥) matchers.

A reminder on matchers

A matcher is a function that, given a character, returns whether the character is inside a set of symbols. Currently we have two types of matchers: CharacterMatcher, and EpsilonMatcher, and they both implement this interface:

class Matcher {
    matches(char) {
        return false;
    }

    // We need to know when it's an epsilon transition to avoid consuming input
    isEpsilon() {
        return null;
    }

    get label() {
        return "undefined-matcher"
    }
}

We'll modify this in the next post to implement anchors, but for now this will do.

Adding a CustomMatcher

You might have realized that since ComplexClass already has a matches method it could inherit from Matcher. The main reason I didn't do that is because they are at different layers of abstraction. First we parse the text to create a Parse Tree (PT), translate the PT to an Abstract Syntax Tree (AST) and finally create a NFA. Matchers are part of the NFA, so the AST shouldn't depend on the NFA.

Therefore we'll add a simple CustomMather class that will act as a wrapper:

class CustomMatcher extends Matcher {
    constructor(matcher, label) {
        super();
        this._matcher = matcher; 
        this._label = label;
    }

    matches(c) {
        // The engine tries to match with c = undefined when it has run out of input. We have to ensure that in that 
        // case it doesn't match, or it could get stuck in an infinite loop
        if (c === undefined) return false;
        return this._matcher(c);
    }

    isEpsilon() {
        return false;
    }

    get label() {
        return this._label;
    }
}

This way, ComplexClass doesn't need to know that c is undefined when the NFA runs out of input. This is weird but necessary so the machine can keep going through epsilon transitions even when it's out of input.

In the NFA, character classes are represented with two states and a transition between them, just as other normal characters. The difference is now we'll use the CustomMatcher. And luckily, we already have a convenient method _oneStepNFA to create this pattern:

_singleRegexToNFA(regexAST) {
    //...
    for (const c of regexAST.subpatterns) {
        // ...
        } else if (c.child instanceof ComplexClass) {
            baseBuilder = () => this._oneStepNFA(new CustomMatcher((char) => c.child.matches(char), c.child.name));
        }
    }
    // ....
}

And it just works 🚀

const regex = new NFARegex("[0-8]+");
console.log(regex.compute("0123456789")); // {groups: 0: {"012345678"}}
const regex2 = new NFARegex("[^0-8]+");
console.log(regex2.compute("9abcde102")); // {groups: 0: {"9abcde"}}

Note that in both cases the group 0 only includes the text until there's a character not included in the class. In the first case it would be the 9 and in the second the 1.

Dots

The dot is a really useful (and probably overused) character class. It matches any character except line breaks.

The definition of what a line break is varies with the languages. As regular-expressions.info states: \n is always a line break, Javascript also considers \r, \u2028 and \u2029 and Java even adds \u0085. For simplicity I'll only add \n and \r:

To represent them you only have to add it to the grammar and create a ComplexClass:

visitDotPattern() {
    return new ComplexClass(["\n", "\r"], [], true, ".");
}

The true is to negate the class. Otherwise it would match only those two characters. And... magic:

const regex = new NFARegex(".+");
console.log(regex.compute(""abc\nde")); // {groups: 0: {"abc"}}

Escape characters

I decided to sneak escape characters in this post since they are also related to matchers and they are simple to explain.

Regex have the classic escape characters (\n, \t, \r, \b, \\) and also allow escaping reserved characters. So if you want to match literally a + you can write \+, or for an opening bracket \[.

To implement this, first add a rule to the grammar for escape characters. In antlr4 it would be something like:

ESCAPED_RESERVED_CHAR: BACKSLASH (BACKSLASH | OPEN_PAR | CLOSE_PAR | ASTERISK
    | PLUS | DOT | OPEN_BRACKET | CLOSE_BRACKET 
    | GREATER_THAN | LOWER_THAN | COLON | CARET | DOLLAR 
    | 'n' | 't' | 'b' | 'r');

subexpr: 
    regexGroup #groupPattern
    | atomicChar #atomicPattern
    | ESCAPED_RESERVED_CHAR #escapedReservedAtomicPattern
    ...

And then in the AST create an AtomicPattern (which is the same class used to match regular characters). The actual pattern will depend on type of the escape character:

  • For regex specific escape characters (\., \+, ...) remove the \.
    • This also applies to \\, even though it's a common escape character.
  • For generic characters (\n, \t, \r, \b) we need to map the text to the actual character. \\n -> \n So we'll have something like:
const ESPECIAL_ESCAPE_CHARACTERS = {"\\n": "\n", "\\b": "\b", "\\t": "\t"};
//...
visitEscapedReservedAtomicPattern(ctx) {
    const char = ctx.getText();
    const pattern = char in ESPECIAL_ESCAPE_CHARACTERS ? ESPECIAL_ESCAPE_CHARACTERS[char] : char.substring(1); 
    return new AtomicPattern(pattern);
}

Let's test it:

const regex = new NFARegex("\\n");
console.log(regex.compute("\n")); // {groups: 0: {"\n"}}

const regex2 = new NFARegex("\+");
console.log(regex.compute("\+")); // {groups: 0: {"+"}}

Even more complexity

And that's all there is about escape characters and character classes, right? The post is finally over?

Even though this topic is simple to learn there are always more details to take into account. For example some regex engines implement stuff like:

  • Unicode characters: \u{hhhhh}
  • Matching based on unicode properties \p{UnicodeProperty}
  • \xhh to match characters with a two digit hexadecimal code

And other things I don't know or don't remember. I could write about these things, but I think they can be solved by just adding more matchers. So yes. The post is over. See you in the next one.

Season 1 - Building a Regex Engine (Completed)

YES. FINALLY. This was supposed to be a really simple post. How did it take so long to write?

Wait, you're still here? Well... Forget about this last part. The next post about anchors should be shorter, which means it'll take me four times as long to write and will be twice as big as this post. Because 🥔. Always 🥔