Published on

Anchors and multiline mode - Building a Regex Engine Part 6

Anchors are assertions about what's the engine position at a certain moment. The most used are the start of string (^) and end of string ($) anchors, which, as their names imply, assert the engine is at the first and last position of the string.

For example, for the regex [ab] the input a-b finds two different matches (a and b). ^[ab] only matches a and [ab]$ only matches b.

Regex modes/flags

Modes/flags are modifiers that change the engine's behaviour. For example in Javascript the flag i makes the engine case insensitive, so /ABC/i matches abc.

But... what do modes have to do with anchors? There is a multiline modifier m that changes how the start and end anchors work. In this mode ^ also matches after every line break and $ matches before every line break. We'll implement both behaviours.

What's the point of anchors

Anchors are used to ensure you are matching the right thing. For example let's say you want to match a number. If you use \d+ and the user writes the input a5, it'll find the number 5 and ignore the a, but that might not be what you want. This happens to me a lot in Javascript with parseInt vs Number. The first one parses 5a as 5 while Number returns NaN. Most of the times what I really want is to just parse correct inputs and otherwise return null.

Optional - A more complicated example

Let's see a more complicated example where the users could trick you if you don't use anchors. Imagine your app uses an external API but you want to customize the error messages shown to your users. The API is sh*t and doesn't give you a context object, so you have to parse the error message, extract the info you want and show another message. You might have something like:

const re1 = /The variable '(.+)' doesn't exist/;
const match1 = re1.exec(str);
if (match1) {
    throw new VariableUndefinedError(match1[1]);
}
const re2 = /The variable '(.+)' doesn't have the right type/;
const match2 = re2.exec(str);
if (match2)
    throw new InvalidTypeError(match2[1]);

So what's the problem? (Besides being extremely prone to error if the API changes the messages and the 🍝 code ). Well, assumming that the name of the variable is an input from the user, he might decide to name his variable The variable 'potato' doesn't exist (let's also assume that in this case variable names allow whitespaces). And if that variable exists but has a wrong type, it should return a InvalidTypeError. The problem is the first regex matches because it's looking for matches anywhere in the text, so we'll get a VariableUndefinedError("potato").

If instead both regex had start and end of string anchors, it would ensure that no matter what the user writes only the right regex matches. Of course this might no be the best example, since it requires the user to know the regex and he doesn't even get anything from naming the variable this way. But in other cases it might be more dangerous.

The start of the string anchor (^) also has a nice secondary effect. It makes your regex faster. If you just want to match the strings that start in a certain way, making it explicit allows the engine to avoid useless searching when the string doesn't start that way. As a rule of thumb, the faster the regex knows the input is incorrect, the less unnecessary computation it'll do.

Implementation

As always, the first step is to add ^ and $ to the grammar and build the AST nodes. I'll assume we have two nodes CaretAnchor and DollarAnchor.

To implement anchors in the NFA we'll add two matchers. The EndOfStringMatcher is easy to implement because we can clearly differenciate that we are at the end of the string because the engine has consumed all the input. And when this happens, it keeps trying to transverse the NFA matching the value undefined:

class EndOfStringMatcher extends Matcher {

    isEpsilon() {
        // This is a pseudo-epsilon transition, since it doesn't consume any input 
        // (even though the NFA can't transverse it freely)
        return true; 
    }

    matches(char) {
        return char === undefined;
    }

    get label() {
        return "$";
    }
}

But the StartOfStringMatcher is more difficult, since the matcher has no way of knowing that a character is the first one. To have that information we'll change the interface of the matchers. Instead of receiving a character the method will receive the full string and the engine's pointer.

As a disclaimer, another alternative would be passing the character and its position, but we'll need the string and pointer in the multiline mode, and I don't want to change the interface twice in the same post.

Warning: Since we are changing the matchers' interface we have to update every single matcher we implemented in previous posts. It's a simple change so I'll ignore it. You can see the final code here

By changing the call to the matchers in the NFA's compute method to if (matcher.matches(string, i)) { ... } we can do this:

class StartOfStringMatcher extends Matcher {

    isEpsilon() {
        // This is a pseudo-epsilon transition, since it doesn't consume any input 
        // (even though the NFA can't transverse it freely)
        return true; 
    }
    
    matches(string, i) {
        return i == 0;
    }

    get label() {
        return "^";
    }
}

Note that, unlike the matchers we added in the post about character classes, these ones are "epsilon" transitions, because they don't consume any input. If you want to get really picky, they aren't really epsilon transitions since the NFA can't transverse them freely. Still, in this engine isEpsilon() just determines whether it should increase the pointer after it transverses the transition or not, so it does the trick. Maybe I shouldn't have named it isEpsilon, but I don't think it's worth to change it now (at least for me, change it if you want).

Now let's add this to the NFA builder. In the NFA an anchor is just two nodes with a single transition. Since this is a common pattern, we already defined a method in a previous post (_oneStepNFA) to implement it:

_singleRegexToNFA(regexAST) {
    //...
    for (const c of regexAST.subpatterns) {
        //..
        } else if (c.child instanceof DollarAnchor)
            baseBuilder = () => this._oneStepNFA(new EndOfStringMatcher());
        else if (c.child instanceof CaretAnchor)
            baseBuilder = () => this._oneStepNFA(new StartOfStringMatcher());
Optional - An alternative approach: Handling anchors at a wrapper level

The implementation above adds anchors at a "NFA" level (the NFA is between quotes because we have broken the formal definition of NFA multiple times in these posts). Another approach could be to handle them in the regex wrapper class.

As a reminder, when the user creates a regex they don't use the NFA class. Instead, they use the regex class, which is really simple:

class NFARegex {
    constructor (regex) {
        this.source = regex;
        const ast = parseRegex(regex);
        const builder = new NFABuilder();
        this.nfa = builder.regexToNFA(ast);
    }

    compute(string) {
        const memory =  this.nfa.compute(string);
        if (!memory) return null;
        const result = {groups: {}};
        for (const [group, start, end] of Object.values(memory.GROUP_MATCHES)) {
            result.groups[group] = string.substring(start, end);
        }
        return result;
    }
}

So for regex like ^foo$, we could detect the anchors in this wrapper, save it in some attributes (this._startAnchor, this._endAnchor) and create the nfa for the regex foo (removing the anchors). Then we can check in the compute method if the anchors are true or not.

In fact, right now the compute method only executes the NFA starting at the first position of the string, which means all regex behave as if they had a start of string anchor. We'll fix that in the next post. For the end of string anchor we could add something like:

    compute(string) {
        const memory =  this.nfa.compute(string);
        if (!memory) return null;
        const result = {groups: {}};
        for (const [group, start, end] of Object.values(memory.GROUP_MATCHES)) {
            result.groups[group] = string.substring(start, end);
            if (this._endAnchor && group === 0 && end !== string.length)
                return null; // <--- Fail if there's a '$' anchor but the group 0 
                             // (which always contains the whole match) is not the same length as the string
        }
        return result;
    }

(This can be implemented in a cleaner way, but this whole subsection is just a random thought, so it wasn't worth the effort.)

Of course this approach wouldn't allow anchors in the middle of the regex, like a^$b. Though those anchos don't make sense unless you are in multiline mode. So... the NFA approach is better and I just wasted your time. You are welcome 😝.

A weird test

We can test the anchors like this:

const r = new NFARegex("^ab$")
r.compute("ab"); // {groups: {0: "ab"}}
r.compute("abb"); // null

But testing the start of string anchor right now is a bit useless. I already explained why in a previous post:

Still, it's missing a lot of modern features, and all the regex have an implicit start of the string anchor (^) at the beginning. For example, for the regex a our engine doesn't find a match in ba.

-- Creating a NFA from a regex - Building a Regex Engine Part 3

So whether you write the ^ or not the result will be the same. We'll solve this problem in the next post by adding a method that tries to execute the NFA with multiple initial positions.

Multiline mode

In multiline mode ^ also matches after every line break and $ matches before every line break. To add both we'll make two other matchers, and choose what matcher to use depending on the mode of the regex.

But first, let's determine the interface of the regex. There are multiple ways to indicate the regex flags, for example in Javascript you can do new RegExp('ab+c', 'm') and in Java you can use an enumeration of flags. In this post I want to make it as simple as possible, so I'll have a second argument mode which is just an object with the active modes, like this: new NFARegex("^ab$", {multiline: true}) and we'll pass this object to the NfABuilder:

class NFARegex {
    constructor (regex, modes) {
        this.source = regex;
        const ast = parseRegex(regex);
        const builder = new NFABuilder(modes);
        this.nfa = builder.regexToNFA(ast);
    }

Defining the matchers is simple. In multiline mode ^ still matches the first position and $ still matches the last one, so this implementation will just add an extra OR condition.

In the StartOfLineMatcher we need to look if the previous character was a line break, since the regex engine has already consumed it. In EndOfLineMatcher we just look at the current one.

class StartOfLineMatcher extends Matcher {

    isEpsilon() {
        return true; 
    }
    
    matches(string, i) {
        return i == 0 || string[i-1] === "\n";
    }

    get label() {
        return "^";
    }
}


class EndOfLineMatcher extends Matcher {

    isEpsilon() {
        return true; 
    }

    matches(string, pos) {
        return string[pos] === undefined || string[pos] === "\n";
    }

    get label() {
        return "$";
    }
}

And finally, add them to the NFA Builder:

class NFABuilder {
    constructor(modes) {
        this.stateNumber = 0;
        this.groupNumber = 0;
        this.modes = modes;
    }
    
    _singleRegexToNFA(regexAST) {
        //...
        for (const c of regexAST.subpatterns) {
            //..
            } else if (c.child instanceof DollarAnchor)
                baseBuilder = () => this._oneStepNFA(
                    this.modes.multiline ? new EndOfLineMatcher() : new EndOfStringMatcher());
            else if (c.child instanceof CaretAnchor)
                baseBuilder = () => this._oneStepNFA(
                    this.modes.multiline ? new StartOfLineMatcher() : new StartOfStringMatcher());
    }

As before, we can't fully test these until we implement the search for matches (which will happen in the next post 😎). But we can test some things, like:

const r = new NFARegex("^a$\n^b$")
r.compute("a\nb"); // {groups: {0: "a\nb"}}
r.compute("ab"); // null

Even MORE anchors? 😭

While I was doing some research on anchors I discovered there are a LOT more anchors. TOO MANY anchors. I don't even know who uses those anchors or why would they use them. If you are curious the list is here https://www.regular-expressions.info/refanchors.html. And those don't even include word boundaries https://www.regular-expressions.info/refwordboundaries.html 😱.Let's see an example:

The anchor \G matches at the position where the previous match ended. During the first match attempt, \G matches at the start of the string in the way \A does.

Applying \G\w to the string test string matches t. Applying it again matches e. The 3rd attempt yields s and the 4th attempt matches the second t in the string. The fifth attempt fails.

-- (Source: https://www.regular-expressions.info/continue.html)

I need that fridge 🤩

I guess that anchor could be implemented by giving the matcher access to the engine's memory. And about word boundaries... as long as matchers can access the whole string and the current position, it should be easy to implement. But I'm not too interested on trying it. I'll just throw a smoke bomb and end this post.

Season 1 - Building a Regex Engine (Completed)