Published on

Backreferences - Building a Regex Engine Part 9

Backreferences are used to match the same text that has already matched in a previous capturing group. For example if you want to match a word between single or double quotes you could do something like (['"])[a-zA-Z]+(['"]) but that would match mixed quotes, like "foo'.

To avoid this you can do (['"])[a-zA-Z]+\1. The \1 means match the same thing that matched in the group one. So in the previous example, if the starting quote is a " it'll only match if the ending quote is also a ".

Corner cases: Empty and unmatched groups

The introduction above is everything there is to know for day-to-day use of backreferences. But as always, we are going to implement them in a engine, so that's not enough. There are two corner cases we need to take into account: Empty groups and unmatched groups.

There isn't a standard way to handle them, and as we will see, PCRE (Perl Compatible Regular Expressions) and Javascript handle them in different ways. Let's see a simple (and stupid) example: ()\1a. The first group doesn't match anything, so what happens if we execute the regex with the input a? Well, the group technically matches an empty string (ε/""), so \1 also matches an empty string and it succeeds.

But what if the regex was (b)?\1a?. Here the first group fails to match, but since it's optional it continues trying to match the rest and arrives at the \1, but what value does it have? It depends on the engine.

In PCRE unmatched backreferences always fails, so the regex fails too. On the contrary, in Javascript an unmatched group is the same as an empty group, so \1 matches an empty string and succeeds. I find it ironic that the language that differenciates between null and undefined doesn't differenciate between empty and unmatched groups 😝.

Let's expand this cases even more:

  • What if the backreference is before the group, or even inside it? For example \1(a) and (a\1). As the name BACKreference implies, they only reference past matched groups. Both are valid regex but, as before, in PCRE always fails and in Javascript they are equivalent to the regex (a)
The engine when it finds a backreference to a capturing group that is after that backreference.
  • And what if you put a backreference in an alternative? Like: (\1b|a)+.This interesting because it looks similar to a recursion.
    • PCRE: Every time the capture group matches the value of group 1 is overwritten, which changes the definition of \1 for the next match.

      It might seem like a recursion, but it isn't. Let's see it with a classic example, balanced parenthesis ^(\(\1\)|\(\))+$. This regex looks a bit confusing with all the escaped parenthesis, but the idea is "Match () or the value of group 1 inside parenthesis". Let's see what PCRE does:

      Input Result
      ()
      (()) Doesn't work ❌. The first time the group tries to match, \1 is empty, so it tries with () and fails
      ()(()) ✔ The capturing group first matches (), then tries to match again (due to the + quantifier) with \1=()
      ()(())((())) ✔ Same as before extended a third time
      • In a way it's "sequentially recursive"... does that even make sense? In order to match nn nested parenthesis first it needs to match 1,2,3...n11, 2, 3 ... n-1 nested parenthesis.
      • PCRE has recursion with the syntax (?R). I wonder how difficult it would be to implement in this regex engine 🤔.
    • In Javascript the \1 matches an empty string, so it's equivalent to (b|a). What a downer 😞.

Implementation

Warning: This post is experimental and not as curated as the previous posts. Since I implemented it quite fast I might not have tested it enough or found some corner cases

We are going to implement backreferences in a similar way we implemented anchors. With a simple transition that will match the whole backreferenced text. As always I'll ignore the parsing and I'll assume we have an AST node called Backreference that just contains the number of the group referenced.

To make backreferences with transitions we need to give more power to matchers. Till now, a matcher received the current string and the position of the machine. Now it also needs to access the machine memory to know what's the current value of the group it's referencing.

We also need to be able to create matchers that consume a variable number of characters. If a group matches potato, the backreference matcher should consume 6 characters. So, from now on all matchers will return an object saying how many characters they consume.

Before After just 2 months! Buy this product. ee out of π\pi developers recommend it
class Matcher {

    matches(string, pos) {
        return false;
    }

    isEpsilon() {
        return null;
    }

    get label() {
        return "undefined-matcher"
    }
}
class Matcher {
    /**
     * Returns an object thats either {matches: false} or {matches:true, consumes:int}
     */
    matches(string, pos, memory) {
        return {matches: false};
    }

    get label() {
        return "undefined-matcher"
    }
}

This way we can remove the isEpsilon method. This is nice because in the post about anchors we defined four matchers (start/end of line/string) as epsilon transitions that weren't really epsilon, they just didn't consume input (to be an epsilon transition it has to allow the machine to go through the transition freely, but those matchers have conditions). This will make the notation clearer.

At this point I could convert all matchers to functions, but I'll leave them as they are only to annoy the fanatics of functional programming. I like to watch the world burn 🔥🔥🔥.

You can see the redefinition of all matches here https://github.com/DanielBV/blog-examples/blob/main/regex-engine/backreferences/src/nfa.js#L30. After all these changes the NFA's compute method changes slightly:

class EngineNFA {

    compute(string, i) {
        //...
        for (let c = currentState.transitions.length-1; c >= 0; c--) {
            const [matcher, toState] = currentState.transitions[c];
            const match = matcher.matches(string, i, memory); // <--- We now store the object in a constant
            if (match.matches) { 
                const copyMemory = JSON.parse(JSON.stringify(memory));
                // <----> Reminder - All this is to avoid epsilon loops (loops where the engine get's stuck because 
                // it keeps looping without consuming input)
                const epsilonLoopPossible = matcher.consumes === 0; // <-- Instead of having an isEpsilon() method, 
                                                                    // we just consider dangerous the transitions that 
                                                                    // don't consume input
                if (epsilonLoopPossible) {
                    // Don't follow the transition. We already have been in that state
                    if (memory.EPSILON_VISITED.includes(toState.name))
                        continue;
                    copyMemory.EPSILON_VISITED.push(currentState.name);
                } else 
                    copyMemory.EPSILON_VISITED = [];
                const nextI = i + match.consumes; // <--- This is also new. Previously we had 
                                                  // const nextI = matcher.isEpsilon() ? i : i+1;
                stack.push({i: nextI, currentState: toState, memory: copyMemory});
            }
        }
    }

I know you just skipped the code to continue reading, but just take into account that the engine still needs to avoid epsilon loops.

Now we have all the ingredients to make a backreference matcher:

class BackreferenceMatcher extends Matcher {
    constructor(groupNumber) {
        this.g = groupNumber;
    }

    matches(string, i, memory) {
        const group = memory.GROUP_MATCHES[this.g];
        if (!group)
            return {matches: false};
        const [_, start, end] = group;
        if (!end) // The backreference is inside the capturing group, which still hasn't finished.
            return {matches: false};
        if (string.substring(start,end) === string.substring(i, i + end-start))
            return {matches: true, consumes: end - start};
        return {matches: false};
    }

    get label() {
        return `\\${this.g}`
    }
}

Note that usually the range would be end-start + 1, but since the intervals are right-open, the plus one is already included implicitly.

As you can see, we are following the PCRE approach (because it's the most fun 🦄). If the group isn't defined or hasn't matched yet, the matcher fails. But there's a problem with the code above in the corner case (\1b|a)+. Every time the engine enters a capturing group it overrides the memory value, and it's incomplete till it exits the group. So in this regex, the \1 always references an incomplete group. We need to check the PREVIOUS match, not the current one.

So the solution is to store the current matches and the previous matches separately. I'm not sure if there is a more efficient way to do this, as I said in the warning at the beginning of the section implementating this was unplanned, so this implementation was done really quickly and without much thought.

class NFARegex {}
    compute(string, i) {
        let stack = [];
        stack.push({i, currentState: this.states[this.initialState], memory: {GROUP_MATCHES:{}, EPSILON_VISITED: [], 
            PREVIOUS_GROUP_MATCHES: [] // <--- This is new
        }});
    }

    computeGroups(currentState, memory, i) {
        for (const group of currentState.startsGroups) {
            memory.GROUP_MATCHES[group] =  [group, i, null];
        }
        for (const group of currentState.endsGroups) {
            memory.GROUP_MATCHES[group][2] = i;
            // When a group ends, we store it in PREVIOUS_GROUP_MATCHES. This will only contain the 
            // previous match, not multiple matches.
            memory.PREVIOUS_GROUP_MATCHES[group] = memory.GROUP_MATCHES[group]; // <----- This is new
        }
    }


Once a group ends memory.PREVIOUS_GROUP_MATCHES[group] is the same as memory.GROUP_MATCHES[group], but if the group is overwritten, the value will differ till the new group ends. Now we just need to use PREVIOUS_GROUP_MATCHES in BackreferenceMatcher:

class BackreferenceMatcher extends Matcher {

    matches(string, i, memory) {
        const group = memory.PREVIOUS_GROUP_MATCHES[this.g];
        //...
    }
}

And finally, we need to create the matcher in the NFA builder. We'll just use the oneStepNFA method again, which just creates two states with a transition in between:

class NFABuilder {

    _singleRegexToNFA
        //...
        else if (c.child instanceof CaretAnchor)
            //...
        else if (c.child instanceof Backreference) 
                baseBuilder = () => this._oneStepNFA(new BackreferenceMatcher(c.child.group))
    }

And that's all 😎:

new NFARegex("(a)\\1").find("aa"); // {groups: {0: "aa", 1: "a"}}
new NFARegex("^(\\1b|a)+$").find("aabb"); // null
new NFARegex("^(\\1b|a)+$").find("aababb"); // {groups: {0: "aababb", 1: "abb"}}

This is the last feature of this regex engine (unless I feel a random urge to implement lookaheads or recursion). In the next and last post of the series I'll theorize about the problems that would happen if we attempted to implement the same features in a DFA engine (remember, the current engine is a NFA engine). And after that... I'll need a vacation from regular expressions.

Season 1 - Building a Regex Engine (Completed)