Published on

Finding multiple matches - Building a Regex Engine Part 7

In the previous posts we have developed a formal regex engine and built some modern features on top. But it still two major problems:

  1. The engine can only find one match.
  2. All regex have an implicit start of string anchor. For example the regex [ab] can't find any match in the string ca even though any other engine would find the a.

The NFA always computes a single match starting at a specific position of the string. Until now, this position was always the first one. We can solve both problems at the same time by executing the NFA multiple times with different initial positions.

Implementation

The first thing we need to do is define the interface to get a single or multiple matches. Different languages do this in different ways. Javascript by default gets a single match, and if you want all of them you need to use the global flag (g). Python's regex have the method search for a single one and findAll for all matches. And, for some reason, Java only has a find method for a single match, and the only way to get all matches is to call find() in a loop. Ironically Java has both replace and replaceAll methods, so I don't know why there isn't a findAll 🤷‍♂️.

The only thing all languages (I know) share is that each time you call the find method it returns the next match. This means the regex stores the position of the last match to skip it the next time the method is called.

To keep it simple we'll add two methods find and findAll. These are added to the NFARegex class, not to the NFA class. Why? As I said in the introduction, a NFA is supposed to find a single match and always start at a specific index. The regex class is the one responsible for controlling when and how the NFA is executed.

Find

Now we'll have two methods that need to convert from the NFA memory to the output result, so to simplify it we'll move the code does it to it's own method.

class NFARegex {
    // ...

    _memoryToFinalResult(memory, string) {
        const result = {groups: {}};
        for (const [group, start, end] of Object.values(memory.GROUP_MATCHES)) {
            result.groups[group] = string.substring(start, end);
        }
        return result;
    }
}

We'll also assume the NFA's compute method also takes the initial position as an argument. I'll explain that later.

The find method is just a loop that tries to find the first match in every position, beginning at _currentPointer. The pointer stores the index where the last match ended, so that each time you call the method it returns the next match.

class NFARegex {
    constructor (regex, modes) {
        // ...
        this._currentPointer = 0;
    }

    // ...
    find(string) {
        for (let i = this._currentPointer; i < string.length; i++) {
            const r = this.nfa.compute(string, i);
            if (!r) continue;
            this._currentPointer = r.GROUP_MATCHES[0][2]; // Ending position of the group 0
            return this._memoryToFinalResult(r, string);
        }
        this._currentPointer = string.length; // This is to cache that there are no more matches.
        return null;
    }
}

To get the position where the match ended it uses the ending index of the group 0. Notice that we aren't adding one to that index. That's because the ending index already points to the character after the group ends:

Notice that each group is actually a right-open interval. When we run computeGroups we have already exited the group, so i points to the first non-consumed letter after the group, not to the group's last letter. This won't be a problem, since the substring function is also right-open.

-- Adding capturing groups - Building a Regex Engine Part 4

Accurate representation of the engine looking at every position of the string

FindAll

The findAll method is similar, but here we always start at 0 and even if it finds a match it keeps looking for more. To be able to change the current position in a smoother way I'll use a while loop.

class NFARegex {

    //...

    findAll(string) {
        let i = 0;
        const matches = [];
        while (i < string.length) {
            const r = this.nfa.compute(string, i);
            if (r) {
                matches.push(this._memoryToFinalResult(r, string));
                i = i ===  r.GROUP_MATCHES[0][2] ? i + 1 : r.GROUP_MATCHES[0][2];
            } else 
                i++;
        }
        return matches;
    }
}

When it finds a match, it updates the index to the ending position of the match, unless it's an empty match. If you use lazy quantifiers, they might be matches with 0 length. For example the regex a?? and the string a produces one empty match. In this case we must advance the pointer one position to avoid getting stuck in an infinite loop.

The final detail

The final touch to get this working is allowing the NFA to start at a non-0 position. In the NFAEngine we had:

compute(string) {
    const stack = [];
    stack.push({i: 0, 
        currentState: this.states[this.initialState], memory: {GROUP_MATCHES:{}, EPSILON_VISITED: []}})
    //...
}

The change is as simple as adding an argument and setting i to that value:

compute(string, i) {
    const stack = [];
    stack.push({i, // <--- Here
        currentState: this.states[this.initialState], memory: {GROUP_MATCHES:{}, EPSILON_VISITED: []}})
    //...
}

Finally, we can test it:

const r = new NFARegex("[abc]");
console.log(r.find("abc")); // {groups: {0: "a"}}
console.log(r.find("abc")); // {groups: {0: "b"}}
console.log(r.find("abc")); // {groups: {0: "c"}}
console.log(r.find("abc")); // null

console.log(r.findAll("abc")); // [{groups: {0: "a"}}, {groups: {0: "b"}}, {groups: {0: "c"}}]

Conclusion

With this change we already have a decent regex engine. The interface could be improved even more, with methods like replace and replaceAll. It shouldn't be too difficult, since we have the indexes of each match.

The next posts will be a (hopefully) short mini-series on the remaining features I know about regex: Atomic groups, backreferences, lookahead and lookbehinds. I didn't intend to cover these topics, but I feel that this series would be incomplete without them. Still, I'll make a more simple and experimental implementation. I might not cover all the corner cases (or even realize they exist), since I would like to end this series as fast as possible (which is contradictory with the idea of adding 4 unplanned posts to the series 😆).

Season 1 - Building a Regex Engine (Completed)