Published on

Atomic groups - Building a Regex Engine Part 8

Atomic groups are non-capturing groups that, once exited, the engine can't backtrack that decision. Of course this only works with NFA engines that backtrack. Since this is a low-level feature I thought it would be interesting to implement it.

Optional - Javascript and atomic groups

Javascript doesn't implement atomic groups. You might think that's because it uses an incredibly advanced algorithm without backtracking... 😂 Oh wait, you're serious, let me laugh even harder 🤣🤣🤣.

I haven't found too many details on Javascript's regex engine, but it suffers from catastrophic backtracking. This melodramatic name means there are regex where the engine get's stuck backtracking for a long time. You can see a long explanation here: https://javascript.info/regexp-catastrophic-backtracking. I'll steal their example:

let regexp = /^(\d+)*$/;

let str = "012345678901234567890123456789z";

// will take a very long time (careful!)
alert( regexp.test(str) );

I executed it in firefox and it raised an Uncaught InternalError: too much recursion 😔. That was disappointing.

So why Javascript doesn't have atomic groups? 🥔. Javascript does have lookaheads and those are inherently atomic groups, so I don't see any technical reason. Steven Levithan made an interesting post about how you can mimick atomic groups using lookaheads.

The conclusion of all this is that, at the end of this post, our engine will officialy have more features than Javascript (ignoring all the other features that JS has and I didn't implement out of lazyness 😝).

To create a capturing group you have to add a ?> at the beginning of a capturing group: (?>this is an atomic group).

Why would anyone use atomic groups?

Short answer: To optimize.

Long answer: People say that it's to optimize, but since it changes the results of the regex I have a hard time thinking of general real life uses.

Atomic groups are used to avoid useless backtracking. As I said in the post about anchors, the faster we can make the engine fail (for invalid inputs) the better. But you just can't make every group atomic, since they change the final result. The classic example that I definitively didn't steal is a(?>bc|b)c. For the input abc the engine would try the first alternative and consume the bc, then it would try to backtrack because it's missing the last c, but it can't because it's an atomic group, so it just fails without trying the second alternative. In a normal group, it would match. So for this regex using an atomic group is useful only if you want to match abcc but not abc, but then, what's the point of the alternative?.

I did a bit of research to understand the use case of atomic groups. These are two key insights I've discovered and that might help you:

  • When writing an atomic group you want to order alternatives from the longest to the shortest because regex try to match alternatives from left to right. For example \b(?>integer|in)\b matches both integer and in but \b(?>in|integer)\b doesn't match integer because the atomic group chooses and succeeds in the first alternative, just to fail a moment later in the \b.

  • It can be useful to avoid backtracking in places you are sure there is no need to backtrack. For example if you want to match some numbers followed by some letters, you might do \d+\w+. Since a digit (\d) is also a word character (\w) this would cause the engine to backtrack a lot trying to decide which of the two should match the digit.

    In this case you know that when you find the last digit you don't need to backtrack again, so you can do (?>\d+)\w+ (of course in this example it would be easier to do \d+[a-zA-Z]+, but this is just an example).

I still have dificulties imagining a real life example where I could use atomic groups (probably because I don't use regex that often), but I feel that now I understand them a bit more.

Capturing groups the moment you add a ?> at the beginning

Implementation

Warning: This post is experimental and not as curated as the previous posts. Since I implemented it quite fast I might not have tested it enough or found some corner cases

In terms of implementation atomic groups are just a group that when exited empties the stack. Since we implemented the compute method with an iterative approach, we have an explicit stack that we can just empty whenever we want:

class EngineNFA {

    compute(string) {
            const stack = []; // <----
            //...
    }
}

First, as always, we need to change the parser to allow the atomic group syntax and store it in the AST. In both Regex and RegexAlternative AST nodes I added an atomic attribute:

  constructor(...alternatives) {
        //...
        this._atomic = false;
    }
    isAtomic() {
        return this._atomic;
    }
    
    atomic() {
        this._atomic = true;
        this._capturing = false; // Atomic groups are non-capturing
        return this;
    }

Now we need to decide how to implement in in the NFA. As a reminder, in each state we are storing a list with the groups that the state begins and ends:

class State {
    constructor(name) {
        //...
        this.startsGroups = [];
        this.endsGroups = [];
    }

This makes it a bit weird to implement. Should I store the groups in endGroups as {group: 1, isAtomic: true}? It didn't convince me, specially because atomic groups are non-capturing groups. Another option I thought is moving the info about groups to the NFA class and creating a Group class. But didn't convince me either. So I decided to just store an attribute in the ending states of atomic groups. This is much quicker to implement though not as clean as I would like.

In the State I added a nukesStack attribute:

class State {
    constructor(name) {
        //...
        this.nukesStack = false;
    }
}

Shouldn't I have used getters and setters? Yes. At least I added a setter in the NFA:

class NFAEngine{
    //...
    setNukeState(state) {
        this.states[state].nukesStack = true;
    }
}

And we need to translate the AST to NFA:

_singleRegexToNFA(regexAST) {
    //...
    if (groupName !== null) nfa.addGroup(nfa.initialState, nfa.endingStates[0], groupName); 
    if (regexAST.isAtomic()) nfa.setNukeState(nfa.endingStates[0]); // <-- This is the new line
    return nfa;
}

_alternativeToNFA(alternativeAst) {
    //...
    if (groupName !== null) nfa.addGroup(start, end, groupName);
    if (alternativeAst.isAtomic()) nfa.setNukeState(nfa.endingStates[0]);  // <-- This is the new line
    return nfa;
}

Finally, we need to tell the engine's compute method to empty the stack if the current state has the nukesStack attribute to true:

compute(string) {

  while (stack.length) {
            const {currentState, i, memory} = stack.pop();
            this.computeGroups(currentState, memory, i);

            if (this.endingStates.includes(currentState.name)) 
                return memory;
            if (currentState.nukesStack) // <--- Here
                stack = [];            
            
            for (let c = currentState.transitions.length-1; c >= 0; c--) {

Easy 😎. Of course that's because we have a lot of control over the stack due to the iterative implementation. Otherwise we would have had to change the whole algorithm to an iteration and the explanation would have been much longer.

const r = new NFARegex("a(?>bc|b)c");
console.log(r.find("abcc")); //  {groups: {0: "abcc"}}
r.reset(); // Regex are stateful. This empties the state to avoid affecting the next execution
console.log(r.find("abc")); // null
Season 1 - Building a Regex Engine (Completed)