Published on

Adding capturing groups - Building a Regex Engine Part 4

In the last three posts we implemented a formal regex engine, but it's still really limited and not good enough for any real use. Today we'll add one of the most used features of modern engines, Capturing groups.

Till now, the engine always responded with "Yes, this string matches the regex" or "😂 Nice try", but capturing groups allow extracting the parts of the string that matched a part of the regex. For example: .+@(.+)\.com can be used to extract the email domain of an email address (let's ignore weird emails and email validation)

  • Input string: coolest-potato@gmail.com
  • Output ["gmail"]
Optional - Quick 'Did you know' about emails and regex

Never use regex for email validation. Instead you should delegate this problem to tried and trusted libraries. BUT... did you know there's a website (https://emailregex.com/) that shows a regex that (reportedly) works with 99.9% of emails? At the time of writing this post, that regex is:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b
\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]
(?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?
[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c
\x0e-\x7f])+)\])

(Added line breaks to make it more readable, if for some reason you want to copy the regex as it is just go to the website).

All I'm going to say is:


Understanding the details of capturing groups

By default every regex expression inside parenthesis is a capturing group. You can also make it a non-capturing group with the syntax (?:<expression>), this is often used to wrap things in parenthesis without creating useless groups, for example ((?:a|b)+).

Most (maybe all?) regex engines use the capturing group 0 to output the whole text that matched the pattern. The following capturing groups are numbered from left to right.

Regex Knowledge: +10 points for Gryffindor.
Alignment of parenthesis with vertical lines: -50 points.

Multiple groups can capture the same input, the most stupid example would be ((((a)))), where the same a would appear in 5 capturing groups (4 parentehsis + group 0). But if a capturing group is transvered multiple times it only captures the last one. Let's see this with an example:

Regex (a+)
Input aaaaa
Result

Group 0: aaaaa
Group 1: aaaaa

The group captures one or more a, so the result above is expected. But what happens if we take the + quantifier outside the capturing group?

Regex (a)+
Input aaaaa
Result

Group 0: aaaaa
Group 1: a

The group 1 only matches a single a, so even though the four a's are matched by that group, it only captures the last one. In implementations terms this means every time we exit a group we'll override the old value.

Implementing capturing groups

To add this functionality we need to modify most parts of the engine:

  • Parsing: If your parser didn't allow parenthesis or the non-capturing syntax, you'll have to add that. I won't cover this part.
  • Building the AST: In the previous post we already had the nodes RegexAlternative and Regex but they assumed it wasn't a capturing group. Now we'll add a variable to store whether it's capturing or not.
  • NFA model and building: The capturing groups have to be modeled in the NFA so we can use that info when we execute it.
  • Engine execution: At every step of the execution we need to check if we are entering or exiting a capturing group and store the indexes of the captured substring.

Some code snippets only include the things that have changed. You can see the complete source code here. Let's begin.

Building the AST

This one is easy. Previously we had two main "compound" AST nodes: Regex (a single regex expression) and RegexAlternative. All we need to do is add an extra attribute _capturing. We'll consider the root of the tree to be a group too (group 0), even though it doesn't have parenthesis. So the only case it isn't a capturing group if it's an explicit non capturing group

  • aaa -> Regex(aaa, true)
  • (aaa)-> Regex(aaa, true)
  • (?:aaa) -> Regex(aaa,false)

A RegexAlternative contains multipe Regex. Even if the alternative is capturing, the inner regex are not:

  • (a|b) -> RegexAlternative([Regex("a", false), Regex("b", false)], true)

I think it would be simpler to merge Regex and RegexAlternative into a single AST node, but it's too late. If you can, don't overcomplicate the AST like I did.

Optional - A small reminder on AST

An AST (Abstract Syntaxt Tree) is a tree representation of a text. It's called abstract because it ignores some details of the text, and it might not be possible to reconstruct exactly the same text from an AST. In the example above, aaa and (aaa) have the same AST, so given an AST you can't know whether the original text had parenthesis or not.

There is no "standard AST". ASTs often have just enough information to fill the specific needs of the problem to solve. In this case I assume we have an ASTs like this one:

AST for '(a|b)+a?|c'

As I said in the previous post the AST has some unnecessary nodes, for example Expression is quite useless, since I could have added the quantifier to a generic RegexNode class and then make all the other classes inherit from it. The code will assume this structure, but if you are going to implement your own engine feel free to improve it all you want.

In my particular case due to how I made the AST builder it was easier to make groups capturing by default and then add a nonCapturing() method to set them as non-capturing.

class Regex {
    constructor(subpatterns) {
        this.subpatterns = subpatterns;
        this._capturing = true;
    }

    isCapturingGroup() {
        return this._capturing;
    }

    nonCapturing() {
        this._capturing = false;
        return this;
    }
}

And the same for RegexAlternative.

Building the NFA

Warning: We are going to extend NFAs, so some mathematical methods like the Powerset construction (which transforms NFAs to DFAs) can no longer be applied

Previously the states of an NFA were just a label with some transitions. Now we need to store the information about groups somehow. These are the options I can think of:

  • Tag the initial and ending states of a group.
  • Tag the transitions. This is slightly more accurate because "TeChNiCaLlY" speaking 🥴 the groups would begin and end when you go through a transition, rather than when you enter or exit a state.

Both methods are only possible because Thompson's construction generates readable and intuitive NFAs, where it's clear where a group begins and ends.

For a NFA engine there's no much different between the two of them. I chose to tag the states because it was simpler. When I did a bit of research afterwards I discovered this topic get's much more interesting if you want to convert the NFA to DFA (Quick reminder: NFA = Non deterministic finite automata, DFA = Deterministic Finite Automata).

I'll talk more about this in the post about NFA vs DFA. As a quick spoiler, in a NFA it's easier to know which states are the beginning and ending of a capturing group. In formal DFA it's impossible because that information is lost during the conversion/minimization. To be able to keep the capturing groups you can't use the Powerset construction, instead you have to use an extended algorithm. The algorithms I'll mention in that post tag the transition instead of the state, since this allows to merge more states.

For the regex (a|b)*a(a|b):

DFA
NFA for the same regex + The paint skills of a professional and serious person

Are you able to see the states that start the groups 1 and 2 in the DFA? Because I definitively can't. Of course I'm cheating. The DFA is fully minimized with the Powerset construction. DFAs with capturing groups can't be minimized as much as formal DFAs, since in those you can merge more states without worrying about groups.

So as a recap: I'm going to tag the states, but if you intend to make a DFA engine it might be more useful to tag the transitions.

Me trying to program the engine to capture input

The model

First we are going to store the groups that start and end in each state. In most cases a state will start or end a single group, but there are some stupid corner cases line (((a))). I'm also not sure if the states can both start and end groups at the same time, but I'll assume it's possible.

Each group will be represented with its name. This will also be useful to implement named capturing groups, but I'll talk about those later.

export class State {
    constructor(name) {
        this.name = name;
        this.transitions = [];
        this.startsGroups = [];
        this.endsGroups = [];
    }

    addStartGroup(group) {
        this.startsGroups.push(group);
    }

    addEndGroup(group) {
        this.endsGroups.push(group);
    }

    //Methods previously defined
}

In EngineNFA I'll also add a method to add a group:

class EngineNFA {
    // ... Previously defined Methods

    addGroup(start, end, group) {
        this.states[start].addStartGroup(group);
        this.states[end].addEndGroup(group);
    }
}

NFA Building

We just need to get the name of the group at the beginning of _singleRegexToNFA and apply the group at the end.

_singleRegexToNFA(regexAST) {
    let nfa = null;
    //...
    
    // You have to define a concrete group name before visiting any subnode so that the order of the group 
    // numbers is ascending.
    const groupName = regexAST.isCapturingGroup() ? this.newGroup() : null;
    for (const c of regexAST.subpatterns) {
        // A lot of code that was defined in the last post
    }
    // ...

    // Be careful, 'if(groupName)' is not enough because group '0' would be false. I guess you instead you could 
    // convert the group to a string
    if (groupName !== null) nfa.addGroup(nfa.initialState, nfa.endingStates[0], groupName); 
    return nfa;
}
_alternativeToNFA(alternativeAst) {
  const groupName = alternativeAst.isCapturingGroup() ? this.newGroup() : null;
  const nfa = this.nfaFactory();
  // ... Other code
  if (groupName !== null) nfa.addGroup(start, end, groupName);
  return nfa;
}

It's simple, but we need to tie up a loose end. Both methods use appendNFA, a method that merges the initial state of the appended NFA with the "union state" (the state where that NFA is appended). In the process a state is lost, what if there is a group there? All we need to do is move that information to the merge state, just like we did with the transitions:

appendNFA(otherNfa, unionState) {
    // ...
    // Before we just added the transitions of the deleted state to the unionState
    for (const [matcher, toTransition] of otherNfa.states[otherNfa.initialState].transitions)
        this.addTransition(unionState, toTransition.name, matcher);

    // Now we also move the groups
    for (const group of otherNfa.states[otherNfa.initialState].startsGroups)
        this.states[unionState].addStartGroup(group);
    for (const group of otherNfa.states[otherNfa.initialState].endsGroups)
        this.states[unionState].addEndGroup(group);
    // ...
}

Finally we define the utilities for group numbering that we used in the previous methods:

class NFABuilder {

    newGroup() {
        const c = this.groupNumber;
        this.groupNumber++;
        return c;
    }

    resetGroupNumbers() {
        this.groupNumber = 0;
    }

    // This method was already defined, we just add the call to 'resetGroupNumber'
    regexToNFA(regexAST) {
        this.resetStateNumbers();
        this.resetGroupNumbers();
        return this._regexToNFA(regexAST);
    }

    // ... Other methods

Executing the model

All we did until now was just so the NFA had the info about the capturing groups... now we need to do something with that info. And as always, it boils down to the compute method in EngineNFA

In the second post of this series I gave the NFA a memory to avoid epsilon loops by storing visited epsilon transitions. We are going to use that same memory to keep track of the capturing groups. We'll add a new variable GROUP_MATCHES and initialize it do an empty object (technically it's a map so I should have used new Map(), but a plain object will also do the job).

Then every time we enter a new state we have to check if we just entered or exited any group. For that we'll call this.computeGroups. Also, before we always returned either true if it matched or false otherwise. Now we'll change that to return memory if it matches or null if it doesn't.

class EngineNFA {

  compute(string) {
        const stack = [];
        stack.push({i: 0, currentState: this.states[this.initialState], memory: {GROUP_MATCHES:{}, EPSILON_VISITED: []}}})
        while (stack.length) {
            const {currentState, i, memory} = stack.pop();
            this.computeGroups(currentState, memory, i); // <-- This is new
            if (this.endingStates.includes(currentState.name)) 
                return memory;
            // ... 
        }
        return null;
    }

    // ... Other methods
}

It's important that we compute the groups before checking if we are in an ending state, otherwise the groups that end in the ending state (like group 0) would never be completed.

And in computeGroups we just check the startGroups and endGroups of the state. As a reminder, i is the position of the string where the NFA is at the moment, which is the first letter that hasn't been consumed yet.

computeGroups(currentState, memory, i) {
    for (const group of currentState.startsGroups) {
        memory.GROUP_MATCHES[group] =  [group, i, null];
    }
    for (const group of currentState.endsGroups) {
        memory.GROUP_MATCHES[group][2] = i;
    }
}

I store the groups in the format [group, start, end]. I know the group is redundant because it's also the key of the object but I like it this way (okay maybe I'm too lazy to change it 😜).

Notice that each group is actually a right-open interval. When we run computeGroups we have already exited the group, so i points to the first non-consumed letter after the group, not to the group's last letter. This won't be a problem, since the substring function is also right-open.

Making a decent output

With all these changes capturing groups already work! Still, the result isn't in a user-friendly format:

const nfa = new NFARegex("((?:a|b)+)(cd)");
const result = nfa.compute("bbaacd");
console.log(result); // {"GROUP_MATCHES":{"0":[0,0,6],"1":[1,0,4],"2":[2,4,6]},"EPSILON_VISITED":[]}

So let's improve that by calculating the substring that matched in each group:

class NFARegex {

 compute(string) {
        const memory =  this.nfa.compute(string);
        if (!memory) return null;
        const result = {groups: {}};
        for (const [group, start, end] of Object.values(memory.GROUP_MATCHES)) {
            result.groups[group] = string.substring(start, end);
        }
        return result;
    }
}

And now:

const nfa = new NFARegex("((?:a|b)+)(cd)");
const result = nfa.compute("bbaacd");
console.log(result); // {"groups":{"0":"bbaacd","1":"bbaa","2":"cd"}}

Much cleaner! You might be wondering why I used an object instead of a simple list. That's because most regex engines also allow to define the name of a capturing group.

Extra, extra! - Named groups

I could have changed the text of the newspaper, but then the jailed guy wouldn't have made sense. Named groups aren't illegal... yet.

Named groups are a type of capturing groups where you can specify the name of the group. It follows the syntax (?<name>x), for example: (?<username>.+)@(?<domain>.+\.com). This allows more expressive regexes.

With all the work we've already done, they are pretty easy to implement. First we need to parse that pattern and store it in the AST.

class RegexAlternative {

    constructor(...alternatives) {
        this.groupName = null; // <-- This is new
        this._capturing = true;
        this.alternatives = alternatives;
    }
    //...
}
class Regex {
    constructor(subpatterns) {
        this.subpatterns = subpatterns;
        this.groupName = null; // <-- This is new
        this._capturing = true;
    }
    //...
}

And then all we have to do is take that name into account in the nfa builder. So in _alternativeToNFA instead of

const groupName = alternativeAst.isCapturingGroup() ? this.newGroup() : null;

We'll have

const groupName = alternativeAst.isCapturingGroup() ? alternativeAst.groupName || this.newGroup() : null;

And the same for _singleRegexToNFA. Done! 😎

const nfa = new NFARegex("(?<lotsOfA>a+)(?<lotsOfB>b+)");
const result = nfa.compute("aaabb");
console.log(result); //  {"groups":{"0":"aaabb","lotsOfA":"aaa","lotsOfB":"bb"}}

Just in case you should do a couple of checks to the names:

  1. What happens if there are duplicated group names? You could throw an error, keep the last one, autodestruct... so many choices.
  2. The engine shouldn't allow numbers as names, since it conflicts with regular capturing groups.

What's next

Capturing groups are a really important feature of modern regex, but the engine is still really flimsy. For example, while trying to test the code I tried to run the regex I mentioned in the introduction (.+@(.+)\.com) and I was surprised when it failed. I didn't realize we haven't implemented character classes, like the dot period, \w, or [a-z]. Don't worry, it can be fixed 🛠.

Season 1 - Building a Regex Engine (Completed)