Published on

Creating a NFA from a regex - Building a Regex Engine Part 3

In the last two posts we talked about how a regex is just a finite state machine and how to implement a non deterministic finite automata. Now we'll close the triangle circle by talking about how to define a NFA from a regex string. At the end of this post we'll have a fully fledged engine for formal regex!

Step one: Parsing

A string is just an array of characters without a meaning. To be able to translate a regex to a NFA we need to convert it to a data structure more suitable. This structure is an AST (Abstract Syntax Tree). An AST is a tree representation of a string. For example, for a regex like (a|b)+a?|c we want something like:

There isn't a standard definition of an AST. In fact the AST above could be simplified to remove the Expression node. I love parsing, and I already worked on some projects that required parsing. Maybe that's why I found this step to be the most uninteresing part of this project, and I implemented it as fast as possible.

And for that reason I won't enter into the code to actually parse the regex. I'll assume you somehow obtained an AST like the one above and focus on translating that AST to a NFA. In the future I would like to do posts specifically about parsing, but for now I'll just ignore it.

In my project I used antlr4, a well-known parser generator. There are a lot of tools you can use to parse it. If you are brave enough, you could use the Shunting-Yard Algorithm instead of a parser. This algorithm allows you to either generate an AST or to translate the regex to reverse polish notation.

In this post the code will assume you have an AST, but the concepts still apply if you are using polish notation. As a fun fact, Rust has a crate (what the heck is a crate?) for nfa regex, and they use reverse polish notation (Rust docs)

And that's all I'm going to say about parsers in this post. Good luck. Now let's get to the good part.

Me after ignoring all the people that have no idea about parsing

Step 2: Building the NFA

To build a NFA from a regex we are going to use Thompson's construction. This is a method that uses simple patterns to recursively build a NFA from a regex. In the code snippets I'll assume that all the classes defined in the previous posts are imported.

The rules of the game

Empty expression

An empty expression is translated to an epsilon transition. Totally unexpected.

In code we'll see this pattern (two states with a single transition) multiple times, so I'll make a small abstraction called _oneStepNFA.

_emptyExpression(character) {
    return this._oneStepNFA(new EpsilonMatcher());
}

_oneStepNFA(matcher) {
    const nfa = new EngineNFA();
    let a = this.newState();
    let b = this.newState();
    nfa.declareStates(a,b);
    nfa.setInitialState(a);
    nfa.setEndingStates([b]);

    nfa.addTransition(a,b, matcher);
    return nfa;
}

With the parser I made empty expressions are an AtomicPattern with an epsilon symbol inside, so I won't be using _emptyExpression. We'll handle the epsilon symbol in the following rule.

Single symbol

A single symbol is translated to... 🥁🥁🥁... a single transition that consumes that character 🤯🤯.

NFA for 'a'

In code, we can just use the abstraction previously defined.

_atomicPatternNFA(character) {
    const matcher = character === EPSILON ? new EpsilonMatcher() : new CharacterMatcher(character);
    return this._oneStepNFA(matcher);
}

Concatenation of two regex

If you have two regex r1r_1 and r2r_2, to concatenate them you have to build the NFA for each regex and combine the ending state of r1r_1 with the starting state of r2r_2.

Image from Wikipedia

Thompson's construction ensure that all NFA have a single initial state with only outward transitions and a single final state with only inward transitions. This ensures that concatenating two regex is always possible.

To implement this, I'll make a method a bit more generic. To keep it simple I made some assumptions: Both NFAs have unique state names, and the concatenation is destructive for the NFA r2r_2. Both assumptions could be avoided by making a deep copy of the argument and renaming the states.

/** 
* This should only be used when:
* - The initial state of 'otherNFA' doesn't have any inwards transition
* - 'unionState' belongs to 'this' and doesn't have any outwards transition
* The main use of this method is in thompson's construction
* If the assertios aren't true, the append might not be correct.
*/
appendNFA(otherNfa, unionState) {
    // 1. Copy the states of 'otherNfa' to this
    // (This isn't a real copy, but a pointer copy, so it's dangerous if 'otherNFA' changes afterwards)
    for (const s in otherNfa.states) {
        this.states[s] = otherNfa.states[s];
    } 
    // 2. Combining two states means one get's deleted
    delete this.states[otherNfa.initialState]; 
    // 3. All the outward transitions of 'otherNfa.initialState' now belong to 'unionState'
    for (const [matcher, toTransition] of otherNfa.states[otherNfa.initialState].transitions)
        this.addTransition(unionState, toTransition.name, matcher);
    // 4. If the unionState is an end state, then the end states of the appended nfa are also end states of the fusion.
    if (this.endingStates && this.endingStates.includes(unionState)) {
        this.endingStates.splice(this.endingStates.indexOf(unionState),1, ...otherNfa.endingStates);
    }
}

I called this method appendNFA rather than concatenateNFA because this method is a bit more generic. A pure concatenation would be an append where the unionState is the ending state of the instance NFA (nfa.appendNFA(otherNFA, nfa.endingStates[0])). We'll take advantage of this abstraction in the next rule.

With this implementation the labels of the states are no longer continous, since we are destroying one state. If you have q0q1q_0 \rightarrow q_1 and q2q3q_2 \rightarrow q_3 the result will be q0q1q3q_0 \rightarrow q_1 \rightarrow q_3. q2q_2 dissappears. Since labels are only for debugging, it doesn't affect the execution.

In my engine I made a small patch to avoid this by allowing the unionState to share the same label as the initialState of otherNFA, which makes a more consistent numbering. But the main reason was that the engine has a web app that shows the NFA diagram, so I wanted to keep the naming consistent to avoid unnecessary confusion.

Union of two regex

To get the union of two regex you generate the NFA of each alternative, and then you create two states that will be the new initial and ending states. Finally you have to add an epsilon transition from the union's initial state to the initial state of each alternative and an epsilon transition from the ending state of each transition to the union's ending state.

Don't worry copyright officer, this image is definitively not stolen from wikipedia. Nope. Definitively not

A simple example is a|b:

NFA for aba|b

But this can be extended with more complex regexes and even more alternatives

NFA for (aa)(bb)(cc)(aa)|(bb)|(cc)

To implement this, we'll use the appendNFA method defined in the previous rule. We'll also assume there is a method newState that returns a unique name for an state. The argument is an AST that represents the union, and we'll generate the NFA for each alternative by calling _regexToNFA, a method we'll define later.

_alternativeToNFA(alternativeAst) {
    // 1. Create a new NFA with a starting state
    const nfa = new EngineNFA(); 
    const start = this.newState();
    nfa.addState(start);
    nfa.setInitialState(start);
    const endingStates = [];

    for (let i = 0; i < alternativeAst.alternatives.length; i++) {
        //2. Generate a NFA for each alternative
        const tmp = this._regexToNFA(alternativeAst.alternatives[i], false);
        endingStates.push(...tmp.endingStates);
        //3. Append the NFA to the initial state of the union 
        nfa.appendNFA(tmp, start);
    }

    const end = this.newState();
    nfa.addState(end);
    //4. Point the end state of each alternative to the union end state
    endingStates.forEach(x => nfa.addTransition(x, end, new EpsilonMatcher()));
    nfa.setEndingStates([end]);
    return nfa;
}

As as small detail, I calledconst end = this.newState(); after appending all NFAs because I assume newState generates increasing ids. By calling it after the appends the union's end state will have the highest number, making the ordering more consistent. Still, the end user should know nothing about states, so this is just for debugging.

This implementation with appendNFA generates a NFA slightly different from the official Thompson's Construction:

"Official" construction for aba|b
NFA for aba|b with 'appendNFA'

appendNFA destroys the initial state of the appended NFA, so q1q_1 and q3q_3 are removed. This still should work for every case, and it removes some unnecesary states so it's a win-win.

Kleene star expression (aka the asterisk *)

The kleene star (rr*) matches zero or more rr. To achieve this, we'll get the NFA of rr, add a new initial and final state (qiq_i and qfq_f) and add four epsilon transitions:

  1. A transition from the end state of rr to the initial state of rr. This one allows repeating rr multiple times
  2. A transition from qiq_i to qfq_f. This is the transition that doesn't match any rr.
  3. A transition from qiq_i to the initial state of rr and a transition from the ending state of rr to qfq_f
    • I'm not 100% sure if you can just ignore qiq_i, qfq_f and these two transitions. For the basic cases it works and avoids creating "useless states", but I'm not sure if it works for all cases.
With this image one must wonder what's the point of this post if I'm copying half of Wikipedia images

Let's see some examples (I promise I didn't plagiarized the following images from Wikipedia):

NFA for the regex (aaaa)*

This rule can be modified slightly to translate the + quantifier. You just need to remove the epsilon transition from qiq_i to qfq_f:

NFA for the regex (aaaa)+

But things get more interesting. There are two types of quantifiers:

  • Eager quantifiers (?, *, +): They consume all the characters they can.
  • Lazy quantifiers (??, *?, +?): They consume the least characters possible.

Depending on how the engine prioritizes the four transitions the quantifier is eager or lazy. To make it eager you have to prioritize the transition that enters the loop (the transition from qiq_i to the intial state of rr) and the one that stays in the loop (the transition that goes fron the ending state of rr to the initial state of rr).

If instead you prioritize avoiding the loop (qiqfq_i\rightarrow q_f) and exiting the loop (the transition from the ending state of rr to qfq_f) you'll have a lazy quantifier.

Optional - More on lazy quantifiers and efficiency

Overall using lazy quantifier is recommended when possible to improve the regex efficiency, since greediness tends to provoke a lot of backtracking errors because the quantifier tries to consume more characters than it should. The post Performance of Greedy vs. Lazy Regex Quantifiers talks about how lazyness is not more efficient per se, instead, we tend to rely too much in backtracking

A common misconception about regular expression performance is that lazy quantifiers (also called non-greedy, reluctant, minimal, or ungreedy) are faster than their greedy equivalents. That's generally not true, but with an important qualifier: in practice, lazy quantifiers often are faster. This seeming contradiction is due to the fact that many people rely on backtracking to compensate for the impreciseness of their patterns, often without realizing this is what they're doing. Hence, the previously mentioned misconception stems from that fact that lazy quantifiers often result in much less backtracking when combined with long subject strings and the (mis)use of overly flexible regex tokens such as the dot.

-- Performance of Greedy vs. Lazy Regex Quantifiers by Steven Levithan, coauthor of Oreilly's Regular Expressions Cookbook

If you want to match the first tag of an HTML like <em></em>, an eager quantifier won't work. In the regex <(.+)> the > matches the second one, so the result is em></em. To avoid this we could use a lazy quantifier <(.+?)>, but we can make it even more explicit by using a negation character class to indicate we don't want to match any > character inside the group: <([^>]+?)>.

Also note that even though using lazy quantifiers is good, they should be followed by some other expression,specially for *?. For example a.*?b matches ab and aab because the b is forcing the engine to keep searching. Meanwhile the regex a.*? only matches the a in ab, making the *? completely useless. In fact the regex .*? doesn't match anything.

Scott Regex vs The World

To implement this, I'll have a builder argument that returns the NFA generated for rr. I did this so the new initial state had a number lower than rr initial state. You could just have the NFA as an argument, but for debugging having all states ordered is a bless.

We can implement +, * and their lazy versions in just one method. Remember that this engine prioritizes the transitions in the order they are added. So take special attentiont to the order:

_asterisk(builder, lazy) {
    return this._asteriskplus(builder, lazy, true);
}

_plus(builder, lazy) {
    return this._asteriskplus(builder, lazy, false);
}

_asteriskplus(builder, lazy, asterisk) {
    const newInit = this.newState();
    const base = builder(); // For 'r*' builder() returns the NFA of 'r'
    const newEnd = this.newState();
    base.addState(newInit); 
    base.addState(newEnd);
    // Both options add the same transitions but in different orders
    if (lazy) {
        if (asterisk) base.addTransition(newInit, newEnd, EPSILON_MATCHER); // q_i -> q_f (Skip the loop)
        base.addTransition(newInit, base.initialState, EPSILON_MATCHER);    // r_i -> r_f (Enter the loop)
        base.addTransition(base.endingStates[0], newEnd, EPSILON_MATCHER);  // r_f -> q_f (Exit the loop)
        base.addTransition(base.endingStates[0], base.initialState, EPSILON_MATCHER); //r_f -> r_i (Stay in the loop)
    } else {
        base.addTransition(newInit, base.initialState, EPSILON_MATCHER); // r_i -> r_f (Enter the loop)
        base.addTransition(base.endingStates[0], base.initialState, EPSILON_MATCHER); //r_f -> r_i (Stay in the loop)
        base.addTransition(base.endingStates[0], newEnd, EPSILON_MATCHER);  // r_f -> q_f (Exit the loop)
        if (asterisk) base.addTransition(newInit, newEnd, EPSILON_MATCHER); // q_i -> q_f (Skip the loop)
    }
    
    base.setInitialState(newInit);
    base.setEndingStates([newEnd]);
    return base;
}

The epsilon matcher is so common that I defined it as a constant const EPSILON_MATCHER = new EpsilonMatcher();

Tying all up

With all the rules implemented we just need to write the overall method to transverse the AST. As an extra I'll add the definition of the optional quantifier ? and it's lazy version ??.

A regex node is just a series of subpatterns, so we just need to transverse and cocatenate their respective NFA.

// Code for a single regex node (not an alternative node) 
_singleRegexToNFA(regexAST) {
    let nfa = null;
    // Handles empty capturing groups
    if (regexAST.subpatterns.length === 0) {
        nfa = this._oneStepNFA(new EpsilonMatcher());
    }
    // A regex is just a series of subpatterns. We iterate through them and concatenate them to 'nfa'
    for (const c of regexAST.subpatterns) {
        let baseBuilder, current, namedGroup = null;

        // 1: We translate the base of the regex (ignoring the quantifier)
        if (c.child instanceof AtomicPattern) {
            baseBuilder = () => this._atomicPatternNFA(c.child.char);
        } else if (c.child instanceof RegexAlternative || c.child instanceof Regex) { // Groups
            baseBuilder = () => this._regexToNFA(c.child);
        }

        // 2: We apply the quantifier (if there is one)
        if (c.quantifier === ASTERISK || c.quantifier === LAZY_ASTERISK) {
            current = this._asterisk(() => baseBuilder(), c.quantifier === LAZY_ASTERISK);
        } else if (c.quantifier === PLUS || c.quantifier === LAZY_PLUS) {
            current = this._plus(() => baseBuilder(), c.quantifier === LAZY_PLUS);
        } else if (c.quantifier === OPTIONAL || c.quantifier === LAZY_OPTIONAL) {
            current = baseBuilder();
            if (c.quantifier === LAZY_OPTIONAL)
                // Reminder: unshiftTransition adds a transition and gives it the max priority
                current.unshiftTransition(current.initialState, current.endingStates[0], EPSILON_MATCHER);
            else 
                current.addTransition(current.initialState, current.endingStates[0], EPSILON_MATCHER);
        } else {
            // No quantifier
            current = baseBuilder();
        }

        if (nfa === null) 
            nfa = current 
        else 
            // 3: Concatenate 'current' to the overall nfa
            nfa.appendNFA(current, nfa.endingStates[0]);
    }
    return nfa;
}

All these methods are inside a class NFABuilder:

export class NFABuilder {
    constructor() {
        this.stateNumber = 0;
    }

    newState() {
        const c = `q${this.stateNumber}`;
        this.stateNumber++;
        return c;
    }

    resetStateNumbers() {
        i = 0;
    }

    // This is the entry point from outside. 
    regexToNFA(regexAST) {
        // This is to ensure we always have the same state numbers. Skip it if you're not interested
        this.resetStateNumbers();
        return this._regexToNFA(regexAST);
    }

    _regexToNFA(regexAST) {
        if (regexAST instanceof RegexAlternative) 
            return this._alternativeToNFA(regexAST);
        else 
            return this._singleRegexToNFA(regexAST);
    }

    // All the other methods defined previously
}

Finally we can define the Regex class that will encapsulate everything 😎.

export class NFARegex {
    constructor (regex) {
        this.source = regex;
        const ast = parseRegex(regex);
        const builder = new NFABuilder();
        this.nfa = builder.regexToNFA(ast);
    }

    compute(string) {
        return this.nfa.compute(string, i);
    }
}

Conclusion

Finally, after three extremely long posts we have a fully functional engine for formal regex.

const nfa = new NFARegex("(a|b)+c*");
const result = nfa.compute("abababababacccc"); // True

const nfa2 = new NFARegex("a+c?b+");
console.log(nfa2.compute("aaaaacbbbbbb")); //True
console.log(nfa2.compute("accb")); //False

Still, it's missing a lot of modern features, and all the regex have an implicit start of the string anchor (^) at the beginning. For example, for the regex a our engine doesn't find a match in ba. Meanwhile javascript's regex finds one: "ba".match(/a/) returns [a]. Also, we can only get the first match, while modern regex are able to return all matches. We'll make these improvements in the following posts.

If you want to see all the code we've written in these three posts, it's all in https://github.com/DanielBV/blog-examples/tree/main/regex-engine/formal-regex-engine

Season 1 - Building a Regex Engine (Completed)