Published on

Building a Regex Engine Part 1 - Introduction

Most programming languages include a built-in regex engine and all of them behave like a black box. You create a regex object from a string, then you call an operation (matchAll, match, replace, replaceAll, etc) giving an input string and... it just works, or at least it works as the pattern says. If that pattern isn't the one you really wanted or it's extremelly difficult to understand that's not the engine's problem 😜. But how does it work? What's inside a regex object? And why do you have to instanciate it?

To be able to answer these questions I decided to build my own Regex Engine based on NFAs. You can find the complete project in Github. I also built a website were you can test it and see the NFAs generated by any regex.

In order to understand how to create a regex engine we must first understand their origins. Wait! WAIT! Don't leave! I promise I won't mention any years or random names. Formal language theory defines regular expressions as a notation to define a regular language. A language is just a set of strings, for example the regex a* defines the language L={ε,a,aa,aaa,aaaa...}L = \{\varepsilon, a, aa, aaa, aaaa...\} which can also be expressed as L={an:n>=0}L = \{a^n : n >=0 \}.

These "old" regex only allow three operations:

  • Concatenation: ab Matches an a followed by a b
  • Alternation: (a|b) Matches a or b
  • Kleene Star/Closure: a* Matches zero or more a

And that's all. No more operations. If you know about regexes, you might wonder where are ALL THE OTHER operations. The + quantifier (match one or more), the character classes (like [a-zA-Z0-9] to match any word character), the capturing groups, etc.

Modern regex engines allow an extended syntax, but the core remains the same. If we build a regex engine for formal regex it can be extended to implement the modern features. But how do we do that?

Remember a paragraph ago when I told you that regex are a notation of regular languages and completely ignored the fact that you probably don't know what a REGULAR language is? I did it on purpose, and that detail is the key to solve the puzzle. Yes, I'm an expert in narrative foreshadowing. Netflix give me a call.

A regular language is a language that can be recognized by a finite automaton. So that's all. To build a regex engine just build a finite automaton. Hope you liked the post, don't forget to click the subscribe button and the notification bell. BYEEEEEE!....... Wait, what's an automaton?

Finite automaton and regular languages

In an informal way, an automaton is like a state machine. It has a number of states and transitions that define the behaviour of the machine. It reads an input string one character at a time and each one makes the machine go through a transition. That process is often called consuming a character, because each character makes a single change in the machine. An automaton recognizes a regular language LL if it's able to determine if any word ww belongs to LL. This is exactly what we want!

If you feel that this definition is still kind of useless, don't worry. In the next post I'll explain in detail what a NFA (Non Deterministic Finite Automata) is and how to implement it. In the last post I'll talk about why I choose a non deterministic automata instead of a deterministic one (DFA).

Optional - A (slightly) better explanation of automata

An automaton is an abstraction of a computer. It has an input, that is divided into atomic cells. In a real computer this atoms are bits. In regular expressions this atoms are characters. The control unit implements the logic of the machine and consists of a series of internal states. The machine reads the input one cell at a time from left to right and decides what to do based on it's current state, the input and a memory (if it has one).

Image blatantly stolen from "An Introduction To Formal Languages (6th edition)"

Here's an animation of how a NFA machine for the regex (?:a∣b)a+(?:a∣b)(?:a|b)a+(?:a|b) computes the input string baaaab. The machine starts in state q0q0 and changes states as it consumes input. The ε\varepsilon transition mean it can change state without consuming input. I'll explain more about NFAs in the next post.

The animation is a bit fast, since it's a gif. If you want to see each step you can use the website I made to generate this animations.

There are several classes of automaton and each one of them defines a type of language. These types form a hierarchy called the "Chomsky hierarchy". From less to more powerful they are:

  • Finite State Automata (FSM): Regular languages
  • Pushdown automata: Context-free languages
  • Linear bounded automaton: Context-sensitive languages
  • Turing machine: Recursively enumerable languages

A regular language is also context-free, context-sensitive and recursively enumerable, so we could use any of the four types of automaton to recognize it. Still, using a turing machine would be an overkill (and inefficient) so we'll stick with finite state automata.

Congratulations. This is your reward for reading an optional section

To better understand what is and isn't a regular language let's focus on a single word: FINITE. The implications of an automata being finite are:

  1. It has a finite number of states
    • This doesn't mean it only matches finite strings. The simplest example is the regex a+a+ which matches an infinite number of a. If the states have a loop, the language is infinite. For example, this is the diagram for an automaton for a+a+:
  2. It doesn't have auxiliary memory. The only thing it can store is its current state.
    • To remember an input it needs to have an state that consumes specifically that input.

These two conditions define what is and isn't a regular language. Every language defined by a formal regex is a regular language, we already saw an example with the a+a+ diagram above. So what languages aren't regular? Here is the most common example:

L={anbn:n>=0}L = \{a^nb^n : n >=0 \}

This languages matches {ab,aabb,aaabbb,aaaabbbb,...}\{ab, aabb, aaabbb, aaaabbbb, ...\}. At first sight this doesn't look like a problem, after all regex can define infinite languages. But here the machine has to remember how many as it has found (the value of n) in order to match exactly the same number of bs. The only way a finite machine can remember something is by creating a state that represent that exactly information, so since n can be infinite the machine would need infinite states â›”.

This sounds a bit confusing, so let's look at specific values of n. If n is finite, then L(n)L(n) is a regular language and we can draw a diagram representing its automaton:

Diagram of the regex abab (n=1)
Diagram of the regex aabbaabb (n=2)
Diagram of the regex aaabbbaaabbb (n=3)

In each diagram there are n states with an a-transitions and n states with b-transitions. This is the way the machine "stores" the value of n. If n was infinite, the machine would need an infinite number of states.

There's a mathematical way to disprove that a language is regular, called the pumping lemma. I'm not going to talk about it, but if you are interested you can read the book "An introduction to Formal Language and Automata" (or... you know, just use google).

The language L={anbn:n>=0}L = \{a^nb^n : n >=0 \} might look like an useless example. Who wants to check that a string matches it? But if you swap the a for ( and the b for ) suddenly it's a real-world problem (called the balanced parenthesis problem). L={(n)n:n>=0}L = \{(^n)^n : n >=0 \} is not a regular language and therefore you can't use a (formal) regular expression to check if a string has balanced parenthesis.

I specified (formal) because as I said earlier modern regexes implement more complex features that go beyond regular languages. The most common example are backreferences, but C# also has balancing groups which solve the balanced parenthesis problem, and PCRE even implements recursion! 😮. The post "The true power of regular expressions" by Nikita Popov claims modern regex are able to match all context-free languages (the next tier of languages). I haven't used PCRE specific features, so I can't ensure it's true. All I can do is smile, nod, and accept it as a fact.

Even though formal regex are limited, NFA can be extended to support the features of modern regular expressions. In the following posts we'll begin implementing a NFA for formal regex and slowly build up the other features. Don't believe me? Here's a quote from the .NET documentation:

The .NET regular expression engine is a backtracking regular expression matcher that incorporates a traditional Nondeterministic Finite Automaton (NFA) engine such as that used by Perl, Python, Emacs, and Tcl.

-- Details of regular expression behavior - .NET docs

As a small spoiler, you can see the source code of the engine in Github, or even better, check the website I made where you can use it: https://danielbv.github.io/RegexEngine/

Season 1 - Building a Regex Engine (Completed)