Published on

Defining languages with sets - A horrible yet correct alternative to grammars

When you want to define a language you usually describe its syntax with a grammar. For example:

term:
    'true'
    | 'false'
    | '[+-]?[0-9]' //Integer number
    | 'plusOne' term
    | 'minusOne' term
    | 'isZero' term
    | 'if' term 'then' term 'else' term

This language includes plusOne 0, plusOne plusOne 0, if isZero minusOne 1 then 42 else -5, etc. The thing is, when I see a grammar I tend to focus on the parse trees it would produce with parser generators like ANTLR4. And the convenience of ANTLR4 makes me forget that the grammar doesn't define the parser, it defines the language (though yeah, it also defines the structure the parse trees will have).

            |
  /    /    |  \   \   \
if isZero then 42 else -5
     |
  minusOne
     |
     1

(Now I'm an ASCII artist. I'll turn this tree into a NFT)

In the first post of my series about making a regex engine, I wrote:

A language is just a set of strings, for example the regex a* defines the language L={ε,a,aa,aaa,aaaa...}L = \{\varepsilon, a, aa, aaa, aaaa...\} which can also be expressed as L={an:n>=0}L = \{a^n : n >=0 \}.

The same way the regex a* defines a set a grammar also defines one. And the same way you use the regex to build recognizer for LL, you use the grammar to build a recognizer (a.k.a parser) for the language it defines.

It's no coincidence that regex and grammars are so similar. Both are methods of defining formal languages, but regex are limited to regular languages, which are the lowest tier of languages in Chomsky's hierarchy. You can even use a regular grammar to define a regular language instead of a regex, but they are more inconvenient.

Defining a language with sets

Let's get to the point of this post. Given the above grammar of term, is it possible to define its language by directly defining the set of the language? Yes, though it's a bit harder than with regular expressions since grammar rules can be recursive. But you can still make an inductive definition:

S0={true,false}ZSi+1={true,false}Z{plusOne t1,minusOne t1,isZero t1  t1Si}{if t1 then t2 else t3  t1,t2,t3Si}\begin{array}{cccc} S_0 & = & \phantom{\cup} & \{true, false \} \cup \mathbb{Z} \\ S_{i+1} & = & & \{true, false \} \cup \mathbb{Z} \\ & & \cup & \{\textnormal{plusOne} \ t_1, \textnormal{minusOne} \ t_1, \textnormal{isZero} \ t_1 \ | \ t_1 \in S_i\} \\ & & \cup & \{\textnormal{if} \ t_1 \ \textnormal{then} \ t_2 \ \textnormal{else} \ t_3 \ | \ t_1, t_2, t_3 \in S_i\} \end{array}

S0S_0 includes literals (as a small reminder Z\mathbb{Z} is the set of integers). S1S_1 includes literals and calls to functions with a literal argument, like plusOne 0, I'll call them simple calls. S2S_2 contains again literals, simple calls and one level nested calls, like plusOne plusOne 0, S3S_3 contains literals, simple calls and two level nested calls, like if isZero minusOne 1 then 42 else -5. You get the idea, each level includes the next level of nested calls.

You might have noticed that every SiS_i contains the literals ({true,false}  Z\{true, false\} \ \cup \ \mathbb{Z}) and therefore simple calls too. This is necessary because otherwise you couldn't use them in SiS_i where i>1i > 1. Another approach would be doing something like L={true,false,0}L = \{true, false, 0\} and in every rule: t1SiLt_1 \in S_i \cup L, but that would be a bit confusing and I'm not sure if it makes sense mathematically.

Of course, each SiS_i is just a piece of the cake. The complete language is defined by the union of all these sets:

S=i=0Si\begin{aligned} S = \bigcup\limits_{i=0}^{\infty} S_{i} \end{aligned}

Depth and tree structure (part 1)

In a instinctive way I assumed that the set definition would list the strings that are contained in a language but, unlike grammars, it wouldn't define the structure of the parse trees. So it struck me when I realized the values contained in SiS_i are correlated to the depth of the parse trees those strings would generate.

i Tree depth
0 1
1 1, 2
2 1, 2, 3
3 1, 2, 3, 4
4 1, 2, 3, 4, 5
5 1, 2, 3, 4, 5, 6

For example, above I (horribly) drawed the tree of if isZero minusOne 1 then 42 else -5. It has a depth of 3, so the first SiS_i where it appears is S2S_2. Overall SiS_i contains all the strings with depth lower or equal than i+1i+1.

The reason that each SiS_i also contains trees of depth 0,1,2,3,...i0,1,2,3,... i is that each SiS_i contains literals, which have depth 1. So S2S_2 contains elements of depth 2 because S1S_1 contains elements of depth 1 (literals). And therefore S3S_3 also contain elements of depth 3. You get the idea. I'll make an attempt for a formal proof of this in a collapsible below.

So what's the point of all of this? That even though the complete set SS doesn't contain structure, the different values of SiS_i have traces of it. And it makes sense, since I defined the sets using the grammar as a base. For example, if instead of:

term:
    ...
    | 'if' term 'then' term 'else' term;

I wrote:

term:
    ...
    | 'if' term 'then' term elseClause;

elseClause: 'else' term;

The set SS would have a different definition but its value would be the same. The values of SiS_i would change to reflect the different definition, but the union of all SiS_i wouldn't change.

But before we can see this, we need to understand how the definition of the set changes when the grammar has multiple rules.

Defining a grammar with multiple rules as a set

The example above, which I stole extracted from Types and Programming Languages, defines the set given a grammar with a single rule. But grammars usually have more than one. I guess that every grammar could be transformed into a single rule grammar, at the cheap cost of the sanity of the developer, so that's a way of generating the set of the language. But that's boring, I prefer to interpret it in a different way: Each rule defines it's own set, and therefore, it's own sublanguage.

Warning: Everything below here is completely made up by me, so it might not be mathematically correct. Feel free to scream at the screen while I destroy every mathematical rule in existence

Let's use a simple grammar with two rules.

expr:
    expr ('*'|'/') expr
    | expr ('+'|'-') expr
    | ID '(' arguments ')'
    | NUMBER, STRING, BOOL;

arguments:
    expr (',' arguments)+;

This is a bit tricky because there is recursion. So this recursion is also be reflected in the definition of the sets. EE will be the set of expressions and AA the set of arguments:

E0,0={NUMBER,STRING,BOOL}Ei+1, j={NUMBER,STRING,BOOL}{ t1t2, t1t2, t1t2, t1t2, t1,t2Ei,j}{<ID>( t1 ) t1Ai+1, j}A0,0=Ai+1, j={t1 ’,’ t2  t1Ei,t2Ai+1,j1}{ t1 t1Ei, j}\begin{array}{cccc} E_{0,0} & = & \phantom{\cup} & \{\textnormal{NUMBER}, \textnormal{STRING}, \textnormal{BOOL}\} \\ E_{i+1,\ j} & = & & \{\textnormal{NUMBER}, \textnormal{STRING}, \textnormal{BOOL}\} \\ & & \cup & \{\ t_1 \textnormal{*} \ t_2, \ t_1 \textnormal{/} \ t_2, \ t_1 \textnormal{+} \ t_2, \ t_1 \textnormal{-} \ t_2, | \ t_1, t_2 \in E_{i,j}\} \\ & & \cup & \{\textnormal{<ID>} ( \ t_1 \ ) | \ t_1 \in A_{i+1, \ j}\} \end{array} \begin{array}{cccc} A_{0,0} & = & \phantom{\cup} & \emptyset \\ A_{i+1, \ j} & = & & \{ t_1 \ \textnormal{'},\textnormal{'} \ t_2 \ | \ t_1 \in E_i, t_2 \in A_{i+1, j-1} \} \\ & & \cup & \{\ t_1 | \ t_1 \in E_{i, \ j}\} \\ \end{array}
E=i=0 j=0Ei,j\begin{aligned} E = \bigcup\limits_{i=0 \ j=0}^{\infty} E_{i,j} \end{aligned}

There are several really important details:

  • Here there are two variables, ii and jj. This is because I want to separate the length of the arguments set from its depth.
    • A2,5A_{2,5} contains up to 5 arguments in E1,5E_{1, 5}.
    • A5,2A_{5, 2} contains up to 2 arguments in E4,2E_{4,2}.
    • I could have used a single variable ii to denote both width and depth, but that would break the argument that I plan to do on the next section. See the optional section below for more details.
  • The definition of Ei+1, jE_{i+1, \ j} depends on Ai+1, jA_{i+1, \ j}, and at the same time Ai+1, jA_{i+1, \ j} depends on Ei, jE_{i,\ j}. This might look like a circular definition, but the arguments set depends on the previous level of the expression sets. So it's more like a spiral definition 😎.
    • For it to work, Ei,jE_{i, j} must be a subset of Ei+1, jE_{i+1, \ j}. Otherwise there could be values that are valid for the language but aren't included in the set. I added a "proof" of this at the end of this section.
  • In the definition of Ai+1, jA_{i+1, \ j}, the { t1 t1Ei,j}\{\ t_1 | \ t_1 \in E_{i,j}\} is the same as just saying   Ei,j\ \cup \ E_{i,j}. So Ei,jE_{i,j} is a subset of Ai+1, jA_{i+1,\ j}. I use this syntax to keep it closer to the grammar definition.
  • I decided to leave A0A_0 as an empty set, though I guess I could just made A0=E0A_0 = E_0, but that would be confusing since the overall rule is that EiAi+1E_i \subset A_{i+1} but there's no E1E_{-1}

And as before, the value of ii is correlated to the depth of the parse tree. So even though my definition of the sets might be wrong, it doesn't look horribly wrong. I hope...😇

             |
   foo ( ArgumentsNode )                                                <----- E4 (& A5) (The value of every j is 2,
        |                                                                                 so I'll ignore it)
            var ( ArgumentsNode )                                       <----- E3 (& A4)
                        |
                     xz ( ArgumentsNode )                               <----- E2 (& A3)
                                |
                             yw ( ArgumentsNode )                       <----- E1 (& A2)
                                      /   \
                                     3     2                            <----- E0 (& A1)
Optional - More details on why I separated i and j

One of the main points I want to make of this post is that, in the same way two grammars can denote the same language but generate parse trees with different structure, two sets can be equal but have different definitions of each EiE_i change. So even though EE doesn't contain any structure, the values of each EiE_i show traces of the structure, even if that's not enough to make a parse tree.

And the basis of this hipothesis is that ii always changes with depth. Previously I had define AiA_i like this:

A0=Ai+1={t1 ’,’ t2  t1Ei,t2Ai+1}{ t1 t1Ei}\begin{array}{cccc} A_{0} & = & \phantom{\cup} & \emptyset \\ A_{i+1} & = & & \{ t_1 \ \textnormal{'},\textnormal{'} \ t_2 \ | \ t_1 \in E_i, t_2 \in A_{i+1} \} \\ & & \cup & \{\ t_1 | \ t_1 \in E_{i}\} \\ \end{array}

The problem is that the ii doesn't only affect the values of the arguments it can use (EiE_i), but also the number of arguments. Therefore a simple call like foo(1,2,3,4,5,6,7,8) would be in A8A_8 even though all of the arguments are in E0E_0. By having Ei,jE_{i,j} the previous call is in E1,8E_{1,8} and the ii remains unaffected by the number of arguments.

In a way the jj also denotes the structure of the parse tree. By allowing an arbitrary number of arguments in the same ii it's like if I had define the grammar like this:

arguments:
    expr (',' arguments)+;

While correlating the depth and width in the same ii would be like defining it like this:

arguments:
    expr ',' arguments | expr;

In this case the value of ii is the same as the depth of the tree. That's because unlike the first example, here you can't write a call like foo(3) till E2,1E_{2,1}. But in S1S_1 you could already have plusOne 0. In case you were wondering, E1,1E_{1,1} does contains arithmetic calls, like 3 + 2. But since those don't call the arguments rule they aren't interesting for this example, so I casually ignored them.

Optional - Are E_i and A_i subsets of E_i+1 and A_i+1?

The current definition of Ai+1, jA_{i+1, \ j} only works in the assumption that Ei,j  Ei+1, jE_{i,j} \ \subset \ E_{i+1, \ j}. Otherwise, the fact that Ai+1A_{i+1} uses values of Ei,jE_{i,j} would mean that there are values that the set wouldn't accept but that are valid in the grammar.

To check this my first thought was: What happens if I grab a really long nested tree like:

a(                                   )    <----- E8 (& A9) (j = 1 at every level)
    |
    b(                               )    <----- E7 (& A8)
        |
        c(                           )    <----- E6 (& A7)
            |
            d(                       )    <----- E5 (& A6)
                |
                e(                   )    <----- E4 (& A5)
                    |
                    f(               )    <----- E3 (& A4)
                        |
                        g(           )    <----- E2 (& A3)
                            |
                            h(       )    <----- E1 (& A2)
                                 |
                                 3        <----- E0 (& A1)

And I try to add another shorter argument i(j(k(4))):

i(             )    <----- E3 (& A4)  (j = 1 at every level)
    |
    j(         )    <----- E2 (& A3)
        |
        k(     )    <----- E1 (& A2)
            |
            4       <----- E0 (& A1)

Since i(j(k(4))) is in E3,1E_{3,1}, adding it as a first argument would create a string that is in the language but not in the set: a(i(j(k(4))), b(c(d(e(f(g(h(3)))))))), right? In an unexpected turn of events, no. Even though the first ii in which i(j(k(4))) is contained is E3,1E_{3,1}, since every Ei,jE_{i,j} can access the literals, it means that it's also defined in E4,1E_{4,1}:

i(             )    <----- E4 (& A4)  (j = 1 at every level)
    |
    j(         )    <----- E3 (& A3)
        |
        k(     )    <----- E2 (& A2) Note that in the previous tree the k() was in E1, not in E2
            |
            4       <----- E0 (& A1 & A2)

And in E5,1E_{5,1}:

i(             )    <----- E5 (& A4)   (j = 1 at every level)
    |
    j(         )    <----- E4 (& A3)
        |
        k(     )    <----- E3 (& A2) Now k() is in E3
            |
            4       <----- E0 (& A1 & A2 & A3)

Even though this isn't a mathematical proof it infers that Ei,j  Ei+1, j , Ai,j  Ai+1, jE_{i,j} \ \subset \ E_{i+1, \ j} \ , \ A_{i,j} \ \subset \ A_{i+1, \ j} for every i>0i > 0.

I'll try a probably wrong attempt at a formal proof using the depth of the trees. For simplicity I'll ignore the function calls, and the jj with it. Since jj is the maximum number of arguments allowed, and the arguments include every possible combination of Ei,jE_{i,j} I think it's decently obvious that Ei,jEi,j+1E_{i,j} \subset E_{i,j+1} (also when I wrote this I still hadn't divided ii in ii and jj and I don't want to rewrite half of the post for the hundredth time).

As a shortcut I'll define td as a tree depth operation and L={NUMBER,STRING,BOOL} L = \{\textnormal{NUMBER}, \textnormal{STRING}, \textnormal{BOOL}\}

Fist let's define how depth increases with every ii.

txEi  td(tx)={0if txLmax(td(t1),td(t2))+1  t1,t2Ei1if tx∉L\forall t_x \in E_{i} \ \ td(t_x) = \begin{cases} 0 &\text{if } t_x \in L \\ \text{max} (td(t_1), td(t_2)) + 1 \ | \ t_1, t_2 \in E_{i-1} & \text{if } t_x \not \in L \end{cases}

Then, if EiE_i contains all the possible values with depth xx, then Ei+1E_{i+1} contains all the possible values with depth x+1x+1. And since E0E_0 contains all the elements of depth 1, then E1E_1 contains all the elements of depth 2.

P(X)=t1,t2Ei/ max(td(t1),td(t2))=d  tzEi+1 / tz is formed by t1,t2     if Ei contains all possible elements of depth d then Ei+1 contains every element of depth d+1I(X)=ti / td(ti)=0,tiE0P(X) & I(X)    tx/ td(tx)=2 , txE1\begin{aligned} & P(X) = \forall t_1, t_2 \in E_i / \ max(td(t_1),td(t_2)) = d \ \exists \ t_z \in E_{i+1} \ / \ t_z \text{ is formed by } t_1, t_2 \\ & \implies \text{ if } E_i \text{ contains all possible elements of depth } d \text{ then } E_i+1 \text{ contains every element of depth } d+1 \\ & \\ & I(X) = \forall t_i \ / \ td(t_i) = 0, t_i \in E_0 \\ & \\ & P(X) \ \& \ I(X) \implies \forall t_x / \ td(t_x) = 2 \ , \ t_x \in E_1 \\ \end{aligned}

We could keep this going and infer that E2E_2 contains all elements of depth 3, etc. But here's the twist. By definition E0E_0 is a subset of every EiE_i, and therefore every EiE_i also contains trees of depth 2:

P(X) & I(X) &  i E0Ei    i>0,  tx/ td(tx)=2,txEi\begin{aligned} P(X) \ \& \ I(X) \ \& \ \forall \ i \ E_0 \subset E_i \implies \forall i > 0, \ \forall \ t_x / \ td(t_x) = 2 , t_x \in E_i \end{aligned}

And by induction, every EiE_i where i>1i > 1 also contains all the elements of depth 3.

P(X) &  tx/ td(tx)=2,txEi    i>1,tx/ td(tx)=3,txEi\begin{aligned} P(X) \ \& \ \forall \ t_x / \ td(t_x) = 2 , t_x \in E_i \implies \forall i > 1, t_x / \ td(t_x) = 3 , t_x \in E_i \end{aligned}

The overall conclusion is that each EiE_i contains all the elements from depth 1 to i+1:

i,tz/ td(tz){0,1,2...i+1},  tzEi\forall i, t_z / \ td(t_z) \in \{0, 1, 2... i+1\}, \ \ t_z \in E_i

And therefore EiEi+1E_i \subset E_{i+1} and AiA_i won't miss any value.

(Yes, I really need to review math notation... and math in general. I fell like I'm making everything up)

Depth and tree structure (part 2)

With this random made up theory, let's go back to how the way you define the set affects the generated parse tree.

term2:
    'true'
    | 'false'
    | '[+-]?[0-9]' //Integer number
    | 'plusOne' term
    | 'minusOne' term
    | 'isZero' term
    | 'if' term 'then' term elseClause;

elseClause:
    term 'else' term;

I'll call T the set defined by term2 and E the set defined by elseClause. The result is similar to the sets created by expr and arguments, but luckily this one doesn't allow an arbitrary number of arguments, so there is no need for a jj.

T0={true,false}ZTi+1={true,false}Z{plusOne t1,minusOne t1,isZero t1  t1Ti}{if t1 then t2  t3  t1,t2Ti ,t3Ei+1}E0=Ei+1={ else t1 t1Ti}\begin{array}{cccc} T_0 & = & \phantom{\cup} & \{true, false \} \cup \mathbb{Z} \\ T_{i+1} & = & & \{true, false \} \cup \mathbb{Z} \\ & & \cup & \{\textnormal{plusOne} \ t_1, \textnormal{minusOne} \ t_1, \textnormal{isZero} \ t_1 \ | \ t_1 \in T_i\} \\ & & \cup & \{\textnormal{if} \ t_1 \ \textnormal{then} \ t_2 \ \ t_3 \ | \ t_1, t_2\in T_i \ , t_3 \in E_{i+1}\} \end{array} \begin{array}{cccc} E_0 & = & \phantom{\cup} & \emptyset \\ E_{i+1} & = & & \{\ \textnormal{else} \ t_1 | \ t_1 \in T_i\} \end{array}
T=i=0Ti\begin{aligned} T = \bigcup\limits_{i=0}^{\infty} T_{i} \end{aligned}

For the input if isZero 1 then 42 else plusOne -5 the generated tree would be:

             |
   if isZero then 42 elseClause             <----- T2
        |                |
        1               else plusOne        <----- E2 (because plusOne -5 is in T1)
                                |
                                -5          <----- T0

With this new grammar the parse tree has a depth of 3, but with the first one it was 2. And accordingly, this string is in T2T_2 but in S1S_1.

I guess it makes sense since the induction rule (the definition of Si+1 S_{i+1} \ /  Ti+1\ T_{i+1}) is almost identical to the grammar. It's just a different notation.

Still, even though SiS_i is correlated to the depth of the tree, it doesn't define the specific structure. For example, it can't differenciate between these two:

         |                    |
      /  |  \              /  |  \
      3  *   +            *   +   1
           /   \        /  \
           2   1        3   2

Both would be in the same EiE_i but that tells you nothing about which is the correct one.

So after all this long post, it turns out the set definition is not as powerful as the grammar? Not exactly, even though the value of ii tells you nothing about the structure, you could still take an arbitrary decision based on how the set is defined:

{ t1t2, t1t2, t1t2, t1t2, t1,t2Ei}\{\ t_1 \textnormal{*} \ t_2, \ t_1 \textnormal{/} \ t_2, \ t_1 \textnormal{+} \ t_2, \ t_1 \textnormal{-} \ t_2, | \ t_1, t_2 \in E_i\}

The same way ANTLR4 prioritizes the subrules by their order, I could decide that since I wrote the multipication and division first the * has priority. Still, at this point I think I'm rambling too much.

The end

After all this experiments I've realized that defining languages as a set isn't that different as using a grammar. In fact I guess you could write a parser generator that takes these definitions as inputs. Though grammars are way more convenient. And of course there are probably a million things I haven't taken into account.

I guess my next project will be a parser generator from set defintion! Relax, I'm just joking... or am I?