Published on

Building an Autocomplete Library for ANTLR4 Part 2 - Writing the core

After the painstakingly long introduction, it's about time we wrote some code. I'll begin defining the interface of the autocompleter, and slowly develop the core, which is just an algorithm that exhaustively searchs all the paths of the ATN.

The interface

I'll begin by defining a class Autocompleter:

class Autocompleter {
    constructor(Lexer, Parser, options) {
        this._lexer = Lexer;
        this._parser = Parser;
        this.options = options;
    }
}

It takes the lexer and parser classes (NOT the instances). I guess you could take directly instances, but doing it this way allows me to control the lexing and replace the default ConsoleErrorListener with a listener that throws an error if it finds an unexpected token.

I also added an options parameter, which will be an object that allows configuring the autocompleter. This will come in handy later.

Now we need the method that will be called to autocomplete. It should have two parameters: The text that has already been written and the position you want to autocomplete, this is usually called THE CARET (I hope that writing it in bold and upper case has made you read it in your mind with a really epic voice).

For the sake of simplicity I'll assume that it's always the last position, but don't worry, I'll come back to this in another post... probably.

autocomplete(input) {
    const chars = new antlr4.CharStreams.fromString(input);
    const lexer = new this._lexer(chars);
    lexer.removeErrorListeners();
    lexer.addErrorListener(new ThrowErrorListener())
    const all = lexer.getAllTokens();
    const tokenList = all.filter(x => x.channel === 0);

The first step is obtaining a list of tokens, which we'll use to traverse the ATN. This requires instanciating the lexer, replacing the error listener and filtering any token that isn't in the default channel, since ANTLR4 does the same thing.

The point of replacing the listener is that it can be confusing that by default ANTLR4 ignores strings it doesn't know how to tokenize.

class ThrowErrorListener extends antlr4.error.ErrorListener {
  syntaxError(recognizer, offendingSymbol, line, column, msg, e) {
      throw new Error("line " + line + ":" + column + " " + msg);
  }
}

(Yes, I should probably throw a custom-made and more specific type of error)

When the algorithm is traversing the states it will need to know if it just reached the caret. A nice (and hacky) way to do this is to add a token to the list of tokens. This token will have a single instance that will be stored in a global variable. That way we can check if a token is a caret with token === caret.

// At the top of the file 
import antlr4 from 'antlr4';

const caret = new antlr4.CommonToken();
//...

autocomplete(input) {
    //...
    const tokenList = all.filter(x => x.channel === 0);
    tokenList.push(caret);

Now to traverse the ATN we need to get its reference. For some reason the javascript implementation of ANTLR4 requires instanciating the parser even though its a "static" (hidden by the scope) variable.

const parser = new this._parser(lexer);

We can get the ATN by doing parser.atn, but... what's the intial state? It's the first state of the initial rule. And what's the initial rule?... No idea. The thing is, ANTLR4 grammars don't define an initial rule. For example, in this trivial grammar

first: 'A';

second: 'B';

You might assume first is the initial rule, since it's... well, the first one. But when you call the ANTLR4 parser you can easily say myParser.second(), and it will parse the text as if second was the starting rule. So we need the user to explicitly tell us which rule is the starting one. To refer to a rule we use its index. It can be found as static attributes of the parser class, like MyParser.RULE_first.

const parser = new this._parser(lexer);
const startingRule = this.options?.initialRule ?? 0;
const initialState = parser.atn.ruleToStartState[startingRule];

If the user doesn't specify a rule, it will assume the starting rule is the first rule. To get the initial state of a rule we can use the convenient ruleToState map of the ATN. Finally, we need to traverse the machine and return the results:

const suggestions = [];
process(tokenList, 0, initialState, suggestions);
return suggestions;

The 0 is the index of the stream of tokens. For a first implementation I'll do a recursive algorithm so it's easier to understand, but at the end of the post I'll transform it into an iterative function.

The core of the autocompleter: "Process()"

Process is the method that will non-magically traverse the ATN. It's given a list of tokens, the current state and the number of tokens already consumed (tokenStreamIndex). The suggestion list is a little trick so that I can simply push suggetions there instead of concatenating the suggestions generated in every possible path.

The first step is to get the current state and iterate through its transitions:

function process(tokens, tokenStreamIndex, state, suggestions) {
    state.transitions.forEach(it => {
        // This will handle all transitions types
    });
}

Epsilon (ε) Transitions

Let's begin with the simplest type: Epsilon transitions. Here there is some confusion I want to clarify. There is a class called EpsilonTransition, but there are also other types of transitions, like RuleTransition, that are considered epsilon even if they are not an instance of that one.

The formal definition of an epsilon transition is a transition that doesn't consume any input. For now, we'll handle all types of epsilon transitions the same way, but it will need some updating in following posts.

state.transitions.forEach(it => {
    if (it.isEpsilon) {
        process(tokens, tokenStreamIndex, it.target, suggestions);
    }
});

Since it doesn't consume any input, the tokenStreamIndex doesn't increase. The only thing that changes is the current state, which now is the target of the transition.

But there's already one problem: Since the transition doesn't consume any input there's a chance the machine could get stuck in a loop of epsilon transitions. To avoid this we'll keep track of the epsilon transitions already traversed (alreadyPassed).

As long as it resets this list after it consumes any input, skipping repeated states won't affect the final result. This is because it's doing an exhaustive search, traversing all paths looking for all possible token suggestions. And if it already has inspected a state at a certain tokenStreamIndex, it doesn't need to calculate it again. In fact we could cache this, like antlr4-c3 does.

function process(tokens, tokenStreamIndex, state, suggestions, alreadyPassed) {
   state.transitions.forEach(it => {
        if (it.isEpsilon && !alreadyPassed.includes(it.target.stateNumber)) {
            process(tokens, tokenStreamIndex, it.target, suggestions, [it.target.stateNumber, ...alreadyPassed]);
        }
    });
}

Nice, but it still doesn't suggest anything. Let's begin with the fun: AtomicTransitions.

AtomicTransition

These transitions match a single type of token. For example if you have: r: 'A' 'B' 'C'; this rule has 3 atomic transitions.

The ATN of the rule r: 'A' 'B' 'C';

To handle this transition we get the next token with tokens[tokenStreamIndex] (note that tokenStreamIndex is the index of the first non-consumed token). Then, there are three scenarios:

  1. The next token is the caret. In this case the current transition is a suggestion, so we only need to add the expected token to the list of suggestions
  2. The next token isn't the caret, but it is the token that the transition expected. In this case we keep executing the ATN.
  3. It's neither. This is completely normal. The autocompleter tries every possibility, which means a lot of times it will take the wrong path.
import AtomTransition from 'antlr4/src/antlr4/transition/AtomTransition.js';

//...

} else if (it instanceof AtomTransition) {
      const nextToken = tokens[tokenStreamIndex];
      if (nextToken === caret) {
          suggestions.push(...intervalToArray(it.label));
      } else if (it.label.contains(nextToken.type)) {
          process(tokens, 
          tokenStreamIndex + 1, // Increase the index because it has consumed a token
          it.target, 
          // 'alreadyPassed' is now empty because it just consumed a token, so there's no longer
          // a risk of getting stuck in an infinite loop. 
          [] 
          );
        } 
      }

If you have looked carefully at the code, you might be a bit confused about the "label". The label refers to the token that the transition matches. That name comes from the fact that when you draw a state machine you put a label above the transition arrow (like in the picture above).

In ANTLR4 this "label" is an IntervalSet object. As the name implies, the IntervalSet is composed of a set of intervals, even though AtomicTransition always has a single interval with a single token.

This object allows multiple intervals because other types of transitions can match non-adjacent tokens. For example you could have an IntervalSet that matches tokens from [12-15] and tokens from [20-25] (internally ANTLR4 refers to tokens with their number, rather than with their name)

The function intervalToArray is implemented like this:

function intervalToArray(interval) {
  let values= [];
  let n = interval.intervals.length;
  for (let i = 0; i < n; i++) {
    let I = interval.intervals[i];
    let a = I.start;
    let b = I.stop;
    // Careful. Intervals are open in the right, thats why < and not <=
    for (let v = a; v < b; v++) {
      values.push(v);
    }
  }

  return values;
}

I must admit that I grabbed the function from antlr4-c3 🙈. Nevertheless, we can already see some useful results!

The testing approach

While I was researching for these posts, I was inspired by the tests of the library antlr4-autosuggest. That library shares some things with this one and antlr4-c3. But they go above and beyond and also use the lexer's ATN, though I'm not sure if it's really worth doing.

But back to the point of testing, since that library is implemented in Java they can generate lexers and parsers on the go. This one will use Javascript (I feel tempted to make a joke like "the markup language", or the "language that isn't a programming language" 😜). Despite that, I decided I wanted my tests to be as clean as theirs, so I made an implementation that calls the console to execute the ANTLR4 tool.

await givenGrammar("r: 'A' 'B' 'C';").whenInput("A").thenExpect("B");

(The argument of thenExpect is just a representation for testing. Internally the autocompletition engine returns the NUMBER of the token, not the string)

The internal implementation of thenExpect is a bit cumbersome, but it does the job and the test looks really nice. The point of this post is to talk about the algorithm and not the testing, but if you are curious you can see the source code in the file GrammarTest in the Github repository of the library

SetTransition

In ANTLR4, when you write an alternative of tokens like (A|B|C) it gets translated into a single SetTransition. The good thing about them is that they follows exactly the same structure as AtomTransition, though in this case they can match more than one token.

import SetTransition from 'antlr4/src/antlr4/transition/SetTransition.js';

//....

  } else if (it instanceof AtomTransition || it instanceof SetTransition) {
            const nextToken = tokens[tokenStreamIndex];
            //...

That's all. Finally something easy:

await givenGrammar("r: 'A' ('B'|'C'|'D') EOF;").whenInput("A").thenExpect(["B","C","D"]);

Optionals and quantifiers

When you have a quantifier like ?, * or +, they generate a series of epsilon transitions that provoke the expected behaviour. The approach is similar to the Thompson's construction in regular expresssions. I talked about it in one of the posts about making a regex engine.

So quantifiers work out of the box:

await givenGrammar("r: 'A'? ('B'|'C') EOF;").whenInput("").thenExpect(["A", "B","C"]);
await givenGrammar("r: 'A' ('B'|'C')* 'D' EOF;").whenInput("A").thenExpect(["B","C", "D"]);
await givenGrammar("r: 'A'+ ('B'|'C') EOF;").whenInput("AAAAA").thenExpect(["A", "B", "C"]);

And their lazy versions too:

await givenGrammar("r: 'A' 'B'?? 'C' EOF;").whenInput("A").thenExpect(["C", "B"]);
await givenGrammar("r: A*? B; A: 'A'; B:'B';").whenInput("A").thenExpect(["B", "A"]);
await givenGrammar("r: A+? B; A: 'A'; B:'B';").whenInput("A").thenExpect(["B", "A"]);

As a fun implementation fact, the lazy quantifiers have the same number of transitions but in a different order (again, I explained this in my series of posts about regex #spam). The idea is that the ATN follows the transitions in a certain order, and since the parser stops at the first valid result altering the order also alters the final result.

In the case of the autocompleter it keeps trying all paths even after one succeeds. But the order of the transitions affects the order of the suggested tokens. That's why r: A+ B; returns ["A", "B"] but r: A+? B; returns ["B", "A"].

Optional - The ATN in quantifiers

ANTLR4's javadocs already has some cool documentation on the ATN produced by quantifiers. There's no point in repeating each specific case here, so let's study just one: The greedy closure (...)*.

In case the URL of the image disappears, or to at least see it with the same style as the other ATNs, this is the ATN for the rule r: A*;

The ATN of the rule r: 'A'*

I'm really curious about why it needs so many intermediate states. The StarBlockStartState and StarLoopbackState seem a bit redundant. Thompson's construction also has a lot of epsilon transitions because it works in a way that all pieces fit together seamlessly, so I guess this is similar.

In the end ANTLR4 caches the ATN into a DFA (a simpler and faster type of state machine), so I guess that way it skips the unnecessary epsilon transitions. That won't be the case of this autocompletition system 😅, but it won't be a problem.

Recursion to Iteration

Before we add the remaining types of transitions it's better to transform the recursion into an iterative algorithm. The current algorithm would crash for any non-trivial text, since it calls the function recursively for each transition it traverses.

Let's make a rough estimation: Even ignoring epsilon transitions the algorithm traverses one transition per token. This means that if you try to autocomplete a text with 300 tokens, the stack will be 300 function deep. And if you also take into account epsilon transitions... 🧨.

ANTLR4 probably doesn't have this issue because it doesn't execute the whole ATN at once. As far as I know, it only calls the ATN to calculate what rule it should follow next, but then it returns the control to the parser instance. However this autocompleter executes all the ATN, from beginning to end.

The change to iteration isn't too difficult, plus it'll give us a complete view of the method:

class Autocompleter {
    autocomplete(input) {
        //...
        const stack = [[0, initialState, []]];
        return process(tokenList, stack);
    }
}

function process(tokens, stack) {
  const suggestions = [];

  while(stack.length !== 0) {
    const [tokenStreamIndex, state, alreadyPassed] = stack.pop();

    // Iterates through the transitions in reverse order so that the first transition is processed 
    // first (therefore it's pushed the last) This way if the grammar says '(A|B|C)', the autocompleter 
    // will suggest them in that same order
    for (let i = state.transitions.length - 1; i >= 0; i--) {
      const it = state.transitions[i];
      if (it.isEpsilon && !alreadyPassed.includes(it.target.stateNumber)) {
        stack.push([tokenStreamIndex, it.target, [it.target.stateNumber, ...alreadyPassed]]);
    } else if (it instanceof AtomTransition || it instanceof SetTransition) {
      const nextToken = tokens[tokenStreamIndex];
      if (nextToken === caret) {
          suggestions.push(...intervalToArray(it.label));
      } else if (it.label.intervals.some(x => x.contains(nextToken.type))) {
          stack.push([tokenStreamIndex + 1, // Increase the index because it has consumed a token
          it.target,
          // 'alreadyPassed' is now empty because it just consumed a token, so there's no longer
          // a risk of getting stuck in an infinite loop.  
          []
        ]);
        } 
      }
    }
  }
  return suggestions;
 }

As in every transformation to iteration, now instead of calling process it adds the arguments to the stack. The tokenList is common to all calls, so there is no need to put it in the stack. Plus, I removed the suggestions from the stack since now it's cleaner to define a single list at the top of process().

In the next episode of "Really long posts that you probably don't finish..."

I wonder how many people will reach this point. This post is already quite long, so I'll continue in the next one. The autocompletion is already slightly useful, but there are still two important points to tackle. First, there are other types of transitions that we have not covered yet, and second, there's the topic of rules....

If you were to ask if rules work right now, the answer is "Kind of". It uses epsilon transitions to move to another rule, so we already got that covered. But there's more to it.

   await givenGrammar(`r: a | b;
        a: 'A';
        b: 'B';
      `).whenInput("").thenExpect(["A", "B"]);

This does work, but the example is too simple. In the following posts we'll see in detail what ANTLR4 does with rules, and more importantly, how it returns to the caller after the rule has been completed.

Season 2 - Building an Autocomplete Library for ANTLR4 (Completed)