A regexp translation scheme

Copyright © 2024 J. M. Spivey
Jump to navigation Jump to search

Presented here is a translation from regular expressions to pseudocode for a matching subroutine. The generated code is suitable for use in an implementation of grep: it looks for an instance of the regular expression as a substring at every place within the text, and reports success or failure. For other applications, a different kind of matching is needed that looks for an instance of the regular expression only at the start of the text, possibly finding the longest instance that appears as a prefix of the text. Such applications can be accommodated by modifying the translation scheme shown here.

This translation scheme uses four scalar variables that can be kept in registers:

Char – the current character
Cloc – the current location in CLIST
Cmax – the last element of CLIST
Nmax – the last element of NLIST

Cloc and Cmax point into an array of addresses that is the current CLIST, and Nmax points into another array of addresses that is the current NLIST. Other registers will be needed for temporary values, including registers R1 and R2 that are used to hold the arguments of the NNODE and CNODE operations.

Global variables clist and nlist point to the beginning of these two arrays; they are swapped when the matching machine moves to the next character, with the current NLIST becoming the new CLIST, and the space occupied by the old CLIST being recycled for the new NLIST. The variables nlist and clist, and another variable ptr that points to the next character in the text, are less frequently used and can live in memory.

Both NLIST and CLIST are held as arrays of code addresses, not arrays of instructions as in Thompson's original scheme. This avoids having to generate instructions dynamically during matching, but it forces a bit of overhead from indirection: instead of branching to the current location in CLIST, the FAIL operation instead fetches an element of CLIST and branches to it indirectly: that is the meaning of the statement

goto *Cloc++;

written in the matching code. The current CLIST always ends (at the address held in register Cmax) with the the label XCHG, so that we don't have to test in implementing FAIL whether the list has been exhausted: if so, control transfers to XCHG automatically.

The matching routine

The outline of the matching routine is as follows, with a placeholder for the code implementing a specific regular expression.

function match(text):
    ptr = text;
    Nmax = nlist;

next:
    // Swap clist and nlist
    Cmax = Nmax;
    Cloc = nlist; Nmax = clist;
    clist = Cloc; nlist = Nmax;
    *Cmax = xchg;
    
    // Fetch next character      
    Char = *ptr++;

    // -------------------------
    [ translate(regexp, found) ]
    // -------------------------

found:
    return TRUE;

nnode:
    *Nmax++ = R1;

fail:
    goto *Cloc++;

cnode:
    *Cmax++ = R2;
    *Cmax = xchg;
    goto R1;

xchg:
    if (Char != 0) goto next;
    return FALSE;

Note that after fetching the next character of the text (at label next), this routine begins again to match the regular expression from the start, in addition to continuing to explore previous partial matches represented by the contents of NLIST from the previous cycle. Also, matching goes on until the input string is exhausted, and doesn't stop when the set of active states becomes empty.

Translating regular expressions

For the specific matching code in the middle of the routine, each regular expression is translated by a recursive routine

translate(regexp e, label ok)

into code that branches to a label ok on success and to a fixed label FAIL on failure, and can call the subroutines cnode or nnode to generate alternatives to be tried on the current character or on the next character respectively. We write NNODE(lab1) for the sequence

R1 = lab1; goto nnode

and CNODE(lab1, lab2) for the sequence

R1 = lab1; R2 = lab2; goto cnode;

The subroutine at nnode never "returns", but rather saves its argument and then explores the next alternative from CLIST; the subroutine at cnode saves lab2 as an alternative and returns to lab1.

Translations of each kind of regexp follow.

Literal character a
    if (Char != a) goto fail;
    NNODE(ok);
Sequencing e1 e2
    [ translate(e1, lab1) ]
lab1:
    [ translate(e2, ok) ]
Alternation e1 | e2
    CNODE(lab1, lab2);
lab1:
    [ translate(e1, ok) ]
lab2:
    [ translate(e2, ok) ]
Closure e1 *
using the equivalence e1 * = eps | e1 e1*
lab1:
    CNODE(ok, lab2);
lab2:
    [ translate(e1, lab1) ]
Empty string eps
    goto ok;
Empty language NULL
    goto fail;

Refinements

The translation scheme can easily be extended to support other kinds of regular expression, such as the dot . that matches any single character, classes of characters [a-z] (conveniently represented by small bitmaps), and beginning and end of line anchors ^ and $.

Some regular expressions, for example the expression a*a*a*a*a*a* given in Thompson's paper, can lead to an explosion in the number of states recorded in the NLIST, because the same state can be recorded multiple times. Thompson's solution to this is to scan the NLIST whenever a state is to be added to it, and make the addition only if the state is not already present. A better implementation of this idea uses a data structure for sets that provides operations to add an element, test for membership and (crucially) to clear the set, each in O(1) time. A suitable data structure is a vector of timestamps: to add an element, we set its entry in the vector to the current time; to test membership, we see if the entry shows the current time; and to make the set empty, we need only to increment the current time, and all existing entries then become invalid. It is quite easy to implement this scheme with a third array kept alongside CLIST and NLIST.

Thunder support

Using this translation scheme depends on being able to put the address of any label in a register, and being able to branch to a code address held in a register. Thunder provides specific operations for these two actions: there is an instruction generated by

vm_gen(MOV, r, lab);

where r is a register (type vmreg) and lab is a label (type vmlabel). The instruction places the address associated with lab into the register r. There also an instruction generated by

vm_gen(JUMP, r);

that transfers control to such a label value. These instructions are not generally used for the main purpose of Thunder, namely acting as a target for JIT translators from the code of an abstract machine, but they are available for use in translation schemes like the one described above.