A fast, portable grep

Copyright © 2024 J. M. Spivey
Jump to navigation Jump to search

Grep is a unix tool that takes a regular expression on the command line and prints all lines from a file that contain a string matching the regular expression. Various ways of implementing regular expression matching are possible, but in a 1968 article, Ken Thompson describes an implementation that dynamically translates regular expressions into machine code. Thunder is a small library (a replacement for GNU Lightning) that makes dynamic code generation possible in a portable way, with backends for the x86 and the ARM at present. Thunder works by taking assembly-level code for an invented RISC-like register machine and transliterating each instruction into one or two instructions for the host machine. The goal of this project is to implement a fast version of Grep that uses Thompson's method of matching.

Thompson's paper contains a stack-based compiler written in Algol 60 that generates machine code for the IBM 7094, one of the first transistorised computers, with bizarre features such as index registers that are subtracted from the address rather than added. It's worth deciphering to appreciate the wicked cleverness of Thompson's program, which although it does not mention nondeterministic finite automata explicitly, was nevertheless the source of the construction that turns regular expressions into NFAs. Russ Cox has written a cheat sheet that contains just enough detail about the 7094 to read Thompson's paper, and I have some notes of my own that transliterate the machine code into a more modern style. Thompson's compiler uses a fiendish stack-based algorithm that exploits quirks of the way subroutines work on the 7094, but a simpler compiler can work from an AST and use Thunder's system of code labels to avoid explicit patching of branch destinations.

To learn to use Thunder, it's best to begin with the program fact.c that is included with the source of the Oxford Oberon Compiler, and contains both iterative and recursive implementations of the factorial function written as Thunder code. There is a tutorial that explains how it works.

For the project, it would be possible to prototype a regular expression compiler with OCaml using Lex and Yacc, and generate the object code in the form of C source like the fact.c program. A smoother implementation would be written entirely in C, parsing regexps by recursive descent to build an AST, then using a recursive function written in C to translate to Thunder code. Thomson's algorithm cannot handle certain regular expressions like (a*)* without going into an infinite loop, but it is always possible to simplify these expressions (in this case to a*) to avoid the problem. This can be done during parsing or as a separate pass between parsing and code generation.

Some hints

[1] To build the 32-bit Thunder, you need 32-bit libraries installed. The way to do that has had a checkered history on Debian-based systems (of which Ubuntu is an example), but currently the way seems to be

$ sudo apt-get install libc6-dev-i386

The symptom of missing libraries is that the configure script finds that gcc with the -m32 flag is unable to make executables.

[2] For the parser, it's not necessary to have a separate lexer, because "tokens" in the syntax of regular expressions are almost all single characters. The parser can consist of a family of mutually recursive functions, one for each nonterminal in the grammar of regexps, and can work with an input string that is already entirely in memory: in fact, it is probably one of the argument strings passed to the main program when it starts. The parsing functions can communicate via a global variable chp that always points to the next character of the input string, and the duty of the function

regexp p_term(void);

is to expect chp to be pointing at the first character of a term (whatever that is), to advance chp until it points to the first character after the term, and to return the AST for the term as its result. It can recursively call functions like p_factor to recognise sub-expressions of the term. Allowing any expression in brackets to be a factor in a larger expression makes all the functions mutually recursive with each other. Parsing begins with a call to p_expr() when chp is pointing at the first character of the input regexp; p_expr will return with chp pointing at the terminating null character of the string, and the result returned will be a tree for the entire regexp.

[3] What I said in a recent conversation isn't true: there is no delay in the generation of native code on any of the platforms (x86, amd64, ARM, MIPS) currently supported by Thunder. Calls to the gen(...) routines immediately result in machine code output in all cases – not always the prettiest code imaginable, but good enough to be faster than interpretive schemes.

[4] Another page gives a simple translation scheme for regular expressions.

[5] A combinatorial explosion can be avoided by keeping track of the set of active states.