A fast, portable grep
Grep is a unix tool that takes a regular expression on the command line and prints all lines from a file that contain a string matching the regular expression. Various ways of implementing regular expression matching are possible, but in a 1968 article, Ken Thompson describes an implementation that dynamically translates regular expressions into machine code. Thunder is a small library (a replacement for GNU Lightning) that makes dynamic code generation possible in a portable way, with backends for the x86 and the ARM at present. Thunder works by taking assembly-level code for an invented RISC-like register machine and transliterating each instruction into one or two instructions for the host machine. The goal of this project is to implement a fast version of Grep that uses Thompson's method of matching.
Thompson's paper contains a stack-based compiler written in Algol 60 that generates machine code for the IBM 7094, one of the first transistorised computers, with bizarre features such as index registers that are subtracted from the address rather than added. It's worth deciphering to appreciate the wicked cleverness of Thompson's program, which although it does not mention nondeterministic finite automata explicitly, was nevertheless the source of the construction that turns regular expressions into NFAs. Russ Cox has written a cheat sheet that contains just enough detail about the 7094 to read Thompson's paper, and I have some notes of my own that transliterate the machine code into a more modern style. Thompson's compiler uses a fiendish stack-based algorithm that exploits quirks of the way subroutines work on the 7094, but a simpler compiler can work from an AST and use Thunder's system of code labels to avoid explicit patching of branch destinations.
To learn to use Thunder, it's best to begin with the program fact.c that is included with the source of the Oxford Oberon Compiler, and contains both iterative and recursive implementations of the factorial function written as Thunder code. There is a tutorial that explains how it works.
For the project, it would be possible to prototype a regular expression compiler with OCaml using Lex and Yacc, and generate the object code in the form of C source like the
fact.c program. A smoother implementation would be written entirely in C, parsing regexps by recursive descent to build an AST, then using a recursive function written in C to translate to Thunder code. Thomson's algorithm cannot handle certain regular expressions like
(a*)* without going into an infinite loop, but it is always possible to simplify these expressions (in this case to
a*) to avoid the problem. This can be done during parsing or as a separate pass between parsing and code generation.
(Reduced Instruction Set Computer). A style of computer design where there are multiple, identical registers, arithmetic instructions that operate between registers, and separate load and store instructions with a limited set of addressing modes.