Timestamp sets

Copyright © 2024 J. M. Spivey
Jump to navigation Jump to search

In Thompson's matcher for regular expressions, certain regular expressions such as \(a^*a^*a^*a^*\) can give rise to a combinatorial explosion in the number of states that are stored, because the same state can be stored multiple times in a list, and in each cycle of the machine, each item of the list can give rise to several items in a new list. A solution to this problem is to keep, alongside the list of states \({\it nlist}\) for the next cycle also the set of states \(\it nset\) that are on the list. Keeping both the list and the set allows us to iterate over the list efficiently, while checking using the set that no state is put on the list more than once. The operation \({\it NNODE}(x)\) that presently performs \({\it nlist} := {\it nlist}\mathbin{++}[x]\) can be replaced with

if \(x \notin {\it nset}\) then \({\it nlist} := {\it nlist}\mathbin{++}[x]\) and \({\it nset} := {\it nset}\cup\{x\}\).

A suitable implementation of sets is outlined below, giving \(O(1)\) performance for all operations, including the crucial step of resetting \(\it nset\) for each successive cycle.

Timestamp sets

[math]\displaystyle{ }[/math]This is a clever implementation of an abstract data type of sets \(S\) of small integers, with operations \(S := \empty\) and \(S := S \cup \{x\}\) and a test \(x \in S\). We keep an array \(v[x]\) indexed by potential elements \(x\) and a timestamp \(t > 0\), with the invariant that \(0 \leq v[x] \leq t\) for all \(x\); initially \(v[x]=0\). The abstraction function is that \(x \in S\) if and only if \(v[x] = t\).

We can implement the operation \(S := S \cup \{x\}\) by setting \(v[x] := t\), and of course the test \(x \in S\) is implemented simply by \(v[x] = t\). The clever part is that we can reset \(S := \empty\) simply by incrementing \(t\); all timestamps previously placed on elements \(x\) can remain unchanged, but no longer indicate that they are members of the set, until they are updated with the new timestamp.

In the regexp matcher, we can use the address of a state as the value of \(x\) for that state, and we can use the value of the pointer to the next character in the text as the timestamp.