From Regular Expressions to DFA's Using Compressed NFA's

11:30 a.m., Thursday, August 20, 1992
719 Broadway, 12th fl. conference room

We show how to turn a regular expression R of length r into an O(s) space representation of McNaughton and Yamada's NFA, where s is the number of occurrences of alphabet symbols in R, and s+1is the number of NFA states. The standard adjacency list representation of McNaughton and Yamada's NFA takes up 1 + 2s + s2 space in the worst case. The adjacency list representation of the NFA produced by Thompson takes up between 2r and 6r space, where r can be arbitrarily larger than s. Given any subset V of states in McNaughton and Yamada's NFA, our representation can be used to compute the set U of states one transition away from the states in V in optimal time O(|V| + |U|). McNaughton and Yamada's NFA requires $\Theta(|V| \times |U|)$ time in the worst case. Using Thompson's NFA, the equivalent calculation requires $\Theta(r)$ time in the worst case.

An implementation of our NFA representation confirms that it takes up an order of magnitude less space than McNaughton and Yamada's machine. An implementation to produce a DFA from our NFA representation by subset construction shows linear and quadratic speedups over subset construction starting from both Thompson's and McNaughton and Yamada's NFA's.

It also shows that the DFA produced from our NFA is as much as one order of magnitude smaller than DFA's constructed from the two other NFA's.

An UNIX egrep compatible software called cgrep based on our NFA representation is implemented. A benchmark shows that cgrep is dramatically faster than both UNIX egrep and GNU e?grep.

Throughout this thesis the importance of syntax is stressed in the design of our algorithms. In particular, we exploit a method of program improvement in which costly repeated calculations can be avoided by establishing and maintaining program invariants. This method of symbolic finite differencing has been used previously by Douglas Smith to derive efficient functional programs.