[FOM] regular expressions

Vaughan Pratt pratt at cs.stanford.edu
Tue Oct 30 21:29:19 EDT 2007


Adam Kolany wrote:
> I am sorry for possibly trivial questions:
>     1. Is the problem of equality of languages given by two regular
> expresions equal decidable?
>     2. Is there a (decidable) formal system for proving the above
> equalities?

Normally one would just answer this in the affirmative and point to any 
good automata text such as Hopcroft and Ullman, Lewis and Papadimitriou, 
Sipser, etc.  However this is a nice example of where there is a really 
slick proof that is essentially categorical in nature (exhibit the 
initial automaton and then the final automaton in the category of all 
deterministic automata accepting L) yet which doesn't require knowing 
any category theory to appreciate.  (Another such is the proof that two 
algebras freely generated by sets of the same cardinality are 
isomorphic.)  I believe the late Joe Goguen was the first to point this 
out.  I like this story enough that I can't resist telling it here.

For this purpose a deterministic automaton is any graph with set Q of 
vertices, possibly infinite, with one start state and an arbitrary set 
of final states (possibly empty, possibly all of Q), such that there are 
exactly |S| edges leaving each state, called the *transitions* from that 
state, each labeled with a different letter of the alphabet S.  Two or 
more of the edges from one state may go to the same state (this is the 
usual dispensation permitted for directed graphs whose edges are 
labeled---without labels graph theorists only allow one edge per ordered 
pair of vertices though category theorists don't feel so constrained). 
There are no epsilon transitions, every edge must be labeled with a 
single letter.

The crucial propositions are as follows.

Proposition 1.  Given any language L, as a subset of S*, there exists a 
canonical deterministic automaton A_L recognizing L.

(L need not be regular, or recursive, or recursively enumerable, it can 
be any language whatsoever.)

Proposition 2.  Any regular expression denoting a language L can be 
effectively transformed into A_L with a number of states at most 
exponential in the size of the regular expression.

For Proposition 1, take an infinite tree T all of whose vertices are of 
out-degree the cardinality s of the alphabet S.  All are of in-degree 1 
save the root vertex which is of in-degree zero.  There are no 
leaves---all paths are infinite.

Label the s edges leaving each vertex of T with the respective s letters 
of S.

Label the vertices of T with words from S* according to the labels on 
the path to that vertex from the root.  Thus the root will be labeled 
epsilon (the empty word), the s vertices immediately below the root will 
be labeled with the s one-letter words, and the s^n vertices n levels 
below the root will be labeled with the s^n n-letter words, for all 
natural numbers n (which is why every path from the root must be infinite).

Make T an automaton by taking its start state to be the root vertex of T 
and its set of final states to be the vertices of T labeled with members 
of L.

It should be clear that T is a deterministic automaton accepting L.  (It 
is in fact the initial such in a suitable category of all such.)

Now associate to every vertex v of T the language w\L defined as the set 
of those words x in S* such that wx is in L, where w is the word 
labeling v per the above procedure.  Call two vertices with labels v and 
w *equivalent* when v\L = w\L, that is, when the two subtrees with 
respective roots v and w are made equivalent automata by taking the 
respective subtree's root to be the start state in each case.  (The 
final states remain marked as originally.)

One can show that the automaton obtained by identifying equivalent 
vertices is the final automaton A_L in the category of all deterministic 
automata accepting L.  It has the further property that *any* automaton 
accepting L can be turned into the final one by identifying states (the 
sense in which it is final).  This is crucial for the last stage of the 
construction in Proposition 2.

It is further the case that A_L as defined above is finite if and only 
if L is regular.  (The initial automaton is always infinite as long as S 
is nonempty, the usual though not necessary assumption in automata 
theory.)  This automaton is the canonical deterministic finite state 
automaton A_L recognizing L, as promised by Proposition 1.  (It is also 
the minimal deterministic such, subject to a remark below.)

Proposition 2 is shown by a series of standard reductions.  First 
transform the regular expression into a nondeterministic finite state 
automaton A.  (The end product should have no epsilon transitions; a 
popular two-step approach first produces one with epsilon transitions 
and then eliminates them, but it can also be done in one step.)  Then 
"determinize" A by taking the set of states of A' to be the power set 
2^Q of the set Q of states of A.  Exercise: given a subset X of Q, 
determine the unique subset Y of Q that the transition from X labeled s 
must go to.  Then determine which subsets of Q should be the final 
states of A' and which the start state of A'.

The last step is to identify the equivalent states of A'.  (This is far 
from the fastest way but it is the easiest to describe.)  Start out with 
the very coarse equivalence relation such that two states are equivalent 
just when either both or neither are final (so there are only two 
equivalence classes, or one when all or no states are final).  Now 
refine this to the (new) equivalence relation holding between (old) 
equivalent states that also have (old) equivalent successor states. 
That is, if q is equivalent to q' before, and for all s the two states 
reached by the s-transitions from q and q' are equivalent before, then q 
and q' are still equivalent after, the new equivalence relation.  Refine 
to convergence (i.e. when nothing changes), necessarily in at most |2^Q| 
steps since no relation can be finer than the identity relation.  Now 
identify equivalent states.  The result is A_L.

To decide if two regular expressions are equivalent (denote the same 
language L), compute their respective canonical automata and test 
whether they are isomorphic.  (Isomorphism of automata is trivial to 
decide in comparison with graph isomorphism: start by pairing up the 
initial states, then the successor states of the initial states, and so 
on.  With graph isomorphism one does not have any good starting place in 
general.)

Note that this A_L is not necessarily minimal in the standard sense used 
in automata textbooks because some authors define "deterministic" to 
mean only that *at most* one s-transition leaves each state, whereas the 
definition needed for the above treatment calls for *exactly* one 
s-transition.  The difference is that the A_L produced above will in 
general (but not always) include a state that leads to no final state; 
the other definition eliminates that state (but no other) out of a sense 
of efficiency.

Also note that the initial and final automata are always connected and 
countable (provided the alphabet is countable), but that other automata 
in the category of automata accepting L need be neither.  These are 
points that are almost never made in automata theory texts but that 
should appeal to FOM subscribers.

============

Regarding the second question on formal systems, there are two good 
axiom systems, each with its own merits.

1.  Kleene algebras.  This is a quasiequational theory which proves 
exactly the identities of regular algebra, such as a** = a*, a(ba)* = 
(ab)*a, and so on.  The first nontrivial result about axiomatizing these 
identities is due to V. Redko in 1967 who showed that no finite set of 
equations could axiomatize them.  Arturo Salomaa then gave an axiom 
schema for the theory.  In 1971 John Conway (of Life and surreal-numbers 
fame) wrote a book on the subject which includes the observation that 
the equational theory includes a four element algebra in which aa = a 
yet a* is not equal to a, i.e. a nonstandard model.  In 1990 Dexter 
Kozen gave a finite equational theory together with a quasiequation 
expressing induction, the equations of which were the desired equations 
and which did not have Conway's nonstandard model.  Kozen defined a 
Kleene algebra to be any model of this quasiequational (universal Horn) 
theory.  For more details see the relevant Wikipedia article at 
http://en.wikipedia.org/wiki/Kleene_algebra .

I'm embarrassed to say I don't know what's known about decidability of 
Kleene algebra.  The Wikipedia article doesn't address the question.

2.  Action algebras.  In 1990 I posed and answered affirmatively the 
question of whether the equational theory of regular expressions had a 
finitely axiomatizable equational conservative extension.  That is, 
could there be a superset of the regular operations and a finitely 
axiomatized theory thereof, such that those equations in the language of 
regular expressions were exactly the regular identities defined by 
Kleene and studied by Redko, Salomaa, etc.  The extension amounts to 
adding "guards" or "tests" to the language of regular expressions, so 
one can think of the language of action algebra as guarded regular 
expressions.  The two new constructs are a --> b and a <-- b, understood 
as assertions "if a before then b now" and "a now if ever b in the 
future".  Concatenation ab is viewed as conjunction ("a and then b") 
while a+b is disjunction ("a or b").  The crucial axiom pinning down * 
is (a --> a)* |- a --> a, where |- can be read as <= (less-or-equal), 
with a <= b in turn being viewed as an abbreviation for a+b = b.  (Or 
more simply (a --> a)* = a --> a which is equivalent but less elegant, 
asserting that a --> a is its own reflexive transitive closure.) 
Conway's nonstandard model is killed off by this equational theory, and 
moreover in every model of the theory a* is reflexive transitive closure 
in the sense that a is reflexive when 1 <= a and transitive when aa <= 
a.  For more details see the relevant Wikipedia article at 
http://en.wikipedia.org/wiki/Action_algebra .

It is a long-standing (17 years) open problem whether action algebra is 
decidable.  This has been looked at by a number of algebraists, mainly 
in Europe, so far without success.

Vaughan Pratt


More information about the FOM mailing list