# [FOM] regular expressions

Vaughan Pratt pratt at cs.stanford.edu
Wed Oct 31 02:49:21 EDT 2007

```Adam Kolany wrote:
> I am sorry for possibly trivial questions:
>     1. Is the problem of equality of languages given by two regular
> expresions equal decidable?

Normally one would just answer this in the affirmative and point to any
good automata text such as Hopcroft and Ullman, Lewis and Papadimitriou,
Sipser, etc.  However this is a nice example of where there is a really
slick proof that is essentially categorical in nature (exhibit the
initial automaton and then the final automaton in the category of all
deterministic automata accepting L) yet which doesn't require knowing
any category theory to appreciate.  (Another such is the proof that two
algebras freely generated by sets of the same cardinality are
isomorphic.)  I believe the late Joe Goguen was the first to point this
out.  I like this story enough that I can't resist telling it here.

For this purpose a deterministic automaton is any graph with set Q of
vertices, possibly infinite, with one start state and an arbitrary set
of final states (possibly empty, possibly all of Q), such that there are
exactly |S| edges leaving each state, called the *transitions* from that
state, each labeled with a different letter of the alphabet S, and such
that every state is reachable by a path from the start state.  Two or
more of the edges from one state may go to the same state (in contrast
to the directed graphs of graph theory which only allow one edge per
ordered pair of vertices). There are no epsilon transitions, every edge
must be labeled with a single letter.

The crucial propositions are as follows.

Proposition 1.  Given any language L, as a subset of S*, there exists a
canonical deterministic automaton A_L recognizing L.

(L need not be regular, or recursive, or recursively enumerable, it can
be any language whatsoever.)

Proposition 2.  For any regular expression denoting a language L, the
A_L associated to L by Proposition 1 is finite, with a number of states
at most exponential in the size of the regular expression, and can be
obtained from the regular expression in time at most quadratic in the
size of A_L.

Corollary.  Equivalence of regular expressions is decidable in
exponential time.  (Meyer and Stockmeyer obtained in 1974 the tighter
result that the problem is PSPACE-complete.)

To show Proposition 1, take an infinite tree T all of whose vertices are
of out-degree the cardinality s of the alphabet S.  All are of in-degree
1 save the root vertex which is of in-degree zero.  There are no
leaves---all paths are infinite.

Label the s edges leaving each vertex of T with the respective s letters
of S.

Label the vertices of T with words from S* according to the labels on
the path to that vertex from the root.  Thus the root will be labeled
epsilon (the empty word), the s vertices immediately below the root will
be labeled with the s one-letter words, and the s^n vertices n levels
below the root will be labeled with the s^n n-letter words, for all
natural numbers n (which is why every path from the root must be infinite).

Make T an automaton by taking its start state to be the root vertex of T
and its set of final states to be the vertices of T labeled with members
of L.

It should be clear that T is a deterministic automaton accepting L.  (It
is in fact the initial such in a suitable category of all such.)

Now associate to every vertex v of T the language w\L defined as the set
of those words x in S* such that wx is in L, where w is the word
labeling v per the above procedure.  Call two vertices with labels v and
w *equivalent* when v\L = w\L, that is, when the two subtrees with
respective roots v and w are made equivalent automata by taking the
respective subtree's root to be the start state in each case.  (The
final states remain marked as originally.)

One can now show that the automaton obtained by identifying equivalent
vertices is the final automaton A_L in the category of all deterministic
automata accepting L.  Finality establishes the canonical property
promised by Proposition 1.

Both the meaning and significance of finality for A_L is that *any*
automaton accepting L can be turned into A_L by identifying states.
This is key to the last stage of the construction in Proposition 2.

It is further the case that A_L as defined above is finite if and only
if L is regular.  (The initial automaton is always infinite as long as S
is nonempty, the usual though not necessary assumption in automata
theory.)  This automaton is the canonical deterministic finite state
automaton A_L recognizing L, as promised by Proposition 1.  (It is also
the minimal deterministic such, subject to a remark below.)

Proposition 2 is shown by a series of standard reductions.  First
transform the regular expression into a nondeterministic finite state
automaton A.  (The end product should have no epsilon transitions; a
popular two-step approach first produces one with epsilon transitions
and then eliminates them, but it can also be done in one step.)  Then
"determinize" A by taking the set of states of A' to be the power set
2^Q of the set Q of states of A.  Exercise: given a subset X of Q,
determine the unique subset Y of Q that the transition from X labeled s
must go to.  Then determine which subsets of Q should be the final
states of A' and which the start state of A'.

The last step is to identify the equivalent states of A'.  (This is far
from the fastest way but it is the easiest to describe.)  Start out with
the very coarse equivalence relation such that two states are equivalent
just when either both or neither are final (so there are only two
equivalence classes, or one when all or no states are final).  Now
refine this to the (new) equivalence relation holding between (old)
equivalent states that also have (old) equivalent successor states. That
is, if q is equivalent to q' before, and for all s the two states
reached by the s-transitions from q and q' are equivalent before, then q
and q' are still equivalent after, the new equivalence relation.  Refine
to convergence (i.e. when nothing changes), necessarily in at most |2^Q|
steps since no relation can be finer than the identity relation.  Now
identify equivalent states.  The result is A_L.

To decide if two regular expressions are equivalent (denote the same
language L), compute their respective canonical automata and test
whether they are isomorphic.  (Isomorphism of automata is trivial to
decide in comparison with graph isomorphism: start by pairing up the
initial states, then the successor states of the initial states, and so
on.  With graph isomorphism one does not have any good starting place in
general.)

Note that this A_L is not necessarily minimal in the standard sense used
in automata textbooks because some authors define "deterministic" to
mean only that *at most* one s-transition leaves each state, whereas the
definition needed for the above treatment calls for *exactly* one
s-transition.  The difference is that the A_L produced above will in
general (but not always) include a state that leads to no final state;
the other definition eliminates that state (but no other) out of a sense
of efficiency.

============
>     2. Is there a (decidable) formal system for proving the above
> equalities?

The short answer is that no finite axiomatization is known that is
decidable.  If you omit decidable then there two good axiom systems,
each with its own merits.

1.  Kleene algebras.  A Kleene algebra is an idempotent semiring with
identity 1 and zero 0 satisfying 1+x+x*x* = x*, xy <= y  -->  x*y = y,
and yx <= y  -->  yx* = y.  All these conditions are equations save the
last two which are quasiequations (universal Horn clauses).  The
equations of this theory are exactly the identities of regular algebra,
such as a** = a*, a(ba)* = (ab)*a, and so on, namely what you asked for.
The first nontrivial result about axiomatizing these identities is due
to V. Redko in 1967 who showed that no finite set of equations could
axiomatize them.  Arturo Salomaa then gave an axiom schema for the
theory.  In 1971 John Conway (of Life and surreal-numbers fame) wrote a
book on the subject which includes the observation that the equational
theory includes a four element algebra in which aa = a yet a* is not
equal to a, i.e. a nonstandard model.  In 1990 Dexter Kozen gave a
finite equational theory together with a quasiequation expressing
induction, the equations of which were the desired equations and which
did not have Conway's nonstandard model.  Kozen defined a Kleene algebra
to be any model of this quasiequational (universal Horn) theory.  For
more details see the relevant Wikipedia article at
http://en.wikipedia.org/wiki/Kleene_algebra .

The equational fragment of Kleene algebras is decidable as per the first
section above, but Redko showed that fragment has no standalone finite
axiomatization.  The system axiomatized is not decidable, a common
outcome when passing from an equational theory to a larger
quasiequational theory.

2.  Action algebras.  In 1990 I posed and answered affirmatively the
question of whether the equational theory of regular expressions had a
finitely axiomatizable equational conservative extension.  That is,
could there be a superset of the regular operations and a finitely
axiomatized equational theory thereof, such that those equations in the
language of regular expressions are exactly the regular identities you
to the language of regular expressions, so one can think of the language
of action algebra as guarded regular expressions.  The two new
constructs are a --> b and a <-- b, understood as assertions "if a
before then b now" and "a now if ever b in the future".  Concatenation
ab is viewed as conjunction ("a and then b") while a+b is disjunction
("a or b").  The crucial axiom pinning down * is (a --> a)* |- a --> a,
where |- can be read as <= (less-or-equal), with a <= b in turn being
viewed as an abbreviation for a+b = b.  (Or more simply (a --> a)* = a
--> a which is equivalent but less elegant, asserting that a --> a is
its own reflexive transitive closure.) Conway's nonstandard model is
killed off by this equational theory, and moreover in every model of the
theory a* is reflexive transitive closure in the sense that a is
reflexive when 1 <= a and transitive when aa <= a.  For more details see
the relevant Wikipedia article at
http://en.wikipedia.org/wiki/Action_algebra .

It is a long-standing (17 years) open problem whether action algebra is
decidable.  This has been looked at by a number of algebraists, mainly
in Europe, so far without success.

Vaughan Pratt

```