# [FOM] regular expressions

Vaughan Pratt pratt at cs.stanford.edu
Tue Oct 30 21:29:19 EDT 2007

```Adam Kolany wrote:
> I am sorry for possibly trivial questions:
>     1. Is the problem of equality of languages given by two regular
> expresions equal decidable?
>     2. Is there a (decidable) formal system for proving the above
> equalities?

Normally one would just answer this in the affirmative and point to any
good automata text such as Hopcroft and Ullman, Lewis and Papadimitriou,
Sipser, etc.  However this is a nice example of where there is a really
slick proof that is essentially categorical in nature (exhibit the
initial automaton and then the final automaton in the category of all
deterministic automata accepting L) yet which doesn't require knowing
any category theory to appreciate.  (Another such is the proof that two
algebras freely generated by sets of the same cardinality are
isomorphic.)  I believe the late Joe Goguen was the first to point this
out.  I like this story enough that I can't resist telling it here.

For this purpose a deterministic automaton is any graph with set Q of
vertices, possibly infinite, with one start state and an arbitrary set
of final states (possibly empty, possibly all of Q), such that there are
exactly |S| edges leaving each state, called the *transitions* from that
state, each labeled with a different letter of the alphabet S.  Two or
more of the edges from one state may go to the same state (this is the
usual dispensation permitted for directed graphs whose edges are
labeled---without labels graph theorists only allow one edge per ordered
pair of vertices though category theorists don't feel so constrained).
There are no epsilon transitions, every edge must be labeled with a
single letter.

The crucial propositions are as follows.

Proposition 1.  Given any language L, as a subset of S*, there exists a
canonical deterministic automaton A_L recognizing L.

(L need not be regular, or recursive, or recursively enumerable, it can
be any language whatsoever.)

Proposition 2.  Any regular expression denoting a language L can be
effectively transformed into A_L with a number of states at most
exponential in the size of the regular expression.

For Proposition 1, take an infinite tree T all of whose vertices are of
out-degree the cardinality s of the alphabet S.  All are of in-degree 1
save the root vertex which is of in-degree zero.  There are no
leaves---all paths are infinite.

Label the s edges leaving each vertex of T with the respective s letters
of S.

Label the vertices of T with words from S* according to the labels on
the path to that vertex from the root.  Thus the root will be labeled
epsilon (the empty word), the s vertices immediately below the root will
be labeled with the s one-letter words, and the s^n vertices n levels
below the root will be labeled with the s^n n-letter words, for all
natural numbers n (which is why every path from the root must be infinite).

Make T an automaton by taking its start state to be the root vertex of T
and its set of final states to be the vertices of T labeled with members
of L.

It should be clear that T is a deterministic automaton accepting L.  (It
is in fact the initial such in a suitable category of all such.)

Now associate to every vertex v of T the language w\L defined as the set
of those words x in S* such that wx is in L, where w is the word
labeling v per the above procedure.  Call two vertices with labels v and
w *equivalent* when v\L = w\L, that is, when the two subtrees with
respective roots v and w are made equivalent automata by taking the
respective subtree's root to be the start state in each case.  (The
final states remain marked as originally.)

One can show that the automaton obtained by identifying equivalent
vertices is the final automaton A_L in the category of all deterministic
automata accepting L.  It has the further property that *any* automaton
accepting L can be turned into the final one by identifying states (the
sense in which it is final).  This is crucial for the last stage of the
construction in Proposition 2.

It is further the case that A_L as defined above is finite if and only
if L is regular.  (The initial automaton is always infinite as long as S
is nonempty, the usual though not necessary assumption in automata
theory.)  This automaton is the canonical deterministic finite state
automaton A_L recognizing L, as promised by Proposition 1.  (It is also
the minimal deterministic such, subject to a remark below.)

Proposition 2 is shown by a series of standard reductions.  First
transform the regular expression into a nondeterministic finite state
automaton A.  (The end product should have no epsilon transitions; a
popular two-step approach first produces one with epsilon transitions
and then eliminates them, but it can also be done in one step.)  Then
"determinize" A by taking the set of states of A' to be the power set
2^Q of the set Q of states of A.  Exercise: given a subset X of Q,
determine the unique subset Y of Q that the transition from X labeled s
must go to.  Then determine which subsets of Q should be the final
states of A' and which the start state of A'.

The last step is to identify the equivalent states of A'.  (This is far
from the fastest way but it is the easiest to describe.)  Start out with
the very coarse equivalence relation such that two states are equivalent
just when either both or neither are final (so there are only two
equivalence classes, or one when all or no states are final).  Now
refine this to the (new) equivalence relation holding between (old)
equivalent states that also have (old) equivalent successor states.
That is, if q is equivalent to q' before, and for all s the two states
reached by the s-transitions from q and q' are equivalent before, then q
and q' are still equivalent after, the new equivalence relation.  Refine
to convergence (i.e. when nothing changes), necessarily in at most |2^Q|
steps since no relation can be finer than the identity relation.  Now
identify equivalent states.  The result is A_L.

To decide if two regular expressions are equivalent (denote the same
language L), compute their respective canonical automata and test
whether they are isomorphic.  (Isomorphism of automata is trivial to
decide in comparison with graph isomorphism: start by pairing up the
initial states, then the successor states of the initial states, and so
on.  With graph isomorphism one does not have any good starting place in
general.)

Note that this A_L is not necessarily minimal in the standard sense used
in automata textbooks because some authors define "deterministic" to
mean only that *at most* one s-transition leaves each state, whereas the
definition needed for the above treatment calls for *exactly* one
s-transition.  The difference is that the A_L produced above will in
general (but not always) include a state that leads to no final state;
the other definition eliminates that state (but no other) out of a sense
of efficiency.

Also note that the initial and final automata are always connected and
countable (provided the alphabet is countable), but that other automata
in the category of automata accepting L need be neither.  These are
points that are almost never made in automata theory texts but that
should appeal to FOM subscribers.

============

Regarding the second question on formal systems, there are two good
axiom systems, each with its own merits.

1.  Kleene algebras.  This is a quasiequational theory which proves
exactly the identities of regular algebra, such as a** = a*, a(ba)* =
(ab)*a, and so on.  The first nontrivial result about axiomatizing these
identities is due to V. Redko in 1967 who showed that no finite set of
equations could axiomatize them.  Arturo Salomaa then gave an axiom
schema for the theory.  In 1971 John Conway (of Life and surreal-numbers
fame) wrote a book on the subject which includes the observation that
the equational theory includes a four element algebra in which aa = a
yet a* is not equal to a, i.e. a nonstandard model.  In 1990 Dexter
Kozen gave a finite equational theory together with a quasiequation
expressing induction, the equations of which were the desired equations
and which did not have Conway's nonstandard model.  Kozen defined a
Kleene algebra to be any model of this quasiequational (universal Horn)
theory.  For more details see the relevant Wikipedia article at
http://en.wikipedia.org/wiki/Kleene_algebra .

I'm embarrassed to say I don't know what's known about decidability of
Kleene algebra.  The Wikipedia article doesn't address the question.

2.  Action algebras.  In 1990 I posed and answered affirmatively the
question of whether the equational theory of regular expressions had a
finitely axiomatizable equational conservative extension.  That is,
could there be a superset of the regular operations and a finitely
axiomatized theory thereof, such that those equations in the language of
regular expressions were exactly the regular identities defined by
Kleene and studied by Redko, Salomaa, etc.  The extension amounts to
adding "guards" or "tests" to the language of regular expressions, so
one can think of the language of action algebra as guarded regular
expressions.  The two new constructs are a --> b and a <-- b, understood
as assertions "if a before then b now" and "a now if ever b in the
future".  Concatenation ab is viewed as conjunction ("a and then b")
while a+b is disjunction ("a or b").  The crucial axiom pinning down *
is (a --> a)* |- a --> a, where |- can be read as <= (less-or-equal),
with a <= b in turn being viewed as an abbreviation for a+b = b.  (Or
more simply (a --> a)* = a --> a which is equivalent but less elegant,
asserting that a --> a is its own reflexive transitive closure.)
Conway's nonstandard model is killed off by this equational theory, and
moreover in every model of the theory a* is reflexive transitive closure
in the sense that a is reflexive when 1 <= a and transitive when aa <=
a.  For more details see the relevant Wikipedia article at
http://en.wikipedia.org/wiki/Action_algebra .

It is a long-standing (17 years) open problem whether action algebra is
decidable.  This has been looked at by a number of algebraists, mainly
in Europe, so far without success.

Vaughan Pratt
```