Lemma 3: Let T be a cluster tree and let Z be an instantiation of T. Let a and b be symbols of T. Let N be the least common ancestor of a and b in T. Then od( Z(a), Z(b)) = odiam(Z(N)).
Proof: Since Z(a) and Z(b) are elements of Z(N), it follows from the definition of odiam that od( Z(a), Z(b)) <<= odiam(Z(N)). Suppose the inequality were strict; that is, od( Z(a), Z(b)) << odiam(Z(N)). Then let C be the set of all the symbols c of T such that od(Z(a),Z(c)) <<= od(Z(a),Z(b)). Then odiam(Z(C)) = od(Z(a),Z(b)) << odiam(Z(N)). It is easily shown that Z(C) is a cluster in Z(T). Therefore, by property (ii) of definition 5, there must be a node M such that M.symbols = C. Now, M is certainly not an ancestor of N, since odiam(Z(M)) << odiam(Z(N)) but M.symbols contains both a and b. But this contradicts the assumption that N was the least common ancestor of a and b.
Corollary 4: Let T be a cluster tree and let Z be an instantiation of T. Let a,b,c,d be symbols of T. Let N be the least common ancestor of c and d in T, and let M be the least common ancestor of a and b in T. Then od( Z(a), Z(b)) << od( Z(c), Z(d)) if and only if M.label < N.label.
Proof: Immediate from lemma 3 and property (iii) of definition 5 of instantiation.
Lemma 5: Let S be any set of constraints of the form od(a,b) << od(c,d). Let H be the connected components of the shorts of S. If S is consistent, then not every edge of S is in H.
Proof: Let Z be a valuation satisfying S. Find an edge pq in S for which od(Z(p),Z(q)) is maximal. Now, if ab is a short of S -- that is, there is a constraint od(a,b) << od(c,d) in S -- then od( Z(a), Z(b)) << od( Z(c), Z(d)) <<= od( Z(p), Z(q)).
Now, let ab be any edge in H, the connected components of the shorts of S. Then there is a path a1=a, a2 ... ak =b such that the edge aiai+1 is a short of S for i=1 ... k-1. Thus, by the om-triangle inequality, od( Z(a), Z(b)) <<= maxi=1..k-1(od (Z(ai), Z(ai+1))) <<od( Z(p), Z(q)). Hence pq <> ab, so pq is not in H.
Lemma 6: The values of S and H in any iteration are supersets of their values in any later iteration.
Proof: S is reset to a subset of itself at the end of each iteration. H is defined in terms of S in a monotonic manner.
Lemma 7: S cannot be the same in two successive iterations of the main loop.
Proof: by contradiction. Suppose that S is the same in two successive iterations. Then H will be the same, since it is defined in terms of S. H is constructed to contain all the shorts of S, Since the resetting of S at the end of the first iteration does not change S, H must contain all the longs as well. Thus, H contains all the edges in S. But that being the case, the algorithm should have terminated with failure at the beginning of the first iteration.
Lemma 8: Algorithm solve_constraints always terminates.
Proof: By lemma 7, if the algorithm does not exit with failure, then on each iteration some constraints are removed from S. Hence, the number of iterations of the main loop is at most the original size of S. Everything else in the algorithm is clearly bounded. (Note that this bound on the number of iterations is improved in section 6.1 to n-1, where n is the number of symbols.)
Lemma 9: If algorithm solve_constraints returns false, then S is inconsistent.
Proof: If the algorithm returns false, then the transitive closure of the shorts of S contains all the edges in S. By lemma 5, S is inconsistent.
Lemma 10: If constraint C of form od(a,b) << od(c,d) is in the initial value of S, and edge cd is in H in some particular iteration, then constraint C is in S at the start of that iteration.
Proof: Suppose that C is deleted from S on some particular iteration. Then edge cd, the long of C, cannot be in H in that iteration. That is, it is not possible for edge cd to persist in H in an iteration after C has been deleted from S. Note that, by lemma 6, once cd is eliminated from H, it remains out of H.
Lemma 11: The following loop invariant holds: At the end of each loop iteration, the values of L.symbols, where L is a leaf in the current state of the tree, are exactly the connected components of H.
Proof: In the first iteration, T is initially just the root R, containing all the symbols, and a child of R is created for each connected component of H.
Let Ti and Hi be the values of T and H at the end of the ith iteration. Suppose that the invariant holds at the end of the kth iteration. By lemma 6, Hk+1 is a subset of Hk. Hence, each connected component of Hk+1 is a subset of a connected component of Hk. Moreover, each connected component J of Hk is either a connected component of Hk+1 or is partitioned into several connected components of Hk+1. In the former case, the leaf of Tk corresponding to J is unchanged and remains a leaf in Tk+1. In the latter case, the leaf corresponding to J gets assigned one child for each connected component of Hk+1 that is a subset of J. Thus, the connected components of Hk+1 correspond to the leaves of Tk+1.
Lemma 12: If procedure solve_constraints does not return false, then it returns a well-formed cluster tree T.
Proof: Using lemma 11, and the cleanup section of solve_constraints which creates the final leaves for symbols, it follows that every symbol in S ends up in a single leaf of T. As m is decremented on each iteration, and as no iteration adds both a new node and children of that node, it follows that the label of each internal node is less than the label of its father. Hence the constraints on cluster trees (definition 3) are satisfied.
Lemma 13: Let a,b be two distinct symbols in S and let T be the cluster tree returned by solve_constraints for S. Let N be the least common ancestor of a,b in T. Then either N is assigned its label on the first iteration when the edge ab is not in H, or the edge ab is in the final value of H when the loop is exited and N is assigned its label in the final cleanup section.
Proof: As above, let Hi be the value of H in the ith iteration.
If N is the root, then it is assigned its label in the first iteration. Clearly, a and b, being in different subtrees of N, must be in different connected components of H1.
Suppose N is assigned its label in the kth iteration of the loop for k>1. By lemma 11, at the end of the previous iteration, N.symbols was a connected component of Hk-1, and it therefore contained the edge ab. Since N is the least common ancestor of a,b, it follows that a and b are placed in two different children of N; hence, they are in two different connected components of Hk. Thus the edge ab cannot be in Hk.
Suppose N is assigned its label in the cleanup section of the algorithm. Then by lemma 11, N.symbols is a connected component of the final value of H. Hence the edge ab was in the final value of H.
Lemma 14: Let S initially contain constraint C of form od(a,b) << od(c,d). Suppose that solve_constraints(S) returns a cluster tree T. Let M be the least common ancestor of a,b in T and let N be the least common ancestor of c,d. Then M.label < N.label.
Proof: Suppose N is given a label in a given iteration. By lemma 13, cd is eliminated from H in that same iteration. By lemma 10, constraint C must be in S at the start of the iteration. Hence ab is a short of S in the iteration, and is therefore in H. Hence M is not given a label until a later iteration, and therefore is given a lower label.
It is easily seen that cd cannot be in H in the final iteration of the loop, and hence N is not assigned its label in the cleanup section.
Lemma 15: Suppose that solve_constraints(S) returns a cluster tree T. Then any instantiation of T satisfies the constraints S.
Proof: Immediate from lemma 14 and corollary 4.
Theorem 1: The algorithm solve_constraints(S) returns a cluster tree satisfying S if S is consistent, and returns false if S is inconsistent.
Proof: If solve_constraints(S) returns false, then it is inconsistent (lemma 9). If it does not return false, then it returns a cluster tree T (lemma 12). Since T has an instantiation (lemma 2) and since every instantiation of T is a solution of S (lemma 15), it follows that S is consistent and T satisfies S.