Treediff: Approximate Tree Matching for Ordered (and Unordered) Trees

Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html

Jason Wang
Department of Computer Science
New Jersey Institute of Technology
jason@cis.njit.edu

Kaizhong Zhang
Department of Computer Science
University of Western Ontario
kzhang@csd.uwo.ca



Motivation

Approximate tree and subtree matching is important in many applications. We have worked out algorithms for these problems that are extremely efficient both practically and theoretically. The main algorithms used in the accompanying software are presented in the chapter of Zhang and Shasha in the book Pattern Matching in Strings, Trees, and Arrays edited by Alberto Apostolico and Zvi Galil. This web page describes the installation, use, and semantics of this software which you may use for research purposes.

If you have many trees or you are concerned about searching among trees that are unordered (where the order among the children of a node does not matter), please see our software that allows a large database of trees to be searched fast (time proportional to the number of paths in the query tree, on the average).

If you want an approximate editing metric tailored to trees that are unordered then use the +a flag of this software. If you want a non-editing unordered tree match, please see our cousin-distance based unordered comparison software

If trees are not quite general enough, then we also have an efficient, heuristic package for graph comparison.

You can find a software package (GraphGrep) to search for a query graph in a database of graphs here.

You can find our project home page here .


Installation

The approximate tree matcher runs in a high performance interpreted environment called K.

Semantics of the Result and of the Input

The edit distance is defined as the minimum number of label modifications, node deletes, and node inserts required to transform one tree to another. A label modification is simply a change of the label and entails no modification to the structure. For example, the following two trees (represented in outline format) differ by a single label modification:

a
 b
   d 
   e
 c
and
a
 x
   d
   e
 c
A delete of a node N removes node N and makes the children (if any) of N become the children of the parent of N (if any). Deleting node x in the example above gives:
a
 d
 e
 c
An insertion is some possible inverse of a deletion, so for example, an insertion of y below k in the following tree
 k
   b
   c
   d
   e
   f
   c
may yield
 k
   b
   c
   y
     d
     e
     f
   c
or it may yield
 k
   b
   y
     c
     d
   e
   f
   c
or any other configuration in which y is the parent of a consecutive subsequence of the children of k of length zero or greater.

The trees this software handles are ordered: meaning that the order among siblings matters. Thus, the following

a
  b
  c
and
a
  c
  b
are two different trees. In fact the transformation from the first to the second entails one deletion and one insertion. If you are interested in unordered trees and there is only one instance of any label in a given tree, then use our +a option. What that does is to order siblings initially based on the alphabetical order of labels, thus effectively eliminating any order dependency. After that it does normal editing distance comparison.

A tree-to-subtree match is an asymmetric distance calculation in which the question is "What is the smallest distance between T1 and some subtree of T2?" For example, a tree-to-subtree match from treec to treef yields a distance of 0, but that does not hold for a tree-to-subtree match from treef to treec.

Tree: treec
xray
  15
  z
    w
    v
    xray
      15
  q
  15
Tree: treef
newroot
  xray
    15
    z
      w
      v
      xray
        15
    q
    15
  newsibling1
  newsibling2
    nephew1
    nephew2

Input

As you can see by looking at treein, trees are described in two tables. The first one, having the schema
tree(treeid, parentid, childrenid)
describes the parent-child relationships using any self-consistent numbering of the nodes. Rows of the table are separated by a newline character. Here are the first few rows of that table:

#tree| treeid| parentid| childrenid
treea |0 |1 3 5 6 8
treea |3 |2 4
treea |6 |7
treeb |0 |1 2 5 6 8
treeb |2 |3 4
treeb |6 |7
treeb |7 |9
treec |0 |1 2 5 8
treec |2 |3 4 6

The table whose schema is
treelabel(treeid, nodeid, label)
gives the label associated with each node of the tree. Here are the first few rows of that table in treein:

#treelabel|treeid|nodeid|label
treea |0 |xray
treea |1 |15
treea |3 |z
treea |2 |w
treea |4 |v
treea |5 |q
treea |6 |xray
treea |7 |15
treea |8 |15
treeb |0 |xray
treeb |1 |15
treeb |2 |z
treeb |3 |w
treeb |4 |v
treeb |5 |q
treeb |6 |xray
treeb |7 |15
Labels can be characters, numbers, or strings and can be mixed. For example, treeb in the file treein has the format (see file treelayout after you run k tree ).
Tree: treeb
xray
  15
  z
    w
    v
  q
  xray
    15
      15
  15

Options

The program has a few command line options: k tree [+f filename] [+n num] [+s] [+a] [+b] [+r]

Support

This material is based upon work supported by the U.S. National Science Foundation under grants IIS-0414763 and DBI-0445666. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. This support is greatly appreciated.