Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html
Jason Wang
Department of Computer Science
New Jersey Institute of Technology
jason@cis.njit.edu
Kaizhong Zhang
Department of Computer Science
University of Western Ontario
kzhang@csd.uwo.ca
If you have many trees or you are concerned about searching among trees that are unordered (where the order among the children of a node does not matter), please see our software that allows a large database of trees to be searched fast (time proportional to the number of paths in the query tree, on the average).
If you want an approximate editing metric tailored to trees that are unordered then use the +a flag of this software. If you want a non-editing unordered tree match, please see our cousin-distance based unordered comparison software
If trees are not quite general enough, then we also have an efficient, heuristic package for graph comparison.
You can find a software package (GraphGrep) to search for a query graph in a database of graphs here.
You can find
our project home page here .
The approximate tree matcher runs in a high performance interpreted environment called K.
treea | treec | 4means that the edit distance from treea to treec is 4.
The edit distance is defined as the minimum number of label modifications, node deletes, and node inserts required to transform one tree to another. A label modification is simply a change of the label and entails no modification to the structure. For example, the following two trees (represented in outline format) differ by a single label modification:
a b d e cand
a x d e cA delete of a node N removes node N and makes the children (if any) of N become the children of the parent of N (if any). Deleting node x in the example above gives:
a d e cAn insertion is some possible inverse of a deletion, so for example, an insertion of y below k in the following tree
k b c d e f cmay yield
k b c y d e f cor it may yield
k b y c d e f cor any other configuration in which y is the parent of a consecutive subsequence of the children of k of length zero or greater.
The trees this software handles are ordered: meaning that the order among siblings matters. Thus, the following
a b cand
a c bare two different trees. In fact the transformation from the first to the second entails one deletion and one insertion. If you are interested in unordered trees and there is only one instance of any label in a given tree, then use our +a option. What that does is to order siblings initially based on the alphabetical order of labels, thus effectively eliminating any order dependency. After that it does normal editing distance comparison.
A tree-to-subtree match is an asymmetric distance calculation in which the question is "What is the smallest distance between T1 and some subtree of T2?" For example, a tree-to-subtree match from treec to treef yields a distance of 0, but that does not hold for a tree-to-subtree match from treef to treec.
Tree: treec xray 15 z w v xray 15 q 15 Tree: treef newroot xray 15 z w v xray 15 q 15 newsibling1 newsibling2 nephew1 nephew2
As you can see by looking at treein, trees are described in two
tables. The first one, having the schema
tree(treeid, parentid, childrenid)
describes the parent-child relationships using any self-consistent
numbering of the nodes.
Rows of the table are separated by a newline character.
Here are the first few rows of that table:
#tree| treeid| parentid| childrenid treea |0 |1 3 5 6 8 treea |3 |2 4 treea |6 |7 treeb |0 |1 2 5 6 8 treeb |2 |3 4 treeb |6 |7 treeb |7 |9 treec |0 |1 2 5 8 treec |2 |3 4 6
The table whose schema is
treelabel(treeid, nodeid, label)
gives the label associated with each node of the tree.
Here are the first few rows of that table in treein:
#treelabel|treeid|nodeid|label treea |0 |xray treea |1 |15 treea |3 |z treea |2 |w treea |4 |v treea |5 |q treea |6 |xray treea |7 |15 treea |8 |15 treeb |0 |xray treeb |1 |15 treeb |2 |z treeb |3 |w treeb |4 |v treeb |5 |q treeb |6 |xray treeb |7 |15Labels can be characters, numbers, or strings and can be mixed. For example, treeb in the file treein has the format (see file treelayout after you run k tree ).
Tree: treeb xray 15 z w v q xray 15 15 15
The program has a few command line options: k tree [+f filename] [+n num] [+s] [+a] [+b] [+r]
This material is based upon work supported by the U.S. National Science Foundation under grants IIS-0414763 and DBI-0445666. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. This support is greatly appreciated.