Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html
Kaizhong Zhang
Department of Computer Science
University of Western Ontario
kzhang@csd.uwo.ca
Jason Wang
Department of Computer Science
New Jersey Institute of Technology
jason@cis.njit.edu
In applications ranging from taxonomy comparisons to searches within class hierarchies to query processing in XML query languages, trees patterns are important, but the order among siblings (children of the same node) is not. The package described here allows rapid exact and approximate search on unordered trees. The one assumption this makes is that for all nodes n, the children of n all have distinct names. Of course, the children of n' could have the same names as the children of n for two different nodes n and n'. (If two children of n have the same nodes, then the system will not miss any matches, but may return false positives -- data trees having the same leaf-to-root paths as the query tree but not the same shape.) As of February, 2001, the package also offers the possibility of variable length don't cares (represented by *) and single label don't cares (represented by ?). This may be useful for querying based on XPATH or its extensions.
The algorithms proposed here may require quadratic time in the size of the trees to construct the database. Search however is proportional to the sum of the lengths of the paths in the database times a small constant. In my experiments, searches of 100 50 node trees take under a second. Scaling is still an issue to be sure.
Here is a web demo of the capabilities . You can find an application to a phylogenetic database. Bill Piel's full phylogenetic database, called treebase is found here .
If trees are not quite general enough, then we also have an efficient (quadratic in the number of edges), heuristic package for graph comparison.
If you want edit distance among ordered trees then look for package for approximate tree comparison.
If you want a non-editing metric tailored to trees that are unordered
(such as phylogenetic trees), please see
our cousin-distance based unordered comparison software
You can find a software package (GraphGrep)
to search for a query graph in a database of graphs here.
You can find
our project home page here .
The approximate tree matcher runs in a high performance interpreted environment
called K.
As you can see by looking at treein, trees are described in two
tables. The first one, having the schema
The table whose schema is
You can find the general syntax of the command by typing
The general syntax in fact is:
Here are some more examples with commentary:
In the case where you have many trees to search through,
you may want to use a second program in addition to pathfix
called pathfilter.
Intended use:
This work has been partly supported by the U.S. National Science
Foundation under grants
IIS-0414763
and
DBI-0445666.
Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the authors and do not
necessarily reflect the views of the National Science Foundation.
This support is greatly appreciated.
Installation
k pathfix +f treein
which will compare the first tree in treein with the rest of them.
Each row of output will tell you where in the other trees the first
tree finds a match.
(The output can also be found in a file called data.out
for possible post-processing.)
For example:
treeb_0, i.e. node 0 in treeb, using the numbering given in treein.
k pathfix +f treeinquest
Input
tree(treeid, parentid, childrenid)
describes the parent-child relationships using any self-consistent
numbering of the nodes.
Rows of the table are separated by a newline character.
Here are the first few rows of that table:
#tree| treeid| parentid| childrenid
treea |0 |1 3 5 6 8
treea |3 |2 4
treea |6 |7
treeb |0 |1 2 5 6 8
treeb |2 |3 4
treeb |6 |7
treeb |7 |9
treec |0 |1 2 5 8
treec |2 |3 4 6
treelabel(treeid, nodeid, label)
gives the label associated with each node of the tree.
Here are the first few rows of that table in treein:
# treelabel|treeid|nodeid|label
treea |0 |xray
treea |1 |15
treea |3 |z
treea |2 |w
treea |4 |v
treea |5 |q
treea |6 |xray
treea |7 |15
treea |8 |15
treeb |0 |xray
treeb |1 |15
treeb |2 |z
treeb |3 |w
treeb |4 |v
treeb |5 |q
treeb |6 |xray
treeb |7 |15
Labels can be characters, numbers, or strings and can be mixed.
For example, treeb in the file treein has the format (see files dumpedquery
and dumpeddb)
after you run k pathfix +f treein +dumpquery +dumpdb ).
Tree: treeb
xray (0)
15 (1)
z (2)
w (3)
v (4)
q (5)
xray (6)
15 (7)
15 (9)
15 (8)
Options
k pathfix +help
k pathfix (+b filenamefordb | +q queryfilename +d dbfilename | +f filename ) [+m maxdist (0)] [+dumpquery] [+dumpdb]
will form treeindb.l or treeindb.K
will find the query tree in treeindb.
will find the query tree in treeindb within distance 2 and put the
query tree in indented form in the file dumpedquery.
will make the first tree in treein be the query tree;
the remaining trees in treein be the database trees;
will look for the query tree among the database trees
within distance 2; will put the query tree in
indented form in the file dumpedquery, and the
database trees in indented form in the file dumpedb.
Large Numbers of Trees
k pathfilter +q queryfile +o outfile
or if you want to allow a distance of 2 then
k pathfilter +q queryfile + m 2 +o outfile
Support