Treesearch: Searching among Unordered Trees

Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html

Kaizhong Zhang
Department of Computer Science
University of Western Ontario
kzhang@csd.uwo.ca

Jason Wang
Department of Computer Science
New Jersey Institute of Technology
jason@cis.njit.edu



Motivation

In applications ranging from taxonomy comparisons to searches within class hierarchies to query processing in XML query languages, trees patterns are important, but the order among siblings (children of the same node) is not. The package described here allows rapid exact and approximate search on unordered trees. The one assumption this makes is that for all nodes n, the children of n all have distinct names. Of course, the children of n' could have the same names as the children of n for two different nodes n and n'. (If two children of n have the same nodes, then the system will not miss any matches, but may return false positives -- data trees having the same leaf-to-root paths as the query tree but not the same shape.) As of February, 2001, the package also offers the possibility of variable length don't cares (represented by *) and single label don't cares (represented by ?). This may be useful for querying based on XPATH or its extensions.

The algorithms proposed here may require quadratic time in the size of the trees to construct the database. Search however is proportional to the sum of the lengths of the paths in the database times a small constant. In my experiments, searches of 100 50 node trees take under a second. Scaling is still an issue to be sure.

Here is a web demo of the capabilities . You can find an application to a phylogenetic database. Bill Piel's full phylogenetic database, called treebase is found here .

If trees are not quite general enough, then we also have an efficient (quadratic in the number of edges), heuristic package for graph comparison.

If you want edit distance among ordered trees then look for package for approximate tree comparison.

If you want a non-editing metric tailored to trees that are unordered (such as phylogenetic trees), please see our cousin-distance based unordered comparison software

You can find a software package (GraphGrep) to search for a query graph in a database of graphs here.

You can find our project home page here .


Installation

The approximate tree matcher runs in a high performance interpreted environment called K.

Input

As you can see by looking at treein, trees are described in two tables. The first one, having the schema
tree(treeid, parentid, childrenid)
describes the parent-child relationships using any self-consistent numbering of the nodes. Rows of the table are separated by a newline character. Here are the first few rows of that table:

#tree| treeid| parentid| childrenid
treea |0 |1 3 5 6 8
treea |3 |2 4
treea |6 |7
treeb |0 |1 2 5 6 8
treeb |2 |3 4
treeb |6 |7
treeb |7 |9
treec |0 |1 2 5 8
treec |2 |3 4 6

The table whose schema is
treelabel(treeid, nodeid, label)
gives the label associated with each node of the tree. Here are the first few rows of that table in treein:

# treelabel|treeid|nodeid|label
treea |0 |xray
treea |1 |15
treea |3 |z
treea |2 |w
treea |4 |v
treea |5 |q
treea |6 |xray
treea |7 |15
treea |8 |15
treeb |0 |xray
treeb |1 |15
treeb |2 |z
treeb |3 |w
treeb |4 |v
treeb |5 |q
treeb |6 |xray
treeb |7 |15
Labels can be characters, numbers, or strings and can be mixed. For example, treeb in the file treein has the format (see files dumpedquery and dumpeddb) after you run k pathfix +f treein +dumpquery +dumpdb ).
Tree: treeb
xray (0)
  15 (1)
  z (2)
    w (3)
    v (4)
  q (5)
  xray (6)
    15 (7)
      15 (9)
  15 (8)

Options

You can find the general syntax of the command by typing
k pathfix +help

The general syntax in fact is:
k pathfix (+b filenamefordb | +q queryfilename +d dbfilename | +f filename ) [+m maxdist (0)] [+dumpquery] [+dumpdb]

Here are some more examples with commentary:

Large Numbers of Trees

In the case where you have many trees to search through, you may want to use a second program in addition to pathfix called pathfilter. Intended use: