Treesearch: Searching among Unordered Trees

Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html

Kaizhong Zhang
Department of Computer Science
University of Western Ontario
kzhang@csd.uwo.ca

Jason Wang
Department of Computer Science
New Jersey Institute of Technology
jason@cis.njit.edu

Motivation

In applications ranging from taxonomy comparisons to searches within class hierarchies to query processing in XML query languages, trees patterns are important, but the order among siblings (children of the same node) is not. The package described here allows rapid exact and approximate search on unordered trees. The one assumption this makes is that for all nodes n, the children of n all have distinct names. Of course, the children of n' could have the same names as the children of n for two different nodes n and n'. (If two children of n have the same nodes, then the system will not miss any matches, but may return false positives -- data trees having the same leaf-to-root paths as the query tree but not the same shape.) As of February, 2001, the package also offers the possibility of variable length don't cares (represented by *) and single label don't cares (represented by ?). This may be useful for querying based on XPATH or its extensions.

The algorithms proposed here may require quadratic time in the size of the trees to construct the database. Search however is proportional to the sum of the lengths of the paths in the database times a small constant. In my experiments, searches of 100 50 node trees take under a second. Scaling is still an issue to be sure.

Here is a web demo of the capabilities . You can find an application to a phylogenetic database. Bill Piel's full phylogenetic database, called treebase is found here .

If trees are not quite general enough, then we also have an efficient (quadratic in the number of edges), heuristic package for graph comparison.

If you want edit distance among ordered trees then look for package for approximate tree comparison.

If you want a non-editing metric tailored to trees that are unordered (such as phylogenetic trees), please see our cousin-distance based unordered comparison software

You can find a software package (GraphGrep) to search for a query graph in a database of graphs here.

You can find our project home page here .

Installation

The approximate tree matcher runs in a high performance interpreted environment called K.

To begin with, therefore, please download trial K from for a sun version and a pair of files for windows consisting of k.exe and k20.dll (if this doesn't work, then go to k20dll and then rename the file to k20.dll) and a . K and our program run equally well on linux and windows. The tree matcher, our sample file, and all the K files fit comfortably in 0.5 megabytes of disk space.
Send email to shasha@cs.nyu.edu. If you care to describe your application, we'd be glad to hear about it. In any case, we will send you instructions for downloading two files:
- treein -- a sample database of trees without variable length don't cares
- You can then run the program by typing
  k pathfix +f treein
  which will compare the first tree in treein with the rest of them. Each row of output will tell you where in the other trees the first tree finds a match. (The output can also be found in a file called data.out for possible post-processing.) For example: treeb_0, i.e. node 0 in treeb, using the numbering given in treein.
- treeinquest -- a sample database of trees with variable length don't cares.
- You can then run the program by typing
  k pathfix +f treeinquest

Input

As you can see by looking at treein, trees are described in two tables. The first one, having the schema
tree(treeid, parentid, childrenid)
describes the parent-child relationships using any self-consistent numbering of the nodes. Rows of the table are separated by a newline character. Here are the first few rows of that table:

#tree| treeid| parentid| childrenid
treea |0 |1 3 5 6 8
treea |3 |2 4
treea |6 |7
treeb |0 |1 2 5 6 8
treeb |2 |3 4
treeb |6 |7
treeb |7 |9
treec |0 |1 2 5 8
treec |2 |3 4 6

The table whose schema is
treelabel(treeid, nodeid, label)
gives the label associated with each node of the tree. Here are the first few rows of that table in treein:

# treelabel|treeid|nodeid|label
treea |0 |xray
treea |1 |15
treea |3 |z
treea |2 |w
treea |4 |v
treea |5 |q
treea |6 |xray
treea |7 |15
treea |8 |15
treeb |0 |xray
treeb |1 |15
treeb |2 |z
treeb |3 |w
treeb |4 |v
treeb |5 |q
treeb |6 |xray
treeb |7 |15

Labels can be characters, numbers, or strings and can be mixed. For example, treeb in the file treein has the format (see files dumpedquery and dumpeddb) after you run k pathfix +f treein +dumpquery +dumpdb ).

Tree: treeb
xray (0)
  15 (1)
  z (2)
    w (3)
    v (4)
  q (5)
  xray (6)
    15 (7)
      15 (9)
  15 (8)

Options

You can find the general syntax of the command by typing
k pathfix +help

The general syntax in fact is:
k pathfix (+b filenamefordb | +q queryfilename +d dbfilename | +f filename ) [+m maxdist (0)] [+dumpquery] [+dumpdb]

Program pathfix with the +b filename will build a database from that file and will call it filenamedb with an extension (either .l or .K depending on the operating system but you need not care).
Program pathfix with the +q queryfilename and +d dbfilename will take the query file and the database file already produced (but you need not specify the extension for the database file) and find out in which trees of the database and at which positions of those trees the query tree can be found.
Program pathfix with the +f filename will take a text file using the normal tree format (see below) and will compare the first tree against all others.
+m maxdist is the maximum number of differences to allow, where the differences are the number of paths from the root to the leaf in the query tree that are missing in the database tree. Note that this is different from edit distance, e.g. a query tree will not match a database tree at a node whose label is different from the label of the root of the query tree.
+dumpquery says to dump the query tree in indented form to the file dumpedquery (unless one is using the +b option).
+dumpdb says to dump the database trees in indented form to the file dumpeddb (unless one is using the +d option).

Here are some more examples with commentary:

k pathfix +b treein
will form treeindb.l or treeindb.K
k pathfix +q treeinquery +d treeindb
will find the query tree in treeindb.
k pathfix +q treeinquery +d treeindb +dumpquery +m 2
will find the query tree in treeindb within distance 2 and put the query tree in indented form in the file dumpedquery.
k pathfix +f treein +dumpquery +dumpdb +m 2
will make the first tree in treein be the query tree; the remaining trees in treein be the database trees; will look for the query tree among the database trees within distance 2; will put the query tree in indented form in the file dumpedquery, and the database trees in indented form in the file dumpedb.

Large Numbers of Trees

In the case where you have many trees to search through, you may want to use a second program in addition to pathfix called pathfilter. Intended use: