Unordered Tree Comparison Based on Cousin Distance

Programmer: Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html

Joint work with Jason Wang and Kaizhong Zhang.


Motivation

An unordered tree is one in which parent-child relationships are important, but there is no sibling order. An unrooted unordered tree is one in which even parent-child relationships are unimportant (there is no root). Such trees are particularly relevant in phylogenetic analysis. A sophisticated library for phylogenetic analysis can be found on Rod Page's site The algorithms here are complementary to those of the Page Lab.

Unordered tree comparison based on tree-editing distance is NP-complete. This sofware proposes a variety of algorithms and options for comparing trees based on cousin pairs. A sibling is a cousin of degree 0, a nephew (niece) is a cousin of degree 0.5, a first cousin is a cousin of degree 1 and so on. Two trees can then be compared based on the set of pairs of each degree or we can ignore degree and ask for a comparison of pairs of any degree up to 2.5, among many other options. This metric is most interesting for phylogenetic trees where only leaves have labels. This algorithm has both a rooted and unrooted option.

We have related software for searching in unordered trees , finding the editing distance among ordered trees , and searching in graphs

A metric suggested by Ward Wheeler is to label interior nodes based on the alphabetical ordering of all their children and then to compare trees that way.


Installation, Brief Explanation, and Execution

Our software runs in a high performance interpreted environment called K.

Some typical queries

k cousins +findcommon +f treeintmp +m 2
-- this finds the commonalities up to second cousins of the trees in treeintmp2

k cousins +findcommon +f treein +m 2 +exact
-- again finds the commonalities, but this time preserving the extent of relationship, e.g. first cousins with first cousins.

k cousins +f treeintmp +m 2 +dumpmatch +allnodes +setcompare
-- first tree agains all other trees.
+dumpmatch makes it dump art of the match
+allnodes means not leaves only [this option is deprecated]
+setcompare means we don't care about cardinality, just existence.

k cousins +phylo +f treeinphylo +m 2 +findcommonsim 0.7 +exact
-- assumes that treeinphylo is in phylogenetic format and tries to find commonalities that appear in 70% of the trees.

You can also do something like
k cousins +b treein +m 2

to build a database. Then treeintmp becomes the queries against the database.
k cousins +q treeintmp +d treeindb +m 2
The +unrooted option enables the previous options to be tried ignoring which node is the root. The distance is measured as the number of links from one node to another. This currently works only on trees whose leaves are the only nodes with labels. It is mutually exclusive with the subtree label disance because the notion of subtree makes no sense in this context.

Finally, there is the subtree label distance. Each subtree root is labeled with its leaf descendants in alphabetical order. Comparing two trees means comparing those labels.

Example:
k cousins +f treeinphylo +subtreelabel +findcommon +phylo

For more options, just try
k cousins kjlkj

The software responds to errors in the input by giving a full help screen.

Support

This material is based upon work partly supported by the United States National Science Foundation under grants IIS-9988636, 0115586, and MCB-0209754. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. This support is greatly appreciated.