Graphdiff: Approximate Graph Matcher and Clusterer


Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html

Jason Wang
Department of Computer Science
New Jersey Institute of Technology
jason@cis.njit.edu



Motivation

Approximate graph isomorphism and subgraph isomorphism are known to be NP-complete. After reviewing the literature, we have combined heuristics for these problems (geometric hashing and valence heuristics mostly) that perform approximate graph matching and clustering extremely well on random data and on real data from the National Cancer Institute. This web page describes the installation and use of this software which you may use for research purposes. A presentation done at the University of Pennsylvania describing this work is contained here . Here is a web demo of the capabilities . (If graphs are more general than you need, then we also have an efficient, heuristic-free package for ordered tree comparison, our cousin-distance based unordered comparison software and a data structure for tree searching among unordered trees .) You can find our project home page here .

Installation

The approximate graph matcher runs in a high performance interpreted environment called K.

Semantics of the Result and of the Input

The score is defined as follows. Given a query graph q and a database graph d , the score is based on the best alignment found between q and d . Call that alignment a . An alignment is a 1:1 (partial) mapping from nodes in q to nodes in d . The score depends on how good an alignment a is.

In the case that all edges represent the same distance (defined below), an edge {i,j} in q matches an edge {i',j'} in d with respect to alignment a , if, first, alignment a maps i to i' and j to j' ; second, i and i' are of the same type and j and j' are of the same type. In the uniform distance case, the score is the number of matches between q and d with respect to alignment a divided by the total number of edges in q . Thus, each match is worth 1 and, in the first score, we are looking for the number of matching edges divided by the total number of edges in q . The second score is the same but is divided by the total number of edges in d .

When edges may have different distances, then an edge {i,j,s} in q matches an edge {i',j',s'} in d with respect to alignment a , if, first, alignment a maps i to i' and j to j' ; second, i and i' are of the same type and j and j' are of the same type. This is the same as for the uniform distance case. The contribution of this match is min(s/s', s'/s) . Thus, the contribution of a match cannot be greater than 1 and is often less, unlike in the uniform distance case. In the different distance case, the first score is the sum of the contributions of the matches between q and d with respect to alignment a divided by the total number of edges in q . The second score is the same but is divided by the total number of edges in d .

The notion of distance is an abstraction from molecular comparisons: small distances (e.g. 1) connote a stronger bond than big distances (e.g. 3). Thus, to the program, distance is the inverse of strength. So, if your application has some notion of strength (that may have nothing to do with physical or chemical strength), then invert it when assigning distances. Distances should be positive integers.

The notion of type is a way to constrain matches. Each node in one graph must map to a node of the same type in the target graph. For example, when defining an alignment between molecules, types might constrain mappings so each atom maps to another atom of the same type: oxygen to oxygen, carbon to carbon, etc. Types are defined using positive integers.

Input

As you can see by looking at tempindb or tempinprobes, each graph is characterized by a label with which it will be identified for matching purposes. This can be a number or a string.

Next, there is a list of types which can be placed on one or more lines (any number of spaces or newlines may separate types). The first type is the type of node 0, the next one of node 1, etc.

Finally, there is a list of edges consisting of node id pairs and the distance. We assume that all edges are symmetric. There should be one edge per line.

Options

The program has just a few options which you set by adjusting command line parameters or modifying graph.k. For the command line parameters, please type k graph +help You will see the following:
(+f filenameforalltoall (tempin) [+nc numclusters (1)] |
+p probefile (tempinprobes) +d dbfile (tempindb) |
+m manygraphs (-2) [+n nodecount (50)]
[+en extranodes (0)] [+ee extraedges (0)] [+t numtypes (2)]
)
[+a givemealignment (1)] [+h numberofhillclimbs (50)]
[+help]
Example 1: All to all comparison of file tempin with 100 hillclimbs.
k graph +f tempin +h 100
Example 2: five 70 node random graphs and lab2 with 8 extra nodes and 20 extra edges.
k graph +m 5 +n 70 +en 8 +ee 20

You can also modify variables directly in the program if you want. Here is the meaning of these options and variables.

A note about the current clustering algorithm:
Define the similarity radius between a ``centroid'' and a set of other nodes to be the minimum similarity between that centroid and the other nodes.
Define the overall similarity radius , given a collection of centroid-node set pairs, to be the minimum of the similarity radii of each member of the collection.
In k maxmin cluster formation, the program finds k centroids and k partitions such that (i) the partitions cover the graphs; and (ii) the overall similarity radius is as high as (heuristically) possible given k.

Support

This material is based upon work partly supported by the United States National Science Foundation under grant IIS-9988636 Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.