Graphdiff: Approximate Graph Matcher and Clusterer

Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html

Jason Wang
Department of Computer Science
New Jersey Institute of Technology
jason@cis.njit.edu

Motivation

Approximate graph isomorphism and subgraph isomorphism are known to be NP-complete. After reviewing the literature, we have combined heuristics for these problems (geometric hashing and valence heuristics mostly) that perform approximate graph matching and clustering extremely well on random data and on real data from the National Cancer Institute. This web page describes the installation and use of this software which you may use for research purposes. (If graphs are more general than you need, then we also have an efficient, heuristic-free package for tree comparison.)

Installation

The approximate graph matcher runs in a high performance interpreted environment called K.

To begin with, therefore, please download trial K from kx.com . The license is good for a limited duration, but is renewable. K and our program run equally well on linux and windows. All of K is under 300 kilobytes.
Send email to shasha@cs.nyu.edu. If you care to describe your application, we'd be glad to hear about it. In any case, we will send you instructions for downloading three files:
- graph.k -- the software
- tempindb -- a sample database of graphs
- tempinprobes -- a sample database of query graphs
You can then run the program by simply typing
k graph
Each row of the output will tell you: the query graph, the score of the query graph to the best match (i.e. size in edges of best alignment divided by the size in edges of the query graph, as explained below) the score of the best match to that query graph (i.e. size of best alignment divided by size of best match), the label of the database graph matching the query graph with that score, e.g.
12|0.8|0.6|7
means that query graph 12 has a matching score of 0.8 with respect to query graph 7 (intuitively, size of match divided by size of graph 12, see below), and graph 7 has a matching score of 0.6 with respect to graph 12. If an alignment is offered, it follows.

Semantics of the Result and of the Input

The score is defined as follows. Given a query graph q and a database graph d , the score is based on the best alignment found between q and d . Call that alignment a . An alignment is a 1:1 (partial) mapping from nodes in q to nodes in d . The score depends on how good an alignment a is.

In the case that all edges represent the same distance (defined below), an edge {i,j} in q matches an edge {i',j'} in d with respect to alignment a , if, first, alignment a maps i to i' and j to j' ; second, i and i' are of the same type and j and j' are of the same type. In the uniform distance case, the score is the number of matches between q and d with respect to alignment a divided by the total number of edges in q . Thus, each match is worth 1 and, in the first score, we are looking for the number of matching edges divided by the total number of edges in q . The second score is the same but is divided by the total number of edges in d .

When edges may have different distances, then an edge {i,j,s} in q matches an edge {i',j',s'} in d with respect to alignment a , if, first, alignment a maps i to i' and j to j' ; second, i and i' are of the same type and j and j' are of the same type. This is the same as for the uniform distance case. The contribution of this match is min(s/s', s'/s) . Thus, the contribution of a match cannot be greater than 1 and is often less, unlike in the uniform distance case. In the different distance case, the first score is the sum of the contributions of the matches between q and d with respect to alignment a divided by the total number of edges in q . The second score is the same but is divided by the total number of edges in d .

The notion of distance is an abstraction from molecular comparisons: small distances (e.g. 1) connote a stronger bond than big distances (e.g. 3). Thus, to the program, distance is the inverse of strength. So, if your application has some notion of strength (that may have nothing to do with physical or chemical strength), then invert it when assigning distances. Distances should be positive integers.

The notion of type is a way to constrain matches. Each node in one graph must map to a node of the same type in the target graph. For example, when defining an alignment between molecules, types might constrain mappings so each atom maps to another atom of the same type: oxygen to oxygen, carbon to carbon, etc. Types are defined using positive integers.

Input

As you can see by looking at tempindb or tempinprobes, each graph is characterized by a label with which it will be identified for matching purposes. This can be a number or a string.

Next, there is a list of types which can be placed on one or more lines (any number of spaces or newlines may separate types). The first type is the type of node 0, the next one of node 1, etc.

Finally, there is a list of edges consisting of node id pairs and the distance. We assume that all edges are symmetric. There should be one edge per line.

Options

The program has just a few options which you set by adjusting command line parameters or modifying graph.k. For the command line parameters, please type k graph +help You will see the following:
(+f filenameforalltoall (tempin) [+nc numclusters (1)] |
+p probefile (tempinprobes) +d dbfile (tempindb) |
+m manygraphs (-2) [+n nodecount (50)]
[+en extranodes (0)] [+ee extraedges (0)] [+t numtypes (2)]
)
[+a givemealignment (1)] [+h numberofhillclimbs (50)]
[+help]
Example 1: All to all comparison of file tempin with 100 hillclimbs.
k graph +f tempin +h 100
Example 2: five 70 node random graphs and lab2 with 8 extra nodes and 20 extra edges.
k graph +m 5 +n 70 +en 8 +ee 20

You can also modify variables directly in the program if you want. Here is the meaning of these options and variables.

You can take input from files or have it generated randomly.
- If you want to take a single file and compare every graph with every other one in that file: specify a file after the +f option. (That will force the manygraphs variable to the value -2 or you can explicitly set it to the value -2 using the +m option.) The default is that the input file is tempin and manygraphs is set to -2. In the case that you want the program to attempt a clustering, specify the number of clusters with the parameter +nc.
- If you want to take a query file consisting of graphs and compare those graphs with the graphs in a database file, then use the +p and +d options. (That will force the manygraphs variable to the value -3 or you can explicitly set it to the value -3 using the +m option.)
- If you want the system to generate random graphs, then set manygraphs to the number of random graphs you want in addition to a base random graph called lab1 and a permutation called lab2 . The program then uses lab1 to probe among the random graphs generated including lab2. You can generate these nodes with a certain number of types by setting the +t parameter (the higher the number of types, the easier the job for the graph matcher). You can also give lab2 extra nodes and extra edges using the +en and +ee parameters.
Produce the best alignment when a query graph matches a database graph. If you want this, set the variable givemealignment to "1" or specifying parameter +a 1.
Setting variable hillclimbing to 0 will give faster execution but will give poorer quality alignments. The command line option for the number of hillclimbs is +h. <<--
Setting variable manygraphs to "-3" allows you to compare each graph in file tempinprobes against every graph in tempindb .
Setting variable manygraphs to "-2" allows you to compare each graph in file tempin against every other graph in that file. Note that in this case you must create the file tempin . For testing, you might want to do this by copying tempindb to tempin.
Since many-to-many testing (manygraphs = -2) is a natural precursor to clustering, the program offers a k means clustering service. When the kmeans variable is set to any positive integer k and manygraphs is set to -2, the program will discover k clusters (whenever possible) having the following desirable properties.
Define the similarity radius between a ``centroid'' and a set of other nodes to be the minimum similarity between that centroid and the other nodes.
Define the overall similarity radius , given a collection of centroid-node set pairs, to be the minimum of the similarity radii of each member of the collection.
In k means cluster formation, the program finds k centroids and k partitions such that (i) the partitions cover the graphs; and (ii) the overall similarity radius is as high as (heuristically) possible given k.
When the kmeans variable is set to any positive integer k and manygraphs is set to -2 and reducecentroids is set to 1, the program will attempt to find the essential nodes and edges of each centroid that will improve the overall similarity radius. It doesn't always help. -->>
No other variables should be changed.

A note about the current clustering algorithm:
Define the similarity radius between a ``centroid'' and a set of other nodes to be the minimum similarity between that centroid and the other nodes.
Define the overall similarity radius , given a collection of centroid-node set pairs, to be the minimum of the similarity radii of each member of the collection.
In k means cluster formation, the program finds k centroids and k partitions such that (i) the partitions cover the graphs; and (ii) the overall similarity radius is as high as (heuristically) possible given k.

Support

This work was partly supported by grant 9531554 of the United States National Science Foundation. This support is greatly appreciated.