Graphdiff: Approximate Graph Matcher and Clusterer
Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html
Jason Wang
Department of Computer Science
New Jersey Institute of Technology
jason@cis.njit.edu
Motivation
Approximate graph isomorphism and subgraph isomorphism
are known to be NP-complete.
After reviewing the literature,
we have combined heuristics for these problems (geometric
hashing and valence heuristics mostly) that perform approximate
graph matching and clustering extremely
well on random data and on real data from the National Cancer
Institute.
This web page describes the installation and use of this software
which you may use for research purposes.
(If graphs are more general than you need, then we also have an
efficient, heuristic-free package for tree comparison.)
Installation
The approximate graph matcher runs in a high performance interpreted environment
called K.
-
To begin with, therefore, please download trial K from
kx.com .
The license is good for a limited duration, but is renewable.
K and our program run equally well on linux and windows.
All of K is under 300 kilobytes.
-
Send email to shasha@cs.nyu.edu.
If you care to describe your application, we'd be glad to hear
about it.
In any case, we will send you instructions for downloading three files:
-
graph.k -- the software
-
tempindb -- a sample database of graphs
-
tempinprobes -- a sample database of query graphs
-
You can then run the program by simply typing
k graph
-
Each row of the output will tell you:
the query graph,
the score of the query graph to the best match (i.e. size
in edges of best alignment
divided by the size in edges of the query graph, as explained below)
the score of the best match to that query graph
(i.e. size of best alignment divided by size of best match),
the label of the database graph matching the query graph
with that score, e.g.
12|0.8|0.6|7
means that query graph 12 has a matching score
of 0.8 with respect to query graph 7 (intuitively, size of match
divided by size of graph 12, see below),
and graph 7 has a matching score of 0.6 with respect to graph 12.
If an alignment is offered, it follows.
Semantics of the Result and of the Input
The score is defined as follows.
Given a query graph q and a database graph d ,
the score is based on the best alignment
found between q and d . Call that alignment a .
An alignment is a 1:1 (partial) mapping from nodes in
q to nodes in d .
The score depends on how good an alignment a is.
In the case that all
edges represent the same distance (defined below),
an edge {i,j} in q matches an edge
{i',j'} in d with respect
to
alignment a ,
if, first, alignment a
maps
i to i' and
j to j' ;
second, i and i' are of the same type
and
j and j' are of the same type.
In the uniform distance case, the score is the number of
matches between q
and d
with respect to alignment a divided by the total
number of edges in q .
Thus, each match is worth 1 and, in the first score,
we are looking for the number of
matching edges divided by the total number of edges in q .
The second score is the same but is divided by the total number of edges
in d .
When edges may have different distances, then
an edge {i,j,s} in q matches an edge
{i',j',s'} in d with respect
to
alignment a ,
if, first, alignment a
maps
i to i'
and
j to j' ;
second, i and i' are of the same type
and
j and j' are of the same type.
This is the same as for the uniform distance case.
The contribution of this match is min(s/s', s'/s) .
Thus, the contribution of a match cannot be greater than 1
and is often less, unlike in the uniform distance case.
In the different distance case, the first score is the sum
of the contributions of the matches between q
and d
with respect to alignment a divided by the total
number of edges in q .
The second score is the same but is divided by the total number of edges
in d .
The notion of distance is an abstraction from molecular comparisons:
small distances (e.g. 1) connote a stronger bond than
big distances (e.g. 3).
Thus, to the program, distance is the inverse of strength.
So, if your application has some notion of strength (that
may have nothing to do with physical or chemical strength), then
invert it when assigning distances.
Distances should be positive integers.
The notion of type is a way to constrain matches.
Each node in one graph must map to a node of the same type in the target graph.
For example, when defining an alignment between
molecules, types might constrain mappings so each atom
maps to another atom of the same type:
oxygen to oxygen, carbon to carbon, etc.
Types are defined using positive integers.
Input
As you can see by looking at tempindb or tempinprobes,
each graph is characterized by a label with which
it will be identified for matching purposes.
This can be a number or a string.
Next, there is a list of types which can be placed on
one or more lines (any number of spaces or newlines may separate types).
The first type is the type of node 0, the next one of node 1, etc.
Finally, there is a list of edges consisting of node id pairs and
the distance.
We assume that all edges are symmetric.
There should be one edge per line.
Options
The program has just a few options which you set by adjusting
command line parameters or modifying graph.k.
For the command line parameters, please type
k graph +help
You will see the following:
(+f filenameforalltoall (tempin) [+nc numclusters (1)] |
+p probefile (tempinprobes) +d dbfile (tempindb) |
+m manygraphs (-2) [+n nodecount (50)]
[+en extranodes (0)] [+ee extraedges (0)] [+t numtypes (2)]
)
[+a givemealignment (1)] [+h numberofhillclimbs (50)]
[+help]
Example 1: All to all comparison of file tempin with 100 hillclimbs.
k graph +f tempin +h 100
Example 2: five 70 node random graphs and lab2 with 8 extra nodes and 20 extra
edges.
k graph +m 5 +n 70 +en 8 +ee 20
You can also modify variables directly in the program if you want.
Here is the meaning of these options and variables.
-
You can take input from files or have it generated randomly.
-
If you want to take a single file and compare every graph with
every other one in that file:
specify a file after the +f option.
(That will force the manygraphs variable to the value -2 or you can
explicitly set it to the value -2 using the +m option.)
The default is that the input file is tempin and manygraphs is set to -2.
In the case that you want the program to attempt a clustering, specify
the number of clusters with the parameter +nc.
-
If you want to take a query file consisting of graphs and compare those
graphs with the graphs in a database file, then use the
+p and +d options.
(That will force the manygraphs variable to the value -3 or you can
explicitly set it to the value -3 using the +m option.)
-
If you want the system to generate random graphs, then set manygraphs
to the number of random graphs you want in addition to a base random
graph called lab1 and a permutation called lab2 .
The program then uses lab1 to probe among the random graphs generated
including lab2.
You can generate these nodes with a certain number of types by
setting the +t parameter (the higher the number of types, the easier
the job for the graph matcher).
You can also give lab2 extra nodes and extra edges using the +en
and +ee parameters.
-
Produce the
best alignment when a query graph matches a database graph.
If you want this, set the variable givemealignment to "1"
or specifying parameter +a 1.
-
Setting variable hillclimbing to 0
will give faster execution but will give poorer quality alignments.
The command line option for the number of hillclimbs is +h.
<<--
-
Setting variable manygraphs to "-3" allows you to compare
each graph in file tempinprobes against every graph in
tempindb .
Setting variable manygraphs to "-2" allows you to compare
each graph in file tempin against every other graph in that file.
Note that in this case you must create the file tempin .
For testing, you might want to do this by copying tempindb to tempin.
-
Since many-to-many testing (manygraphs = -2)
is a natural precursor to clustering,
the program offers a k means clustering service.
When the kmeans variable is set to any positive integer k
and manygraphs is set to -2, the program
will discover k clusters (whenever possible) having the following
desirable properties.
Define the similarity radius
between a ``centroid'' and a set of other nodes
to be the minimum similarity between that centroid and the other nodes.
Define the overall similarity radius , given a collection
of centroid-node set pairs, to be the minimum of the similarity radii
of each member of the collection.
In k means cluster formation,
the program finds k centroids and k partitions
such that (i)
the partitions cover the graphs; and
(ii) the overall similarity radius
is as high as (heuristically) possible given k.
-
When the kmeans variable is set to any positive integer k
and manygraphs is set to -2 and
reducecentroids is set to 1, the program
will attempt to find the essential nodes and edges of each centroid
that will improve the overall similarity radius.
It doesn't always help.
-->>
-
No other variables should be changed.
A note about the current clustering algorithm:
Define the similarity radius
between a ``centroid'' and a set of other nodes
to be the minimum similarity between that centroid and the other nodes.
Define the overall similarity radius , given a collection
of centroid-node set pairs, to be the minimum of the similarity radii
of each member of the collection.
In k means cluster formation,
the program finds k centroids and k partitions
such that (i)
the partitions cover the graphs; and
(ii) the overall similarity radius
is as high as (heuristically) possible given k.
Support
This work was partly supported by grant 9531554
of the United States National Science Foundation.
This support is greatly appreciated.