Jason Wang
Department of Computer Science
New Jersey Institute of Technology
jason@cis.njit.edu
The approximate graph matcher runs in a high performance interpreted environment called K.
The score is defined as follows. Given a query graph q and a database graph d , the score is based on the best alignment found between q and d . Call that alignment a . An alignment is a 1:1 (partial) mapping from nodes in q to nodes in d . The score depends on how good an alignment a is.
In the case that all edges represent the same distance (defined below), an edge {i,j} in q matches an edge {i',j'} in d with respect to alignment a , if, first, alignment a maps i to i' and j to j' ; second, i and i' are of the same type and j and j' are of the same type. In the uniform distance case, the score is the number of matches between q and d with respect to alignment a divided by the total number of edges in q . Thus, each match is worth 1 and, in the first score, we are looking for the number of matching edges divided by the total number of edges in q . The second score is the same but is divided by the total number of edges in d .
When edges may have different distances, then an edge {i,j,s} in q matches an edge {i',j',s'} in d with respect to alignment a , if, first, alignment a maps i to i' and j to j' ; second, i and i' are of the same type and j and j' are of the same type. This is the same as for the uniform distance case. The contribution of this match is min(s/s', s'/s) . Thus, the contribution of a match cannot be greater than 1 and is often less, unlike in the uniform distance case. In the different distance case, the first score is the sum of the contributions of the matches between q and d with respect to alignment a divided by the total number of edges in q . The second score is the same but is divided by the total number of edges in d .
The notion of distance is an abstraction from molecular comparisons: small distances (e.g. 1) connote a stronger bond than big distances (e.g. 3). Thus, to the program, distance is the inverse of strength. So, if your application has some notion of strength (that may have nothing to do with physical or chemical strength), then invert it when assigning distances. Distances should be positive integers.
The notion of type is a way to constrain matches. Each node in one graph must map to a node of the same type in the target graph. For example, when defining an alignment between molecules, types might constrain mappings so each atom maps to another atom of the same type: oxygen to oxygen, carbon to carbon, etc. Types are defined using positive integers.
As you can see by looking at tempindb or tempinprobes, each graph is characterized by a label with which it will be identified for matching purposes. This can be a number or a string.
Next, there is a list of types which can be placed on one or more lines (any number of spaces or newlines may separate types). The first type is the type of node 0, the next one of node 1, etc.
Finally, there is a list of edges consisting of node id pairs and the distance. We assume that all edges are symmetric. There should be one edge per line.
The program has just a few options which you set by adjusting
command line parameters or modifying graph.k.
For the command line parameters, please type
k graph +help
You will see the following:
(+f filenameforalltoall (tempin) [+nc numclusters (1)] |
+p probefile (tempinprobes) +d dbfile (tempindb) |
+m manygraphs (-2) [+n nodecount (50)]
[+en extranodes (0)] [+ee extraedges (0)] [+t numtypes (2)]
)
[+a givemealignment (1)] [+h numberofhillclimbs (50)]
[+help]
Example 1: All to all comparison of file tempin with 100 hillclimbs.
k graph +f tempin +h 100
Example 2: five 70 node random graphs and lab2 with 8 extra nodes and 20 extra
edges.
k graph +m 5 +n 70 +en 8 +ee 20
You can also modify variables directly in the program if you want. Here is the meaning of these options and variables.
A note about the current clustering algorithm:
Define the similarity radius
between a ``centroid'' and a set of other nodes
to be the minimum similarity between that centroid and the other nodes.
Define the overall similarity radius , given a collection
of centroid-node set pairs, to be the minimum of the similarity radii
of each member of the collection.
In k maxmin cluster formation,
the program finds k centroids and k partitions
such that (i)
the partitions cover the graphs; and
(ii) the overall similarity radius
is as high as (heuristically) possible given k.
This material is based upon work partly supported by the United States National Science Foundation under grant IIS-9988636 Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.