|
GraphClust |
Diego Reforgiato
Recupero
Universita' di Catania
Dipartimento di Matematica e
Informatica
diegoref@dmi.unict.it
Dennis
Shasha
Courant Institute of Mathematical Sciences
Department of
Computer Science
New York University
shasha@cs.nyu.edu
GraphClust is a tool that, given a labeled (directed and undirected) graph,
clusters the nodes in the graph based on their topology. The GraphGrep
software, by contrast, allows relatively small graphs to be used as queries
into databases of usually larger graphs. That software finds matching subgraphs
in the larger graphs very quickly.
GraphClust consists of 16 different algorithms broken down along four binary
dimensions:
- Number of clusters: The two options are: (1) specify the number of
clusters explicitly. (The underlying algorithm is k-means neighbor.) (2)
specify a "tightness" measure (an integer value in the range 1 to 4) where the
higher the tightness value the smaller the cluster radius and hence the larger the
number of clusters. (The underlying algorithm is the Antipole algorithm which
is faster than k-means.)
Options: -kmeans k for k-means where k is
an integer greater than 1 or -tightness k where k is between 1 and 4.
Default: Must be specified.
- Definition of substructures: If the -s S option is specified then the comparison is based on the Subdue-located common substructures,
developed at the University of Texas at Arlington, which finds common substructures in graphs. Otherwise, it looks for common paths up to a small length (currently 4). It works best with the -s S option.
Options: -s S for Subdue or -s P for paths
Default:
Common Paths.
- Graph Type: Graphs can be either directed or undirected. If
undirected, the edge specification is interpreted as a set of two-element sets
as in LNE. If directed, the edge specification is interpreted as a set of
ordered pairs.
Options: -g D for directed graph or -g U for
undirected.
Default: undirected.
- Distance metric: For each graph, we record the number of times each
substructure is present, thus constituting a vector of non-negative integers.
The metric between graphs is either (1) the inner product of the vectors for
each graph; or (2) the Euclidean distance between those vectors.
Options: -m I for inner product or -m E for Euclidean.
Default: inner product.
For all
algorithms, the procedure starts in the same way. First, all substructures are
found for each graph. Then a matrix is formed whose columns consist of all
possible substructures and for which there is one row for each graph. Each entry
[i,j] represents the number of substructures j in graph i. The following example
illustrates this when the substructures are paths, the graphs are considered to
be undirected, the number of clusters is 15, and the distance metric is
Euclidean. So the options would be -g U -m E -kmeans 15 -s S.
|
|
Graph 1 |
Graph 2 |
For Graph1, we have the following shortest paths
of length 1 up to LP=3.
starting from the upper node A :
{A,AC,AB,ABA}
starting from the node B : {B,BC,BA,BA}
starting from the
downer node A : {A,AB,ABC,ABA}
starting from the node C :
{C,CA,CB,CBA}
For Graph2, we have the following shortest paths of length
1 up to LP=3.
starting from the node A : {A,AB,AC}
starting from the node
B : {B,BA,BAC}
starting from the node C : {C,CA,CAB}
When edges are
undirected, the path XYZ is equal to path ZYX we form the following matrix
Graph |
A |
AC |
AB |
ABA |
B |
BC |
ABC |
BAC |
Graph1 |
2 |
2 |
4 |
2 |
1 |
2 |
2 |
0 |
Graph2 |
1 |
2 |
2 |
0 |
1 |
0 |
0 |
2 |
Once the matrix is
created both algorithms take all rows and cluster them using distances - either
inner product or Euclidean distance (Euclidean in this example), chosen by the
user.
To use any algorithm of GraphClust you have to: (1) create a dataset
file; (2) choose your options.
At the end of the process, a form of correlated substructures pairs is displayed.
The following image shows an example of such a form:
The substructures pairs are sorted decreasingly according to the correlation value.
Greater the correlation
value is, more correlated the two substructures are. The substructures pair chosen by the user in the left panel is displayed on the right.
For each substructure, the nodes, painted as circles, have a label and the identification number. A circle
containing the dot symbol is used as dummy and it means that the proposed substructure has only 1 node (the
one linked to the circle with the dot symbol).
Two input parameters are here involved:
- -min m. Only substructures with the number of nodes greater or equal to m are showed.
Options: integer value.
Default: 0.
- -SVD r. If this parameter is not used, then the values of the substructure-substructure
correlation matrix X are used (the ones present in the left column of the form).
Else, the correlation matrix X is broken apart into the product of 3 matrices T0,
S0 and D'0 by using the singular value decomposition (SVD) algorithm.
The first r values in S0 are retained whereas the remaining values are set to 0.
Finally, the reduced correlation matrix is the product of the three matrices and its values are showed
in the form.
Options: integer value.
Default: null.
The output of "graphclust" includes a file named 'output' and a directory named
'correlation_files'.
The first one displays all the
generated clusters and for each cluster it displays the centroid, all its
elements and the distances of each element from the centroid.
The second one contains all the information already showed in the form.
Clustering is done in many areas of computer science, and there are a lot of
different techniques. There are many applications for clustering graphs. Some
examples are: Internet search engines, knowledge management systems, document
databases and XML documents.
Home
GraphClust is implemented in ANSI C, the graphical interface is implemented in JAVA and
the substructures images are built by using the springgraph command.
Home
To download GraphClust, please send email to shasha@cs.nyu.edu and to diegoref@dmi.unict.it.
If you care to
describe your application, we'd be glad to hear about it. In any case, we will
send you instructions for downloading the program.
Home
GraphClust is implemented
in ANSI C. It has been
ported on Unix, and Windows platforms.
To install
GraphClust:
- Unzip the GraphClust package
- type "cd GraphClust"
- type "cd src"
- type "make"
- type "cd .."
- Check if there is "graphclust_main", "graphclustA", "graphclustA_S",
"graphclustB", "graphclustB_S" existing in the current folder
- Download the SubDue5.1 release from the download section at http://ailab.uta.edu/subdue/
- unzip the zip file where you want
- go to subdue-5.1.0/src directory
- type make
- copy binary files sgiso and subdue to the directory that graphclust_main,
graphclustA, graphclustA_S and graphclustB_S are in.
- type ./graphclust_main without arguments to see all the possible options.
- Check if the springgraph command is present in own unix system. (Even without it, the sotware
still works but it will not show the substructures images).
For bugs and questions please contact diegoref@dmi.unict.it or shasha@cs.nyu.edu.
Home
Usage of
GraphClust
File
formats
- dataset file
The file consists of a collection of graph
specifications (LNE=List of Nodes and Edges ids format). The first
line of each graph must begin with the character '#' and contains the
label of the graph; it can be a number or a string. The next line contains the
number of nodes in the graph. Subsequent lines contain the nodes'
labels, one label per line. The first label is the label of node 0, the next
one is the label of node 1, etc. Next is the number of edges in the
graph, followed by a list of edges consisting of node id pairs. Each line
contains only one edge. An example of dataset file is given in here.
- output
file
The output is given in a file called "output". Each
Cluster is identified by an index number (starting from 1) with the number of
graphs in the cluster and the radius in square brackets. In the next row we
have the centroid of the cluster. In the subsequent rows we have the other
elements, if any, and their distance from the centroid in brackets.
(see
example here ).
The "correlation_files" directory includes all the information already showed in the form.
Running
GraphClust
To run the examples go to the GraphClust directory and do the
following
- Execution
type
graphclust_main dataset_file_name".
You have to specify if you want to run the kmeans algorithm with the option
-kmeans k where k is the number of cluster you want or with the
Antipole algorithm with the option -tightness measure where
measure is 1..4.
Other options are:
- -s S in order to use subdue algorithm to find the substructures
- -s P in order to use shortest path to find the substructures
- -g D to deal with directed graphs
- -g U to deal with undirected graphs
- -m I to use the inner product metric
- -m E to use the Euclidean distance
- -min m to show the substructures whose number of nodes is greater or equal than m
- -SVD r to reduce the substructure-substructure correlation matrix by using the SVD
algorithm retaining the most r meaningful rows.
Home
Clustering References