GraphBlast

Rosalba Giugno
Universita' di Catania
Dipartimento di Matematica e Informatica
giugno@dmi.unict.it

Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu

What is GraphGrep?

Compare GraphBlast Demo

References Collaborators

What is GraphGrep?

GraphGrep is a software package to search for a query graph in a database of graphs. Given a collection of graphs and a pattern graph, GraphGrep finds all the occurrences of the pattern in each graph. The pattern is a subgraph and it can be also a tree, a path, or a node. The pattern is expressed as a list of nodes and a list of edges.

The components of GraphGrep are the following:

graphbuild, to index the graphs in the collection and store their abstract representation;
graphgrep, a graph query language;
2.1 algebra, a collection of operations to manipulate and select data.

GraphGrep manipulates heterogeneous graph collections and does not depend on a specific application.

Why GraphGrep? Many applications in industry, science, and engineering depend on efficient information retrieval methods from structured or semistructured databases. These databases contain items with a complex structure where the relationships between the components (of the items) have to be maintained. The items are typically represented by graphs. For example, shapes, images (computer vision), XML or OEM data (web exchange data model), molecules and proteins (computational biology), graphical sentences and grammars (visual languages) are represented by graphs. By using graphs the problem of information retrieval on a database is transformed to that of finding subgraphs in a collection of graphs. This problem is well known to be NP complete. Although graph-to-graph matching algorithms can be used, efficiency reasons dictate the use of special techniques in order to reduce the search space and the time complexity.

Compare GraphGrep

To use any version of GraphGrep you have to: (1) create a dataset file; (2) preprocess the collection of graphs; and (3) run a query. The output of "graphgrep" contains all the occurrences of the query in the collection of graphs.

GraphGrep2.0 is implemented in C++, it uses an efficient storage system, it is fast and accurate. Compared to GraphGrep1.0 or GraphGrep1.1, GraphGrep2.0 is faster but it does not allow inexact matchings.

	Input Data Format	Input Query Format	Output Format	Language	Storage System	Filtering System	Wildcards
GraphGrep2.1	LNE	LNE, Glide	List of corresponding nodes	C++	Berkeley DB	Compacted Hashed Paths Vector	Yes
GraphGrep2.0	LNE	LNE	List of corresponding nodes	C++	Berkeley DB	Compacted Hashed Paths Vector	No
GraphGrep1.0 GraphGrep1.1	LNE	LNE,Glide	List of corresponding paths	C, K C	Binary Files	Extended Hashed paths	Yes

Download

In order to build a very fast system we are exploring different filtering algorithms, storage and matching techniques. Please specify to us which version you would like to use.

To download GraphGrep, send email to shasha@cs.nyu.edu and to giugno@dmi.unict.it.
If you care to describe your application, we'd be glad to hear about it. In any case, we will send you instructions for downloading the program.

Installation of GraphGrep2.0

GraphGrep2.0 is implemented in ANSI C++. GraphGrep2.0 interfaces BerkeleyDB as the embedded database system to handle the data. It has been ported on Unix, and Windows platforms.

Follow these instructions to install GraphGrep2.0.

Install BerkeleyDB

1. Download BerkeleyDB 4.01 at http://www.sleepycat.com/update/snapshot/db-4.1.25.NC.tar.gz
2. Unzip it to a folder
3. Switch to the inside folder "build_unix"
4. Type "../dist/configure --enable-cxx" and execute
5. Type "make" and execute
6. Type "make install" and execute
7. Make sure there is a directory "/usr/local/BerkeleyDB.4.1/" in your system.
8. Make sure "libdb_cxx-4.1.a" " libdb_cxx.a" are in the "/usr/lib" folder. You need this library to build GraphGrep

If you have any trouble building BerkelyDB or to see detail information, please check "http://www.sleepycat.com/docs/ref/build_unix/intro.html"
Build GraphGrep2.0

1. Install successfully BerkeleyDB.
2. Unzip the graphgrep2.0 package
3. Type "cd graphgrep2.0"
4. Type "cd src; make; cd .. "
5. Check if there are "graphgrep" and "graphbuild" existing in the current folder.

For bugs and questions please contact the giugno@dmi.unict.it and shasha@cs.nyu.edu.

Usage of GraphGrep2.0

dataset file
The file consists of a collection of graph specifications (LNE=List of Nodes and Edges ids format). The first line of each graph must begin with the character '#' and contains the label of the graph; it can be a number or a string. The next line contains the number of nodes in the graph. Subsequent lines contain the nodes' labels, one label per line. The first label is the label of node 0, the next one is the label of node 1, etc. Next is the number of edges in the graph, followed by a list of edges consisting of node id pairs. Each line contains only one edge. GraphGrep2.0 assumes that all edges are symmetric and no double edges are given. An example of dataset file is given in here.|

query file
It has the same format of the dataset file; only one graph per file is allowed (see an example here) .
output file
The output is given in a file called "query_file_name.out". For example, if the query file name is query.dat the output is given in query.dat.out. The first rows of the output file contain the list of the node ids of the query.
Each following row of the output file is a subgraph matching the query. The rows contain an identification of the graph, the number of matches in each graph if not zero and a list of data nodes matching the query nodes given in the first row (see example here ).

Run GraphGrep2.0

To run the examples go to the GraphGrep2.0 directory and do the following (If you want to run GraphGrep2.0 in a different directory: make sure that graphgrep and graphbuild are in your path):

Database creation
In order to build a GraphGrep2.0 database, run "graphbuild -f dataset_file_name". This is a preprocessing step that creates an abstract representation of the data.
- Parameters
  graphbuild -f graphDBfile [ -lp lengthpath -fp length of FP ]
  -lp lengthpath value, default is 4
  -fp length of FP vector, default is 256
Query Execution
run "graphgrep -m query_file_name"
- Parameters
  graphgrep -m queryfile [-o -v -noFL -n numGraphs -lp lengthpath -fp length of FP]
  -"o output count"     -----it outputs the number of total matches
  -"v verbose mode"   -----it outputs the steps of opfile
  -"noFL disable filtering"
  -"n numGraphs"       -----input number of graphs in DB(only with option -noFL)
  -"lp lengthpath"        ------input different LP, default is 4
  -"fp length of FP"     -------input different length of FP, default is 256

R. Giugno, D. Shasha, GraphGrep: A Fast and Universal Method for Querying Graphs.
Proceeding of the International Conference in Pattern recognition (ICPR), Quebec, Canada, August 2002. pdf

D. Shasha, J.T-L Wang, R. Giugno, Algorithmics and Applications of Tree and Graph Searching
Proceeding of the ACM Symposium on Principles of Database Systems (PODS), Madison, Wisconsin, June 2002. pdf

Chih-Yi, HSU, master in computer science at New York University 2002. Now he is working at AT&T lab, New Jersey.
This project is supported by New York University, USA and University of Catania, Italy.
Visit our CT-NYU project page for a further useful package of algorithms with applications in Bioinformatics and Mobile Stations.

Grant Support

This work has been partly supported by the U.S. National Science Foundation under grants IIS-0414763, DBI-0445666, N2010 IOB-0519985, N2010 DBI-0519984, DBI-0421604, and MCB-0209754. This support is greatly appreciated.