SIGMOD 2008
2008 ACM SIGMOD International Conference on Management of Data
June 9-12 2008
Vancouver
Canada
Paper ID: 154
Title: WHIM: Frequent Pattern Mining in Weighted Directed Graphs
Track Name: Research
Review Count: 3

Review: 1
Reviewer: Gao Cong
Email: gcong@inf.ed.ac.uk
Organization: Microsoft Research Asia
Review:
 QuestionResponse
1Overall Rating Weak Reject
2Reject due to technical incorrectness No
3Novelty Low
4Technical Depth Medium
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? to a limited extent
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? adequate
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? yes
8Presentation Adequate
9Reviewer Confidence Medium
10Name of External Reviewer (if applicable)
11Summary of the paper's main contributions and impact (up to one paragraph) This paper describes an algorithm for mining frequent patterns in weighted directed graphs. Previous algorithms for mining frequent subgraphs consider undirected graphs without weight, although it is easy to extend them to mine directed graphs.
12Three strong points of the paper (please number them S1,S2,S3) S1. This paper introduces the problem of mining frequent subgraphs from weight directed graphs.
S2. The idea of encoding the problem of finding a maximum-weight occurrence of a pattern into constraint satisfaction problem seems to be interesting.
13Three weak points of the paper (please number them W1,W2,W3) W1. The motivation for mining frequent subgraphs is weak. This paper does not give appealing application. The so-called real-life data used in the experiment is actually not real, but a dataset generated using real data.
W2. The techniques used for mining frequent subgraphs from undirected graphs can be easily extended to mine frequent subgraphs from directed graphs. Previous work actually mentioned it.
W3. Some related work is missed in this paper.
14Detailed comments (please number each point) D1. This paper does not give an appealing application of mining frequent subgraphs from weighted directed graphs. The biological interaction is often not one way processes.
D2. The technical part on generating candidate graphs is simple extension of previous work, although the section on finding a maximum-weight occurrence of a pattern in a graph seems to be interesting.
D3. A number of papers on graph mining are missed in the related work of this paper. E.g..Zhiping Zeng, Jianyong Wang, Lizhu Zhou, George Karypis: Coherent closed quasi-clique discovery from large dense graph databases KDD 2006

Yan and Han. CloseGraph: Mining Closed Frequent Graph Patterns KDD2003-- this work discusses mining directed graphs

Efficient Mining of Frequent Rooted Continuous Directed Subgraphs. Advanced Computing and Communications, 2006. ADCOM 2006.




15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)

Review: 2
Reviewer: Jianzhong Li
Email: lijzh@hit.edu.cn
Organization: Harbin Institute of Technology, China
Review:
 QuestionResponse
1Overall Rating Reject
2Reject due to technical incorrectness No
3Novelty Medium
4Technical Depth Low
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? no
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? not good
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? yes
8Presentation Adequate
9Reviewer Confidence High
10Name of External Reviewer (if applicable)
11Summary of the paper's main contributions and impact (up to one paragraph) This paper investigates the problem of frequent (labeled and directed) subgraph mining from a database of weighted, labeled and directed graphs. The new problem of mining frequent subgraphs from a set of weighted, labeled and directed graphs is defined. A canonical form for representing directed acyclic graphs (DAGs) is devised. Some probabilistic subgraph isomorphism testing methods are designed. Based on these techniques, an algorithm WHIM is proposed to discover frequent subgraphs from a database.
12Three strong points of the paper (please number them S1,S2,S3) S1. A new subgraph mining problem is formally defined, which takes the weights of edges into consideration. A new support measure (maximal occurrence) is proposed.
S2. A canonical form for representing directed acyclic graphs (DAGs) is devised, which prevents the algorithms from searching for the same DAG multiple times.
S3. To reduce the time consumed by subgraph isomorphism testing in mining process, several heuristic methods, greedy, full random and prioritized random were proposed to find the maximum-weight occurrence of a pattern in a database graph without inspecting all occurrences.
13Three weak points of the paper (please number them W1,W2,W3) W1. As a foundation of the formulated problem and the consequent proposed model in this paper, the assumption that the probabilities attached in edges are independent of each other is not consistent with many realistic applications. That is, there are no practical applications mentioned explicitly in the paper. In more detail, the model assumes that "the weighted (probabilities) assigned to edges are independent of one another"(Page 3) and "the occurrence of a pattern in a graph is the product of the individual probabilities of all its edges"(Page 3). However, edges in a graph are usually correlated with one another. For example, the edges between the vertices in a community of a graph are very likely to be dependant of each other.
W2. The paper is not strict in theory. Some important conclusions in the paper are not proved. For example, the uniqueness of the canonical form of a DAG is not proved. The completeness of the discovered set of frequent subgraphs is not proved. The correctness and the approximation ratio of the probabilistic subgraph isomorphism testing algorithm are not proved. These proofs are very important for the validity of the algorithms and methods in the paper.
W3. The paper doesn't study the existing works very well. The existing frequent subgraph mining algorithms, such as gSpan, can deal with directed graphs. ("designed to run on undirected graphs" in Page 2 is not true.) And it is a direct solution to the problem to easily adapt these existing algorithms by computing the product of the weights of edges in each embedding. It can perform efficiently, and the correctness can be guaranteed as well. However, these existing algorithms are not sufficiently considered in the paper.
W4. The experiment part of the paper has the following problems:
1. No comparison with existing algorithms.
2. The dataset used in section 5.1 (Page 9) is obtained by randomly adding weights to edges in the KEGG PATHWAY dataset. Hence the dataset isn't real any more.
3. For the recall and the precision, only the number of patterns are considered. The influence of other factors on the recall and the precision should be considered.
14Detailed comments (please number each point) (1) The proposed model of weighted, labeled and directed graph doesn't stand on the very solid theoretical and application backgrounds. The model assumes that "the weighted (probabilities) assigned to edges are independent of one another" and "the occurrence of a pattern in a graph is the product of the individual probabilities of all its edges". However, edges in a graph are usually correlated with one another. For example, the edges between the vertices in a community of a graph are very likely to be dependant of each other.
(2). The uniqueness of the canonical form of a DAG, i.e. two DAGs are isomorphic to each other if and only if the canonical forms of the DAGs are the same, is not proved.
(3). The completeness of the discovered set of frequent weighted, labeled and directed subgraphs, i.e., all the frequent weighted, labeled and directed subgraphs should occur in the mining result, is not proved.
(4). The correctness and the approximation ratio of the probabilistic subgraph isomorphism testing algorithm are not proved. Due to the downward closure property of frequent weighted, labeled and directed subgraphs, if a frequent subgraph g is considered as infrequent by the probabilistic subgraph isomorphism testing algorithm, then all the frequent subgraphs that can be extended from g will be missed. So, the applicability of probabilistic subgraph isomorphism testing algorithm in frequent subgraph mining needs to be verified.
(5). The existing frequent subgraph mining algorithms such as gSpan, etc, can be easily adapted to solve this problem by computing the product of the weights of edges in each embedding. They can also perform very efficiently, and the correctness can be guaranteed. However, these algorithms are not studied in the paper.
(6). In the paper, the method of how to set the maximum number of iterations (the "max-iterations" parameter of algorithm 2 in Page 8) is not provided. The only mentioned method is given in Page 11, which is derived from the observation of the experimental results on the given datasets. But there isn't any guidance for other datasets or general datasets.
15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)

Review: 3
Reviewer: Alfredo Ferro
Email: ferro@dmi.unict.it
Organization: Catania University
Review:
 QuestionResponse
1Overall Rating Weak Accept
2Reject due to technical incorrectness No
3Novelty Medium
4Technical Depth Medium
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? yes, allowing for space limitations
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? adequate
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? yes
8Presentation Adequate
9Reviewer Confidence High
10Name of External Reviewer (if applicable)
11Summary of the paper's main contributions and impact (up to one paragraph) Authors propose a method to mine weighted directed graphs. It is the first attempt to do that. They introduce a priori criteria to adapt the concept of support to the treated problem. Moreover, a canonical form for DAG is defined together with a probabilistic pattern-matching algorithm. The mining is based on a very promising idea: pruning the possible patterns by using DAG.
12Three strong points of the paper (please number them S1,S2,S3) S1: First method for mining weighted directed graphs.
S2: It defines a canonical form for directed acyclic graphs.
S3: Mining using simple but meaningful structures such as trees (DAG).
13Three weak points of the paper (please number them W1,W2,W3) W1: Running time should be reported (Section 5.1). Some users may be interested and it would be useful to provide a complete evaluation of the system.
W2: Concepts in the Section 4.3 should be presented in a simpler way. Authors could sketch a discussion referring also to subgraph isomorphism notions cited in the related work.
14Detailed comments (please number each point) 1. Experiments on a larger data set of graphs should be provided, giving timing and accuracy for each step of the method.
2. Explain sentence on page 3  Adding a zero-weight …all of its…
3. Check the paper for typos: e.g. page 10  a graphs, page 9 -> on on.
15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)