SIGMOD 2008
2008 ACM SIGMOD International Conference on Management of Data
June 9-12 2008
Vancouver
Canada
Paper ID: 533
Title: End-to-End Support for Joins in Large-Scale Publish/Subscribe Systems
Track Name: Research
Review Count: 3

Review: 1
Reviewer: Karl Aberer
Email: karl.aberer@epfl.ch
Organization: EPFL Lausanne
Review:
 QuestionResponse
1Overall Rating Weak Reject
2Reject due to technical incorrectness No
3Novelty Medium
4Technical Depth High
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? to a limited extent
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? adequate
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? no or not applicable
8Presentation Not Good
9Reviewer Confidence High
10Name of External Reviewer (if applicable) Oana Jurca
11Summary of the paper's main contributions and impact (up to one paragraph) The paper presents the problem of supporting select-join subscriptions in a distributed system over a multitude of data sources stored in a centralized location. The authors identify different types of data redundancy stemming from naïve approaches and propose a number of solutions for tuple (event) selection and dissemination. These solutions are detailed for the simpler case of two-way select-joins, and the extension to multi-way select-joins is briefly discussed. Finally, experiments show the improvement achieved by the various solutions over the naïve approaches.
12Three strong points of the paper (please number them S1,S2,S3) S1 Good formulation of the problem
S2 Large range of possible solutions discussed
S3 Extensive consideration of the optimization opportunities
13Three weak points of the paper (please number them W1,W2,W3) W1 Typical publish/subscribe systems allow many subscribers and publishers; here, the data sources are centralized, allowing for only one publisher
W2 The solutions are explained for the simple case of two-way select-joins, while the more probable case of multi-way select-joins is only briefly discussed; further more, it is not clear how straight-forward and what is the overall complexity of the extension.
W3 The presentation lacks clarity: there are many notations, implicit definitions and different solutions and it is difficult to follow. A table recapitulating them would have been very useful. A number of things are omitted due to space constraints and we are referred to a technical report, which for obvious reasons, cannot be cited; maybe this type of submission was not the most appropriate for this work.
14Detailed comments (please number each point) 1. The problem is very well presented, though I am surprised that there is no mention of the case when data sources are in different locations, as is the typical setting for publish/subscribe systems, to allow for multiple subscribers AND publishers
2. The presentation quality needs to be improved, as it is very difficult to grasp the differences between the many approaches. As a general comment, the settings must be specified more clearly, as should be the computational complexity of each solution and the space required for the data structures that is not discussed for any of them; only in the experiments chapter directory sizes are discussed, but it is not clear where the B-trees for semijoins are kept and how much space they take. Tables recapitulating the most important notations and the assumptions/ settings/ complexities/ data structures required, together with advantages and disadvantages of each solution would help.
3. Going back to the settings issue, the terms “server”, “nodes”, “brokers” must be clearly explained, as sometimes they are used interchangeably and some other times they describe different concepts.
4. The “inter-subscription redundancy” is typically transparently resolved in traditional publish-subscribe systems based on a network of brokers; this is not the case for pub/sub over structured overlays and must be treated separately, which is what happens in your solution; again, the setting should be clearer; still, you could have considered such kind of solutions, as in the beginning of the paper it seems that you keep state only on the server and not on the brokers (like in traditional pub/sub) and a lot of work has been done on minimizing such redundancy in pub/sub
5. Not clear where do you actually keep state, depending on the solution
6. Typos: page 2, col1, when defining redundancies: “the some information”; section 2.1 par 3 “cross” instead of “across; sect. 3.1.1 “clusteredness” ; last 2 parags 3.2.1, missing “)”, “to” and “s”; sect 4, par 3 "the contents them"; sect 5, explanation of fig 4-center: “with the payload” instead of “without the payload”; sect 6.1 par1 "use main memory use main memory";, last but one par: "normal distributions distributions N_1..."
7. No point in mentioning the technical report when you are not allowed to reference it.
8. What is the difference between a semijoin in the case of two-way join and a binary semijoin in multi-way join? If there is no difference, then you only need as many semijoins as there are links in the join graph, not tables; if there are differences, than the brief extension explanation in section 4 is using undefined concepts (“t_s affected R^S-semijoins”)
9. What is the complexity of the greedy decomposition algorithm and how does it impact the overall computation complexity?
10. There is no actual comparison to related work, even though the cited papers use at least some of the same techniques you employ; maybe you could have made parallels between the related works and the naïve solutions you compare with.
15Comments for the Program Committee The authors propose a very complex solution, for which they can only provide the details in a very restrictive case (two-way and not multi-way select-joins), due to space constraints. The extension is not straight-forward and their brief discussion is not sufficient. I do not think they can clearly present everything in 12 pages, as the presentation already lacks clarity. Further more, they do not really compare with similar existing approaches, just with some naïve ones, from which it is extremely clear and expected that they will achieve better results.
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) Yes
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17) See detailed comments 2,3,5,8,9,10

Review: 2
Reviewer: Alan Fekete
Email: fekete@it.usyd.edu.au
Organization: University of Sydney
Review:
 QuestionResponse
1Overall Rating Weak Accept
2Reject due to technical incorrectness No
3Novelty Medium
4Technical Depth High
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? yes, allowing for space limitations
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? adequate
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? yes
8Presentation Adequate
9Reviewer Confidence Medium
10Name of External Reviewer (if applicable)
11Summary of the paper's main contributions and impact (up to one paragraph) A reasonable algorithm to optimize the computation of joins among event streams, taking account of sharing among subscriptions; this uses semijoins and skyline techniques. The problem mixes ideas from stream processing, distributed query processing, and pub/sub systems, and seems well-motivated; a range of algorithms are explored extensively, and there is a clear winner over much of the design space.
12Three strong points of the paper (please number them S1,S2,S3) S1. An interesting problem

S2. A good solution, melding several ideas from different parts of the db literature

S3. Extensive experiments to show the quality of the solution
13Three weak points of the paper (please number them W1,W2,W3) W1. There are no error bars or confidence intervals in the graphs or discussed in the text (though the smoothness of the curves and the large gaps, suggests to me that the differences are indeed significant)

W2. It would be good to do experiments comparing to the performance from previous work like [14] or [17]; even though each is not aimed at the same problem, it can be applied here, and it would be good to know how it performs in your context.
14Detailed comments (please number each point) 1. The problem is a complex mix of requirements from pub/sub, view maintainance, stream processing; it would help I think to put together all the requirements in one place (what inputs the system gets, how often they change, and what should happen)

2. It would be good to evaluate existing algorithms that were not designed with all these requirements, but can be applied in your context

3. It would be good to evaluate performance as the amount of overlap between subscriptions increases (while the number of subs is constant)
15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)

Review: 3
Reviewer: Nick Roussopoulos
Email: nick@cs.umd.edu
Organization: University of Maryland
Review:
 QuestionResponse
1Overall Rating Weak Accept
2Reject due to technical incorrectness No
3Novelty Medium
4Technical Depth Medium
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? to a limited extent
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? adequate
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? no or not applicable
8Presentation Not Good
9Reviewer Confidence Medium
10Name of External Reviewer (if applicable)
11Summary of the paper's main contributions and impact (up to one paragraph) The paper presents a reduction in maintenance of notification of publish/subscribe systems. Many of the redundancies in such systems are discussed and solutions are provided for avoiding them.
12Three strong points of the paper (please number them S1,S2,S3) s1. Characterization of redundancies
s2. Use and maintenance of dual semi-joins
13Three weak points of the paper (please number them W1,W2,W3) w1. The formulation could have been simpler
w2. The presentation is too coarse in some areas and too dense in others.
14Detailed comments (please number each point) The presentation is not very intuitive. There is some redundancy in it in many areas but not well explained in others. It may due to the space limitation but it requires an effort to follow the paper.
15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)