|
SIGMOD 2008
2008 ACM SIGMOD International Conference on Management of Data
June 9-12 2008
Vancouver
Canada
|
Review: 1 |
Reviewer:
| Guy Lohman |
Email:
| lohman@almaden.ibm.com |
Organization:
| IBM Almaden Research Center |
Review:
| Question | Response |
1 | Overall Rating |
Weak Reject
|
2 | Reject due to technical incorrectness |
No
|
3 | Novelty |
Medium
|
4 | Technical Depth |
Medium
|
5 | Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? |
to a limited extent
|
6 | Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? |
not good
|
7 | If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? |
no or not applicable
|
8 | Presentation |
Adequate
|
9 | Reviewer Confidence |
High
|
10 | Name of External Reviewer (if applicable) |
Torsten Steinbach
|
11 | Summary of the paper's main contributions and impact (up to one paragraph) |
This paper presents an extreme approach to query scheduling in the sense that the db is treated as a black box. It develops both off-line and on-line scheduling algorithms for a workload that take into consideration the effects of other queries running concurrently on the run-times of each query. To schedule a set of queries off-line, the algorithm builds a linear programming model whose factors are derived experimentally by measuring the database system as a black box when running various mixes of those queries. The authors evaluate their scheduling algorithm using the 12 longest-running queries in a 1 GB TPC-H database. Even though I am skeptical that this approach would work well for real-world workloads, it provides an interesting contrast to scheduling approaches based on database performance models.
|
12 | Three strong points of the paper (please number them S1,S2,S3) |
S1. Model takes interaction of concurrent queries into consideration.
S2. Provide both off-line and on-line scheduling algorithms.
S3. Scheduler is outside of DBMS, so is portable.
S4. NRO seems to be a good metric for the approach chosen, and is independent of the many resources and their constraints.
|
13 | Three weak points of the paper (please number them W1,W2,W3) |
W1. The development and exposition of the model in Section 6 and especially the constraints in Section 7 is very non-intuitive and poorly explained. Why are the constraints in terms of number of instances being greater than I(j), the number of instances in the queue of query type Q(j)? Try explaining what you're trying to do in English first, and then, when the reader's intuition is better prepared, provide the formal formulation.
W2. No references to or discussion of the relevance of this technique to the huge literature on scheduling in Operations Research. There are numerous books on this topic, e.g. the classic text by Conway, Maxwell, and Miller.
W3. The model has impractical data collection requirements. In particular, one has to enumerate all the possible job mixes, and then for each job mix, one has to measure the performance of each of the queries in that mix as it executes concurrently to get all the A(i,j) factors, which denote "the average completion time of a query of type Q(j) when running concurrently with other queries in mix m(i). So you'll have to run the entire workload at least i * R times, where R is the number of repetitions to get reasonably accurate measurements, and i is the number of possible mixes of the j different query types, which even the authors admit could be around 50 and in an industrial setting is likely to be more in the hundreds. Little matter that the resulting LP can be solved in reasonable time -- by the time you do all that data collection, the workload will have changed!
W4. A 1 GB TPC-H will fit into memory in today's machines, i.e. is a trivial database. This means that I/O and buffer pool considerations were not measured by the experiments, and this is a major source of interaction between concurrently-running queries. In fact, it can be a positive interaction, if the queries share the data in the buffer pool, something that wasn't even explicitly mentioned.
W5. The offline approach is only applicable to very few real-life scenarios.
W6. The approach taken in this paper is not properly compared and experimentally checked against scheduling tools used by real database products, e.g., DB2's Work Load Manager (WLM) or Oracle's ADDM (VLDB 2004)
|
14 | Detailed comments (please number each point) |
C1. What guarantees integral values for the n(i), given that you solved the problem as a LP, not an IP? Lemma 7.1 tells you how many of the n(i) will be non-zero, but what will you do if the n(i) are non-integral? Or do you care?
C2. With only 12 distinct queries used to validate the model, it's not terribly surprising that you can get "very high accuracy by using simple linear models and training these models with a small number of query mixes (50 or less)..." That's not exactly an industrial-strength workload.
C3. Even so, Figs. 7 and 8 show a disturbing lack of convergence in the linear model, with 40% error and growing as the MPL increases.
C4. Figs. 10-12 are not particularly enlightening. The lines are hard to distinguish and use a weird metric ("slowdown compared to off-line algorithm, defined as (completion time of off-line schedule) / (completion time of algorithm being tested)." And why would the on-line schedule suddenly improve relative to the off-line schedule at MPL of 60, when you argue that "at MPL 60 there are almost no good schedules"?
C5. I was surprised you didn't reference a quite old reference on the Hot Set model (Sacco and Schkolnik, 1986), which tried to schedule concurrent jobs at the "knee in the curve" of performance with more concurrent queries, when diminishing returns and system thrashing reduce the increase in throughput as more queries run concurrently.
C6. I don't buy your assumption that different values of parameter markers won't change their performance characteristics. That might be the case for the contrived and purified world of TPC-H, where everything is uniform, but certainly not in general. In fact, there's a fairly large literature on how to optimize such queries because of these differences.
C7. You say in Section 5.3 that you have experimentally determined a good value for theta (NRO) to be 0.7, but was that obtained over a wide variety of hardware, workloads, and databases? If not, then it could be bogus information for someone trying to use your model.
C8. Section 5.2: "The 30 Q1 queries finish in 163.7 seconds, while it would take 30 * 10.07 = 302.1 seconds to run them sequentially." This is an example of positive correlation, due to the 29 other executions of Q1 piggybacking on the first execution bringing the pages into the buffer pool. Q1 can only be done via a table scan. Here's where treating the DBMS as a black box bites you, because you lose this insight.
|
15 | Comments for the Program Committee |
This review merges both Guy Lohman's initial review (which resulted in a Weak Reject) and a quick review by Torsten Steinbach (who gave it a Weak Accept). Torsten's comment:
I am a bit skeptical that this approach can be of value in real-life scenarios. It completely ignores very important factors for concurrent query performance, such as synergies on cached data. But I think it should be presented and discussed as an interesting example of a "black box" scheduling approach.
|
16 | Is this paper a candidate for the Best Paper Award? |
No
|
17 | Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) |
No
|
18 | List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17) |
|
|
Review: 2 |
Reviewer:
| Steve Schlosser |
Email:
| steven.w.schlosser@intel.com |
Organization:
| Intel Research |
Review:
| Question | Response |
1 | Overall Rating |
Weak Reject
|
2 | Reject due to technical incorrectness |
No
|
3 | Novelty |
Medium
|
4 | Technical Depth |
Medium
|
5 | Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? |
yes, allowing for space limitations
|
6 | Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? |
adequate
|
7 | If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? |
no or not applicable
|
8 | Presentation |
Very Good
|
9 | Reviewer Confidence |
Low
|
10 | Name of External Reviewer (if applicable) |
|
11 | Summary of the paper's main contributions and impact (up to one paragraph) |
The authors present a system, QShuffler, for scheduling database query mixes to maximize total system throughput by minimizing inter-query interactions. Workloads are restricted to collections of queries drawn from a set of workload types, although the number of queries of each type varies. Two scheduling algorithms are described, one on-line and one off-line, along with a cost metric, Normalized Runtime Overhead (NRO). The implementation consists of the scheduler ordering requests from a work queue to the DBMS, in this case DB2.
|
12 | Three strong points of the paper (please number them S1,S2,S3) |
S1: Attacking an important problem
S2: Good approach using black-box models
S3: Strong writing
|
13 | Three weak points of the paper (please number them W1,W2,W3) |
W1: I was surprised to see that the results were underwhelming in the end. The greatest performance improvement I saw for the on-line algorithm was about 25% over FCFS.
W2: The time and effort it takes to build the models seems high and felt obfuscated in the writing.
W3: I would like to see more analysis of the effects of data skew. The results shown in section 8.5 seem like an afterthought and doesn't lay the issue to rest.
|
14 | Detailed comments (please number each point) |
1. The graphs were hard to read in black and white.
2. The 1GB size of the TPCH benchmark is too small. How well does QShuffler handle the load imposed by larger datasets?
3. The evidence presented in section 5.1 showing that one-size-fits-all configurations don't exist for all workloads is too disconnected from the rest of the paper. For example, I can't immediately see how using NRO would help me solve the configuration problem presented in the example. It would be a stronger example if the next section that describes NRO could then connect back, that way I could tell exactly what problem is being solved by using NRO.
4. The time to build the data by running repeated experiments seems high, and it was difficult for me to get a clear picture of how expensive it was. What would Figures 4 and 5 look like as a function of time? Can new models be built incrementally as new templates are created or does everything have to be regenerated?
5. In section 8.4 the authors mention that choosing theta dynamically might be interesting, but in the previous sentence they claim that their model is not too sensitive to it. Would this really be worth it? On the other hand, 0.7 seems like a magic number derived just for the TPCH benchmark used here. How could we expect this to change with other workloads?
6. I was glad to see the result in Figure 14 use all of the query types in TPCH, but this led me to wonder why the authors didn't use all of the queries for the other results, especially because their algorithm does better with the higher T. I would have liked to have seen more results with T=12.
7. This is a well-written paper. Detailed throughout, good explanation of the problem, solution, algorithms presented at the right level of detail, etc.
8. Experimental setup was well-described. No information about statistical significance (number of runs, confidence interval, etc.). Used standard benchmarks (TPC-H). I would have like more results about or, at least, more discussion about data skew.
|
15 | Comments for the Program Committee |
1. In the end, I was positive about this paper.
|
16 | Is this paper a candidate for the Best Paper Award? |
No
|
17 | Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) |
No
|
18 | List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17) |
|
|
|