SIGMOD 2008
2008 ACM SIGMOD International Conference on Management of Data
June 9-12 2008
Vancouver
Canada
Paper ID: 34
Title: Similarity-based Search over Uncertain Time Series
Track Name: Research
Review Count: 3

Review: 1
Reviewer: Naoko Kosugi
Email: kosugi.naoko@lab.ntt.co.jp
Organization: NTT
Review:
 QuestionResponse
1Overall Rating Weak Accept
2Reject due to technical incorrectness No
3Novelty High
4Technical Depth High
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? yes, allowing for space limitations
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? adequate
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? no or not applicable
8Presentation Adequate
9Reviewer Confidence Medium
10Name of External Reviewer (if applicable) Yoshiharu Ishikawa
11Summary of the paper's main contributions and impact (up to one paragraph) As the title says, the paper proposes an approach to similarity-based search over uncertain time series. A uncertain time series is an ordered sequence of probabilistic distributions. The authors proposed a dimensionality reduction technique and an effective pruning strategy for searching.
They show methods for probabilistic range queries and k-nearest neighbor queries, too.
The authors evaluate pruning power, query performance of indexes using two kinds of datasets. The evaluation shows good performance about the cost.
12Three strong points of the paper (please number them S1,S2,S3) S1: The paper considers the retrieval problem of uncertain time series, which would be a new problem in this field.
S2: The techniques for uncertain range queries -- especially the pruning strategy is interesting since it elegantly uses the properties of probabilistic modeling.
S3: As the experimental result shows the proposes strategies work well.
13Three weak points of the paper (please number them W1,W2,W3) W1: I'm not sure a practical example of a uncertain time series is actually available and the proposed modeling method of a uncertain time series is acceptable one in the real world (e.g., stock price modeling). (but I don't think that is not a serious problem of the paper)
W2: The authors only evaluated the cost performance of their proposed methods.
They didn't evaluate the accuracy of the retrieved results.
14Detailed comments (please number each point) The paper is well written. The target problem is novel and clear, the proposed methods are new, to the reviewer's knowledge.

Short explanation after definitions of theorems and lemmas are good to help readers.

As I said in the above comment "W1", I'm wondering whether the proposed problem and solutions are applicable to real world data. The authors are recommended mention whether the proposed modeling approach of a unsertain time series actually effective in real applications.

The most interesting part of the paper for me is Section 3.5 that describes the pruning technique, but the pruning conditions 1 and 2 are applicable only for limited cases. It would be better to provide the statistics to describe the applicability of the conditions for the target data.

There were a few typos:
- Eq. (10): lack of "-" in the second line
- page 6, right, the paragraph "One remaining issue ...": need to check formulas carefully
15Comments for the Program Committee I recommend rollover for VLDB.
16Is this paper a candidate for the Best Paper Award? Yes
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)

Review: 2
Reviewer: Philip Yu
Email: psyu@cs.uic.edu
Organization: IBM T. J. Watson Research Center
Review:
 QuestionResponse
1Overall Rating Reject
2Reject due to technical incorrectness No
3Novelty Medium
4Technical Depth Low
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? yes, allowing for space limitations
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? adequate
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? no or not applicable
8Presentation Not Good
9Reviewer Confidence High
10Name of External Reviewer (if applicable) Spiros Papadimitriou
11Summary of the paper's main contributions and impact (up to one paragraph) The paper defines the problem of URQs for time series data, and also extends it to kNN queries. It studies some theoretical properties, proposes a lower-bounding distance and an indexing scheme, and evaluates performance of the algorithms.
12Three strong points of the paper (please number them S1,S2,S3) S1. Adapts URQ concepts to time series, employing a common query model (distance and probability threshold pairs).
S2. Proposes a kNN variant, for time series.
S3. Thorough coverage of related work.
13Three weak points of the paper (please number them W1,W2,W3) W1. Assumptions and models are never stated clearly.
W2. Technical correctness is not clear (depends on assumptions made).
W3. Synthetic datasets (esp. uncertainty component).
W4. The practical significance of the first claimed contribution (URQ on time series) is not really demonstrated or evaluated.
14Detailed comments (please number each point) C1. The assumptions made in the paper and the model for time series should be stated up front. For example:
- Do you assume that the probability measures have bounded or compact support?
- Do you assume independence between the time series in the database? For instance, in the stock example, the daily movements of two stocks (between their high and low prices) should not be independent. Hence, some model to capture this dependence might be useful.
- Do you assume that there is no time dependence? If yes, do you assume that the distributions for different time points in one series are the same or different? It would seem unlikely to make both assumptions, since then the model would only apply to white noise sequences, which is not very interesting in practice.
- The method claims to make no assumptions about the distributions for each time point. To the best of my knowledge, given only mean and variance information, it is not possible to obtain significantly tighter than those provided by the Chebyshev inequality, for one variable. For multiple variable, it is unclear (depends on precise assumptions made about dependences).

C2. The CLT assumes iid variables. Clearly, \lambda_i cannot be iid, unless you are modeling pure white noise processes (in which case, mean and variance are indeed sufficient to characterize a sum of many such variables). Thus, I am somewhat confused about the technical correctness of the bounding approach. Furthermore, reduction to two dimensions is too good to be true for real time series (i.e., non-white noise processes): in “certain” time series, this does not provide enough pruning power, so I do not see how it can be sufficient for indexing uncertain series, which is a more difficult problem.

C3. The definitions of pdf_{Q,T_i,dist_U}, pdf_{Q,T_i,LB_U} (and, consequently, the respective cdfs) is not clear. At the end of section 3.1, you state that “calculating CDF_{Q,T_i,dist_U} is not trivial”. How to do that would depend heavily on the definition of pdf_{Q,T_i,dist_U}, which would in turn depend on assumptions about independence (or lack thereof). On what space is this probability measure defined? You integrate over distance only in equation (3), so it would seem this is a marginal for a pair of time points.

Also, are the probability measures for dist_U and LB_U defined in the same space? If so, what is the relationship between them? The proof of Lemma 3.1 is hard to follow. I fail to see how the integral over the measure (ie the cdf) implies something about the range of integration (ie LB and dist values).

C4. A range search problem for uncertain series is proposed, but its practical significance is not clearly demonstrated or evaluated. Since this is claimed as the first main contribution and the authors additionally claim that traditional approaches are not sufficient, it would be nice to demonstrate some practical examples where the new methods achieve something that was not possible with previous approaches. For example, if I were to run a traditional kNN query, how different would the result sets be? Currently, the evaluation is limited only to performance, but since this is a new problem and a new metric, with potentially high significance, at least an example to demonstrate intuitively _why_ URQs are fundamentally different would help.

C5. Related to the previous comment, it would be nice if the authors could obtain some real-world uncertain data. The candlestick example in the introduction is very nice and convincing as a motivating application. However, in the experiments, instead of using real high/low prices, the uncertainty is artificially added as Gaussian noise.

C6. A proofreading would help, (esp. for grammatical and syntactical errors -- although the paper is not hard to follow because of these).
15Comments for the Program Committee I am not checking the “technically incorrect” box because the assumptions are not clearly stated, thus, strictly speaking, the results may be correct if (unreasonably) strict assumptions are made. The authors should clarify these points.

Also, based on personal experience, I am not entirely convinced how meaningful URQs are for very high-dimensional data (the first claimed contribution). The query model itself is not new, but it hasn’t been applied to time series before, so it would seem that this is the main contribution.

I am requesting clarifications from the authors – however, even if these address the present concerns adequately (in which case I would not have any problem recommending acceptance), the paper may still need significant rewrite, so I’m not sure it makes sense to go through the ordeal. I would probably recommend a rollover to VLDB as the best course of action.
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) Yes
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17) Q1. Please state your assumptions precisely, both upfront, as well as in the various definitions and proofs (see comments above).

Review: 3
Reviewer: Xiaofang Zhou
Email: zxf@itee.uq.edu.au
Organization: University of Queensland
Review:
 QuestionResponse
1Overall Rating Weak Reject
2Reject due to technical incorrectness No
3Novelty Medium
4Technical Depth Medium
5Is the relevant and reasonably well known state of the art fairly treated (note this may come from outside the database literature and even outside computer science)? yes, allowing for space limitations
6Experimental results should meet the following expectations: deal with materially relevant cases (e.g., updates as well as queries, different scales) allowing for space limitations; statistically meaningful results; and use standard datasets or benchmarks (as opposed to a cherry-picked one) when possible. How did the experiments rate? not good
7If experiments were presented, would you want them checked for repeatability (as might be required in a future sigmod)? yes
8Presentation Adequate
9Reviewer Confidence High
10Name of External Reviewer (if applicable) Heng Tao Shen
11Summary of the paper's main contributions and impact (up to one paragraph) This paper proposes a method to process uncertain range queries and uncertain kNN queries against uncertain time series data. The problem is new and is well explained. Its technical foundation is strong.
12Three strong points of the paper (please number them S1,S2,S3) S1: This paper is the first piece of database research work on query processing with uncertain time series data. It investigates a couple of new problems which are naturally progressed from other currently actively investigated problems such as efficient query processing for time series data and how to deal with data uncertainty.

S2: Concepts, new challenges, differences from previous work, intuitive ideas and algorithms are clearly defined and explained, followed by proofs and an experimental evaluation.

S3: Optimization strategies are based on strong technical foundation.
13Three weak points of the paper (please number them W1,W2,W3) W1: Motivational examples are not convincing, and the data used in the experiments are all synthetic data (no real data?).

W2: The experimental results reported in the paper failed to show satisfactory pruning power of the proposed methods, and some results are not explained clearly.

W3: No discussions about data characteristics for those data which are obtained through the proposed dimensionality reduction (it is highly possible that an R-tree index may not be a suitable solution here if such derived low-dimensional data are highly skewed).
14Detailed comments (please number each point) W1. The problem itself is not motivated convincingly. Why it is important to deal with uncertain time series data? Are there any practical applications? Although data uncertainty is a hot topic now in the DB research community with many real applications, we still need to have real and strong motivations to propose a new type of queries. Personally I don't think the uncertain stock time series discussed in the paper is a good example. From figure 1, a single closing stock price on each day is enough to indicate the up or down patterns, as brokers usually analyze not a single day's price bound, but the overall trend along the time. Furthermore, if any practical application exists, some real uncertain time series data should be used. However, in the experiments, all the uncertain datasets are synthetic. So, the question is: is there any practical value of this problem.

W2: The experiments conducted are not able to show the effectiveness of the methods.
First, the pruning power seems not good enough. It shows 75-95% pruning power.
However, if the candidates incur random accesses, the pruning power has to be consistently very high in order to outperform sequential scan. Second, the dataset sizes used are unknown. It seems very small, as indicated from Figure 12 since random accesses are only in the order of tens (each I/O is 10ms). Third, I am not sure if there is any error in some performance figures. For example, in Figure 13 (a), Gaussian uncertainty improves linear scan by 726 times, i.e., linear scan takes 726*0.1=72.6S. However, for uniform uncertainty, it only takes 95*0.13=12.4s. Since both datasets have the same size, similar distance computational costs, and similar pruning powers (Figure 11(a)), why does Gaussian uncertainty takes 6 times longer than uniform distribution when linear scan is performed?

Figure 3 seems redundant. Reference 8 and 10 are not correctly cited.
15Comments for the Program Committee
16Is this paper a candidate for the Best Paper Award? No
17Would author feedback be useful for this Review? (if "Yes", please answer Q. 18) No
18List specific clarifications you seek from the Authors (if you have answered "Yes" to Q. 17)