From: "Data Mining and Knowledge Discovery (DAMI)" Dear Dr Kolev, We are pleased to inform you that your manuscript, "ParCorr: Efficient Parallel Methods to Identify Similar Time Series Pairs across Sliding Windows", has been accepted for publication in Data Mining and Knowledge Discovery subject to you making required changes by 16 Apr 2018. When preparing your revised manuscript, you are asked to carefully consider the editor and reviewer comments which are attached, and submit a list of responses to the comments. Your list of responses should be uploaded as a file in addition to your revised manuscript. In order to submit your revised manuscript please log on and you will find your submissions in submissions needing revision box. Click submissions needing revision, click edit submission, click attach files, and upload your revised version. It is essential that you use this process and not submit the manuscript as a new submission. Please note: When uploading your revised files, please make sure only to submit your editable source files (i. E. Word, tex). Also, if you opt for a tex file, only one .tex file per submission should be used to guarantee proper building. If it doesn't build, it is because the .sty files also have to be uploaded. EM does support most new versions of LaTex, but the older ones are not. More details are listed at the bottom of this email should you need it. https://dami.editorialmanager.com/ Your username is: bkolev If you have forgotten your password, kindly use the Send Login Details link on the login page. Click "Author Login" to submit your revision. We look forward to receiving your revised manuscript. Best regards, Keerthana Govindarajan Springer Journals Editorial Office Data Mining and Knowledge Discovery COMMENTS FOR THE AUTHOR: Dear Boyan Kolev, Thank you for your submission to Data Mining and Knowledge Discovery. I am delighted to accept it for publication, subject to completion of the following revisions. Please try to address all the issues raised by the reviews, but you must make the following changes: *Please add a comparison to MASS and include one more real data set in the evaluation *Please expand on the results of seismic data, e.g., by showing some real examples of matches in the seismic data (see review 1 & 2). *Please try to improve the clarity of Subsection 4.4 *Please discuss the connection to tiling *Address the concern about normalization in a streaming setting as raised by multiple reviewers *Please answer the 3 questions revolving around lemma 1 raised by Reviewer 3 (i.e., Is "50%" really sufficient to conclude "very likely", how does the random vector approach interact with lemma 1, and "a large fraction" of the sketch-similar series are actually similar) *Please resolve the speed up issue raised by Reviewer 2 Please submit a point-wise list of responses to the issues with your manuscript We look forward to receiving your revised manuscript. Sincerely, Bjorn, Jesse, Elisa and Derek ECMLPKDD 2018 Journal Track Co Chairs Reviewer #1: SUMMARY. PROVIDE A BRIEF SUMMARY OF THE REASONS FOR YOUR RECOMMENDATION. WHAT ARE THE MAJOR STRENGTHS AND WEAKNESSES OF THE MANUSCRIPT? See below RELEVANCE. HOW RELEVANT IS THE PAPER TO THE DISCIPLINES OF DATA MINING AND KNOWLEDGE DISCOVERY? good SIGNIFICANCE [NOT APPLICABLE TO SURVEY PAPERS]. HOW SIGNIFICANTLY DOES THE PAPER ADVANCE THE CURRENT STATE-OF-THE-ART? HOW BIG AN IMPACT WILL THE PAPER HAVE, ON WHAT AREAS, AND WHY? WILL THE PAPER STIMULATE FURTHER RESEARCH?' good RELATED RESEARCH. IS ALL RELEVANT PRIOR WORK DISCUSSED? IS THE CURRENT WORK A DISTINCT AND NEW CONTRIBUTION RELATIVE TO THE AUTHOR(S)' PREVIOUS WORK? [NOTE, DMKD ENCOURAGES SUBMISSION OF EXPANDED VERSIONS OF SIGNIFICANT WORK PREVIOUSLY PUBLISHED IN CONFERENCES, SO LONG AS THE RELATIONSHIP TO THE EARLIER WORK IS ACKNOWLEDGED, BUT STRONGLY CONDEMNS BOTH PLAGIARISM AND SELF-PLAGIARISM. SEE INSTRUCTIONS FOR AUTHORS FOR MORE DETAILS.] [NOTE ALSO THAT DMKD CLOSELY SCRUTINIZES ANY SUGGESTION BY A REVIEWER THAT THEIR OWN WORK SHOULD BE REFERENCED AND RESERVES THE RIGHT TO DELETE ANY SUCH COMMENTS FROM A REVIEW. WHERE A REVIEWER BELIEVES THEIR OWN WORK SHOULD BE CITED,WE ADVISE WRITING A COMMENT TO THE EDITORS UNDER CONFIDENTIAL COMMENTS TO THE EDITOR.] should compare to MASS ... We have done that but have also added a comment about MASS in the future work. RELATIONSHIP TO PREVIOUS RESEARCH [NOT APPLICABLE TO SURVEY PAPERS]. IS THE NOVEL CONTRIBUTION OF THE NEW WORK MADE EXPLICIT TOGETHER WITH ITS RELATIONSHIP TO PRIOR WORK? IS THE NEED FOR THE NEW CONTRIBUTION SUBSTANTIATED? DOES THE PAPER MAKE CLEAR NOT ONLY THE STRENGTH BUT ALSO THE LIMITATIONS OF THE NEW CONTRIBUTION? good THEORETICAL AND EXPERIMENTAL EVALUATION. IS THE THEORETICAL AND EXPERIMENTAL EVALUATION SOUND AND APPROPRIATE. ARE ALL CLAIMS THAT ARE MADE BACKED BY SUFFICIENT THEORETICAL OR EXPERIMENTAL SUPPORT? IS APPROPRIATE STATISTICAL ANALYSIS PROVIDED OF EXPERIMENTAL RESULTS AND ARE APPROPRIATE CONCLUSIONS DRAWN? EXPRESSION. IS THE PAPER CLEARLY WRITTEN AND ACCESSIBLE TO A WIDE AUDIENCE OF DATA MINING RESEARCHERS? IF NOT, WHAT TYPES OF CHANGES ARE REQUIRED? IS THE ENGLISH EXPRESSION OF A SUITABLE STANDARD FOR PUBLICATION? IS THE PAPER OF APPROPRIATE LENGTH? IF NOT, WHAT SHOULD BE ADDED, EXPANDED, DELETED OR COMPRESSED? DO THE TITLE AND ABSTRACT APPROPRIATELY REFLECT THE CONTENTS? TECHNICAL DETAIL. IS SUFFICIENT DETAIL PROVIDED ABOUT ALGORITHMS AND TECHNIQUES, INCLUDING EXPERIMENTAL TECHNIQUES? IS THERE SUFFICIENT DETAIL TO ALLOW REPLICATION OF THE WORK? FIGURES AND TABLES. ARE THE FIGURES AND TABLES NECESSARY AND SUFFICIENT? No, see below REFERENCES. ARE THE REFERENCES ACCURATE AND COMPLETE? good FOR SURVEY PAPERS ONLY: - IS A SURVEY IN THIS AREA TIMELY? [IS THE AREA OF INTEREST TO THE DATA MINING COMMUNITY AND IS THERE NO RECENT AND COMPREHENSIVE REVIEW ALREADY AVAILABLE?] - IS THE COVERAGE OF THE AREA BALANCED, COMPLETE AND UP-TO-DATE? - DOES THE SURVEY PROVIDE A SUITABLE FRAMEWORK FOR UNDERSTANDING THE AREA? ANY OTHER COMMENTS NOT COVERED ABOVE. SUMMARIZE ANY CHANGES THAT MUST BE MADE FOR A REVISED VERSION OF THIS PAPER TO BE ACCEPTABLE FOR PUBLICATION. SUMMARIZE ANY FURTHER CHANGES THAT YOU RECOMMEND THE AUTHOR(S) CONSIDER. Very nice paper, except… 1) This is a about time series, an inherently visual domain, and there are zero images of time series. You need images of time series, to motivate "1 Introduction", to explain "2 Problem Definition" and to show results "5 Experiments". ... Boyan: seismic 2) "We compare with iSAX [7] because it does not require time series to be cooperative so is a general method." It does assume that the time series is slowly changing, you could not use it to efficiently index D = rand(10000000,1); ... We note this limitation of iSAX. 3) "Quantitatively, the matrix profile SCRIMP algorithm takes 20000 seconds to analyze a time series of size one million and compute similarities across all pairs of subsequences (almost one million), where each subsequence is of size 500." This seems a little misleading, I tried it… >>> data1=cumsum(randn(1000000,1)); >>> [matrixProfile, profileIndex, motifIndex, discordIndex] = interactiveMatrixProfileVer2(data1,500); And after about 10 seconds, the motifs have converged (they stop changing). The whole point of SCRIMP is that it is an anytime algorithm. In any case, if it is not a solution to you problem, why do we care if it is slow? ... Good point. We have changed to note the 10 seconds. 4) ". A data stream, for our purposes, is a set of streaming time series. Each time series is normalized to have zero mean and unit standard deviation", I am confused, you can normalize a batch time series, but can you normalize a stream? ... This is a good point. We suggest normalizing based on a prefix of the data. We don't address this explicitly in our experiments but suggest periodic re-normalization to enhance recall. 5) I just don't understand why you compare to iSAX, which seems unsuited to this problem. Keep the comparison if you like, but can you also compare to MASS http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html It should be trivial to "slot" it into where you use iSAX ... Boyan. 6) Can you show some real examples of matches in the seismic data? ... Boyan. 7) "the seismic dataset contains 5 million time series of 2000 values each." That is about 5 gig in matlab or zipped format, right? Can you put it online for the reviewers? ... Boyan. 8) iSax or iSAX? (you have both) ... Boyan. 9) Can you flesh out the motivation a little ? ... Boyan: once we have some nice results from seismic, this will be taken care of. 10) "Missing some correlations is acceptable, because a major event will reveal many correlations so a recall of 90% or more is quite enough." I take it you don't work for an insurance company ;-) ... Boyan: as we show in the example, we can detect the seismic event. Reviewer #2: SUMMARY. The paper is about finding correlations among multiple time series over a sliding window. The main idea is to identify sketches and update the using a parallel approach in sliding windows. The proposed approach mainly avoids to compute sketches from scratch but updates them on the fly. Moreover, the time series segments in each window are mapped to a vector space. The overall approach manages to minimize the size number of computations needed for detecting candidate matching time series. Strong points: - This is an interesting and challenging problem - The paper presents a fresh and attractive solution, that is technically good enough and in the adequate depth for this venue. - The experimental evaluation is sufficient. Weak points: - The authors could improve the clarity of Section 4.4. - Some relevance to the concept of "tiling" could be provided and discussed. - Inclusion of a case study and some more real datasets would benefit the results section of the paper and increase its motivation. - The paper contains some minor language errors which should be fixed. RELEVANCE. The paper is highly relevant to DAMI. and the ECML/PKDD audience. SIGNIFICANCE. The paper is highly significant. RELATED RESEARCH. The related research is discussed in detail. RELATIONSHIP TO PREVIOUS RESEARCH [NOT APPLICABLE TO SURVEY PAPERS]. This is covered, however, some relevance to the concept of "tiling" would be also interesting to see in the paper. ... We have added this in both related work and in the future work. THEORETICAL AND EXPERIMENTAL EVALUATION. This is quite elaborate. EXPRESSION. The expression is decent. TECHNICAL DETAIL. The technical details as well as the experimental details are satisfactory. FIGURES AND TABLES. They seem sufficient. REFERENCES. The references seem accurate. SUMMARIZE ANY CHANGES THAT MUST BE MADE FOR A REVISED VERSION OF THIS PAPER TO BE ACCEPTABLE FOR PUBLICATION. This is a very interesting paper with a strong motivation and attractive approach. However, the authors could consider some improvements: A. The authors could improve the clarity of Section 4.4. The three points/strategies suggested here should be given a more in-depth and clearer description. Moreover, some examples should be provided to support the intuition of these strategies. B. Some relevance to the concept of "tiling" could be provided and discussed. The concept of tiling has been studied earlier, in some cases with the objective to identify sequence or time series segments that are highly similar or correlated. It would be great to mention their relevance (if the authors deem that is necessary) and contrast them with their approach. Specifically the authors could examine the following papers: - Geometric and combinatorial tiles in 0-1 data, PKDD 2004 - Semigeometric Tiling of Event Sequences, ECML/PKDD 2016 - Tiling databases, DS 2004. ... Discussed. We note the possible complementarity between the two problems.. C. Please provide some case studies of correlated time series segments from the real datasets you have used. It would be nice to see some real findings from the seismic dataset. ... Boyan. D. It would also be great to try your approach on other datasets of multi-dimensional time series. For example, ECGs? Please include at least one more real dataset in your experimental evaluation. ... ECGs are not so appropriate because there is no time criticality, but we also look at finance. E. The paper contains some minor language errors which should be fixed. Please proofread the paper carefully. ... Dennis will do this when we're done. Reviewer #3: >> Summary: The article presents a method for parallelizing the search for similar pieces at the same time step (windows) between two of a large set of time series. The main idea is to summarize windows into sketches, and distributing the decision of which pairs could be similar, based on partial information. A number of experiments have been conducted to compare the proposed method with iSAX, the current state of the art. The results indicate that this new method is significantly faster than iSAX, at the cost of recall. In its current form the paper lacks a lot of details (which can likely be supplied in a comparatively short time frame). My main concern here is the seeming overselling of the performance of the proposed approach (1000-fold speedup in abstract). >>> Detailed justification: I find the first contribution (listed on page 2) somewhat lackluster. Upon reading what it actually entails, this seems straightforward and following standard techniques. I would certainly keep it in the paper, but would suggest leaving it out of the list of contributions; or making it very clear why this is novel and interesting. ... Yes, we agree that it is simple, but we prefer to leave it in as a contribution, because its simplicity is a virtue we think. We mention that it is simple. As a practical remark, section 2 lists that the time series are assumed to be normalized. To what degree is this achievable, when the data is streamed in? Is it re-normalized over the current window, or over all data received so far, etc? As a minor note, it also makes me wonder why x and y are normalized in eq (1) when the assumption is that they are normalized? ... Yes, thank you for this observation. We have rewritten to make it clear that we normalize based on the prefix of the time series and then recompute the normalization transform periodically. Boyan: we might have to try this in our experiments.. The approach seems to largely build on Lemma 1, to assume that sketches are good representatives of the actual similarity. Is "50%" really sufficient to conclude "very likely" on line 31 (p5)? ... We agree with the reviewer, so we have changed "very likely" to "likely" and justified this by noting that the lemma bascially says that the ratio of the sketch distance and real distance is close to 1. I'm setting serious question marks with the approach described at the very end of Section 4.2. In particular, a technique to avoid redistributing new random vectors is described, by taking a random vector that is twice the necessary length, and looping over it (presumably circularly?). It is unclear why this is a valid way of doing this, w.r.t. say Lemma 1. The randomization is no longer independent. What are the effects of this choice? ... We agree. We have noted that this is a hackish approach, but also that this compromise on randomness has no particular impact. Boyan: we should confirm that this has no impact on our result by actually generating a sufficiently long random vector and showing that the results don't change. What is the effect of grid boundaries? In the current approach, it seems that two time series can be arbitrarily close together but fall into two different grid cells and thus remain undetected. How does, e.g. extending to check for distances with points in neighboring cells affect results? ... Thank you for this comment. Because our technique is heuristic, some pairs of sketch points may incorrectly be thought to be far apart just because they happen to fall in different cells. However the sketch vectors are large enough that we can miss a few cells and still get good recall. Overall, the recall is not perfect though. Boyan: we should try varying grid sizes and say what we get. The paper introduces three communication strategies. It is however unclear to me why the first two are introduced (and later also shown in the first experiment). It seems to me, at least from the description, that the "opt" variant should be the preferable one under any circumstance (better/equal asymptotic behavior and lower constants). I also don't find the difference to be that surprising in the experiment, but I suppose that a lot of the reported time actually revolves around other aspects rather than the communication? What happens if you single out the actual time spent on communication? ... Boyan: Yes, we should do this. P11 makes the claim that Step 2 is cheaper than Step 1: I wonder in terms of what? I'd assume this refers purely to running time, and that these steps are local and thus have significantly smaller "constants" hidden in the asymptotic claims? ... Yes, thank you. We have made this clear. This is linear in the same way. Bottom of page 11 makes a claim that "a large fraction" of the sketch-similar series are also actually similar. What is this large fraction? Upon what is this claim based? It seems that this is related to Lemma 1 again (is 50% indeed a large fraction?), but the change in the random vectors undermines this Lemma possibly. Are there experiments to support this claim? ... Boyan: we need to show these ratios. In Section 5, the experiments are described. ParCorr is compared to Parallel Linear Search and iSAX. The justification for iSAX seems clear (state of the art), but why is PLS introduced and compared to? Also, the claim is that iSAX local indices have to be rebuilt constantly. Is this inherent in the iSAX method or is iSAX easily amenable to an incremental update as well? ... We have compared to PLS as a baseline. iSAX can be made incremental though not easily. We are studying this concurrently. What repetition is the error bar based on? (page 14) Is this a number of runs of the same algorithm on different data, same data, etc? Does this actually produce a 100% confidence interval? (wouldn't that nearly always be 0 to +infinity for running times?) ... Boyan: we have to say how many repetitions. What fraction f do you use to get the 95-ish % recall? ... Boyan: please put this in. Section 5.4 makes some claims on different thresholds for the Pearson correlation. Why are the speedup factors reported here around 18, whereas the actual speedup (reported also in the abstract) states a 1000-fold gain? In fact, reading through the results, I see no mention of the 1000-fold speedup at all (the best I can find seems to be below 58). ... Yes, point taken. Out of curiosity: ParCorr seems very much designed towards many but short time series. How would it perform if there are few but long time series? Is this simply a matter of cutting them up into shorter ones to leverage parallelism more? ... It would not work that well. >>> Presentation: Overall the paper is well written. In the following, there are a few suggestions that may help to improve it further. The abstract starts with trying to define the problem in one sentence (over 4 lines). This could be improved, since it's currently not clear (at least to me) that the windows that you want to compare must happen at the same time. Mostly because of the part of successive windows and that these have to start shortly after one another. I had the impression that you wanted to find correlated sequences of windows (with each window only shortly after the end of the previous) in time series that match to another such sequence. ... The method could in fact be used for non co-temporous windows as we explain in the future work. The introduction frequently refers to sketches, without making it clear what this precisely is. The explanation follows later; I'd suggest a simple forward reference at the first occurrence. ... Good idea. Done. There are some statements that could use substantiation/citations: - p2: Fourier & Wavelet Transforms sacrifice too much accuracy or reduce dimensionality too little ... Boyan: please refer to the Cole et al paper. - p2: speeds turns out to be of greater importance ... Toned down. - p3: synchronous variation (do you mean variant here?), in particular, is this distinction a new one, or established? ... Thank you. Clarified. - p3: I was under the impression that (for general high dimensional data), Euclidean distance is fairly meaningless; is this impression wrong or are time series different in this regard? ... You are correct. The Euclidean distance is related to correlation and that's what we're focussing on. Some further remarks on presentation: - in many instances, formulas are written out very verbally. Personally, I find this awkward to read and I lose track if this is more than a multiplication. Most extreme case seems to be on p12, the paragraph just before section 5. Another ambiguous case is on p11, line 43-44. ... Boyan: I've noted where the problem is just before section 5. Please re-write. I like the way we have it on page 11. - section 4.3: the idea is given with an example, then generalized into the actual computational process. I found the example difficult to understand without first reading the general idea; you may want to consider swapping these. - there are some typos throughout the paper, most notably some instances of "ParrCorr" rather than "ParCorr" ... Thank you. Done. - p7, line 12: Spark is suddenly mentioned here. It's unclear why this consideration of it being "awkward to do in Spark" is relevant. ... We make it clear that this is our implementation platform. . - p14: claims on the ratio of window size vs basic window size: why are these reasonable for many applications? Why is 50 more reasonable for high frequency measurements? etc. (as a minor side note, I'm not quite sure why basic window size is a window to begin with, it seems more like a step/increment size) ... Yes, we have re-expressed this, noting that the incremental approach gains the most from short basic windows. I found the conclusion is bit short - a more nuanced conclusion from the experimental results would be welcome here, and the future work is extremely brief. Are these the only remaining venues of further work? Why are these important or relevant? Are there other steps to improve the algorithm (e.g. achieve better recall)? ... We agree and we have changed this. ___________________________________________________________ STYLE FILE INFORMATION: Questions on submitting Tex files: What submission item should I use for a Tex file? The best submission item to use for a tex file is a "Manuscript" or similar worded submission item description. What if my Tex file(s) doesn't build? First, look in the PDF to see if an error message is available. If the error message suggests you are missing a sty file, then this may be the cause that the Tex submission will not build. If your submission still is not building after attempting to fix your Tex file from the suggested error message, you may want to verify the following: Are your images referenced correctly? Images cannot be referenced in subfolders, otherwise they will not appear in the PDF Are all your accompanying files referenced correctly in your Tex file? Why can't I submit Tex files and DVI files together? The System will accept either a Tex file(s) _OR_ a DVI file. If your DVI file does not build, then it is suggested you instead submit a Tex file. If you submit a DVI file and it does not build correctly, then you will need to provide the Tex file in order for the building problem to be researched. What if my figures are not appearing? It may be possible that your images are referenced in subfolders. Images cannot be referenced in subfolders, otherwise they will not appear in the PDF. One example of a correctly referenced image is: epsfig{figure=alld.eps,width=.5 extwidth} . An example of an incorrectly referenced image is: epsfig{figure=images/alld.eps,width=.5 extwidth} What if I view large or cut off EPS images? If you view large or cut off images, then you will need to resize the image to fit on one 8.5 x 11 page. When there is a problem caused by not resizing PostScript files (the images get cut off…) the author will either need to resize the images or save the files in a format that EM can recognize as an image.( Perhaps as a TIFF or a JPEG). Please note that EPS files are the best choice for image files in Tex submissions. Do I need to use my STY files too? Yes. The sty files will be needed to properly build your submission. What if I see (?) Question marks in my PDF? If you see question marks in the references of your PDF, most likely your Tex file(s) are in subdirectories. Tex submissions cannot include subdirectories for your submission to properly build. All associated files must be in one directory for the submission to build. Please note: 1) The best image/figure file types for Tex submissions are EPS files. The author should attempt to submit EPS images for all their figure files. The author should verify that their EPS images are not in subfolders, otherwise the files may not build into the submission 2) The author should look at the error generated at the end of their Tex submission. The author should verify the format of their Tex submission, as the error message suggests the Tex file may be corrupt. 3) The author should re-verify that the submission and all of their accompanying Tex related files (STY files, etc) are uploaded to the submission 4) The author should then re-build their submission. 5) The author verify their images are not in subfolders, otherwise the images may not appear. For your reference, a sty file is: STY is the file extension for a Style sheet file. A STYle template may be used by different publishers to define what should be bold/centered/italic in the paper. The link below to Springer's own site can be used for latex references: http://www.springer.com/sgw/cda/frontpage/0,,5-164-2-72376-0,00.html We rely on the authors to create their sty files. Authors need to create their sty files to govern their own work. There is a link below we found in Google that may assist the author in completing their submission. http://www.sci.usq.edu.au/staff/robertsa/LaTeX/latexintro.html TeX Web Resources There has been a significant increase in the number of TeX submissions to journals using Editorial Manager. While we do not offer direct technical support for TeX, just as we don't offer direct technical support for Microsoft Word, we have compiled a list of TeX-related web sites for journals to use, but please do feel free to distribute this information to your authors if you deem it helpful. Helpful TeX Links: Beginners guide to TeX This excellent introduction to TeX contains links to a basic explanation of TeX, a more thorough overview, and FAQs. You'll also find user help, documentation, sample documents, and a list of recommended reference books. http://www.tug.org/begin.html The Comprehensive TeX Archive Network If you know absolutely nothing about TeX and would like to learn about what TeX is and where it came from, be sure to take a look at the article entitle "What it TeX?" There is a search function for files and documentation on the site as well as links to sign up for TeX users groups and announcements lists. http://www.ctan.org/ TeX Guides An excellent resource offering a variety of TeX guides including guides for Mathematical Symbols in TeX and TeX for Word Processor Users. http://www.mcs.vuw.ac.nz/~david/latex/ LaTeX Encyclopedia An online LaTeX "encyclopedia". The site contains a table of contents with links to information on documentation, installation, typography, and a Navigator for the site. http://tex.loria.fr/ LaTeX Math Guide The American Mathematical Society's Short Math Guide for LaTeX. ftp://ftp.ams.org/pub/tex/doc/amsmath/short-math-guide.pdf Part 2: Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8

--
Boyan Kolev

Zenith team
INRIA Sophia-Antipolis Méditerranée
LIRMM
860 rue St Priest - Bâtiment 5
34095 Montpellier Cedex 5 France

Begin forwarded message:

From: "Data Mining and Knowledge Discovery (DAMI)" <em@editorialmanager.com>
Subject: DAMI: Decision on your Manuscript
Date: 15 February 2018 at 07:28:11 CET
To: "Boyan Kolev" <boyan.kolev@inria.fr>
Reply-To: "Data Mining and Knowledge Discovery (DAMI)" <keerthana.govindarajan@springernature.com>

Dear Dr Kolev,

We are pleased to inform you that your manuscript, "ParCorr: Efficient Parallel Methods to Identify Similar Time Series Pairs across Sliding Windows", has been accepted for publication in Data Mining and Knowledge Discovery subject to you making required changes by 16 Apr 2018.

When preparing your revised manuscript, you are asked to carefully consider the editor and reviewer comments which are attached, and submit a list of responses to the comments.  Your list of responses should be uploaded as a file in addition to your revised manuscript.

In order to submit your revised manuscript please log on and you will find your submissions in submissions needing revision box.  Click submissions needing revision, click edit submission, click attach files, and upload your revised version.  It is essential that you use this process and not submit the manuscript as a new submission.

Please note: When uploading your revised files, please make sure only to submit your editable source files (i. E. Word, tex).

Also, if you opt for a tex file, only one .tex file per submission should be used to guarantee proper building.  If it doesn't build, it is because the .sty files also have to be uploaded.  EM does support most new versions of LaTex, but the older ones are not.  More details are listed at the bottom of this email should you need it.

     https://dami.editorialmanager.com/

Your username is: bkolev

If you have forgotten your password, kindly use the Send Login Details link on the login page.

Click "Author Login" to submit your revision.

We look forward to receiving your revised manuscript.

Best regards,

Keerthana Govindarajan
Springer Journals Editorial Office
Data Mining and Knowledge Discovery


COMMENTS FOR THE AUTHOR:






Dear Boyan Kolev,

Thank you for your submission to Data Mining and Knowledge Discovery.  I am delighted to accept it for publication, subject to completion of the following revisions.  Please try to address all the issues raised by the reviews, but you must make the following changes:

*Please add a comparison to MASS and include one more real data set in the evaluation

*Please expand on the results of seismic data, e.g., by showing some real examples of matches in the seismic data (see review 1 & 2).

*Please try to improve the clarity of Subsection 4.4

*Please discuss the connection to tiling

*Address the concern about normalization in a streaming setting as raised by multiple reviewers

*Please answer the 3 questions revolving around lemma 1 raised by Reviewer 3 (i.e., Is "50%" really sufficient to conclude "very likely", how does the random vector approach interact with lemma 1, and "a large fraction" of the sketch-similar series are actually similar)

*Please resolve the speed up issue raised by Reviewer 2

Please submit a point-wise list of responses to the issues with your manuscript

We look forward to receiving your revised manuscript.

Sincerely,

Bjorn, Jesse, Elisa and Derek

ECMLPKDD 2018 Journal Track Co Chairs

Reviewer #1:
SUMMARY. PROVIDE A BRIEF SUMMARY OF THE REASONS FOR YOUR RECOMMENDATION. WHAT ARE THE MAJOR STRENGTHS AND WEAKNESSES OF THE MANUSCRIPT?

See below

RELEVANCE. HOW RELEVANT IS THE PAPER TO THE DISCIPLINES OF DATA MINING AND KNOWLEDGE DISCOVERY?


good

SIGNIFICANCE [NOT APPLICABLE TO SURVEY PAPERS]. HOW SIGNIFICANTLY DOES THE PAPER ADVANCE THE CURRENT STATE-OF-THE-ART? HOW BIG AN IMPACT WILL THE PAPER HAVE, ON WHAT AREAS, AND WHY?  WILL THE PAPER STIMULATE FURTHER RESEARCH?'

good

RELATED RESEARCH.  IS ALL RELEVANT PRIOR WORK DISCUSSED? IS THE CURRENT WORK A DISTINCT AND NEW CONTRIBUTION RELATIVE TO THE AUTHOR(S)' PREVIOUS WORK?  [NOTE, DMKD ENCOURAGES SUBMISSION OF EXPANDED VERSIONS OF SIGNIFICANT WORK PREVIOUSLY PUBLISHED IN CONFERENCES, SO LONG AS THE RELATIONSHIP TO THE EARLIER WORK IS ACKNOWLEDGED, BUT STRONGLY CONDEMNS BOTH PLAGIARISM AND SELF-PLAGIARISM.  SEE INSTRUCTIONS FOR AUTHORS FOR MORE DETAILS.]
[NOTE ALSO THAT DMKD CLOSELY SCRUTINIZES ANY SUGGESTION BY A REVIEWER THAT THEIR OWN WORK SHOULD BE REFERENCED AND RESERVES THE RIGHT TO DELETE ANY SUCH COMMENTS FROM A REVIEW. WHERE A REVIEWER BELIEVES THEIR OWN WORK SHOULD BE CITED,WE ADVISE WRITING A COMMENT TO THE EDITORS UNDER CONFIDENTIAL COMMENTS TO THE EDITOR.]

should compare to MASS


RELATIONSHIP TO PREVIOUS RESEARCH [NOT APPLICABLE TO SURVEY PAPERS]. IS THE NOVEL CONTRIBUTION OF THE NEW WORK MADE EXPLICIT TOGETHER WITH ITS RELATIONSHIP TO PRIOR WORK?  IS THE NEED FOR THE NEW CONTRIBUTION SUBSTANTIATED?  DOES THE PAPER MAKE CLEAR NOT ONLY
THE STRENGTH BUT ALSO THE LIMITATIONS OF THE NEW CONTRIBUTION?

good

THEORETICAL AND EXPERIMENTAL EVALUATION. IS THE THEORETICAL AND EXPERIMENTAL EVALUATION SOUND AND APPROPRIATE. ARE ALL CLAIMS THAT ARE MADE BACKED BY SUFFICIENT THEORETICAL OR EXPERIMENTAL SUPPORT? IS APPROPRIATE STATISTICAL ANALYSIS PROVIDED OF EXPERIMENTAL
RESULTS AND ARE APPROPRIATE CONCLUSIONS DRAWN?



EXPRESSION. IS THE PAPER CLEARLY WRITTEN AND ACCESSIBLE TO A WIDE AUDIENCE OF DATA MINING RESEARCHERS?  IF NOT, WHAT TYPES OF CHANGES ARE REQUIRED? IS THE ENGLISH EXPRESSION OF A SUITABLE STANDARD FOR PUBLICATION?  IS THE PAPER OF APPROPRIATE LENGTH? IF NOT, WHAT SHOULD BE ADDED, EXPANDED, DELETED OR COMPRESSED? DO THE TITLE AND ABSTRACT APPROPRIATELY REFLECT THE CONTENTS?



TECHNICAL DETAIL.  IS SUFFICIENT DETAIL PROVIDED ABOUT ALGORITHMS AND TECHNIQUES, INCLUDING EXPERIMENTAL TECHNIQUES?  IS THERE SUFFICIENT DETAIL TO ALLOW REPLICATION OF THE WORK?



FIGURES AND TABLES.  ARE THE FIGURES AND TABLES NECESSARY AND SUFFICIENT?

No, see below

REFERENCES.  ARE THE REFERENCES ACCURATE AND COMPLETE?

good

FOR SURVEY PAPERS ONLY:
- IS A SURVEY IN THIS AREA TIMELY?    [IS THE AREA OF INTEREST TO
 THE DATA MINING COMMUNITY AND IS THERE NO RECENT AND COMPREHENSIVE
 REVIEW ALREADY AVAILABLE?]
- IS THE COVERAGE OF THE AREA BALANCED, COMPLETE AND UP-TO-DATE?
- DOES THE SURVEY PROVIDE A SUITABLE FRAMEWORK FOR UNDERSTANDING THE AREA?




ANY OTHER COMMENTS NOT COVERED ABOVE.



SUMMARIZE ANY CHANGES THAT MUST BE MADE FOR A REVISED VERSION OF THIS PAPER TO BE ACCEPTABLE FOR PUBLICATION.



SUMMARIZE ANY FURTHER CHANGES THAT YOU RECOMMEND THE AUTHOR(S)
CONSIDER.

Very nice paper, except…

1) This is a about time series, an inherently visual domain, and there are zero images of time series. You need images of time series, to motivate "1 Introduction", to explain "2 Problem Definition" and to show results "5 Experiments".

2) "We compare with iSAX [7] because it does not require time series to be cooperative so is a general method." It does assume that the time series is slowly changing, you could not use it to efficiently index D = rand(10000000,1);


3) "Quantitatively, the matrix profile SCRIMP algorithm takes 20000 seconds to analyze a time series of size one million and compute similarities across all pairs of subsequences (almost one million), where each subsequence is of size 500."   This seems a little misleading, I tried it…
data1=cumsum(randn(1000000,1));
[matrixProfile, profileIndex, motifIndex, discordIndex] = interactiveMatrixProfileVer2(data1,500);
And after about 10 seconds, the motifs have converged (they stop changing).  The whole point of SCRIMP is that it is an anytime algorithm.  In any case, if it is not a solution to you problem, why do we care if it is slow?

4) ". A data stream, for our purposes, is a set of streaming time series. Each time series is normalized to have zero mean and unit standard deviation", I am confused, you can normalize a batch time series, but can you normalize a stream?

5) I just don't understand why you compare to iSAX, which seems unsuited to this problem. Keep the comparison if you like, but can you also compare to MASS http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html   It should be trivial to "slot" it into where you use iSAX

6) Can you show some real examples of matches in the seismic data?

7) "the seismic dataset contains 5 million time series of 2000 values each." That is about 5 gig in matlab or zipped format, right? Can you put it online for the reviewers?

8) iSax or iSAX? (you have both)

9) Can you flesh out the motivation a little ?

10) "Missing some correlations is acceptable, because a major event will reveal many correlations so a recall of 90% or more is quite enough." I take it you don't work for an insurance company ;-)



Reviewer #2: SUMMARY.
The paper is about finding correlations among multiple time series over a sliding window. The main idea is to identify sketches and update the using a parallel approach in sliding windows.
The proposed approach mainly avoids to compute sketches from scratch but updates them on the fly. Moreover, the time series segments in each window are mapped to a vector space. The overall approach manages to minimize the size number of computations needed for detecting candidate matching time series.

Strong points:
- This is an interesting and challenging problem
- The paper presents a fresh and attractive solution, that is technically good enough and in the adequate depth for this venue.
- The experimental evaluation is sufficient.

Weak points:
 - The authors could improve the clarity of Section 4.4.
 - Some relevance to the concept of "tiling" could be provided and discussed.
 - Inclusion of a case study and some more real datasets would benefit the results section of the paper and increase its motivation.
 - The paper contains some minor language errors which should be fixed.

RELEVANCE.
The paper is highly relevant to DAMI. and the ECML/PKDD audience.

SIGNIFICANCE.
The paper is highly significant.


RELATED RESEARCH.
The related research is discussed in detail.


RELATIONSHIP TO PREVIOUS RESEARCH [NOT APPLICABLE TO SURVEY PAPERS].
This is covered, however, some relevance to the concept of "tiling" would be also interesting to see in the paper.


THEORETICAL AND EXPERIMENTAL EVALUATION.
This is quite elaborate.

EXPRESSION.
The expression is decent.

TECHNICAL DETAIL.
The technical details as well as the experimental details are satisfactory.


FIGURES AND TABLES.
They seem sufficient.

REFERENCES.
The references seem accurate.

SUMMARIZE ANY CHANGES THAT MUST BE MADE FOR A REVISED VERSION OF THIS PAPER TO BE ACCEPTABLE FOR PUBLICATION.

This is a very interesting paper with a strong motivation and attractive approach. However, the authors could consider some improvements:

A. The authors could improve the clarity of Section 4.4. The three points/strategies suggested here should be given a more in-depth and clearer description. Moreover, some examples should be provided to support the intuition of these strategies.

B. Some relevance to the concept of "tiling" could be provided and discussed. The concept of tiling has been studied earlier, in some cases with the objective to identify sequence or time series segments that are highly similar or correlated. It would be great to mention their relevance (if the authors deem that is necessary) and contrast them with their approach. Specifically the authors could examine the following papers:
- Geometric and combinatorial tiles in 0-1 data, PKDD 2004
- Semigeometric Tiling of Event Sequences, ECML/PKDD 2016
- Tiling databases, DS 2004.

C. Please provide some case studies of correlated time series segments from the real datasets you have used. It would be nice to see some real findings from the seismic dataset.

D. It would also be great to try your approach on other datasets of multi-dimensional time series. For example, ECGs? Please include at least one more real dataset in your experimental evaluation.

E. The paper contains some minor language errors which should be fixed. Please proofread the paper carefully.





Reviewer #3: >> Summary:

The article presents a method for parallelizing the search for similar pieces at the same time step (windows) between two of a large set of time series.
The main idea is to summarize windows into sketches, and distributing the decision of which pairs could be similar, based on partial information.
A number of experiments have been conducted to compare the proposed method with iSAX, the current state of the art. The results indicate that this new method is significantly faster than iSAX, at the cost of recall.

In its current form the paper lacks a lot of details (which can likely be supplied in a comparatively short time frame). My main concern here is the seeming overselling of the performance of the proposed approach (1000-fold speedup in abstract).


Detailed justification:

I find the first contribution (listed on page 2) somewhat lackluster. Upon reading what it actually entails, this seems straightforward and following standard techniques. I would certainly keep it in the paper, but would suggest leaving it out of the list of contributions; or making it very clear why this is novel and interesting.

As a practical remark, section 2 lists that the time series are assumed to be normalized. To what degree is this achievable, when the data is streamed in? Is it re-normalized over the current window, or over all data received so far, etc? As a minor note, it also makes me wonder why x and y are normalized in eq (1) when the assumption is that they are normalized?

The approach seems to largely build on Lemma 1, to assume that sketches are good representatives of the actual similarity. Is "50%" really sufficient to conclude "very likely" on line 31 (p5)?

I'm setting serious question marks with the approach described at the very end of Section 4.2. In particular, a technique to avoid redistributing new random vectors is described, by taking a random vector that is twice the necessary length, and looping over it (presumably circularly?). It is unclear why this is a valid way of doing this, w.r.t. say Lemma 1. The randomization is no longer independent. What are the effects of this choice?

What is the effect of grid boundaries? In the current approach, it seems that two time series can be arbitrarily close together but fall into two different grid cells and thus remain undetected. How does, e.g. extending to check for distances with points in neighboring cells affect results?

The paper introduces three communication strategies. It is however unclear to me why the first two are introduced (and later also shown in the first experiment). It seems to me, at least from the description, that the "opt" variant should be the preferable one under any circumstance (better/equal asymptotic behavior and lower constants). I also don't find the difference to be that surprising in the experiment, but I suppose that a lot of the reported time actually revolves around other aspects rather than the communication? What happens if you single out the actual time spent on communication?

P11 makes the claim that Step 2 is cheaper than Step 1: I wonder in terms of what? I'd assume this refers purely to running time, and that these steps are local and thus have significantly smaller "constants" hidden in the asymptotic claims?

Bottom of page 11 makes a claim that "a large fraction" of the sketch-similar series are also actually similar. What is this large fraction? Upon what is this claim based? It seems that this is related to Lemma 1 again (is 50% indeed a large fraction?), but the change in the random vectors undermines this Lemma possibly. Are there experiments to support this claim?

In Section 5, the experiments are described. ParCorr is compared to Parallel Linear Search and iSAX. The justification for iSAX seems clear (state of the art), but why is PLS introduced and compared to? Also, the claim is that iSAX local indices have to be rebuilt constantly. Is this inherent in the iSAX method or is iSAX easily amenable to an incremental update as well?

What repetition is the error bar based on? (page 14) Is this a number of runs of the same algorithm on different data, same data, etc? Does this actually produce a 100% confidence interval? (wouldn't that nearly always be 0 to +infinity for running times?)

What fraction f do you use to get the 95-ish % recall?

Section 5.4 makes some claims on different thresholds for the Pearson correlation. Why are the speedup factors reported here around 18, whereas the actual speedup (reported also in the abstract) states a 1000-fold gain? In fact, reading through the results, I see no mention of the 1000-fold speedup at all (the best I can find seems to be below 58).

Out of curiosity: ParCorr seems very much designed towards many but short time series. How would it perform if there are few but long time series? Is this simply a matter of cutting them up into shorter ones to leverage parallelism more?


Presentation:

Overall the paper is well written. In the following, there are a few suggestions that may help to improve it further.

The abstract starts with trying to define the problem in one sentence (over 4 lines). This could be improved, since it's currently not clear (at least to me) that the windows that you want to compare must happen at the same time. Mostly because of the part of successive windows and that these have to start shortly after one another. I had the impression that you wanted to find correlated sequences of windows (with each window only shortly after the end of the previous) in time series that match to another such sequence.

The introduction frequently refers to sketches, without making it clear what this precisely is. The explanation follows later; I'd suggest a simple forward reference at the first occurrence.

There are some statements that could use substantiation/citations:
- p2: Fourier & Wavelet Transforms sacrifice too much accuracy or reduce dimensionality too little
- p2: speeds turns out to be of greater importance
- p3: synchronous variation (do you mean variant here?), in particular, is this distinction a new one, or established?
- p3: I was under the impression that (for general high dimensional data), Euclidean distance is fairly meaningless; is this impression wrong or are time series different in this regard?

Some further remarks on presentation:
- in many instances, formulas are written out very verbally. Personally, I find this awkward to read and I lose track if this is more than a multiplication. Most extreme case seems to be on p12, the paragraph just before section 5. Another ambiguous case is on p11, line 43-44.
- section 4.3: the idea is given with an example, then generalized into the actual computational process. I found the example difficult to understand without first reading the general idea; you may want to consider swapping these.
- there are some typos throughout the paper, most notably some instances of "ParrCorr" rather than "ParCorr"
- p7, line 12: Spark is suddenly mentioned here. It's unclear why this consideration of it being "awkward to do in Spark" is relevant.
- p14: claims on the ratio of window size vs basic window size: why are these reasonable for many applications? Why is 50 more reasonable for high frequency measurements? etc. (as a minor side note, I'm not quite sure why basic window size is a window to begin with, it seems more like a step/increment size)

I found the conclusion is bit short - a more nuanced conclusion from the experimental results would be welcome here, and the future work is extremely brief. Are these the only remaining venues of further work? Why are these important or relevant? Are there other steps to improve the algorithm (e.g. achieve better recall)?


___________________________________________________________


STYLE FILE INFORMATION:


Questions on submitting Tex files:

What submission item should I use for a Tex file?  The best submission item to use for a tex file is a "Manuscript" or similar worded submission item description.

What if my Tex file(s) doesn't build?  First, look in the PDF to see if an error message is available.  If the error message suggests you are missing a sty file, then this may be the cause that the Tex submission will not build.   If your submission still is not building after attempting to fix your Tex file from the suggested error message, you may want to verify the following:

Are your images referenced correctly? Images cannot be referenced in subfolders, otherwise they will not appear in the PDF

Are all your accompanying files referenced correctly in your Tex file?

Why can't  I submit Tex files and DVI files together?  The System will accept either a Tex file(s) _OR_ a DVI file.  If your DVI file does not build, then it is suggested you instead submit a Tex file.  If you submit a DVI file and it does not build correctly, then you will need to provide the Tex file in order for the building problem to be researched.    

What if my figures are not appearing? It may be possible that your images are referenced in subfolders.  Images cannot be referenced in subfolders, otherwise they will not appear in the PDF.  One example of a correctly referenced image is: epsfig{figure=alld.eps,width=.5 extwidth} .   An example of an incorrectly referenced image is: epsfig{figure=images/alld.eps,width=.5 extwidth}

What if I view large or cut off EPS images?  If you view large or cut off images, then you will need to resize the image to fit on one 8.5 x 11 page.  When there is a problem caused by not resizing PostScript files (the images get cut off…) the author will either need to resize the images or save the files in a format that EM can recognize as an image.( Perhaps as a TIFF or a JPEG). Please note that EPS files are the best choice for image files in Tex submissions.


Do I need to use my STY files too?  Yes.  The sty files will be needed to properly build your submission.

What if I see (?) Question marks in my PDF?  If you see question marks in the references of your PDF, most likely your Tex file(s) are in subdirectories. Tex submissions cannot include subdirectories for your submission to properly build.  All associated files must be in one directory for the submission to build.

Please note:
1)       The best image/figure file types for Tex submissions are EPS files.  The author should attempt to submit EPS images for all their figure files.  The author should verify that their EPS images are not in subfolders, otherwise the files may not build into the submission

2)       The author should look at the error generated at the end of their Tex submission.  The author should verify the format of their Tex submission, as the error message suggests the Tex file may be corrupt.

3)       The author should re-verify that the submission and all of their accompanying Tex related files (STY files, etc) are uploaded to the submission

4)       The author should then re-build their submission.

5) The author verify their images are not in subfolders, otherwise the images may not appear.

For your reference, a sty file is:

STY is the file extension for a Style sheet file. A STYle template may be used by different publishers to define what should be bold/centered/italic in the paper.

The link below to Springer's own site can be used for latex references:
http://www.springer.com/sgw/cda/frontpage/0,,5-164-2-72376-0,00.html

We rely on the authors to create their sty files.  Authors need to create their sty files to govern their own work.  

There is a link below we found in Google that may assist the author in completing their submission.

http://www.sci.usq.edu.au/staff/robertsa/LaTeX/latexintro.html


TeX Web Resources

There has been a significant increase in the number of TeX submissions to journals using Editorial Manager. While we do not offer direct technical support for TeX, just as we don't offer direct technical support for Microsoft Word, we have compiled a list of TeX-related web sites for journals to use, but please do feel free to distribute this information to your authors if you deem it helpful.

Helpful TeX Links:

Beginners guide to TeX
This excellent introduction to TeX contains links to a basic explanation of TeX, a more thorough overview, and FAQs. You'll also find user help, documentation, sample documents, and a list of recommended reference books.
http://www.tug.org/begin.html

The Comprehensive TeX Archive Network
If you know absolutely nothing about TeX and would like to learn about what TeX is and where it came from, be sure to take a look at the article entitle "What it TeX?" There is a search function for files and documentation on the site as well as links to sign up for TeX users groups and announcements lists.
http://www.ctan.org/

TeX Guides
An excellent resource offering a variety of TeX guides including guides for Mathematical Symbols in TeX and TeX for Word Processor Users.
http://www.mcs.vuw.ac.nz/~david/latex/
LaTeX Encyclopedia
An online LaTeX "encyclopedia". The site contains a table of contents with links to information on documentation, installation, typography, and a Navigator for the site.
http://tex.loria.fr/
LaTeX Math Guide
The American Mathematical Society's Short Math Guide for LaTeX.
ftp://ftp.ams.org/pub/tex/doc/amsmath/short-math-guide.pdf