High Performance Data Mining in Time Series: Techniques and Case Studies

Author: Yunyue Zhu

Advisor: Dennis Shasha

As extremely large time series data sets grow more prevalent in a wide variety of settings, we face the significant challenge of developing efficient analysis methods. This dissertation addresses the problem in designing fast, scalable algorithms for the analysis of time series.

The first part of this dissertation describes the framework for high performance time series data mining based on important primitives. Data reduction trasform such as the Discrete Fourier Transform, the Discrete Wavelet Transform, Singular Value Decomposition and Random Projection, can reduce the size of the data without substantial loss of information, therefore provides a synopsis of the data. Indexing methods organize data so that the time series data can be retrieved efficiently. Transformation on time series, such as shifting, scaling, time shifting, time scaling and dynamic time warping, facilitates the discovery of flexible patterns from time series.

The second part of this dissertation integrates the above primitives into useful applications ranging from music to physics to finance to medicine.

StatStream

StatStream is a system based on fast algorithms for finding the most highly correlated pairs of time series from among thousands of time series streams and doing so in a moving window fashion. It can be used to find correlations in time series in finance and in scientific applications.

HumFinder

Most people hum rather poorly. Nevertheless, somehow people have some idea what we are humming when we hum. The goal of the query by humming program, HumFinder, is to make a computer do what a person can do. Using pitch translation, time dilation, and dynamic time warping, one can match an inaccurate hum to a melody remarkably accurately.

OmniBurst

Burst detection is the activity of finding abnormal aggregates in data streams. Our software, OmniBurst, can detect bursts of varying durations. Our example applications are monitoring gamma rays and stock market price volatility. The software makes use of a shifted wavelet structure to create a linear time filter that can guarantee that no bursts will be missed at the same time that it guarantees (under a reasonable statistical model) that the filter eliminates nearly all false positives.