Data arriving in time order (a data stream) arises in fields ranging from
physics to finance to medicine to music, to name a few. Often the data
comes from sensors (in physics and medicine for example) whose data rates
continue to improve dramatically as sensor technology improves.
Furthermore, the number of sensors is increasing, so analyzing data
between sensors becomes ever more critical in order to distill knowledge
from the data. Fast response is desirable in many applications (e.g. to
aim a telescope at an activity of interest or to perform a stock trade).
In applications such as finance, recent information, e.g. correlation, is
of far more interest than older information, so analysis over sliding
windows is a desired operation.
These three factors -- huge data size, fast response, and windowed computation -- motivated this work. Our intent is to build a foundational library of primitives to perform online or near online statistical analysis, e.g. windowed correlation, incremental matching pursuit, burst detection, on thousands or even millions of time series. Beside the algorithms, we also propose the concept of ``uncooperative'' time series, whose power spectra are spread over all frequencies with any regularity.
Previous work showed how to do windowed correlation with Fast Fourier Transforms and Wavelet Transforms, but such techniques don't work for uncooperative time series. This thesis will show how to use sketches (random projections) in a way that combines several simple techniques -- sketches, convolution, structured random vectors, grid structures, combinatorial design, and bootstrapping -- to achieve high performance, windowed correlation over a variety of data sets. Experiments confirm the asymptotic analysis.
To conduct matching pursuit (MP) over time series windows, an incremental scheme is designed to reduce the computational effort. Our empirical study demonstrates a substantial improvement in speed.
In previous work, Zhu and Shasha introduced an efficient algorithm to monitor bursts within windows of multiple sizes. We implemented it in a physical system by overcoming several practical challenges. Experimental results support the authors' linear running time analysis.