- Problem statement
- Cost of the direct approach
- The StatStream optimization
- Running the system
- For the advanced user
- Send back your feedback

This package computes windowed
correlations between many pairs of time series, using an efficient algorithm.
This document is written by David Tanzer

Given:

s1,...,sn -- n time series

corr -- correlation cutoff, where 0 <= corr <= 1.0

sw -- size of sliding window

bw -- size of basic window (time between consecutive reports of results)

At every reporting time t, return the set of all pairs of streams si,sj such that the subsequences si[t-sw+1 .. t] and sj[t-sw+1 .. t] have a correlation that exceeds corr, or is less than -corr.

This algorithm
operates with a possibly large number of streams, in an efficient incremental
fashion, so that as new points are appended to the timeseries, the new set of correlated pairs can be computed
and updated in real time.

For each pair of streams si and sj, the direct approach requires computing a correlation over all of the sw points of the two subsequences. This is done for every pair of streams. So if there are N streams, the computation time needed at every reporting period is O(N*N*sw).

A second, smaller window, called the basic window, is introduced:

bw -- size of the basic window, bw is a divisor of sw

Compared to a naive recompute-at-everty-time-point whose complexity is O(sw) for each time series and each time point, ours is O(sw/bw) integer additions and O(1) floating point operations for each time point and time series pair. Further we use data structures to eliminate the need to compare every pair. Our expected time complexity therefore is near linear overall. That is, provided the correlation threshold is high enough so few pairs are returned (depending on the application, this may mean a correlation of 0.95 or 0.4), the time needed at every reporting point is O(N*sw/(bw*bw)), which is linear in N.

First, download the statstream2.zip for your platform (linux/unix/windows) and then unpack it into a directory of your choice as described in download page

There are two main programs that you will invoke.

data_provider.k is the generator
of data, and statstream2.k computes the output data.

They communicate through the port specified by the user (By default port=2000).

You will run each of these in a different window.

To get started, open up a command window, and change to the statstream directory.

Type in the command:

k data_provider

This starts the K interpreter, and tells it to run the program data_provider.k, which generates a set of sample time series, using a random walk simulation. The output was just written to a .csv files. You are now at the K prompt.

Before we run statstream to get the output results, let's first look at the generated .csv files, so that we can understand the structure of the input. Open the file newdata.csv.

It has the generated data:

streamID,timepoint,vlue

s1,1,52.97

s2,1,77.56

s3,1,95.99

...

s50,1,52.44

s1,2,54.24

s2,2,76.42

s3,2,96.75

...

Now, in this second window, we will run the main statstream program:

k statstream2 +corr 0.90 +sw 50 +bw 10

This runs the analysis, using a correlation threshold of 0.90, sliding window size of 50 points, and basic window size of 10 points. The output is put into the file output.csv.

Note: you must have the K interpreter for data_provider running at the time that you the K interpreter for statstream2.

You are again left at a K prompt. Now exit from each of the K interpreters, by typing \\ followed by enter.

Here is the beginning of
a generated output.csv:

streamID1,streamID2,Begin Time Point,End Time Point,Correlation Coefficient

4,3,21,70,-0.9131029

5,1,71,120,0.9256776

7,5,71,120,-0.9124727

7,4,71,120,-0.904117

4,0,81,130,0.9028802

5,4,81,130,0.910386

7,4,81,130,-0.9471997

...

The output points are ordered by the third column, which is the time point when
the sliding window begins. Notice that all the sliding windows are of length 50, as
we specified, and also that the correlations
are reported every 10 points, the
basic window size that we specified.

The first line tells us that in the window 21-70, streams 4 and 3 were highly correlated -- in this case negatively correlated, with correlation coefficient -0.9131029.

For the advanced user: there are other parameters that can be passed to statstream2, see below.

Now let's run the system
on our own data. Please note that without a K license, you will have
to break up a large data set into smaller ones.

First, replace newdata.csv with your own data -- keeping the
same structure for the file, of course.

Then, in the first shell, run the command

k data_provider +u +ns **

Here ** should be replaced with the number of streams. For example, if you have 1000 time series to be processed, here ** should be 1000

This tells data_provider to use user-provided data, instead of simulated data. In this case, data_provider reads from the .csv files, rather than writing to them.

In the second window, run, e.g.,

k statstream2 +corr 0.90 +sw 50 +bw 10 +ns **

** is the same as above

The output.csv is now calculated from your input files.

**Note**: due to the
memory limitation of the K trial version, if the user's data takes over 50 megabytes
when the program runs,
a message will pop up saying "eval limit blah blah".

We call data * uncooperative *
if spectral methods such as
Discrete Fourier Transform, Singular Value Decomposition, and Wavelet
Transform give poor performance. In that case, we use a technique called
"random projections" as described in
this paper.
This technique requires a few parameters to be set.
The basic statstream2 call will use empirically good settings of these
parameters stored in goodparam.csv, but if you plan to use the data a lot, you may want
to find settings of the parameters that fit your data.
To do that, you need to add one parameter to your call:

k statstream2 +sw 128 +bw 32 +corr 0.85 +ns 50 +recall 0.99 +train 1 +debug 1

"+recall 0.99" means the final system will output correlated pairs with a recall larger than or equal to 0.99. "+train 1" indicates that parameter training is needed. Adding "+debug 1" will display all the message to the screen. Besides being used in that call, system parameters will be stored in goodparam.csv as good empirical parameters for next use. You may want to consider to train the system whenever the data has materially changed.

If you have any question or feedback, please let us know. Email: shasha@cs.nyu.edu or xiaojian@cs.nyu.edu.

Maintained by shasha@cs.nyu.edu

Last Updated Jan. 11, 2006