# StatStream 2.0 Manual

This package computes windowed correlations between many pairs of time series, using an efficient algorithm. This document is written by David Tanzer

### Problem statement

Given:
s1,...,sn -- n time series
corr -- correlation cutoff, where 0 <= corr <= 1.0
sw -- size of sliding window
bw -- size of basic window (time between consecutive reports of results)

At every reporting time t, return the set of all pairs of streams si,sj such that the subsequences si[t-sw+1 .. t] and sj[t-sw+1 .. t] have a correlation that exceeds corr, or is less than -corr.

This algorithm operates with a possibly large number of streams, in an efficient incremental fashion, so that as new points are appended to the timeseries, the new set of correlated pairs can be computed and updated in real time.

### Cost of the direct approach

For each pair of streams si and sj, the direct approach requires computing a correlation over all of the sw points of the two subsequences. This is done for every pair of streams. So if there are N streams, the computation time needed at every reporting period is O(N*N*sw).

### The statstream optimization

bw -- size of the basic window, bw is a divisor of sw

Compared to a naive recompute-at-everty-time-point whose complexity is O(sw) for each time series and each time point, ours is O(sw/bw) integer additions and O(1) floating point operations for each time point and time series pair. Further we use data structures to eliminate the need to compare every pair. Our expected time complexity therefore is near linear overall. That is, provided the correlation threshold is high enough so few pairs are returned (depending on the application, this may mean a correlation of 0.95 or 0.4), the time needed at every reporting point is O(N*sw/(bw*bw)), which is linear in N.

### Running the system

There are two main programs that you will invoke.

data_provider.k is the generator of data, and statstream2.k computes the output data.
They communicate through the port specified by the user (By default port=2000).

You will run each of these in a different window.

To get started, open up a command window, and change to the statstream directory.

Type in the command:

k data_provider

This starts the K interpreter, and tells it to run the program data_provider.k, which generates a set of sample time series, using a random walk simulation. The output was just written to a .csv files. You are now at the K prompt.

Before we run statstream to get the output results, let's first look at the generated .csv files, so that we can understand the structure of the input. Open the file newdata.csv.

It has the generated data:

streamID,timepoint,vlue
s1,1,52.97
s2,1,77.56
s3,1,95.99
...
s50,1,52.44
s1,2,54.24
s2,2,76.42
s3,2,96.75
...

Now, in this second window, we will run the main statstream program:

k statstream2 +corr 0.90 +sw 50 +bw 10

This runs the analysis, using a correlation threshold of 0.90, sliding window size of 50 points, and basic window size of 10 points. The output is put into the file output.csv.

Note: you must have the K interpreter for data_provider running at the time that you the K interpreter for statstream2.

You are again left at a K prompt. Now exit from each of the K interpreters, by typing \\ followed by enter.

Here is the beginning of a generated output.csv:

streamID1,streamID2,Begin Time Point,End Time Point,Correlation Coefficient
4,3,21,70,-0.9131029
5,1,71,120,0.9256776
7,5,71,120,-0.9124727
7,4,71,120,-0.904117
4,0,81,130,0.9028802
5,4,81,130,0.910386
7,4,81,130,-0.9471997
...

The output points are ordered by the third column, which is the time point when the sliding window begins. Notice that all the sliding windows are of length 50, as we specified, and also that the correlations are reported every 10 points, the basic window size that we specified.

The first line tells us that in the window 21-70, streams 4 and 3 were highly correlated -- in this case negatively correlated, with correlation coefficient -0.9131029.

For the advanced user: there are other parameters that can be passed to statstream2, see below.

Now let's run the system on our own data. Please note that without a K license, you will have to break up a large data set into smaller ones.
First, replace newdata.csv with your own data -- keeping the same structure for the file, of course.

Then, in the first shell, run the command

k data_provider +u +ns **

Here ** should be replaced with the number of streams. For example, if you have 1000 time series to be processed, here ** should be 1000

This tells data_provider to use user-provided data, instead of simulated data. In this case, data_provider reads from the .csv files, rather than writing to them.

In the second window, run, e.g.,

k statstream2 +corr 0.90 +sw 50 +bw 10 +ns **

** is the same as above

The output.csv is now calculated from your input files.

Note: due to the memory limitation of the K trial version, if the user's data takes over 50 megabytes when the program runs, a message will pop up saying "eval limit blah blah".

### Training for Uncooperative Time Series: for the advanced user

We call data uncooperative if spectral methods such as Discrete Fourier Transform, Singular Value Decomposition, and Wavelet Transform give poor performance. In that case, we use a technique called "random projections" as described in this paper. This technique requires a few parameters to be set. The basic statstream2 call will use empirically good settings of these parameters stored in goodparam.csv, but if you plan to use the data a lot, you may want to find settings of the parameters that fit your data. To do that, you need to add one parameter to your call:

k statstream2 +sw 128 +bw 32 +corr 0.85 +ns 50 +recall 0.99 +train 1 +debug 1

"+recall 0.99" means the final system will output correlated pairs with a recall larger than or equal to 0.99. "+train 1" indicates that parameter training is needed. Adding "+debug 1" will display all the message to the screen. Besides being used in that call, system parameters will be stored in goodparam.csv as good empirical parameters for next use. You may want to consider to train the system whenever the data has materially changed.