STATStream:a Monitor for High Speed Time Series Data Stream Statistics

Yunyue Zhu
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
yunyue@cs.nyu.edu

Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html

Motivation

Data Streams are important in many applications such telecommunications, network monitoring, environmental monitoring and financial analysis. Data streams provide information that is ordered by time. Unfortunately it is difficult to process these data in set-oriented data management systems. Our system's goal is to compute in near constant time the statistics for time series data streams, such as moving average, moving standard deviation and moving correlation.

This web page describes the installation, use, and semantics of the software based on these algorithms, which you may use for research purposes.

Installation

The Data Streams Monitor runs in a high performance interpreted environment called K.

To begin with, please download trial K from kx.com. Both K and our program run equally well on linux and windows.
Send email to shasha@cs.nyu.edu. If you care to describe your application, we'd be glad to hear about it. In any case, we will send you instructions for downloading two files:
- stat.k -- the main program of the software
- server.k -- the database server, it will generate the sample time series database and feed the data stream to the main program.
You can then run the program by typing
- k server
  to generate a sample time series database and to start the server.
- k stat
  to monitor the statistics of the time series data streams.

Semantics of the Input

The incoming time series streams consist of many substreams. Each substream has a stream ID. We simulate the environment of data streams by running a server that feeds data to the main program. The user can use the server.k script to generate a sample time series database. You can then create and use your own database by providing tables in CSV (comma-separated value) file format. The database consists of the following tables.

The first table metainfo.csv contains information of the streams. Its schema is metainfo (streamID, [betaStream],...). The betaStream field specified the streamID of another stream. From this table the user can categorize the streams based on their different properties. The category table will have the schema of category (streamID, category). For example, a data stream of stock prices of different companies may have the following metainfo table.

metainfo: StreamID, Industry, Market Capital, Exchange, betaStream
Dell, Computer, 72.5, NASDAQ, SP500
Intel, Computer, 203.6, NASDAQ, DOW
Lucent, Computer, 22.31, NYSE, DOW
GM, Auto, 34.12, NYSE, DOW
Based on the information of the streams, the user can put the substreams into different categories. category: streamID, category
Dell, C1
Intel, C2
Lucent, C1
GE, C2

The second table newdata.csv has the time series data with schema newdata (streamID, timepoint, value). The substreams come in time order, which we abstract as timepoints. The timepoints start from one and increase by one each time a new value of the substream is available. The timepoints of different substreams are non-decreasing too.

Here are some sample data files: metainfo.csv, category.csv, newdata.csv.

Our program will monitor two kinds of statistics:

Statistics of a single substream. These statistics include
- Moving Average
- Standard deviation
- Maximum and Minimum
- Beta:the sensitivity of a substream's values to the values of another substream, specified in the metainfo table.
- First order Autocorrelation: the correlation of the series with itself when the time period is one apart, i.e., at time point t and t+1.
- Best Fit Slope: the slope of the line that best fit the time series.
Our statistics are computed based on a sliding window parameter. For example, if the length of sliding window is 100 data points and the current time point for substream ibm_share_price is 1000, the average for this substream is the average over the timepoints from 991 to 1000.
Correlation between pairs of substreams. Our program will report only highly correlated pairs because the number of pairs is otherwise too large.

Parameters

The following parameters are specified by the users.

Length of Sliding Window l
Frequency to Update Statistics f
This parameter tells the program to compute and output the statistic every f timepoints. For example, if f=5, then the program will output the result every 5 timepoints of a substream. For efficiency's sake, f should divide l.
Correlation Threshold
Only substream pairs whose absolute value of correlation coefficients larger than a specified threshold will be reported. The default value is 0.9.
Duration over Threshold
Some users might be interested in only those pairs with correlation coefficients above the threshold for a pre-defined period. This parameter provides such users with an option to specify a minimum duration. The default value is 1.
Categories Comparing Method (Optional)
Some users might not be interested in the comparison of all substream pairs. They can then divide these substreams into different categories and the computation of the correlation will be based upon these categories. There are two categories options: inter-categories and intra-categories. For the inter-categories options, the two streams in a pair will be compared only if they are from different categories. For the intra-categories options, the two streams in a pair will be compared only if they are from the same category. For example if we have 5 substreams from 2 categories. C1:(s1,s2,s3),C2:(s4,s5)
- With the inter-categories option, the following pairs will be compared:
  (s1,s4),(s1,s5),(s2,s4),(s2,s5),(s3,s 4),(s3,s5)
- With the intra-Categories option, the following pairs will be compared:
  (s1,s2),(s1,s3),(s2,s3),(s4,s5)
Without the category table, we will compute the correlation of all pairs with different streamIDs.

Semantics of the Result

The program will output the statistics of single and paired substreams to two files. Both files will be in CSV format.

The schema for the statistics of single stream will be as follows:
Stream ID, Begin Time Point, End Time Point, Average, Standard Deviation [,Max,Min][,First Rank Autocorrelation] [,Beta][,Best Fit Slope]
The schema for the correlation of substreams will be as follows:
streamID1,streamID2,Begin Time Point, End Time Point, Correlation Coefficient, Duration

Begin Time Point and End Time Point are, respectively, the first and the last time points of the sliding window.

Options

The main program has several command line options:

k stat [+h host] [+p port] [+out1 file1] [+out2 file2] [+l sliding window length] [+f frequency to update] [+m] [+a] [+b] [+s] [+inter|intra] [+t threshold] [+d minimum duration]

With no options specified, the input stream comes from the localhost in port 2000. The statistics of single substreams, including average and standard deviation, go to the file output1. The statistics of correlation between substreams, go to the file output2.
+h host -- specifies the host where input stream comes from.
+p port -- specifies the port where input stream comes from.
+out1 file1 -- specifies the file to which the statistics of single stream output.
+out2 file2 -- specifies the file to which the correlation of stream pairs output.
+l sliding window length -- specifies the length of the sliding window
+f frequency to update -- specifies the frequency to update and output the statistics
+inter|intra -- specifies that we want the sub stream pairs to be compute inter-categories or intra-categories. Without this option, we will compute all pairs of the substreams.
+m -- specifies that the program outputs the statistics of maximum and minimum as well.
+a -- specifies that the program outputs the statistics of first order autocorrelation as well.
+b -- specifies that the program outputs the beta of the substream as well.
+s -- specifies that the program outputs the statistics of best fit slopes as well.

The database server has several command line options:

k server [+p port] [+u] [+s number of substreams] [+f number of time points]

With no options, the database server will listen on port 2000. The server will generate sample data of 10 substreams with 1000 time points for each substream.
+h host -- specifies the host where server listens on
+s number of substreams -- specifies the number of substreams
+f number of time points -- specifies the number of time points for each substreams
+u -- specifies that use the user define data instead of the generated sample data. These user-defined data are in table metainfo.csv and newdata.csv. This option will invalidate +s +f options.

Algorithm and Performance of StatStream

StatStream use our innovative algorithm for data streams to achieve high performance. One of our innovations is to use Discrete Fourier Transforms (DFT) for the computation of correlation. Our algorithm extracts the summary data from the data streams. The summary data include the Fourier coefficients for single stream and running cross-product for pair of streams. StatStream maintains these digests incrementally and thus guarantees constant time computation of statistics for data streams. The codes document what we do to some extent.

Here are some figures on a high performance PC. -- Dell Dimension 8100 PC (P4 1.5GHz) with 128M of RAM. The total time is the time to compute all statistics on all members of all streams. The time per time point is the time to compute all statistics for each time point on all members of all streams when the program is runnning in a steady state.

Number of Streams	Number of Categories	Category Comparing Method	Number of Time Points(each stream)	Length of Sliding Window	Frequency to Update	Total Time (seconds)	Time per Time Point(seconds)
1000	1	All	6,400	3,200	64	16,005	4.8
100	1	All	6,400	3,200	64	134	0.036
10	1	All	6,400	3,200	64	4	0.0006
100	10	All	32,000	6,400	64	1,229	0.046
100	10	Inter Categories	32,000	6,400	64	1,083	0.041
100	10	Intra Categories	32,000	6,400	64	303	0.010
50	1	All	102,400	51,200	1024	103	0.0013
50	1	All	102,400	51,200	256	274	0.0042
50	1	All	102,400	51,200	64	1314	0.020