STATStream:a Monitor for High Speed Time Series Data Stream Statistics

Yunyue Zhu
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
yunyue@cs.nyu.edu

Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html


Motivation

Data Streams are important in many applications such telecommunications, network monitoring, environmental monitoring and financial analysis. Data streams provide information that is ordered by time. Unfortunately it is difficult to process these data in set-oriented data management systems. Our system's goal is to compute in near constant time the statistics for time series data streams, such as moving average, moving standard deviation and moving correlation.

This web page describes the installation, use, and semantics of the software based on these algorithms, which you may use for research purposes.


Installation

The Data Streams Monitor runs in a high performance interpreted environment called K.

Semantics of the Input

The incoming time series streams consist of many substreams. Each substream has a stream ID. We simulate the environment of data streams by running a server that feeds data to the main program. The user can use the server.k script to generate a sample time series database. You can then create and use your own database by providing tables in CSV (comma-separated value) file format. The database consists of the following tables.

The first table metainfo.csv contains information of the streams. Its schema is metainfo (streamID, [betaStream],...). The betaStream field specified the streamID of another stream. From this table the user can categorize the streams based on their different properties. The category table will have the schema of category (streamID, category). For example, a data stream of stock prices of different companies may have the following metainfo table.

metainfo: StreamID, Industry, Market Capital, Exchange, betaStream
Dell, Computer, 72.5, NASDAQ, SP500
Intel, Computer, 203.6, NASDAQ, DOW
Lucent, Computer, 22.31, NYSE, DOW
GM, Auto, 34.12, NYSE, DOW
Based on the information of the streams, the user can put the substreams into different categories. category: streamID, category
Dell, C1
Intel, C2
Lucent, C1
GE, C2

The second table newdata.csv has the time series data with schema newdata (streamID, timepoint, value). The substreams come in time order, which we abstract as timepoints. The timepoints start from one and increase by one each time a new value of the substream is available. The timepoints of different substreams are non-decreasing too.

Here are some sample data files: metainfo.csv, category.csv, newdata.csv.

Our program will monitor two kinds of statistics:

  1. Statistics of a single substream. These statistics include Our statistics are computed based on a sliding window parameter. For example, if the length of sliding window is 100 data points and the current time point for substream ibm_share_price is 1000, the average for this substream is the average over the timepoints from 991 to 1000.
  2. Correlation between pairs of substreams. Our program will report only highly correlated pairs because the number of pairs is otherwise too large.

Parameters

The following parameters are specified by the users.

Semantics of the Result

The program will output the statistics of single and paired substreams to two files. Both files will be in CSV format.

  1. The schema for the statistics of single stream will be as follows:
    Stream ID, Begin Time Point, End Time Point, Average, Standard Deviation [,Max,Min][,First Rank Autocorrelation] [,Beta][,Best Fit Slope]
  2. The schema for the correlation of substreams will be as follows:
    streamID1,streamID2,Begin Time Point, End Time Point, Correlation Coefficient, Duration
Begin Time Point and End Time Point are, respectively, the first and the last time points of the sliding window.

Options

The main program has several command line options:

k stat [+h host] [+p port] [+out1 file1] [+out2 file2] [+l sliding window length] [+f frequency to update] [+m] [+a] [+b] [+s] [+inter|intra] [+t threshold] [+d minimum duration]

The database server has several command line options:

k server [+p port] [+u] [+s number of substreams] [+f number of time points]

Algorithm and Performance of StatStream

StatStream use our innovative algorithm for data streams to achieve high performance. One of our innovations is to use Discrete Fourier Transforms (DFT) for the computation of correlation. Our algorithm extracts the summary data from the data streams. The summary data include the Fourier coefficients for single stream and running cross-product for pair of streams. StatStream maintains these digests incrementally and thus guarantees constant time computation of statistics for data streams. The codes document what we do to some extent.

Here are some figures on a high performance PC. -- Dell Dimension 8100 PC (P4 1.5GHz) with 128M of RAM. The total time is the time to compute all statistics on all members of all streams. The time per time point is the time to compute all statistics for each time point on all members of all streams when the program is runnning in a steady state.

Number of Streams

Number of Categories

Category Comparing Method

Number of Time Points(each stream)

Length of Sliding Window

Frequency to Update

Total Time (seconds)

Time per Time Point(seconds)

10001All6,4003,2006416,0054.8
1001All6,4003,200641340.036
101All6,4003,2006440.0006
10010All32,0006,400641,2290.046
10010Inter Categories32,0006,400641,0830.041
10010Intra Categories32,0006,400643030.010
501All102,40051,20010241030.0013
501All102,40051,2002562740.0042
501All102,40051,2006413140.020