STATStream:a Monitor for High Speed Time Series Data Stream Statistics
Yunyue Zhu
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
yunyue@cs.nyu.edu
Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html
Motivation
Data Streams are important in many applications
such telecommunications, network monitoring,
environmental monitoring and financial analysis.
Data streams provide information that is ordered by time.
Unfortunately it is difficult to process these data
in set-oriented data management systems.
Our system's goal is to compute in
near constant time the statistics for time series data streams,
such as moving average, moving standard deviation and moving correlation.
This web page describes the installation, use, and
semantics of the software based on these algorithms,
which you may use for research purposes.
Installation
The Data Streams Monitor runs in a high performance
interpreted environment called K.
- To begin with, please download trial K from
kx.com.
Both K and our program run equally well on linux and windows.
- Send email to shasha@cs.nyu.edu.
If you care to describe your application,
we'd be glad to hear about it.
In any case, we will send you instructions
for downloading two files:
- stat.k -- the main program of the software
- server.k -- the database server, it will generate the
sample time series database and feed the data stream to the
main program.
- You can then run the program by typing
- k server
to generate a sample time series database and to start
the server.
- k stat
to monitor the statistics of the time series
data streams.
Semantics of the Input
The incoming time series streams consist of many substreams.
Each substream has a stream ID.
We simulate the environment of data streams
by running a server that feeds data to the main program.
The user can use the server.k script to
generate a sample time series database.
You can then create and use your own database by
providing tables in CSV (comma-separated value) file format.
The database consists of the following tables.
The first table metainfo.csv contains information of the streams.
Its schema is metainfo (streamID, [betaStream],...).
The betaStream field specified the streamID of another stream.
From this table the user can categorize the streams
based on their different properties.
The category table will have the schema of
category (streamID, category).
For example, a data stream of stock prices
of different companies may have the following metainfo table.
metainfo: StreamID, Industry, Market Capital, Exchange, betaStream
Dell, Computer, 72.5, NASDAQ, SP500
Intel, Computer, 203.6, NASDAQ, DOW
Lucent, Computer, 22.31, NYSE, DOW
GM, Auto, 34.12, NYSE, DOW
Based on the information of the streams,
the user can put the substreams into different categories.
category: streamID, category
Dell, C1
Intel, C2
Lucent, C1
GE, C2
The second table newdata.csv has the time series data with
schema newdata (streamID, timepoint, value).
The substreams come in time order, which we abstract as timepoints.
The timepoints start from one and increase by one
each time a new value of the substream is available.
The timepoints of different substreams are non-decreasing too.
Here are some sample data files:
metainfo.csv,
category.csv,
newdata.csv.
Our program will monitor two kinds of statistics:
- Statistics of a single substream. These statistics
include
- Moving Average
- Standard deviation
- Maximum and Minimum
- Beta:the sensitivity of a substream's values
to the values of another substream,
specified in the metainfo table.
- First order Autocorrelation: the correlation of the
series with itself when the time period is one apart, i.e.,
at time point t and t+1.
- Best Fit Slope: the slope of the line that best fit the
time series.
Our statistics are computed based on a sliding window parameter.
For example, if the length of sliding window is 100 data points
and the current time point for substream ibm_share_price is 1000,
the average for this substream is the average over the
timepoints from 991 to 1000.
- Correlation between pairs of substreams.
Our program will report only highly correlated pairs
because the number of pairs is otherwise too large.
Parameters
The following parameters are specified by the users.
- Length of Sliding Window l
- Frequency to Update Statistics f
This parameter tells the program to compute and output the
statistic every f timepoints.
For example, if f=5, then the program will output the result
every 5 timepoints of a substream.
For efficiency's sake, f should divide
l.
- Correlation Threshold
Only substream pairs whose absolute value of correlation
coefficients larger than a specified threshold will be reported.
The default value is 0.9.
- Duration over Threshold
Some users might be interested in only those pairs with
correlation coefficients above the
threshold for a pre-defined period.
This parameter provides such users with an
option to specify a minimum duration.
The default value is 1.
- Categories Comparing Method (Optional)
Some users might not be interested in the comparison of all
substream pairs. They can then divide these substreams into
different categories and the computation of
the correlation will be based upon these categories.
There are two categories options:
inter-categories and intra-categories.
For the inter-categories options, the two streams in a pair
will be compared only if they are from different categories.
For the intra-categories options, the two streams in a pair
will be compared only if they are from the same category.
For example if we have 5 substreams from 2 categories.
C1:(s1,s2,s3),C2:(s4,s5)
- With the inter-categories option, the following
pairs will be compared:
(s1,s4),(s1,s5),(s2,s4),(s2,s5),(s3,s 4),(s3,s5)
- With the intra-Categories option, the following
pairs will be compared:
(s1,s2),(s1,s3),(s2,s3),(s4,s5)
Without the category table, we will compute the correlation of
all pairs with different streamIDs.
Semantics of the Result
The program will output the statistics of single and
paired substreams to two files.
Both files will be in CSV format.
- The schema for the statistics of single stream
will be as follows:
Stream ID, Begin Time Point, End Time Point, Average,
Standard Deviation [,Max,Min][,First Rank Autocorrelation]
[,Beta][,Best Fit Slope]
- The schema for the correlation of substreams will be as
follows:
streamID1,streamID2,Begin Time Point, End Time Point,
Correlation Coefficient, Duration
Begin Time Point and End Time Point are, respectively, the
first and the last time points of the sliding window.
Options
The main program has several command line options:
k stat [+h host] [+p port] [+out1 file1] [+out2 file2]
[+l sliding window length] [+f frequency to update]
[+m] [+a] [+b] [+s] [+inter|intra] [+t threshold] [+d minimum
duration]
- With no options specified, the input stream comes from the
localhost in port 2000. The statistics of single substreams,
including average and standard deviation, go to the file
output1. The statistics of correlation
between substreams, go to the file output2.
- +h host -- specifies the host where input stream comes
from.
- +p port -- specifies the port where input stream comes
from.
- +out1 file1 -- specifies the file to which the
statistics of single stream output.
- +out2 file2 -- specifies the file to which the
correlation of stream pairs output.
- +l sliding window length -- specifies the length of
the sliding window
- +f frequency to update -- specifies the frequency to
update and output the statistics
- +inter|intra -- specifies that we want the sub stream
pairs to be compute inter-categories or intra-categories.
Without this option, we will compute all pairs of the
substreams.
- +m -- specifies that the program outputs the statistics
of maximum and minimum as well.
- +a -- specifies that the program outputs the statistics
of first order autocorrelation as well.
- +b -- specifies that the program outputs the beta
of the substream as well.
- +s -- specifies that the program outputs the statistics
of best fit slopes as well.
The database server has several command line options:
k server [+p port] [+u] [+s number of substreams] [+f
number of time points]
- With no options, the database server will listen on port
2000. The server will generate sample data of 10
substreams with 1000 time points for each substream.
- +h host -- specifies the host where server listens
on
- +s number of substreams -- specifies the number of
substreams
- +f number of time points -- specifies the number of time
points for each substreams
- +u -- specifies that use the user define data instead of
the generated sample data. These user-defined data are in
table metainfo.csv and newdata.csv. This option
will invalidate +s +f options.
Algorithm and Performance of StatStream
StatStream use our innovative algorithm for data streams
to achieve high performance.
One of our innovations is to use Discrete Fourier Transforms
(DFT) for the computation of correlation.
Our algorithm extracts the summary data from the data streams.
The summary data include the Fourier coefficients for single
stream and running cross-product for pair of streams.
StatStream maintains these digests incrementally and thus
guarantees constant time computation of statistics for data streams.
The codes document what we do to some extent.
Here are some figures on a high performance PC.
-- Dell Dimension 8100 PC (P4 1.5GHz) with 128M of RAM.
The total time is the time to compute all statistics
on all members of all streams.
The time per time point is the time to compute all statistics
for each time point on all members of all streams
when the program is runnning in a steady state.
Number of Streams |
Number of Categories |
Category Comparing Method |
Number of Time Points(each stream) |
Length of Sliding Window |
Frequency to Update |
Total Time (seconds) |
Time per Time Point(seconds) |
1000 | 1 | All | 6,400 | 3,200 | 64 | 16,005 | 4.8
|
100 | 1 | All | 6,400 | 3,200 | 64 | 134 | 0.036
|
10 | 1 | All | 6,400 | 3,200 | 64 | 4 | 0.0006
|
100 | 10 | All | 32,000 | 6,400 | 64 | 1,229 | 0.046
|
100 | 10 | Inter Categories | 32,000 | 6,400 | 64 | 1,083 | 0.041
|
100 | 10 | Intra Categories | 32,000 | 6,400 | 64 | 303 | 0.010
|
50 | 1 | All | 102,400 | 51,200 | 1024 | 103 | 0.0013
|
50 | 1 | All | 102,400 | 51,200 | 256 | 274 | 0.0042
|
50 | 1 | All | 102,400 | 51,200 | 64 | 1314 | 0.020
|