Elastic Burst detection
Yunyue Zhu
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
yunyue@cs.nyu.edu
http://cs.nyu.edu/yunyue/index.html
Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html
Motivation
In many applications, people are interested in discovering intervals
with unusually large numbers of events.
- In astrophysics, the sky is constantly observed for high-energy particles.
When a particular astrophysical event happens, a shower of
high-energy particles
arrives in addition to the background noise.
This yields an unusually high number of detectable events
over a time period.
- In telecommunication, if the number of packages lost within a certain time
period exceeds some threshold, it might indicate some network anomaly.
- In finance, stocks with unusual high trading volumes should attract the
notice of traders (or perhaps regulators).
Formally, the problem of interest is to discover a subsequence of the time
series that have a particular aggregate significantly different from that
aggregate in most time periods of the same size.
In the case of burst detection, the aggregate
is sum.
(In fact, we can be more general than this.
The aggregate function must be monotonic with window inclusion,
i.e. if window1 is
included in window2, then agg(window1) is less than agg(window2).
Also, the test for interestingness of the aggregate is magnitude,
where larger is most interesting.
Sum is one example, but so is, for example, spread
i.e., the max value in a window less the min value in the window.)
If we know the duration of the time interval,
we can maintain the
sum over sliding windows of the known window size and raise an alarm when
the moving sum is above a threshold. Unfortunately, in many cases, we cannot
predict the duration of the burst period. For example, in astrophysics, a burst
of high-energy particles associated with a special event might last for a few
milliseconds or a few hours or even a few days.
Similarly, in finance, a burst may result from insider trading,
the attempt to buy or sell a large bunch of stocks,
or from price collapse of a significant
stock as happened for example on January 18, 2002 to
Imclone stock after news of the Martha Stuart scandal.
The formal definition of
the problem of burst detection is as follows.
Given a time series of positive numbers x_1,x_2,...,x_n, and a threshold
function f(w), w=1,2,...,n, find the subsequences of any size such that
their sums are above the thresholds, i.e.,
A brute force search has to examine all the sliding window sizes and starting
positions.
Because there are O(n^2) windows for a sequence of size
n, the lower bound of the time complexity is O(n^2). This is
very slow for many applications. However, in most applications, we are
interested only in those few windows that experience bursts. We
present here a burst
searching program whose time complexity is approximately proportional to the
size of the output, i.e. the number of windows with bursts, and that
requires one pass over the input.
Here is a pictorial description of the algorithm.
Here is a paper.
Installation
The burst detector runs in a high performance interpreted environment called K.
-
To begin with, therefore, please download trial K from
kx.com .
The license is good for a limited duration, but is renewable
an unlimited number of times.
K and our program run equally well on linux and windows.
-
Send email to shasha@cs.nyu.edu.
If you care to describe your application, we'd be glad to hear
about it.
In any case, we will send you instructions for downloading the program:
Semantics of the Input
The input time series is of the following format:
x1
x2
x3
...
The input threshold is of the following format:
w1,f(w1)
w2,f(w2)
w3,f(w3)
...
If the sum of the subsequence in a window with size w_i
is larger than f(w_i), we say that the window
experiences burst.
Semantics of the Result
The output of the program is list of tuples as follows.
Window Size, Starting Position, Ending Position
Each tuple means that in the time series, between positions
Starting Position and Ending Position,
all the sliding windows with size Window Size
experience bursts.
Options
The program has a few command line options:
k burst [+f filename1 filename2]
-
With no options, the input time series comes from the file burstin
and thresholds come from the file threshin .
-
+f foo bar -- specifies that the input file of time series and threshold are
foo and bar
instead of burstin and threshin .
Support
This material is based upon work partly supported by
the United States National Science Foundation under
grants IIS-9988636 and N2010-0115586.
Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the authors and do not
necessarily reflect the views of the National Science Foundation.
This support is greatly appreciated.