Better Burst detection
Xin Zhang
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
xinzhang@cs.nyu.edu
Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu
http://cs.nyu.edu/cs/faculty/shasha/index.html
Motivation
Many aspects of our lives are described by events.
An unexpectedly large number of events occurring
within some certain temporal or spatial region
is called a burst,
suggesting unusual behaviors or activities.
Bursts, as noteworthy phenomena, come up in many natural and social processes.
-
A burst of trading volume in some stock might indicate insider trading.
-
A burst of gamma rays may reflect the occurrence of a supernova.
-
A burst of methane may anticipate a coming volcanic eruption.
Detecting a burst over any of $k$ window
sizes, a problem we call
elastic burst detection,
in a stream of length $N$ naively requires $O(kN)$ time.
Our previous work showed
that a simple Shifted Binary Tree structure can reduce this time
substantially (in very favorable cases near to $O(N)$) by filtering away
obvious non-bursts.
Unfortunately, for certain data distributions, the filter
marks many windows of events as possible bursts, even though
a detailed check shows them to be non-bursts.
This work explores efficient data structures and algorithms
for high performance burst detection.
We present a family of data structures,
called Shifted Aggregation Trees (SAT),
and heuristic algorithm to search
for an efficient data structure given the
input time series and the window thresholds.
The advantage of this framework is that it is adaptive to different inputs.
A technical report can be found
here (TR2005-876).
Software
The software package can be found here.
The code is written in standard C++ and the binary is built using Microsoft Visual Studio 6.0.
Usage
To train a shifted aggregation tree from a small sample data:
search thresholdFile sampleDataFile satFile [maxNumberActiveState numberFinalStateToStop]
Where
- thresholdFile contains the window sizes of interest and the corresponding thresholds,
- sampleDataFile contains the training data in binary format,
- satFile defines the structure of the shifted aggregation tree.
- maxNumberActiveState: maximum number of active states in the best-first search in the state space algorithm,
i.e. only the best maxNumberActiveState states are kept in the state space, others are pruned.
The default value is 500.
- numberFinalStateToStop: after numberFinalStateToStop states are visited, the algorithm stops.
The default value is 500.
The formats of these files are explained below.
Please refer to the technical report for these two state space algorithm parameters.
To detect bursts:
detect thresholdFile dataFile satFile
dynamicDetect thresholdFile dataFile satFile
Where dataFile contains the time series to be detected,
and satFile contains the shifted aggregation tree used for the detection.
The output is printed to the standard output.
Semantics of the input
The input time series is stored in a binary date file of the following format:
x1
x2
x3
...
The input thresholds are of the following text format:
w1 | | f(w1)
|
w2 | | f(w2)
|
w3 | | f(w3)
|
... | | ...
|
Where w1 < w2 < w3 < ... and f(w1) < f(w2) < f(w3) < ...
If the sum of the subsequence in a window with size w_i
is larger than f(w_i), we say that the window
experiences a burst.
An example:
4 | | 10.4
|
9 | | 21.5
|
18 | | 43.2
|
40 | | 90
|
... | | ...
|
Semantics of a SAT file
A SAT file defines the structure of a shifted aggregation tree.
It has the following format:
N | | W_N
|
S_1 | | W_1 | | C^1_1 | | C^2_1 | | ...
|
S_2 | | W_2 | | C^1_2 | | C^2_2 | | ...
|
... | | ... | | ... | | ... | | ...
|
S_N | | W_N | | C^1_N | | C^2_N | | ...
|
Where N is the number of levels in this SAT, S_i is the shift at level i,
W_i is the window size of level i and C^i_j is the window size of the ith child's at level j.
For example, the following SAT file corresponds to a shifted binary tree of maximum window size of 32:
6 | | 32
|
1 | | 1 | | 1
|
1 | | 2 | | 1 | | 1
|
2 | | 4 | | 2 | | 2
|
4 | | 8 | | 4 | | 4
|
8 | | 16 | | 8 | | 8
|
16 | | 32 | | 16 | | 16
|
Semantics of the output
The output of the program is a list of tuples as follows.
Starting Position, Window Size, Aggregate within this window
Each tuple means that in the time series, the window starting at
Starting Position with size Window Size
experiences a burst having Aggregate value.
Contact
Please email shasha@cs.nyu.edu or xinzhang@cs.nyu.edu for any questions or comments.