Title: Efficient Parallel Methods to Identify (Most of the) Similar Time Series Pairs
Across Sliding Windows

Abstract: Consider the problem of finding the closest pairs of time series over a time window
and then sliding that window to find the closest pairs over multiple windows such that each successive window starts only a little time after the previous window. 
Doing this efficiently and in parallel could help in many applications involving
sensor, financial, or communications data, to name a few.
We have developed a parallel incrementing  sketching approach with 
the state-of-the-art nearest neighbor method ISAX \cite{} for this problem. Whereas ISAX
achieves 100\% recall and precision, the sketching approach is 1000 times faster and achieves 95\% recall and 100\% precision on real and simulated data.
For some applications this 
1000-fold speedup.
may be worth the minor
reduction in recall.


\section{Algorithmic Contribution}

Our basic approach to approximate similarity (whether Euclidean distance or Pearson correlation, 
for starters)
is to compute the dot product of  each time series over a window size w with
a set of random vectors.
That is, for each time series ti and window size w and time period k .. k+w-1,
we compute the dot product of ti[k..k+w-1] with say r random (-1/+1) vectors of size w.
The r dot products thus computed constitute the "sketch" of ti at time period k.
Next we compare the sketches of the various time series to see which ones are close in sketch space
(if w >> r, which is often the case, this is cheaper than working directly
on the time series) and then identify those close ones 
In the case of sliding windows, 
we want to find the most similar
time series pairs at jumps of a basic window b, e.g. 0 .. w-1, b .. b+w-1,
2b..2b+w-1, ...
where b << w. 


There are several points here that require discussion:

1. If we compute the sketches from scratch at each basic window, 
we are doing some redundant computation.
We call that the naive method.
We can do better.

2. When scaling this to a parallel system, we want to reduce communication costs as much as possible.
We need to develop good strategies for this.


\subsection{Incremental computation of sketches}

<Here I make use of figures from Djamel's slides>

To explain the incremental algorithm consider Djamel slide 1.
The sketch fo random vector v1 is the dot product of the time series with V1,
i.e.1 * (-1) + 2 * 1 +... + 4 * (-1) = 4.

Now if the basic window is of size 2 ( Djamel slide 2) then we simply
add the next two points of the time series (in this case having values
2,1) and generate two more random +/- 1 numbers for each random vector
(in this case -1 -1 for V1 and -1 1 for V2).
To update the dot product for V1 all we need to is subtract the 
contribution of 1 * (-1) + 2 * 1 = 1 and add in the contribution of
2*(-1) + (1*-1) = -3 yielding a new sketch entry of 4 - 1 + (-3) = 0.

In general, the pseudo-code is this:

\begin{enumerate}
\item
Partition time series among parallel sites.
Replicate r random +1/-1 vectors each of size w to all sites.
These random vectors will later be updated in a replicated fashion.
\item
For each site,
\begin{enumerate}
\item
Initially, take the first w data 
points of each time series at that site and form
the dot product with all r random vectors.
So each time series t will represented by r dot products. They
constitute sketch(t).
\item
while data comes in, when data for the
ith basic window of size b appears for all time series,
extend each random vector by a new random +1/-1 vector of size b.
Then for each time series t and random vector v, update the dot
product of t with v by subtracting v[0..b-1] dot t[(i-1)b-w ... ib-w-1]
and adding v[w..w+b-1] dot t[(i-1)b..ib-1].
Change the sketch(t) with all the updated dot products.  
\end{enumerate}
\end{enumerate}

\subsection{Parallel Mixing}

Once the sketch vectors have been constructed incrementally,
the next step is to find sketch vector pairs that are close
to each another. Such pairs might then indicate that their
corresponding time series are highly correlated (or similar
based on some other distance metric.
We first explain how this works by example and then show
the pseudo-code.

As we explained above,
for every time series we construct a sketch vector in parallel and
incrementally.

Suppose we have seven time series with sketch values:

sketch(ts1) = (11, 12, 23, 24, 15, 16) 
sketch(ts2) = (11, 12, 13, 14, 15, 16) 
sketch(ts3) = (21, 22, 13, 14, 25, 26) 
sketch(ts4) = (21, 22, 13, 14, 25, 26) 
sketch(ts5) = (11, 12, 33, 34, 25, 26) 
sketch(ts6) = (31, 32, 33, 34, 15, 16) 
sketch(ts7) = (21, 22, 33, 34, 15, 16) 

We partition these into pairs and so send

sketch(ts1)[0 1] = (11, 12)
sketch(ts2)[0 1] = (11, 12)
sketch(ts3)[0 1] = (21, 22)
sketch(ts4)[0 1] = (21, 22)
sketch(ts5)[0 1] = (11, 12)
sketch(ts6)[0 1] = (31, 32)
sketch(ts7)[0 1] = (21, 22)

to site 1 where this will be formed into a grid.

sketch(ts1)[2 3] = (23, 24)
sketch(ts2)[2 3] = (13, 14)
sketch(ts3)[2 3] = (13, 14)
sketch(ts4)[2 3] = (13, 14)
sketch(ts5)[2 3] = (33, 34)
sketch(ts6)[2 3] = (33, 34)
sketch(ts7)[2 3] = (33, 34)

to site 2 where this will be formed into a second grid.

We send

sketch(ts1)[4 5] = (15, 16) 
sketch(ts2)[4 5] = (15, 16) 
sketch(ts3)[4 5] = (25, 26) 
sketch(ts4)[4 5] = (25, 26) 
sketch(ts5)[4 5] = (25, 26) 
sketch(ts6)[4 5] = (15, 16) 
sketch(ts7)[4 5] = (15, 16) 

to site 3 where this will be formed into a second grid.

In the first grid, 
ts1, ts2, and ts5 map to the same grid cell; ts6 is by itself; and ts3, ts4, 
and ts7 all map to the same cell.
Thus we have a partition consisting of three members each containing
time series identifiers.
In the second grid we have partitions
ts2, ts3, and ts4; ts1; ts5, ts6,and ts7
In the third we have ts3, ts4, and ts5; ts1, ts2, ts6 and ts7.
If two time series are in the same
partition, then they are candidates for similarity. 

Now we construct a mapping tsToNode that maps time series identifiers
to nodes (i.e. computational sites). For the sake of simplicity
for this example, let's say it is the identity function.
So we send the partition {ts1, ts2, ts3} to nodes 1, 2, and 3.
We send {ts3, ts4, and ts7} to nodes 3, 4, and 7. 
And so on.


So, we send ts1, ts2, ts5; ts1; ts1, ts2, ts6, ts7 to node 1.
We send  ts1, ts2, ts5;  ts2, ts3, ts4; ts1, ts2, ts6, ts7  to node 2.
We send  ts3, ts4, ts7;  ts2, ts3, ts4; and  ts3, ts4, ts5 to node 3.
We send  ts3, ts4, ts7;  ts2, ts3, ts4; and  ts4, ts5 to node 4.
We send  ts1, ts2, ts5;  ts5, ts6, ts7;  ts5 to node 5.
We send  ts6; ts5, ts6, ts7;  ts1, ts2, ts6, ts7  to node 6.
We send  ts3, ts4, ts7; ts5, ts6, ts7; ts1, ts2, ts6, ts7 to node 7.

Let's say we require  that 2/3 of the grids should put two time
series in the same grid cell for us to be willing to consider
that pair of time series to be worth checking in detail.
We call this the f factor.

Each node takes care of those time series that map to that node.
So for example, node 1 shows that ts1 and ts2 satisfy the requirement.
Node 2 shows nothing new concerning ts2.
Node 3 shows that ts3 and ts4 satisfy the requirement.
Node 4 shows nothing new concerning ts4.
Node 5 shows nothing new concerning ts5.
Node 6 shows nothing that ts6 and ts7 satsify the requirement.
Node 7 shows nothing new.
Alls those that satisfy the requirement can be tested for direct
correlation.

So, the only non-trivial cluster we get (if we demand that all grids must agree on the cluster (i.e. f = 1))
is 3,4 from node ts_to_node(3).
All other clusters are singletons.
If f is less than 1, then we require that fraction f of the clusters
at some node contain time series i and i' to say that i and i' might
be correlated.


Generalizing from this example, here is the intuitive algorithm.
Partition each sketch vector into subsequences and send the first subsequence
to one site, the second subsequence to a second site etc.
Each site computes a grid and puts time series identifiers in grid cell.
We estimate that two time series
are close if they are in the same grid cells in a fraction f of the
grids.  
To determine this, we 
construct "candidate clusters of time
series" based on each grid (i.e. each grid cell
represents a candidate cluster) of the sketch
and then count the candidates in a clever way.

Here are the details: construct the candidate clusters of 
time series at each node.
Then send the entire cluster of time
series to every node corresponding to the time series in that cluster.
Call the mapping function between time series ids and nodes tsToNode
For this discussion, we assume that ts_to_node is 1 to 1. If not,
then if a node has say the time series groups corresponding to i1, i2, and i3,
then keep those groups separate.
At each destination node, two time series are candidates for explicit analysis
if they are in the same clusters (i.e. the same grid cell) for some fraction f of the grids.

There are several ways to make sending these messages less expensive:
(i) never send information about
grid cells that have just one member; (ii) send less than the full cluster to 
each node. For example, if a cluster has ts1, ts2, ts3, ts4, then perhaps send
all of them to tsToNode(ts1), but only ts2, ts3, ts4 to tsToNode(ts2).
{\em Djamel: which of these optimizations to we do?}


Parallel Mixing Step: Given a threshold f and N time series.

\begin{enumerate}
\item
Partition the sketch vectors, which all have length r,
into groups of size k (e.g. if r is 60 and k is 2, then the
partition would be {0,1}, {2,3}, {4,5}, ..., {58,59} and we would
take indexes 0 and 1 of each sketch vector and put it in the first
partition. So each partition would consist of N mini-vectors of size 2 each.).
\item 
For each site s, 
\begin{enumerate}
\item
for each time series t,
place the identifier of t in a grid cell corresponding to
sketch(t)[$I_s$], where $I_s$ are the indexes assigned to site s.
\item
Next form a partition of the time series identifiers such that
each member of the partition corresponds to a non-empty grid cell.
So, two time series t1 and t2 will fall into
the same partition if sketch(t1)[$I_s$] maps to 
the same grid cell as
sketch(t2)[$I_s$]. 
Denote the partitioning induced by this grid search on site s as
partitioning(s).
\item
Each element of the partition p in partitioning(s) represents a set of time
series. If we sort them by their id, then p can represent 
$ ts_p_1 , ts_p_2 , ... $.
\end{enumderate}
\item
Establish a mapping between time series identifiers and sites
and denote as tsToNode.
\item
For each site s and each partition p within s consisting of 
$ts_p_1 , ts_p_2 , ...., ts_p_{|p|}$
Send all of partition p to tsToNode($ts_p_1$), tsToNode($ts_p_2$), ...
tsToNode($ts_p_{|p|}$).
% to all of
% partition p starting with $ts_p_2$ to tsToNode($ts_p_2$) (i.e.
% $ts_p_2 , ts_p_3 , ...., ts_p_{|p|}$).
% Send all of partition p starting with 
% $ts_p_3$ to tsToNode($ts_p_3$) (i.e.
% $ts_p_3 , ts_p_4 , ...., ts_p_{|p|}$).
\item
At each receiving node n, for every identifier
$ts_i$ such that tsToNode($ts_i$) == n,  
if $ts_i$ and $ts_j$ are in more than a fraction
f of the received partitions, then they are candidates to be highly
correlated.
\item
Check all candidates explicitly using Pearson correlation
or cosine similarity.

\end{enumerate}