Title: Efficient Parallel Methods to Identify (Most of the) Similar Time Series Pairs Across Sliding Windows Abstract: Consider the problem of finding the closest pairs of time series over a time window and then sliding that window to find the closest pairs over multiple windows such that each successive window starts only a little time after the previous window. Doing this efficiently and in parallel could help in many applications involving sensor, financial, or communications data, to name a few. We have developed a parallel incrementing sketching approach with the state-of-the-art nearest neighbor method ISAX \cite{} for this problem. Whereas ISAX achieves 100\% recall and precision, the sketching approach is 1000 times faster and achieves 95\% recall and 100\% precision on real and simulated data. For some applications this 1000-fold speedup. may be worth the minor reduction in recall. \section{Algorithmic Contribution} Our basic approach to approximate similarity (whether Euclidean distance or Pearson correlation, for starters) is to compute the dot product of each time series over a window size w with a set of random vectors. That is, for each time series ti and window size w and time period k .. k+w-1, we compute the dot product of ti[k..k+w-1] with say r random (-1/+1) vectors of size w. The r dot products thus computed constitute the "sketch" of ti at time period k. Next we compare the sketches of the various time series to see which ones are close in sketch space (if w >> r, which is often the case, this is cheaper than working directly on the time series) and then identify those close ones In the case of sliding windows, we want to find the most similar time series pairs at jumps of a basic window b, e.g. 0 .. w-1, b .. b+w-1, 2b..2b+w-1, ... where b << w. There are several points here that require discussion: 1. If we compute the sketches from scratch at each basic window, we are doing some redundant computation. We call that the naive method. We can do better. 2. When scaling this to a parallel system, we want to reduce communication costs as much as possible. We need to develop good strategies for this. \subsection{Incremental computation of sketches} To explain the incremental algorithm consider Djamel slide 1. The sketch fo random vector v1 is the dot product of the time series with V1, i.e.1 * (-1) + 2 * 1 +... + 4 * (-1) = 4. Now if the basic window is of size 2 ( Djamel slide 2) then we simply add the next two points of the time series (in this case having values 2,1) and generate two more random +/- 1 numbers for each random vector (in this case -1 -1 for V1 and -1 1 for V2). To update the dot product for V1 all we need to is subtract the contribution of 1 * (-1) + 2 * 1 = 1 and add in the contribution of 2*(-1) + (1*-1) = -3 yielding a new sketch entry of 4 - 1 + (-3) = 0. In general, the pseudo-code is this: \begin{enumerate} \item Partition time series among parallel sites. Replicate r random +1/-1 vectors each of size w to all sites. These random vectors will later be updated in a replicated fashion. \item For each site, \begin{enumerate} \item Initially, take the first w data points of each time series at that site and form the dot product with all r random vectors. So each time series t will represented by r dot products. They constitute sketch(t). \item while data comes in, when data for the ith basic window of size b appears for all time series, extend each random vector by a new random +1/-1 vector of size b. Then for each time series t and random vector v, update the dot product of t with v by subtracting v[0..b-1] dot t[(i-1)b-w ... ib-w-1] and adding v[w..w+b-1] dot t[(i-1)b..ib-1]. Change the sketch(t) with all the updated dot products. \end{enumerate} \end{enumerate} \subsection{Parallel Mixing} Once the sketch vectors have been constructed incrementally, the next step is to find sketch vector pairs that are close to each another. Such pairs might then indicate that their corresponding time series are highly correlated (or similar based on some other distance metric. We first explain how this works by example and then show the pseudo-code. As we explained above, for every time series we construct a sketch vector in parallel and incrementally. Suppose we have seven time series with sketch values: sketch(ts1) = (11, 12, 23, 24, 15, 16) sketch(ts2) = (11, 12, 13, 14, 15, 16) sketch(ts3) = (21, 22, 13, 14, 25, 26) sketch(ts4) = (21, 22, 13, 14, 25, 26) sketch(ts5) = (11, 12, 33, 34, 25, 26) sketch(ts6) = (31, 32, 33, 34, 15, 16) sketch(ts7) = (21, 22, 33, 34, 15, 16) We partition these into pairs and so send sketch(ts1)[0 1] = (11, 12) sketch(ts2)[0 1] = (11, 12) sketch(ts3)[0 1] = (21, 22) sketch(ts4)[0 1] = (21, 22) sketch(ts5)[0 1] = (11, 12) sketch(ts6)[0 1] = (31, 32) sketch(ts7)[0 1] = (21, 22) to site 1 where this will be formed into a grid. sketch(ts1)[2 3] = (23, 24) sketch(ts2)[2 3] = (13, 14) sketch(ts3)[2 3] = (13, 14) sketch(ts4)[2 3] = (13, 14) sketch(ts5)[2 3] = (33, 34) sketch(ts6)[2 3] = (33, 34) sketch(ts7)[2 3] = (33, 34) to site 2 where this will be formed into a second grid. We send sketch(ts1)[4 5] = (15, 16) sketch(ts2)[4 5] = (15, 16) sketch(ts3)[4 5] = (25, 26) sketch(ts4)[4 5] = (25, 26) sketch(ts5)[4 5] = (25, 26) sketch(ts6)[4 5] = (15, 16) sketch(ts7)[4 5] = (15, 16) to site 3 where this will be formed into a second grid. In the first grid, ts1, ts2, and ts5 map to the same grid cell; ts6 is by itself; and ts3, ts4, and ts7 all map to the same cell. Thus we have a partition consisting of three members each containing time series identifiers. In the second grid we have partitions ts2, ts3, and ts4; ts1; ts5, ts6,and ts7 In the third we have ts3, ts4, and ts5; ts1, ts2, ts6 and ts7. If two time series are in the same partition, then they are candidates for similarity. Now we construct a mapping tsToNode that maps time series identifiers to nodes (i.e. computational sites). For the sake of simplicity for this example, let's say it is the identity function. So we send the partition {ts1, ts2, ts3} to nodes 1, 2, and 3. We send {ts3, ts4, and ts7} to nodes 3, 4, and 7. And so on. So, we send ts1, ts2, ts5; ts1; ts1, ts2, ts6, ts7 to node 1. We send ts1, ts2, ts5; ts2, ts3, ts4; ts1, ts2, ts6, ts7 to node 2. We send ts3, ts4, ts7; ts2, ts3, ts4; and ts3, ts4, ts5 to node 3. We send ts3, ts4, ts7; ts2, ts3, ts4; and ts4, ts5 to node 4. We send ts1, ts2, ts5; ts5, ts6, ts7; ts5 to node 5. We send ts6; ts5, ts6, ts7; ts1, ts2, ts6, ts7 to node 6. We send ts3, ts4, ts7; ts5, ts6, ts7; ts1, ts2, ts6, ts7 to node 7. Let's say we require that 2/3 of the grids should put two time series in the same grid cell for us to be willing to consider that pair of time series to be worth checking in detail. We call this the f factor. Each node takes care of those time series that map to that node. So for example, node 1 shows that ts1 and ts2 satisfy the requirement. Node 2 shows nothing new concerning ts2. Node 3 shows that ts3 and ts4 satisfy the requirement. Node 4 shows nothing new concerning ts4. Node 5 shows nothing new concerning ts5. Node 6 shows nothing that ts6 and ts7 satsify the requirement. Node 7 shows nothing new. Alls those that satisfy the requirement can be tested for direct correlation. So, the only non-trivial cluster we get (if we demand that all grids must agree on the cluster (i.e. f = 1)) is 3,4 from node ts_to_node(3). All other clusters are singletons. If f is less than 1, then we require that fraction f of the clusters at some node contain time series i and i' to say that i and i' might be correlated. Generalizing from this example, here is the intuitive algorithm. Partition each sketch vector into subsequences and send the first subsequence to one site, the second subsequence to a second site etc. Each site computes a grid and puts time series identifiers in grid cell. We estimate that two time series are close if they are in the same grid cells in a fraction f of the grids. To determine this, we construct "candidate clusters of time series" based on each grid (i.e. each grid cell represents a candidate cluster) of the sketch and then count the candidates in a clever way. Here are the details: construct the candidate clusters of time series at each node. Then send the entire cluster of time series to every node corresponding to the time series in that cluster. Call the mapping function between time series ids and nodes tsToNode For this discussion, we assume that ts_to_node is 1 to 1. If not, then if a node has say the time series groups corresponding to i1, i2, and i3, then keep those groups separate. At each destination node, two time series are candidates for explicit analysis if they are in the same clusters (i.e. the same grid cell) for some fraction f of the grids. There are several ways to make sending these messages less expensive: (i) never send information about grid cells that have just one member; (ii) send less than the full cluster to each node. For example, if a cluster has ts1, ts2, ts3, ts4, then perhaps send all of them to tsToNode(ts1), but only ts2, ts3, ts4 to tsToNode(ts2). {\em Djamel: which of these optimizations to we do?} Parallel Mixing Step: Given a threshold f and N time series. \begin{enumerate} \item Partition the sketch vectors, which all have length r, into groups of size k (e.g. if r is 60 and k is 2, then the partition would be {0,1}, {2,3}, {4,5}, ..., {58,59} and we would take indexes 0 and 1 of each sketch vector and put it in the first partition. So each partition would consist of N mini-vectors of size 2 each.). \item For each site s, \begin{enumerate} \item for each time series t, place the identifier of t in a grid cell corresponding to sketch(t)[$I_s$], where $I_s$ are the indexes assigned to site s. \item Next form a partition of the time series identifiers such that each member of the partition corresponds to a non-empty grid cell. So, two time series t1 and t2 will fall into the same partition if sketch(t1)[$I_s$] maps to the same grid cell as sketch(t2)[$I_s$]. Denote the partitioning induced by this grid search on site s as partitioning(s). \item Each element of the partition p in partitioning(s) represents a set of time series. If we sort them by their id, then p can represent $ ts_p_1 , ts_p_2 , ... $. \end{enumderate} \item Establish a mapping between time series identifiers and sites and denote as tsToNode. \item For each site s and each partition p within s consisting of $ts_p_1 , ts_p_2 , ...., ts_p_{|p|}$ Send all of partition p to tsToNode($ts_p_1$), tsToNode($ts_p_2$), ... tsToNode($ts_p_{|p|}$). % to all of % partition p starting with $ts_p_2$ to tsToNode($ts_p_2$) (i.e. % $ts_p_2 , ts_p_3 , ...., ts_p_{|p|}$). % Send all of partition p starting with % $ts_p_3$ to tsToNode($ts_p_3$) (i.e. % $ts_p_3 , ts_p_4 , ...., ts_p_{|p|}$). \item At each receiving node n, for every identifier $ts_i$ such that tsToNode($ts_i$) == n, if $ts_i$ and $ts_j$ are in more than a fraction f of the received partitions, then they are candidates to be highly correlated. \item Check all candidates explicitly using Pearson correlation or cosine similarity. \end{enumerate}