SSH Similarity - Nate's Notes

# SSH Similarity Search Algorithm * We start with certain points (think rows) in a high dimensional space (where this high dimensional space has a time component). So this could be *nodes* represented by their dart time series vectors (say every hour for the past $n = 40$ days) * We want to find points (e.g. nodes) that are *similar* to each other. If we simply have 2000 nodes and each node it represented by $24 * 40 = 960$ darts in time, we could compute a distance matrix between the 2000 nodes and then perform a cluster or similarity search between them. * However, in a time series context this wouldn’t be very robust! For instance, imagine that node $A$ and $B$ have their time series shifted by a single hour! That could mean that they aren’t close by in euclidean space! ![](Screenshot%202023-04-14%20at%2010.45.33%20AM.png) * So we want a way to handle the specific time series nature of our problem (accounting for shifts in dimensions) * [This paper](https://arxiv.org/pdf/1610.07328.pdf) proposes the following: ![](Screenshot%202023-04-14%20at%2010.18.40%20AM.png) ![](Screenshot%202023-04-14%20at%2010.18.52%20AM.png) * So this is basically: 1. Take a **sliding dot product** (effectively a convolution) of some random (gaussian) noise with your time series window. Then take the sign of the resulting dot product. Do this (via sliding) across the entire time series window, resulting in a bit vector of 0’s and 1’s that (this is our **sketch**) for each *row* (e.g. node, our points). So the result looks like: `[0,1,1,0,1,0,0,1]` 2. We then **shingle** this vector, effectively getting ngrams (we can specify the ngram length, e.g. 2): `[(0,1), (1,1), (1,0), (0,1), (1,0), (0,0), (0,1)]` 3. At this point we have *sparse vectors*. We would like to convert these into *dense vectors* which we call **signatures**. We do this via weighted minhashing. [Minhashing](https://ekzhu.com/datasketch/minhash.html) let’s you estimate Jaccard Similarity between sets. [Weighted MinHash](https://ekzhu.com/datasketch/weightedminhash.html) can be used to estimate *weighted Jaccard Similarity* between sets. Both of these are *sketches* where say we have sets 1 and 2, each of these sets would get their own MinHash, and we can then *compare* the resulting respective MinHashes to determine an estimated overlap. We can then combine this with [MinHash LSH](https://ekzhu.com/datasketch/lsh.html): > Suppose you have a very large collection of [sets](https://en.wikipedia.org/wiki/Set_(mathematics)). Giving a query, which is also a set, you want to find sets in your collection that have Jaccard similarities above certain threshold, and you want to do it with many other queries. To do this efficiently, you can create a MinHash for every set, and when a query comes, you compute the Jaccard similarities between the query MinHash and all the MinHash of your collection, and return the sets that satisfy your threshold. > > The said approach is still an O(n) algorithm, meaning the query cost increases linearly with respect to the number of sets. A popular alternative is to use Locality Sensitive Hashing (LSH) index. LSH can be used with MinHash to achieve sub-linear query cost - that is a huge improvement. The details of the algorithm can be found in [Chapter 3, Mining of Massive Datasets](http://infolab.stanford.edu/~ullman/mmds/ch3.pdf). > This package includes the classic version of MinHash LSH. It is important to note that the query does not give you the exact result, due to the use of MinHash and LSH. There will be false positives - sets that do not satisfy your threshold but returned, and false negatives - qualifying sets that are not returned. However, the property of LSH assures that sets with higher Jaccard similarities always have higher probabilities to get returned than sets with lower similarities. Moreover, LSH can be optimized so that there can be a “jump” in probability right at the threshold, making the qualifying sets much more likely to get returned than the rest. The TLDR is as follows: * We start with some points that have associated (high dimensional) time series. * We get signatures that have nice properties for time series (e.g. they handle shifting and warping) * These signatures are then given their own WeightedMinHash * This WeightedMinHash can be used in combination with LSH in order to compute similarity --- Date: 20230414 Links to: Tags: References: * [https://arxiv.org/pdf/1610.07328.pdf](https://arxiv.org/pdf/1610.07328.pdf)