Locality Sensitive Hashing (LSH)

# Locality Sensitive Hashing (LSH) ### Overview > **Locality-sensitive hashing** (**LSH**) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Since similar items end up in the same buckets, this technique can be used for [data clustering](https://en.wikipedia.org/wiki/Cluster_analysis "Cluster analysis") and [nearest neighbor search](https://en.wikipedia.org/wiki/Nearest_neighbor_search "Nearest neighbor search"). It differs from [conventional hashing techniques](https://en.wikipedia.org/wiki/Hash_function "Hash function") in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to [reduce the dimensionality](https://en.wikipedia.org/wiki/Dimension_reduction "Dimension reduction") of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items. If I put this into my own words, and to help make this concrete, imagine that we had the following: * $N$ data points that live in a high, $m$ dimensional space * If we want to determine how close each point is to each other point, that would consist of $N^2$ distance calculations (pairwise) * So we are dealing with $O(N^2)$ algorithmic complexity at this point * The idea of locality sensitive hashing is to avoid the need for this pairwise comparison! * The general approach is that we take each of our $N$ data points, hash them in a very particular way that *maximizes* the number of collisions, with a higher probability of collision if the points are close in the $m$ dimensional space. The end result is that we have data points that were close by (*neighbors*) in $m$ dimensional space grouped together in the same bucket. * Hence, in $O(N)$ we have managed to group our data points by some measure of closeness **Drawbacks** There is one main drawback that should be addressed - there is no notion of similarity *between buckets*. We know that the points within a bucket have a high probability of being similar. However, there is no concept of similarity or adjacency or closeness of one bucket to another bucket. **This is cool!** I should make this clear: this is a very cool and surprising result. Consider the information theoretic perspective: * The bucket/cluster that each point ends up in is directly dependent on the relationship with other points (i.e. your neighbors consist of the set of points that satisfy the relationship that they are close by) * However, we don't make use of that information here! We simply hash the data point and it ends up in a bucket with similar points * It is as though, informationally speaking, the hash function contains information about what groups a point belongs to! For instance, imagine that the hash function was an all knowing oracle that simply had a table (containing information) of two columns: one containing all data points, and the other containing its neighbors. Then of course it could just read off the group of any given point! However, our hash function of course doesn't have this type of table, so how in the world does this work?? ### Types of LSH See this great post [4 Pictures that Explain LSH - Locality Sensitive Hashing Tutorial - Randorithms](https://randorithms.com/2019/09/19/Visual-LSH.html). But we can provide a nice overview via coping his pics! ![](what%20is%20lsh.png) ![](lsh%20collision%20probabilities.png) ![](bit_sampling_lsh.png) ![](signed_random_projections.png) ![](euclidean_and_manhattan_lsh.png) ![](clustering_lsh.png) ### Compare Angular LSH to Euclidean LSH 1. **Euclidean Distance (L2 norm) LSH:** - For Euclidean distance, LSH typically involves projecting high-dimensional data onto a lower-dimensional subspace. The classical approach is to use p-stable distributions, where 'p' refers to the norm (with p=2 for Euclidean). The basic idea is to hash data points so that the probability of collision is higher for points that are closer together. - **Bucketing via Rounding:** One common method involves rounding the projected points to a grid. This means that points that fall into the same grid cell after rounding are hashed into the same bucket. This approach approximates the Euclidean distance because points that are close together are more likely to fall into the same cell after rounding. 2. **Cosine Similarity:** - The Random Projection method for cosine similarity involves projecting data onto a randomly chosen line (or hyperplane in higher dimensions). Here, the focus is on the angle between the vectors. - **Bucketing via Hyperplane Side:** In this approach, points are hashed based on which side of the hyperplane they fall on. Points on the same side of the hyperplane have a smaller angle between them, indicating higher cosine similarity, and are therefore hashed to the same bucket. The key difference lies in how **closeness** is *defined* and hashed in these two approaches. In the Euclidean distance LSH, "closeness" is based on the magnitude of the difference between points, whereas in the cosine similarity LSH, it is based on the angle between vectors. The former is more sensitive to the absolute differences in dimensions, while the latter is more about the relative orientation of the data points in the vector space. ### LSH Kernel This is directly needed to understand [Race Sketches](Race%20Sketches.md). An LSH kernel, or Locality-Sensitive Hashing kernel, is an approach that combines the concepts of kernel methods, which are powerful tools in machine learning for dealing with high-dimensional data, and Locality-Sensitive Hashing, which is a technique for dimensionality reduction and efficient approximate nearest neighbor search. Kernel methods typically involve computing a similarity measure, known as a kernel function, between pairs of input data points. The kernel function implicitly maps the input data into a high-dimensional feature space without ever computing the coordinates in that space explicitly. This is known as the "kernel trick," and it allows for efficient computation of similarities in the high-dimensional space. The LSH kernel is defined such that the kernel value between two points is equivalent to the probability that these points collide in the LSH scheme. In other words, it measures the similarity between two points based on the idea that similar items are likely to be hashed to the same bucket. To construct an LSH kernel, the process can be thought of as follows: 1. **LSH Family Selection**: First, choose an LSH family appropriate for the similarity measure you're interested in preserving. For example, if you're interested in angular similarity, you might choose a hash function family that is sensitive to the angle between vectors. 2. **Projection and Hashing**: Map the data points into a lower-dimensional space using the selected LSH functions. The idea is that if two points are close together (according to the chosen similarity measure), they will have a high probability of being mapped to the same hash value. 3. **Kernel Definition**: Define the kernel function in terms of the LSH functions. The LSH kernel between two data points $x$ and $y$ can be estimated by the fraction of hash functions for which $x$ and $y$ have the same hash value. Formally, the kernel function $K$ can be defined as: $K(x, y)= P[h(x) = h(y)]$ where $h$ is a hash function from the chosen LSH family, and the probability is estimated by using multiple hash functions from the family. This kernel function can be used in any machine learning algorithm that utilizes kernel methods, such as Support Vector Machines (SVM) or kernel PCA, enabling these methods to be applied efficiently in cases where explicit computation in the high-dimensional feature space would be computationally infeasible. The LSH kernel thus serves as a bridge between the efficiency of LSH for high-dimensional data and the effectiveness of kernel methods for non-linear pattern discovery. It enables the approximation of kernel values in large-scale and high-dimensional datasets, where traditional kernel methods would struggle both computationally and in terms of memory requirements. ### A Note on the term "Projection" In the context of Random Sign Projection used in Locality-Sensitive Hashing (LSH), the term "projection" is used in a somewhat looser, more intuitive sense than its strict definition in linear algebra. Let's break this down: 1. **The Context of LSH**: In LSH, the goal is to hash input items so that similar items are more likely to be hashed to the same bucket. This is useful in high-dimensional spaces for approximate nearest neighbor searches, as it reduces the computational complexity. 2. **Random Sign Projection**: This technique involves projecting high-dimensional data onto a lower-dimensional subspace using a random matrix (often with elements from a normal distribution, as in your example). The "projection" here is a reduction in dimensionality, achieved by multiplying the high-dimensional data vectors with a random matrix. 3. **Use of "Projection" Term**: In this context, "projection" refers to the transformation of data from a high-dimensional space to a lower-dimensional space. The term is used more for its intuitive sense of "casting" or "mapping" data onto a new space, rather than its strict mathematical definition. It's a form of dimensionality reduction, not an orthogonal or oblique projection in the linear algebraic sense. 4. **Why Not a Projection Matrix**: The random matrix used in Random Sign Projection does not have the properties of a projection matrix (it's neither idempotent nor necessarily symmetric). Its purpose is not to project data onto a subspace in the way a projection matrix in linear algebra would, but rather to reduce dimensionality while preserving relative distances (in a probabilistic sense) between data points. In summary, the use of the term "projection" in Random Sign Projection LSH is more about the intuitive idea of reducing dimensions and less about the strict mathematical properties of projection matrices. It's a common occurrence in different fields to use mathematical terms in ways that align with their general intuitive sense rather than their precise definitions. ### Learned Metrics [Fast Similarity Search for Learned Metrics](https://www.cs.utexas.edu/~grauman/research/projects/hash/SemisupervisedHashing.html) [cs.utexas.edu/~pjain/pubs/hashing_tr.pdf](https://www.cs.utexas.edu/~pjain/pubs/hashing_tr.pdf) --- Date: 20210813 Links to: [Computer Science MOC](Computer%20Science%20MOC.md) [Algorithms and Data Structures](Algorithms%20and%20Data%20Structures.md) Tags: References: * [Fantastic Blog Post](http://tylerneylon.com/a/lsh1/) * [Wiki](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) * Notability, Computer Science Section * [4 Pictures that Explain LSH - Locality Sensitive Hashing Tutorial - Randorithms](https://randorithms.com/2019/09/19/Visual-LSH.html) * [Implementing LSH Functions in Tensorflow - Randorithms](https://randorithms.com/2022/02/11/tensorflow-lsh-functions.html) * [Fast Similarity Search for Learned Metrics](https://www.cs.utexas.edu/~grauman/research/projects/hash/SemisupervisedHashing.html)