UMAP - Nate's Notes

# UMAP ### Intuition * Let us start with a data set consisting of $N$ observations of $D$ dimensions. UMAP assumes that these observations came from some **manifold** and that they were uniformly distributed on that manifold * UMAP starts by taking this $D$ dimensional dataset and constructing a high dimensional **weighted graph**. This can be thought of as a *weighted distance matrix* of sorts. Regardless a key idea here is that: * We can use UMAP on a discrete graph of nodes (because there is clearly a way to make this a weighted graph, seeing as it is a graph after all) * We can also use UMAP on our $NxD$ points, since we can compute a weighted distance matrix. * This idea of mapping from our mathematical structure (input graph or input dataset) to another structure (in this case it is likely higher dimensional, $N$ dimensional instead of $D$ dimensional in the data set case), is known as an **[embedding](Projection%20vs%20Embedding.md)**. * UMAP then takes this high dimensional embedding and **projects** it down into a low dimensional space, preserving as much structure as possible. ### Breakdown UMAP can be understood cleanly in two steps: 1. **Embedding Step**: Construct a high dimensional representation of our graph. In other words, we are embedding our discrete graph structure into a continuous, high dimensional space. Note: I should make it clear that our data need not be a graph (discrete)! It could just as easily be $n$ observations from some $d$ dimensional space. We can pass this in to UMAP and it will again just try and construct a high 2. **Projection Step**: Project from our high dimensional space down into a lower dimensional space, say 2 dimensions. UMAP intuition: > I think that is the right view. UMAP is based on the metric and topological structure of the data. It's position or configuration in space doesn't really matter; it is the inter-relationships among the data samples that matter. ### UMAP Theory v1 > UMAP, at its core, works very similarly to t-SNE - both use graph layout algorithms to arrange data in low-dimensional space. In the simplest sense, UMAP constructs a high dimensional graph representation of the data then optimizes a low-dimensional graph to be as structurally similar as possible. While the mathematics UMAP uses to construct the high-dimensional graph is advanced, the intuition behind them is remarkably simple. > > In order to construct the initial high-dimensional graph, UMAP builds something called a "fuzzy simplicial complex". This is really just a representation of a weighted graph, with edge weights representing the likelihood that two points are connected. To determine connectedness, UMAP extends a radius outwards from each point, connecting points when those radii overlap. Choosing this radius is critical - too small a choice will lead to small, isolated clusters, while too large a choice will connect everything together. UMAP overcomes this challenge by choosing a radius locally, based on the distance to each point's $n$th nearest neighbor. UMAP then makes the graph "fuzzy" by decreasing the likelihood of connection as the radius grows. Finally, by stipulating that each point must be connected to at least its closest neighbor, UMAP ensures that local structure is preserved in balance with global structure. > > Once the high-dimensional graph is constructed, UMAP optimizes the layout of a low-dimensional analogue to be as similar as possible. This process is essentially the same as in t-SNE, but using a few clever tricks to speed up the process. > > The key to effectively using UMAP lies in understanding the construction of the initial, high-dimensional graph. Though the ideas behind the process are very intuitive, the algorithm relies on some advanced mathematics to give strong theoretical guarantees about how well this graph actually represents the data. > > See more [here](https://pair-code.github.io/understanding-umap/index.html). ### UMAP Theory v2 > In its simplest sense, the UMAP algorithm consists of two steps: construction of a graph in high dimensions followed by an optimization step to find the most similar graph in lower dimensions. > > UMAP essentially constructs a weighted graph from the high dimensional data, with edge strength representing how “close” a given point is to another, then projects this graph down to a lower dimensionality. > > See more [here](https://pair-code.github.io/understanding-umap/supplement.html). ### UMAP Implementation It looks like, based on the medium article referenced below, that one of the first things that is done by umap is building a distance matrix. If this is normalized we can view it as a weighted graph. This is exactly what was communicated when reading the umap theory above: > "UMAP essentially constructs a weighted graph from the high dimensional data" See: * [Good medium article](https://towardsdatascience.com/how-to-program-umap-from-scratch-e6eff67f55fe) * [My personal implementation](https://github.com/NathanielDake/intuitiveml/blob/master/notebooks/unsupervised%20learning/umap.ipynb) ### Bryce explanation UMAP works by creating a graphical representation of the high dimensional data, then projecting it to a lower dimensional graph (usually initializing with a spectral embedding), then minimizing cross-entropy to attempt to map the structure or 'nearness' as closely as possible. The notion of 'nearness' used is essentially a truncated version of what TSNE does which says the nearness between point i and point j is 'What is the probability that point i is a neighbor of point j? (restricted to the k nearest points)'. --- Date: 20210518 Links to: [Projection vs Embedding](Projection%20vs%20Embedding.md) References: * [UMAP Intro](https://pair-code.github.io/understanding-umap/) * [UMAP Uniform Manifold Approximation and Projection for Dimension Reduction | SciPy 2018](https://www.youtube.com/watch?v=nq6iPZVUxZU&t=0s) * [My notebook](https://github.com/NathanielDake/intuitiveml/blob/master/notebooks/unsupervised%20learning/umap.ipynb) * [Answer from creator to github question about distance matrix](https://github.com/lmcinnes/umap/issues/428) * [Quora answer](https://qr.ae/pGr9EY()