Projection vs Embedding

# Projection vs Embedding **Embedding** Low dimensional to high dimensional. Or, discrete to continuous. **Projection** A **projection** is a mapping of a [set](https://en.wikipedia.org/wiki/Set_(mathematics) "Set (mathematics)") (or other [mathematical structure](https://en.wikipedia.org/wiki/Mathematical_structure "Mathematical structure")) into a subset (or sub-structure), which is equal to its square for [mapping composition](https://en.wikipedia.org/wiki/Function_composition "Function composition") (or, in other words, which is [idempotent](https://en.wikipedia.org/wiki/Idempotence "Mathematical structure")) (in ot ) ### Embedding #### In Mathematics In mathematics, an embedding is on instance of some mathematical structure contained within another instance,. When some object $X$ is said to be embedded in another object $Y$, the embedding is given by some *[injective](Injective-Surjective-Bijective.md)* (one-to-one) and structure-preserving map $f: X \rightarrow Y$. So the key idea behind an embedding is that no information should be lost when transitioning between $Y$ and $X$, because it is an **injective** (one to one) function. In other words, no two items (e.g. points) in $X$ should be mapped to the same item in $Y$ (however, note that there could indeed be items in $Y$ that don't have a mapping from $X$) #### In Machine Learning In the context of machine learning, an embedding is a low-dimensional, learned continuous vector representation of discrete variables into which you can translate high-dimensional vectors. See note *Embeddings - 04242021* in notability. Whenever you see the word “embedding” in machine learning, what that means is that the authors are exploring some technique to take non-vectorized data and “embed” it into a vector space (taken from [this response](https://qr.ae/pGNs69)). Another words that may be used here to mean the same thing is "vectorize". Let’s take a couple of examples to get the hang of the concept. For example, let’s say I want to “embed” a graph in a vector space. How would I go about doing that? A simple way to think about this concept is to imagine drawing the graph on paper. That exercise forces you to take every vertex and plonk it down somewhere on the paper. So, in effect, you have constructed a mapping from the set of vertices V of a graph G to pairs of real numbers (the (x,y) coordinate of the point in space with respect to some arbitrary coordinate system). In other words, you have constructed an embedding of the graph Similarly, for word embeddings, suppose I want to know what the word “Democrat” means. Well, one way to ascribe meaning to this word is to find some way to map “Democrat” into an N-dimensional vector (say, N = 100), so that every instance of “Democrat” in a text document is replaced by that 100-dimensional vector. The term “embedding” in machine learning actually [comes from topology](https://en.wikipedia.org/wiki/Embedding "en.wikipedia.org"), and deals with the general concept of a subgroup within a group, where these terms have precise mathematical meanings. In ML, generally we talk about embeddings in metric spaces, i.e. taking an object, such as word, item, users etc. and creating a map to a metric space. Note that an embedding, in general can also be made into a non-metric space, i.e. a space without a well-defined metric. #### Technical Notes * The colloquial "embedding" refers to a map which ideally _should_ be injective and _should_ be approximately-structure-preserving. * Also consider that people talk about embeddings for a particular words, when really the "embedding" should refer to the mapping rather than to the result of applying it. #### Bryce Intuition * An **embedding** is a function that takes one things and makes it a sub-object of another thing (this is usually topological in nature). * A **projection** is a function that reduces the dimension of something (this *needs* the notion of dimension). * In machine learning "embeddings" often do *both*. They take a set of points in large dimensions and make them a sub-object of a lower dimensional space. In most cases the word are used interchangeably because they have similar meaning and nobody cares enough to be careful * In general I could take a bunch of points in high dimensions and specify the same number of points in lower dimensions in any way (even randomly). This is what I would call a **projection** (just a reduction of dimension). However, this may not take into consideration an of the local or global structure of the points in higher dimensions. * An **embedding** is something that does this same process but attempts to preserve some kind of 'nearness' structure, usually either local or global. * So an embedding, like [UMAP](UMAP.md), will not just create the same number of points in a lower dimension, it will optimize for preserving the 'topology' of the points. I.e. the high and low dimensional spaces have a structure (notion 'nearness' between points) and an embedding attempts to take the points in high dimension, *along with their structure*, and make them a sub-structure of the lower dimensional space, *maintaining this structure*, rather than just randomly assigning each point to a point in the sub-structure. * [UMAP has it's own unique way of doing this](UMAP.md#Bryce%20explanation). ### Bryce Key explanation * Consider the following: We have **you** $\rightarrow$ **your shadow** $\rightarrow$ **3-d world**. The map **projects** you into 2 dimensions and preserves some info like your outline, but loses some, like your depth. The second is an **embedding** of your 2d shadow into the 3d world (i.e. it is on the wall or sidewalk or whatever). And it preserves all of the information that the shadow contains but just add another dimension that the shadow doesn't use. In theory you could embed it in 1000 dim space and it would still have the same information. ### Bryce question answer pairs 1. **Question**: I realize now that projections don't need to preserve structure, but they often do right? Even the projection of a person (3d) to a shadow (2d) preserves structure? In that case the projection (reducing dimensionality from 3 to 2) does indeed preserve structure. In this case would it be correct to say that our shadow is embedded in the 3d world? **Answer**: Consider the following: We have **you** $\rightarrow$ **your shadow** $\rightarrow$ **3-d world**. The map **projects** you into 2 dimensions and preserves some info like your outline, but loses some, like your depth. The second is an **embedding** of your 2d shadow into the 3d world (i.e. it is on the wall or sidewalk or whatever). And it preserves all of the information that the shadow contains but just add another dimension that the shadow doesn't use. In theory you could embed it in 1000 dim space and it would still have the same information. 2. **Question**: I have seen (and believe) that when trying to reduce dimensionality, there are really on two options (see [here](https://youtu.be/nq6iPZVUxZU?t=135)): Matrix factorization (PCA, word2vec, etc) and Neighbor graphs (tsne, spectral embeddings, umap, etc). I am wondering, from your point of view, what makes these fundamentally different techniques? I guess I am asking, it seems like if we are going to reduce these techniques down into two groups, it isn't a huge leap to reduce them down further into one. For instance spectral embeddings (neighbor graph group) seems to have a ton in common with PCA (matrix factorization group). Specifically, PCA finds the eigenvectors/values of the _covariance matrix_ of the data (then we truncate trying to keep as much variance as we desire), while spectral embeddings find the eigenvectors/values of the _laplacian_ matrix (before again truncating). But both are just working with eigenvectors/values of matrices. Yet they are deemed to be different approaches/groups. Can you provide a bit of intuition as for why? **Answer**: Almost all of these are really just matrix projections. You are right that PCA and Spectra embedding are basically the same thing except the matrix you use is different and the eigen spaces you project onto are different. TSNE and UMAP claim to be graphical but in the end they initialize their projection with a matrix projection then optimize it to minimize some loss function. I dont actually know all the details of all embeddings but most rely in some way on a matrix factorization and projection. 3. **Question**: Related specifically to UMAP, the last time I looked under the hood it looks like if we start with an `N x D` dimensional data set, where `D` is large, say `D=1000`, UMAP starts by building a high dimensional graph representation of this dataset, specifically weighted graph based on a distance matrix that is `N x N`. My question is, often `N >> D`, so what is it theoretically that makes it such a good idea to _increase the dimensionality_ effectively by building this high dimensional graph (higher dimensional than we even started with). Intuitively, I have this idea that reducing dimensions and preserving structure is hard, so purposefully increase the dimensionality from `D` to `N` seems like a step in the wrong direction. (Obviously I realize that UMAP works very well, I am just trying to build intuition, likely around a lack of deep understanding of graphs and connectivity, and what a distance matrix can provide compared to a list of length `N` of `D` dimensional vectors. Fundamentally, it seems that all of the _information_ present in our `N x D` data set must be present in the high dimensional graph representation? Or-maybe this is where I have beer erring-is it that when we construct our high dimension graph we make certain assumptions that allow us to actually encode more information, allowing us to better capture the 'true' structure of the data) **Answer**: UMAP doesnt actually create a full `NxN` distance matrix (this is the fact that makes it way more effective than TSNE). It only considers the `k` nearest neighbors for each point and makes a sparse matrix that has `Nk` elements in it which represent the high dimensional graph. It then projects onto the smallest eigen spaces and optimized via cross entropy. --- Links to: [UMAP](UMAP.md) [Graph Embeddings](Graph%20Embeddings.md) [Spectral-Embeddings](Spectral-Embeddings.md) Tags: #review References: * [tends to be used loosely](https://stats.stackexchange.com/questions/487545/what-is-embedding-in-the-context-of-dimensionality-reduction) * [Embedding - mathematics definition](https://en.wikipedia.org/wiki/Embedding) * [Projection - mathematics definition](https://en.wikipedia.org/wiki/Projection_(mathematics)) * [Great quora answer](https://qr.ae/pGNs69) * [Another good quora answer, even if not 100% correct...](https://qr.ae/pGNwz1) * [Another solid quora answer](https://qr.ae/pGri4m) * [Worthwhile thoughts on reddit](https://www.reddit.com/r/MachineLearning/comments/6vhf8b/d_word_embeddings_are_projections/)