Variational Autoencoder

# Variational Autoencoder Autoencoders and Variational Autoencoders (VAEs) are neural network architectures used for unsupervised learning tasks, particularly in the context of dimensionality reduction and generative modeling. While they share some similarities, they have fundamental differences, primarily in the way they encode information and the flexibility and characteristics of the codes they produce. **Autoencoders (AE):** 1. **Basic Structure:** An autoencoder consists of two primary components: - **Encoder:** Takes the input data and compresses it into a latent-space representation (latent variables). It captures the most relevant features of the input data. - **Decoder:** Tries to reconstruct the original data from the latent space representation, ideally without loss, though typically some loss occurs. 2. **Objective:** The primary goal is to minimize the difference between the input and the reconstructed output, often using a loss function like Mean Squared Error (MSE). This process forces the AE to maintain as much information as possible in the latent space representation. 3. **Usage:** Autoencoders are mainly used for dimensionality reduction, feature extraction, and data denoising. They are not explicitly designed for generating new data that wasn't in the training set. **Variational Autoencoders (VAE):** 1. **Basic Structure:** Much like basic autoencoders, VAEs also have an encoder and a decoder. However, they introduce an element of stochasticity to the encoding. Instead of encoding an input as a single point in the latent space, they encode it as a distribution over the latent space. 2. **Objective:** The VAE's loss function has two parts: - **Reconstruction Loss:** Like the AE, the VAE measures how accurately the decoder reconstructs the input. - **KL Divergence:** This term forces the encoded distributions to approximate a target distribution, typically a standard normal distribution. It ensures that similar inputs produce nearby points in the latent space, and it regularizes the encoder's outputs, promoting generative capabilities and continuity in the latent space. 3. **Stochasticity and Generation:** The VAE randomly samples points from the latent space distribution when decoding. This property means VAEs not only reconstruct inputs but can also generate new data by sampling from the latent space. 4. **Usage:** VAEs are used for generative tasks where you not only want to learn an efficient representation of the data but also generate new samples similar to the training data. They help in understanding the probabilistic distribution of the data. **Summary of Differences:** - **Encoding Differences:** Regular AEs encode inputs as deterministic points in the latent space, while VAEs encode inputs as distributions, introducing randomness in the encoding process. - **Loss Function:** AEs use a loss function based on data reconstruction, while VAEs use a loss function that also includes a divergence term to regularize the distribution of the latent space. - **Generative Capability:** VAEs can generate new data by sampling from the latent space, something that traditional AEs don't explicitly do. - **Continuity and Smoothness:** The latent space of a VAE tends to be more structured and continuous, making it suitable for generative models and exploration of the transitions between different data points. Each type has its applications based on the requirements of the task at hand, whether it's focusing on encoding and reconstructing information with traditional autoencoders or understanding data distributions and generating new samples with VAEs. ### Understanding Notation ![](Graph%20Variational%20Autoencoders%206-45%20screenshot.png) For more useful context checkout out this conversation [Understanding VAE Main Equations](https://chat.openai.com/share/34da8b32-eb0e-41a1-81ef-40367aabff74). ##### Input: $X$ $X$ is our input (say an image). ##### Latent Representation: $Z$ $Z$ is our learned latent representation. ##### Encoder: $q_{\theta}$ Our **encoder** is a function $q_{\theta}: X \rightarrow Z$. It is meant to model a *probability distribution*, the probability of $Z$ conditional on $X$. Hence we write it as $q_{\theta}(Z \mid X)$. It yields a distribution (of a specific functional form, in this case normal) parameterized by $\mu$ and $\sigma$. $q$ itself is parameterized by weights $\theta$. ##### Decoder: $p_{\phi}$ Our **decoder** is a function $p_{\phi}: Z \rightarrow X$. It is meant to model a probability distribution, the probability of $X$ conditional on $Z$. Hence we write it as $p_{\phi}(X \mid Z)$. It takes in a sample $z \in Z$ and yields a probability distribution over $X$. It does *not* take in the distribution yielded via $q_{\theta}$, it takes in a sample from that distribution. If we want a single reconstructed point, we can sample from the distribution it outputs. ##### Loss: $L$ Our **loss** is then defined as follows: $L = - \mathbb{E}_{Z \sim q_{\theta}(Z \mid X)} \Big[ log\big( p_{\phi}(X \mid Z) \big) \Big] + KL \Big( q_{\theta}(Z \mid X) \; || \; p(Z) \Big)$ ##### Loss: First term Lets start by focusing on the first term of our loss: $- \mathbb{E}_{Z \sim q_{\theta}(Z \mid X)} \Big[ log\big( p_{\phi}(X \mid Z) \big) \Big]$ This looks complex but it is actually very straightforward. Let's break it down as follows. First, recall that with expected values, if we add a subscript to the expectation it means that that is the random variable we are taking the expectation wrt (see last section of [Expected Value](Expected%20Value.md)). So, the first thing to realize is that the notation $\mathbb{E}_{Z \sim q_{\theta}(Z \mid X)}$ is really saying that the only random variable *inside of the brackets* is $Z$, and it is distributed according to $q_{\theta}(Z \mid X)$. As a quick aside, why does the [Expected Value](Expected%20Value.md) needed to be defined wrt a distribution? Recall the expected values definition: $E[Y] = \sum_y y P_Y(y)$ Here the expected value is *specifically* defined wrt a distribution, $P_Y$. If we instead used, say, $Q_Y$, an approximation of $P_Y$, then our expected value would change! Clearly we need to know what distribution we are using. Back to our problem at hand. Now wait, isn't $X$ a random variable? Well, in one sense yes. At times we will model $X$ as a random variable. However, in this case it represents our *data set*, a concrete set of points (e.g. images). Alright, now what about the term inside the bracket - the value we are taking the expectation of: $log\big( p_{\phi}(X \mid Z) \big)$. This is the log probability of $X$ given $Z$ (where $Z$ is distributed according to $q_{\theta}$). Let's pause for a moment to highlight a potential confusion. Generally, inside of an expectation you have a random variable and it's distribution. You then multiply those together and sum them up, as seen in the traditional example: $E[X] = \sum_x x P_X(x)$ However, what is interesting in this case is the thing we want to compute the expectation of *is a probability itself* (well, technically a log probability). That isn't fundamentally a problem, it just may take a moment to get used to. Now let's look at how this expectation may be computed via pseudocode: ```python likelihoods = [] # Iterate over all observed data points in X for x in X: # Encode x via p. This yields a mu and sigma mu, std = encoder(x) # Sample concrete values of z from q for _ in range(num_samples): # Use the reparameterization trick eps = torch.random_normal() z_sample = m + std * eps # Reconstruct parameter mu via decoder reconstructed_mu, _ = decoder(z_sample) # Compute likelihood (probability density) of actual data, x, # under the reconstructed distribution (assuming gaussian density) likelihood = torch.exp( -torch.pow(x - reconstructed_mu, 2) / (2 * fixed_variance) ) likelihoods.append(likelihood) expectation = torch.mean(torch.stack(likelihoods)) ``` Now given the above pseudocode we can see that our expectation expression is really an elegant way of incorporating quite a bit of information and process! What is very interesting to note is the *flow* of the process: * We start with an observed $x \in X$ * We then use this $x$ to compute a sample $z$. We do this by passing $x$ through the **encoder** $q_{\theta}(Z \mid X)$ and then sampling (via the reparameterization trick) from the resulting distribution * Given $z$ we can pass it through the **decoder**, $p_{\phi}(X \mid Z)$, and get an attempt at our reconstruction of the original input point, $x$. Specifically, we get a *distribution* of possible $xs. * From this distribution we can see how likely our actual $x$ was. This is the **likelihood**. * For this specific $x$ we repeat the process for many sampled $zs, collecting the resulting likelihoods along the way. * We then repeat this process for all $x \in X$. * At the end we have a bunch of likelihoods computed and take their mean. This is our final expectation that we wished to compute! * Our main objective is to then *maximize this likelihood*! We can do this via changing the parameters of our encoder and decoder, $\theta$ and $\phi$ Now what about this information flow is interesting? Well, just as with an [Autoencoder](Autoencoders.md), we start by passing in $x$, and that is the final point that we wish to reconstruct. So graphically it goes: $X \rightarrow Z \rightarrow X$ The key is to realize that the *only thing that makes this hard* is that we have a bottleneck and force the dimensionality of $Z$ to be far smaller than $X$. This forces the encoder to be creative in how it models $P(Z \mid X)$ and try to find useful, compressed representations. ### Loss Function Incentives ![](Screenshot%202023-10-27%20at%207.23.41%20AM.png) #### Notes * The decoder that we learn, $p_{\phi}$, is *deterministic*. It is *not* probabilistic. However, by sampling a $Z$ from the output of $q_{\theta}$, we can pass in this sampled (random) $z$ to the decoder and get back an "induced" random variable $X$. * When training a [Generative Model](Generative%20Model.md) it is natural to try and *maximize the likelihood* of the observed data. ### Intractability Consider the bayesian model of our data: ![](Deep%20Learning%20-%20Lecture%2011.4%20(Autoencoders_%20Variational%20Autoencoders)%201-12%20screenshot.png) In order to develop this model, we need to make some assumptions: ![](Deep%20Learning%20-%20Lecture%2011.4%20(Autoencoders_%20Variational%20Autoencoders)%201-58%20screenshot.png) ![](Deep%20Learning%20-%20Lecture%2011.4%20(Autoencoders_%20Variational%20Autoencoders)%204-34%20screenshot.png) The problem is that computing the expectation above is still intractable because our search space is too high. We need to draw too many samples from a very broad distribution on $z$ (were the dimensionality of this latent space may be, say, 60). So in order to estimate our expectation we need to draw far too many samples in order to make a single gradient step, hence it is intractable. VAE's get around this problem by introducing another model - the **recognition model** (the encoder)! By introducing a model $q$ that attempts to model the true probability $P(z | x)$, we are able to drastically reduce our search space from just looking at the marginal $P(z)$! In other words, we are making specific use of the fact that we can *condition on* $x$! Now let's gain some intuition for why this problem is intractable (until we bring in the recognition model). ![](Deep%20Learning%20-%20Lecture%2011.4%20(Autoencoders_%20Variational%20Autoencoders)%208-57%20screenshot.png) Consider the image below: * We want to compute $p(x)$, which is the expectation over $z$ of $p(x|z)$ . * If we look at the prior distribution over $z$ (left hand plot in blue), we see it is rather broad. Now assume we want to compute the likelihood of our particular data point $x_i$ (dashed black line). * We would then draw samples from $p(z)$ and then use those samples to compute (estimate) our expectation. Specifically, we would get the 3 conditional distributions (orange lines) that are slices through the joint distribution (green). * Now, given these 3 conditional distributions, we can compute the likelihood of $x_i$ according to each of them, and then take the average. But we see that according to these three orange curves $x_i$ is very unlikely - none of them actually yield a high probability! That is because we did not take enough samples to correctly estimate it - $z$ is too broad! * On the other hand, if we have a recognition model that is conditional on $x_i$ (the right hand graph), we can see that we then have a much narrow distribution across $z$, $q(z|x_i)$. If we then sample from this, we get a much more useful set of conditional probability curves (in orange), $p(x|z_i)$. ![](Deep%20Learning%20-%20Lecture%2011.4%20(Autoencoders_%20Variational%20Autoencoders)%2011-34%20screenshot.png) --- Date: 20230709 Links to: Tags: References: * [ChatGPT](https://chat.openai.com/c/945b5e3d-a1cd-46e3-ad92-c9b38d2af550) * [Variational Autoencoders - YouTube](https://www.youtube.com/watch?v=9zKuYvjFFS8) * [Deep Learning - Lecture 11.4 (Autoencoders: Variational Autoencoders) - YouTube](https://youtu.be/Myz8UPECgdI?t=725) * [Variational Autoencoders - YouTube](https://www.youtube.com/watch?v=c27SHdQr4lw)