# Variational Autoencoder
Autoencoders and Variational Autoencoders (VAEs) are neural network architectures used for unsupervised learning tasks, particularly in the context of dimensionality reduction and generative modeling. While they share some similarities, they have fundamental differences, primarily in the way they encode information and the flexibility and characteristics of the codes they produce.
**Autoencoders (AE):**
1. **Basic Structure:** An autoencoder consists of two primary components:
- **Encoder:** Takes the input data and compresses it into a latent-space representation (latent variables). It captures the most relevant features of the input data.
- **Decoder:** Tries to reconstruct the original data from the latent space representation, ideally without loss, though typically some loss occurs.
2. **Objective:** The primary goal is to minimize the difference between the input and the reconstructed output, often using a loss function like Mean Squared Error (MSE). This process forces the AE to maintain as much information as possible in the latent space representation.
3. **Usage:** Autoencoders are mainly used for dimensionality reduction, feature extraction, and data denoising. They are not explicitly designed for generating new data that wasn't in the training set.
**Variational Autoencoders (VAE):**
1. **Basic Structure:** Much like basic autoencoders, VAEs also have an encoder and a decoder. However, they introduce an element of stochasticity to the encoding. Instead of encoding an input as a single point in the latent space, they encode it as a distribution over the latent space.
2. **Objective:** The VAE's loss function has two parts:
- **Reconstruction Loss:** Like the AE, the VAE measures how accurately the decoder reconstructs the input.
- **KL Divergence:** This term forces the encoded distributions to approximate a target distribution, typically a standard normal distribution. It ensures that similar inputs produce nearby points in the latent space, and it regularizes the encoder's outputs, promoting generative capabilities and continuity in the latent space.
3. **Stochasticity and Generation:** The VAE randomly samples points from the latent space distribution when decoding. This property means VAEs not only reconstruct inputs but can also generate new data by sampling from the latent space.
4. **Usage:** VAEs are used for generative tasks where you not only want to learn an efficient representation of the data but also generate new samples similar to the training data. They help in understanding the probabilistic distribution of the data.
**Summary of Differences:**
- **Encoding Differences:** Regular AEs encode inputs as deterministic points in the latent space, while VAEs encode inputs as distributions, introducing randomness in the encoding process.
- **Loss Function:** AEs use a loss function based on data reconstruction, while VAEs use a loss function that also includes a divergence term to regularize the distribution of the latent space.
- **Generative Capability:** VAEs can generate new data by sampling from the latent space, something that traditional AEs don't explicitly do.
- **Continuity and Smoothness:** The latent space of a VAE tends to be more structured and continuous, making it suitable for generative models and exploration of the transitions between different data points.
Each type has its applications based on the requirements of the task at hand, whether it's focusing on encoding and reconstructing information with traditional autoencoders or understanding data distributions and generating new samples with VAEs.
### Understanding Notation

For more useful context checkout out this conversation [Understanding VAE Main Equations](https://chat.openai.com/share/34da8b32-eb0e-41a1-81ef-40367aabff74).
##### Input: $X$
$X$ is our input (say an image).
##### Latent Representation: $Z$
$Z$ is our learned latent representation.
##### Encoder: $q_{\theta}$
Our **encoder** is a function $q_{\theta}: X \rightarrow Z$. It is meant to model a *probability distribution*, the probability of $Z$ conditional on $X$. Hence we write it as $q_{\theta}(Z \mid X)$.
It yields a distribution (of a specific functional form, in this case normal) parameterized by $\mu$ and $\sigma$. $q$ itself is parameterized by weights $\theta$.
##### Decoder: $p_{\phi}$
Our **decoder** is a function $p_{\phi}: Z \rightarrow X$. It is meant to model a probability distribution, the probability of $X$ conditional on $Z$. Hence we write it as $p_{\phi}(X \mid Z)$.
It takes in a sample $z \in Z$ and yields a probability distribution over $X$. It does *not* take in the distribution yielded via $q_{\theta}$, it takes in a sample from that distribution. If we want a single reconstructed point, we can sample from the distribution it outputs.
##### Loss: $L$
Our **loss** is then defined as follows:
$L = - \mathbb{E}_{Z \sim q_{\theta}(Z \mid X)} \Big[ log\big( p_{\phi}(X \mid Z) \big) \Big] + KL \Big( q_{\theta}(Z \mid X) \; || \; p(Z) \Big)$
##### Loss: First term
Lets start by focusing on the first term of our loss:
$- \mathbb{E}_{Z \sim q_{\theta}(Z \mid X)} \Big[ log\big( p_{\phi}(X \mid Z) \big) \Big]$
This looks complex but it is actually very straightforward. Let's break it down as follows. First, recall that with expected values, if we add a subscript to the expectation it means that that is the random variable we are taking the expectation wrt (see last section of [Expected Value](Expected%20Value.md)). So, the first thing to realize is that the notation $\mathbb{E}_{Z \sim q_{\theta}(Z \mid X)}$ is really saying that the only random variable *inside of the brackets* is $Z$, and it is distributed according to $q_{\theta}(Z \mid X)$.
As a quick aside, why does the [Expected Value](Expected%20Value.md) needed to be defined wrt a distribution? Recall the expected values definition:
$E[Y] = \sum_y y P_Y(y)$
Here the expected value is *specifically* defined wrt a distribution, $P_Y$. If we instead used, say, $Q_Y$, an approximation of $P_Y$, then our expected value would change! Clearly we need to know what distribution we are using.
Back to our problem at hand. Now wait, isn't $X$ a random variable? Well, in one sense yes. At times we will model $X$ as a random variable. However, in this case it represents our *data set*, a concrete set of points (e.g. images).
Alright, now what about the term inside the bracket - the value we are taking the expectation of: $log\big( p_{\phi}(X \mid Z) \big)$. This is the log probability of $X$ given $Z$ (where $Z$ is distributed according to $q_{\theta}$).
Let's pause for a moment to highlight a potential confusion. Generally, inside of an expectation you have a random variable and it's distribution. You then multiply those together and sum them up, as seen in the traditional example:
$E[X] = \sum_x x P_X(x)$
However, what is interesting in this case is the thing we want to compute the expectation of *is a probability itself* (well, technically a log probability). That isn't fundamentally a problem, it just may take a moment to get used to.
Now let's look at how this expectation may be computed via pseudocode:
```python
likelihoods = []
# Iterate over all observed data points in X
for x in X:
# Encode x via p. This yields a mu and sigma
mu, std = encoder(x)
# Sample concrete values of z from q
for _ in range(num_samples):
# Use the reparameterization trick
eps = torch.random_normal()
z_sample = m + std * eps
# Reconstruct parameter mu via decoder
reconstructed_mu, _ = decoder(z_sample)
# Compute likelihood (probability density) of actual data, x,
# under the reconstructed distribution (assuming gaussian density)
likelihood = torch.exp(
-torch.pow(x - reconstructed_mu, 2) / (2 * fixed_variance)
)
likelihoods.append(likelihood)
expectation = torch.mean(torch.stack(likelihoods))
```
Now given the above pseudocode we can see that our expectation expression is really an elegant way of incorporating quite a bit of information and process! What is very interesting to note is the *flow* of the process:
* We start with an observed $x \in X$
* We then use this $x$ to compute a sample $z$. We do this by passing $x$ through the **encoder** $q_{\theta}(Z \mid X)$ and then sampling (via the reparameterization trick) from the resulting distribution
* Given $z$ we can pass it through the **decoder**, $p_{\phi}(X \mid Z)$, and get an attempt at our reconstruction of the original input point, $x$. Specifically, we get a *distribution* of possible $x