Stable Diffusion - Nate's Notes

****# Stable Diffusion ### Fast AI Intuition (thanks Jeremy Howard) For a deep dive of what is discussed below, see more here: [Lesson 9: Deep Learning Foundations to Stable Diffusion, 2022 - YouTube](https://youtu.be/_7rMfsA24Ls). Imagine that we have access to some function ,$f$, that can tell us the probability that some image is a hand written digit: ![center | 500](Screenshot%202023-07-07%20at%203.46.58%20PM.png) We could use $f$ to in turn *generate* hand written digits. To do this we will (of course) make use of the [Gradient](Gradient.md) of the probability that $X$ is a hand written digit, with respect to the pixels of $X$: $\nabla_X P(X)$ We can then simply change the values of the pixels in the direction of the gradient so as to increase the probability of it being an image of a hand written digit. Note the gradient is often called the **score function**. ### Where do we get $f$? The problem is - where do we get $f$ from!? Well, in general, whenever there is a magic $f$ that does not exist but we would like it to, we train a neural net and treat it as $f$! But how exactly would we go about training a neural net to approximate $f$? Here is the approach: 1. We start with a data set of real hand written digits 2. We then add differing amounts of noise to them 3. Now we could then try to come up with a score representing "how much does this image resemble a hand written digit?". But that seems rather hard and arbitrary. So instead, why not try and determine *how much noise was added to the image*. The reason this is such a great approach is because we *know the answer to this question!*. 4. We can then reason as follows: Given that we know that these images are made up of hand written digits, the amount of noise present in an image determines how likely it is to be a hand written digit. If there is zero noise in the image, we know that it has probability of 1 of being a handwritten digit. Alright but it still may seem a tad bit fuzzy as to how this would work in practice. Well consider that we have a neural network with the following: * **Inputs**: Images of handwritten digits with varying amounts of noise. The noise applied to the image will be known. Let's say it is of the form $\mathcal{N}(0, \sigma)$. * **Outputs**: One option would be to predict the amount of variance, $\sigma$, associated with the noise applied to the image. This would mean that for each image we are trying to predict a *single number*. But better still, we could predict the actual noise itself. This means that our output will be an image itself (of noise). So it would not be a single number, but rather a $m \times m$ grid of values (where $m$ is the number of vertical/horizontal pixels) * **Loss**: So we have the actual noise, call it $n$, and the predicted noise, $\hat{n}$, from which we can compute a simple loss, such as MSE: $\sum (\hat{n} - n)^2$ Does this provide us with what we need? It does! Remember, we wanted some ability to know how much we needed to change the pixels in our image by in order to make it more digit like. Well, in order to turn or noisy digit into the true digit, we need to remove the added noise. And our neural network will be trained to tell us exactly what that noise is. This provides us with $\nabla_X P(X)$. Now we are in a position where we could pass our trained neural net pure noise, and it will tell us the portions that most look like pure noise, and the sections that look the least like pure noise it will have us leave in place. One example of the underlying NN that could be used here is the **Unet**. Now it is worth noting that in terms of implementation, say we were dealing with images of size 512x512x3. We want to train Unet on 10 million pictures of this size. We first may pass our images through an [autoencoder](Autoencoders.md) that can compress our images down to fewer pixels (reduce the dimensionality), and then pass these reduced representations to the Unet. This makes the Unets job easier and will be far faster to train. However, note that the Unet will *not* be getting passed images anymore. It will be passed the output of the autoencoder. We can call this output the **latents**. So, to be even more specific, the input to the Unet will be the latents with some noise applied, and the output will the Unets prediction of the noise applied to the image. We can then take the predicted noise output from the Unet, subtract it from the latent, and then pass that through the decoding portion of the autoencoder to get our final image. It is worth noting that this autoencoder will generally be a [Variational Autoencoder](Variational%20Autoencoder.md). This addition of a VAE is not entirely necessary, but it will require far less compute! So from a practical standpoint it is more efficient. ### Guidance: How do text prompts fit in? In general we don't simply act with diffusion models by saying "produce me an image", rather we interact by saying "produce me an image of a teddy bear riding a horse". So how does that portion work? In some way we need our text input to have a representation (embedding) that is close to our desired image representation. One way of doing this is to pass in *two things* to our model: 1. A picture of a handwritten digit 2. The literal string associated with that digit (e.g. if our hand written digit image is of a 7, we pass in the string "7". This will then be one hot encoded) Now our model will learn how to predict *what is the noise*, but it will also get a bit of *extra information* to help with that task, namely what digit did we pass in! So this model should be *better* at predicting noise than the previous one because we are giving it more information. Why exactly is this useful? Well now when we feed in the string "3" along side an image of 3 with some noise applied, the model is going to say that the noise is everything that does not represent the number 3. That is what it has learned to do! This process is known as applying **guidance** (guidance to help it generate the type of image that we are trying to create). But there still lies a problem here. What if we wanted to do something a bit more complex, such as passing in a string of text: "a cute teddy bear". Previously we just had 10 digits to represent and we could pass in a one hot encoded vector representing each. But what about now? How do we even have a picture of the word "cute"? How would we one hot encode every sentence that could ever occur and match it with its associated picture? We clearly cannot. What we can do though is create another model than can take a sentence like "a cute teddy" and then return a vector of numbers that represents what cute teddy's look like. To allow for this we will surf the internet for images and grab their associated alt tags. This gives us a nice labeled dataset of images and their associated captions. This then allows us to create two models: 1. A text encoder 2. An image encoder We can then have our two encoders try and output final representations that are *close* to each other for a given image + caption input pair. Closeness can be defined many ways, but a simple way would be to just take the [Dot Product](Dot%20Product.md). A large dot product indicates similar embeddings, a small/negative dot products indicates dissimilar. But we can extend this even further (this is a beautiful theme that crops up again and again in the DL world). We know that an image and it's caption should have *similar representations*, but we know that an image and *other captions* should (likely) have very different representations. ![](Lesson%209_%20Deep%20Learning%20Foundations%20to%20Stable%20Diffusion,%202022%201-56-2%20screenshot.png) At this point what we have done is create two models that put text and images into the same space. This is known as a [MultiModal Model](MultiModal%20Model.md). Now we can take our text "a cute teddy", pass it into our Text Encoder, get out a vector representing that string of text (that was specifically trained to be close to the associated image), and then pass that into our Unet model as the context (instead of a one hot encoded vector as we had done in the hand written digit case). The models that are used to do this in practice are known as [CLIP Embeddings](CLIP%20Embeddings.md). ### Noise Schedule The last thing to address is how do we remove noise (and how much noise should we add and when)? See the Fast AI video on this. The important thing to note is that once our model predicts the noise in an image, we do not remove all of the predicted noise. Rather we multiple the predicted noise by a constant, $C$ (similar to a learning rate), and then remove that. The reason for this is that our model only knows how to deal with latents (image representations) with some noise added. If we remove all of the predicted noise at once we will likely create latents that never occurred in our training set! Questions such as: 1. What value do we pick for $C$? 2. How do we add noise? 3. How do we subtract noise? These are all addressed in the *diffusion sampler*. It is worth noting that this framework looks a lot like Deep Learning Optimizers! For instance, if we think about how much to change a learning rate over time that leads us to the concept of **momentum** in the optimizers world. Similarly, we can ask what happens if the variance changes, and that gives us **Adam** (another type of optimizer). Can we use these kinds of tricks for stable diffusion? Early research says yes we can! It is worth noting that the original set of ideas for stable diffusion came from the math world of differential equations. This is all about trying to take little steps and then trying to figure out how to take bigger steps. These end up taking similar steps as optimizers. However, most differential equation solvers will have the concept of time, $t$. But, as discussed in fast ai, a theory present today is that this may not be necessary. ### The Math * [Lesson 9B - the math of diffusion - YouTube](https://www.youtube.com/watch?v=mYpjmM7O-30) * [diffusion_reading_group/#1 DDPM paper.pdf at main · tmabraham/diffusion_reading_group · GitHub](https://github.com/tmabraham/diffusion_reading_group/blob/main/%231%20DDPM%20paper.pdf) * [Diffusion Study Group #1 - EleutherAI - YouTube](https://www.youtube.com/watch?v=B5gfJF8mOPo) It is worth noting stable diffusion (in the literature) has what is known as a **forward** and **reverse** process. The forward process is where we add noise to our image, the reverse process is where we remove noise from our image. **Forward Process** The forward process adds noise to our image by following a Markov Process where the transition probabilities follow a normal distribution with parameters $\mu$ and $\sigma$. ### Textual Inversion [An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion](https://textual-inversion.github.io/) ![](Pasted%20image%2020230711073213.png) [diffusion-nbs/Stable Diffusion Deep Dive.ipynb at master · fastai/diffusion-nbs · GitHub](https://github.com/fastai/diffusion-nbs/blob/master/Stable%20Diffusion%20Deep%20Dive.ipynb) ### Interesting notes An interesting thing to note about diffusion models: they aren't that large! You can put them on your hard drive. Yet they seem to be able to generate a *vast* space of images. So if there is a lower dimensional manifold that images live along (not simply random noise in the higher dimensional space) we can use this to get an empirical sense of the size of this lower dimensional manifold. See [here](https://youtu.be/sJXn4Cl4oww?t=2932). ### Standard Intuition Diffusion models are **generative** models. An important idea of generative modeling is the idea of *data distribution*. This simply describes how likely some data point is to be observed in reality. It is referred to as $p(X)$: ![](Pasted%20image%2020221128075337.png) In the event that we are trying to generate images, we would expect images that could be observed in reality to have a high $p(X)$ and those that are unlikely to have a low $p(X)$: ![](Pasted%20image%2020221128075445.png) Now if our goal was to generate new datapoints (e.g. images), we have had two main approaches up until now: 1. Approximately model $p(X)$ and sample from it - VAEs do this. 2. Model an approximate sampling function of $p(X)$ but not $p(X)$ itself - GANs do this. Diffusion models propose an alternative approach! What if we knew how changing our data point $X$ changes its probability of being observed, $p(X)$. This change in probability is the [Gradient](Gradient.md), $\nabla p(X)$. If you knew $\nabla p(X)$ then you could start at a random point (e.g. a random image) and *iteratively* update the point via following $\nabla p(X)$: ![](Pasted%20image%2020221128075814.png) Okay but how on earth can we determine $\nabla p(X)$? We don’t even know $p(X)$! Well we can actually model it via a neural network! Our neural network will specifically estimate $\nabla p(X)$ which we can use in an iterative sampling scheme. But how can we structure our NN and problem such that it can learn to estimate $\nabla p(X)$? Well, there is actually a beautifully simple idea we can take advantage of: if we add random noise to a datapoint, it results in a data point with lower probability, because images with random noise are not likely to be observed: ![](Pasted%20image%2020221128080154.png) Then we can train a model to *denoise* noisy images by predicting the noise that you need to remove. By subtracting out the predicted noise you are making the image a more likely data point, increasing $p(X)$. The hope is that this is closely related to the rate of change, $\nabla p(X)$. ![](Pasted%20image%2020221128080407.png) It can be proven mathematically that noise prediction is equivalent to predicting $\nabla log \big [p(X) \big]$ (which is termed as the **score** in stats literature). ![](Pasted%20image%2020221128080655.png) However, if you look carefully above you will see that the score function yields the score for the noisy distribution & not the actual distribution! The idea is instead to denoise at multiple noise levels and as you iteratively sample the noise level is decreasing. Iteratively following this approach of subtracting out noise can result in: ![](Pasted%20image%2020221128080903.png) ### Technical Notes [This blog post](https://yang-song.net/blog/2021/score/) does a great job diving into some of the technical details. One thing I really appreciated was the clear explanation of why we would prefer to work with $log \nabla p(X)$ instead of $p(X)$ directly. Below, we can see the parameterizing probability density functions. No matter how you change the model family and parameters, it has to be normalized (area under the curve must integrate to one): ![](ebm.gif) But, we can see that the parameterization of score function requires no worrying about normalization! ![](score.gif) --- Date: 20221128 Links to: Tags: #review References: * [Fantastic twitter thread](https://twitter.com/iScienceLuvr/status/1592860019057250304) * [Generative Modeling by Estimating Gradients of the Data Distribution | Yang Song](https://yang-song.net/blog/2021/score/) * [fast.ai - 1st Two Lessons of From Deep Learning Foundations to Stable Diffusion](https://www.fast.ai/posts/part2-2022-preview.html)