Neural ODEs - Nate's Notes

# Neural ODEs Differential equations arise whenever it is easier to describe *change* rather than *absolute amounts*[^1]. Note that there are two main forms of differential equations: 1. **Ordinary Differential Equations**: These have a single input, often thought of as *time*. 2. **Partial Differential Equations**: These have multiple inputs. ![](Screenshot%202023-08-04%20at%207.45.29%20AM.png) ODE's involve only a finite collection of values changing with time. We can think of [ResNets](Residual%20Connections%20(ResNets).md) as being a **discrete time analogue** to Neural ODEs. In a ResNet, we basically specify how to get from *one instance in time* to the *next instance in time*: $\mathbf{h}_{t+1} = \mathbf{h}_t + f(\mathbf{h}_t, \theta_t) \tag{1}$ So we have $\mathbf{h}_t$ as our hidden state at time $t$ (where $t$ really just represents our layer number) and $\theta_t$ which is our weight matrix (linear transformation). Given this type of update rule, we can think about iteratively going from one state, to the next, to the next, and so on, where each update has a step size of $t$ (again, $t$ isn't really "time" in a ResNet, it simply the next layer). This then allows us to ask: can we look at this as a **time process** and we make our update steps super small? In other words, what if we have **infinitely many layers**? In that case this becomes a **dynamic process**, i.e. an ODE: $\frac{d \mathbf{h}(t)}{dt} = f\big( \mathbf{h}(t), t, \theta \big) \tag{2}$ So now our time is continuous, and we can look at it as a local linearization. We basically specify: "we are at some time $t$ and we we want to get to time $t +\delta tquot; (i.e. the next *infinitesimally small* instance of time). We can do that by specifying the above equation. It says: "the derivative of the *hidden state* is now parameterized by a neural network". Let's say that another way to ensure we are clear: Normally we update our hidden state according to $(1)$. This involves taking "large" step sizes. If we think about this, $(1)$ is really just a rule that tells us how $\mathbf{h}$ changes over time. Well this is the same concept as a derivative, only a derivative deals with infinitesimally small time steps. So if we move to a continuous case, our update rule is no longer discrete, rather it is a derivative that is **parameterized** by a **neural network**. An ODE solver will work by basically saying: * I know the start state, $\mathbf{h}_0$ * At each point in time I know how the gradient looks * So at time $t=0$ the gradient is $\frac{d \mathbf{h}(0)}{dt} = f\big( \mathbf{h}(0), 0, \theta \big)$, which is just a **vector** pointing in the direction of the gradient! * Let me take a step in that direction and update $\mathbf{h}_0$ to be $\mathbf{h}_1$, and also update $t = 0$ to be $t=1$ * Now let me repeat the process again given that I am in state $\mathbf{h}_1$ and $t=1$... ![](Pasted%20image%2020230804080923.png) Note that the smaller the steps the ODE solver takes, the more accurate the resulting evolution of $\mathbf{h}$ will be. The cool part of ODEs is that they are basically saying that you have a start state, your input (say an image), and it is then evolved according to the rules of the ODE (each step in this evolution is akin to a hidden layer in the traditional discrete case), and it finally results in being at the correct class. ![](IMG_309073E9E4E3-1.jpeg) We can think of the training process as follows: 1. We start with an input image 2. We train the Neural Network to give us the correct gradients at each point in time, such that when you solve the ODE, at the end point you are going to be given the correct label. Note the key difference here: > We are no longer training the neural network to determine how to go from *input* to *output*. Rather, we are training the neural network to parameterize how to go from each step in time to the next one; what is the gradient at each point in time. --- Date: 20230804 Links to: Tags: References: * [Neural Ordinary Differential Equations - YouTube](https://www.youtube.com/watch?v=jltgNGt8Lpg) * [https://arxiv.org/pdf/1806.07366.pdf](https://arxiv.org/pdf/1806.07366.pdf) [^1]: [Differential equations, a tourist's guide | DE1 - YouTube](https://youtu.be/p_di4Zn4wz4?t=50)