quot; (i.e. the next *infinitesimally small* instance of time). We can do that by specifying the above equation. It says: "the derivative of the *hidden state* is now parameterized by a neural network". Let's say that another way to ensure we are clear: Normally we update our hidden state according to $(1)$. This involves taking "large" step sizes. If we think about this, $(1)$ is really just a rule that tells us how $\mathbf{h}$ changes over time. Well this is the same concept as a derivative, only a derivative deals with infinitesimally small time steps. So if we move to a continuous case, our update rule is no longer discrete, rather it is a derivative that is **parameterized** by a **neural network**. An ODE solver will work by basically saying: * I know the start state, $\mathbf{h}_0$ * At each point in time I know how the gradient looks * So at time $t=0$ the gradient is $\frac{d \mathbf{h}(0)}{dt} = f\big( \mathbf{h}(0), 0, \theta \big)$, which is just a **vector** pointing in the direction of the gradient! * Let me take a step in that direction and update $\mathbf{h}_0$ to be $\mathbf{h}_1$, and also update $t = 0$ to be $t=1$ * Now let me repeat the process again given that I am in state $\mathbf{h}_1$ and $t=1$...  Note that the smaller the steps the ODE solver takes, the more accurate the resulting evolution of $\mathbf{h}$ will be. The cool part of ODEs is that they are basically saying that you have a start state, your input (say an image), and it is then evolved according to the rules of the ODE (each step in this evolution is akin to a hidden layer in the traditional discrete case), and it finally results in being at the correct class.  We can think of the training process as follows: 1. We start with an input image 2. We train the Neural Network to give us the correct gradients at each point in time, such that when you solve the ODE, at the end point you are going to be given the correct label. Note the key difference here: > We are no longer training the neural network to determine how to go from *input* to *output*. Rather, we are training the neural network to parameterize how to go from each step in time to the next one; what is the gradient at each point in time. --- Date: 20230804 Links to: Tags: References: * [Neural Ordinary Differential Equations - YouTube](https://www.youtube.com/watch?v=jltgNGt8Lpg) * [https://arxiv.org/pdf/1806.07366.pdf](https://arxiv.org/pdf/1806.07366.pdf) [^1]: [Differential equations, a tourist's guide | DE1 - YouTube](https://youtu.be/p_di4Zn4wz4?t=50)