This is heavily based on the teachings of Andrej Karpathy.
### Lecture 1 → Compute Graphs and Scalar Gradients
Any mathematical expression, regardless of complexity, can be broken down into a set of atomic parts, connected via a **computational graph**. We can see this via my implementation of [micrograd](https://github.com/NathanielDake/micrograd). Note that in this computational graph, the *children* are considered to the be the nodes that *feed into* the *parent*.

**Gradients**
When thinking about gradients, we can think of a node that has a value of $0$ as having *no effect* on our loss function. It simply means that as you wiggle the value slightly, it yields no change in the loss function.
Consider the compute graph below. We have a node $L$ that is impacted directly by $d$ and $f$, and indirectly by $e, c, a, b$. The entire crux of backpropagation (and subsequently the training of neural networks) lies in the following: how do we determine how a *change to* $c$ will impact $L$? Well, say we know how changing $d$ impacts $L$. Then, if we can determine how changing $c$ and $e$ impacts $d$ (we can think of these as the *local derivatives*, we should, intuitively, be able to compute how that change impacts $L$. The answer of course is the [**Chain Rule**](Chain%20Rule.md).

Now consider the visualization below:

We wish to understand how changing $J$ impacts $L$, $\color{darkred}\frac{\partial L}{\partial J}$. This can easily be accomplished via the chain rule. We can break it down as:
$\color{darkred}\frac{\partial L}{\partial J} \color{black} = \frac{\partial K}{\partial J} \frac{\partial L}{\partial K}$
So if we know how $K$ impacts $L$, and we can determine how $J$ impacts $K$, we can multiply those two together to get $J$‘s impact on $L$.
In this case $J$ and $L$ are only separated by a single intermediary variable $K$. However, in reality they could be separated by *hundreds* or *thousands* of nodes in a large compute graph. But that is the *beauty of backpropagation*! If we topologically sort our compute graph, treating the final loss as the root node, and then iteratively compute our local gradients and propagate them backwards via the chain rule, we will always have a gradient that represents the **accumulation of impact** on the loss function via the path we are on! Below we see this clearly. If we are interested in $K$‘s impact on the loss, as long as we have backpropagated the gradients along the paths through which $K$ impacts $Loss$, we will be able to use **local derivative** rules at $K$.

To be even more concrete, consider our previous example:

Here we are focused on $J$ and $H$ impact on $L$. Assume that we have already computed $K$‘s impact on $L$ to be $\frac{\partial L}{\partial K}$. And we know, via our compute graph, that:
$K = J+H$
So, we simply need to compute $J$‘s impact on $K$ and $H$‘s impact on $K$:
$\frac{\partial K}{\partial J} \; , \; \frac{\partial K}{\partial H}$
Which are simply:
$\frac{\partial K}{\partial J} = \frac{\partial (J + H)}{\partial J} = \frac{\partial J}{\partial J} + \frac{\partial H}{\partial J} = 1 + 0 = 1 $
$\frac{\partial K}{\partial H} = \frac{\partial (J + H)}{\partial H} = \frac{\partial J}{\partial H} + \frac{\partial H}{\partial H} = 0 + 1 = 1 $
So, the partial derivatives are simply $1$. This means that the full derivative of $L$ with respect to $J$ and $H$ is:
$\frac{\partial L}{\partial J} = \frac{\partial L}{\partial K} \frac{\partial K}{\partial J} = \frac{\partial L}{\partial K} \cdot 1 = \frac{\partial L}{\partial K}$
$\frac{\partial L}{\partial H} = \frac{\partial L}{\partial K} \frac{\partial K}{\partial H} = \frac{\partial L}{\partial K} \cdot 1 = \frac{\partial L}{\partial K}$
Great, we just worked through a nice quick chain rule example! But there is an even greater takeaway worth calling out here. Notice the effect of the *plus node* in our backpropagation framework. We had $K$‘s impact on $L$, and then to get $J$ and $K$s impact on $L$, because each of their local derivatives are $1$, their partial is simply $K$s impact on $L$! This means we can think about a plus node in a mathematical expression as **routing gradient information** to its children nodes *equally* (see more [here](https://youtu.be/VMj-3S1tku0?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&t=2750 )). This routing becomes more meaningful when our nodes $J$ and $H$ are sitting in a massive compute graph (above they simply look like leaf nodes). In a neural network plus nodes will technically sit anywhere that a *dot product* occurs.
A really nice way to intuitively think about backpropagation is as:
> We are *backpropagating* a **signal** that is carrying the information of the derivative of $L$ with respect to all of the intermediate nodes. We can imagine this signal **flowing** backwards through the graph. A plus node will simply distribute all of the information of a parent node to all children nodes.
And what about a multiplication node? For instance, below we have $F$ and $G$ feed into to yield $H$. Let us already have access to $\frac{\partial L}{\partial H}$. How do $F$ and $G$ impact $H$ (and in turn impact $L$) via the multiplication node?
$\frac{\partial H}{\partial F} = \; ? \; , \; \frac{\partial H}{\partial G} = \; ?$

Well, we know that $H = F \times G$, so we can compute our derivatives as:
$\frac{\partial H}{\partial F} = G \; , \; \frac{\partial H}{\partial G} = F$
So, our application of the chain rule will yield:
$\frac{\partial L}{\partial F} = \frac{\partial L}{\partial H} \frac{\partial H}{\partial F} = \frac{\partial L}{\partial H} \cdot G$
$\frac{\partial L}{\partial G} = \frac{\partial L}{\partial H} \frac{\partial H}{\partial G} = \frac{\partial L}{\partial H} \cdot F$
So if a plus node can be thought of as simply *routing* gradient information, a multiplication node can be thought of distributing information about one incoming node *to the other incoming node*. Above, all nodes that lead into $F$ will now have $G$ information flowing backwards in terms of gradients.
Taking a moment, let’s recap what we have done:
1. We iterate through all nodes one by one, locally applying the chain rule.
2. We always know the derivative of $L$ with respect to a specific output (the *parent* node), and the then look at how that output was produced (say via the addition or multiplication of two inputs).
3. The output will always be produced via some operation, and we will always (by our technical implementation, see [here](https://github.com/NathanielDake/micrograd)) have *pointers* to the *children* nodes.
4. This makes it very straightforward to compute the local derivatives of the parent with respect to the children, and then we use the chain rule to multiply these local derivatives onto the total derivative (gradient) of the parent with respect to the loss.
5. This process yields the impact of the children nodes on the loss.
6. We *recursively* apply this process through the entire graph.
**Total Derivative**
Consider a situation like that below. Notice how $F$ impacts both $J$ and $H$. During the backward pass how do we combine the gradient information that will flow into $F$ from $J$ and $H$? We utilize the [Total derivative](https://en.wikipedia.org/wiki/Total_derivative), which states that you sum up the derivatives.

**Neural Network Bias nodes**
The role of a bias node in a neural network can be thought of as that neurons **innate trigger happiness**. The bias can make the neuron a bit more or less trigger happy *regardless of the input*.
**Composite Functions**
In order to perform backpropagation, the only thing that matters is that for any single operation node (such as $+, \times, exp, tanh$) we know what the *inputs* are to the operation and what the *output* is. As long as you know these two things you can perform local backpropagation via:
1. Compute the derivative of the output with respect to the input (where $f$ is our operation):
$\frac{\partial(output)}{\partial(input)} = \frac{\partial(f(input))}{\partial(input)}$
2. Multiply the above local gradient by the gradient of the output node (its gradient with respect to the loss!)
**Gradients and Loss**
We can think of the gradient as a *vector* (with dimensionality equal to the number of parameters of our neural network), that specifically describes the *direction* to move in our parameter space that will *increase* the loss. Of course we want to *decrease* the loss, so we will want to move in the opposite direction. Hence, our update rule will always look like:
```python
data += -(learning_rate * gradient)
```
### Lecture 5 → Vectorized Backprop
Let us start with the following expression, from which we need to backprop through it:
```python
# Forward pass, "chunkated" into smaller steps that are possible to backward one at a time
# concat to transform our 3 inputs of 2 dimensions into a 6 dimensional vector
emb = C[Xb] # embed the characters into vectors
# essentially concatenate the vectors (but no copying is taking place, just representing how we view the tensor. Very cheap operation.)
embcat = emb.view(emb.shape[0], -1)
# Linear layer 1
hprebn = embcat @ W1 + b1 # hidden layer pre-activation
# BatchNorm layer
bnmeani = 1/n*hprebn.sum(0, keepdim=True)
bndiff = hprebn - bnmeani
bndiff2 = bndiff**2
bnvar = 1/(n-1)*(bndiff2).sum(0, keepdim=True) # note: Bessel's correction (dividing by n-1, not n)
bnvar_inv = (bnvar + 1e-5)**-0.5
bnraw = bndiff * bnvar_inv
hpreact = bngain * bnraw + bnbias
# Non-linearity
h = torch.tanh(hpreact) # hidden layer
# Linear layer 2
logits = h @ W2 + b2 # output layer
# cross entropy loss (same as F.cross_entropy(logits, Yb))
logit_maxes = logits.max(1, keepdim=True).values
norm_logits = logits - logit_maxes # subtract max for numerical stability
counts = norm_logits.exp()
counts_sum = counts.sum(1, keepdims=True)
counts_sum_inv = counts_sum**-1 # if I use (1.0 / counts_sum) instead then I can't get backprop to be bit exact...
probs = counts * counts_sum_inv
logprobs = probs.log()
loss = -logprobs[range(n), Yb].mean() # Pluck out the logprobs for the correct labels, and take the mean
```
Before stepping through the details, it’s worth pointing out a few key ideas:
1. A large part of the backpropagation process does *not* deal with gradients of *weights*, the parameters we have control over. Rather, a large part of the process deals with propagating *signal* from the loss through all of the operations performed into the network *to the weights*. This gradient information then allows us to update our weights to decrease the loss.
2. For the sake of creating clear visualizations I think it is worth noting that anytime I reference the forward pass I will utilize blue, and the backward pass will be noted in red. If a cell is left white/blank that represents
#### 1. Gradient of `loss` with respect to `logprobs`
Before we can even start our backpropagation process, we must start at the end of our forward pass: the final computation of the *loss*. Visually it can be seen below:

And in code it is simply:
```python
loss = -logprobs[range(n), Yb].mean()
```
How does the **loss** change as we change **logprobs**? We can think about this via a simple example. Consider the mean of 3 numbers:
$loss = \frac{a + b + c}{3} = \frac{1}{3}a + \frac{1}{3}b + \frac{1}{3}c$
We see it can be rewritten in terms of the right most expression. If we then wanted to know how changing $a$ impacted the loss, we would arrive at a gradient of:
$\frac{ \partial \text{ loss} }{\partial a} = \frac{1}{3}$
And so more generally, each value that is being considered as part of a mean will have a gradient of $\frac{1}{n}$, where $n$ is the number of values under consideration (in our example this will be the batch size, $32$):
$\frac{ \partial \text{ loss} }{\partial \text{ logprobs}} = -\frac{1}{n}$
So in our example each value will have a gradient shown below. Note: all other values that are not plucked out simply have a derivative of 0 because they do not contribute to the loss (this is due to the loss we are utilizing, the log loss, i.e. the [Cross Entropy](Entropy,%20Cross%20Entropy%20and%20KL%20Divergence.md) - see more in my old blog
post [here](https://www.nathanieldake.com/Mathematics/05-Information_Theory-01-Cross-Entropy-and-MLE-walkthrough.html)). Visually this looks like (where grey is simply $0$):

And our gradient will be (a negative being introduced due to its use in the final equation used to compute loss):
```python
dlogprobs = torch.zeros_like(logprobs)
dlogprobs[range(n), Yb] = -1.0 / n
```
#### 2. Gradient of `logprobs` with respect to `probs`
Next we need to compute the gradient of:
```python
logprobs = probs.log()
```
This is much easier since it is an element wise application of a log:
$\text{logprobs} = log(\text{probs})$
Meaning our gradient will just look like:
$\frac{\partial\text{ logprobs}}{\partial{\text{ probs}}} = \frac{1}{\text{probs}}$
Which in code is (after applying the chain rule and multiplying the gradient above by the gradient of loss wrt to log probs):
```python
dprobs = (1 / probs) * dlogprobs
```
Visually this looks like (where white is simply $\frac{1}{probs_{i,j}}$, I did not fill it out in order to reduce visual noise. The red is just to help guide your eye. At this point no masking or zeroing has occurred. That will be happening momentarily):

And of course remember that `dprobs` is representing the gradient of the `loss` with respect to `probs`:
$\frac{\partial \text{ loss}}{\partial \text{ probs}} = \frac{1}{\text{probs}} \times \frac{\partial \text{ loss}}{\partial \text{ logprobs}} = \frac{1}{\text{ probs}} \times -\frac{1}{n}$
Visually this looks like a simple element wise product (where we can see how the zeroing out of all entries that did not correspond to the true next character occurs):

It is worth taking a moment to discuss some of the intuition for the above block. We start with an incoming `dlogprobs` (the gradient of the loss with respect to the log probabilities). We then _multiply_ that by `1 / probs`. Remember, `dlogprobs` will be zero in all cases other than where index corresponds to the true target, so all values get set to 0 besides those (see the greyed out blocks below)
In the event that the index _does correspond_ to the true target, we will have a non zero entry for `dlogprobs` (a grey block), and then multiply that by `1 / probs`. This will *kill any gradient information*. Why is this the case? And is that a bad thing? I have outlined my thoughts on the matter in [Cross Entropy vs Wasserstein Loss Function](Cross%20Entropy%20vs%20Wasserstein%20Loss%20Function.md).
If the predicted probability of the true target is _high_ and _close to 1_ (meaning it is a good prediction) then we will have `1` divided by some number close to `1`, and our `dlogprobs` value will effectively just be passed through. However, if our prediction is _bad_ and our probability of the true target was some number far from `1` (and in turn close to `0`), then our `dlogprobs` will be ***boosted*** when multiplied by `1 / probs`. In other words, the *gradient* of the loss with respect to probs will be *boosted* where the true next character had a low probability.
#### 3. Gradient of `probs` with respect to `logits`
Notice that in order to create `probs` we utilize several lines of code:
```python
# cross entropy loss (same as F.cross_entropy(logits, Yb))
logit_maxes = logits.max(1, keepdim=True).values
norm_logits = logits - logit_maxes # subtract max for numerical stability
counts = norm_logits.exp()
counts_sum = counts.sum(1, keepdims=True)
counts_sum_inv = counts_sum**-1 #
probs = counts * counts_sum_inv
```
But these lines are just a broken out version of the **softmax** function. This function takes in **logits** (the final output of the neural network) and effectively normalizes them into probabilities. So for clarity I will group all of these sub calculation together as part of the same section.
##### 3.1 Gradient of `probs` with respect to `counts_sum_inv`
We start by determining the gradient of `probs` with respect to `counts_sum_inv`:
```python
probs = counts * counts_sum_inv
```
The first thing we must be aware of is how `counts` and `counts_sum_inv` combine. To start, `counts` is of shape `(32, 27)`, while `counts_sum_inv` is of shape `(32, 1)`. In other words, each example in our batch of `32` will have a single entry in `counts_sum_inv`.
This is important to understand because in order to perform an element wise multiplication, under the hood pytorch needs to:
1. Broadcast the `counts_sum_inv` to be shape `(32, 27)`
2. Perform an element wise multiplication between the two `(32, 27)` tensors
So, we can start by backpropagating through the multiplication first. Recall in the case of scalar multiplication how our local gradient is constructed:
$c = a \times b$
$\frac{\partial c}{\partial a} = b$
In our case this will look like:
$\text{probs} = \text{counts} \times \text{counts\_sum\_inv} $
$\frac{\partial \text{ probs} }{\partial \text{ counts\_sum\_inv} } = \text{counts}$
We then take into account the chain rule and have:
$\frac{\partial \text{ loss}}{\partial \text{ counts\_sum\_inv}} = \text{counts} \times \frac{\partial \text{ loss}}{\partial \text{ probs}}$
We now need to figure out how to handle the broadcast. We can think of our broadcast as taking a single value and having it branch out. When looking at micrograd we talked about how in the backward pass we need to talk all the gradients that arrive at any one node and *sum* them (i.e. the total derivative). We can do this via:
```python
dcounts_sum_inv = (counts * dprobs).sum(1, keepdim=True)
```
This is worth a visual walk through. We start with `counts` and `counts_sum_inv`, in their original shapes:

We then broadcast `counts_sum_inv` to match the shape of `counts`:

And we then can perform a simple element wise multiplication:

We are trying to compute how `counts_sum_inv` impacts our `loss` (via `probs`). What we are seeing is that, while `counts_sum_inv` is a single column, it gets broadcasted to be 27 identical columns so the element wise multiplication works properly. Let us call the broadcasted version `counts_sum_inv_broad`. We then have:
```python
probs = counts * counts_sum_inv_broad
```
It is *easy* to compute the gradient of `probs` with respect to `counts_sum_inv_broad`. We simply use the multiplication rule:
```python
dprobs_wrt_counts_sum_inv_broad = counts
```
But we wish to know how `counts_sum_inv`, the `(32,1)` column vector, is impacted. A way to think of this is that we have a branching process. We take `counts_sum_inv`, copy it 27 times, and pass a single copy to be element wise multiplied with a single column of `counts`.

Because each copy of `counts_sum_inv` contains the same entries, those exact same entries impact all elements of `counts` when multiplied. The result then of course impacts `probs`, which impacts `loss`. So we must accumulate the *total impact* of `counts_sum_inv` and that is done via the total derivative.
### Lecture 5: [Building a Wavenet](https://www.youtube.com/watch?v=t3YJ5hKiMQ0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=6&t=102s)
The challenge with our current implementation is that we are crushing all of our characters (currently 3) into a single layer at the beginning of the network. Even if we make our hidden layers wider (more neurons) or add additional hidden layers, it is still silly to squash all of our information in a single step. A better approach would be to use something like [Wavenet](https://arxiv.org/pdf/1609.03499.pdf). It’s architecture is as follows:

We can think of this as taking our input and pairing it up into pairs of two, i.e. *bigrams*. We *learn* the bigram representation (seen in hidden layer 1), and then *fuse* the bigrams together! This result would be a four character level chunk. So we are *slowly* squashing our information until we arrive at the final layer.
The “technical” name for this type of layer is a *dilated causal convolutional layer*. This makes it sound very scary, but the idea is actually quite simple! The core idea is simple:
> This architecture allows us to have **progressive fusion of information**.
Another idea that is talked about in this lecture is that of *controlling for the number of parameters* as you experiment with different architectures. For instance, if we compare a wavenet to a standard, flat neural network with several hidden layers, we would like to be able to compare how *useful* it is to transmit information via the wavenet approach compared to the standard approach. In order to ensure that we are *only comparing architecture* and not simply *which has more parameters* we must control for the number of parameters.