# Cross Entropy and Neural Network Mechanics Cross entropy is simply one way of measuring the difference between two probability distributions, say $p$ and $q$: ![](Screenshot%202023-01-15%20at%209.50.24%20AM.png) Formally, for the discrete case we can define the cross entropy, $H$, as: $H(p, q)= \sum_{x \in X} p(x) log \Big[ \frac{1}{q(x)} \Big]$ Visually we can think over overlaying the two distributions, and then pairing up the corresponding bins and their probability masses as shown below: ![](Screenshot%202023-01-15%20at%209.53.28%20AM.png) This is all well and good, but simply looking at these formulas fails to capture crucial intuitions about how this type of loss (particularly the discrete case) will interplay with neural networks. ## 1. The Andrej Karpathy Vantage Point ### 1.1 The Softmax: From Logits to Probabilities To start, one of the interesting things about using cross entropy in a predictive context is that $p$ is always going to have all probability mass concentrated on a single discrete bin (whatever the true value was for the example). ![](Screenshot%202023-01-15%20at%209.56.21%20AM.png) This drastically reduces the cross entropy loss calculation! Effectively the only thing that matters at this point is how much probability mass our predicted distribution, $q$, assigned to the correct bin: ![](Screenshot%202023-01-15%20at%209.57.29%20AM.png) How can we improve this loss? Better put, what approaches does a neural network have at its disposal to reduce the loss? To effectively reason about this we need to think about *where* these probabilities come from. Mechanically, the probabilities output via a neural network are simply the final layer **logits** run through the **softmax**. We then “pluck out” the predicted value of the true target bin, take the log of it, multiply by $-1$, and voila: we have our loss. ![](Screenshot%202023-01-15%20at%2010.01.57%20AM.png) So our softmax effectively applies the **constraint** that our probabilities must sum to $1$. If $\bf{z}$ is our logits vector, then the softmax is formally it is defined as: $\text{softmax}(\bf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$ We can now return to our question: how can a neural network decrease the loss? Well we can clearly see that the final cross entropy loss is a function of the predicted probability of the true bin. That is a function of *two things*: 1. The *logit* of the true bin 2. The *logits* of the incorrect bins, due to the fact that the *probability* of the correct bin is generated via the softmax. So our neural network, in an attempt to minimize loss, should tune its weights such that the logit of the correct bins is *increased*, and the logits of the incorrect bins are *decreased*. To see how it does this in more detail, we are going to need to introduce the **[Gradient](Gradient.md)**. ### 1.2 Enter: The Gradient I have quite a few notes on the [gradient and neural network intuitions](Neural%20Network%20Intuitions.md), so for now I’ll simply gloss over the key intuition. A neural network can be thought of as a **computational graph**. It consists of nodes connected via edges. The nodes are general *variable values* and the edges connect these values via some sort of operation. For example if we have $a + b = c$, then $a, b, c$ would be nodes and the $+$ operation would have two incoming edges, one from $a$ and one from $b$, and one outgoing edge, to $c$. In a neural network this large computational graph with generate a *prediction*. This prediction is compared to the true *target* in some way, producing a *loss*. The worse the prediction, the worse the loss. The key algorithm that makes modern neural networks so powerful is [Backpropagation](Backpropagation.md). Again, I have discussed its details at length elsewhere, but for our purposes the key thing I want us to think about it is that every node in the compute graph will have a *gradient* associated with it. This gradient holds information telling us how changing the nodes value will impact the loss. So, in our case, we will have : * Gradient information for how *predicted probabilities* impact the loss * Gradient information for how the *predicted logits* impact the loss Why is it worth making this key distinction? Well, consider the gradient of the loss with respect to the predicted probabilities. There are two distinct cases here: when the predicted probability is for the true class, and when it is not. When the predicted probability is for the true class there will be gradient information and as we improve our prediction it will reduce the loss. However, in the even that the predicted probability is *not* for the correct class, the cross entropy will multiply that predicted probability by $0$, the probability that that was the true class. This *kills* any gradient information present. So at first glance you may think that the predicted probabilities associated with any other classes but the true class have *no impact on the loss*. However, that is why we must also look at the *logits*! There is a [great explanation of this by Andrej Karpathy](https://youtu.be/q8SA3rM6ckI?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&t=5440), but for our purposes let me just jump straight to the final gradients: $\text{if i is not correct class:} \;\;\;\;\;\frac{\partial \text{ loss}}{\partial l_i} = p_i$ $\text{if i is correct class:} \;\;\;\;\;\;\frac{\partial \text{ loss}}{\partial l_i} = p_i - 1$ The interpretation here is that if we increase a logit $l_i$ that is not associated with the true class, that will increase our loss by $p_i$ for each unit increase of $l_i$. On the other hand, if we increase the logit $l_i$ that *is* associated with the true class, that will *decrease* our loss by $p_i - 1$ for each unit increase of $l_i$. > This means that the network has a strong incentive to *increase* the logit associated with the true class and *decrease* the logits associated with the incorrect class. Visually this ends up looking like the images below: ![](Screenshot%202023-01-15%20at%2011.33.50%20AM.png) ![](Screenshot%202023-01-15%20at%2011.34.00%20AM.png) A nice physical analogy is to think of this gradient as a *force*. At each cell (colored above) there exists a force, the gradient, that will *pull down* on the probabilities of the incorrect classes, and there will be a force *pulling up* on the probability of the correct class. We can see that the amount of push and pull is *exactly equal* because the sum of the individual gradients (the sum of the above vector) is $0$. This allows us to think of our neural network as a massive pulley system, that is pulling up on the probability of the correct class and pulling down on the probabilities of the incorrect classes. So there is effectively a tension translating through our entire system (neural net, i.e. computational graph) that ends up making it’s way to our tunable weights and biases, at which time an update is performed that gives in to the tug generated via the process above. A very key point must be called out at this point: > The amount of “force” that we are applying is *proportional* to the probabilities that came out in the forward pass. The amount by which our prediction is correct is the amount by which there will be a push or pull in that dimension. ## 2. The Michael Nielsen Viewpoint All of this time we have not really *motivated* the reason as to why would want to use cross entropy as a loss function as opposed to, say, mean square error. I am going to heavily excerpt from a Michael Nielsen’s deep learning book moving forward, particularly [chapter 3](http://neuralnetworksanddeeplearning.com/chap3.html#the_cross-entropy_cost_function). > Most of us find it unpleasant to be wrong. Soon after beginning to learn the piano I gave my first performance before an audience. I was nervous, and began playing the piece an octave too low. I got confused, and couldn't continue until someone pointed out my error. I was very embarrassed. Yet while unpleasant, we also learn quickly when we're decisively wrong. You can bet that the next time I played before an audience I played in the correct octave! By contrast, **we learn more slowly when our errors are less well-defined**. > > Ideally, we hope and expect that our neural networks will learn fast from their errors. Is this what happens in practice? Nielson provides two fantastic animations that show that in the event that our starting weight and bias are very bad (i.e. they produce *poor* predictions) the learning of our network will start out much more slowly! However, when our parameters are *close* to the optimal values (yielding good predictions) the learning starts out very fast. > This behavior is strange when contrasted to human learning. As I said at the beginning of this section, we often learn fastest when we're badly wrong about something. But we've just seen that our artificial neuron has a lot of difficulty learning when it's badly wrong - far more difficulty than when it's just a little wrong. What's more, it turns out that this behaviour occurs not just in this toy model, but in more general networks. Why is learning so slow? And can we find a way of avoiding this slowdown? > > To understand the origin of the problem, consider that our neuron learns by changing the weight and bias at a rate determined by the partial derivatives of the cost function, $\frac{\partial Loss}{\partial w}$ and $\frac{\partial Loss}{\partial b}$. So saying "learning is slow" is really the same as saying that those partial derivatives are small. The challenge is to understand why they are small. You can follow along with Nielsons argument in the book, but the TLDR is that the gradients are small because we are using a quadratic loss function (e.g. mean square error): $Loss= (y - \hat{y})^2$ The challenge arises when we look at *the gradient* of the $Loss$ with respect to our prediction, $\hat{y}$: $\frac{\partial Loss}{\partial \hat{y}} = 2(\hat{y} - y)$ This is effectively a **linear function** of our prediction, $\hat{y}$. Imagine the true value is $y=1$. Whether we predict $\hat{y} = 0.98$ or $\hat{y} = 0.01$, a small improvement in our prediction will yield the *same improvement* in our loss! Below we can see this clearly (the difference in loss in both the top and bottom row is $0.02$): $2(0.01 - 1) = -1.98 \longrightarrow 2(0.02 - 1) = -1.96$ $2(0.98 - 1) = -0.04 \longrightarrow 2(0.99 - 1) = -0.02$ Nielson goes on to mention that this situation can be made even worse if you use a sigmoid activation function (I won’t cover that here since it is common practice to use Relu’s at this point). You are probably one step ahead of me at this point, realizing that cross entropy can help us here! At its core, it really comes down to the way in which the *logarithm* functions. Consider the derivative of the cross entropy loss (only looking at the predicted value for the true target): $CE = p(x) \times - log \Big[ q(x)\Big] = - log \Big[ q(x)\Big]$ $\frac{\partial CE}{\partial q(x)} = - \frac{1}{q(x)}$ We can immediately see that this gradient is *not a linear function of* $q(x)$. Consider the same situation we discussed earlier: $- \frac{1}{0.01} = -100 \longrightarrow - \frac{1}{0.02} = -50 $ $- \frac{1}{0.98} = -1.02 \longrightarrow - \frac{1}{0.99} = -1.01 $ We see that the gradient is **nonlinear**. This means that for predictions that are *very wrong* the network will produce a large loss and hence yield larger updates to the guilty parameters. Neilson once again has [an animation](http://neuralnetworksanddeeplearning.com/chap3.html#exercise_35813) that shows how this allows for faster learning. > In particular, when we use the quadratic cost learning is _slower_ when the neuron is unambiguously wrong than it is later on, as the neuron gets closer to the correct output; while with the cross-entropy learning is faster when the neuron is unambiguously wrong. Nielson continues on, next touching on the softmax function. I have to give him credit, the [animation](http://neuralnetworksanddeeplearning.com/chap3.html#softmax) he has is one of the best I’ve seen. The biggest takeaway here is that the outputs of the softmax are *dependent* on each other! This is most readily demonstrated if we look at the ways to *increase* $a_1$. Let us start with the following configuration: ![400](Screenshot%202023-01-15%20at%203.47.04%20PM.png) We can then watch what happens to $a_1$ as we *increase* $z_1$ by $0.4$ (left most image below) or *decrease* one of $z_2, z_3, z_4$: ![](Screenshot%202023-01-15%20at%203.51.45%20PM.png) We can see that a nudge of $0.4$ has a *very different impact* on $a_1$ depending on which $z$ the nudge was applied to! * An increase of $0.4$ to $z_1$ yields the largest increase to $a_1$ * A decrease of $0.4$ to $z_3$ yields the largest second largest increase to $a_1$ * The decreases to $z_2$ and $z_4$ both yield very smalls increases to $a_1$ This starts to give us an idea of where the name “softmax” comes from! The idea is that if one value is even slightly larger than the rest it will rapidly be given most of the probability mass. This is similar to how a $max$ function will apply all probability mass, $1$, to the maximum value and $0$ to the rest. ### Putting it all together: The animation and the intuition We now can take the notion of our neural network and the loss as forces pulling our probabilities, and tie that the animation where we actually see that that is the case. It also starts to make more intuitive sense how the gradient of the loss with respect to the logits comes about. Remember, the gradient is $p_i$ in the event that it is not the correct class, and $p_i - 1$ in the event that it is. We can see this clearly when looking at logit $z_2$ and $z_4$. Their corresponding probabilities are very low, and we see that when we nudge the logits themselves the changes in the true probability, $p_1$, is very small (and hence the change in the loss is small)! So we can see visually that for incorrect classes with a small predicted probability, making that even smaller doesn’t lead to a meaningful decrease in the loss. ### Ignoring Distance Between Classes One key thing to realize about the above breakdown is that we do not take into consideration any notion of distance between classes. In order to decrease loss the neural network simply knows that it should reduce logits of incorrect classes in a way that is proportional to their predicted probabilities, and increase the logits of the correct class in a way that is proportional to it’s predicted probability minus $1$. In other words: > This process treats all incorrect classes **equally**. But that may not be desired! Consider if we are trying to predict a discrete distribution. Our classes are bins, $[0,1], [1,2], [2,3], [3,4]$. If the true bin is $[0,1]$, then it may very well be that if our model predicts $0.8$ probability mass to fall in $[1,2]$, that is *better* than predicting the same mass to fall in $[3,4]$. The question is: do we also want to take into account how much we reduce the probability based on how close the incorrect class is to the predicted class. So with a simple vanilla cross entropy, if we place 0.4 probability mass on the incorrect bin but it is right next to the true bin, do we want to *slightly* reduce the corresponding gradients of the loss with respect to the logits, such as: $\text{if i is not correct class:} \;\;\;\;\;\frac{\partial \text{ loss}}{\partial l_i} = p_i \times \frac{1}{\text{distance of i to true class}} $ For more on this see [Cross Entropy vs Wasserstein Loss Function](Cross%20Entropy%20vs%20Wasserstein%20Loss%20Function.md). --- Date: 20230114 Links to: [Neural Networks MOC](Neural%20Networks%20MOC.md) [Neural Network Intuitions](Neural%20Network%20Intuitions.md) [Entropy, Cross Entropy and KL Divergence](Entropy,%20Cross%20Entropy%20and%20KL%20Divergence.md) Tags: References: * [Building makemore Part 4: Becoming a Backprop Ninja - YouTube](https://youtu.be/q8SA3rM6ckI?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&t=5629) * [Neural networks and deep learning](http://neuralnetworksanddeeplearning.com/chap3.html#the_cross-entropy_cost_function) * [The Softmax function and its derivative - Eli Bendersky's website](https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/) * [Nathaniel Dake Blog](https://www.nathanieldake.com/Deep_Learning/01-Neural_Networks-02-Neural-Network-Training.html) *