Cross Entropy vs Wasserstein Loss Function

# Cross Entropy vs Wasserstein Loss Function ### TLDR We can think about the problem of loss as a means to _weighting_ each probability mass that we predict will be a in a bin. **Cross entropy** places _all_ weight on the mass assigned to the _true bin_. **Wasserstein** places _no_ weight on the mass assigned to the true bin (since the probability mass predicted on that bin will have a distance of 0 to that bin), and all weight on the incorrect bins. The more weight assigned to those incorrect bins, the more loss. However, Wasserstein also takes into account the *distance* of the incorrect bin to the true bin. This distance is the *weight* used. These two metrics are of course related via the constraint: $\text{probability mass assigned to true bin} + \text{probability mass assigned to incorrect bins} = 1$ When coupled with neural networks we can interpret this as: * Cross Entropy has a high loss when low probability is assigned to the true bin. In that case, a lot of mass must be assigned to incorrect bins. A neural network with a cross entropy loss function will end up tuning its weights in order to *remove mass from incorrect bins in a proportional way to the mass in the bin*. Bins that were assigned more mass will have more removed (via tuning, so that that mass is then assigned to the true bin). Cross entropy *forces* the network to move mass to the true bin. Over many training examples this ends up producing a finely calibrated distribution. * Wasserstein has a high loss when high probability is assigned to bins far away from the true bin. A neural network with a wasserstein loss function will end up tuning its weights in order to remove as much mass as possible to bins far away from the true bin. Due to our constraint above, when mass is removed from far away bins it has nowhere to go but *closer to* the true bin. To minimize the loss all mass would of course fall in the true bin. Wasserstein forces us the network to move mass to the *closer to the true bin*. ### Longer TLDR Earth movers as a *slightly* more appropriate loss function in the case where you have a classification problem with *ordinal* classes (or, more generally, whenever you can define a *distance* between your classes). Why is it more appropriate? The intuition is as follows. Imagine we are trying to predict the target distribution for an input point, $x$. We have bins representing target values, where bin $b_1 = [-1000, -100]$, $b_7 = [1,2]$, and $b_8 = [2,3]$. The $y$ corresponding to our $x$ of interest falls in $b_7$. Now consider two possible scenarios: 1. Our model assigns $0.4$ probability mass to $b_1$ and $0.01$ probability mass to $b_8$ 2. Our model assigns $0.01$ probability mass to $b_1$ and $0.4$ probability mass to $b_8$ How can we reason about these scenarios? Is (1) or (2) the better prediction? We must account for the following factors: * In the case of (1), it is possible that so much mass was assigned to $b_1$ because in *other training examples* that looked *similar* to $x$ their true value was in $b_1$. * If this *is* the case, for this example should we actually want more probability mass to fall in $b_8$? Of course we want more mass in $b_7$, the true bin. But do we want more mass in $b_8$? * We can answer this by making an argument based on *noise*: * Say we added a bit of *noise* in the bins we chose. E.g., when selecting our bin edges they were subject to a bit of noise. So maybe when picking the bin edges for $b_7$ they follow a normal distribution $\mathcal{N}(1, 0.5)$ and $\mathcal{N}(2, 0.5)$. This means that we could easily end up with $b_7$ as $[1.2, 2.5]$. We want to conduct our modeling in such a way that the specifics of the bin edges does not drastically change the outcome. * Assume that our input $x$ has some noise, $\epsilon$, associated with it. If a small amount of noise would move the true target from $b_7$ to $b_8$, then assigning more mass to $b_8$ in our prediction is indeed a good thing! It makes our prediction mechanism *more robust* to noise. * We can then ask: is the problem that we are trying to model susceptible to noise? If it is, then we want to incorporate distance between classes in some way. In the end, it really boils down to: > Use **cross entropy** if you wish for probability mass to be transferred from incorrect bins to correct based *only* on the probability assigned to those bins. > > Use **earth movers** if you wish for probability mass to be transferred from incorrect bins to correct based on the probability assigned to those bins *and* the distance from the incorrect bins to the correct one. In a neural network context, note that when computing the gradients, we will accumulate how specific weights impact the loss. Say we have a good deal of probability mass assigned to an incorrect bin. This does not *directly* impact the cross entropy loss. However, it does *indirectly* impact it! How? Well, if we end up with a large logit for an incorrect class, that will *increase* the denominator of the softmax, which will *decrease* the probability of the true class! To *increase* the probability of the true class, we can increase it’s logit and *decrease* the logits of the other classes, hence decreasing the denominator of the softmax. So, with a cross entropy loss we are going to tweak our weights such that we simply reduce the logits of *all other classes*, penalizing the *largest* logits the most! But we may want to penalize the logits of *nearby* classes *less* than *distant* classes! If there is a *preference* in how we penalize logits, and we can encode that preference via a *distance* between classes, then earth movers is a great choice! Earth movers can be interpreted almost as a form of regularization that prioritizes placing mass near the true target. ### Deep Dive… Why is the [cross entropy/log loss](Entropy,%20Cross%20Entropy%20and%20KL%20Divergence.md) a better choice than the [Wasserstein (Earth Movers) Distance](Wasserstein%20(Earth%20Movers)%20Distance.md) for a loss? Or at the very least *equivalent*? ![](Screenshot%202022-12-29%20at%2010.09.17%20AM.png) ![](Screenshot%202022-12-29%20at%2010.09.33%20AM.png) ![](Screenshot%202022-12-29%20at%2010.09.49%20AM.png) ![](Screenshot%202022-12-29%20at%2010.10.08%20AM.png) ### When you are specifically dealing with binning There should no be oddities and artifacts that lead to any sort of weirdness due to our binning. Consider another example: We have a true target that falls in a bin $[2,3]$. We predict that the value falls in bin $[1,2]$ with probability $0.5$ and in bin $[3,4]$ with probability $0.5$. This case would yield *the same loss* as if we predicted the true value fell in bin $[-100, -99]$ with probability $1$! That is clearly a problem. --- Date: 20221229 Links to: [Neural Networks MOC](Neural%20Networks%20MOC.md) [Machine Learning MOC](Machine%20Learning%20MOC.md) Tags: #review References: * Good Notes: Earth Movers vs Cross Entropy Loss * [Squared Earth Mover’s Distance-based Loss for Training Deep Neural Networks](https://arxiv.org/pdf/1611.05916.pdf)