Loss Function - Nate's Notes

# Loss Function The function, a function of our model predictions compared to the target, can also be thought of directly as a function of the model parameters. This provides us with a loss landscape, shown below: ![Loss Function|300](Screen%20Shot%202021-12-03%20at%206.43.35%20AM.png) This could lead us to a straight forward algorithm such as: 1. Find a *direction* $v$ where $L(\theta)$ decreases 2. Update theta as: $\theta \leftarrow \theta +\alpha v$ ### An example As a basic example, consider if $L$ was: $L(\theta) = \sum_i log p_{\theta}(y_i \mid x_i)$ We could say that: $\theta^* \leftarrow argmin_{\theta} \big\{ -L \big\} $ ### Key Idea Regardless of the optimization method that we chose, for the most part, they will always use *local* information to move in a direction of decrease. Thankfully we have a wonderful set of mathematical tools that can help us in this regard, such as the [Gradient](Gradient.md). The gradient returns a *vector*; applied to our entire loss function it returns a [ vector field](Vector-Fields.md). It is useful to note that this is *not* the only method where this could work. Technically, any vector that has a *positive* [dot product](Dot%20Product.md) with $v$, you will move in a direction of decrease. ### Increasing the signal in the context of neural networks Consider a common neural network trained to classify images with a cross entropy loss function. Recall that the [Cross Entropy](Entropy,%20Cross%20Entropy%20and%20KL%20Divergence.md) effectively only provides supervision in two ways: 1. Direct supervision the *true class* (i.e. if the true class is $y= cat$ then the model will directly be incentivized to update its weights in a way that *increases* the probability of $y = cat$). 2. Indirect supervision to the *non true classes* by means of a *constraint* (our probabilities must sum to 1. So, one way of *increasing* the probability of $y = cat$ is by *decreasing* the probability of $y = \{ c \in C \mid c \neq cat \}$. This can be done by updating the weights such that the final logits for those classes decrease, and then due to the softmax layer the probability of $y = cat$ will increase) More on the above can be seen in the paper [Calibrating Deep Neural Networks by Pairwise Constraints](https://openaccess.thecvf.com/content/CVPR2022/papers/Cheng_Calibrating_Deep_Neural_Networks_by_Pairwise_Constraints_CVPR_2022_paper.pdf), but there is a key component I want to call out here: This notion of loss (particularly cross entropy and/or KL Divergence) is mainly suited for classification problems where you have no notion of *ordinality*. In other words, it is not entirely clear whether $cat$ is closer to $fire \; hydrant$ or $bus$. However, what is we were trying to train a neural network to predict classes where there was a notion of ordinality? For instance, say we have *discretized* a continuous target into bins. There is clearly a notion of ordinality between those bins (a clear example of this is in [googles MetNet paper](https://arxiv.org/pdf/2003.12140.pdf), and it is also outlined in ML Design Patterns). In this scenario we are particularly trying to predict a *distribution*. ![](Screen%20Shot%202022-11-18%20at%207.28.46%20AM.png) But here is the problem: Cross Entropy is not well suited to this task! Why is that? Because it leaves a key piece of **geometric information** on the table. Namely that our domain (our binned target, the “classes”) have ordinality; they are embedded in the real line, $\mathbb{R}$. Consider the distributions below. We clearly would say that the left most distribution is closer to the middle than the right. However, cross entropy and KL divergence would consider them **equidistant**! ![](Screen%20Shot%202022-11-18%20at%207.31.44%20AM.png) Now this is not only a bad loss in terms of descriptive accuracy, but this can be a disaster in terms of training! Why is that? Well, consider the **gradient** of the loss function if we allow our input space to shift: it will be $0$ when our models predicted distribution is too far from the true distribution. This gradient is our *only* piece of information we have access to in order to update our model to make better predictions. And because we have a poorly chosen loss function, we will struggle to make progress. ![](Screen%20Shot%202022-11-18%20at%207.32.56%20AM.png) A good gradient will tell you that the way to improve is to the move from the top plot below, to the bottom plot. However, if our divergence measure thinks these two distributions are almost the same, then it of course won’t have much gradient. ![](Screen%20Shot%202022-11-18%20at%207.38.32%20AM.png) The solution is to use the [Wasserstein (Earth Movers) Distance](Wasserstein%20(Earth%20Movers)%20Distance.md). ### MSE vs Custom Loss It’s worth asking why we are even performing analyses that measure model performance in the first place. Is it to build a model that achieves the lowest mean squared error? The lowest cross entropy? No - it is because we want to **********************use the results of this inference**********************. We want to convey signal to a down stream agent (portfolio builder in this case). We would like a method that stresses the importance of **********************payoffs of decisions**********************, not just the accuracy of the estimation alone. This is summed up perfectly by Carveth Read: > It is better to be roughly right than precisely wrong. Historically, loss/error functions have been motivated from (1) mathematical ease and (2) their robustness to application (that is, they are objective measures of loss). The ﬁrst motivation has really held back the full breadth of loss functions. With computers being agnostic to mathematical convenience we are free to design our own loss functions. By shifting our focus from trying to be incredibly precise about parameter estimation to focusing on the outcomes of our parameter estimation, we can customize our estimates to be optimized for our application. This requires us to design new loss functions that reﬂect our goals and outcomes. Some examples of more interesting loss functions include the following. ### Goal When running error analysis we want to see information that helps us prioritize the next best action to take. That can be a new feature to add or a model change to make (upstream change) or a portfolio building update (downstream change). This requires ******clarity****** in what we are seeing. It must be _****easy****_ to reason about and ******************map to our problem******************. ### Criteria We can outline the different criteria that we must satisfy as follows: 1. **0 threshold** Zero has a special place in our problem due to the fact that at the end of the day we must place buy/sell trades. So as we cross the 0 threshold our error should nonlinearly increase. 2. **Input dependent slopes** If we predict 50 and the dart is 60 that is **************************far better and more useful************************** than if we predict 5 and the dart is 2. Yet the latter has a lower MAE and MSE. This shows that in our problem the error should be ******input****** dependent. 3. **What bias is present in our model?** This reduces down to the idea of _**********precision**********_ and _******recall******_ in a continuous context. E.g. if our model always predicts -5 it has a high recall when the true value is -5 (it will never have a single false negative), but it has incredibly low precision (it will predict -5 for many examples that are not -5). We can formalize this in a continuous context. 4. **What is the difficulty of a prediction?** Given that most true darts are small and near 0, our model can reduce it’s overall loss by simply improving it’s ability to predict in that region. It will often be better at predicting in that region. Now say we are running error analysis and see that during hour 23 our model has a far lower error. We must ask: is that because our model is more effective in that region based on having valuable signal that allows for it to make challenging predictions? Or, is it because more true darts are near 0 during hour 23 (meaning that simply reverting to the mean is an effective strategy) that does not yield much useful information. ![mse_loss_func](mse_loss_func.png) ![custom_loss_func](custom_loss_func.png) ```python def loss(y, yhat, alpha=1, beta=1, gamma=0.5, s=10): y_scale = np.abs(y) / s scaler = min(1, y_scale) b = beta / scaler g = gamma / scaler if np.sign(y*yhat) == -1: # flip, wrong sign, red region return alpha*yhat**2 + np.abs(y - yhat)*b elif ((y > 0) and (yhat < y)) or ((y < 0) and (yhat > y)): # green region return np.abs(y - yhat) * b else: # yellow region return np.abs(y - yhat) * g def loss_np(y, yhat, alpha=1, beta=1, gamma=0.5, s=10): y_scale = np.abs(y) / s scaler = np.maximum(1, y_scale) b = beta / scaler g = gamma / scaler return np.select( [ np.sign(y*yhat) == -1, # flip, wrong sign, red region ((y > 0) & (yhat < y)) | ((y < 0) & (yhat > y)), # green region ((y > 0) & (yhat >= y)) | ((y < 0) & (yhat <= y)), # yellow region ], [ alpha*yhat**2 + np.abs(y - yhat)*b, np.abs(y - yhat) * b, np.abs(y - yhat) * g ] ) ``` We can update this to be continuous. See this function below. This is based on this paper, [Precision and Recall for Regression by Torgo and Ribeiro](https://www.dcc.fc.up.pt/~ltorgo/Papers/tr09.pdf). ```python def precision_and_recall_prep(df, prediction_col, suffix, t=10, eps=4): df = df.assign( relevant=np.where( np.abs(df.target) > t, 1, 0 ) ) df = df.assign(predicted_sign=np.sign(df[prediction_col])) df = df.assign( **{ f"prediction_relevant{suffix}": ( (np.sign(df.target) == df.predicted_sign) & (np.abs(df[prediction_col]) > t - eps) ) } ) # True positive: relevant, predicted relevant # False positive: not relevant, predicted relevant, # True negative: not relevant, predicted not relevant # False negative: relevant, predicted not relevant return df ``` --- Date: 20211203 Links to: [Machine-Learning](Machine-Learning.md) [Machine-Learning-Method](Machine-Learning-Method.md) Tags: References: * []()