# Accuracy, Precision, Recall, F1
We can start by looking at a **confusion matrix** that hold all possible outcomes of our prediction compared to the actual value.

True positive and true negatives are the observations that are correctly predicted and therefore shown in green. We want to minimize false positives and false negatives so they are shown in red color. These terms are a bit confusing. So let’s take each term one by one and understand it fully.
**True Positives (TP)** - These are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes. E.g. if actual class value indicates that this passenger survived and predicted class tells you the same thing.
**True Negatives (TN)** - These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no. E.g. if actual class says this passenger did not survive and predicted class tells you the same thing.
False positives and false negatives, these values occur when your actual class contradicts with the predicted class.
**False Positives (FP)** – When actual class is no and predicted class is yes. E.g. if actual class says this passenger did not survive but predicted class tells you that this passenger will survive. Note, this is also referred to as **Type I error**.
**False Negatives (FN)** – When actual class is yes but predicted class in no. E.g. if actual class value indicates that this passenger survived and predicted class tells you that passenger will die. Note, this is also referred to as **Type II error**.
Once you understand these four parameters then we can calculate Accuracy, Precision, Recall and F1 score.
**Accuracy** - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model. For our model, we have got 0.803 which means our model is approx. 80% accurate.
$Accuracy = \frac{TP+TN}{TP+FP+FN+TN}$
**Precision** - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Meant to capture the scenario of: "Yeah, your model got all the true positives, but it also got a TON of false positives. Hence, it is has a low precision. Low Precision $\rightarrow$ many false positives. Precision is all about **false positives**.
$Precision = \frac{TP}{TP+FP}$
**Recall (Sensitivity)** - Recall is the ratio of correctly predicted positive observations to the all observations in the positive class. The intuition here is that if you have a low recall, is means that you missed a bunch of positive cases, classifying them as negative. I.e. recall is all about **false negatives**.
$Recall = \frac{TP}{TP+FN}$

**Specificity**

> **Sensitivity** is the *probability of detecting a condition when it is truly present*, and **specificity** is the *probability of not detecting it when it is truly absent*.
**F1 score** - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall. In our case, F1 score is 0.701.
$F1 = \frac{2*(Recall * Precision)}{(Recall + Precision)}$
A visualization of the F1 score as a function of Recall and Precision is useful here. We can see that an F1 score is simply an attempt to capture when we have both a high recall and high precision. If either are low, the F1 score is low (this is effectively a [Geometric Mean](Arithmetic%20vs%20Geometric%20Mean.md)):

### ROC
Note that ROC is an alternative to a ton of confusion matrices, each with a different decision threshold. See more at this video: [ROC and AUC, Clearly Explained! - YouTube](https://www.youtube.com/watch?v=4jRBRDbJemM)
### AUC
The AUC makes it easy to compare one ROC curve to another. It allows us to compare different models, while not taking into account the decision threshold (which is essentially a hyperparameter, distinct from the model).

### Shortcomings
See this fantastic talk: [Safe Handling Instructions for Probabilistic Classification | SciPy 2019 | Gordon Chen - YouTube](https://www.youtube.com/watch?v=RXMu96RJj_s). Specifically, the ROC curve (and hence AUC) are calculated via *ranking* the predicted probabilities. However, the *probabilities themselves* are not used in the prediction:

Hence, the AUC score doesn't care about the probabilities. It only cares about **the ranking**! The AUC score is also a probability. It measures if you randomly pick one positive observation and one negative observation, what is the probability that the the positive observation has a higher forecasted probability than the negative one. This is the real reason that the AUC score is not a good metric in probabilistic classification problems.
### Frank Harrell, Short Comings of Confusion Matrix
Harrell describes the issue [here](https://www.fharrell.com/post/mlconfusion/) as one of *transpose conditionals*. He writes:
> While sensitivity and specificity sound like the right thing and a good thing, there is an essential misdirection for prediction. The problem for prediction with focusing on sensitivity and specificity is that you are conditioning on the wrong thing: the true underlying condition; you are conditioning on the thing you actually want information about. In Pr(true positive ∣ true condition positive) and Pr(true negative ∣ true condition negative) these measures make fixed the aspect that should be free to vary to provide the information actually needed to assess something meaningful about the future performance of the algorithm. To understand how the algorithm will actually perform in new data, the measures required are Pr(true positive ∣ ascribed class positive) and Pr(true negative ∣ ascribed class negative), i.e., the other dimension of the confusion matrix. To measure something meaningful about future performance, the outcome of interest (or what will be found to be true) must not be fixed by conditioning.
Associated reasoning can be see via my whiteboarding [here](https://photos.google.com/photo/AF1QipNrxMBYXiFs045LUae285ecV1gJW7BzRY7aCTGm), same photo below:

---
Date: 20211201
Links to:
Tags:
References:
* []()