Predictive Measures and Probability Calibration

# Predictive Measures The key ideas behind probability calibration are: * In the image below, we want our predicted probabilities to be as close to the diagonal line as possible: ![](Screen%20Shot%202022-02-06%20at%209.11.15%20AM.png) In practice we want to: * Measure probability calibration curves * Brier Score, Average log likelihood * What is the scoring rule? [Scoring rule](https://en.wikipedia.org/wiki/Scoring_rule) * Calculate the **Brier Score** (where the smaller the better). However, we must note that the Brier Score can be misleading in terms of calibration. Below we see that Logistic Regression has the lowest Brier Score even though it is the most well calibrated: ![](Applied%20ML%202020%20-%2010%20-%20Calibration,%20Imbalanced%20data%2021-48%20screenshot.png) * Hence, we also want to measure the components that *make up the Brier Score*! See [this fantastic talk here](https://youtu.be/RXMu96RJj_s?t=656). We can utilize three components that make up the Brier Score: * **Reliability (Calibration)**: How close our predicted probability is to the reality? * **Resolution (Purity)**: How far are predictions from mean response value? * **Uncertainty (Noise)**: Measures noise level of data. ![](Safe%20Handling%20Instructions%20for%20Probabilistic%20Classification%20_%20SciPy%202019%20_%20Gordon%20Chen%2012-38%20screenshot.png) ![](Safe%20Handling%20Instructions%20for%20Probabilistic%20Classification%20_%20SciPy%202019%20_%20Gordon%20Chen%2013-8%20screenshot.png) ## Extra Content and Excerpts Some great measures (*not accuracy, precision and recall-the references below explicitly argue against these measures*), are: 1. Smooth, flexible probability calibration curve 2. Frequentist: Log Likelihood 3. Bayesian: Log Likelihood + log prior 4. Explained outcome heterogeneity 1. Heterogeneity of predictions (Kent & O'Quigley-type measures). This is measuring the variance of the predictions. 2. Relative explained variation (relative $R^2$): ratio of variances from $\hat{Y}$ from a subset model to the full model The reason that the links below (specifically, Frank Harrell) argues so heavily against the standard measures (accuracy, sensitivity, specificity, precision and recall) is that they are discontinuous improper accuracy scores. They can be optimized by a bogus model, one that is known to be wrong. Additionally, ROC curves are highly problematic. The coordinates (sensitivity and 1-specificity) are improper scores. The coordinates are transposed conditionals, meaning we are conditioning on the wrong thing. Specifically, we are using the *unknown* to predict the *known*. ### Measures that relate to Decision Making The **Optimum Bayes** decision is one that maximizes **expected utility**. Expected utility is a convolution of a utility function and the posterior distribution that describes what we know about parameters in the model. The expected utility uses posterior distribution of outcome probability for a patient combined with consequences of possible wrong actions. I In general it appears that we should use a [Scoring rule - Wikipedia](https://en.wikipedia.org/wiki/Scoring_rule) and calibration plot. ### Plots to Consider See this section of [Frank Harrell talk](https://youtu.be/DF1WsYZ94Es?t=2387) and the associated [blog post](https://www.fharrell.com/post/addvalue/). Below we have two distributions of *predictive risk*: ![](Screen%20Shot%202022-02-02%20at%207.56.43%20AM.png) ![](Screen%20Shot%202022-02-02%20at%207.57.03%20AM.png) Below we can see that the cholesterol variable tells us more about patients are younger (again, watch this section of talk for more details): ![](Screen%20Shot%202022-02-02%20at%207.57.16%20AM.png) > The presumption comes from the fallacious view that ultimately end-users need to make a binary decision, so binary classification is needed. Optimum decisions require making full use of available data, developing expectations quantitatively expressed as individual probabilities on a continuous scale, and applying an independently derived loss/utility/cost function to make a decision that minimizes expected loss or maximizes expected utility. Different end users have different utility functions which leads to their having different risk thresholds for action. ### When to use classification > For all applications it is well to distinguish and clearly differentiate prediction and classification. Formally, for ML, classification is using labels you have for data in hand to correctly label new data. This is feature recognition and class or category attribution. Strictly understood, it is about _identification_, and not about stochastic outcomes. Classification is best used with non-stochastic mechanistic or deterministic processes that yield outcomes that occur frequently. Classification should be used when outcomes are inherently distinct and predictors are strong enough to provide, for all subjects, a probability closely approximating 1.0 for one of the outcomes. A classification does not account well for gray zones. Classification techniques are appropriate in situations in which there is a known gold standard and replicate observations with approximately the same result each time, for instance in pattern recognition (e.g., optical character recognition algorithms, etc.). In such situations the process generating the data are primarily non-stochastic, with high signal:noise ratios. - [In Machine Learning Predictions for Health Care the Confusion Matrix is a Matrix of Confusion | Statistical Thinking](https://www.fharrell.com/post/mlconfusion/) ### Short comings of AUROC > The AUROC (or its equivalent for the case of binary response variable, the c-index or c-statistic) is conventionally employed as a measure of the discrimination capacity of a model: the ability to correctly classify observations into categories of interest. Setting aside the question of the appropriateness of classification-focused measures (sensitivity, specificity and their summary in the ROC) of performance for prediction models, I speculate that the AUROC and the c-statistic do not really reflect what people generally think it does. And here again, nuances and behavioral economics (inconsistencies in perceptions, cognition, behavior and logic) are pertinent. Discrimination literally indicates the ability to identify a meaningful difference between things and connotes the ability to put observations into groups correctly. As applied to a prediction model the area under the ROC curve or the c-statistic, however, is based on the _ranks_ of the predicted probabilities and compares these ranks between observations in the classes of interest. The AUC is closely related to the Mann–Whitney U, which tests whether positives are ranked higher than negatives, and to the Wilcoxon rank-sum statistic. Because this is a rank based statistic, the area under the curve is the probability that a randomly chosen subject from one outcome group will have a higher score than a randomly chosen subject from the other outcome group—that’s all. In health care, discrimination is frequently concerned with the ability of a test to correctly classify those with and without the disease. Consider the situation in which patients are already correctly classified into two groups by some gold standard of pathology. To realize the AUROC measure, you randomly pick one patient from the disease group and one from the non-disease group and perform the test on both. The patient with the more abnormal test result should be the one from the disease group. The area under the curve is the percentage of randomly drawn pairs for which the test correctly rank orders the test measures for the two patients in the random pair. It is something like accuracy, but _not_ accuracy. - [In Machine Learning Predictions for Health Care the Confusion Matrix is a Matrix of Confusion | Statistical Thinking](https://www.fharrell.com/post/mlconfusion/) In general, we can see that AUROC is about ranking. However, in an insurance context, ranking isn't really our goal, and hence this metric isn't really appropriate: > In health care, discrimination is frequently concerned with the ability of a test to correctly classify those with and without the disease. Consider the situation in which patients are already correctly classified into two groups by some gold standard of pathology. To realize the AUROC measure, you randomly pick one patient from the disease group and one from the non-disease group and perform the test on both. The patient with the more abnormal test result should be the one from the disease group. The area under the curve is the percentage of randomly drawn pairs for which the test correctly rank orders the test measures for the two patients in the random pair. It is something like accuracy, but _not_ accuracy. Also frequently important in health care is the ability of a model to prognosticate something like death. For the binary logistic prediction model the area under the curve is the probability that a random sample of the deceased will have a greater rank estimated probability of death than a randomly chosen survivor. This is only a corollary of what is desirable for evaluation of the performance of a prediction model: a measure of quantitative absolute agreement between observed and predicted mortality. The probability of correctly ranking a pair seems of secondary interest. If you develop a model indicting that I am likely to develop a cancer, and you tell me that you assert this because the model has an AUROC of 0.9, you really have only told me something about the expected relative ranking of my predicted value; that on average people who do not go on to develop the cancer tend to have a lower predicted value. This seems like something less than strong inference. Whether the absolute risks are 0.19 vs. 0.17, or 0.9 vs. 0.2 does not enter into the information. Another key insight: > **But the area under the curve is the probability that a randomly chosen subject from one outcome group will have a higher score than a randomly chosen subject form the other outcome group, nothing more. I do not feel that tells us enough.** This measure can mislead us: > And there are various ways we may be misled by this measure ( [Cook, 2007](http://circ.ahajournals.org/content/115/7/928.long); [Lobo, 2007](https://www2.unil.ch/biomapper/Download/Lobo-GloEcoBioGeo-2007.pdf)). As the ROC curve does not use the estimated probabilities themselves, only ranks, it may be insensitive to absolute differences in predicted probabilities. Hence, a well discriminating model can have poor calibration. And perfect calibration is possible with poor discrimination when the range of predicted probabilities is small (as with a homogeneous population case-mix), as discrimination is sensitive to the variance in the predictor variables. Over-fitted models can show both poor discrimination and calibration when validated in new patients. Inferential tests for comparing AUROC are problematic ( [Seshan, 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3617074/)), and other disadvantages with the AUROC are noted ( [Halligan, 2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4356897/)). For various reasons, the AUROC and the c-index or c-statistic are problematic and of limited value for comparing among tests or models, though unfortunately, still widely used for such. Additionally, we see that calibration and classification can be at odds: > As the ROC curve does not use the estimated probabilities themselves, only ranks, it may be insensitive to absolute differences in predicted probabilities. Hence, a well discriminating model can have poor calibration. And perfect calibration is possible with poor discrimination when the range of predicted probabilities is small (as with a homogeneous population case-mix), as discrimination is sensitive to the variance in the predictor variables. ### Probability Calibration An interesting approach to probability calibration can be seen in [Probability Calibration : Data Science Concepts - YouTube](https://www.youtube.com/watch?v=AunotauS5yI). ### Why is the AUC score bad in probabilistic classification problems? See [Accuracy Precision Recall F1, Shortcomings](Accuracy%20Precision%20Recall%20F1.md#Shortcomings) to get a good understanding of this. --- Date: 20220202 Links to: Tags: References: * [Classification vs. Prediction | Statistical Thinking](https://www.fharrell.com/post/classification/) * [Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules | Statistical Thinking](https://www.fharrell.com/post/class-damage/) * [classification - How does logistic regression "elegantly" handle unbalanced classes? - Cross Validated](https://stats.stackexchange.com/questions/403239/how-does-logistic-regression-elegantly-handle-unbalanced-classes) * [machine learning - Reduce Classification Probability Threshold - Cross Validated](https://stats.stackexchange.com/questions/312119/reduce-classification-probability-threshold) * [Why R? 2020 Keynote - Frank Harrell - Controversies in Predictive Modeling and Machine Learning - YouTube](https://youtu.be/DF1WsYZ94Es?t=1902) * [Statistically Efficient Ways to Quantify Added Predictive Value of New Measurements | Statistical Thinking](https://www.fharrell.com/post/addvalue/) * [In Machine Learning Predictions for Health Care the Confusion Matrix is a Matrix of Confusion | Statistical Thinking](https://www.fharrell.com/post/mlconfusion/) * [Scoring rule - Wikipedia](https://en.wikipedia.org/wiki/Scoring_rule) * [Probability Calibration : Data Science Concepts - YouTube](https://www.youtube.com/watch?v=AunotauS5yI) * [Model Calibration - is your model ready for the real world? - Inbar Naor - PyCon Israel 2018 - YouTube](https://www.youtube.com/watch?v=FkfDlOnQVvQ&t=35s) * [Safe Handling Instructions for Probabilistic Classification | SciPy 2019 | Gordon Chen - YouTube](https://www.youtube.com/watch?v=RXMu96RJj_s) * [model evaluation - Worse AUC but better metrics (Recall, Precision) on a classification problem - How can this happen? - Cross Validated](https://stats.stackexchange.com/questions/547410/worse-auc-but-better-metrics-recall-precision-on-a-classification-problem-h) * [Clinicians' Misunderstanding of Probabilities Makes Them Like Backwards Probabilities Such As Sensitivity, Specificity, and Type I Error | Statistical Thinking](https://www.fharrell.com/post/backwards-probs/) * [Google AI Blog: Can You Trust Your Model’s Uncertainty?](https://ai.googleblog.com/2020/01/can-you-trust-your-models-uncertainty.html) * [machine learning - Proper scoring rule when there is a decision to make (e.g. spam vs ham email) - Cross Validated](https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email)