Continuous Rank Probability Score

# Continuous Rank Probability Score # TLDR The Continuous Rank Probability Score is a metric used to measure the quality of probabilistic predictions of a continuous random variable. A lower CRPS is better. It can be negative. It can be decomposed into three terms (reliability, resolution, uncertainty) where: - Lower reliability is better - High resolution is better - Uncertainty is constant # Brier Score The best way to understand the **[Continuous Rank Probability Score](https://journals.ametsoc.org/view/journals/wefo/15/5/1520-0434_2000_015_0559_dotcrp_2_0_co_2.xml)** is by building intuition around the **[Brier Score](https://en.wikipedia.org/wiki/Brier_score).** The Brier Score is a [proper scoring rule](https://en.wikipedia.org/wiki/Scoring_rule#StrictlyProperScoringRules) that measures the quality of probabilistic predictions. It specifically deals with predicting the probability of a _**binary** **event**_ occurring. For instance: will it rain tomorrow, yes or no (1 or 0)? It is defined as: $ BS = \frac{1}{N} \sum_{i=1}^N (q_i - y_i)^2 $ Where $y_i$ is what actually happened (did it rain or not) and $q_i$ is the predicted probability of that event. We then take the squared error between these terms and average it across all $N$ observation-prediction pairs. ## Decomposition Part of the beauty of the brier score lies in it’s decomposition. It can be decomposed into 3 additive components: Reliability, Resolution, and Uncertainty: $ BS = \overbrace{ \frac{1}{N}\sum_{k=1}^K n_k (\bar{q}_k - \bar{y}_k)^2}^{\text{Reliability}} - \overbrace{ \frac{1}{N} \sum_{k=1}^K n_k (\bar{y}_k - \bar{y})^2 }^{\text{Resolution}} + \overbrace{\bar{y}(1 - \bar{y})}^{\text{Uncertainty}} $ Where: - $K$ = our predicted probability bins - $n_k$ = number of predictions falling in bin $k$ - $\bar{q}_k$ = the average predicted probability in bin $k$ - $\bar{y}_k$ = the average observation (the true frequency) in bin $k$ - $\bar{y}$ = the average observation (the true frequency) across the entire dataset We can visualize this decomposition nicely via a calibration plot (images generated via [this notebook](https://github.com/NathanielDake/intuitiveml/blob/master/notebooks/Machine-Learning-Perspective/Loss-Functions/decomposition_intuitions.ipynb)): ![](Pasted%20image%2020240311195716.png) ### **Reliability** > The reliability measures how close the forecasted probabilities are to the true probabilities, given the predicted probabilities. It can be thought of as follows: - Let us focus on bin $k = [0.18, 0.2]$ - _**Given**_ that the predicted probability was in bin $k = [0.18, 0.2]$: - What was the observed frequency of the event? This is $\bar{y}_k$ - What was the average predicted probability? This is $\bar{q}_k$ - Compute the squared difference between $\bar{y}_k$ and $\bar{q}_k$ - This is the reliability for bin $k$ - Simply repeat this for all $K$ bins and take the weighted average of the resulting reliabilities > For example, say our event is “rain: yes or no”. In that case we would look at all cases where we forecasted that the probability of rain was in the range [0.18, 0.2], and then compute the observed frequency that it actually rained. Then take the squared difference with the average predicted probability in that bin, yielding the reliability. The lower the reliability the better. Note that this means reliability is defined in direct opposition to the common english definition. If the reliability is 0, the predictions are perfectly reliable. So you may frequently hear yourself saying that you want to decrease the reliability of a forecast 😅. ### **Resolution** > Resolution measures how dissimilar the observed frequencies are to the overall average observed frequency, given the predicted probabilities. It can be thought of as follows: - We start by computing the overall observed frequency of our event, $\bar{y}$ - Then, again let us focus on bin $k = [0.18, 0.2]$ - _**Given**_ that the predicted probability was in bin $k = [0.18, 0.2]$: - What was the observed frequency of the event? This is $\bar{y}_k$ - Compute the squared difference between $\bar{y}_k$ and $\bar{y}$ - This is the resolution of bin $k$ - Simply repeat this for all $K$ bins and take the weighted average of the resulting resolutions > For example, again let our event be “rain: yes or no”. Say that overall, _given no information_, the probability of rain is 0.4. Then, we would look at all cases where we are _given the information_ that we predicted the probability of rain was in the range [0.18, 0.2], and then compute the observed frequency that it actually rained. Say this conditional observed frequency was 0.23. We then just take the squared difference with the overall average observed frequency yielding the resolution. The key thing to realize about resolution is that you are just using the predicted probability to _condition on_. Given that conditional information, you then compute the squared error between the overall observed frequency and the conditional observed frequency. Why is this useful? It is specifically trying to prevent you just predicting the overall average every time. If you always just predicted the overall average observed frequency, then $n_k$ will be $0$ for all $k$ other than the $k$ that contains the overall average observed frequency. And in specific bin, the resolution term will be 0. So the resolution in this case will be 0. This is visualized below: ![](Pasted%20image%2020240311195749.png) Because the resolution is subtracted from the reliability, _higher the resolution the better_. ### **Uncertainty** The uncertainty term measures the inherent uncertainty in the outcomes of the event. For binary events, it is at a maximum when each outcome occurs 50% of the time, and is minimal (zero) if a single outcome always occurs (e.g. it rains every day). ### Building intuition ![](CRPS1.png) ![](CRPS2.png) In the top row we see two poorly calibrated sets of predictions. Notice a lot of red (poor reliability) and not a lot of green (poor resolution). On the other hand, in the bottom row we can see what good predictions look like: very little red and a lot of green. # CRPS Whereas the Brier Score measures the quality of a probabilistic prediction for a binary event, the **[Continuous Rank Probability Score](https://en.wikipedia.org/wiki/Scoring_rule#Continuous_ranked_probability_score)** measures the quality of a predicted distribution for a _**continuous**_ random variable. An article providing a basic overview is [here](https://towardsdatascience.com/crps-a-scoring-function-for-bayesian-machine-learning-models-dd55a7a337a8). Note the notation below is slightly altered from the section above in order to match the CRPS decomposition notation [here](https://journals.ametsoc.org/view/journals/wefo/15/5/1520-0434_2000_015_0559_dotcrp_2_0_co_2.xml). It is defined as: $ CRPS = \int_{-\infty}^{\infty} \big[ P(x) - P_a(x) \big]^2 dx $ Where $P(X)$ is the CDF of the predicted distribution: $ P(x) = \int_{-\infty}^x \rho(y)dy $ And $P_a(x)$ is the CDF of the _observed_ value (the [heaviside function](https://en.wikipedia.org/wiki/Heaviside_step_function)): $ P_a(x) = H(x - x_a) $ Visually, it is area (yellow) between the predicted distribution (red) and observation’s cdf (black): ![](Pasted%20image%2020240311195950.png) ## Decomposition The CRPS also has a decomposition similar to the brier score, and is derived in “[Decomposition of the Continuous Ranked Probability Score for Ensemble Prediction Systems](https://journals.ametsoc.org/view/journals/wefo/15/5/1520-0434_2000_015_0559_dotcrp_2_0_co_2.xml)”. ## Function of Brier Score CRPS can also be viewed as a function of brier score: $ \overline{CRPS} = \int_{-\infty}^{\infty} BS(x_t) dx_t $ By casting our predicted distribution to events, we can gain access to the decomposition as a function of x (i.e. the calibration as a function of dart). By casting to events all I mean is taking our predicted distribution and converting it into the form: - `1 if x in [a, b] otherwise 0` - `1 if x <= a otherwise 0` --- Date: 20240311 Links to: Tags: References: * [Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.](https://www.notion.so/zgenergy/Continuous-Rank-Probability-Score-42dc2a853c574913b42c354df9e80422)