# Statistical Rethinking
### Bayesian Inference
Bayesian inference is really just counting and comparing possibilities.
When we don’t know what caused the data, potential causes that may produce the data in more ways are more plausible.
Designing a simple Bayesian model benefits from a design loop with three steps.
1. **Data story**: Motivate the model by narrating how the data might arise.
2. **Update**: Educate your model by feeding it the data.
3. **Evaluate**: All statistical models require supervision, leading possibly to model revision.
For the next few sections we are going to work through an example involving estimating the proportion of water on the surface of the earth via randomly tossing a globe. Our data story will make use of the [Binomial Probability Distribution](https://en.wikipedia.org/wiki/Binomial_distribution):
$P(w | n, \theta) = \frac{n!}{w!(n - w)!}\theta^w (1 - \theta)^{n-w}$
So, we are tossing a globe in order to determine the proportion of water covering the surface of the earth. We know that there is some *true* proportion of water that we can call $\theta$. If we then randomly toss the globe into the air and observe water when we catch it, that is our data point, $x$. We perform $n$ total tosses and observed $w$ events where water was face up on the globe. The probability distribution [looks like](https://github.com/NathanielDake/intuitiveml/blob/master/notebooks/Machine-Learning-Perspective/Loss-Functions/decomposition_intuitions.ipynb):

It is this binomial distribution that, to our satisfaction, captures the data generating process of globe tossing quite well. It specifically is defined to *fix* $n$ and $\theta$ and then as we scan over $w$ it computes a valid probability distribution.
Let's now move on the the different components of bayesian inference process.
### Likelihood
> **Likelihood** is a mathematical formula that specifies the plausibility of the data.
It maps each *conjecture* (such as the proportion of water on the globe - $\theta$ in our example) onto the relative number of ways the data could occur, given that conjecture.
It specifically answers the question: "*given the state of the world*, what is the plausibility of observing some specific *data point* ?". The state of the world can be some *parameter*, and a data point can be anything observed.
Say we perform five tosses and observe $x = [\text{water, water, land, water, land}]$, meaning that $w=3, n=5$. The likelihood will compute the plausibility of observing that specific $x$ given some value of $\theta$. Intuitively, we know that if $\theta = 0.01$ then the likelihood of observing that sequence would be very low! We can actually compute it:
$0.01 \times 0.01 \times (1 - 0.01) \times 0.01 \times (1 - 0.01) = 0.0000009801$
So what the likelihood tells us is that if $\theta$ actually was $0.01$, it would be *incredibly unlikely* to have observed the $x$ that we did. A bit more formally, the likelihood can be written as:
$P(x | \theta)$
Where $x$ is always the observed data and $\theta$ represents our parameter(s).
Now it is useful to note that choosing a likelihood is *always an assumption* that you are making; there is no free lunch, you cannot avoid it.
It also may have stuck out that we are *not* using the term "probability" when speaking of the likelihood. We are referring to "plausibilities". Why is that? Well lets take a step back and remember our data generating process, the binomial distribution. This is a probability distribution that is a function of three input variables:
$P: w \times n \times \theta \rightarrow [0,1]$
So it maps a 3-tuple, some specific $(w, n, \theta)$, to a real number in the range of $0$ to $1$. No matter what combination of those three inputs (subject to certain constraints, such as $\theta \in [0,1]$ and $w \leq n$) we pass in, we will get out a valid probability.
However, producing a valid *[probability distribution](https://en.wikipedia.org/wiki/Probability_distribution)* is another beast altogether. Typically, when we think about a probability *distribution* we think of it as a function that maps some input domain into an output range where the output range satisfies certain properties; specifically the outputs sum to $1$ (area under the curve is $1$ in the case of density), each output is in the range of $[0,1]$, and each output is non negative. These are known as the [Kolmogorov Axioms](https://en.wikipedia.org/wiki/Probability_axioms)).
Now think back to the our binomial distribution. This was constructed as a valid probability distribution when our input that we are varying is $w$ and $n, \theta$ are *fixed* ! It is under these specific conditions that our output will satisfy our axioms.
What happens if we vary $\theta$, holding $w$ and $n$ fixed instead? Well we are no longer given any guarantee that we will produce a valid probability distribution (we almost certainly will not)! We can visualize this nicely below. On the left we can see the resulting function when we fix $n$ and $w$ and vary $\theta$. The key thing to note is that the area under the curve *does not* sum to $1$. This means that it does not satisfy the required axioms to be a probability distribution. But, if we look at the right, we see that we can very simply normalize the purple distribution in order to get a properly normalized probability distribution. The shape is perfectly maintained, but not the area under the curve is indeed $1$.

We can overlay the non-normalized and normalized in order to get a sense of the difference between them. Keep in mind their shape and curvature is identical, the yellow is simply scaled up.

As you may have already figured out, the purple curve is our likelihood and the yellow is the likelihood converted into a valid probability distribution! The key point is that the binomial distribution is a data generating process that properly maps on to our physical situation we wish to model. However, it is only as a function of $w$ with a fixed $n$ and $\theta$ that we it produces a valid probability *distribution*. This is based on it's *definition*. Put another way, it is this way *by construction*. If we hold $w$ and $n$ fixed and vary $\theta$, we still are working with the same data generating process, but we are now looking at a different *slice* of the resulting surface, a slice that was *not* guaranteed to have any of the properties that came along with the original definition. But, we can quite easily *normalize* this resulting curve in order to yield a valid probability distribution.
To summarize this yet another way: we took an object (probability distribution) with a *specific purpose and interpretation* and are now using it for *a different purpose* with *a different interpretation*.
A useful intuition to have here is that of *slicing*. Taking a look at the the two surfaces below, we see that we can hold either $y$ or $x$ constant and produce a slice with some area under the curve in the other dimension. When we are doing this in our binomial distribution example, it is only when we slice along the $n$ and $\theta$ dimensions that we produce a valid distribution.


### Parameters
> **Parameters** are quantities that we wish to estimate.
They represent the different *conjectures* for causes or explanations of the data. In our globe tossing example above, $x$ is *data* that we believe we have observed without error. That leaves $\theta$ as our unknown parameter and it is our Bayesian machine's job to tell us what the data tell us about it.
### Prior
For every parameter you intend the Bayesian machine to estimate, you must provide the machine a **prior**. This is simply the initial plausibility of that given parameter. Priors are engineering assumptions, chosen to help the machine learn. The prior is written as $P(\theta)$.
### Posterior
Once you have chosen a likelihood, which parameters are to be estimated, and a prior for each parameter, a Bayesian model treats the estimates as a purely logical consequence of those assumptions. For every unique combination of data, likelihood, parameters, and prior, there is a unique set of estimates. The resulting estimates—the relative plausibility of different parameter values, conditional on the data—are known as the **posterior distribution**. The posterior distribution takes the form of the probability of the parameters, conditional on the data: $P(\theta | x)$.
$\text{Posterior} = P(\theta | x) = \frac{\text{Likelihood} \times \text{Prior}}{\text{Average Likelihood}}= \frac{P(x | \theta) P(\theta)}{P(x)} = \frac{P(x | \theta) P(\theta)}{\int P(x | \theta) P(\theta)d\theta}$
An example will help elucidate what this "average likelihood term" (sometimes called the "evidence" or "probability of data") is. Remember that earlier we touched on the fact that our likelihood was *not* a valid probability distribution (see that discussion for more details). Well for now let's just say that our prior, $P(\theta)$, is a uniform distribution and can effectively be ignored (it is $1$ for all values of $\theta$). Well then we must normalize our numerator (likelihood) in order to end up with valid probabilities and a valid posterior probability distribution. That is all the denominator is doing! It is performing the same normalization that we saw earlier:

### Misc
* [ChatGPT](https://chat.openai.com/c/74e90c18-997a-4baf-96fe-1b53b9661072)
- Fundamental modeling outperforming pure statistical approaches (see more [here](http://www.stat.columbia.edu/~gelman/stuff_for_blog/reilly1.pdf), statistical rethinking page 94)
- model criticism and revision
---
Date: 20240303
Links to:
Tags:
References:
* All images were generated via this notebook [here](https://github.com/NathanielDake/intuitiveml/blob/master/notebooks/Machine-Learning-Perspective/Loss-Functions/decomposition_intuitions.ipynb)