Conditional Probability

# Conditional Probability Given a probability distribution $p(x| z)$, we can think of this as **slicing** from the total **joint distribution**, $p(x, z)$. But, let's take a step back for a moment. Say we have the following equation, simply the relating joint to conditional probability: $p(x, z) = p(x | z) p(z)$ Let's start by saying that each of these terms is a **function**. That may not be entirely clear, but each is literally a function, which we could expand as: $p(x, z) = g: X \times Z \rightarrow [0, 1]$ $p(x|z) = h: X \times Z \rightarrow [0, 1]$ $p(z) = k: Z \rightarrow [0, 1]$ So again, each of these terms takes in some input and yields an output in the **range** of $[0, 1]$. As always is the case with functions, we can think of them in two ways: 1. Without being passed an *specific* input. In this case we think of the function against *all inputs*. Consider $f(x) = x^2$. You can visualize this as a parabola - in this case we are thinking of the function evaluated against all inputs. 2. Being passed a *specific* input. In our parabola example this could mean $f(x = 2) = 2^2 = 4$. In a probabilistic setting it is often more vague as to whether we are thinking of a function across all inputs or being evaluated at a single input. Why is this often combined with all sorts of sums and integrals? For instance, you may see: $p(x) = \int p(x, z) dz$ We can start by considering just the two marginal distributions $p(x)$ and $p(z)$. These are two functions that *know nothing about each other*. But if we suddenly have access to $p(x, z)$, that is a function explicitly encoding the probability of specific pairs of $(x,y)$ occurring together. It is a means of **information sharing** across these variables. Given this joint distribution, intuitively, we can think of $p(x|z)$ as yielding some **slice** of $p(x, z)$, *given a specific $z$*. In other words we *fix* $z$ to be some specific value. Visually this looks like: ![center](Screenshot%202023-10-31%20at%206.59.19%20AM.png) ![center|400](Screenshot%202023-10-31%20at%207.53.16%20AM.png) If we want to *recover* what the marginal distribution of $x$ looks like, we need to somehow get of / accumulate the effects of $z$ on $x$. Remember, we are specifically working with distributions that *encode* information about the relationship of $x$ and $z$. To make this concrete, imagine that we know that if $z=4$ the probability $x=-2$ is $0.95$: $p(x=-2 | z = 4) = 0.95$ We may think that $x$ is very likely to be $-2$! However, what if $z$ is *very unlikely* to be $4$? In that case $x$ is actually not that likely to be $-2$, hence we would see: $p(x=-2) \ll p(x = -2 | z=4)$ So to get the marginal distribution, $p(x)$, we need to **accumulate** all of the effects of $z$. We can simply think of this as a summing or integration. --- Date: 20231031 Links to: Tags: References: * An introduction to variational autoencoders * [Conditional Probability and Slicing](https://chat.openai.com/share/15b864d6-650e-42bd-9d2c-fb693ed4b327)