# Conditional Probability
Given a probability distribution $p(x| z)$, we can think of this as **slicing** from the total **joint distribution**, $p(x, z)$.
But, let's take a step back for a moment. Say we have the following equation, simply the relating joint to conditional probability:
$p(x, z) = p(x | z) p(z)$
Let's start by saying that each of these terms is a **function**. That may not be entirely clear, but each is literally a function, which we could expand as:
$p(x, z) = g: X \times Z \rightarrow [0, 1]$
$p(x|z) = h: X \times Z \rightarrow [0, 1]$
$p(z) = k: Z \rightarrow [0, 1]$
So again, each of these terms takes in some input and yields an output in the **range** of $[0, 1]$. As always is the case with functions, we can think of them in two ways:
1. Without being passed an *specific* input. In this case we think of the function against *all inputs*. Consider $f(x) = x^2$. You can visualize this as a parabola - in this case we are thinking of the function evaluated against all inputs.
2. Being passed a *specific* input. In our parabola example this could mean $f(x = 2) = 2^2 = 4$.
In a probabilistic setting it is often more vague as to whether we are thinking of a function across all inputs or being evaluated at a single input.
Why is this often combined with all sorts of sums and integrals? For instance, you may see:
$p(x) = \int p(x, z) dz$
We can start by considering just the two marginal distributions $p(x)$ and $p(z)$. These are two functions that *know nothing about each other*. But if we suddenly have access to $p(x, z)$, that is a function explicitly encoding the probability of specific pairs of $(x,y)$ occurring together. It is a means of **information sharing** across these variables.
Given this joint distribution, intuitively, we can think of $p(x|z)$ as yielding some **slice** of $p(x, z)$, *given a specific $z$*. In other words we *fix* $z$ to be some specific value. Visually this looks like:


If we want to *recover* what the marginal distribution of $x$ looks like, we need to somehow get of / accumulate the effects of $z$ on $x$. Remember, we are specifically working with distributions that *encode* information about the relationship of $x$ and $z$. To make this concrete, imagine that we know that if $z=4$ the probability $x=-2$ is $0.95$:
$p(x=-2 | z = 4) = 0.95$
We may think that $x$ is very likely to be $-2$! However, what if $z$ is *very unlikely* to be $4$? In that case $x$ is actually not that likely to be $-2$, hence we would see:
$p(x=-2) \ll p(x = -2 | z=4)$
So to get the marginal distribution, $p(x)$, we need to **accumulate** all of the effects of $z$. We can simply think of this as a summing or integration.
---
Date: 20231031
Links to:
Tags:
References:
* An introduction to variational autoencoders
* [Conditional Probability and Slicing](https://chat.openai.com/share/15b864d6-650e-42bd-9d2c-fb693ed4b327)