Entropy, Cross Entropy and KL Divergence

# Entropy, Cross Entropy and KL Divergence We can define **Entropy** as: $\text{Entropy} = H(X)= \sum_{x \in X} p(x) \overbrace{log(\frac{1}{p(x)})}^{\text{surprisal}}$ Where is specifically is meant to capture the average **shannon information (surprisal)** of a random variable $X$ with distribution $p(x)$. We can then measure **Cross Entropy** as: $\text{Cross Entropy} = H(p, q)= \sum_{x \in X} p(x) log(\frac{1}{q(x)})$ Where this is now the average shannon information if we use a guessed distribution $q$ in place of the true distribution, $p$. Note that cross entropy is minimized when $q = p$. This makes it a proper scoring function ([Scoring rule - Wikipedia](https://en.wikipedia.org/wiki/Scoring_rule)). Finally, we can define **KL Divergence** as: $\text{KL Divergence} = D_{KL}(p, q)= H(p, q) - H(X) = \sum_{x \in X} p(x) log(\frac{1}{q(x)}) - \sum_{x \in X} p(x) log(\frac{1}{p(x)})$ We see that the second term of KL Divergence doesn't depend on our data/the guessed distribution $q$, meaning that if we wish to minimize the distance between $p$ and $q$, we can simply minimize the cross entropy. This is why cross entropy is always the loss function that we are trying to minimize in machine learning contexts and not KL Divergence. See more [here](http://www.awebb.info/probability/2017/05/18/cross-entropy-and-log-likelihood.html#K-L-divergence) and [here](https://glassboxmedicine.com/2019/12/07/connections-log-likelihood-cross-entropy-kl-divergence-logistic-regression-and-neural-networks/). ### In terms of coding schemes See this article: [Cross Entropy and Log Likelihood | Andrew M. Webb](http://www.awebb.info/probability/2017/05/18/cross-entropy-and-log-likelihood.html). * Let us define Entropy as the expected number of bits needed to communicate the value taken by $X$ if we use the optimal encoding scheme, based on true distribution $p$ * Cross entropy is then the expected number of bits needed to communicate the value taken by $X$ if we use some other encoding scheme, based on distribution $q$ * KL Divergence is then the number of *additional bits* on average, needed to communicate the value taken by $X$, if we use distribution $q$ instead of $p$ ### Notes ![](Pasted%20image%2020220206123545.png) --- Date: 20220206 Links to: Tags: References: * [Nathaniel Dake Blog](https://www.nathanieldake.com/Mathematics/05-Information_Theory-01-Cross-Entropy-and-MLE-walkthrough.html) * [Connections: Log Likelihood, Cross Entropy, KL Divergence, Logistic Regression, and Neural Networks – Glass Box](https://glassboxmedicine.com/2019/12/07/connections-log-likelihood-cross-entropy-kl-divergence-logistic-regression-and-neural-networks/) * Entropy & Information Paper notes * Entropy notes in notability * [Cross Entropy and Log Likelihood | Andrew M. Webb](http://www.awebb.info/probability/2017/05/18/cross-entropy-and-log-likelihood.html)