# Classical relative entropy

In probability theory and information theory, the Kullback-Leibler divergence, or relative entropy, is a quantity which measures the difference between two probability distributions. It is named after Solomon Kullback and Richard Leibler. The term "divergence" is a misnomer; it is not the same as divergence in calculus. One might be tempted to call it a "distance metric", but this would also be a misnomer as the Kullback-Leibler divergence is not symmetric and does not satisfy the triangle inequality.

The Kullback-Leibler divergence between two probability distributions p and q is defined as

$$\mathrm{KL}(p,q) = \sum_x p(x) \log_2 \frac{p(x)}{q(x)} \!$$

for distributions of a discrete variable, and as

$$\mathrm{KL}(p,q) = \int_{-\infty}^{\infty} p(x) \log_2 \frac{p(x)}{q(x)} \; dx \!$$

for distributions of a continuous random variable. The logarithms in these formulae are conventionally taken to base 2, so that the quantity can be interpreted in units of bits; the other important properties of the KL divergence hold irrespective of log base.

It can be seen from the definition that

KL(p, q) =  − ∑xp(x)log2q(x) + ∑xp(x)log2p(x) = H(p, q) − H(p)

denoting by H(p,q) the cross entropy of p and q, and by H(p) the entropy of p. As the cross-entropy is always greater than or equal to the entropy, this shows that the Kullback-Leibler divergence is nonnegative, and furthermore KL(p,q) is zero iff p = q, a result known as Gibbs' inequality.

In coding theory, the KL divergence can be interpreted as the needed extra message-length per datum for sending messages distributed as q, if the messages are encoded using a code that is optimal for distribution p.

In Bayesian statistics the KL divergence can be used as a measure of the "distance" between the prior distribution and the posterior distribution. The KL divergence is also the gain in Shannon information involved in going from the prior to the posterior. In Bayesian experimental design a design which is optimised to maximise the KL divergence between the prior and the posterior is said to be Bayes d-optimal.

In quantum information science it is used as a measure of entanglement in a state.

It should be noted that Kullback and Leibler themselves actually defined the divergence as:

KL(p, q) + KL(q, p)

which is symmetric and nonnegative.

### References

• S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics 22(1):79–86, March 1951.