In [[probability theory]] and [[information theory]], the '''Kullback-Leibler divergence''', or '''relative entropy''', is a quantity which measures the difference between two [[probability distributions]]. It is named after [[Solomon Kullback]] and [[Richard Leibler]]. The term "divergence" is a misnomer; it is not the same as [[divergence]] in [[vector calculus|calculus]]. One might be tempted to call it a "[[metric space|distance metric]]", but this would also be a misnomer as the Kullback-Leibler divergence is not [[symmetric]] and does not satisfy the [[triangle inequality]].
The Kullback-Leibler divergence between two probability distributions ''p'' and ''q'' is defined as
:
for distributions of a [[discrete]] variable, and as
:
for distributions of a [[continuous random variable]]. The logarithms in these formulae are conventionally taken to base 2, so that the quantity can be interpreted in units of [[bit|bits]]; the other important properties of the KL divergence hold irrespective of log base.
It can be seen from the definition that
:
denoting by ''H''(''p'',''q'') the [[cross entropy]] of ''p'' and ''q'', and by ''H''(''p'') the [[information entropy|entropy]] of ''p''. As the cross-entropy is always greater than or equal to the entropy, this shows that the Kullback-Leibler divergence is nonnegative, and furthermore ''KL''(''p'',''q'') is zero [[iff]] ''p'' = ''q'', a result known as [[Gibbs' inequality]].
In [[coding theory]], the KL divergence can be interpreted as the needed extra message-length per datum for sending messages distributed as ''q'', if the messages are encoded using a code that is optimal for distribution ''p''.
In [[Bayesian statistics]] the KL divergence can be used as a measure of the "distance" between the [[prior distribution]] and the [[posterior distribution]]. The KL divergence is also the gain in [[information entropy|Shannon information]] involved in going from the prior to the posterior. In [[Bayesian experimental design]] a design which is optimised to maximise the KL divergence between the prior and the posterior is said to be [[Bayes d-optimality|Bayes d-optimal]].
In [[quantum information science]] it is used as a measure of entanglement in a state.
It should be noted that Kullback and Leibler themselves actually defined the divergence as:
:
which is symmetric and nonnegative.
==References==
* S. Kullback and R. A. Leibler. On information and sufficiency. ''Annals of Mathematical Statistics'' 22(1):79–86, March 1951.
{{FromWikipedia}}
[[Category:Classical Information Theory]]
[[Category:Handbook of Quantum Information]]
[[Category:Entropy]]