# Classical Entropy Measures

====Introduction====

Classical entropy quantities are a way to quantify how much information is revealed in a random event. Shannon first introduced a way to quantify information by associating to an event occuring with probability p, an amount of information -\log p (as is standard in information theory, logarithms are taken in base 2 and so information is measured in bits). If an unlikely event occurs, one gains a large amount of information. The Shannon entropy is the average amount of information gained by observing the outcome of an event. In some cases, in particular ones relevant to cryptography, the average amount of information is not a good quantity, since it assumes many independent repetitions of the same experiment.

##### Random Events

When we discuss random events, we assume that they occur according to a pre-defined ensemble of possible outcomes of that event, and associated probabilities. Event X is a single instance drawn from the ensemble \{0,1,\ldots,|X|\} with probabilities \{p_0,p_1,\ldots,p_{|X|}\}. We call this probability distribution P_X. The terminology X=x refers to a single instance drawn from this distribution taking the value x. One similarly defines distributions over more than one random variable. For instance, P_{XY} is the joint distribution of X and Y, and P_{X|Y=y} is the distribution of X conditioned on the fact that Y takes the value Y=y.

##### Shannon Entropy

It was Shannon who pioneered the mathematical formulation of information. In essence his insight was that an event that occurs with probability p could be associated with an amount of information -\log p. Consider many independent repetitions of random event X. The average information revealed by each instance of X is given by the Shannon entropy of X defined as follows.

The Shannon entropy associated with an event x drawn from random distribution X is H(X)\equiv\sum_{x\in X}-P_X(x)\log P_X(x).

Likewise, one can define conditional Shannon entropies. H(X|Y=y) denotes the Shannon entropy of X given Y=y. It measures the average amount of information one learns from a single instance of X if one possesses string y\in Y, where X,Y are chosen according to joint distribution P_{XY}. One can average this quantity to form H(X|Y), the conditional Shannon entropy.

The conditional Shannon entropy of an event X given Y is defined by H(X|Y)\equiv\sum_{x\in X, y\in Y}-P_Y(y)P_{X|Y=y}(x)\log P_{X|Y=y}(x).

This leads one to define the mutual Shannon information between X and Y by I(X:Y)\equiv H(X)-H(X|Y)=H(Y)-H(Y|X). In some sense, this is the amount of information in common to the two strings X and Y.

Shannon information was first used to solve problems of compression, and communication over a noisy channel, as given in the following theorems.

Source coding theorem : Consider a source emitting independent and indentically distributed (i.i.d.) random variables drawn from distribution P_X. For any \epsilon>0 and R>H(X), there exists an encoder such that for sufficiently large N, any sequence drawn from P_X^N can be compressed to length NR, and a decoder such that, except with probability <\epsilon, the original sequence can be restored from the compressed string.

Furthermore, if one tries to compress the same source using R bits per instance, it is virtually certain that information will be lost.

For a discrete, memoryless channel, in which Alice sends a random variable drawn from X to Bob who receives Y, the channel capacity is defined by C\equiv \max_{P_X}I(X:Y).

Noisy channel coding theorem : Consider Alice communicate with Bob via a discrete memoryless channel which has the property that if Alice draws from an i.i.d. source X, Bob receives Y. For any \epsilon>0 and R, for large enough N, there exists an encoding of length N and a decoder such that \geq RN bits of information are conveyed by the channel for each encoder-channel-decoder cycle, except with probability <\epsilon.

Notice that in the noisy channel coding theorem, the channel is memoryless, and Alice has an i.i.d. source. In other words, all uses of the channel are independent of one another. This is the situation in which Shannon information is useful. However, in cryptographic scenarios where the channel may be controlled by an eavesdropper, such an assumption is not usually valid. Instead, other entropy measures have been developed that apply for these cases.

##### Beyond Shannon entropy

Rényi introduced the following generalization of the Shannon entropy.

The Rényi entropy of order \alpha is defined by H_{\alpha}(X)\equiv\frac{1}{1-\alpha}\log\sum_{x\in X}P_X(x)^{\alpha}.

We have, H_1(X)\equiv\lim_{\alpha\rightarrow 1}H_{\alpha}(X)=H(X). Two other important cases are H_0(X)=\log|X| and H_{\infty}(X)=-\log\max_{x\in X}P_X(x). A useful property is that, for \alpha\leq\beta, H_{\alpha}(X)\geq H_{\beta}(X).

H_0(X) is sometimes called the max entropy of X. It is important for information reconciliation, which in essence is error correction.

H_{\infty}(X) is sometimes called the min entropy of X. It is important for privacy amplification. There, the presence of an eavesdropper means that it no longer suffices to consider each use of the channel as independent. The min entropy represents the maximum amount of information that could be learned from the event X, so describes the worst case scenario. In a cryptographic application, one wants to be assured security even in the worst case.

In general, Rényi entropies are strongly discontinuous. As an example, consider the two distributions P_X and Q_X defined on x\in\{1,\ldots,2^n\}. Take P_X(1)=2^{-\frac{n}{4}}, P_X(x\neq 1)=\frac{1-2^{-\frac{n}{4}}}{2^n-1}, and Q_X to be the uniform distribution. The difference between min entropies is \frac{3n}{4}. In the large n limit, the two distributions have distance \approx 2^{-\frac{n}{4}}, which is exponentially small, while the difference in min entropies becomes arbitrarily large.

Smoothed versions of these quantities have been introduced which remove such discontinuities. In essence, these smoothed quantities involve optimizing such quantities over a small region of probability space. They have operational significance in cryptography in that they provide the relevant quantities for information reconciliation and privacy amplification.

The conditional versions of these smoothed entropies are relevant in cryptography, hence we provide a definition of these directly.

For a distribution P_{XY}, and smoothing parameter \epsilon>0, we define the following '''smoothed Rényi entropies: ''' \begin{array}{ccc} H_0^{\epsilon}(X|Y)\equiv\min_{\Omega}\max_y\log|\{x:P_{X\Omega|Y=y}(x)>0\}|\\ H_{\infty}^{\epsilon}(X|Y)\equiv\max_{\Omega}\left(-\log\max_y\max_x P_{X\Omega|Y=y}(x)\right), \end{array} where \Omega is a set of events with total probability at least 1-\epsilon.

More generally, the smooth Rényi entropy of order \alpha can be defined. However, up to an additive constant these equal either H_0^{\epsilon} (for \alpha<1) or H_{\infty}^{\epsilon} (for \alpha>1). It is also worth noting that for a large number of independent repetitions of the same experiment, the Rényi entropies tend to the Shannon entropy, that is, \lim_{\epsilon\rightarrow 0}\lim_{n\rightarrow\infty}\frac{H_{\alpha}^{\epsilon}(X^n|Y^n)}{n}=H(X|Y).

##### Significance to cryptography

Information Reconciliation

The task of information reconciliation can be stated as follows. Alice has string X and Bob Y, these being chosen with joint distribution P_{XY}. Alice also possesses some additional independent random string R. What is the minimum length of string S=f(X,R) that Alice can compute such that X is uniquely obtainable by Bob using Y, S and R, except with probability less than \epsilon?

Renner and Wolf denote this quantity H_{enc}^{\epsilon}(X|Y). It is tightly bounded by the relation H_0^{\epsilon}(X|Y)\leq H_{enc}^{\epsilon}(X|Y)\leq H_0^{\epsilon_1}(X|Y)+\log\frac{1}{\epsilon_2},

where \epsilon_1+\epsilon_2=\epsilon.

It is intuitively clear why H_0^{\epsilon}(X|Y) is the correct quantity. Recall the definition

H_0^{\epsilon}(X|Y)\equiv\min_{\Omega}\max_y\log|\{x:P_{X\Omega|Y=y}(x)>0\}|,

where \Omega is a set of events with total probability at least 1-\epsilon. The size of the set of strings x that could have generated Y=y given \Omega is |\{x:P_{X\Omega|Y=y}(x)>0\}|. Alice's additional information needs to point to one of these. It hence requires \log |\{x:P_{X\Omega|Y=y}(x)>0\}| bits to encode. Since Alice does not know y, she must assume the worst, hence we maximize on y. Furthermore, since some error is tolerable, we minimize on \Omega, by cutting away unlikely events from the probability distribution.

Privacy Amplification

In essence, this task seeks to find the maximum length of string Alice and Bob can form from their shared string such that Eve has no information on this string.

This task can be stated more formally as follows. Alice possesses string X and Eve Z, distributed according to P_{XZ}. Alice also has some uncorrelated random string R. What is the maximum length of a binary string S=f(X,R), such that for a uniform random variable U that is independent of Y and R, we have S=U, except with probability less than \epsilon?

Again, this quantity, denoted H_{ext}^{\epsilon}(X|Z), has been defined and bounded by Renner and Wolf :

H_{\infty}^{\epsilon_1}(X|Z)-2\log\frac{1}{\epsilon_2}\leq H_{ext}^{\epsilon}(X|Z)\leq H_{\infty}^{\epsilon}(X|Z),

where \epsilon_1+\epsilon_2=\epsilon.

The reason that this quantity is relevant follows from the definition of extractors, which are functions commonly exploited for privacy amplification. In essence they take a non-uniform random string, and some additional catalytic randomness, to form a smaller string which is arbitrarily close to being uniform. The length reduction in the string is bounded by its min-entropy.

C. E. Shannon, Bell System Technical Journal 27, 379 (1948).

A. Rényi, in Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability (1961), vol. 1.

R. Renner and S. Wolf, in Advances in Cryptology --- ASIACRYPT 2005, edited by B. Roy (Springer-Verlag, 2005), vol. 3788, pp. 199--216.