Misplaced Pages

Entropy (information theory)

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

This is an old revision of this page, as edited by 208.255.152.227 (talk) at 21:27, 5 September 2006 (Derivation of Shannon's entropy). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Revision as of 21:27, 5 September 2006 by 208.255.152.227 (talk) (Derivation of Shannon's entropy)(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)
Entropy of a Bernoulli trial as a function of success probability, often called the binary entropy function.

Entropy is a concept in thermodynamics (see thermodynamic entropy), statistical mechanics and information theory. The concepts of information and entropy have deep links with one another, although it took many years for the development of the theories of statistical mechanics and information theory to make this apparent. This article is about information entropy, the information-theoretic formulation of entropy. Information entropy is occasionally called Shannon's entropy in honor of Claude E. Shannon, who formulated many of the key ideas of information theory.

Introduction

The concept of entropy in information theory describes how much information there is in a signal or event. Shannon introduced the idea of information entropy in his 1948 paper "A Mathematical Theory of Communication".

An intuitive understanding of information entropy relates to the amount of uncertainty about an event associated with a given probability distribution. As an example, consider a box containing many coloured balls. If the balls are all of different colours and no colour predominates, then our uncertainty about the colour of a randomly drawn ball is maximal. On the other hand, if the box contains more red balls than any other colour, then there is slightly less uncertainty about the result: the ball drawn from the box has more chances of being red (if we were forced to place a bet, we would bet on a red ball). Telling someone the colour of every new drawn ball provides them with more information in the first case than it does in the second case, because there is more uncertainty about what might happen in the first case than there is in the second. Intuitively, if there were no uncertainty as to the outcome, then we would learn nothing by drawing the next ball, and so the information content would be zero. As a result, the entropy of the "signal" (the sequence of balls drawn, as calculated from the probability distribution) is higher in the first case than in the second.

Shannon, in fact, defined entropy as a measure of the average information content associated with a random outcome.

Shannon's definition of information entropy makes this intuitive distinction mathematically precise. His definition satisfies these desiderata:

  • The measure should be continuous — i.e., changing the value of one of the probabilities by a very small amount should only change the entropy by a small amount.
  • If all the outcomes (ball colours in the example above) are equally likely, then entropy should be maximal.
  • If the outcome is a certainty, then the entropy should be zero.
  • The amount of entropy should be the same independently of how the process is regarded as being divided into parts.

(Note: The Shannon/Weaver book makes reference to Tolman (1938) who in turn credits Pauli (1933) with the definition of entropy Shannon used. Elsewhere in statistical mechanics, the literature includes references to von Neumann having derived the same form of entropy in 1927, which may explain why von Neumann favoured the use of the existing term 'entropy'.)

Formal definitions

Shannon defines entropy in terms of a discrete random event x, with possible states (or outcomes) 1...n as:

H ( x ) = i = 1 n p ( i ) log 2 ( 1 p ( i ) ) = i = 1 n p ( i ) log 2 p ( i ) . {\displaystyle H(x)=\sum _{i=1}^{n}p(i)\log _{2}\left({\frac {1}{p(i)}}\right)=-\sum _{i=1}^{n}p(i)\log _{2}p(i).\,\!}

That is, the entropy of the event x is the sum, over all possible outcomes i of x, of the product of the probability of outcome i times the log of the inverse of the probability of i (which is also called i's surprisal - the entropy of x is the expected value of its outcome's surprisal). We can also apply this to a general probability distribution, rather than a discrete-valued event.

Shannon shows that any definition of entropy satisfying his assumptions will be of the form:

K i = 1 n p ( i ) log p ( i ) . {\displaystyle -K\sum _{i=1}^{n}p(i)\log p(i).\,\!}

where K is a constant (and is really just a choice of measurement units).

Shannon defined a measure of entropy (H = − p1 log2 p1 − … − pn log2 pn) that, when applied to an information source, could determine the minimum channel capacity required to reliably transmit the source as encoded binary digits. The formula can be derived by calculating the mathematical expectation of the amount of information contained in a digit from the information source. Shannon's entropy measure came to be taken as a measure of the uncertainty about the realization of a random variable. It thus served as a proxy capturing the concept of information contained in a message as opposed to the portion of the message that is strictly determined (hence predictable) by inherent structures. For example, redundancy in language structure or statistical properties relating to the occurrence frequencies of letter or word pairs, triplets etc. See Markov chain.

Relationship to thermodynamic entropy

Main article: Entropy in thermodynamics and information theory

Shannon's definition of entropy is closely related to thermodynamic entropy as defined in physics and chemistry. Boltzmann and Gibbs did considerable work on statistical thermodynamics, which became the inspiration for adopting the word entropy in information theory. There are relationships between thermodynamic and informational entropy. In fact, in the view of Jaynes (1957), thermodynamics should be seen as an application of Shannon's information theory: the thermodynamic entropy is interpreted as being an estimate of the amount of further Shannon information (needed to define the detailed microscopic state of the system) that remains uncommunicated by a description solely in terms of the macroscopic variables of classical thermodynamics. For example, adding heat to a system increases its thermodynamic entropy because it increases the number of possible microscopic states that it could be in, thus making any complete state description longer. (See article: MaxEnt thermodynamics). Maxwell's demon (hypothetically) reduces the thermodynamic entropy of a system using information about the states of individual molecules; however, the demon himself increases his own entropy in the process, and so the total entropy does not decrease (which resolves the paradox).

Entropy as information content

It is important to remember that entropy is a quantity defined in the context of a probabilistic model for a data source. Independent fair coin flips have an entropy of 1 bit per flip. A source that always generates a long string of A's has an entropy of 0, since the next character will always be an 'A'.

The entropy rate of a data source means the average number of bits per symbol needed to encode it. Empirically, it seems that entropy of English text is between 1.1 and 1.6 bits per character, though clearly that will vary from one source of text to another. Experiments with human predictors show an information rate of 1.1 or 1.6 bits per character, depending on the experimental setup; the PPM compression algorithm can achieve a compression ratio of 1.5 bits per character.

From the preceding example, note the following points:

  1. The amount of entropy is not always an integer number of bits.
  2. Many data bits may not convey information. For example, data structures often store information redundantly, or have identical sections regardless of the information in the data structure.

Data compression

Entropy effectively bounds the performance of the strongest lossless (or nearly lossless) compression possible, which can be realized in theory by using the typical set or in practice using Huffman, Lempel-Ziv or arithmetic coding. The performance of existing data compression algorithms is often used as a rough estimate of the entropy of a block of data.

Data as a Markov process

A common way to define entropy for text is based on the Markov model of text. For an order-0 source (each character is selected independent of the last characters), the binary entropy is:

H ( S ) = p i log 2 p i , {\displaystyle H({\mathcal {S}})=-\sum p_{i}\log _{2}p_{i},\,\!}

where pi is the probability of i. For a first-order Markov source (one in which the probability of selecting a character is dependent only on the immediately preceding character), the entropy rate is:

H ( S ) = i p i j   p i ( j ) log 2 p i ( j ) , {\displaystyle H({\mathcal {S}})=-\sum _{i}p_{i}\sum _{j}\ p_{i}(j)\log _{2}p_{i}(j),\,\!}

where i is a state (certain preceding characters) and p i ( j ) {\displaystyle p_{i}(j)} is the probability of j {\displaystyle j} given i {\displaystyle i} as the previous character (s).

For a second order Markov source, the entropy rate is

H ( S ) = i p i j p i ( j ) k p i , j ( k )   log   p i , j ( k ) . {\displaystyle H({\mathcal {S}})=-\sum _{i}p_{i}\sum _{j}p_{i}(j)\sum _{k}p_{i,j}(k)\ \log \ p_{i,j}(k).\,\!}

In general the b-ary entropy of a source S {\displaystyle {\mathcal {S}}} = (S,P) with source alphabet S = {a1, …, an} and discrete probability distribution P = {p1, …, pn} where pi is the probability of ai (say pi = p(ai)) is defined by:

H b ( S ) = i = 1 n p i log b p i {\displaystyle H_{b}({\mathcal {S}})=-\sum _{i=1}^{n}p_{i}\log _{b}p_{i}\,\!}

Note: the b in "b-ary entropy" is the number of different symbols of the "ideal alphabet" which is being used as the standard yardstick to measure source alphabets. In information theory, two symbols are necessary and sufficient for an alphabet to be able to encode information, therefore the default is to let b = 2 ("binary entropy"). Thus, the entropy of the source alphabet, with its given empiric probability distribution, is a number equal to the number (possibly fractional) of symbols of the "ideal alphabet", with an optimal probability distribution, necessary to encode for each symbol of the source alphabet. Also note that "optimal probability distribution" here means a uniform distribution: a source alphabet with n symbols has the highest possible entropy (for an alphabet with n symbols) when the probability distribution of the alphabet is uniform. This optimal entropy turns out to be log b n {\displaystyle \log _{b}\,n} .

Alternative definition

Another way to define the entropy function H (not using the Markov model) is by proving that H is uniquely defined (as earlier mentioned) if and only if H satisfies the following conditions:

1. H(p1, …, pn) is defined and continuous for all p1, …, pn where pi {\displaystyle \in } for all i = 1, …, n and p1 + … + pn = 1. (Remark that the function solely depends on the probability distribution, not the alphabet.)

2. For all positive integers n, H satisfies

H ( 1 n , , 1 n ) n   a r g u m e n t s < H ( 1 n + 1 , , 1 n + 1 ) . n + 1   a r g u m e n t s {\displaystyle H\underbrace {\left({\frac {1}{n}},\ldots ,{\frac {1}{n}}\right)} _{n\ \mathrm {arguments} }<H\underbrace {\left({\frac {1}{n+1}},\ldots ,{\frac {1}{n+1}}\right).} _{n+1\ \mathrm {arguments} }}

3. For positive integers bi where b1 + … + bk = n, H satisfies

H ( 1 n , , 1 n ) n = H ( b 1 n , , b k n ) k + i = 1 k b i n H ( 1 b i , , 1 b i ) b i . {\displaystyle H\underbrace {\left({\frac {1}{n}},\ldots ,{\frac {1}{n}}\right)} _{n}=H\underbrace {\left({\frac {b_{1}}{n}},\ldots ,{\frac {b_{k}}{n}}\right)} _{k}+\sum _{i=1}^{k}{\frac {b_{i}}{n}}H\underbrace {\left({\frac {1}{b_{i}}},\ldots ,{\frac {1}{b_{i}}}\right)} _{b_{i}}.}

Efficiency

A source alphabet encountered in practice should be found to have a probability distribution which is less than optimal. If the source alphabet has n symbols, then it can be compared to an "optimized alphabet" with n symbols, whose probability distribution is uniform. The ratio of the entropy of the source alphabet with the entropy of its optimized version is the efficiency of the source alphabet, which can be expressed as a percentage.

This implies that the efficiency of a source alphabet with n symbols can be defined simply as being equal to its n-ary entropy. See also Redundancy (information theory).

Derivation of Shannon's entropy

Since the entropy was given as a definition, it does not need to be derived. On the other hand, a "derivation" can be given which gives a sense of the motivation for the definition as well as the link to thermodynamic entropy.

Q. Given a roulette with n pockets which are all equally likely to be landed on by the ball, what is the probability of obtaining a distribution (A1, A2, …, An) where Ai is the number of times pocket i was landed on and

P = i = 1 n A i {\displaystyle P=\sum _{i=1}^{n}A_{i}\,\!}

is the total number of ball-landing events?

A. The probability is a multinomial distribution, viz.

p = Ω T = P ! A 1 !   A 2 !   A 3 !     A n ! ( 1 n ) P {\displaystyle p={\Omega \over \mathrm {T} }={P! \over A_{1}!\ A_{2}!\ A_{3}!\ \cdots \ A_{n}!}\left({\frac {1}{n}}\right)^{P}\,\!}

where

Ω = P ! A 1 !   A 2 !   A 3 !     A n ! {\displaystyle \Omega ={P! \over A_{1}!\ A_{2}!\ A_{3}!\ \cdots \ A_{n}!}\,\!}

is the number of possible combinations of outcomes (for the events) which fit the given distribution, and

T = n P   {\displaystyle \mathrm {T} =n^{P}\ }

is the number of all possible combinations of outcomes for the set of P events.

Q. And what is the entropy?

A. The entropy of the distribution is obtained from the logarithm of Ω:

H = log Ω = log P ! A 1 !   A 2 !   A 3 !   A n ! {\displaystyle H=\log \Omega =\log {\frac {P!}{A_{1}!\ A_{2}!\ A_{3}!\cdots \ A_{n}!}}\,\!}
= log P ! log A 1 ! log A 2 ! log A 3 ! log A n !   {\displaystyle =\log P!-\log A_{1}!-\log A_{2}!-\log A_{3}!-\cdots -\log A_{n}!\ }
= i P log i i A 1 log i i A 2 log i i A n log i {\displaystyle =\sum _{i}^{P}\log i-\sum _{i}^{A_{1}}\log i-\sum _{i}^{A_{2}}\log i-\cdots -\sum _{i}^{A_{n}}\log i\,\!}

The summations can be approximated closely by being replaced with integrals:

H = 1 P log x d x 1 A 1 log x d x 1 A 2 log x d x 1 A n log x d x . {\displaystyle H=\int _{1}^{P}\log x\,dx-\int _{1}^{A_{1}}\log x\,dx-\int _{1}^{A_{2}}\log x\,dx-\cdots -\int _{1}^{A_{n}}\log x\,dx.\,\!}

The integral of the logarithm is

log x d x = x log x x d x x = x log x x . {\displaystyle \int \log x\,dx=x\log x-\int x\,{dx \over x}=x\log x-x.\,\!}

So the entropy is

H = ( P log P P + 1 ) ( A 1 log A 1 A 1 + 1 ) ( A 2 log A 2 A 2 + 1 ) ( A n log A n A n + 1 ) {\displaystyle H=(P\log P-P+1)-(A_{1}\log A_{1}-A_{1}+1)-(A_{2}\log A_{2}-A_{2}+1)-\cdots -(A_{n}\log A_{n}-A_{n}+1)}
= ( P log P + 1 ) ( A 1 log A 1 + 1 ) ( A 2 log A 2 + 1 ) ( A n log A n + 1 ) {\displaystyle =(P\log P+1)-(A_{1}\log A_{1}+1)-(A_{2}\log A_{2}+1)-\cdots -(A_{n}\log A_{n}+1)}
= P log P x = 1 n A x log A x + ( 1 n ) {\displaystyle =P\log P-\sum _{x=1}^{n}A_{x}\log A_{x}+(1-n)\,\!}

By using px = Ax/P and doing some simple algebra we obtain:

H = ( 1 n ) x = 1 n p x log p x {\displaystyle H=(1-n)-\sum _{x=1}^{n}p_{x}\log p_{x}\,\!}

and the term (1 − n) can be dropped since it is a constant, independent of the px distribution. The result is

H = x = 1 n p x log p x {\displaystyle H=-\sum _{x=1}^{n}p_{x}\log p_{x}\,\!} .

Thus, the Shannon entropy is a consequence of the equation

H = log Ω   {\displaystyle H=\log \Omega \ }

which relates to Boltzmann's definition,

S = k ln Ω {\displaystyle {\mathcal {S}}=k\ln \Omega } ,

of thermodynamic entropy, where k is the Boltzmann constant.

Properties of Shannon's information entropy

We write H(X) as Hn(p1,...,pn). The Shannon entropy satisfies the following properties:

  • For any n, Hn(p1,...,pn) is a continuous and symmetric function on variables p1, p2,...,pn.
  • Event of probability zero does not contribute to the entropy, i.e. for any n,
H n + 1 ( p 1 , , p n , 0 ) = H n ( p 1 , , p n ) {\displaystyle H_{n+1}(p_{1},\ldots ,p_{n},0)=H_{n}(p_{1},\ldots ,p_{n})} .
  • Entropy is maximized when the probability distribution is uniform. For all n,
H n ( p 1 , , p n ) H n ( 1 n , , 1 n ) {\displaystyle H_{n}(p_{1},\ldots ,p_{n})\leq H_{n}{\Big (}{\frac {1}{n}},\ldots ,{\frac {1}{n}}{\Big )}} .

Following from the Jensen inequality,

H ( X ) = E [ log b ( 1 p ( X ) ) ] log b ( E [ 1 p ( X ) ] ) = log b ( n ) {\displaystyle H(X)=E{\Big }\leq \log _{b}{\Big (}E{\Big }{\Big )}=\log _{b}(n)} .
  • If p i j , 1 i m , 1 j n {\displaystyle p_{ij},1\leq i\leq m,1\leq j\leq n} are non-negative real numbers summing up to one, and q i = j = 1 n p i j {\displaystyle q_{i}=\sum _{j=1}^{n}p_{ij}} , then
H m n ( p 11 , , p m n ) = H m ( q 1 , , q m ) + i = 1 m q i H n ( p i 1 q i , , p i n q i ) {\displaystyle H_{mn}(p_{11},\ldots ,p_{mn})=H_{m}(q_{1},\ldots ,q_{m})+\sum _{i=1}^{m}q_{i}H_{n}{\Big (}{\frac {p_{i1}}{q_{i}}},\ldots ,{\frac {p_{in}}{q_{i}}}{\Big )}} .

If we partition the mn outcomes of the random experiment into m groups with each group containing n elements, we can do the experiment in two steps: first, determine the group to which the actual outcome belongs; then, find the outcome in that group. The probability that you will observe group i is qi. The conditional probability distribution function for group i is pi1/qi,...,pin/qi). The entropy

H n ( p i 1 q i , , p i n q i ) {\displaystyle H_{n}{\Big (}{\frac {p_{i1}}{q_{i}}},\ldots ,{\frac {p_{in}}{q_{i}}}{\Big )}}

is the entropy of the probability distribution conditioned on group i. This property means that the total information is the sum of the information gained in the first step, Hm(q1,..., qn), and a weighted sum of the entropies conditioned on each group.

Khinchin in 1957 showed that the only function satisfying the above assumptions is of the form

H n ( p 1 , , p n ) = k i = 1 n p i log p i {\displaystyle H_{n}(p_{1},\ldots ,p_{n})=-k\sum _{i=1}^{n}p_{i}\log p_{i}} ,

where k is a positive constant representing the desired unit of measurement.

Extending discrete entropy to the continuous case: differential entropy

The Shannon entropy is restricted to finite sets. It seems that the formula

h [ f ] = f ( x ) log f ( x ) d x , ( ) {\displaystyle h=-\int _{-\infty }^{\infty }f(x)\log f(x)\,dx,\quad (*)}

where f denotes a probability density function on the real line, is analogous to the Shannon entropy and could thus be viewed as an extension of the Shannon entropy to the domain of real numbers. Formula (*) is usually referred to as the continuous entropy, or differential entropy. Although the analogy between both functions is suggestive, the following question must be set: is the Boltzmann entropy a valid extension of the Shannon entropy? To answer this question, we must establish a connection between the two functions:

We wish to obtain a generally finite measure as the bin size goes to zero. In the discrete case, the bin size is the (implicit) width of each of the n (finite or infinite) bins whose probabilities are denoted by pn. As we generalize to the continuous domain, we must make this width explicit.

To do this, start with a continuous function f discretized as shown in the figure. As the figure indicates, by the mean-value theorem there exists a value xi in each bin such that

f ( x i ) Δ = i Δ ( i + 1 ) Δ f ( x ) d x {\displaystyle f(x_{i})\Delta =\int _{i\Delta }^{(i+1)\Delta }f(x)\,dx}

and thus the integral of the function f can be approximated (in the Riemannian sense) by

f ( x ) d x = lim Δ 0 i = f ( x i ) Δ {\displaystyle \int _{-\infty }^{\infty }f(x)\,dx=\lim _{\Delta \to 0}\sum _{i=-\infty }^{\infty }f(x_{i})\Delta }

where this limit and bin size goes to zero are equivalent.

We will denote

H Δ := i = Δ f ( x i ) log Δ f ( x i ) {\displaystyle H^{\Delta }:=-\sum _{i=-\infty }^{\infty }\Delta f(x_{i})\log \Delta f(x_{i})}

and expanding the logarithm, we have

H Δ = i = Δ f ( x i ) log Δ f ( x i ) {\displaystyle H^{\Delta }=-\sum _{i=-\infty }^{\infty }\Delta f(x_{i})\log \Delta f(x_{i})}
= i = Δ f ( x i ) log f ( x i ) i = f ( x i ) Δ log Δ . {\displaystyle =-\sum _{i=-\infty }^{\infty }\Delta f(x_{i})\log f(x_{i})-\sum _{i=-\infty }^{\infty }f(x_{i})\Delta \log \Delta .}

As Δ 0 {\displaystyle \Delta \to 0} , we have

i = f ( x i ) Δ f ( x ) d x = 1 {\displaystyle \sum _{i=-\infty }^{\infty }f(x_{i})\Delta \to \int f(x)\,dx=1}

and so

i = Δ f ( x i ) log f ( x i ) f ( x ) log f ( x ) d x . {\displaystyle \sum _{i=-\infty }^{\infty }\Delta f(x_{i})\log f(x_{i})\to \int f(x)\log f(x)\,dx.}

But note that log Δ {\displaystyle \log \Delta \to -\infty } as Δ 0 {\displaystyle \Delta \to 0} , therefore we need a special definition of the differential or continuous entropy:

h [ f ] = lim Δ 0 [ H Δ + log Δ ] = f ( x ) log f ( x ) d x , {\displaystyle h=\lim _{\Delta \to 0}\left=-\int _{-\infty }^{\infty }f(x)\log f(x)\,dx,}

which is, as said before, referred to as the differential entropy. This means that the differential entropy is not a limit of the Shannon entropy for n → ∞

It turns out as a result that, unlike the Shannon entropy, the differential entropy is not in general a good measure of uncertainty or information. For example, the differential entropy can be negative; also it is not invariant under continuous co-ordinate transformations.

More useful for the continuous case is the relative entropy of a distribution, defined as the Kullback-Leibler divergence from the distribution to a reference measure m(x),

D K L ( f ( x ) m ( x ) ) = f ( x ) log f ( x ) m ( x ) d x {\displaystyle D_{\mathrm {KL} }(f(x)\|m(x))=\int f(x)\log {\frac {f(x)}{m(x)}}\,dx}

The relative entropy carries over directly from discrete to continuous distributions, and is invariant under co-ordinate reparametrisations.

References

Shannon's entropy at PlanetMath.

See also

External links

Categories: