Sunteți pe pagina 1din 11

ENTROPY

ALI E. ABBAS
University of Illinois at UrbanaChampaign Urbana, Illinois

number of ways by which it can be realized. If the multiplicity of each state, O, is dened as the number of its microstates, it is said that state 7 has a larger multiplicity than any of the other states. The entropy of each macrostate is proportional to the logarithm of its multiplicity, entropy / lnO:

1. INTRODUCTION: WHAT IS ENTROPY? Many ways to describe the entropy of a system exist. One method interprets the entropy as the quantity of heat, Q, that is absorbed in a reversible system when the temperature is T, Entropy Q : T

Another method interprets entropy as the amount of energy in a system that is unavailable to do work. The entropy of a system is also a measure of its disorder. The second law of thermodynamics states that the entropy of an isolated system is non-decreasing. As a result, the longer that time elapses, the more disordered is the system until it reaches maximum disorder (or maximum entropy). Consider, for example, the set of molecules in the closed system shown in Fig. 1. The second law of thermodynamics suggests that the system on the left-hand side would precede the system on the right-hand side as it has less disorder and less entropy. Entropy is also a measure of the multiplicity of the states of a system. To illustrate this point further, consider tossing a die twice and observing the sum of the two throws. For two throws of the die, the sum can be any of 2; 3; 4; 5; 6; 7; 8; 9; 10; 11; 12: Assume that the sum is a state variable. Figure 2 shows the possible realizations for each of these states. Each of the states can be realized in a different way, and some of these states are more likely to occur than others. If the sum is considered as a macrostate for this system and the possible realizations are considered as microstates, the total number of macrostates found is 11, and the number of microstates is 36. As can be seen, it is more likely that the sum of 7 will appear, as it has a larger

In this example, the multiplicity of each macrostate is proportional to its probability of occurrence, because the probability of a macrostate is simply the multiplicity of this macrostate divided by the total multiplicities of the system. This observation leads to a statistical denition of the entropy of a given macrostate as a measure of the probability of its occurrence. The average entropy of the macrostates can also be dened as the average entropy of the system. The previous denition of entropy using the probability of its macrostates is the basis of Claude Shannons technical report A mathematical theory of communication (1). Shannons work set the Magna Carta of the information age and the dawn of a new science called information theory. Information theory started with a discovery of the fundamental laws of data compression and transmission. For example, in communication systems, the two main applications of information theory are: 1. Optimal coding of a message, thereby setting a limit on the amount of compression for a source of information or a random process, and 2. Maximum permissible transmission bandwidth for a channel, known as the channel capacity. Information theory has now become a unifying theory with profound intersections with probability, statistics, computer science, management science, economics, bioinformatics, politics, and has numerous applications in many other elds. In the following sections, an intuitive interpretation for entropy as a measure of information is provided. Also, other elements of information theory are presented, and several interpretations and applications and for their use. 2. ENTROPY AS A MEASURE OF INFORMATION The birth of information theory started when Shannon made the key observation that a source of information should be modeled as a random process. The following is a quote from his original paper:
We can think of a discrete source as generating the message, symbol by symbol. It will choose successive symbols according to certain probabilities depending, in general, on preceding choices as well as the particular symbols in question. A physical system, or a mathematical model of a system which produces such a sequence of symbols governed by a set of probabilities, is known as a stochastic process. We may consider a discrete source, therefore, to be represented by a stochastic process. Conversely, any stochastic process, which produces a discrete sequence of symbols chosen from a nite set, may be considered a discrete source. 1

Low Entropy

High Entropy

Figure 1. Entropy as a measure of disorder.

Wiley Encyclopedia of Biomedical Engineering, Copyright & 2006 John Wiley & Sons, Inc.

ENTROPY

(7) = 6 (6) = 5 (5) = 4 (4) = 3 (3) = 2 (2) = 1 (8) = 5 (9) = 4 (10) = 3 (11) = 2 (12) = 1

10

11

12

Number of Microstates = 36
Figure 2. Entropy as a measure of multiplicity.

Number of Microstates = 11

Shannon realized the importance of a measure that quanties the amount of information (or uncertainty) about a discrete random process as it will also describe the amount of information that it produces. He then posed the following question:
Can we dene a quantity which will measure, in some sense, how much information is produced by such a process, or better, at what rate information is produced?

1 2

1 2

1 3 1 2 1 6 1 1 1 H( , , ) 2 3 6

2 3

1 3

In an attempt to answer his own question, Shannon suggested that the measure of information should satisfy three essential axioms, and proposed an information measure that uniquely satises these axioms. The following is taken from Shannons original paper:
Suppose we have a set of possible events whose probabilities of occurrence are p1 ; p2 ; ::::; pn . These probabilities are known but that is all we know concerning which event will occur. Can we nd a measure of how uncertain we are of the outcome? If there is such a measure, say Hp1 ; p2 ; :::; pn , it is reasonable to require of it the following properties: 1. H should be continuous in the pi . 1 2. If all the pi s are equal (pi n) then H should be a monotonic increasing function of n. 3. If an event can be further decomposed into two several events, the original H should be the weighted sum of the individual values of each H.

1 1 1 2 1 H( , ) + H( , ) 2 2 2 3 3

Figure 3. Decomposability of the entropy measure.

proposed, should be calculated from the right-hand side tree using the marginal-conditional probabilities as H1 ; 1 1 H2 ; 1, where the coefcient of 1 appears be2 2 2 3 3 2 cause the decomposition occurs half the time. Shannon pointed out that the only measure that satises his three axioms is, HX k
n X i1

pxi logpxi ;

The third axiom is an important property of this measure of information. It will be explained further using the following example from Shannons original paper. Consider the following equivalent decision trees of Fig. 3. The tree on the left-hand side describes events that will occur with probabilities f1 ; 1 ; 1g. The tree on the right-hand 2 3 6 side is an equivalent tree that has the same probabilities of the events but is further decomposed into conditional probabilities that are assigned to match the original tree on the left-hand side. Shannons third axiom requires that the measure of information that is calculated for the tree on the left-hand side, H1 ; 1 ; 1, be numerically equal to that calculated 2 3 6 using the marginal and conditional probabilities from the tree on the right-hand side. The information measure, he

where k is a positive constant that will be assigned a value of unity. Shannon called this term the entropy and proposed it as a measure of information (or uncertainty) about a discrete random variable having a probability mass function p(x). To get an intuitive feel for this entropy expression, consider the following example. 2.1. Example 1: Interpretation of the Entropy Consider a random variable, X, with four possible outcomes, {0, 1, 2, 3}, and with probabilities of occurrence, f1 ; 1 ; 1 ; 1g, respectively. Calculate the entropy of X using 2 4 8 8 Shannons entropy denition and using base two for the

ENTROPY

logarithm in the entropy expression. 1 1 1 1 1 1 1 1 3 HX log2 log2 log2 log2 1 : 2 2 4 4 8 8 8 8 4 One intuitive way to explain this number is to consider the minimum expected number of binary (Yes/No) questions needed to describe an outcome of X. The most efcient question-selection procedure in this example is to start with the outcome that has the highest probability of occurrence (i.e., the question is, does X 0? If it does, then X has been determined in one question. If it does not, then the question is, does X 1? Again, if it does, then X has been determined in two questions, if it does not, does X 2? X 3 question does not need to be asked because if it is not 0, 1, or 2, then it must be 3). The expected number of binary questions needed to determine X is then EX 1: 1 1 1 1 3 2: 3: 3: 1 : 2 4 8 8 4

outcomes have a probability of zero.

HX 1: log1 0:

The maximum entropy (maximum spread) probability mass function of a discrete random variable with m outcomes occurs, when all the outcomes have equal probabil1 ity of m. This probability mass function has a maximum entropy value

HX

m X1 1 log logm: m m i1

Note that the expected number of binary questions needed to determine X in this example was equal to the entropy of the random variable X. HX EX: Although this equality does not always hold, an optimal question selection process always exists using entropy coding principles such that the expected number of questions is bounded by the following well-known data compression inequality (explained in further detail in Cover and Thomas (2), page 87) HX Expected number of questions HX 1:

As the number of outcomes, m, increases, the entropy of a probability mass function with equal outcomes also increases logarithmically, which agrees with Shannons second axiom.

2.3. Entropy as a Measure of Realizations The entropy of a discrete random variable also has an interpretation in terms of the number of ways that the outcomes of an experiment can be realized. Consider, for example, a random variable X with m possible outcomes, x1 ; x2 ; :::; xm . Now consider an experiment where n independent trials have been observed, with each trial giving a possible realization of the variable x. Now consider the number of possible observations that can be obtained with this experiment. Using factorials, the number of possible sequences that can be obtained for n trials with m possible outcomes for each trial is

The entropy of a discrete random variable thus has an intuitive interpretation as it provides some notion of the expected number of questions needed to describe the outcome of this random variable. The fact that binary questions were used in this example relates to the base two in the logarithm of the entropy expression. For example, if we had calculated the entropy using base four in the logarithm, this result would provide an idea of the expected number of tertiary questions needed to determine X. In coding theory, entropy also provides some notion of the expected length of the optimal code needed to describe the output of a random process. Note that with these interpretations, the entropy of a discrete random variable cannot be negative. 2.2. Entropy as a Measure of Spread The entropy of a discrete random variable is also a measure of spread of its probability mass function. The minimum entropy (minimum spread) occurs when the variable is deterministic and has no uncertainty, which is equivalent to a probability mass function where one outcome has a probability equal to one and the remaining

n! ; n1 !n2 !::::nm !

where n1 ; n2 ; :::; nm are the number of observations of P outcomes x1 ; x2 ; :::; xm , respectively, such that n 1 ni n. i If the number of observations, n, as well as n1 ; n2 ; :::; nm are large compared with m, then Stirlings approximation can be used for the factorial such that p n! en nn 2pn:

Alternatively, it can be written

logn! n n logn ologn:

ENTROPY

Now, it can be written logW logn!


n X i1 n X i1 n X i1

mately the same probability for large n, the number of possible realizations of the sequences, W, is logni ! W 1 2nHX ; PX1 ; X2 ; :::; Xn

n logn n

ni logni

ni ologn

which is the result obtained earlier for entropy as a measure of the long run number of possible realizations of a sequence. 2.5. Differential Entropy Now that the entropy of a discrete random variable has been discussed, the extensions of this denition to the continuous case will be discussed. In continuous form, the integral Z hx f x lnf xdx

ni ni log ologn: n i1 For large n, the value of


n n X ni X 1 ni logW % log fi logfi n n n i1 i1

n X

is calculated, where fi is the fraction of observations for outcome xi and converges to the probability of xi as n increases. The entropy is thus equal to the number of possible realizations of an experiment divided by the total number of trials as the number of trials approaches innity. 2.4. Entropy as a Measure of the Probability of a Sequence: The Asymptotic Equipartition Theorem (AEP) The AEP theorem states that if a set of values X1 , X2 ,..., Xn is drawn independently from a random variable X distributed according to P(x), then the joint probability PX1 ; X2 ; :::; Xn of observing the sequence satises 1 log2 PX1 ; X2 ; . . . Xn ! HX: n Rearranging the terms in the AEP expression gives PX1 ; X2 ; :::; Xn ! 2nHX : The AEP theorem also implies that all sequences drawn from an independent and identically distributed distribution will have the same probability in the long run. Of course, as n is small, it may not be true, but as n gets larger, more and more sequences satisfy this equipartition theorem. Let us dene An as the set of all sequences that satises e the equipartition theorem, for a given n, and a given e such that 2nHX e pX1 ; X2 ; :::; Xn 2nHXe :

is known as the differential entropy of a random variable having a probability density function, f x. As opposed to the discrete case, however, the differential entropy can be negative, which is illustrated by the following example. 2.5.1. Example 2: Entropy of a Uniform Bounded Random Variable. Consider a random variable x with a uniform probability density function on the domain [0,a], f x 1 ;0 a x a:

The entropy of this random variable is Z hx


0 a

f x logf xdx

Z
0

1 1 log dx loga: a a

Note that the entropy expression can be positive or negative depending on the value of a: if 0oao1, the differential entropy is negative, if a 1, the differential entropy is zero, and if a41, the differential entropy is positive. The reader can also verify that a uniform distribution over the domain a; b is equal to logb a. Thus, the entropy of a uniform distribution is determined only by the width of its interval and will not change by adding a shift to the values of a and b, thus the entropy is invariant to a shift transformation but not invariant to scale. 2.5.2. Example 3: Entropy of a normal Distribution. If a random variable x has a normal distribution, with mean m and variance s2 , its differential entropy using the natural logarithm is equal to hx 1 log2pes2 ; 2

The set An is also known as the typical set. The e cardinality of the set An is bounded by e 2nHXe jAn j e 2nHX e :

Furthermore, because all sequences have approxi-

where e is the base of the natural log. It is again noted that the entropy is a function of the variance only and not of the mean. As such, it is invariant under a shift transformation but is not invariant under a scale transformation. Note further, that when the stan-

ENTROPY

dard deviation 1 so p ; 2pe


1 then the entropy is negative and is positive when s > p. 2pe

The following example provides an intuitive interpretation for the KL measure. 2.7.1. Example 4: Interpretation of the Relative Entropy. Refer back to Example 1 and consider another person wishing to determine the value of the outcome of X. However, suppose that this person does not know the generating distribution, P, and believes it is generated by a different probability distribution, Q 1 ; 1 ; 1 ; 1. 8 8 2 4 The KL measure (relative entropy) from P to Q is KL measure from P to Q
1 1 1 1 1 1 log2 2 log2 4 log2 8 1 1 1 2 4 8 8 8 2 1 1 7 log2 8 : 1 8 8 4

2.6. Entropy of a Discretized Variable The possibility of a negative sign in the last two examples shows the absence of the expected number of questions interpreted for entropy in the continuous case. Nevertheless, numerous applications of entropy exist in the continuous case. Furthermore, the differential entropy can be related to the discrete entropy by discretizing the continuous density into intervals of length D, as shown in Fig. 4, and using the substitution pi f xi D in the discrete form, entropy expression as follows: HX D X
1 X 1

pi log pi

1 X 1

f xi D logf xi D;

Df xi logf xi logD;

Now, calculate the expected number of binary questions needed for this second person to determine X. His question selection based on the distribution, Q, would be is X 2?, is X 3?, is X 0? The expected number of binary questions needed to determine X is given by EX 1: 1 1 1 1 21 2: 3: 3: : 8 8 2 4 8

where HX D entropy of the discretized random variable. If f xi logf xi is Riemann integrable, then the rst term on the right-hand side of the discretized entropy expression approaches the integral of f xi logf xi by denition of Riemann integrability. Thus, HX D logD ! hx; asD ! 0: Now, several other terms of information theory are presented that relate to the entropy expression. 2.7. Relative Entropy Kullback and Leibler (3) generalized Shannons denition of entropy and introduced a measure that is used frequently to determine the closeness between two probability distributions, which is now known as the Kullback Leibler (KL) measure of divergence, and is also known as the relative entropy or cross entropy. The KL measure from a distribution, P, to a distribution, Q, is dened as KL measure
n X i1

Note that the difference in the expected number of questions needed to determine X for the two distributions, P and Q 21 1 3 7 KL distance from P to Q. 8 4 8 This example provides an intuitive explanation for the KL measure as a measure of closeness between a distribution, P, and another distribution, Q. Think of the KL measure as the increase in the number of questions that distribution Q would introduce if it were used to determine the outcome of a random variable that was generated by the distribution P. As entropy is also a measure of the compression limit for a discrete random variable, one can think of the KL measure as the inefciency in the compression introduced by distribution Q when used to code a stochastic process generated by a distribution P. 3. RELATIVE ENTROPY OF CONTINUOUS VARIABLES Similar to the entropy expression, the relative entropy also has an expression in continuous form from a probability density, f(x), to a target (reference) density, q(x), as KL measure Z
a b

pxi log

pxi : qxi

f x ln

The KL measure is non-negative and is zero if and only if the two distributions are identical, which makes it an attractive candidate to measure the divergence of one distribution from another. However, the KL measure is not symmetric. For example, the KL measure from P to Q is not necessarily equal to the KL measure from Q to P. Furthermore, the KL measure does not follow the triangle inequality. For example, if three distributions, P, Q, and Z, are given, the KL measure from P to Q can be greater than the sum of the KL measures from P to Z and from Z to Q.

f x dx: qx

3.1. Joint Entropy Shannons entropy denition works well for describing the uncertainty (or information) about a single random variable. It is natural to extend this denition to describe the entropy of two or more variables using their joint distribution. For example, when considering two discrete random

ENTROPY

Continuous Probability Density Function


1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Discretized Probability Mass Function


0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 0.08 0.17 0.25 0.33 0.42 0.5 0.58 0.67 0.75 0.83 0.92 1

Figure 4. Continuous and discretized distribution.

variables, X and Y, one can measure the amount of information (or uncertainty) associated with them using the denition of their joint entropy as
m1 m2 XX i1 j1

replace the summation Z Z f x; y dxdy: f xf y

f x; y log

HX; Y

pxi ; yj logpxi ; yj ; The mutual information measure is invariant to any invertible transformation of the individual variables. The mutual information can be generalized to a larger number of variables yielding the multi-information measure Z Z f x1 ; x2 ; :::; xn f x1 ; x2 ; :::; xn log dx1 dx2 :::dxn f x1 f x2 :::f xn

where X and Y may respectively take m1 and m2 possible values, and pij is their joint probability. The joint entropy is a symmetric measure that ranges from zero (when both variables are deterministic) to a maximum value of logm1 logm2 when the variables have probability independence. In other words, the joint entropy of two discrete random variables is bound by 0 HX; Y logm1 logm2 :

3.3. Conditional Entropy The conditional entropy of y given x is dened as 3.2. Mutual Information The mutual information, IX; Y, between two variables X and Y is the KL measure between their joint distribution and the product of their marginal distributions.
m1 m2 XX

HYjX

m1 m2 XX i1 j1

pxi ; yj logpyj jxi ;

IX; Y

pxi ; yj : pxi ; yj log pxi pyj i1 j1

where pyj jxi is the conditional probability of yj given xi . The conditional entropy can also be dened as the joint entropy of x and y minus the entropy of x. In other words, it is the difference between the joint information of x and y, and the information brought by x: HYjX HX; Y HX: One can also dene the conditional entropy of Y given X as the difference between the entropy (information) in Y and the mutual information of X and Y: HYjX HY IX; Y: This equation provides an interpretation for this type of entropy as the reduction of uncertainty about Y brought by knowledge of X. If the two variables are probabilistically independent, their mutual information is zero and the conditional entropy of Y given X is simply equal to the entropy of Y, which places an upper bound on the condi-

The value of the mutual information is larger when knowing the outcome of one variable leads to a larger expected reduction in the entropy of the other. In computational biology, the information measure quanties how much information the expression level of one gene provides about the expression level of the other. The mutual information is thus a symmetric measure of dependence between the two variables and is zero if and only if the two variables are independent, which is a direct result of the properties of the KL measure, being zero if and only if the two distributions are identical. The mutual information can also be dened for continuous variables using a joint probability density and marginal probability densities, and a double integral to

ENTROPY

tional entropy as HYjX HY:

be written in terms of the conditional entropy in two steps, HX; Y; Z HX HY; ZjX HX HYjX HYjX; Z: By induction, for n variables, X1 ; X2 ; :::; Xn , the joint entropy can be written as HX1 ; X2 ; :::; Xn HX1 HX2 jX1 :::: HXn jX1 ; X2 ; :::; Xn1
n X i1

Contrary to the joint entropy or the mutual information, conditional entropy is not a symmetrical measure. Conditioning on one variable or the other does not give the same result. It is useful to think of the relationships between the joint entropy, the conditional entropy, the mutual information, and the individual entropies using the Venn diagram depicted in Fig. 5. This diagram helps us deduce many expressions for entropy relations by inspection of the areas. For example, one can see that IX; Y HX HY HX; Y: Another measure of interest is the mutual information index (4) that measures the expected fraction of entropy reduction of variable Y because of variable X and is given as Mutual Information Index MII IX; Y HY HYjX : HY

HXi jX1 ; :::; Xi1 :

3.5. The Chain Rule for Mutual Information A chain rule for mutual information exists, as will be shown below. For the case of three variables, one has IX; Y; Z HX; Y HX; YjZ: Using the chain rule for relative entropy gives IX; Y; Z HX HYjX HXjZ HYjX; Z HX HXjZ HYjX HYjX; Z IX; Z IY; ZjX; where IY; ZjX HYjX HYjX; Z is the conditional mutual information of y and z given x. For n variables, X1 ; X2 ; :::; Xn , the chain rule for mutual information is IX1 ; X2 ; :::; Xn ; Y
n X i1

The mutual information index is thus normalized to range from zero (when the variables are independent) to one (when the variables are functionally related). 3.4. The Chain Rule for Joint Entropy From the denition of the conditional entropy for the twovariable case, it has been shown that the joint entropy is equal to HX; Y HX HYjX: For three variables, one can express the joint entropy as HX; Y; Z HX HYjX HZjX; Y; which is very simple to prove because the joint entropy can
H (x, y )

IXi ; YjX1 ; X2 ; :::; Xi1 :

3.6. The Maximum Entropy Principle Jaynes (5) built on the entropy concept and proposed a method to assign probability values with partial information. This method is known as the maximum entropy principle, and is stated from his original paper as follows.
In making inferences on the basis of partial information we must use that probability distribution which has maximum entropy subject to whatever is known. This is the only unbiased assignment we can make; to use any other would amount to arbitrary assumption of information, which by hypothesis we do not have.

H (x | y)

I (x ; y )

H (y | x)

H (x)

H (y)

Figure 5. Venn diagram for entropy expressions.

The partial information may have several forms such as moment constraints, probability constraints, and shape constraints. Jaynes maximum entropy principle is considered to be an extension of Laplaces principle of insufcient reason to assign unbiased probability distributions based on partial information. The term unbiased means

ENTROPY

the assignment satises the partial information constraints and adds the least amount of information as measured by Shannons entropy measure. To illustrate this further, consider the examples below. 3.6.1. Example 5: Maximum Entropy Discrete Formulation. The maximum entropy probability mass function for a discrete variable, X, having n outcomes, when no further information is available, is pmaxent arg max
pxi n X i1

Taking the partial derivative with respect to f xand equating it to zero gives
n X @Lf x ai hi x 0: lnf x 1 a0 @f x i1

Rearranging this equation gives fmaxent x ea0 1a1 h1 xa2 h2 x:::::an hn x : When only the normalization and non-negativity constraints are available, the maximum entropy solution is uniform over a bounded domain: fmaxent x ea0 1 1 ;a ba x b:

pxi logpxi

such that
n X i1

pxi 1;

pxi ! 0; i 1; :::; n: The solution to this formulation yields a probability mass function with equal probability values for all outcomes, 1 pxi ; i 1; :::; n: n The maximum entropy value for this probability mass function is equal to logn, which is the maximum value that can be attained for n outcomes. Recall also that this result agrees with Laplaces principle of insufcient reason discussed above. 3.6.2. Example 6: Maximum Entropy Given Moments and Percentiles of the Distribution. The maximum entropy formulation for the probability density function of a continuous variable having prescribed moments and percentile constraints is fmaxent x arg maxf x subject to Z
a b

If the rst moment is available, the maximum entropy solution on the non-negative domain is fmaxent x ea0 1a1 x 1 mx e 1 ; x ! 0: m1

If the rst and second moments are available over the interval 1; 1, the maximum entropy solution is a Gaussian distribution
xm2 1 2 fmaxent x ea0 1a1 xa2 x p e 2s2 ; 1oxo1; s 2p

Z
a

f x lnf xdx

where m is the rst moment and s2 is the variance. When percentiles of the distribution are available, the maximum entropy solution is a staircase probability density function (Fig. 6b) that satises the percentile constraints and is uniform over each interval, which integrates to a piecewise linear cumulative probability distribution that has the shape of a taut string. An example of a maximum entropy distribution given ve common percentiles used in decision analysis practice, (1%, 25%, 50%, 75%, and 99%), with a bounded support is shown in Fig. 6. 3.6.3. Example 7: Inverse Maximum Entropy Problem. The inverse maximum entropy problem is the problem of nding, for any given probability density function, the constraints in the maximum entropy formulation that lead to its assignment. The inverse maximum entropy problem can be solved by rst placing any probability density function in the form f x ea0 1a1 h1 xa2 h2 x:::::an hn x : Now, the inverse maximum entropy problem can be solved by inspection because the constraint set needed for its assignment is Z
a b

Z
a

hi xf xdx mi i 1; :::n;

f xdx 1 and f x ! 0;

where [a,b] is the domain of the variable, hi x is x raised to a certain power for moment constraints or an indicator function for percentile constraints, and mi s are a given set of moments or percentiles of the distribution. Using the method of Lagrange multipliers, one obtains Lf x Xn Z
a

Z b f x lnf xdx a0 f Z
a a b

f xdx 1g

af i1 i

hi xf xdx mi g;

hi xuxdx mi ; i 0; :::; n:

where ai is the Lagrange multiplier for each constraint.

For example, a Beta probability density function can be

ENTROPY

Fractile Maximum Entropy Distribution Taut String given fractile constraints 1 0.8 0.6 0.4 0.25 0.2 0.01 0 a b c d e f g a b c d e f g 0.99 0.75 0.50 0.25 (dc) 0.24 (c b) 0.01 (ba) 0.25 (e d ) 0.24 (fe) 0.01 (g f )

Figure 6. (a) Percentile maximum entropy distribution obtained using the 0.01, 0.25, 0.5, 0.75, 0.99 percentiles. (b) Probability density function corresponding to the given percentile constraints.

rewritten in the form f x 1 xm1 1 Betam; n

xn1 e lnBetam;nm1 ln xn1 ln1x ; 0 x 1: By inspection, it can be seen that the constraint set needed to assign a Beta density function is Z
0 1

which is not equal to the solution obtained in the rst case, which may seem to pose a problem to entropy formulations at rst; however, it will be shown that this dependence on the choice of the coordinate system has a simple remedy using the minimum cross-entropy principle. This principle will be discussed in more detail below. 3.7. The Minimum Cross-Entropy Principle In many situations, additional knowledge may exist about the shape of the probability density function or its relation to a certain family of probability functions. In this case, the relative entropy measure is minimized (3) to a known probability density function. Minimum cross-entropy formulations for a probability density, f x, and a target density, qx, take the form Z fminXent x arg min
f x a b

lnxuxdx m1 ;

Z
0

ln1 xuxdx m2 ;

where m1 and m2 are given constants. 3.6.4. Dependence on the Choice of the Coordinate System. In the continuous case, the maximum entropy solution depends on the choice of the coordinate system. For example, suppose you have a random variable x on the domain [0,1], and another random variable, y, on the same domain, where y x2 . Suppose also that nothing is known about the random variables except their domains. If the problem is solved in the x coordinate system, the maximum entropy distribution for x is uniform fmaxent x 1; 0 ! x 1:

f x dx f x ln qx

such that Z
a b

hi xf xdx mi i 1; :::n;
b

Z
a

f xdx 1andf x ! 0:

Using the method of Lagrange multipliers, one has The probability density function for variable y can be obtained using the well-known formula for a change of variables fy y fx x dy j dx jx p y 1 1 p : 2x 2 y Z b f x dx a0 f f x ln f xdx 1g Lf x qx a a : n X Z b ai f hi xf xdx mi g Z
b i1 a

On the other hand, if the problem is solved directly in the y coordinate system, one would get a maximum entropy distribution for variable y as fmaxent y 1; 0 ! y 1;

Taking the rst partial derivative with respect to f x and equating it to zero gives
n X @Lf x f x ai hi x 0: ln 1 a0 @f x qx i1

10

ENTROPY

Rearranging shows that the minimum cross-entropy solution has the form fminXent x qxea0 1a1 h1 xa2 h2 x:::::an hn x ; where ai is the Lagrange multiplier for each constraint and fminXent x is the minimum cross-entropy probability density. One can see that maximizing the entropy of f(x) is, therefore, a special case of minimizing the cross entropy when the target density, q(x), is uniform. The minimum cross-entropy principle generalizes the maximum entropy principle, and assigns a probability distribution that minimizes the cross-entropy relative to a target (or a reference) distribution. 3.7.1. Independence on the Choice of the Coordinate System. Now refer back to the maximum entropy solution in the continuous case and recall its dependence on the choice of the coordinate system. A remedy to this problem is obtained using the minimum cross-entropy principle. When suing minimum cross entropy, a target density function that represents our prior knowledge about any of the variables is rst decided on. If the problem in the xcoordinate system is then solved, one obtains a minimum cross-entropy distribution equal to fminXent x qx; where qx is the target probability density function, that represents our prior knowledge about variable x. On the other hand, if one solves the variable in the y coordinate system, the change of variables formula is rst used to determine the target density for variable y, qy y qx x j dy jx p y dx :

The entropy rate is well dened for stationary processes. When X1 ; X2 ; :::; Xn are independent and identically distributed random variables, one can use the chain rule of joint entropy to show that Hw lim
n 1X nHX1 HX1 : HXi lim n!1 n n!1 n i1

Another denition of the entropy rate is Hw lim HXn jXn1 ; Xn2 ; :::; X1 :
n!1

For a stationary process, both denitions exist and are equivalent. Furthermore, for a stationary ergodic process, one has 1 log2 PX1 ; X2 ; . . . ; Xn ! Hw: n Using the results of the asymptotic equipartion theorem, one sees that a generated sequence of a stationary ergodic stochastic process has a probability of 2nHX and 2nHX possible sequences exist. Therefore, a typical sequence of length n can be represented using log2nHX nHX bits. Thus, the entropy rate is the average description length for a stationary ergodic process. Of special interest is also the entropy rate of a Markov process, which can be easily calculated as Hw lim HXn jXn1 ; Xn2 ; :::; X1 lim HXn jXn1 HX2 jX1 :
n!1 n!1

For a given stationary distribution, m, and transition matrix, P, the entropy rate of a Markov process is Hw
n n XX i1 j1

Next, one nds the minimum cross-entropy distribution for y. If no further information exists except its domain and the target density, the minimum cross-entropy solution is equal to fminXent y qy; a result that is consistent with the solution in the xcoordinate system. Thus, solving a minimum cross-entropy problem (1) generalizes the maximum entropy approach and reduces to the maximum entropy solution if the target density, qx, is uniform; and (2) solves the problem of coordinate system independence for the continuous case. 4. ENTROPY RATE OF A STOCHASTIC PROCESS The entropy rate of a stochastic process, fXi g, is dened by Hw lim when the limit exists. 1 HX1 ; X2 ; :::; Xn ;

mi Pij log Pij :

5. ENTROPY APPLICATIONS IN BIOLOGY Maximum entropy applications have found wide use in various elds. In computational biology, for example, maximum entropy methods have been used extensively for modeling biological sequences. In particular, how one can model the probability distribution p (a | h), the probability that a will be the next amino acid, given the history h of amino acids that have already occurred (6,7). Maximum entropy methods have also been applied to the modeling of short sequence motifs (8). They recommend approximating short sequence motif distributions with the maximum entropy distribution (MED) consistent with low-order marginal constraints estimated from available data, which may include dependencies between nonadjacent as well as adjacent positions. Entropy methods have also been used in computing new distance measures between sequences (9). Mixtures of conditional maximum

n!1 n

ENTROPY

11

entropy distributions have also been used to model biological sequences (10), and for modeling splicing sites with pairwise correlations (11). Maximum entropy analysis of biomedical literature has also been used for associating genes with gene ontology codes (12). Maximum entropy applications have also been used to assess the accuracy of structure analysis (13). Maximum entropy methods have also been used in the prediction of protein secondary structure in multiple sequence alignments by combining the single-sequence predictions using a maximum entropy weighting scheme (14). Recently, there have been other expressions related to the entropy of a signal that have found applications in biomedical engineering. For example, the approximate entropy, ApEn, is a term that quanties regularities in data and time series (15). The approximate entropy has been applied in biology to discriminate atypical EEGs and respiratory patterns from normative counterparts (1618) and for physiological time series analysis (19). Finally, it is pointed out that other forms of nonextensive entropy measures also exist such as the Tsallis entropy (20) and Renyis entropy (21), special cases of which reduce to Shannons entropy. BIBLIOGRAPHY
1. C. E. Shannon, A mathematical theory of communication. Bell Sys. Tech. J. 1948; 27:379423, 623656. 2. T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: John Wiley, 1991. 3. S. Kullback and R. A. Leibler, On information and sufciency. Ann. Mathemat. Statist. 1951; 2:7986. 4. H. Joe, Relative entropy measures of multivariate dependence. J. Am. Statistic. Assoc. 1989; 84:157164. 5. E. T. Jaynes, Information Theory and Statistical Mechanics. Phys. Rev. 1957; 106:620630. 6. E. C. Buehler and L. H. Ungar, Maximum entropy methods for biological sequence modeling. J. Computat. Biol., in press. 7. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge, UK: Cambridge University Press, 1998. 8. G. Yeo and C. B. Burge, Proceedings of the seventh annual international conference on research in computational molecular biology RECOMB 03, April, 2003.

9. G. Benson, A new distance measure for comparing sequence proles based on path lengths along an entropy surface. Bioinformatics 2002; 18(suppl. 2m);S44S53. 10. D. Pavlov, Sequence modeling with mixtures of conditional maximum entropy distributions. Third IEEE International Conference on Data Mining, ICDM, 2003:251258. 11. M. Arita, K. Tsuda, and K. Asai, Modeling splicing sites with pairwise correlations. Bioinformatics 2002; 18(suppl 2):S27 SS34. 12. S. Raychaudhuri, J. T. Chang, P. D. Sutphin, and R. B. Altman, Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. Genome Res. 2002; 12:203214 13. M. Sakata and M. Sato, Accurate structure analysis by the maximum-entropy method. Acta Cryst. 1990; A46:263270. 14. Krogh and Mitchinson, Maximum entropy weighting of aligned sequences of proteins of DNA. In: C. Rawlings, D. Clark, R. Altman, L. Hunter, T. Lengauer, and S. Wodak, eds., Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press, 1995, pp. 215221. 15. S. Pincu, Approximate entropy as a measure of system complexity. Proc. Natl. Acad. Sci. USA 1991; 88:22972301. 16. J. Bruhn, H. Ropcke, B. Rehberg, T. Bouillon, and A. Hoeft, Electroencephalogram approximate entropy correctly classies the occurrence of burst suppression pattern as increasing anesthetic drug effect. Anesthsiology 2000; 93:981985. 17. M. Engoren Approximate entropy of respiratory rate and tidal volume during weaning from mechanical ventilation. Crit. Care Med. 1998; 26:18171823. 18. Hornero et al. 2005. 19. J. S. Richman and J. R. Moorman, Physiological time-series analysis using approximate entropy and sample entropy. Am. J. Physiol. Heart Circ. Physiol. 2000; 6:278. 20. C. Tsallis, J. Stat. Phys. 1998; 52:479. 21. A. Renyi, On measures of entropy and information. Proc. Fourth Berkeley Symposium Mathematical Statistical Problems 1960, vol. I. University of California Press, Berkeley, 1961:547.

READING LIST
E. T. Jaynes, Prior probabilities. IEEE Trans. Syst. Sci. Cybernet. 1968; 4:227241.

S-ar putea să vă placă și