Documente Academic
Documente Profesional
Documente Cultură
22010780
Department of Mathematics and Statistics
University of Reading
June 8, 2016
Abstract
This paper describes various density estimation techniques yielding probability distributions that are most consistent with our knowledge in detail and illustrates the
Principle of Maximum Entropy which is central to this report. The concept of maximum entropy is very old but very powerful at the same instance. Nowadays, computers
have become so potent that we can apply this concept to a wide range of real life
scenarios in statistical information and pattern recognition.
22010780
Contents
List of Tables
ii
List of Figures
ii
1 Introduction
2.1
Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
2.3
3 Density Estimation
3.1
Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
14
Image Restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
5 Conclusion
15
6 Bibliography
16
List of Tables
1
Speedy Savers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
List of Figures
1
12
12
12
12
MA3PR
ii
14
Dr Patrick Ilg
22010780
Introduction
The old concept of entropy dates back to the mid-1800s, being first introduced into Thermodynamics [Dr. Nailong Wu (1997)]. Now one of the main motivations regarding the
principle of maximum entropy is the question, how do we go about finding the best probabilistic model given our limited knowledge/data? Boltzmann (1844-1906) had an answer
to this problem stating that out of all the possible models pick the one with the largest
entropy. In other words, maximise the entropy of the model. This thought process in a
nutshell brought about the answer to Boltzmanns undetermined problem which was to
infer/predict the distributions of the phase space (position and velocities) of gas particles.
Whether by luck or inspiration, he (Boltzmann) put into his equation only the
dynamical information (average energies and particle numbers) that happened
to be relevant to the questions he was asking [E.T. Jaynes (1979)]
With those microscopic measurements, Boltzmann used the Maximum Entropy Method
(MEM) to identify the distribution of the phase space of the gas particles. However, he
was criticised for ignoring the explicit dynamics of the particles. Now Shannon (1916-2001),
proposed another way of explaining the principle of maximum entropy. Through his studies
in Information Theory, in 1948, Shannon discovered a unique quantity H that measured the
uncertainty of an information source. We call this quantity H, Entropy or more famously
Shannon Entropy which we will come across in the next section.
E.T.Jaynes (1922-1998), a more recent pioneer of studies in maximum entropy simply stated
that, to justify the use of a distribution for inference (future predictions), it needs to agree
with what we do know and carefully reject what we do not. An ancient principle of wisdom
[A.L. Berger et al (1996)]. We will expand further on this concept in the next section. In
Section 3 we will take a look at some density estimation techniques step-by-step yielding
maximum entropy distributions consistent with our knowledge. Applications of the MEM
in real life scenarios widely range from image restoration to spectral analysis and even uses
in natural language processing. Its interesting that our computers have become potent
to a point that we can apply this concept (principle of maximum entropy) to a plethora
of real world problems in statistical estimation and pattern recognition. We will briefly
demonstrate and explain an example of this further in Section 4.
Our main focus remains to look at density estimation methods through the application
of the principle of maximum entropy. Techniques will be demonstrated, compared and
critiqued.
In this section, we focus on defining entropy for the discrete and continuous case and explain
the principle of maximum entropy in a clear and direct manner. A probability density
function (p.d.f.) is a function of a continuous random variable X (e.g. height of male
22010780
students at a university), whose integral over an interval yields the probability that X lies
within that interval. Note that in this paper we will be generally dealing with entropy for
the continuous p.d.f. p(x) rather than the discrete case.
2.1
Entropy
Entropy (or uncertainty) is represented quantitatively by the information (in terms of probability distributions) we do not possess about the state the system is in.
We calculate the entropy for a discrete probability distribution p on the finite set {x1 , x2 , ...},
pi = p(xi ) the entropy of p is defined as:
X
pi ln pi
(1)
h(p) =
i1
Generally speaking, for a uniform p on a finite set {x1 , x2 , ...} (i.e. p(xi ) = 1/n i),
h(p) ln n.
(2)
every p.d.f. with finite outcomes has an entropy h(p) ln n, as equal chance of every outcome occuring brings about maximum uncertainty. Therefore we can say that the uniform
distribution is a maximum entropy distribution as h(p) for p(xi ) = 1/n on a finite countable
n
P
set is equal to n1 ln( n1 ) = n n1 ln( n1 ) = ln n.
1
(3)
h(p) from the probabilistic point of view is viewed as a measure of information carried by p,
with larger entropy telling us p is carrying less information (more uncertainty). Although
there might be a misconception that less information in our model is not necessarily a
positive outcome, the principle of maximum entropy explains this succinctly which we will
encounter later in this section. Now we will look at some examples of the entropy of different
densities that satisfy certain constraints. This is mainly to show the effect the constraints
have on the value of entropy.
x 2
1
Example 2.1.1 (Gaussian Distribution): Let p(x) = 2
e(1/2)( ) . We compute the
entropy of the Gaussian density on the real line R with respect to our constraints; the mean
and variance 2 :
Z
x 2
x 2
1
1
h(p) = 2
e(1/2)( ) ln( 2
e(1/2)( ) )dx
R
Z
=
x 2
1 e(1/2)( ) ( ln(
2
R
Z
ln( 2)
2
R
(1/2)( x
)2
2
2) (1/2)( x
) )dx
Z
x
2 (1/2)( )2 dx
dx + (1/2) ( x
) e
R
22010780
(This can be computed using the generalisations of the integral of a Gaussian function)
= (1/2)(1 + ln(2 2 ))
We witness here that our mean has no effect on the entropy of the Gaussian density, in
fact, we can state from our result that all Gaussians with the same have the same entropy.
Now, if we focus on the effect has on h(p) we observe that for significantly small, the
entropy of a Gaussian is negative [K. Conrad (2013)]. We can also ask the question of
whether it is reasonable that the mean does not enter the entropy. This question will be
answered at the end of Example 3.1.1.
2.2
The principle of maximum entropy was introduced to us by E.T. Jaynes. It states that
subject to our knowledge (represented by constraints) the probability distribution that
bests reflects our knowledge is the one with the largest entropy [Wikipedia (2016)]. This
principle aids us in our selection of densities most consistent with what we know and does not
introduce unjustified information. In hindsight, smaller entropy tells us something stronger
than what is being assumed and can lead to surprising predictions which we do not want.
Think of the MEM as a guide that can also advice us what to do when our experimental data
is not consistent with our predictions [K. Conrad (2013)]. We can ask ourselves questions
like: Are there missing constraints we should consider? or Are there current constraints
we should omit?. An unseen constraint may be a huge factor effecting an experiment so we
include it in our p.d.f. and maximise entropy over the distributions satisfying our improved
constraints. We will also witness that the maximum entropy principle explains a natural
connection between the more native distributions and other distributions.
2.3
Here we will approach an example which will describe the principle in one simple case of
one constraint and three input events. In this case the maximum entropy method can be
carried out analytically. The more general technique will be shown in the next chapter.
Example 2.3.1 (Fast Food Restaurant): Before we can use the principle the problem
domain must be constructed first. It consists of the various states that the systems can
exist in, alongside all the parameters involved in the known constraints [P. Penfield (2003)]
(e.g. the quantities associated with each state is assumed known, whether it be energy,
speed, etc.). Assuming that we do not know which particular state is being occupied (our
lack of knowledge) we deal instead, with the probabilities of each state being occupied.
Therefore, probabilities help us work with our incomplete knowledge.
Suppose we have been employed to analyse the revenue of a fast food franchise (Speedy
Savers) in the UK (where prices are homogeneous). Through some primary research we
deduce that customers spend an average of 1.75 per meal deal. The price, calories per
meal, and probability of food being served hot or cold are displayed in the table below:
22010780
Meal Deal
Main
Speedy Savers
Cost () Calories
1
2
3
Burger
Chicken
Fish
1.00
2.00
3.00
800
650
500
Probability of
arriving hot
0.6
0.8
0.9
Probability of
arriving cold
0.4
0.2
0.1
Now with our problem set up, we still require knowledge of which state our system is
in. We assume that for this fast food franchise model that we have three outcomes. The
probabilities of these outcomes tell us how likely a customer will purchase any one of the
three available meal deals. These underlying probabilities are denoted p(A1 ), p(A2 ) and
p(A3 ) for each of the respective meal deals 1-3. We have each of the possible outcomes Ai
(where i = 1, 2, 3) has a probability p(Ai ) and these are mutually exclusive and exhaustive.
Therefore:
3
X
p(Ai ) = 1
(4)
i=1
In this example we are playing the role of an observer. Different observers may acquire
varied information and because of that difference in knowledge, they may attain different
probability distributions as a result. In this sense the probability distributions are subjective.
Constraints: The principle of maximum entropy is only useful when applied to testable
information [J. Liu (2012)]. If we had no additional information then the assumption that
all p(Ai ) are equal is reasonable, we would have maximum uncertainty (see equation (2)).
However, if we do have additional information the better choice is to use these and construct
our constraints. In one sense, we have some certainty (our current knowledge) although we
are seeking maximum uncertainty. During this example we found out through research that
the average price of a meal deal amounted to 1.75. We want to estimate the separate
probabilities p(Ai ) only using what we know.
Our constraints are:
p(A1 ) + p(A2 ) + p(A3 ) = 1
1.00p(A1 ) + 2.00p(A2 ) + 3.00p(A3 ) = 1.75
(5)
Now what we have are two simultaneous equations with three unknowns, with insufficient
information available to solve for these unknowns. Our unknowns are the values of the
underlying probabilities sought-after from the maximum entropy distribution (where h(p)
is largest).
h(p) = p(A1 ) ln
1
1
1
+ p(A2 ) ln
+ p(A3 ) ln
p(A1 )
p(A2 )
p(A3 )
(6)
Again, we are simply searching for the probability distribution that uses nothing other than
what is already known. Tracking back to our constraints, we can express p(A1 ) & p(A2 ) in
22010780
(7)
Next, we determine the ranges of our probabilities, since 0 p(Ai ) 1 this is very easy to
do (using the equations in (7) and the fact that p(Ai ) 0):
0 p(A3 ) 0.375
0 p(A2 ) 0.75
0.25 p(A1 ) 0.625
(8)
Now, we can rewrite our entropy in (6) in terms of p(A3 ), we only now need to find the
value of p(A3 ) for which h(p) is largest to yield our maximum entropy distribution.
Our entropy is:
h(p) = (0.25 + p(A3 )) ln
1
1
1
+ (0.75 2p(A3 )) ln
+ p(A3 ) ln
(9)
p(0.25 + p(A3 ))
0.75 2p(A3 )
p(A3 )
There are many techniques (see Section 3) that can be used to find the value of p(A3 ) for
which h(p) is maximum. In this example the maximum occurs at p(A3 ) = 0.216 (found
using the method of Lagrange Multipliers) which leads us to p(A1 ) = 0.466, p(A2 ) = 0.318
and h(p) = 1.051 our maximised entropy. What we can learn from this is that Meal Deal 1
is the most popular purchase (most probably because of its price), followed by Meal Deal
2 and finally Meal Deal 3. The result is a maximum entropy distribution consistent with
our constraints introducing zero bias. The constraints play a vital role when calculating
our distributions, next, we see where our constraints make this impact in a mathematical
sense.
Density Estimation
There are various density estimation techniques for maximising entropy that we will be
encountering in this section. The aim is to maximise h(p) over all density functions satisfying
certain constraints and yield a p(x) with the largest entropy. We will first look at the
method of Lagrange multipliers explaining it step-by-step with a worked out example for
our understanding. Afterwards we will encounter some computationally efficient (fast)
approximating algorithms and evaluate them accordingly.
3.1
Lagrange Multipliers
This method is named after the late Italian born, French mathematician, Joseph-Louis
Lagrange (1736-1813). Instead of attempting to reduce the number of unknowns through
our constraint equations, we actually increase the number of unknowns using the multipliers
i [P. Penfield (2003)]. In order to demonstrate that a probability distribution is a maximum
entropy distribution we start off using the method of Lagrange multipliers. We will visit an
example to clarify how this idea is carried out.
22010780
Consider a countable set of polynomials {r1 (x), r2 (x), ..., rn (x)} where x {continuous
outcome space with unspecified density p(x)} with the assumption that
Zb
i = 1, ..., n, a, b R
p(x)ri (x)dx = Mi ,
(10)
holds. Where Mi represents the moments (constraints) and I is the interval the p.d.f. is
on. Remember, these constraints and the interval of our proposed distribution is on are
the ingredients that will help us find the p.d.f. with maximised h(p). For validity and
completeness we take r1 (x) 1 and
Z M1 1 then include the further required constraints
regarding p(x) being a p.d.f. (e.g. xp(x)dx = where is the mean)
R
Now the task is to acquire p(x) subject to h(p) being maximised. To begin solving this
problem (for the n constraints) we construct a function F and use Lagrange coefficients i
respectively for each of the constraints Mi . We have:
Zb
Zb
n
X
F (p, 1 , 2 , ..., n ) = p(x) ln p(x)dx +
i ( p(x)ri (x)dx Mi )
i=1
|a
{z
=0
(11)
=L(x,p,1 ,2 ,...,n )
Therefore
L
= 1 ln(p) + 1 + 2 r2 (x) + ... + n rn (x)
p
(12)
Example 3.1.1: In this example, we let p(x) be a density on R subject to constraints mean
and variance 2
Z
Z
Z
Z
F (p, 1 , 2 , 3 ) = p(x) ln p(x)dx+1 ( p(x)dx1)+2 ( xp(x)dx)+3 ( (x)2 p(x)dx 2 )
R
Z
=
R
Therefore
22010780
L
= 1 ln(p) + 1 + 2 x + 3 (x )2
p
L
2
= 0 p(x) = e1 1+2 x+3 (x)
p
Z
For
p(x)dx to be finite we set 2 = 0 and 3 < 0. This leads to p(x) = e1 1+3 (x)
Z
p(x)dx =
eab(x) dx = ea
p
p
p
(/b) = 1 ea = (b/) a = ln( (b/))
p
2
Now p(x) = (b/)eb(x)
Z
Z p
2
xp(x)dx = x (b/)eb(x) dx =
R
and
Z
(x )2 p(x)dx =
(x )2
p
2
(b/)eb(x) dx = 1/(2b) = 2 b = 1/(2 2 )
q
x
2
2
2
1
(x)2 /(2 2 ) = 1 e(1/2)( )2 is a Gaus p(x) = eln( (1/(2 ))/)(x )/(2 ) = 2
2e
2
sian distribution like we encountered in Example 2.1.1. Since we already calculated the
entropy h(p) of the Gaussian p.d.f. p(x), we can state that for a continuous p.d.f. p on R
subject to mean and variance 2
h(p) (1/2)(1 + ln(2 2 ))
with equality iff p is the Gaussian (i.e. the maximum entropy distribution satisfying
constraints). Going back to our question in Example 2.1.1, it is reasonable that the mean
does not enter the entropy. Tracing back our steps in this example we set our 2 = 0 to
keep our integral of p finite, this is the same as omitting from the entropy (as a moment).
However, we must notice that as our variance 2 depends on in our second constraint.
Overall, the notion behind this technique is easy to understand. It allows us to effectively
maximise our entropy function h(p) subject to our moment constraints (the information we
have gathered about the distribution we are seeking) no matter how many we would like
to include. Clearly, as we have stressed before, it is better to include what we know so we
calculate a distribution most consistent with that knowledge. However, the computational
efficiency has a strong relationship with the number of constraints that we include. Notably,
for every additional moment constraint included, the difficulty associated with yielding the
maximum entropy distribution also increases. Intuitively, this make very much sense as
each additional constraint implies that we need to solve for an extra Lagrange Multiplier
i . Fortunately, several computational methods have been developed to counter this issue.
In the following part of this section we will witness three algorithms that help us quickly
compute maximum entropy distributions subject to what we know.
3.2
22010780
Here we will be encountering three different algorithms with the purpose of yielding approximations to the p.d.f. with maximum entropy. This is on the basis of a finite number of
sampled data. Comparisons of the results of these proposed algorithms will be made with
the exact maximum entropy estimates in terms of accuracy and efficiency.
Let p(x) be the p.d.f. we want to compute defined over an interval [a, b] where a, b R
subject to the constraints:
Zb
p(x) 0, x [a, b]
p(x)dx = 1,
(13)
and n extra moment constraints which are in the form (same as what we encountered in
Section 3.1):
Zb
p(x)ri (x)dx = Mi
i = 1, ..., n
(14)
(15)
e(
Pn
j=1
j rj (x))
dx = e0
(16)
Zb
ri (x)e(
Pn
j=1
j rj (x))
dx
Zb
= Mi
e(
Pn
j=1
j rj (x))
j = 1, ..., n
(17)
dx
not to confuse ourselves equations (14) and (17) are both equivalent. In hindsight, we
notice that the determination of a maximum entropy distribution is reduced to a solution
of a system of n non-linear equations (17) each with n unknowns.
Three algorithms of computationally efficient nature to obtain a solution to (17) are to be
illustrated. Throughout this Section 3.2 it is assumed that all the functions ri are in the
form ri (x) = xi , i = 1, ..., n. Our available data represents what we know, which are the
22010780
constants Mi (14). The algorithms will work towards yielding an approximation to p (with
maximum h(p)) in the form of a linear combination of basis functions:
p(x)
n
X
j pj (x) = p(x)
(18)
j=1
Zb
Zb
1
x1 p1 (x)dx + ... + n
Zb
1
pk (x)dx = 1
p1 (x)dx + ... + n
Zb
1 (x)dx
x1 pk (x)dx = M1
..
.
xn p
Zb
+ ... + n
(19)
xk pk (x)dx = Mn
The basis functions pj will differ when being calculated through respective algorithms,
however, it is vital that we establish the relation between the approximating function p and
maximum entropy p.d.f. p.
Zb
j
Let the function E be defined by E(y, x ) = xj y(x)dx for any generic function y and
a
j Z. By construction:
E(p, xj ) = E(
p, j)
j = 1, ..., n
(20)
The values of our functions E are equal, and we state that if all their constraints are also
equal then p p as n . This tells us that the more moment constraints (more knowledge) we have, the closer our approximation p to the true maximum entropy distribution
p. Now we will approach the three algorithms with intention to yield values of the basis
functions pj which we will substitute into the system of equations (19) to find the solutions
j . The final step will be to substitute our j and pj (x) into (18) to get our approximation
to the maximum entropy distribution.
Algorithm 1: Tchebycheff polynomials are used in this algorithm to determine our pj ,
this is after the normalisation of the interval [a, b] to [0, 1]:
p1 (x) = 1
p2 (x) = x
p3 (x) = 2x2 1
pj (x) = 2xpj1 (x) pj2 (x)
(21)
j = 2, 3, ...
This algorithm is very easy to use and with this choice of basis functions the system of linear
equations (19) can be directly solved. However, the limitation is that the basis functions
yielded are independent of the available data (i.e. coefficients Mi ).
22010780
Zb
pj (x)dx = 1
a
Zb
(22)
xj pj (x)dx = Mj
j = 1, ..., n
(23)
where the Lagrange multipliers 0 & j are received as a result of solving the system of
non-linear equations:
Zb
j
ej x dx = e0j
a
Zb
xj ej x dx
Zb
(24)
= Mj
j = 1, ..., n
ej x dx
The benefit of this method is that, instead of going through the havoc of solving a nonlinear system in n unknowns (see equation (17)), we instead, solve n-independent non-linear
equations in one unknown. After we have solved our n non-linear equations (n, depending
on how many basis functions we want to calculate), our pj are determined, and now (19)
can be solved.
Algorithm 3: This algorithm is more or less a replica of the previous (Algorithm 2).
Instead of basis functions pj being taken as the solution of a simplified maximum entropy
problem using only one of the constraints, we now use two of the known constraints.
The refined target now is to maximise h(pj ) subject to:
Zb
pj (x)dx = 1
a
Zb
xq pj (x)dx = Mq
Zb
a
xt pj (x)dx = Mt
q, t [1, n]
(25)
22010780
For each pj it is important that a different couple (Mq , Mt ) is selected. The solution to each
function pj is again given through Jaynes result, however, the non-linear system yielding
values for our Lagrange Multipliers is now 2-dimensional.
The limitation of this method is that, instead of solving n-independent non-linear equations
in one unknown like Algorithm 2, we solve n non-linear equations of 2-dimensions instead
of one. Again, after we have solved our n non-linear equations, our pj are determined. This
algorithm introduces increased complexity in determining our pj and furthermore solving
(19).
Summary: Each of the stated algorithms are all based on the use of the basis functions in
order to yield the maximum entropy estimate p which converges to p as n .
Summarised steps:
1. Choose Algorithm.
2. Calculate basis functions pj through algorithmic technique.
3. Use calculated pj to determine coefficients j by solving a linear system of equations
(See equation (19))
4. Substitute calculated pj and j into equation (18) to find the approximation to p the
maximum entropy distribution.
Algorithms 2 and 3 increases complexity by forming basis functions on the basis of our
available data, linking each of these functions to a simplified maximum entropy problem.
An interesting point is that we could proceed to define Algorithms 4, 5, 6,..., n if we wanted
to. For Algorithm 4, for example, the solution to each function pj is still given through
Jaynes result, but now the non-linear system yielding our Lagrange Multipliers is now 3dimensional (computational effort has again increased). If all n constraints are applied, we
simply return back to Jaynes original solution (15).
In the following part of this section we will set each technique we have discussed (in Section
3.2) in motion. The efficiency of the proposed algorithms will also be assessed and evaluated.
22010780
The following example is based on research conducted by the DSEA, University of Pisa.
Example 3.2.1: In this example we consider data samples from an hyperbolic distribution
on the interval [0,1]; there are 1500 samples. Figure 1 shows us the histogram of Sample
Frequency vs Sample Value. The moments of the dataset are again assumed known and
have been applied each of the algorithms we have just defined. Comparisons between the
true maximum entropy estimate and results from the algorithms are made and displayed in
figures 2, 3 and 4. It is interesting to see the accuracy and convergence to the true maximum
entropy distribution between algorithms.
Below are the series of graphs taken from the from the research conducted [A. Balestrino et
al (2003)]. Figures 2, 3 and 4 displays the true maximum entropy estimate (obtained using
four constraints) as a black line. The dotted, dashed-dotted and dashed lines represent each
Algorithms approximations using 2, 3 and 4 constraints respectively.
Figure 1: Histogram plot of the sample frequencies against their respective values
22010780
If we look closely at Figure 1 we see that the shape formed by our histogram is very similar
to that of the hyperbolic distribution which makes sense since that is where our data is
generated from. One thing that is thing consistent between the results of the 3 algorithms
is the convergence to the true maximum entropy estimate (solid black line) as the number of
moment constraints increase. This emphasises on the maximum entropy principle, telling
us that the application of more constraints generally leads to the distribution of highest
entropy. Figure 2 shows the p.d.f. obtained with Algorithm 1, there is a relatively close
relationship between the estimates and the true maximum entropy estimate. However, if
we visually compare this to our respective figures 3 and 4, it is easy enough to see the
differences in approximation errors. The estimates in the other figures are closer to the true
maximum entropy estimates than the ones seen in Figure 2. This is explained by the fact
that the basis functions pj yielded through algorithms 3 and 4 depend on the constraint
values. Figure 4 shows the estimate closest to the true maximum entropy estimate. The
dashed line in the figure is practically undistinguishable in comparison to the black line.
Therefore, we deduce that Algorithm 3 is the most proficient followed by 2 and 1. Again
this is understandable as the third algorithm yields each of the basis functions depending
on different couples of constraints, instead of different single constraints (see Algorithm 2).
22010780
In the real world we like to predict what events are most likely to occur in the times to
come. For example; world stock market speculators might want to predict market trends,
physicists may want to predict occurrences like whether civilisation would be wiped out by
an asteroid in a given time frame, betting agencies might want to predict the chances of a
football team winning the championship and so on.
4.1
Image Restoration
The MEM is becoming an increasingly popular and more general approach to restoring
images from noisy and incomplete data. In astronomy, this method has been used across
the electromagnetic spectrum for radio aperture synthesis [J. Skilling & R.K. Bryan (1984)],
x-ray imaging, gamma-ray imaging and much more. The result of the MEM applied to an
image reconstruction problem is an image of optimal quality.
Below is an image of a woman before and after the Image Restoration process:
a
X
l e2if l )
(26)
l=a
where the complex conjugate of the Lagrange Multipliers are denoted by l and are also
the real functions of the time series x(n):
l = 2IF T [ln |F T [x(n)]|]
where IFT means Inverse Fourier Transform
Z and FT means Fourier Transform.
The entropy here is given as h(S(f )) = S(f ) ln S(f ).
R
(27)
22010780
Conclusion
22010780
Bibliography
Dr. Nailong Wu (1997) The Maximum Entropy Method Springer Series in Information Sciences
E.T. Jaynes (1979) The Maximum Entropy Formalism MIT, Cambridge, MA
A.L. Berger, V.J. Della Pietra & S.A. Della Pietra (1996) A Maximum Entropy Approach
to Natural Language Processing Dept. of Computer Science, Columbia University
Paul Penfield (2003) Information and Entropy Dept. of Electrical Engineering and Computer
Science, MIT, Cambridge, MA, Chapter 9
Keith Conrad (2013) Probability Distributions and Maximum Entropy
Wikipedia (2016) Principle of Maximum Entropy [online] Available at https://en.
wikipedia.org/wiki/Principle_of_maximum_entropy (Accessed on 2nd February)
A. Balestrino, A. Caiti, A. Noe & F. Parenti (2003) Maximum Entropy Based Numerical
Algorithms for Approximation of Probability Density Functions Dept. Electrical Systems
and Automation, University of Pisa, Italy
E.T. Jaynes (1957) Information Theory and Statistical Mechanics vol. 106, pp. 361-373
Jiawang Liu (2012) Baidu [online] Available at http://www.slideshare.net/JiawangLiu/
maxent (Accessed on 1st March)
Paul Penfield (2003) Information and Entropy Dept. of Electrical Engineering and Computer
Science, MIT, Cambridge, MA, Chapter 10
J. Skilling & R.K. Bryan (1984) Maximum Entropy Image Reconstruction: General Algorithm Dept. of Applied Mathematics and Theoretical Physics, Cambridge, UK