Maximum Entropy: Density Estimation

Maximum Entropy: Density Estimation
22010780
Department of Mathematics and Statistics
University of Reading
June 8, 2016
Abstract
This paper describes various density estimation techniques yielding probability distributions that are most consistent with our knowledge in detail and illustrates the
Principle of Maximum Entropy which is central to this report. The concept of maximum entropy is very old but very powerful at the same instance. Nowadays, computers
have become so potent that we can apply this concept to a wide range of real life
scenarios in statistical information and pattern recognition.
Part 3 Project (Project Report)
22010780
Contents
List of Tables
ii
List of Figures
ii
1 Introduction
2 Entropy & The Principle of Maximum Entropy
2.1
Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
The Principle of Maximum Entropy . . . . . . . . . . . . . . . . . . . . . .
2.3
Maximum Entropy Principle Applied . . . . . . . . . . . . . . . . . . . . . .
3 Density Estimation
3.1
Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Fast Approximating Algorithms . . . . . . . . . . . . . . . . . . . . . . . . .
4 Applications In Real Life Scenarios

4.1
14
Image Restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
5 Conclusion
15
6 Bibliography
16
List of Tables
1
Speedy Savers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
List of Figures
1
Histogram plot of the sample frequencies against their respective values . .
12
P.d.f. estimates obtained with Algorithm 1 . . . . . . . . . . . . . . . . . .
12
12
12
Plot of the average computational time against # of moments used in estimation 13
Photograph before and after Image Restoration process . . . . . . . . . . .
MA3PR
ii
14
Dr Patrick Ilg
22010780
Introduction
The old concept of entropy dates back to the mid-1800s, being first introduced into Thermodynamics [Dr. Nailong Wu (1997)]. Now one of the main motivations regarding the
principle of maximum entropy is the question, how do we go about finding the best probabilistic model given our limited knowledge/data? Boltzmann (1844-1906) had an answer
to this problem stating that out of all the possible models pick the one with the largest
entropy. In other words, maximise the entropy of the model. This thought process in a
nutshell brought about the answer to Boltzmanns undetermined problem which was to
infer/predict the distributions of the phase space (position and velocities) of gas particles.
Whether by luck or inspiration, he (Boltzmann) put into his equation only the
dynamical information (average energies and particle numbers) that happened
to be relevant to the questions he was asking [E.T. Jaynes (1979)]
With those microscopic measurements, Boltzmann used the Maximum Entropy Method
(MEM) to identify the distribution of the phase space of the gas particles. However, he
was criticised for ignoring the explicit dynamics of the particles. Now Shannon (1916-2001),
proposed another way of explaining the principle of maximum entropy. Through his studies
in Information Theory, in 1948, Shannon discovered a unique quantity H that measured the
uncertainty of an information source. We call this quantity H, Entropy or more famously
Shannon Entropy which we will come across in the next section.
E.T.Jaynes (1922-1998), a more recent pioneer of studies in maximum entropy simply stated
that, to justify the use of a distribution for inference (future predictions), it needs to agree
with what we do know and carefully reject what we do not. An ancient principle of wisdom
[A.L. Berger et al (1996)]. We will expand further on this concept in the next section. In
Section 3 we will take a look at some density estimation techniques step-by-step yielding
maximum entropy distributions consistent with our knowledge. Applications of the MEM
in real life scenarios widely range from image restoration to spectral analysis and even uses
in natural language processing. Its interesting that our computers have become potent
to a point that we can apply this concept (principle of maximum entropy) to a plethora
of real world problems in statistical estimation and pattern recognition. We will briefly
demonstrate and explain an example of this further in Section 4.
Our main focus remains to look at density estimation methods through the application
of the principle of maximum entropy. Techniques will be demonstrated, compared and
critiqued.
Entropy & The Principle of Maximum Entropy
In this section, we focus on defining entropy for the discrete and continuous case and explain
the principle of maximum entropy in a clear and direct manner. A probability density
function (p.d.f.) is a function of a continuous random variable X (e.g. height of male
22010780
students at a university), whose integral over an interval yields the probability that X lies
within that interval. Note that in this paper we will be generally dealing with entropy for
the continuous p.d.f. p(x) rather than the discrete case.
2.1
Entropy
Entropy (or uncertainty) is represented quantitatively by the information (in terms of probability distributions) we do not possess about the state the system is in.
We calculate the entropy for a discrete probability distribution p on the finite set {x1 , x2 , ...},
pi = p(xi ) the entropy of p is defined as:
X
pi ln pi
(1)
h(p) =
i1
Generally speaking, for a uniform p on a finite set {x1 , x2 , ...} (i.e. p(xi ) = 1/n i),
h(p) ln n.
(2)
every p.d.f. with finite outcomes has an entropy h(p) ln n, as equal chance of every outcome occuring brings about maximum uncertainty. Therefore we can say that the uniform
distribution is a maximum entropy distribution as h(p) for p(xi ) = 1/n on a finite countable
n
P
set is equal to n1 ln( n1 ) = n n1 ln( n1 ) = ln n.
1
For a continuous p.d.f. p(x) on interval I, its entropy is defined as:

Z
h(p) = p(x) ln p(x)dx.
(3)
h(p) from the probabilistic point of view is viewed as a measure of information carried by p,
with larger entropy telling us p is carrying less information (more uncertainty). Although
there might be a misconception that less information in our model is not necessarily a
positive outcome, the principle of maximum entropy explains this succinctly which we will
encounter later in this section. Now we will look at some examples of the entropy of different
densities that satisfy certain constraints. This is mainly to show the effect the constraints
have on the value of entropy.
x 2
1
Example 2.1.1 (Gaussian Distribution): Let p(x) = 2
e(1/2)( ) . We compute the
entropy of the Gaussian density on the real line R with respect to our constraints; the mean
and variance 2 :
Z
x 2
x 2
1
1
h(p) = 2
e(1/2)( ) ln( 2
e(1/2)( ) )dx
R
Z
=
x 2
1 e(1/2)( ) ( ln(
2
R
Z
ln( 2)
2
R
(1/2)( x
)2
2
2) (1/2)( x
) )dx
Z
x
2 (1/2)( )2 dx
dx + (1/2) ( x
) e
R
22010780
(This can be computed using the generalisations of the integral of a Gaussian function)
= (1/2)(1 + ln(2 2 ))
We witness here that our mean has no effect on the entropy of the Gaussian density, in
fact, we can state from our result that all Gaussians with the same have the same entropy.
Now, if we focus on the effect has on h(p) we observe that for significantly small, the
entropy of a Gaussian is negative [K. Conrad (2013)]. We can also ask the question of
whether it is reasonable that the mean does not enter the entropy. This question will be
answered at the end of Example 3.1.1.
2.2
The Principle of Maximum Entropy
The principle of maximum entropy was introduced to us by E.T. Jaynes. It states that
subject to our knowledge (represented by constraints) the probability distribution that
bests reflects our knowledge is the one with the largest entropy [Wikipedia (2016)]. This
principle aids us in our selection of densities most consistent with what we know and does not
introduce unjustified information. In hindsight, smaller entropy tells us something stronger
than what is being assumed and can lead to surprising predictions which we do not want.
Think of the MEM as a guide that can also advice us what to do when our experimental data
is not consistent with our predictions [K. Conrad (2013)]. We can ask ourselves questions
like: Are there missing constraints we should consider? or Are there current constraints
we should omit?. An unseen constraint may be a huge factor effecting an experiment so we
include it in our p.d.f. and maximise entropy over the distributions satisfying our improved
constraints. We will also witness that the maximum entropy principle explains a natural
connection between the more native distributions and other distributions.
2.3
Maximum Entropy Principle Applied
Here we will approach an example which will describe the principle in one simple case of
one constraint and three input events. In this case the maximum entropy method can be
carried out analytically. The more general technique will be shown in the next chapter.
Example 2.3.1 (Fast Food Restaurant): Before we can use the principle the problem
domain must be constructed first. It consists of the various states that the systems can
exist in, alongside all the parameters involved in the known constraints [P. Penfield (2003)]
(e.g. the quantities associated with each state is assumed known, whether it be energy,
speed, etc.). Assuming that we do not know which particular state is being occupied (our
lack of knowledge) we deal instead, with the probabilities of each state being occupied.
Therefore, probabilities help us work with our incomplete knowledge.
Suppose we have been employed to analyse the revenue of a fast food franchise (Speedy
Savers) in the UK (where prices are homogeneous). Through some primary research we
deduce that customers spend an average of 1.75 per meal deal. The price, calories per
meal, and probability of food being served hot or cold are displayed in the table below:
22010780
Meal Deal
Main
Speedy Savers
Cost () Calories
1
2
3
Burger
Chicken
Fish
1.00
2.00
3.00
800
650
500
Probability of
arriving hot
0.6
0.8
0.9
Probability of
arriving cold
0.4
0.2
0.1
Table 1: Speedy Savers
Now with our problem set up, we still require knowledge of which state our system is
in. We assume that for this fast food franchise model that we have three outcomes. The
probabilities of these outcomes tell us how likely a customer will purchase any one of the
three available meal deals. These underlying probabilities are denoted p(A1 ), p(A2 ) and
p(A3 ) for each of the respective meal deals 1-3. We have each of the possible outcomes Ai
(where i = 1, 2, 3) has a probability p(Ai ) and these are mutually exclusive and exhaustive.
Therefore:
3
X
p(Ai ) = 1
(4)
i=1
In this example we are playing the role of an observer. Different observers may acquire
varied information and because of that difference in knowledge, they may attain different
probability distributions as a result. In this sense the probability distributions are subjective.
Constraints: The principle of maximum entropy is only useful when applied to testable
information [J. Liu (2012)]. If we had no additional information then the assumption that
all p(Ai ) are equal is reasonable, we would have maximum uncertainty (see equation (2)).
However, if we do have additional information the better choice is to use these and construct
our constraints. In one sense, we have some certainty (our current knowledge) although we
are seeking maximum uncertainty. During this example we found out through research that
the average price of a meal deal amounted to 1.75. We want to estimate the separate
probabilities p(Ai ) only using what we know.
Our constraints are:
p(A1 ) + p(A2 ) + p(A3 ) = 1
1.00p(A1 ) + 2.00p(A2 ) + 3.00p(A3 ) = 1.75
(5)
Now what we have are two simultaneous equations with three unknowns, with insufficient
information available to solve for these unknowns. Our unknowns are the values of the
underlying probabilities sought-after from the maximum entropy distribution (where h(p)
is largest).
h(p) = p(A1 ) ln
1
1
1
+ p(A2 ) ln
+ p(A3 ) ln
p(A1 )
p(A2 )
p(A3 )
(6)
Again, we are simply searching for the probability distribution that uses nothing other than
what is already known. Tracking back to our constraints, we can express p(A1 ) & p(A2 ) in
22010780
terms of p(A3 ). We have:

p(A1 ) = 0.25 + p(A3 )
p(A2 ) = 0.75 2p(A3 )
(7)
Next, we determine the ranges of our probabilities, since 0 p(Ai ) 1 this is very easy to
do (using the equations in (7) and the fact that p(Ai ) 0):
0 p(A3 ) 0.375
0 p(A2 ) 0.75
0.25 p(A1 ) 0.625
(8)
Now, we can rewrite our entropy in (6) in terms of p(A3 ), we only now need to find the
value of p(A3 ) for which h(p) is largest to yield our maximum entropy distribution.
Our entropy is:
h(p) = (0.25 + p(A3 )) ln
1
1
1
+ (0.75 2p(A3 )) ln
+ p(A3 ) ln
(9)
p(0.25 + p(A3 ))
0.75 2p(A3 )
p(A3 )
There are many techniques (see Section 3) that can be used to find the value of p(A3 ) for
which h(p) is maximum. In this example the maximum occurs at p(A3 ) = 0.216 (found
using the method of Lagrange Multipliers) which leads us to p(A1 ) = 0.466, p(A2 ) = 0.318
and h(p) = 1.051 our maximised entropy. What we can learn from this is that Meal Deal 1
is the most popular purchase (most probably because of its price), followed by Meal Deal
2 and finally Meal Deal 3. The result is a maximum entropy distribution consistent with
our constraints introducing zero bias. The constraints play a vital role when calculating
our distributions, next, we see where our constraints make this impact in a mathematical
sense.
Density Estimation
There are various density estimation techniques for maximising entropy that we will be
encountering in this section. The aim is to maximise h(p) over all density functions satisfying
certain constraints and yield a p(x) with the largest entropy. We will first look at the
method of Lagrange multipliers explaining it step-by-step with a worked out example for
our understanding. Afterwards we will encounter some computationally efficient (fast)
approximating algorithms and evaluate them accordingly.
3.1
Lagrange Multipliers
This method is named after the late Italian born, French mathematician, Joseph-Louis
Lagrange (1736-1813). Instead of attempting to reduce the number of unknowns through
our constraint equations, we actually increase the number of unknowns using the multipliers
i [P. Penfield (2003)]. In order to demonstrate that a probability distribution is a maximum
entropy distribution we start off using the method of Lagrange multipliers. We will visit an
example to clarify how this idea is carried out.
22010780
Consider a countable set of polynomials {r1 (x), r2 (x), ..., rn (x)} where x {continuous
outcome space with unspecified density p(x)} with the assumption that
Zb
i = 1, ..., n, a, b R
p(x)ri (x)dx = Mi ,
(10)
holds. Where Mi represents the moments (constraints) and I is the interval the p.d.f. is
on. Remember, these constraints and the interval of our proposed distribution is on are
the ingredients that will help us find the p.d.f. with maximised h(p). For validity and
completeness we take r1 (x) 1 and
Z M1 1 then include the further required constraints
regarding p(x) being a p.d.f. (e.g. xp(x)dx = where is the mean)
R
Now the task is to acquire p(x) subject to h(p) being maximised. To begin solving this
problem (for the n constraints) we construct a function F and use Lagrange coefficients i
respectively for each of the constraints Mi . We have:
Zb
Zb
n
X
F (p, 1 , 2 , ..., n ) = p(x) ln p(x)dx +
i ( p(x)ri (x)dx Mi )
i=1
|a
{z
=0
(11)
Expanding out the equation leads us to

Z
= p(x) ln p(x) + 1 p(x) + 2 r2 (x)p(x) + ... + n rn (x)p(x)dx 1 2 M2 ... n Mn
|
{z
}
I
=L(x,p,1 ,2 ,...,n )
Therefore
L
= 1 ln(p) + 1 + 2 r2 (x) + ... + n rn (x)
p
(12)
1 1+2 r2 (x)+...+n rn (x) .

We have that L
p = 0 at a maximum entropy distribution which p(x) = e
The objective now is to seek the values of i in terms of x and our constraints to attain
our p with maximised entropy h(p). We will now go through an example illustrating the
method of Lagrange multipliers to yield a maximum entropy distribution.
Example 3.1.1: In this example, we let p(x) be a density on R subject to constraints mean
and variance 2
Z
Z
Z
Z
F (p, 1 , 2 , 3 ) = p(x) ln p(x)dx+1 ( p(x)dx1)+2 ( xp(x)dx)+3 ( (x)2 p(x)dx 2 )
R
Z
=
R
p(x) ln p(x) + 1 p(x) + 2 xp(x) + 3 (x )2 p(x)dx 1 2 3 2
Therefore
22010780
L
= 1 ln(p) + 1 + 2 x + 3 (x )2
p
L
2
= 0 p(x) = e1 1+2 x+3 (x)
p
Z
For
p(x)dx to be finite we set 2 = 0 and 3 < 0. This leads to p(x) = e1 1+3 (x)
now we let a = 1 1 and b = 3 > 0 and this gives us p(x) = eab(x) .

Z
Z
p(x)dx =
eab(x) dx = ea
p
p
p
(/b) = 1 ea = (b/) a = ln( (b/))
p
2
Now p(x) = (b/)eb(x)
Z
Z p
2
xp(x)dx = x (b/)eb(x) dx =
R
and
Z
(x )2 p(x)dx =
(x )2
p
2
(b/)eb(x) dx = 1/(2b) = 2 b = 1/(2 2 )
q
x
2
2
2
1
(x)2 /(2 2 ) = 1 e(1/2)( )2 is a Gaus p(x) = eln( (1/(2 ))/)(x )/(2 ) = 2
2e
2
sian distribution like we encountered in Example 2.1.1. Since we already calculated the
entropy h(p) of the Gaussian p.d.f. p(x), we can state that for a continuous p.d.f. p on R
subject to mean and variance 2
h(p) (1/2)(1 + ln(2 2 ))
with equality iff p is the Gaussian (i.e. the maximum entropy distribution satisfying
constraints). Going back to our question in Example 2.1.1, it is reasonable that the mean
does not enter the entropy. Tracing back our steps in this example we set our 2 = 0 to
keep our integral of p finite, this is the same as omitting from the entropy (as a moment).
However, we must notice that as our variance 2 depends on in our second constraint.
Overall, the notion behind this technique is easy to understand. It allows us to effectively
maximise our entropy function h(p) subject to our moment constraints (the information we
have gathered about the distribution we are seeking) no matter how many we would like
to include. Clearly, as we have stressed before, it is better to include what we know so we
calculate a distribution most consistent with that knowledge. However, the computational
efficiency has a strong relationship with the number of constraints that we include. Notably,
for every additional moment constraint included, the difficulty associated with yielding the
maximum entropy distribution also increases. Intuitively, this make very much sense as
each additional constraint implies that we need to solve for an extra Lagrange Multiplier
i . Fortunately, several computational methods have been developed to counter this issue.
In the following part of this section we will witness three algorithms that help us quickly
compute maximum entropy distributions subject to what we know.
3.2
22010780
Fast Approximating Algorithms
Here we will be encountering three different algorithms with the purpose of yielding approximations to the p.d.f. with maximum entropy. This is on the basis of a finite number of
sampled data. Comparisons of the results of these proposed algorithms will be made with
the exact maximum entropy estimates in terms of accuracy and efficiency.
Let p(x) be the p.d.f. we want to compute defined over an interval [a, b] where a, b R
subject to the constraints:
Zb
p(x) 0, x [a, b]
p(x)dx = 1,
(13)
and n extra moment constraints which are in the form (same as what we encountered in
Section 3.1):
Zb
p(x)ri (x)dx = Mi
i = 1, ..., n
(14)
where the functions ri (x) and constants Mi are both known.

Since the method we are currently looking at is an extension of the method of Lagrange
multipliers witnessed in Section 3.1, we can state that the maximisation of h(p) subject to
the constraints (13) and (14) gives us the solution (which Jaynes [E.T. Jaynes (1957)] has
shown) in the form:
p(x) = e0 1 r1 (x)...n rn (x)
(15)
where the elements i of the set of Lagrangian multipliers {0 , 1 , ..., n } satisfy:

Zb
e(
Pn
j=1
j rj (x))
dx = e0
(16)
Zb
ri (x)e(
Pn
j=1
j rj (x))
dx
Zb
= Mi
e(
Pn
j=1
j rj (x))
j = 1, ..., n
(17)
dx
not to confuse ourselves equations (14) and (17) are both equivalent. In hindsight, we
notice that the determination of a maximum entropy distribution is reduced to a solution
of a system of n non-linear equations (17) each with n unknowns.
Three algorithms of computationally efficient nature to obtain a solution to (17) are to be
illustrated. Throughout this Section 3.2 it is assumed that all the functions ri are in the
form ri (x) = xi , i = 1, ..., n. Our available data represents what we know, which are the
22010780
constants Mi (14). The algorithms will work towards yielding an approximation to p (with
maximum h(p)) in the form of a linear combination of basis functions:
p(x)
n
X
j pj (x) = p(x)
(18)
j=1
Our coefficients i are computed by solving a system of n linear equations:

Zb
Zb
Zb
1
x1 p1 (x)dx + ... + n
Zb
1
pk (x)dx = 1
p1 (x)dx + ... + n
Zb
1 (x)dx
x1 pk (x)dx = M1
..
.
xn p
Zb
+ ... + n
(19)
xk pk (x)dx = Mn
The basis functions pj will differ when being calculated through respective algorithms,
however, it is vital that we establish the relation between the approximating function p and
maximum entropy p.d.f. p.
Zb
j
Let the function E be defined by E(y, x ) = xj y(x)dx for any generic function y and
a
j Z. By construction:
E(p, xj ) = E(
p, j)
j = 1, ..., n
(20)
The values of our functions E are equal, and we state that if all their constraints are also
equal then p p as n . This tells us that the more moment constraints (more knowledge) we have, the closer our approximation p to the true maximum entropy distribution
p. Now we will approach the three algorithms with intention to yield values of the basis
functions pj which we will substitute into the system of equations (19) to find the solutions
j . The final step will be to substitute our j and pj (x) into (18) to get our approximation
to the maximum entropy distribution.
Algorithm 1: Tchebycheff polynomials are used in this algorithm to determine our pj ,
this is after the normalisation of the interval [a, b] to [0, 1]:
p1 (x) = 1
p2 (x) = x
p3 (x) = 2x2 1
pj (x) = 2xpj1 (x) pj2 (x)
(21)
j = 2, 3, ...
This algorithm is very easy to use and with this choice of basis functions the system of linear
equations (19) can be directly solved. However, the limitation is that the basis functions
yielded are independent of the available data (i.e. coefficients Mi ).
22010780
Algorithm 2: In (15) Jaynes solution p to the maximum entropy estimation problem is

displayed. Each of the basis functions pj are taken as the solution of a simplified maximum
entropy problem using only one of the known moment constraints. The target is to maximise
Zb
h(pj ), (max( pj (x) ln pj (x)dx)) subject to the constraints:
a
Zb
pj (x)dx = 1
a
Zb
(22)
xj pj (x)dx = Mj
j = 1, ..., n
Our solution according to Jaynes result is:

pj (x) = e0j j x
(23)
where the Lagrange multipliers 0 & j are received as a result of solving the system of
non-linear equations:
Zb
j
ej x dx = e0j
a
Zb
xj ej x dx
Zb
(24)
= Mj
j = 1, ..., n
ej x dx
The benefit of this method is that, instead of going through the havoc of solving a nonlinear system in n unknowns (see equation (17)), we instead, solve n-independent non-linear
equations in one unknown. After we have solved our n non-linear equations (n, depending
on how many basis functions we want to calculate), our pj are determined, and now (19)
can be solved.
Algorithm 3: This algorithm is more or less a replica of the previous (Algorithm 2).
Instead of basis functions pj being taken as the solution of a simplified maximum entropy
problem using only one of the constraints, we now use two of the known constraints.
The refined target now is to maximise h(pj ) subject to:
Zb
pj (x)dx = 1
a
Zb
xq pj (x)dx = Mq
Zb
a
xt pj (x)dx = Mt
q, t [1, n]
(25)
22010780
For each pj it is important that a different couple (Mq , Mt ) is selected. The solution to each
function pj is again given through Jaynes result, however, the non-linear system yielding
values for our Lagrange Multipliers is now 2-dimensional.
The limitation of this method is that, instead of solving n-independent non-linear equations
in one unknown like Algorithm 2, we solve n non-linear equations of 2-dimensions instead
of one. Again, after we have solved our n non-linear equations, our pj are determined. This
algorithm introduces increased complexity in determining our pj and furthermore solving
(19).
Summary: Each of the stated algorithms are all based on the use of the basis functions in
order to yield the maximum entropy estimate p which converges to p as n .
Summarised steps:
1. Choose Algorithm.
2. Calculate basis functions pj through algorithmic technique.
3. Use calculated pj to determine coefficients j by solving a linear system of equations
(See equation (19))
4. Substitute calculated pj and j into equation (18) to find the approximation to p the
maximum entropy distribution.
Algorithms 2 and 3 increases complexity by forming basis functions on the basis of our
available data, linking each of these functions to a simplified maximum entropy problem.
An interesting point is that we could proceed to define Algorithms 4, 5, 6,..., n if we wanted
to. For Algorithm 4, for example, the solution to each function pj is still given through
Jaynes result, but now the non-linear system yielding our Lagrange Multipliers is now 3dimensional (computational effort has again increased). If all n constraints are applied, we
simply return back to Jaynes original solution (15).
In the following part of this section we will set each technique we have discussed (in Section
3.2) in motion. The efficiency of the proposed algorithms will also be assessed and evaluated.
22010780
The following example is based on research conducted by the DSEA, University of Pisa.
Example 3.2.1: In this example we consider data samples from an hyperbolic distribution
on the interval [0,1]; there are 1500 samples. Figure 1 shows us the histogram of Sample
Frequency vs Sample Value. The moments of the dataset are again assumed known and
have been applied each of the algorithms we have just defined. Comparisons between the
true maximum entropy estimate and results from the algorithms are made and displayed in
figures 2, 3 and 4. It is interesting to see the accuracy and convergence to the true maximum
entropy distribution between algorithms.
Below are the series of graphs taken from the from the research conducted [A. Balestrino et
al (2003)]. Figures 2, 3 and 4 displays the true maximum entropy estimate (obtained using
four constraints) as a black line. The dotted, dashed-dotted and dashed lines represent each
Algorithms approximations using 2, 3 and 4 constraints respectively.
Figure 1: Histogram plot of the sample frequencies against their respective values
Figure 3: P.d.f. estimates obtained

with Algorithm 2
Figure 2: P.d.f. estimates obtained with

Algorithm 1
Figure 4: P.d.f. estimates obtained

with Algorithm 3
22010780
If we look closely at Figure 1 we see that the shape formed by our histogram is very similar
to that of the hyperbolic distribution which makes sense since that is where our data is
generated from. One thing that is thing consistent between the results of the 3 algorithms
is the convergence to the true maximum entropy estimate (solid black line) as the number of
moment constraints increase. This emphasises on the maximum entropy principle, telling
us that the application of more constraints generally leads to the distribution of highest
entropy. Figure 2 shows the p.d.f. obtained with Algorithm 1, there is a relatively close
relationship between the estimates and the true maximum entropy estimate. However, if
we visually compare this to our respective figures 3 and 4, it is easy enough to see the
differences in approximation errors. The estimates in the other figures are closer to the true
maximum entropy estimates than the ones seen in Figure 2. This is explained by the fact
that the basis functions pj yielded through algorithms 3 and 4 depend on the constraint
values. Figure 4 shows the estimate closest to the true maximum entropy estimate. The
dashed line in the figure is practically undistinguishable in comparison to the black line.
Therefore, we deduce that Algorithm 3 is the most proficient followed by 2 and 1. Again
this is understandable as the third algorithm yields each of the basis functions depending
on different couples of constraints, instead of different single constraints (see Algorithm 2).
Figure 5: Plot of the average computational

time against # of moments used in estimation
The solid (with circles), dashed-dotted (with diamonds), dashed (with stars) and dotted
(with triangles) lines represent the true maximum entropy estimate and Algorithms 3, 2
and 1 respectively.
Although we had witnessed a direct positive correlation between the number moments and
convergence to the true maximum entropy estimate, it was interesting to find at which
computational cost (in terms of time s). Figure 5 shows us that as the complexity of the
algorithms increases, the computational time taken to get the estimations also increases.
Notably, we see that the greatest change in time as the number of moments increase from
5 to 6. In fact, as the number of moments increases, the rate of change in time taken also
increases.
22010780
Applications In Real Life Scenarios
In the real world we like to predict what events are most likely to occur in the times to
come. For example; world stock market speculators might want to predict market trends,
physicists may want to predict occurrences like whether civilisation would be wiped out by
an asteroid in a given time frame, betting agencies might want to predict the chances of a
football team winning the championship and so on.
4.1
Image Restoration
The MEM is becoming an increasingly popular and more general approach to restoring
images from noisy and incomplete data. In astronomy, this method has been used across
the electromagnetic spectrum for radio aperture synthesis [J. Skilling & R.K. Bryan (1984)],
x-ray imaging, gamma-ray imaging and much more. The result of the MEM applied to an
image reconstruction problem is an image of optimal quality.
Below is an image of a woman before and after the Image Restoration process:
Figure 6: Photograph before and after Image

Restoration process
Here we will only view the form of the solution. S(f ) is the power spectrum density of the
time series, f just denotes the frequency [Dr. Nailong Wu (1997)].
S(f ) = exp(
a
X
l e2if l )
(26)
l=a
where the complex conjugate of the Lagrange Multipliers are denoted by l and are also
the real functions of the time series x(n):
l = 2IF T [ln |F T [x(n)]|]
where IFT means Inverse Fourier Transform
Z and FT means Fourier Transform.
The entropy here is given as h(S(f )) = S(f ) ln S(f ).
R
(27)
22010780
Conclusion
Various density estimation techniques have been presented to aid us in computing/approximating

densities with the largest entropy. The MEM gives us exactly what we ask for, a distribution that is a reflection of our knowledge. The beauty of this technique is that there is
no limit to the amount of knowledge we can feed our model with. In practice, we should
not have any (or much) surprises when comparing our experimental data with the data
from our maximum entropy distributions. We saw this in Example 3.2.1, where there was a
direct correlation between the number of moments and convergence towards the true maximum entropy estimate. The method of Lagrange Multipliers was seen as the foundation
in yielding the maximum entropy distributions, with great roles in our fast approximating
algorithms. However, we met a drawback of the method. When many constraints were
included a solution became hard to compute. The MEM in hindsight, could take a priori
data and lead us to probable inferences of events to come.
22010780
Bibliography
Dr. Nailong Wu (1997) The Maximum Entropy Method Springer Series in Information Sciences
E.T. Jaynes (1979) The Maximum Entropy Formalism MIT, Cambridge, MA
A.L. Berger, V.J. Della Pietra & S.A. Della Pietra (1996) A Maximum Entropy Approach
to Natural Language Processing Dept. of Computer Science, Columbia University
Paul Penfield (2003) Information and Entropy Dept. of Electrical Engineering and Computer
Science, MIT, Cambridge, MA, Chapter 9
Keith Conrad (2013) Probability Distributions and Maximum Entropy
Wikipedia (2016) Principle of Maximum Entropy [online] Available at https://en.
wikipedia.org/wiki/Principle_of_maximum_entropy (Accessed on 2nd February)
A. Balestrino, A. Caiti, A. Noe & F. Parenti (2003) Maximum Entropy Based Numerical
Algorithms for Approximation of Probability Density Functions Dept. Electrical Systems
and Automation, University of Pisa, Italy
E.T. Jaynes (1957) Information Theory and Statistical Mechanics vol. 106, pp. 361-373
Jiawang Liu (2012) Baidu [online] Available at http://www.slideshare.net/JiawangLiu/
maxent (Accessed on 1st March)
Paul Penfield (2003) Information and Entropy Dept. of Electrical Engineering and Computer
Science, MIT, Cambridge, MA, Chapter 10
J. Skilling & R.K. Bryan (1984) Maximum Entropy Image Reconstruction: General Algorithm Dept. of Applied Mathematics and Theoretical Physics, Cambridge, UK

Maximum Entropy: Density Estimation

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Maximum Entropy: Density Estimation

Încărcat de

Drepturi de autor:

Formate disponibile

Maximum Entropy: Density Estimation

Part 3 Project (Project Report)

2 Entropy & The Principle of Maximum Entropy

The Principle of Maximum Entropy . . . . . . . . . . . . . . . . . . . . . .

Maximum Entropy Principle Applied . . . . . . . . . . . . . . . . . . . . . .

Fast Approximating Algorithms . . . . . . . . . . . . . . . . . . . . . . . . .

4 Applications In Real Life Scenarios

Histogram plot of the sample frequencies against their respective values . .

P.d.f. estimates obtained with Algorithm 1 . . . . . . . . . . . . . . . . . .

P.d.f. estimates obtained with Algorithm 2 . . . . . . . . . . . . . . . . . .

P.d.f. estimates obtained with Algorithm 3 . . . . . . . . . . . . . . . . . .

Plot of the average computational time against # of moments used in estimation 13

Photograph before and after Image Restoration process . . . . . . . . . . .

Part 3 Project (Project Report)

Entropy & The Principle of Maximum Entropy

Part 3 Project (Project Report)

For a continuous p.d.f. p(x) on interval I, its entropy is defined as:

Part 3 Project (Project Report)

The Principle of Maximum Entropy

Maximum Entropy Principle Applied

Part 3 Project (Project Report)

Table 1: Speedy Savers

Part 3 Project (Project Report)

terms of p(A3 ). We have:

Part 3 Project (Project Report)

Expanding out the equation leads us to

1 1+2 r2 (x)+...+n rn (x) .

p(x) ln p(x) + 1 p(x) + 2 xp(x) + 3 (x )2 p(x)dx 1 2 3 2

Part 3 Project (Project Report)

now we let a = 1 1 and b = 3 > 0 and this gives us p(x) = eab(x) .

Part 3 Project (Project Report)

Fast Approximating Algorithms

where the functions ri (x) and constants Mi are both known.

where the elements i of the set of Lagrangian multipliers {0 , 1 , ..., n } satisfy:

Part 3 Project (Project Report)

Our coefficients i are computed by solving a system of n linear equations:

Part 3 Project (Project Report)

Algorithm 2: In (15) Jaynes solution p to the maximum entropy estimation problem is

Our solution according to Jaynes result is:

Part 3 Project (Project Report)

Part 3 Project (Project Report)

Figure 3: P.d.f. estimates obtained

Figure 2: P.d.f. estimates obtained with

Figure 4: P.d.f. estimates obtained

Part 3 Project (Project Report)

Figure 5: Plot of the average computational

Part 3 Project (Project Report)

Applications In Real Life Scenarios

Figure 6: Photograph before and after Image

Part 3 Project (Project Report)

Various density estimation techniques have been presented to aid us in computing/approximating

Part 3 Project (Project Report)

S-ar putea să vă placă și