Documente Academic
Documente Profesional
Documente Cultură
This article is about the statistical techniques. For mator which would be as close to the true value 0 as
computer data storage, see Partial response maximum possible. Either or both the observed variables xi and the
likelihood.
parameter can be vectors.
To use the method of maximum likelihood, one rst specIn statistics, maximum-likelihood estimation (MLE) ies the joint density function for all observations. For an
is a method of estimating the parameters of a statistical independent and identically distributed sample, this joint
model. When applied to a data set and given a density function is
statistical model, maximum-likelihood estimation provides estimates for the models parameters.
The method of maximum likelihood corresponds to many f (x1 , x2 , . . . , xn | ) = f (x1 |)f (x2 |) f (xn |).
well-known estimation methods in statistics. For example, one may be interested in the heights of adult female penguins, but be unable to measure the height of
every single penguin in a population due to cost or time
constraints. Assuming that the heights are normally distributed with some unknown mean and variance, the
mean and variance can be estimated with MLE while only
knowing the heights of some sample of the overall population. MLE would accomplish this by taking the mean
and variance as parameters and nding particular parametric values that make the observed results the most
probable given the model.
In general, for a xed set of data and underlying statistical model, the method of maximum likelihood selects the set of values of the model parameters that maximizes the likelihood function. Intuitively, this maximizes
the agreement of the selected model with the observed
data, and for discrete random variables it indeed maximizes the probability of the observed data under the
resulting distribution. Maximum-likelihood estimation
gives a unied approach to estimation, which is welldened in the case of the normal distribution and many
other problems. However, in some complicated problems, diculties do occur: in such problems, maximumlikelihood estimators are unsuitable or do not exist.
L( ; x1 , . . . , xn ) = f (x1 , x2 , . . . , xn | ) =
f (xi | ).
i=1
Note ; denotes a separation between the two input arguments: and the vector-valued input x1 , . . . , xn .
ln L( ; x1 , . . . , xn ) =
ln f (xi | ),
i=1
Principles
0
distribution with an unknown probability density function f 0 (). It is however surmised that the function f 0
belongs to a certain family of distributions { f(| ),
} (where is a vector of parameters for this family), {mle } {arg max ( ; x1 , . . . , xn )}.
2 PROPERTIES
log-likelihood function, since log is a strictly monotoni- this being the sample analogue of the expected logcally increasing function.
likelihood () = E[ ln f (xi | ) ] , where this expectation
is taken with respect to the true density f ( | 0 )
For many models, a maximum likelihood estimator can
.
be found as an explicit function of the observed data x ,
1
, xn. For many other models, however, no closedform solution to the maximization problem is known or
available, and an MLE has to be found numerically using optimization methods. For some problems, there may
be multiple estimates that maximize the likelihood. For
other problems, no maximum likelihood estimate exists
(meaning that the log-likelihood function increases without attaining the supremum value).
In the exposition above, it is assumed that the data are
independent and identically distributed. The method can
be applied however to a broader setting, as long as it is
possible to write the joint density function f(x1 , , xn |
), and its parameter has a nite dimension which does
not depend on the sample size n. In a simpler extension,
an allowance can be made for data heterogeneity, so that
the joint density is equal to f 1 (x1 |) f 2 (x2 |) fn(xn
| ). Put another way, we are now assuming that each
observation x comes from a random variable that has its
own distribution function f. In the more complicated
case of time series models, the independence assumption
may have to be dropped as well.
Maximum-likelihood estimators have no optimum properties for nite samples, in the sense that (when evaluated
on nite samples) other estimators may have greater concentration around the true parameter-value.[1] However,
like other estimation methods, maximum-likelihood estimation possesses a number of attractive limiting properties: As the sample size increases to innity, sequences
of maximum-likelihood estimators have these properties:
Consistency: the sequence of MLEs converges in
probability to the value being estimated.
Asymptotic normality: as the sample size increases,
the distribution of the MLE tends to the Gaussian distribution with mean and covariance matrix
equal to the inverse of the Fisher information matrix.
Eciency, i.e., it achieves the CramrRao lower
bound when the sample size tends to innity. This
means that no consistent estimator has lower asymptotic mean squared error than the MLE (or other estimators attaining this bound).
Under the conditions outlined below, the maximum likelihood estimator is consistent. The consistency means
that having a suciently large number of observations n,
it is possible to nd the value of 0 with arbitrary precision. In mathematical terms this means that as n goes to
innity the estimator converges in probability to its true
value:
Properties
A maximum-likelihood estimator is an extremum estimator obtained by maximizing, as a function of , the objective function (c.f., the loss function)
a.s.
mle 0 .
| x) = 1
(
n
i=1
ln f (xi | ),
f ( | ) = f ( | 0 ).
In other words, dierent parameter values correspond to dierent distributions within the model.
2.2
Asymptotic normality
If this condition did not hold, there would be some
value 1 such that 0 and 1 generate an identical
distribution of the observable data. Then we
wouldn't be able to distinguish between these two
parameters even with an innite amount of data
these parameters would have been observationally
equivalent.
The identication condition is absolutely necessary
for the ML estimator to be consistent. When this
condition holds, the limiting likelihood function
(|) has unique global maximum at 0 .
3
The dominance condition can be employed in the case
of i.i.d. observations. In the non-i.i.d. case the uniform
convergence in probability can be checked by showing
Asymptotic normality
that the log-likelihood has a unique global maximum. Compactness implies that the likelihood cannot approach the maximum value arbitrarily close
at some other point (as demonstrated for example in
the picture on the right).
Compactness is only a sucient condition and not a
necessary condition. Compactness can be replaced
by some other conditions, such as:
both concavity of the log-likelihood function
and compactness of some (nonempty) upper
level sets of the log-likelihood function, or
existence of a compact neighborhood N of 0
such that outside of N the log-likelihood function is less than the maximum by at least some
> 0.
Estimate on boundary. Sometimes the maximum likelihood estimate lies on the boundary of the set of possible
parameters, or (if the boundary is not, strictly speaking,
allowed) the likelihood gets larger and larger as the parameter approaches the boundary. Standard asymptotic
theory needs the assumption that the true parameter value
lies away from the boundary. If we have enough data,
the maximum likelihood estimate will keep away from
the boundary too. But with smaller samples, the estimate
can lie on the boundary. In such cases, the asymptotic
theory clearly does not give a practically useful approximation. Examples here would be variance-component
models, where each component of variance, 2 , must satisfy the constraint 2 0.
2 PROPERTIES
ferent (unknown) mean but otherwise the observations When the log-likelihood is twice dierentiable, this exare independent and normally distributed with a common pression can be expanded into a Taylor series around the
variance. Here for 2N observations, there are N+1 pa- point = 0 :
rameters. It is well known that the maximum likelihood
estimate for the variance does not converge to the true
[
]
n
n
value of the variance.
1
1
0 ),
0=
ln f (xi | 0 )+
ln f (xi | ) (
n i=1
n i=1
Increasing information. For the asymptotics to hold
in cases where the assumption of independent identically
distributed observations does not hold, a basic require- where is some point intermediate between 0 and .
ment is that the amount of information in the data in- From this expression we can derive that
creases indenitely as the sample size increases. Such a
requirement may not be met if either there is too much
[
]1
n
n
ln f (xi | 0 )
n(
ln
f
(x
|
)
i
are essentially identical to existing observations), or if
n i=1
n i=1
new independent observations are subject to an increasing observation error.
Here the expression in square brackets converges in probSome regularity conditions which ensure this behavior ability to H = E[ln f(x | 0 )] by the law of large numbers. The continuous mapping theorem ensures that the
are:
inverse of this expression also converges in probability, to
H 1 . The second sum, by the central limit theorem, con1. The rst and second derivatives of the log-likelihood
verges in distribution to a multivariate normal with mean
function must be dened.
zero and variance matrix equal to the Fisher information
2. The Fisher information matrix must not be zero, and I. Thus, applying Slutskys theorem to the whole expresmust be continuous as a function of the parameter. sion, we obtain that
3. The maximum likelihood estimator is consistent.
(
)
d
n( 0 )
N 0, H 1 IH 1 .
<
and
Functional invariance
The maximum likelihood estimator selects the parameter value which gives the observed data the largest possi4. I = E[lnf(x | 0 ) lnf(x|0 )] exists and is non- ble probability (or probability density, in the continuous
case). If the parameter consists of a number of composingular;
nents, then we dene their separate maximum likelihood
estimators, as the corresponding component of the MLE
5. E[ supN||lnf(x | )||] < .
of the complete parameter. Consistent with this, if b is
Then the maximum likelihood estimator has asymptoti- the MLE for , and if g() is any transformation of ,
then the MLE for = g() is by denition
cally normal distribution:
) d
(
N (0, I 1 ).
n mle 0
Proof, skipping the technicalities:
b = g( b ).
It maximizes the so-called prole likelihood:
| x) = 1
= 0.
(
ln f (xi | )
n i=1
n
The MLE is also invariant with respect to certain transformations of the data. If Y = g(X) where g is one to one
and does not depend on the parameters to be estimated,
then the density functions satisfy
3 Examples
d
The expected value of the number m on the drawn ticket,
n(mle 0 )
N (0, I 1 ),
and therefore the expected value of n
, is (n + 1)/2. As a
where I is the Fisher information matrix:
result, with a sample size of 1, the maximum likelihood
estimator for n will systematically underestimate n by (n
[
]
2
1)/2.
ln f0 (Xt )
Ijk = EX
.
j k
In particular, it means that the bias of the maximumlikelihood estimator is equal to zero up to the order n1/2 .
However when we consider the higher-order terms in the
expansion of the distribution of this estimator, it turns
out that has bias of order n1 . This bias is equal to
(componentwise)[5]
)
(
1
bs E[(mle 0 )s ] = I si I jk 12 Kijk + Jj,ik
n
where Einsteins summation convention over the repeating indices has been adopted; I jk denotes the j,k-th component of the inverse Fisher information matrix I 1 , and
mle
= mle b.
49 but dierent values of p (the probability of success),
1
This estimator is unbiased up to the terms of order n , the likelihood function (dened below) takes one of three
and is called the bias-corrected maximum likelihood values:
estimator.
( )
This bias-corrected estimator is second-order ecient (at
80
(1/3)49 (1 1/3)31 0.000,
least within the curved exponential family), meaning that Pr(H = 49 | p = 1/3) =
49
it has minimal mean squared error among all second( )
order bias-corrected estimators, up to the terms of the
80
2
Pr(H
=
49
|
p
=
1/2)
=
(1/2)49 (1 1/2)31 0.012,
order n . It is possible to continue this process, that is
49
to derive the third-order bias-correction term, and so on.
( )
However as was shown by Kano (1996), the maximum80
Pr(H = 49 | p = 2/3) =
(2/3)49 (1 2/3)31 0.054.
likelihood estimator is not third-order ecient.
49
3 EXAMPLES
3.3
(
)
1
(x )2
Now suppose that there was only one coin but its p could
2
exp
f (x | , ) =
,
have been any value 0 p 1. The likelihood function
2 2
2
to be maximised is
the corresponding probability density function for a sample of n independent identically distributed normal ran( )
dom variables (the likelihood) is
80 49
L(p) = fD (H = 49 | p) =
p (1 p)31 ,
49
( n
(
)n/2
n
(
1
and the maximisation is over all possible values 0 p f (x , . . . , x | , 2 ) = f (x | , 2 ) =
exp i=1
1
n
i
2
2
2
1.
i=1
likelihood function for proportion value of a binomial process (n=10)
0.4
t=1
t=3
t=6
0.35
0.3
(
f (x1 , . . . , xn | , ) =
2
0.25
0.2
1
2 2
)n/2
( n
(xi x
)2 + n(
x
exp i=1
2 2
where x
is the sample mean.
0.15
0.1
0.05
0
or more conveniently:
0.2
0.4
0.6
0.8
0=
p
(( )
)
80 49
31
p (1 p)
49
n
1
(xi )2
2 2 i=1
2n(
x )
log(L(, )) = 0
.
2 2
which has solutions p = 0, p = 1, and p = 49/80. The solution which maximizes the likelihood is clearly p = 49/80 This is solved by
(since p = 0 and p = 1 result in a likelihood of zero). Thus
the maximum likelihood estimator for p is 49/80.
n
xi
This result is easily generalized by substituting a letter
=x
=
.
such as t in the place of 49 to represent the observed
n
i=1
number of 'successes of our Bernoulli trials, and a letter such as n in the place of 80 to represent the number This is indeed the maximum of the function since it is
of Bernoulli trials. Exactly the same calculation yields the the only turning point in and the second derivative is
maximum likelihood estimator t / n for any sequence of n strictly less than zero. Its expectation value is equal to the
parameter of the given distribution,
Bernoulli trials resulting in t 'successes.
7
The normal log likelihood at its maximum takes a particularly simple form:
E [b
] = ,
which means that the maximum-likelihood estimator
b is
n
log(L(
,
)) =
(log(2
2 ) + 1)
unbiased.
2
Similarly we dierentiate the log likelihood with respect
This maximum log likelihood can be shown to be the
to and equate to zero:
same for more general least squares, even for nonlinear least squares. This is often used in determin((
ing likelihood-based
approximate condence intervals
)n/2
( n
))
2
)2 + n(
xand
)
1
i=1 (xi x
condence
regions,
which are generally more acculog
exp
0=
2 2
2 2
rate than those using the asymptotic normality discussed
above.
(
(
) n
)
2
2
(x
)
+
n(
x
1
n
i
i=1
=
log
2
2 2
2 2
n
4 Non-independent variables
(xi x
)2 + n(
x )2
n
= + i=1
3
It may be the case that variables are correlated, that is,
which is solved by
not independent. Two random variables X and Y are independent only if their joint probability density function
is the product of the individual probability density funcn
1
2
2
tions, i.e.
b =
(xi ) .
n i=1
Inserting the estimate =
b we obtain
n
n
n
n
1 2
1
1
(xi x
)2 =
xi 2
xi xj .
b =
n i=1
n i=1
n i=1 j=1
2
To calculate its expected value, it is convenient to rewrite The joint probability density function of these n random
the expression in terms of zero-mean random variables variables is then given by:
(statistical error) i xi . Expressing the estimate
in these variables yields
(
1
1
f (x1 , . . . , xn ) =
exp [x1 1 , . . . , xn n ]
n/2
2
n
n
n
(2)
det()
1
1
2
2
b =
( i ) 2
( i )( j ).
n i=1
n i=1 j=1
In the two variable case, the joint probability density
function is given by:
Simplifying the expression above, utilizing the facts that
E [i ] = 0 and E[i2 ] = 2 , allows us to obtain
[
(
1
1
(x x )2
2(x x
f (x, y) =
exp
2
2
2
2(1 )
x
x
2x y 1
[ 2] n 1 2
.
E
b =
n
In this and other cases where a joint density function exThis means that the estimator
b is biased. However,
b is ists, the likelihood function is dened as above, in the
section Principles, using this density.
consistent.
Formally we say that the maximum likelihood estimator
for = (, 2 ) is:
(
)
b =
b,
b2 .
5 Iterative procedures
Consider problems where both states xi and parameters such as 2 require to be estimated. Iterative proceIn this case the MLEs could be obtained individually. In dures such as Expectation-maximization algorithms may
general this may not be the case, and the MLEs would be used to solve joint state-parameter estimation probhave to be obtained simultaneously.
lems.
8 SEE ALSO
b =
(
xi x
)2 .
n i=1
n
7 History
Maximum-likelihood estimation was recommended, analyzed (with awed attempts at proofs) and vastly popularized by R. A. Fisher between 1912 and 1922[11] (although
it had been used earlier by Gauss, Laplace, T. N. Thiele,
and F. Y. Edgeworth).[12] Reviews of the development of
maximum likelihood have been provided by a number of
authors.[13]
8 See also
Applications
communication systems;
psychometrics;
econometrics;
time-delay of arrival (TDOA) in acoustic or electromagnetic detection;
data modeling in nuclear and particle physics;
magnetic resonance imaging;
[9][10]
computational phylogenetics;
origin/destination and path-choice modeling in
transport networks;
geographical satellite-image classication.
power system state estimation
9
Mean squared error, a measure of how 'good'
an estimator of a distributional parameter is
(be it the maximum likelihood estimator or
some other estimator).
The RaoBlackwell theorem, a result which
yields a process for nding the best possible unbiased estimator (in the sense of having
minimal mean squared error). The MLE is often a good starting place for the process.
Sucient statistic, a function of the data
through which the MLE (if it exists and is
unique) will depend on the data.
References
10 Further reading
Aldrich, John (1997). R. A. Fisher and the making
of maximum likelihood 19121922. Statistical Science 12 (3): 162176. doi:10.1214/ss/1030037906.
MR 1617519.
Andersen, Erling B. (1970); Asymptotic Properties
of Conditional Maximum Likelihood Estimators,
Journal of the Royal Statistical Society B 32, 283
301
Andersen, Erling B. (1980); Discrete Statistical Models with Social Science Applications, North Holland,
1980
Basu, Debabrata (1988); Statistical Information and
Likelihood : A Collection of Critical Essays by Dr.
D. Basu; in Ghosh, Jayanta K., editor; Lecture Notes
in Statistics, Volume 45, Springer-Verlag, 1988
Cox, David R.; Snell, E. Joyce (1968). A general
denition of residuals. Journal of the Royal Statistical Society, Series B: 248275. JSTOR 2984505.
Edgeworth, Francis Y. (Sep 1908). On the
probable errors of frequency-constants. Journal
of the Royal Statistical Society 71 (3): 499512.
doi:10.2307/2339293. JSTOR 2339293.
Edgeworth, Francis Y. (Dec 1908). On the
probable errors of frequency-constants. Journal
of the Royal Statistical Society 71 (4): 651678.
doi:10.2307/2339378. JSTOR 2339378.
Einicke, G.A. (2012). Smoothing, Filtering and Prediction: Estimating the Past, Present and Future. Rijeka, Croatia: Intech. ISBN 978-953-307-752-9.
Ferguson, Thomas S. (1982). An inconsistent
maximum likelihood estimate. Journal of the
American Statistical Association 77 (380): 831834.
doi:10.1080/01621459.1982.10477894. JSTOR
2287314.
Ferguson, Thomas S. (1996). A course in large sample theory. Chapman & Hall. ISBN 0-412-04371-8.
Hald, Anders (1998). A history of mathematical
statistics from 1750 to 1930. New York, NY: Wiley. ISBN 0-471-17912-4.
Hald, Anders (1999). On the history of maximum
likelihood in relation to inverse probability and
least squares. Statistical Science 14 (2): 214222.
doi:10.1214/ss/1009212248. JSTOR 2676741.
Kano, Yutaka (1996). Third-order eciency
implies fourth-order eciency.
Journal
of the Japan Statistical Society 26: 101117.
doi:10.14490/jjss1995.26.101.
10
11
11
External links
EXTERNAL LINKS
11
12
12.1
12.2
Images
12.3
Content license