Sunteți pe pagina 1din 10

Properties of Estimators

Suppose you were given a random sample of observations from a normal distribution, and you
wish to use the sample data to estimate the population mean. This is a simple example of an
estimation problem. The you are seeking to estimate is a population parameter. A parameter
is a number, generally unknown, that describes some interesting characteristic of the population.
In a more general setting the generic notation for an unknown population parameter isu .
Suppose you decide to use the sample mean of the data, X , to guess the true value of . X is
an example of an estimator; the generic notation for an estimator is

u . More generally, an
estimator is a function of sample data used to guess an unknown population parameter. The
difficulty arises because there are many plausible ways to use the sample data to guess the same
unknown population parameter. Since the normal distribution is symmetric, the population mean
is the same as the population median. Therefore it makes sense to believe you could also guess
using the sample median. For that matter, because the normal is symmetric, you could equally
plausibly use the sample midrange the average of the largest and smallest observations in the
data. Which of these alternatives is best, and what exactly does one mean by best?

Desirable Properties of Estimators: What do we mean by best?

There are three properties of estimators that are commonly used to judge their quality.
1) Unbiasedness An unbiased estimator has no tendency to over or underestimate the
truth. The mathematical statement of unbiasedness is that
( )

. E u u =
( )

E u is the
average guess, and unbiasedness means the average guess is correct. Average over
what? you might ask. The answer is the average over all the possible samples of
size n that might be drawn to construct the estimate. If an estimator is biased, the
size of that bias is given by
( )

Bias E u u .
2) Consistency A consistent estimator has the property that as the sample size goes to
infinity, the estimator homes in on the true parameter value. To state this property
more precisely, it is that
( )

1
n
P u u o

< . In English, the probability that the guess,

u , will be within an arbitrarily small smidgen, o , of the true u approaches 1 as the


sample size grows to infinity. Consistency is a very important property for an
estimator to have, but Studenmund muddles it together with unbiasedness.
3) Efficiency An efficient estimator is one that, on average, takes on values close to
the true u . There is more than one way to make this notion precise, but a very
common way is to say that an estimator is efficient if it has the smallest mean squared
error, where
( )
2

MSE E u u = . In English, this means the average squared mistakes


(averaged, again, over all the possible samples that might be drawn) are smaller using
this guessing technique than with the alternative guessing techniques.


There is a useful identity which allows us to decompose mean squared error into two
components.
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( )
2 2
2 2
2



var
E E E E
E E E
MSE bias
u u u u u u
u u u u
u u
= +
= +
= +

Although smallest mean squared error is a sensible measure of efficiency, it is impossible to ever
have an estimator which is always superior in mean squared error. The reason is that it is
impossible to beat an inspired guess. One way to estimate u is to ignore the data entirely and just
guess a number, such as

u =3. An estimator like this cant be beaten if it so happens that 3 u = ,


and it does very well in a mean squared error sense whenever u is close to 3. So if you want a
theory that can say this is the best estimator, you have to put on some additional limitations.
The most popular one is to restrict the competition to unbiased estimators. Examining the
decomposition above, one can see that if

0 bias u = , then the estimator with the smallest mean


squared error is the estimator with the smallest variance. Hence there are many theorems about
best estimators that tout an estimator as the minimum variance unbiased estimator.

Now lets examine some simple estimation examples using this theory. To keep things concrete
these issues will be explored using a technique known as Monte Carlo simulation. The idea is
very simple. We simulate some pretend data where we know the right answer. Then we try
guessing the unknown parameter using various rules, and we see which works best.

Suppose you want to estimate the center of the normal distribution. Should you use the sample
mean, or the sample median? Lets try an experiment to see which of these works best. I used
Minitab to simulate 1000 samples of size 7, and for each of these samples of size seven I guessed
the population mean using the sample mean, the sample median, and the sample midrange. Since
the data were created by sampling from a normal distribution with mean 10, a good guess is a
guess that is close to 10. Then I made histograms of the thousand guesses.

Here are the results.
100
75
50
25
0
12.0 11.2 10.4 9.6 8.8 8.0 7.2
80
60
40
20
0
12.0 11.2 10.4 9.6 8.8 8.0 7.2
80
60
40
20
0
Means
F
r
e
q
u
e
n
c
y
Medians
Midranges
Mean 10.02
StDev 0.7275
N 1000
Means
Mean 10.05
StDev 0.9063
N 1000
Medians
Mean 10.00
StDev 0.9021
N 1000
Midranges
Histogram of Means, Medians, Midranges
Normal


Examining these we can see that when we guessed using the sample means the average guess
was 10.02, when we guessed with the sample medians it was 10.05, and when we guessed with
sample midranges it was 10.00. All of these are quite close to the true value of 10, and there are
theorems that prove that the small deviations we see in this experiment are just sampling error.
All these estimators are unbiased, as the experimental results suggest.

Note however that the means are more tightly distributed about the central value of 10. On
average, the errors youd have made guessing with the means are smaller. This is evident if you
compare the standard deviations of the means, the medians, and the midranges. The means has a
standard deviation of 0.7275, which is considerably less than for the medians (.9063) and the
midranges (.9021). Therefore, in this application, the mean has a smaller variance and a smaller
mean squared error than either of the other two estimators. The midrange and median are roughly
comparable, although the midrange seems slightly superior to the median when the sample size
is 7.

Lets consider what happens if the sample size is increased from 7 to 200. The experiment was
repeated, with the following result.
400
300
200
100
0
12.0 11.4 10.8 10.2 9.6 9.0 8.4 7.8
300
200
100
0
12.0 11.4 10.8 10.2 9.6 9.0 8.4 7.8
120
90
60
30
0
Means
F
r
e
q
u
e
n
c
y
Medians
Midranges
Mean 10.00
StDev 0.1375
N 1000
Means
Mean 10.01
StDev 0.1735
N 1000
Medians
Mean 10.02
StDev 0.5766
N 1000
Midranges
Histogram of Means, Medians, Midranges
Normal


Again, each of these estimators produces an average guess in our experiment which is within a
smidgen of its theoretical value of 10: Means 10.00; Medians 10.01; Midranges 10.02. That
is, they seem to be unbiased. Notice that the spread of these different estimators fell as sample
size increased. The standard deviation of the means fell from 0.7275 to 0.1375; the standard
deviation of the medians fell from 0.9063 to 0.1735; the standard deviation of the midranges fell
from 0.9021 to 0.5766. With larger and larger samples the standard deviation of each of these
estimators would continue to shrink until in the limit, all the distributions would collapse to a
spike on the true value of 10, which is what it means for an estimator to be consistent. In a
sample of size 200, the mean is clearly superior to the median, and the midrange is by far the
worst of the three estimators. There is a theorem in statistics which proves that the sample mean
is the minimum variance unbiased estimator of when sampling from a normal population.

The sample mean isnt always the best estimator, however. Lets repeat the experiment, but
sampling from a Laplace distribution. The Laplace distribution, like the normal, is symmetric.
Unlike the normal distribution, however, the Laplace distribution has fat tails, so that outliers are
relatively more common. Here is how the distributions compare.

20 15 10 5 0
0.25
0.20
0.15
0.10
0.05
0.00
X
D
e
n
s
i
t
y
Normal 10 2
Distribution Mean StDev
Laplace 10 2
Distribution Loc Scale
Distribution Plot


As in our first example, we draw 1000 samples, each of size seven. The true value of the mean is
still ten. Here is the result.


200
150
100
50
0
16 14 12 10 8 6 4
240
180
120
60
0
16 14 12 10 8 6 4
120
90
60
30
0
Means
F
r
e
q
u
e
n
c
y
Medians
Midranges
Mean 10.01
StDev 1.105
N 1000
Means
Mean 10.04
StDev 0.9582
N 1000
Medians
Mean 9.997
StDev 1.882
N 1000
Midranges
Histogram of Means, Medians, Midranges
Laplace

Looking at the histograms and the means, we can see all three estimators appear to be unbiased.
The average guess using sample means is 10.01; sample medians 10.04; sample midranges
9.997. All these are very close to ten, and the discrepancy can be explained as sampling error.
All these estimators are unbiased. The standard deviation of the medians, however, is the
smallest -- .9582, better than the means (1.105), with the midrange taking a distant third place
(1.882). The reason the medians now outperform the means is because the Laplace generates
more outliers. The median is insensitive to outliers, but the mean is more affected, and the
midrange is affected even more severely. The result of this experiment suggests a result which
theory is capable of proving to be true: When sampling from a Laplace, the sample median is a
better unbiased estimator of than the sample mean. The sample mean is still the best linear
unbiased estimator (a.k.a. BLUE) of , beating alternatives such as unequally weighed averages
of the observations. However, being the best linear unbiased estimator in this example is a
dubious honor, because there is a simple nonlinear unbiased estimator that can beat it.

Sometimes the mean isnt even BLUE. In particular, suppose that the original
1 2
, , ,
n
X X X are
a simple random sample from a Cauchy distribution. The Cauchy distribution is a special case of
the t-distribution: it is a t-distribution with one degree of freedom. Here is a picture comparing
the Cauchy with the normal density.

10 5 0 -5 -10
0.4
0.3
0.2
0.1
0.0
X
D
e
n
s
i
t
y
Normal 0 1
Distribution Mean StDev
Cauchy 0 1
Distribution Loc Scale
Comparison of the Normal and the Cauchy


As you can see, the Cauchy distribution has a shape that resembles the normal, but with much
fatter tails. Both the standard Cauchy and the standard normal (shown here) are symmetric and
centered at zero, but the Cauchy does not have a mean of zero because the Cauchy distribution
doesnt have a mean. How is this possible? Well, remember that for a continuous random
variable the mean is defined as ( ) xf x dx =
}
where the integral is taken over the values the
random variable can take on, which in the case of the Cauchy is ( ) , + . The Cauchy has such
fat tails that ( )
0
xf x dx

= +
}
and ( )
0
xf x dx

=
}
. According to the Lebesgue definition of the
integral (which is the correct one for this branch of statistics) = , which is undefined. On
account of this the Cauchy has other bizarre properties. If you take a random sample of size n
and computeX , X has the same distribution as any of the individual observations. In other
words, if you try to discover the center of the distribution by usingX , increasing the sample size
will do you no good at all. Your estimator will have the same distribution whether you use one
observation or a million observations. Suppose we call the center of the Cauchy bell curve u .
This means X is not a consistent estimator of u . You cant even sayX is an unbiased estimator
of u because the definition of unbiasedness is
( )

0 bias E u u = = . For

X u bias isnt even


defined. There is a proof, however, that the minimum variance unbiased estimator of u is the
sample median. Strictly speaking, Studenmunds statement of the Gauss-Markov Theorem on the
bottom of page 106 is incorrect, because one needs to rule out error terms that are very fat tailed
in the regression context. For Gauss-Markov to be true, the error terms have to have distributions
for which the population mean and variance exist. In the case of Cauchy error terms neither the
mean nor variance exists.

Unbiasedness isnt all that.

Now suppose we want to estimate the population variance of a normally distributed population.
The usual estimator is
( )
2
2 1
1
n
i
i
X X
s
n
=

=

, but students encountering this formula for the first


time are always flummoxed by the 1 n divisor. Since a variance by definition is the average
size of a squared deviation from the mean, it seems more natural to estimate a variance by
dividing by n -- that is, using
( )
2
2 1

n
i
i
X X
n
o
=

=

. How do
2
s and
2
o compare according to the
criteria introduced here? If n is large, the two estimators are almost the same, so to keep the
difference appreciable, I have simulated 5000 samples, each with 3 n = , and computed 5000
sample estimates using both
2
s and
2
o . The sample is taken from a normal distribution with
2
o =4, so we want our estimators to give answers close to four. Here are histograms of the
actual results.

0
.
0
5
.
1
.
1
5
.
2
.
2
5
D
e
n
s
i
t
y
0 10 20 30 40
Variances
Observed values of s-squared

0
.
1
.
2
.
3
.
4
D
e
n
s
i
t
y
0 5 10 15 20 25
sigma_hat_sq
Observed values of sigma hat squared

Since these distributions arent symmetric, it is hard to eyeball the average value. However, here
are the summary statistics.

si gma_hat _sq 5000 2. 647214 2. 621763 . 0001333 23. 08673
s_squar ed 5000 3. 970821 3. 932645 . 0002 34. 6301

Var i abl e Obs Mean St d. Dev. Mi n Max
. summar i ze s_squar ed si gma_hat _sq


Recall that the true value of
2
o is four, and note that the average guess when one uses
2
s is 3.97;
when one uses
2
o the average guess is 2.65. This suggests (correctly) that
2
s is unbiased but that
2
o is downwardly biased, so that it tends to underestimate the true value of
2
o . Both of these
estimators are consistent, however. It is easy to see that if one of them homes in on
2
o as the
sample size goes to infinity, the other must also, because the two differ by a factor of ( ) 1 n n
and ( ) 1 1
n
n n

. If one looks at the summary table and the histograms one can see a weakness
in
2
s ; namely, that while it is correct on average, it sometimes overestimates the true value by a
large margin and is considerably more variable than
2
o -- it has a standard deviation of 3.93
compared to just 2.62 for
2
o . Which estimator has the smaller mean squared error?
( ) ( )
2 2

var MSE E bias u u u u = +
Applying this to the estimator with the n-1 divisor, we can approximate the MSE by using the
results of our simulation.
( ) ( )
2 2
3.932 3.971 4 15.46 MSE = + =
Applying this to the estimator with the n divisor, we can approximate its MSE.

( ) ( )
2 2
2.621 2.647 4 8.70 MSE = + =

For efficiency, measured by MSE, we actually did better dividing by n than dividing by n-1. As
a matter of fact, there is a theorem that says using a divisor of n+1 actually gives the best MSE
when the draws come from a normal distribution. This is a good example of a case where our
criteria diverge: using a divisor of n-1 gives an unbiased estimate, but using a divisor of n (or
n+1) gives a more efficient estimate.

Actually, insisting on an unbiased estimator is often of limited value. In this example, it might
really be the standard deviation that is of interest, not the variance. Recall that
2
o o = and
2
s s = . Usually if ( ) y f x = , ( ) ( ) ( )
E Y f E X = unless the function f is linear. In this very
example, 2 o = , but the average value of s is 1.769.

st devs 5000 1. 76923 . 9169594 . 0157 5. 88473

Var i abl e Obs Mean St d. Dev. Mi n Max
. summar i ze st devs


In other words, even though
2
s is an unbiased estimator of
2
o , s is not an unbiased estimator of
o . However, s is a consistent estimator of o .

S-ar putea să vă placă și