Etf3600 Lecture3 Mle LPM 2013

Outline Review of Lecture 2 Maximum Likelihood Estimation Linear Probability Model Wrap-up
ETF3600/5600 Quantitative Models for Business

Research
Lecture 3:
Maximum Likelihood Estimation
&
Linear Probability Model
References:
Long Chapter 2 (Section 2.6)
Powers 2.2.2
18 March, 2013
Review of Lecture 2
Intuition & Examples
MLE Normal Linear Model
MLE General Case
Wrap-up
Estimation Methods
How to estimate the parameters
0
and
1
An estimate of an unknown parameter in a model is a guess
based on the observed data.
Let (x
i
, y
i
) be the random sample of size n from a population
y
i
=
0
+
1
x
i
+
i
(1)
To estimate the relationship between y and xs we need to
estimate s.
Various frameworks are used to estimate s:
Least Squares Estimation
Method of Moments
Bayesian Methods
Method of Least Squares
Summary
Dierent samples of size T will produce dierent values of
s. The sampling properties of are:

1
E(
) =
2
Var (
) =
2
/T
3
N(,
2
/T)
4
BLUE If
is any other unbiased linear estimator of then

var (
) var ().
Problems with LS Method
We have to make very specic assumptions about
i
in order
to get OLS estimators with desirable properties.
We want the errors to be random so we assume:
E(
i
) = 0
E(
2
i
) =
2
= var (
i
)
E(
i
j
) = 0 = cov(
i
,
j
)
If these assumptions are violated then LS estimators are not
necessarily BLUE.
Goodness of t: R
2
(= ESS/TSS) measures how well the
estimated model ts the underlying data.
Problems with LS Method
Serially dependent errors i.e.,E(
i
j
) = 0 Autocorrelation
Error terms from dierent (usually adjacent) periods (time
periods or cross-section observations) are correlated.
Example: Rain forecast. The probability of tomorrow being
rainy is greater if today is rainy than if today is dry.
Error variance are not constant E(
2
i
) =
2
= var (
i
)
Heteroskedasticity
In Greek hetero means diering and skedastic means variance
Example: Income and food expenditure. As income increases
the variability of food expenditure will increase.
Function is non-linear.
Problem of outliers and extreme values.
Missing variable problem.
Relation to LS estimation
MLE is another form of estimation procedure.
LSE seeks the parameter values that provide the most
accurate description of the data, measured in terms of how
closely the models ts the data under the square loss function
i.e., minimizing the sum of squares errors between the
observations and predictions.
MLE seeks the parameter values that are most likely to have
produced the data
LSE estimated dier from MLE when the data is not normally
distribution.
Intuition
Consider two possible outcomes, 1 and 0 where the probability
of obtaining 1 is and the probability of 0 is (1 ).
Ex. 1 Take a random sample of values of size n. Suppose n = 5 and
that the sample is (y = 1, y
2
= 1, y
3
= 1, y
4
= 1, y
5
= 1).
What is the most likely value of to have generated this
sample?
Ex. 2 If the probability of getting a head upon ipping a particular
coin is . We ip a coin independently 10 times to get the
following sample: HHTHHHTTHH.
The probability of obtaining this sequence is a function of
What is the most likely value of to have generated this
sample?
The intuition behind these questions is the intuition behind
MLE. That is., what is the most likely value of the parameter
to have generated the observed sample.
MLE Example
But as we have already collected the data, the sample is xed.
The parameter also has a xed value, but the value is
currently unknown.
we need to work out the most likely to generate this value
i.e., probability of observed data is a function of .
The probability of this sample being generated is:
Pr(data|parameter) = Pr (HHTHHHTTHH|)
=..(1 )....(1 ).(1 )..
Suppose =0.1 then probability of obtaining our sample in a
random experiment would be:
0.1 0.1 (1 0.1) 0.1 0.1 0.1 (1 0.1) (1 0.1) 0.1
0.1 =0.0000000729
Suppose =0.2 then probability of obtaining our sample in a
random experiment would be:
0.2 0.2 (1 0.2) 0.2 0.2 0.2 (1 0.2) (1 0.2) 0.2 0.2 =
0.00000655
Thus it is more likely, given this sample, that is 0.2 than 0.1.
MLE
Example
For dierent values of :
Value of Prob of sample
0 0
0.1 0.0000000729
0.2 0.00000655
0.3 0.0000750
0.4 0.000354
0.5 0.000977
0.6 0.00179
0.7 0.00222
0.8 0.00168
0.9 0.000478
1.0 0.0
Plot these on a graph. The maximum likelihood estimate of
is the value of with the highest probability.
Likelihood Function
Example 2

Figure 1. Likelihood of observing 7 heads and 3 tails in a particular sequence
for different values of the probability of observing a head, .
ML Estimation
Likelihood Function for example 2
The likelihood function:
L(parameter |data) = L(|HHTHHHTTH) =
7
(1 )
3
(2)
The probability function and the likelihood function are given
by the same equation, but the probability function is a
function of the data with the value of the parameter xed,
while the likelihood function is a function of the parameter
with the data xed.
Although each value of L(|data) is a notional probability, the
function L(| data) is not a probability density function - it
does not enclose the area of 1.
ML Estimation
Likelihood Function for example 2
The probability of obtaining the sample data in hand (
HHTHHHTTH) is small regardless of the value of .
The value of that is most supported by the data is the one
for which the likelihood is the largest:
This value is the maximum likelihood estimate (MLE) denoted
by
here = 0.7 which is the sample proportion of heads 7/10.
One way to see this is that for n independent ips of the
coins, producing a particular sequence that includes x heads
and (n x) tailes, is given by:
L(|data) = Pr (|data) =
x
(1 )
(nx)
This is the multiplication probabilities for all values gives the
joint density of the sample. This is the probability that this
sample would arise in any random experiment.
We want the value of that maximizes L(| data) L()
ML Estimation
Example 2
It is simpler to maximize the log of the likelihood:
logL() = xlog + (n x)log(1 ) (3)
Dierentiating log L(x) with respect to gives:
dlogL()
d
=
x
+ (n x)
1
1
(1) (4)
dlogL()
d
=
x
+
n x
1
(5)
setting derivative to 0 and solving produces MLE which in this
case is the sample proportion:
x
n
=
X
n
which is the Maximum Likelihood Estimator
MLE of Normal Linear Model
We want to estimate
s in:
y
i
=
0
+
1
x
i
+ (6)
is assumed to be identically and independently distributed
(i.i.d distribution is the Normal distribution, N(0,
2
)
i
= y
0

1
x (7)
which is a normally distributed variable.
We need to nd the
s that are most likely to have generated

this sample.
The Probability density function of each of the normally
distributed error is:
f () =
_
1
2
2
_
0.5
exp
_
2
2
2
_
(8)
Likelihood function when normally distributed y
substituting = y
0

1
x
f (y
0

1
x) =
_
1
2
2
_
0.5
exp
_
(y
0

1
x)
2
2
2
_
(9)
Transform this density function into a likelihood.
The likelihood is the joint p.d.f of the sampled data
Since
s are assumed to be independent of each other, the

joint p.d.f is the product of individual probabilities.
Multiplying individual density function for each observation
together gives the likelihood function of the sample:
L =
n
t=1
_
1
2
2
_
0.5
exp
_
1
2
2
(y
t
x
t
)
2
_
(10)
Log likelihood when normally distributed y
The likelihood function is a function of parameters (
s and
) and all of the data on the independent and dependent
variables (ys and xs) i.e., it is the formula for the joint p.d.f
of sample.
This likelihood function can be used to nd the set of values
of the parameters as a function of the given data on that
maximise the value of this likelihood function.
Simplify this function by taking logs (this would transform the
products to sums).
The maximise by equating the rst order derivative with
respect to each parameter to zero
logL
Normally distributed y
1
Log likelihood:
lnL = ln(2
2
)
n/2
exp
_
1
2
2
n
t=1
(y
t

0

1
x
t
)
2
_
=
n
2
ln(2
2
)
1
2
2
n
t=1
(y
t

0

1
x
t
)
2
lnL =
n
2
ln(2) nln
n
t=1
(y
t

0

1
x
t
)
2
2
2
(11)
2
Score function: Now get the rst derivative L() (
lnL
parameter
).
This function L is called the score function.
3
Solve the equation (
lnL
parameter
= 0) and get the maximum
likelihood estimate. This equation is called the likelihood
equation.
Normally distributed y
For
k
logL
k
=

k
(
1
2
n
t=1
(y
t

0

1
x
t
)
2
) (12)
logL
k
=
1
2
n
t=1
x
k
(y
t

0

1
x
t
) (13)
This is the same condition as OLS and hence OLS=MLE if
y
t
is normally distributed then MLE=OLS.
If y
t
is not normally distributed then there is no simple closed
form solution.
Variance of MLE: Information matrix
The second order derivative of the log-likelihood function
gives the asymptotic sampling variance of MLE
Var (L
()) = E{L
()L
()
T
}
= E(
2
L()
j
) = I () (14)
The right hand since is called the expected Fisher Information
Matrix.
The distribution of parameters is asymptotically normal with
variance covariance martix given by
Var () = I ()
1
= (
2
L()
j
) (15)
General Approach
1
Dene the density function
2
Express the joint probability of the data
3
Convert the joint probability into a likelihood
4
Simplify the log likelihood expression
5
Derive the likelihood function with respect to parameters
6
Solve for the unknown parameters or write a program that
uses numerical analysis to produce maximum likelihood
estimates for the unknown:
Use successive approximation: Start with a starting value of
k
calculate log(L)
Adjust the values of
k
, recalculate log(L)
Continue doing this to get values of
k
.
Choose the value of that gives the highest log L.
7
There are a number of algorithms to nd the ML estimation
numerically. Eg., Newton Raphson Method, Quadratic Hill
Climbing, Berndt-Hall-Hall-Hausman method.
MLE Properties
MLE has many optimal properties in estimation:
Unbiased: The MLE are asymptotically unbiased, although
they may be biased in nite samples.
Suciency: complete information about the parameter of
interest contained in its MLE estimator. That is, if there is a
sucient statistic for a parameter, then the maximum
likelihood estimator of the parameter is a function of a
sucient statistic.
Consistency: True parameter value that generated the data
recovered asymptotically
Eciency: Lowest possible variance of parameter estimates
achieved asymptotically
ML Estimators are asymptotically normally distributed.
MLE Properties
Implementation problems:
Starting values essential for most numerical algorithms
Algorithm choice seems esoteric but it might be important for
some problems.
Scaling of variables, seems trivial, but matters in the numerical
estimations
Step sizes of algorithm need to be small (not to skip over
maximum and minimum).
Flat likelihood may imply there is no clearly identiable unique
solution and may be a real consequence of collinearity. Maybe
there is no single maxima!
Introduction
Consider a dependent variable is of qualitative nature, coded
as dummy variable Y
i
{0, 1}
Examples:
driving to work versus public transport
employed versus unemployed
being single versus married.
We will analyse two models:
Linear probability model
Non-linear models i.e., logit, probit.
The Model
Linear probability model is a linear regression model applied to
binary dependent variable
Y
i
=
0
+
1
x
1i
+
2
x
2i
... +
k
x
ki
+
i
, Y
i
{0, 1} (16)
Multiple linear regression model with a binary dependent
variable is called the linear probability model (LMP) because
the response probability is linear in
s
The zero conditional mean assumption, E(|x) = 0 gives:
Pr (Y
i
= 1) =
0
+
1
x
1i
+
2
x
2i
... +
k
x
ki
(17)
this is called the response probability.
Since Y can take only two values,
j
cannot be interpreted as
change in Y given one unit increase in x
j
.
s are the expected change in the response probability for a

unit increase in x
k
Pr (Y
i
= 1)
x
j
=
j
(18)
Example 1: Decision to drive and commuting time
Consider the decision to drive and mode of transport be
related as:
y
i
= +x
i
+
i
(19)
where x
i
= (commuting time by bus - commuting time by
car).
we expect individuals to drive more if x increases.
Example 1: Decision to drive and commuting time
Dependent Variable: Y
Method: Least Squares
Date: 03/08/07 Time: 11:43
Sample: 1 21
Included observations: 21

Variable
Coefficien
t Std. Error t-Statistic Prob.

C 0.484795 0.071449 6.785151 0.0000
X 0.007031 0.001286 5.466635 0.0000

R-squared 0.611326 Mean dependent var 0.476190
Adjusted R-squared 0.590869 S.D. dependent var 0.511766
S.E. of regression 0.327343 Akaike info criterion 0.694776
Sum squared resid 2.035914 Schwarz criterion 0.794254
Log likelihood -5.295144 F-statistic 29.88410
Durbin-Watson stat 1.978844 Prob(F-statistic) 0.000028

(0.007) means that for every minute increase in the commuting

time by bus relative to car the probability to driving increases by
0.007
Example 2: Womens labour force participation (Long Page 37)
Variable Description
LFP Whether works? [1 if in paid labour force; 0 otherwise.]
k6 children younger than 6
k618 children aged 6 to 18
age Age in years
wc 1 if attended college; 0 otherwise
hc 1 if husband attends college; 0 otherwise
lwg log(womens wage rate)
Income Family income except womans wages
Example 2: Womens labour force participation (WLFP) (Long Page 37)
Long estimates a linear model in which the variable Whether
woman works is a linear function of all the other variables:
Variable

x
t val P(> ltl )
Constant 1.14 - 9.00 0.00
Child(k6) -0.29 -0.154 -8.21 0.00
Child(k618) -0.01 -0.015 -0.80 0.42
Age -0.01 -0.103 -5.02 0.00
wc 0.16 - 3.57 0.00
hc 0.02 - 0.45 0.66
lwg 0.12 0.072 4.07 0.00
Income -0.01 -0.079 -4.30 0.00
is coe.;

x
=

x
standard coe.
Example 2: Womens labour force participation (WLFP) (Long Page 37)
Less WLFP for families with larger number of pre school
children (5 y/o) each child under 5 will reduce the
probability of WLFP by 30% ! (in 1975) holding all other
variables constant.
For a SD in family income the predicted probability of being
employed decreases by 0.08 holding other variables constant.
If the wife attended college then predicted prob. of being in
labour force increases by 0.16 holding all other variables
constant.
Problems with LPM
Dont use it. Never!
Heteroskedasticity: Variance of errors depend on the xs and is
not constant LPM heteroskedastic making OLS inecient
and biased standard errors.
Var(y|x)=f(x) Assume binary variable y= 1 if P(y=1) and 1 if P(y=0).
P
i
= 1
has a mean E(y|x) = x = the variance is (1 ).
Normality: Errors are not normally distributed.
Since y can be either 0 or 1 then error can be either
1
= 1 E(y|x) or
0
= 0 E(y|x).
Nonsensical prediction: predicted ys are <0 or >1.
In WLFP example, 35 year old woman who did not attend
college, has four children, and husband did not attend college
has a predicted probability of being employed of -0.48.
Unreasonable prediction!
Problems with LPM
Functional form: model is linear i..e, unit increase in x
k
results
in constant change of
k
in the probability of an event,
holding all other variables constant. Eg., for LPM each
additional young child decreases the probability of
employment.
The WLFP had 753 married white women, 428 of which were
employed in labour force Pr(emp): p
e
=
428
753
= 0.568;
Pr(unemp): 1- p
e
= 0.432
Odds Ratio:

=
p
e
1 p
e
No of Children
In Labour Force 0 1 2 3
No 231 72 19 3
Yes 375 46 7 0
Odds of emp 1.6 0.64 0.37 0.0
Pr(emp with 2 children):
7
7+19
= 0.27 =
0.27
10.27
= 0.37.
Odds of being employed is negatively related to having kids
(BUT, eect is not strictly linear)
Wrap-up
The method of maximum likelihood provides estimators that
have both a reasonable intuitive basis and many desirable
statistical properties.
The method is very broadly applicable and simple to apply.
Once a maximum likelihood estimation provides standard
errors, statistical tests and other results useful for statistical
inference.
A disadvantage of the method is that it frequently requires
strong assumptions about the structure of the data.
Wrap-up
LPM and Limitations
Multiple linear regression model with a binary dependent
variable is called the linear probability model (LMP)
Many limitations:
Interpretation of parameters remain unaected by having
binary outcome.
Eect of a variable is the same regardless of the values of the
other variables.
Eect of unit change for variable is the same regardless of
current value of that variable.
Summary
The limitation of LPM can be overcomed by using more
sophisticated response models:
Pr (Y = 1) = G(
0
+
1
X
1i
+
2
X
2i
) (20)
Where G is a function taking on values between zero and
one:0 < G(z) < 1 for any real z.
Two common functional forms are:
Logit Model
Probit Model

Etf3600 Lecture3 Mle LPM 2013

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Etf3600 Lecture3 Mle LPM 2013

Încărcat de

Drepturi de autor:

Formate disponibile

Outline Review of Lecture 2 Maximum Likelihood Estimation Linear Probability Model Wrap-up

ETF3600/5600 Quantitative Models for Business

s. The sampling properties of are:

is any other unbiased linear estimator of then

s that are most likely to have generated

s are assumed to be independent of each other, the

s are the expected change in the response probability for a

(0.007) means that for every minute increase in the commuting

S-ar putea să vă placă și