Econometrics Note, UT Dallas

1 Appendix A: Matrix Algebra
1.1 Definitions
• Matrix A = [ ] = [A]
• Symmetric matrix:  =  for all  and 
• Diagonal matrix:  6= 0 if  =  but  = 0 if  6= 
• Scalar matrix: the diagonal matrix of  = 
• Identity matrix: the scalar matrix of  = 1
• Triangular matrix:  = 0 if   
• Idempotent matrix: A = AA = A2
• Symmetric idempotent matrix: A0 A = A = AA
• Orthogonal matrix: A−1 = A0
• Unitary matrix: A0 A = AA0 = I

P
• Trace of A :  (A) = =1   sum of diagonal terms.
—  (ABC) =  (BCA) =  (CBA) if A B C are symmetric.
—  (A) =  [ (A)]
—  (A + B) =  (A) +  (B)
Matrix Addition:
• A + B = [ +  ]
• (A + B) + C = A+ (B + C)
• (A + B)0 = A0 +B0
1
Matrix Multiplication
• AB 6= BA :
⎡ ⎤⎡ ⎤ ⎡ ⎤
     +   + 
⎣ ⎦⎣ ⎦ = ⎣ ⎦
     +   + 
⎡ ⎤⎡ ⎤ ⎡ ⎤
     +   + 
⎣ ⎦⎣ ⎦ = ⎣ ⎦
     +   + 
• (AB) C = A (BC)
• A (B + C) = AB + AC
• (AB)0 = B0 A0
Idempotent (projection) Matrix
y = x + z + u
where y x z and u are  × 1 vectors  and  are scalars. Let

³ ´
−1
M =  − z (z0 z) z0
then,  is an idempotent matrix.

³ ´³ ´
−1 −1
M M =  − z (z0 z) z0  − z (z0 z) z0
−1 −1 −1 −1
=  − z (z0 z) z0 − z (z0 z) z0 + z (z0 z) z0 z (z0 z) z0
−1 −1 −1
=  − z (z0 z) z0 − z (z0 z) z0 + z (z0 z) z0
−1
=  − z (z0 z) z0 = 
Further note that

³ ´
0 −1 0
M z =  − z (z z) z z = 0
Hence we have
M y = M x + M z + Mz u
= M x + Mz u
2
Vector
• Lenth of a vector: Norm is defined as

µ ¶12
√ P

||e|| = e0 e = 2
=1
• Orthogonal vectors: Two nonzero vectors a and b are orthogonal, written a⊥b iff
a0 b = b0 a = 0
Regression in a Matrix form

y = Xb + u
The OLS estimate is
−1
b̂ = (X0 X) X0 y
−1 −1
= (X0 X) X0 Xb + (X0 X) X0 u
−1
= b + (X0 X) X0 u
The OLS residuals are

³ ´
û = y − Xb̂ = Xb − Xb̂ + u = u − X b̂ − b
Hence we have
³ ³ ´´
X0 û = X 0
u − X b̂ − b
³ ´
0 0
= X u − X X b̂ − b
−1
= X0 u − X0 X (X0 X) X0 u
= 0
Matrix Inverse
• (AB)−1 = B−1 A−1

⎡ ⎤−1 ⎡ ⎤
A11 0 A−1 0
• ⎣ ⎦ = ⎣ 11 ⎦
−1
0 A22 0 A22
3
⎡ ⎤−1
A11 A12
• ⎣ ⎦ =? (see p.966 A-74)
A21 A22
⎡ ⎤
11 12
Kronecker Products Let A = ⎣ ⎦  then
21 22
⎡ ⎤
11 B 12 B
A⊗B=⎣ ⎦
21 B 22 B
• A is  ×  and B is  ×  Then A ⊗ B is () × ()
• A ⊗ (B + C) = A ⊗ B + A ⊗ C
• (A + B) ⊗ C = A ⊗ C + B ⊗ C
• (A) ⊗ B = A ⊗ (B) =  (A ⊗ B) where  is a scalar
• (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C)
¡ ¢
• (A ⊗ B)−1 = A−1 ⊗B−1
• (A ⊗ B)0 = A0 ⊗B0
• tr(A ⊗ B) =tr(A)tr(B)
• (A ⊗ B) (C ⊗ D) = AC ⊗ BD
1.2 Eigen values and Eigen vectors
Eigen Values Characteristic vectors = Eigen vectors, c

Characteristic roots = Eigen values. 
Ac = c
Ac−Ic = 0
|A−I| c = 0
4
Example: Find eigen values of A :
⎡ ⎤
1 3
A=⎣ ⎦
0 2
¯ ¯
¯ ¯
¯ 1− 3 ¯
|A−I| = ¯¯ ¯=0
¯
¯ 0 2− ¯
Solutions:
 = 1 2
Eigen Vector: The characteristic vectors of a symmetric matrix are orthogonal. That is,
C0 C = I
where C = [c1  c2   c ]  Alternatively C is a unitary matrix.

Let ⎡ ⎤
1 0 ··· 0
⎢ ⎥
⎢ ⎥
⎢ 0 2 ··· 0 ⎥
Λ =  (1    ) = ⎢
⎢
⎥
⎥
⎢ ⎥
⎣ ⎦
0 0 · · · 
Then we have
Ac =  c
or
AC = CΛ
C0 AC =C 0
CΛ = Λ =  (1    )
Alternatively we have
A = CΛC0
which is the spectral decomposition of A
5
Some Facts: Prove the followings
1.  (A) =  (Λ)
 (A) =  (CΛC0 ) =  (ΛCC0 ) =  (ΛI) =  (Λ)
The trace of a matrix equals the sum of its eigen values.
2. |A| = |Λ|
3. AA = A2 = CΛ2 C0
4. A−1 = CΛ−1 C0
5. Suppose that A is a nonsigular symmetric matrix. Then
A12 = CΛ12 C0
6. Consider a matrix P such that

P0 P = A−1
then
P = Λ−12 C0
Matrix Decomposition LU decomposition (Cholesky Decomposition)
A = LU
where L is lower triangular and U is upper triangular matrix. L = U0

Example: ⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤
2 2
     0  + 
⎣ ⎦=⎣ ⎦⎣ ⎦=⎣ ⎦
  0     2
Hence the solution is given by
√ √
=   =    =?
6
Spectral (Eigen) Decomposition
A = CΛC0
Schur Decomposition
A = USU0
where U is an orthogonal matrix and S is a upper triangular matrix.
Quadratic forms Let  be a symmetric matrix. Then all eigen values of  are positive
(negative), then  is a positive (negative) definite matrix. If  has both negative and positive
eigen values, then  is indefinite.
1.3 Matrix Algebra
 (Ax)  (Ax)
= A 0
= A0
x x
 (x0 Ax)
= 2Ax
x
 (x0 Ax)
= xx0
A
7
1.4 Sample Questions:
⎡ ⎤ ⎡ ⎤
1 1 1 0
Part I: Calculation A = ⎣ ⎦ B = ⎣ ⎦
1 2 0 1
Q1: Find eigen values of A

Q2: Find the lower triangular matrix of A
Q3: A ⊗ B
Q4: (A ⊗ B)−1
Q5: tr(A)
Part II: Matrix Algebra Consider the following regression
 =  + 1 + 2 +  (1)
Q6: If you wrote (9) as

y = Xβ + u (2)
Define ,  and 

Q7: Consider the following problem
arg min  = u0 u

Show the first derivertives of  function.

Q8: Show the solution satisfies β = ( 0 )−1 X0 y
8
2 Probability and Distribution Theory
2.1 Probability Distributions
 () = Prob ( = )
1. 0 ≤ Prob( = ) ≤ 1
P R∞
2.   () = 1 (discrete case), −∞
 ()  = 1 (continuous case)
Cumulative Distribution Function

⎧
⎨ P
≤  () = Prob ( ≤ ) : discrete
 () = R
⎩ 
 ()  = Prob ( ≤ ) : continuous
−∞
1. 0 ≤  () ≤ 1
2. If    then  () ≥  ()
3.  (+∞) = 1
4.  (−∞) = 0
Expectations of a Random Variable Mean or expected value of a random variable is

⎧ P
⎨
  () discrete
 [] = R
⎩  ()  continuous

Median: used when the distribution is not symmetric

Mode: the value of  at which  () take its maximum
9
Functional expectation Let  () be a function of  Then
⎧
⎨ P  ()  () discrete

 [ ()] = R
⎩  ()  ()  continuous

Variance
 () =  ( − )2
⎧
⎨ P ( − )2  () discrete

= R
⎩ ( − )2  ()  continuous

Note that
¡ ¢
 () =  ( − )2 =  2 − 2 + 2
¡ ¢
=  2 − 2
Now we consider third and fourth central moments
Skewness :  ( − )3
Kurtosis :  ( − )4
Skewness is a measure of the asymmetry of a distribution. For symmetric distribution,

we have
 ( − ) =  ( + )
and
 ( − )3 = 0
2.2 Some Specific Probabilty Distributions
2.2.1 Normal Distribution

Ã !
2
¡ ¢ 1 ( − )
 | 2 = √ exp −
2 2 2 2
or we note
¡ ¢
 ∼   2
10
Properties:
1. Addition and Multiplication

¡ ¢ ¡ ¢
 ∼   2   +  ∼   +  2  2
2. Standard normal function
 ∼  (0 1)
µ 2¶
1 
 (|0 1) =  () = √ exp −
2 2
³ 2´
√1 exp − 2
2
0.4
y
0.3
0.2
0.1
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
Chi-squared,  and  distributions
1 −2
³ ´
 (; ) =  2 exp − 1 [ ≥ 0]
22 Γ (2) 2
1. If  ∼  (0 1)  then 2 ∼ 21 chi-squared with one degress of freedom.
11
2. If 1    are  independent 21 variables, then
X

 ∼ 2
=1
3. If 1    are  independent  (0 1) variables, then

X

2 ∼ 2
=1
4. If 1    are  independent  (0 2 ) variables, then

X

2 
∼ 2
=1
2
5. If 1 and 2 are independent 2 and 2 variables, then
1 + 2 ∼ 2+
6. If 1 and 2 are independent 2 and 2 variables, then
1 
∼  ( )
2 
7. If  is a  (0 1) variable and  is 2 and is independent of  then the ratio

 = p

and it has the density function given by

¡ ¢ µ ¶− +1
Γ +1
2¡ ¢ 2 2
 () = √ 1 +
Γ 2 
where  =  − 1 and Γ () is the gamma function

Z ∞
Γ () = −1 − 
0
8.  →  (0 1) as  → ∞
9. If  ∼   then 2 ∼  (1 )
12
10. Noncentral 2 distribution: If  ∼  ( 2 )  then ()2 has a noncentral 21 distrib-
ution.
11. If  and  have a joint normal distribution, then w = ( )0 has
w ∼  (0 Σ)
where ⎡ ⎤
 2  
Σ=⎣ ⎦
   2
12. If w ∼  (μ Σ) where w has  elements, then w0 Σ−1 w has a noncentral 2 . The
noncentral parameter is μ0 Σ−1 μ2
2.2.2 Other Distributions
Lognormal distribution
Ã ∙ ¸2 !
1 1 ln  − 
 () = √ exp −
2 2 
Note that
µ ¶
2
 () = exp  +
2
2
³ 2 ´
 () = 2+  − 1
Hence
µ ¶
1  ()
 = ln  () − ln 1 +
2  ()2
µ ¶
 ()
 2 = ln 1 +
 ()2
2
Mode () = −
Median () = 
13
¡ ¢
√1
2
exp − 12 [ln  − 1]2
y
0.20
0.15
0.10
0.05
0.00
0 10 20 30
x
• Properties
1. If  ∼  ( 2 )  then exp () ∼  (  2 )
2. If  ∼  (  2 )  then ln () ∼  ( 2 )
3. If  ∼  ( 2 )  then  =  +  is a shifted LN of .  () =  () +   () =

 ( + ) =  ()
4. If  ∼  (  2 )  then  =  is also LN.  ∼  (ln  +  2 )
5. If  ∼  (  2 )  then  = 1 is also LN.  ∼  (− 2 )
6. If  ∼  (1   21 )   ∼  (2   22 ) and they are independent, then

¡ ¢
 ∼  1 + 2   21 +  22
14
Gamma Distribution

 () = exp (−) −1 for  ≥ 0   0   0
Γ ()
where Z ∞
Γ () = −1 − 
0
Note that if  is a positive integer, then
Γ () = ( − 1)!
When  = 1 gamma distribution becomes exponential distribution

⎧
⎨  exp (−) for  ≥ 0
 () =
⎩ 0 for   0
When  = 2   = 12 gamma dist. = 2 dist.
When  is a positive integer, gamma dist. is called Erlang family.
Beta distribution For a variable constrained between 0 and   0 the beta distribution
has its density as
Γ ( + ) ³  ´−1 ³  ´−1 1
 () = 1−
Γ () Γ ()   
Usually ’s range becomes  ∈ (0 1) that is  = 1
1. symmetric if  = 
2.  = 1  = 1 becomes  (0 1)
3.   1   1, becomes  − 
4.  = 1   2 strictly convex
5.  = 1  = 2 straight line
6.  = 1 1    2 strictly concave

7. Mean:  
+ +
for  = 1
2
  
8. Variance: (++1)(+)2
(++1)(+)2
for  = 1
15
Figure 1:
Logistic Distribution
−(−)
 () = 2 0
 (1 + −(−) )
1
 () =
1+ −(−)
When  = 1 and  = 0 we have
− 1
 () = 2   () =
(1 + − ) 1 + −
−
(1+− )2
0.25
y
0.20
0.15
0.10
0.05
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
1
1+−
16
1.0
y
0.8
0.6
0.4
0.2
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
1
1+−01
1.0
y
0.8
0.6
0.4
0.2
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
Exponential distribution
 () = −
Weibull Distribution
⎧ ¡  ¢−1
⎨  
−() for  ≥ 0
 
 () =
⎩ 0 for   0
When  = 1 Weibell becomes the exponential distribution.
17
Cauchy Distribution ∙ ¸
1 
 () =
 ( − 0 )2 +  2
where 0 is the location parameter,  is the scale parameter. The standard Cauchy distri-
bution is the case where 0 = 0 and  = 1
1
 () =
 (1 + 2 )
1
(1+2 )
y 0.3
0.2
0.1
-5 -4 -3 -2 -1 1 2 3 4 5
x
Note that the  distribution with  = 1 becomes a standard Cauchy distribution. Also note
that the mean and variance of the Cauchy distribution don’t exist.
2.3 Representations of A Probability Distribution
Survival Function
 () = 1 −  () = Prob [ ≥ ]
where  is a continuous random variable.
Hazard Function (Failure rate)

 ()  ()
 () = =
 () 1 −  ()
Let  () = − (exponential density function), then we have
 ()
 () = =
 ()
18
which implies that the harzard rate is a constant with respect to time. However for Weibull
distribution or log normal distribution, the harzard function is not a constant any more.
Moment Generating Function (mdf) The mgf of a random variable  is

¡ ¢
 () =    for  ∈ 
Note that mgf is an alternate definition of probability distribution. Hence there is one
for one relationship between the pdf and mgf. However mgf does not exist sometimes. For
example, the mgf for the Cauchy distribution is not able to be defined.
Characteristic Function (cf) Alternatively, the following characteristic function is used

frequently in Finance to define probability function. Even when the mdf does not exist, cf
always exist.
¡ ¢
 () =    2 = −1
For example, the cf for the Cauchy distribution is exp (0  −  ||) 
Cumulants The cumulants  of a random variable  are defined by the cumulant gener-
ating function which is the logarithm of the mgf.
£ ¡ ¢¤
 () = log  
Then, the cumulants are gievn by
1 =  =  0 (0)
2 =  2 = 00 (0)
 =  () (0)
19
2.4 Joint Distributions
The joint distribution for  and  denoted  ( ) is defined as

⎧
⎨ P P  ( )
 
Prob ( ≤  ≤   ≤  ≤ ) = R R
⎩  
 ( ) 
 
Consider the following bivariate normal distribution as an example.

µ ¶
1 1£ 2 2
¤ ¡ 2
¢
 ( ) = p exp −  +  − 2   1 − 
2    1 − 2 2
where
 −   −   
 =   =  =
   
Suppose that   =   = 1  =  = 0   = 05 Then we have
µ ¶
1 1£ 2 2
¤ ¡ 2
¢
 ( ) = √ exp −  +  −   1 − 05
2 1 − 052 2
¡ ¢
√1 exp − 12 (2 +  2 − )  (1 − 052 )
2 1−052
0.15
z 0.10
0.05
0.00
-4 -2 -2 -4
4 20 02 4
y x
We denote ⎛ ⎞ ⎛⎡ ⎤ ⎡ ⎤⎞
   2  
⎝ ⎠ ∼  ⎝⎣ ⎦⎣ ⎦⎠
     2
20
Marginal distribution It is defined as
⎧
X ⎨ P  ( )

Prob ( = 0 ) = Prob ( = 0 | = 0 ) Prob ( = 0 ) =  () =

⎩ R  ( ) 

Note that
 ( ) =  ()  () iff  and  are independent
Also note that if  and  are independent, then
 ( ) =  ()  ()
alternatively
Prob ( ≤    ≤  ) = Prob ( ≤  ) Prob ( ≤  )
For a bivariate normal distribution case, the marginal distribution is given by

¡ ¢
 () =     2
¡ ¢
 () =     2
Expectations in a joint distribution Mean:

⎧ P P P
⎨
  () =
    ( )
 () = R R R
⎩  ()  =  ( ) 
   
Variance: See B-50.
Covariance and Correlation

£ ¡ ¢¤
 [ ] =  ( −  )  −  = 
 
 ( ) =
 
 ( + ) =  () +  () + 2 ( )
21
2.5 Conditioning in a bivariate distribution
 ( )
 (|) =
 ()
For a bivariate normal distribution case, the conditional distribution is given by
¡ ¡ ¢¢
 (|) =   +   2 1 − 2
where  =  −    =    2 
If  = 0 then  and  are independent.
Regression: The Conditional Mean The conditional mean is the mean of the condi-
tional distribution which is defined as
⎧
⎨ P  (|)

 (|) =
⎩ R  (|) 

The conditional mean function  (|) is called the regression of  on 
 =  (|) + ( −  (|))
=  (|) + 
Example:
 =  +  + 
Then
 (|) =  + 
Conditional Variance
£ ¤
 (|) =  ( −  (|))2 |
¡ ¢
=   2 | −  (|)2
22
2.6 The Multivariate Normal Distribution
Let x = (1    )0 and have a multivariate normal distribution. Then we have
µ ¶
−2 −12 1 0 −1
 (x) = (2) |Σ| exp (x − μ) Σ (x − μ)
2
1. If x ∼  (μ Σ)  then
Ax + b ∼  (Aμ + b AΣA0 )
2. If x ∼  (μ Σ)  then
(x − μ)0 Σ−1 (x − μ) ∼ 2
3. If x ∼  (μ Σ)  then
Σ−12 (x − μ) ∼  (0 I )
23
2.7 Sample Questions
Q1: Write down the definitions of skewness and kurtosis. What is the value of skewness for
the symmetric distribution.
Q2: Let  ∼  (0  2 ) for  = 1   Further assume that  is independent each
other.Then
1. 2 ∼
P
2. =1 2 ∼
P 2
3. =1  2 ∼
21
4. 22
∼
21 +22
5. 23
∼
1
6. 22
∼
1
7. 22 +23
∼
Q3: Write down the standard normal density

Q4: Let  ∼  ( 2 ) 
1. ln () ∼
2. Prove that  =  ∼  (ln  +   2 ) 
3. Prove that  = 1 ∼  (−  2 ) 
Q5: Write down the density function of the Gamma distribution
1. Write down the values of  and  when Gamma = 2
2. Write down the values of  and  when Gamma = exponential distribution
24
Q6: Write down the density function of the logistic distribution.
Q7: Write down the density function of Cauchy distribution. Write down the value of 
when Cauchy= distribution
Q8: Write down the definition of Moment Generating and Characteristic function.
Q9: Suppose that x = (1  2  3 )0 and x ∼  ( Σ)
1. Write down the normal density in this case.
2. y = Ax + b ∼
3. z = Σ−1 (x − c) ∼ where c 6= μ
25
3 Estimation and Inference
3.1 Definitions
1. Random variable and constant: A random variable is believed to change over time
across individual. Constant is believed not to change either dimension. It becomes an
issue in the panel data.
 =  + 
Here we decompose  ( = 1  ;  = 1   ) into its mean (time invariant) and
time varying components. Now is  random or constant. According to the definition
of random variables,  can be a constant since it does not change over time. However,
if  has a pdf, then it becomes a random variable.
2. IID: independent, identically distributed: Consider the following sequence
x = (1  2  2 ) = (1 2 3)
Now we are asking if each number is a random variable or constant. If they are random,
then we have to ask the pdf of each number. Suppose that
¡ ¢
 ∼  0 2 
Now we have to know that  is an independent event. If they are independent, then
next we have to know  2 is identical or not. Typical assumption is IID.
3. Mean:
1X

̄ = 
 =1
4. Standard error (deviation)

" #12
1 X

 = ( − ̄)2
−1 =1
5. Covariance
1 X

 = ( − ̄) ( − ̄)
 − 1 =1
26
6. Correlation

  =
 
Population values and estimates
 =  + 
¡ ¢
 ∼   2
³ ´
1. Estimate: it is a statistic computed from a sample ̂ . Sample mean is the statistic
for the population mean ().
2. Standard deviation and error:  is the standard deviation and  is the standard error
of the population.
3. regression error and residual:  is the error, ̂ is the residual
4. Estimator: a rule or method for using the data to estimate the parameter. “OLS
estimator is consistent” should read “the estimation method using OLS is consistently
estimated a parameter.
5. Asymptotic = Approximated. Asymptotic theory = approximation property. We are

interested in how an approximation works as  → ∞
Estimation in the Finite Sample
1. Unbiased: An estimator of a parameter  is unbiased if the mean of its sampling

distribution is 
³ ´
 ̂ −  = 0 for all 
2. Efficient: An unbiased estimator ̂1 is more efficient than another unbiased estimator
̂2 is the sampling variance of ̂1 is less than that of ̂2 
³ ´ ³ ´
 ̂1   ̂2
27
3. Mean Squared Error:
³ ´ ∙³ ´2 ¸
 ̂ =  ̂ − 
∙³ ´2 ¸
=  ̂ −  ̂ +  ̂ − 
³ ´ h ³ í2
=  ̂ +  ̂ − 
4. Likelihood Function: rewrite

 =  − 
and consider the joint density of   If  are independent, then
 (1    |) =  (1 |)  (2 |)  ( |)
Y
=  ( |) =  (|    )
=1
The function  (|u) is called the likelihood function for  given the data u.
5. Cramer-Rao Lower Bound: Under regularity condition, the variance of an unbiased

estimator of a parameter  will always be at least as large as
µ ∙ 2 ¸¶−1 Ã ∙ ¸2 !−1
−1  ln  ()  ln  ()
[ ()] = − = 
2 
where the quantity  () is the information number for the sample.
4 Large Sample Distribution Theory

Definition and Theorem (Consistency and Convergence in Probability)
1. Convergence in probability: The random variable  converges in probability to a

constant  if
lim Prob (| − |  ) = 0 for any positive 
→∞
We denote
 →  or plim→∞  = 
28
Carefully look at the subscript ‘’. This means  is dependent on the size of  For
P
an example, the sample mean, −1 =1  is a function of 
2. Almost sure convergence:

³ ´
Prob lim  =  = 1
→∞
Note that almost sure convergence is stronger than convergence in probability. We

denote
 → 
3. Convergence in the −th mean
lim  (| − | ) = 0

→∞
and denote it as

 → 
When  = 2 we say convergence in quardratic mean.
4. Consistent Estimator: An estimator ̂ of a parameter  is a consistent estimator of 

iff
plim→∞ ̂ = 
5. Khinchine’s weak law of large number: If  is a random sample from a distribution

with finite mean  ( ) =  then
X

−1
plim→∞ ̄ = plim→∞   = 
=1
6. Chebychev’s weak law of large number: If  is a sample of observations such that

P
 ( ) =   ∞  ( ) =  2  ∞, ̄ 2  = −2 =1  2 → 0 as  → ∞ then
plim→∞ (̄ − ̄ ) = 0
P
where ̄ = −1 =1  
29
7. Kolmogorov’s Strong LLN: If  is a sequence of independently distributed random
P
variables such that  ( ) =   ∞ and  ( ) =  2  ∞ such that ∞ 2 2
=1     ∞
as  → ∞ then
̄ − ̄ → 0
8. (Corollary of 7) If  is a sequence of iid variables such that  ( ) =   ∞ and

 | |  ∞ then
̄ −  → 0
9. Markov’s Strong LLN: If  is a sequence of independent random variables with  ( ) =

P h i
  ∞ and if for some   0, ∞ =1  | −  |1+
1+  ∞ then
̄ − ̄ → 0
Properties of Probability Limits
1. If  and  are random variables with plim =  and plim =  then
plim ( +  ) =  + 
plim  = 
 
plim = if  6= 0
 
2. W is a matrix whose elements are random variables and if plimW = Ω then
plimW−1 = Ω−1
3. If X and Y are random matrices with plimX = B and plimY = C then
plimX Y = BC
30
Convergence in Distribution
1.  converges in distribution to a random variable  with cdf  () if
lim | ( ) −  ()| = 0 at all continuity points of  ()

→∞
In this case,  () is the limiting distribution of   and this is written
 → 
2. Cramer-Wold Device: If x → x then
c0 x → c0 x
where c ∈
3. Lindeberg-Levy CLT (Central limit theorem): If 1    are a random sample from a
probability distribution with finite mean  and finite variance  2 , then it sample mean,
P
̄ = −1 =1  have the following limiting distribution
√ ¡ ¢
 (̄ − ) →  0  2
4. Lindegerg-Feller CLT: Suppose that 1    are a random sample from a probability
distribution with finite mean  and finite variance  2 . Let
1X 2 1X
 
̄ 2 =    ̄ = 
 =1  =1 
where lim→∞ max (  )  (̄  ) = 0 Further assume that lim→∞ ̄ 2 = ̄ 2  ∞ then

P
it sample mean, ̄ = −1 =1  have the following limiting distribution
√ ¡ ¢
 (̄ − ̄ ) →  0 ̄ 2
or √
 (̄ − ̄ ) 
→  (0 1)
̄
31
5. Liapounov CLT: Suppose that { } is a sequnece of independent random variables with
³ ´
2 2+
finite mean  and finite positive variance   such that  | −  |  ∞ for some
  0 If ̄ is positive and finite for all  sufficiently large, then
√
 (̄ − ̄ ) 
→  (0 1)
̄ 
6. Multivariate Lindeberg-Feller CLT:

√
 (x̄ −μ̄ ) →  (0 Q)
where  (x ) = Q and we assume that lim Q̄ = Q
7. Asymptotic Covariance Matrix: Suppose that

√ ³ ´
 θ̂  −θ →  (0 V)
then its asymptotic covariance matrix is defined as

³ ´ 1
Asy. Var θ̂  = V

Order of A Sequence
¡ ¢
1. A sequence  is of order   denoted    iff

plim→∞ =∞

(a)  = 1 =  (1)
(b)  = 2 =  (2 ) : plim2 2 = 1
(c)  = 1 ( + 10) =  (−1 ) : plim( + 10)−1 −1 = 1
2. A sequence  is of order less than  iff


plim→∞ =0

(a)  = 0 =  (1) : plim01 = 0

¡ ¢
(b)  =  −12  then  =  (1)
32
Order in Probability
1. A sequence random variable  is  ( ()) if there exists some  such that   0

and all     µ¯ ¯ ¶
¯  ¯
¯
Prob ¯ ¯  1−
 () ¯
where  is a finite constant
(a) If  ∼  (0  2 )  then  =  (1)  Since given  there is always some  such
that
Prob (| |  )  1 − 
¡ ¢ ¡ ¢
(b)  ( )   =  +
√ ¡ ¢
(c) If  (̄ − ̄ ) →  (0 ̄ 2 )  then (̄ − ̄ ) =  −12 but ̄ =  (1)
2. The notation  =  ( ) means

 
→ 0

√ ¡ ¢
(a) If ̄ →  (0 ̄ 2 )  then ̄ =  −12 and ̄ =  (1)
¡ ¢
(b)  ( )   = +

33
Sample Qestions
Part I: Consider the following model
M1 :  =  +    = 1  
M2 :  =  +  + 
Suppose that
 1 =   ∞ but   = 0 for all 
Q1: Show the OLS estimator ̂ in M1 is unbiased and consistent

Q2: Show the OLS estimator ̂ is biased but consistent
Q3: Suppose that  ∼  (0 1)  Derive the limiting distribution of ̂ in M1
Q4: Suppose that  ∼  (0  2 )  Derive the limiting distribution of ̂ in M2
Part II: Consider the following model
y = Xb + u
Q5: Obtain the limiting distribution of θ = x0 u

Q6: Obtain the limiting distribution of b̂
Q7: Suppose that  ∼  (0 2 )  Find the asymptotic variance of b̂
34
5 Chapters 1 through 4: The Classical Assumptions
1. Linear
y = Xb + u  ∞
2. X is a nonstochastic and finite  ×  matrix
3. X0 X is nonsingular for all  ≥ 
4.  (u) = 0
5. u ∼ (0 2 I) 
When X is stochastic 4. Exogeneity of the independent variables:  (u|X) = 0

5-1. Homoscedasticity and no-autocorrelation.
5-2. Normal distribution.
Properties: A. Existence: Given 1,2,3, β̂ exists for all  ≥  and it unique

B. Unbiasedness: Given 1-4,
³ ´
 β̂ = β
C. Normality: Given 1-5,

³ ´
−1
β̂ ∼  β 2 (X0 X)
D. Gauss-Markov Theorem: OLS estimator is the minimum variance linear unbiased

estimator (whether X is stochastic or nonstochastic).
Linear Model Consider two models.
 =  +  + 
 =  +  + 2 + 
35
Consider log, semilog, level:
 =  +  + 
ln  =  +  ln  + 
ln  =  +  + 
36
5.1 Brainstorm I: Sample Mean
Notation:
 = earning at time 
Classification: Male, Female, Skilled Worker, Non-skilled worker.
Question 1: How to test the difference between male and female earning.
1 = sample mean of male earning
2 = sample mean of female earning
Question 2: How to explain the earning difference between male and female. By using skill
data.
3 = sample mean of skilled worker
4 = sample mean of nonskilled worker
If so how?
Question 3: Deriving the limiting distributions for Q1 and Q2 as  → ∞
Question 4: Form a null hypothesis to test if male and female earning difference.
37
5.2 Brainstorm II: Trend Regression
Notation: The true model is given by

¡ ¢
 =  +    ∼  0  2
Now consider two regressions
 = 1  + 
√
 = 2  + 
Question 1: Deriving the limiting distribution of ̂1

√
Question 2: Write down  as a function of ,  and 
Question 3: Deriving the limiting distribution of ̂2
38
5.3 Brainstorm I Continue: Dummy Regression
Notation:  = earning.  = Decision for Ph.D. program.  = Decision for taking

Econometric class
Consider the following decision tree.
If  = 0 then the values for  does not matter.
Question 1: Construct a dummy regression for the first example of Brainstorm I

Question 2: Construct a dummy regression for the current example
Question 3: Derive the limiting distribution for Q2.
39
5.4 Assignment II: 2 Extra Credits
Basic 0: Sample Mean Suppose that  is i.i.d.  ( 2 ) 
P
1. Let ̂ 2 = 1
 =1 ( − )2  Show that ̂ 2 is biased but consistent. Obtain the unbiased
estimator.
2. Let  =  − 1 for  = 2   Obtain the unbiased variance for  
3. Let  =  − 21 + 2 for  = 3   Obtain the unbiased variance for  
4. Let ̂ 21  ̂ 22  ̂ 23 be the unbiased variance for Q1,2,3. Find the smallest variance.
Basic I: Single Regressor Consider the following model
 =  + 
We assume that all classical assumptions hold.
1. Show the OLS estimator ̂ is unbiased.
2. Show the OLS estimator is minimizing the following quadratic loss

X

2
=1
3. Show that
1X

 ̂ = 0
 =1
Basic II: Multiple Regressors Consider the following model
 = X β +  = 1  1 + 2  2 + 
or equivalently
y = Xβ + u = X1  1 +X2  2 +u
We assume the classical assumptions hold.
40
1. Show the OLS estimator ̂ is unbiased.
2. Show the OLS estimator is minimizing the following quadratic loss
u0 u
3. Show that
X0 û = 0
4. Show that
−1
̂ = (X0 X) X0 y
5. Define M1 and M2  and show the followings
−1
̂ 1 = (X01 M2 X1 ) X01 M2 y
−1
̂ 2 = (X02 M1 X2 ) X02 M1 y
6. Suppose that  2 = 0 Consider the following two regressions
y = X1  1 + e (3)
y = X1  1 + X2  2 + u (4)
(a) Show that 2 in (3) is smaller than that in (4).
(b) Write down the relationship between 2 and ̄2 in general.
(c) Suppose that  2 6= 0 and  ratio for ̂ 2 is greater than 1. Show that ̄2 in (4) is
greater than 2 in (3).
û0 û
7. Consider (3) as the true model.  ∼  (0 2 )  Let 2 = 
−1
(a) Show that ̂ 1 and ̂2 be independent.
(b) Show that

̂ − 
q 1 ∼ 1
2 (X01 X1 )−1
41
8. Consider (4) as the true model. Let 2 = 1 +   1 is independent on   Also  is
independent on  .
(a) Find plim̂ 2 
(b) Let  ∼  (0 − ) for   0 Find plim̂ 1 and plim̂ 2
9. Consider (4) as the true model.  2 6= 0 and 1 is independent on 2  Suppose that
you are interested in testing
1
0 :  = =0
2
Derive the limiting distribution ̂ = ̂ 1 ̂ 2
42
5.5 Answer and Additional Notes
True Model (when  is nonstochastic)
∗ =  +   ∗ =  + 
 =  + 
Regression
∗ =  +  +  =  −  +  ( +  ) + 
=  + ∗ +  (5)
1. How to obtain ̂ and ̂ (OLS estimators)

Take the sample average
1X ∗ 1X ∗ 1X ∗
 =  +   +  (6)
  
(2) - (1) yields
̃∗ =  ̃∗ + ̃
or equivalently
̃ =  ̃ + ̃
since
1X
̃∗ =  −  = ̃

P
̃ ̃
̂ =  + P 2
̃
P 1
P
̃ ̃ 
̃ ̃
̂ −  = P 2 = 1 P 2
̃ 
̃
Since we assume  is nonstochastic, we have
³ ´ 1
P 1
P

̃ ̃ 
̃  ̃
 ̂ −  =  1 P 2 = 1
P 2 =0

̃ 
̃
³ ´2 ∙1P ¸2 P

̃ ̃  1
2 ( ̃ ̃ )2
 ̂ −  =  1 P 2 = £ P ¤2 (7)

̃ 1
̃2 
43
Note that
³X ´2 ¡ ¢
̃ ̃ = (̃1 ̃1 +  + ̃ ̃ )2 = ̃21 ̃21 +  + ̃2 ̃2 +2 (̃1 ̃1 ̃2 ̃2 +  + ̃ ̃ ̃−1 ̃−1 )
Hence
³X ´2 ¡ ¢
 ̃ ̃ =  ̃21 ̃21 +  + ̃2 ̃2 + 2 (̃1 ̃1 ̃2 ̃2 +  + ̃ ̃ ̃−1 ̃−1 )
¡ ¢
= ̃21  ̃21 +  + ̃2  ̃2 + 2 (̃1 ̃2  ̃1 ̃2 +  + ̃ ̃−1  ̃ ̃−1 )
We will assume
 (  ) = 0 for  6=  : independent
2 =  2 : identical
Then we have
µ ¶2 µ ¶
1X 2 X 1 ³X ´2
 ̃2 =   −  =   −  2
 + 2 
  
2 ¡ ¢
=  2 −   1 +  + 2 +  +1 +  +  

1 ³X 2 ´
+ 2  + 2 (1 2 +  +  −1 ) (8)

2¡ ¢
=  2 −  1 +  + 2 +  +1 +  +  

1 ³X 2 ´
+ 2  + 2 (1 2 +  +  −1 )
 µ ¶
2 2 2 1 2 2 1 −1 2
=  −  +  =  1 − = 
   
since
 1 = 0 if  6= 1 and  +1 = 0 for all 
Also note that

µ ¶µ ¶
1X 1X
 ̃ ̃+1 =   −  +1 − 
 
µ ¶
1 X 1 X 1 ³X ´2
=   +1 − +1  −  +1 + 2 
  
1 1 1 ¡ ¢ 1 ¡ ¢
= 0 −  2 −  2 + 2  ·  2 = −  2 =  −1
   
44
Now, we have
³X ´2 ¡ 2 2 ¢
 ̃ ̃ = ̃1  ̃1 +  + ̃2  ̃2 + 2 (̃1 ̃2  ̃1 ̃2 +  + ̃ ̃−1  ̃ ̃−1 )
−1 2X 2

1
=  ̃ − 2  2 (̃1 ̃2 +  + ̃ ̃−1 )
 =1

" #
X
1 X 
=  2 ̃2 −  2 ̃2 − 2 (̃1 ̃2 +  + ̃ ̃−1 )
=1
 =1
Ã  !2
X
1 X
=  2 ̃2 −  2 ̃
=1
 =1
Ã !2 Ã !2
X
1 X X 
1 X 
=  2 ̃ − ̃ =  2  − 
=1
 =1 =1
 =1
X

=  2 ̃2
=1
since
Ã !
1 X
1 X
1 X
1 X
̃ − ̃ =  −  −  − 
 =1  =1  =1  =1
1X 1X 1X
  
=  −  −  + 
 =1  =1  =1
1X

=  −  = ̃ 
 =1
Now from (3), we have

³ ´2 ∙1 P ¸2 P P 2

̃ ̃  12 ( ̃ ̃ )2 1 2

2  =1 ̃
 ̂ −  =  1P 2 = £ P ¤2 = £ 1 P 2 ¤2

̃ 1
̃2 ̃
 
1 2 1
£ P  2
¤

   =1 ̃ 2
= £ 1 P 2 ¤2 = P 2
̃ ̃

Note that
2 1X 2
lim P  2 = 0 if ̃ = 
→∞ ̃ 
In order to have a finite variance, we need to have
P
√ ³ ´ √1

̃ ̃
 ̂ −  = 1
P

̃2
45
Now it is easy to show that (by using LL CLT)
µ ¶
√ ³ ´
  2
 ̂ −  →  0 1
P

̃2
or
̂ − 
p
2
P 2 →  (0 1) 
   ̃
To Students:
Q1. Now you do more simple model
 =  +  (9)
and get the limiting distribution of ̂ Here we assume  =  = 0

Q2. (Example of nonstochastic  ) Consider
 =  +   (10)
This is a regression of  on 1 That is, we let  =  and  = 1 for all  in (9) then we have
(10). Find the limiting distribution of ̂
When  is stochastic We don’t work with expectation term here. Instead of this, we
take probability limit (since we are obtaining the limiting distribution, so we are caring
about consistency only. The previous case, both two unbiaseness and consistency becomes
identical problem since  was nonstochastic.)
³ ´ plim 1
P P
→∞  ̃ ̃ plim→∞ 1 ̃ ̃
plim→∞ ̂  −  = 1
P 2 = 1
P 2 =0
plim→∞  ̃ plim→∞  ̃
Note that
−1 2 1
 ̃2 =     ̃ ̃+1 = −  2
 
Similarly, we have
−1 2 1
 ̃2 =     ̃ ̃+1 = −  2
 
46
Hence
³X ´2 ¡ ¢
 ̃ ̃ =  ̃21 ̃21 +  + ̃2 ̃2 + 2 (̃1 ̃2 ̃1 ̃2 +  + ̃ ̃−1 ̃ ̃−1 )
¡ ¢
=  ̃21  ̃21 +  +  ̃2  ̃2 + 2 ( ̃1 ̃2  ̃1 ̃2 +  +  ̃ ̃−1  ̃ ̃−1 )
µ ¶µ ¶ µ ¶µ ¶
−1 2 −1 2 1 2 1 2
=     −  ( − 1)  
     
" #
2 2
( − 1)  ( − 1) ( − 1) − ( − 1)
=  2  2 −  2  2 =  2  2
2 2 2
∙ ¸ ∙ ¸
2 2 3 + 2 2 2 3 ¡ ¢
=     1 − 2
=    1 − =  2  2 +  −1
 
Also note that

µ ¶2 µ X ¶2
1X 1X 1X 2 1
plim→∞  −  = plim→∞  − plim→∞ 
   
1 ¡ ¢
=  2 −  2 =  2 +  −1

Then we have
µ ¶
1X   2  2 ¡ −1 ¢
̃ ̃ →  0 + 
 
1 X 
¡ ¢
√ ̃ ̃ →  0 2  2

From Cramer-Wold Device, we have

1
P µ ¶
√

̃ ̃   2
1
P →  0 2

̃2 
Usually in textbooks, we don’t follow the above derivation. Simply others use the condi-
tional expectation. Let x = (1    )0  Then we consider
³ ´
 ̂ − |x
and
³ ´2
 ̂ − |x
47
which is equivalent to treat  like a nonstochastic variable. Asymptotically we note that
√ ³ ´ ¡ ¢
 ̂ − |x →  0  2 Q−1

where µ ¶−1
x0 x
Q = lim→∞

48
Dummy Variable Regression: Let’s go back to our orignial example
 =  +  + 
where
 = 0 for  =   1 for  = 
Suppose that 1 = total number of female = total number of male. Then

 = 1 + 1 = 21  or 1 =
2
P
̃ ̃
̂ =  + P 2
̃
Treat as if  is nonrandom. Then we have
X X
 ̃ ̃ = ̃  ̃ = 0
Next Ã !2
X X 1X X 1 ³X ´2

̃2 =  −  = 2 − 
 =1 
Note ⎧
⎨ 0 if female
2 =  =
⎩ 1 if male
so that
X X
2 =  = 1 : total number of female or male
Hence we have
X X 1 ³X ´2 1  1 2 2 −  1
̃2 = 2 −  = 1 − 21 = − = =  (11)
  2  4 4 4
Next, find the limiting distribution of
1 X
√ ̃ ̃ 

We know
1X
plim ̃ ̃ = 0

49
also we know
Ã ! Ã  !
1 X 1 X 1X 1 X 1 X 1X

√ ̃ ̃ = √  −  ̃ = √  ̃ − √  ̃
   =1    =1
Ã  !
1 X 1 1X X
= √  ̃ − √  ̃
   =1

1 X 1
1 1 X
= √ ̃ − √ ̃
 =1  
since
 ̃ = 0 if  is female.
= ̃ if  is male
Hence
1 X 1 X 1 1 X 1 X 1 1 X
1 1 
√ ̃ ̃ = √ ̃ − √ ̃ = √ ̃ − √ ̃
  =1    =1 2  =1
Ã ! Ã  !
1 X 1
1X

1 1X 1
1 X

= √ ̃ − ̃ = √ ̃ − ̃
 =1 2 =1  2 =1 2 = +1
1
1 X

= √ ̃∗
 =1
where ⎧
⎨ 1
̃ if  is male
2 
̃∗ =
⎩ − 1 ̃ if  is female
2 
Note that the stochastic properties of ̃∗ is the same as 12 ̃ as long as  is independent and
identically distributed.
Next, we know already the value of  (̃ )2 from (8)
Ã !2 ⎛ Ã  !2 ⎞
1 X
1 ⎝ X
1 X
 √ ̃∗ =  (∗ )2 −  ⎠
 =1  =1
 =1
µ ¶
1  2 1 2
=  − 
 4  4 
1 2 1 1 ¡ ¢
=   −  2 =  2 +  −1
4 4 4
50
Hence µ ¶
1 X  1 2
√ ̃ ̃ →  0  
 4
and P P
³ ´ √1 ̃ ̃ √1 ̃ ̃ 4 X
 
 ̂ −  = 1
P = 1 =√ ̃ ̃

̃2 4

P
since ̃2 = 4 from (11). Therefore finally we have
√ ³ ´ ¡ ¢
 ̂ −  →  0 4 2 
or
√ ̂ −  
p →  (0 1)
4 2
Note that the asymptotic variance of ̂ is
³ ´ 4 2
Asy Var ̂ = 

Now consider the quantity of

³X ´−1 4 ³ ´
2 ̃2 =  2 = Asy Var ̂

Two Dependent Dummies Consider two models
 =  +  +  + 
 =  +  + 
Now
 =  + 
where ⎧
⎨ 0 if  is non-skilled
 =
⎩ 1 if  is skilled
Further note that

1X
  6= 0

51
Consider the following ‘pay-off’ matrix where  indiciates the total number of observations.
Unskilled Skilled Total

Female 11 12 11 + 12
Male 21 22 21 + 22
Total 11 + 21 12 + 22 
Assume the total number of female = that of male.
11 + 12 = 21 + 22
Further consider the following assumptions.
11 = 221
212 = 22
Then we have
Female 211 22 211 + 22
Male 11 222 11 + 222
Total 311 322 
and the probability matrix becomes

211 +22
Female 211  22  
11 +222
Male 11  222  
Total 311  322  1

Note that
211 + 22 11 + 222 1 1
= = ⇐⇒ 22 = 11 = 0 =
  2 6
so that finally we have

1 1 1
Female 3 6 2
1 1 1
Male 6 3 2
1 1
Total 2 2
1
52
Consider the following expectation
Expected earning
Female & unskilled;  =  = 0  ( ) = 
Female & skilled;  = 0  = 1  ( ) =  + 
Male & unskilled;  = 1  = 0  ( ) =  + 
Male & skilled;  = 1  = 1  ( ) =  +  + 

2
Female  + 3
 + 13 ( + )
1
Male + ++ 3
( + ) + 23 ( +  + )
2
Total 3
 + 13 ( + ) 1
3
( + ) + 23 ( +  + )
Now let  = 0 but   0 Then we have

2
Female  + 3
 + 13 ( + ) =  + 13 
1
Male  + 3
 + 23 ( + ) =  + 23 
Total  +
Hence unskilled worker’s earning is lower than skilled workers, and female earning is lower
than male earning because of   0 not of   0
We can do the above analysis (very tedious) but rather also can do the following regression
analysis to test if  = 0 but  6= 0
 =  +  +  +  (12)
Construct the following null hypothesis
01 :  = 0
02 :  = 0
03 :  =  = 0
Further note that when we run
 =  +  +  (13)
the OLS estimator ̂ may not be zero. But the OLS estimator ̂ in (12) could be zero. Why?
53
Omitted Variable If  (  ) 6= 0, or in other words,  (  ) 6= 0 then the OLS esti-
mator ̂ in (13) becomes inconsistent. (To students: Prove it)
Cross Dummies Suppose that among unskilled workers, there is no gender earning dif-
ference. However among skilled workers, there is gender earning difference. How to test?
Expected earning
Female & unskilled;  =  = 0  ( ) = 
Female & skilled;  = 0  = 1  ( ) =  + 
Male & unskilled;  = 1  = 0  ( ) =  + 
Male & skilled;  = 1  = 1  ( ) =  +  +  + 
In this case, we will have

 =  +  +  +   + 
Then test  = 0 but  6= 0

What is the meaning of  =  = 0 but  6= 0?
What is the meaning of  = 0 and  = 0?
What is the meaning of  = 0 but  6= 0 and  6= 0?
Sequential Dummies Consider the following learning choice:
Expected earning
Ph.D & taking Econometrics III  ( ) =  +  + 
Ph.D & not taking Econometrics III  ( ) =  + 
No Ph.D  ( ) = 
Construct dummy variable regression:
54
Another Example of Nonstochastic Regressor
¡ ¢
 =  +    ∼  0  2
Q1: Find the limiting distribution of 

P

̂ −  = P 2

Consider
X
  = 0
³X ´2
  =  (1 + 22 +  +   )2
¡ ¢
=  21 + 22 22 +  +  2 2 + 2 (21 2 + )
¡ ¢ X
=  2 1 + 4 +  +  2 =  2 2
Note that
X

1
2 =  (2 + 1) ( + 1)
=1
6
Hence we have µ ¶
X
 2  (2 + 1) ( + 1)
 →  0 
6
and P P µ ¶
   2 6
P 2 = 1 →  0 
 6
 (2 + 1) ( + 1)  (2 + 1) ( + 1)
Now fine   which makes P
 ¡ ¢
  P 2 →  0 2

The answer is
r r r
 (2 + 1) ( + 1) 3 2 1 32
 = = + +  ( ) =  +  ( )
6 3 6 3
Hence we have µ ¶
³ ´ 2
32 
 ̂ −  →  0
3
55
Linear and Nonlinear Restrictions (Chapter 5) Consider the following regression
y = Xb + u = x1 1 + x2 2 + u
Then in general we have

√ ³ ´
 b̂ − b →  (0 Σb )
or ⎛ ⎞ ⎛⎡ ⎤ ⎡ ⎤⎞
√ ̂1 − 1 0  12  211
⎝ ⎠ →  ⎝⎣ ⎦  ⎣ ⎦⎠ 
2
̂2 − 2 0  12  22
Next consider the following linear restriction
0 ̂1 + 1 ̂2 = 2
Alternatively we may let

³ ´
 ̂1  ̂2 = 0 ̂1 + 1 ̂2 + 2 := ̂ (14)
Taking Taylor expansion around their true values yields

³ ´  (1  2 ) ³ ´  (   ) ³ ´
1 2
 ̂1  ̂2 =  (1  2 ) + ̂1 − 1 + ̂2 − 2
1 2
1  2  (1  2 ) ³ ´2 1  2  (   ) ³
1 2
´2
+ ̂1 − 1 + ̂2 − 2
2 21 2 22
1  2  (1  2 ) ³ ´³ ´
+ ̂1 − 1 ̂2 − 2 + · · · 
2 1 2
Note that from (14), we have
 (1  2 )  (1  2 )  2  (1  2 )  2  (1  2 )  2  (1  2 )

= 0  = 1  = = = 0 (15)
1 2 21 22 1 2
so that
³ ´ ³ ´
̂ = 0 1 + 1 2 + 2 + 0 ̂1 − 1 + 1 ̂2 − 2
⎡ ³ ´ ⎤
̂1 − 1
= 0 1 + 1 2 + 2 + (0  1 ) ⎣ ³ ´ ⎦
̂2 − 2
56
or
⎡ √ ³ ´ ⎤
√ h i
 ̂1 − 1
 (̂ − ) = 0 1 ⎣ √ ³ ´ ⎦
 ̂2 − 2
⎛ ⎡ ⎤⎡ ⎤⎞
h i  2  12 
→   ⎝0 0 1 ⎣
11 ⎦ ⎣ 0 ⎦⎠
 12  222 1
¡ ¢
=  0 2
Finally we have √
 (̂ − ) 
p 2 →  (0 1)

Now compare this limiting result with Greene. (page 84) Let
h i
R = 0 1 and q = 2
Then we have
h i−1
−1
 = (Rb − q)0  2 R (X0 X) R0 (Rb − q) → 21
Note that  2 (X0 X)−1 = Σb in our notation.

Next we consider a nonlinear restriction. Usually for a nonlinear restriction case, the
second and cross terms in (14) are not equal to zero but become small terms. To see this,
consider
³ ´  (1  2 ) ³ ´  (   ) ³ ´
1 2
 ̂1  ̂2 −  (1  2 ) = ̂1 − 1 + ̂2 − 2 + 
1 2
or
 (1  2 ) ³ ´  (   ) ³
1 2
´
̂ −  = ̂1 − 1 + ̂2 − 2 + 
1 2
√ ³ ´ ³ ´
where  is the remainder term. Note that  ̂1 − 1 is  (1), so that ̂1 − 1 is
¡ ¢ ³ ´2 ³ ´³ ´
 −12  ̂1 − 1 is  (−1 ) and ̂1 − 1 ̂2 − 2 =  (−1 )  Hence we have
¡ ¢
 =  −1
Therefore,
µ ¶
√  (1  2 ) √ ³ ´  (   ) √ ³
1 2
´ 1
 (̂ − ) =  ̂1 − 1 +  ̂2 − 2 +  √ 
1 2 
57
and
⎡ √ ³ ´ ⎤
h i µ ¶
√  ̂1 − 1 1
⎣ ⎦
√ ³ ´ +  √
 (1 2 )  (1 2 )
 (̂ − ) = 1 2
 ̂2 − 2 
⎛ ⎡ ⎤⎡ ⎤⎞
h i  2  12  (1 2 )
→ 
 ⎝0  (1 2 )  (1 2 ) ⎣ 11 ⎦ ⎣ 1 ⎦⎠
1 2 2  (1 2 )
 12  22 2
Example: Suppose that you want to test if
̂1
= 0
̂2
Then we have
 (1  2 ) 1  (1  2 ) 1
=  =− 2
1 2 2 2
58
6 Time Series Models (Ref: Chap 19 & 21)
Consider the following regression
 =  +    = 1  
where
¡ ¢
 = −1 +    ∼  0 2
Let’s assume that  (  ) =  (  ) = 0 for all  and  Q1: Find the limiting
distribution of ̂
We know P
 
̂ −  = P 2 

and
³X ´
   = 0
Consider
³X ´2
   
Observe this
³X ´2 X
  = (1 1 +  +   )2 = 2 2 + 2 (1 1 2 2 +  +  −1  −1   ) (16)
Now
X X
 2 2 = 2 2
To calculate 2 , consider the folllowings
 = −1 +   −1 = −2 + −1
so that
 = 2 −2 + −1 + 
= 3 −3 + 2 −2 + −1 + 

..
.
X∞
=  + −1 + 2 −2 +  =  −
=0
59
Next,
¡ ¢2 X
∞
2 2
=  + −1 +  −2 +  = 2 2− + cross product terms
=0
so that
¡ ¢
2 =  2 + 2 2−1 + 4 2−2 +  +  (cross)
Since
  = 0 for  6=   (cross) = 0
Hence we have
¡ ¢2
2 =   + −1 + 2 −2 + 
¡ ¢
=  2 + 2 2−1 + 4 2−2 + 
¡ ¢
=  2 1 + 2 + 4 + 
Note that
1
1 + 2 + 4 +  = 
1 − 2
1
1 +  + 2 +  = 
1−
2  1 −  +1
1 +  +  +  +  = 
1−
Finally
2
2 = =  2 
1 − 2
Also note that
¡ ¢¡ ¢
 −1 =   + −1 + 2 −2 +  −1 + −2 + 2 −3 + 
2
¡ 3
¢ 2
¡ 2
¢ 2
=   +  +  =   1 +  +  = 
1 − 2
¡ 2¢
=   
and
¡ ¢¡ ¢
 −2 =   + −1 + 2 −2 +  −2 + −3 + 2 −4 + 
¡ ¢ ¡ ¢ 2 2
=  2 2 + 4 +  =  2 2 1 + 2 +  = 
1 − 2
¡ ¢
= 2 2 
60
In general
 − = +  =  2 =   2
Then we have
³X ´2 X
   = 2 2 + 2 (1 1 2 2 +  +  −1  −1   ) =   2  2
Hence there is no much difference.
Let’s assume that  (  ) = 0 for all  and  but  (  ) = −  (2 ). Then we
have
¡ ¢ ¡ ¢
    = −  2 −  2
= 2(−)  2  2
Consider the cross product term carefully
Cross Term = 1 1 (2 2 +  +   )
+2 2 (1 1 + 3 3  +   )
+
+  (1 1 +  +  −1  −1 )
Hence
Cross Term = 1 1 (2 2 +  +   )
+2 2 (1 1 + 3 3  +   )
+
+  (1 1 +  +  −1  −1 )
where
1 1 (2 2 +  +   ) = 2  2  2 + 4  2  2 +  + 2( −1)  2  2

¡ ¢ 1 − 2( −1)
=  2  2 2 1 + 2 +  + 2( −2) =  2  2 2 
1 − 2
61
2 2 (1 1 + 3 3  +   ) = 2  2  2 + 2  2  2 +  + 2( −2)  2  2
¡ ¢
=  2  2 2 2 + 2 +  + 2( −3)
1 − 2( −2)
=  2  2 2 +  2  2 2 
1 − 2
3 3 (1 1 + 2 2 + 4 4 +  +   )
= 4  2  2 + 2  2  2 + 2  2  2 +  + 2( −3)  2  2

¡ ¢
=  2  2 2 2 + 2 + 2 +  + 2( −4)
2 2 21 − 
2( −3)
2 2 2
¡ 2
¢
=   +      1 +  
1 − 2
1 − 2( −1)
  (1 1 +  +  −1  −1 ) = 2 −2  2  2 +  + 2  2  2 =  2  2 2
1 − 2
Hence the total sum becomes
X

1 − 2( −)  2  2 2
2 2  2 2 =2  +  (1)
=1
1 − 2 1 − 2
Or
³X ´2 X
   = 2 2 + 2 (1 1 2 2 +  +  −1  −1   )
 2  2 2
=   2  2 + 2   2  +  (1)
1−
µ ¶
2 2 1 + 2
=   +  (1)
1 − 2
Finally we have
√ ³ ´ ¡ ¢
 ̂ −  →  0  2
where ³ ´
 2  2 1+2 µ ¶
1−2 1 + 2  2  2
 2 = = ≥
 2  2 1 − 2  2  2
In other words, the typical limiting distribution such as
√ ³ ´ ³
−1
´
 ̂ −  →  0 2 (X0 X)
does not work here.
62
6.1 Definitions
1. Strong Stationary: A time-series process, { }=∞

=−∞ is stronlgy stationary if the joint
probability distribution of any set of  obersvations in the sequence {   + } is the
same regardless of the origin  in the time scale.
2. Weak Stationary: { } is weakly stationary if (i)  ( ) is finite, (ii)  (  − ) is a
finite function only of  and model parameters. (In other words, it should not be time
varying)
3. Ergodicity: A strongly stationary time series process is ergodic if
lim | [ (  +1   + )  (+  ++1   ++ )]|
→∞
= | (  +1   + )| | (+  ++1   ++ )|
(a) Example: Let  = −1 +    ∼  (0 1)

¯ ¯
lim | ( + )| = lim ¯  2 ¯ = 0 = | | |+ |
→∞ →∞
4. The Ergodic Theorem: If  is strongly stationary and ergodic and  | | is a finite
P
constant, then ̄ =  −1  →  =  ( ) 
5. Martingale Sequence:  is a martingale sequence if
 ( |−1  −2  ) = −1
(a) Example:  = −1 +    ( |−1  −2  ) = −1
6. Martingale Difference Sequence:  is a martingale difference sequence if
 ( |−1  −2  ) = 0
7. White Noise process: stationary but not-autocorrelated process.
63
6.2 Long Run Variance
Q1: Consider  = −1 +    is a white noise process with a finite variance of  2  Find
the limiting distribution of the sample mean of  
1X

 = 
 =1
Mean:  = 0
Variance:
Ã !2 "  #
1X 1 X 2 X
  −1 X
1
  =  (1 +  +  ) (1 +  +  ) = 2  + 2  
 =1 2  =1 =1 =
" # ∙ ¸
1 X
 −1 X
1 2 
2 − 2
=   + 2   = 2   + 2  +  (1)
2 =1 =
 1−
µ ¶ µ ¶
1 2  ¡ −1 ¢ 1 2 1+ ¡ −1 ¢
= 1+2 +  = + 
 1 − 2 1−  1 − 2 1 − 
1 2 ¡ −2 ¢ 1
= 2 +  := 2 omega
 (1 − ) 
Hence we have µ ¶
1 2
 →  0 

or µ ¶
1 X

 2
√  →  0 (17)
 =1 (1 − )2
We call 2 long run variance of  
64
6.3 Estimation of Long Run Variance (HAC Estimation)
How to estimate the long run variance of  in (17) then? The unknowns are  2 and  How
many observations do we have?  So it is easy to estimate it.
Now what if the parametric structure is unknown. Let say  follows  ( ) or  ( )
where  and  are unknown? Is it possible to estimate  2 ? No. The total number of un-
 ( −1)
knowns becomes 2
+ 1 The first term is the sum of cross product terms and the last
term, 1, is the unknown variance term (diagonal term). If variance is time varying, then it
 ( −1)
becomes 2
+  Simply impossible to estimate the long run variance in this case.
Therefore we are imposing regularity: Ergodic and stationary process. And then we
assume that
 ( − ) ' 0 for a large 
Alternatively let say
1 XX 1 XX 1X X
 −1   −1 +  −1 
   =    +   
 =1 =  =1 =  =1 =++1
1 XX
 −1 +
=    +  (1) (18)
 =1 =
In this case, we don’t need to estimate the second term.
Newey and West Estimator Let

X
∞
¡ 2 ¢
 2 =  20 +   +  2−
=1
where
 2 =  −
Then we can apply the above concept in (18), so we have
X

¡ 2 ¢
2
̂ = ̂ 20 + ̂  + ̂ 2−
=1
65
According to Andrews (1991), we can modify the estimator further in an elegant way
X

¡ ¢
2
̂ = ̂ 20 +  ̂ 2 + ̂ 2−
=1
where  is some optimal weight. Newey and West (1992) suggest
 ¡ ¢
 = 1 −   =   13 
+1
We call such weight Bartlett kernel weight. They show that this type of estimator becomes
consistent.
Parametric Version: Andrews and Monahan’s Prewhitening HAC estimator Let

 is a stationary and ergodic process. Then we may have
 = −1 + 
and µ ¶2
1 X 2
 √  =
 (1 − )2
where  2 is the long run variance of   Now we estimate ̂ and replace this. That is,
̂ 2
̂2 = 
(1 − ̂)2
Conversion to Matrix Form Consider
 = X0 b + 
√ ³ ´
 b̂ − b →  (0 V̂ )
where µ ¶−1 X µ X ¶−1

1X
 X 
0 1 0 1 0
V̂ = X X  X ( X ) X X 
  =1 =1 
Now let
ξ =  · X = ( 1   2     )
66
Then
Ω2 = Ω0 + Ω + Ω−
X
 ³ ´
2
Ω̂ = Ω̂0 +  Ω̂ + Ω̂−
=1
Then we have µ ¶−1 µ X ¶−1

1X 0 2 1 0
V̂̂ = X X Ω̂ X X 
 
67
Alternative Approach Let assume

X
 = X0 b +    =  − + 
=1
Then
1 −1 = 1 X0−1 b + 1 −1

..
.
 − =  X0− b +  −
Now subtract −1   − from  

  
X X X
 = X0 b−  X0− b +  − + b −  −
=1 =1 =1
 
X X
= X0 b−  X0− b +  − +  = Z γ + 
=1 =1
¡ ¢
where  = X  X−1   X−  −1   −  Let rewrite it as
y = Zγ + e
and then we have

√ ¡ ¢
 (γ̂ − γ) →  0 2 −1
 
where
Z0 Z
 = plim →∞

68
Conventional Approach (Generalized Least Squares GLS: Chapter 8) Suppose
that we know the AR order. Let say AR(1). Then we have
 = −1 + 
so that
⎡ ⎤
1   −1
···
⎢ 1−2 1−2 1−2 ⎥
⎢  1 .. ⎥
2⎢
⎢ ··· . ⎥
uu0 = Ω × =  ⎢ 1−2 1−2 ⎥
.. .. ...  ⎥
⎢ . . ⎥
⎣ 1−2 ⎦
 −1  1
1−2
··· 1−2 1−2
⎡ ⎤
1 · · ·  −1

⎢ ⎥
⎢ .. ⎥
⎢
 2  1 ··· . ⎥
= ⎢ ⎥
1 − 2 ⎢
⎢
..
.
.. . .
. .  ⎥
⎥
⎣ ⎦
 −1 ···  1
Now we know
Ω = CΛC0
where C0 C = I and
Ω−1 = CΛ−1 C0
= P0 P
where
P = Λ−12 C0
Next consider the following transformation
Py = PXb + Pu
or
y∗ = X∗ b + u∗ (19)
Now define the GLS estimator
−1
b̂ = (X∗0 X∗ ) X∗0 y∗
69
or alternatively we can say
X∗0 X∗ = X0 P0 PX = X0 Ω−1 X
and
X∗0 y∗ = X0 P0 Py = X0 Ω−1 y
so that we have
¡ ¢−1 0 −1
b̂ = X0 Ω−1 X XΩ y
and find its limiting distribution.

First note that
u∗ u∗0 = Puu0 P0 = PΩP0 = Λ−12 C0 CΛC0 CΛ−12 = I
Hence the limiting distribution of b̂ is given by

Ã µ ¶ !
√ ³ ´ ∗0 ∗ −1
X X
 b̂ − b →  0

or Ã µ ¶−1 !
√ ³ ´ X0 Ω−1 X
 b̂ − b →  0

Feasible GLS Replace Ω by Ω̂

³ ´−1
0 −1
b̂  = X Ω̂ X X0 Ω̂−1 y
70
7 Heteroskedasticity (Chapter 8 Continue)
Now we allow heterogenous variance for each  or  That is,
2 =  2 6=  2 = 2
However we assume that

  = 0
Then we have ⎡ ⎤
 21 0 0 ···
⎢ ⎥
⎢ 2 .. ⎥
⎢ 0 2 · · · . ⎥
uu0 = Ω × =⎢
⎢ .. .. . .
⎥
⎥
⎢ . . . 0 ⎥
⎣ ⎦
2
0 · · · 0 
Note that
X0 uu0 X = X0 ΩX 6= X0 X
But in this case, we have

⎡ ⎤⎡ ⎤
2 0 · · · 0 X
⎢ 1 ⎥⎢ 1 ⎥
⎢ . ⎥ ⎢ .. ⎥
⎢ 0  22 · · · .. ⎥⎢ . ⎥
[X1   X ] ⎢
⎢ .. .. . .
⎥⎢
⎥⎢
⎥ =  21 X01 X1 +  22 X02 X2 +  +  2 X0 X
⎥
⎢ . . . 0 ⎥⎢ ⎥
⎣ ⎦⎣ ⎦
0 · · · 0  2 X
X

=  2 X0 X
=1
Therefore we have
√ ³ ´
 b̂ − b →  (0 V )
where
−1 −1 −1
V = (X0 X) (X0 ΩX) (X0 X)
Ã  !
−1
X −1
0
= (X X)   X X (X0 X)
2 0
=1
If we replace  2 but ̂ 2 = ̂2  then we call this estimator ‘White’ heteroskedasticity

consistent estimator.
71
8 Instrumental Variables
Consider the following data generating process
 =  + 
where
 =  + 
we assume that  (  ) = 0 for all  and 

Now we have
−1 −1
̂ =  + (0 ) 0  =  +  + (0 ) 0 
so that
 (̂ − |) =  6= 0
We say that  is endogeneous in this case. Note that the concept of endogeneity is in
general somewhat different. We will explain it later in Chapter 20.
8.1 Can we know if  = 0 or not?
1. Hausman Test: testing for exogeneity. We will study it later.
2.  is unknown. How do we know if  is correlated with  ?
3. Known case: Lagged dependent variable.
 =  +    = −1


+ 
so that
 =  (1 − ) + −1 + 
Then we can rewrite it as

̃ = ̃−1 + ̃ 
Note that  (̃−1 ̃ ) 6= 0 However as  → ∞ this bias goes away at the  ( −1 ) rate.
72
4. Measurement error: True model
 =  +  (20)
But we observe ∗ =  +   So you run
 = ∗ +  
From (20), we haev

 =  ( +  ) −  +  = ∗ + 
Now  ( ∗ ) =  ( −  ) ( +  ) 6= 0
8.2 Solution I
Including control variables.

 =  + w0 γ + 
where w = (1    )0  Now w becomes a proxy variable for  

Problem: We don’t know how many control variables should be included.
8.3 Solution II
Construct instrumental variable,  such that
 (  ) 6= 0
but
 (  ) = 0
Then construct IV estimator
−1
̂ = (z0 x) z0 y
−1
= (z0 x) z0 (xα + u)
−1
=  + (z0 x) z0 u
73
Next,
−1
̂ −  = (z0 x) z0 u
and
µ ¶−1
z0 x z0 u
plim (̂ − ) = plim plim
 
=  · 0 = 0
so that ̂ is a consistent estimator of 

Asymptotic variance:
h i
−1 −1
 (̂ − ) (̂ − )0 =  (z0 x) z0 uu0 z (z0 x)
If  and  are non-stochastic, we have
−1 −1
 (̂ − ) (̂ − )0 = ( 0 )  0 Ω  ( 0 )
8.3.1 Getting into details: Measurement error
 = ∗ +   ∗ =  +    = − + 
Find a variable such that

 =  + 
but
  = 0 and   = 0
Then  is the right instrumental variable.
How to find such a good IV then? Ask GOD.
74
9 Method of Moments (Chap 15)
Consider moment conditions such that
 (  − ) = 0
where   is a random variable and  is the unknown mean of    The parameter of interest,
here, is  Consider the following minimum criteria given by
1X

2
arg min  = arg min ( − )
   =1 
which becomes the minimum variance of   with respect to  Of course, the simple solution
becomes the sample mean for  since we have
1X 1X
 

= −2 ( − ) = 0 =⇒  =
  =1   =1 
The above case is the simple example of the method of moment(s).

Now consider more moments such that
 (  − ) = 0
£ ¤
 (  − )2 −  0 = 0
£ ¡ ¢ ¤
 (  − )  −1 −  −  1 = 0
£ ¡ ¢ ¤
 (  − )  −2 −  −  2 = 0
Then we have the four unknowns:   0   1   2  We have four sample moments such that
1X 1X 2 1X 1X
   
      −1  
 =1  =1  =1  =1  −2
so that we can solve this numerically.

However, we want to impose further restriction. Suppose that we assume   follows AR(1)
process. Then we have
 1 =  0   2 =  0
75
so that the total number of unknowns is reducing to three ( 0   )  We can increase more
³ P P 2 1 P P ´0
1  1 1
cross moment conditions also. Let  =  =1     =1     =1    −1   =1    −2 
Then we have
1X 1X 2
 
2
 (  − ) =    − 2 =  0
 =1  =1
so that
1X 2

   =  0 − 2
 =1
Also note that
1X

  =  0 − 2  and so on.
 =1  −1
Hence we may consider the following estimation
arg min [ −  ()]0 [ −  ()]  (21)

 0
where  is the parameters of interest (true parameters,   0  ). The resulting estimator
is called ‘method of moments estimator’. Note that MM estimator is a kind of minimum
distance estimators.
In general, MM estimator can be used in many cases. However, this method has one
weakness. Suppose that the second moment is relatively huge than the first moment. Since
 function assigns the same weight across moments, the minimum problem in (21) tries to
minimize the second moment rather than the first and second moment both. Hence we need
to design the optimal weighted method of moments, which becomes generalized method of
moments (GMM).
To understand the nature of GMM, we have to study the asymptotic properties of MM
estimator. (in order to find the optimal weighting matrix). Now to get the asymptotic
distribution of ̂ we need a Taylor expansion.
µ ¶
 () ³ ´ 1
 =  () + ̂ −  + 
0 
so that we have
µ ¶
√ ³ ´ √ 1
 ̂ −  =  [ −  ()]  ()−1 +  √

76
 ()
where  () = 0
 Note that we know that
√
 [ −  ()] →  (0 Φ)
Hence we have
√ ³ ´ ¡ ¢
 ̂ −  →  0  ()−1 Φ ()0−1
where  () →  () 
9.1 GMM
First consider infeasible generalized version of method of moments.
arg min [ −  ()]0 Φ−1 [ −  ()] 

 0
where Φ is true unknown weighting matrix. Now feasible version becomes
arg min [ −  ()]0 W [ −  ()] = arg min  ()0 W  ()
 0  0
where W is a consistent estimator of Φ−1  Let
 = [ −  ()]0 W [ −  ()]
Then GMM estimator satisfies

³ ´
 ̂ ³ ´0 h ³ í
= 2 ̂ W  −  ̂ = 0
 ̂
so that we have
³ ´ ³ ´ µ ¶
1
 ̂ =  () +  () ̂ −  + 

Thus
³ ´0 h ³ í
 ̂ W  −  ̂
³ ´0 h ³ í ³ ´0 ³ ´
=  ̂ W  −  ̂ +  ̂ W  () ̂ −  = 0
77
Hence
³ ´ ½ ³ ´0 ¾−1 ³ ´0 h ³ í
̂ −  = −  ̂ W  ()  ̂ W  −  ̂
and
√ ³ ´
 ̂ −  →  (0  )
where
1 −1 −1
 = {0 W} 0 WΦW {0 W} 

When  = Φ−1  then we have
1 © 0 −1 ª−1 0 −1 © 0 −1 ª−1 1 © 0 −1 ª−1

 = Φ  Φ  Φ  = Φ  
 
78
10 Panel Data Analysis (Chapter 9)
Latent Data Generating Process
 =  +  + 
where
 = time invariant individual characteristics
 = cross sectional invariant common factor
 = idiosyncratic term
10.1 Economic, Financial, or Social Theory:
1. Time series approach: Long  but small  : Finance and macroeconomics
 =  +   + 
(a) Heterogeneity (coefficients, especially constant term) becomes an important issue.
(b) Pooling regression coefficient  : Testing heterogeneity becomes an issue.
(c) Cross section dependence becomes an issue
2. Cross sectional approach: Small  but large  : microeconomics, political science.
 =  +  +    =  + 
(a) Heterogeneity (variance of  is correlated with  ) becomes an issue
(b) Use usually random effects model. Why?
(c) Cross section dependence does not matter much.
10.2 Cross Section & Time Series Regressions
1. Cross Section Regression (applied microeconomics, typically labor, health, demography

etc.)
79
(a) Usually try to explain the different averages: Examples; gender wage difference,
race wage difference etc. Use survey data.
(b) Typical regression setting: Let  be the th individual wage (or income) at a
particular time (survey year)
 =  + 1 gender + 2 region + 3 age + 5 edu +  + 
i. Explanatory variables: discrete variables. In other words, dummies.

ii. Nonlinear versus linear: Approximation around x0
 1 2
 =  (x ) '  (x0 ) + (x − x0 ) + 2
(x − x0 )2 + 
x0 2 x0
2 2
=  + 1 1 + 2 2 + 1 1 + 2 2 + 3 1 2 +  : for two regressors case
Hence without including the second moments, the regression suffers from
misspecification
(c) Don’t run cross section regression to examine time series behavior
 =  +    =  + 
Assume
 =  +  : mean relation
 =  +  : time series relation
In fact,
 6= 
2. Time series regression (Finance, International Economics, Macroeconomic etc)
(a) Examining parities (PPP, UIP, CIP, Fisher Hypothesis, etc). Dynamic stability
becomes the main issue.
(b) Cointegration among nonstationary variables becomes an issue.
(c) Ignore time invariant variables such as means.
80
11 Pooling Panel and Random Effects (Estimation: Mi-
cro Panel)
Model
 =  +  + 
1. Why pooling?
(a) Economic theory must hold for all individuals.
(b) More data: either more cross sectional or time series observations. Pooling means
more ‘efficient’ and ‘powerful’ (will explain later)
2. Why not pooling?
(a) Account for individual heterogeneity. So at least we have to allow some level
heterogeneity such as
 =  +  + 
(b) How to handle for  then? Either fixed or random effects.
(c) What if  is observable? like gender, edu, age etc. You may want to include
them. How?
11.1 Random Effects
Model:
 =  +  +    =  −  +  =  + 
Assumption:
A1 E(  ) = 0 for all 
A2 E(  ) = 0 for all 
81
Under A1 and A2, note that the pooled OLS becomes consistent but not efficient. The
consistency (here we are assuming   → ∞ or  → ∞ for any  ) requires that
1 X1X
 
plim →∞   = 0
 =1  =1
Indeed under A1 and A2, we can prove that POLS estimator satisfies the above condition.
However, the regression errors are not i.i.d. anymore.
1 2 = 2 + 1 2 +  1 +  2
Taking expectation yields
E1 2 = E2 + E1 2 + E 1 + E 2
=  2 if E1 2 = 0 (no serial corr.)
where we assume E(  ) = 0 Also note that
E1 1 = E2 + E1 1 + 2E 1
=  2 +  2
In this case, pooled GLS estimator becomes efficient and consistent. Here is how to
obtain the feasible GLS estimator
1. Run
 =  +  + 
and get the pooled OLS residuals ̂  Let ̂pols and ̂pols be the POLS estimates for 
and 
82
2. Construct
1 XX³ ´2 1 XX 2
   
̂ 2 =  − ̂pols − ̂pols  = ̂
 =1 =1  =1 =1 
1X

̂ = ̂  ̂ = ̂ − ̂
 =1
Ã !2
1 X 1 X
 
̂ 2 = ̂ − ̂
 =1  =1 
Ã !2
1 XX 1 XX
   
̂ 2 = ̂ − ̂
 =1 =1   =1 =1
Note if  is small, ( − 1) should be used for the above calculation.
3. Construct the sample covariance matrix

⎡ ⎤
̂ 2 + ̂ 2 ̂ 2 ··· ̂ 2
⎢  ⎥
⎢ .. ⎥
⎢ ̂ 2 ̂ 2 + ̂ 2 · · · . ⎥
Ω̂ = ⎢
⎢ .. .. ... ..
⎥
⎥ (22)
⎢ . . . ⎥
⎣ ⎦
̂ 2 ̂ 2    ̂ 2 + ̂ 2
and then construct the feasible GLS estimator given by

Ã !−1 Ã  !
X X
̂fgls = 0 Ω̂−1  0 Ω̂−1 
=1 =1
where  = [1     ]0 and  = [1     ]0
Remark 1: (Inconsistency relies on A1) If A1 does not hold (usually A1 does not
hold), that is, if individual characteristics are correlated with regressors, then POLS esti-
mator becomes inconsistent. Also the random effects estimator (FGLS) is also inconsistent.
Because of this reason, many researchers in practice don’t run the random effects model
(or FGLS estimator). We will study the alternative estimation method in the below (fixed
effects model).
83
Remark 2: (Including Observed Individual Effects) Even when A1 does not hold,
if  is observable, then the observed  can be entered the regression as a regressor. That
is,
 =  +  1 1 +  2 2 +  +  + 
We will study this model later (after studying fixed effects model) in detail.
84
12 Fixed Effects (Estimation: Micro Panel)
12.1 Eyeball Approach: Works well.
You need to draw some graphs (for your dissertation or journal article) why? looks good,
and give more direct information. Try to draw one nice graph which explains main theme
of the paper.
12.1.1 Single explanatory variables
Target: Want to explain the relationship between  and   Plot  on   Use different
color for each 
1. See if there is one unique relationship between  and  across 
2. insert a graph here fixed effects (positive and positive)
3. insert a graph here fixed effects (positive but negative)
4. insert a graph here heterogeneity (positive but negative)
5. insert a graph here projected graph. (demean )
Demean:
 =  +  + 

1 X

1X

1X

 =  +   + 
 =1
 =1
 =1
Ã !
1 X
1 X 
1X

 −  =   −  +  − 
 =1
 =1
 =1
̃ = ̃ + ̃
85
12.1.2 More than two variables
 =  +  +  + 
1. Don’t plot either ̃ on ̃ or ̃ on ̃ : Why?
2. running ̃ on ̃ implies
̃ = ̃ + ̃  ̃ = ̃ + ̃
If  (̃ ̃ ) 6= 0 then ̂ becomes inconsistent. Worst case:  = 0 but  (̃ ̃ ) 6= 0
then ̂ 6= 0
3. Solution: Run
̃ = 1 ̃ + ̃+  ̃ = 2 ̃ + ̃+

and get residuals ̃+ and ̃+

  Plot them. Similarly, Run
̃ = 1 ̃ + ̃∗  ̃ = 2 ̃ + ̃∗
and plot ̃∗ on ̃∗
4. Mathematically, it is a projection approach.  −  ( 0 )−1  0 =  or  matrix.
12.2 Common Time Effects:
 =  +  +  + 
Allows time dummies also. How to estimate ̂?
1. Eliminate fixed effects by demeaning over 

Ã !
1X 1X 1X 1X
   
 −  =  −  +   −  +  − 
 =1  =1  =1  =1
Still you have  terms.
86
2. Rewrite this as
̃ = ̃ + ̃ + ̃ (23)
Take cross sectional mean
1 X 1 X 1 X
  
̃ = ̃ +  ̃ + ̃ (24)
 =1  =1  =1
3. subtract (24) from (23).

Ã ! Ã !
1 X 1 X 1 X
  
̃ − ̃ =  ̃ − ̃ + ̃ − ̃
 =1  =1  =1
4. Finally evaluate
1 X 1X 1 X 1X 1 X
    
† = ̃ − ̃ =  −  −  +  (25)
 =1  =1  =1  =1  =1
we call it ‘within transformation’.
Note: Fixed effects estimator is called either ‘Least Squares Dummies Variable (LSDV)’
estimator or ‘Within Group’ estimator.
87
Questions Consider the following data generating process
 =  +    =  +  (26)
where
 =  +  +  (27)
 =  +  +  (28)
1. Suppose that you run the following cross section regression for  = 1
1 = 1 +  1 1 + 1 (29)
Prove that the OLS estimate becomes inconsistent generally. That is,
plim→∞ ̂ 1 6= 
2. Rather than running (29), you run the following cross sectional regression with time
series average.
̄ =  +  ̄ + ̄ (30)
where
1X 1X
 
̄ =   ̄ = 
 =1  =1
√
Derive the limiting distribution of ̂ in (30). Is the convergence rate equal to  or
√
?
Part II (POLS): Consider the following DGP

¡ ¢
 =  +    = −1

+    ∼  0 2
1. You run the POLS given by

 =  + −1 + 
Prove that when   1 the POLS estimator becomes inconsistent. Derive the exact
bias.
88
Part III (Dynamic Panel Regression I) Consider the following DGP
¡ ¢
 =  +    = −1

+    ∼  0 2
Derive Nickell bias when  = 1
89
12.3 Dynamic Panel Regression
Read: Bertrand, M., E. Duflo and S. Mullainathan, 2004, How much should we trust
differences-in-differences estimates?, Quarterly Journal of Economics, 249—275.
Model:
 =  +  +  +  (31)
Now the regression error follows

 = −1 + 
Remark 1: As long as  is exogenous, the LSDV estimator in (31) becomes consistent.
However, the statistical inference (in other words, −value for ̂) becomes an issue (in other
words, the critical value for ̂ must be different than the ordinary critical value). We will
suggest the solution for the statistical inference later. (see section 3)
Remark 2: If  is large, then more efficient estimator can be obtain by running dynamic
panel regression.
Let’s transform (31) as
−1 =   + −1 + −1 + −1 (32)
and next subtract (32) from (31). Then we have
 =  (1 − ) +  + −1 + −1 +  − −1 + 
or
 =  +  + −1 +  + −1 +   (33)
By using within transformation, we can run
† = −1
†
+ † + †−1 + †
90
See (25) the definition of ‘†’.
12
̂ from
y it  a i  X ′it   X ′it−1   y it−1  u it
10
6
̂ from
y it  a i  X ′it    it
4
0
-0.3 -0.2 -0.1 0 0.1 0.2 0.3
̂
Remark 3 (Consistency for  and ): The LSDV estimators for  and  are inconsistent
but the LSDV estimator  becomes consistent. So the parameter of interest is here assumed
to be  Since the estimators for  and  are inconsistent, the statistical inference for 
should be carefully constructed. In the next section, we will study how to obtain robust
statistical inference regardless of error term structures.
91
13 Pooling Panel and Random Effects (Testing: Micro
Panel)
13.1 Bench Mark Model: Strongly Exogenous Single Regressor

with Fixed Effects
Model
 =  +  +  (34)
Assumptions
1. E  = 0 for all    
Here we are interested in testing the null hypothesis of 0 :  = 0 To test this null
hypothesis, we need a statistic. Usually we use a formal  statistic defined by
̂
̂ = r ³ ´
  ̂
³ ´
where   ̂ stands for the sample variance of the point estimate ̂ which depends on the
parametric assumptions for the regression errors.
In the below, we will study various hypotheses testings and statistics. Before that, I will
address why the panel data is useful (and powerful) compared with either cross sectional or
time series regressions.
13.1.1 More  or More ?
General statistical panel theory states that the panel gain comes from the use of more data.
However, this statement is not quite right. One may have either a lengthy time series or
cross section data. However whenever one uses a panel data, s/he can use either a short time
series across some individuals, or a small individual over somewhat large time series data.
For example, many empirical growth regressions have been based on cross sectional studies
92
due to the data limitation. Even though PWT provides more than 150 countries panel data,
it is often very hard to obtain a full set of panel data for all 150 countries. Here we consider
which data sets (larger  or ) we should use to increase panel gain.
To attack this issue, we first consider the rate of convergence concept. Consider the
following simple regression
 =  +   for  =  or  and  = 1   
where we assume the strong exogeneity of   Typical limiting distribution theory says
1
P 1
P
=1    

̂ = 1 P =  + 1 P=1



 2  2
 =1   =1 
µ ¶ √1 P µ ¶
1  =1   1 
̂ −  = √ 1
P 
:= √  let say
 
2
=1   
We may assume that

¡ ¢
 =⇒  0 Ω2   −→  as  → ∞
where ‘=⇒ ’ stands for convergence in distribution and ‘−→ ’ means convergence in proba-
bility. Then we finally have (following by Cramer’s theorem)
√ ³ ´ ¡ ¢
 ̂ −  =⇒  0 −1 2 −1
 Ω 
Alternatively
√ ³ ´
 ̂ − 
q =⇒  (0 1)
−1 2 −1
 Ω 
Meanwhile the testing hypothesis is given by
0 :  = 0 usually.
 :  6= 0
Then we have ⎛ ⎞
√ √
 ̂ 
q =⇒  ⎝ q  1⎠
−1 2 −1 −1 2 −1
 Ω   Ω 
so that the power of the test (how frequently a test can reject the null hypothesis when the
alternative is true) is getting larger if
93
1. true value of || is getting larger,
2. Variance of  is getting smaller,
3. the number of observations,  is getting larger.
Among them, the last item, 3, is only thing we can control for. We don’t know the true value
of  and the true variance of  either However, we can increase the number of observations
(by putting more labor hours for digging out the data).
Now, when we have both  and  dimensions, we can rewrite the pooled estimate of 
as P P P P
1 1 1 1
 =1  =1    =1  =1  
̂panel = 1
P  1
P 2 =  + 1
P  1
P 
 =1  =1   =1  =1 2
and similarly µ ¶
1 
̂panel −  = √  let say
 
and
¡ ¢
 =⇒  0 Ω2   −→  as   → ∞ jointly. (35)
Then we have
√ ³ ´ ¡ ¢
 ̂panel −  =⇒  0 −1 2 −1
 Ω  (36)
Now consider the above three criteria for the power of the test. Does panel data enable us
to know either true value of variance of ? The answer is no. Then what about the last one?
Does panel data enable us to use more observations? The answer is not straightforward.
In practice, one often face the situation like this. When one use one dimensional data (for
example time series), one may choose or select the longest time series data for  and  
Denote the size of the sample as   Now if s/he has to use a panel data, usually s/he scarifies
the lengthy time series in order to increase the cross section units. Denote the time series
s/he will use for the panel data as  From the direct calculation, we have the condition for
the panel gain given by
   
94
That is, if you have 300 of  for one series but have to use only 30 of  in order to use the
panel data, then the minimum number of the cross sections — you have to obtain — should
be larger than 10.
However, we may need much larger cross sections if  and  are  (1) (or in other words,
nonstationary). In this case, the limiting distribution for ̂ is different from a normal distri-
¡R ¢ ¡R 2 ¢−1
bution (actually it becomes     ) and also the convergence rate becomes 
√
rather than   Hence the minimum condition for the panel gain changes as
√
   
√
In the above example, you need at least   10 or   100
Unfortunately, the most of macro data are nonstationary. So the important question
becomes that how many observations should be scarified to use the panel data. Let  be the
fraction of the sample you have to sacrifice to use additional  cross sections. Then we have
√  √  1
   or   = 
 (1 − )  1−
so that µ ¶2
1
 
1−
To decode this formula, let say you have 120 monthly time series observations initially.
In order to use the panel data, if you have to use 10 annual observations, then   =
12010 = 12 so that the minimum  becomes 144. Remember that the power of a test
with  = 144 and  = 10 will be exactly same as the power of the test with  = 120 and
 = 1 However, if you can still use monthly observations but loose 2 years observations,
then   = 12096 = 125 so that the minimum  becomes 156 which is less than 2.
Hence the power of a test with  = 2 and  = 96 will be larger than that with  = 1 and
 = 120
So the conclusion follows:
Recommendation (How to Construct a Panel Data)
1. When you are interested in the correlation among level variables, you should use the
panel data set which contains more  or the largest of  ×  2 rather than  × 
95
2. When you are interested in the correlation among (quasi) difference variables (such
as growth rates), you should use the panel data which total number of observations
(=  ×  ) is largest.
13.1.2 How to Calculate the Covariance Matrix
Here we are asking how to estimate Ω2 and  in (35) and (36). First consider Ω2 which
can be defined as
Ã  !2
1 XX
Ω2 =   
 =1 =1
1
=  (11 11 +  + 1 1 + 21 21 +  +    )2 (37)

1 ¡ ¢
=  211 211 +  + 21 21 + 221 221 +  + 2 2

+  (cross products)
If  (1 2 ) 6= 0 due to serial correlation, then in general the expected values of the cross
product terms are not equal to zero.
White (1980) suggests the use of the so called ‘heteroskedasticity consistent estimator’
which is given by
1 X 0 0

Ω̂2 = ̃ ̂ ̂ ̃
 =1  
where ̂ = (̂1   ̂ )0  ̃ = (̃1   ̃ )0  So that the sample covariance matrix becomes
Ã !−1 Ã  !Ã  !−1
³ ´ X X X
 ̂ = ̃0 ̃ ̃0 ̂ ̂0 ̃ ̃0 ̃ (38)
=1 =1 =1
and its associated -statistic becomes
̂
̂ = r³ ´−1 ³P ´ ³P ´−1 (39)
P  
=1 ̃0 ̃ =1 ̃0 ̂ ̂0 ̃ =1 ̃0 ̃
Note that if  and  are  then the above formula can be simplified as
Ã !−1
³ ´ X
 ̂ = ̂2 ̃0 ̃ (40)
=1
which is the sample variance reported in canned statistical packages.
Here are a couple of very important facts:
96
Recommendation
1. Usually the sample variance in (2-5) is larger than that in (40). This implies that when
there is either heteroskedasticity or autocorrelation, the standard -ratio is much larger
than its true value.
2. When  is fixed but  is large, the ̂ in (39) is distributed as a normal. So the
standard critical value can be used here. However when  is large but  is small, the
 ratio asymptotically follows a −distribution with  − 1 degrees of freedom under
homoskedasticity. (Hansen, 2007 Journal of Econometrics, ‘Asymptotic Properties of
a Robust Variance Matrix Estimator for Panel Data when  is large’)
13.2 Testing
13.2.1 Some Basic Facts on Statistical Testing
Size and Power The size of a test stands for the rejection rate of the null when the null
hypothesis is true, meanwhile the power of a test implies the rejection rate of the null when
the alternative is true. Usually we set the size of a test at the significance level. For example,
the critical value for the 5% significance level for a normal distribution (for two sides test)
is 1.95. In other words, we permit ourselves that we would make a wrong decision at the
5% level. (5 out of 100 times). Setting a smaller size means that you want to be more
conservative or don’t want to make any mistake, but at the same time it also implies that
the power of the test will be reduced.
Size Distortion You set the size at the 5% significance level. However (especially in the
finite sample), a test does not produce exactly the 5% of the size. If a test over-rejects
the null (when the null is true), then we say that the test suffers from oversize distortion.
The opposite case is undersize distortion. Usually the undersized test is acceptable since it
simply implies that you will make less mistake. The oversize problem becomes serious. The
oversized test usually rejects the null very often even when the null is true.
97
Size Problem in Panel Data In univariate case, usually a well designed statistic does
not suffer from the size distortion as  (the number of observations) goes to infinity. For
example, the standard t-test for the univariate AR(1) regression produces somewhat serious
size distortion with small  but as  → ∞ the size distortion goes away.
̂
 =  + −1 +   ̂ = p for   1 (41)
 (̂)
It is because the asymptotic variance of ̂ is designed in this way. However, in the panel
data, the t-ratio produces more size distortion as  → ∞ for fixed 
̂
 =  + −1 +   ̂ = p lsdv for   1 (42)
 (̂lsdv )
The underlying reason is simple. When  is small, the test statistic in (41) produces a small
size distortion. In the panel data, the size distortion becomes cumulated as  increases.
Similarly, as  → ∞ for a fixed  the usual panel statistic in (39) produces more size
distortion if there is heteroskedasticity in the error terms.
13.2.2 Fixed versus Random Effects.
LSDV estimator is ‘robust’ and consistent whether or not the fixed effects  in (34) are
correlated with   Meanwhile the GLS (or random effects estimator) is ‘efficient’ and con-
sistent only when  is not correlated with regressors. When the number of observations
are small (such as moderately small  and  ), the GLS becomes an attractive estimator if
 is not correlated with regressors. Naturally econometricians have developed various test
statistics to investigate if this condition holds or not.
There are broadly two ways to test the orthogonality between  and   The first method
is based on the pooled OLS regression residuals, and the second method is based on the
difference between LSDV and GLS. We discuss the first method, first.
Breusch & Pagan (1980)’s LM Test BP tests if
0 :  =  for all  (43)
 :  6=  for any 
98
When  in (34) is not serially correlated, these hypotheses can be rewritten as
0 :  (̂ ̂ ) = 0 for all 
 :  (̂ ̂ ) 6= 0 for any 
where ̂ is the pooled OLS regression residuals. That is,
̂ =  − ̂ − ̂pols  
The test statistic is given by

⎡ ⎤2
P ³P ´2
 ⎢ =1 =1 ̂ ⎥
 = ⎣ P P 2 − 1⎦ =⇒ 21
2 ( − 1) =1 =1 ̂
Note that Ã  !2 Ã  !
X X X
 X

E ̂ =E ̂2 +  
=1 =1 =1 6=
and under 0  we have

Ã  !2 Ã  ! Ã  !
X X X
 X
 X
E ̂ = E ̂2 +   = E ̂2
=1 =1 =1 6= =1
since the expectation of the cross product terms become zero. For large  and  also note
that under the alternative and no serial correlation among  , we have
Ã  !2 Ã  !
X X
E ̂ ≥ E ̂2
=1 =1
since
¡ ¢
E (  ) = E 2 +   =  2  0
It is important to note that if  is serially correlated, then BP’s LM test fails.
Hausman’s Specification Test Hausman test is fairly a general test for misspecification,
and can be applied to test the null hypothesis in (43). Under the null hypothesis
plim →∞ ̂LSDV = plim →∞ ̂GLS
99
since two estimators are both consistent. However, under the alternative, we have
plim →∞ ̂LSDV =  but plim →∞ ̂GLS 6= 
so that
³ ´
plim →∞ ̂GLS − ̂LSDV 6= 0
Hence we can test 0 by examining if the distance between ̂GLS and ̂LSDV is equal to
zero or not. A typical test statistic in this case is given by
³ ´0 h ³ í−1 ³ ´0
̂GLS − ̂LSDV   ̂GLS − ̂LSDV ̂GLS − ̂LSDV =⇒ 2
when the dimension of ̂ is  For a single regressor case, we have simply

³ ´2
̂GLS − ̂LSDV
³ ´ =⇒ 21
  ̂GLS − ̂LSDV
Note that under 0 

"  #−1 "    #−1
³ ´ XX XXX
2
  ̂GLS − ̂LSDV = ̂  ̃2 − ̂  ̃ ̃
=1 =1 =1 =1 =1
where ̂  is the th and th element of Ω̂−1 and Ω̂ is defined at (22).
100
14 Dynamic Panel Regression I (Issues and Problems)
More than the quarter of theoretical studies on the panel data is focused on the dynamic
panel regression. Modelling the ‘dynamics’ in the panel data is critically important. First
we address where ‘dynamic adjustment form’ comes from.
14.1 Source of Serial Correlation
14.1.1 Univariate Series
Many economic variables such as income, consumption, wage, etc have the following transi-
tional path.
 = ∗ + (0 − ∗ ) −
where ∗ is the steady state outcome. Note that all variables are in logarithm. Rewrite this
model as
£ ¤ ¡ ¢ ¡ ¢
 = ∗ + (0 − ∗ ) −(−1) − + ∗ 1 − − = ∗ 1 − − + − −1
By letting  = −   = ∗ and adding a random error which can be exogeneous i.i.d.
measurement errors, transitory shocks, etc then we have
 =  (1 − ) + −1 +   for  = 1  
This simple growth model generates the time dependence between  and −1 
In other words, all variables (growing variables such as wage, income, height etc) are
serially correlated during transition periods.
14.1.2 General Regressions
In general regression models, the serial correlation occurs whenever the regression is not
balanced. To understand the balancing concept, consider a simple regression model given by
 =  +  +  (44)
Suppose that
101
1.  has a linear trend.  6= 0. Regardless  has a linear trend or not,  contains a
linear trend. So you have to include a trend in the regression. Why?
(a) Let  =  +   +   and  =  +   +   You don’t want to assume that
the deterministic trend terms have a common relationshop since you can’t write
it as
  =  +  ( ) +   (45)
Simply because the dependent variable is purely nonstochastic. Even when the
dependent variable has a stochastic component (such as   =   +   and   =
 +  ( ) +   as long as  6=  for any  the error term includes a linear trend
component.
(b) The interest relation must be between  and   In this case, you have to elim-
inate the trend term in the first place by including a linear trend component in
the regression
(c) If you are interested in analyzing growth rates in  and   then you have to
take the first difference to approximate the stochastic growth components. That
is,
∆ =  + ∆ + error (46)
2.  is serially correlated but  is not. Then  is serially correlated. (the opposite is
not true) In this case, you may want to run the dynamic panel regression
 =  + −1 +  + −1 +  (47)
(a) From (44), you have

 = −1 +   (48)
Here I assume that the error term follows AR(1) structure for simplicity.
(b) Then you have

−1 =   + −1 + −1  (49)
Subtracting (49) from (44) yields (47).
102
3.  is not serially correlated but  is very persistent ( is near unity). And more
importantly  6= 0 Then the regression in (44) is not well specified. Simply it becomes
unbalanced regression. In this case, to balance out the serial correlation,  should be
negatively correlated with  
(a) Example: Stock return predictability & UIP:
 =  +  + 
where  is either stock return or depreciation rates, which are almost white noisy.
 is either interest rate differential (for UIP), or dividend ratio (stock return).
Both interest rate differential or dividend ratio is highly serially correlated. If
 6= 0 then  should be negatively correlated with  
(b) Hence  is serially correlated in this case also.
14.2 Modeling Dynamic Panel Regression
There are several types of dynamic panel regressions. Depending on the regression types, the
properties of LSDV estimators are quite different. Hence modeling dynamic panel regression
becomes very important.
M1:  =  +  +    = −1 +  (50)
M2:  =  + −1 +  +  (51)
where I didn’t include common time effects and linear trend components either. Note that
M1 and M2 can be restated as
M1:  =  +    =  −    = −1 +  (52)
M2:  =  +    = −1 +    =  +  (53)
Note that in M1,  is correlated with  in level. Meanwhile in M2,  is correlated with
the quasi-differenced   Alternatively we can rewrite M1 as
M1:  =  + −1 +  + −1 +   (54)
103
Hence if (51) is true, then (54) is not misspecified. Simply  becomes zero if (51) is true.
However, if (54) or M1 is true, then (51) becomes misspecified, which results in inconsistent
estimator for  as well as  in (51). In this sense, (54) nests (51).
The economic interpretations are different across models. M1 states that the quasi-
difference ( − −1 ) is explained by   Meanwhile M2 implies that the level of  is
explained by   Hence usually  in (51) is assumed to follow a white noisy process (no
serial correlation). Meanwhile  in (54) does not have such restriction.
14.3 Inconsistency of LSDV estimator
Here we analyze why the LSDV estimator under fixed effects becomes inconsistent as  → ∞
but fixed  . The model we study is given by
¡ ¢
 =  + −1 +    ∼  0 2
Nickell Bias (1981, Econometrica) Nickell extends the so-called ‘Kendall’ (1954, Bio-
metrika) bias to the panel data setting.
1. To understand Kendall bias, we consider an univariate simple AR(1) model with

constant
 =  + −1 +   (55)
The OLS estimator is given by

P
̃−1 ̃
̂ = P=2
 2

=2 ̃−1
and its expectation gives

"P #

̃−1 ̃ 
̂ =  P=2

:= 
2
=2 ̃−1

From Marriott and Pope (1954, Biometrika), we have
 
 = [1 −  ( )]
 
104
 (  )   ( )
 ( ) = +
 ( )  ( ) [ ( )]2
Note that  ( ) 6= 0 usually due to asymetric distribution of ̂ In the finite
sample, the empirical distribution of ̂ is not a normal but skewed left a little bit.
This asymetric distribution yields the small sample bias but usually it goes away
quickly as  increases
2. The major bias arises from the first term    To see this
P
  =2 ̃−1 ̃
=+ P
  =2 ̃−1
2
Note that
Ã  !Ã  !
X
 X
 X

1 X X
 ̃−1 ̃ =  (−1 − ̄) ( − ̄) =  −1  −  −1 
=2 =2 =2
 =2 =2
Ã  !Ã  !
1 X X
= 0−  −1 
 =2 =2
Since
 X ∞
 =  + −1 +  = +  −
1 −  =0
so that   = 0 for all    However
Ã  !Ã  !
X X
 −1  =  (1 +  +  −1 ) (2 +  +  )
=2 =2
and note that
1 1 =  2  2 1 =  (1 + 2 ) 1 = 2  
hence this term is not eqaul to zero.
3. Finally we have

̂ =  =  − 1 ( ) − 2 ( )

where
P
 =2 ̃−1 ̃ 1+ ¡ ¢
1 ( ) = P 2 =− +   −2
 =2 ̃−1 
2 ¡ ¢
2 ( ) = − +   −2

105
It is important to know that the first bias, 1 ( ), comes from the correlation
between ̃−1 and ̃ (which are the regressor and the regression error after de-
meaning transformation), and the second bias, 2 ( )  comes from the asymmetric
distribution of ̂
4. In panel regressions, this first part of the small time series bias remains perma-
nently when  → ∞ However the second part of the small bias goes away. The
underlying reason is straightforward. As  → ∞ the distribution of ̂LSDV be-
comes symmetric. Hence the bias arised from asymmetric distribution goes away
simply. However the first bias 1 ( ) does not go away since this bias is arised be-
cause of the time series correlation between the regressor, ̃−1  and the regression
error, ̃  More formally, we states
1
P P
 =1 =2 ̃−1 ̃
plim→∞ (̂LSDV − ) = plim →∞ 1
P P 2
=1 =2 ̃−1
P P
plim →∞ 1 =1 =2 ̃−1 ̃
= P P
plim→∞ 1 =1 =2 ̃−1
 2
P P
 1 =1

=2 ̃−1 ̃
= 1
P P 2
  =1 =2 ̃−1
1+ ¡ ¢
= − +   −2

Asymptotic Bias when  = 1 Nickell (1981) shows the asymptotic bias (or inconsistency
of ̂LSDV ) when   1 Here we study how the expression of the bias formula badly fails
when  = 1
1. Consider the following latent model
 =  +    = −1


+ 
then we have
 =  (1 − ) + −1 + 
so that, if  = 1 then
X

 = −1 +  =  = 1 +  + 
=1
106
2. In the panel data, we have
2 =  (1 +  +  )2 = 2 for 2 =  2 for all 
1 X 2 1 X 1 X
   −1

 −1 =  (1 +  +  )2 =  2  = 2 
 − 1 =2  − 1 =2  − 1 =1 2
3. Prove that
3 ¡ ¢ 2 ¡ ¢
 (̂LSDV − 1) = − +   −2  − +   −2
 
14.4 Inconsistency of the Pooled OLS Estimator
Derive the inconsistency of the pooled OLS estimator
 (̂POLS ) =?
1. We are running
 =  + −1 +    =  −  + 
2. The POLS estimator is given by

P P ³ 1
P P ´³
1
P P ´
=1 =2 −1 −  =1 
=2 −1  −1 −  =1 
=2 −1
̂POLS =  + P P ³ P P ´2
1
=1 =2 −1 −   =1 =2 −1
Note that
¡ 
¢
 [ − ] + −1 ([ − ] (1 − ) +  ) =  2
and Ã !2
1 XX 1 XX
   
 −1 − −1 =  2 +  2
 =1 =2  =1 =2
3. Let  =  2  2  And express the inconsistency in terms of 
107
14.5 Asymptotic Distribution of LSDV estimator
P P
=1 =2 ̃−1 ̃
̂LSDV −  = P  P 2
=1 =2 ̃−1
P P ³ ´ ³P ´
 
 P P
=2 −1 
=1 =2 −1 =2 
=1
= − P P 2 + P  P 2
=1 =2 ̃−1 =1 =2 ̃−1
P P
√1 −1  ¡ ¢
 =1 =2
1
P P 2
=⇒  0 1 − 2
 =1 =2 ̃−1
since µ ¶
1 XX
 
  4
√ −1  =⇒  0
 =1 =2 1 − 2
Now we have
P P
√ √1
1 + √  =1 =2 −1 
 (̂LSDV − ) = −  + 1
P  P 
  =1
2
=2 ̃−1
r P P
√1
  =1 =2 −1 
= − (1 + ) + 1
P  P 
  =1
2
=2 ̃−1

If 
→  as   → ∞
√ ¡ ¢
  (̂LSDV − ) =⇒ − (1 + )  +  0 1 − 2

If 
→ ∞ as   → ∞
√
 (̂LSDV − ) → ∞

If 
→ 0 as   → ∞ then
√ ¡ ¢
 (̂LSDV − ) =⇒  0 1 − 2
108
Empirical Example Nominal wage =    = treatment variable or dummy
¡ ¢
 =  +  +    = −1 +    ∼  0  2
where  is a dummy variable. Suppose that  is serially correlated ( 6= 0).

Q1: Find the limiting distribution of ̂ and ̂
First transform the regression as
Ã ! Ã !
1 X 1 X 1 X
  
 −  =   −  +  − 
 =1  =1  =1
̃ =  ̃ + ̃  let say
Then P ³P ´ P ³P ´
 
̃
=1  ̃
=1  ̃
=1  ̃
=1 
̂ = P 2 =+ P 2
=1 ̃ =1 ̃
Let P ³P ´
1 
 =1 ̃ =1 ̃
̂ −  = 1
P 
 =1 ̃2
Assume that ⎧
⎨ 0 if  ∈ 1 or  = 1  2
 = 
⎩ 1 if  ∈
 1  or  = 2 + 1  
" Ã  !#2 " #2
X X X
 ̃ ̃ = ̃  ̄
=1 =1 =1
where
1X

̄ = ̃ 
 =1
Observe this
" Ã  !#2
X X
 ̃ ̃
=1 =1
⎡ ⎤2
2 
1 XX X X
 
1
=  ⎣− ̃ + ̃ ⎦
2 =1 =1 2
=2+1 =1
⎡ ⎛ ⎞2 ⎛ ⎞2 ⎛ ⎞⎛ ⎞⎤
2  2 
1 X X 1 X X 
1 X X X
 X
= ⎣ ⎝ ̃ ⎠ + ⎝ ̃ ⎠ − ⎝ ̃ ⎠ ⎝ ̃ ⎠⎦
4 =1 =1 4 =1
2 =1 =1 =1
=2+1 =2+1
109
Note that if there is no cross section dependence, then the last third term becomes zero.
Hence we have
" Ã  !#2 ⎛ ⎞2 ⎛ ⎞2
2 
X X 1 ⎝ X X 1 X X
 ̃ ̃ =  ̃ ⎠ +  ⎝ ̃ ⎠
=1
4 =1 =1
4
=1 =2+1 =1
⎛ ⎞2 ⎛ ⎞2
2 
1 ⎝ 2 X X 1 ⎝ 2 X  X
=  ̃ ⎠ +  ̃ ⎠
42  =1 =1 42  =1 =2+1
2 2
     2
= + =
8 (1 − )2 8 (1 − )2 4 (1 − )2
where we use the fact
Ã  !2
X 2
 ̃ =  (To students: Prove this)
=1
(1 − )2
Note that Ã  !2
X X

2
 ̃  ̃2 =
=1 =1
1 − 2
Solution: Use panel robust HAC estimator. Prove this.
√
Next, Consider the convergence rate: = must be   Why?
Limiting Distribution: Major (nice) term and nuisance term For LSDV.
Nice term: P
−1 
 = P 2
̃−1
Nuisance term: P ³P ´ ³P ´

1 −1 
  = − P 2
 ̃−1
Note that
P  µ ¶
̃−1 ̃ 1
̂ −  = P 2 =  +  =  +  
̃−1 ?
³ ´
and  is  √1  
P
√ √1 −1  ¡ ¢

   = 1
P 2
→  0  2

̃−1
110
but
1
P ³P ´ ³P

´
1 
 −1  
 = − 1
P  2
 ̃−1
P ³ P ´³ P ´
1 
√1

 √ 1 
 P
1   −1   1 1  (1)  (1)
= − 1
P 2 =−
 
̃−1   (1)
1
P  µ ¶
1 1 √  (1)  (1)  (1) 1
= −√ = √ =  √
  (1)  
Hence µ ¶
√ √ √ 1
 (̂ − ) =   +   =  (1) +  √

so that as  → ∞ we can ignore the second term.
111
Sample Final Exam:
Part I: Definition and Explanation

Q1: Cointegration
Q2: Unit Root Test
Q3: Weakly Stationarity
Q4: Newey and West Estimator
Q5: Panel Robust Covariance Estimator
Q6: White Heteroskedasticity Consistent Estimator
Q7: Nickell Bias
Q8: Relationship among between, within and pooled estimators
Q9: First Difference GMM/IV estimator in Dynamic Panel Regression
Q10: Hausman Test for Fixed Effects
Q11: Granger Causality Test
Q12: Error Correction Model
Part II: Proof and Derivation

Consider the following DGP
 =  +    = −1

 
+    ∼  (0 1)  0 = 0 
Q1: Assume  = 1 You run the following regression
 = −1 +  (56)
(a) Show that the pooled OLS estimator in (56) becomes consistent for fixed  and large
. That is,
plim→∞ ̂pols = 1
(b) Derive the limiting distribution of ̂pols when   → ∞ jointly.
Now you add fixed effects.

 =   + −1 +  (57)
112
(c) Show that the within group estimator in (57) becomes inconsistent. (for fixed  large
)
(d) Suppose that  → 0 as   → ∞ Derive the limiting distribution of ̂FE 
Q2: Assume ||  1 You run (56).

(a) Find the moment conditions that the pooled OLS becomes consistent.
(b) Under the conditon of (a), derive the limiting distribution of ̂pols 
113
1 Stationary Process
Model: Time invariant mean
yt = + "t
De…nition:
1. Autocovariance
jt = E (yt ) (yt j ) = E ("t "t j)
2. Stationarity: If neither the mean nor the autocovariance depend on the data t; then the process
yt is said to be covariance stationary or weakly stationary
E (yt ) = for all t

E (yt ) (yt j ) = j for all t and any j
3. Ergodicity:
(a) Covariance stationary process is said to be ergodic for the mean if

T
1X
yt !p E (yt )
T t=1
for all j: Alternatively we have

1
X
j < 1:
j=0
(b) Covariance stationary process is said to be ergodic for second moment if

T
X
1
(yt ) (yt j ) !p j
T j t=j+1
for all j:
4. White Noise: A series "t is a white noise process if
E ("t ) = 0; E "2t = 2
; E ("t "s ) = 0 for all t and s:
1.1 Moving Average

The …rst-order MA process: MA(1)
2
yt = + "t + " t 1; "t iid 0; "
2 2 2
E (yt ) = 0 = 1+ "
2
E (yt ) (yt 1 ) = 1 = "
E (yt ) (yt 2 ) = 2 =0
1
MA(2)
yt = + "t + 1 "t 1 + 2 "t 2
2 2 2
0 = 1+ 1 + 2 "
2
1 = ( 1 + 2 1) "
2
2 = 2 "
3 = 4 = ::: = 0
MA(1)
1
X
yt = + j "t j
j=0
yt is stationary if
1
X
2
j < 1 : square summable
j=0
1.2 Autoregressive Process

AR(1)
yt = a + u t ; u t = u t 1 + "t
yt = a (1 ) + yt 1 + "t
2
yt = a (1 ) + " t + "t 1 + "t 2 + :::
so that yt = M A (1) ; and
1
X
2j 1
= 2
<1
j=0
1
1 2 1 2
0 = 2 "; 1 = 2 "; t = t 1
1 1
AR(p)
yt = a (1 )+ 1 yt 1 + ::: + p yt p + "t
where
p
X
= j:
j=1
t = 1 t 1 + ::: + p t p : Yule-Walker equation
Augmented Form for AR(2)
yt = a (1 )+( 1 + 2 ) yt 1 2 yt 1 + 2 yt 2 + "t
= a (1 ) + yt 1 2 yt 1 + "t
Unit Root Testing Form
yt = a (1 ) + (1 ) yt 1 2 yt 1 + "t
2
1.3 Source of MA term
Example 1:
yt = yt 1 + ut ; xt = xt 1 + et ;
where
ut ; et = white noise
Consider z variable such that

zt = xt + yt :
Now does zt follow AR(1)?
zt = (yt 1 + xt 1) +( ) xt 1 + ut + et = zt 1 + "t
where
1
X
j
"t = ( ) et j 1 + ut + et
j=0
so that zt becomes ARM A (1; 1) :
Example 2:
ys = ys 1 + us ; s = 1; :::; S
You observe only even event. Then we have
2
ys = ys 2 + us 1 + us
Let
xt = ys for t = 1; :::; T ; s = 2; :::; S:
Then we have
2
xt = xt 1 + " t ; "t = u t 1=2 + ut
so that xt follows ARMA(1,1)
1.4 Model Selection

1.4.1 Information Criteria
Consider a criteria function given by
ln L (k) (T )
cn (k) = 2 +k
T T
where (T ) is a deterministic function. The model (lag length) is selected by minimizing the above criteria
function with respect to k: That is
arg min cT (k)
k
3
There are three famous criteria functions
AIC: (T ) = 2
BIC(Schwartz): (T ) = ln T
Hannan-Quinn: (T ) = 2 ln (ln T )
Let k be the true lag length. Then the likelihood function must be maximized with k asymptotically.
That is,
plimT !1 ln L (k ) > plimT !1 ln L (k) for any k
Now, consider two cases. First, k < k : Then we have
ln L (k ) (T ) ln L (k) (T )
lim Pr [cT (k ) cT (k)] = lim Pr 2 +k 2 +k
T !1 T !1 T T T T
ln L (k ) ln L (k) 1 (T )
= lim Pr (k k) =0
T !1 T T 2 T
for all (T )s:

Next, consider the case of k > k : Then we know that the likelihood ration test given by
2 [ln L (k) ln L (k )] !D 2
k k
Now consider AIC …rst.
T (cT (k ) cT (k)) = 2 [ln L (k) ln L (k )] 2 (k k ) !D 2

k k 2 (k k )
Hence we have
2
lim Pr [cT (k ) cT (k)] = Pr k k 2 (k k ) >0
T !1
so that AIC may asymptotically over-estimate the lag length.

Consider the other two criteria. For both cases,
lim (T ) = 1:
T !1
Hence we have
T (cT (k ) cT (k)) = 2 [ln L (k) ln L (k )] 2 (k k ) (T ) !D 2

k k 2 (k k ) (T )
so that
2
lim Pr [cT (k ) cT (k)] = lim Pr k k 2 (k k ) (T ) = 0
T !1 T !1
Hence BIC and Hannan-Quinn’s criteria consistently estimate the true lag length.
1.4.2 General to Speci…c (GS) Method
In practice, the so-called general to speci…c method is also popularly used. GS method involves the following
sequential steps.
Step 1 Run AR(kmax ) and test if the last coe¢ cient is signi…cantly di¤erent from zero.
4
Step 2 If not, let kmax = kmax 1; and repeat step 1 until the last coe¢ cient is signi…cant.
The general-to-speci…c methodology applies conventional statistical tests. So if the signi…cance level for
the tests is …xed, then the order estimator inevitably allows for a nonzero probability of overestimation.
Furthermore, as is typical in sequential tests, this overestimation probability is bigger than the signi…cance
level when there are multiple steps between kmax and p because the probability of false rejection accumulates
as k step downs from kmax to p.
These problems can be mitigated (and overcome at least asymptotically) by letting the level of the test
be dependent on the sample size. More precisely, following Bauer, Pötscher and Hackl (1988), we can set
p
the critical value CT in such a way that (i) CT ! 1, and (ii) CT = T ! 0 as T ! 1: The critical value
corresponds to the standard normal critical value for the signi…cance level T =1 (CT ), where ( ) is the
standard normal c.d.f. Conditions (i) and (ii) are equivalent to the requirement that the signi…cance level
log
T ! 0 and p
T
T
! 0 (proved in equation (22) of Pötscher, 1983).
Bauer, P., Pötscher, B. M., and P. Hackl (1988). Model Selection by Multiple Test Procedures. Statistics,
19, 39–44.
Pötscher, B. M. (1983). Order Estimation in ARMA-models by Lagrangian Multiplier Tests. Annals of
Statistics, 11, 872–885.
(Chapter head:)Asymptotic Distribution for Stationary Process
2 Law of Large Numbers for a covariance stationary process

Let consider …rst the asymptotic properties of the sample mean.
1X
yT = yt ; E (yT ) =
T
Next, as T ! 1
2 1 nX o2 1
E (yT ) = E (y t ) = 2 T 0 + 2 (T 1) 1 + +2 T 1
T2 T
1 T 1 1
= 0+2 1+ +
T T T T 1
1
+2 1+ + T 1
T 0
Hence we have
1
X
2
lim T E (yT ) = j
T !1
1
5
2
Example: yt = u t ; u t = u t 1 + "t ; " t iid 0; : Then we have
1X
2
1 T 1 1
E ut = 0 +2 1 + + T 1
T T T T
2
1 T 2 2 1 T 1 T 1
= 2
1+2 + +2 +
T1 T T T
( )
2
1 2
= 2
1+ 2 T T + T 1
T1 T (1 )
( )
2 T
1 2 T (1 ) 2 1
= 2
1+ 2 + 2
T1 T (1 ) T (1 )
( )
2 T
1 (1 ) 2 1
= 2
1+2 2 + 2
T1 (1 ) T (1 )
2
1
= 2
1+2 +O T 2
T1 ) (1
2
1 1+ 2
= +O T
T (1 ) (1 + ) (1 )
2
1 2
= 2 +O T
T (1 )
where note that
T
X T t t 2 T
2 = 2 T T + 1 :
t=1
T T (1 )
Now as T ! 1; we have
1X
2 2
lim T E ut = 2:
T !1 T (1 )
2.1 CLT for Martingale Di¤erence Sequence

E (yt ) = 0 & E (yt j t 1) = 0 for all t;
then yt is called m.d.s.
If yt is m.d.s, then yt is not serially correlated.
1 PT
CLT for a mds Let fyt gt=1 be a scalar mds with yt = T 1 t=1 yt : Suppose that (a) E yt2 = 2
t >0
PT r PT
with T 1 t=1 2t ! 2 > 0 (b) E jyt j < 1 for some r > 2 and all t (c) T 1 t=1 yt2 !p 2 ; Then
p
T yt !d N 0; 2 :
CLT for a stationary stochastic process Let

1
X
yt = + j "t j
j=0
P1
where "t is iid random variable with E "2t < 1 and j=0 j < 1: Then
0 1
1
X
p
T (yT ) !d N @0; j
A:
j= 1
6
Example 1:
2
yt = a + u t ; u t = u t 1 + et ; et iid 0; e
Then we have
1
X
j
yt = a + et j:
j=0
Hence !
p d
2
e
T (yT a) ! N 0; 2
(1 )
Example 2:
2
yt = a (1 ) + yt 1 + "t ; "t iid 0;
P
(yt 1 yT 1 ) ("t "T )
^= + P 2
(yt 1 yT 1 )
Show the condition that
1X 2
lim E (yt 1 yT 1) = Q2 < 1
T !1 T
where Q2 = 2
= 1 2
:
Calculate
1 X
2
lim E p (yt 1 yT 1 ) ("t "T )
T !1 T
Show that
p
T (^ ) !d N 0; 1 2
:
3 Finite Sample Properties

3.1 Calculating Bias By using Simple Taylor Expansion
3.1.1 Unknown Constant Case
yt = a + yt 1 + et ; et iidN (0; 1)
First, show that

!
A EA Cov (A; B) V ar (B) 2
E = 1 + 2 +O T
B EB E(A)E(B) E (B)
2
EA E (A a) (B b) EAE (B b) 2
= + 3 +O T
EB E(B)2 E (B)
Let EA = a; EB = b and take the Taylor expansion of A=B around a and b:
A a 1 a 1 1 a
= + (A a) (B b) (A a)(B b) (A a)(B b) + (B b)2 + Rn
B b b b2 b2 b2 b3
Take expectation.
A a 1 a 1 a 2
E = + E (A a) E (B b) E (B b) (A a) + E (B b) + ERn
B b b b2 b2 b3
a 1 a
= Cov(A; B) + 3 V ar(B) + O T 2
b b2 b
7
Now consider P
y~t y~t 1
E^ = E P 2 =?
y~t 1
Note that in this example, we have
2 2
e e 1
E (A) = E (B) = 2) 2 +O
(1 T (1 ) T2
and
k+m 2l
1+2
E (xt xt+k xt+k+l xt+k+l+m ) =
(1 2 )2
if ut is normal.
From this, we can calculate all moments. For an example, we have
" T
#
1 X 2 1 3T X1 1 + 2 2t
E x2t = 2 +2 (T i)
T2 T (1 2 )2 (1 2 )2
t=1
Then we have …nally P

y~t y~t 1 1+3 2
E^ = E P 2 = +O T
y~t 1 T
so that
1+3 2
E (^ )= +O T
T
For non-constant case

xt = xt 1 + et
2 2
E (^ )= +O T
T
For a trend case

yt = a + bt + yt 1 + et
2(1 + 2 ) 2
E (^ )= +O T
T
3.2 Approximating Statistical Inference by using Edgeworth Expansion

For non-constant case (Phillips, 1977), we have
P P P P
yt 1 u t yt 1 (yt yt 1) yt 1 yt yt2 1
^ = P 2 = P 2 = P
yt 1 yt 1 yt2 1
so that (^ ) can be expressed as a function of moments. Let

p p
T (^ )= T e (m)
where m stands for a vector of moments. Then taking Taylor expansion yields
p p 1 1 1
T e (m) = T er mr + ers mr ms + erst mr ms mt + Op
2 6 T2
8
where
@e (0)
er = ; ... etc.
@mr
Solving all moments yields
p ! !
T (^ ) (w)
Pr p w = (w) + p p w2 + 1
1 2 T 1 2
p
where w = x= 1 2:
For constant case (Tanaka, 1983)

p ! !
2
T (^ ) (w) + (w)
Pr p w = (w) + p p
1 2 T 1 2
(Chapter head:)
9
4 Univariate Processes with Unit Roots
yt = a + u t ; u t = u t 1 + "t
Then we have
yt = y t 1 + "t
Let
yt = "1 + + "t
where
"t iidN (0; 1)
Then we have
yt N (0; t)
and
yt yt 1 = "t N (0; 1)
yt ys N (0; t s)
Rewrite as a continuous time stochastic process such that
Standard Brownian Motion: W ( ) is a continuous-time stochastic process, associatting each date t 2

[0; 1] with the scalar W (t) such that
1. W (0) = 0
2. [W (s) W (t)] N (0; s t)
3. W (1) N (0; 1)
Transition from discrete to continuous. First let "t iidN (0; 1)

T T T
!
1 1X 1 X yt 1 "1 " 1 + "2 1X
p yt = p =p + + + "t
T T t=1 T t=1 T T T T T t=1
Z 1
d
! W (r) dr
0
T Z
1 X 2 d 1 2
y ! W dr
T 2 t=1 t 0
T
X Z 1
1 d
tyt 1 ! rW dr
T 5=2 t=1 0
10
2
If "t N 0; ; then we have
X Z 1
3=2 d 2
T yt ! W dr
0
etc..
More Limiting distributions stubs. Let
2
yt = yt 1 + et ; et iid 0; ; y0 = Op (1)
as T ! 1; we have
X
1=2 d 2
T et ! W (1) = N 0;
X
1=2
T yt 1 et ! 1
Next, consider
t
1 1 X
p yt = p es !d N 0; 2
t t s=1
so that !2
t
X
1 1 2
y2
2t t
= 2t
es !d N (0; 1) = 2
1 for a large t:
s=1
Now we are ready to prove h i

X 1 2
1
T yt 1 et !d 2
W (1) 1
2
Proof: Consider …rst

2
yt2 = (yt 1 + et ) = yt2 1 + e2t + 2yt 1 et
so that we have
1 2
yt 1 et = y yt2 e2t :
2 t 1
Taking time series average yields

1X 1 2 11X 2
yt 1 et = y y02 et
T 2 T 2T
Now let y0 = 0; then we have
1X 1 2 11X 2 d 1 1 1 h i
2 2 2 2 2
yt 1 et = y et ! = W (1) 1 :
T 2 T 2T 2 1
2 2
Next, consider
1 X 1
yt 1 = (e1 + (e1 + e2 ) + + (e1 + + eT 1 ))
T 3=2 T 3=2
1
= ((T 1) e1 + (T 2) e2 + + eT 1)
T 3=2
T
X T
X T
X
1 1 1
= (T t) et = et tet :
T 3=2 t=1
T 1=2 t=1
T 3=2 t=1
11
Now we are ready to prove Z
X 1
3=2
T tet !d W (1) W (r) dr
0
Proof:
X T
X X Z
3=2 1 1
T tet = et yt 1 !d W (1) W (r) dr:
T 1=2 t=1
T 3=2
4.1 Limiting Distribution of Unit Root Process I (No constant)

Consider the simple AR(1) regression without constant,
yt = y t 1 + et :
Then OLS estimator is given by P

yt 1 et
^ = P 2
yt 1
When = 1; we know
1X h i
d1 2 2
yt 1 et ! W (1) 1 ;
T 2
T Z
1 X 2 d 2
1
y ! W 2 dr:
T 2 t=1 t 0
Hence we have
P Z Z Z
1h i
1 1 1
1 yt 1 et d 2
T (^ 1) = T
1
P 2 ! W 2 dr W (1) 1 = W 2 dr W dW:
T2 yt 1 2
Consider t-ratio statistic given by
^ 1 ^ 1
t =p = P 1=2
^ T 2 yt2
T 1
where
2 1X 2
T = (yt ^ yt 1) !p 2
T
Hence we have Z
1h i
1=2
^ 1 d 2
t = P 1=2
! W 2 dr W (1) 1
2 yt2 2
T 1
Note that the upper and lower 2.5 (5.0) % of critical values are -2.23 (-1.95) and 1.62 (1.28) which are
very di¤erent from 1.96 (1.65).
4.2 Limiting Distribution of Unit Root Process I (constant)

Now we have
y t = a + yt 1 + et
When = 1; a = 0: Howeve we don’t know if = 1 or not. Under the null of unit root, the OLS estimators
are given by
" # " P # 1 " P #
a
^ 0 T yt 1 et
= P P P
^ 1 yt 1 yt2 1 yt 1 et
12
Consider
" p #" # " P # 1 " P #
T 0 a
^ 0 1 1
yt 1 p1 et
= T 3=2 T
1
P 1
P 2 1
P
0 T ^ 1 T 3=2
yt 1 T2 yt 1 T yt 1 et
" R # 1
2 3
1 W (r) dr W (1)
! d
R R 4 h i 5
2 2 21 2
W (r) dr W (r) dr 2 W (1) 1
Hence h i R
1
W (1)
2
1 W (1) W dr R
~ dW
W
d 2
T (^ 1) ! R R 2 = R
W 2 dr W dr ~ 2 dr
W
Similarly t ratio statistic is given by
h i R
1 2
W (1) 1 W (1) W dr
d 2
t ! nR o :
R 2 1=2
W 2 dr W dr
4.3 Unit Root Test

For AR(p), we have
X
y t = a + yt 1 + j yt j + et
Note that
1X
1
p ; yt j = Op
T T
so that as T ! 1; the augmented term goes away. Hence the limiting distribution does not change at all.
Meaning of Nonstationary Let’s …nd an economic meaning of nonstationarity.
1. No steady state. No static mean or average exists. A series becomes random around its true mean.
Never converge to its mean.
2. No equilibrium since there is no steady state. Cannot forecast or predict its future value without
considering other nonstationary variable.
3. Fast convergence rate.
4.4 Unit Root Test and Stationarity Test

Note that the rejection of the null of unit root does not imply that a series is stationary. To see this, let
p 2
xt = xt 1 + " t ; "t = tet ; et iid 0; :
Further let = 0: Now xt is not weakly stationary since its variance is time varying. However at the same
time, xt does not follow unit root process since = 0: To see this, let derive the limiting distribution of ^:
P
xt 1 "t
^ = P 2
xt 1
13
and
X 2 X 2 X 2
E xt 1 "t = E "t 1 "t =E tet 1 et
2 2 T2
= + O (T )
2
so that
1X 4
xt 1 "t !d N 0;
T 2
X X X T2
x2t 1 = "2t 1 = te2t 1 !p 2 + O (T )
2
Hence we have P 1 p
xt 1 "t d
T
T (^ )= 1 P 2 ! N (0; 2) = 2W (1) :
T2 xt 1
Therefore, we can see that the convergence rate is still T; but the limiting distribution is not a function of
Brownian motion at all.
Unit Root Test Considering Finite Sample Bias Consider the following recursive mean adjustment
t 1
X
1
yt = ys
t 1 s=1
Under the null of unit root, we have
yt yt = a + (yt yt ) + ( 1) yt + et :
Since a = 0 and = 1; we have

yt yt = (yt yt ) + et :
The limiting distribution of ^RD is given by

Z Z r 1 Z Z r
T (^RD 1) !d W dr r 1
W (r) dr W dW r 1
W dW
0 0
Finite sample performance is usually better than ADF test.
4.5 Weak Stationary and Local to Unity

Joon Park (2007) and Phillips and Magdalinos (2007, JoE) Consider the following DGP
yt = n yt 1 + ut ; t = 1; :::; n;
where
c
n =1 ; 0 < 1 and c > 0
n
Then we have
p
n (^n n) !d N 0; 1 2
n
so that p
n
p (^n n) !d N (0; 1) :
1 2
n
14
Note that
2 c2 2c c2 2c 2
1 n =1 1+ 2
+ = 2 + =O n +O n ;
n n n n
hence r
p 2c c
1 2 = 1+ ;
n
n 2n
and p
n 1
p = n1=2 n =2
p +O n +1=2
:
1 2 2c
n
Finally we have
+1
n 2 (^n n) !d N (0; 2c)
For the case of = 1; we call it local to unity of which limiting distribution is di¤erent. (Phillips 1987,
Biometrica, 88 Econometrica). Consider the following simple DGP
c c
yt = n yt 1 + ut ; n = exp '1+
T T
Now, de…ne Z r
J (r) = e(r s)c
dW (s)
0
where J (r) is a Gaussian process which for …xed r > 0; has the distribution
1 e2rc 1
J (r) N 0; ;
2 c
and it call Ornstein-Uhlenbeck process. Alternatively we have

Z r
J (r) = W (r) + c e(r s)c W (s) ds:
0
The limiting distribution of ^n is given by

Z 2 Z 1
d 1 u
n (^n n) ! JdW + 1 2
J 2 dr
2
where is the long run variance of ut :

Now we have
n n =n+c
so that Z Z
2 1
1 u
n (^n 1 c) !d JdW + 1 2
J 2 dr
2
2 2
and let u = (for AR(1) case), then
Z Z 1
d
n (^n 1) ! c + JdW J 2 dr for c < 0
See Phillips for the case of c ! 1:
15
Explosive Series
c
n =1+ ; for c > 0
n
As n ! 1; n ! 1 but in the …xed n; n > 1: Note that if > 1 but yo = 0; then the limiting distribution
(done by White, 1958) is given by
n
2
(^ ) !d C as n ! 1
1
where C is a Cauchy distribution. From this, consider
2 c c2
n 1=2 + 2
n n
so that
n n
n n n
2
= = nn =2c
n 1 2cn
Hence we have
n
( nn =2c) (^n n) !d C:
Note: White considered momenting generating function …rst and convert it to pdf.
16

Econometrics Note, UT Dallas

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Econometrics Note, UT Dallas

Încărcat de

Drepturi de autor:

Formate disponibile

1 Appendix A: Matrix Algebra

• Matrix A = [ ] = [A]

• Symmetric matrix:  =  for all  and 

• Diagonal matrix:  6= 0 if  =  but  = 0 if  6= 

• Scalar matrix: the diagonal matrix of  = 

• Identity matrix: the scalar matrix of  = 1

• Triangular matrix:  = 0 if   

• Symmetric idempotent matrix: A0 A = A = AA

• Orthogonal matrix: A−1 = A0

• Unitary matrix: A0 A = AA0 = I

—  (ABC) =  (BCA) =  (CBA) if A B C are symmetric.

—  (A) =  [ (A)]

Idempotent (projection) Matrix

where y x z and u are  × 1 vectors  and  are scalars. Let

then,  is an idempotent matrix.

Further note that

• Lenth of a vector: Norm is defined as

Regression in a Matrix form

The OLS estimate is

The OLS residuals are

• (AB)−1 = B−1 A−1

• (A) ⊗ B = A ⊗ (B) =  (A ⊗ B) where  is a scalar

1.2 Eigen values and Eigen vectors

Eigen Values Characteristic vectors = Eigen vectors, c

where C = [c1  c2   c ]  Alternatively C is a unitary matrix.

which is the spectral decomposition of A

 (A) =  (CΛC0 ) =  (ΛCC0 ) =  (ΛI) =  (Λ)

The trace of a matrix equals the sum of its eigen values.

5. Suppose that A is a nonsigular symmetric matrix. Then

6. Consider a matrix P such that

Matrix Decomposition LU decomposition (Cholesky Decomposition)

where L is lower triangular and U is upper triangular matrix. L = U0

where U is an orthogonal matrix and S is a upper triangular matrix.

1.3 Matrix Algebra

Q1: Find eigen values of A

Part II: Matrix Algebra Consider the following regression

 =  + 1 + 2 +  (1)

Q6: If you wrote (9) as

Define ,  and 

Show the first derivertives of  function.

2.1 Probability Distributions

Cumulative Distribution Function

2. If    then  () ≥  ()

Expectations of a Random Variable Mean or expected value of a random variable is

Median: used when the distribution is not symmetric

Now we consider third and fourth central moments

Skewness is a measure of the asymmetry of a distribution. For symmetric distribution,

2.2 Some Specific Probabilty Distributions

2.2.1 Normal Distribution

1. Addition and Multiplication

2. Standard normal function

Chi-squared,  and  distributions

1. If  ∼  (0 1)  then 2 ∼ 21 chi-squared with one degress of freedom.

3. If 1    are  independent  (0 1) variables, then

4. If 1    are  independent  (0 2 ) variables, then

5. If 1 and 2 are independent 2 and 2 variables, then

6. If 1 and 2 are independent 2 and 2 variables, then

7. If  is a  (0 1) variable and  is 2 and is independent of  then the ratio

and it has the density function given by

where  =  − 1 and Γ () is the gamma function

2.2.2 Other Distributions

1. If  ∼  ( 2 )  then exp () ∼  (  2 )

2. If  ∼  (  2 )  then ln () ∼  ( 2 )

3. If  ∼  ( 2 )  then  =  +  is a shifted LN of .  () =  () +   () =