0 Voturi pozitive0 Voturi negative

279 (de) vizualizări133 paginiEconometrics

Dec 16, 2015

© © All Rights Reserved

PDF, TXT sau citiți online pe Scribd

Econometrics

© All Rights Reserved

279 (de) vizualizări

Econometrics

© All Rights Reserved

- Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
- Hidden Figures Young Readers' Edition
- The Law of Explosive Growth: Lesson 20 from The 21 Irrefutable Laws of Leadership
- The Art of Thinking Clearly
- The E-Myth Revisited: Why Most Small Businesses Don't Work and
- The Wright Brothers
- The Other Einstein: A Novel
- State of Fear
- State of Fear
- The Power of Discipline: 7 Ways it Can Change Your Life
- The Kiss Quotient: A Novel
- The 10X Rule: The Only Difference Between Success and Failure
- Being Wrong: Adventures in the Margin of Error
- Algorithms to Live By: The Computer Science of Human Decisions

Sunteți pe pagina 1din 133

Contents

1 Review of simple regression

1.1 The Sample Regression Function . . . . . . .

1.2 Interpretation of regression as prediction . . .

1.3 Regression in Eviews . . . . . . . . . . . . . .

1.4 Goodness of t . . . . . . . . . . . . . . . . .

1.5 Derivations . . . . . . . . . . . . . . . . . . .

1.5.1 Summation notation . . . . . . . . . .

1.5.2 Derivation of OLS . . . . . . . . . . .

1.5.3 Properties of predictions and residuals

2 Statistical Inference and the Population

2.1 Simple random sample . . . . . . . . . .

2.2 Population distributions and parameters

2.3 Population vs Sample . . . . . . . . . .

2.4 Conditional Expectation . . . . . . . . .

2.5 The Population Regression Function . .

2.6 Statistical Properties of OLS . . . . . .

2.6.1 Properties of Expectations . . . .

2.6.2 Unbiasedness . . . . . . . . . . .

2.6.3 Variance . . . . . . . . . . . . . .

2.6.4 Asymptotic normality . . . . . .

2.7 Summary . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

3

3

6

6

18

19

19

22

23

Regression Function

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

24

24

25

25

25

26

26

27

28

30

31

35

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

37

37

37

37

38

39

40

40

41

44

46

49

51

52

53

.

.

.

.

.

.

.

.

3.1 Hypothesis testing . . . . . . . . . . . . . . . .

3.1.1 The null hypothesis . . . . . . . . . . .

3.1.2 The alternative hypothesis . . . . . . . .

3.1.3 The null distribution . . . . . . . . . . .

3.1.4 The alternative distribution . . . . . . .

3.1.5 Decision rules and the signicance level

3.1.6 The t test theory . . . . . . . . . . .

3.1.7 The t test two sided example . . . .

3.1.8 The t test one sided example . . . .

3.1.9 p-values . . . . . . . . . . . . . . . . . .

3.1.10 Testing other null hypotheses . . . . . .

3.2 Condence intervals . . . . . . . . . . . . . . .

3.3 Prediction intervals . . . . . . . . . . . . . . . .

3.3.1 Derivations . . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

4 Multiple Regression

4.1 Population Regression Function . . . .

4.2 Sample Regression Function and OLS

4.3 Example: house price modelling . . . .

4.4 Statistical Inference . . . . . . . . . .

4.5 Applications to house price regression

4.6 Joint hypothesis tests . . . . . . . . .

4.7 Multicollinearity . . . . . . . . . . . .

4.7.1 Perfect multicollinearity . . . .

4.7.2 Imperfect multicollinearity . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

55

55

55

56

57

59

62

65

65

69

5 Dummy Variables

5.1 Estimating two means . . . . . . . . . .

5.2 Estimating several means . . . . . . . .

5.3 Dummy variables in general regressions

5.3.1 Dummies for intercepts . . . . .

5.3.2 Dummies for slopes . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

69

69

71

72

73

78

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

dependent variable

. . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

80

80

81

82

86

87

93

94

97

6.1 Quadratic regression . . . . . . . . . . . . . . .

6.1.1 Example: wages and work experience .

6.2 Regression with logs explanatory variable . .

6.2.1 Example: wages and work experience .

6.3 Regression with logs dependent variable . . .

6.3.1 Example: modelling the log of wages . .

6.3.2 Choosing between levels and logs for the

6.4 Practical summary of functional forms . . . . .

7 Comparing regressions

98

7.1 Adjusted R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7.2 Information criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.3 Adjusted R2 as an IC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

8 Functional form

100

9.1 Notation . . . . . . . . .

9.2 Regression for prediction

9.3 Omitted variables . . . .

9.4 Simultaneity . . . . . . .

9.5 Sample selection . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

100

101

101

102

104

104

10.1 Dynamic regressions . . . . . . . . . . . . . .

10.1.1 Finite Distributed Lag model . . . . .

10.1.2 Autoregressive Distributed Lag model

10.1.3 Forecasting . . . . . . . . . . . . . . .

10.1.4 Application . . . . . . . . . . . . . . .

10.2 OLS estimation . . . . . . . . . . . . . . . . .

10.2.1 Bias . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

105

106

106

106

107

107

108

109

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

10.3 Checking weak dependence . . . . . .

10.4 Model specication . . . . . . . . . . .

10.5 Interpretation . . . . . . . . . . . . . .

10.5.1 Interpretation of FDL models .

10.5.2 Interpretation of ARDL models

11 Regression in matrix notation

11.1 Denitions . . . . . . . . . . .

11.2 Addition and Subtraction . .

11.3 Multiplication . . . . . . . . .

11.4 The PRF . . . . . . . . . . .

11.5 Matrix Inverse . . . . . . . .

11.6 OLS in matrix notation . . .

11.6.1 Proof . . . . . . . . .

11.7 Unbiasedness of OLS . . . . .

11.8 Time series regressions . . . .

1

1.1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

regression

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

112

113

113

124

124

125

.

.

.

.

.

.

.

.

.

127

127

128

129

129

130

131

131

132

132

The Sample Regression Function

Regression is the primary statistical tool used in econometrics to understand the relationship

between variables. To illustrate, consider the dataset introduced in Example 2.3 of Wooldridge

for relating the salary paid to corporate chief executive o cers to the return on equity achieved

by their rms. Data is available for 209 rms. The idea is to examine whether the salaries paid to

CEOs is related to the earnings of their rms, and specically whether rms with higher incomes

reward their CEOs with higher salaries. A scatter plot of the possible relationship is shown in

Figure 1, which reveals the possibility of increasing returns to equity corresponding to higher

CEO salaries, but with some apparently high salaries for a small number of CEOs also included

(these are known as outliers, to be discussed later).

A regression line can be t to this data using the method of Ordinary Least Squares (OLS), as

shown in Figure 2. The OLS method works as follows. The dependent variable for the regression

is denoted yi , where the subscript i refers to the number of the observation for i = 1; : : : ; n. In the

example we have n = 209 and yi corresponds to the CEO salary for each of the 209 rms. The

explanatory variable, or regressor, is denoted xi for i = 1; : : : ; n and corresponds to the Return

on Equity for each of the 209 rms. The data are shown in Table 1. The rst observation in the

dataset is y1 = 1095 and x1 = 14:10, meaning that the CEO of the rst rm earns $1,095,000 and

the rms Return on Equity is 14.10%. The second observation is y2 = 1001 and x2 = 10:90, the

last observation is y209 = 626 and x209 = 14:40, and so on.

The regression line is a linear function of xi that is used to calculate a prediction of yi , denoted

y^i . This regression line is expressed

y^i = ^ 0 + ^ 1 xi ; i = 1; : : : ; n:

(1)

This is called the Sample Regression Function (SRF). The hat on top of any quantity implies

that it is a prediction or an estimate that is calculated from the data. The method of OLS is

used to calculate ^ 0 and ^ 1 , respectively the intercept and the slope of the regression line. The

prediction errors, or regression residuals, are denoted

u

^ i = yi

y^i ; i = 1; : : : ; n;

3

(2)

and OLS chooses the values of ^ 0 and ^ 1 such that the overall residuals (^

u1 ; : : : ; u

^n ) are minimised,

in the sense that the Sum of Squared Residuals (SSR)

SSR =

n

X

u

^2i

i=1

n

X

y^i )2

(yi

i=1

is as small as possible. This is the sense in which the OLS regression line is known as the line of

best t.

The formulae for ^ 0 and ^ 1 are given by

Pn

(xi x) (yi y)

^ = i=1

;

(3)

Pn

1

x)2

i=1 (xi

and

^ =y

0

^ x;

1

(4)

y=

i=1

i=1

1X

1X

yi ; x =

xi .

n

n

For the CEO salary data, the coe cients of the regression line can be calculated to be ^ 0 =

963:191 and ^ 1 = 18:501, so the regression line can be written

y^i = 963:191 + 18:501xi ,

or equivalently using the names of the variables:

d = 963:191 + 18:501 RoEi :

salary

i

The interpretation of this regression line is that it gives a prediction of CEO salary in terms of

the return on equity of the rm. For example, for the rst rm the predicted salary on the basis

of return on equity is

y^1 = 963:191 + 18:501xi

= 963:191 + 18:501

14:10

= 1224:1;

or $1; 224; 100, and the residual is

u

^ 1 = y1

y^1

= 1095

=

1224:1

129:1;

or $129; 100. That is, the CEO of the rst company in the dataset is earning $129; 100 less

than predicted by the rms return on equity. Table 2 gives some of the values of y^i and u

^i

corresponding to those values of yi and xi given in Table 1.

16,000

14,000

12,000

SALARY

10,000

8,000

6,000

4,000

2,000

0

0

10

20

30

40

50

60

ROE

16,000

14,000

12,000

SALARY

10,000

8,000

6,000

4,000

2,000

0

0

10

20

30

40

50

60

ROE

Observation (i) Salary (yi ) Return on Equity (xi )

1

1095

14.10

2

1001

10.90

3

1122

23.50

..

..

..

.

.

.

208

555

13.70

209

626

14.40

Table 2: CEO salaries and Return on Equity, with regression predictions and residuals

Observation (i) Salary (yi ) Return on Equity (xi ) Predicted Salary (^

yi ) Residual (^

ui )

1

1095

14.10

1224.1

129:1

2

1001

10.90

1164.9

163:9

3

1122

23.50

1398.0

276:0

..

..

..

..

..

.

.

.

.

.

208

555

13.70

1216.7

661:7

209

626

14.40

1229.6

603:6

1.2

The intrepretation of the regression coe cients ^ 0 and ^ 1 relies on the interpretation of regression

as giving predictions for yi using xi . For a general regression equation

y^i = ^ 0 + ^ 1 xi ;

the interpretation of ^ 0 is that it is the predicted value of yi when xi = 0. It depends on the

application whether xi = 0 is practically relevant. In the CEO salary example, a rm with zero

return on equity (i.e. net income of zero) is predicted to have a CEO with a salary of $963,191.

Such a prediction has some value in this case because it is possible for a rm to have zero net

income in a particular year, and the data contains observations where the return on equity is quite

close to zero. As a dierent example, if we had a regression of individual wages on age of the form

wage

d i = ^ 0 + ^ 1 agei , it would make no practical sense to predict the wage of an individual of age

zero! In this case the intercept coe cient ^ 0 does not have a natural interpretation.

The slope coe cient ^ 1 measures the change in the predicted value y^i that would follow from a

one unit increase in the regressor xi . The predicted value of yi given the regressor takes the value

xi is y^i = ^ 0 + ^ 1 xi , while the predicted value of yi given the regressor takes the value xi + 1 is

y^i = ^ 0 + ^ 1 (xi + 1). The change in prediction for yi based on this change in xi is y^i yi = ^ 1 . In

the CEO salary example, an increase of 1% in a rms return on equity corresponds to a predicted

increase of 18.501 ($18; 501) in CEO salary. This quanties how increases in rm income change

our prediction for CEO salary. Econometrics is especially concerned with the estimation and

interpretation of such slope coe cients.

1.3

Regression in Eviews

Eviews is statistical software designed specically for econometric analysis. Data can be read in

from Excel les and then easily analysed using OLS regression. The steps to carry out the CEO

salary analysis in the previous section are presented here.

Figure 3 shows part of an Excel spreadsheet containing the CEO salary data. The variable

names are in the rst row, followed by the observations for each variable. To open this le in

Eviews, go to File - Open - Foreign Data as Workle... as shown in Figure 4, and select the

Excel le in the subsequent window. On opening the le, the dialog boxes in Figures 5, 6 and 7

can often be left unchanged. The rst species the range of the data within the workle (in this

case the rst two columns of Sheet 1), the second species that the variable names are contained

in the rst row of the spreadsheet, and the third species that a new workle be created in Eviews

to contain the data. For simple data sets such as this, the defaults in these dialog boxes will be

correct. More involved data sets will be considered later. On clicking Finish in the nal dialog

box, the new workle is displayed in Eviews, see Figure 8.

The Range of the workle species the total number of observations available for analysis, in

this case 209. The Sample of the workle species which observations are currently being used

for analysis, and this defaults to the full range of the workle unless otherwise specied. There

are four objects displayed in the workle c, resid, roe and salary. The rst two of these will be

present in any workle. The c and resid objects contain the coe cient values and residuals

from the most recently estimated regression. The objects roe and salary contain the data on those

two variables. For example, double clicking on salary gives the object view shown in Figure 9,

where the observations can be seen. Many other views are possible, but a common and important

rst step is to obtain some graphical and statistical summaries by selecting View - Descriptive

Statistics & Tests - Histogram and Statsas shown in Figure 10. This results in Figure 11, where

the histogram gives an idea of the distribution of the variable and the descriptive statistics provide

an idea of the measures of central tendency (mean, median), dispersion (maximum, minimum,

standard deviation) and other measures. The mean CEO salary is $1,281,120 while the median

is $1,039,000. The substantial dierence between these two statistics is because there are at

least three very large salaries that are very inuential on the mean, but not the median. These

observations were also evident in the scatter plot in Figure 1. The same descriptive statistics can

Figure 6: ... specify that the rst header line contains the variables names (salary and ROE)...

10

be obtained for the Return on Equity variable.

The scatter plots in Figure 1 or 2 can be obtained by selecting Quick - Graph...as shown in

Figure 12, entering roe salaryinto the resulting Series List box as shown in Figure 13, and then

specifying a Scatter with regression line (if desired) as shown in Figure 14. The result is Figure 2.

The regression equation itself can be computed by selecting Quick - Estimate Equation...

as shown in Figure 15, and then specifying the equation as shown in Figure 16. The dependent

variable (salary) for the regression equation goes rst, the c refers to the intercept of the

equation ^ 0 and then the explanatory variable (roe). The results of the regression calculation

are shown in Figure 17. In particular the values of the intercept

0

coe cient on RoE ^ 1 = 18:50119 can be read from the Coe cient column of the tabulated

results. The equation can be named as shown in Figure 18, which means that it will appear as

an object in the workle and can be saved for future reference.

d for the regression, click on the Forecast button and

To obtain the predicted values salary

i

enter a new variable name in the Forecast namebox, say salary_hat, as shown in Figure 19.

A new object called salary_hat is created in the workle and double clicking on it reveals the

values shown in the Figure ??, the rst three of which correspond to the values given in Table 2

for y^i .

To obtain the residuals u

^i for the regression select Proc - Make Residual Series in the

equation window as shown in Figure 20 and name the new residuals object as shown in the Figure

21. The resulting residuals for the CEO salary regression are shown in Figure 22, the rst three

of which correspond to the values given in Table 2 for u

^i .

11

90

Series: SALARY

Sample 1 209

Observations 209

80

70

Mean

Median

Maximum

Minimum

Std. Dev.

Skewness

Kurtosis

60

50

40

30

20

1281.120

1039.000

14822.00

223.0000

1372.345

6.854923

60.54128

Probability

0.000000

10

0

0

2000

4000

6000

8000

10000

12000

14000

Figure 11: Descriptive statistics and histogram for the CEO salaries

12

Figure 13: Variables to plot roe rst because it goes on the x-axis.

13

14

Figure 16: Specifying a regression of CEO salary on an intercept and Return on Equity

15

Figure 19: Use the Forecast procedure to calculate predicted values from the regression.

16

17

1.4

Goodness of t

The equation (2) that denes the regression residuals can we written

yi = y^i + u

^i ;

(5)

which states that the regression decomposes each observation into a prediction (^

yi ) that is a

function of xi , and the residual u

^i . Let var

c (yi ) denote the sample variance of y1 ; : : : ; yn :

var

c (yi ) =

n

X

(yi

y)2 ;

i=1

c (^

yi ) and var

c (^

ui ) are the sample variances of y^1 ; : : : ; y^n and u

^1 ; : : : ; u

^n . Some

simple algebra (in section 1.5.3 below) shows that

var

c (yi ) = var

c (^

yi ) + var

c (^

ui ) :

(6)

(Note

Pn that (6) does not follow automatically from (5) and requires the additional property that

^i u

^i = 0.) Equation (6) shows that the variation in yi can be decomposed into the sum of

i=1 y

the variation in the regression predictions y^i and the variation in the residuals u

^i . The variation

of the regression predictions is referred to as the variation in yi that is explained by the regression.

A common descriptive statistic is

var

c (^

yi )

R2 =

;

var

c (yi )

which measures the goodness of t of the a regression as the proportion of variation in the

dependent variable that is explained by the variation in xi . The R2 is known as the coe cient

of determination and lies between 0 and 1. The closer is R2 to one, the better the regression is

said to t. Note that this is just one criteria by which to evaluate the quality of a regression, and

others will be given during the course.

18

squares rather than sample variances. Equation (6) can be written

1

n

Pn

n

X

(yi

y) =

i=1

n

X

(^

yi

y) +

i=1

where use is made of i=1 u

the derivation). Cancelling the 1= (n 1) gives

Pn

i=1 yi

1

=

n

X

u

^2i ;

i=1

Pn

^i

i=1 y

where

SST

n

X

(yi

i=1

SSE =

SSR =

n

X

i=1

n

X

y^i

y^

u

^2i

i=1

R2 =

SSE

:

SST

The R2 for the CEO salary regression in Figure 17 is 0.0132, so that just 1.32% of the variation

in CEO salaries is explained by the Return on Equity of the rm. This low R2 (i.e. close to zero)

need not imply the regression is useless, but it does imply that CEO salaries are determined by

other important factors besides just the protability of the rm.

Some intuition for what R2 is measuring can be found in Figures 23 and 24, which show two

hypothetical regressions with R2 = 0:185 and R2 = 0:820 respectively. The data in Figure 24 are

less dispersed around the regression line, so that changes in xi more precisely predict changes in

y2;i than y1;i . There is more variation in y1;i that is left unexplained by the regression.

Figure 25 gives one example of how R2 does not always provide a foolproof measure of the

quality of a regression. The regression in Figure 25 has R2 = 0:975, very close to the maximum

possible value of one, but the scatter plot clearly reveals that the regression does not explain

an important feature of the relationship between y3;i and xi there is some curvature or nonlinearity that is not captured by the regression. A high R2 is a nice property for a regression to

have, but is neither necessary nor su cient for a regression to be useful.

1.5

1.5.1

Derivations

Summation notation

It will be necessary to know some simple properties of summation operators to follow the derivations. The summation operator is dened by

n

X

ai = a1 + a2 + : : : + an :

i=1

19

2.5

2.0

1.5

Y1

1.0

0.5

0.0

-0.5

-1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.8

1.0

2.5

2.0

1.5

Y2

1.0

0.5

0.0

-0.5

-1.0

0.0

0.2

0.4

0.6

X

20

2.5

2.0

1.5

Y3

1.0

0.5

0.0

-0.5

-1.0

0.0

0.2

0.4

0.6

0.8

1.0

It follows that

n

X

(ai + bi ) =

i=1

n

X

ai +

i=1

n

X

bi :

i=1

n

X

i=1

Similarly

c = c| + c +{z: : : + }c = nc:

n times

n

X

cai = c

i=1

n

X

ai ;

i=1

The sample mean of a1 ; : : : ; an is

n

1X

ai ;

a=

n

i=1

and then

n

X

(ai

a) =

i=1

n

X

ai

i=1

= na

n

X

i=1

na

= 0;

so that the sum of deviations around the sample mean is always exactly zero.

21

(7)

n

X

(ai

n

X

a)2 =

i=1

i=1

n

X

a2i

2ai a + a2

a2i

2a

i=1

n

X

n

X

ai +

i=1

n

X

a2

i=1

a2i

2na2 + na2

a2i

na2 :

i=1

n

X

i=1

1.5.2

Derivation of OLS

y~i = b0 + b1 xi ;

where b0 and b1 could be any coe cients. The residuals from this predictor are

u

~ i = yi

y~i = yi

b0

b1 xi :

The idea of OLS is to choose the values of b0 and b1 that minimise the sum of squared residuals

SSR (b0 ; b1 ) =

n

X

u

~2i =

i=1

n

X

(yi

b1 xi )2 :

b0

i=1

The minimisation can be done using calculus. The rst derivatives of SSR (b0 ; b1 ) with respect

to b0 and b1 are

@SSR (b0 ; b1 )

@b0

@SSR (b0 ; b1 )

@b1

2

2

n

X

i=1

n

X

(yi

b0

xi (yi

b1 xi )

b0

b1 xi ) :

i=1

Setting these rst derivatives to zero at the desired estimators ^ 0 and ^ 1 gives the rst order

conditions

n

X

yi

x i yi

^ xi

1

= 0

(8)

^ xi

1

= 0:

(9)

i=1

n

X

i=1

(See equations (2.14) and (2.15) of Wooldridge, who takes a dierent approach to arrive at these

equations.)

The rst equation can be written

n

X

yi

n^

i=1

n

X

i=1

22

xi = 0;

^

^ x = 0:

1

Substituting this expression for ^ 0 into the second equation gives

n

X

xi (yi

^ (xi

1

y)

x) = 0;

i=1

or

n

X

xi (yi

y)

i=1

Notice that

n

X

(xi

x) (yi

n

X

xi (xi

y) =

i=1

n

X

xi (yi

y)

i=1

and similarly

x) = 0:

i=1

n

X

(xi

n

X

(yi

y) =

i=1

x)2 =

i=1

n

X

n

X

xi (yi

y) ;

i=1

xi (xi

x) ;

i=1

n

X

(xi

x) (yi

y)

i=1

n

X

(xi

x)2 = 0;

i=1

1.5.3

u

^i = yi

y^i = yi

satisfy

n

X

^ xi ;

1

u

^i = 0;

(10)

i=1

because of (8), and hence u

n

X

xi u

^i = 0:

(11)

i=1

n

X

y^i =

i=1

n

X

(yi

u

^i ) =

i=1

n

X

yi

i=1

n

X

i=1

u

^i =

n

X

yi ;

(12)

i=1

so the OLS predictions y^i have the same sum, and hence the same sample mean, as the original

dependent variable yi . Also

n

X

i=1

y^i u

^i = ^ 0

n

X

u

^i + ^ 1

i=1

n

X

i=1

23

xi u

^i = 0:

(13)

SST

n

X

(yi

y)2

(yi

y^i + y^i

i=1

n

X

y)2

i=1

n

X

y)2

(^

ui + y^i

i=1

n

X

u

^2i

+2

i=1

n

X

u

^2i

i=1

+2

n

X

u

^i (^

yi

i=1

n

X

y) +

n

X

(^

yi

y)2

i=1

u

^i y^i

i=1

n

X

i=1

= SSR + SSE:

u

^i

n

X

(^

yi

y)2

i=1

The last step uses (13) and (10) to cancel the middle two terms. It also uses (12) to identify SSE =

P

P

Pn

2

2

yi y)2 . Also (10) implies that u = 0 so that SSR = ni=1 u

^i u

^

^i y^ = ni=1 (^

i=1 y

does not require the u

^. Dividing this equality through by n 1 gives (6).

The review of regression in section 1 suggests that regression is a useful tool for summarising

the relationship between two observed variables and for calculating predictions for one variable

based on observations on the other. In econometrics we want to do more than this. We want to

use the information contained in a sample to carry out inductive inference (statistical inference)

on the underlying population from which the sample was drawn. For example, we want to take

the sample of 209 CEO salaries in section 1 as being representative of the salaries of CEOs in

the population of all rms. In practice it is necessary to be very careful about the denition of

this population. This dataset, taken from the American textbook of Wooldridge, would best be

taken as being representative of only US rms, rather than all rms in the world, or all rms in

OECD countries. In fact the population may be US publicly listed rms, since rms unlisted on

the stock market may have quite dierence processes for executive salaries. Nevertheless, with

the population carefully dened, the idea of statistical inference is to make statistical statements

about that population, not only the sample that has been observed.

2.1

Suppose there is a well dened population in which we are interested, eg. the population of publicly

listed rms in the US. A simple random sample is one in which each rm in the population has

an equal probability of being included in the sample. Moreover each rm in the sample is chosen

independently of all the others. That is the probability of inclusion or exclusion of one rm into

the sample does not depend on the inclusion or exclusion of any other rm.

For each rm included in the sample, we take one or more measurements of interest (eg. CEO

salary and the rms Return on Equity). Mathematically these are represented as the random

variables yi and xi for i = 1; : : : ; n, where n is the sample size. The concept of a random variable

reects the idea that the values taken in the sample would have been dierent if a dierent random

sample had been drawn. In the observed sample we had y1 = 1095, y2 = 1001, etc, but if another

24

simple random sample had been drawn then dierent rms should (most likely) have been chosen

and the yi values would have been dierent. That is, random variables take dierent values if a

dierent sample is drawn.

In the population of all rms, there is a distribution of CEO salaries. The random variables

y1 ; y2 ; : : : are independent random drawings from this distribution. That is, each of y1 ; y2 ; : : :

are independent of each other and are drawn from the same underlying distribution. They are

therefore called independent and identically distributed random variables, always abbreviated as

i.i.d..

There are many other sampling schemes that may arise in practice, some of which will be

introduced later. Our initial discussion of regression modelling will be conned to cross-sectional

data drawn as a simple random sample, i.e. conned to i.i.d. random variables.

2.2

For example, the distribution of CEO salaries for the population of publicly listed rms in the

US will have some population mean that could be denoted . The population mean is dened

mathematically as the expected value of the population distribution. This expected value is the

weighted average of all possible valuesR in the population, with weights given by the probability

distribution denoted f (y), i.e.

= yf (y) dy. (The evaluation of such integrals will not be

required here, but see Appendix B.3 of Wooldridge for some more details.) Since each of the

random variables y1 ; y2 ; : : : represent random drawings from the population distribution, each of

them also has a mean of . This is written

E (yi ) = ; i = 1; 2; : : : :

(14)

Similarly the population distribution of CEO salaries has some variance that could be denoted

2 , which is dened in terms of y as

i

i

h

2

= var (yi ) = E (yi

)2 ; i = 1; 2; : : : :

(Or, in terms of integrals, as

2.3

Population vs Sample

(y

)2 f (y) dy.)

It will be important throughout to be clear on the distinction between the population and the

sample. The population is too large or unwieldly or simply impossible to fully observe and

measure. Therefore a quantity such as the population mean

= E (yi ) is also impossible to

observe. Instead we take a sample, which is a subset of the population, and attempt to estimate

the population mean based on that P

sample. An obvious (but not the only) statistic to use to

estimate is the sample mean y = n1 ni=1 yi . A sample statistic such as y is observable, for eg

y = 1281:12 for the CEO salary data (see Figure 11).

It is vital at all times to keep clear the distinction between an unobservable population parameter like = E (yi ) about which we wish to learn, and an observable sample statistic y that

we use to estimate . More generally we want to use y (and perhaps other statistics) to draw

statistical inferences about .

2.4

Conditional Expectation

In econometrics we are nearly always interested in at least two random variables in a population,

eg yi for CEO salary and xi for Return on Equity, and the relationships between them. Of central

25

interest in econometrics is the conditional distribution of yi given xi . That is, rather than being

interested in the distribution of CEO salaries in isolation (the so-called marginal distribution of

yi ), we are interested in how the distribution of CEO salaries changes as the Return on Equity of

the rm changes. For regression analysis, the fundamental population quantity of interest is the

conditional expectation of yi given xi , which is denoted by the function

E (yi jxi ) =

(xi ) :

(15)

(As

R outlined in Appendix B.4 of Wooldridge, this conditional expectation is dened as (x) =

yfY jX (yjx) dy, where fY jX is conditional distribution of yi given xi .) Much of econometrics is

devoted to estimating conditional expectations functions.

The idea is that E (yi jxi ) provides the prediction of yi corresponding to a given value of xi

(i.e. the value of yi that we would expect given some value of xi ). For example (10) is the

population mean of CEO salary for a rm with Return on Equity of 10%. This conditional mean

will be dierent (perhaps lower?) than (20) which is the population mean of CEO salary for a

rm with Return on Equity of 20%. If the population mean of yi changes when we change the

value of xi , there is a potentially interesting relationship between yi and xi to explore.

Consider the dierence between the unconditional mean

= E (yi ) given in (14) and the

conditional mean (xi ) = E (yi jxi ) given in (15). These are dierent population quantities with

dierent uses. The unconditional mean provides an overall measure of central tendency for

the distribution of yi but provides no information on the relationship between yi and xi . The

conditional mean (xi ), by contrast, describes how the predicted/mean value of yi changes with

xi . For example, is of interest if we want to investigate the overall average level of CEO salaries

(perhaps to compare them to other occupations say), while (xi ) is of interest if we want to start

to try to understand what factors may help explain the level of CEO salaries.

Note also that is, by denition, a single number. On the other hand (xi ) is a function,

that is it is able to take dierent values for dierent values of xi .

2.5

The Population Regression Function (PRF) is, by denition, the conditional expectations function

(15). In a simple regression analysis, it is assumed that this function is linear, i.e.

E (yi jxi ) =

1 xi :

(16)

This linearity assumption need not always be true and is discussed more later. This PRF species

the conditional mean of yi in the population for any value of xi . It species one important aspect

of the relationship between yi and xi .

Statistical inference in regression models is about using sample information to learn about

E (yi jxi ), which in the case of (16) amounts to learning about 0 and 1 . Consider the SRF

introduced in (1), restated here:

y^i = ^ 0 + ^ 1 xi :

(17)

The idea is that ^ 0 and ^ 1 are the sample OLS estimators that we calculate to estimate the

unobserved population coe cients 0 and 1 . Then for any xi we can use the sample predicted

value y^i to estimate the conditional expectation E (yi jxi ).

2.6

An important question is whether the OLS SRF (17) provides a good estimator of the PRF (16)

in some sense. In this section we address this question assuming that

26

A1 (yi ; xi )ni=1 are i.i.d. random variables (i.e. from a simple random sample)

A2 the linear form (16) of the PRF is correct.

Estimators in statistics (such as a sample mean y or regression coe cient ^ 0 ; ^ 1 ) can be

considered to be random variables since they arePfunctions of the random variables that represent

that data. For example the sample mean y = n1 ni=1 yi is a random variable because it is dened

in terms of the random variables y1 ; : : : ; yn . That is, if a dierent random sample had been drawn

for y1 ; : : : ; yn then a dierent value for y would be obtained. The distribution of an estimator

is called the sampling distribution of the estimator. The statistical properties of an estimator is

derived from its sampling distribution.

2.6.1

Properties of Expectations

The properties of a sampling distribution are often dened in terms of its mean and variance and

other similar quantities. To work these out, it is necessary to use some simple properties of the

expectations operator E and the conditional expectations operator, summarised here.

Suppose z1 ; : : : ; zn are i.i.d random variables and c1 ; : : : ; cn are non-random. Then

E1 E (

Pn

E2 var (

i=1 ci zi )

Pn

i=1 ci zi )

Pn

i=1 ci E

(zi )

Pn

2

i=1 ci var (zi )

Property E1 continues to hold if zi are not i.i.d. (for example, if they are correlated with each

other) but Property E2 does not continue to hold if zi are correlated. Recall from Assumption A1

that, at least for now, we are assuming that the random variables yi and xi are each i.i.d. across

i. Property E3 simply states that the expectation of a constant (ci ) is itself, and that a constant

has no variation.

In view of the denition of the PRF (16), conditional expectations are fundamental to regression analysis. It turns out to be useful to be able to work with not only E (yi jxi ) but

E (yi jx1 ; : : : ; xn ), which is the conditional expectation of yi given information on the explanatory variables for all observations, not only observation i. The reason for this becomes clear in

the following section. Under Assumption A1

E (yi jxi ) = E (yi jx1 ; : : : ; xn ) :

(18)

This can be proven formally, but the intuition is simply that under independent sampling, information in explanatory variables xj for j 6= i is not informative about yi since (yi ; xi ) and (yj ; xj )

are independent for all j 6= i. That is, knowing xj for j 6= i does not change our prediction of yi .

For example, our prediction of the CEO salary for rm 1 is not improved by knowing the Return

to Equity of any other rms, it is assumed to be explained only by the performance of rm 1.

That is

E (salaryi jRoEi ) = E (salaryi jRoE1 ; : : : ; RoEn ) :

Equation (18) is reasonable under Assumption A1, but not in other sampling situations such as

time series data considered later.

The conditional variance of a random variable is a measure of its conditional dispersion around

its conditional mean. For example

h

i

var (yi jxi ) = E (yi E (yi jxi ))2 xi :

27

h

i

(Compare this to the unconditional variance var (yi ) = E (yi E (yi ))2 .) The conditional variance of yi is the variation in yi that remains when xi is given a xed value. The unconditional variance of yi is the overall variation in yi , averaged across all xi values. It follows that

var (yi jxi ) var (yi ). If yi and xi are independent then var (yi jxi ) = var (yi ). It is frequently the

case in practice that var (yi jxi ) may vary in important ways with xi . For example, it may be that

CEO salaries are more highly variable for more protable rms than less protable rms. Or, if

yi is wages and xi is individual age, then it is likely that the variation in wage across individuals

become greater as age increases. If var (yi jxi ) varies with xi then this is called heteroskedasticity.

If var (yi jxi ) is constant across xi then this is called homoskedasticity.

Under Assumption A1, the conditional expectations operator has properties similar to E1 and

E2. Suppose c1 ; : : : ; cn are either non-random or functions of x1 ; : : : ; xn only (i.e. not functions

of y1 ; : : : ; yn ). Then

CE1 E (

Pn

CE2 var (

i=1 ci yi jx1 ; : : : ; xn )

i=1 yi jx1 ; : : : ; xn )

Pn

Pn

i=1 ci E

Pn

(yi jxi )

2

i=1 ci var (yi jxi )

P

Without i.i.d. sampling (eg. time series), CE1 would continue to hold in the form E ( ni=1 yi jx1 ; : : : ; xn ) =

P

n

i=1 E (yi jx1 ; : : : ; xn ) while CE2 would generally not be true.

The nal very useful property of conditional expectations is the Law of Iterated Expectations:

LIE For any random variables z and x, E [z] = E [E (zjx)].

The LIE may appear odd at rst but is very useful and has some intuition. Leaving aside

the regression context, let z represent the outcome from a roll of a die, i.e. a number from

1; 2; : : : ; 6. The expected value of this random variable is E (z) = 16 (1 + 2 + : : : + 6) = 3:5, since

the probability of each possible outcome is 16 . Now suppose we dene another random variable

x that takes the value 0 if z is even and 1 if z is odd. That is x = 0 if z = 2; 4; 6; and x = 1 if

z = 1; 3; 5, so that Pr (x = 0) = 21 and Pr (x = 1) = 12 . It should be clear that E (zjx = 0) = 4

and E (zjx = 1) = 3, which illustrates the idea that conditional expectations can take dierent

values (4 or 3) when the conditioning variables take dierent values (0 or 1). The expected

value of the random variable E (zjx) is taken as an average over the possible x values, that is,

E [E (zjx)] = 12 (4 + 3) = 3:5, since the probability of each possible outcome of E (zjx) is 12 . This

illustrates the LIE, i.e. E (z) = E [E (zjx)] = 3:5. While E [E (zjx)] may appear more complicated

than E (z), it frequently turns out to be easier to work with.

The LIE also has a version in variances:

LIEvar var (z) = E [var (zjx)] + var [E (zjx)].

This shows that the variance of a random variable can be decomposed into its average conditional

variance given x and the variance of the regression function on x.

2.6.2

Unbiasedness

An estimator is dened to be unbiased if the mean of its sampling distribution is equal to the

true value of the parameter being estimated. If ^ is any estimator of a parameter , it is unbiased

if E ^ = . The idea is that an unbiased estimator is one that does not systematically underestimate or over-estimate the true value . Some samples from the population will give values of

^ below and some samples will give values of ^ above , and these dierences average out. In

practice we only get to observe a single value of ^ of course, and this single value may dier from

28

by being too large or small. It is only on average over all possible samples that the estimator gives

. So unbiasedness is a desirable property for a statistical estimator, although not one that occurs

very often. However in linear regression models there are situations where the OLS estimator can

be shown to be unbiased. We consider the unbiasedness of the sample mean rst, and then the

OLS estimator of the slope coe cient in a simple regression.

Let y = E (yi ) denote the population mean of the i.i.d. random variables y1 ; : : : ; yn . Then

!

n

n

n

1X

1X

1X

E (y) = E

yi =

E (yi ) =

(19)

y = y;

n

n

n

i=1

i=1

i=1

where the second step uses Property E1 above, and this shows that the sample mean is an unbiased

estimator of the population mean.

Under Assumptions 1 and 2 above, the OLS estimators ^ 0 and ^ 1 can be shown to be unbiased.

Just ^ 1 is considered here. First recall the property of zero sums around sample means (7), which

implies

n

X

(xi x) = 0;

(20)

i=1

and

n

X

(xi

x) (yi

y) =

i=1

and similarly

n

X

(xi

x) yi

i=1

n

X

(xi

x)2 =

x) =

n

X

(xi

n

X

(xi

x) yi ;

(21)

i=1

x) xi :

(22)

i=1

1

(xi

i=1

i=1

n

X

=

=

=

Pn

Pi=1

n

(xi

x) yi

i=1 (xi

n

X

i=1

n

X

x)2

(xi x)

yi

Pn

x)2

i=1 (xi

an;i yi :

(23)

i=1

This shows that ^ 1 is a weighted sum of y1 ; : : : ; yn , with the weight on each observation yi being

given by

(xi x)

an;i = Pn

;

(24)

x)2

i=1 (xi

which for each i depends on all of x1 ; : : : ; xn (hence the subscript n included in the an;i notation).

Now use the LIE to write

E ^ 1 jx1 ; : : : ; xn

=

=

n

X

i=1

n

X

an;i (

1 xi )

i=1

n

X

i=1

29

an;i +

n

X

i=1

an;i xi ;

(25)

where the second lines uses (18), which holds under Assumption A1. Using (20) gives

Pn

n

X

(xi x)

= 0;

an;i = Pni=1

x)2

i=1 (xi

i=1

and using (22) gives

n

X

i=1

Pn

an;i xi = Pi=1

n

(xi

x) xi

i=1 (xi

x)2

E ^ 1 jx1 ; : : : ; xn =

= 1:

1;

h i

h

i

E ^ 1 = E E ^ 1 jx1 ; : : : ; xn = E [

2.6.3

(26)

1]

1:

1.

Variance

The variance of an estimator measures how dispersed values of the estimator can be around the

mean. In general it is preferred for an estimator to have a small variance, implying that it tends

not to produce estimates very far from its mean. This is especially so for an unbiased estimator,

for which a small variance implies the distribution of the estimator is closely concentrated around

the true population value of the parameter of interest.

For the sample mean, consider again the i.i.d. random variables y1 ; : : : ; yn each with population

mean y = E (yi ) and population variance 2y . Then

!

n

n

2

1X

1 X

y

var (y) = var

;

(27)

yi = 2

var (yi ) =

n

n

n

i=1

i=1

the second equality following from Property E2. This formula shows what factors inuence the

precision of the sample mean the variance 2y and the sample size n. Specically having

a population with a small variance 2y leads to a more precise estimator y of y , which makes

intuitive sense. Similarly intuitively, a larger sample size n implies a smaller variance of y, implying

that more precise estimates are obtained from larger sample sizes.

Now consider the variance of the OLS slope estimator ^ 1 . Using Property LIEvar above, the

variance of ^ 1 can be expressed

var ^ 1

h

i

h

i

= E var ^ 1 jx1 ; : : : ; xn + var E ^ 1 jx1 ; : : : ; xn

i

h

= E var ^ 1 jx1 ; : : : ; xn + var [ 1 ]

h

i

= E var ^ 1 jx1 ; : : : ; xn ;

where (25) is used to get the second line and then Property E3 (the variance of a constant is zero)

to get the third line. The conditional variance of ^ 1 given x1 ; : : : ; xn is

var ^ 1 jx1 ; : : : ; xn

=

=

n

X

i=1

Pn

i=1 (xi

Pn

i=1 (xi

30

x)2

using Property CE2 to obtain the rst line and then substituting for an;i to obtain the second

line. This implies

3

2

Pn

2

(xi x) var (yi jxi ) 7

6

(28)

var ^ 1 = E 4 i=1P

5;

2

2

n

(x

x)

i

i=1

which is a fairly complicated formula that doesnt shed a lot of light on the properties of ^ 1 , but

it does have later practical use when we talk about hypothesis testing.

A simplication of the variance occurs under homoskedasticity, that is when var (yi jxi ) = 2

for every i. If the conditional variance is constant then

#

"

2

var ^ 1

= E Pn

x)2

i=1 (xi

2

n

X

where

s2x

1

n

1

;

s2x

(29)

x)2

(xi

i=1

is the usual sample variance of the explanatory variable xi . Formula (29) is simple enough to

understand what factors in a regression inuence the precision of ^ 1 . The variance will be small

for small values of 2 and large values of n 1 and s2x . This implies practically that slope

coe cients can be precisely estimated in situations where the sample size is large, where the

regressor xi is highly variable, and where the dependent variable yi has small variation around

the regression function (i.e. small 2 ).

2.6.4

Asymptotic normality

Having discussed the mean and variance of a sampling distribution, it is also possible to consider

the entire sampling distribution. This becomes important when we discuss hypothesis testing.

First consider the sample mean of some i.i.d. random variables y1 ; : : : ; yn with mean y and

variance 2y . Recall from (19) and (27) that the sample mean y has mean E (y) = y and variance

var (y) = 2y =n. In general the sampling distribution of y is not known, but in the special case we

it is known that each yi is normally distributed, then it also follows that y is normally distributed.

That is if yi i:i:d:N y ; 2y then

y

y;

2

y

(30)

If the distribution of yi is not normal, then the distribution of y is also not normal. In econometrics

is it very rare to know that each yi is normally distributed, so it would appear that (30) has only

theoretical interest. However, there is a powerful result in probability called the Central Limit

Theorem that states that even if yi is not normally distributed, the sample mean y can still be

taken to be approximately normally distributed, with the approximation generally working better

for larger values of n. Technically we say that y converges to a normal distribution as n ! 1, or

that y is asymptotically normal, and we will write this in the form

!

y

y;

31

2

y

(31)

.20

F_GAMMA

.16

.12

.08

.04

.00

0

10

11

12

13

14

with the a denoted the fact that the normal distribution for y is asymptotic (i.e. as n ! 1)

or more simply is approximate.

The proof of the Central Limit Theorem goes beyond our scope, but it can be illustrated using

simulated data. Suppose that y1 ; : : : ; yn are i.i.d. random variables with a Gamma distribution

as shown in Figure 26. The mean of this distribution is y = 4. The details of the Gamma

distribution are not important for this discussion, although it is a well-known distribution for

modelling certain types of data in econometrics. For example, the skewed shape of the distribution

can make it suitable for income distribution modelling, in which many people or households make

low to moderate incomes and a relative few make high to very high incomes. Clearly this Gamma

distribution is very dierent in shape from a normal distribution! We can use Eviews to draw a

sample of size n from this Gamma distribution and to compute y. Repeating this many times

builds up a picture of the sampling distribution of y for the given n. The results of doing this are

given in Figures 27-31.

Figure 27 shows the simulated sampling distribution of y when n = 5. The skewness of the

population distribution of yi in Figure 26 remains evident in the distribution of y in Figure 27,

but to a reduced extent. The approximation (31), which is meant to hold for large n, does not

work very well for n = 5. As n increases, however, through n = 10; 20; 40; 80 in Figures 28-31,

it is clear that the sampling distribution of y becomes more and more like a normal distribution,

even though the underlying data from yi is very far from being normal. This is the Central

Limit Theorem at work and is why, for reasonable sample sizes, we are prepared to rely on an

approximate distribution such as (31) to carry out statistical inference.

Two other features of the sampling distributions in Figures 27-31 are worth noting. Firstly

the mean of each sampling distribution is known to be y = 4 because y is unbiased for every

n. Secondly the variance of the sampling distribution becomes smaller as n increases because

var (y) = 2y =n. That is, the sampling distribution becomes more concentrated around y = 4 as

n increases (note carefully the scale on the horizontal axis changing as n increases).

The same principle applies to the regression coe cients ^ 0 and ^ 1 . Each can be shown to be

asymptotically normal because of the Central Limit Theorem. For ^ 1 , the Central Limit Theorem

32

YBAR_5

900

800

700

Frequency

600

500

400

300

200

100

0

0

10

11

Figure 27: Sampling distribution of y with n = 5 observations from the Gamma(2; 2) distribution.

YBAR_10

1,200

1,000

Frequency

800

600

400

200

0

1

Figure 28: Sampling distribution of y with n = 10 observations from the Gamma(2; 2) distribution.

33

YBAR_20

2,000

Frequency

1,600

1,200

800

400

0

2

Figure 29: Sampling distribution of y with n = 20 observations from the Gamma(2; 2) distribution.

YBAR_40

1,000

Frequency

800

600

400

200

0

2.4

2.8

3.2

3.6

4.0

4.4

4.8

5.2

5.6

6.0

Figure 30: Sampling distribution of y with n = 40 observations from the Gamma(2; 2) distribution.

34

YBAR_80

1,400

1,200

Frequency

1,000

800

600

400

200

0

2.8

3.0

3.2

3.4

3.6

3.8

4.0

4.2

4.4

4.6

4.8

5.0

5.2

5.4

Figure 31: Sampling distribution of y with n = 80 observations from the Gamma(2; 2) distribution.

applies to the sum (23) and gives the approximate distribution

^

where in general

2

1 ; ! 1;n

6

! 21;n = var ^ 1 = E 4

Pn

i=1 (xi

(32)

5;

2

Pn

2

x)

i=1 (xi

(33)

! 21;n =

1

;

s2x

(34)

as shown in (29).

2.7

Summary

In introductory econometrics the topic of statistical inference and its theory is typically the most

di cult to grasp, both in its concept and formulae. What follows is a summary of the important

concepts of this section.

Populations and Samples

Statistical inference is the process of attempting to learn about some characteristics of a

population based on a sample drawn from that population.

The most straightforward sampling approach is a simple random sample, in which every

element in the population has an equal chance of being included in the sample.

Mean and variance in the Population and the Sample

35

Population characteristics

h such as imeans and variances are dened as the expectations

2

2

yi

.

y = E (yi ) and y = E

y

P

s2y = n 1 1 ni=1 (yi y)2 .

1

n

Pn

i=1 yi

and

The Population Regression Function (PRF) is dened in terms of the conditional expectations operator E (yi jxi ) = 0 + 1 xi :

The Sample Regression Function (SRF) is dened in terms of the OLS regression line y^i =

^ + ^ xi .

0

1

Statistical properties

Under simple random sampling

y

^

N

a

y;

2 =n

y

2

1 ; ! 1;n

, where in general

2

6

! 21;n = E 4

Pn

i=1 (xi

or under homoskedasticity

! 21;n =

5;

2

2

(x

x)

i

i=1

Pn

36

1

:

s2x

The idea of statistical inference is that we use the observable sample information summarised by

the SRF

y^i = ^ 0 + ^ 1 xi

to make inferences about the unobservable PRF

E (yi jxi ) =

1 xi :

For example, in the CEO salary regression in Figure 17, we take ^ 1 = 18:50 to be the point

estimate of the unknown population coe cient 1 . This point estimate is very useful, but on

its own it doesnt communicate the uncertainty that is implicit in having taken a sample of just

n = 209 rms from all rms in the population. If we had taken a dierent sample of rms, we

would have obtained a dierent value for ^ 1 . This uncertainty is summarised in the sampling

distribution of ^ 1 in equation (32), which quanties (approximately) the entire distribution of

^ that could have been obtained by taking dierent samples from the underlying population.

1

The techniques of hypothesis testing and condence intervals provide ways of making probabilistic

statements about 1 that are more informative, and more honest about the statistical uncertainty,

than a simple point estimate.

3.1

3.1.1

Hypothesis testing

The null hypothesis

The approach in hypothesis testing is to specify a null hypothesis about a particular value for

a population parameter (say 0 or 1 ) and then to investigate whether the observed provides

evidence for the rejection of this hypothesis. For example, in the CEO salary regression, we might

specify a null hypothesis that rm protability has no predictive power for CEO salary. In the

PRF

E (Salaryi jRoEi ) = 0 + 1 RoEi ;

(35)

the null hypothesis would be expressed

H0 :

= 0:

(36)

If the null hypothesis were true then E (Salaryi jRoEi ) = 0 , which states that average CEO

salaries are constant ( 0 ) across all levels of rm protability.

Note that the hypothesis is expressed in terms of the population parameter 1 , not the sample

estimate ^ 1 . Since we know that ^ 1 = 18:50, it would be nonsense to investigate whether ^ 1 = 0....

it isnt! Instead we are interested in testing to see whether ^ 1 = 18:50 diers su ciently from

zero such that we can conclude that 1 also diers from zero, albeit with some level of uncertainty

that acknowledges the sampling variability inherent in ^ 1 .

3.1.2

After the specifying the null hypothesis, the next requirement is the alternative hypothesis. The

alternative hypothesis is specied as an inequality, as opposed to the null hypothesis which is an

equality. In the case of a null hypothesis specied as (36), the alternative hypothesis would be

one of the following three possibilities

H1 :

H1 :

> 0 or

H1 :

< 0;

37

6= 0 or

(falling on both sides of the null hypothesis) while H1 : 1 > 0 or H1 : 1 < 0 are called one-sided

alternatives. A one-sided alternative would be specied in situations where the only reasonable or

interesting deviations from the null hypothesis lie on one side. In the case of the null hypothesis

H0 : 1 = 0 in (35), we might specify H1 : 1 > 0 if the only interest were in the hypothesis that

protable rms reward their CEOs with higher salaries. However there is also a possibility that

some less protable rms might try to improve their fortunes by attempting to attract proven

CEOs with oers of a higher salary. With two conicting stories like this, the sign of the possible

relationship would be unclear and we would specify H1 : 1 6= 0. One very important point is

that we must not use the sign of the sample estimate to specify the alternative hypothesis the

hypothesis testing methodology requires that the hypotheses be specied before looking at any

sample information. The hypotheses must be specied on the basis of the practical questions of

interest. Both one and two sided testing will be discussed below.

3.1.3

The idea in hypothesis testing is to make a decision whether or not to reject H0 in favour of H1

on the basis of the evidence in the data. For testing (36), an approach to making this decision

can be based on the sampling distribution (32). Specically, if H0 : 1 = 0 is true then

^

a

1

N 0; ! 21;n ;

where ! 21;n is given in (33), or (34) in the special case where homoskedasticity can be assumed.

This sampling distribution can also be written

^

! 1;n

N (0; 1) ;

which is useful because the distribution on the right hand side is now a very well known one, the

standard normal distribution, for which derivations and computations are relatively straightforward. However this expression is not yet usable in practice because ! 1;n depends on population

expectations (i.e. it contains an E and a var) and is not observable. It can, however, be estimated

using

qP

n

x)2 u

^2i

i=1 (xi

!

^ 1;n = Pn

;

(37)

x)2

i=1 (xi

which is obtained from (33) by dropping the outside expectation, replacing var (yi jxi ) by the

squared residuals u

^2i , and then taking the square root to turn !

^ 21;n into !

^ 1;n . This quantity !

^ 1;n is

^

called the standard error of 1 . It can then be shown (using derivations beyond our scope) that

^

!

^ 1;n

N (0; 1) :

That is, replacing the unknown standard deviation ! 1;n with the observable standard error !

^ 1;n

does not change the approximate distribution of ^ 1 . However, often a practical better approximation is provided by

^ a

1

tn 2 ;

(38)

!

^ 1;n

where tn 2 denotes the t distribution with n 2 degrees of freedom. For large values of n the

tn 2 and N (0; 1) distributions are almost indistinguishable (indeed limn!1 tn 2 = N (0; 1)) but

38

for smaller n using (38) can often give a more accurate approximation. Equation (38) provides

a practically usable approximate null distribution for ^ 1 (its called the null distribution because

a

recall we imposed the null hypothesis to obtain ^ 1 N 0; ! 21;n in the rst step above).

If it is known that the conditional distribution of yi given xi is homoskedastic (i.e. that

var (yi jxi ) is constant) then (34) can be used to justify the alternative estimator

s

^2

!

^ 1;n = Pn

;

(39)

x)2

i=1 (xi

where

^2 =

n

X

u

^2i ;

i=1

is the sample variance of the OLS residuals. In small samples the standard error estimated using

(39) may be more precise than that estimated by (37), provided the assumption of homoskedasticity is correct. If the assumption of homoskedasticity is incorrect, however, the standard error

in (39) is not valid. In econometrics the standard error in (37) is referred to as Whites standard

error while (39) is referred to the OLS standard error. Modern econometric practice is to

favour the robustness of (37), and we will generally follow that practice.

If it is known that the conditional distribution of yi given xi is both homoskedastic and

^ 1;n

normally distributed (written yi jxi N 0 + 1 xi ; 2 ) then the null distribution (38) with !

given in (39) is exact, no longer an approximation. This is a beautiful theoretical result, but since

it is very rarely known that yi jxi

N 0 + 1 xi ; 2 in practice, we should acknowledge that

(38) is an approximation.

3.1.4

If H0 :

= 0 is not true then the approximate sampling distribution (32) can be written

^

a

1

+ N 0; ! 21;n ;

which is informal notation that represents a normal distribution with a constant 1 added to it

(which is identical to a N 1 ; ! 21;n distribution). Then repeating the steps leading to (38) gives

^

! 1;n

! 1;n

+ N (0; 1) ;

^ 1;n gives

^

!

^ 1;n

! 1;n

+ tn

2:

(40)

This equation says that if the null hypothesis is false then the distribution of the ratio ^ 1 =^

! 1;n

is no longer approximately tn 2 , but instead is tn 2 with a constant 1 =! 1;n added to it. That

is, the distribution is shifted (either positively or negatively depending on the sign of 1 ) relative

to tn 2 distribution. The dierence between (38) under the null and (40) under the alternative

provides the basis for the hypothesis test.

39

3.1.5

In hypothesis testing we either reject H0 or do not reject H0 (we dont accept hypotheses, more

on this soon). A hypothesis test requires a decision rule, that species when H0 is to be rejected.

Because we have only partial information, i.e. a random sample rather than the entire population, there is some probability that any decision we make will be incorrect. That is, there is a

chance we might reject H0 when H0 is in fact true, which is called a Type I error. There is also a

chance that we might not reject H0 when H0 is in fact false, which is called a Type II error. The

four possibilities are summarised in this table.

Decision

Reject H0

Do not reject H0

H0 true

H0 false

Type I error Correct

Correct

Type II error

Clearly we would like a hypothesis test to minimise the probabilities of both Type I and II errors,

but there is no unique way of doing this. The convention is to set the signicance level of the

hypothesis to a small xed probability , which species the probability of a Type I error. The

most common choice is = 0:05, although = 0:01 and = 0:10 are sometimes used.

3.1.6

= 0 is

^

t=

!

^ 1;n

(41)

From (38) we know that t tn 2 if H0 is true, while from (40) we know that t is shifted away

from the tn 2 distribution if H0 is false. First consider testing H0 : 1 = 0 against the one-sided

alternative H1 : 1 > 0, implying the interesting deviations from the null hypothesis induce a

positive shift of t away from the tn 2 distribution. We will therefore dene a decision rule based

on t that states that H0 is reject if t takes a larger value than would be thought reasonable from

the tn 2 distribution. The way we formalise the statement t takes a larger value than would be

thought reasonable from the tn 2 distribution is to use the signicance level. The decision rule

is dened to reject H0 if t takes a larger value than a critical value c , which is dened by the

probability

Pr (tn 2 > c ) =

for signicance level . The distribution of t under H0 is tn 2 , so the value of c can be computed

from the tn 2 distribution, as shown graphically in Figure 32 for = 0:05 and n 2 = 30. The

critical value in this case is c0:05 = 1:697, which can be found in Table G.2 of Wooldridge (p.833)

or computed in Eviews.

For testing H0 : 1 = 0 against H1 : 1 < 0, the procedure is essentially a mirror image. The

decision rule is to reject H0 if t takes a smaller value than the critical value, which is shown in

Figure 33. If c is the -signicance critical value for testing against H1 : 1 > 0, then c is the

-signicance critical value for testing against H1 : 1 < 0. That is

Pr (tn

<

c )= :

The critical value for = 0:05 and n 2 = 30 is therefore simply c0:05 = 1:697.

For testing H0 : 1 = 0 against H1 : 1 6= 0, the potentially interesting deviations from the null

hypothesis might induce either a positive or negative shift of t away from the tn 2 distribution.

40

H1 : 1 > 0.

distribution with

= 0 against

Therefore we need to check in either direction. That is, we will reject H0 if either t takes a larger

value than considered reasonable for the tn 2 distribution, or a smaller value. The decision rule

is to reject H0 if t > c =2 or t < c =2 , which can be expressed more simply as t > c =2 , where

c =2 satises

Pr tn

>c

Pr jtn

2j

=2

or equivalently

The critical value for

3.1.7

= 0:05 and n

>c

=2

2 = 30 is c

=2

= :

= 2:042.

1. The null hypothesis H0 :

2. The alternative hypothesis H1 :

3. A signicance level :

4. A test statistic. (in this case t, but we will see others soon)

5. A decision rule that states when H0 is rejected.

6. The decision, and its interpretation.

Consider the CEO salary regression, which has PRF

E (Salaryi jRoEi ) =

41

1 RoEi ;

(42)

H1 : 1 < 0.

distribution with

= 0 against

H1 : 1 6= 0.

distribution with

= 0 against

42

Figure 35: Choosing to use White standard errors that allow for heteroskedasticity

and the hypotheses H0 : 1 = 0 and H1 : 1 6= 0;so that we are interested in either positive or

negative deviations from the null hypothesis, i.e. any role for rm protability in predicting CEO

salaries, whether positively or negatively. We will choose = 0:05, which is the default choice

unless specied otherwise.

The test statistic will be the t statistic given in (41). This statistic can be computed in Eviews

using either of (37) or (39), with the default choice being (39) which imposes the homoskedasticity

assumption. This assumption can frequently be violated in practice, and can be tested for, but

we will play it safe for now and use the (37) version of !

^ 1;n which allows for heteroskedasticity.

This requires an additional option to be changed in Eviews. When specifying the regression in

Eviews in Figure 16, click on the Options tab to reveal the options shown in Figure 35, and

select White for the coe cient covariance matrix as shown. The resulting regression is shown

in Figure 36, with the selection of the appropriate White standard errors highlighted. We now

have enough information to carry out the hypothesis test. The details are as follows.

1. H0 :

=0

2. H1 :

6= 0

3. Signicance level:

= 0:05

5. Reject H0 if jtj > c0:025 = 1:980

6. H0 is rejected, so Return on Equity is a signicant predictor for CEO Salary.

The critical value of c0:025 = 1:96 is found from the table of critical values on p.833 of Wooldridge,

reproduced in Figure 37. For this regression with n = 209, the relevant t distribution has n 2 =

207 degrees of freedom. This many degrees of freedom is not included in the table, so we choose

the closest degrees of freedom that is less than this number, i.e. 120. The test is two-sided with

signicance level of = 0:05, so the critical value of c0:025 = 1:980 can be read from the third

column of critical values in the table.

43

3.1.8

The assessment for ETC2410/ETC3440 in semester two of 2013 consisted of 40% assignments

during the semester and a 60% nal exam. Descriptive statistics for these marks, both expressed

as percentages, are shown in Figures 38 and 39. It may be of interest to investigate how well

assignment marks earned during the semester predict nal exam marks. In particular, we would

expect that those students who do better on assignments during the semester will go on to also do

better on their nal exams. The scatter plot in Figure 40 show that such a relationship potentially

does exist in the data, so we will carry out a formal hypothesis test in a regression.

The PRF has the form

E (exami jasgnmti ) =

1 asgnmti ;

(43)

and we will test H0 : 1 = 0 (that assignment marks have no predictive power for exam marks)

against the one-sided alternative H1 : 1 > 0 (that higher assignment marks predict higher exam

marks). The estimates are given in Figure 41, in which the SRF is

exam

d i = 23:763 + 0:548 asgnmti :

(5:360)

(0:095)

The numbers in parentheses below the coe cients are the standard errors. This is a common

way of reporting an estimated regression equation, since it provides su cient information for

the reader to carry out some inference themselves if they wish. The hypothesis test of interest

proceeds as follows.

1. H0 :

=0

2. H1 :

>0

3. Signicance level:

= 0:05

44

45

16

Series: ASGNMT

Sample 1 118

Observations 118

14

12

Mean

Median

Maximum

Minimum

Std. Dev.

Skewness

Kurtosis

10

8

6

59.67415

61.92188

83.62500

19.25000

13.27081

-0.738493

3.340469

Probability

0.003525

2

0

20

25

30

35

40

45

50

55

60

65

70

75

80

85

Figure 38: Assignment marks for ETC2410 / ETC3440 in semester two of 2013.

12

Series: EXAM

Sample 1 118

Observations 118

10

Mean

Median

Maximum

Minimum

Std. Dev.

Skewness

Kurtosis

56.49154

56.61960

93.24731

0.000000

16.74263

-0.790440

4.709527

Probability

0.000002

0

0

10

20

30

40

50

60

70

80

90

Figure 39: Exam marks for ETC2410 / ETC3440 in semester two of 2013.

5. Reject H0 if t > c0:05 = 1:662

6. H0 is rejected, so there is evidence that higher assignment marks predict signicantly higher

nal exam marks.

The critical value in this case is found from the table in Figure 37 using 90 degrees of freedom

(n 2 = 116 in this case) and the column corresponding to the = 0:05 level of signicance for

a one-sided test.

3.1.9

p-values

A convenient alternative way to express a decision rule for a hypothesis test is to use p-values

rather than critical values, where they are available.

First consider testing H0 : 1 = 0 against H1 : 1 > 0. The critical value for this t test is c0:05

as shown in Figure 32. Recall that c0:05 is dened to satisfy Pr (tn 2 > c0:05 ) = 0:05, which means

that the area under the tn 2 distribution to the right of c0:05 is 0.05. Any value of the test statistic

t that falls above c0:05 leads to a rejection of the null hypothesis, and the area under the tn 2

distribution to the right of such a value of t must be less than 0.05. So instead of dening a decision

46

100

80

EXAM

60

40

20

0

10

20

30

40

50

60

70

80

90

ASGNMT

47

rule in terms of t > c0:05 , we could equivalently dene decision in terms of Pr (tn 2 > t) < 0:05.

That is the decision rules reject H0 if t > c0:05 and reject H0 if Pr (tn 2 > t) < 0:05 yield

identical tests. Similarly if we are testing H0 : 1 = 0 against H1 : 1 < 0, the decision rules

reject H0 if t < c0:05 and reject H0 if Pr (tn 2 < t) < 0:05 yield identical tests.

For the two sided problem H0 : 1 = 0 against H1 : 1 6= 0, the decision rule is reject H0

if jtj > c0:025 . Recall that c0:025 is dened to satisfy Pr (tn 2 > c0:025 ) = 0:025, see Figure 34.

The condition jtj > c0:025 therefore implies that Pr (tn 2 > jtj) < 0:025, because jtj is further

out into the tail of the tn 2 distribution than c0:025 . Multiplying this inequality by 2 gives

2 Pr (tn 2 > jtj) < 0:05, so the critical value decision rule reject H0 if jtj > c0:025 is equivalent

to reject H0 if 2 Pr (tn 2 > jtj) < 0:05.

It is conventional in econometrics and statistics (and in Eviews!) to dene the p-value for a

regression t statistic as

p = 2 Pr (tn 2 > jtj) :

(44)

Therefore the decision rule for testing H0 :

= 0 against H1 :

6= 0 is

where p is the value printed out by Eviews under the Prob. column of the regression output.

The two-sided test of the signicance of RoEi in the model (42) can be re-expressed in terms of

p values as follows.

1. H0 :

=0

2. H1 :

6= 0

3. Signicance level:

= 0:05

5. Reject H0 if p < 0:05

6. H0 is rejected, so Return on Equity is a signicant predictor for CEO Salary.

The p-value in item 4 is read directly from the regression output in Figure 36. Clearly having a pvalue available makes the hypothesis test more convenient to carry out because it is not necessary

to look up or compute a critical value. The vast majority of hypothesis tests computed in modern

econometrics and statistics software are accompanied by a p-value for easy testing.

For testing against one-sided alternative hypotheses, a small modication is required. In the

introductory discussion it was shown that the decision rule for testing H0 : 1 = 0 against H1 :

0 then Pr (tn 2 > t) = Pr (tn 2 > jtj) = p=2

1 > 0 is to reject H0 if Pr (tn 2 > t) < 0:05. If t

using (44). On the other hand if t < 0 then Pr (tn 2 > t) = 1 Pr (tn 2 > jtj) > 0:5 (by the

symmetry of the tn 2 distribution) so H0 will never be rejected if t < 0. This makes intuitive

sense since t < 0 can only occur if ^ 1 < 0, and an estimate of ^ 1 < 0 cannot provide evidence to

reject H0 : 1 = 0 in favour of H1 : 1 > 0. So the decision rule for testing H0 : 1 = 0 against

H1 : 1 > 0 is reject H0 if t > 0 and p=2 < 0:05, or more simply

reject H0 if t > 0 and p < 0:10.

That is, to carry out a one-sided test at the 5% level of signicance, the comparison of the p-value

is made with 0.10 not 0.05. The reason is that the p-value provided by Eviews is (44), which is

for testing against two-sided alternatives.

48

upper-tailed version, that is

= 0 against H1 :

so that the null is rejected only for negative estimates of 1 whose p-value is less than 0.10.

The one-sided test of the signicance of assignment marks in (43) can therefore be re-expressed

as follows.

1. H0 :

=0

2. H1 :

>0

3. Signicance level:

= 0:05

5. Reject H0 if t > 0 and p < 0:10

6. H0 is rejected, so there is evidence that higher assignment marks predict signicantly higher

nal exam marks.

The outcome of a hypothesis test carried out using critical values or p-values will always be

the same, the choice comes down to one of convenience. Most often p-values are more convenient

and most often used in practice, and we will generally rely on them from now on.

3.1.10

By far the most common hypotheses testing in regression models have the null in the form H0 :

1 = 0. However there are other null hypotheses that can also be of interest. For example, in the

exam marks application we might want to test whether an extra 1% gained on assignment marks

predicts an extra 1% gained on the nal exam. In the regression model (43), this would translate

to a null hypothesis of the form H0 : 1 = 1.

In general, consider testing a null hypothesis of the form H0 : 1 = b1 , where b1 is a specied

number (eg, 0,1, etc). The t statistic for testing this null hypothesis is

^

t=

b1

:

!

^ 1;n

(45)

Obviously this reduces to (41) when b1 = 0. The decision rules presented above remain unchanged,

both for critical values and p-values.

To illustrate, consider testing H0 : 1 = 1 against H1 : 1 6= 1 in the exam regression (43).

From the results in Figure 41 we can calculate

t=

0:548 1

=

0:095

4: 758:

The hypothesis test using critical values can then proceed as follows.

1. H0 :

=1

2. H1 :

6= 1

3. Signicance level:

= 0:05

49

(46)

4. Test statistic: t =

4:758

6. H0 is rejected, so the predicted change in nal exam scores corresponding to a 1% higher

assignment score is signicantly dierent from 1%.

Note that a two-sided alternative was used in this case because there is no prior expectation before

the analysis whether the coe cient should be greater than one or less than one. Having estimated

the regression it appears the coe cient is less than one, but we must not use that information to

formulate the alternative hypothesis.

In order to avoid re-calculating the t statistic manually as we did above, and also in order to

obtain convenient p-values, the regression model can be re-estimated in a form that makes testing

a null hypothesis H0 : 1 = b1 very easy. Suppose in general we have a PRF of the form

E (yi jxi ) =

1 xi :

E (yi

b1 xi jxi ) =

+(

b1 ) xi

1 xi :

re-expressed as H0 : 1 = 0 in the regression of (yi b1 xi ) on xi .

For testing H0 : 1 = 1 against H1 : 1 6= 1 in the exam regression (43), the PRF is re-written

E (exami

asgnmti jasgnmti ) =

+(

1) asgnmti

1 asgnmti :

The results from regressing (exami asgnmti ) on asgnmti are given in Figure 42, which shows

that ^ 1 = 0:452 with t = 4:746. (This latter t statistic diers from (46) only because of the

rounding error induced by the calculation in (46) being carried out using three decimal places

in numerator and denominator. Without this rounding error, the two would be identical.) The

hypothesis test in terms of p-values then proceeds as follows.

1. H0 :

=1

2. H1 :

6= 1

3. Signicance level:

= 0:05

5. Reject H0 if p < 0:05

6. H0 is rejected, so the predicted change in nal exam scores corresponding to a 1% higher

assignment score is signicantly dierent from 1%.

The same conclusions will always be found from this approach (re-specifying the regression) and

the previous approach that manually computes the t statistic and uses a critical value. It will

usually be more convenient to practice to re-specify the regression and use the p-value that is

then automatically provided.

50

3.2

Condence intervals

Condence intervals provide an alternative method for summarising the uncertainty due to sampling in coe cient estimates. A condence interval is a pair of numbers that form an interval

within which the true value of the parameter is contained with a pre-specied probability. This

probability, called the condence level, is typically chosen to be 1

, where is the usual significance level used in hypothesis tests. So, for a regression coe cient 1 , the aim is to nd numbers

1 and 1 such that

Pr

=1

(47)

The derivation of the condence interval follows from hypothesis tests of the form H0 : 1 = b1

against H1 : 1 6= b1 . If we imagine testing these hypothesis for all possible values of b1 , the

condence interval is formed by those values of b1 for which H0 : 1 = b1 is not rejected using a

two-sided t test with signicance level . To show where this leads, for any b1 the null hypothesis

H0 : 1 = b1 is not rejected if the t statistic in (45) satises jtj c =2 , which implies

^

c

=2

b1

!

^ 1;n

=2 :

^

That is, H0 :

^ 1;n

=2 !

b1

^ +c

1

^ 1;n :

=2 !

h

i

i h

^

^

c

!

^

;

+

c

!

^

;

=

=2 1;n

=2 1;n ;

1

1

1

1

(48)

(49)

which is the desired condence interval. It has the desired level because when b1 is the true

value of the parameter, the null hypothesis H0 : 1 = b1 is rejected with probability (this is the

denition of the signicance level of the test), which implies that it is not rejected with probability

51

1

1

. Therefore the true value 1 is included in the condence interval (49) with probability

as required.

To illustrate, consider a condence interval for the slope coe cient in the salary PRF (42).

From the results in Figure 36 we see that ^ 1 = 18:501 and !

^ 1;n = 6:829. The critical value for a

two-sided t test with signicance level = 0:05 is c0:025 = 1:980. The 95% condence interval for

1 is therefore

h

i

;

= [18:501 1:980 6:829; 18:501 + 1:980 6:829]

1

1

= [4:980; 32:022] :

(50)

The interpretation of this interval is that it contains the true value of 1 with probability 95%.

(In fact this probability of 95% is an approximation because the distribution of t in (38) on

which it is based is also approximate. In practice though we usually just talk about a 95%

condence interval, rather than an approximate or asymptotic 95% condence interval.) The

95% condence interval (or interval estimate) of the coe cient implies that an increase of 1%

in a rms Return on Equity predicts an increase in CEO salary of between $4,980 and $32,022.

A condence interval provides a convenient and informative way to report the ndings of a

regression. The mid-point of the interval is the point estimate ^ 1 , while its width represents how

much uncertainty there is about the estimate. A narrow condence interval implies the sample has

provided a precise estimate of the coe cient. The width of the condence interval is determined

by the standard error !

^ 1;n , so a small standard error implies a precise estimate and a narrow

condence interval.

From a hypothesis testing perspective, the condence interval provides a nice summary of all

the null hypotheses that would not be rejected by a two-sided t test (those values within the

interval) and all of the null hypotheses that would be rejected (those values outside the interval).

Clearly this is much more informative than simply reporting a coe cient estimate and whether or

not it is signicantly dierent from zero (which does happen sometimes...). A condence interval

that does not include zero, such as (50) above, immediately conveys the information that the

coe cient estimate is signicantly dierent from zero, but it contains much more information as

well.

These ideas also emphasise why in a hypothesis test we never claim

H0 , we only

h to accept

i

say that we do not reject H0 . Consider the condence interval

; 1 = [4:980; 32:022]

1

constructed above. This implies that H0 : 1 = b1 would not be rejected for all b1 between 4.980

and 32.022. It would be illogical to say that we accept H0 : 1 = 5 and H0 : 1 = 10 and

H0 : 1 = 25 and so on, we cannot accept that 1 is equal to several dierent values at once!

Instead we say that the sample does not provide su cient evidence to reject those values at the

specied level of signicance.

3.3

Prediction intervals

Suppose we want to make a prediction for yi for a particular xed value x of xi . For example,

to predict average CEO salary for Return on Equity of x = 15%, or nal exam marks for an

assignment mark of x = 75%. The prediction is given by

y^ (x) = ^ 0 + ^ 1 x;

(51)

y

52

1 x:

interval can

i be

h

calculated for the population conditional mean E (yi jxi = x), i.e. an interval y (x); y (x) such

that

Pr y (x)

;

y (x)

y (x) = 1

compare to (47) for 1 .

The distribution of y^ (x) as an estimator of

y^ (x)

where

! 2n;

=E

"

n

X

i=1

1

+ (x

n

(x) ; ! 2n;

;

#

x) an;i

and an;i was given in (24). This leads to the prediction interval

h

i

(x);

(x)

= y^ (x) c =2 !

^ n; ; y^ (x) + c

y

y

where

!

^ 2n; =

n

X

i=1

1

+ (x

n

^ n;

=2 !

(52)

x) an;i

u

^2i :

^ 2n; without dealing with the formula. If

we take the usual SRF y^i = ^ 0 + ^ 1 xi and subtract the prediction formula at x given by (51), we

obtain

y^i = y^ (x) + ^ 1 (xi x) :

This shows that an OLS regression of yi on an intercept and (xi x) will provide an intercept

that corresponds to y^ (x), and then !

^ n; required for the condence interval is simply the standard

error on this estimate.

As an example, consider making a prediction for the average nal exam mark for an assignment

mark of x = 75%. A regression in Eviews specied as exam c (asgnmt-75) will produce an

intercept corresponding to y^ (75). The Eviews output is shown in Figure 43. The prediction is

y^ (75) = 64:90%, with standard error !

^ n; = 2:34. The 95% prediction interval based on (52) is

therefore

h

i

= [64:90 1:987 2:34; 64:90 + 1:987 2:34]

y (75); y (75)

= [60:25; 69:55] ;

where c0:025 = 1:987 is obtained from the t distribution table with 90 degrees of freedom (the

closest to n 2 = 116 in this example). The interpretation of this interval is that it contains the

population conditional mean y (75) = E (exami jasgnmti = 75) with probability of 95%.

3.3.1

Derivations

These derivations of the distribution of the prediction follow easily from the preceding derivations

we did for y and ^ 1 , but this subsection is not required for the course.

First recall the representation

n

X

^ =

an;i yi ;

1

i=1

53

Figure 43: Predicting nal exam mark for an assignment mark of 75%

where

in which

Pn

i=1 an;i

= 0 and

xi x

;

x)2

i=1 (xi

an;i = Pn

Pn

i=1 an;i xi

^ =y

0

^ x=

1

n

X

i=1

1

n

xan;i yi

n

X

y^ (x) =

i=1

1

+ (x

n

x) an;i yi ;

which shows that y^ (x) is a weighted sum of y1 ; : : : ; yn . Its mean and variance can therefore be

derived in the same way as we did for ^ 1 .

The mean of y^ (x) is

" n

#

X 1

E [^

y (x)] = E

+ (x x) an;i E (yi jx1 ; : : : ; xn )

by the LIE

n

i=1

" n

#

X 1

= E

+ (x x) an;i ( 0 + 1 xi )

substituting the PRF

n

i=1

= E[

0+

1x +

1 (x

x)]

using

n

X

i=1

1x

(x) ;

(x).

54

an;i = 0;

n

X

i=1

an;i xi = 1

The variance is

! 2n; = var (^

y (x)) = E

"

n

X

i=1

1

+ (x

n

x) an;i

1

+ (x

n

x) an;i

!

^ 2n; =

n

X

i=1

u

^2i ;

where u

^i are the OLS residuals.

The approximate normality of y^ (x) follows from the Central Limit Theorem.

Multiple Regression

An extremely useful feature of regression modelling is that it easily allows for the inclusion of

more than one explanatory variable. This is very useful for interpreting the roles of individual

explanatory variables and potentially for improving predictions. The techniques for OLS estimation and inference that we have discussed for simple regression extend straightforwardly to

the multiple regression setting. The models and methods will be discussed here, with formulae

postponed until the section on matrix notation for regression.

4.1

A linear PRF with multiple explanatory variables x1;i ; : : : ; xk;i takes the form

E (yi jx1;i ; : : : ; xk;i ) =

1 x1;i

+ ::: +

k xk;i :

(53)

That is, the population conditional mean of yi given x1;i ; : : : ; xk;i is specied as a weighted sum

of x1;i ; : : : ; xk;i .

The interpretation of the coe cients 1 ; : : : ; k is similar to that in a simple regression, with

an important qualication. To interpret 1 , consider the predicted value of yi with x1;i increased

by one unit and with x2;i ; : : : ; xk;i unchanged:

E (yi jx1;i + 1; : : : ; xk;i ) =

1 (x1;i

+ 1) + : : : +

k xk;i :

Then

E (yi jx1;i + 1; : : : ; xk;i )

1;

so that we interpret 1 as the change in the prediction of yi corresponding to a one unit increase

in x1;i , holding x2;i ; : : : ; xk;i constant. This aspect of holding all of the other explanatory variables

constant leads to the regression coe cient being called a marginal e ect or partial e ect. In general, for any j = 1; : : : ; k, the parameter j is the change in the predicted value of yi corresponding

to a one unit increase in xj;i , holding xh;i constant for all h 6= j.

The intercept 0 only has a meaningful interpretation if it makes sense for all of x1;i ; : : : ; xk;i

to take the value zero. In that case 0 is the predicted value of yi when x1;i = : : : = xk;i = 0.

4.2

y^i = ^ 0 + ^ 1 x1;i + : : : + ^ k xk;i ;

55

(54)

where ^ 0 ; ^ 1 ; : : : ; ^ k are the values that minimise the sum of squared residuals

SSR (b0 ; b1 ; : : : ; bk ) =

n

X

(yi

b0

b1 x1;i

:::

bk xk;i )2 :

i=1

The separate formulae for ^ 0 ; ^ 1 ; : : : ; ^ k are messy and omitted for now, but can easily be expressed in matrix notation later. The OLS residuals are denoted

u

^i = yi

= yi

The R2 for the regression is

y^i

^

^ x1;i

1

:::

Pn

(^

yi

SSE

= Pi=1

R =

n

SST

i=1 (yi

2

^ xk;i :

k

y)2

y)2

which has the same derivation, properties and interpretation as the R2 in a simple regression.

That is, R2 measures the proportion of the variance in yi explained by the regression.

4.3

An example data set from Chapter 4 of Wooldridge contains the following data on house prices

and explanatory variables.

price :

selling price of the house ($000)

assess : assessed value prior to sale ($000)

lotsize : size of the block in square feet

sqrft :

size of the house in square feet

bdrms : number of bedrooms

The histogram and descriptive statistics for the dependent variable price are shown in Figure It

may be expected that increases in each of the explanatory variables assess, lotsize, sqrft, bdrms

would predict an increase in the selling price of a house. The PRF in this case is

E (pricei jassessi ; lotsizei ; sqrfti ; bdrmsi ) =

0+

1 assessi +

2 lotsizei +

3 sqrfti +

4 bdrmsi :

(55)

The specication of a multiple regression in Eviews simply involves a list of variables as shown

in Figure 45, with the dependent variable price rst, followed by the explanatory variables. The

results are shown in Figure 46. The SRF can be written

di =

price

(23:77)

(0:119)

(0:000210)

0:000517sqrfti + 11:60bdrms

(0:0174)

(5:55)

n = 88; R = 0:83

The intercept of ^ 0 = 38:89 has no meaningful interpretation since none of the explanatory

variables would reasonably take the value zero. The slope coe cients are interpreted as follows.

1. ^ 1 = 0:908 : an increase in assessed value of a house of $1,000 predicts an increase in the

sale price of $908, holding lot size, house size and number of bedrooms xed. That is, the

coe cient measures the eect of variations in assessed value for a house of particular size.

It is therefore capturing variations in other aspects that aect the price of the house besides

its size, for example, its kitchen and bathroom quality, its suburb, proximity to transport,

shops, schools and major roads, architectural style, renovated or not, and so on.

56

24

Series: PRICE

Sample 1 88

Observations 88

20

Mean

Median

Maximum

Minimum

Std. Dev .

Skewness

Kurtosis

16

12

293.5460

265.5000

725.0000

111.0000

102.7134

1.998857

8.393914

Jarque-Bera 165.2787

Probability

0.000000

0

100

150

200

250

300

350

400

450

500

550

600

650

700

750

2. ^ 2 = 0:000587 : each extra square foot of lot size predicts an increase in sale price of 58.7

cents, holding the other explanatory variables xed. The interpretation could equivalently

be expressed as saying that an extra 1000 square feet of lot size predicts an increase in sale

price of $587, which may make the magnitudes more relevant. Note that this coe cient

measures the eect of lot size on average sale price while holding house size and bedrooms

xed. That is, it measures the eect of a larger lot for a house of a given size. It does not

measure the eect of a larger lot size with a larger house on it. It isolates the eect of lot

size alone.

3. ^ 3 = 0:000517 : each extra square foot of house size predicts a decrease in sale price of

51.7 cents, holding the other explanatory variables xed. This nding is highly counterintuitive, but when we look at t test in this regression it will be seen that the coe cient is

not signicantly dierent from zero, so this interpretation can be ignored.

4. ^ 4 = 11:60 : each extra bedroom in a house predicts an increase in sale price of $11,600.

Note that this interpretation holds house size constant, so it is specically measuring the

eect of number of bedrooms, not overall size of house. Generally these two variables would

be positively related (a correlation of 0.53 in this sample) but the multiple regression allows

their eects to be estimated separately.

The regression has an R2 of 83% and so explains a high proportion of the variation in selling

prices of houses in this sample.

4.4

Statistical Inference

The derivations of OLS properties in multiple regression are simply in matrix notation, but messy

otherwise. For now they are simply stated. If (yi ; xi )ni=1 are i.i.d. and the PRF is given by (53),

each OLS coe cient ^ j for j = 0; 1; : : : ; k is unbiased and satises

^

a

j

2

j ; ! j;n

(56)

where ! 2j;n is a variance that depends on the conditional variance var (yi jx1;i ; : : : ; xk;i ). The

implications are the same as in the simple regression.

57

Figure 45: Specifying the multiple regression for house prices in Eviews

58

^

t=

bj

!

^ j;n

where !

^ j;n is the standard error of ^ j that is computed to estimate ! j;n . As in simple regressions,

the computation can be done imposing homoskedasticity (OLS standard errors) or allowing for

heteroskedasticity (Whites standard errors). The approximate null distribution of this statistic

can be derived from (56) and is given by

t

tn

k 1;

which is the t distribution with n k 1 degrees of freedom. The degrees of freedom in a multiple

regression is the sample size less the number of regression coe cient estimated. The decision rules

for a hypothesis test at the = 0:05 signicance level are summarised in the following table, in

which c0:025 and c0:05 are critical values from the tn k 1 distribution.

Rejection rule for H0 : j = bj

Critical value

p value

H1 :

H1 :

H1 :

j

j

j

6= bj

> bj

< bj

t > c0:05

t < c0:05

h

i h

^

;

c

j =

j

j

p < 0:05

t > 0 and p < 0:10

t < 0 and p < 0:10

is given by

^ j;n ; ^ j + c

=2 !

i

!

^

=2 j;n ;

To make a prediction from a multiple regression, values x1 ; : : : ; xk need to be specied for the

explanatory variables. Then

y^ (x1 ; : : : ; xk ) = ^ 0 + ^ 1 x1 + : : : + ^ k xk :

Substracting this from (54) and rearranging gives

y^i = y^ (x1 ; : : : ; xk ) + ^ 1 (x1;i

x1 ) + : : : + ^ k (xk;i

xk ) :

(57)

That is, y^ (x1 ; : : : ; xk ) can be calculated as the intercept in a regression of yi on an intercept and

(x1;i x1 ) ; : : : ; (xk;i xk ).

4.5

The signicance of the regression coe cients reported in Figure 46 can tested using t tests. Here

is the test of whether increased lot size predicts increased sale price.

1. H0 :

=0

2. H1 :

>0

3.

= 0:05

59

5. Decision rule : reject H0 if t > 0 and p < 0:10

6. Reject H0 , so increased lot size does predict increased selling price, holding the other three

regressors xed.

The same analysis shows that assessed value and number of bedrooms also predict increased

selling price. However the house size (sqrft), with p value of 0.9764, is not signicant the

implication is that once we control for the size of the block the house is on and the number of

bedrooms, the overall size of the house itself has no further predictive power for the selling price.

It may be of interest to test the null hypothesis H0 : 1 = 1. Under this null we could take the

assessed value as being an unbiased predictor1 of the sale price, in this sense that changes in the

assessed value would be matched one-for-one by changes in the predictor of the sale price. The

test is as follows.

1. H0 :

=1

2. H1 :

6= 1

3.

= 0:05

1) =0:119 =

0:773

6. Do not reject H0 , so there is no evidence to suggest that assessed value is not an unbiased

predictor of the sale price.

A 95% condence interval can be constructed for 4 in order to give an interval estimate of

the contribution to the selling prices of each bedroom. The calculation is

h

i

h

i

^ + c0:025 !

^

c

!

^

;

^

;

=

0:025

4;n

4;n

4

4

4

4

= [11:602

2:000

5:552]

= [0:498; 22:706]

which states that the predicted increase in selling value corresponding to an extra bedroom lies

in the interval [$498; $22; 706] with condence level of 95%.

Suppose we want to predict the selling price of a four bedroom house with assessed value of

$350,000, lot size of 6000 square feet, house size of 2000 square feet. Following (57), we specify

the SRF

d i = price

d (350; 6000; 2000; 4) + ^ 1 (assessi 350) + ^ 2 (lotsizei

price

+ ^ (sqrtft

2000) + ^ (bdrmsi 4) :

3

6000)

The specication of this SRF in Eviews is shown in Figure 47, with results in Figure 48. This gives

d (350; 6000; 2000; 4) = 327:913, or a predicted selling price of $327,913. The 95% prediction

price

interval is

d (350; 6000; 2000; 4)

price

= [327:913

2:000

c0:025 !

^

;n

8:035]

= [311:843; 343:983] ;

so the predicted selling price lies within $311,843 and $343,983 with condence level of 95%.

1

60

Figure 47: Specication of the regression for prediction of selling price of a four bedroom house

with assessed value of $350,000, lot size of 6000 square feet, house size of 2000 square feet

Figure 48: OLS regression for predicting the selling price of a four bedroom house with assessed

value of $350,000, lot size of 6000 square feet, house size of 2000 square feet

61

4.6

In multiple regressions it can be interesting to test hypotheses about more than one coe cient at

a time. The most common example is to jointly tests that all slope coe cients are equal to zero,

which implies that none of the explanatory variables have any predictive power for the dependent

variable. The null hypothesis in (53) takes the form

H0 :

= ::: =

= 0;

i.e. all k slope coe cients are set to zero. The alternative hypothesis is

H1 : at least one of

1; : : : ;

not equal to 0,

which covers the possibilities than one or some or all of the slope coe cients are not zero. The

alternative hypothesis implies that the regression provides some explanatory power for yi . The

most common way of testing this null is using an F test. The test statistic is

F =

(SSR0 SSR1 ) =k

;

SSR1 = (n k 1)

where SSR0 is the sum of squared residuals from the SRF under H0 :

y~i = ~ 0

and SSR1 is the sum of squared residuals from the SRF under H1 :

y^i = ^ 0 + ^ 1 x1;i + : : : + ^ k xk;i :

The null distribution of the F statistic is an Fk;n k 1 distribution; that is, an F distribution with

k and n k 1 degrees of freedom. Tables of critical values are provided for this distribution in

Wooldridge, but Eviews provides convenient p values. For the regression results for house prices

in Figure 46, the F statistic is reported (F = 100:7409) along with its p-value (p = 0:0000). The

test proceeds as follows.

1. H0 :

2. H1 : at least one of

3.

4

1;

=0

2;

3;

= 0:05

5. Decision rule : reject H0 if p < 0:05:

6. Reject H0 , at least one of the regressors has signicant explanatory power for housing prices.

This F test is very convenient and hence popular for this hypothesis, but unfortunately is not

valid in the presence of heteroskedasticity.

Just as Whites standard errors can be used to construct a t test that is valid in the presence

of heteroskedasticity, there is a modication of the F test to allow for heteroskedasticity. The

formula is not easily expressed without matrix notation, but the implementation in Eviews is

straightforward. To test H0 : 1 = 2 = 3 = 4 = 0 in (55), make sure the regression has been

estimated using Whites standard errors, and then select View -- Coefficient Diagnostics

-- Wald Test - Coefficient Restrictions... as shown in Figure 49. In the subsequent

dialogue box shown in Figure 50, specify the null hypothesis for the test. Eviews uses the syntax

62

63

Figure 51: Results of the Wald F test on the house price regression.

c(1), c(2),. . . corresponding to our regression coe cients 0 ; 1 ; : : :, so the null hypothesis 1 =

2 = 3 = 4 = 0 is entered as shown in the Figure. The results of the test are shown in Figure

51. The heteroskedasticity-robust F statistic is F = 55:90, with p = 0:0000. The presentation

and outcome of the test is therefore unchanged from that given preceding this paragraph, but at

least now we know that the result is still valid even if there is heteroskedasticity.

It is possible to test other joint hypotheses as well. For example, in the housing regression (55)

we might be interested in testing H0 : 2 = 3 = 4 = 0, which would imply that the assessed

value of the house fully takes into account all of the information about the size of the house and

its block. That is, under H0 the PRF would be

E (pricei jassessi ; lotsizei ; sqrfti ; bdrmsi ) =

1 assessi ;

which states that once we have the assessors valuation, there is no extra explanatory power in

the block size, house size, or number of bedrooms. This could be taken as a test of the e ciency

of the assessors valuation. The Wald test is carried out in Eviews following the same steps as in

Figures 49 and 50, except that the hypothesis is now entered as only c(3)=0, c(4)=0, c(5)=0.

The results are given in Figure 52, resulting in the following hypothesis test.

1. H0 :

= 0,

2. H1 : at least one of

3.

2;

3;

= 0:05

5. Decision rule : reject H0 if p < 0:05:

6. Reject H0 , so the size of the house and block have signicant predictive power for house

prices in addition to that in the assessed value. This is evidence that the assessed value

does not e ciently capture all information about the pricing of the house.

The question might be raised why do we need this joint test of 2 ; 3 ; 4 when we can already

see from the individual t tests that ^ 2 and ^ 4 are signicantly dierent from zero? The answer

to this lies in the signicance level . In hypothesis testing, we aim to make a decision about

a hypothesis with probability of type I error equal to . If we do three separate t tests to test

a single hypothesis about three coe cients then each of these t tests have a signicance level of

, so the three of them together have a signicance level that will be greater than . Intuitively

there are three opportunities for this procedure to make a type I error instead of just one. So in

order to test a hypothesis about three coe cients and keep the signicance level controlled at ,

it is necessary to do a single test (a Wald test) and not three separate tests.

64

As a nal example of a joint test, we can combine the hypotheses about the unbiasedness and

e ciency of the assessed value of the house as a predictor of the selling price. We can test the

null hypothesis

H0 : 1 = 1 and 0 = 2 = 3 = 4 = 0;

under which the PRF reduces to

E (pricei jassessi ; lotsizei ; sqrfti ; bdrmsi ) = assessi ;

which states that the assessors value is an unbiased predictor of the selling price (adjusts onefor-one and is not systematically too high or low) and is e cient is the sense of capturing all of

the size characteristics of the house. The alternative hypothesis is

H1 :

0;

2;

3;

The Wald test in Eviews is carried out by specifying the null hypothesis as in Figure 53 to give

the results in Figure 54. The hypothesis test is therefore done as follows.

1. H0 :

= 1,

2. H1 :

3.

= 0,

0;

2;

3;

= 0:05

5. Decision rule : reject H0 if p < 0:05:

6. Reject H0 , so that joint unbiasedness and e ciency of the assessors value as a predictor of

selling price is rejected.

4.7

Multicollinearity

In multiple regression there are potentially issues associated with the degree of correlation between

the explanatory variables. These go under the general heading of multicollinearity, which refers

to linear relationships among the explanatory variables.

4.7.1

Perfect multicollinearity

Perfect multicollinearity means that there is an exact linear relationship among some or all of the

explanatory variables. This makes the computation of the OLS estimator impossible and Eviews

will return an error message if this is attempted.

65

= 1 and

= 1 and

=0

A version of the problem can occur in a simple regression if there is no variation in the

explanatory variable xi . Suppose we want to estimate the PRF

E (yi jxi ) =

1 xi ;

constant c. For example, we might want to regress yi = wagei on xi = agei , but in our sample

every individual is the same age. Without variation in age in our sample, we cant expect to be

able to estimate the eect of variations in age on wages. In the formulae for the OLS estimator

^

1

Pn

(xi x) (yi y)

^ = i=1

;

Pn

1

x)2

i=1 (xi

note that xi = c for every i implies that x = c and hence that xi

^ = 0;

1

0

which is undened. Even in more complicated multiple regression cases, perfect multicollinearity

induces this sort of divide by zero problem for the OLS estimator.

To illustrate in the house price example, suppose we wanted to include the size of the garden

as a possible predictor of selling price. The size of the garden can be taken to be

gardeni = lotsizei

sqrfti ;

that is, that part of the block not taken up by the house. The PRF

E (pricei jassessi ; lotsizei ; sqrfti ; bdrmsi ) =

66

is then subject to the perfect multicollinearity problem because of the perfect linear relationship

between gardeni , lotsizei and sqrfti . To see what happens in Eviews if we try to estimate this

regression, we rst generate the gardeni variable as in Figure 55 and then specify the regression

as in Figure 56. Attempting to estimate this equation gives the error message shown in Figure

57.

Perfect multicollinearity can easily be xed by removing one or more explanatory variables

until the problem disappears. In the example, any one of gardeni , lotsizei or sqrfti can be removed

from the regression. The choice of which to drop depends on the practical interpretation of the

variables in each case. In this example, it might be argued that the most natural variable to

drop is lotsizei , since in the original PRF (55) the interpretation of 2 is really capturing garden

size anyway. Recall that 2 measures the change in predicted selling price for a one square foot

increase in lot size, holding all the other explanatory variables constant. A one square foot increase

in lot size holding house size constant must be a one square foot increase garden size, so the clarity

of the practical interpretation of the model could be improved by including gardeni instead of

lotsizei . The results are shown in Figure 58. Nearly all of the results are identical, except for the

coe cient on sqrfti , which can be explained by substituting lotsizei = sqrfti + gardeni into (55)

to obtain

E (pricei jassessi ; lotsizei ; sqrfti ; bdrmsi ) =

1 assessi

2 (sqrfti

+ gardeni ) +

1 assessi

2 gardeni

+(

3 sqrfti

3 ) sqrfti

4 bdrmsi

4 bdrmsi .

This show the PRF with gardeni included is the same as that with lotsizei included except that

the coe cient on sqrfti is changed to 2 + 3 . Therefore the coe cient on sqrfti in Figure 58 (i.e.

0.0000693) is the sum of the coe cients on lotsizei (i.e. 0.000587) and sqrfti (i.e. 0:000517) in

Figure 46. The other coe cient estimates and goodness of t are unchanged so overall the two

regressions statistically equivalent and the choice can be made on the grounds of which choice of

variables is more meaningfully interpretable.

67

Figure 56: Attempting to estimate the house price regression with gardeni included.

Figure 57: The Eviews error message when there is perfect multicollinearity.

68

Figure 58: House price regression with garden size instead of lot size.

4.7.2

Imperfect multicollinearity

Imperfect multicollinearity is a situation where some or all of the regressors are highly correlated

with each other, but not with a correlation of 1 that would come with an exact linear relationship that implies perfect multicollinearity. Imperfect multicollinearity does not invalidate any

assumptions of OLS estimation, so computation of the estimator can proceed and its unbiasedness

and distributional properties hold. The issue with imperfect multicollinearity is that the standard

errors of the estimated regression coe cients can be quite large as a result, implying the estimates

are not very precise and hence condence intervals will be quite wide. One symptom of perfect

multicollinearity is a regression whose coe cients are insignicant according to the individual t

tests (because of the large standard errors) but are signicant according to the joint F test (or

its Wald heteroskedasticity-consistent variant). More details are given in Wooldridge p.9497 for

the homoskedastic case.

Dummy Variables

A dummy variable (or indicator variable) can be used to include qualitative or categorical variables

in a regression. In the simplest case this refers to variables for which there are two categories,

for example an individual can be categorised as male/female or employed/unemployed or have

some/no private health insurance and so on. The inclusion of such characteristics in regression

models can be extremely informative.

5.1

Consider the CEO salary data again. The sample of n = 209 CEOs have been drawn from several

industries, one of which is summarised as Utility, which includes rms in the transport and

utilities industries (utilities includes electricity, gas and water rms). We can dene a dummy

69

200

Series: UTILITY

Sample 1 209

Observations 209

160

Mean

Median

Maximum

Minimum

Std. Dev.

Skewness

Kurtosis

120

80

40

0.172249

0.000000

1.000000

0.000000

0.378503

1.735986

4.013648

Probability

0.000000

0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

variable, or indicator variable, as

utilityi =

1; if rm i is in transport or utilities

0; if rm i is in any other industry

A histogram of this variable is shown in Figure 59, where it can be seen the variable is taking

only values 0 or 1. There are 36 rms in the sample in

transport and utilities, with 173 in other

36

1 P209

= 0:1722, as shown in

industries. The mean of this variable is therefore 209 i=1 utilitiesi = 209

the Figure.

Consider the simple regression

E (salaryi jutilityi ) =

1 utilityi :

(58)

E (salaryi jutilityi = 0) =

0;

so that 0 is the population mean of CEO salaries across all industries except transport and

utilities. If rm i is in either transport or utilities then utilityi = 1, giving

E (salaryi jutilityi = 1) =

1;

so that the population mean of CEO salaries in the transport and utilities industries is 0 + 1 .

Therefore 1 measures the dierence between average salaries in transport and utilities versus all

other industries. We can therefore use (58) to estimate the mean salaries for these two industry

groups and also test for dierences between them.

Figure 60 shows the results of an OLS regression of CEO salary on an intercept and utilityi .

The SRF is

d = 1396:225 668:523utility ;

salary

i

i

(112:402)

(12:229)

implying ^ 0 = 1396:225 and ^ 1 = 668:523. The estimated average CEO salary across all industries other than transport and utilities is therefore $1,396,225, while the estimated average CEO

salary in transport and utilities is $1; 396; 225 $668; 253 = $727; 972. A test of the signicance

of ^ 1 is a test of whether average salaries dier in the transport and utilities industries from the

others.

70

Figure 60: SRF of CEO salary on an intercept and utility dummy variable

1. H0 :

2. H1 :

3.

= 0:05

5. Decision rule: reject H0 if p < 0:05

6. Reject H0 , so average CEO salaries are signicantly dierent in transport and utilities

compared to other industries.

5.2

Dummy variables can be used to estimate several means (or dierences between means) at once. In

the CEO salary dataset the rms are classied into four industries utilities/transport, nancial,

industrial and consumer products. The additional dummy variables are

nancei =

indusi =

consprodi =

0; if rm i is in any other industry

1; if rm i is in industrial production

0; if rm i is in any other industry

1; if rm i is in consumer products

0; if rm i is in any other industry

Each rm in the sample falls into one of these four categories. The following PRF can be specied:

E (salaryi jutilityi ; nancei ; indusi ) =

1 utilityi

2 nancei

3 indusi :

(59)

The implied mean for CEO salaries in the consumer product industry is found from setting

utilityi = nancei = indusi = 0, giving

E (salaryi jutilityi = 0; nancei = 0; indusi = 0) =

71

0:

The average salaries for the other three industries are dened relative to the consumer product

industry (the base category in this case). For utilities/transport we have utilityi = 1 and

nancei = indusi = 0, so

E (salaryi jutilityi = 1; nancei = 0; indusi = 0) =

1:

E (salaryi jutilityi = 0; nancei = 1; indusi = 0) =

2:

E (salaryi jutilityi = 0; nancei = 0; indusi = 0) =

3:

We do not also include the consprodi dummy variable in the PRF because this would cause a

perfect multicollinearity problem because each rm in the sample is categorised as one (and only

one) of utilities, nancial, industrial or consumer product, we have the perfect linear relationship

utilityi + nancei + indusi + consprodi = 1;

where 1 is the regressor for an intercept term. Therefore a PRF

E (salaryi jutilityi ; nancei ; indusi ; consprodi ) =

has an exact linear relationship among its regressors and therefore has perfect multicollinearity

and cannot be estimated. One of the ve explanatory variables needs to be omitted in order for

the PRF to be estimated. In (59) we chose to omit consprodi , but any one of the other regressors

could have been omitted, including the intercept.

The SRF corresponding to (59) is shown in Figure 61. The estimated average salary for CEOs

in the consumer products industry is ^ 0 = 1722:417, or $1,722,417. The estimated average salary

for CEOs in the nance industry is ^ 0 + ^ 2 = 1722:417 377:5036 = 1344:913 or $1,344,913.

The nance dummy variable is not signicant at the 5% level (p = 0:2473 so that H0 : 2 = 0

would not be rejected against H1 : 2 6= 0) so there is no evidence of a signicant dierence

between CEO salaries in the nancial and consumer products industries. The interpretations

for the utilities and industrial dummies follows similarly, with average CEO salaries in utilities

diering signicantly (p = 0:0009) from the consumer products industry, while those in industrial

production do not dier signicantly from consumer products (only just, with p = 0:0519).

5.3

Dummy variables are useful for more than just estimating means, they can be used in more general

regression models as well. The dataset cochlear.wf1 contains observations on n = 91 severely

hearing impaired children who have received Cochlear Implants (CIs) to enable some form of

hearing. Some children have a single CI in one ear (a unilateral CI) while others have received

two CIs, one in each ear (bilateral CIs). It is believed that bilateral CIs provide an advantage

to children in real world listening and learning situations because they allow better directional

recognition of sounds and voices and also better hearing in background noise. However, CIs are

expensive (approximately $25,000 per implant), which must be borne by public or private health

insurance or the families themselves. Also the implantation of a CI involves damage to the inner

ear that then rules out the use of any newly discovered surgical procedure or device in the future

that might deliver improved performance. This background provides motivation for why it is

important to be able to detect and quantify improvements in listening and language that children

can achieve through the use of either one or two CIs.

72

The datale contains outcomes for young children (ages 58) with either unilateral or bilateral

CIs on the standardised2 Peabody Picture Vocabulary Test (PPVT). The histogram and descriptive statistics for this dependent variable are shown in Figure 62. The datale also contains the

dummy variable bilati , which takes the value 1 if child i has bilateral CIs and the value 0 if they

have unilateral CIs. The PRF

E (PPVTi jbilati ) =

1 bilati

allows a test of the dierence of means between the bilateral and unilateral outcomes. The SRF

in Figure 63 shows that ^ 0 = 85:21 is the estimated average score for unilateral children, while

^ + ^ = 85:21 + 9:36 = 94:57 is the estimated average score for bilateral children. The null

0

1

hypothesis H0 : 1 = 0 is rejected against H1 : 1 6= 0 at the 5% level of signicance (p = 0:0045)

so there is a signicant dierence in outcomes between bilateral and unilateral children.

5.3.1

There is also clinical experience that children should not be made to wait too long to receive their

CIs. There is a window early in life when the young brain needs to receive sounds and language

inputs in order to develop best to be able to hear and understand language. Delaying the CIs can

result in developmental delays that are very di cult to later catch up. A PRF to analyse this

question is

E (PPVTi jbilati ; ageCI1i ; ageCI2i ) =

1 bilati

2 ageCI1i

3 ageCI2i ;

(60)

where ageCI1i and ageCI2i are the respective ages in years when the rst and second CIs were

switched on. (For children with only a unilateral CI, ageCI2i = 0.) Histograms and descriptive

2

73

statistics for these ages are shown in Figures 64 and 65 and the SRF is shown in Figure 66. It

has the form

d i = 92:77 + 16:22bilati 3:81 ageCI1i 3:13 ageCI2i :

ppvt

(4:64)

(5:91)

(1:86)

(1:53)

A way to interpret this SRF is to think of it as containing two dierent SRFs: one for unilateral

children (bilati = 0 and ageCI2i = 0):

d i = 92:77

ppvt

3:81ageCI1i ;

d i = 108:99

ppvt

3:81ageCI1i

3:13ageCI2i :

The role of the bilati dummy variable in this SRF is to allow the regression to have dierent

intercepts for unilateral and bilateral children in this the intercept for bilateral children is higher

(108.99 vs 92.77) reecting the higher average scores for them relative to unilateral children. The

bilati dummy variable is signicant at the 5% level (p = 0:0074) so the dierence between the

regression lines is a statistically signicant one.

To interpret its practical signicance, suppose we compare the dierence between the predicted

outcomes for a unilateral child with ageCI1 = a1 against a bilateral child also with ageCI1 = a1

and who received their second CI at age 2 (the average age of second implant being 2.16 years).

The unilateral prediction would be

da1 ; 0) = 92:77

ppvt (0;

3:81a1

da1 ; 2) = 108:99

ppvt (1;

= 102:73

3:81a1

3:13

3:81a1 :

The dierence between these two is the predicted dierence due to the bilateral CI:

da1 ; 2)

ppvt (1;

da1 ; 0) = 102:73

ppvt (0;

92:77 = 9:96:

(61)

Relative to the average standardised score of 100, the bilateral child is predicted to score approximately 10% better on the PPVT language test. A method of computing this dierence and its

standard error is to recognise that

and that

+2

da1 ; 2)

ppvt (1;

da1 ; 0) = ^ + 2 ^ ;

ppvt (0;

1

3

E (PPVTi jbilati ) =

+(

+2

3 ) bilati

2 ageCI1i

3 (ageCI2i

2bilati ) :

That is, the coe cient on bilati in a regression of PPVTi on an intercept, bilati , ageCI1i and

(ageCI2i 2bilati ) delivers the estimated eect of a second CI at age 2 versus sticking with a

unilateral CI. This results of this regression are shown in Figure 67, where it can be seen that

^ + 2 ^ = 9:95 (which diers from (61) only because of rounding) with standard error 3.86. The

1

3

95% condence interval for this eect is therefore

[9:95

2:00

74

12

Series: PPVT

Sample 1 91

Observations 91

10

Mean

Median

Maximum

Minimum

Std. Dev.

Skewness

Kurtosis

92.09890

94.00000

139.0000

59.00000

16.43104

0.244310

3.076533

Probability

0.628932

0

60

70

80

90

100

110

120

130

140

Figure 62: Histogram of language scores for children with Cochlear Implants

75

Series: AGE_CI1

Sample 1 91

Observations 91

8

7

6

5

4

3

Mean

Median

Maximum

Minimum

Std. Dev.

Skewness

Kurtosis

1.532857

1.430000

3.790000

0.340000

0.815059

0.725634

2.778721

Jarque-Bera

Probability

8.171589

0.016810

2

1

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Figure 64: Histogram and descriptive statistics for age at rst CI.

25

Series: AGE_CI2

Sample 1 91

Observations 91

20

Mean

Median

Maximum

Minimum

Std. Dev.

Skewness

Kurtosis

15

10

2.162637

2.020000

5.820000

0.000000

1.754301

0.250492

1.876018

Probability

0.056648

0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

76

Figure 66: SRF for PPVT languages scores on bilateral CIs and implant ages.

77

5.3.2

Dummy variables can be used to allow slope coe cients to change for dierent categories of

observations as well. For example we could specify

E (PPVTi jbilati ; ageCI1i ; ageCI2i ) =

0 + 1 bilati + 2 (bilati

ageCI1i )+

ageCI1i )+ 4 ageCI2i ;

(62)

where unilati = 1 bilati is a dummy variable that takes the value 1 for a child with a unilateral

CI and 0 for a child with a bilateral CI. Regressors that involve products of explanatory variables

like this are often called interactions. In this case the eect of the bilateral CI is being allowed to

interact with the age of rst implant. The PRF for a unilateral child is then

E (PPVTi jbilati = 0; ageCI1i ; ageCI2i ) =

3 (unilati

3 ageCI1i ;

E (PPVTi jbilati = 1; ageCI1i ; ageCI2i ) = (

1)

2 ageCI1i

4 ageCI2i :

The unilateral and bilateral PRFs therefore have potentially dierent intercepts and slope coe cients on ageCI1i . The statistical signicance of these dierences can be tested.

The SRF is shown in Figure 66. The slope coe cients on ageCI1i are dierent for unilateral

and bilateral children, and in fact is only signicant (p = 0:0299) for bilateral children. A year of

delay in the rst CI predicts a fall in the PPVT outcome of 5.43 points for bilateral children but

only 0.62 points for unilateral children. The prediction equations are

d i = 86:43

ppvt

d i = (86:43 + 23:69)

ppvt

= 110:12

0:62ageCI1i

5:43ageCI1i

5:43ageCI1i

2:76ageCI2i

2:76ageCI2i

A Wald test can be used to test H0 : 2 = 3 against H1 : 2 6= 3 . Under H0 the slope

coe cients on ageCI1i are equal for unilateral and bilateral children and the PRF would simplify

back to (60). The results of the Wald test (specied as c(3)=c(4)) are shown in Figure 69.

The details are

1. H0 :

2. H1 :

6=

3.

in (62)

= 0:05

5. Decision rule : reject H0 if p < 0:05:

6. Do not reject H0 , the PRF could be simplied back to (60).

78

Figure 68: SRF of regression for language outcomes with bilateral interactions

79

In all cases so far we have assumed that the PRF is a linear function of the explanatory variables.

Non-linearities can be introduced in an endless variety of ways. The interactions involving dummy

variables in the previous section was one step in this direction. Here we look at some common

examples of non-linear regression models that can be handled using the methods of OLS estimation

and inference considered so far.

6.1

Quadratic regression

Some non-linearity can be introduced into a PRF by including the square of an explanatory

variable. Consider a simple regression

E (yi jxi ) =

1 xi

2

2 xi .

(63)

It no longer makes sense to interpret 1 as the change in E (yi jxi ) from a unit increase in xi , since

a unit increase in xi necessary increases x2i as well. The term 1 xi + 2 x2i needs to be interpreted

as a whole. The sign of 2 dictates the shape of the parabola a positive sign giving a convex

shape (a valley) and a negative sign giving a concave shape (a hill). The general approach to

regression interpretation is to consider the dierence between the conditional mean at some value

x:

E (yi jxi = x) = 0 + 1 x + 2 x2

(64)

and the conditional mean at x + 1 :

E (yi jxi = x + 1) =

1 (x

+ 1) +

2 (x

+ 1)2 :

This gives an expression for the change in the conditional mean that results from a one unit

increase in xi :

E (yi jxi = x + 1) E (yi jxi = x) = 1 + 2 (2x + 1) :

(65)

For a linear regression (i.e. one without the x2i term) the expression is E (yi jxi = x + 1)

E (yi jxi = x) = 1 . The eect of the quadratic term is to introduce 2 (2x + 1) into the marginal

eect of xi . An important dierence from the linear model is that this marginal eect now depends on x, implying that the eect of increasing xi by one unit now depends on what value of xi

we start from. The following application will illustrate the sense of this property. Sometimes the

derivative of the regression function with respect to x is used as an approximation to the marginal

eect shown in (65). The derivative is

dE (yi jxi = x)

=

dx

+2

2 x;

which diers from (65) by 2 . Estimated values of 2 are often small, so that the derivative is

close to (65). We will focus on (65) as the exact marginal eect of a unit change in xi .

For a given value x, the estimation of E (yi jxi = x) and its standard error follows by subtracting

(64) from (63) to obtain

E (yi jxi ) = E (yi jxi = x) +

1 (xi

x) +

x2i

x2 ;

so that computing this prediction involves a regression of yi on an intercept, (xi x) and x2i x2 ,

and taking the intercept estimate and its standard error. The approach is identical to that for

80

linear models. The estimation of the marginal eect in (65) for a given value x can proceed by

re-arranging (63) to give

E (yi jxi ) =

0

0

+(

+

1 xi

2 (2x

+ 1)) xi +

2

2 xi

x2i

(2x + 1) xi

(2x + 1) xi :

That is, the marginal eect (65) for a given x is estimated from a regression of yi on an intercept,

xi and x2i (2x + 1) xi , and taking the coe cient on xi and its standard error.

A quadratic function has a turning point, either a maximum (for 2 < 0 ) or a minimum (for

>

0). The location of this turning point is found by setting the derivative equal to zero:

2

dE (yi jxi = x)

=0 ) x=

dx

This is the point at which the eect of xi on E (yi jxi ) changes from being positive to negative (for

^ = 2^ ,

2 < 0) or negative to positive ( 2 > 0). The estimated turning point is therefore

1

2

^

^

where

and

are the usual OLS estimators of the PRF (63).

1

6.1.1

A quadratic term is commonly included when modelling wages in terms of labour market experience. The workle wages.wf1 contains data on n = 1260 individuals with their wages ($/hour)

and various potential explanatory variables. A straightforward linear PRF would take the form

E (wagei jfemalei ; educi ; experi ) =

1 femalei

2 educi

3 experi ;

where educi is years of education, experi is years of labour force experience and femalei is a dummy

variable that takes the value 1 if individual i is female and 0 otherwise. The SRF is shown in

Figure 70. Each of the slope coe cients are signicant at the 5% level and each has interpretable

signs and magnitudes. An extra year of education increases the estimated conditional mean of

wages by $0.45, and extra year of experience increases the estimated conditional mean of wages

of $0.08, and the average wage of females is $2.57 below that of males with the same levels of

education and experience.

However, experience is generally not modelled in a linear form in a wage equation like this.

The idea is that the initial years of work experience involve the greatest learning and greatest

increases in productivity for an employee, resulting in the greater increases in wages at that time.

As experience increases, the rates of growth in productivity and hence wages slows. This eect

can be captured by including a quadratic term into the PRF

E (wagei jfemalei ; educi ; experi ) =

1 femalei

2 educi

3 experi

2

4 experi ;

(66)

The interpretations of the femalei and wagei variables in this model are unchanged, but the interpretation of experience must be altered as shown above. The coe cients 3 and 4 are no longer

individually interpretable because it is impossible to make a one year increase in experi while holding exper2i xed (or visa versa). Instead the marginal eect on E (wagei jfemalei ; educi ; experi ) of

one extra year of work experience follows from (65):

E (wagei jfemalei ; educi ; experi + 1)

+ 1) :

(67)

The eect on average wage of an extra year of work experience depends on the amount of experience obtained so far. For an individual with one year of work experience, a second year of work

81

4 (2experi

experience will change the expected wage by 3 + 3 4 dollars per hour. For an individual with 20

years of work experience, the next year of experience will change the expected wage by 3 + 41 4

dollars per hour.

The SRF for (66) is shown in Figure 71, showing ^ 3 = 0:2527 and ^ 4 = 0:0039. The

quadratic term in experience is signicant at the 5% level so it adds some explanatory power for

wages. Figure 72 gives a graphical representation of the contribution of the experience variables

to the wages PRF, given by ^ 3 experi + ^ 4 exper2i plotted over the range of observed values of

experi . Also shown for comparison is the linear term in experience from the SRF in Figure 70,

given by 0:0847experi . The quadratic function has a positive slope for all levels of experience from

one year up to the turning point experi = ^ 3 = 2 ^ 4 = 0:2527= (2

0:0039) = 32:40 years.

After that the quadratic has a negative slope. The implication is that extra experience increases

average wages, at a decreasing rate, until experience reaches about 32.4 years. After that, extra

work experience has a negative eect on average wages. The same information is also displayed

in Figure 73, which graphs the eects of an extra year of work experience on average wages. The

eect in the linear case is forced to be constant for all experience at ^ 3 = 0:0847, while the eect

in the quadratic case is ^ 3 + ^ 4 (2experi + 1) = 0:2527 0:0039 (2experi + 1). Again this shows

that an extra year of experience raises average wages until experience reaches 32.4 years, at which

point it cross the x-axis and implies decreases in average wages.

Prediction in a quadratic regression works in the same way as a linear model. Suppose we

want to calcuate the average wage for a female with 15 years of education and 10 years of work

experience. Figure 74 shows the regression for this purpose, re-specied in terms of (femalei 1),

d (1; 15; 10) = $5:08

(educi 15), (experi 10) and exper2i 102 . The resulting prediction is wage

with standard error of 0.23 that can be used to compute the 95% prediction interval

[5:0827

1:980

Suppose we also want a 95% condence interval for the eect of an extra year of work experience on average wages for an individual with these characteristics. The desired eect is (67) with

experience set to 10 years, i.e. 3 + 21 4 . The PRF (66) can be re-written as

E (wagei jfemalei ; educi ; experi ) =

0 + 1 femalei + 2 educi +( 3

+ 21

4 ) experi + 4

exper2i

21experi ;

so that a regression of wagei on an intercept, femalei , educi , experi and exper2i 21experi will

provide the desired coe cient on experi . The SRF is shown in Figure 75, which shows that

^ + 21 ^ = 0:17 with 95% condence interval computed as

3

4

[0:1707

6.2

1:980

It is common practice to work with variables in logs rather than their original levels. Consider a

PRF

E (yi jxi ) = 0 + 1 log xi :

(68)

Taking logs is only possible when xi only takes values greater than zero, so this specication is

not always available. It can be used for a positive variable like years of work experience though.

The interpretation of the eect of xi in this PRF needs to be derived. Following the same general

approach as in the quadratic model, we consider a xed value x and compare

E (yi jxi = x) =

82

1 log x

(69)

83

Linear

Quadratic

0

0

10

15

20

25

30

35

40

45

50

EXPER

.25

.20

.15

.10

.05

.00

-.05

Linear

Quadratic

-.10

-.15

0

10

15

20

25

30

35

40

45

50

EXPER

84

Figure 74: SRF to predict the wages for females with 15 years of education and 10 years of work

experience

Figure 75: SRF to estimate the eect on average wages of an extra year of experience for an

individual with 10 years of experience

85

and

E (yi jxi = x + 1) =

1 log (x

+ 1) :

E (yi jxi = x + 1)

E (yi jxi = x) =

1 (log (x

+ 1) log x)

1

:

1 log 1 +

x

(70)

E (yi jxi ) =

1 log

1+

1

x

log xi

log 1 + x1

log xi

;

log 1 + x1

so the desired marginal eect (70) for a given value x is estimated as the slope coe cient from a

regression of yi on an intercept and log xi = log 1 + x1 .

An alternative and very common interpretation of this PRF is to consider a 1% increase in xi

rather than a one unit increase. That is, instead of comparing E (yi jxi ) at xi = x and xi = x + 1,

we compare it at xi = x and xi = 1:01x. This gives

E (yi jxi = 1:01x)

E (yi jxi = x) =

1 (log (1:01x)

1 (log 1:01

1 log (1:01)

1

100

log x)

+ log x

log x)

1

where the last step uses log 1:01 = 0:00995 0:01 = 100

. Therefore a 1% increase in xi results in a

change of 1 =100 in E (yi jxi ). This interpretation is common because the result is not dependent

on x, so it gives 1 a convenient interpretation without reference to some starting value x.

Prediction is carried in the usual way by subtracting (69) from (68) to obtain

1 (log xi

log x) ;

so the estimate of E (yi jxi = x) is the estimated intercept in a regression of yi on an intercept and

(log xi log x).

6.2.1

In the context of wages and work experience, using the log of work experience provides an alternative non-linear specication to a quadratic. Consider the PRF

E (wagei jfemalei ; educi ; experi ) =

1 femalei

2 educi

3 log (experi ) :

(71)

Including experience in logs instead of linearly allows the eect of an extra year of experience

on average wages to decrease as experience increases. The eect of an extra year of experience

(holding femalei and educi constant) on the conditional mean of wages is obtained from (70) to

be

1

:

(72)

3 log 1 +

experi

86

3 =100 dollars per hour.

The SRF for (71) is shown in Figure 76, and the experience component is illustrated in Figure

77, with the quadratic component from Figure 71 for comparison. Both functional forms capture

the initially larger gains to work experience at the beginning of a career and the reduction of

those gains as experience grows. The dierence is that the log specication does not include a

negative eect of experience at any level, rather showing a continuing gradual increase in wages

with experience at all levels. Figure 78 shows the estimated eect on average wages of an extra

year of work experience, comparing the results from the log and quadratic models. The biggest

dierences occur at the ends of the distribution of experience, where data is most sparse, so

choosing between the two specications will not be simple. For now we consider each as providing

a reasonable approximation to the role of experience and return later to the topics of model

comparison and selection.

The average wages can again be estimated for a female with 15 years of education and 10

years of work experience, see Figure 79. This gives wage

d (1; 15; 10) = $5:27 with 95% prediction

interval

[5:2728 1:980 0:2231] = [$4:83; $5:71] :

This interval mostly overlaps that from the quadratic model (i.e. [$4:63; $5:53]) so the predictions

from the two models are very similar for this type of individual. We can expect them to dier

more for very small or very large values of experience.

To estimate the eect of an extra year of experience for this individual, which from (72) would

be

1

= 3 log (1:1) ;

3 log 1 +

10

consider the re-specied PRF

E (wagei jfemalei ; educi ; experi ) =

1 femalei

2 educi

log (experi )

;

log (1:1)

where 3 = 3 log (1:1). The results are shown in Figure 80, from which we nd the estimated

increase in average wages is $0.12 with 95% condence interval

[0:1246

1:980

This interval does not overlap that constructed with the quadratic model, so the two models make

dierent predictions in this case.

For any values of the explanatory variables (i.e. regardless of work experience) an increase

of work experience by 1% increases the average wage by approximately $0.013. In this case a

one year change in experience is the more natural unit to consider, but in other applications a

percentage change is very natural.

6.3

It is standard in econometrics to model a variable like wages in log form rather than its levels.

There is no denite rule for choosing between logs and levels, but variables like wages or incomes

that are positive (logs only apply to positive numbers) and generally highly positively skewed and

non-normal are rendered less skewed and closer to normal by a log transformation. Figures 81

(wages in levels) and 82 (wages in logs) illustrate the point. This can lessen the impact of the

small number of very large incomes. Also the approximate normal, t and F distributions that

rely on the Central Limit Theorem will tend to work better for more symmetrically distributed

data.

87

Quadratic

Log

0

0

10

15

20

25

30

35

40

45

50

EXPER

Figure 77: Comparison of experience in quadratic and log form in the wage equations

88

1.0

Quadratic

Log

0.8

0.6

0.4

0.2

0.0

-0.2

0

10

15

20

25

30

35

40

45

50

EXPER

Figure 78: Estimated eects of a year of extra work experience of average wages for the log and

quadratic models

Figure 79: Prediction for a female with 15 years education and 10 years of work experience

89

Figure 80: Estimating the eect of an extra year of work experience for an individual with 10

years of work experience

500

Series: WAGE

Sample 1 1260

Observations 1260

400

300

200

Mean

Median

Maximum

Minimum

Std. Dev.

Skewness

Kurtosis

6.306690

5.300000

77.72000

1.020000

4.660639

4.813465

54.01341

Jarque-Bera

Probability

141489.9

0.000000

100

0

0

10

20

30

40

50

60

70

80

90

240

Series: LWAGE

Sample 1 1260

Observations 1260

200

160

120

80

40

Mean

Median

Maximum

Minimum

Std. Dev.

Skewness

Kurtosis

1.658800

1.667705

4.353113

0.019803

0.594508

0.083235

3.425003

Jarque-Bera

Probability

10.93785

0.004216

0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Consider the simple regression

E (log yi jxi ) =

1 xi :

(73)

The interpretation of this regression in terms of log yi is simple. However we are rarely interested

in log yi for practical purposes, we are interested in yi . So we want to work out the implications of

this model in log yi for yi itself. That is, we would like to deduce an expression for E (yi jxi ), but

the fundamental di culty is that E (log yi jxi ) 6= log E (yi jxi ). The log is a non-linear function so it

cannot be interchanged with the expectations operator. In fact we know from Jensens inequality

that E (log yi jxi ) < log E (yi jxi ). Instead we write (73) as

log yi =

1 xi

+ (log yi

1 xi

+ ui ;

E (log yi jxi ))

(74)

where

ui = log yi

E (log yi jxi ) :

yi = exp (

1 xi ) exp (ui ) :

(75)

Making any progress with the interpretation of this model requires the assumption that ui is

independent of xi . This is a di cult assumption to interpret or test, but tends to be made

in practice without any discussion, so we will do so here. Under this assumption, taking the

conditional expectation of both sides of (75) gives

E (yi jxi ) = exp (

= exp (

=

where

1 xi ) E

1 xi ) E

0 exp ( 0

[exp (ui )]

1 xi )

E (yi jxi = x) =

0 exp ( 0

91

1 x)

(76)

and

E (yi jxi = x + 1) =

0 exp ( 0

1 (x

+ 1))

0 exp ( 0

1 x) exp ( 1 )

1)

so

E (yi jxi = x + 1)

1)

1) :

E (yi jxi = x + 1) E (yi jxi = x)

100% = (exp (

E (yi jxi = x)

1)

1) 100%:

(77)

That is, a one unit increase in xi produces a (exp ( 1 ) 1) 100% change in E (yi jxi ).

It is common to approximate exp ( 1 ) 1 by 1 (an approximation that works best for small

1 ) so that the interpretation of the model becomes

E (yi jxi = x + 1) E (yi jxi = x)

100% =

E (yi jxi = x)

100%:

(78)

That is, a one unit increase in xi produces an approximate 1 100% change in E (yi jxi ). The

convenience of using 1 instead of having to compute (exp ( 1 ) 1) means this approximate interpretation is more often used in practice. The approximation can also be derived using calculus:

d

E (yi jxi = x) =

0 exp ( 0 +

dx

= E (yi jxi = x)

1 x)

1;

which implies

1

d

E (yi jxi = x) 100% =

E (yi jxi = x) dx

100%;

each side of which is an approximation to each side of (77). We will proceed with (78) as the

interpretation of (73).

Estimation of E (yi jxi ) is more di cult in a model than is expressed in terms of log yi . It is

straightforward to obtain the SRF

dyi = ^ + ^ xi ;

log

0

1

from an OLS regression of log yi on an intercept and xi . The prediction equation

logd

y (x) = ^ 0 + ^ 1 x

provides the right way to estimate E (log yi jxi = x). It is then tempting to use

exp logd

y (x) = exp ^ 0 + ^ 1 x

to estimate E (yi jxi = x), but this would not be right because it omits the 0 term in the correct

expression (76). Since 0 is necessarily greater than one (because E (log yi jxi ) < log E (yi jxi )),

using y^ (x) = exp ^ 0 + ^ 1 x will systematically under-estimate E (yi jxi = x), i.e. it will be

negatively biased. An estimator of 0 is required to correct this bias. Since 0 = E [exp (ui )],

(the population mean of ui ) a natural estimator is the sample mean

n

1X

^0 =

exp (^

ui ) ;

n

i=1

92

where

^

u

^i = log yi

^ xi ;

1

are the usual OLS residuals from the SRF corresponding to (73). The prediction equation

y^ (x) = ^ 0 exp ^ 0 + ^ 1 x

can then be used.

6.3.1

A common base specication for a wage equation takes the form of the PRF

E (log (wagei ) jfemalei ; educi ; experi ) =

1 femalei

2 educi

3 experi

2

4 experi :

The interpretation of this PRF is simple in terms of log (wagei ), but it is generally wagei which

is the quantity of interest. Using (77), an extra year of education increases the average wage by

(exp ( 2 ) 1) 100%. The SRF for this model is given in Figure 83, in which ^ 2 = 0:0679. The

estimated eect of an extra year of education in this regression is to increase the average wage

1 100% = exp (0:0679) 1 = 7:03%. The approximation (78) gives the eect

by exp ^ 2

as 6.79%, which is practically very similar. Moreover the standard error for this latter estimate

^ is immediately available for computing a condence interval, whereas computing a standard

2

1 goes beyond our scope.

error for exp ^ 2

The interpretation of the eect of work experience on wages involves a non-linear transformation for both variables. The general form is

E (log yi jxi ) =

93

1 xi

2

2 xi ;

E (yi jxi ) =

0 exp

1 xi

2

2 xi

E (yi jxi = x) =

0 exp

1x

2x

and

E (yi jxi = x + 1) =

0 exp

1 (x

0 exp

1x

+ 1) +

2x

1

+ 1)2

2 (x

exp (

2 (2x

2 (2x

+ 1))

+ 1)) :

E (yi jxi = x + 1) E (yi jxi = x)

100% = (exp (

E (yi jxi = x)

( 1+

2 (2x

2 (2x

+ 1))

1) 100%

+ 1)) 100%:

This shows that the intepretation of a quadratic regression with a logged dependent variable

simply combines the two elements of each of the transformations. The marginal eect of the

quadratic regression is ( 1 + 2 (2x + 1)) as before, but the presence of the logged dependent

variable means that this eect needs to be interpreted as a percentage change in E (yi jxi ), rather

than an absolute change in E (yi jxi ).

To compute the marginal eect of an extra year of experience for an individual with x years

of experience, we would re-arrange the PRF to give

0

1 femalei

2 educi

+(

4 (2x

+ 1)) experi +

exper2i

(2x + 1) experi

so the required marginal eect is the coe cient on experi in a regression of wagei on an intercept,

femalei , educi , experi and exper2i (2x + 1) experi . For individuals with 10 years of experience,

the SRF is shown in Figure 84. The interpretation of the estimate is that an extra year of work

experience increases average wages by approximately 2.73%. A condence interval for this is

obtained from

0:0273 1:980 0:00238 = [0:0226; 0:0320] ;

or [2:26%; 3:20%] .

6.3.2

Comparing (71) and (83) shows that the model in log wages has a much higher R2 , which would

appear to suggest it is superior. However, models with dierent dependent variables can not be

compared using R2 . Instead, the predictions from one of the models needs to be transformed to

match those of the other model to allow a valid comparison.

Suppose we want to compare

E (yi jxi ) = 0 + 1 xi

and

E (log yi jxi ) =

94

1 xi :

Figure 84: SRF for computing the marginal eect of one extra year of experience on wages for an

individual with 10 years of experience

Let

y^i = ^ 0 + ^ 1 xi ;

(79)

denote the usual SRF for the levels yi . For the log of yi , write the SRF as

gyi = ~0 + ~1 xi ;

log

where ~0 and ~1 are the usual OLS estimators from log yi on an intercept and xi , the dierent

notation only being used to distinguish them from ^ 0 and ^ 1 in the levels SRF. Now transform

gyi , into tted values for yi , denoted

the tted values for log yi , denoted log

y~i = exp ~0 + ~1 xi :

These tted values are not unbiased estimators of E (yi jxi ) since they omit consideration of 0

in (76), but this turns out not to matter for the R2 comparison. The comparison is made by

computing the R2 from a regression of yi on an intercept and y~i , and then comparing this with

the R2 from (79). Whichever is larger suggests whether yi should be logged or not. Note that this

comparison is valid only when the two regressions (for yi and log yi ) contain the same explanatory

variables.

This procedure is made more convenient in Eviews because it oers the option of computing

tted values for yi directly from a regression estimated in log yi . In the regression for log wagei ,

choose Proc - Forecast... as shown in Figure 85, and then ensure that the tted values

are obtained for wage, and not log(wage), as shown in Figure 86. This creates a variable called

wagefin the workle (its name can be changed if desired). Figure 87 shows the regression used

to compute R2 = 0:211, which is the percentage of variation in wagei explained by the regression

for log (wagei ). This R2 is slightly higher than that for the regression for wagei , implying the log

transformation for wages is to be preferred in this case.

95

Figure 86: Choosing to calculate tted values for wage, not log(wage)

96

Figure 87: Regression to compute R2 for wagei from the log wagei regression

6.4

Dene the general notation y (x) = E (yi jxi = x). The following gives a summary of the functional forms and their marginal eects and prediction equations.

Linear

PRF

SRF

Marginal eect

Prediction y (x)

E (yi jxi ) = 0 + 1 xi

y^i = ^ 0 + ^ 1 xi

y (x + 1)

y (x) = 1

E (yi jxi ) = y (x) + 1 (xi

Quadratic x

PRF

SRF

Marginal eect

Marginal eect estimation

Prediction y (x)

Turning point

Log x

PRF

SRF

Marginal eect

Prediction y (x)

x)

y^i = ^ 0 + ^ 1 xi + ^ 2 x2i

y (x + 1)

y (x) = 1 + 2 (2x + 1) = 1

E (yi jxi ) = 0 + 1 xi + 2 x2i (2x + 1) xi

E (yi jxi ) = y (x) + 1 (xi x) + 2 x2i x2

x=

y^i = ^ 0 + ^ 1 log xi

y (1:01x)

y (x)

1 =100

E (yi jxi ) = y (x) + 1 log (xi =x)

97

Log y

PRF

SRF

Marginal eect

Prediction

Prediction

(x)

y (x)

log y

E (log yi jxi ) = 0 + 1 xi

dyi = ^ + ^ xi

log

1

0

y (x + 1)

y (x)

100%

1 100%

y (x)

E (log yi jxi ) = log y (x) + 1 (xi x)

^ y (x) = ^ 0 exp ^ log y (x)

P

dyi

^ 0 = n1 ni=1 exp log yi log

dyi

R2 from SRF of yi on intercept and exp log

R2 for yi

Log y + quadratic x

PRF

SRF

Marginal eect

Prediction log y (x)

Prediction y (x)

100%

dyi

R2 from SRF of yi on intercept and exp log

R2 for yi

dyi = ^ + ^ xi + ^ x2

log

0

1

2 i

y (x + 1)

y (x)

100% ( 1 + 2 (2x + 1)) 100% =

y (x)

E (log yi jxi ) = 0 + 1 xi + 2 x2i (2x + 1) xi

E (log yi jxi ) = log y (x) + 1 (xi x) + 2 x2i x2

^ y (x) = ^ 0 exp ^ log y (x)

P

dyi

^ 0 = n1 ni=1 exp log yi log

Comparing regressions

Comparing the t of dierent regressions for the same dependent variable can be done in many

dierent ways, there is not one correct approach. Four statistics will be discussed here for the

purpose. First note, however, that while R2 is a useful descriptive statistic for a single regression,

it has only very limited use for comparing dierent regressions. It can only be used for comparing

regressions with the same number of explanatory variables. The problem with R2 is that it will

never decrease when a new explanatory variable is added to a regression, no matter how little

explanatory power it has. So comparing regressions with R2 will always end up giving preference

to the largest model. The four closely related statistics given here do not have this problem and

should be used for regression comparison.

7.1

Adjusted R2

y^i = ^ 0 + ^ 1 x1;i + : : : + ^ k xk;i ;

recall the denitions

SST

n

X

(yi

i=1

SSE =

SSR =

n

X

i=1

n

X

y^i

u

^2i

y^

i=1

98

which satisfy

SSR = SSE + SSR:

The

R2

is dened as

R2 =

SSE

SST SSR

=

=1

SST

SST

SSR= (n

SST = (n

1)

:

1)

SSR= (n k 1)

;

SST = (n 1)

R2 = 1

the adjustment being the inclusion of the degrees of freedom as the divisor of SSR in the numerator. A result of this change is that R2 may decrease if an explanatory variable with little

predictive power is added to a regression, so it is legitimate strategy to compare regressions with

dierent numbers of explanatory variables based on R2 (as long as they have the same dependent

variable). Another result of the change is that R2 0 need not always hold as it does for R2 . A

negative R2 is a sign of a regression with very little overall explanatory power.

7.2

Information criteria

There are three closely related information criteria that can be used for comparisons of regression

models the Akaike, Schwarz and Hannan-Quinn criteria. They have the general form

IC = log

SSR

p

+ (k + 1) ;

n

n

Akaike:

Schwarz:

Hannan-Quinn:

p=2

p = log n

p = 2 log log n

A regression is preferred to another if it has a smaller IC, whichever of the three is used.

The problem with having four dierent criteria for model comparison is that it is unclear

which to rely on. All four methods are widely used in practice and each of them is derived from

dierent principles and has dierent desirable (and undesirable) properties. In order, the R2

is most included to prefer the larger of two regression models, followed by the Akaike IC, the

Hannan-Quinn IC and then the Schwarz IC, which is most likely of the four to prefer the smaller

of two regression models. We will rely on the Akaike criterion in this subject.

7.3

Adjusted R2 as an IC

The R2 at rst sight appears quite dierent from the other three ICs, but in fact is very closely

related. Choosing a model with a larger value of R2 is identical to choosing a model with a smaller

value of log 1 R2 , and

log 1

R2

SSR

SST

log

n k 1

n 1

SSR

1

SST

= log

+ log

log

k+1

n

n 1

1

n

= log

log

SSR k + 1

+

n

n

99

log

SST

:

n 1

Therefore choosing a regression with larger R2 is (almost) equivalent to choosing a regression with

smaller

SSR k + 1

log

+

;

n

n

implying p = 1.

Functional form

An additional issue that can occur is one of incorrect functional form. This has implications for

the estimation of a conditional mean, whether or not causal inference is of interest. In general

suppose the true conditional expectation is

E (yi jxi ) =

1 xi

+ g (xi )

y^i = ^ 0 + ^ 1 xi

is estimated. Recall the slope coe cient ^ 1 has the representation

^ =

1

n

X

an;i yi ;

i=1

where

and

Pn

i=1 an;i

(xi x)

an;i = Pn

x)2

i=1 (xi

Pn

= 1. Then

" n

#

h i

X

E ^1 = E

an;i E (yi jxi )

= 0 and

i=1 an;i xi

= E

=

6=

"

i=1

0

n

X

an;i +

i=1

+E

"

n

X

n

X

an;i xi +

i=1

n

X

i=1

an;i g (xi )

an;i g (xi )

i=1

P

where ni=1 an;i g (xi ) has the interpretation of being the slope coe cient from a regression of

g (xi ) on xi .

The general conclusion from this is that a misspecied functional form results in biased estimates of the conditional mean E (yi jxi ). This is a dierent problem from omitted variables, which

does not bias conditional mean estimates, although it will generally bias estimates of causal eects

if this is of interest.

A regression is a statistical model for the conditional mean of a dependent variable given some

explanatory variables. To take a simple example, the PRF

E (wagei jeduci ) =

100

1 educi

(80)

measures how the average wage changes with dierent values of education. With 1 > 0 , we

would nd the average wage of individuals with 15 years of education is higher than the average

wage of individuals with 12 years of education, and the dierence between these two averages

would be 3 1 .

It is common in practice to want to take the interpretation of a regression further and to claim

a causal relationship. For example, that an individual who undertakes a university degree (hence

increasing their years of education from 12 to 15) can expect to increase their wages by 3 1 as

a result of this extra education. This causal statement is a much stronger interpretation of (80)

than simply saying that higher educated individuals have higher average wages, and is far more

di cult to justify. Much research at the frontier of econometrics focusses on if and how dierent

statistical models might be given causal interpretations. It is generally necessary to go beyond

statistical arguments to a clear understanding of the nature of the practical question and the way

that the data has been obtained.

In order to give (80) a causal interpretation, it is necessary that an individuals wages be

caused in a manner that satises the mathematical relationship

wagei =

1 educi

+ ui ;

(81)

where ui is the disturbance term that captures all of the other factors that cause wages besides

education, and it is necessary that this disturbance term satisfy

E (ui jeduci ) = 0:

(82)

Taking the conditional expectation of both sides of (81) given educi and applying (82) gives

(80). It is necessary that both (81) and (82) hold in order for the regression (80) to be given the

interpretation that an extra year of education causes an individuals wage to rise by 1 . Sometimes

this interpretation may be possible, but there are many ways in which (81) and especially (82)

may be violated, even though (80) may be a valid representation of the conditional mean of wages.

Note that (82) requires that education have no explanatory power for any of the factors that make

up the disturbance term ui , a requirement that can be very di cult to satisfy in practice.

9.1

Notation

One aspect of the notation here diers from that of the textbook. A regression is a statistical

model of a conditional expectation, and so for our purposes is always represented explicitly as

a conditional expectation as in (80). In Wooldridge and other textbooks, it is common to also

represent a regression in the form (81), as well as sometimes in the form (80). In these notes the

notation (81) will be reserved for an equation representing how the dependent variable is caused.

This causal equation may or may not correspond to a regression equation, as we will now discuss.

To be clear, a regression model represents the conditional mean of the dependent variable and

is therefore written in terms of that conditional mean (eg E (wagei jeduci )). A causal equation

represents how the dependent variable itself is determined and is therefore written in terms of

that dependent variable (eg wagei ). The regression model always measures the conditional mean,

but if the regression model and the causal equation happen to coincide then the regression can

also be given a causal interpretation.

9.2

Before discussing causal interpretations further, it should be noted that many regressions are

not meant to be causal in the rst place. Regressions for prediction / forecasting are a leading

example. Consider the PRF for nal exam marks

E (exami jasgnmti ) =

101

1 asgnmti :

(83)

This provides a statistical model for how predicted nal exam marks vary with assignment marks.

It may have interest for both students and teachers in summarising the relationship between oncourse assessment and the nal exam. It is clearly not a causal regression though. Assignment

marks do not cause exam marks. A better causal story would be that both assignment and

exam marks are caused by some combination of study during the semester (including lecture and

tutorial participation, reading and revision and so on) and pre-existing ability (extent of previous

exposure to statistics, general intelligence and so on). A highly stylised causal model of this might

be

exami =

1 studyi

asgnmti =

1 studyi

2 abilityi

+ ui

2 abilityi

+ vi ;

where ui and vi represent the disturbances capturing all the other causal factors that inuence

individual marks. Presumably all of 1 ; 2 ; 1 ; 2 are positive, so that the causal model generates

a positive statistical relationship between assignment and exam marks, and this statistical relationship is captured by (83). So estimates of (83) may be useful for estimating predicted nal

exam marks, but they do not attempt to uncover any causal factors that produce either of those

marks in the rst place. Regression (83) is an example of the saying that correlation need not

imply causation.

This discussion reveals one way in which an attempt at causal modelling may fail. A regression

model E (yi jxi ) = 0 + 1 xi may be specied in the belief that xi causes yi , when the true story

is that some other factor zi causes both yi and xi and produces a purely statistical relationship

between them.

9.3

Omitted variables

Omitted explanatory variables is a common reason that regression models fail to measure causal

eects. The case of wages and education is famous for this problem in econometrics. Suppose

wages are truly caused by

wagei =

1 educi

2 abilityi

+ ui ;

(84)

where

E (ui jeduci ; abilityi ) = 0:

(85)

This is a highly simplied model of wages, but is su cient for this discussion. Natural ability

is a di cult concept involving intelligence of various sorts, persistence, resilience and other such

factors. Numerical measurement of natural ability is probably impossible and wage regressions

do not contain this variable in practice. Nevertheless, ability is surely an important causal factor

for an individuals productivity, and hence their wages, implying 2 > 0 in (84).

In addition, more able individuals will generally obtain higher levels of education, since they

can use their ability to qualify for higher education opportunities and also will benet more from

taking up such opportunities. We might therefore expect to nd a statistical relationship between

education and ability of the form

E (abilityi jeduci ) =

1 educi ;

(86)

with 1 > 0. This education / ability may or may not be causal, or causation may run in the

opposite direction, but it doesnt matter for the discussion of interpretation of (84).

Now suppose we specify a PRF of the form

E (wagei jeduci ) =

102

1 educi ;

(87)

not including ability. The omission of ability does not introduce a problem for the SRF as an

estimator of this PRF (it is still unbiased, asymptotically normal coe cients and t statistics and

so on), so the estimation of the conditional mean of wages given education is correct. The question

is whether 1 measure the causal eect of education on wages, i.e. whether 1 = 1 in (84)?

To answer this requires an extension of the LIE E [y] = E [E (yjx)]. A more general version is

E [yjz] = E [E (yjx; z) jz] :

This has exactly the same structure as the basic LIE, but each of the expectations has z as an

additional conditioning variable. In the current context this extended LIE can be used to write

E (wagei jeduci ) = E [E (wagei jeduci ; abilityi ) jeduci ]

(88)

Taking the conditional expectation of (84) given educi and abilityi and applying (85) gives

E (wagei jeduci ; abilityi ) =

1 educi

2 abilityi ;

E (wagei jeduci ) = E [

=

1 educi

1 educi

2E

2 abilityi jeduci ]

(abilityi jeduci ) :

E (wagei jeduci ) =

= (

1 educi

2 0)

+(

2( 0

1

1 educi )

2 1 ) educi :

1

2 1:

That is, 1 does not measure the causal eect 1 . Instead it measures a mixture of coe cients

from both (84) and (86). The fact that the SRF for (87) estimates 1 and not 1 is generally

referred to as omitted variable bias. It is not bias in the statistical sense, since ^ 1 is unbiased

for 1 regardless. The so-called bias is an estimator property, but is really the fact that the model

(87) does not match the causal mechanism (84) and therefore has dierent parameter values.

In this case it is plausible to think that 2 > 0 and 1 > 0, which implies 1 > 1 . That is,

the regression (87) will over-state the causal eect of education on wages. For some intuition for

this, imagine comparing average wages between two groups of individuals, the rst group with

12 years of education, the second group with 15 years of education. The average wage for the

second group will be higher (by 3 1 ). But this dierent is due to two factors the second group

has extra education, but will also consist of individuals of generally higher ability. So the average

wage dierence between the two groups is due to both education and ability dierences, not

education alone. Attributing the entire average wage dierence to education is an error because

the comparison fails to control for ability dierences.

Note that omitted variables would not be a problem for causal estimation if 1 = 0. (It is

assumed that 2 6= 0 for this discussion, otherwise abilityi would be irrelevant anyway and could

be safely omitted.) That is, if the included explanatory variable has no explanatory power for the

omitted variable, there will be no omitted variable bias, i.e. 1 = 1 .

103

9.4

Simultaneity

Another problem with causal interpretations of regression models arises when the causality between two variables runs in both directions. That is, there is causality from the explanatory

variable to the dependent variable of the regression, but also causality in the other direction

from the dependent variable to the explanatory variable. In this case we say the variables are

simultaneously determined.

Consider the CEO salary example, where rm protability (as measured by Return on Equity)

was used as an explanatory. It was found that average CEO salary varied with rm protability.

This is not the same thing, however, as saying that the level of CEO salary is caused by rm

protability. This may be true, or it may be that highly paid CEOs are more competent and

cause rms to be more highly protable, or a mixture of the two eects. If rms determine their

CEOs salary on the basis of their protability, and highly paid CEOs also cause higher prots,

we would say the two outcomes are simultaneously determined. This might be represented in

equation form as

salaryi =

1 roei

+ ui

roei =

1 salaryi

(89)

+ vi ;

(90)

where ui represents the other factors that determine CEO salary and vi represents all the other

factors that determine the rms return on equity. In order for each of these equations to be given

some sort of statistical interpretation, it is necessary to say something about ui and vi . In the

rst equation we would like to assume that E (ui jroei ) = 0, while in the second E (vi jsalaryi ) = 0.

These assumptions would allow each of these equations to be given regression representations. Unfortunately neither assumption is possible when there is simultaneity. For example, E (ui jroei ) = 0

implies that ui and roei must be uncorrelated, but the simultaneous structure of the equations

dictates that any factors that causes the CEOs salary must then also be a factor causing Return

on Equity because of salarys presence in the second equation. This can be made explicit by

substituting the equation for salaryi into the equation for roei and re-arranging to give

roei =

1 0

1 1

(1

1 1)

ui +

1

(1

1 1)

vi :

This correlation between roei and ui implies that E (ui jroei ) = 0 is not possible. Therefore the

PRF

E (salaryi jroei ) = 0 + 1 roei

does not have the same parameters as the causal equation (89), i.e. 1 6= 1 . The PRF provides

a representation of the conditional mean of CEO salary given Return on Equity, and an unbiased

estimate is provided by the SRF, but the conditional mean diers from the causal equation because

of the simultaneity.

9.5

Sample selection

Sample selection problems can result in dierences between the parameters of a PRF and the

underlying causal mechanism. The problem arises when a simple random sample is not available,

and instead the sample is chosen at least partly based on the dependent variable itself, or some

other factor correlated with the dependent variable.

In Tutorial 5 it was found that a rms CEO salary was a positive predictor of the risk of the

rms stock. Suppose there is a causal relationship

jreturni j =

+

104

1 salaryi

+ ui ;

(91)

with 1 > 0 implying the higher CEO salaries cause higher risk in the stocks (greater magnitude

movements in share price, either positive or negative). Further suppose for this story that

E (ui jsalaryi ) = 0;

so that

E (jreturni j jsalaryi ) =

1 salaryi :

(92)

However, the risks undertaken by some highly paid CEOs may be been so large and gone so

wrong that their rms went bankrupt. Such rms with very large negative returns may therefore

be excluded from the sample (if, for example, their bankruptcy resulted in them being removed

from a database of currently trading rms). To make the story simple, suppose we only observed

rms for whom returni > 90 say, such that rms that lost more than 90% of their value went

bankrupt and were excluded for the database. (This 90% gure is just made up for this story,

rm bankruptcy is more complicated in practice of course!) In that case our regression model for

E (jreturni j jsalaryi ) is in fact a regression model for E (jreturni j jsalaryi ; returni > 90). That is,

if rms with returni

90 are unavailable for our sample, our regression model is really

E (jreturni j jsalaryi ; returni >

90) =

1 salaryi :

(93)

E (jreturni j jsalaryi ), because the latter averages

over some larger absolute returns that are excluded from the former. The main point is that the

PRF (93) based on the available sample would not match the PRF (92) derived from the causal

equation (91) for all rms, so the coe cients in (93) would dier from the causal coe cients in

(91).

10

Time series data diers in important respects from cross-sectional data. Time series data on a

variable is collected over a period of time, as opposed to a cross-section which is collected at (at

least approximately) a single point in time. Examples of time series data include observations on

a share price or market index recording each minute or each day or at any other frequency, or

exchange rates measured similarly, or macroeconomic variables like price ination or GDP growth

that are measured monthly or quarterly, and so on. This time series aspect introduces dierent

features to the data compared to a cross section. Firstly the observations are ordered, meaning

that there is a natural ordering in time that does not apply to cross sections. When we take a

simple random sample of individuals or rms or countries there is no single order of observations

that is naturally imposed (although they can of course be ordered according to any criterion we

wish after they are collected).

Statistically a very interesting feature of time series data is that there is generally some form of

temporal dependence that is interesting to model. Temporal dependence means there is statistical

dependence (i.e. correlation or predictability) that exists between observations at dierent points

in time. For example there may be information in todays stock prices that is useful to predict

movements in prices tomorrow, or information in this months ination gure about next months

ination or interest rates or GDP growth, and so on. Modelling this dependence over time is of

great interest both for forecasting / prediction purposes and also for attempts at causal modelling

with time series. The dependence also means that the theory underlying regression using OLS

is dierent, because the i.i.d. assumption is generally no longer applicable. That is, time series

data is cannot be collected using a simple random sample.

Variables with time series data are generally denoted as yt and xt rather than yi and xi . The

dierence is purely convention, but helps to remind which type of data is in use for a particular

105

model. Following Wooldridge, it will be useful to begin by denoting the dependent variable as yt

and an explanatory variable as zt . The switch to zt instead of xt will be become clear, but isnt

very important. A simple static regression with time series then looks like

E (yt jzt ) =

1 zt ;

(94)

which has the same structure as a cross sectional regression, but needs dierent theoretical underpinnings without the structure of an i.i.d. sample. We will not pursue this, but instead discuss

some more interesting time series models that are used both for forecasting and causal modelling.

10.1

Dynamic regressions

A dynamic regression is one that models the ways in which the relationships between variables can

evolve over time. There are many ways of doing this, but just two of the most popular approaches

will be covered here.

10.1.1

In specifying a regression model, the concept of conditioning is obviously fundamental. A regression model is a model of a conditional mean. In time series analysis it becomes important to be

clear about exactly what is being conditioned in any regression model. It is often of most practical

interest to not just condition on zt as in (94), but also on the past values as well. That is, a time

series regression model is often specied conditional on all values of the explanatory variable that

are observable at time t. The conditional expectation is written E (yt jzt ; zt 1 ; : : : z1 ). The idea is

that previous values of zt might also be useful for explaining yt . A regression of the form

E (yt jzt ; zt

1 ; : : : z1 )

0 zt

1 zt 1

+ ::: +

q zt q

(95)

is often used. A variable of the form zt j (for any j > 0) is called a lag of zt . The regression (95)

is called a Finite Distributed Lag (FDL) model (the Finitepart not being used by all authors).

The number of lags q to include in this model can be determined on the basis of the sample size

(using few lags if few observations are available), the frequency of the data (sometimes q = 12 for

monthly data, q = 5 for daily data etc) or most commonly on the basis of statistical analysis of

the model to see what value of q seems most appropriate for explaining yt .

An FDL model captures the idea that the full eect of a change in zt on the mean of yt may

not occur immediately, but may take several time periods. For example, a central bank may raise

o cial interest rates in order to attempt to reduce the level of ination, but it is well known there

are lags in adjustments in the economy such that interest rate changes take some months (as

many as 12-18 months) for their eects to be fully felt. A very simple FDL model with monthly

data to capture this idea would take the form

E (inft jrt ; rt

1 ; : : : ; r1 )

0 rt

1 rt 1

+ ::: +

12 rt 12 ;

(96)

which allows for current ination (inft ) to be explained by interest rate changes that were made

up to 12 months ago. FDL models are typically used for policy analysis questions such as these.

10.1.2

FDL models can be extended to allow for lags of the dependent variable. The conditioning set

for the regression is extended to cover not only present and past explanatory variables, but also

past values of the dependent variable. The model is

E (yt jzt ; yt

1 ; zt 1 ; : : : ; y1 z1 )

1 yt 1

+ 0 zt +

106

+ :::

1 zt 1

p yt p

+ ::: +

q zt q ;

(97)

so that past values of yt are permitted to have explanatory power for yt . This is an additional

way to introduce a concept a lagged eects or inertia into a model of a dynamic situation. Model

(97) is called an Autoregressive Distributed Lag (ARDL) model. It is a exible way of capturing

dynamic eects, but requires more eort to interpret than the FDL model.

10.1.3

Forecasting

A small variation on the ARDL model is often used for forecasting. Forecasting is the attempt to

predict a variable in the future. In the simplest case, a forecast is made on time period into the

future. Regressions such as (95) and (97) are not useful for forecasting because they contain zt as

an explanatory variable for yt , which means a forecast for the future value of yt is being expressed

in terms of the future value of zt . A forecasting model needs to remove any variables at time t

from its set of conditioning variables, and make take a form such as

E (yt jyt

1 ; zt 1 ; : : : ; y1 z1 )

+ 1 zt

1 yt 1

1

+ :::

+ ::: +

p yt p

q zt q :

(98)

In this model the forecast of yt (i.e. E (yt jyt 1 ; zt 1 ; : : : ; y1 z1 )) is expressed purely in terms of

variables that are available in the previous time period t 1.

10.1.4

Application

Especially since the GFC, there has been considerable discussion in economics about the various

possible eects of government debt on economics growth. The austeritystory is that government

debt crowds out private sector economics activity and undermines condence, and hence prolongs

the recession, giving rise to calls to cut government spending and hence government debt. The

scal stimulusstory is that at a time of recession the government should spend more than they

otherwise might (going into further debt if necessary) in an eort to stimulate the economy and

end the recession, leaving the task of then reducing the debt to when strong economic growth has

resumed.

We will look at a simple dynamic model relating government debt and economic growth in

Australia using annual data from 1971 to 2012 on the real GDP growth rate per year (from the

RBA) and on the net government debt as a percentage of GDP (from the Australian government

budget papers). Time series plots are shown in Figures 88 and 10.1.4. Observe that the evolution

of the debt/GDP ratio is much smoother than that of GDP growth, a fact that will inform our

regression modelling later.

A time series regression model for the question of interest is the ARDL PRF

E growtht jdebtt ; growtht

1 ; debtt 1 ; : : :

1 growtht 1

+ 0 debtt +

+ ::: +

1 debtt 1

p growtht p

+ ::: +

q debtt q :

The debt variables in this model can be used to measure the eect of government debt on economic

growth. The inclusion of the lagged dependent variables is a simple way for the model to allow

for the other dynamics in the economy.

107

GROWTH

6

5

4

3

2

1

0

-1

-2

-3

1975

1980

1985

1990

1995

2000

2005

2010

DEBT_GDP

20

16

12

-4

1975

1980

1985

1990

1995

2000

2005

2010

10.2

OLS estimation

The algebra of OLS estimation in these regressions is almost identical to that for cross-sectional

regressions, with one exception. The presence of lags in these regressions requires adjustments to

be made at the start of the sample. To illustrate, suppose we have observations for t = 1; : : : ; n,

and specify the rst-order FDL model

E (yt jzt ; zt

1 ; : : : ; z1 )

0 zt

1 zt 1 :

This equation has a problem for t = 1 because it involves the variable zt 1 = z0 on the right hand

side, and this is unavailable. Strictly speaking we should write down these models as applying

only to values of t for which the variables are available. For example

E (yt jzt ; zt

1 ; : : : ; z1 )

108

0 zt

1 zt 1 ;

t = 2; : : : ; n

or

E (yt jzt ; zt

1 ; : : : z1 )

0 zt

1 zt 1

1 yt 1

+ ::: +

q zt q ;

t = q + 1; : : : ; n

q zt q ;

t = max (p; q) + 1; : : : ; n

or

E (yt jzt ; yt

1 ; zt 1 ; : : : ; y1 z1 )

+ 0 zt +

+ :::

1 zt 1

p yt p

+ ::: +

and so on.

OLS estimation therefore does not use all n observations, it only uses those observations for

which the regression is well-dened for the sample. In the preceding examples, OLS will use

respectively n 1, n q and n max (p; q) observations.

The theory for time series regressions is more di cult than for i.i.d. regressions. Only an

outline of some practically important points is given here.

10.2.1

Bias

Unbiasedness is more di cult to show in time series regressions, and often does not hold. Recall

the unbiasedness proof for an i.i.d. regression

E (yi jxi ) =

in which the OLS estimator is written

Pn

^ = Pi=1 (xi

1

n

i=1 (xi

x) yi

x)2

1 xi ;

n

X

an;i yi ;

i=1

and then the independence part of the i.i.d. conditions is used to deduce that

E (yi jxi ) = E (yi jx1 ; : : : ; xn ) ;

(99)

so that

h

E ^1

= E

= E

=

" n

X

"

i=1

0

(100)

an;i xi

(101)

n

X

an;i +

i=1

n

X

i=1

P

P

(using ni=1 an;i = 0 and ni=1 an;i xi = 1). The crucial condition is (99), since without that the

step from (100) to (101) cannot happen.

Small biases Consider a simple time series regression

E (yt jxt ; : : : ; x1 ) =

1 xt ;

(102)

where xt might be an explanatory variable such as zt in (95), or xt might just be the lagged

dependent variable xt = yt 1 , in which case we would have the so-called AR(1) model

E (yt jyt

1 ; yt 2 ; : : : ; y1 )

109

1 yt 1 ;

(103)

which is often used as a very simple forecasting model. Now the crucial condition, analogous to

(99), that is required is

E (yt jxt ; : : : ; x1 ) = E (yt jxn ; : : : ; x1 ) :

(104)

If this is true then xt is said to be a strictly exogenous regressor and the OLS estimator of 1 is

unbiased. However, in a time series setting without independence across time, (104) can easily

fail.

The simplest situation in which (104) is certain to fail is in the AR(1) model. In that case we

have

E (yt jyn 1 ; : : : ; y1 ) = yt ;

since yt is included in the conditioning set yn 1 ; : : : ; y1 . Thus E (yt jyn 1 ; : : : ; y1 ) diers from

E (yt jyt 1 ; yt 2 ; : : : ; y1 ) in (103) for all t = 2; : : : ; n 1, implying that strict exogeneity does not

hold in this model . The OLS estimator of an AR(1) model is biased, and more generally the

OLS estimator of any model with a lagged dependent variable (eg any ARDL model) will also be

biased. It turns out, however, that this bias is small in the sense that it does not arise from a

misspecication of the model and disappears as the sample size grows. That is, in a reasonable

sized sample we can expect the bias to be practically unimportant (just as in a reasonablesized

sample we can treat the OLS coe cients and t statistics as being approximately normal and t

distrbuted).

In (102) it is also possible for there to be bias if xt is not a lagged dependent variable, depending

on the nature of the relationships between xt and yt . If yt has some explanatory value for future

values of xt (i.e. the leads xt+j for some j > 0) then (104) will fail. For example, if yt is correlated

with xt+1 as well as xt then we may have the relationship

E (yt jxn ; : : : x1 ) =

1 xt

2 xt+1 ;

which is not equal to (102). This latter relationship is usually not interesting from a practical

perspective since saying that yt is explained by future values of xt is useless for forecasting and

is unlikely to be meaningful in causal modelling. The point is that we can use this to derive

an expression for the coe cients in (102) by taking expectations conditional on xt ; : : : ; x1 and

applying the LIE:

E (yt jxt ; : : : ; x1 ) = E [E (yt jxn ; : : : x1 ) jxt ; : : : ; x1 ]

=

1 xt

2E

[xt+1 jxt ; : : : ; x1 ] :

E [xt+1 jxt ; : : : ; x1 ] =

1 xt ;

so that

E (yt jxt ; : : : ; x1 ) =

= (

=

0

0

1 xt

2 0)

2( 0

+(

1 xt )

2 1 ) xt

1 xt :

in (102) is given by (

110

2 1 ).

h

E ^1

= E

= E

"

n

X

t=1

" n

X

an;t (

t=1

=

=

1E

"

n

X

1 xt

an;t xt+1

t=1

h i

^1 ;

E

1

0+

2 xt+1 )

where ^ 1 is

h the

i OLS estimator of 1 in the AR(1) model for xt . As an OLS estimator of an AR(1)

model, E ^ 1 6= 1 , and this is the source of the bias in ^ 1 . Again this bias is small, so that

for reasonable sample sizes the bias in ^ can be treated as unimportant.

1

Such explanatory power of yt for future xt is quite realistic. For example, in (96), current values

of ination may be useful for forecasting future interest rate movements because the central bank

may set interest rates partly in response to observed ination. Some bias may be present in the

OLS estimation of the FDL model (96) as a result.

Large biases Biases can also arise from mis-specifying the conditional expectations. For

example, suppose (102) is the assumed model,

E (yt jxn ; : : : ; x1 ) =

0 xt

1 xt 1 ;

with the form of the conditioning set implying that xt is a strictly exogenous regressor. This looks

like a lot like an omitted variables problem (i.e. xt 1 is omitted in (102)) but the consequences

of omitting xt 1 are more like a functional form misspecication. That is, the estimates of the

conditional mean E (yt jxt ; : : : ; x1 ) can be biased by the omission of xt 1 . To see this, we take the

same approach as in the analysis of functional form misspecication. The OLS estimator ^ 1 in

the SRF

y^t = ^ 0 + ^ 1 xt ;

can be written

^ =

1

as usual, with

Pn

t=1 an;t

= 0 and

h

E ^1

n

X

t=1

Pn

t=1 an;t xt

= E

= E

"

n

X

t=1

" n

X

= 1. Now

#

an;t (

t=1

an;t yt ;

1E

" n

X

t=1

Pn

6=

0 xt

an;t xt

1 xt 1 )

where t=1 an;t xt 1 is the slope coe cient in a regression of xt 1 on an intercept and xt . If there

is temporal dependence in xt then this regression coe cient will be generally non-zero, implying

111

that ^ 1 is not an unbiased estimator of 0 . This bias does not disappear with larger samples. In

attempting to model E (yt jxt ; : : : ; x1 ) (i.e. in any FDL or ARDL model), it is necessary to have

a method of choosing enough lags to go in the regression in order to avoid inducing biases in the

estimates.

Summary of biases A time series regression is a conditional expectation E (yt jxt ; xt 1 ; : : : x1 ).

The explanatory variables xt may include lagged dependent variables yt 1 ; yt 2 ; : : : and/or other

explanatory variables zt ; zt 1 ; : : :. That is, E (yt jxt ; xt 1 ; : : : x1 ) can include FDL, AR and ARDL

models.

A large bias occurs if E (yt jxt ; xt 1 ; : : : x1 ) is not correctly specied, i.e. if insu cient lags

are included or if the incorrect functional form is specied. A large bias is once that does

not disappear no matter how large the sample size and is one we should try to avoid by careful

specication.

If E (yt jxt ; xt 1 ; : : : x1 ) is correctly specied then the OLS estimates of its parameters may still

be subject to small biases arising from the temporal dependence in the variables. This bias is

di cult to avoid (i.e. it arises even in well-specied models) but will disappear for larger samples

and is usually not worried about in practical work.

10.2.2

A general theoretical result that underpins much of practical time series analysis is as follows. If

1. the true conditional expectation for the PRF is

E (yt jzt ; yt

1 ; zt 1 ; : : : ; y1 ; z1 )

1 yt 1

+ 0 zt +

+ ::: +

1 zt 1

+ :::

p yt p

q zt q ;

then the parameters of the OLS SRF

y^t = ^ 0 + ^ 1 yt 1 + : : : + ^ p yt

+^0 zt + ^1 zt 1 + : : : ^q zt

p

q

are consistent and asymptotically normal estimators of the parameters of the PRF.

There are some new terms in this. A consistent estimator is asymptotically unbiased, so that

any bias disappears as the sample size grows. That is, a consistent estimator may exhibit the

small bias discussed above, but not large bias. An asymptotically normal estimator is one

that obeys the Central Limit Theorem, just like the cross sectional case. Then the OLS estimators

are approximately normal and subsequent t and Wald tests are valid. The practical implication

of this result is that we can use the OLS estimators (and resulting t and Wald tests) in just the

same way as they are used in cross-sectional regressions.

There are two important conditions to be satised. The rst is that su cient lags have

been included in the ARDL model to remove any large bias, as discussed above. The second

condition is that yt and zt are weakly dependent, which is a new concept. A time series xt is

weakly dependent if any dependence between xt and xt h decreases quickly to zero as h increases

to innity. An implication is that the correlation between xt and xt h must quickly decrease

to zero as h increases, which we will use to check for weak dependence. In a time series plot, a

strongly dependent time series may exhibit a trend (a persistent upwards or downwards movement)

and/or evolve very smoothly, and needs to be transformed before being included in a time series

regression.

The practical steps for time series regression are therefore the following.

112

2. Choose an FDL/AR/ARDL specication with su cient lags.

3. Carry out estimation and inference by OLS methods as usual.

This is not the nal word on time series regression, there are more complications that can arise,

but this approach is often su cient.

10.3

Deciding whether or not a time series displays weak or strong dependence can be a di cult and

inexact process. The rst piece of evidence to check is the time series plot. A strongly dependent

time series may display a trend or very smooth plot, while a weakly dependent time series will be

less smooth. Figures 88 and 10.1.4 suggest that GDP growth is weakly dependent because its plot

is not smooth at all, while the plot of debt/GDP is quite smooth and suggests strong dependence.

The other piece of evidence we will use is the Correlogram. For a weakly dependent time

series the correlation cor (xt ; xt h ) will decrease quickly to zero as h increases, while for a strongly

dependent time series cor (xt ; xt h ) will decrease much more slowly. This is not a clear-cut criterion

to apply, but is often informative. To obtain the correlogram of a time series, choose View Correlogram... for that series as shown in Figure 89, for then select Level for now. The

correlograms for growth and debt/GDP are shown in Figures 90 and 91. The relevant correlations

are in the graph under the heading Autocorrelation and in the table under the heading AC.

The autocorrelations for growth are all quite small and support the graphical evidence that GDP

growth is weakly dependent. The autocorrelations for debt/GDP are considerably larger and

decrease much more slowly towards zero. This evidence, together with the time series plot, leads

us to treat debt/GDP as strongly dependent.

If a variable is judged to be strongly dependent then the usual next step is to take its rst

di erence in order to achieve weak dependence. The dierence is dened as

debtt = debtt

debtt

1;

i.e. the amount by which the debt/GDP changes from one year to the next. Usually one dierence

is su cient, but occasionally dierencing twice may be required. Eviews uses the letter D to

generate a dierence. The time series plot of debtt is shown in Figure 92, where it can be

seen to be substantially less smooth than its undierenced version. The correlogram of debtt is

shown in Figure 93, where the autocorrelations can be seen to decrease towards zero faster than

the undierenced version. These pieces of evidence are su cient for us to proceed using debtt

in its rst-dierenced version.

10.4

Model specication

To illustrate the specication and interpretation of these models, we will rst select an FDL model

and then an ARDL model. This is just for illustration and usually the selection process could

cover all possibilities together.

For any model of E (yt jzt ; yt 1 ; zt 1 ; : : : ; y1 ; z1 ), it is a necessary condition for correct specication that the residuals of the SRF not display signicant temporal dependence, in particular

no autocorrelation. If we dene the prediction error of the PRF as

et = yt

E (yt jzt ; yt

1 ; zt 1 ; : : : ; y1 ; z1 ) ;

113

114

D(DEBT_GDP)

6

-2

-4

1975

1980

1985

1990

1995

2000

2005

115

2010

debtt

E (et jzt ; yt

1 ; zt 1 ; : : : ; y1 ; z1 )

= 0;

cov (et ; et

j)

= E (et et

j)

= E [E (et et

j jzt ; yt 1 ; zt 1 ; : : : ; y1 ; z1 )]

= E [E (et jzt ; yt

by the LIE

1 ; zt 1 ; : : : ; y1 ; z1 ) et j ]

= 0;

so that et has no correlation with any lags of itself. The important step in this proof is from the

second line to the third, where et j is taken outside of the conditional expectation. This is possible

because et j is a function of yt j ; zt j ; yt j 1 ; zt j 1 ; : : : ; y1 ; z1 , all of which are contained in the

conditioning set zt ; yt 1 ; zt 1 ; : : : ; y1 ; z1 for any j > 0. The practical implication of this is that

any evidence of autocorrelation in the residuals of the SRF, which would imply cov (et ; et j ) 6= 0,

suggests that the PRF has been misspecied and requires additional lags. A convenient check for

autocorrelation is provided by the residual correlogram in Eviews, see Figure 94.

Figures 9597 show the results of FDL regressions for growth including zero, one and two lags

respectively. The residual correlograms are shown in Figures 98100.The last two columns of the

residual correlograms are useful for autocorrelation testing. The null hypothesis for the Q-stat in

row r is that there is no correlation between et and et j for all j = 1; : : : ; r, with the p value for

the test in the last column. For example, in Figure 98 we could set out a test for correlation at

lags 14 as

116

1. H0 : cov (et ; et

j)

= 0 for j = 1; 2; 3; 4

2. H0 : cov (et ; et

j)

3.

= 0:05

5. Decision rule : reject H0 for p < 0:05

6. Do not reject H0 , so there is no evidence of autocorrelation at any lags less than or equal

to four.

It is generally unnecessary to set out the full test like this for an autocorrelation check. It is

su cient to look down the last column of p values and if any of them are less than 0.05 then

consider the model as being misspecied and move on to another that attempts to address the

problem, using more lags for eg. In this case, there is no evidence of residuals autocorrelation in

any of the three FDL models.

Since all three models pass the autocorrelation test, we can compare them using their AIC

values. The FDL model with a single lag has the smallest AIC value and so would be chosen from

among these three models. .

ARDL models are also specied for growth, see Figures 101, 103, 105. These are, respectively, ARDL(1,0), ARDL(1,1), ARDL(1,2) models, implying each has a single lagged dependent

variable (the AR(1) part) and respectively 0, 1 and 2 lagged explanatory variables. The residual

correlogram for the ARDL(1,0) model in Figure 102 shows a signicant lag nine autocorrelation,

so this model is excluded from further comparisons. The other two models pass the residual

autocorrelation tests. The ARDL(1,1) model has a lower AIC than the ARDL(1,2) model and is

therefore preferred. The ARDL(1,1) model is also preferred to the FDL(1) model in Figure 96

according to the AIC. Of those six models considered here, the ARDL(1,1) would therefore be the

one preferred overall. We will interpret both the ARDL(1,1) and FDL(1) models for illustrative

purposes.

117

118

Figure 98: Residual correlogram for the FDL model with zero lags

119

Figure 99: Residual correlogram for the FDL model with one lag

Figure 100: Residual correlogram for the FDL model with two lags

120

121

122

123

10.5

Interpretation

The interpretation of models with lags is not quite as straightforward as in static regressions.

10.5.1

E (yt jxt ; xt

1 ; : : : ; x1 )

0 xt

1 xt 1 :

The individual coe cients have similar interpretations to usual regressions. If xt is increased by

one unit then the conditional mean of yt changes by 0 units, and in these regressions this is called

the impact multiplier. If xt 1 is increased by one unit then the conditional mean of yt changes by

1 units, so this change takes one time period before it occurs. The coe cient 1 is called the lag

one multiplier. These interpretations carry over to longer lags in FDL models.

Now suppose xt 1 were increased by one unit and this increase were allowed to remain in time

t as well. In that case both xt and xt 1 have been increased by one unit, so the eect on the

conditional mean of yt is 0 + 1 . This is called the long run multiplier. This joint interpretation of

the coe cients (i.e. allowing both xt and xt 1 to increase, rather than increasing one and holding

the other constant) makes practical sense. If xt were the o cial interest rate and yt ination for

example, the central bank would be interested to measure the eect on ination if the interest

rate were increased by 1% this month and the increase allowed to stay in place next month. In

that context, the long run multiplier has more practical meaning than the lag one multiplier,

which measures the eect of a 1% increase in interest rate in one month that is then reversed the

following month.

The long run multiplier can be estimated directly by re-writing the FDL regression as

E (yt jxt ; xt

1 ; : : : ; x1 )

+(

1 ) xt

xt :

That is, regressing yt on an intercept and xt and xt will give a direct estimate of the long run

multiplier, along with its standard error for t statistics and condence intervals.

The FDL(1) model in Figure 96 can be written

d

growth

t = 3:198

(0:186)

(0:153)

(0:161)

1;

n = 40; R2 = 0:333:

The two slope coe cients are signicant at the 5% level (i.e. have p < 0 on their t statistics). The

impact multiplier is 0:586, so that a 1% increase in the rate of change of the debt/GDP ratio

predicts a 0:586% fall in the growth rate. This is both statistically and economically signicant

and would be consistent with the austeritystory. However, the lag one multiplier is +0:563, so

that one period later the eect of the increase in the rate of change of the debt/GDP ratio is of

opposite sign and approximately the same magnitude. The long run multiplier is 0:586+0:563 =

0:023% , so that an initial negative eect on growth is almost completely oset the following

year by a positive eect of growth, with the net eect of the increase in debt being very small

on growth. The transformed regression to estimate the long run multiplier directly involves a

regression of growth on an intercept and debtt and 2 debtt (i.e. the second dierence of debt),

the results of which are shown in Figure 107. The long run multipler estimate of 0:023% is

insignicant, implying very little long run eect of government debt changes on predictions for

economic growth, despite the signicant changes in the short run (at lags 0 and 1).

124

Figure 107: Direct estimation of long run multiplier on the FDL model for growth

10.5.2

Interpretation of ARDL models is more complicated because the dynamic eects are formed

by a mixture of the lagged dependent and lagged explanatory variables. For the purposes of

interpretations here, we will assume that zt is strictly exogenous. This makes the derivations

simpler and may be a reasonable assumption when zt is a policy variable such as government debt

or an o cial interest rate.

ARDL(1,0) Consider rst the ARDL(1,0) model

E (yt jyt

1 ; : : : ; y1 ; zn ; : : : ; z1 )

1 yt 1

0 zt :

(105)

that zt is strictly exogenous. The impact multiplier of a one unit increase in zt on the conditional

mean of yt is 0 . That is a standard interpretation.

Looking for eects at higher lags requires some derivations. First take (105) and lag by one

time period:

E (yt 1 jyt 2 ; : : : ; y1 ; zn ; : : : ; z1 ) = 0 + 1 yt 2 + 0 zt 1 :

(106)

Now the expectation of both sides of (105) conditional on yt 2 ; : : : ; y1 ; zn ; : : : ; z1 and use the LIE

on the left hand side and (106) on the right hand side to obtain

E (yt jyt

2 ; : : : ; y1 ; zn ; : : : ; z1 )

1E

1( 0

0 (1

(yt

1)

+

+

1 jyt 2 ; : : : ; y1 ; zn ; : : : ; z1 )

1 yt 2 + 0 z t 1 ) + 0 zt

2

1 yt 2 + 1 0 zt 1 + 0 zt :

This representation shows that the lag one multiplier for a one unit increase in zt is

To nd the lag two multiplier, lagging (106) by another time period gives

E (yt

2 jyt 3 ; : : : ; y1 ; zn ; : : : ; z1 )

125

0 zt

1 yt 3

0 zt 2 ;

(107)

1 0.

E (yt jyt

3 ; : : : ; y1 ; zn ; : : : ; z1 )

0 (1

1)

0 (1

1)

1+

3 ; : : : ; y1 ; zn ; : : : ; z1

gives

2

1 E (yt 2 jyt 3 ; : : : ; y1 ; zn ; : : : ; z1 ) + 1 0 zt 1

2

1 ( 0 + 1 yt 3 + 0 zt 2 ) + 1 0 zt 1 + 0 zt

2

3

2

1 + 1 yt 3 + 1 0 zt 2 + 1 0 zt 1 + 0 zt :

0 zt

This shows that the lag two multiplier for a one unit increase in zt is 21 0 .

This process can be repeated as often as desired. The pattern is clearly that the lag j multiplier

for a one unit increase in zt is j1 0 . The long run multiplier is the sum over all of the individual

lag j multipliers. It is both simple and conventional to sum over all j = 0; 1; 2; : : : without upper

limit, giving

0

0 1

2

0 1

3

0 1

+ ::: =

;

1

P

j

which uses the geometric series 1

1 ) for j 1 j < 1, the latter condition being

j=0 1 = 1= (1

satised when yt is weakly dependent. This is the long run eect on the conditional mean of yt

of a permanent one unit increase in zt .

ARDL(1,1) The same approach to interpretation applies to the ARDL(1,1) model

E (yt jyt

1 ; : : : ; y1 ; zn ; : : : ; z1 )

1 yt 1

0 zt

1 zt 1 :

(108)

E (yt jyt

=

j ; : : : ; y1 ; zn ; : : : ; z1 )

1+

+ 0 zt + (

j 1

1

+ ::: +

1 0

1 ) zt 1

j

1 yt j

1( 1 0

1 ) zt 2

2

1( 1 0

1 ) zt 3

from which it can be seen that the lag j multiplier for a one unit increase in zt is

j 1

( 1 0 + 1 ) for j > 0. The long run multiplier is therefore

1

0+(

0+

1)

1

X

j 1

1

j=1

1 0

1

1

+ :::

0

Returning to the ARDL(1,1) model in Figure 103, the SRF can be written

d

growth

t = 4:070

(0:470)

0:269 growtht

(0:153)

(0:181)

(0:176)

n = 40; R2 = 0:383;

giving

^ =

1

0:269; ^0 =

0:726; ^1 = 0:617:

Impact (lag 0)

Lag 1

Lag 2

^0 = 0:726

^ ^0 + ^1 = 0:269

0:726 + 0:617 = 0:812

1

^ ^ ^0 + ^1 = 0:269 0:812 = 0:218

1

^2

1

Lag 3

^ ^0 + ^1 =

1

..

.

126

0:269

..

.

0:218 = 0:059

1;

for j = 0 and

The long run multiplier is

^0 + ^1

1

0:726 + 0:617

=

1 ( 0:269)

0:086:

The evidence from this regression is that changes in government debt predict short run changes

in economic growth, greatest in the rst 2-3 years, but quickly decreasing to zero thereafter. Also

the eects tend to oscillate in sign so that they cancel out when added, producing a long run

eect that is quite small and, according to the Wald test in Figure 108, statistically insignicant.

These results are about the predictions of economic growth on the basis of changes in government debt. All of the di culties outlined above in making causal interpretations applies here. In

particular there are other variables besides government debt that may inuence economic growth.

Also there may be simultaneity between government debt and growth for example, a slowdown

in economic growth may increase government expenditures (unemployment benets) and decrease

tax receipts (reduced company and income taxes and GST because of reduced economic activity)

and hence increase the government debt. So these results here are informative about the dynamic

structure of the conditional expectations that relate economics growth and government debt, but

must be treated with very great caution for causal inference.

11

11.1

Denitions

0

1

1 4

A = @ 2 5 A:

3 6

The dimension of a matrix is denoted r c, where r is the numbers of rows and c is the number

of columns. The matrix A has dimension 3 2.

127

The individual elements of the matrix A can be denoted ai;j for i = 1; : : : ; r indexes the row

and j = 1; : : : ; c indexes the column. So a2;1 = 2 and a3;2 = 6.

Two matrices are dened to be equal if they have the same dimensions and their individual

elements are all equal.

A square matrix has the same number of columns as rows; that is, r = c.

A column vector, or simply a vector, is a matrix consisting of a single column. A row vector

consists of a single row. For example, if we dene

B=

1

2

; C=

1 2 3

then B is a (column) vector and C is a row vector. Their dimensions are 2 1 and 1 3 respectively.

A scalar is a 1 1 matrix, that is, a single number.

The transpose of a matrix turns its columns into rows (equivalently rows into columns). The

transpose of A is denoted A0 (though some denote transpose as AT ). For example,

0 1

1

1

2

3

A0 =

; B0 = 1 2 ; C 0 = @ 2 A :

4 5 6

3

Note that (A0 )0 = A for any matrix, and that the transpose of a scalar is just the scalar (eg.

20 = 2). If A has dimension r c then A0 has dimension c r.

A square matrix M is symmetric if M = M 0 . For example,

1

0

1 4 6

M =@ 4 2 0 A

6 0 3

is a symmetric matrix. If M has elements mi;j then it is symmetric if mi;j = mj;i for all i and j.

The main diagonal of a square matrix are the elements running from the top left corner to

the bottom right corner of the matrix, denoted diag (M ). In the example,

0 1

1

@

2 A:

diag (M ) =

3

That is diag (M ) is the vector of elements mi;i for all i. A symmetric matrix is symmetric about

its main diagonal, meaning those elements below the main diagonal are reected above the main

diagonal.

11.2

Two matrices can be added or subtracted if they have the same dimensions; that is, if they are

conformable for addition. Addition and subtraction are element-wise, for example if

0

1

3

8

9 A;

D=@ 2

1

2

then A and D are conformable for addition and

0

1

4 12

A + D = @ 4 14 A ; A

4 4

D=@

2

0

2

1

4

4 A:

8

are not dened.

128

11.3

Multiplication

If x is a scalar then the product Ax means each element of A is multiplied by x. For example, if

x = 2 then

0

1

2 8

Ax = @ 4 10 A :

6 12

Suppose we have two matrices A and B with respective dimensions rA cA and rB cB . The

matrix product AB can be dened if cA = rB ; that is, if the number of columns of A matches the

number of rows of B. In this case, AB is a matrix of dimension rA cB , with individual elements

of the form

cA

X

(AB)i;j =

ai;k bk;j :

k=1

For example, with A and B as dened above, we have cA = rB = 2 so that the product AB

is dened, and the result will have dimension rA cB = 3 1 :

0

1 0

1

0

1

1 1+4 2

9

1 4

1

= @ 2 1 + 5 2 A = @ 12 A :

AB = @ 2 5 A

2

3 1+6 2

15

3 6

Unlike with scalars, matrix multiplication is not commutative; that is, AB 6= BA in general.

In fact, AB may be dened but BA not dened. The current denitions of A and B illustrate

this, since BA is not dened since B has one column and A has three rows. Even if AB and BA

are both dened, they may not be the same dimension, and even if they are the same dimension,

AB and BA will generally be dierent.

For example, if

E= 2 3 ;

then

1 3

4 6

BE =

; EB = 8:

11.4

The PRF

E (yi jx1;i ; : : : xk;i ) =

1 x1;i

+ ::: +

k xk;i :

The right hand side of this can be compactly written in matrix form. Dene the (k + 1)

vectors

0

0

1

1

1

0

B x1;i C

B

C

C

B 1 C

B

= B . C:

xi = B . C ;

@ .. A

@ .. A

xk;i

Then

x0i =

1 x1;i

+ ::: +

k xk;i ;

E (yi jxi ) = x0i :

This representation is very useful for theoretical and computational purposes.

129

11.5

Matrix Inverse

a11 a12

a21 a22

jAj =

= a11 a22

2 then

a12 a21 :

If jAj = 0 then A is called singular, while if jAj =

6 0 then A is non-singular. For example,

1 2

2 4

1

2

= 0;

2

4

= 8;

so the rst matrix is singular and the second is non-singular. A singular matrix satises Ac = 0

for some vector c 6= 0, while a non-singular matrix has Ac 6= 0 for all c 6= 0. For example,

A=

1 2

2 4

2

1

; c=

0

0

; Ac =

while if

1

2

A=

2

4

The identity matrix is a square matrix of dimension r, denoted Ir , such that

AIr = Ir A = A

for any r

r matrix A. It has ones on the main diagonal and zeros elsewhere, that is

0

1

1 0 ::: 0

B 0 1

0 C

C

B

Ir = B .

.. C :

.

.

.

@ .

. . A

0 0 ::: 1

r square matrix A is an r

A

A = AA

r matrix denoted A

and satises

= Ir :

If A is 2 2 then

A

a11 a12

a21 a22

1

jAj

a22

a21

a12

a11

For example,

1

2

2

4

but

1 2

2 4

1

8

4 2

2 1

1

2

1

4

1

4

1

8

Ax = b;

if A is non-singular then A

A 1 b so the solution is

x=A

130

b:

gives A

1 Ax

11.6

For a PRF E (yi jxi ) = x0i , the OLS estimator ^ is the choice of vector b that minimises the sum

of squared residuals

n

X

2

SSR(b) =

yi x0i b :

i=1

^=

n

X

xi x0i

i=1

1 n

X

x i yi :

i=1

P

The matrix ni=1 xi x0i is a square (k + 1) (k + 1) matrix. To be non-singular, there must be

no vector c 6= 0 such that

n

X

xi x0i c = 0:

i=1

0

i=1 xi xi c =

Pn

0 would require x0i c = 0 for all i, which would imply

a perfect linear relationship among the elements of the xi vector, i.e.Pperfect multicollinearity. So

the condition that there is no perfect multicollinearity implies that ni=1 xi x0i is non-singular and

has an inverse, and hence that ^ can be computed.

11.6.1

Proof

The OLS estimator can be derived using vector calculus, or it can shown to minimise SSR (b) as

follows. Write

n

X

2

SSR (b) =

yi x0i ^

xi b ^

i=1

n

X

x0i ^

yi

i=1

xi yi

x0i ^ + b

x i yi

n

0X

i=1

Pn

the second term satises

n

X

n

X

i=1

x0i ^

yi

x0i ^

n

X

i=1

^ :

i=1

x i yi

i=1

n

X

xi x0i b

n

X

xi x0i

i=1

x i yi

i=1

n

X

n

X

xi x0i

i=1

1 n

X

x i yi

i=1

x i yi

i=1

= 0;

so

n

0X

xi x0i b

^ :

i=1

If b = ^ then b

0P

n

0

i=1 xi xi

n

0X

^ = 0. If b 6= ^ then writing c = b

xi x0i

^ =

i=1

n

X

i=1

xi x0i c

n

X

^ gives

zi2 > 0

i=1

since zi = c0 xi 6= 0 when there is no perfect multicollinearity. Thus SSR (b) > SSR ^

b 6= ^ . This shows that SSR (b) is minimised by b = ^ .

131

when

11.7

Unbiasedness of OLS

Suppose (yi ; xi ) are i.i.d. for i = 1; : : : ; n and E (yi jxi ) = x0i . Then the independence part of

i.i.d. implies that

E (yi jxi ) = E (yi jx1 ; : : : ; xn ) :

Then

2

!

n

h i

X

E ^ = E4

xi x0i

2

= E4

2

= E4

i=1

n

X

xi x0i

xi x0i

i=1

n

X

i=1

1 n

X

i=1

1 n

X

i=1

1 n

X

i=1

xi E (yi jx1 ; : : : ; xn )5

3

xi E (yi jxi )5

3

xi x0i 5

showing the OLS estimator is unbiased. The proof is clearly far simpler when expressed in matrix

notation.

11.8

Matrix notation provides a convenient way to represent time series regressions. For example, the

ARDL model

E (yt jzt ; yt

1 ; zt 1 ; : : : ; y1 ; z1 )

1 yt 1

+ 0 zt +

+ ::: +

1 zt 1

can be written as

E (yt jxt ; : : : ; x1 ) = x0t ;

where

B yt 1

B

B ..

B .

B

xt = B

B yt p

B zt

B

B ..

@ .

zt q

C

C

C

C

C

C;

C

C

C

C

A

B

B

B

B

B

=B

B

B

B

B

@

C

1 C

.. C

. C

C

C

p C:

C

0 C

.. C

. A

q

132

+ :::

p yt p

q zt q ;

2

!

n

h i

X

0

E ^ = E4

xt xt

2

= E4

2

= E4

t=1

n

X

xt x0t

t=1

n

X

t=1

xt x0t

1 n

X

t=1

1 n

X

t=1

1 n

X

t=1

xt E (yt jxn ; : : : ; x1 )5

3

xt E (yt jxt ; : : : ; x1 )5

3

xt x0t 5

133

- Numerical Method for engineers-chapter 19Încărcat deMrbudakbaek
- MKC1200 Sem2 2007 ExamÎncărcat deAllan Zhang
- MKC1200_Practice Exam PaperÎncărcat deAltovista
- Final Exam - 2013 BTC1110Încărcat deThomasMann
- Tema 17Încărcat deAdinaMileva
- Transforming Leadership for Patient SatisfactionÎncărcat depietro
- NotesÎncărcat deMatthew Mullaly
- BFC2140 Semester2 2016 Tutorial QuestionsÎncărcat deRhys Cosmo
- Analytical Delay Models for Signalized Intersections AkgungorÎncărcat deEka Swastika AlwiPutri
- Economic Uncertainty and Corruption: Evidence from a Large Cross-Country Data SetÎncărcat deFirmansyah
- Unit Guide BFC3340Încărcat deShangEn Chan
- 2013 S1 Solutions bfc2410Încărcat deLisa Kang
- UT Dallas Syllabus for epps7344.001.11f taught by Adam Olulicz-Kozaryn (ajo021000)Încărcat deUT Dallas Provost's Technology Group
- MKC1200 SummaryÎncărcat deTran Quoc Peter Giang
- 1507.07280Încărcat dejuan cota maod
- (Monographs on Statistics and Applied Probability (Series) 155) Jiming Jiang-Asymptotic Analysis of Mixed Effects Models_ Theory, Applications, And Open Problems-Chapman and Hall_CRC_CRC Press (2017)Încărcat deJeferson Martins
- Topic 2 Nonlinear Regression ModelsÎncărcat deHan Yong
- Ie230concise NotesÎncărcat deraghavsiva89
- Exam 3 Study GuideÎncărcat deAmit Poddar
- Spatial Econometrics Cross-Sectional Data to Spatial Panels [Book] (Elhorst 2014).pdfÎncărcat deGulbin Erdem
- 10 Things to Know About Covariate Adjustment.docxÎncărcat deAhmad Rustam
- micecxÎncărcat dechrisignm
- Empirics on ConvergenceÎncărcat deSeungHoon Ko
- research paper on working capitalÎncărcat deRujuta Shah
- ols_14.pdfÎncărcat deMatías Andrés Alfaro
- LectureÎncărcat deNuur Ahmed
- advanced statistics Phd booknotes.processfile.pdfÎncărcat deapdpjp
- 04767780Încărcat dehariharankalyan
- SPSS ProjectÎncărcat deRishabh Sethi
- Liberlization in Indian EconomyÎncărcat deNripendra Yadav

- 5_AC654imÎncărcat deMohammad Rashman
- Exam format Sem 2 2017.docxÎncărcat deMohammad Rashman
- 8_ACW2851 2014 Lec 8 BizProcessÎncărcat deKeshavSeeam
- 6_ACW2851-2014-Lec 6 Doc.pdfÎncărcat deMohammad Rashman
- wrwrÎncărcat deMohammad Rashman
- 65654Încărcat deMohammad Rashman
- 223Încărcat deMohammad Rashman
- 03030Încărcat deMohammad Rashman
- 121Încărcat deMohammad Rashman
- 000Încărcat deMohammad Rashman
- Misty _ Piano SoloÎncărcat deMohammad Rashman
- appÎncărcat deMohammad Rashman
- Finance All UnitsÎncărcat deMohammad Rashman
- Mass CommunicationGroup Edited Done.ppt SlidesÎncărcat deMohammad Rashman
- Transmission of Values 1Încărcat deMohammad Rashman
- 3d_IF and ANDÎncărcat deMohammad Rashman
- 2c_Total Mark S2 2017 - Student 31OctÎncărcat deMohammad Rashman
- 1b_ACW2851 CWork - StudversionÎncărcat deMohammad Rashman
- 1_ Questions_ Tutorial 5Încărcat deMohammad Rashman
- Lec1-2016-S2Încărcat deMohammad Rashman
- Topic 2 Lecture Slides_3slidesÎncărcat deMohammad Rashman
- ECW1101 Week 1 Lecture- LÎncărcat deMohammad Rashman
- Problems - Chapter 2Încărcat deMohammad Rashman
- ETC1000 2.pptxÎncărcat deMohammad Rashman
- ETC1010 2012 S2 Exam Solutions Q3 ActuarialÎncărcat deMohammad Rashman
- ETC1010 S12015 Solution Part 1Încărcat deMohammad Rashman
- ETC1010_Final_S22015Încărcat deMohammad Rashman

- algorithm makingÎncărcat depurijatin
- 1B6BBE10d01Încărcat dedelia2011
- ComplexÎncărcat deFlorenTina
- Block DiagramÎncărcat deShibin Mathew
- Introduction to Wavelets and Wavelet Transforms - A Primer , Brrus C. S., 1998.Încărcat deNitesh Guinde
- C++ IntroductionÎncărcat dePa Wa N
- C2-Moment and ForceÎncărcat demaran.sugu
- d200401-501Încărcat deSunil Kumar
- Hexadecimal NumbersÎncărcat deg4ubh
- ELEG4430_Tut02Încărcat deFung Alex Li
- Functional TestingÎncărcat defrozen3592
- EC319 Lecture 2Încărcat deHitesh Rangra
- 01-Probability and Counting MethodsÎncărcat deQuasar Chunawalla
- placeless place recognition.pdfÎncărcat desuudfiin
- Chapter 9Încărcat deAgradecidoHuésped
- 2012_0060. Two-level - Tabulation MethodÎncărcat deMukulGarg
- Chapter 3b Miller IndicesÎncărcat desiddharth agarwal
- 1617 Level J Mathematics Grid Sample Questions.pdfÎncărcat deAndrew
- 21 the Cartoon Guide to Calculus and AlgebraÎncărcat dealagarg137691
- design roller coaster - student tÎncărcat deapi-405243813
- Resultants-of-Coplanar-Force-Systems.pdfÎncărcat deH Aries Oña
- _964b8d77dc0ee6fd42ac7d8a70c4ffa1_Lecture6.pdfÎncărcat dedeepak1892
- 020217Încărcat deShreyas Kulkarni
- Antilogarithmic ConverterÎncărcat deJohn Leons
- 57069519-Assignment-1Încărcat deSherena Kaur
- Poisson Distribution ExamplesÎncărcat deEdward Kahwai
- Operation Research NotesÎncărcat desagar kadam
- SenthamaraiÎncărcat deyuliapurna
- Practical Stress Analysis With Finite ElementsÎncărcat dePrabhu Charan Teja
- PSO based optimization of a PI controller for a Real time Pressure processÎncărcat deInternational Journal for Scientific Research and Development - IJSRD

## Mult mai mult decât documente.

Descoperiți tot ce are Scribd de oferit, inclusiv cărți și cărți audio de la editori majori.

Anulați oricând.