Documente Academic
Documente Profesional
Documente Cultură
For example, consider the data for advertisement expenditure and sales revenue data.
(X) (Y) X2 XY Y2 (X- ) (Y- ) (X- )2 (Y- )2 dx*dY p e
0 3000 0 0 9000000 -42.5 -650 1806.25 422500 27625 3826 -826
20 3600 400 72000 12960000 -22.5 -50 506.25 2500 1125 3743 -143
50 5500 2500 275000 30250000 7.5 1850 56.25 3422500 13875 3619 1881
100 2500 10000 250000 6250000 57.5 -1150 3306.25 1322500 -66125 3412 -912
170 14600 12900 597000 58460000 0 0 5675 5170000 -23500
42.5 3650
2
To solve for and we need to compute X , X , X Y and Y
i i
2
i i i
∑ ( . )( )
= = (
= = −4.141 and
∑ . )( . )
= − = 3650 + 4.141(42.5) = 3825
The residuals (estimated errors are then given by)
= − − = − 3825 + 4.141
which can be calculated for each observation in our sample, and are presented in the last column in the first
table above.
The errors tell us what we get if we tried to predict the value of Y on the basis of the estimated regression
equation and the values of X, within the range of the sample. We do not get errors for X outside our range
simply because we do not have the corresponding values of Y. By the virtue of the first condition we put in
the normal equations, ei = 0. The sum of squares of these errors is given by
2.2b. Method of least squares
The method of least squares, often termed Ordinary Least Squares (OLS), requires us to choose and
such that the sum of squares of errors is minimized. Thus, given the population regression function
(equation 4)
= + + = 1, 2, … , 1
we substitute , and ei for , , and ∈ to get
= + + = 1, 2, … , 2
Where, ei’s are called the residuals and the equation is the sample regression equation
From 2, we can write
= − −
Square the ei and sum over the observations to get
∑ =∑ − − 3
∑ is called the residual sum of squares.
The intuitive idea behind the procedure of least squares is given by looking at the following figure
6000
5000
4000
3000
2000
0 20 40 60 80 100
X
Y Fitted values
We want the regression line to pass through points in such a way that it is ‘as close as possible’ to the
points of the data. Closeness could mean different things. The minimisation procedure of the OLS implies
that we minimise the sum of squares of the vertical distances of the points from the line.
Minimisation of equation 3
min , ∑ = min , ∑ − − requires that we differentiate it with respect to and and
equate the derivatives to zero. The equations so obtained are known as the first order conditions for
minimisation. This procedure yields
∑
= 0 → −2 ∑ − − =0
∑ =0
∑ = + 3a
= + → = −
3
As before!
∑
= 0 → −∑ − − =0
∑ =0
∑ = ∑ + ∑ 3b
Equations 3a and 3b are known as the normal equations. Use the same method to solve for and and
get
∑ = − ∑ + ∑
∑ = ∑ − ∑ + ∑ →
∑ − = (∑ − )→
∑ ∑ ∑ ∑
= ∑
= ∑ (∑ )
4
Equation 4 can be simplified as follows:
1st we take the numerator to get
∑ −∑ ∑ = ∑ −
= −
= (∑( − )) = (∑( − )( − ))
( − )( − ) = ( − − + ) = − − +
1 1
= − − +
1 1 1
= −2 +
1 1
= −2 +
1
= −
1
= −
= −
nd
2 we similarly take the denominator to get
∑ − (∑ ) = ∑ −( )
= (∑ − )
= ∑( − )
− = ( − )
Therefore, equation 4 reduces to
∑( )( ) ∑ ( , )
= ∑( )
= ∑
= ( )
= − and = −
̂ = 1075.8 + 4.03 or = 1075.8 + 4.03 +
= 1075.8 = 4.03
2.3 Residuals and goodness of fit
Given the residuals
= − −
We have
∑ =0→ ̅=0 6a
∑
∑ = =0 6b
in this formulation
= +
is the estimated (fitted) value of Yi. Equations a and b of 11 imply that
1. the mean of residuals is zero, and
2. The residuals and the explanatory variable are uncorrelated.
4
Given this, it follows that
∑ =0
that is, the residuals and the estimated value of Y are uncorrelated.
Proof
∑ =∑ +
= ∑ + ∑ =0
Recall
= + 7
Observed value = predicted value + residual
This holds for all observations
Sum equation 7 over i, the sampled observations, to get
∑ =∑ +∑
Thus
∑ = ∑ because ∑ = 0
Now given equation 7, we subtract and , from its left-hand-side and right-hand-side to get
− = − +
= + 8
squaring both sides and then summing over i we get
=( + )
∑ = ∑( + )
∑ = ∑ + ∑ + 2∑
Now, ∑ = 0 which implies that
∑ =∑ +∑ 9
Note that the left-hand-side of the equation 9 is the Total Sum of Squares (TSS) of the dependent variable.
The first component of the left-hand-side of the equation is the Explained Sum of Squares (ESS) and the
second element is the Residual Sum of Squares (RSS). Thus
∑ =∑ +∑
= +
The ratio of ESS to TSS is called the coefficient of determination, and is given by R2, i.e.,
∑
= =∑ note that by definition 0 ≤ R2 ≤ 1.
We know that
= ∑ and
= −
= −
= + −
but we know that
= −
Therefore
= + −
= − + −
= ( − )
=
Therefore
∑ =∑ =∑
= ∑
= ∑ ( − )
= ∑ −∑ (= 0)
= ∑ 10
or equivalently, since
∑
= ∑
it follows that
5
∑
∑ = ∑
∑
(∑ )
= ∑
11
Since =∑ it follows that
∑ ∑ ∑
= ∑ ∑
= ∑
12
Denote the square of the correlation coefficient between Y and , i.e., the observed and predicted values of
Y by r2, thus, their correlation coefficient would be
∑
=
∑ ∑
Proposition:
=
proof
Start with the fact that
= + ̂
multiply this equation throughout by and sum over i to get
∑ =∑ +∑ ̂ (= 0)
=∑ 13
Now
∑
=
∑ ∑
∑
=
∑ ∑
∑ ∑
= ∑ ∑
∑
= ∑
=
=√
Therefore
=
Moreover:
Show that
=
The proof goes as follows
∑
=
∑ ∑
∑
=
∑ ∑
∑
=
∑ ∑
∑
=
∑ ∑
=
Example: Consider the yield data
use "C:\Users\6440\Documents\DocA\Econometrics\Econ352\Kefyalew\Data_wheat_Yield.dta", clear
keep if yield>0
(6 observations deleted)
drop if yield ==.
(8 observations deleted)
6
twoway (scatter yield fertha) (lfit yield fertha)
20000
15000
10000
5000
0
∑ . ∑ .
= = = 1681.074; = = = 150.2757
use "C:\Users\6440\Documents\DocA\Econometrics\Econ352\Kefyalew\Data_wheat_Yield.dta", clear
keep if yield>0
drop if yield ==.
gen ymybar = yield - 1681.0734
gen xmxbar = fertha-150.27568
gen ymbtxmb = ymy*xmx
gen xmxbsq = xmxbar*xmxbar
gen ymybsq = ymybar*ymybar
∑ = ∑( − ) = 1.03(10 )
∑ = ∑( − ) = 5291075
∑ = ∑( − )( − ) = 2.13(10 )
Thus, we had
∑ . ( )
= ∑
= = 4.02565
7
= 4.027833
Since the slope is 4, the equation implies that yield will increase by this amount if fertilizer use increases by
1 unit! We can go further and estimate
∑ . ( )
= = = 0.28852871
∑ ∑ ( )( . ( ))
display 2.13e+07/(5291075*1.03e+09)^0.5
.28852871
Therefore
. display .28852871*.28852871
.08324882
= .08324882 =
Or, similarly
∑ ( . )( . ( ) )
= ∑
= = .08329402
. ( )
display 4.027833 * 2.13e+07/ 1.03e+09
.08329402
8
2.4 Properties of LS estimates and Gauss-Markov theorem
Given our regression model
= + + ; = 1,2, … 1
The classical assumptions we put forward earlier can be divided into two parts: those that are made on Xi
and those made on .
a) Assumptions imposed on X
a1) The values of X: X1, X2, …, Xn, are fixed in advance (i.e., they are non stochastic)
a2) Not all X are equal.
b) Assumptions on the error term
b1) ( | ) = 0 ∀
b2) ( | )= ∀ (homoskedasticity)
b3) , = 0 ∀ ≠ (Assumption of no autocorrelation)
Note: so far we have not introduced normality.
Given these assumptions, we propose that the Least Squares Estimators are Best Linear
Unbiased Estimators (BLUE)
Recall: an estimator of , say is BLUE if it is:
1. a linear function of the random variable
2. unbiased, and
3. among all the linear unbiased estimators it has minimum variance.
These are known as the Gauss-Markov theorem, which is stated as follows:
Gauss-Markov theorem: Given the assumptions of the classical linear regression model, the least-squares
estimators, in the class of unbiased linear estimators, have minimum variance, i.e., they are BLUE.
Proof: we shall provide a proof for , try to prove this for
1. is a linear estimator of
We know that
∑ ∑( ) ∑ ∑ ∑
= ∑
= ∑
= ∑
= ∑
2a
and
∑ ∑( ) ∑ ∑ ∑
= ∑
= ∑
= ∑
= ∑
2b
Take equation 2a,
and let =∑
then
=∑ 3
This shows that is linear in Yi—it is in fact its weighted average, with wi serving as weights. Note the
following properties of wi.
a) since the X variable is assumed non-stochastic wi are also fixed in advance and is not random;
b) ∑ = 0
c) ∑ =∑
d) ∑ =∑ =1
Assignment: show that properties b, c and d are true.
2. is an unbiased estimator of
if we substitute equation 1 into equation 3 we get
=∑
=∑ ( + + )
=∑ +∑ +∑
= ∑ + ∑ +∑
∑
= + ∑
Since b and d above are true. If we now take the expectation of this result, we get
∑
= + ∑
= [ ] + [∑ ]
= +∑ [ ]=
9
Since E[ ] = 0. Therefore, is an unbiased estimator of .
3. Among the set of linearly unbiased estimators of the parameters in a regression model, the least
squares estimators have minimum variance. To show this we need to derive the variances and
covariances of the least squares estimators. We do this for the slope parameter, . At the same time,
we shall also need the least squares estimator of σ2.
a) By the definition of variance, we have the variance of as
= −
= [∑ − ]
= [∑ ( + + )− ]
= [ ∑ + ∑ +∑ ∈− ]
= [ + ∑ − ]
= [∑ ]
= [ + + ⋯+ ]
= [ + +⋯+ +2 +⋯+2 ]
= [ ]+ [ ]+⋯+ [ ]+2 [ ] + ⋯2 [ ]
= [ ]+ [ ]+ ⋯+ [ ]
= + + ⋯+
= ∑
=∑ 4
b) by the definition of covariance we have
, = − − 5
We know that
= −
= −
= −
= −
Using the results here and in a) we can write equation 5 as follows
, = − −
, = − −( − ) −
, = − + −
, =− − − =− 6
2
c) To derive the least squares estimator of σ , which is unbiased, we proceed as follows
Given, the population regression equation, equation 1
= + + = 1, 2, … , 1
We have
= + + ̅ 7
Subtracting equation 7 from 1, we get
( − ) = ( − ) + ( − )̅ = 1, 2, … ,
= + ( + )̅ = 1, 2, … , a
Moreover, from the sample regression we have
= + + = 1, 2, … ,
Then = + implying ( − ) = ( − ) +
= + = 1, 2, … , b
Subtract b from a to get
= ( − )̅ − − = 1, 2, … , 8
Squaring equation 8, one gets
= ( − )− − = 1, 2, … ,
= ( − )̅ + − − 2 ( − )̅ − = 1, 2, … ,
Summing this over the sample and taking expectations we get
10
[∑ ] = [∑( − )̅ ] + ∑ − −2 − ∑ ( − )̅ 9
Equation 9 has three component parts which can be reduced as follows
a) The 1st element in the equation could be written as follows
[∑( − )̅ ] = [∑( + ̅ − 2 )̅ ]
= [∑ + ̅ − 2 ̅ ∑ ]
= [∑ + ̅ − 2 ̅ ]
= [∑ − ̅ ]
=∑ [ ]− [ ̅ ]
But we know that
[ ]= [ ] − [ ] and [ ̅ ] = [ ]̅ − ( [ ]̅ )
Therefore
[∑( − )̅ ] = ∑ [ ] − [ ̅ ] = ∑( [ ]− [ ] )− ( [ ]̅ − [ ]̅ )
[∑( − )̅ ] = ∑( − [ ] ) − − ( [∈])
= − [ ] − + [ ]̅ = −
[∑( − )̅ ] = ( − 1)
= −∑ ∑
=−2
Collecting the results obtained in a) b) and c) above, we get
[∑ ] = [∑( − )̅ ] + ∑ − −2 − ∑ ( − )̅
= ( − 1) + − 2 = ( − 2)
It easily follows that if we set
∑
= =
11
we have an unbiased estimator of 2
To show that the LS estimator of is BLUE, in addition to the conditions of linearity and unbiasedness we
need to show that it has minimum variance. To show this, we proceed as follows: we already know that
=∑
where = ∑( )
Now define an alternative linear estimator of that is unbiased, say
=∑
this makes linear, in that it is a linear function of Yi. For this estimator to be unbiased its expected value
must be equal to 1, i.e.,
=∑ [ ]=∑ [ + + ]=∑ ( + )= ∑ + ∑
Therefore, for to be unbiased
∑ + ∑ =
For this to be true
∑ = 0 and ∑ =1
Note also that the variance of is
= [∑ ]
=∑ [ ]= ∑
We now compare the variances of and . To do so, let
= − note that ∑ = ∑ − ∑ = 0
= + implying that =( + )
Therefore
∑ =∑ +∑ + 2∑
∑ =∑ +∑
because
∑ =0
as
∑
∑ = ∑ and
∑ =∑ ( − )=∑ − ∑ (= 0) and
∑ = ∑( − ) = ∑ −∑ = 1−1= 0
therefore
∑ =∑ +∑
so that
∑ = ∑ + ∑
= + ∑
therefore
≥ ; since ∑ ≥0
This establishes that the OLS estimator, , is the Best Linear Unbiased Estimator (BLUE) of .
12
2.5. Confidence intervals and hypothesis testing
In this section, we shall discuss issues of interval estimation and hypothesis testing—what is known as
statistical inference in the statistics literature. For this and related aspects we need the
1. Variances of the OLS estimators
2. Covariance between the OLS estimators, and
3. The unbiased estimator of σ2, which were derived earlier, and
4. Discuss and derive the implications of the normality assumption of the error terms. This assumption is
particularly crucial for inference, because without this we cannot do any statistical testing on the
parameters, nor can we do interval estimation.
The variances and covariances of the OLS estimators
In our earlier discussions we showed that the variance of is
=∑ 1a
and you must have obtained the variance of to be
= +∑ 1b
the covariance between and is given as
, =− 1c
the least squares estimator of σ2, which is unbiased, is given by
∑
= =
For our example raised earlier we obtained
= 1075.788 − 4.027833 ℎ + = 0.0835
Recall that
∑ =∑ +∑
∑ =∑ −∑
∑ =∑ − ∑
= 1028163547.315 − (4.027833)(21311566.26134)
= 942324102.28
Now
∑ .
= = = 1833315.4
( )
Implications of the normality assumption on the distribution of the parameters of interest
Let
∈= ⋮
13
=∑ + − + −
=∑ +∑ − +( − ) ∑ + terms involving and
Dividing the whole equation by 2 and discarding the terms involving multiples of ei, and we get
∑ ∑ ∑
= + +
This result is used for both estimating confidence intervals and hypothesis testing. Notice the switch from
the variance of to the estimator of the variance of
We shall use the data in our previous example to calculate the variances and standard errors of the
estimators
= 1075.788 − 4.027833 ℎ + = 0.0835
The standard errors are obtained by
1. calculating the variances of and
2. substitute for σ2
3. take the square root of the resulting expression
Now
=∑ =
.
. . .
= +∑ = + = =
. .
(0.00620605834363549)
Recall
∑
= = = 1833315.4
Therefore
= (0.0062) = 1833315.4(0.0062) = 11377.684
= √11377.684 = 106.66623
.
= = = 0.3464
.
= √0.34649209 = 0.5886
Usually, the complete result of the regression is written as follows:
= 1075.788 − 4.0278833 ℎ
= 0.0835
(106.66623) (0.5886)
14
It is usually prefered to put the results on a table when we have a large number of explanatory variables;
given in stata as follows
Source SS df MS Number of obs = 516
F(1, 514) = 46.82
Model 85839433.6 1 85839433.6 Prob > F = 0
Residual 942324116 514 1833315.4 R-squared = 0.0835
Adj R-squared = 0.0817
Total 1.03E+09 515 1996434.08 Root MSE = 1354
We can then easily obtain the confidence intervals for and by using the t distribution with n-2 degrees
of freedom.
For instance, in our example we know that − / has t distribution with n - 2 degrees of
freedom, therefore
< < =1−
where is the level of confidence. If we let out confidence level be 0.05, we have the following confidence
interval for 516 - 2 (=514) degrees of freedom from the statistical tables.
−1.96 < < 1.96 = 0.95
Therefore
− − 1.96 <− <− + 1.96 = 0.95
+ 1.96 > > − 1.96 = 0.95
[1075.788 + 1.96(106.66623) > > 1075.788 − 1.96(106.66623)] = 0.95
[1285 > > 866] = 0.95
Thus, ’s 95% confidence interval is (866, 1285).
Similarly, since − / has t distribution with n - 2 degrees of freedom, it follows that
< < =1−
where is the level of confidence. If we let out confidence level be 0.05, we have the following confidence
interval for 18 - 2 (=16) degrees of freedom from the statistical tables.
−1.96 < < 1.96 = 0.95
Therefore
− − 1.96 <− <− + 1.96 = 0.95
+ 1.96 > > − 1.96 = 0.95
[4.027833 + 1.96(0.5886358) > > 1075.788 − 1.96(0.5886358)] = 0.95
[5.18 > > 2.87] = 0.95
Thus, ’s 95% confidence interval is (2.87, 5.18).
The confidence interval could also be used as a means of testing whether the parameter of interest is
statistically different from zero. If zero is a member of the interval, or is within it, then the parameter is not
statistically significantly different from zero! If zero is not a member then the parameter is statistically
significantly different from zero.
Hypothesis testing
15
The most common hypothesis tested regarding the parameters in the simple linear regression model is
whether the parameters of interest are different from zero or not. Such hypotheses could be easily tested
using the following null and alternative hypotheses.
Interpretation: the 1st set of hypotheses is testing whether the slope parameter is different from zero. If the
data supports the null, whereby we say we accept it, then the alternative is rejected. This implies that the
relationship we formulated is not supported by the data as a result there is no relationship between the
explanatory and dependent variables.
The 2nd set of hypotheses also intends to test whether the intercept is different from zero or not. However,
the interpretation is different, in that we are now asking whether the function passes through the origin or
through a different point on the y axis.
Now ~
Therefore, if the null hypothesis (H0) is true, it follows that follows
.
= 6.84 → | | = 6.84
.
Now, from the t table for 514 degrees of freedom we read that
Pr[t > 0.674] = 0.25, Pr[t > 1.282] = 0.10, Pr[t > 1.645] = 0.05, Pr[t > 1.96] = 0.025 and so on. Actually, the
probability that Pr[t > 6.84] = 0. Since this is a low probability (actually the lowest we can obtain), we
reject the null hypothesis. Thus, what the slope parameter we calculated is statistically different from Zero.
People customarily say that their parameters have been found to be significant. It is customary to use as
cut-offs probability levels of 0.05 and 0.01 to reject the null hypothesis; i.e., we reject the null hypotheses if
the probabilities obtained are less than 0.05 or 0.01.
Assignment:
1. Test whether the intercept is significantly different from zero.
2. Given the following
= 30 6 ℎ
+ + = 200 = 0.95
(5.33) (2.55)
a) interpret the results
b) test the hypotheses that the parameters are different from zero
2.6. Analysis of variance
The analysis of variance is yet another way of presenting results in regression analysis that complements
statistical inference. It uses the decomposition of variation in Y into the explained and residual sum of
squares. Under the assumption of normality, we obtained the following facts:
∑
= ~
and
~
Of course, these hold if the true is zero, i.e., the null hypothesis holds. It can be shown further that these
two distributions are independent. Thus, under the assumption that = 0, dividing both equations by their
respective degrees of freedom and taking their raito one gets
/
~ ,
/( )
This result can be used to test whether = 0. The sketch for presenting the analysis of variance is given in
the following table.
16
Source of variation Sum of Squares df Mean Square
Model = 1 ESS/1
/
Since ~ ,
/( )
The F statistic we obtain from this data is
. /
= = 46.82
/( )
Recall the t statistic we obtained for testing the significance of the parameter was.
/ = 4.027833/0.5886358 = 6.84
17
2.7. Prediction in simple regression model.
Let the given value of X = X*, then we predict the corresponding value of Y* of Y by solving
∗ = + ∗ 2
Where ∗ is the predicted value of ∗ .
Now, the true value of ∗ is
∗ = + ∗+ ∗ 3
Where ∗ is the disturbance (error) term. We now try to look at the desirable properties of ∗ .
First, we note that ∗ is a linear function of Y1, Y2,… Yn, since and are linear in Yi. Thus ∗ is a
linear function of ∗
Second, ∗ is unbiased. This follows from the fact that
= ∗ − ∗
But ∗ = + ∗
And ∗ = + ∗+ ∗
Therefore, the prediction error is
= ∗ − ∗= − + − ∗− ∗
Thus
[ ] = [ ∗ − ∗] = − + − ∗ − [ ∗] = 0
Thus,
[ ∗ − ∗] = 0 → [ ∗ ] = [ ∗]
Thus it is unbiased.
Third, though we shall not show this here, ∗ has the smallest variance among the linear unbiased
predictors of ∗ . Thus, ∗ is the Best Linear Unbiased Predictor (BLUP) of ∗ .
We will, however, derive the variance of the predictor’s error, [ ], which is obtained as follows:
Now:
= − + − ∗− ∗
Thus,
[ ]= − + ∗ − + ∗ − , − + [ ∗]
∗
= +∑ + ∑
−2 ∗∑ +
= 1+ +∑ +∑∗ −2 ∗∑
∗ ∗
= 1+ + ∑
( ∗ )
= 1+ + ∑
As a result, we observe that
1. The variance of the predictor increases as X* is further away from the mean of X, , (i.e., the mean of the
observations on the basis of which and have been computed). Or as the distance between X* and
increases, the variance of the error of prediction increases.
2. The variance of the error in prediction increases with the variance of the regression.
3. It decreases with n (number of observations used in the estimation of the parameters).
Interval prediction
Given the assumption of normality, it can be shown that Y* follows a normal distribution with mean
[ ∗] = + ∗
Its variance, on the other hand, is
18
( ∗ )
( ∗) = 1+ + ∑
Substituting for it follows that
∗ ∗
= ( ∗)
Which follows a t distribution with n – 2 degrees of freedom. This result can be used for both interval
estimation and other inference purposes. Let us try to obtain confidence intervals for the predictor using the
results in our earlier example. Recall that the regression results were
= 1075.788 − 4.0278833 ℎ
(106.66623) (0.5886)
= 1833315.4; ℎ = 150.2757 ∑ = 5291074.3 recall also that ferth ranges from 0 to 700.
Now, suppose we are interested to predict Yield* for, ferth* = 150.2757; i.e., the mean. In this case
∗ = + ℎ∗
= 1075.788 − 4.0278833(150.2757)
=1681.0736
And given
( ∗ )
( ∗) = 1+ + ∑
The standard error of the predictor is
( ∗ )
( ∗) = 1+ + ∑
( . . )
( ∗) = 1833315.4 1 + +
.
( ∗) = 1833315.4 1 + = 1355.311
The t value for 95% confidence for 514 degrees of freedom is 1.960. Thus, the 95% confidence interval for
Y* is
1681.0736 ± 1.96(1355.311) = (−975, 4337)
Now, suppose we wanted to predict the value of Yield* that is far away from the mean of fertha, say, 200.
Then the predicted value of MTAX* is
∗ = + ℎ∗
= 1075.788 − 4.0278833(200)
=1881.188
Then, the standard error of the predictor is
( . )
( ∗) = 1833315.4 1 + +
.
1 2500
( ∗) = 1833315.4 1 + + > 1355.311
516 5291074.886233
19