Documente Academic
Documente Profesional
Documente Cultură
Analysis
Using Econometrics: A Practical Guide
A.H. Studenmund
6th Edition. Addison Wesley Longman
1-
Chapter 1
An Overview of
Regression
Analysis
What is Econometrics?
Econometrics literally means economic
measurement
It is the quantitative measurement and analysis
of actual economic and business phenomena
phenomena
and so involves:
economic theoryy
Statistics
Math
observation/data collection
1-
1-
Example
Consider the general and purely theoretical
relationship:
l ti
hi
Q = f(P,
f(P Ps, Yd)
(1 1)
(1.1)
(1 2)
(1.2)
1-
1-
l i
movements in one variable, the dependent
variable, as a function of movements in a set
of other variables, the independent (or
explanatory) variables, through the
quantification of a single equation
1-
Example
Return to the example from before:
Q = f(P,
( , Ps, Yd))
((1.1))
1-
(1 3)
(1.3)
1-
Figure 1.1
Graphical Representation of the
Coefficients of the Regression Line
1-
(1.4)
Z = X2
(1 5)
(1.5)
Y = 0 + 1 Z
(1 6)
(1.6)
is not linear
1-
No, at least four sources of variation in Y other than the variation in the
included Xs:
Measurement error
(1.7)
1-
Why
y deterministic?
Indicates the value of Y that is determined by a given value of X
(which is assumed to be non-stochastic)
Alt
Alternatively,
ti l the
th det.
d t comp. can be
b thought
th
ht off as the
th
expected value of Y given Xnamely E(Y|X)i.e. the
mean (or average) value of the Ys associated with a
particular value of X
This is also denoted the conditional expectation (that is,
expectation of Y conditional on X)
1-
Example: Aggregate
Consumption Function
Aggregate consumption as a function of aggregate income may
be lower (or higher) than it would otherwise have been due to:
consumer uncertaintyhard
t i t
h d (i
(impossible?)
ibl ?) tto measure, ii.e. iis an
omitted variable
Observed consumption may be different from actual consumption
due to measurement error
The true consumption function may be nonlinear but a linear one is
estimated (see Figure 1.2 for a graphical illustration)
Human
H
b
behavior
h i always
l
contains
t i some element(s)
l
t( ) off pure chance;
h
unpredictable, i.e. random events may increase or decrease
consumption at any given time
1-
Figure 1.2
Errors Caused by Using a Linear Functional
Form to Model a Nonlinear Relationship
1-
(1.10)
1-
1-
1-
Indexing Conventions
Subscript i for data on individuals (so called
cross section data)
Subscript
p t for time series data ((e.g.,
g series of
years, months, or daysdaily exchange rates, for
example
p )
Subscript it when we have both (for example,
panel
panel data
data))
1-
It has to be estimated
Yi = 0 + 1 X i
(1.16)
The signs
Th
i
on top off the
h estimates
i
are denoted
d
d hat,
h so that
h we h
have
Y-hat, for example
1-
1-
1-
Figure 1.3
T
True
and
d Estimated
E ti t d Regression
R
i
Lines
Li
1-
1-
(1.23)
1-
1-
1-
1-
1-
1-
Figure 1.4
A Weight-Guessing Equation
1-
Slope coefficient
Dependent variable
Multivariate regression
model
Independent (or explanatory) variable(s)
Expected value
Causality
Residual
Stochastic error term
Time series
Linear
Cross-sectional data set
Intercept term
1-
Chapter 2
1-
1-
2-
(i = 1, 2, ., N)
(2.3)
1-
2-
(
Y
Y
)
i i
i
1-
2-
produced by
y OLS is an estimate
1-
2-
(2 4)
(2.4)
(2.5)
2011 Pearson Addison-Wesley. All rights reserved.
1-
2-
Estimating Multivariate
Regression Models with OLS
In the real world one explanatory variable is not enough
The general multivariate regression model with K
independent variables is:
Yi = 0 + 1X1i + 2X2i + ... + KXKi + i (i = 1,2,,N) (1.13)
1-
2-
1-
2-
1-
2-
where:
h
PARENTi: The amount (in dollars) that the parents of the ith
student are judged able to contribute to college expenses
HSRANKi: The ith students GPA rank in high school, measured
as a percentage (i.e. between 0 and 100)
2011 Pearson Addison-Wesley. All rights reserved.
1-
2-
1-
2-
1-
2-
1-
2-
(2.12)
(2 13)
(2.13)
1-
2-
1-
2-
1-
2-
1-
2-
1-
2-
1-
2-
1-
2-
1-
2-
1-
2-
1-
2-
Table 2.1a
The Calculation of Estimated Regression
Coefficients for the Weight/Height Example
1-
2-
Table 2.1b
The Calculation of Estimated Regression
Coefficients for the Weight/Height Example
1-
2-
Table 2.2a
Data for the Financial Aid Example
1-
2-
Table 2.2b
Data for the Financial Aid Example
1-
2-
Table 2.2c
Data for the Financial Aid Example
1-
2-
Table 2.2d
Data for the Financial Aid Example
1-
2-
1-
2-
Chapter 3
1-
Steps in Applied
Regression Analysis
The first step is choosing the dependent variable this step is
determined by the purpose of the research (see Chapter 11 for
details)
After choosing the dependent variable, its logical to follow the
following sequence:
1. Review
1
R i
th
the lit
literature
t
and
d develop
d
l the
th theoretical
th
ti l model
d l
2. Specify the model: Select the independent variables and the
functional form
3. Hypothesize the expected signs of the coefficients
4. Collect the data. Inspect and clean the data
5 Estimate
5.
E ti t and
d evaluate
l t the
th equation
ti
6. Document the results
1-
1-
1-
For example,
example only theoretically relevant explanatory variables should
be included
Example:
when estimating a demand equation, theory informs us that prices of
complements and substitutes of the good in question are important
explanatory variables
But which complementsand which substitutes?
1-
1-
1-
1-
1-
Examples:
Does a student have a 7.0 GPA on a 4.0 scale?
Is consumption negative?
1-
1-
1-
Objective:
j
How to decide location using the six basic steps of
applied
app
ed regression
eg ess o a
analysis,
a ys s, discussed
d scussed ea
earlier?
e
1-
1-
1-
1-
1-
1-
You take the data set and enter it into the computer
You then run an OLS regression (after thinking the model over one
last time!)
(3.5)
Values for N, P, and I for each potential new location are then
obtained and plugged into (3.5) to predict Y
1-
1-
Table 3.1a
Data for the Woodys
Woody s Restaurants Example
(Using the Eviews Program)
1-
Table 3.1b
Data for the Woodys
Woody s Restaurants Example
(Using the Eviews Program)
1-
Table 3.1c
Data for the Woodys
Woody s Restaurants Example
(Using the Eviews Program)
1-
Table 3.2a
Actual Computer Output
(Using the Eviews Program)
1-
Table 3.2b
Actual Computer Output
(Using the Eviews Program)
1-
Table 3.3
Data for the Woodys
Woody s Restaurants Example
(Using the Stata Program)
1-
Table 3.3b
Data for the Woodys
Woody s Restaurants Example
(Using the Stata Program)
1-
Table 3.4a
Actual Computer Output
(Using the Stata Program)
1-
Table 3.4b
Actual Computer Output
(Using the Stata Program)
1-
1-
Chapter 4
1-
The classical assumptions must be met in order for OLS estimators to be the
best available
The seven classical assumptions are:
I. The regression model is linear, is correctly specified, and has an
additive error term
II The error term has a zero population mean
II.
III. All explanatory variables are uncorrelated with the error term
IV. Observations of the error term are uncorrelated with each other
(no serial correlation)
V. The error term has a constant variance (no heteroskedasticity)
VI. No explanatory variable is a perfect linear function of any other
explanatory variable(s) (no perfect multicollinearity)
VII. The error term is normally distributed (this assumption is optional
but usually is invoked)
1-
4-
(4.1)
This model:
is linear (in the coefficients)
has an additive error term
1-
4-
1-
4-
1-
4-
1-
4-
A
An increase
i
in
i th
the error tterm iin one ti
time period
i d ((a random
d
shock,
h k
for example) is likely to be followed by an increase in the next
period, also
Example: Hurricane Katrina
If, over all the observations of the sample t+1 is correlated with t then
the error term is said to be serially correlated (or auto-correlated),
and Assumption IV is violated
Violations of this assumption are considered in more detail in Chapter 9
1-
4-
V: Constant variance / No
heteroskedasticity in error term
The error term must have a constant variance
That is, the variance of the error term cannot
change
g for each observation or range
g of
observations
If it does,
does there is heteroskedasticity present in the
error term
An example of this can bee seen from Figure 4.2
1-
4-
1-
4-
Example:
Including both annual sales (in dollars) and the annual sales tax
paid in a regression at the level of an individual store, all in the
same city
Since the stores are all in the same city, there is no variation in the
percentage sales tax
2011 Pearson Addison-Wesley. All rights reserved.
1-
4-
1-
4-
Figure 4.3
Normal Distributions
1-
4-
The Sampling
Di t ib ti off
Distribution
We saw earlier that the error term follows a
probability distribution (Classical Assumption VII)
But so do the estimates of !
The probability distribution of these values across
different samples is called the sampling distribution
of
1-
4-
1-
4-
1-
4-
Figure 4.4
Distributions of
1-
4-
1-
4-
Properties of the
Standard Error
The standard error of the estimated coefficient, SE( ),
is the square root of the estimated variance of the
estimated coefficients.
H
Hence, it iis similarly
i il l affected
ff t d b
by th
the sample
l size
i and
d
the other factors discussed previously
For example, an increase in the sample size will decrease the
standard error
Similarly, the larger the sample, the more precise the
coefficient estimates will be
1-
4-
1-
4-
1-
4-
Table 4.1a
Notation Conventions
1-
4-
Table 4.1b
Notation Conventions
1-
4-
1-
4-
Chapter 5
Hypothesis Testing
1-
1-
Example:
H0: 0 (the values you do not expect)
HA: > 0 (the values you do expect)
1-
Example:
p Suppose
pp
we have the following
g null and alternative
hypotheses:
H0: 0
HA: > 0
Even if the true really is not positive, in any one sample we might
y positive to lead to
still observe an estimate of that is sufficiently
the rejection of the null hypothesis
1-
1-
1-
1-
Decision Rules of
Hypothesis Testing
1-
1-
1-
The t-Test
The t-test is the test that econometricians usually use to test
hypotheses
yp
about individual regression
g
slope
p coefficients
Tests of more than one coefficient at a time (joint hypotheses)
are typically
yp
y done with the F-test,, presented
p
in Section 5.6
The appropriate test to use when the stochastic error term is
normally
y distributed and when the variance of that distribution
must be estimated
Since these usually are the case, the use of the t-test for
hypothesis testing has become standard practice in
econometrics
1-
The t-Statistic
For a typical multiple regression equation:
(5.1)
we can calculate t-values for each of the estimated
coefficients
Usually these are only calculated for the slope coefficients, though
(see Section 7.1)
1-
1-
1-
H0: k S
HA: k > 0
HA: k > S
H0: k 0
H0: k S
HA: k < 0
HA: k < S
1-
H0: k = S
HA: k 0
HA: k S
1-
1-
Choosing a Level of
Significance
The level of significance must be chosen before a critical
value can be found,, using
g Statistical Table B
The level of significance indicates the probability of
observing an estimated t-value greater than the critical
t l if th
t-value
the null
ll hypothesis
h
th i were correctt
It also measures the amount of Type I Error implied by a
particular critical t-value
Which level of significance is chosen?
5 percentt is
i recommended,
d d unless
l
you know
k
something
thi
unusual about the relative costs of making Type I and
Type II Errors
2011 Pearson Addison-Wesley. All rights reserved.
1-
Confidence Intervals
A confidence interval is a range that contains the true value of an
item a specified
p
p
percentage
g of the time
It is calculated using the estimated regression coefficient, the
two-sided critical t-value and the standard error of the estimated
coefficient as follows:
( )
(5.5)
Whats the relationship between confidence intervals and twosided hypothesis testing?
If a hypothesized value fall within the confidence interval, then we
cannot reject the null hypothesis
2011 Pearson Addison-Wesley. All rights reserved.
1-
p-Values
1-
Examples of t-Tests:
One-Sided
The most common use of the one-sided t-test is to determine whether
a regression coefficient is significantly different from zero (in the
direction predicted by theory!)
This involves four steps:
1. Set up the null and alternative hypothesis
2. Choose a level of significance and therefore a critical t-value
3. Run the regression and obtain an estimated t-value (or t-score)
4. Apply
pp y the decision rule by
y comparing
p
g calculated t-value with the
critical t-value in order to reject or not reject the null hypothesis
Lets look at each step in more detail for a specific example:
1-
Examples of t-Tests:
One-Sided (cont.)
Consider the following simple model of the aggregate retail sales
of new cars:
(5.6)
Where:
Y = sales of new cars
X1 = real disposable income
X2 = average retail price of a new car adjusted by the consumer
price index
X3 = number
b off sports
t utility
tilit vehicles
hi l sold
ld
The four steps for this example then are as follows:
1-
1.
H0: 1 0
HA: 1 > 0
2.
H0: 2 0
HA: 2 < 0
3.
H0: 3 0
HA: 3 < 0
1-
1-
1-
1-
1-
1-
Examples of t-Tests:
Two-Sided
The two-sided test is used when the hypotheses should be
rejected if estimated coefficients are significantly different from
zero or a specific nonzero value
zero,
value, in either direction
So, there are two cases:
1 Two
1.
Two-sided
sided tests of whether an estimated coefficient is
significantly different from zero, and
2. Two-sided tests of whether an estimated coefficient is
significantly different from a specific nonzero value
1-
Examples of t-Tests:
Two-Sided (cont.)
Again, in the Woodys restaurant equation of Section 3.2, the
impace of the average income of an area on the expected number
of Woody
Woodys
s customer
customers
s in that area is ambiguous:
A high-income neighborhood might have more total customers
going
g
g out to dinner (p
(positive sign),
g ), but those customers might
g
decide to eat at a more formal restaurant that Woodys (negative
sign)
The appropriate (two-sided) t-test therefore is:
1-
1-
Examples of t-Tests:
Two-Sided (cont.)
2
2.
3.
4.
Set up
p the null and alternative hypothesis
yp
H0: k = 0
HA: k 0
Choose a level of significance and therefore a critical t-value
t value
Keep the level at significance at 5 percent but this now must be
distributed between two rejection regions for 29 degrees of freedom
hence the correct critical t-value is 2.045
2 045 (found in Statistical Table B-1
for 29 degrees of freedom and a 5-percent, two-sided test)
Run the regression and obtain an estimated t-value:
Th t-value
The
t l remains
i att 2.37
2 37 (from
(f
Equation
E
ti 5.4)
5 4)
Apply the decision rule:
For the two-sided case, this simplifies to:
R j t H0 if |2.37|
Reject
|2 37| > 2.045;
2 045 so, reject
j t H0
1-
1-
3.
1-
1-
F Tables
Usually will see several pages of these; one or two pages at
p
level of significance
g
((.10,, .05,, .01).
)
each specific
Numerator d.f.
denom.
d.f.
Value of F at a
specific significance level
F Test Hypotheses
H0: 1 = 2 = = K = 0 (None of the Xs help explain Y)
Ha: Not all s are 0 (At least one X is useful)
H 0 : R2 = 0
is an equivalent hypothesis
Reject H0
Do Not Reject
j H0
if FFc
if F<Fc
Decision rule
Alternative hypothesis
Critical value
Type I Error
t-statistic
Level of significance
Confidence interval
Two-sided test
p-value
1-
Chapter 6
1-
Specifying an Econometric
Equation and Specification Error
Before any equation can be estimated, it must be completely
specified
Specifying
S
if i an econometric
t i equation
ti consists
i t off three
th
parts,
t
namely choosing the correct:
independent variables
functional form
form of the stochastic error term
Again,
A i thi
this iis partt off th
the first
fi t classical
l
i l assumption
ti from
f
Chapter
Ch t 4
A specification error results when one of these choices is made
incorrectly
This chapter will deal with the first of these choices (the two other
choices will be discussed in subsequent chapters)
1-
Omitted Variables
Two reasons why an important explanatory variable
might have been left out:
we forgot
it is not available in the dataset, we are examining
1-
The Consequences of an
Omitted Variable
Hence, the explanatory variables in the estimated regression (6.2) are not
independent of the error term (unless the omitted variable is uncorrelated
with all the included variablessomething which is very unlikely)
1-
1-
Note here that dropping a variable is not a viable strategy to help cure
omitted variable bias:
If anything youll just generate even more omitted variable bias on the
remaining coefficients!
1-
What if:
You have an unexpected result
result, which leads you to believe that you have
an omitted variable
You have two or more theoretically sound explanatory variables as
potential
t ti l candidates
1-
Similarly
Similarly, when these signs differ,
differ the variable is
extremely unlikely to have caused the unexpected
result
1-
Irrelevant Variables
1-
1-
1-
1-
Specification Searches
Almost any result can be obtained from a given
dataset, by simply specifying different regressions until
estimates
ti t with
ith th
the d
desired
i d properties
ti are obtained
bt i d
Hence, the integrity of all empirical work is open to
question
To counter this, the following three points of Best
Practices in Specification Searches are suggested:
1. Rely on theory rather than statistical fit as much as possible when
choosing variables, functional forms, and the like
2 Minimize the number of equations estimated (except for
2.
sensitivity analysis, to be discussed later in this section)
3. Reveal, in a footnote or appendix, all alternative
specifications estimated
1-
Sequential Specification
Searches
1-
In the first case, there is no bias but in the second case there is bias
1-
Sensitivity Analysis
1-
Data Mining
Data mining involves exploring a data set to try to uncover
empirical
p
regularities
g
that can inform economic theory
y
That is, the role of data mining is opposite that of traditional
econometrics, which instead tests the economic theory on
a data set
Be careful, however!
a hypothesis developed using data mining techniques must be
tested on a different data set (or in a different context) than
the one used to develop the hypothesis
N
Nott doing
d i so would
ld b
be hi
highly
hl unethical:
thi l After
Aft all,
ll the
th researcher
h
already knows ahead of time what the results will be!
1-
1-
Chapter 7
1-
1-
1-
(7.1)
(7.2)
1-
1-
Linear Form
This is based on the assumption that the slope of the
relationship between the independent variable and the
dependent variable is constant:
For the linear case, the elasticity of Y with respect to X
(the percentage change in the dependent variable
caused by a 1-percent increase in the independent
variable, holding the other variables in the equation
constant) is:
1-
What Is a Log?
1-
One useful property of natural logs in econometrics is that they make it easier to
figure out impacts in percentage terms (well see this when we get to the doublelog specification)
1-
Double-Log Form
Here, the natural log of Y is the dependent variable and the natural log
of X is the independent variable:
(7.5)
IIn a double-log
d bl l equation,
ti
an iindividual
di id l regression
i coefficient
ffi i t can b
be
interpreted as an elasticity because:
(7 6)
(7.6)
Note that the elasticities of the model are constant and the slopes are
nott
1-
Figure 7.2
Double-Log Functions
1-
Semilog Form
The semilog functional form is a variant of the doublelog
g equation
q
in which some but not all of the variables
(dependent and independent) are expressed in terms
of their natural logs.
It can be
b on the
th right-hand
i ht h d side,
id as iin:
Yi = 0 + 1lnX1i + 2X2i + i
(7.7)
(7 9)
(7.9)
1-
Figure 7.3
Semilog Functions
1-
Polynomial Form
(7.10)
1-
Figure 7.4
Polynomial Functions
1-
Inverse Form
The inverse functional form expresses Y as a function of the
reciprocal (or inverse) of one or more of the independent
variables (in this case
case, X1):
Yi = 0 + 1(1/X1i) + 2X2i + i
(7.13)
1-
Figure 7.5
7 5 Inverse Functions
1-
1-
Lagged Independent
Variables
Virtually all the regressions weve studied so far have been
instantaneous in nature
In other words, they have included independent and
dependent variables from the same time period, as in:
Yt = 0 + 1X1t + 2X2t + t
(7.15)
(7 16)
(7.16)
1-
1-
Figure 7.6
An Intercept Dummy
1-
(7.20)
p depends
p
on the value of D:
The slope
When D = 0, Y/X = 1
When D = 1,
1 Y/X = (1 + 3)
Graphical illustration of how this works in Figure 7.7
1-
1-
2
2.
An incorrect function form may provide a reasonable fit within the sample but
have the potential to make large forecast errors when used outside the range of
the sample
The first of these is essentially due to the fact that when the dependent variable
is transformed, the total sum of squares (TSS) changes as well
The second is essentially die to the fact that using an incorrect functional
amounts to a specification error similar to the omitted variables bias discussed in
Section 6.1
Thiss seco
second
d case is
s illustrated
us a ed in Figure
gu e 7.8
8
2011 Pearson Addison-Wesley. All rights reserved.
1-
1-
1-
Slope dummy
Double
Double-log
log
functional form
Natural log
Semilog
functional form
Interaction term
y
Polynomial
functional form
Inverse
functional form
Omitted condition
Linear in the variables
Linear in
the coefficients
1-
Chapter 8
Multicollinearity
1-
1-
Perfect Multicollinearity
The word perfect in this context implies that the variation in one explanatory
variable can be completely explained by movements in another explanatory
variable
A special case is that of a dominant variable: an explanatory variable is
definitionally
y related to the dependent
p
variable
(8.1)
where the s are constants and the Xs are independent variables in:
Yi = 0 + 1X1i + 2X2i + i
(8.2)
1-
Figure 8.1
Perfect Multicollinearity
1-
Perfect Multicollinearity
(cont.)
You cannot hold all the other independent variables in the equation constant if
every time one variable changes
changes, another changes in an identical manner!
Solution: one of the collinear variables must be dropped (they are essentially
identical, anyway)
1-
Imperfect Multicollinearity
Imperfect multicollinearity occurs when two
( more)) explanatory
(or
l
t
variables
i bl are imperfectly
i
f tl
linearly related, as in:
X1i = 0 + 1X2i + ui
(8.7)
Compare
C
E
Equation
ti 8.7
8 7 to
t Equation
E
ti 8.1
81
Notice that Equation 8.7 includes ui, a stochastic
error term
1-
Figure 8.2
Imperfect Multicollinearity
1-
The Consequences of
Multicollinearity
There are five major consequences of multicollinearity:
1
1.
2.
1-
1-
The Consequences of
Multicollinearity (cont.)
3. The computed t-scores will fall:
a. Recalling
g Equation
q
5.2,, this is a direct consequence
q
of 2. above
b.
For example, if you drop a variable, even one that appears to be statistically
i i ifi
insignificant,
t th
the coefficients
ffi i t off the
th remaining
i i variables
i bl iin th
the equation
ti
sometimes will change dramatically
c.
5. The overall fit of the equation and the estimation of the coefficients of
g y unaffected
nonmulticollinear variables will be largely
1-
The Detection of
Multicollinearity
First realize that that some multicollinearity exists in every
equation:
q
all variables are correlated to some degree
g
((even if
completely at random)
q
of how much multicollinearityy exists in
So its reallyy a question
an equation, rather than whether any multicollinearity exists
p detect the
There are basicallyy two characteristics that help
degree of multicollinearity for a given application:
1. High simple correlation coefficients
2. High Variance Inflation Factors (VIFs)
We will now go through each of these in turn:
2011 Pearson Addison-Wesley. All rights reserved.
1-
1-
(8.15)
:
(8 16)
(8.16)
where
is the unadjusted
From Equation 8.16, the higher the VIF, the more severe the effects of
mulitcollinearity
While there is no table of formal critical VIF values, a common rule of thumb is
that if a given VIF is greater than 5, the multicollinearity is severe
Note
N
t th
thatt the
th authors
th
replace
l
the
th VIF with
ith its
it reciprocal,
i
l
tolerance, or TOL
, called
ll d
1-
Remedies for
Multicollinearity
Essentially three remedies for multicollinearity:
1 Do nothing:
1.
a. Multicollinearity will not necessarily reduce the tscores enough
g to make them statistically
y
insignificant and/or change the estimated
coefficients to make them differ from expectations
b the deletion of a multicollinear variable that belongs
b.
in an equation will cause specification bias
2. Drop
p a redundant variable:
a. Viable strategy when two variables measure
essentially the same thing
b Always use theory as the basis for this decision!
b.
2011 Pearson Addison-Wesley. All rights reserved.
1-
Remedies for
Multicollinearity (cont.)
3. Increase the sample size:
a This is frequently impossible but a useful alternative
a.
to be considered if feasible
b The idea is that the larger sample normally will
b.
reduce the variance of the estimated coefficients,
diminishing the impact of the multicollinearity
1-
Table 8.1a
8 1a
1-
Table 8.1a
8 1a
1-
Table 8.2a
8 2a
1-
Table 8.2b
8 2b
1-
Table 8.2c
8 2c
1-
Table 8.2d
8 2d
1-
Table 8.3a
8 3a
1-
Table 8.3b
8 3b
1-
1-
Chapter 9
Serial Correlation
1-
(9.1)
1-
((9.2))
1-
Negative:
implies that the error term has a tendency to switch signs from
negative to positive and back again in consecutive observations
this is called negative serial correlation
1-
Figure 9.1a
Positive Serial Correlation
1-
Figure 9.1b
Positive Serial Correlation
1-
Figure 9.2
No Serial Correlation
1-
Figure 9.3a
Negative Serial Correlation
1-
Figure 9.3b
Negative Serial Correlation
1-
1-
1-
1-
The new error term * is now a function of the true error term and of
the differences between the linear and the polynomial
y
functional
forms
1-
1-
1-
Pure serial correlation does not cause bias in the coefficient estimates
2.
3
3.
1-
1-
The DurbinWatson
d Test (cont.)
The equation for the DurbinWatson d statistic for T
observations is:
(9 10)
(9.10)
where the ets are the OLS residuals
There are three main cases:
1. Extreme positive serial correlation: d = 0
2. Extreme negative serial correlation: d 4
3 No serial correlation: d 2
3.
2011 Pearson Addison-Wesley. All rights reserved.
1-
The DurbinWatson
d Test (cont.)
To test for positive (note that we rarely, if ever, test for
negative!)
g
) serial correlation,, the following
g steps
p are
required:
1. Obtain the OLS residuals from the equation to be tested and
calculate
l l t the
th d statistic
t ti ti by
b using
i E
Equation
ti 9.10
9 10
2. Determine the sample size and the number of explanatory
variables and then consult Statistical Tables B-4,, B-5,, or B-6
in Appendix B to find the upper critical d value, dU, and the
lower critical d value, dL, respectively (instructions for the use of
pp
)
these tables are also in that appendix)
1-
The DurbinWatson
d Test (cont.)
3. Set up the test hypotheses and decision rule:
H0 : 0
( positive
(no
iti serial
i l correlation)
l ti )
HA : > 0
if d < dL
Reject H0
if d > dU
Do not reject H0
if dL d dU
Inconclusive
1-
The DurbinWatson
d Test (cont.)
3. Set up the test hypotheses and decision rule:
H0 : = 0
HA : 0
(serial correlation)
if d < dL
Reject H0
if d > 4 dL
Reject H0
if 4 dU > d > dU
Do Not Reject H0
Otherwise
Inconclusive
1-
1-
1-
Multiply Equation 9.15 by and then lag the new equation by one
period obtaining:
period,
(9.17)
1-
1-
1-
1-
p iterative technique
q that first p
produces an estimate
This is a two-step
of and then estimates the GLS equation using that estimate.
These two steps are repeated (iterated) until further iteration results in little
change in
Once has converged (usually in just a few iterations), the last estimate of
step 2 is used as a final estimate of Equation 9.18
9 18
1-
1-
1-
1-
1-
Chapter 10
Heteroskedasticity
1-
Pure Heteroskedasticity
Pure heteroskedasticity occurs when Classical
Assumption V,
V which assumes constant variance of the
error term, is violated (in a correctly specified equation!)
Classical Assumption V assumes that:
(10.1)
With heteroskedasticity, this error term variance is not
constant
10-
1-
Pure Heteroskedasticity
(cont.)
Instead, the variance of the distribution of the error term
depends on exactly which observation is being
discussed:
(10 2)
(10.2)
The simplest case is that of discrete heteroskedasticity,
where the observations of the error term can be grouped
into just two different distributions, wide and narrow
This
Thi case is
i illustrated
ill t t d iin Fi
Figure 10.1
10 1
10-
1-
10-
1-
10-
1-
Pure Heteroskedasticity
(cont.)
10-
1-
10-
1-
10-
1-
Impure Heteroskedasticity
10-
1-
The Consequences of
Heteroskedasticity
The existence of heteroskedasticity in the error term of an
equation violates Classical Assumption V, and the estimation of
th equation
the
ti with
ith OLS h
has att least
l
t three
th
consequences:
1. Pure heteroskedasticity does not cause bias in the coefficient
estimates
2. Heteroskedasticity typically causes OLS to no longer be the
minimum variance estimator ((of all the linear unbiased
estimators)
3. Heteroskedasticity causes the OLS estimates of the SE to be
biased, leading to unreliable hypothesis testing. Typically
the bias in the SE estimate is negative, meaning that OLS
underestimates the standard errors (and thus overestimates the
t-scores)
2011 Pearson Addison-Wesley. All rights reserved.
10-
1-
Testing for
Heteroskedasticity
Econometricians do not all use the same test for heteroskedasticity because
heteroskedasticity takes a number of different forms, and its precise
manifestation in a g
given equation
q
is almost never known
Before using any test for heteroskedasticity, however, ask the following:
1. Are there any obvious specification errors?
Fix
Fi th
those before
b f
t ti !
testing!
10-
1-
10-
1-
10-
1-
10-
1-
2 U
2.
Use th
these residuals
id l ((squared)
d) as th
the dependent
d
d t variable
i bl in
i a
second equation that includes as explanatory variables each X
from the original equation, the square of each X, and the product of
each X times every other Xfor example, in the case of three
explanatory variables:
(10.9)
10-
1-
10-
1-
Remedies for
Heteroskedasticity
10-
1-
Heteroskedasticity-Corrected
Standard Errors
Heteroskedasticity-corrected errors take account of
heteroskedasticity correcting the standard errors without
changing the estimated coefficients
The logic behind heteroskedasticity-corrected
heteroskedasticity corrected standard
errors is power
If heteroskedasticity does not cause bias in the estimated
coefficients but does impact the standard errors, then it makes
sense to adjust the estimated equation in a way that changes the
standard errors but
b t not the coefficients
10-
1-
Heteroskedasticity-Corrected
Standard Errors (cont.)
The heteroskedasticity-corrected SEs are biased but
generally more accurate than uncorrected standard
errors for large samples in the face of heteroskedasticity
As a result
result, heteroskedasticity-corrected
heteroskedasticity corrected standard errors
can be used for t-tests and other hypothesis tests in most
samples
p
without the errors of inference p
potentially
y caused
by heteroskedasticity
Typically heteroskedasticity-corrected
heteroskedasticity corrected SEs are larger than
OLS SEs, thus producing lower t-scores
10-
1-
10-
1-
10-
1-
10-
1-
10-
1-
10-
1-
Table 10.1a
10 1a
10-
1-
Table 10.1b
10 1b
10-
1-
Table 10.1c
10 1c
10-
1-
10-
1-
Chapter 11
1-
1-
1-
Table 11.1a
Sources of Potential Topic Ideas
1-
Table 11.1b
Sources of Potential Topic Ideas
1-
Before any quantitative analysis can be done, the data must be:
collected
organized
entered into a computer
But time spent thinking about and collecting the data is well spent, since a
researcher who knows the data sources and definitions is much less likely
to make mistakes using or interpreting regressions run on that data
We will now discuss three data collection issues in a bit more detail
1-
Checking for data availability means deciding what specific variables you
want to study:
y
dependent variable
all relevant independent variables
2. Measuring
gq
quantity:
y
If the market and/or quality of a given variable has changed over time, it makes
little sense to use quantity in units
Example: TVs have changed so much over time that it makes more sense to use
quantity in terms of monetary equivalent: more comparable across time
2011 Pearson Addison-Wesley. All rights reserved.
1-
5 Be careful
5.
caref l when
hen reading (and creating!) descriptions of data
data:
Where did the data originate?
Are prices and/or income measured in nominal or real terms?
Are prices retail or wholesale?
2011 Pearson Addison-Wesley. All rights reserved.
1-
1-
Missing Data
Suppose the data arent there?
What happens if you choose the perfect variable and
look in all the right sources and cant find the data?
The answer to this question depends on how much
data is missing:
1. A few observations:
in a cross-section study:
Can usually afford to drop these observations from the
sample
in a time-series study:
May interpolate value (taking the mean of adjacent values)
2011 Pearson Addison-Wesley. All rights reserved.
1-
1-
It turns out,
out however,
however that:
1. time-series and cross-sectional data can be pooled to form panel
data
2. data can be generated through surveys
1-
Surveys
Surveys are everywhere in our society and are
used for many different purposes
purposesexamples
examples
include:
marketing firms using surveys to learn more about
products and competition
political candidates using surveys to finetune their
campaign advertising or strategies
go
governments
e
e ts us
using
g surveys
su eys for
o a
all so
sorts
ts o
of pu
purposes,
poses,
including keeping track of their citizens with instruments
like the U.S. Census
2011 Pearson Addison-Wesley. All rights reserved.
1-
Surveys (cont.)
(cont )
While running your own survey might be tempting as a
way of obtaining data for your own project,
project running a survey
is not as easy as it might seem surveys:
must be carefully thought through; its virtually impossible to go
back to the respondents and add another question later
must be worded precisely (and pretested) to avoid confusing the
respondent or "leading"
leading the respondent to a particular answer
must have samples that are random and avoid the selection,
survivor, and nonresponse biases explained in Section 17.2
1-
Panel Data
Again, panel data are formed when cross-sectional and
time-series
time
series data sets are pooled to create a single data
set
Two main reasons for using
gp
panel data:
To increase the sample size
To provide an insight into an analytical question that can't be
obtained by using time-series or cross-sectional data alone
1-
1-
1-
1-
The 10 Commandments of
Applied Econometrics
1. Use common sense and economic theory:
Example:
p match p
per capita
p variables with p
per capita
p variables,, use real exchange
g rates to
explain real imports or exports, etc
1-
The 10 Commandments of
Applied Econometrics (cont.)
5. Keep it sensibly simple:
a Begin with a simple model and only complicate it if it fails
a.
b. This both goes for the specifications, functional forms, etc and for the
estimation method
6. Look long and hard at your results:
a. Check that the results make sense, including signs and magnitudes
b. Apply the laugh test
7. Understand the costs and benefits of data mining:
a. Bad data mining: deliberately searching for a specification that works
(i.e. torturing the data)
b Good
b.
Good data mining: experimenting with the data to discover empirical
regularities that can inform economic theory and be tested on a second data
set
1-
2011 Pearson Addison-Wesley. All rights reserved.
The 10 Commandments of
Applied Econometrics (cont.)
8. Be prepared to compromise:
a. The Classical Assumptions are only rarely are satisfied
b. Applied econometricians are therefore forced to compromise and adopt
suboptimal solutions, the characteristics and consequences of which are
not always known
c. Applied econometrics is necessarily ad hoc: we develop our analysis,
including responses to potential problems, as we go along
9. Do not confuse statistical significance with meaningful magnitude:
a. If the sample size is large enough, any (two-sided) hypothesis can be
rejected (when large enough to make the SEs small enough)
b. Substantive significancei.e. how large?is also important, not just
statistical significance
2011 Pearson Addison-Wesley. All rights reserved.
1-
The 10 Commandments of
Applied Econometrics (cont.)
10. Report a sensitivity analysis:
a Dimensions to examine:
a.
i. sample period
ii the functional form
ii.
iii. the set of explanatory variables
i th
iv.
the choice
h i off proxies
i
b. If results are not robust across the examined dimensions, then
this casts doubt on the conclusions of the research
1-
1-
1-
1-
1-
1-
1-
1-
1-
1-
Table 11.2a
Regression Users Checklist
1-
Table 11.2b
Regression Users Checklist
1-
Table 11.2c
Regression Users Checklist
1-
Table 11.2d
Regression Users Checklist
1-
Table 11.3a
Regression Users Guide
1-
Table 11.3b
Regression Users Guide
1-
Table 11.3c
Regression Users Guide
1-
Data collection
Missing data
Surveys
Panel data
A Regression Users
User s Guide
1-
Chapter 12
Time-Series
Time
Series Models
1-
Dynamic Models:
Distributed Lag Models
(12.2)
1-
Dynamic Models:
Distributed Lag Models (cont.)
2. In large part because of this multicollinearity, there is no
guarantee that the estimated coefficients will follow the smoothly
declining pattern that economic theory would suggest
Instead,, its quite
q
typical
yp
to g
get something
g like:
1-
(12.3)
1-
1 = 0
2 = 20
3 = 30
.
.
p = P0
(12 2)
(12.2)
(12.8)
As long
g as is between 0 and 1, these coefficients will indeed smoothly
y
decline, as shown in Figure 12.1
1-
1-
2 Dynamic
2.
D
i models:
d l
Now serial correlation causes bias in the coefficients produced by OLS
Compounding all this this is the fact that the consequences, detection, and
remedies for serial correlation that we discussed in Chapter 9 are all either
incorrect or need to be modified in the presence of a lagged dependent
variable
We will now discuss the issues of testing and correcting for serial correlation
in dynamic models in a bit more detail
1-
2011 Pearson Addison-Wesley. All rights reserved.
e t = Yt Yt = Yt 0 0 X1t Yt1
2.
1-
(12.19)
1-
instrumental variables:
substituting an instrument (a variable that is highly correlated with YM but
is uncorrelated with ut) for
f Yt: in the original equation effectively
ff
eliminates
the correlation between Ytl and ut
Problem: good instruments are hard to come by (also see Section 14.3)
modified GLS:
Technique similar to the GLS procedure outlined in Section 9.4
Potential issues: sample must be large and the standard
1-
Granger Causality
Granger causality, or precedence, is a circumstance in which
one time series variable consistentlyy and p
predictably
y changes
g
before another variable
A word of caution: even if one variable precedes (Granger
causes) another, this does not mean that the first variable
causes the other to change
There are several tests for Granger causality
They all involve distributed lag models in one form or another,
however
Well discuss an expanded version of a test originally
developed
p by
y Granger
g
2011 Pearson Addison-Wesley. All rights reserved.
1-
(12.20)
and test the null hypothesis that the coefficients of the lagged As
(the s) jointly equal zero
If we can reject
j t this
thi nullll h
hypothesis
th i using
i th
the F-test,
F t t then
th we
have evidence that A Granger-causes Y
N
Note
t that
th t if p = 1,
1 Equation
E
ti 12.20
12 20 is
i similar
i il tto th
the dynamic
d
i
model, Equation 12.3
A
Applications
li ti
off thi
this ttestt involve
i
l running
i two
t
G
Granger tests,
t t one
in each direction
2011 Pearson Addison-Wesley. All rights reserved.
1-
1-
Independent variables can appear to be more significant than they actually are
if they have the same underlying trend as the dependent variable
Example: In a country with rampant inflation almost any nominal variable will
appear to be highly correlated with all other nominal variables
Why?
y
Nominal variables are unadjusted for inflation, so every nominal variable will have
a powerful inflationary component
Such a p
Suc
problem
ob e is
sa
an e
example
a peo
of spu
spurious
ous correlation:
co e at o
a strong relationship between two or more variables that is not caused by a real
underlying causal relationship
If y
you run a regression
g
in which the dependent
p
variable and one or more independent
p
variables are spuriously correlated, the result is a spurious regression, and the
t-scores and overall fit of such spurious regressions are likely to be overstated and
untrustworthy
1-
If a series is nonstationary
nonstationary, that problem is often referred to as
nonstationarity
1-
(12.22)
Can you see that if | | < 1, then the expected value of Yt will eventually
approach
h 0 (and
( d th
therefore
f
be
b stationary)
t ti
) as th
the sample
l size
i gets
t bi
bigger and
d
bigger? (Remember, since vt is a classical error term, its expected value = 0)
Similarly,
y, can you
y see that if | | > 1,, then the expected
p
value of Yt will
continuously increase, making Yt nonstationary?
1-
(12.23)
1-
(12.22)
(12.26)
1-
(12.27)
where 1 = 1
Note: alternative Dickey-Fuller
y
tests additionally
y include
a constant and/or a constant and a trend term
2. Set up the test hypotheses:
H0: 1 = 0 (unit root)
HA: 1 < 0 (stationary)
2011 Pearson Addison-Wesley. All rights reserved.
1-
1-
1-
Cointegration
1-
Cointegration (cont.)
(cont )
1-
Cointegration (cont.)
(cont )
Next,, perform
p
a Dickey-Fuller
y
test on the residuals
Remember to use the critical values from the Dickey-Fuller Table!
1-
1-
1-
Chapter 13
1-
1-
Th
Thus, R2 is
i lik
likely
l tto b
be much
h lower
l
than
th 1 even if the
th model
d l actually
t ll d
does an
2
exceptional job of explaining the choices
R p involved
As an alternative, one can instead use
, a measure based on the
percentage
t
off the
th observations
b
ti
i th
in
the sample
l th
thatt a particular
ti l estimated
ti t d
equation explains correctly
To use this approach, consider a
> .5 to predict that Di = 1 and a < .5
to predict that Di = 0 and then simply compare these predictions with the
actual Di
2.
1-
Figure 13.1
A Linear Probability Model
1-
is bounded by 1 and 0
1-
Figure 13.2
Is Bounded by 0
and 1 in a Binomial Logit Model
1-
Interpreting Estimated
Logit Coefficients
The signs of the coefficients in the logit model have the
same meaning as in the linear probability (i.e.
(i e OLS) model
The interpretation of the magnitude of the coefficients
differs though
differs,
though, the dependent variable has changed
dramatically.
That the marginal effects are not constant can be seen
from Figure 13.2: the slope (i.e. the change in probability)
of the graph of the logit changes as moves from 0 to 1!
Well consider three ways for helping to interpret logit
coeffcients meaningfully:
2011 Pearson Addison-Wesley. All rights reserved.
1-
s then g
gives the marginal
g
effect
and Di
From this
this, again
again, the marginal impact of X does indeed depend on the value of
1-
2
p
1-
Chapter 14
Simultaneous Equations
1-
(14 1)
(14.1)
(14.2)
(14 3)
(14.3)
1-
1-
The main p
problem with simultaneous systems
y
is that
they violate Classical Assumption III (the error term
and each explanatory variable should be uncorrelated)
1-
Reduced-Form Equations
An alternative way of expressing a simultaneous equations
system is through the use of reduced
reduced-form
form equations
Reduced-form equations express a particular
endogenous variable solely in terms of an error term and all
the predetermined (exogenous plus lagged endogenous)
variables in the simultaneous system
y
1-
Reduced-Form Equations
(cont.)
The reduced-form equations for the structural
Equations 14.2
14 2 and 14.3
14 3 would thus be:
Y1t =
Y2t =
4+
1X1t
5X1t +
2X2t +
3X3t
6X2t
7X3t +
+ v1t
v2t
(14.6)
(14.7)
1-
Reduced-Form Equations
(cont.)
There are at least three reasons for using reduced-form equations:
1 Since the reduced
1.
reduced-form
form equations have no inherent simultaneity,
simultaneity they
do not violate Classical Assumption III
Therefore, they can be estimated with OLS without encountering the
problems discussed in this chapter
1-
The reason for this is that the two error terms of Equation 14.11
and 14.12 are correlated with the endogenous variables when
they appear as explanatory variables
1-
1-
1-
1-
1-
1-
1-
1-
1-
1-
Figure 14.3
A Shifting Supply Curve
1-
Figure 14.4
When Both Curves Shift
1-
Table 14.1a
Data for a Small Macromodel
1-
Table 14.1b
Data for a Small Macromodel
1-
1-
Chapter 15
Forecasting
1-
What Is Forecasting?
In econometrics
econometrics, forecasting is the estimation of the expected value of
a dependent variable for observations that are not part of the same
data set
In most forecasts, the values being predicted are for time periods in
the future, but cross-sectional predictions of values for countries or
people not in the sample are also common
1-
(15.2)
1-
Figure 15.1
15 1 illustrates two examples
1-
Figure 15.1a
Forecasting Examples
1-
Figure 15.1b
Forecasting Examples
1-
More Complex
Forecasting Problems
1-
1-
1-
1-
1-
1-
Forecasting Confidence
Intervals
The techniques we use to test hypotheses can also be
adapted to create forecasting confidence intervals
Given a point forecast,
all we need to generate a
confidence interval around that forecast are tc, the critical
t-value (for the desired level of confidence), and SF, the
estimated standard error of the forecast:
(15.11)
Th
The critical
iti l t-value
t l , tc, can be
b ffound
d iin St
Statistical
ti ti l T
Table
bl
B-1 (for a two-tailed test with T-K-1 degrees of freedom)
1-
Forecasting Confidence
Intervals (cont.)
Lastly, the standard error of the forecast, SF, for an equation with just
one independent variable, equals the square root of the forecast error
variance:
(15.13)
where:
s2
1-
Figure 15.2
A Confidence Interval for
1-
1-
2.
1-
ARIMA Models
1-
1-
1-
1-
1-
Chapter 16
1-
Random Assignment
Experiments
When medical researchers want to examine the effect of a new drug, they
use an experimental design called an random assignment experiment
(16.1)
where:
OUTCOMEi = a measure of the desired outcome in the ith individual
TREATMENTi = a dummy variable equal to 1 for individuals in
t t
treatment
t group and
d 0 for
f individuals
i di id l iin th
the control
t l group
2011 Pearson Addison-Wesley. All rights reserved.
the
1-
Random Assignment
Experiments (cont.)
But random assignment cant always control for all possible
other factorsthough
g sometimes we may
y be able to identify
y some
of these factors and add them to our equation
Lets say that the treatment is job training:
Suppose that random assignment, by chance, results in one group having more
males and being slightly older than the other group
If gender and age matter in determining earnings
earnings, then we can control for the
different composition of the two groups by including gender and age in our
regression equation:
(16.2)
1-
Random Assignment
Experiments (cont.)
Non-Random Samples:
Most subjects in economic experiments are volunteers, and samples of
volunteers often arent random and therefore may not be representative of the
overall population
As a result, our conclusions may not apply to everyone
2.
Unobservable Heterogeneity:
In Equation 16.2, we added observable factors to the equation to avoid omitted
variable bias, but not all omitted factors in economics are observable
This unobservable omitted variable problem is called unobserved
heterogeneity
1-
Random Assignment
Experiments (cont.)
3. The Hawthorne Effect:
Human subjects typically know that theyre
they re being studied
studied, and they
usually know whether theyre in the treatment group or the control group
The fact that human subjects know that theyre being observed
sometimes
ti
can change
h
th
their
i b
behavior,
h i and
d thi
this change
h
iin b
behavior
h i could
ld
clearly change the results of the experiment
4. Impossible Experiments:
Its often impossible (or unethical) to run a random assignment
experiment in economics
Think about how difficult it would be to use a random assignment
experiment to study the impact of marriage on earnings!
1-
Natural Experiments
Natural experiments (or quasi-experiments) are similar
to random assignment experiments,
experiments except:
observations fall into treatment and control groups
naturally
naturally (because of an exogenous event) instead of
being randomly assigned by the researcher
By exogenous
exogenous event
event is meant that the natural event
must not be under the control of either of the two groups
1-
1-
1-
Panel (or longitudinal) data combine time-series and crosssectional data such that observations on the same variables from the
same cross sectional
ti
l sample
l are followed
f ll
d over two
t
or more
different time periods
1-
2. Variables that change over time but are the same for all
individuals in a given time period:
e.g., the retail price index and the national unemployment rate
1-
((16.4))
where:
D2 = intercept dummy equal to 1 for the second cross-sectional
cross sectional entity
and 0 otherwise
DN = intercept
p dummy
y equal
q
to 1 for the Nth cross-sectional entity
y
and 0 otherwise
1-
One major advantage of the fixed effects model is that it avoids bias
due to omitted variables that dont change over time
e.g., race or gender
Such time-invariant omitted variables often are referred to as unobserved
heterogeneity or a fixed effect
To understand how this works, consider what Equation 16.4 would look
like with only two years worth of data:
Yit = 0 + 1Xit + 2D2i + vit
(16.5)
(16.6)
1-
(16.7)
Next, average Equation 16.7 over time for each observation i, thus
producing:
Yi = 0 + 1Xi + 2D2i + i + ai
(16.8)
where the bar over a variable indicates the mean of that variable
across time
Note that ai, 2D2i, and 0 dont have bars over them because theyre
constant over time
i
1-
Note that ai, 2, D2i, and 0 are subtracted out because theyre in both
equations
q
Weve therefore shown that estimating panel data with the fixed effects
model does indeed drop the ai out of the equation
Hence, the fixed effects model will not experience bias due to timeinvariant omitted variables!
1-
1-
Figure 16.3
In a Panel Data Model
Model, the Murder
Rate Decreases with Executions
1-
1-
1-
One key is the nature of the relationship between ai and the Xs:
If theyre likely to be correlated, then it makes sense to use the fixed
effects model
If not, then it makes sense to use the random effects model
If the they are not different, then the random effects model is preferred
(or estimates of both the fixed
fi ed effects and random effects models are
provided)
1-
2011 Pearson Addison-Wesley. All rights reserved.
Table 16.1a
16 1a
1-
Table 16.1b
16 1b
1-
Table 16.1c
16 1c
1-
Table 16.1d
16 1d
1-
Table 16.1e
16 1e
1-
1-
Chapter 17
Statistical Principles
1-
Probability
For example,
p when a fair six-sided die is rolled, there are six equally
q
y
likely outcomes, each with a 1/6 probability of occurring
1-
1-
(17.1)
(17.2)
The
Th standard
t d d deviation
d i ti
is
i th
the square roott off the
th variance
i
1-
Continuous Random
Variables
Our examples to this point have involved discrete random variables,
for which we can count the number of possible outcomes:
The coin can be heads or tails; the die can be 1, 2, 3, 4, 5, or 6
1-
Figure 17.2
Pick a Number, Any Number
1-
1-
Standardized Variables
Z=
(17.3)
No matter what the initial units of X, the standardized random variable Z has
a mean of 0 and a standard deviation of 1
Figures 17.4 and 17.5 illustrates this for the case of dice and fair coin flips,
respectively
ti l
1-
1-
1-
1-
1-
1-
1-
1-
Figure 17.6
The Normal Distribution
1-
1-
Sampling
First, lets define some key terms:
Population: the entire group of items that interests us
Sample: the part of this population that we actually
observe
Statistical inference involves using
g the sample
p to draw
conclusions about the characteristics of the population
from which the sample came
1-
Selection Bias
Any sample that differs systematically from the population that it is
intended to represent is called a biased sample
One of the most common causes of biased samples is selection bias,
which occurs when the selection of the sample systematically
excludes
l d or underrepresents
d
t certain
t i groups
Selection bias often happens when we use a convenience sample
consisting
g of data that are readily
y available
1-
Survivor and
Nonresponse Bias
A retrospective study looks at past data for a contemporaneously
selected sample
for example, an examination of the lifetime medical records of 65-year-olds
1-
Estimation
First, some terminology:
Estimator: a sample
p statistic that will be used to estimate the value of
the population parameter
1-
Sampling Distributions
The sampling distribution of a statistic is the probability distribution
or density curve that describes the population of all possible values
of this statistic
For example, it can be shown mathematically that if the individual
observations are drawn from a normal distribution
distribution, then the sampling
distribution for the sample mean is also normal
Even if the population does not have a normal distribution, the sampling
distribution of the sample mean will approach a normal distribution as the
sample size increases
1-
1-
1-
The t-Distribution
1-
1-
Confidence Intervals
X t*s/ N
2011 Pearson Addison-Wesley. All rights reserved.
1-
1-
Selection, survivor,
and nonresponse bias
Sampling distribution
Population mean
Sample mean
Standard deviation
Population standard
d i ti
deviation
Standardized
random variable
Population
Sample
2011 Pearson Addison-Wesley. All rights reserved.
D
off ffreedom
d
Degrees
Confidence interval
1-