Documente Academic
Documente Profesional
Documente Cultură
Joseph
Original
Regression
Blackwell
Oxford,
Teaching
TEST
© G.
3100141-982X
25
2003 Article
00Teaching
UK Eisenhauer
through
Statisticsthe
Statistics
Publishing LtdOrigin
Trust 2001
Yi = β0 + β1xi + ei (1)
ª INTRODUCTION ª
where β0 is the intercept, β1 is the slope and ei
lthough ordinary least-squares (OLS) regres- denotes the ith residual. Lagging observations
A sion is one of the most familiar statistical
tools, far less has been written − especially in the
and taking first differences (i.e. subtracting each
observation from its successor) to correct for serial
pedagogical literature − on regression through correlation in the errors requires transforming equa-
the origin (RTO). Indeed, the subject is surpris- tion (1) into an RTO equation of the form
ingly controversial. The present note highlights
situations in which RTO is appropriate, discusses Yi − Yi −1 = β1(xi − xi −1) + (ei − ei −1)
the implementation and evaluation of such models
and compares RTO functions among three pop-
Alternatively, applying weighted least squares
ular statistical packages. Some examples gleaned
to correct for heteroscedasticity will result in a
from past Teaching Statistics articles are used as
model with no intercept if the weighting factor
illustrations. For expository convenience, OLS
(z) is not an independent variable. In that case,
and RTO refer here to linear regressions obtained
β0 becomes a coefficient and equation (1) is
by least-squares methods with and without a con-
replaced by a multiple linear regression without
stant term, respectively.
a constant:
where the left-hand side is the sum of squares total ∑Y i2 = ∑(Yi − Yi )2 + ∑Y i2 (3′)
(SST), the first term on the right is the sum of
squares due to error (SSE) and the final term is
Applying equation (3′ ) rather than equation (3) to
the sum of squares due to regression (SSR). The
RTO, one finds that SSE is unchanged, but SST =
coefficient of determination for OLS is then
∑Y i2 and SSR = ∑Y i2. Redefining SST and SSR in
defined by the ratio of SSR to SST
this manner results in
∑(Yi − y )2
R2 = ∑Y i2
∑(Yi − y )2 R2 = (4′)
∑Y i2
or equivalently
a strictly non-negative coefficient of determina-
∑(Yi − Yi ) 2 tion that equals or exceeds the measure in equa-
R2 = 1 − (4) tion (4). Of course, these definitions also affect the
∑(Yi − y )2
adjusted R2 and F statistics, but do not alter the
standard error of the regression (Se). Note that,
Some authors maintain that because this diag-
without a constant, the degrees of freedom for
nostic measure is based on an identity, it should
SST, SSR and SSE are n, k and n – k, respectively,
not depend on the inclusion or exclusion of a
where n is the sample size and k is the number of
constant term in the regression. From that per-
spective, equation (4) is equally valid for RTO independent variables; thus, Se = SSE /(n − k )
and OLS.
regardless of how SST is defined.
However, when there is no constant in the regres-
The controversy over SST is not merely academic:
sion, ∑(Yi − Yi )(Yi − y ) will generally take a
practitioners (and students) running RTO will
nonzero value, so equation (3) is not a valid
obtain various outputs depending on which com-
basis for analysis of variance in RTO. And if
puter packages they use. Indeed, as Prvan et al.
the RTO model provides a sufficiently poor fit,
(2002, p. 74) observed in a recent comparison of
the data may exhibit more variation around the
Minitab, SPSS and Excel, ‘Obtaining a simple
regression line than around y, in which case
linear regression is easy in all three packages, and
∑(Yi − Yi)2 > ∑(Yi − y )2. Heedlessly applying
all three give the standard output options (such as
equation (4) would then result in an implausibly
regression through the origin)’. But in fact the three
negative (and thus uninterpretable) coefficient
packages all give different outputs for RTO! Two
of determination as well as a negative F ratio.
illustrative examples are provided below.
Moreover, it is often argued that defining SST as
the sum of squared deviations from the mean is
inappropriate when the regression line is forced
through the origin but does not necessarily pass
ª EXAMPLES ª
through (x,y ); when so viewed, equation (2) is
replaced by the identity
In Kimber’s essay on the shape of birds’ eggs,
(Yi − 0) = (Yi − Yi ) + (Yi − 0) (2′) egg height is regressed on width, both with and
without an intercept (Kimber 1995). Her study of
Squaring and summing yields 281 species can be approximately replicated using
the 96 observations provided in the Data Bank
∑Yi2 = ∑(Yi − Yi )2 + ∑Y 2i + 2∑Yi (Yi − Yi ) section of the Summer 1990 issue of Teaching
Statistics (Data Bank 1990). Regardless of which
but the final (cross-product) term in this equation computer package is used, OLS yields the follow-
equals zero under RTO, because ing output: