Documente Academic
Documente Profesional
Documente Cultură
for
Scientists and Engineers
Thomas P. Ryan
NIST, external
Course Outline
c robust regression
c nonlinear regression
The transformation
?^
Ay
Chi-Square Distribution
t-Distribution
The transformation
?^
ty
l
F-distribution
-
%
y
%
10
Confidence Intervals
11
12
Prediction Intervals
c A short,
necessary excursion
13
is 5 (0,1).
14
Since ( ^ is ^
we then
have
!^
(%^ %)
]
(^
^
(%^ %)
]
15
(%^ %)
]
| !^ = ^
Lower Limit:
% ^ !^ l ]
Upper Limit:
% ^ !^ l ]
16
Hypothesis Tests
p-value:
17
18
19
c NIST applications:
c Alaska pipeline
c
20
General Applications
c
c
21
c Prediction
(Statistics is prediction,
quote from Ed Deming)
22
c Control
This is a seldom-mentioned but important use
of regression analysis, as it is often necessary
to try to control the value of the dependent
variable, such as a river pollutant, at a
particular level. (See section 1.8.1 on page
30 for details.)
23
(B) Specific:
c Calibration
Such as instrument calibration using
inverse regression, the classical theory
of calibration (section 1.8.2), or
Bayesian calibration.
This will be discussed later in these notes.
c Process Monitoring
A regression control chart or a cause-selecting
chart might be used. Both employ regression
methods. See sections 12.7 and 12.8 of
Statistical Methods for Quality Improvement,
2nd ed., by T.P. Ryan for details.
24
Regression Models
Simple Linear Regression: (linear in parameters)
Y = + X +
( is the slope; is the Y-intercept.
Paradoxically, is viewed as a nuisance
parameter in most applications, but
no-intercept models are rarely used.)
V ]
V X
V =
Prediction equation: Y
Multiple Linear Regression:
Y = + X ] X ] . . . ] X +
Prediction Equation:
V ]
V X ]
V X ] . . . ]
V X
V =
Y
25
Regression Basics
V )
OLS minimizes (Y ^ Y
y
V) = 0
with (Y ^ Y
y
26
Regression Plot
Y = -75.4468 + 0.0207766 X1
S = 12.4929
R-Sq = 41.0 %
R-Sq(adj) = 40.8 %
120
100
80
60
40
20
0
4000
4500
5000
5500
X1
27
6000
6500
? @ ^? @
? ^?
:%&
:%%
V X
Y ^
c Additional terminology:
The X's will in these notes additionally
be referred to as regressors and as
predictors.
28
Residuals
V = e is the ith (raw) residual
Y ^ Y
The e are substitutes for the (unobservable)
29
Model Assumptions
(1) that the model being used is an appropriate one
and
(2) that NID(0, )
In words, the errors are assumed to be normally
distributed (N), independent (ID), and have
a variance ( ) that is constant and doesn't
depend on any factors in or not in the model.
30
Checking Assumptions
c
c
% ^%
% ^%
y
and
h =
% ^%% ^%
% ^%
y
32
33
Simulation Envelopes
35
37
do this?
39
40
EXAMPLE
Alaska pipeline data (calibration data)
c
c
http://www.itl.nist.gov/div898/handbook/pmd/
section6/pmd621.htm
41
43
c The values of
First step?
90
80
70
60
50
40
30
20
10
0
0
10
20
30
40
50
60
70
80
but
45
as X increases
standardized residuals
4
3
2
1
0
-1
-2
-3
0
10
20
30
40
50
60
70
80
Coef
4.994
0.731
SE Coef
T
P
1.126
4.44 0.00
0.025 29.78 0.00
S = 6.081
R-Sq = 89.4%
R-Sq(adj) = 89.3%
Analysis of Variance
Source
DF
Regression
1
Residual Error 105
Total
106
SS
MS
F
P
32789 32789 886.74 0.00
3883
37
36672
47
c SE (constant) =
c SE (lab) =
?
4 :,
:%%
m 4::,
%%
48
T = Coef/ SE(Coef)
S=
l4:,
49
c
c
c DF(total) = ^ 1
SS denotes Sum of Squares
the predictor(s)
c
c
V )
SS (residual error) = (Y ^ Y
y
SS(Total) = (Y ^ Y )
y
MS = SS/DF
MS(regression)
MSresidual error
51
Unusual Observations
Obs
15
17
35
37
55
100
lab
81.5
81.5
80.4
80.9
80.0
77.4
field
50.00
50.00
50.00
50.00
85.00
45.00
V
Fit is @
Std Resid is e / e , as previously defined
53
Regression Plot
field = 4.99368 + 0.731111 lab
S = 6.08092
R-Sq = 89.4 %
R-Sq(adj) = 89.3 %
90
80
70
field
60
50
40
30
20
10
0
0
10
20
30
40
lab
54
50
60
70
80
55
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
1
32789 32789 886.74 0.00
Residual Error 105 3883
37.0
Lack of Fit
76
2799 36.8
0.99 0.54
Pure Error
29
1084 37.4
Total
106 36672
60 rows with no replicates
Pure error is a measure of the vertical spread
in the data, with the sum of squares for pure
error (SS" ) computed using Eq. (1.17)
on page 25.
56
SS
y SS!! ^ SS
c is an - -test given by
F=
4 :
4 :"
58
c Consequence:
59
(a) @ y ] ? ]
b @ y ( ] ? ) ]
60
61
raw
Rraw
3
rZ Z
0.866
0.881
0.895
0.908
0.920
0.932
0.942
0.952
0.961
0.970
0.977
0.982
0.986
0.988
0.988
0.989
0.992
0.993
0.992
63
/
loglikelihood
-1.0
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-3E+02
-3E+02
-3E+02
-3E+02
-3E+02
-3E+02
-3E+02
-2E+02
-2E+02
-2E+02
-2E+02
-2E+02
-2E+02
-2E+02
-2E+02
-2E+02
-2E+02
-2E+02
-2E+02
/
0.109
0.138
0.156
0.170
0.181
0.189
0.195
0.197
0.193
0.182
-0.183
-0.123
-0.067
0.007
0.096
0.190
0.276
0.345
0.394
SPREADRATIO
0.025
0.034
0.071
0.090
0.096
0.091
0.078
0.051
0.046
0.008
-0.021
0.018
0.045
0.056
0.158
0.206
0.223
0.192
0.197
8.00
8.00
8.00
8.00
8.00
8.00
8.00
8.00
2.34
2.06
2.17
2.17
2.23
2.23
2.23
2.23
2.23
2.23
2.23
0.889
3.58
c DEFINITION OF TERMS:
(1) r@ @V --- This is the square of the
correlation between Y
and the predicted values
converted back to the original
scale.
raw
(2)
Rraw
@ ^@V
@ ^@
1 ^
raw
(4)
rZ Z
(5)
---
65
@ @ $
V
@ @ $ );
/ ); / );
SPREAD-
);
0.228
-0.217
-0.202
-0.182
-0.153
-0.115
0.986
0.985
0.986
0.986
0.987
0.989
RATIO
0.779
0.798
0.815
0.830
0.844
0.855
0.904
0.904
0.905
0.905
0.905
0.905
-0.10
0.01
0.12
0.23
0.34
0.45
-0.19
-0.11
-0.03
0.05
0.13
0.21
0.331
-0.208
-0.075
0.053
0.160
0.240
0.261
-0.211
-0.136
-0.043
0.010
0.079
1.40
1.62
1.55
1.68
1.80
1.95
easiest to justify.
DEFINITIONS:
(1) --- the power transformation of X
68
Std Residual
-1
-2
2
Yhat
Std Res
-1
-2
1.3
1.8
Yhat
70
2.3
72
74
75
76
77
c Definitions:
(1) Regression Outlier
A point that deviates from the linear
relationship determined from the other
78
79
(3) X-outlier
This is a point that is outlying only in
regard to the x-coordinate(s).
An X-outlier could also be a regression
and/or residual outlier.
(4) Y-outlier
This is a point that is outlying only
because its y-coordinate is extreme.
The manner and extent to which such an
outlier will affect the parameter estimates
will depend upon both its x-coordinate and
the general configuration of the other points.
Thus, the point might also be a regression
80
81
Pontius Data
Y is Deflection
X is Load
82
0.11019
0.21956
0.32949
0.43899
0.54803
0.65694
0.76562
0.87487
0.98292
1.09146
1.20001
1.30822
1.41599
1.52399
1.63194
1.73947
1.84646
1.95392
83
150000
300000
450000
600000
750000
900000
1050000
1200000
1350000
1500000
1650000
1800000
1950000
2100000
2250000
2400000
2550000
2700000
2.06128
2.16844
Y
2850000
3000000
X
0.11052
0.22018
0.32939
0.43886
0.54798
0.65739
0.76596
0.87474
0.98300
1.09150
1.20004
1.30818
1.41613
1.52408
1.63159
1.73965
1.84696
1.95445
150000
300000
450000
600000
750000
900000
1050000
1200000
1350000
1500000
1650000
1800000
1950000
2100000
2250000
2400000
2550000
2700000
84
2.06177
2.16829
2850000
3000000
0
0
100
200
300
86
0.11019
0.21956
0.32949
0.43899
0.54803
0.65694
0.76562
0.87487
0.98292
1.09146
1.20001
1.30822
1.41599
1.52399
1.63194
1.73947
1.84646
1.95392
87
0.0042751
0.0032205
0.0016058
0.004212
0.0003034
0.0008980
0.0012626
0.0021972
0.0019318
0.0021564
0.0023911
0.0022857
0.0017403
0.0014249
0.0010595
0.0002741
0.0010513
0.0019067
2.06128
2.16844
Y
0.11052
0.22018
0.32939
0.43886
0.54798
0.65739
0.76596
0.87474
0.98300
1.09150
1.20004
1.30818
1.41613
1.52408
1.63159
1.73965
1.84696
1.95445
88
0.0028620
0.0040174
V
|Y ^ Y|
0.0039451
0.0026005
0.0017058
0.0005512
0.0002534
0.0013480
0.0016026
0.0020672
0.0020118
0.0021964
0.0024211
0.0022457
0.0018803
0.0015149
0.0007095
0.0004541
0.0005513
0.0013767
2.06177
2.16829
0.0023720
0.0041674
V = 0.00183
Average of |Y ^ Y|
89
90
-1
-2
0
100
200
300
91
partial residual
0
0
100
200
300
93
VO
|Y ^ Y
Y
0.11019
0.21956
0.32949
0.43899
0.54803
0.65694
0.76562
0.87487
0.98292
1.09146
1.20001
1.30822
1.41599
1.52399
1.63194
1.73947
1.84646
1.95392
0.0002213
0.0004468
0.0000299
0.0002188
0.0000900
0.0000265
0.0002309
0.0002770
0.0002728
0.0001905
0.0000441
0.0000810
0.0001799
0.0000686
0.0001350
0.0000608
0.0004112
0.0002709
94
2.06128
2.16844
Y
0.0000884
0.0000363
VO
|Y ^ Y
0.11052
0.22018
0.32939
0.43886
0.54798
0.65739
0.76596
0.87474
0.98300
1.09150
1.20004
1.30818
1.41613
1.52408
1.63159
1.73965
1.84696
1.95445
0.0001087
0.0001732
0.0000701
0.0000888
0.0000400
0.0004235
0.0001091
0.0001470
0.0001928
0.0001505
0.0000741
0.0000410
0.0000399
0.0000214
0.0002150
0.0002408
0.0000888
0.0002591
95
2.06177
2.16829
0.0004016
0.0001137
V = 0.00016
Average of |Y ^ Y|
for model with linear and quadratic
terms
vs.
V = 0.00183
Average of |Y ^ Y|
for model with linear term only
96
97
98
99
Multiple Regression
102
103
104
subject, however.
Consequences of Multicollinearity
108
109
Orthogonal Regressors
Y
23.3
24.5
27.2
27.1
24.1
23.4
24.3
24.1
27.2
27.3
27.4
27.3
24.3
23.4
24.1
27.0
23.5
X
X
5
6
8
9
7
5
6
7
9
8
8
9
6
5
7
9
5
17
14
14
17
13
17
14
13
17
14
14
17
14
17
13
17
17
110
24.3
27.3
23.7
6 14
8 14
7 13
111
Correlated Regressors
Y
X
23.3
24.5
27.2
27.1
24.1
23.4
24.3
24.1
27.2
27.3
27.4
27.3
24.3
23.4
24.1
27.0
23.5
5
6
8
9
7
5
6
7
9
8
8
9
6
5
7
9
5
X
13
14
17
17
14
13
14
14
17
17
17
17
14
13
14
17
13
112
24.3
27.3
23.7
6
8
7
14
17
14
V is now
Note that the sign of
negative, even though neither Y nor
X has changed. Only X has changed.
Correlation Form
Yd = :@ ^@
&&
with S&& =
(Y
y1
? ^?
Xd = :
% %
with :% % =
114
^ @ )
( = 1, 2 ....m)
(X
y1
^ ? )
115
1 A
(X d )Z X d
1
= @
(X d )Z Y d
@
= @
@ A
so
((X d )Z X d )^
1
1^r
1
@ ^r
^ r
1 A
and
d
V
=
1
1^r
1
@ ^r
Thus,
116
^ r
r@
1 A @ r@ A
d
V
= B
so,
d
V
will be negative whenever r@ ^ r r@ is
negative.
d
V
V
has the same sign as since
V =
:&&
( S? ? )
1 1
117
d
V
118
119
c
c
120
http://www2.tltc.ttu.edu/westfall/images/5349/
wrongsign.htm
122
@
V
.640
0.01
.353
-1.1E08
.376
0.46
.359
11.0
.101
-0.16
-.289
0.28
2 being an
correlation coefficients, with V
enormous negative number even though
2@ is not close to being negative.
Outlier-induced Multicollinearity
124
125
Regression Plot
X2 = 3.58403 + 0.533613 X1
S = 2.63210
R-Sq = 23.7 %
R-Sq(adj) = 12.8 %
10
9
8
X2
7
6
5
4
3
2
1
X1
126
Regression Plot
X2 = 2.09192 + 0.921917 X1
S = 2.67975
R-Sq = 87.7 %
R-Sq(adj) = 86.2 %
30
X2
20
10
0
0
10
15
20
25
X1
Detecting Multicollinearity
127
d
V
Var( ) = d cd
VIF(i)
^9
written as
cd
= (v / )
y1
.
6# . 7
#
y1
133
h =
x ^x
x ^x
134
135
given previously.
136
^9
EXAMPLE:
Assume that two highly correlated regressors
are combined with (r ^ 2) regressors, with
the latter being orthogonal to the former.
V )will be the same with or
The (r ^ 2) Var(
without the other two highly correlated
regressors.
This of course is because the (r ^ 2) 9
values will not change because predictors
are being added that are orthogonal to the
predictors already in the model
137
c Yes, and No
c First, the no:
Under the model assumption,
V ) )= E(SSE) = (n ^ p ^ 1)
E((Y ^ @
which does not depend upon the degree of
multicollinearity (discussed on page 406 of
my book)
138
139
140
141
142
c options:
(1) place them at equal intervals
between the lowest desired value
and the highest desired value
(2) place the points at random between
the two extreme values
(3) place half of the points at the largest
value and half at the smallest value
143
Consider (3):
144
145
146
Weight (%)
Academic reputation
Graduation rate
Financial resources
Faculty compensation
% classes under 20
SAT score
% students in top 10% HS class
Graduation rate performance
Alumni giving
Freshman retention
% faculty with terminal degree
Acceptance rate
25
16
10
7
6
6
5.25
5
5
4
3
2.25
149
Robust Regression
150
152
Nonlinear Regression
Or will
it work?
156
158
with
Y = ln(Y) X = ln(X) = ln( )
= and = ln()
Z
c Question
What if the error structure isn't multiplicative?
We cannot linearize the model:
Y = X +
?
]?
c The transformation
Y ^ =
6 X1 7 ]
Y=
?
?] ] ?
161
c Thus, must iterate, using the GaussNewton method, until the chosen
convergence criterion is satisfied.
c
c
d =
C %
C
=
Z
Z
=
=
= ^ = Z
m
::,
That is, V = Q R
Part of StatLib:
http://lib.stat.cmu.edu/S/rms.curvatures
169
Confidence Intervals:
100(1 ^ )% confidence intervals for the
are obtained as
V a t^ sl
Prediction Intervals:
V a t^ s ] #Z =V Z =V ^ #
Y
with v given by
172
is
v =
C %
C Z
yV
Hypothesis Tests:
Approximate t-tests could be constructed
to test the , with the tests of the form
t=
l
y Vl^#
V
174
with the V
# the diagonal elements of
V , the Jacobian matrix at
the matrix =
convergence.
175
Multicollinearity Diagnostics:
176
Model Validation
The best
approach is to obtain new data, if possible,
but care must be exercised to check that
the new data are compatible with the data
that were used to construct the model, and
this can be hard to do.
179
c
c
c
182
Additional Topics
(1) Need for terms that are sums and products in
regression models:
(A) Sums: Constructing sums has sometimes
been used to address multicollinearity,
as if two predictors are deemed
necessary, but they are highly
correlated, their sum might be used
instead of the individual terms (see
the top of page 470).
183
185
c
c
186
Suggested references:
Applied Logistic Regression, 2nd. ed. (2000)
by Hosmer and Lemeshow
188
189
SUGGESTED REFERENCES:
190
191
c
c
c
192