Sunteți pe pagina 1din 23

ST3232: Design and Analysis of Experiments

Chen Zehua

Department of Statistics & Applied Probability

2:00-4:00 pm, Monday, January 28, 2013

Chen Zehua ST3232: Design and Analysis of Experiments


Lecture 5: Model Diagnosis and Data Transformation

Residual Analysis
For a general linear model

yi = xτi β + i , i = 1, . . . , n,

the residuals are defined as

ri = yi − ŷi = yi − xi β̂, i = 1, . . . , N.

For the one-way ANOVA model, the residuals are given by

rij = yij − µ̂i = yij − ȳi· , i = 1, . . . , g , j = 1, . . . , ni .

The model assumptions can be checked by residual analysis.

Chen Zehua ST3232: Design and Analysis of Experiments


Properties of residuals
Let r = (r1 , r2 , . . . , rN )τ . Then

r = y − ŷ, ŷ = X β̂,

where y, X are respectively the response vector and design matrix


as defined before, β̂ is the LSE of β. The residuals have the
following properties:
I E (r) = 0, Var(r) = σ 2 (I − H), where H = X (X τ X )−1 X τ .
I When rk(H) = p + 1 << n, the residuals are almost
uncorrelated to each other and have about the same variance.
I For the one-way ANOVA model,
1n 1τ 1ng 1τn
H = Diag( 1n1 n1 , . . . , ng g ). When ni ’s are equal, the
residuals have the equal variances.
I rτ ŷ = 0, r and ŷ are un-correlated.
I If normality is true, r ∼ N(0, σ 2 (I − H)), r and ŷ are
independent.
Chen Zehua ST3232: Design and Analysis of Experiments
Residual plots useful for experimental design data
I Plot ri vs. ŷi . - check assumption of constant variance.
It should appear as a parallel band around 0. Otherwise, it
would suggest model violation. If spread of ri increases as ŷi
increases, error variance of y increases with mean of y . A
transformation of y is needed.
I Plot ri vs. time sequence - check assumption of independence.
If data is collected at different times, the plot can reveal
whether the residual has correlation with time. If correlation
exists, the independence assumption is violated. The problem
of non-independence is difficult to correct in the analysis.
I Normal Probability Plot - check assumption of normality.
If the normality assumption is true, the plot should appear
roughly as a straight line. If not, it is an indication of
violation of the normality assumption.

Chen Zehua ST3232: Design and Analysis of Experiments


More on Normal Probability Plot
I The normal probability plot is also called the Q-Q plot. It
plots the order statistics of the residuals:

r(1) ≤ r(2) ≤ · · · ≤ r(N)

against the quantiles of the standard normal distribution:

z 1−0.5 , z 2−0.5 , . . . , z N−0.5 ,


N N N

where z i−0.5 is the (i − 0.5)/Nth quantile of N(0, 1).


N
I The rationale of the Q-Q plot is the following. If
r ∼ N(µ, σ 2 ), then the (i − 0.5)/Nth quantile of r is
r i−0.5 = µ + σz i−0.5 . The ith order statistic of a random
N N
sample of size N from a distribution provides a reasonable
estimates for the (i − 0.5)/Nth quantile of the distribution.
Thus the Q-Q plot is roughly linear if normality is true.

Chen Zehua ST3232: Design and Analysis of Experiments


Residual plots for the meat storage example
I R codes for generating the plots:
x1=c(7.66, 6.98, 7.80)
x2=c(5.26, 5.44, 5.80)
x3=c(7.41, 7.33, 7.04)
x4=c(3.51, 2.91, 3.66)
x=c(x1,x2,x3,x4)
tmt = factor(kronecker(c(1:4),rep(1,3)))
options(contrasts=c("contr.treatment","contr.poly"))
fit = lm(x~tmt)
xhat = fit$fitted
r=fit$resid
ind = c(1:12)
plot(r, xhat)
plot(r, ind)
qqnorm(r)
I Go to the R session for the demonstration of the plots.
Chen Zehua ST3232: Design and Analysis of Experiments
Strategies for non-normal responses

For non-normal responses, there are two strategies for the data
analysis:
I Data transformation: make an appropriate transformation of
the responses such that the transformed response has a
constant variance.
We’ll first discuss transformations for some particular
responses, i.e, proportion, counts, time-to-event, etc. and
then discuss the general methodology
I Non-parametric method — Kruskal-Wallis test for one-way
layout data.

Chen Zehua ST3232: Design and Analysis of Experiments


The response — a proportion

Example 5.1
A clinical trial was conducted to compare two methods for the
training of senile dementia patients.
I Two groups of patients with sizes 11 and 8 are recruited.
I The two groups are trained with different methods.
I After the completion of the training, patients are asked to
take 20 tests involving activities of daily living such as
unlocking a door, tying one’s shoe laces, etc.
I The response from each patient is the proportion of his or her
tests that are successful.

Chen Zehua ST3232: Design and Analysis of Experiments


Example 5.1(cont.)
The data (under column head X ) is given below:
Group 1 Group 2
X A X A
0.05 0.226 0 0
0.15 0.398 0.15 0.398
0.35 0.633 0 0
0.25 0.524 0.05 0.226
0.20 0.464 0 0
0.05 0.226 0 0
0.10 0.322 0.05 0.226
0.05 0.226 0.10 0.322
0.30 0.580
0.05 0.226
0.25 0.524
Mean 0.164 0.395 0.044 0.147
sd 0.112 0.158 0.056 0.166

The variance of Group 1 (0.1122 = 0.012544) is 4 times that of


Group 2 (0.0562 = 0.003136).
Chen Zehua ST3232: Design and Analysis of Experiments
Transformation of proportions
I A common feature of proportions: if the mean is close to
either zero or 1, the variance is smaller; if the mean is close to
0.5, the variance is larger. The variance of such response can
be approximated by

V (p) = cp(1 − p),

where c is a constant, p is the mean proportion.


I The transformation is given by

h(p) = arcsin p.

The transformed data in Example 5.1 are given in the previous


table under column head A. The transformed data have quite
similar variances.

Chen Zehua ST3232: Design and Analysis of Experiments


The response — a count
Examples: number of microrganisms, number of attacks of angina
pectoris, etc.

Example 5.2
I Purpose: to assesse the effect of vaccination of heat-killed
bacilli.
I Case group (n1 = 7): vaccinated.
I Control group (n2 = 6): not vaccinated.
I Measurements: number of oral lactobacilli in the saliva.

Chen Zehua ST3232: Design and Analysis of Experiments


Example 5.2 (cont.)
Data:

Case √ Control √
X Y = X X Y = X
7,925 89.02 3,158 56.20
15,643 125.07 3,669 60.57
17,462 132.14 5,930 77.01
10,805 103.95 5,697 75.48
9,300 96.44 8,331 91.27
7,538 86.82 11,822 108.73
6,297 79.35
Mean 10,710.0 101.827 6,434.5 78.21
sd 4,266.4 19.946 3,218.8 19.527

Chen Zehua ST3232: Design and Analysis of Experiments


Transformation of counts
It can be seen from the data that
s s
p1 = 41.2 ≈ 40.1 = p2 ,
X̄1 X̄2

i.e., s 2 ∝ X̄ .
I A common feature of counts: the variance is proportional to
the mean.
I Most count response can be approximated by a Poisson
random variable.
I The transformation is given by

h(µ) = µ.

Chen Zehua ST3232: Design and Analysis of Experiments


Residual analysis
I Meat storage example.
I Example 5.1.
I Example 5.2.
Go to the R sessoin for the residual analysis for original and
transformed data of the above examples.

Chen Zehua ST3232: Design and Analysis of Experiments


The response — a time to event
Examples: time to failure, survival times, etc.
Transformation of time-to-event
I For time-to-event, the variance is usually dependent of a
certain power of mean.
I Once the power is determined, the transformation can be
determined by a general variance stablization transformation
formula to be discussed later.
I A reciprocal transformation is appropriate when the variance is
proportional to the fourth power of the mean.
I The reciprocal transformation usually has a physical meaning,
e.g., the reciprocal of time to death is the death rate, the
reciprocal of time to the occurence of a reaction is the speed
of the reaction, etc.

Chen Zehua ST3232: Design and Analysis of Experiments


Variance stablization transformation
I For a random variable with its variance σ 2 depending on its
mean µ, i.e., the variance is a function of the mean:
σ 2 = V (µ), a transformation can be found such that the
variance of the transformed variable is independent of its
mean.
I The proper transformation is derived as
Z
1
h(µ) = p dµ.
V (µ)
I The transformation is called Variance Stablization
transformation.

Chen Zehua ST3232: Design and Analysis of Experiments


Examples
I Proportion: V (µ) = cµ(1 − µ),


Z
1
h(µ) = p dµ = arcsin µ.
µ(1 − µ)

I Count: V (µ) = µ,


Z
1
h(µ) = √ dµ = µ.
µ

I Time-to-event: V (µ) = µ4 ,
Z
1
h(µ) = p dµ = 1/µ.
µ4

Chen Zehua ST3232: Design and Analysis of Experiments


Log transformation
A log transformation is appropriate in the following situations:
I Mean values are more sensibly compared in terms of their
ratios than in terms of their differences.
I The variance of the responses is proportional to the square of
their mean.
I The responses have a log-normal distribution.

Box-Cox transformation

The Box-Cox transformation is a general family of transformations,


wich is given by
µλ − 1
h(µ) = .
λ
The parameter λ can be determined by the data.

Chen Zehua ST3232: Design and Analysis of Experiments


Determination of λ in the Box-Cox transformation
Let σi ∝ µαi , where σi and µi are, respectively, the standard
deviation and the mean of the ith treatment effect. A few variance
stablization transformations are give below:

σi ∝ µαi α λ=1−α Transformation


σi ∝ µ3i 3 -2 reciprocal squared
σi ∝ µ2i 2 -1 reciprocal
3/2
σi ∝ µi 3/2 -1/2 reciprocal square root
σi ∝ µi 1 0 log
1/2
σi ∝ µi 1/2 1/2 square root
σi ∝ constant 0 1 no transformation
−1/2
σi ∝ µi -1/2 3/2 3/2 power
−1
σi ∝ µi -1 2 square

Chen Zehua ST3232: Design and Analysis of Experiments


Determination of λ in the Box-Cox transformation
I Empirical determination of α: Compute si and ȳi· . Fit the
regression model

ln si = β0 + α ln ȳi· + ei .

Take α as its estimate in the model and λ = 1 − α.


I An ad hoc method: Select a few number of α values, say αk ,
k = 1, . . . , K . For each k, compute
si
maxi α
ȳi· k
Rk = .
mini ȳsαi k

Select the αk with the smallest Rk .


I Analytic determination of λ: see Chapter 3, Montgomery.

Chen Zehua ST3232: Design and Analysis of Experiments


Example 5.3
I Purpose: comparison of effects of several combinations of
poisons and treatments on the survival of animals.
I Group assignment: combinations are randomly assigned to the
animals.
I Measurements: survival time in hours after treatment.
I The data is given in the next slide.

Chen Zehua ST3232: Design and Analysis of Experiments


Example 5.3 (cont.)

Group Measurements X̄i si


1 3.1, 4.5, 4.6, 4.3 4.125 0.695
2 8.2, 11.0, 8.8, 7.2 8.800 1.608
3 4.3, 4.5, 6.3, 7.6 5.675 1.567
4 4.5, 7.1, 6.6, 6.2 6.100 1.128
5 3.6, 2.9, 4.0, 2.3 3.200 0.715
6 9.2, 6.1, 4.9, 12.4 8.150 3.363
7 4.4, 3.5, 3.1, 4.0 3.750 0.569
8 5.6, 10.2, 7.1, 3.8 6.675 2.710
9 2.2, 2.1, 1.8, 2.3 2.100 0.216
10 3.0, 3.7, 3.8, 2.9 3.350 0.465
11 2.3, 2.5, 2.4, 2.2 2.350 0.129
12 3.0, 3.6, 3.1, 3.3 3.250 0.265

Go to the R-session for the residual analysis of Example 5.3.

Chen Zehua ST3232: Design and Analysis of Experiments


Transformation of data in Example 5.3.
The residual analysis show that there is a violation of the
assumption of constant variance and normality.
I Determination of transformation by empirical method: In the
regression model

ln si = β0 + α ln ȳi· + ei ,

The estimated α is 1.977 ≈ 2. The determined transformation


is the reciprocal.
I Determination of transformation by the ad hoc method:
Consider the α values: 1/2, 1, 2, 3. The Rk ’s are given below:
k 1/2 1 2 3
Rk 13.99 7.51 3.54 9.88
I Both methods yield the reciprocal transformation.

Chen Zehua ST3232: Design and Analysis of Experiments

S-ar putea să vă placă și