Stat 3022 Slides Umn Chapter3

Intrduction
Robustness
Resistance
Transformation
Outlier
Chapter 3 A Closer Look at Assumptions

STAT 3022 School of Statistic, University of Minnesota
2013 spring
1 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Introduction
In Chapter 2, we discussed the mechanics of using t-procedures to perform statistical inference. Namely t-tests and condence interval. We base these procedures on certain assumptions: we have random samples, representative of populations data come from Normal population samples are drawn independently. in pooled two-sample settings, we have equal variance (1 = 2 = ) In practice, these assumptions are usually not strictly met. When are these procedures still appropriate?
2 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Case Study: Making it Rain

Data collected in southern Florida between 1968 - 1972 to test hypothesis that massive injection of silver iodide (AgI) into cumulus clouds can lead to increased rainfall. This process is called cloud seeding. Over 52 days, either seeded a target cloud or left it unseeded (as control). Randomly assigned treatment. Researchers were blind to the treatment - pilots ew through cloud every day, whether treatment or control, and mechanism in plane either seeded the cloud or left it unseeded. Question: Did cloud seeding have an eect on rainfall? If so, how much?
3 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Graphical Summaries
library("Sleuth2") boxplot(Rainfall ~ Treatment, ylab='Rainfall (acre-feet)', data=case0301)
Rainfall (acrefeet)
500
1000
1500
2000
2500
Unseeded
Seeded
4 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Graphical Summaries
par(mfrow=c(2,1), mar=c(4,4,1,0.5)) hist(case0301$Rainfall[case0301$Treatment=="Seeded"], breaks=10, main="Seeded - Rainfall", xlim=c(0,3000), col="gray", xlab="") hist(case0301$Rainfall[case0301$Treatment=="Unseeded"], breaks=8, main="Unseeded - Rainfall", xlim=c(0,3000),col="gray", xlab="")
Frequency
10
12
Seeded Rainfall
500
1000
1500
2000
2500
3000
Frequency
10
15
20
Unseeded Rainfall
500
1000
1500
2000
2500
3000
5 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Numerical Summaries and Interpretations
Numerical Summaries: Do it yourself (follow the R-code on page 42 of Chapter 2 slides) Graphical and numerical summaries indicate that rainfall tended to be greater on seeded days. However, there are problems with our necessary assumptions: both distributions are very skewed both distributions have outliers variability is much greater in the seeded group than in the unseeded group Can we use our usual t-tools to analyze these data? How?
6 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Can we do this?
> t.test(Rainfall ~ Treatment, alternative="two.sided", + var.equal=TRUE, data=case0301) Two Sample t-test data: Rainfall by Treatment t = -1.9982, df = 50, p-value = 0.05114 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -556.224179 1.431851 sample estimates: mean in group Unseeded mean in group Seeded 164.5885 441.9846
How much did the violations of our assumptions aect these results?
7 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Robustness
t-tools may be used even when assumptions are violated, to a certain degree, because the t-tools are robust.
Robustness: A statistical procedure is robust to departures from a particular assumption if it is valid even when the assumption is not met.
8 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Type 1: Robustness Against Departures from Normality
Recall that the Central Limit Theorem (CLT) states that sample averages have approximately Normal sampling distributions, regardless of the shape of the population distribution, for large samples. As long as samples are large enough, the t-ratio will follow an approximate t-distribution even if the data is non-Normal.
9 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Type 1: Robustness Against Departures from Normality

Eects of Skewness
If two populations have same standard deviations and approximately same shapes, and if n1 n2, then validity of t-tools is aected very little by skewness. If two populations have same standard deviations and approximately same shapes, but n1 = n2, then validity of t-tools is aected substantially by skewness. Larger sample size diminish this eect. If skewness in two populations diers considerably, tools can be very misleading with small and moderate sample sizes. See Display 3.4 in the textbook for simulation results.
10 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Type 2: Robustness Against Diering Standard Deviations

When we cannot assume 1 = 2 , more serious problems may arise: sp no longer estimates any parameter SE(1 2 ) no longer estimates the standard deviation of x x the dierence between averages the t-ratio no longer follows a t-distribution What can we do: If n1 n2, t-tools remain fairly valid even when 1 = 2 . When n1 and n2 are very dierent, we need the ratio 1 /2 to be between 1/2 and 2 to have reliable results. See Display 3.5 in the textbook for simulation results.
> t.test(x1, x2, alternative = 'two.sided', var.equal = FALSE)
11 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Type 3: Robustness Against Departures from Independence

There are two types of dependence (i.e., lack of independence) that commonly arise:
1
A cluster eect occurs when the data have been collected in subgroups. Observations in the same subgroup tend to be more similar in their responses than observations in dierent subgroups. A serial eect occurs when measurements are taken over time and observations close together in time tend to be more similar (or more dierent) than observations collected at distant time points.
When the assumption of independence is violated, the standard error becomes very inaccurate. t-tools are usually not recommended in such cases.
12 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Resistance and Outliers

An outlier is an observation judged to be far from its group average. A statistical procedure is resistant if it does not change very much when a small part of the data changes, perhaps drastically. Whether or not we should simply remove such observations depend on how resistant our tools are to changes in the data.
Question: Can you tell the dierence between Robustness and Resistance?
13 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Example of Outlier
1.0 1.0 6 3 0.5 0.0 0.5
0 x
3 3
14 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Example of Resistance
Consider a hypothetical sample: 10, 20, 30, 50, 70 The sample mean is 36, and the sample median is 30. Now consider the sample: 10, 20, 30, 50, 700 What happens to the sample mean? What about the sample median?
The sample median is resistant to any change in a single observation, while the sample mean is not.
15 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Resistance of t-Tools
Since t-tools are based on mean, they are not resistant. Small portion of the data can have a major inuence on the results. One or two outliers can aect a 95% CI or change a p-value enough to alter a conclusion. Solution: When you have an outlier, it is good practice to run your analysis with and without the outlier in the data set. Compare your results to see how inuential the outlier in question is.
16 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Practical Strategies for the Two-Sample Problem
Our task is to size up actual conditions, using available data, and evaluate appropriateness of t-tools:
1 2
think about possible cluster and serial eects evaluate the suitability of t-tools by examining graphical displays (side-by-side histograms or box plots) consider alternatives
a. Transform the data (Section 3.5) to see if the transformed data looks nicer b. Alternative tools that do not require model assumptions (Chapter 4)
17 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Transformations of Data
For positive data, the most useful transformation is the logarithm (log), particularly the natural (base e) logarithm (e = 2.71828...). log(1) = 0 log(ex ) = x
log function
log(x)
2 0
4 x
10
18 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
When Do We Use Log Transformation

In a set of data, if the ratio might be useful.
max min
> 10, then a log transformation
if samples are skewed, with the group with the larger average having a greater spread, then a log transformation could be a good choice.
before transformation
1200
after transformation
800
1000
log(y) 2 0 2
200
400
600
19 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Cloud Seeding - Transformation

Recall both groups are skewed, with the seeded days having a larger average and a greater spread.
> max(case0301$Rainfall[case0301$Treatment=="Seeded"])/ + min(case0301$Rainfall[case0301$Treatment=="Seeded"]) [1] 669.6586 > max(case0301$Rainfall[case0301$Treatment=="Unseeded"])/ + min(case0301$Rainfall[case0301$Treatment=="Unseeded"]) [1] 1202.6 > case0301$logRain <- with(case0301, log(Rainfall)) > head(case0301) Rainfall Treatment logRain 1 1202.6 Unseeded 7.092241 2 830.1 Unseeded 6.721546 3 372.4 Unseeded 5.919969 4 345.5 Unseeded 5.844993 5 321.2 Unseeded 5.772064 6 244.3 Unseeded 5.498397
before transformation
8
after transformation
2500
2000
log(Rainfall) Unseeded Seeded
1500
1000
500
Unseeded
Seeded
20 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Two-Sample t-Analysis
Before:
> t.test(Rainfall ~ Treatment, alternative="two.sided", + var.equal=TRUE, data=case0301) Two Sample t-test data: Rainfall by Treatment t = -1.9982, df = 50, p-value = 0.05114 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -556.224179 1.431851 sample estimates: mean in group Unseeded mean in group Seeded 164.5885 441.9846
After:
> t.test(logRain ~ Treatment, data=case0301, + alternative="less", var.equal=TRUE) Two Sample t-test data: logRain by Treatment t = -2.5444, df = 50, p-value = 0.007041 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf -0.3904045 sample estimates: mean in group Unseeded mean in group Seeded 3.990406 5.134187
There is convincing evidence that seeding increased rainfall.
21 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Multiplicative Treatment Eect

Denition: Suppose Z = log Y. It is estimated that the response of an experimental unit to treatment 2 will be eZ2 Z1 times as large as its response to treatment 1 (where Z1 = average of log(Y1 )).
> m1 <- mean(case0301$logRain[case0301$Treatment=='Seeded']) > m2 <- mean(case0301$logRain[case0301$Treatment=='Unseeded']) > (diffmeans <- m1 - m2) [1] 1.143781 > (est.mult.effect <- exp(diffmeans)) [1] 3.138614
We interpret this in the following way: The volume of rainfall produced by a seeded cloud is estimated to be 3.14 times as large as the volume that would have been produced in the absence of seeding.
22 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Condence Interval
> (test <- t.test(logRain ~ Treatment, data=case0301, var.equal=TRUE)) Two Sample t-test data: logRain by Treatment t = -2.5444, df = 50, p-value = 0.01408 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.0466973 -0.2408651 sample estimates: mean in group Unseeded mean in group Seeded 3.990406 5.134187 > test$conf.int [1] -2.0466973 -0.2408651 attr(,"conf.level") [1] 0.95 > exp(test$conf.int) [1] 0.1291608 0.7859476 attr(,"conf.level") [1] 0.95
A 95% condence interval for the multiplicative eect of unseeding/seeding is 0.129 to 0.786 times.
23 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
A Strategy for Dealing with Outliers

If outlying observation resulted from measurement error or contamination from another population:
if the right value is known, correct the observation if not, leave the observation out of the analysis
Often there is no way to know what caused the outlier(s). Two tools exist:
employ resistant statistical tool (Chapter 4) adopt a careful strategy - see Display 3.6 in the text.
The idea is to perform analysis with and without the suspected outliers.
If both analyses give same answer, only report results INCLUDING suspected outliers. If not, report both results.
24 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Removing Outliers and Other Data Points

> library(Sleuth2); ex0327[15:17, ] Country Life Income Type 15 Portugal 68.1 956 Industrialized 16 South_Africa 68.2 NaN Industrialized 17 Sweden 74.7 5596 Industrialized > range(ex0327$Income, na.rm=TRUE) [1] 110 5596 > data <- ex0327; data[16, 'Income'] <- 10000 # set it as an outlier > data[15:17, ] Country Life Income Type 15 Portugal 68.1 956 Industrialized 16 South_Africa 68.2 10000 Industrialized 17 Sweden 74.7 5596 Industrialized > > d1 <- subset(data, Income < 8000); d1[15:17, ] Country Life Income Type 15 Portugal 68.1 956 Industrialized 17 Sweden 74.7 5596 Industrialized 18 Switzerland 72.1 2963 Industrialized
> ### dealing with Missing data ### > (cc <- complete.cases(ex0327)) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE [16] FALSE TRUE TRUE TRUE TRUE TRUE TRUE > data2 <- ex0327[cc, ]; data2[15:17, ] Country Life Income Type 15 Portugal 68.1 956 Industrialized 17 Sweden 74.7 5596 Industrialized 18 Switzerland 72.1 2963 Industrialized
TRUE TRUE
TRUE TRUE
TRUE TRUE
TRUE TRUE
TRUE TRUE
TRUE TRUE
TRUE TRUE
TRUE
25 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
Q: How many conservative economists does it take to change a light bulb?
26 / 27
Intrduction
Robustness
Resistance
Transformation
Outlier
A: None, theyre all waiting for the unseen hand of the market to correct the lighting disequilibrium.
27 / 27

Stat 3022 Slides Umn Chapter3

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Stat 3022 Slides Umn Chapter3

Încărcat de

Drepturi de autor:

Formate disponibile

Intrduction

Chapter 3 A Closer Look at Assumptions

Case Study: Making it Rain

Numerical Summaries and Interpretations

Type 1: Robustness Against Departures from Normality

Type 1: Robustness Against Departures from Normality

Type 2: Robustness Against Diering Standard Deviations

Type 3: Robustness Against Departures from Independence

Resistance and Outliers

Practical Strategies for the Two-Sample Problem

When Do We Use Log Transformation

> 10, then a log transformation

Cloud Seeding - Transformation

log(Rainfall) Unseeded Seeded

There is convincing evidence that seeding increased rainfall.

Multiplicative Treatment Eect

A Strategy for Dealing with Outliers

Removing Outliers and Other Data Points

Q: How many conservative economists does it take to change a light bulb?

S-ar putea să vă placă și