Sunteți pe pagina 1din 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Chapter 3 A Closer Look at Assumptions


STAT 3022 School of Statistic, University of Minnesota

2013 spring

1 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Introduction
In Chapter 2, we discussed the mechanics of using t-procedures to perform statistical inference. Namely t-tests and condence interval. We base these procedures on certain assumptions: we have random samples, representative of populations data come from Normal population samples are drawn independently. in pooled two-sample settings, we have equal variance (1 = 2 = ) In practice, these assumptions are usually not strictly met. When are these procedures still appropriate?

2 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Case Study: Making it Rain


Data collected in southern Florida between 1968 - 1972 to test hypothesis that massive injection of silver iodide (AgI) into cumulus clouds can lead to increased rainfall. This process is called cloud seeding. Over 52 days, either seeded a target cloud or left it unseeded (as control). Randomly assigned treatment. Researchers were blind to the treatment - pilots ew through cloud every day, whether treatment or control, and mechanism in plane either seeded the cloud or left it unseeded. Question: Did cloud seeding have an eect on rainfall? If so, how much?
3 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Graphical Summaries
library("Sleuth2") boxplot(Rainfall ~ Treatment, ylab='Rainfall (acre-feet)', data=case0301)

Rainfall (acrefeet)

500

1000

1500

2000

2500

Unseeded

Seeded

4 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Graphical Summaries
par(mfrow=c(2,1), mar=c(4,4,1,0.5)) hist(case0301$Rainfall[case0301$Treatment=="Seeded"], breaks=10, main="Seeded - Rainfall", xlim=c(0,3000), col="gray", xlab="") hist(case0301$Rainfall[case0301$Treatment=="Unseeded"], breaks=8, main="Unseeded - Rainfall", xlim=c(0,3000),col="gray", xlab="")

Frequency

10

12

Seeded Rainfall

500

1000

1500

2000

2500

3000

Frequency

10

15

20

Unseeded Rainfall

500

1000

1500

2000

2500

3000

5 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Numerical Summaries and Interpretations

Numerical Summaries: Do it yourself (follow the R-code on page 42 of Chapter 2 slides) Graphical and numerical summaries indicate that rainfall tended to be greater on seeded days. However, there are problems with our necessary assumptions: both distributions are very skewed both distributions have outliers variability is much greater in the seeded group than in the unseeded group Can we use our usual t-tools to analyze these data? How?

6 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Can we do this?

> t.test(Rainfall ~ Treatment, alternative="two.sided", + var.equal=TRUE, data=case0301) Two Sample t-test data: Rainfall by Treatment t = -1.9982, df = 50, p-value = 0.05114 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -556.224179 1.431851 sample estimates: mean in group Unseeded mean in group Seeded 164.5885 441.9846

How much did the violations of our assumptions aect these results?

7 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Robustness

t-tools may be used even when assumptions are violated, to a certain degree, because the t-tools are robust.

Robustness: A statistical procedure is robust to departures from a particular assumption if it is valid even when the assumption is not met.

8 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Type 1: Robustness Against Departures from Normality

Recall that the Central Limit Theorem (CLT) states that sample averages have approximately Normal sampling distributions, regardless of the shape of the population distribution, for large samples. As long as samples are large enough, the t-ratio will follow an approximate t-distribution even if the data is non-Normal.

9 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Type 1: Robustness Against Departures from Normality


Eects of Skewness

If two populations have same standard deviations and approximately same shapes, and if n1 n2, then validity of t-tools is aected very little by skewness. If two populations have same standard deviations and approximately same shapes, but n1 = n2, then validity of t-tools is aected substantially by skewness. Larger sample size diminish this eect. If skewness in two populations diers considerably, tools can be very misleading with small and moderate sample sizes. See Display 3.4 in the textbook for simulation results.
10 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Type 2: Robustness Against Diering Standard Deviations


When we cannot assume 1 = 2 , more serious problems may arise: sp no longer estimates any parameter SE(1 2 ) no longer estimates the standard deviation of x x the dierence between averages the t-ratio no longer follows a t-distribution What can we do: If n1 n2, t-tools remain fairly valid even when 1 = 2 . When n1 and n2 are very dierent, we need the ratio 1 /2 to be between 1/2 and 2 to have reliable results. See Display 3.5 in the textbook for simulation results.
> t.test(x1, x2, alternative = 'two.sided', var.equal = FALSE)
11 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Type 3: Robustness Against Departures from Independence


There are two types of dependence (i.e., lack of independence) that commonly arise:
1

A cluster eect occurs when the data have been collected in subgroups. Observations in the same subgroup tend to be more similar in their responses than observations in dierent subgroups. A serial eect occurs when measurements are taken over time and observations close together in time tend to be more similar (or more dierent) than observations collected at distant time points.

When the assumption of independence is violated, the standard error becomes very inaccurate. t-tools are usually not recommended in such cases.
12 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Resistance and Outliers


An outlier is an observation judged to be far from its group average. A statistical procedure is resistant if it does not change very much when a small part of the data changes, perhaps drastically. Whether or not we should simply remove such observations depend on how resistant our tools are to changes in the data.

Question: Can you tell the dierence between Robustness and Resistance?
13 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Example of Outlier
1.0 1.0 6 3 0.5 0.0 0.5

0 x

3 3

14 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Example of Resistance
Consider a hypothetical sample: 10, 20, 30, 50, 70 The sample mean is 36, and the sample median is 30. Now consider the sample: 10, 20, 30, 50, 700 What happens to the sample mean? What about the sample median?

The sample median is resistant to any change in a single observation, while the sample mean is not.
15 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Resistance of t-Tools

Since t-tools are based on mean, they are not resistant. Small portion of the data can have a major inuence on the results. One or two outliers can aect a 95% CI or change a p-value enough to alter a conclusion. Solution: When you have an outlier, it is good practice to run your analysis with and without the outlier in the data set. Compare your results to see how inuential the outlier in question is.

16 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Practical Strategies for the Two-Sample Problem

Our task is to size up actual conditions, using available data, and evaluate appropriateness of t-tools:
1 2

think about possible cluster and serial eects evaluate the suitability of t-tools by examining graphical displays (side-by-side histograms or box plots) consider alternatives
a. Transform the data (Section 3.5) to see if the transformed data looks nicer b. Alternative tools that do not require model assumptions (Chapter 4)

17 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Transformations of Data
For positive data, the most useful transformation is the logarithm (log), particularly the natural (base e) logarithm (e = 2.71828...). log(1) = 0 log(ex ) = x
log function

log(x)

2 0

4 x

10

18 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

When Do We Use Log Transformation


In a set of data, if the ratio might be useful.
max min

> 10, then a log transformation

if samples are skewed, with the group with the larger average having a greater spread, then a log transformation could be a good choice.
before transformation
1200

after transformation

800

1000

log(y) 2 0 2

200

400

600

19 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Cloud Seeding - Transformation


Recall both groups are skewed, with the seeded days having a larger average and a greater spread.
> max(case0301$Rainfall[case0301$Treatment=="Seeded"])/ + min(case0301$Rainfall[case0301$Treatment=="Seeded"]) [1] 669.6586 > max(case0301$Rainfall[case0301$Treatment=="Unseeded"])/ + min(case0301$Rainfall[case0301$Treatment=="Unseeded"]) [1] 1202.6 > case0301$logRain <- with(case0301, log(Rainfall)) > head(case0301) Rainfall Treatment logRain 1 1202.6 Unseeded 7.092241 2 830.1 Unseeded 6.721546 3 372.4 Unseeded 5.919969 4 345.5 Unseeded 5.844993 5 321.2 Unseeded 5.772064 6 244.3 Unseeded 5.498397
before transformation
8

after transformation

2500

2000

log(Rainfall) Unseeded Seeded

1500

1000

500

Unseeded

Seeded

20 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Two-Sample t-Analysis
Before:
> t.test(Rainfall ~ Treatment, alternative="two.sided", + var.equal=TRUE, data=case0301) Two Sample t-test data: Rainfall by Treatment t = -1.9982, df = 50, p-value = 0.05114 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -556.224179 1.431851 sample estimates: mean in group Unseeded mean in group Seeded 164.5885 441.9846

After:
> t.test(logRain ~ Treatment, data=case0301, + alternative="less", var.equal=TRUE) Two Sample t-test data: logRain by Treatment t = -2.5444, df = 50, p-value = 0.007041 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf -0.3904045 sample estimates: mean in group Unseeded mean in group Seeded 3.990406 5.134187

There is convincing evidence that seeding increased rainfall.

21 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Multiplicative Treatment Eect


Denition: Suppose Z = log Y. It is estimated that the response of an experimental unit to treatment 2 will be eZ2 Z1 times as large as its response to treatment 1 (where Z1 = average of log(Y1 )).
> m1 <- mean(case0301$logRain[case0301$Treatment=='Seeded']) > m2 <- mean(case0301$logRain[case0301$Treatment=='Unseeded']) > (diffmeans <- m1 - m2) [1] 1.143781 > (est.mult.effect <- exp(diffmeans)) [1] 3.138614

We interpret this in the following way: The volume of rainfall produced by a seeded cloud is estimated to be 3.14 times as large as the volume that would have been produced in the absence of seeding.

22 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Condence Interval
> (test <- t.test(logRain ~ Treatment, data=case0301, var.equal=TRUE)) Two Sample t-test data: logRain by Treatment t = -2.5444, df = 50, p-value = 0.01408 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.0466973 -0.2408651 sample estimates: mean in group Unseeded mean in group Seeded 3.990406 5.134187 > test$conf.int [1] -2.0466973 -0.2408651 attr(,"conf.level") [1] 0.95 > exp(test$conf.int) [1] 0.1291608 0.7859476 attr(,"conf.level") [1] 0.95

A 95% condence interval for the multiplicative eect of unseeding/seeding is 0.129 to 0.786 times.
23 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

A Strategy for Dealing with Outliers


If outlying observation resulted from measurement error or contamination from another population:
if the right value is known, correct the observation if not, leave the observation out of the analysis

Often there is no way to know what caused the outlier(s). Two tools exist:
employ resistant statistical tool (Chapter 4) adopt a careful strategy - see Display 3.6 in the text.

The idea is to perform analysis with and without the suspected outliers.
If both analyses give same answer, only report results INCLUDING suspected outliers. If not, report both results.

24 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Removing Outliers and Other Data Points


> library(Sleuth2); ex0327[15:17, ] Country Life Income Type 15 Portugal 68.1 956 Industrialized 16 South_Africa 68.2 NaN Industrialized 17 Sweden 74.7 5596 Industrialized > range(ex0327$Income, na.rm=TRUE) [1] 110 5596 > data <- ex0327; data[16, 'Income'] <- 10000 # set it as an outlier > data[15:17, ] Country Life Income Type 15 Portugal 68.1 956 Industrialized 16 South_Africa 68.2 10000 Industrialized 17 Sweden 74.7 5596 Industrialized > > d1 <- subset(data, Income < 8000); d1[15:17, ] Country Life Income Type 15 Portugal 68.1 956 Industrialized 17 Sweden 74.7 5596 Industrialized 18 Switzerland 72.1 2963 Industrialized

> ### dealing with Missing data ### > (cc <- complete.cases(ex0327)) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE [16] FALSE TRUE TRUE TRUE TRUE TRUE TRUE > data2 <- ex0327[cc, ]; data2[15:17, ] Country Life Income Type 15 Portugal 68.1 956 Industrialized 17 Sweden 74.7 5596 Industrialized 18 Switzerland 72.1 2963 Industrialized

TRUE TRUE

TRUE TRUE

TRUE TRUE

TRUE TRUE

TRUE TRUE

TRUE TRUE

TRUE TRUE

TRUE

25 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

Q: How many conservative economists does it take to change a light bulb?

26 / 27

Intrduction

Robustness

Resistance

Transformation

Outlier

A: None, theyre all waiting for the unseen hand of the market to correct the lighting disequilibrium.

27 / 27

S-ar putea să vă placă și