Sunteți pe pagina 1din 227

STATISTICAL INFERENCE

Prepared by:
Antonio E. Chan, M.D.

Session 1
Introduction to Statistical Inference
Learning objectives
I. To give introduction to the topic of
statistical inference
A. Differentiate descriptive statistic from
inferential statistic
B. Enumerate and discuss components of
inferential statistics
C. Discuss population parameter and sample
statistic
D. Define some terms
1. Sampling distribution of statistic
2. Standard deviation of the sampling distribution of
statistics
3. Mean of sample means
Learning objectives
E. Enumerate the probability distribution
F. Discuss normal distribution
G. Discuss estimation and hypothesis testing
II. Discuss and give example on test
statistics for one population
III. Discuss and give example on test
statistics for two populations
IV. Discuss and give example on test
statistics for more than two population
V. Discuss and give example on correlation
and regression
D e s c r i p t i v e s t a t i s t i c

refers to the different methods applied in
order to organize, summarize and present
data in a form which will make them easier
to analyze and interpret. (tabulation, graphical
presentation, computation of averages as well
as measures of variability

I n f e r e n t i a l s t a t i s t i c

is the process of generalizing or drawing
conclusions about the target population on the
basis of results obtained from a sample
Definition of Terms
1. Parameter
is a numerical constant obtained by
observing the total population
2. Statistic
is a numerical variable obtained by
observing a random sample from
the population

Conventional Notations for the
Common Summarizing Figures
Summarizing Figure Parameter Statistic

Mean x
Variance o
2
s
2

Standard deviation o s
Proportion t p
Difference between
two means
1
-
2
x
1
- x
2

Difference between
two proportions t
1
- t
2
p
1
-p
2

Statistical Inference
1. Estimation
- The process by which a statistic computed for a
random sample is used to approximate
(estimate) the corresponding parameter.

a. Point estimate - is a single numerical value
used to approximate the population parameter
(sample mean or sample proportion)
b. Interval estimate (Confidence interval) -
consists of two numbers, a lower limit and an
upper limit which serve as the bounding values
within which the parameter is expected to lie
with a certain degree of confidence
Statistical Inference

2. Hypothesis testing
- Comprises a set of procedures
- A hypothesis is either rejected or not,
based on the probability of occurrence of
the sample results if the null hypothesis is
true


Steps in Hypothesis Testing
1. State the research question in
terms of statistical hypotheses
2. Decide on the appropriate test
statistic
3. Select the level of significance for
the statistical test
4. Determine the value the test
statistic must attain to be declared
significant (Critical ratio or value)
5. Perform the calculation
6. Draw and state the conclusion.
Definition of Terms
1. Sampling variation
2. Sampling distribution of statistic
3. Standard error (SE)
4. Sampling distribution of sample
means
5. Mean of sample means


Population of months since last examination
Patient Number of Months Since Last Examination
1
2
3
4
5

12
13
14
15
16

Population mean () = 70 / 5 = 14
Twenty-five samples of size 2 patients each

Sample

Patients Selected
Number of
Months for each

Mean
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1,1
1,2
1,3
1,4
1,5
2,1
2,2
2,3
2,4
2,5
3,1
3,2
3,3
3,4
3,5
4,1
4,2
4,3
4,4
4,5
5,1
5,2
5,3
5,4
5,5
12, 12
12, 13
12, 14
12, 15
12, 16
13, 12
13, 13
13, 14
13, 15
13, 16
14, 12
14, 13
14, 14
14, 15
14, 16
15, 12
15, 13
15, 14
15, 15
15, 16
16, 12
16, 13
16, 14
16, 15
16, 16

12.0
12.5
13.0
13.5
14.0
12.5
13.0
13.5
14.0
14.5
13.0
13.5
14.0
14.5
15.0
13.5
14.0
14.5
15.0
15.5
14.0
14.5
15.0
15.5
16.0
The frequency distribution of that statistic
taken over all possible samples. (called
Sampling distribution of the statistic)

(Statistic is the sample mean)


Properties of Sampling Distribution
1. The mean of sample means = Population mean ()
350 / 25 = 14
2. The standard deviation of the sampling distribution of the
mean (called standard error)

or

3. The sampling distribution of the sample means is normally
distributed if the population from which the samples are
drawn is itself normally distributed.

n SE o = n s
Applications of Sampling Distribution
1. Determine the probability of obtaining
a sample statistic with a pre-specified
magnitude from a given population
2. Estimate parameters
3. Test hypothesis regarding parameters
Probability Distribution
1. Binomial Distribution
2. Poisson Distribution
3. Normal or Gaussian Distribution

Normal Distribution Curve

-3o -2o -1o 1o 2o 3o
Characteristics Normal Distribution Curve
Bell-shaped and symmetrical about the mean
Completely determined by two parameters,
i.e., the mean and the standard deviation.
Mean, median and mode are equal
Total area under the curve is equal to 1 or 100%
Long tapering tails that extend infinitely on both
end
1 = 68.5% ; 2 = 95% ; 3 = 99.7%
The Normal Distribution Curve
To obtain probability,
the normal distribution is transformed to a z- distribution

The normal distribution curve Standard normal curve
= 0, o = 1

The actual measurement z-value (standard variate or
(x-value) standard score)
X
= 0
Normal distribution curve
Standard normal curve
area
Z
o = 1

o
x z
NORMAL DISTRIBUTION CURVE
Formula :


o

=
x
z
n
x
z
o

=
Individual observation sample from sampling distribution



a a
a
z z
z
2
z
1

Application of the Normal Distribution Curve
1. Computation of proportion, percentages or
probabilities of x-values falling on a given range.

o

=
x
z
Application of the Normal Distribution Curve
2. Determination of the x-values that
bound a specified area under the normal
curve.
a
z
a
z
z
2

a
z
1

X = + zo
X = - zo
Example
Assuming that the distribution of systolic blood
pressure of non-hypertensive men has a mean of
110 mm. Hg. and a standard deviation of 15 mm.
Hg. What is the proportion of non-hypertensive
men who have systolic blood pressure above 120
mm. Hg ?

x
o = 15
110 120
x - 120 - 110
Z = --------- = ------------- = 0.67 (.2514)
o 15
?
What is the proportion of non-hypertensive men with
systolic blood pressure of less than 90 mm. Hg ?



x
90 110
o = 15
x - 90 - 110
z = -------- =------------ = - 1.33 ( 0.0918)
o 15
?
What is the proportion of non-hypertensive men
who have systolic blood pressure between 90 and
120 mm. Hg.?

X
2

X
1

90 120

110
o = 15
1 (z
1
+ z
2
) = 1 (.2514 + .0918) = .6568
?
Example
Suppose a health care provider studies a randomly
selected group of 25 men and women between 20
and 39 years of age and finds that their mean
systolic BP is 124 mm Hg. How often would a
sample of 25 patients have a mean systolic BP this
high or higher?
Assuming that the systolic BP is a normally
distributed random variable with a known mean of
120 mm Hg. and a standard deviation of 10 mm Hg.
in the population of normal healthy adults

x
o = 10
120 124
?

0 . 2
2
4
25
10
120 124
= =

=
n
x
z
o

(0.023)
Example
Suppose a health care provider wants to detect
adverse effects of systolic BP in a random sample
of 25 patients using a drug that causes
vasoconstriction. The provider decides that a mean
systolic BP in the upper 5% of the distribution is
cause for alarm; therefore, the provider must
determine the value that divides the upper 5% of
the sampling distribution from the lower 95%

x
o = 10
120 ?
95%
5%

z
o = 10
120 ?
95%
5%
.05 = z-value = 1.645
( )
25
10
120
645 . 1

= =

=
x
n
x
z
o

|
.
|

\
|
+ = + =
25
10
645 . 1 120
n
z x
o

) 2 ( 645 . 1 120 + = x
29 . 123 = x
Definition of Terms
Confidence coefficient refers to the
degree of reliability one may place in the
estimate of the parameter (1-)

Estimation
1. Point estimate
Sample statistic (x or p)
reliability coefficient (z-value or t-value)
standard error
General formula:
Sample statistic = reliability coefficient x standard error
Ex:

2. Interval estimate
General formula:
Sample statistic reliability coefficient x standard error
Ex:
n
z x
o
=
n
z x
o

Steps in Hypothesis Testing


1. State the research question in terms of
statistical hypotheses
2. Decide on the appropriate test statistic
3. Select the level of significance for the
statistical test
4. Determine the value the test statistic
must attain to be declared significant
(Critical ratio or value)
5. Perform the calculation
6. Draw and state the conclusion.
Step 1 State the hypothesis
1. Null hypothesis
- a working hypothesis: hypothesis of no
difference
- Formulated for the purpose of being rejected
- Can be stated in words or symbols (use
parameter symbols)
2. Alternative hypothesis
- The experimenters research hypothesis
- May or may not have direction (one-sided or
two-sided or one-tailed or two-tailed)
- Can be stated also in words or symbols
Step 2 Select the appropriate test statistic
1. Parametric test statistic
- Is based on assumptions made concerning the parameters
of the population from which the sample was drawn
- The validity of the tests depends on whether these
assumptions about the nature of the sampled population
are satisfied or not
- Scales of measurement interval or ratio
2. Non-parametric test statistic (Distribution-free)
- Has fewer and less stringent assumptions
- No attempt is made to specify and identify the form of the
population from which the sample was drawn
- Scales of measurement nominal or ordinal
Grouping of Statistical tests
1. Number of populations studied
2. Types of samples (Independent or
related samples)
3. Nature of the study (Test of significance
or relationship)

Parametric test Non-parametric test
Test of Significance for One
population
Mean
Z-test
Students t-test
Paired t-test (Related)

Proportion
- Z-test
Test of Significance
For One population
Binomial test
X
2
test
Kolmogorov-Smirnov one-
sample test

Parametric test Non-parametric test
Test of Significance for
Two populations

Mean

- Z test
- T test

Proportion

- Z test or x
2
test


Test of Significance for
Two populations
- McNemar test (Related)
- Fishers Exact Probability test
- X
2
test for independent samples
- Sign test (Related)
- Wilcoxon Matched Pairs Signed
Rank Test (Related)
- Median test
- Mann Whitney U test
- Kolmogorov-Smirnov Two
Sample Test
Parametric test Non-parametric test
Test of Significance for
Three or more populations

Mean
- Analysis of Variance
(ANOVA)
Proportion
- X
2
test

Test of Significance for
Three or more populations

- x
2
test
- Friedman Two-Way
Analysis of Variance By
Rank (Related)
- Kruskal-Wallis One-way
Analysis of Variance by
Rank
Parametric test Non-parametric test
Test of Relationship

Simple linear Regression
and Correlation
Test of Relationship

Spearman Rank Correlation
Coefficient
Criteria for Test Statistic Selection
1. Scales of measurement used
2. The aim for doing the tests
3. Types of Samples (Independent vs
Paired or related samples)
4. Assumptions of the test
Step 3 Select the level of significance
The probability of committing Type 1 error
The error of rejecting a true hypothesis
Symbol - o
Common value used .10 .05, .01
Level of Confidence
(1 - o)
Level of Significance
(o)
z-value for alternative
hypothesis
One-tailed Two-tailed
90% .10 1.28 1.64
95% .05 1.64 1.96
99% .01 2.33 2.58
Step 4 Determine the Critical ratio
(Critical value)
Factors considered in determining the
critical ratio
1. Level of significance
2. Test statistic used
3. Direction of the alternative hypothesis
4. Sample size
2
o
2
o
o
Critical Regions for two-sided or two-tailed test
Critical Ratio
Non-rejection region
Rejection region
Rejection region
Critical Regions for one-sided or one-tailed test
Non-rejection region
Rejection region
Critical ratio
Critical ratio
Two-tailed test
is one wherein the alternative hypothesis (H
1
)
simply states that there is difference between
the two study groups
does not specify the direction of the difference
One-tailed test
is one wherein the alternative hypothesis (H
1
)
specifies the direction of the difference
Step 6 Draw and state the conclusion
Decision rule:
Make decision whether to reject or not to reject the null
hypothesis based on the computed value of the test
statistic or based on equivalent p-value of the computed
test statistic
Reject Ho if the computed value > tabulated value or
critical value
Do not reject Ho if the computed value s tabulated value
or critical value
Reject Ho if the computed p-value < o
Do not reject Ho if the computed p-value > o

P-value
The probability of obtaining, when Ho is true, a value of the
test statistic as extreme as or more extreme (in the
appropriate direction) than the one actually computed.
The smallest value of o for which the null hypothesis can be
rejected
Calculated after the statistical test has been performed
Reporting the actual p-value communicates the significance
of the findings more precisely
If the p-value is less than o, the null hypothesis is rejected
Possible Errors in Hypothesis Testing
TRUE STATE OF NATURE
Ho is true Ho is false
(H
1
is true)
D
E
C
I
S
I
O
N

Accept Ho

Correct decision
(1 - o)

Type II error
(|)

Reject Ho
(assume H
1
is true)

Type I error
(o)

Correct decision
(1 - |)
Session 2
Statistical Tests for One Population
Statistical test for one population
1. Single population mean
- z-test
- t-test
2. Single population proportion
- z-test
Hypothesis testing about a population mean
under three different conditions
1. When sampling is from a normally
distributed population of values with
known variance
2. When sampling is from a normally
distributed population with unknown
variance
3. When sampling is from a population that
is not normally distributed
1
st
assumption for the use of
z test for one population mean
It is assumed that the sample which is
randomly selected comes from a
population that is normally distributed
and the population variance is known
Example
Suppose a researcher, interested in obtaining an
estimate of the average level of some enzyme in
a certain human population, takes a sample of 10
individuals, determine the level of the enzyme in
each, and computes sample mean x = 22.
Suppose further it is known that the variable of
interest is approximately normally distributed
with a variance of 45. Can we conclude that the
mean enzyme level in this population is different
from 25 ? Use o = .05
1. State the hypothesis
Ho: The population mean enzyme level is equal to 25
H
A
: The population mean enzyme level is not equal to 25

2. Select the appropriate test statistic Z test




3. Select the level of significance - .05
4. Determine the critical ratio or critical value z
(o/2)
= 1.96
5. Do the computation of the test statistic




6. Draw and state the conclusion
We are unable to reject the null hypothesis since -1.41 is not in the
rejection region. The computed value of the test statistic is not
significant at the .05 level.
We conclude that may be equal to 25
n
x
z
o

0

=
41 . 1
1213 . 2
3
10 45
25 22
=

= z
Testing Ho by means of a 95% Confidence Interval
95% Confidence Interval =
Estimator Reliability Coefficient x Standard error



Interpretation
When testing a null hypothesis by means of a two-sided
confidence interval, we reject Ho at the o level of significance if
the hypothesized parameter is not contained within the 100
(1 o) percent confidence interval. If the hypothesized parameter
is contained within the interval, Ho cannot be rejected at the o
level of significance.

n
z x
o

Testing Ho by means of a 95% Confidence Interval


95% Confidence Interval



= 22 1.96 (2.1213)
= 22 4.16
17,84, 26.16

Hypothesized parameter
0
= 25


10 / 45 96 . 1 22 =
n
z x
o
Assumption for the use of t test
for one population mean
It is assumed that the sample which is
randomly selected comes from a
population that is normally distributed
but the population variance is unknown
Example
Researchers collected serum amylase values from a
random sample of 15 apparently healthy subjects.
They want to know whether they can conclude that
the mean of the population from which the sample
of serum amylase determinations came is different
from 120. The mean and standard deviation
computed from the sample are 96 and 35 units/100
ml, respectively. Use o =.05.
1. State the hypothesis
Ho: The population mean serum amylase is equal to 120 units/100ml
H
A
: The population mean serum amylase is not equal to 120 units

2. Select the appropriate test statistic t-test




3. Select the level of significance - .05
4. Determine the critical ratio or critical t
(o/2, n-1)
=2.1448
5. Do the computation of the test statistic




6. Draw and state the conclusion

Reject Ho since -2.65 falls in the rejection region.

The population mean is not equal to 120.

n s
x
t
0

=
65 . 2
04 . 9
24
15 35
120 96
=

= t
Testing Ho by means of a 95% Confidence Interval
95% Confidence Interval =







96 2.1448 (9.04)
96 19.4
76.6, 115.4
n
s
t x
15
35
1448 . 2 96
2
nd
assumption for the use of
z test for one population mean
It is assumed that the sample which is
randomly selected comes from a
population that is normally distributed and
the population variance is unknown,
however, the sample size is large.
(Central limit theorem)
Example
In a health survey of a certain community 150
persons were interviewed. One of the items of
information obtained was the number of
prescriptions each person had had filled during the
past year. The average number for the 150 people
was 5.8 with a standard deviation of 3.1. The
investigator wishes to know if these data provide
sufficient evidence to indicate that the population
mean is greater than 5. Let o = .05
1. State the hypothesis
Ho: The average number of prescriptions is s 5
H
A
: The average number of prescriptions is > 5

2. Select the appropriate test statistic z test




3. Select the level of significance = .05
4. Determine the critical ratio or critical value (one-tailed) = 1.645
5. Do the computation of the test statistic z test





6. Draw and state conclusion

Reject Ho since 3.2 is the region of rejection

The average number of prescriptions is greater than 5
n s
x
z
0

=
2 . 3
150 1 . 3
0 . 5 8 . 5
=

= z
Hypothesis testing :A Single Population Proportion
Assumption:

The sampling distribution of p is
approximately normally distributed in
accordance with the central limit theorem
Hypothesis testing :A Single Population Proportion
Example
Suppose we are interested in knowing what
proportion of automobile drivers regularly
wear seat belts. In a survey of 300 adult
drivers, 123 said they regularly wear seat
belts. Can we conclude from these data that
in the sampled population the proportion
who regularly wear seat belts is not .50 ?
Let o = .05.

1. State the hypothesis
Ho: The population proportion who regularly wear seat belts = .50
H
A
: The population proportion who regularly wear seat belts = .50

2. Select the appropriate test statistic = z test


or


3. Select the level of significance = .05
4. Determine the critical ratio or critical value = 1.96
5. Do the calculation for the test statistic





6. Draw and state the conclusion
Reject Ho since -3.11 is in the region of rejection
The population proportion who regularly wear seat belts is not
equal to .05
n
p
z
) 1 (
0 0
0
t t
t

=
n
q p
p p
z
0 0
0

=
11 . 3
289 . 0
09 .
300
) 5 )(. 5 (.
50 . 41 .
=

= z
Testing Ho by means of a 95% Confidence Interval
95% Confidence Interval =






.41 1.96 x .0289
.41 .06
.35, .47



n
p p
z p
) 1 (

300
) 5 )(. 5 (.
96 . 1 41 .
Session Three
Statistical Tests for Two Populations
Hypothesis testing for two populations
1. Difference between two population
means
- z test
- t test
- Paired t test
2. Difference between two population
proportions
- z test
- X
2
test
z test for the difference between
two population means
Assumptions 1:
1.Each of the two simple random samples
has been drawn from a normally
distributed population with known
variances
2.Each of the two random samples are
independent from each other
z test for the difference between
two population means
( ) ( )
2
2
2
1
2
1
2 1 2 1
n n
x x
z
o o

+

=
Test Statistic
Different hypotheses
1. Ho:
1
-
2
= 0 , H
A
:
1
-
2
0
2. Ho:
1
-
2
0, H
A
:
1
-
2
< 0
3. Ho:
1
-
2
0, H
A
:
1
-
2
> 0



Example
Researchers wish to know if the data they have
collected provide sufficient evidence to indicate a
difference in mean serum uric acid levels between
normal individuals and individuals with mongolism.
The data consist of serum uric acid readings on 12
mongoloid individuals and 15 normal individuals. The
means are x
1
= 4.5 mg/100 ml and x
2
= 3.4 mg/100
ml. The data constitute two independent simple
random samples each drawn from a normally
distributed population with a variance equal to 1. Let
o = .05


1. State the hypothesis
Ho: The mean difference in the serum uric acid between mongoloid
and normal individuals is equal to zero
H
A
: The mean difference in the serum uric acid between mongoloid
and normal individuals is not equal to zero
2. Select the appropriate test statistic





3. Select the level of significance = .05
4. Determine the critical ratio or critical value = 1.96
5. Perform the calculation for the test statistic




6. Draw and state the conclusion
Reject Ho since 2.82 is in the rejection region
The mean serum uric acid of the two populations are not equal
( ) ( )
2
2
2
1
2
1
2 1 2 1
n n
x x
z
o o

+

=
82 . 2
39 .
1 . 1
15
1
12
1
0 ) 4 . 3 5 . 4 (
= =
+

= z
Testing Ho by mean difference of a 95%
Confidence Interval
2
2
2
1
2
1
2 1
n n
z
o o
+
15
1
12
1
96 . 1 ) 4 . 3 5 . 4 ( +
1.1 .8

.3, 1.9
z test for the difference between two
population means
Assumption 2:
When each of two independent simple random
samples has been drawn from a population that is
not normally distributed. The results of the central
limit theorem may be employed if sample sizes are
large (say > 30)

Test statistic is
( ) ( )
2
2
2
1
2
1
2 1 2 1
n n
x x
z
o o

+

=
Example
A hospital administrator wished to know if the
population which patronizes hospital A has a larger
mean family income than does the population which
patronizes hospital B. The data consist of the family
incomes of 75 patients admitted to hospital A and
of 80 patients admitted to hospital B. The sample
means are x
1
= Php 6,800.00 and x
2
= Php 5,450.00.
The standard deviations are o
1
= Php 600.00 and
o
2
= Php 500.00. Let o = .01.
1. Stat the hypothesis
Ho. The mean difference in family income between the two
populations is s zero
H
A
: The mean difference in family income between the two
populations is > zero
2. Select the appropriate test statistic





3. Select the level of significance = .01
4. Determine the critical ratio or critical value z = 2.33 one-tailed
5. Perform the calculation for the test statistic





6. Draw and state the conclusion
Reject Ho since 15.17 is in the rejection region
These data indicate that the population patronizing hospital A has
a larger mean family income than does the population patronizing
hospital B.
( ) ( )
2
2
2
1
2
1
2 1 2 1
n n
x x
z
o o

+

=
( )
( ) ( )
17 . 15
89
1350
80
500
75
600
0 5450 6800
2 2
= =
+

= z
t test for the difference between the
two population means
Assumption:
- The two random samples are independent
- Each drawn from a normally distributed
population
- The population variances are unknown but
are assumed to be equal.
t test for the difference between the
two population means
When the population variances are unknown,
but assumed to be equal, it is appropriate to
pool the sample variance by means of the
following formula



Test statistic is

or

( )( ) ( )
2
1 1
2 1
2
2 2
2
1 1
2
+
+
=
n n
s n s n
s
p
( ) ( )
2
2
1
2
2 1 2 1
n
sp
n
sp
x x
t
+

=

( ) ( )
2 1
2 1 2 1
1 1
n n
sp
x x
t
+

=

Example
A research team collected serum amylase data from
a sample of healthy subjects and from a sample of
hospitalized subjects. They wish to know if they
would be justified in concluding that the population
means are different. The data consist of serum
amylase determinations on n
2
=15 healthy subjects
and n
1
=22 hospitalized subjects. The sample means
and standard deviations are as follows: Let o = .05
x
1
= 120 units/ml, s
1
= 40 units/ml
x
2
= 96 units/ml, s
2
= 35 units/ml

1. State the hypothesis
Ho : The mean difference in serum amylase between the healthy and
hospitalized subjects is equal to zero
H
A
: The mean difference in serum amylase between the healthy and
hospitalized subjects is not equal to zero

2. Select the appropriate test statistic





3. Select the level of significance = .05
4. Determine the critical ratio or critical value = 2.0301
5. Perform the calculation of the test statistic




6. Draw and state conclusion
Unable to reject Ho since 1.88 is in the region of non-rejection
Cannot conclude that the two population means are different
( )
88 . 1
75 . 12
24
2
1450
15
1450
0 96 120
= =
+

= t
( ) ( )
2
2
1
2
2 1 2 1
n
s
n
s
x x
t
p p
+

=

( )( ) ( )( )
1450
2 15 22
35 ) 1 15 40 1 22
2 2
2
=
+
+
=
p
s
t test for the difference between the
two population means
Assumption:
The data constitute two independent
random samples, each drawn from a
normally distributed population. The
population variances are unknown and
unequal
Test Statistic



Compute for Critical ratio or critical value of t



Where
w
i
= s
1
2
/n
1
, t
1
= t
1-o/2
for n
1
-1 degrees of freedom

w
2
= s
2
2
/n
2

, t
1
= t
1-o/2
for n
2
-1 degrees of freedom

( ) ( )
2
2
2
1
2
1
2 1 2 1
'
n
s
n
s
x x
t
+

=

2 1
2 2 1 1
'
2 / 1
w w
t w t w
t
+
+
=
o
Example
Researchers wish to know if two populations differ
with respect to the mean value of total serum
complement activity (C
H50

). The data consist of
total serum complement activity (C
H50
)
determinations on n
2
= 20 apparently normal
subjects and n
1
= 10 subjects with disease. The
sample means and standard deviations are:

x
1
= 62.6 s
1
= 33.8
x
2
=

37.2 s
2
= 10.1


1. State the hypothesis
Ho: The mean difference in the value of total serum complement
activity between normal subjects and subjects with disease is
equal to zero
H
A
: The mean difference in the value of total serum complement
activity between normal subjects and subjects with disease is not
equal to zero

2. Select the appropriate test statistic



3. Select the level of significance (o) = .05
4. Determine the critical ratio or critical value of the test statistic




= 2.255



t
1
=

2.2622; t
2
= 2.0930

( ) ( )
2
2
2
1
2
1
2 1 2 1
'
n
s
n
s
x x
t
+

=

( )
1005 . 5 244 . 114
) 0930 . 2 ( 1005 . 5 2622 . 2 244 . 114
2 1
2 2 1 1
'
2 / 1
+
+
=
+
+
=

w w
t w t w
t
o
( )
244 . 114
10
8 . 33
2
1
= = w
( )
1005 . 5
20
1 . 10
2
2
= = w
( ) ( )
255 . 2
1005 . 5 244 . 114
0930 . 2 1005 . 5 2622 . 2 244 . 114
' =
+
+
= t
5. Perform the calculation for the test statistic
( )
( ) ( )
41 . 1
92 . 10
4 . 15
20
1 . 10
10
8 . 33
0 2 . 47 6 . 62
'
2 2
= =
+

= t
6. Draw and state the conclusion
Cannot reject Ho since 1.41 is in the non-rejection region

There is no difference in the mean value of total complement
activity between normal subjects and subjects with disease
(Critical value of)
t test for paired comparisons
Involves related, matched or paired observations
Same subjects may be measured before and after
receiving some treatment
Litter mates of the same sex, pairs of twins or
pairs may be formed by matching individuals on
some characteristic.
The objective in paired comparisons tests is to
eliminate a maximum number of sources of
extraneous variation by making the pairs similar
with respect to as many variables as possible
t test for paired comparisons
Variable of interest in this test is the
difference between individual pairs of
observations
Assumption:
The observed differences constitute a simple
random sample from a normally distributed
population of differences that could be
generated under the same circumstances
Test statistic is
d
d
s
d
t

=
n
s
s
d
d
=
Standard error
Example
Twelve subjects participated in an experiment to
study the effectiveness of a certain diet,
combined with a program of exercise, in reducing
serum cholesterol levels. Do the data provide
sufficient evidence to conclude that the diet
exercise program is effective in reducing serum
cholesterol levels? Let o = .05
Example

Subject
Serum Cholesterol Difference
(afterbefore)
Before (x
1
) After (x
2
)
1
2
3
4
5
6
7
8
9
10
11
12
201
231
221
260
228
237
326
235
240
267
284
201
200
236
216
233
224
216
296
195
207
247
210
209
-1
+5
-5
-27
-4
-21
-30
-40
-33
-20
-74
+8
Serum Cholesterol Levels for 12 Subjects Before and After
Diet-Exercise Program
( ) ( ) ( )
17 . 20
12
242
12
8 ... 5 1
=

=
+ + +
= =

n
d
d
i
( ) ( )
( )
( ) ( )
( )
06 . 535
11 12
242 10766 12
1 1
2
2
2
2
2
=

=

=

n n
d d n
n
d d
s
i i i
d
1. State the hypothesis
Ho: The mean difference in serum cholesterol between before
and after diet-exercise program is > zero
H
A
: The mean difference in serum cholesterol between before
and after diet-exercise program is < zero

2. Select the appropriate test statistic




3. Select the level of significance = .05

d
d
s
d
t

=
06 . 535 =
d
s
4. Determine the critical ratio or critical value of t test = - 1.7959
5. Perform the calculation for the test statistic





6. Draw and state the conclusion
Reject Ho since - 3.02 is in the non-rejection region
The mean difference in the serum cholesterol between before
and after diet exercise program is less than zero.
Therefore, the diet-exercise program is effective.


02 . 3
68 . 6
17 . 20
12 06 . 535
0 17 . 20
=

=

= t
Testing Ho by mean difference of a 95%
Confidence Interval
( )
d
s t d
2 / 1 o

-20.17 2.201 (6.68)


-20.17 14.70
-34.87, -5.47
Difference between two proportions
Assumption:
It is assumed that the sampling distribution
of
1
-
2
is approximately normally distributed.

Test statistic is

( ) ( )
2 1
2 1 2 1
p p
p p
z


=
o
t t
( ) ( )
2 1
1 1
2 1
n
p p
n
p p
p p

o Standard error
Example
In a study designed to compare a new
treatment for migraine headache with the
standard treatment, 78 of 100 subjects who
received the standard treatment responded
favorably. Of the 100 subjects who received
the new treatment 90 responded favorably.
Do these data provide sufficient evidence to
indicate that the new treatment is more
effective than the standard? Let o = .05
1. State the hypothesis
Ho: The difference in proportion between the standard and new
treatment is s zero
H
A
: The difference in proportion between the standard and new
treatment is > zero

2. Select the test statistic
3. Select the level of significant .05
4. Determine the critical ratio or critical value of the test = 1.645
5. Perform the calculation for the test statistic
p
1
= 78 / 100 = .78 p
2
= 90 / 100 = .90

Pooled estimate of the hypothesize common proportion








6. Draw and state the conclusion
Reject Ho since 2.32 > 1.645
The new treatment is more effective that the standard
( ) ( )
2 1
2 1 2 1
p p
p p
z


=
o
t t
84 .
100 100
78 90
2 1
2 1
=
+
+
=
+
+
=
n n
x x
p
( )
( )( ) ( )( )
32 . 2
0518 .
12 .
100
16 . 84 .
100
16 . 84 .
78 . 90 .
= =
+

= z
Using Chi-Square to Compare Frequencies
or Proportions in Two Groups
Chi-square test can be used to test
for the following:
1. Goodness of fit
2. Independence
3. Homegeneity
Test of Goodness of fit
- Determines whether or not a sample of observed values
of some random variable is compatible with the
hypothesis that it was drawn from a population of values
that is normally distributed.
Test of Independence
- the most frequently used
- test the null hypothesis that two criteria of classification,
when applied to the same set of entities, are
independent

Test of Homogeneity
- Determines whether the samples drawn from
populations are homogeneous with respect to
some criterion of classification
Despite these differences in concept and
sampling procedure the three tests are
mathematically identical, that is, they use
the same formula:

( )

=
k
i
E
E O
1
2
2
_
Tests of Goodness-Of-Fit
We make use of our knowledge of normal
distribution to determine the frequencies for each
category that one could expect if the sample had
come from a normal distribution.
If the discrepancy between what was observed
and what one would expect, given that sampling
was from a normal distribution, is too great to be
attributed to chance, we conclude that the
sample did not come from a normal distribution.
If the discrepancy is of such magnitude that it
could have come about due to chance, we
conclude that the sample may have come from a
normal distribution

Example
Suppose that a research team making a study of
hospitals in the United States collects data on a
sample of 250 hospitals which enables the team
to compute for each the inpatient occupancy
ratio, a variable that shows, for a 12-month
period, the ratio of average daily census to the
average number of beds maintained. Suppose
the sample yielded the distribution of ratio
(expressed as percents)
Inpatient occupancy
ratio
Number of hospitals
0.0 to 39.9
40.0 to 49.9
50.0 to 59.9
60.0 to 69.9
70.0 to 79.9
80.0 to 89.9
90.0 to 99.9
100.0 to 109.9
16
18
22
51
62
55
22
4
Total 250
Results of Study Described In Example
We wish to know whether these data provide sufficient
evidence to indicate that the sample did not come from
a normally distributed population.
1. State the hypothesis
Ho: In the population from which the sample was drawn, inpatient
occupancy ratios are normally distributed
H
A
: The sampled population is not normally distributed

2. Select the appropriate test statistic



3. Select the level of significance = .005
4. Determine the critical ratio or critical value of the test statistic
x
2
(.005, 6 df)
= 18.548
5. Perform the calculation of the test statistic

Compute for the mean and standard deviation of the sample =
x = 69.91 and s = 19.02

Compute for the expected frequencies in each cell


( )

=
k
i
E
E O
1
2
2
_

Class
interval
Z=(x
i
-x/s at
lower limit
of interval
Expected
relative
frequency

Expected
frequency
<40.0
40.0 to 49.9
50.0 to 59.9
60.0 to 69.9
70.0 to 79.9
80.0 to 89.9
90.0 to 99.9
100.0 to 109.9
110.0 and greater

-1.57
-1.05
-.52
.00
.53
1.06
1.58
2.11
.0582
.0887
.1546
.1985
.2019
.1535
.0875
.0397
.0174
14.55
22.18
38.65
49.62
50.48
38.38
21.88
9.92
4.35
Total 1.0000 250.00
Class Intervals and Expected Frequencies

Class
interval
Observed
frequency
O
i
)
Expected
frequency
(E
i
)


(O
i
-E
i
)
2
/ E
i

<40.0
40.0 to 49.9
50.0 to 59.9
60.0 to 69.9
70.0 to 79.9
80.0 to 89.9
90.0 to 99.9
100.0 to 109.9
110.0 and greater
16
18
22
51
62
55
22
4
0
14.55
22.18
38.65
49.62
50.48
38.38
21.88
9.92
4.35
.145
.788
7.173
.038
2.629
7.197
.001
3.533
4.350
Total 250 250.00 25.854
Observed and Expected Frequencies and (O
i
-E
i
)
2
/ E
i

6. Draw and state the conclusion
Reject Ho since 25.854 > 18.548
Conclude that in the sampled population, inpatient occupancy ratios
are not normally distributed.
Tests of Independence
Are the two criteria of classification independent?
The expected frequencies are calculated for each cell. The
expected frequencies and observed frequencies are
compared.
If the discrepancy is small, the null hypothesis is tenable,
that is, the two criteria of classification are independent.
If the discrepancy is large, the null hypothesis is rejected
and we conclude that the two criteria of classification are
not independent


Second
criterion of
classification
level
First Criterion of
Classification Level


Total
1 2 3 c
1 n
11
n
12
n
13
n
1c
n
1.

2 n
21
n
22
n
23
n
2c
n
2.

3 n
31
n
32
n
33
n
3c
n
3.

. . . . . . .
r n
r1
n
r2
n
r3
n
rc
n
r.

Total n
.1
n
.2
n
.3
n
.c
n
Two-Way Classification of a Sample of Entities
For contingency tables with more than 1 degree of freedom a
minimum expectation of 1 is allowable if no more than 20 percent
of the cells have expected frequencies of less than 5
Degree of freedom = (r-1)(c-1)
Second
criterion of
classification
First Criterion of
Classification


Total 1 2
1 a b a + b
2 c d c + d
Total a + c b + d n
A 2 X 2 Contingency Table
Small frequencies for a 2 x 2 table
X
2
test should not be used if n<20 or if 20<n<40 and any expected
Frequency is less than 5. When n > 40 an expected cell frequency as
Small as 1 can be tolerated
( )
( )( )( )( ) d c b a d b c a
n bcI Iad n
corrected
+ + + +

=
2
2
5 .
_
Example
A sample of 500 elementary school children
in a certain school system were cross-
classified by nutritional status and academic
performance. The researchers wished to
know if they could conclude that there is a
relationship between nutritional status and
academic performance. Let o = .05
Academic
Performance
Nutritional Status
Total
Poor Good
Poor
Satisfactory
105
80
15
300
120
380
Total 185 315 500
1. State the hypothesis
Ho: Nutritional status and academic performance are independent
H
A
: The two variables are not independent

2. Select the appropriate test statistic



or




3. Select the level of significance = .05
4. Determine the critical ratio or critical value of the test statistic=3.841
5. Perform the calculation of the test statistic




6. Draw and state conclusion
Reject Ho since 169.907 is in the rejection region
Conclude that the two variables are not independent
( )

=
k
i
E
E O
1
2
2
_
( )
( )( )( )( ) d c b a d b c a
n bc ad n
corrected
+ + + +

=
2
2
5 . ] [
_
( )( ) ( )( ) ( )( ) | |
( )( )( )( )
907 . 169
380 120 315 185
500 5 . ] 80 15 300 105 [ 500
2
2
=

= _
Observed
frequency
(O
i
)
Expected
frequency
(E
i
)

(O
i
E
i
)
2


(O
i
E
i
)2
/

E
i

105
15
80
300
44.4
75.6
140.6
239.4
3672.36
3672.36
3672.36
3672.36
82.71
48.58
26.12
15.34
Total 172.75
Session Four
Statistical tests for Three or
More Populations

Test of Significance for Three or More Population
1. Population mean (Analysis of Variance)
a. Completely randomized design (One-way ANOVA)
b. Randomized block design
c. Factorial experiment (Two-way ANOVA)
2. Population proportion
_
2
test
Analysis of Variance
Defined as a technique whereby the total
variation present in a set of data is
partitioned into several components.
Associated with each of these components
is a specific source of variation, so that in
the analysis it is possible to ascertain the
magnitude of the contributions of each of
these sources to the total variation
Different Experimental Designs
1. Completely Randomized Design
2. Randomized Complete Block Design
3. Two-Factor Completely Randomized Experiment
4. Latin Square Design
5. Incomplete Block Design
6. Split Plot Design
Definition of Terms
One-Way Analysis of Variance
Only one source of variation, or factor, is
investigated
Two-Way Analysis of Variance
Two factors are analyzed in the study.
Fixed effect model or model I
restriction placed on our inference goal, that
is any inferences we make apply only to the
treatments under study
Completely Randomized Design
(One-Way ANOVA)
Test the null hypothesis that three or
more treatments are equally effective.
The treatments of interest are assigned
completely at random to the subjects or
objects on which the measurements to
determine treatment effectiveness are to
be made.
Table of Sample Values for the
Completely Randomized Design
T R E A T M E N T
1 2 3 k
X
11
X
21

X
31

.
.
.
x
n11

X
12

X
22

X
32

.
.
.
x
n22

X
13
X
23
X
33

.
.
.
x
n33

. . .
. . .
. . .
.
.
.
. . .
X
1k

X
2k

X
3k

.
.
.
x
.k

Total T
.1
T
.2
T
.3
T
k
T..
Mean x
.1
x
.2
x
.3
x
k
x..
X
ij
= the i
th
observation resulting from the j
th
treatment
i = 1, 2,, n
j ,
j = 1, 2,,k

= total of the j
th
column


= mean of the j
th
column


= total of all observations

=
=
j
n
i
ij j
x T
1
.
j
j
j
n
T
x
.
.
=

= = =
= =
k
j
n
i
ij
k
j
j
j
x T T
1 1 1
.
..
N
T
x
..
..
=

=
=
k
j
j
n N
1
Completely Randomized Design
Model:

X
ij
= the observation that deviates from the
group mean by the amount e
ij


j
= group mean
t
j
= treatment effect
e
ij
= error

ij j j ij
e x + + = t
Assumptions of the Model
1. The k sets of observed data constitute k
independent random samples from the
respective populations
2. Each of the populations from which the samples
come is normally distributed with mean,
j
, and
variance o
j
2
.
3. Each of the populations has the same variance.
That is o
1
2
, o
2
2
, o
3
2
, o
k
2
= o
2
, the common
variance
4. The t
j
s are unknown constants and Et
j
= 0,
since the sum of all deviations of the
j
from
their mean, , is zero
Computational formula
1. Total Sum of Squares (SST)


2. Within Groups Sum of Squares (SSW) or (SSE)


3. Among Groups Sum of Squares (SSA)


SST = SSA + SSW

= = = =
= =
k
j
n
i
ij
k
j
n
i
ij
N
T
x x x SST
j j
1
2
..
1
2
1 1
2
..
) (
( )

= = = = =
= =
k
j
n
i
k
j
j
j
ij
k
j
n
i
j
ij
j j
n
T
x x x SSW
1 1 1
2
.
2
1 1
2
.
) (

= =
= =
k
j
j
j
k
j
j
j
N
T
n
T
x x n SSA
1
2
..
2
.
1
2
.. .
) (
Computational formula
4. Within groups variance or within groups mean square (MSW)






5. Among groups mean square (MSA)
( )
( ) ( )


= =
= = =

=
k
j
j
k
j
j
k
j
n
i
k
j
j
j
ij
n
SSW
n
n
T
x
MSW
j
1 1
1 1 1
2
. 2
1 1
1 1
1
2
..
2
.

=

=
k
SSA
k
N
T
n
T
MSA
k
j
j
j
Source of
variation
Sum of
squares
Degrees of
freedom
Mean
square
Variance
ratio
Among
samples

Within
samples
SSA


SSW
K - 1


N k
MSA=
SSA/(k-1)

MSW=
SSW/ (N-k)
V.R=
MSA/MSW
Total SST N - 1
Analysis of Variance Table for the Completely Randomized Design
Reject Ho if the computed Variance ratio (V.R) is greater than
the Critical ratio or Critical value or

Reject Ho if the computed Variance ratio (VR) is in the rejection
region
Example
In a study of the effect of glucose on insulin
release, specimens of pancreatic tissue from
experimental animals were randomly assigned to
be treated with one of five different stimulants.
Later, a determination was made on the amount
of insulin released. The experimenters wished to
know if they could conclude that there is a
difference among the five treatments with
Respect to the mean amount of insulin released.
Let o =.05
S T I M U L A N T S
1 2 3 4 5
1.53
1.61
3.75
2.89
3.26
3.15
3.96
3.59
1.89
1.45
1.56
3.89
3.68
5.70
5.62
5.79
5.33
8.18
5.64
7.36
5.33
8.82
5.26
7.10
5.86
5.46
5.69
6.49
7.81
9.03
7.49
8.98
Total
Mean
13.04
2.61
15.60
2.60
30.01
5.00
47.69
6.81
56.81
7.10
163.15
5.10
Insulin Released
1. State the hypothesis
Ho: All population or treatment means are equal
H
A
: All population or treatment means are not equal
2. Select the test statistic : Variance ratio or F test
3. Select the level of significance = .05
4. Determine the critical ratio or critical value of the test statistic = 2.73
5. Perform the calculation of the test statistic





32
) 15 . 163 (
) 98 . 8 ( ... ) 61 . 1 ( ) 53 . 1 (
2
2 2 2
+ + + = SST
32
923 . 26617
3529 . 994 =
81008 . 831 3529 . 994 =
54282 . 162 =
(

+ + + + + + + =
8
) 81 . 56 (
7
) 69 . 47 (
6
) 01 . 30 (
6
) 60 . 15 (
5
) 04 . 13 (
) 98 . 8 ( ... ) 61 . 1 ( ) 53 . 1 (
2 2 2 2 2
2 2 2
SSW
) 42201 . 403 90516 . 324 10002 . 150 56 . 40 00832 . 34 ( 3529 . 994 + + + + =
99551 . 952 3529 . 994 =
35739 . 41 =
32
) 15 . 163 (
8
) 81 . 56 (
7
) 69 . 47 (
6
) 01 . 30 (
6
) 60 . 15 (
5
) 04 . 13 (
2 2 2 2 2 2
+ + + + = SSA
81008 . 831 99551 . 952 =
18543 . 121 =
SSA = SST SSW
= 162.54282 41.35739
= 121.18543
Source SS d.f. MS V.R.
Among samples
Within samples
121.18543
41.35739
4
27
30.2963580
1.5317552
19.78
Total 162.54282 31
ANOVA Table
6. Draw and state the conclusion
Reject Ho since 19.78 is in the rejection region
Conclude that not all treatment means are equal
A Word of Caution :
The completely randomized design is simple and,
therefore, widely used. It should be used, however,
only when the units receiving the treatments are
homogeneous
Testing for Significant Differences Between
Individual Pairs of Means
(Multiple Comparison Tests)
1. Tukeys HSD Procedure (Honestly
Significant Difference)
2. Scheffes Procedure most versatile
3. Newman-Keuls Procedure
4. Dunnets Procedure
Tukeys HSD Test
Frequently used for testing the null hypotheses that all
possible pairs of treatment means are equal when the
samples are all of the same size.
Makes use of a single value against which all mean
differences are compared.
This value, called the HSD, is given by
n
MSE
q HSD
k N k
=
, , o
o the chosen level of significance
k the number of means in the experiment
N the total number of observations in the experiment
q obtained from Table H with o, k, and N-k
S T I M U L A N T S
1 2 3 4 5
1.53
1.61
3.75
2.89
3.26
3.15
3.96
3.59
1.89
1.45
1.56
3.89
3.68
5.70
5.62
5.79
5.33
8.18
5.64
7.36
5.33
8.82
5.26
7.10
5.86
5.46
5.69
6.49
7.81
9.03
7.49
8.98
Total
Mean
13.04
2.61
15.60
2.60
30.01
5.00
47.69
6.81
56.81
7.10
163.15
5.10
Insulin Released
x
.2
x
.1
x
.3
x
.4
x
.5

x
.2
= 2.60
x
.1
= 2.61
x
.3
= 5.00
x
.4
= 6.81
x
.5
= 7.10
- .01
-
2.40
2.39
-
4.21
4.20
1.81
-
4.50
4.49
2.10
.29
-
Differences Between Means (Absolute Value)
Determine the value of (q) based on an o = .05

g
1
24 4.17 d
1

g 27 x d
g
2
30 4.10 d
2


or 4.14
) (
1 2
1 2
1
1
d d
g g
g g
d d

+ =
Formula for interpolation
135 . 4 ) 17 . 4 10 . 4 (
24 30
24 27
17 . 4 =

+ = d
*
*
, ,
j
k N k
n
MSE
q HSD

=
o
29 . 2
5
5317552 . 1
14 . 4 * = = HSD
09 . 2
6
5317552 . 1
14 . 4 * = = HSD
94 . 1
7
5317552 . 1
14 . 4 * = = HSD
Hypothesis HSD* Statistical Decision
Ho:
1
-
2

Ho:
1
-
3

Ho:
1
-
4

Ho:
1
-
5
Ho
2
-
3

Ho:
2
-
4

Ho
2
-
5

Ho:
3
-
4
Ho:
3
-
5

Ho:
4
-
5

Do not reject Ho (.01)
Reject Ho (2.39)
Reject Ho (4.20)
Reject Ho (4.49)
Reject Ho (2.40)
Reject Ho (4.21)
Reject Ho (4.50)
Do not reject Ho (1.81)
Reject Ho (2.10)
Do not reject Ho (.29)


29 . 2
5
5317552 . 1
14 . 4 * = = HSD
29 . 2
5
5317552 . 1
14 . 4 * = = HSD
09 . 2
6
5317552 . 1
14 . 4 * = = HSD
09 . 2
6
5317552 . 1
14 . 4 * = = HSD
09 . 2
6
5317552 . 1
14 . 4 * = = HSD
09 . 2
6
5317552 . 1
14 . 4 * = = HSD
29 . 2
5
5317552 . 1
14 . 4 * = = HSD
The Randomized Complete Block Design
The experimental units to which the treatments
are applied are subdivided into homogeneous
groups called blocks.
The treatments are then assigned at random to
the experimental units within each block
The objective of this design is to isolate and
remove from the error term the variation
attributable to the blocks, while assuring that
treatment means will be free of block effects.
Table of Sample Values for the Randomized Complete Block Design
Blocks T R E A T M E N T S Total Mean
1 2 3 k
1
2
3
.
.
.
n
X
11

X
21

X
31

.
.
.
X
n1

X
12

X
22

X
32

.
.
.
X
n2

X
13

X
23

X
33

.
.
.
X
n3




.
.
.

X
1k

X
2k

X
3k

.
.
.
X
nk

T
1.

T
2.

T
3.

.
.
.
T
n.

X
1.

X
2.

X
3.

.
.
.
X
n.

Total T
.1
T
.2
T
.3
T
.k

Mean X
1
X
2
X
3
X
K
X..
Randomized Complete Block Design
Fixed-Effects Model:

i = 1,2,,n; j = 1,2,,k
x
ij
a typical value from the overall population
an unknown constant
|
I
represents a block effect
t
j
represents a treatment effect
e
ij
a residual component representing all sources of
variation other than treatments and blocks
ij j i ij
e x + + + = t |
Randomized Complete Block Design
Assumptions:
1. Each x
ij
that is observed constitutes a random
independent sample of size 1 from one of the
kn populations represented
2. Each of these kn populations is normally
distributed with mean
ij
and the same variance
o
2

3. The block and treatment effects are additive,
that is, there is no interaction between
treatments and blocks
Computational Formulas

= = = =
= =
k
j
n
i
k
j
n
i
ij i
C x x x SST
1 1 1 1
2 2
..
.
) (

=
=
n
i
i
C
k
T
SSBl
1
2
.

=
=
k
j
j
C
n
T
SSTr
1
2
.
kn x C
k
j
n
i
ij
2
1 1
|
|
.
|

\
|
=

= =
SSE = SST SSBl SSTr

The appropriate degrees of freedom for each component
total blocks treatments residual (error)
Kn - 1 = (n 1) + (k 1) + (n 1)(k 1)
SST = SSBl + SSTr + SSE
Source SS d.f. MS V.R
Treatments
Blocks
Residual
SSTr
SSBl
SSE
(k 1)
(n 1)
(n -1)(k -1)
MSTr=SSTr/k-1
MSBl=SSBl/n-1
MSE=
SSE/(n-1)(k-1)
MSTr/MSE
Total SST Kn - 1
ANOVA Table for the Randomized Complete Block Design
Example
A physical therapist wished to compare three methods for
teaching patients to use a certain prosthetic device. He felt that
the rate of learning would be different for patients of different
ages and wished to design an experiment in which the influence
of age could be taken into account. The randomized complete
block design is the appropriate design for achieving this goal.
Three patients in each of five age groups were selected to
participate in the experiment, and one patient in each age
group was randomly assigned to each of the teaching methods.
The methods of instruction constitute the three treatments, and
the five age groups are the blocks. Let o=.05

Age group
Teaching Method
Total

Mean
A B C
Under 20
20 to 29
30 to 39
40 to 49
50 and over
7
8
9
10
11
9
9
9
9
12
10
10
12
12
14
26
27
30
31
37
8.67
9.00
10.00
10.33
12.33
Total 45 48 58 151
Mean 9.0 9.6 11.6 10.07
Time (in Days) Required to Learn the Use of a Certain Prosthetic Device
1. State the hypothesis
Ho: All the treatment means are equal or equal to zero
H
A
: Not all treatment means are equal or equal to zero
2. Select the appropriate test statistic: F test Two-way ANOVA
3. Select the level of significance : .05
4. Determine the critical ratio or critical value: 4.46
5. Perform the calculation for the test statistic

0667 . 1520
15
22801
) 5 )( 3 (
) 151 (
2
= = = C
9333 . 46 0667 . 1520 14 ... 9 7
2 2 2
= + + + = SST
9333 . 24 0667 . 1520
3
37 ... 27 26
2 2 2
=
+ + +
= SSBl
5333 . 18 0667 . 1520
5
58 48 45
2 2 2
=
+ +
= SSTr
SSE = SST SSBl - SSTr
= 46.9333 24.9333 18.5333 = 3.4667
Source SS d.f. MS V.R
Treatments
Blocks
Residual
18.5333
24.9333
3.4667
2
4
8
9.26665
6.233325
.4333375
21.38
Total 46.9333 14
ANOVA Table for the Randomized Complete Block Design
6. Draw and state the conclusion
Reject Ho since the 21.38 is in the rejection region
Not all treatment means are equal
Two-Factor Completely Randomized Experiment
The experiment in which two or more
factors are investigated simultaneously is
called a factorial experiment
In a factorial experiment we may study
not only the effects of individual factors
but also, the interaction between factors
Two-Factor Completely Randomized Experiment
Fixed-Effects Model:


Assumptions:
1. The observations in each of the ab cells constitute a
random independent sample of size n drawn from the
population defined by the particular combination of the
levels of the two factors
2. Each of the ab populations is normally distributed
3. The populations all have the same variance
ijk ij j i ij
e x + + + + = ) (o| | o

Factor A
F A C T O R B
Totals

Means
1 2 b
1 X
111
.
.
X
11n

X
121

.
.
X
12n


.
.

X
1b1
.
.
X
1bn



T
1..



X
1..

2 X
211
.
.
X
21n

X
221
.
.
X
22n


.
.

X
2b1
.
.
X
2bn



T
2..



X
2..

.
.
.
.
.
.
.
.
.
.
.
.
.
.
a X
a11
.
.
X
a1n

X
a21
.
.
X
a2n


.
.

X
ab1
.
.
X
abn



T
a..



X
a..

Totals T
.1.
T
.2.
T
.b.
T
Means x
.1.
x
.2.
X
1b1

x
Table of Sample Data from a Two-Factor Completely Randomized Experiment
Computational Formulas
SST = SSTr + SSE
SSE = SST-SSTr
SSTr = SSA + SSB + SSAB
SSAB = SSTr SSA - SSB
abn x C
a
i
b
j
n
k
ijk
2
1 1 1
|
|
.
|

\
|
=

= = =

= = =
=
a
i
b
j
n
k
ijk
C x SST
1 1 1
2
C
n
T
SSTr
a
i
b
j
ij
=

= = 1 1
2
.
C
bn
T
SSA
a
i
i
=

=1
2
..
C
an
T
SSB
b
j
j
=

=1
2
. .
Source SS d.f. MS V.R.
A
B
AB

Treatments
Residuals
SSA
SSB
SSAB

SSTr
SSE
a 1
b 1
(a 1)(b -1)

ab -1
ab(n 1)
MSA = SSA/(a 1)
MSB = SSB/(b 1)
MSAB = SSAB/(a-
1)(b-1)

MSE = SSE/ab(n-1)
MSA/MSE
MSB/MSE
MSAB/MSE
Total SST abn 1
ANOVA Table for a Two-Factor Completely Randomized Experiment
EXAMPLE OF INTERACTION ANALYSIS
Suppose, in terms of effect on reaction time, that the
true relationship between three dosage levels of some
drug and the age of human subjects taking the drug is
known. Suppose further that age occurs at two levels
young (under 65) and old (65 and over). Let us
assume that effect is measured in terms of reduction
in reaction time to some stimulus

Factor A - Age
Factor B Drug Dosage
J = 1 J = 2 J = 3
Young (i=1)
Old (i=2)

11
= 5

21
= 10

12
= 10

22
= 15

13
= 20

23
= 25
Mean Reduction in Reaction Time (Milliseconds) of Subjects in Two Age
Groups at Three Drug Dosage Levels
1. For both levels of factor A the difference between the means for any
two levels of factor B is the same. That is, for both levels of factor
A, the difference between means for levels 1 and 2 is 5, for levels 2
and 3 the difference is 10, and for levels 1 and 3 the difference is
15.
2. For all levels of factor B the difference between means for the two
levels of factor A is the same. The difference is 5 at all three levels
of factor B.

0
5
10
15
20
25
30
b1 b2 b3
Drug dosage
R
e
a
c
t
i
o
n

T
i
m
e
a1
a2
0
5
10
15
20
25
30
a1 a2
Age
R
e
a
c
t
i
o
n

T
i
m
e
b1
b2
b3
. A third characteristic is revealed when the data are plotted.
We note that the curves corresponding to the different levels
of a factor are all parallel.

Factor A Age
Factor B Drug Dosage
J = 1 J = 2 J = 3
Young (i = 1)
Old (i = 2)

11
= 5

21
= 15

12
= 10

22
= 10

13
= 20

23
= 5
Altered to Show the Effect of One Type of Interaction
1. The difference between means for any two levels of factor A
is not the same for both levels of factor A. The difference
between levels 1 and 2 of factor B is -5 for the young age
group and +5 for the old age group.
2. The difference between means for both levels of factor A is
not the same at all levels of factor B. The difference between
factor A means are -10, 0, and 15 for levels 1, 2, and 3,
respectively, of factor B
0
5
10
15
20
25
a1 a2
Age
R
e
a
c
t
i
o
n

T
i
m
e
b1
b2
b3
0
5
10
15
20
25
b1 b2 b3
Drug dosage
R
e
a
c
t
i
o
n

T
i
m
e
a1
a2
3. The factor level curves are not parallel


There is interaction between two factors if a change in one
of the factors produces a change in response at one level
of the other factor different from that produced at other
levels of this factor
Example
In a study of length of time spent on individual home visits by
public health nurses, data were reported on length of home
visit, in minutes, by a sample of 80 nurses. A record was
made also of each nursess age and the type of illness of each
patient visited. The researchers wished to obtain from their
investigation answers to the following questions?
1. Does the mean length of home visit differ among the
different age groups of nurses ?
2. Does type of patient affect the mean length of home visit ?
3. Is there interaction between nurses age and type of
patient ?
Let o = .05
Factor A
(type of patient)
levels
Factor B (Nurses Age Group)

Totals


Means
1
(20 to 29)
2
(30 to 39
3
(40 to 49)
4
(50 & over)
1 (Cardiac) 20
25
22
27
21
25
30
29
28
30
24
28
24
25
30
28
31
26
29
32


534


26.70

2 (Cancer) 30
45
30
35
36
30
29
31
30
30
39
42
36
42
40
40
45
50
45
60


765


38.25
3 (C.V.A.) 31
30
40
35
30
32
35
30
40
30
41
45
40
40
35
42
50
40
55
45


766


38.30
4 (Tuberculosis) 20
21
20
20
19
23
25
28
30
31
24
25
30
26
23
29
30
28
27
30


509


25.45
Totals 557 596 659 762 2574
Means 27.85 29.8 32.95 38.10 32.18
Length of Home Visit in Minutes by Public Health Nurses by Nurses Age Group and Type of Patient
1. State the hypothesis
a. Ho: o
1
=o
2
=o
3
= o
4
= 0
H
A
: Not all o
I
= 0

b. Ho: |
1
= |
2
= |
3
= |
4
= 0
H
A
: Not all |
j
= 0

c. Ho: (o|)
ij
= 0
H
A
: Not all (o|)
ij
= 0

2. Select the appropriate test statistic: F test or Two-Way ANOVA
3. Select the level of significance: .05
4. Determine the critical ratio or critical value of the test statistic =
2.76, 2.76, and 2.04

Cell a
1
b
1
a
1
b
2
a
1
b
3
a
1
b
4
a
2
b
1
a
2
b
2
a
2
b
3
a
2
b
4

Totals
Means
115
23.0
142
28.4
131
26.2
146
29.2
176
35.2
150
30.0
199
39.8
240
48.0
Cell a
3
b
1
a
3
b
2
a
3
b
3
a
3
b
4
a
4
b
1
a
4
b
2
a
4
b
3
A
4
b
4

Totals
Means
166
33.2
167
33.4
201
40.2
232
46.4
100
20.0
137
27.4
128
25.6
144
28.8
Perform the calculation for the test statistic

45 . 82818 80 ) 2574 (
2
= = C
55 . 5741 45 . 82818 ) 30 ... 25 20 (
2 2 2
= + + + = SST
95 . 4801 45 . 82818
5
144 ... 142 115
2 2 2
=
+ + +
= SSTr
45 . 2992 45 . 82818
20
509 766 765 534
2 2 2 2
=
+ + +
= SSA
05 . 1201 45 . 82818
20
762 659 596 557
2 2 2 2
=
+ + +
= SSB
SSAB = SSTr SSA SSB
= 4801.95 2992.95 1201.05 = 608.45
SSE = SST SSTr
= 5741.55 4801.95 = 939.60
Source SS d.f. MS V.R.
A
B
AB
Treatments
Residual
2992.45
1201.05
608.45
4801.95
939.60
3
3
9
15
64
997.48
400.35
67.61

14.68
67.95
27.27
4.61
Total 5741.55 79
ANOVA Table
6. Draw and state the conclusions
a. Ho is rejected since 67.92 is in the rejection region (2.76)
There are differences in the average amount of time spent
in home visits with different types of patients
b. Ho is rejected since 27.27 is in the rejection region (2.76)
There are differences in the average amount of time spent
on home visits among the different nurses when grouped
by age
c. Ho is rejected since 4.61 is in the rejection region (2.04)
Conclude that factors A and B interact; that is, different
combinations of levels of the two factors produce different
effects.
Three or More Population Proportion
Second
criterion of
classification
level
First Criterion of
Classification Level


Total
1 2 3 c
1 n
11
n
12
n
13
n
1c
n
1.

2 n
21
n
22
n
23
n
2c
n
2.

3 n
31
n
32
n
33
n
3c
n
3.

. . . . . . .
r n
r1
n
r2
n
r3
n
rc
n
r.

Total n
.1
n
.2
n
.3
n
.c
n
Two-Way Classification of a Sample of Entities
For contingency tables with more than 1 degree of freedom a
minimum expectation of 1 is allowable if no more than 20 percent
of the cells have expected frequencies of less than 5
Degree of freedom = (r-1)(c-1)
Example
A research team studying the relationship
between blood type and severity of a certain
condition in a population collected data on
1500 subjects. The researchers wished to
know if these data were compatible with the
hypothesis that severity of condition and
blood type are independent. Let o = .01
Severity of
condition
Blood Type
Total
A B AB O
Absent
Mild
Severe
543
44
28
211
22
9
90
8
7
476
31
31
1320
105
75
Total 615 242 105 538 1500
Fifteen-Hundred Subjects Classified by
Severity of Condition and Blood Type
1. Ho: Blood type and severity of the condition are independent
H
A
: The two variables are not independent
2. Select the appropriate test statistic: x
2
test with d.f. = (r-1)(c-1)




3. Select the level of significance: .01
4. Determine the critical ratio or critical value of the test statistic
x
2
(.01,6)
= 18.548

( )

=
k
i
E
E O
1
2
2
_
Severity of
condition
Blood Type
Total
A B AB O
Absent

Mild

Severe

543
(541.2)
44
(43.05)
28
(30.75)
211
(212.96)
22
(16.94)
9
(12.10)
90
(92.40)
8
(7.35)
7
(5.25)
476
(473.44)
31
(37.66)
31
(26.90)
1320

105

75
Total 615 242 105 538 1500
5. Perform the calculation for the test statistic
90 . 26
) 90 . 26 31 (
...
96 . 212
) 96 . 212 211 (
2 . 541
) 2 . 541 543 (
2 2 2
2

+ +

= _
624907 . ... 018039 . 005987 . + + + =
12 . 5 =
6. Draw and state the conclusion
Do not reject Ho since 5.12 is in the region of non-rejection
Conclude that these data are compatible with the hypothesis that
severity of the condition and blood type are independent
Session Five
Correlation and Regression
Correlation
A measure of the linear relationship between two
variables measured on a numerical scale
Both Y and X are random variables
Correlation Coefficient is the statistic used to
measure the strength of the linear relationship
between X and Y
is the symbol used for population correlation
coefficient
r is the symbol used for sample correlation
coefficient
Assumptions in Correlation
1. For each value of X there is a normally
distributed subpopulation of Y values
2. For each value of Y there is a normally
distributed subpopulation of X values
3. The joint distribution of X and Y is a normal
distribution called the bivariate normal
distribution
4. The subpopulations of Y values all have the
same variance
5. The subpopulations of X values all have the
same variance

Correlation Coefficient
May assume any value between -1
and +1
If = 1 there is a perfect direct linear
correlation between the two variables
If = -1 there is perfect inverse linear
correlation
If = 0 the two variables are not
correlated

Computational Formula of
Correlation Coefficient
( )( )
( ) ( )




=
2
2
2
2
i i i i
i i i i
y y n x x n
y x y x n
r
Example
Blood pressure readings by two different
methods were made on 25 patients with
essential hypertension. The clinician
wished to investigate the strength of the
relationship between the two measurements
Patient
number
Method I Method II Patient
number
Method I Method II
1
2
3
4
5
6
7
8
9
10
11
12
13

132
138
144
146
148
152
158
130
162
168
172
174
180
130
134
132
140
150
144
150
122
160
150
160
178
168

14
15
16
17
18
19
20
21
22
23
24
25
180
188
194
194
200
200
204
210
210
216
220
220

174
186
172
182
178
196
188
180
196
210
190
202
Systolic Blood Pressure Readings (mm Hg) by Two Methods in
25 Patients with Essential Hypertension
Method I (x) Method II (y) x
2
y
2
Xy
132
138
144
146
148
152
158
130
162
168
172
174
180
180
188
194
194
200
200
204
210
210
216
220
220
130
134
132
140
150
144
150
122
160
150
160
178
168
174
186
172
182
178
196
188
180
196
210
190
202
17424
19044
20736
21316
21904
23104
24964
16900
26244
28224
29584
30276
32400
32400
35344
37636
37636
40000
40000
41616
44100
44100
46656
48400
48400
16900
17956
17424
19600
22500
20736
22500
14884
25600
22500
25600
31684
28224
30276
34596
2984
33124
31684
38416
35344
32400
38416
44100
36100
40814
17160
18492
19008
20440
22200
21888
23700
15860
25920
25200
27520
30972
30240
31320
34968
33368
35308
35600
39200
38352
37800
41160
45360
41800
44440
4440 4172 808408 710952 757276
( )( )
( ) ( )




=
2
2
2
2
i i i i
i i i i
y y n x x n
y x y x n
r
2 2
) 4172 ( ) 952 , 710 ( 25 ) 4440 ( ) 408 , 808 ( 25
) 4172 )( 4440 ( ) 276 , 757 ( 25


= r
95 . = r
Test of Significance for (r)
1. Ho: = 0
H
A
: = 0
2. Select the appropriate test statistic:


3. Select the level of significance : .05
4. Determine the critical ratio or critical value:
t
(.05,df=n-2)
= 2.0687
5. Perform the calculation for the test statistic



6. Draw and state the conclusion
Reject Ho since 15.37 > 2.0687
Conclude that the two variables are correlated

2
1
2
r
n
r t

=
37 . 15
9112707 . 1
23
954605 .
954605 . 1
2 25
954605 .
2
=


= t
Fishers z Transformation to Test the Correlation
Used when the interest lies in whether the correlation is
equal to a specific value other than zero
r is transformed to z
r



Z
r
is approximately normally distributed with a mean of
z

= In{(1+) / (1-)} and a standard deviation of



The test statistic for the test of significance is


The 95% Confidence interval for can also be determined


r
r
In z
r

+
=
1
1
2
1
3 1 n
1 1

=
n
z z
Z
r
( ) 3 1 n z z
r
Example
Suppose in our present example we wish to test
1. State the hypothesis
Ho: = .98
H
A
: = .98
2. Select the appropriate test statistic


3. Select the level of significance: .05
4. Determine the critical ratio or critical value of the test statistic:
1.96
5. Perform the calculation for the test statistic
r = .95 z
r
= 1.83178
= .98 z

= 2.29756

6. Draw and state the conclusion
Reject Ho since -2.18 is lesser than -1.96
Conclude that the population correlation coefficient is not .98

18 . 2
3 25 1
29756 . 2 83178 . 1
=

= z
1 1

=
n
z z
Z
r
Interpretation of Correlation Coefficient (r)
0 to 0.25 (or -0.25)
- indicate little or no relationship
0.25 to 0.50 (or -0.25 to -0.50)
- indicate a fair degree of relationship
0.50 to 0.75 (or -0.50 to -0.75)
- indicate moderate to good relationship
>0.75 (or -0.75)
- a very good or excellent relationship
Simple Linear Regression
The main objective is to predict or estimate
the value of one variable corresponding to a
given value of another variable.
Analyzes two variables: one is independent
(x) and the other one is dependent (y).

Simple Linear Regression
Assumptions:
1. Values of the independent variable (x) are said to be
fixed (non-random) the values of x are pre-selected
by the investigator so that in the collection of the data
they are not allowed to vary from these pre-selected
values
2. The variable (x) is measured without error.
3. For each value of (x) there is a subpopulation of (y)
values. These subpopulations are normally distributed.
4. The variances of the subpopulations of (y) are all equal
5. The means of the subpopulations of (y) all lie on the
same straight line. (Assumption of linearity)
6. The (y) values are statistically independent. (value of y
chosen at one value of (x) is in no way depend on the
values of (y) chosen at another value of (x)

Regression model:
y = o +|x + e

y = a typical value from one of the subpopulation of Y
o and | = population regression coefficients
e = error term

Sample Regression Equation
Describes the true relationship between the
dependent variable (y) and the independent
variable (x)
In the absence of extensive information
regarding the nature of the variables of interest,
a frequently employed strategy is to assume
initially that they are linearly related.
Subsequent analysis involves the
following steps
1. Determine whether or not the assumptions underlying a
linear relationship are met in the data available for
analysis
2. Obtain the equation for the line that best fits the sample
data
3. Evaluate the equation to obtain some idea of the strength
of the relationship and the usefulness of the equation for
predicting and estimating
4. If the data appear to conform satisfactorily to the linear
model, use equation obtained from the sample data to
predict and to estimate
Example
A team of professional mental health workers in a
long-stay psychiatric hospital wished to measure
the level of response of withdrawn patients to a
program of remotivation therapy. A standardized
test was available for this purpose, but it was
expensive and time-consuming to administer. To
overcome this obstacle the team developed a test
that was much easier to administer. To test the
usefulness of the new instrument for measuring the
level of patient response, the team decided to
examine the relationship between scores made on
the new test and scores made on the standardized
test.
The objective was to use the new test if it could
be shown that it was a good predictor of a
patients score on the standardized test. The
team was interested only in carrying out the
analysis for standardized scores between 50 and
100, since a score below 50 did not represent a
significant level of response, and scores above
100, although possible, were seldom made by
the type of patient under consideration. The
team also felt that the use of scores in
increments of 5 would give a good coverage of
the range of scores between 50 and 100.
Accordingly 11 patients who had made scores
on the new test of 50, 55, 60, 65, 70, 80, 85,
90, 95, and 100, respectively, were selected to
take the standardized test. The independent
and dependent variables, respectively, are
scores made on the new test and scores made
on the standardized test


Patient number
Score on new
test (x)
Score on
Standardized test (y)
1
2
3
4
5
6
7
8
9
10
11
50
55
60
65
70
75
80
85
90
95
100
61
61
59
71
80
76
90
106
98
100
114
Patients Scores on Standardized Test and New Test
Assumptions met by the data
1. The assumption of fixed values of (x) is
satisfied
2. The values of (x) were selected in
advance and were not allowed to vary
A first step that is usually useful in studying
the relationship between two variables is to
prepare a scatter diagram
y
0
20
40
60
80
100
120
0 20 40 60 80 100 120
Standardized Test
N
e
w

T
e
s
t
The pattern made by the points plotted on the
scatter diagram usually suggests the basic nature of
the relationship

The points seem to be scattered around an invisible
straight line.

It also shows that , in general, patients who score
high on the new test also make high scores on the
standardized test
The method usually employed for obtaining the
desired line is known as the method of least
squares, and the resulting line is called the least-
squares line
The general equation for a straight line is:
y = a + bx
y = a value on the vertical axis
x = a value on the horizontal axis
a = the point where the line crosses the vertical axis
( y-intercept)
b = shows the amount by which y changes for each unit
change in x (slope of the line)
To draw a line, we need the numerical values of the
constants (a) and (b)




Given these constants, we may substitute various values of
(x) into the equation to obtain corresponding values of y
The resulting points may be plotted. Two points make a
straight line

( )( )
( )

=
2
2
x x n
y x xy n
b
n
x b y
a

=
Score on
New Test
x
Score on
Standardized Test
y
x
2
y
2
xy
50
55
60
65
70
75
80
85
90
95
100
61
61
59
71
80
76
90
106
98
100
114
2500
3025
3600
4225
4900
5625
6400
7225
8100
9025
10000
3721
3721
3481
5041
6400
5776
8100
11236
9604
10000
12996
3050
3355
3540
4615
5600
5700
7200
9010
8820
9500
11400
825 916 64625 80076 71790
Intermediate Computations for Normal Equations
1236 . 1
) 825 ( ) 64625 ( 11
) 916 )( 825 ( ) 71790 ( 11
2
=

= b
( )
9973 .
11
825 1236 . 1 916
=

= a
The linear equation for the least-squares line that describes
the relationship between scores on the standardized test
and scores on the new test may is written as

= -.9973 + 1.1236x
Obtain the () of x =50 and x = 100

= -.9973 + 1.1236(50) = 55.1827

= -.9973 + 1.1236(100) = 111.3627

With these two values, we can make a least-squares line

The least-squares line does not pass through any of the
observed points that are plotted on the scatter diagram

The sum of the squared vertical deviations of the observed
data points (y
i
) from the least-squares line is smaller than
the sum of the squared vertical deviations of the data points
from any other line
Evaluating The Regression Equation
If in the population the relationship between X
and Y is linear, |, the slope of the line that
describes this relationship, will be either positive,
negative, or zero.
If | is zero, sample data drawn from the
population will, in the long run, yield regression
equations that are of little or no value for
prediction and estimation purposes
Evaluating The Regression Equation
If we do hypothesis testing in which the null
hypothesis that | equals zero is not rejected.
Conclusion could either be:
1. That although the relationship between X and Y may
be linear it is not strong enough for X to be of much
value in predicting and estimating Y.
2. That the relationship between X and Y is not linear;
that is some curvilinear model provides a better fit to
the data
Evaluating The Regression Equation
Thus before using a sample regression equation
to predict and estimate, it is desirable to test Ho:
|=0.
Either F test or t test can be used as the test
statistic
Coefficient of Determination (r
2
)
tells us how strong the relationship really is
(the strength of the regression equation)
determine by comparing the scatter of the
points about the regression line with the
scatter about y, the mean of the values of Y.

Total deviation = the vertical distance of a point corresponding
to the vertical distance from the y line and designate
it as (y
i
- y)

Explained deviation = the vertical distance from the regression
line to the y line and designate it as ( y )

Unexplained deviation = the vertical distance of the observed
point from the regression line and designate it as
(y
i
)


Total deviation = explained deviation + unexplained deviation
(y
i
y) = ( + y) + (y
i
)



If we measure these deviations for each value of y
i
and ,
square each deviation, and add up the squared deviations,
we have
E(y
i
y)
2
= E( y)
2
+ E(y
i
)
total sum explained unexplained
of squares sum of sum of
squares squares
(SST) (SSR) (SSE)

These quantities may be considered measures of dispersion or
Variability.

SST measures the total variation in the observed values of Y
SSR - measures the amount of the total variability in the
observed values of Y that is accounted for by the linear
relationship between the observed values of X and Y.
SSE measures the dispersion of the observed Y values about
the regression line (sometimes called error sum of
squares or the residual sum of squares
Computational Formulas
( )


=
n
y
y y y
i
i i
2
2 2
) (
( ) | | n x x b
i i
2
2 2

=
SST


SSR E( y)
2



SSE = SST SSR


Coefficient of Determination (r
2
)
SST
SSR
r =
2


The sample coefficient of determination measures
the closeness of fit of the sample regression
equation to the observed values of Y.
When the quantities (y
i
), the vertical
distances of the observed values of Y from the
equation, are small, the unexplained sum of
squares is small.
This leads to a large explained sum of squares
that leads, in turn, to a large value of r
2

1818 . 3798
11
916
114 ... 61 61
2
2 2 2
= + + + = SST
8116 . 3471
11
) 825 (
100 ... 55 50 ) 1236 . 1 (
2
2 2 2 2
=
(

+ + + = SSR
3702 . 326 8116 . 3471 1818 . 3798 = = SSE
91 .
1818 . 3798
8116 . 3471
2
= = =
SST
SSR
r
Testing Ho: | = 0 with the F Statistic
Source of variation SS d.f. MS V.R.
Linear regression

Residual
SSR

SSE
1

n 2
SSR / 1

SSE / (n1)
MSR / MSE
Total SST n - 1
ANOVA Table for Simple Linear Regression
1. State the hypothesis
Ho: | = 0
H
A
: | = 0
2. Select the appropriate test statistic: F test
3. Select the level of significance: .05
4. Determine the critical ratio or critical value of the test statistic : 5.12
5. Perform the calculation for the test statistic

Source of variation SS d.f. MS V.R.
Linear regression

Residual
3471.8116

326.3702
1

9
3471.8116

36.2634
95.74
Total 3798.1818 10
6. Draw and state the conclusion
Ho is rejected since 95.74 > 5.12
Conclude that the linear model provides a good fit to the model
Testing Ho: | = 0 with the t Statistic
When the assumptions of the regression model hold
true, the sampling distributions of (a) and (b) are each
normally distributed with means and variances as
follows:

a
= o




b
= |

=
2
2 2
| 2
) ( x x n
x
i
i x y
a
o
o


=
2
2
| 2
) ( x x
i
x y
b
o
o
Testing Ho: | = 0 with the t Statistic
Inferences regarding o are usually not of interest
A great deal of interest centers on inferential
procedures with respect to |
| tells us so much about the form of the
relationship between X and Y
When X and Y are linearly related a positive | indicates
that, in general, Y increases as X increases, and we say
that there is a direct linear relationship between X and Y
A negative | indicates that values of Y tend to decrease
as values of X increase, and we say that there is an
inverse linear relationship between X and Y.
When there is no linear relationship between X and Y, |
is equal to zero

Testing Ho: | = 0 with the t Statistic
For testing hypothesis about | the test statistic
when is known is


As a rule is unknown. When this is the case,
the test statistic is
2
|x y
o
b
b
z
o
|
0

=
2
|x y
o
b
s
b
t
0
|
=
where s
b
is an estimate of o
b
and t is distributed as Students t
with n 2 degrees of freedom
Testing Ho: | = 0 with the t Statistic
To obtain s
b
, we must first estimate
is an unbiased estimator of . This is the
unexplained variance computed from the
sample data
E(y
i
-)
= --------
n -2
2
|x y
o
2
|x y
s
2
|x y
o
( )
2 2 2
2
1
x y
s b s
n
n

=
2
|x y
o
We can now compute for s
b

( )
( )

=
n x x
s
x x
s
s
i i
x y
i
x y
b
2
2
2
|
2
2
| 2
Testing Ho: | = 0 with the t Statistic
1. State the hypothesis
Ho: | = 0
H
a
: | = 0
2. Select the appropriate test statistic: t test
3. Select the level of significance: .05
4. Determine the critical ratio or critical value of the test statistic: 2.2622
5. Perform the calculation for the test statistic

s
y
2
= 379.8182 s
x
2
= 275.0000



| | 2634 . 36 ) 0000 . 275 ( ) 1236 . 1 ( 8182 . 379
9
10
2 2
|
= =
x y
s
| |
013 .
11
) 825 (
100 ... 55 50
2634 . 36
2
2 2 2
2
=
+ + +
=
b
s
Testing Ho: | = 0 with the t Statistic
85 . 9
013 .
0 1236 . 1
0
=

=
b
s
b
t
|
6. Draw and state the conclusion
Reject Ho since 9.85 > 2.2622 (p value is less than .01)
Conclude that the slope of the true regression line is not zero.
We can expect to get better predictions and estimates of Y if we
use the sample regression equation than we would get if we
ignore the relationship between X and Y.
Since | is positive, the relationship between X and Y is direct
linear relationship
Confidence Interval for |
Estimator (reliability factor)(standard error of the
estimate)
B = estimator
Z or t = reliability factor
If the variance is known, the standard error of estimate
is



If the variance is unknown, o
b
is estimated by

2
|x y
o


=
2
2
|
2
) ( x x
i
x y
b
o
o
2
|x y
o
( )

=
n x x
s
s
i i
x y
b
2
2
2
|
Example for Confidence Interval of |
( )


n x x
s
t b
i i
x y
2
2
2
|
) 2 / 1 ( o
013 . 2622 . 2 1236 . 1
1.1236 .2579

.8657, 1.3815
Using the Regression Equation
1. Predicting Y for a given X
2. Estimating the Mean of Y for a given X

95 Confidence interval for both can be
determined
Example for prediction of y for a given x
Suppose, we have a patient who makes a score
of 70 on the new test and we want to predict
his score on the standardized test. To obtain
the predicted value, we substitute 70 for x in
the sample regression equation
Predicted value
= -9973 + 1.1236 (70)
= 78
2750
) 75 70 (
11
1
1 2634 . 36 2622 . 2 78
2

+ +
95% Confidence Interval of the predicted value




( )
( )


+ +

n x x
x x
n
s t
i i
p
x y
2
2
2
| ) 2 / 1 (
1
1
o
78 2.2622 (6.02) (1.0488)

78 14

64, 92
Example for estimation of mean of y for a given x
Suppose we are interested in estimating the
mean Y score for a subpopulation of patients all
of whom make a score of 70 on the new test
Estimated mean of y
= -.9973 + 1.1236 (70)
= 78

95% Confidence Interval for estimated mean of Y
( )


+

n x x
x x
n
s t
i i
p
x y
2
2
2
| ) 2 / 1 (
) (
1
o

2750
) 75 70 (
11
1
2634 . 36 2622 . 2 78
2

+
78 4

74, 82
THANK YOU
&
GOD BLESS

S-ar putea să vă placă și