Documente Academic
Documente Profesional
Documente Cultură
Final Exam
You may bring 5 pages of notes
You MUST bring full copies of
statistical tables (on Blackboard)
You MUST bring a calculator
Logistic Regression
Survival analysis
One-Way ANOVA
Used when we want to compare the means
of three or more groups from independent
populations.
Continuous outcome measured on each
subject.
We set up an analysis of variance table and
compare the variances of between groups
and within groups.
An F-test is used with two different degrees
of freedom terms.
Chi-Square Test
Chi-square goodness of fit test
Assess whether responses fit a specified
distribution for one sample of people
Power
Linked up with Type II error
Power = 1-
=P(Reject H0 | H0 false)
= Probability of correctly
rejecting H0 when H0 is false.
Correlation
Correlation measures the nature and
strength of linear association between
two variables at a time.
Regression equation that best
describes relationship between
variables.
Correlation Coefficient
Population correlation is r (rho)
Sample correlation is r where
-1 < r < +1
Sign indicates nature of relationship
(positive or direct, negative or inverse)
Linear Regression
A very popular method for describing
the linear relationship between two
variables (usually continuous
variables).
We use a scatterplot to display the
data graphically
y = b0 + b1 x
Linear Regression
Predictors can be continuous, indicator
variables (0/1) or a set of dummy variables
Confounding the effect of a risk factor on
an outcome is somehow changed due to the
effect of another factor.
Effect Modification a different relationship
between the risk factor and an outcome
depending on the level of another variable.
Logistic Regression
Used when the outcome is dichotomous
(binary), e.g. diseased , not diseased.
Our goals remain the same as for linear
regression:
is there an association between a
variable X and our outcome variable Y?
If so, what type?
e
p
b 0 b1X
1 e
p
b0 b1x
logit( p ) ln
1 p
p
b0 b1x1 b 2 x 2 ... b p x p
ln
1 - p
Exp(bi) = OR
Survival Analysis
Outcome is the time to an event.
An event could be time to heart attack,
cancer remission or death.
Survival Analysis
Incomplete follow-up information
Censoring
Measure follow-up time and not time to
event
We know survival time > follow-up time
Problem 1.
Suppose a cross-sectional study is
conducted to investigate cardiovascular risk
factors among a sample of patients seeking
medical care at one of three local hospitals.
A total of 300 patients are enrolled. Using
the following data, test if there is an
association between enrollment site (i.e.,
hospital) and family history of CVD. Run
the appropriate test at a 5% level of
significance.
Problem 1.
Family
Hx
Definite
Hosp 1
Hosp 2
Hosp 3
24
14
22
Probable
14
No
68
72
70
Total
100
100
100
Problem 1.
H0: Site and family history are
independent
H1: H0 is false
=0.05
Df = (r-1)(c-1) = (3-1)(3-1) = 4.
Reject H0 if 2 > 9.49
Problem 1.
Family
Hx
Definite
Hosp 1
Hosp 2
Hosp 3
24 (20)
14 (20)
22 (20)
Probable
8 (10)
14 (10)
8 (10)
No
68 (70)
72 (70)
70 (70)
100
100
100
Total
Problem 1.
(24 20 ) 2 (14 20 ) 2 (22 20 ) 2 (8 10 ) 2 (14 10 ) 2 (8 10 ) 2
20
20
20
10
10
10
(68 70 ) 2 (72 70 ) 2 (70 70 ) 2
70
70
70
2
Problem 2.
The following table summarizes data collected
in the study described in problem 1. The
variable summarized below is body mass
index (BMI) computed as the ratio of weight
in kilograms to height in meters squared.
BMI
N
Mean
Std Dev
Overall
300
24.8
2.5
Hosp 1
100
21.6
2.1
Hosp 2
100
24.8
1.8
Hosp 3
100
27.9
1.3
Problem 2.
Test if there is a significant difference in the mean BMI
scores among hospitals. Show all parts of the test and
use a 5% level of significance. (HINT: MSE = 3.1).
H0: 123
H1: means not all equal
SSb n j (X j X)
=0.05
=100((21.6-24.8)2+(24.824.8)2+(27.924.8)2)
= 100(10.24 + 0 + 9.61) = 1985
Problem 2.
Source
SS
Df
MS
Between
1985
992.5
320.2
Error
920.7
297
3.1
Total
2905.7
299
Problem 3.
Suppose each participant in the study
described in problem 1 is assigned a
cardiovascular risk (a value between 0 and
100 with higher scores indicative of more
risk of cardiovascular disease). The mean
cardiovascular risk is 21.7 with a standard
deviation of 5.6. Suppose that the
covariance between BMI and cardiovascular
risk is 4.5.
Problem 3.
Compute the sample correlation coefficient between
BMI and cardiovascular risk.
Var(BMI) = sx2= 2.52
Var(Risk) = sy2 = 5.62
Cov(X,Y)
2 2
x y
ss
4.5
2
(2.5) (5.6)
0.3
(n 2)
Zr
1 r2
=0.05
Reject H0 if Z < -1.96 or if Z > 1.96
298
Z 0.3
5.4
2
1 (0.3)
Reject H0 since 5.4 > 1.96. We have significant
evidence, =0.05, to show that r 0.
Problem 4.
Compute the equation of the line that best describes
the relationship between BMI and cardiovascular risk
(Assume that cardiovascular risk is the dependent
variable).
sy
5.6
b1 r 0.3
0.67
sx
2.5
y 5.08 0.67X
Problem 5.
Suppose we restrict our attention to the
subgroup of patients at high risk for
cardiovascular disease (cardiovascular
risk score of 30 or more).
Using the following data, test if BMI is
significantly different in men versus
women. Use a 5% level of significance.
Problem 5.
H0: 1 = 2
H1: 1 2
=0.05
BMI
X1 X 2
t
1 1
Sp
n1 n 2
Men
Women
20
10
Mean
31.6
28.1
Std Dev
1.7
2.1
Df=20+10-2 = 28
Reject H0 if t < -2.048 or if t > 2.048
Problem 5.
19(1.7) 2 9(2.1) 2
Sp
1.84
20 10 2
31.6 - 28.1
4.91
1 1
1.84
20 10
Problem 6.
How many men and women would be required to
estimate a difference in mean BMI with a 95%
confidence interval and a margin of error not
exceeding 1 unit. (Use data from problem 6 as
needed.)
2
Zs
ni 2
E
Use Sp from #6
1.96(1.84)
ni 2
26.01
1
Problem 7.
The following table was constructed based on a
comparison of various sociodemographic
characteristics between men and women enrolled in
the study of cardiovascular risk factors.
Which, if any, of the characteristics shown
above are significantly different between men
and women? Justify.
Problem 7.
Characteristic
Men (n=160)
Women (n=140)
45
47
Race
p
0.7256
0.0354
% White
32
38
% Black
41
37
% Hispanic
25
19
% Other
% HS Graduate
78
64
0.0245
47
31
0.0001
% No Insurance
0.9876
Problem 8.
Problem 9.
Two different scales are used in a particular
laboratory. There is some concern that one
scale gives different readings than the other.
Ten specimens are randomly selected and
weighed on each scale. The data are shown
below.
Problem 9.
Specimen
Scale 1
Scale 2
1.2
2.1
3.5
3.6
1.8
1.9
4.0
4.0
5.0
4.9
1.9
2.0
2.7
2.7
2.2
2.3
2.8
2.9
10
3.5
3.7
diff 2 diff /n
2
diff 1.5
Xd
0.15
n
10
sd
n 1
0.276
9
H0: d = 0
H1: d 0 =0.05
t
Xd
sd
, df n 1
t
Xd
sd
0.15
1.72
n 0.276
10
Problem 10.
Patients with hypertension are generally
recommended to follow a low salt diet.
Surveys report that approximately 75% of
patients adhere to these diets. In a random
sample of 100 patients with hypertension,
70% report following a low-salt diet. Are
these patients significantly low in terms of
adherence? Run the test at = 0.05.
Problem 10.
H0: p = 0.75
H1: p < 0.75
=0.05
p p 0
p 0 (1 p 0 )
n
Z
p 0 (1 p 0 )
n
0.70 0.75
0.75(1 0.75)
100
1.15
Problem 11.
The following table was presented in a journal and describes
the associations between demographic and clinical risk
factors and systolic blood pressure.
Risk Factors
Intercept
Age
Male Sex
Current Smoker
Number
of
Exercise/Week
Hrs
Problem 11.
a) What type of analysis generated the results summarized
above?
Multiple linear regression analysis because the outcome
(systolic blood pressure) is continuous.
b) Which of the risk factors are significantly associated with
systolic blood pressure?
Problem 11.
c) What is the relative importance of the risk factors?
The most important (statistically significant) risk factor is number of
hours of exercise per week, followed by age and then male sex.
Current smoking status is not statistically significant.
d) How would you interpret the regression coefficient associated with
male sex? With number of hours of exercise per week?
Mens systolic blood pressure is 4.5 units higher than womens
holding age, smoking status and number of hours of exercise
constant. Each additional hour of exercise per week is associated
with a reduction of 2.4 units of systolic blood pressure holding age,
sex and current smoking status constant.
Problem 12.
The following table was presented in a journal and describes
the associations between demographic and clinical risk factors
and hypertension.
Risk Factors
Outcome = Hypertension
Regression Coefficient
3.5
0.0001
Age
0.02
0.0357
Male Sex
0.27
0.0264
-0.005
0.7564
-0.36
0.0111
Intercept
Current Smoker
Number of Hrs Exercise/Week
Problem 12.
a) What type of analysis generated the results summarized above?
Multiple logistic regression analysis because the outcome
(hypertension) is dichotomous.
b) Which of the risk factors are significantly associated with
hypertension?
Age, male sex and number of hours of exercise are statistically
significant at the 5% level (both have p values < 0.05).
c) What is the relative importance of the risk factors?
The most important (statistically significant) risk factor is number of
hours of exercise per week, followed by male sex and then age.
Current smoking status is not statistically significant.
Problem 12.
d) Compute odds ratios for each of the risk factors.
Risk Factors
Outcome = Hypertension
Regression Coefficient
Odds Ratio
Age
0.02
1.02
Male Sex
0.27
1.31
-0.005
0.99
-0.36
0.70
Current Smoker
Number of Hrs Exercise/Week
Problem 13.
A study is conducted to assess whether there is a difference in physicians
opinions regarding the treatment of early stage throat cancer. Specifically,
physicians were asked if they would recommend radiation, surgery or
neither upon initial diagnosis. Based on the data below, is there a
relationship between treatment recommendations and physicians age?
Run the test at a 5% level of significance.
Radiation
Surgery
Neither
Total
<40
35
15
50
100
40-59
29
30
41
100
60-79
40
43
22
105
Total
104
88
113
305
Problem 13.
H0: Age and treatment recommendation are independent
H1: H0 is false
=0.05
2
(
O
E
)
2
E
Df = (r-1)(c-1) = (3-1)(3-1) = 4.
Reject H0 if 2 > 9.49
(35 34 .1) 2 (15 28 .9) 2 (50 37 ) 2 (29 34 .1) 2 (30 28 .9) 2 (41 37 ) 2
34 .1
28 .9
37
34 .1
28 .9
37
(40 35 .8) 2 (43 30 .3) 2 (22 38 .9) 2
35 .8
30 .3
38 .9
2
Radiation
Surgery
Neither
Total
<40
35 (34.1)
15 (28.9)
50 (37.0)
100
40-59
29 (34.1)
30 (28.9)
41 (37.0)
100
60-79
40 (35.8)
43 (30.3)
22 (38.9)
105
Total
104
88
113
305
Problem 14.
For each of the following scenarios,
indicate which test would be used. Use
the letters below to indicate the test in
the space provided. Note that the same
test might be used for more than one
scenario.
Problem 14.
a)
b)
c)
d)
e)
f)
g)
h)
i)
j)
k)
Problem 14.
Scenario
1. We want to test if there is a significant association between BMI (kg/m2) and
incident myocardial infarction adjusting for age, sex, systolic blood pressure and
smoking.
2. We want to test if a new environmental intervention is effective in reducing
exposure to second-hand smoke. Each participant in the study has levels of exposure
measured before and after the intervention is implemented.
3. We wish to test if there is a significant association between GRE scores and first
year GPA in MPH students who matriculated in fall 2011.
4. We want to determine if there are significant differences in ages of participants
enrolled in a study comparing those with a family history of cardiovascular disease to
those without.
5. A study reports that 15% of college freshman smoke. We want to test if
significantly more BU freshman smoke.
6. We want to test if there is a difference in preterm versus term deliveries among
women of black, Hispanic and white race.
7. We want to test if nutritional supplements prolong life (minimize time to death) in
persons over 65 years of age, adjusted for sex and other comorbid conditions.
8. A clinical trial is run to assess the safety of a new drug compared to a standard
drug and the outcome is development of skin rash or not
9. We want to test if there is a difference in mean time to complete a physical task
when comparing 12, 13, 14 and 15 year olds.
10. We want to test whether smoking in pregnancy increases the risk of infection in
newborns.
Test
j
d
h or i
c
b
g
k
g or j
e
g or k