Take Home Final

Madeline Lasell
SPH 245
PROBLEM 1:
1. (a) Null Hypothesis:
H0: There is no significant association between CA-125 levels and having ovarian cancer
(the coefficient of CA-125 levels is equal to zero, β = 0).
Alternative Hypothesis:
HA: There is a significant association between CA-125 and having ovarian cancer (the
coefficient of CA-125 levels is equal to zero, β ≠ 0).
(b) Per the output tables below, the estimated coefficient for CA125 is b=0.00932. The
standard error is 0.00190, and the p-value is <0.0001*.
(c) With a significance level set to 0.05, we reject the null hypothesis at a p-value <0.0001.
In other words, we can conclude that there is an association between CA-125 and having
ovarian cancer.
2. With the 95% confidence interval of 1.006 to 1.013, the odds ratio of CA-125 is 1.009. So,
for each one unit increase in CA-125, the odds of developing ovarian cancer are 1.009 times
higher (0.9% increased probability).
3. (a) With a 95% confidence interval of 0.13674-0.28755, the likelihood of developing ovarian
cancer for a woman with a CA-125 level of 20 is 0.20183 or 20.183%.
(b) With a 95% confidence interval of 0.89111-0.99743, a woman with a CA-125 level of 600
is 0.98255 or 98.255%.
Madeline Lasell
SPH 245
4. Pictured here is an ROC curve for the logistic regression model at a probability level of 0.5.
The corresponding area under the curve (AUC) is 0.8848.
5. While I began with a probability level of 0.5 (see classification table above), I also ran the
model at probability levels of 0.4, 0.3, and 0.6 (see classification tables below). I concluded
that a probability level of 0.3 would be the best probability threshold for a diagnostic test
for ovarian cancer.
For a diagnostic test intended to detect ovarian cancer, I would want a test that has high
specificity to ensure the greatest precision and accuracy of the test in determining or ruling
out the number of people who truly do not have cancer (true negatives). A high sensitivity
test is better for a screening test that captures as many potential cases of cancer as possible
to determine who is at the greatest risk of developing cancer. However, it is also important
to have a good balance of both sensitivity and specificity in order to appropriately capture
all cancer cases (rather than under-capturing them) with the highest rate of accuracy.
At a probability of 0.3, it appears that the test correctly diagnoses 79.8% of cancer cases,
the second highest rate of accuracy behind 80.9% (for a probability level of 0.5). Although
the 0.5 level had a higher specificity rate than the 0.3 level (93.8 versus 77.3), the 0.3 level
Madeline Lasell
SPH 245
was able to determine a higher number of true positives (correct events = 67) and a lower
number of false negatives (incorrect non-events = 14). Hence, I would choose this
threshold level of 0.3 for this diagnostic test.
Based on its levels of accuracy (79.8%), specificity (77.3%), and sensitivity (82.7%), this
diagnostic test (with a threshold of 0.3) could certainly be better. I’d like to see a higher
accuracy rate, which would be indicative of higher sensitivity and specificity (greater
amounts of both true positives and true negatives).
There appears to be a significant association between CA-125 and ovarian cancer,

suggesting that it is a valuable biomarker. However, as a biomarker used as part of a
diagnostic test for ovarian cancer, CA-125 may not be the most appropriate indicator of the
disease because of its overall detection accuracy. It may instead be a better biomarker for
use in a sensitive screening test rather than for use in a more specific diagnostic test.
2x2 Confusion Table for Probability Level 0.3

True Positive True Negative
Test Positive 67 22
(correct events) (incorrect events)
Test Negative 14 75
(incorrect non-events) (correct non-events)
PROBLEM 2:
1. For parts (1a)-(1e), the code used for the data processing and preparation is included in the
Full Code Appendix.
Madeline Lasell
SPH 245
2. (a) For part 2a, see Appendix A for accompanying output tables.
With the 95% confidence interval of 0.841 to 1.043, the odds of developing
hypertension for each one unit increase in exercise level are 0.937 times lower (6.3%
decrease in probability); (OR = 0.0937, p=0.2343).
hypertension for each one unit increase in age are 1.039 times higher (3.9% increase in
probability); (OR = 1.039, p<0.0001*).
hypertension for a female are 0.588 times lower (42% decrease in probability) than
those of a subject who is male (OR = 0.0937, p=0.0334*).
hypertension for a smoker are 0.713 times lower (29% decrease in probability) than
those of a subject who is a nonsmoker (OR = 0.713, p=0.2316).
hypertension for each one unit increase in waist circumference are 1.025 times higher
(2.5% increase in probability); (OR = 1.025, p=0.0024*).
hypertension for each one unit increase in number of alcoholic drinks are 0.921 times
lower (8% decrease in probability); (OR = 0.921, p=0.2640).
With p-values <0.05, the predictors age, gender, and waist circumference are significantly
related to hypertension. Surprisingly, a smoker is less likely to develop hypertension and
increased consumption of the number of alcoholic drinks is associated with a decreased
likelihood of developing hypertension. Also, because waist circumference can be influenced by
physical activity levels, it is surprising that physical activity is not significantly related to
hypertension while waist circumference is significantly related to hypertension.
3. (a) To identify the best model for predicting the probability of hypertension based on
exercise, age, gender, smoke status, waist circumference, and drinks per day, I used the
backward selection procedure. In the Type 3 Analysis of Effects table (below), this procedure
indicated that the predictors age and gender are significant (p=0.0032, p=0.0086). Therefore,
I’d include both age and gender in my final best model.
Madeline Lasell
SPH 245
(b) After fitting the final model for age and gender, I generated the ROC curve for the model,
which indicates an AUC value of 0.6997. AUC values can range from 0.5-1.0. It is said that
an AUC value of 0.5 signifies a 50/50 chance accuracy and an AUC value of 1.0 signifies
perfect, 100% accuracy. Because this value is closer to chance, I would not think that this
model predicts the presence of hypertension very well. However, this AUC value does
reflect that the model may be useful in that it can predict hypertension better than at
random chance alone.
4. (a) I used the backwards selection procedure in part 3 to determine the best model by
eliminating predictors one at a time. Exercise was eliminated first, followed by drinks, waist
circumference, and smoke status. I discovered that the most significant predictors were age
and gender. After re-running the model including only age and gender, I obtained the
following: With a 95% confidence interval of 1.019-1.055, the odds ratio for age is 1.037
(p<0.0001). With a 95% confidence interval of 0.262-0.926, the odds ratio for gender is 0.511
(p=0.0286).
(b) Controlling for gender, the odds of developing hypertension for a one unit change in age
are 1.037 times higher (3.7% increased probability); (OR = 1.037, p=0.0001*).
Controlling for age, the odds of developing hypertension for a female are 0.511 times
lower (49% decreased probability); (OR = 0.511, p=0.0268*).
Yes, the significant predictors slightly varied between multiple logistic regression and
the individual predictor models. The individual predictor models (Part 2) showed that
significant predictors included age, gender and waist circumference (p<0.0001,
p=0.0334, p=0.0024). In contrast, the multiple logistic regression (Part 3) showed
significant predictors to solely be age and gender (p<0.0001, p=0.0286).
From univariate to multiple logistic regression, developing hypertension due to a one

unit increase in age changed from 3.9% to 3.7% increased probability (-0.2% change).
Madeline Lasell
SPH 245
Similarly, a female’s likelihood of developing hypertension changed from 42%

decreased likelihood to 49% decreased likelihood as compared to males (-7% change).
Lastly, developing hypertension due to a one unit increase in waist circumference
changed from a significant association to a non-significant association. This could be
attributed to the variation in p-value cut-offs from model to model (i.e. 0.05 cut-off for
univariate regression, 0.1 cut-off for multiple logistic regression) and, potentially,
collinearity in the multiple logistic regression model.
Madeline Lasell
SPH 245
FULL CODE APPENDIX:

*SPH 245 Take-Home Final;
*Problem 1;
FILENAME REFFILE '/folders/myshortcuts/SAS_Scripts/Ovarian.xlsx';
PROC IMPORT DATAFILE="/folders/myshortcuts/SAS_Scripts/Ovarian.xlsx"
REPLACE
DBMS=XLSX
OUT=OVARIAN;
GETNAMES=YES;
RUN;
PROC CONTENTS DATA=OVARIAN; RUN;
*Number 1;
PROC LOGISTIC DATA=OVARIAN DESCENDING;
MODEL DIAGNOSIS (EVENT="CANCER") = CA125;
STORE OUT = OVARIANGLM;
RUN;
*Number 3;
DATA OVARIANPREDICT;
INPUT CA125;
DATALINES;
20
600
;
PROC PLM SOURCE=OVARIANGLM;
SCORE DATA = OVARIANPREDICT OUT= OVARIANPREDOUT PREDICTED LCLM= LOWER UCLM=UPPER/
ILINK;
PROC PRINT DATA=OVARIANPREDOUT;
RUN;
*Number 4;
PROC LOGISTIC DATA=OVARIAN PLOTS=ROC;
CLASS DIAGNOSIS (REF="Benign")/ PARAM=REF;
MODEL DIAGNOSIS (EVENT="Cancer") = CA125 / LINK = LOGIT CTABLE PPROB=0.5;
RUN;

RUN;

RUN;

RUN;
*Problem 2;
*The NA values were excluded from the NHANES dataset below;
FILENAME REFFILE '/folders/myshortcuts/SAS_Scripts/NHANES.csv';
Madeline Lasell
SPH 245
PROC IMPORT DATAFILE="/folders/myshortcuts/SAS_Scripts/NHANES.csv"

REPLACE
DBMS=CSV
OUT=NHANES;
GETNAMES=YES;
RUN;
PROC CONTENTS DATA=NHANES; RUN;
DATA NHANES1;
SET NHANES;
IF PAD630 GE 7777 THEN PAD630 =".";
IF PAD645 GE 7777 THEN PAD645 = ".";
IF PAD660 GE 7777 THEN PAD660 = ".";
IF PAD675 GE 7777 THEN PAD675 = ".";
IF ALQ130 GE 26 THEN ALQ130 = ".";
IF SMQ040 GE 7 THEN SMQ040 = ".";
RUN;
DATA NHANES2;
SET NHANES1;
*1a;
BLOODPRESSURE ="NORMAL";
IF BPXSY1 GE 120 AND BPXSY2 GE 120 AND BPXSY3 GE 120 THEN BLOODPRESSURE
="HYPERTENSIVE";
*1b;
EXERCISE = SUM(PAD630, PAD645, PAD660, PAD675)/60;
RUN;
*1c;
DATA NHANES2A;
SET NHANES2;
IF SMQ040 = 3 THEN SMK = "NONSMOKER";
IF SMQ040 = 2 THEN SMK = "CURRENT";
RUN;
PROC PRINT;
VAR SMK;
RUN;
*1d;
PROC MEANS DATA=NHANES2;
VAR BMXWAIST;
VAR RIDAGEYR;
RUN;
DATA NHANES3;
SET NHANES2;
WAIST=BMXWAIST-98.7905923;
AGE=RIDAGEYR-48.1206897;
RUN;
Madeline Lasell
SPH 245
*1e;
DATA NHANES3;
SET NHANES2;
DRINKS=ALQ130;
RUN;
DATA NHANES4;
SET NHANES3;
if RIAGENDR = 1 then GENDER = "MALE";
if RIAGENDR = 2 then GENDER = "FEMALE";
RUN;
PROC PRINT;
VAR RIAGENDR;
RUN;
DATA NHANES5;
SET NHANES4;
BLOODPRESSURE ="NORMAL";
IF BPXSY1 GE 120 AND BPXSY2 GE 120 AND BPXSY3 GE 120 THEN BLOODPRESSURE
="HYPERTENSIVE";
IF SMQ040 = 3 THEN SMK = "NONSMOKER";
EXERCISE = SUM(PAD630, PAD645, PAD660, PAD675)/60;
WAIST=BMXWAIST-98.7905923;
AGE=RIDAGEYR-48.1206897;
DRINKS=ALQ130;
if RIAGENDR = 1 then GENDER = "MALE";
if RIAGENDR = 2 then GENDER = "FEMALE";
RUN;
*Number 2;
*2a;
PROC LOGISTIC DATA = NHANES2 DESCENDING;
CLASS BLOODPRESSURE (ref = "NORMAL") / PARAM = REF;
MODEL BLOODPRESSURE = EXERCISE / LINK = GLOGIT;
RUN;

CLASS BLOODPRESSURE (ref = "NORMAL") / PARAM = REF;
MODEL BLOODPRESSURE = AGE / LINK = GLOGIT;
RUN;

CLASS BLOODPRESSURE (ref = "NORMAL") GENDER (ref = "MALE")/ PARAM = REF;
MODEL BLOODPRESSURE = GENDER / LINK = GLOGIT;
RUN;
PROC LOGISTIC DATA = NHANES2A DESCENDING;

CLASS BLOODPRESSURE (ref = "NORMAL") SMK (ref = "NONSMOKER")/ PARAM = REF;
Madeline Lasell
SPH 245
MODEL BLOODPRESSURE = SMK / LINK = GLOGIT;

RUN;

CLASS BLOODPRESSURE (ref = "NORMAL")/ PARAM = REF;
MODEL BLOODPRESSURE = WAIST / LINK = GLOGIT;
RUN;

CLASS BLOODPRESSURE (ref = "NORMAL")/ PARAM = REF;
MODEL BLOODPRESSURE = DRINKS / LINK = GLOGIT;
RUN;
*3a;
PROC LOGISTIC DATA = NHANES5;
CLASS BLOODPRESSURE (ref = "NORMAL") GENDER (ref = "MALE") SMK (REF = "NONSMOKER")/
PARAM = REF;
MODEL BLOODPRESSURE = EXERCISE|AGE|GENDER|SMK|WAIST|DRINKS / SELECTION=BACKWARD
SLSTAY=0.1 LINK = GLOGIT;
RUN;
PROC LOGISTIC DATA = NHANES5;

CLASS BLOODPRESSURE (ref = "NORMAL") GENDER (ref = "MALE") SMK (ref = "NONSMOKER") /
PARAM = REF;
MODEL BLOODPRESSURE = EXERCISE AGE GENDER SMK WAIST DRINKS / SELECTION=BACKWARD
SLSTAY=0.1 LINK = GLOGIT;
RUN;
*3b;
PROC LOGISTIC DATA = NHANES5 DESCENDING PLOTS = ROC;
CLASS BLOODPRESSURE (ref = "NORMAL") GENDER (ref = "MALE");
MODEL BLOODPRESSURE = AGE GENDER;
RUN;
Madeline Lasell
SPH 245
APPENDIX A:
Exercise Level
Age
Gender
Smoking Status
Madeline Lasell
SPH 245
Waist Circumference
Number of Alcoholic Drinks

Take Home Final

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Take Home Final

Încărcat de

Drepturi de autor:

Formate disponibile

Madeline Lasell

There appears to be a significant association between CA-125 and ovarian cancer,

2x2 Confusion Table for Probability Level 0.3

From univariate to multiple logistic regression, developing hypertension due to a one

Similarly, a female’s likelihood of developing hypertension changed from 42%

FULL CODE APPENDIX:

PROC LOGISTIC DATA=OVARIAN PLOTS=ROC;

PROC LOGISTIC DATA=OVARIAN PLOTS=ROC;

PROC LOGISTIC DATA=OVARIAN PLOTS=ROC;

PROC IMPORT DATAFILE="/folders/myshortcuts/SAS_Scripts/NHANES.csv"

PROC LOGISTIC DATA = NHANES3 DESCENDING;

PROC LOGISTIC DATA = NHANES4 DESCENDING;

PROC LOGISTIC DATA = NHANES2A DESCENDING;

MODEL BLOODPRESSURE = SMK / LINK = GLOGIT;

PROC LOGISTIC DATA = NHANES3 DESCENDING;

PROC LOGISTIC DATA = NHANES3 DESCENDING;

PROC LOGISTIC DATA = NHANES5;

Number of Alcoholic Drinks

S-ar putea să vă placă și