Sunteți pe pagina 1din 7

Austin Kinion

STA 138
December 4, 2014
Final Project
Introduction: The data that I will be analyzing for this report deals with whether a a patient id
diagnosed or not diagnosed with depression in a visit during one year of care. There are many
ways in which a patient can be diagnosed with depression, so there are many more variables
not taken into account with this model, that may affect the results greatly, but for the sake if this
project, I will try and predict whether a patient will be diagnosed with depression using stepwise
logtisitic regression. The variables for this data are as follows:
Diagnosis of depression in any visit during one

DAV

year of care. (0= Not diagnosed, 1= Diagnosed)


Physical component of SF-36 measuring health

PCS

status of the patient.


Mental component of SF-36 measuring health

MCS

status of the patient.


The Beck depression score of the patient

BECK

The Gender of the patient

PGEND

The Patients age in years

AGE

The number of years of formal schooling

EDUCAT

The response variable is DAV. The explanatory variables are PCS, MCS, BECK, and PGEND
which indicates the gender, AGE, which indicates the age, and EDUCAT which tells the number
of years of formal schooling.
Materials and Methods: For this project, I will be testing to see if we can predict whether a
patient will be diagnosed with depression based on the variables above, and pick the best
model ,using stepwise logistic regression. 400 patients were randomly selected from primary
care facilities and the above 7 variables were recored for each patient.
SAS Code and Results:
To read in the data and format:

The very first thing I did was to check to make sure that the model with main effect did not
include any multi collinearity, I did this with the following code:

And partial output:


From the output, it is clear that there is no
multicollinearity in the model since the variance
inflation for each of the variables is much lower
than 10.

Nest, I want to create the logistic model. I will fist show you the model with only the main effects
(no interactions) that were chosen from the stepwise logistic regression in SAS, though
this is not the model that I will use. The code that was used to obtained the best model
through forward stepwise regression was:

And the output for the best model was :


So our model for the best picked from SAS, with
no interactions is:
log(/1-) = -2.3093 - (0.047)MCS +
(0.0721)BECK - (0.6633)PGEND +
(.1785)EDUCAT

I chose to use a similar model as the one above, with two extra interaction terms.I chose these
interaction terms in the model because I believe that firstly, the interaction between
PGEND and EDUCAT can help with the prediction of depression because educations
effect may differ depending on the gender of the patient. Secondly, the interaction between
PGEND and BECK, I believe, may help with prediction of depression because the Becks
depression score may differ depending on gender as well. So the SAS code for the model
described above is:

and the partial output from this to obtain the model is:
So the model is :
log(/1-) = 0 + 1x1 + 2y

log(/1-) = -2.7921 - (0.0487)MCS + (0.0813)BECK (1.3659)PGEND + (.2151)EDUCAT (0.1296)EDUCAT*PGEND - (0.0419)BECK*PGEND

An explanation of the variables used for my final model is as follows: The intercept 0 is -2.7921
which is for when all of the other parameters are equal to zero. The slope estimate for
MCS is -0.0487, which means that when MCS increases by one unit, the odds of the
patient being diagnosed with depression is (e^(-0.0487) = 0.9524) 0.9524 the odds of the
patient not being diagnosed with depression. For EDUCAT, when a patient does one extra
year of formal schooling, the odds of the patient being diagnosed with depression is
(e^(.2151)) 1.24 times the odds of the patient not being diagnosed with depression. For
BECK, when a patients Beck depression score increases by one unit, the odds of that
patient being diagnosed with depression is (e^(.0813)) 1.085 times the odds of that paient
not being diagnosed with depression. For PGEND, we can say that the odds of the patient
being diagnosed with depression as a male are (e^(1.3659)) 3.92 the odds of the patients
being diagnosed with depression as a female.

From the output to the left, we san see the SAS


reported 95% Wald confidence intervals for the
above variables described. For MCS, we can be
95% confident that when the MCS unit increases
by one, the odds of that person being diagnosed
with depression will increase by between 2.6% and 6.3%. For EDUCAT, we can be 95%
confident that with one year of additional formal education, the odds of that person being

diagnosed with depression will increase by between 5.9% and 34.9%. For BECK, we can
be 95% confident that when the Beck depression score increases by one unit, the odds of
that person being diagnosed with depression will increase by between 1.0% and 14.3%.
For PGEND, since the 95% confidence interval contains 1, it is not statistically significant.
-Residual Analysis:
It is clear form the chart to the right, that
there are many outliers, which have a
Pearson and deviance Residual of over
the absolute value of 2.0, so they are
influencing the coefficients and the
goodness of fit. After looking at the data
output (which is not displayed because
it is too big), I can see that the following
observations have a Pearson and
Deviance residual of over the absolute
value of 2.0: observations 22, 115, 173,
194, 255, 260, 286, 316, 323, 325, 333, 353, and 368. These observation numbers are the
ones corresponding to the output form SAS, with observation 1 being nothing (the header).
So if I were to adjust the observations to match exactly the observations form the data,
they would be the observation numbers listed above minus 1: 21, 114, 172, 193, 254, 259,
287, 315, 322, 324, 332, 352, and 367. Out of these adjusted observations, the 5
observations with the highest Pearson and Deviance residuals are observations (with
Pearson residual, Deviance residual): 193(5.47, 2.62), 259(3.54, 2.28), 315(4.20, 2.42),
324(4.31, 2.44), 352(5.75, 2.65).
-Influential Observations
Looking at the hat matrix diagonal column (what we were told to do in class) from the SAS
output, it is clear, after carefully looking, that there is really only one influential observation
which is not even listed as a residual, it is observation 378, with a hat matrix diagonal
equal to .0892, which is much higher than any of the others (the next highest is .02). With
a hat matrix diagonal so high, this means that this observation is affecting the the
parameter estimates.

-Goodness of Fit
The percent concordant is 76.5 and
the percent discordant is 23.1. This is
relatively a good thing, with Somers
D, Gamma, and C being relatively
high (.535, .537, and .767
respectively). This means that (using
Somers D) there 53.5% concordants (or agreement) with the model that we have
selected. This isnt an excellent number, but it still implies that there is some association.
So we can conclude with Somers D that the average difference in he percent concordant
and percent discordant is 53.5%, which means our model is doing an okay job at
predicting. We could do a similar analysis for Gamma and say that since it is positive and
relatively large (.537), that there is some association.
With the lowest AIC(305.201), I chose to
work with the best model chosen by
SAS with stepwise regression over the
model with only the intercept (AIC:
353.736), and over the best model with
the interaction terms (AIC: 308.519). With the Hosmer and Lemshow Goodness-of-fit test
in SAS, we can see that the 2 statistic is: 2 = 7.4172 with 8 degrees of freedom and
p-value= 0.4924. Since we have such a large p-value in this case, we will fail to reject
H0: Model fits the data well, and conclude that the model IS a good fit for the data.

I have repeated the analysis of maximum


likelihood estimates table from above, for
the best fit model so we can test the s to
see if they are statistically significant: To
start, Lets take a look at 0. To test, we
have H0: 0=0, against

Ha: 00. Since

our Wald Chi Square Statistic is 3.897, with p-value= 0.0484, (p-value < .05), we can
conclude that 0 has statistical significance, Ha: 00.

Lets take a look at 1. To test, we have H0: 1=0, against

Ha: 10. Since our Wald Chi Square

Statistic is 9.773, with p-value= 0.0018, (p-value < .05), we can conclude that 1 has
statistical significance, Ha: 20.
Lets take a look at 2. To test, we have H0: 2=0, against

Ha: 20. Since our Wald Chi Square

Statistic is 5.2214, with p-value= 0.0223, (p-value < .05), we can conclude that 2 has
statistical significance, Ha: 20.
Lets take a look at 3. To test, we have H0: 3=0, against

Ha: 30. Since our Wald Chi Square

Statistic is 3.8280, with p-value= 0.0504, (p-value .05), we can conclude that 3 has NO
statistical significance, H0: 3=0. Just because this fails the test though, does not mean it
should not be included in the model. It is right on the edge of being statistically significant
and it can play part in predicting ones depression diagnosis, so I believe it should stay in
the model.
Lastly, lets take a look at 4. To test, we have H0: 4=0, against

Ha: 40. Since our Wald Chi

Square Statistic is 8.36009, with p-value= 0.0038, (p-value < .05), we can conclude that
4 has statistical significance, Ha: 20.
-Goodness-of-link function
This was done with the code:

Output:

From the output to the right, we can see that


the estimated new variable (linkf) is
1.4111, with a Wald Chi Square Statistic
of .4469, and p-value = .5038. We can conclude, since p-value > .05, that this variable is
statistically insignificant, so the link function is appropriate.
Conclusion and Discussion: A logistic regression model was fit to the data, using stepwise
logistic regression. I found that we can be 95% confident that when the MCS unit
increases by one, the odds of that person being diagnosed with depression will increase
by between 2.6% and 6.3%. For EDUCAT, we can be 95% confident that with one year of
additional formal education, the odds of that person being diagnosed with depression will
increase by between 5.9% and 34.9%. For BECK, we can be 95% confident that when the

Beck depression score increases by one unit, the odds of that person being diagnosed
with depression will increase by between 1.0% and 14.3%. And that for PGEND, since the
95% confidence interval contains 1, it is not statistically significant.
I also did residual analysis on the data, and searched for influential observations. I found there
was an influential observation that was not included in the residuals, which was
observation 378. Even though there were several residuals and an influential observation,
the model was still found to be a good fit for the data, which was determined from the
Hosmer and Lemshow goodness-of-fit test.
As a result from the statements above, I was able to conclude that the Variables BECK,
EDUCAT, MCS, and PGEND, are all associated with the depression diagnosis of a patient.
Based on the result, we cannot reject the Null Hypothesis that the model is fitting the data;
yet I would be more comfortable with a model that provides more support of fit. Therefore,
I recommend researching additional covariates in order to make more reliable predictions,
such as How many people the patient talks to on a dally basis, if the patient has a hobby,
and so on.