Sunteți pe pagina 1din 3

Organics Decision Tree Analysis

In making a decision tree for the organics data set, we attempted to


distinguish between different traits that could identify whether or not a
customer would buy organic foods at the grocery store. A decision tree
segments data through a series of different variables that split groups based
on the expected outcome. This allows us to see which variables combine
together to achieve different outcomes, such as whether a customer will or
will not buy organic foods. Sometimes one variable will not split data
accurately and decision trees allow us to combine multiple variables to
create more unique sub groups that produce results more accurate
classifications.
Our response variable was whether or not organics were purchased and we
utilized predictor variables of loyalty card class, gender, affluence grade, and
age. Within our decision tree, we broke down our response into four levels
and each level contained two branches, which produced a binary result.
Within our branches, the leaf size was 1,000, which is the number of data
points used.
Once our model was created, we used a misclassification chart to analyze
the results and identified that our model was good at identifying true
negatives and avoiding false positives. However, it was poor at identifying
false negatives with 207,252 incorrectly identified negatives. In comparison,
the model only identified 154,812 true positives, less than half of the overall
positives. From the leaf statistics in SAS, node 3 had 76.4% of purchases as
organic purchases, which indicates it would be helpful in identifying organics
purchases. From the ROC chart, there was a KS statistic for max separation
of .35. This number is fairly good, as more separation is desired in the ROC
chart.
This information could be used to market organic products to consumers who
match the traits determined to indicate a propensity for purchasing organic
products (some traits that indicated a likelihood of purchasing organics
include: affluence grade of over 10.2, age of under 42, loyalty card status of
silver, etc.). By using the decision tree in this way, a company can reduce
unnecessary advertising costs by specifically targeting customers that are
more likely to purchase organics products.

Coronary Heart Disease Decision Tree Analysis


Using the Hearts data set, we attempted to produce a decision tree that
would distinguish between traits that were indicative of heart disease versus
traits that were not indicative of heart disease.
To create our decision tree, we used cause of death as the response variable
and started with cholesterol, diastolic, weight, systolic, and smoking as our
predictor variables. However, we quickly realized the ability to improve upon
the lift chart and to segment the data in a better way. After assessing
multiple combinations of predictors, we found that the best combination
included systolic, cholesterol, height, age diagnosed with coronary heart
disease, metropolitan weight, and weight. This combination of predictors
resulted in the best lift chart, with a model very close to the best model.
Moreover, the ks statistic of .69 shows fantastic separation in the ROC chart.
The misclassification chart also shows only 30 false negatives for coronary
heart disease compared to 551 true positives. It also indicates 210 true
negatives compared to 73 false positives. This indicates that our model does
a very good job of detecting true positives and true negatives and makes
relatively few errors in its predictions of people dying of cardiac heart
disease. The leaf statistics show that nodes 6, 15, 20, 26, 27, and 31 are the
most likely to predict coronary heart disease deaths. The tree map shows
which combinations of data result in the most likely outcomes for heart
disease. For instance, certain cholesterol ranges, ranges of which age
coronary heart disease is diagnosed, ranges of metropolitan relative weight,
and ranges of weight yield more coronary heart disease deaths than other
ranges. To utilize this in a real-world setting, patients with statistics in the
given ranges that are more likely to die of coronary heart disease could be
given medicines such as nitroglycerines to counteract the potential for added
risk of coronary heart disease deaths. This model is shown under the
decision tree tab.
One concern we had with the predictors that we chose was the potential for
multicollinearity. In particular, we were concerned that the metropolitan
relative weight and weight could be multicollinear. These two variables had a
correlation of .7672 (found in the multicollinearity check tab), which indicates
a strong possibility of multicollinearity. Moreover, we felt the diastolic and
systolic variables also had potential to be multicollinear. Since SAS
automatically removed diastolic as a variable from the decision tree, we did

not have to change that aspect. However, we found that removing weight as
a predictor had the least overall impact on the lift chart. We also realized that
removing weight caused for fewer false negatives, which shows that data
would do a better job of predicting actual coronary heart disease victims.
However, that model also had far more false positives (137) than the
appealing model shown above. The ROC for this model still produces a kstatistic of .55 and a good lift chart that is similar to the model. For the
adjusted model, nodes 6, 8 17, 21, 25, and 27 are best at indicating coronary
heart disease. As before, certain ranges of the predictor variables are better
at predicting eventual deaths from coronary heart disease. This model can
be found under the adjusted decision tree tab.

S-ar putea să vă placă și