Documente Academic
Documente Profesional
Documente Cultură
Score 10 out of 10
1. (Question #2, page 30) For each of the following problem scenarios,
decide if a solution would best be addressed with supervised learning,
unsupervised clustering, or database query. As appropriate, state any
initial hypothesis you would like to test. If you decide that supervised
learning or unsupervised clustering is the best answer, list several input
attributes you believe to be relevant for solving the problem.
Score 10 out of 10
2. (Question #3, page 30) Medical doctors are experts at disease diagnosis
and surgery. Explain how medical doctors use induction to help develop
their skills.
Score 10 out of 10
Monica Nusskern 3
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
3. (Question #6, page 31) What happens when you try to build a decision
tree for the data in Table 1.1 without employing the attributes Swollen
Glands and Fever?
Let's pick sore throat as the top-level node. The only possibilities are yes and no.
Instances one, three four, eight, and ten follow the yes path. The no path shows instances
2,5,6,7 & 9. The path for sore throat = yes has representatives from all three classes as
does sore throat = no.
Next we follow the sore throat = yes path and choose headache. We need only concern
ourselves with instances 1,3,4, 8 & 10. For headache = yes we have instances 1 (strep
throat) ,8 (allergy ), & 10 (cold). For headache = no we have instances 3 (cold) and 4 (strep
throat).
Next follow headache = yes and choose congestion the only remaining attribute. All
three instances show congestion = yes, therefore the tree is unable to further differentiate
the three instances. A similar problem is seen by following headache = no. Therefore, the
path following sore throat = yes is unable to differentiate any of the five instances. The
Monica Nusskern 4
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
problem repeats itself for the path sore throat = no. In general, any top-level node choice
of sore throat, congestion, or headache gives a similar result.
Score 10 out of 10
4. (Question #6, page 63) Supposed you have used data mining to develop
two alternative models designed to accept or reject home mortgage
applications. Both models show an 85% test set classification correctness.
The majority of errors made by model A are false accepts whereas the
majority of errors made by model B are false rejects. Which model should
you choose? Justify your answer.
The model that I would choose would be model B because false rejects
would cost the firm much less money than false accepts. If the majority of the
errors for model A were false accepts that means that people who where not
qualified candidates for home loans would be accepted regardless. This could
be detrimental to the company, as these applicants would not pay their mortgage
bills resulting in less income for the company while large expenses would be
incurred. OK, but consider this perspective, since a mortgage is secured
credit, is there much risk in false accepts?
Score 10 out of 10
5. (Question #7, page 63) Supposed you have used data mining to develop
two alternative models designed to decide whether or not to drill for oil.
Both models show an 85% test set classification correctness. The majority
of errors made by model A are false accepts whereas the majority of errors
made by model B are false rejects. Which model should you choose?
Justify your answer.
The model that I would choose would be model A because the company
could be missing out on large income by not drilling in a certain area when in
actuality, they should. If a company drilled for oil where none existed, this could
be used for future knowledge to apply to the models that were developed.
However, while I chose model A for this question, I do see the benefits of utilizing
model B, such as the environmental impacts of drilling where no oil exists. OK,
but consider if the cost of drilling for oil is very high, Model B is the best
choice.
Score 10 out of 10
6. (Question #8, page 63) Explain how unsupervised clustering can be used
to evaluate the likely success of a supervised learner model.
Monica Nusskern 5
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
If our unsupervised learner determines that the same input attributes will
form clusters that differentiate the values of the output attribute, then the
complementary results verify the supervised learner assumptions.
Score 10 out of 10
7. (Question #9, page 63) Explain how supervised learning can be used to
help evaluate the results of an unsupervised clustering model.
Build a supervised learner model with the class name as the output attribute
using the randomly sampled instances as training data. Employ the
remaining instances to test the supervised model for classification
correctness. Very good
Monica Nusskern 6
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
Score 7 out of 10
Computed Decision
86% Good
2 Republicans Good
0 Independents Good
Score 7 out of 10
Monica Nusskern 7
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
80 individuals 40 instances
80 individuals 10 instances
Score 10 out of 10
Computed Computed
Model X
Accept Reject
Accept 46 54
Reject 2,245 7,655
Computed Computed
Model Y
Accept Reject
Accept 45 55
Reject 1,955 7,945
Score 9 out of 10
P(C11 | Population)
Lift = c11/Sum(ComputedSend)
Monica Nusskern 9
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005
Sum(Send)/Sum(Total)