Sunteți pe pagina 1din 38

Data Mining Techniques in

The Diagnosis of
Coronary Artery Disease (CAD)
Steve Iduye
Xiaoqing Zhuang
HINF 6210 Data Mining

Contents
Coronary Heart Disease in a Nutshell
Description of the Datasets
Case 1
Case 2
Case 3
Discussion
Conclusion

Heart Disease in a Nutshell


Coronary Artery Disease(CAD) happens when the arteries that

supply blood to heart muscle become hardened and narrowed.


As a result, the heart muscle cannot get the blood or oxygen it
needs and this can lead to chest pain (angina) or a heart attack.
Current research on heart disease research has established that it is
not a single condition, but refers to any condition in which the heart
and blood vessels are injured and do not function properly, resulting
in serious and fatal health problems (Chilnick, 2008; HEALTHS,
2010; King, 2004; Silverstein et al., 2006).

Heart Disease in a Nutshell


The causes of heart disease are unclear, but age, gender, family
history, and ethnic background are all considered to be the major
causes in different investigations (Chilnick, 2008; HEALTHS, 2010;
King, 2004; Silverstein et al., 2006).
Other factors like eating habits, fatty foods, lack of exercise, high
cholesterol, hypertension, pollution, life style factors, obesity, high
blood pressure, stress, diabetes and lack of awareness have also
been claimed to increase the chance of developing heart disease
(Chilnick, 2008; HEALTHS, 2010);
Heart research, further, has found that the majority of the disease
occurrence is noticed in people between the ages of 5060
(Chilnick, 2008; HEALTHS, 2010)

Case 1
The case study investigates the risk factors which contribute to

Coronary Artery Disease in males and females


(Article was published by Jesmin Nahar, Tasadduq Imama, Kevin S.
Ticklea, Yi-Ping Phoebe Chen)
UCI Cleveland Dataset(
https://archive.ics.uci.edu/ml/machine-learning-databases/heart-dise
ase/
)
Predictive Apriori (Association Rules) was used to identify those risk
factors

Apriori Algorithm (Case 1)


The learning process looks for the following:
Support and Confidence greater than or equal to the min threshold
List all possible association rules that meet these requirements
Confidence and support are used in this study because of its
accuracy in Apriori to rank the rules (Agrawal et al., 1993; Mutter,
Hall, & Frank, 2005; Taihua & Fan, 2010)

Attributes of Interest in the Dataset


These attributes are the combination of symptoms, characteristics of
heart disease, diagnostic techniques and probable causes.
Let X represents all the attributes
Let Y represents the class vector(CAD=unhealthy, No_CAD= healthy)

Dataset

Attributes of Interest in the

Prior Setting
Rules with confidence levels above 90%, with accuracy levels above

99% and confirmation levels above 79% were selected respectively


for Predictive Apriori .
As there can be many such rules, only the rules containing the sick
or healthy class in the right-hand side (RHS) were considered.
If no such rules were available, rules containing the sick or
healthy class in the left-hand side (LHS) were reported.

Apriori Rules

Apriori Rules

Summary: Case 1
Four of the five rules attributed for the healthy class indicates
female gender on this particular dataset, have more chance of
being free from coronary heart disease.
Also, the results shows that when exercise induced angina (chest
pain) was false, it was a good indicator of a person being healthy,
irrespective of gender (exercise induced angina = false has
appeared in the LHS of all the high confidence rules).
The number of coloured vessels being zero and thal (heart status)
being normal were also shown to be good indicators of health.

Case 1 Summary
Rules mined for the sick class, on the other hand, showed that
chest pain type being asymptomatic and thal being reversed were
probable indicators of a person being sick (both the high confidence
rules have these two factors in LHS).

Building Classification Rules


Objectives
Building Classification Rules from the previous A.R attributes data
Trained data are analyzed by a classification algorithm
The learned attribute or classifier becomes the rules
Trained Data are used to estimate the accuracy of the rules
The rules can be applied to the classification of new data tuples
(Jiawei, Kamber, Pei, 2012)

Step 1: Training Data


Healthy Class
SEX

EXERCISE_INDUC
ED_ANGINA

NO_VESSEL_COLO
RED

THAL(HEART
STATUS)

Female

Failed

Normal

Female

Failed

Female

Failed

Female

Failed

M or F

Failed

CLASS

Healthy (no_CAD)
False

Healthy(no_CAD)
Healthy (no_CAD)

Normal
0

FASTING
BLOOD
SUGAR

Normal

False

Healthy (no_CAD)
Healthy (no_CAD)

Step 1: Training Data


Un- Healthy Class
CHEST_PAIN_TYPE

SLOPE

asymptomatic

flat

asymptomatic

EXERCISE
INDUCED
ANGINA

true

THAL(HEART
STATUS)

CLASS

reversible defect

Unhealthy
(CAD)

reversible defect

Unhealthy (CAD)

Step 2 : Create Classification Rules


The learned attribute or classifier becomes the rules

If {Sex = female \ exercise_induced_angina = fal \


number_of_vessels_colored=0 \ thal = nom} => Then, no CAD .
If {Sex = female \ fasting_blood_sugar = fal \
exercise_induced_angina = fal \ number_of_vessels_colored = 0}
=> Then,no CAD .

C. Rules
If {Sex = female \ fasting_blood_sugar = fal \
exercise_induced_angina = fal \ thal = norm} => Then, no CAD
If {Resting_blood_pres less or = (115.2, 136.4] \
exercise_induced_angina = fal \ number_of_vessels_colored = 0 \
thal = norm} => Then, no CAD
If {Sex=female \ exercise_induced_angina = fal \
number_of_vessels_colored = 0} => Then, no CAD

C. Rules
If {Chest_pain_type = asympt \ slope = flat \ thal = rev} => Then,
CAD is present
If {Chest_pain_type=asympt \ exercise_induced_angina=TRUE \
thal=rev} => Then, CAD is present

Step 3: To Estimate the Accuracy of the


Rules Using Decision Tree
Find the attributes Information Gain
info(D) -5/7log2(5/7)-2/7log2(2/7)= 1.9848 (A)
infosex(D) 4/7*(-4/5log24/5-1/5log21/5)=1.4411(B)
info exercise_induced_angina(D) 6/7*(-5/6log2 -1/6log2 1/6)=
3.6914(C)
info heart status(D) 5/7*(-3/5log23/5-2/5log22/5)= 2.6779 (D)
A-B=0.5437bits(sex), A-C= -1.7066bits, A-D= -0.6931bits

Case 2: Diagnosing Coronary Artery


Disease via Data Mining Algorithms
by Considering Laboratory and
Echocardiography Features

Case 3: A data mining approach for diagnosis


of coronary
artery disease

Dataset

Z-Alizadeh Sani dataset: 303 patients


(each 54 features)

Z-Alizadeh Sani dataset: 303 patients (each 54


features)

Objectiv
e

Using non-invasive, less costly method,


various data mining algorithms to
predict stenosis of each artery
separately.

Using affordable costs and affordable feature


measurements and applying proposed approached
to identify CAD state probability.

Feature
s

Demographic Features, Laboratory and


Echo Features

FEATURES 4 GROUPS demographic, symptom


and examination, ECG, laboratory and echo
features
2 possible categories: CAD or Normal
(IF patients diameter narrowing is >= 50% THEN
CAD,
ELSE = Normal)

Method
s

Classification Algorithm: C4.5, Bagging


algorithm
Information gain, Gini Index, Ten-fold
cross-validation method, Confusion
matrix, Performance measure

Classification Algorithm: SMO, Nave Bayes


classifier, Bagging algorithm, Neural Network
algorithm
Feature Selection & Feature creation, Information
gain, Gini Index, Association rule mining,

Case 2 (METHODS)

C4.5 classification algorithm


Based on decision trees (augment the performance)
Has the ability of the latter to manage continuous values by breaking
them down into sub intervals
Using pruning methods: improve accuracy

Case 2 (METHODS)

Bagging Algorithm
Classifies each sample based on the output of a set of diverse base
classifiers.
Base classifiers can be selected from the C4.5, Nave Bayes, ID3, and
other data mining algorithms.

Case 3 (METHODS)
Sequential Minimal Optimization (SMO): algorithm for efficiently
solving the optimization problem which arises during the training
of Support Vector Machines (SVMs)
Nave Bayes classifier: simple probabilistic classifier based on
applying Bayes theorem with strong independence assumption
Bagging algorithm
Neural Network algorithm: Artificial Neural Network (ANN)
interconnected group of artificial neuronsuse a mathematical or
computational model for information processing based on a
connectionist approach.Model complex relationships between

Case 3 (METHODS)
Feature Selection
uses the coefficients of the normal vector of a linear SVM as feature
weights
The attribute values still have to be numerical.
34 of features had the weight > 0.6: selected and the algorithms were
applied on them.

Case 3 (METHODS)
Feature creation
3 new features: LAD (Left Anterior Descending) recognizer, LCX (Left
Circumflex) recognizer, RCA (Right Coronary Artery) recognizer are
used to recognize whether LAD, LCX, RCA is blocked. Higher the
value, higher the risk.
Available features of the dataset are first discretized into binary
variables
value 1 for a feature indicates higher probabilities of the record being in
the CAD class, while value zero indicates otherwise.

Case 3(METHODS)
Association rule mining (Mentioned in Case 1)
Support
Confidence

Case 2 and Case 3


Informaton gain
measures the reduction in entropy of the data records because of a
single split over a given attribute.
The entropy before and after the split is computed
c is the class value which can be CAD or Normal
P(c)probability of a record being in class c
if a feature separates the two classes completely, it has the most
Information Gain and is the best feature for classification

Case 2 and Case 3


Gini Index
measure of how often a randomly chosen element from a set of
elements would be incorrectly labeled if it was randomly labeled
according to the distribution of labels in the subset
the probability of correctly labeling an item is equal to the probability
of choosing that item
higher values of Gini Index for a feature indicate its prevalence in
causing the disease.

Case1 and Case 2


Performance measure: Accuracy, sensitivity, and specificity are the
most important performance measures in the medical field
Confusion matrix: a table that allows visualization of the performance
of an algorithm

Discussion(Improve Accuracy of CAD Diagnosis by Using Data Mining Techniques)

Understand CAD

Confusion Matrix
Sensitivity
Specificity
Accuracy

CAD Risk Features

Rules
Extract
ed
Performance
Measurement

Results

Confidenc
e

Feature Selection
Feature Creation
Information Gain
Gini Index

Dataset with Effective Features

C 4.5
Bagging Algorithm
SMO Algorithm
Naive Bayes algorithm
Neural Network algorithm
Association Rule Mining
RapidMine
r

Data Mining Methods

Conclusion

Using Feature selection methods can increase the accuracy of CAD diagnosis
(Though sometimes may decrease the accuracy of the LAD, RCA stenosis diagnosis)
To enrich our dataset, we may need to create some new features which has vital
influence the accuracy of the CAD diagnosis.
Rules extracted from association rule mining methods may not be 100% correct, we
need some more testing data to test the rules.
Still need the results of the standard angiographic method which are used as the
base of comparison, to assess the prediction capability of classification algorithms.

S-ar putea să vă placă și