Data Mining Techniques On Heart Failure Diagnosis (Case Presentation)

Data Mining Techniques in
The Diagnosis of
Coronary Artery Disease (CAD)
Steve Iduye
Xiaoqing Zhuang
HINF 6210 Data Mining
Contents
Coronary Heart Disease in a Nutshell
Description of the Datasets
Case 1
Case 2
Case 3
Discussion
Conclusion
Heart Disease in a Nutshell

Coronary Artery Disease(CAD) happens when the arteries that
supply blood to heart muscle become hardened and narrowed.

As a result, the heart muscle cannot get the blood or oxygen it
needs and this can lead to chest pain (angina) or a heart attack.
Current research on heart disease research has established that it is
not a single condition, but refers to any condition in which the heart
and blood vessels are injured and do not function properly, resulting
in serious and fatal health problems (Chilnick, 2008; HEALTHS,
2010; King, 2004; Silverstein et al., 2006).
Heart Disease in a Nutshell

The causes of heart disease are unclear, but age, gender, family
history, and ethnic background are all considered to be the major
causes in different investigations (Chilnick, 2008; HEALTHS, 2010;
King, 2004; Silverstein et al., 2006).
Other factors like eating habits, fatty foods, lack of exercise, high
cholesterol, hypertension, pollution, life style factors, obesity, high
blood pressure, stress, diabetes and lack of awareness have also
been claimed to increase the chance of developing heart disease
(Chilnick, 2008; HEALTHS, 2010);
Heart research, further, has found that the majority of the disease
occurrence is noticed in people between the ages of 5060
(Chilnick, 2008; HEALTHS, 2010)
Case 1
The case study investigates the risk factors which contribute to
Coronary Artery Disease in males and females

(Article was published by Jesmin Nahar, Tasadduq Imama, Kevin S.
Ticklea, Yi-Ping Phoebe Chen)
UCI Cleveland Dataset(
https://archive.ics.uci.edu/ml/machine-learning-databases/heart-dise
ase/
)
Predictive Apriori (Association Rules) was used to identify those risk
factors
Apriori Algorithm (Case 1)

The learning process looks for the following:
Support and Confidence greater than or equal to the min threshold
List all possible association rules that meet these requirements
Confidence and support are used in this study because of its
accuracy in Apriori to rank the rules (Agrawal et al., 1993; Mutter,
Hall, & Frank, 2005; Taihua & Fan, 2010)
Attributes of Interest in the Dataset

These attributes are the combination of symptoms, characteristics of
heart disease, diagnostic techniques and probable causes.
Let X represents all the attributes
Let Y represents the class vector(CAD=unhealthy, No_CAD= healthy)
Dataset
Attributes of Interest in the
Prior Setting
Rules with confidence levels above 90%, with accuracy levels above
99% and confirmation levels above 79% were selected respectively

for Predictive Apriori .
As there can be many such rules, only the rules containing the sick
or healthy class in the right-hand side (RHS) were considered.
If no such rules were available, rules containing the sick or
healthy class in the left-hand side (LHS) were reported.
Apriori Rules
Apriori Rules
Summary: Case 1
Four of the five rules attributed for the healthy class indicates
female gender on this particular dataset, have more chance of
being free from coronary heart disease.
Also, the results shows that when exercise induced angina (chest
pain) was false, it was a good indicator of a person being healthy,
irrespective of gender (exercise induced angina = false has
appeared in the LHS of all the high confidence rules).
The number of coloured vessels being zero and thal (heart status)
being normal were also shown to be good indicators of health.
Case 1 Summary
Rules mined for the sick class, on the other hand, showed that
chest pain type being asymptomatic and thal being reversed were
probable indicators of a person being sick (both the high confidence
rules have these two factors in LHS).
Building Classification Rules

Objectives
Building Classification Rules from the previous A.R attributes data
Trained data are analyzed by a classification algorithm
The learned attribute or classifier becomes the rules
Trained Data are used to estimate the accuracy of the rules
The rules can be applied to the classification of new data tuples
(Jiawei, Kamber, Pei, 2012)
Step 1: Training Data

Healthy Class
SEX
EXERCISE_INDUC
ED_ANGINA
NO_VESSEL_COLO
RED
THAL(HEART
STATUS)
Female
Failed
Normal
Female
Failed
Female
Failed
Female
Failed
M or F
Failed
CLASS
Healthy (no_CAD)
False
Healthy(no_CAD)
Healthy (no_CAD)
Normal
0
FASTING
BLOOD
SUGAR
Normal
False
Healthy (no_CAD)
Healthy (no_CAD)
Step 1: Training Data

Un- Healthy Class
CHEST_PAIN_TYPE
SLOPE
asymptomatic
flat
asymptomatic
EXERCISE
INDUCED
ANGINA
true
THAL(HEART
STATUS)
CLASS
reversible defect
Unhealthy
(CAD)
reversible defect
Unhealthy (CAD)
Step 2 : Create Classification Rules

The learned attribute or classifier becomes the rules
If {Sex = female \ exercise_induced_angina = fal \

number_of_vessels_colored=0 \ thal = nom} => Then, no CAD .
If {Sex = female \ fasting_blood_sugar = fal \
exercise_induced_angina = fal \ number_of_vessels_colored = 0}
=> Then,no CAD .
C. Rules
If {Sex = female \ fasting_blood_sugar = fal \
exercise_induced_angina = fal \ thal = norm} => Then, no CAD
If {Resting_blood_pres less or = (115.2, 136.4] \
exercise_induced_angina = fal \ number_of_vessels_colored = 0 \
thal = norm} => Then, no CAD
If {Sex=female \ exercise_induced_angina = fal \
number_of_vessels_colored = 0} => Then, no CAD
C. Rules
If {Chest_pain_type = asympt \ slope = flat \ thal = rev} => Then,
CAD is present
If {Chest_pain_type=asympt \ exercise_induced_angina=TRUE \
thal=rev} => Then, CAD is present
Step 3: To Estimate the Accuracy of the

Rules Using Decision Tree
Find the attributes Information Gain
info(D) -5/7log2(5/7)-2/7log2(2/7)= 1.9848 (A)
infosex(D) 4/7*(-4/5log24/5-1/5log21/5)=1.4411(B)
info exercise_induced_angina(D) 6/7*(-5/6log2 -1/6log2 1/6)=
3.6914(C)
info heart status(D) 5/7*(-3/5log23/5-2/5log22/5)= 2.6779 (D)
A-B=0.5437bits(sex), A-C= -1.7066bits, A-D= -0.6931bits
Case 2: Diagnosing Coronary Artery

Disease via Data Mining Algorithms
by Considering Laboratory and
Echocardiography Features
Case 3: A data mining approach for diagnosis

of coronary
artery disease
Dataset
Z-Alizadeh Sani dataset: 303 patients

(each 54 features)
Z-Alizadeh Sani dataset: 303 patients (each 54

features)
Objectiv
e
Using non-invasive, less costly method,

various data mining algorithms to
predict stenosis of each artery
separately.
Using affordable costs and affordable feature

measurements and applying proposed approached
to identify CAD state probability.
Feature
s
Demographic Features, Laboratory and

Echo Features
FEATURES 4 GROUPS demographic, symptom

and examination, ECG, laboratory and echo
features
2 possible categories: CAD or Normal
(IF patients diameter narrowing is >= 50% THEN
CAD,
ELSE = Normal)
Method
s
Classification Algorithm: C4.5, Bagging

algorithm
Information gain, Gini Index, Ten-fold
cross-validation method, Confusion
matrix, Performance measure
Classification Algorithm: SMO, Nave Bayes

classifier, Bagging algorithm, Neural Network
algorithm
Feature Selection & Feature creation, Information
gain, Gini Index, Association rule mining,
Case 2 (METHODS)
C4.5 classification algorithm

Based on decision trees (augment the performance)
Has the ability of the latter to manage continuous values by breaking
them down into sub intervals
Using pruning methods: improve accuracy
Case 2 (METHODS)
Bagging Algorithm
Classifies each sample based on the output of a set of diverse base
classifiers.
Base classifiers can be selected from the C4.5, Nave Bayes, ID3, and
other data mining algorithms.
Case 3 (METHODS)
Sequential Minimal Optimization (SMO): algorithm for efficiently
solving the optimization problem which arises during the training
of Support Vector Machines (SVMs)
Nave Bayes classifier: simple probabilistic classifier based on
applying Bayes theorem with strong independence assumption
Bagging algorithm
Neural Network algorithm: Artificial Neural Network (ANN)
interconnected group of artificial neuronsuse a mathematical or
computational model for information processing based on a
connectionist approach.Model complex relationships between
Case 3 (METHODS)
Feature Selection
uses the coefficients of the normal vector of a linear SVM as feature
weights
The attribute values still have to be numerical.
34 of features had the weight > 0.6: selected and the algorithms were
applied on them.
Case 3 (METHODS)
Feature creation
3 new features: LAD (Left Anterior Descending) recognizer, LCX (Left
Circumflex) recognizer, RCA (Right Coronary Artery) recognizer are
used to recognize whether LAD, LCX, RCA is blocked. Higher the
value, higher the risk.
Available features of the dataset are first discretized into binary
variables
value 1 for a feature indicates higher probabilities of the record being in
the CAD class, while value zero indicates otherwise.
Case 3(METHODS)
Association rule mining (Mentioned in Case 1)
Support
Confidence
Case 2 and Case 3

Informaton gain
measures the reduction in entropy of the data records because of a
single split over a given attribute.
The entropy before and after the split is computed
c is the class value which can be CAD or Normal
P(c)probability of a record being in class c
if a feature separates the two classes completely, it has the most
Information Gain and is the best feature for classification
Case 2 and Case 3

Gini Index
measure of how often a randomly chosen element from a set of
elements would be incorrectly labeled if it was randomly labeled
according to the distribution of labels in the subset
the probability of correctly labeling an item is equal to the probability
of choosing that item
higher values of Gini Index for a feature indicate its prevalence in
causing the disease.
Case1 and Case 2

Performance measure: Accuracy, sensitivity, and specificity are the
most important performance measures in the medical field
Confusion matrix: a table that allows visualization of the performance
of an algorithm
Discussion(Improve Accuracy of CAD Diagnosis by Using Data Mining Techniques)
Understand CAD
Confusion Matrix
Sensitivity
Specificity
Accuracy
CAD Risk Features
Rules
Extract
ed
Performance
Measurement
Results
Confidenc
e
Feature Selection
Feature Creation
Information Gain
Gini Index
Dataset with Effective Features
C 4.5
Bagging Algorithm
SMO Algorithm
Naive Bayes algorithm
Neural Network algorithm
Association Rule Mining
RapidMine
r
Data Mining Methods
Conclusion
Using Feature selection methods can increase the accuracy of CAD diagnosis
(Though sometimes may decrease the accuracy of the LAD, RCA stenosis diagnosis)
To enrich our dataset, we may need to create some new features which has vital
influence the accuracy of the CAD diagnosis.
Rules extracted from association rule mining methods may not be 100% correct, we
need some more testing data to test the rules.
Still need the results of the standard angiographic method which are used as the
base of comparison, to assess the prediction capability of classification algorithms.

Data Mining Techniques On Heart Failure Diagnosis (Case Presentation)

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Mining Techniques On Heart Failure Diagnosis (Case Presentation)

Încărcat de

Drepturi de autor:

Formate disponibile

Data Mining Techniques in

Heart Disease in a Nutshell

supply blood to heart muscle become hardened and narrowed.

Heart Disease in a Nutshell

Coronary Artery Disease in males and females

Apriori Algorithm (Case 1)

Attributes of Interest in the Dataset

Attributes of Interest in the

99% and confirmation levels above 79% were selected respectively

Building Classification Rules

Step 1: Training Data

Step 1: Training Data

Step 2 : Create Classification Rules

If {Sex = female \ exercise_induced_angina = fal \

Step 3: To Estimate the Accuracy of the

Case 2: Diagnosing Coronary Artery

Case 3: A data mining approach for diagnosis

Z-Alizadeh Sani dataset: 303 patients

Z-Alizadeh Sani dataset: 303 patients (each 54

Using non-invasive, less costly method,

Using affordable costs and affordable feature

Demographic Features, Laboratory and

FEATURES 4 GROUPS demographic, symptom

Classification Algorithm: C4.5, Bagging

Classification Algorithm: SMO, Nave Bayes

C4.5 classification algorithm

Case 2 and Case 3

Case 2 and Case 3

Case1 and Case 2

Discussion(Improve Accuracy of CAD Diagnosis by Using Data Mining Techniques)

CAD Risk Features

Dataset with Effective Features

Data Mining Methods

S-ar putea să vă placă și