Documente Academic
Documente Profesional
Documente Cultură
Data Modeling
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
2
1
15-09-2019
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
3
2
15-09-2019
Data Science
• Multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract
knowledge and insight from structured and
unstructured data
• Central concept is gaining insight from data
• Machine learning uses data to extract knowledge
Data Preprocessing
Data
Feature
Database Cleaning and
Representation
Cleansing
5
3
15-09-2019
Pattern Classification
Classification
• Problem of identifying to which of a set of categories a
new observation belongs
• Predicts categorical labels
• Example:
– Assigning a given email to the "spam" or "non-spam"
class
– Assigning a diagnosis (disease) to a given patient based
on observed characteristics of the patient
• Classification is a two step process
– Step1: Building a classifier (data modeling)
• Learning from data (training phase)
– Step2: Using classification model for classification
• Testing phase
4
15-09-2019
10
5
15-09-2019
2-class Classification
• Example: Classifying a person as child or adult
Weight (x2)
Adult
Height, x1 Class
Adult/Child
Classifier Child
Weight, x2
Adult :Class C1
Child :Class C2 Height (x1)
x = [x1 x2]T
11
Weight
in Kg
Height in cm 12
6
15-09-2019
13
7
15-09-2019
15
Feature
Training extraction
98 28.43 Child Classifier
Phase
Feature
extraction
183 90 Adult
Feature
extraction
163 67.45 Adult
16
8
15-09-2019
17
Feature
extraction Training Examples
90 21.5 Child
Feature
extraction
100 32.45 Child
Class label
Feature
(Adult)
Training extraction
98 28.43 Child Classifier
Phase
Feature
extraction
183 90 Adult
Feature
extraction
163 67.45 Adult
Feature
extraction
Testing
150 50.6
Phase
18
9
15-09-2019
x2 x2 x2
x1 x1 x1
Linearly Nonlinearly Overlapping
separable separable classes
classes classes
19
Nearest-Neighbour Method
• Training data with N samples: D {x n , y n }n 1 ,
N
x n R d and y n {1, 2, , M }
– d: dimension of input example
– M: Number of classes
• Step 1: Compute Euclidian distance for a test example
x with every training examples, x1, x2, …, xn, …, xN
Euclidean distance x n x
(x n x)T (x n x)
x2 d
T
x [ x1 x2 ]
(x
i 1
ni xi ) 2
x1
20
10
15-09-2019
Nearest-Neighbour Method
• Training data:D {x n , y n }n 1 ,
N
x n R d and yn {1, 2, , M }
– d: dimension of input example
– M: Number of classes
• Step 1: Compute Euclidian distance for a test example
x with every training examples, x1, x2, …, xn, …, xN
• Step 2: Sort the examples
in the training set in the
ascending order of the
x2 distance to x
Weight
in Kg
Height in cm
• Step 1: Compute Euclidian distance
(ED) will each training examples
22
11
15-09-2019
Weight
in Kg
Height in cm
• Step 2: Sort the examples in the
training set in the ascending order
of the distance to test example
23
Weight
in Kg
Height in cm
• Step 3: Assign the class of the
training example with the
minimum distance to the test
example
– Class: Adult
24
12
15-09-2019
Nearest-Neighbour Method
• Training data:D {x n , y n }n 1 ,
N
x n R d and yn {1, 2, , M }
– d: dimension of input example
– M: Number of classes
• Step 1: Compute Euclidian distance for a test example
x with every training examples, x1, x2, …, xn, …, xN
• Step 2: Sort the examples
in the training set in the
ascending order of the
x2 distance to x
Weight
in Kg
Height in cm
• Step 1: Compute Euclidian distance
(ED) will each training examples
26
13
15-09-2019
Weight
in Kg
Height in cm
• Step 2: Sort the examples in the
training set in the ascending order
of the distance to test example
27
Weight
in Kg
Height in cm
• Step 3: Assign the class of the
training example with the minimum
distance to the test example
– Class: Adult
28
14
15-09-2019
Euclidean distance x n x
(x n x)T (x n x)
x2 d
T
x [ x1 x2 ] (x
i 1
ni xi ) 2
x1
29
15
15-09-2019
Weight
in Kg
Height in cm
• Consider K=5
• Step 3: Choose the first K=5
examples in the sorted list
31
Weight
in Kg
Height in cm
• Consider K=5
• Step 4: Test example is assigned
the most common class among its
K neighbours
– Class: Adult
32
16
15-09-2019
33
Data Normalization
• Since the distance measure is used, K-NN classifier
require normalising the values of each attribute
• Normalising the training data:
– Compute the minimum and maximum values of each of
the attributes in the training data
– Store the minimum and maximum values of each of the
attributes
– Perform the min-max normalization on training data set’
• Normalizing the test data:
– Use the stored minimum and maximum values of each
of the attributes from training set to normalise the test
examples
• NOTE: Ensure that test examples are not causing out-
of-bound error
34
17
15-09-2019
18
15-09-2019
Confusion Matrix
Actual Class
Class1 Class2
(Positive) (Negative)
Predicted
Class Class1
Class
True Positive False Positive
(Positive)
Class2 False
True Negative
(Negative) Negative
Confusion Matrix
Actual Class
Class1 Class2
Class Predicted
(Positive) (Negative)
Total
samples
in class1
38
19
15-09-2019
Confusion Matrix
Actual Class
Class1 Class2
Class Predicted
(Positive) (Negative)
Total
samples
in class2
39
Confusion Matrix
Actual Class
Class1 Class2
Class Predicted
(Positive) (Negative)
40
20
15-09-2019
Confusion Matrix
Actual Class
Class1 Class2
Class Predicted
(Positive) (Negative)
Total samples
Class2 False True
predicted as
(Negative) Negative Negative
class1
41
Accuracy
•
Actual Class
Class1 Class2
(Positive) (Negative)
Predicted
Class1
Class
Class
42
21
15-09-2019
Predicted Class
Class1
C11 C21 C31
Class2
C12 C22 C32
Actual Class
Class1
C11 C21 C31 predicted as
class1
Total samples
Class2
C12 C22 C32 predicted as
class2
Total samples
Class2
C13 C23 C33 predicted as
class3
Total samples Total samples Total samples in
Total
in class1 in class2 class3
22
15-09-2019
Actual Class
Class1
C11 C21 C31
Class2
C12 C22 C32
45
Text Books
1. J. Han and M. Kamber, Data Mining: Concepts and
Techniques, Third Edition, Morgan Kaufmann Publishers,
2011.
46
23
15-09-2019
Thank You
47
24