Documente Academic
Documente Profesional
Documente Cultură
Likelihood of sk belonging to Ci
P v1 , v 2 ,..., v m | Ci P (Ci )
ProbCi | v1 , v 2 ,..., v m
P v1 , v 2 ,..., v m
Likelihood of sk belonging to Cj
ProbC j | v1 , v 2 ,..., v m
P v1 , v 2 ,..., v m | C j P(C j )
P v1 , v 2 ,..., v m
P( Ah v h | C j )
h 1
temperature
yes
no
sunny
overcast
rainy
humidity
yes
no
hot
mild
cool
sunny
2/9
3/5
hot
2/9
2/5 high
3/9
overcast
4/9
0/5
mild
4/9
2/5 normal
6/9
rainy
3/9
2/5
cool
3/9
1/5
windy
yes
no
high
normal
play
yes
no
yes
no
false
true
4/5 false
6/9
2/5
9/14
5/14
1/5 true
3/9
3/5
A new day
outlook
temperature
humidity
windy
play
sunny
cool
high
true
Likelihood of yes
2 3 3 3 9
0.0053
9 9 9 9 14
Likelihood of no
3 1 4 3 5
0.0206
5 5 5 5 14
Therefore, the prediction is No
temperature
humidity
windy
yes
no
yes
no
yes
no
sunny
83
85
86
85
overcast
70
80
96
90
rainy
68
65
80
70
64
72
65
95
69
71
70
91
75
80
75
70
72
90
81
75
play
yes
no
yes
no
false
true
9/14
5/14
sunny
2/9
3/5 mean
73
74.6
mean
79.1
86.2
false
6/9
2/5
overcast
4/9
0/5 std
dev
6.2
7.9
std
dev
10.2
9.7
true
3/9
3/5
rainy
3/9
2/5
1 n
xi
n i 1
1 n
2
i
n 1 i 1
1
f ( w)
e
2
w 2
2
For examples,
f temperature 66 | Yes
1
e
2 6.2
66 73 2
2 6.2 2
0.0340
Likelihood of Yes =
2
3 9
0.0340 0.0221 0.000036
9
9 14
Likelihood of No =
3
3 5
0.0291 0.038 0.000136
5
5 14
Instance-Based Learning
In instance-based learning, we take k
nearest training samples of a new instance
(v1, v2, , vm) and assign the new
instance to the class that has most
instances in the k nearest training samples.
Classifiers that adopt instance-based
learning are commonly called the KNN
classifiers.
If the data set is noiseless, then the 1NN classifier should work well.
In general, the more noisy the data set is, the higher should k be set.
However, the optimal k value should be figured out through cross
validation.
The ranges of attribute values should be normalized, before the
KNN classifier is applied. There are two common normalization
approaches
v vmin
w
vmax vmin
v
w
Cross Validatioan
Most data classification algorithms require some
parameters to be set, e.g. k in KNN classifier
and the tree pruning threshold in the decision
tree.
One way to find an appropriate parameter
setting is through k-fold cross validation,
normally k=10.
In the k-fold cross validation, the training data
set is divided into k subsets. Then k runs of the
classification algorithm is conducted, with each
subset serving as the test set once, while using
the remaining (k-1) subsets as the training set.