Sunteți pe pagina 1din 18

Classification and prediction

Data Mining Concepts and Techniques


Chapter 8.1-8.3, 9.5.1 Partly based on slides prepared by Jiawei Han

Type of method
Infrastructure preparation exploration analysis intepretation - exploration Supervised unsupervised Classification - prediction

Process

Process (1): Model Construction


Classification Algorithms

Training Data

NAME Mike Mary Bill Jim Dave Anne


4

RANK Assistant Prof Assistant Prof Professor Associate Prof Assistant Prof Associate Prof

YEARS TENURED (Model) 3 no 7 yes 2 yes 7 yes IF rank = professor 6 no OR years > 6 3 no

Classifier

THEN tenured = yes

Process (2): Using the Model in Prediction

Classifier Testing Data

Unseen Data

(Jeff, Professor, 4)
NAME RANK T om M erlisa G eorge 5 Joseph A ssistant P rof A ssociate P rof P rofessor A ssistant P rof YEARS TENURED 2 7 5 7 no no yes yes

Tenured?

Decision trees

Information gain
Information gain:

Gain(A) Info(D) Info A(D)


Information before split:

Info ( D) pi log 2 ( pi )
i 1
v

Information after split:

InfoA ( D)
j 1

| Dj | | D|

Info( D j )

Try it: decision tree induction

Concepts
Overfitting Pruning: postpruning and prepruning

Nave bayes

10

Nave Bayes
Bayes theorem:

P(H | X) = P(X | H )P(H ) P(X)

Nave Bayes classification:


Class Ci is hypothesis H Other attributes are evidence X

n P(X | C i) P( x | C i) P( x | C i) P( x | C i) ... P( x | C i) k 1 2 n k 1
11

Independence assumption:

Estimate from training set


P(Ci) from class frequency Nominal attributes:
P(xk|Ci) from occurrence of xk with instances in Ci

Numerical continuous attributes:


Gaussian distribution with a mean i and standard deviation i i and i from values of xk with instances in Ci

1 P(X | Ci) = g( xk , mCi , s Ci ) = e 2ps i

( xi -mi )2 2s i2

12

Try it:
Outlook Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Temp Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Humidity High High High High Normal Normal Normal High Normal Normal Normal High Normal Windy False True False False False True True False False False True True False Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes

New day: Predict play


Outlook Sunny Temp. Cool Humidity High Windy True Play ?

Rainy

Mild

High

True

No

13

Outlook Sunny Overcast Rainy Sunny Overcast Rainy 2 4 3

Yes

Temperature Concepts

Humidity

Windy

Play

No
3 0 2

Yes
2 4 3

No
2 2 1 2/5 2/5 1/5 High Normal High Normal

Yes
3 6 3/9 6/9

No
4 1 4/5 1/5 False True False True

Yes
6 3 6/9 3/9

No
2 3 2/5 3/5

Yes
9

No
5

Hot Mild Cool Hot Mild Cool

2/9 4/9 3/9

3/5 0/5 2/5

2/9 4/9 3/9

9/ 14

5/ 14

Outlook Sunny

Temp. Cool

Humidity High

Windy True

Play ?

Likelihood of the two classes For yes = 2/9 3/9 3/9 3/9 9/14 = 0.0053 For no = 3/5 1/5 4/5 3/5 5/14 = 0.0206 Conversion into a probability by normalization: P(yes) = 0.0053 / (0.0053 + 0.0206) = 0.205 P(no) = 0.0206 / (0.0053 + 0.0206) = 0.795
14

Concepts
Zero-frequency problem Smoothing / Laplacian correction

15

K-nearest neighbor

16

Concepts
Lazy learner Distance function
Which ones?

17

And now
Assignment classification, classification 2

18

S-ar putea să vă placă și