Sunteți pe pagina 1din 35

CSL346: naive 8ayes

WlnLer 2012
Luke Zeulemoyer


Slldes adapLed from Carlos CuesLrln and uan kleln
Supervlsed Learnlng: nd f
Clven: 1ralnlng seL {(x
i
, y
i
) | i = 1 n}
llnd: A good approxlmauon Lo f : X ! Y
Lxamples: whaL are X and Y ?

Spam Detection
Map email to {Spam,Ham}
Digit recognition
Map pixels to [0,1,2,3,4,3,6,7,8,9}
Stock Prediction
Map new, historic prices, etc. to (the real numbers)
!
Classlcauon
Example: Spam Filter
Input: email
Output: spam/ham
Setup:
Get a large collection of
example emails, each
labeled spam or ham
Note: someone has to hand
label all this data!
Want to learn to predict
labels of new, future emails
Features: The attributes used to
make the ham / spam decision
Words: FREE!
Text Patterns: $dd, CAPS
Non-text: SenderInContacts
!
Dear Sir.

First, I must solicit your confidence in this
transaction, this is by virture of its nature
as being utterly confidencial and top
secret. !
TO BE REMOVED FROM FUTURE
MAILINGS, SIMPLY REPLY TO THIS
MESSAGE AND PUT "REMOVE" IN THE
SUBJECT.

99 MILLION EMAIL ADDRESSES
FOR ONLY $99
Ok, Iknow this is blatantly OT but I'm
beginning to go insane. Had an old Dell
Dimension XPS sitting in the corner and
decided to put it to use, I know it was
working pre being stuck in the corner, but
when I plugged it in, hit the power nothing
happened.
Example: Digit Recognition
Input: images / pixel grids
Output: a digit 0-9
Setup:
Get a large collection of example
images, each labeled with a digit
Note: someone has to hand label all
this data!
Want to learn to predict labels of new,
future digit images
Features: The attributes used to make the
digit decision
Pixels: (6,8)=ON
Shape Patterns: NumComponents,
AspectRatio, NumLoops
!
0
1
2
1
??
Other Classification Tasks
In classification, we predict labels y (classes) for inputs x
Examples:
Spam detection (input: document, classes: spam / ham)
OCR (input: images, classes: characters)
Medical diagnosis (input: symptoms, classes: diseases)
Automatic essay grader (input: document, classes: grades)
Fraud detection (input: account activity, classes: fraud / no fraud)
Customer service email routing
! many more
Classification is an important commercial technology!
LeLs Lake a probablllsuc approach!!!
Can we dlrecLly esumaLe Lhe daLa
dlsLrlbuuon (x,?)?
Pow do we represenL Lhese?
Pow many parameLers?
rlor, (?):
Suppose ? ls composed of ! classes
Llkellhood, (!"?):
Suppose ! ls composed of " blnary
feaLures
Complex model ! Plgh varlance
wlLh llmlLed daLa!!!
mpg cylinders displacement horsepower weight acceleration modelyear maker
good 4 low low low high 75to78 asia
bad 6 medium medium medium medium 70to74 america
bad 4 medium medium medium low 75to78 europe
bad 8 high high high low 70to74 america
bad 6 medium medium medium medium 70to74 america
bad 4 low medium low medium 70to74 asia
bad 4 low medium low low 70to74 asia
bad 8 high high high low 75to78 america
: : : : : : : :
: : : : : : : :
: : : : : : : :
bad 8 high high high low 70to74 america
good 8 high medium high high 79to83 america
bad 8 high high high low 75to78 america
good 4 low low low low 79to83 america
bad 6 medium medium medium high 75to78 america
good 4 medium low low low 79to83 america
good 4 low low medium high 79to83 america
bad 8 high high high low 70to74 america
good 4 low medium low medium 75to78 europe
bad 5 medium medium medium medium 75to78 europe
Condluonal lndependence
x ls #$%&'($%)**+ '%&-.-%&-%/ of ? glven Z, lf
Lhe probablllLy dlsLrlbuuon governlng x ls
lndependenL of Lhe value of ?, glven Lhe value
of Z


e.g.,

LqulvalenL Lo:

naive 8ayes
naive 8ayes assumpuon:
leaLures are lndependenL glven class:
More generally:
Pow many parameLers now?
Suppose ! ls composed of " blnary feaLures
1he naive 8ayes Classler
Clven:
rlor (?)
" condluonally lndependenL
feaLures ! glven Lhe class ?
lor each x
l
, we have
llkellhood (x
l
|?)
ueclslon rule:
If certain assumption holds, NB is optimal classifier! Will
discuss at end of lecture!
Y
X
1
X
n
X
2

A Digit Recognizer
Input: pixel grids
Output: a digit 0-9
Nave Bayes for Digits (Binary Inputs)
Simple version:
One feature F
ij
for each grid position <i,j>
Possible feature values are on / off, based on whether intensity
is more or less than 0.5 in underlying image
Each input maps to a feature vector, e.g.
Here: lots of features, each is binary valued
Nave Bayes model:
Are the features independent given class?
What do we need to learn?
Example Distributions
1 0.1
2 0.1
3 0.1
4 0.1
5 0.1
6 0.1
7 0.1
8 0.1
9 0.1
0 0.1
1 0.01
2 0.05
3 0.05
4 0.30
5 0.80
6 0.90
7 0.05
8 0.60
9 0.50
0 0.80
1 0.05
2 0.01
3 0.90
4 0.80
5 0.90
6 0.90
7 0.25
8 0.85
9 0.60
0 0.80
MLL for Lhe parameLers of n8
Clven daLaseL
CounL(A=a,8=b) " number of examples
where A=a and 8=b
MLL for dlscreLe n8, slmply:
rlor:

Llkellhood:

MLE
,
MLE
= arg max
,
P(D | , )
=
N

i=1
(x
i
)

2
= 0
=
N

i=1
x
i
+N = 0
=
N

+
N

i=1
(x
i
)
2

3
= 0
arg max
w
ln

N
+
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg max
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg min
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
P(Y = y) =
Count(Y = y)

Count(Y = y

)
2

MLE
,
MLE
= arg max
,
P(D | , )
=
N

i=1
(x
i
)

2
= 0
=
N

i=1
x
i
+N = 0
=
N

+
N

i=1
(x
i
)
2

3
= 0
arg max
w
ln

N
+
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg max
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg min
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
P(Y = y) =
Count(Y = y)

Count(Y = y

)
2

MLE
,
MLE
= arg max
,
P(D | , )
=
N

i=1
(x
i
)

2
= 0
=
N

i=1
x
i
+N = 0
=
N

+
N

i=1
(x
i
)
2

3
= 0
arg max
w
ln

N
+
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg max
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg min
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
P(Y = y) =
Count(Y = y)

Count(Y = y

)
P(X
i
= x|Y = y) =
Count(X
i
= x, Y = y)

Count(X
i
= x

, Y = y)
2

MLE
,
MLE
= arg max
,
P(D | , )
=
N

i=1
(x
i
)

2
= 0
=
N

i=1
x
i
+N = 0
=
N

+
N

i=1
(x
i
)
2

3
= 0
arg max
w
ln

N
+
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg max
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg min
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
P(Y = y) =
Count(Y = y)

Count(Y = y

)
P(X
i
= x|Y = y) =
Count(X
i
= x, Y = y)

Count(X
i
= x

, Y = y)
2
SubLleues of n8 classler 1 - vlolaung
Lhe n8 assumpuon
usually, feaLures are noL condluonally lndependenL:
AcLual probablllues (?|!) oen blased Lowards 0 or 1
noneLheless, n8 ls Lhe slngle mosL used classler ouL
Lhere
n8 oen performs well, even when assumpuon ls vlolaLed
[uomlngos & azzanl '96] dlscuss some condluons for good
performance
SubLleues of n8 classler 2: Overfitting
2 wins!!
lor 8lnary leaLures: We
already know Lhe answer!
MA: use mosL llkely parameLer
8eLa prlor equlvalenL Lo exLra observauons for each
feaLure
As # #, prlor ls forgouen"
01/2 3$4 56)** 5)6.*- 5'7-2 .4'$4 '5 '6.$4/)%/8
Brief Article
The Author
January 11, 2012

= arg max

ln P(D | )
ln

H
d
d
ln P(D | ) =
d
d
[ln

H
(1 )

T
] =
d
d
[
H
ln +
T
ln(1 )]
=
H
d
d
ln +
T
d
d
ln(1 ) =

H



T
1
= 0
2e
2N
2
P(mistake)
ln ln 2 2N
2
N
ln(2/)
2
2
N
ln(2/0.05)
2 0.1
2

3.8
0.02
= 190
P() 1
P( | D) P(D | )
P( | D)

H
(1)

H
1
(1)

T
1
=

H
+
H
1
(1)

T
+
t
+1
= Beta(
H
+
H
,
T
+
T
)

H
+
H
1

H
+
H
+
T
+
T
2
1
Multinomials: Laplace Smoothing
Laplaces estimate:
Pretend you saw every outcome k
extra times
Whats Laplace with k = 0?
k is the strength of the prior
Can derive this as a MAP estimate
for multinomial with Dirichlet priors
Laplace for conditionals:
Smooth each condition
independently:
H H T
1exL classlcauon
Classlfy e-malls
? = [Spam,noLSpam}
Classlfy news arucles
? = [whaL ls Lhe Loplc of Lhe arucle?}
Classlfy webpages
? = [SLudenL, professor, pro[ecL, .}
WhaL abouL Lhe feaLures !?
1he LexL!
leaLures ! are enure documenL - x
l
for l
Lh

word ln arucle
n8 for 1exL classlcauon
(!|?) ls huge!!!
Arucle aL leasL 1000 words, !=[x
1
,.,x
1000
}
x
l
represenLs l
Lh
word ln documenL, l.e., Lhe domaln of x
l
ls enure
vocabulary, e.g., WebsLer ulcuonary (or more), 10,000 words, eLc.
n8 assumpuon helps a loL!!!
(x
l
=x
l
|?=y) ls [usL Lhe probablllLy of observlng word x
l
ln a documenL on
Loplc y
8ag of words model
1yplcal addluonal assumpuon -
9$5'($% '% &$#16-%/ &$-5%:/ 6);-4:
(x
l
=x
l
|?=y) = (x
k
=x
l
|?=y) (all posluon have Lhe
same dlsLrlbuuon)
8ag of words" model - order of words on Lhe page
lgnored
Sounds really sllly, buL oen works very well!
When the lecture is over, remember to wake up the
person sitting next to you in the lecture room.
8ag of words model
in is lecture lecture next over person remember room
sitting the the the to to up wake when you
1yplcal addluonal assumpuon -
9$5'($% '% &$#16-%/ &$-5%:/ 6);-4:
(x
l
=x
l
|?=y) = (x
k
=x
l
|?=y) (all posluon have Lhe
same dlsLrlbuuon)
8ag of words" model - order of words on Lhe page
lgnored
Sounds really sllly, buL oen works very well!
8ag of Words Approach
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1

Zaire 0
n8 wlLh 8ag of Words for LexL classlcauon
Learnlng phase:
rlor (?)
CounL how many documenLs from each Loplc (prlor)
(x
l
|?)
lor each Loplc, counL how many umes you saw word ln
documenLs of Lhls Loplc (+ prlor), remember Lhls dlsL'n
ls shared across all posluons l
1esL phase:
lor each documenL
use naive 8ayes declslon rule
1wenLy news Croups resulLs
Learnlng curve for 1wenLy news Croups
WhaL lf we have conunuous $
%
?
Lg., characLer recognluon: $
%
ls l
Lh
plxel





Causslan naive 8ayes (Cn8):



Someumes assume varlance
ls lndependenL of ? (l.e., !
l
),
or lndependenL of x
l
(l.e., !
k
)
or boLh (l.e., !)
Lsumaung arameLers: & dlscreLe( $
%
conunuous
Maxlmum llkellhood esumaLes:
Mean:
varlance:




jth training
example
"(x)=1 if x true,
else 0
Lxample: Cn8 for classlfylng menLal sLaLes
~1 mm resolution
~2 images per sec.
15,000 voxels/image
non-invasive, safe

measures Blood
Oxygen Level
Dependent (BOLD)
response
Typical
impulse
response
10 sec
[Mitchell et al.]
8raln scans can
Lrack acuvauon
wlLh preclslon
and sensluvlLy
[Mitchell et al.]
Causslan naive 8ayes: Learned
voxel,word

(8ralnAcuvlLy | WordCaLegory = [eople,Anlmal})
[Mitchell et al.]
Learned 8ayes Models - Means for
(8ralnAcuvlLy | WordCaLegory)

Animal words
People words
Pairwise classification accuracy: 85%
[Mitchell et al.]
8ayes Classler ls Cpumal!
<-)4%: h : ! ! ?
! - feaLures
? - LargeL classes
Suppose: you know Lrue (?|!):
8ayes classler:

=>+?

MLE
,
MLE
= arg max
,
P(D | , )
=
N

i=1
(x
i
)

2
= 0
=
N

i=1
x
i
+N = 0
=
N

+
N

i=1
(x
i
)
2

3
= 0
arg max
w
ln

N
+
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg max
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg min
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
P(Y = y) =
Count(Y = y)

Count(Y = y

)
P(X
i
= x|Y = y) =
Count(X
i
= x, Y = y)

Count(X
i
= x

, Y = y)
h
Bayes
(x) = arg max
y
P(Y = y|X = x)
2
Cpumal classlcauon
1heorem:
8ayes classler h
8ayes
ls opumal!
Why?

MLE
,
MLE
= arg max
,
P(D | , )
=
N

i=1
(x
i
)

2
= 0
=
N

i=1
x
i
+N = 0
=
N

+
N

i=1
(x
i
)
2

3
= 0
arg max
w
ln

N
+
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg max
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg min
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
P(Y = y) =
Count(Y = y)

Count(Y = y

)
P(X
i
= x|Y = y) =
Count(X
i
= x, Y = y)

Count(X
i
= x

, Y = y)
h
Bayes
(x) = arg max
y
P(Y = y|X = x)
error
true
(h
Bayes
) error
true
(h), h
2

MLE
,
MLE
= arg max
,
P(D | , )
=
N

i=1
(x
i
)

2
= 0
=
N

i=1
x
i
+N = 0
=
N

+
N

i=1
(x
i
)
2

3
= 0
arg max
w
ln

N
+
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg max
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg min
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
P(Y = y) =
Count(Y = y)

Count(Y = y

)
P(X
i
= x|Y = y) =
Count(X
i
= x, Y = y)

Count(X
i
= x

, Y = y)
h
Bayes
(x) = arg max
y
P(Y = y|X = x)
error
true
(h
Bayes
) error
true
(h), h
p
h
(error) =

x
p
h
(error|x)p(x)dx
2

MLE
,
MLE
= arg max
,
P(D | , )
=
N

i=1
(x
i
)

2
= 0
=
N

i=1
x
i
+N = 0
=
N

+
N

i=1
(x
i
)
2

3
= 0
arg max
w
ln

N
+
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg max
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg min
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
P(Y = y) =
Count(Y = y)

Count(Y = y

)
P(X
i
= x|Y = y) =
Count(X
i
= x, Y = y)

Count(X
i
= x

, Y = y)
h
Bayes
(x) = arg max
y
P(Y = y|X = x)
error
true
(h
Bayes
) error
true
(h), h
p
h
(error) =

x
p
h
(error|x)p(x)dx
2

MLE
,
MLE
= arg max
,
P(D | , )
=
N

i=1
(x
i
)

2
= 0
=
N

i=1
x
i
+N = 0
=
N

+
N

i=1
(x
i
)
2

3
= 0
arg max
w
ln

N
+
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg max
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
2
2
= arg min
w
N

j=1
[t
j

i
w
i
h
i
(x
j
)]
2
P(Y = y) =
Count(Y = y)

Count(Y = y

)
P(X
i
= x|Y = y) =
Count(X
i
= x, Y = y)

Count(X
i
= x

, Y = y)
h
Bayes
(x) = arg max
y
P(Y = y|X = x)
error
true
(h
Bayes
) error
true
(h), h
p
h
(error) =

x
p
h
(error|x)p(x) =

y
(h(x), y)p(y|x)p(x)
2
WhaL you need Lo know abouL naive
8ayes
naive 8ayes classler
WhaL's Lhe assumpuon
Why we use lL
Pow do we learn lL
Why ls 8ayeslan esumauon lmporLanL
1exL classlcauon
8ag of words model
Causslan n8
leaLures are sull condluonally lndependenL
Lach feaLure has a Causslan dlsLrlbuuon glven class
Cpumal declslon uslng 8ayes Classler

S-ar putea să vă placă și