w3 PDF

Bayesian Learning
CSL465/603 - Fall 2016

Narayanan C Krishnan
ckn@iitrpr.ac.in
Outline
Bayes Theorem
MAP Learners
Bayes optimal classifier
Nave Bayes classifier
Example text classification
Bayesian networks
EM algorithm
Bayesian Learning
CSL465/603 - Machine Learning
Features Bayesian Learning

Practical learning algorithms
Nave Bayes learning
Bayesian network learning
Combine prior knowledge with observations
Require prior probabilities
Useful conceptual framework

gold standard for evaluating other classifiers
Tools for analysis
Bayesian Learning
Bayes Theorem
If and are two random variables
()
=
()
In the context of classifier hypothesis h and training

data
h (h)
h =
()
(h) prior probability of hypothesis h
() prior probability of training data
h| probability of h given
|h probability of given h
Bayesian Learning
Choosing the Hypotheses

Given the training data, we are interested in the
most probable hypothesis
Maximum a posteriori hypothesis - h+,h+,- argmax46 h
h (h)
argmax46
()
argmax46 h h
If every hypothesis is equally probable, h7 =
h8 , h7 , h8 , then we can simplify it to
Maximum likelihood (ML) hypothesis - h+<
h+< = argmax4=6 |h7
Bayesian Learning
Example
Does the patient have cancer or not?
A patient takes a lab test and the result comes back
positive. The test returns a correct positive result in only
98% of the cases in which the disease is actually present,
and a correct negative result in only 97% of the cases in
which the disease is no present. Furthermore, 0.008 of
the entire population have this cancer.
Bayesian Learning
+| =
| =
+| =
| =
| + =
Bayesian Learning
Brute-Force MAP Hypothesis

Learner (1)
If we are given = < H , H , , K , K >,
examples and the class labels,
For each hypothesis h , calculate the posterior
probability
h (h)
h =
()
Output the hypothesis +,- that has the highest posterior
probability
h+,- = argmax 46 h
Bayesian Learning

Learner (2)
examples and the class labels, choose (|h)
(|h) = 1 if h is consistent with
(|h) = 0 otherwise
Choose (h) to be uniform distribution

h =
H
6
Then h =
Bayesian Learning
-(4)
-(P)
Bayesian Learning
10

Learner (3)
examples and the class labels, choose (|h)
(|h) = 1 if h is consistent with
(|h) = 0 otherwise
Choose (h) to be uniform distribution

h =
H
6
Then
H
h =Q
Bayesian Learning
RS,T
, if h is consistent with
0 , otherwise
11
P h|H
Bayesian Learning
P h|H, _
Evolution of Posterior
Probabilities
12
Classifying new instances

Given a new instance x, what is the most probable
classification?
One solution h+,- (x)
But can we do better?

Consider the following example containing three
hypotheses:
hH = 0.4, h_ = 0.3, hc = 0.3
Given a new instance x,
hH x = +, h_ x = , hc x =
What is the most probable classification for x
Bayesian Learning
13
Bayes Optimal Classifier (1)

Combine the prediction of all hypotheses weighted
by their posterior probabilities
Bayes optimal classification
argmaxd f y h7 (h7 |)
4= 6
Example
hH = .4, hH = 0, + hH = 1
h_ = .3, h_ = 1, + h_ = 0
hc = .3, hc = 1, + hc = 0
f + h7 (h7 |) =
f h7 (h7 |) =
4= 6
4= 6
Bayesian Learning
14
Bayes Optimal Classifier (2)

Optimal in the sense
No other classification method using the same hypothesis
space and same prior knowledge can outperform this
method on average.
Method maximizes the probability that the new

instance is classified correctly, given the available
data, hypothesis space and prior probabilities over
the hypothesis.
But it is inefficient
Compute posterior probability for every hypothesis and
combine the predictions of each hypothesis.
Bayesian Learning
15
Gibbs Classifier
Gibbs Algorithm
Choose a hypothesis h H at random, according to the
posterior probability distribution over
Use h to classify the new instance x.
Observation Assume target concepts are drawn at

random from according to the priors on , Then
E l7mmn 2E qrstnuvw7xry
Haussler et al., ML 1994
Bayesian Learning
16
Nave Bayes Classifier (1)

Bayes rule, slightly different application
Let = {H , _ , { } be the different class labels.
The label for ~ instance
x ( )
x =
(x)
x - posterior probability that instance x belongs to
class
x - probability that an instance drawn from class
would be x (likelihood)
( ) probability of class (prior)
(x) probability of instance x (evidence)
Bayesian Learning
17

Classify instance x as class with maximum
posterior probability
= argmax ( |x)
Ignore the denominator (since we are only

interested in the maximum)
= argmax x ( )
If the prior is uniform

= argmax x
Bayesian Learning
18

Look at the classifier
= argmax x
What is each instance x?

A dimensional tuple (H, , )
Estimate the joint probability

distribution H ,
Practical issue- need to know the probability of every
possible instance given every possible class.
With Boolean features and classes K2 probability
values!!!
Bayesian Learning
19

Make the nave Bayes assumption
features/attributes are conditionally independent
given the target attribute (class label)
H , =
H
This results in the nave Bayes classifier (NBC)!
= argmax ( )
H
Bayesian Learning
20
NBC Practical Issues (1)

Estimating probabilities from I
Prior probabilities
| x7 , : = |
=
||
If the features are discrete
| x7 , : = = |
= =
| x7 , : = |
Bayesian Learning
21

If the features are continuous?
Assume some parameterized distribution for , e.g.,
Normal
Learn parameters of distribution from data, e.g., mean and
variance of values
Determine the parameters that maximize the likelihood.
~ (, _ ), and _ are unknown
Bayesian Learning
22
Maximum Likelihood Estimate
Bayesian Learning
23
Bayesian Learning
24

If the features are continuous?
Assume some parameterized distribution for , e.g.,
Normal
Learn parameters of distribution from data, e.g., mean and
variance of values
Determine the parameters that maximize the likelihood.
Discretize the feature

E.g., price to price , ,
Bayesian Learning
25

If there are no examples in class for which =
= = 0
= 0
H
Use m-estimate defined as follows

x7 , : = = +
= =
x7 , : = +
Prior estimate of the probability
Equivalent sample size (how heavily to weight
relative to the observed data)
Bayesian Learning
26
Example Learn to Classify Text

Problem Definition
Given a set of news articles that are of interest, we would
to like to learn to classify the articles by topic.
Nave Bayes is among the most effective algorithms

to perform this task.
What will be attributes to represent the documents?
Vector of words one attribute per word position in the
document
What is the Target concept

Is the document interesting?
Topic of the document
Bayesian Learning
27
Algorithm Learn Nave Bayes

Collect all words and tokens that occur in the
Examples ()
Vocabulary all distinct words and tokens in
Compute probabilities and
Examples for which the target label is
|P |
=
|P|
total number of words in (counting duplicates

multiple times)
For each work in Vocabulary
= number of times word occurs in
| =
Bayesian Learning
|R d|
28
Algorithm Classify Nave Bayes

Given a test instance
Compute the frequency of occurrence in the test instance
of each term in the vocabulary
Apply nave Bayes classification rule!
Bayesian Learning
29
Example: 20 Newsgroup
Given 1000 training documents from each group
Learn to classify new documents according to the
newsgroup it came from
NBC 89% accuracy
Bayesian Learning
30
Bayesian Network (1)

Nave Bayes assumption of conditional
independence is too restrictive.
The problem is intractable without some conditional
independent assumption
Bayesian networks describe conditional
independence among subsets of variables.
Allows for combining prior knowledge about (in)
dependencies among variables with training data
Recollect Conditional Independence
Bayesian Learning
31
Bayesian Network - Example

Storm
BusTourGroup
S,B
Lightning
Campfire
S,B S,B S,B
0.4
0.1
0.8
0.2
0.6
0.9
0.2
0.8
Campfire
Thunder
Bayesian Learning
ForestFire
32
Bayes Network (2)

Network represents the joint probability distribution
over all variables
(, , )
In general,
H, _, , = |
H
Where denotes immediate predecessors of

in the graph.
What is the Bayes Network corresponding to the

Naive Bayes Classifier?
Bayesian Learning
33
Bayes Network (3)

Inference
Bayes network encodes all the information required for
inference.
Exact inference methods
Work well for some structures
Monte Carlo methods

Simulate the network randomly to calculate approximate
solutions.
Learning
If the structure is known and there are no missing values,
it is easy to learn a Bayes network
If the network structure is known and there are some
missing values, expectation maximization algorithm
If the structure is unknown, the problem is very difficult.
Bayesian Learning
34
Summary
Bayes rule
Bayes Optimal Classifier
Practical Nave Bayes Classifier
Example text classification task
Maximum-likelihood estimates
Bayesian networks
Bayesian Learning
35

w3 PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

w3 PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Bayesian Learning

CSL465/603 - Fall 2016

CSL465/603 - Machine Learning

Features Bayesian Learning

Useful conceptual framework

CSL465/603 - Machine Learning

In the context of classifier hypothesis h and training

CSL465/603 - Machine Learning

Choosing the Hypotheses

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

Brute-Force MAP Hypothesis

CSL465/603 - Machine Learning

Brute-Force MAP Hypothesis

Choose (h) to be uniform distribution

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

Brute-Force MAP Hypothesis

Choose (h) to be uniform distribution

CSL465/603 - Machine Learning

Classifying new instances

But can we do better?

CSL465/603 - Machine Learning

Bayes Optimal Classifier (1)

CSL465/603 - Machine Learning

Bayes Optimal Classifier (2)

Method maximizes the probability that the new

CSL465/603 - Machine Learning

Observation Assume target concepts are drawn at

CSL465/603 - Machine Learning

Nave Bayes Classifier (1)

CSL465/603 - Machine Learning

Nave Bayes Classifier (2)

Ignore the denominator (since we are only

If the prior is uniform

CSL465/603 - Machine Learning

Nave Bayes Classifier (3)

What is each instance x?

Estimate the joint probability

CSL465/603 - Machine Learning

Nave Bayes Classifier (4)

This results in the nave Bayes classifier (NBC)!

CSL465/603 - Machine Learning

NBC Practical Issues (1)

CSL465/603 - Machine Learning

NBC Practical Issues (2)

CSL465/603 - Machine Learning

Maximum Likelihood Estimate

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

NBC Practical Issues (3)

Discretize the feature

CSL465/603 - Machine Learning

NBC Practical Issues (4)

Use m-estimate defined as follows

CSL465/603 - Machine Learning

Example Learn to Classify Text

Nave Bayes is among the most effective algorithms

What is the Target concept

CSL465/603 - Machine Learning

Algorithm Learn Nave Bayes

Compute probabilities and