Sunteți pe pagina 1din 15

3 - BAYESIAN CLASSIFICATION

Bayesian classifiers are statistical classifiers, they can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class. Bayesian classification is based on Bayes theorem. Comparing Bayesian classifier known as the nave Bayesian classifier to be comparable in performance with decision tree. Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases.
1

Why Bayesian Classification


A statistical classifier: Performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes Theorem. Performance: A simple Bayesian classifier, nave Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct, prior knowledge can be combined with observed data Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured. 2

Bayesian Classification consists of


1. Bayes Theorem

2. Nave Bayesian Classification


3. Bayesian Belief Networks 4. Training Bayesian Belief Networks

4) Training Bayesian Belief Networks

1) Bayes Theorem

Bayesian Classification

3.Bayesian Belief Networks

2. Nave Bayesian Classification

FIG: Bayesian classification


4

1. Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem.

P ( X | H ) P ( H ) P(H | X) P(X)
Where
X is considered evidence.

H be some hypothesis
P(H/X) is the posterior probability of H conditioned on X. P(X/H) is the posterior probability of X conditioned on H.
5

2. Nave Bayesian Classifier


Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, , xn) Suppose there are m classes C1, C2, , Cm. Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) This can be derived from Bayes theorem
P(X | C )P(C ) i i P(C | X) i P(X)

Since P(X) is constant for all classes, only needs to be maximized


P(C | X) P(X| C )P(C ) i i i
6

Derivation of Nave Bayes Classifier


A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes):
n P( X | C i) P( x | C i) P( x | C i ) P( x | C i ) ... P( x | C i) k 1 2 n k 1

This greatly reduces the computation cost: Only counts the class distribution. If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean and standard deviation and P(xk|Ci) is ( x )2 1 2 2 g ( x, , ) e 2

P ( X | C i ) g ( xk , C i , Ci )

EX: Data Set in All Electronics Customer Database

8 Fig: Class-labeled training tuples from the AllElectronics customer database.

The data tuples are described by the attributes age, income, student, and credit_rating. The class label attribute, buys_computer, has two distinct values (namely, {yes, no}). Let C1 correspond to the class buys_ computer = yes and C2 correspond to buys_computer = no.

The tuple we wish to classify is


X = (age = youth, income = medium, student = yes, credit rating = fair)

We need to maximize P(X/Ci)P(Ci), for i = 1, 2. P(Ci), the prior probability of each class, can be computed based on the training tuples
9

P(Ci):

P(buys_computer = yes) = 9/14 = 0.643 P(buys_computer = no) = 5/14 = 0.357

To Compute P(X|Ci) for each class

P(age = <=30 | buys_computer = yes) = 2/9 P(age = <= 30 | buys_computer = no) = 3/5 P(income = medium | buys_computer = yes) P(income = medium | buys_computer = no) P(student = yes | buys_computer = yes) P(student = yes | buys_computer = no) P(credit_rating = fair | buys_computer = yes) P(credit_rating = fair | buys_computer = no)

= 0.222 = 0.6 = 4/9 = 0.444 = 2/5 = 0.4 = 6/9 = 0.667 = 1/5 = 0.2 = 6/9 = 0.667 = 2/5 = 0.4

X = (age <= 30 , income = medium, student = yes, credit_rating = fair)


P(X|Ci) : P(X|buys_computer = yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = yes) * P(buys_computer = yes) = 0.028 P(X|buys_computer = no) * P(buys_computer = no) = 0.007
Therefore, X belongs to class (buys_computer = yes)
10

Nave Bayesian Classifier: Comments


Advantages Easy to implement Good results obtained in most of the cases Disadvantages Class conditional independence, therefore loss of accuracy Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. Dependencies among these cannot be modeled by Nave Bayesian Classifier

How to deal with these dependencies? With help of Bayesian Belief Networks
11

3. Bayesian Belief Networks


Bayesian belief network allows a subset of the variables conditionally independent.

A graphical model of causal relationships.


Represents dependency among the variables Gives a specification of joint probability distribution
Nodes: random variables Links: dependency

X Z

X and Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops or cycles

12

Eg: Bayesian Belief Network


Family History

Smoker

The conditional probability table (CPT) for variable LungCancer:


(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC
LungCancer Emphysema

0.8

0.5

0.7

0.1

~LC

0.2

0.5

0.3

0.9

CPT shows the conditional probability for each possible combination of its parents

PositiveXRay

Dyspnea

Derivation of the probability of a particular combination of values of X, from CPT:


n P ( x1 ,...,xn ) P ( xi | Parents(Y i )) i 1
13

Bayesian Belief Networks

4. Training Bayesian Networks


The network topology may be given in advance or inferred from the data, the network variables may be observable or hidden in all or some of the training tuples. The case of hidden data is also referred to as missing values or incomplete data. Several scenarios: Given both the network structure and all variables observable: learn only the CPTs Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning Network structure unknown, all variables observable: search through the model space to reconstruct network topology Unknown structure, all hidden variables: No good algorithms known for this purpose. 14

Issues of Classification
1. 2. 3. 4. 5. 1. 2. 3. 4. 5. 6. Accuracy Training time Robustness Interpretability Scalability Credit approval Target marketing Medical diagnosis Fraud detection Weather forecasting Stock Marketing
15

Typical applications

S-ar putea să vă placă și