Sunteți pe pagina 1din 8

A Seminar Report on

Naive Bayes classifier

Submitted by
Himanshu Patel
15CS09F
II Sem M.Tech (CSE)
Under CS867
in partial fulfillment for the award of the degree o
MASTER OF TECHNOLOGY
in
COMPUTER SCIENCE & ENGINEERING

Department of Computer Science & Engineering


National Institute of Technology Karnataka, Surathkal.

April 2016

1. INTRODUCTION
Naive Bayes is a collection of classification algorithms based on Bayes Theorem. It is
not a single algorithm but a family of algorithms that all share a common principle, that every
feature being classified is independent of the value of any other feature. Naive Bayes model
is easy to build and particularly useful for very large data sets. Along with simplicity, Naive
Bayes is known to outperform even highly sophisticated classification methods. Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
Look at the equation below:

From the above equation:


P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
A simple example best explains the application of Naive Bayes for classification[2]. It
explains the concept really well and runs through the simple maths behind it. So, let's say we
have data on 1000 pieces of fruit. The fruit being a Banana, Orange or some Other fruit and
imagine we know 3 features of each fruit, whether its long or not, sweet or not and yellow or
not, as displayed in the table below:

[Figure.1.1 Table of Fruit features and count]


From the table it is clear that 50% of the fruits are bananas, 30% are oranges and 20% are
other fruits. Based on our training set we can also say the following:

From 500 bananas 400 (0.8) are Long, 350 (0.7) are Sweet and 450 (0.9) are Yellow
ii

Out of 300 oranges 0 are Long, 150 (0.5) are Sweet and 300 (1) are Yellow

From the remaining 200 fruits, 100 (0.5) are Long, 150 (0.75) are Sweet and 50 (0.25)
are Yellow

So now were given the features of a piece of fruit and we need to predict the class. If were
told that the additional fruit is Long, Sweet and Yellow, we can classify it using the following
formula and subbing in the values for each outcome, whether its a Banana, an Orange or
Other Fruit. The one with the highest probability(score) being the winner. from below result
based on the higher score 0.01875 < 0.252, we can conclude this Long, Sweet and Yellow
fruit classified as a Banana.

1.

2.

3.

1.1. Pros and Cons of Naive Bayes:


Pros:
1. It is easy and fast to predict class of test data set.
2. It also perform well in multi class prediction When assumption of independence
holds, a Naive Bayes classifier performs better compare to other models like logistic
regression and you need less training data.
3. It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a
strong assumption).
Cons:
1. If categorical variable has a category (in test data set), which was not observed in training
data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as Zero Frequency. To solve this, we can use the smoothing
technique. One of the simplest smoothing techniques is called Laplace estimation.
iii

2. Another limitation of Naive Bayes is the assumption of independent predictors. In real


life, it is almost impossible that we get a set of predictors which are completely independent.
1.2. Applications of Naive Bayes Algorithms:

Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast.
Thus, it could be used for making predictions in real time.

Multi class Prediction: This algorithm is also well known for multi class prediction
feature. Here we can predict the probability of multiple classes of target variable.

Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers


mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis
(in social media analysis, to identify positive and negative customer sentiments)

Recommendation

System: Naive

Bayes

Classifier

and Collaborative

Filter-

ing together builds a Recommendation System that uses machine learning and data
mining techniques to filter unseen information and predict whether a user would like a
given resource or not.

iv

2. SENTIMENT ANALYSIS USING NAVE BAYES


Among the most researched topics of natural language processing is sentiment analysis.
Sentiment analysis involves extraction of subjective information from documents like online
reviews to determine the polarity with respect to certain objects. It is useful for identifying
trends of public opinion in the social media, for the purpose of marketing and consumer research. It has its uses in getting customer feedback about new product launches, political
campaigns and even in financial markets [4]. Sentiment analysis is a complicated problem but
experiments have been done using Naive Bayes, maximum entropy classifiers. Nave Bayes
is a very simple probabilistic model that tends to work well on text classifications and usually
takes orders of magnitude less time to train when compared to models like support vector
machines. High degree of accuracy is obtained using Nave Bayes model, which is comparable to the current state of the art models in sentiment classification [3].
2.1 Nave Bayes Strategy
The Nave Bayes model involves a simplifying conditional independence assumption.
That is given a class (positive or negative), the words are conditionally independent of each
other. This assumption does not affect the accuracy in text classification by much but makes
really fast classification algorithms applicable for the problem. The maximum likelihood
probability of a word belonging to a particular class is given by the below expression. The
frequency counts of the words are stored in hash tables during the training phase[4].

The following is the normal definition of Naive Bayes classification[3]:

If the condition of each feature property are independent, then according to Bayes formula:
v

Laplacian Smoothing

If the classifier encounters a word that has not been seen in the training set, the probability of both the classes would become zero and there wont be anything to compare between.
This problem can be solved by Laplacian smoothing. Usually, k is chosen as 1. This way,
there is equal probability for the new word to be in either class. Since Bernoulli Nave Bayes
is used, the total number of words in a class is computed differently.
( | )
The flowchart of Naive Bayes classification is[3]:

Negation Handling

Negation handling was one of the factors that contributed significantly to the accuracy
of Nave bayes classifier. A major problem faced during the task of sentiment classification is
that of handling negations. Since using each word as feature, the word good in the phrase
not good will be contributing to positive sentiment rather that negative sentiment as the
presence of not before it is not taken into account. Algorithm which uses a state variable to
store the negation state. It transforms a word followed by a not or nt into not_ + word.
Whenever the negation state variable is set, the words read are treated as not_ + word. The
state variable is reset when a punctuation mark is encountered or when there is double negation. The pseudo code of the algorithm is described below:

vi

PSEUDO CODE :.
negated := False
for each word in document:
if negated = True:
Transform word to not_ + word.
if word is not or nt:
negated := not negated
if a punctuation mark is encountered
negated := False.
Since the number of negated forms might not be adequate for correct classifications. It is possible that many words with strong sentiment occur only in their normal forms in the training
set. But their negated forms would be of strong polarity.
Nave Bayes classification algorithm is a best choice. It is not only fast and has high precision of classification, which will have certain significance on this type of text categorization. Please see the below result obtained in paper by Yaguang Wang, Wenlong Fu,Aina Sui,
Yuqing Ding [3].

[Figure 2.1 Time Consuming of Four Classifiers]

vii

References
[1].

Naive Bayes classifier https://en.wikipedia.org/wiki/Naive_Bayes_classifier

[2].

www.datasciencecentral.com/profiles/blogs/

[3].

Yaguang Wang, Wenlong Fu,Aina Sui,Yuqing Ding, Comparison of Four Text


classifiers on Movie Reviews2015 3rd International Conference on Applied Computing and Information Technology/2nd International Conference on Computational Science and Intelligence 2015.

[4]. Vivek Narayanan, Ishan Arora, Arjun Bhatia,"Fast and accurate sentiment classification using an enhanced Naive Bayes model."

viii

S-ar putea să vă placă și