Documente Academic
Documente Profesional
Documente Cultură
Submitted by
Himanshu Patel
15CS09F
II Sem M.Tech (CSE)
Under CS867
in partial fulfillment for the award of the degree o
MASTER OF TECHNOLOGY
in
COMPUTER SCIENCE & ENGINEERING
April 2016
1. INTRODUCTION
Naive Bayes is a collection of classification algorithms based on Bayes Theorem. It is
not a single algorithm but a family of algorithms that all share a common principle, that every
feature being classified is independent of the value of any other feature. Naive Bayes model
is easy to build and particularly useful for very large data sets. Along with simplicity, Naive
Bayes is known to outperform even highly sophisticated classification methods. Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
Look at the equation below:
From 500 bananas 400 (0.8) are Long, 350 (0.7) are Sweet and 450 (0.9) are Yellow
ii
Out of 300 oranges 0 are Long, 150 (0.5) are Sweet and 300 (1) are Yellow
From the remaining 200 fruits, 100 (0.5) are Long, 150 (0.75) are Sweet and 50 (0.25)
are Yellow
So now were given the features of a piece of fruit and we need to predict the class. If were
told that the additional fruit is Long, Sweet and Yellow, we can classify it using the following
formula and subbing in the values for each outcome, whether its a Banana, an Orange or
Other Fruit. The one with the highest probability(score) being the winner. from below result
based on the higher score 0.01875 < 0.252, we can conclude this Long, Sweet and Yellow
fruit classified as a Banana.
1.
2.
3.
Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast.
Thus, it could be used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction
feature. Here we can predict the probability of multiple classes of target variable.
Recommendation
System: Naive
Bayes
Classifier
and Collaborative
Filter-
ing together builds a Recommendation System that uses machine learning and data
mining techniques to filter unseen information and predict whether a user would like a
given resource or not.
iv
If the condition of each feature property are independent, then according to Bayes formula:
v
Laplacian Smoothing
If the classifier encounters a word that has not been seen in the training set, the probability of both the classes would become zero and there wont be anything to compare between.
This problem can be solved by Laplacian smoothing. Usually, k is chosen as 1. This way,
there is equal probability for the new word to be in either class. Since Bernoulli Nave Bayes
is used, the total number of words in a class is computed differently.
( | )
The flowchart of Naive Bayes classification is[3]:
Negation Handling
Negation handling was one of the factors that contributed significantly to the accuracy
of Nave bayes classifier. A major problem faced during the task of sentiment classification is
that of handling negations. Since using each word as feature, the word good in the phrase
not good will be contributing to positive sentiment rather that negative sentiment as the
presence of not before it is not taken into account. Algorithm which uses a state variable to
store the negation state. It transforms a word followed by a not or nt into not_ + word.
Whenever the negation state variable is set, the words read are treated as not_ + word. The
state variable is reset when a punctuation mark is encountered or when there is double negation. The pseudo code of the algorithm is described below:
vi
PSEUDO CODE :.
negated := False
for each word in document:
if negated = True:
Transform word to not_ + word.
if word is not or nt:
negated := not negated
if a punctuation mark is encountered
negated := False.
Since the number of negated forms might not be adequate for correct classifications. It is possible that many words with strong sentiment occur only in their normal forms in the training
set. But their negated forms would be of strong polarity.
Nave Bayes classification algorithm is a best choice. It is not only fast and has high precision of classification, which will have certain significance on this type of text categorization. Please see the below result obtained in paper by Yaguang Wang, Wenlong Fu,Aina Sui,
Yuqing Ding [3].
vii
References
[1].
[2].
www.datasciencecentral.com/profiles/blogs/
[3].
[4]. Vivek Narayanan, Ishan Arora, Arjun Bhatia,"Fast and accurate sentiment classification using an enhanced Naive Bayes model."
viii