Sunteți pe pagina 1din 10

Enhancing the naïve bayes spam filter through

intelligent text modification


Mrs Pallavi N R1,Spoorthi S2, Sumaiya
taj31Assistant professor, 23UG Student
Department of Computer Science and Engineering
BGS Institute of Technology, B G Nagar, Mandya-571448
Pallavinr28@gmail.com, spoorthis19@gmail.com, sumaitaj671@gmail.com.

Abstract
Spam emails have been a chronic issue in controversial topic, is actually a real
computer security. They are very costly phenomenon and utilized by spammers.
economically and extremely dangerous for
Index Terms—Email, Spam, Spam Filter,
computers and networks. Despite of the
Bayes Spam Filter, Naive Bayes Classifier,
emergence of social networks and other
Spamassassin, Text Classification,
Internet based information exchange
Bayesian Poisoning
venues, dependence on email
communication has increased over the I. INTRODUCTION
years and this dependence has resulted in
an urgent need to improve spam filters. As the digitization of communication
Although many spam filters have been grows, electronic mail, or emails, has
created to help prevent these spam emails become increasingly popular; in 2016, an
from entering a user’s inbox, there is a estimated 2.3 million people used email. In
lack or research focusing on text 2015, 205 billion emails were sent and
modifications. Currently, Naive Bayes is received daily, which is expected to grow
one of the most popular methods of spam at an annual rate of 3% and reach over 246
classification because of its simplicity and billion by 2019. However, the growth in
efficiency. Naive Bayes is also very emails has also led to an unprecedented
accurate; however, it is unable to correctly increase in the number of illegitimate mail,
classify emails when they contain or spam - 49.7% of emails sent is spam-
leetspeak or diacritics. Thus, in this because current spam detection methods
proposes, we implemented a novel lack an accurate spam classifier. Spam is
algorithm for enhancing the accuracy of problematic not only because it often is the
the Naive Bayes Spam Filter so that it can carrier of malware, but also because spam
detect text modifications and correctly emails hoard network bandwidth, storage
classify the email as spam or ham. Our space, and computational power.
Python algorithm combines semantic Additionally, the commercial world has
significant interests in spam detection
based, keyword based, and machine
learning algorithms to increase the because spam causes loss of work
accuracy of Naive Bayes compared to productivity and financial loss. It is
Spamassassin by over two hundred estimated that American firms and
percent. Additionally, we have discovered consumers lose 20 billion annually, even
a relationship between the length of the while sustained by the private firms’
email and the spam score, indicating that investment in anti-spam software. On the
Bayesian Poisoning,a other hand, spam advertising earns200
million per year. Although extensive work hand, machine learning methods are
has been done on spam filter improvement customized based on the user and is able to
over the years, many of the spam filters adapt to the changing spamming methods,
today have limited success because of the yet is slower.
dynamic nature of spam. Spammers are
Another major issue with knowledge
constantly developing new techniques to
engineering spam detection is that,
bypass filters, some of which include word
although some of the rules are often
obfuscation and statistical poisoning.
characteristics of spam emails, they do not
Although these two text classification necessarily imply that the message is
issues are recognized, research today has spam. Since emails are text in the form of
largely neglected to provide a successful strings, they must be converted into
method to improve spam detection by objects such as vectors of numbers, or
counteracting word obstruction and feature vectors, so that there is some
Bayesian poisoning, and many common measure of similarity between theobjects.
spam filters are unable to detect them. In this conversion process, there may be
loss of information. Feature selection is a
In the remainder of this paper, we will
prominent yet neglected issue in modern
discuss related methods, definitions, our
spam filtering because spam and ham
new method, results, and future work.
emails with the same feature vector will be
Section II reviews other machine learning
incorrectly classified, resulting in a high
spam filter techniques as well as related
false positive rate, or ham emails being
work and definitions. Section 3 proposes a
misclassified as spam. Thus, most effective
new algorithm that will effectively
spam detection methods utilize some form
increase the accuracy of Naive Bayes and
of machine learning.
reduce false positives. W describe our
methodology in Section III. In Section IV Non-machine learning methods of spam
we described our implementation, testing detectors include using the IP numbers of
results and performance issues. senders, calculating the correlation of text
Concluding remarks and further research to a preset list of words used to find spam,
work are presented in SectionV. and many etc. The differentiating
characteristic between machine learning
II. BACKGROUND
techniques and non-machine learning
Many existing methods of spam detection methods is that after being trained using a
are ineffective, exemplified by the increase data set, the machine is able to make more
in spam mail. The two categories of email- accurate predictions on its own instead of
filtering techniques are knowledge constantly requiring human programming.
engineering and machine learning. In Spam detection methods employing
knowledge engineering, a set of rules to machine learning include the Naive Bayes,
determine legitimacy is established, or Vector Space Models, clustering, neural
rule-based spam filtration. However, rule- networks, and rulebased classification.
based spam detection has slight
A. NaiveBayes
disadvantages because this method is
exclusively based on spammers’ previous The most effective spam detection
methods, while spammers’ methods are methods utilize some form of machine
continuously expanding. On the other learning. Some machine learning spam
filtration methods includeNaive
Bayes, Vector Space Models, clustering, symbols. For example, ”A” can be written
neural networks, and rule-based as ”/-\”. When a word is modified using
classification. The Naive Bayes spam leetspeak, spam detectors are not able to
detection method is a supervised machine identify the email as spam, which creates a
learning probabilistic model for spam false positive. Naive Bayes filters are also
classification based on Bayes Theorem. susceptible to a form of statistical attack
Supervised machine learning is a method called Bayesian Poisoning; Bayesian
of teaching computers without direct poisoning occurs when random but
programming, or machine learning, that harmless words are inserted into spam
uses a known data set (a training set), messages, causing a spam email to be
which contains input and response values, incorrectly classified as ham.
to make predictions on another dataset, the
Additionally, there was, and is,
testing set. We have chosen Naive Bayes
speculation about the existence of
for its speed, multi- class prediction
Bayesian Poisoning because it requires
ability, and small training set.
knowledge of which words are considered
Additionally, since Naive Bayes is the
ham. However, studies have shown that in
baseline for most spam filters, improving
both a passive attack, or when the
Naive Bayes will inevitably improve most
spammers speculates about ham words,
spam filters overall.
and an active attack, when the spammer
B. CurrentIssues knows which words are ham words, the
performance of Naive Bayes severely
Although the Naive Bayes classifier has
decreases. Bayesian poisoning results in a
high accuracy in testing, there are several
high false negative rate, or spam emails
major flaws that surface in real life spam
incorrectly classified as ham. Furthermore,
detection as exemplified by the increase in
there are some inherent flaws within Bayes
spam. One way spammers can bypass
Theorem and the Naive Bayes classifier
spam filters easily is through using
itself. Naive Bayes makes the naive
tokenization attacks. A tokenization attack
assumption that all feature vectors are
disrupts the feature selection process by
independent of one another; in other
inserting forms of word obfustication.
words, Naive Bayes will not be able to
Some different forms of word
detect if certain words or phrases are
obfustication include embedding special
related.
characters, using HTML comments,
character-entity encoding, or ASCII codes. C. PreviousWork
One important aspect to note here is that Some advanced methods of text
although humans are able to discern the classification that have been proposed
actual words, the computer is incompetent. include word stemming and preprocessing
Common Spam senders are able to bypass to Naive Bayes.
spam detectors by using leetspeak and
Data preprocessing is a data mining
diacritics. Leetspeak is an alternative
technique that transforms raw data into a
alphabet that is primarily used on the
more understandable format. It is appliedto
Internet. Diacritics are the accents placed
the dataset of emails to account for feature
on words to modify the appearance.
obstruction, tokenization attacks, and
Leetspeak allows the spam senders to
Bayesian Poisoning. Three of the most
change letters into symbols or a series of
popular preprocessing techniquesinclude
word stemming, lemmatization, and stop then run the word through the Naive Bayes
word removal. spam filter.
The goal of both stemming and A. Naive Bayes Classifier
lemmatization is to reduce inflectional
Naive Bayes is known for its simplicity,
forms and sometimes derivationally related
multi-class predictive ability, and small
forms of a word to a common base form.
training set requirement. Naive Bayes uses
Stemming is a heuristic process that
probabilistic models based on Bayes
removes derivational affixes, while
Theorem, which states that:
lemmatization is a more refined process
that considers the context to return the P(c|x) = P(x|c)P(c)
lemma of the word. Another method is the
P(x)
removal of stop words. The stoplist is a list
of 100 of the most common words in the where
English language, such as the, and, of, etc.
Since Naive Bayes is a probability model, • P(c|x) is the posterior probability that
these stop words would act the accuracy of document c is in a class given evidencex;
the emails. • P(x|c) is the conditional probability that
A keyword based statistical method like evidence x is in documentc;
Naive Bayes relies on the strict lexical • P(c) is the class prior probability that any
correctness of the words. In other words, document is in the certainclass;
the Naive Bayes algorithm is unable to
detect all forms of word obfuscation. Very • P(x) is the predictor prior probability that
minimal research has been done on the evidence x istrue.
forms of word obfuscation. One previous B. BayesianClassifier
research paper created a very simple
algorithm for word stemming, essentially For spam detection, we employ the
removing the non- alpha characters, Bayesian Classifier. There are two classes:
vowels, and repeated characters to get the let S stand for spam, and L stand for ham.
phonetic syllables. However, although the P(c|x) is the a-priori probability, given a
efficiency of the Spam-Assassin and their message x, what category c produced it, if
addition was approximately 20%, the study a word is in x:
failed to account for word boundary P(c|x) = P(x|c)P(c)
detection or optimization.
P(x|S)P(S) + P(x|L)P(L)
The research in proposed a two-step
feature selection • P(c) is the overall probability that a
message isspam;
III. METHODOLOGY
• P(x|c) is the probability that the word
In order to minimize false positives and appears in spammessages;
increase the accuracy of Naive Bayes, we
created an addition to the existing Naive • P(L) is the probability that a given
Bayes method. Our addition will be able to message is legitimatemail;
convert symbols inside words to possible
• P(x|L) is the probability that the word is
letters and use a spell check function to
in ham.
ensure the corrected symbol is a word and
C. Multinomial NaiveBayes explained by algorithm 4. The keyword
search command used in algorithm 4
We chose the Multinomial Naive Bayes
searches for common spam key words. Our
optimization of the Naive Bayes classifier
addition to Naive Bayes is described in the
mainly for its multi-class predictive
4 pseudo-code algorithms below.
ability. Multinomial Naive Bayes is also
more accurate than the other optimization Algorithm 1 Leetspeak and Diacritics
methods. Multinomial Naive Bayes(MNB) Preprocessing
is the conditional probability of the
1: If a = leetspeak
evidence tk, or words, within a message d
given a class of the message, spam or ham. 2: then replace(a, c)
It assumes that the message is a bag of
3: If b =diacritc
tokens or words, such that the order of the
tokens is irrelevant. Multinomial Naive 4: then replace(b,c)
Bayes essentially counts the relative Algorithm 2 Multinomial Naive Bayes(C;
occurrences of a particular token within D)
the message to determine the
conditionalprobability: 1: V ExtractVocabulary(D)

P(c|d)=P(c) π P(tk|c)
2: N CountDocs(D)
3: for each c∈Cdo
1≤k≥nd
4: Nc← CountDocsInClass(D; c)
where
5: prior[c] ← Nc/N
• d is thedocument;
6:text←ConcatenateTextOfAllDocsInClas
• c is the class, either spam orham; s(D; c)
• tk is the evidence, where t1 to tnd are 7: for each t∈Vdo
tokens. 8: Tct ← CountTokensOfTerm(textc;t)
Theoretically, the best class is determined 9: for each t∈Vdo
by multiplying all the probabilities that
each individual word is spam together as 10: condprob[t][c] ← Tct+1
shown in the equation to get an overall Σtf (Tctf +1)
probability, with probabilities closer to 1
being spam. However, there are instances 11: return v; prior; condprob
where the spam word does not occur at all Algorithm 3 ApplyMultinomialNB(C;C;
in a message; Laplacian Smoothing may prior; condprob; d)
ameliorate this problem.
1: W ← ExtractTokensFromDoc(V ; d)
D. OurAlgorithm
2: for each c∈C do
Besides preprocessing the data, we
3: score[c] ← log prior[c]
developed another algorithm that combines
semantic-based, key-word based, and 4: for each t∈Vdo
machine learning in Python. First, we pre-
5: score[c]+= log condprob[t][c]
process the messages using Algorithm 1,
the leetspeak and diacritics preprocessing. 6: return argmaxc2Cscore[c]
Then, we utilize algorithms 2 and 3 as
(which has Spamassassin built in) to User
1, and once again through our Python
script to User 2. The Python script that
implements the new spam filtering
algorithm is embedded into the Exim4
email server. The testing set up allows us
to easily evaluate the performance of our
spam algorithm in a real Internet
environment.
B. Spamassasin
Fig. 1. Real-Time Testing Environment Leetspeak spam, diacritics spam, and ham
emails were first sent through the
Algorithm 4 A Novel Algorithm for Naive
Spamassasin server. From Figure 2, the
Bayes
majority of spam scores fall under the
1: Convert all text to lowercase using orange threshold of 5. Spamassasin only
lower() classified 26.9% of the leetspeak emails as
2: Replace leetspeak and diacritics
3: Determine P(x|c) through training set
4: Use Fuzzy Matching
5: Use Multinomial Naive Bayes
IV. IMPLEMENTATION,TESTING,
ANDPERFORMANCE
Fig. 2. Spamassasin Leetspeak Spam
A. Real-Time TestEnvironment Scores.
We set up a real-time testing suite to verify
the viability of our spam filtering
algorithm. Our goal is to observe and
compare the behavior of the default spam
filtering functionality with our new spam
filtering algorithm. The test environment
consists of a Linux server running the
latest Debian Linux. The Linux server
hosts an Exim4 email server integrated
with the well known Spamassissin spam
detection/filtering software. Exim4 is the
default mail transfer agent in Debian
Linux, and 56% of all publicly reachable
mail servers run Exim4. Fig. 3. Spamassasin Diacritics Results

The test environment is shown in Figure 1. spam even though all of the emails were
Each email message will be sent twice, spam. For diacritics emails, Spamassasin
once through the normal reception path assigned a mode score of 0.
In addition, Figure 3 depicts how the vast Where NS→Lis false negative, NL→Sis false
majority of spam emails are given a score positive, NSis total spam, and NLis total
of 0, falsely indicating non-spam. ham. Error rate is probability of
misclassification of a classifier and
Next, we tested ham emails and
accuracy is defined as the percentage of
encountered a similar issue. The
the dataset correctly classified by the
distribution of Figure 4 shows about half
method. Using our Python filter we
the emails received scores indicating ham
achieved an error rate of 38% and an
while another half received scores
accuracy of 62%. This is a drastic increase
indicating spam; furthermore, there are a
to the accuracy and error rate of the
couple outliers of extremely high spam
original Spamassassin, which has an error
scores. Thus, the false positive rate, or the
rate of 76.1% and accuracy of23.9%.
percentage of misclassified ham emails, is
44.66%. V. CONCLUSION AND FURTHER
WORK
C. Optimal Threshold: SpamScore
We proposed a novel algorithm for
Most spam servers have a pre-programmed
enhancing the accuracy of the Naive Bayes
value that determines whether the email
Spam Filter. The algorithmm was
should be spam or ham. Depending on the
implemented and tested in real-time
elements incorporated within different
environments over the Internet. We carried
servers, that spam threshold varies. Thus,
out our testing to two main types of spam
when proposing a new addition to the
encryptions: leetspeak and diacritics. The
spam servers, we tested for the optimal
algorithm significantly improved the
threshold.
accuracy of the Spam Server Spamassassin
D. Optimal Threshold: FuzzyMatching by coding a new addition to the current
servers. We analyzed a particular aspect of
Although many elements of accuracy are spam emails that causes many challenges
implemented such as text reversal and
for individuals and companies. We
detection, we still have to account for demonstrated that our algorithm
spelling mistakes that cannot register as consistently reduced the amount of spam
spam words through the spam servers. For emails misclassified as ham email. The
example, our Python code will not be able algorithm not only increased the accuracy
to detect a small change in the spelling of a of email sorting but also proved to be a
key spam word, and then not count that as valuable addition to the current systems.
a spam word. Thus we have incorporated
the Fuzzy Matching algorithm that allows Through the Naive Bayes classifier, we
for a greater degree of mispellings to were able to improve upon the baseline for
identify spam words. We found this code spam filtering. Naive Bayes has a very fast
matches about 80% of letters in processing speed and allows for a small
corresponding order to word in the spam training set, hence is suitable for real-time
dataset. spam filtering. Previous research showed
that most spam classifications failed when
E. Error Rate and Accuracy exposed to text modifications. By creating
BbbErrorRate = NS→L+NL→S an addition to Naive Bayes, we were able
to improve the multi-class prediction
NS+ NL ability.
Accuracy = 1 – ErrorRate
Using our improved Spamassassin spam between the length of the email and the
server, we tested hundreds of emails both spam score. As shown in Figure 6, there is
spam and ham on the servers. Our test an exponential regression between email
results showed that the addition did, length and spam score, meaning that as the
indeed, drastically improve accuracy from length increases, the score decreases. This
about 23.9% to 62%. This is almost phenomenon partially confirms that
Bayesian Poisoning does influence the
spam score negatively. Dealing with
Bayesian Poisoning is still a major
challenge.
One of the key features of our research is
applying text classification techniques to
spam filtering/detection. While email spam
detection has fundamental differences
from text classification because the
Fig. 4. Spamassasin Ham Emails opponents of spam detection are dynamic,
our work on machine learning text
classification techniques could also be
applied to other naturallanguage

Fig. 5. Comparison of Spamassassin Fig. 6. Comparison and Length to Score


Accuracy to Spamassassin with new
algorithm processing tasks such as news filtering or
user preference prediction.
a 259% increase in accuracy and can save
corporations millions of dollars. Thus, the Our findings show that the improvements
discovery of an addition not only improves to the spam email servers can drastically
the overall detection abilities of enhance spam filtering and, in turn, create
Spamassassin, but has enormous impacts less of a burden on corporations to hand
on the customers of email servers, sort any unwanted emails. In addition, the
potentially preventing millions of dollars simulation of the spam servers has
in malwaredamages. significant implications on preventing
malicious activity and viruses from
A. Bayesian PoisoningPhenomenon entering the computer system and costing a
As discussed earlier, viability of Bayesian company thousands of dollars.
Poisoning, or the addition of harmless stop B. FutureWork
words to decrease the spam score, is still a
debatable topic. In our research we have Since our addition successfully enhances
found an interestingcorelation the Naive Bayes spam filter, we will try to
implement the addition onto other machine https://arxiv.org/abs/1606.0104,2016,
learning spam filters such as Vector Space accessed: 2017.
Models, clustering, and artificial neural [3] “An analyst review of hotmail anti-spam
networks. Combining these other methods technology,”
will allow the improvement of spam https://www.lifewire.com,Radicati Group,
detection across many different systems to Inc, 2010, accessed:2017.
ultimately create a welldeveloped spam
[4] H. Tschabitscher, “How many emails are
detector for text modifications.
sent every day?”
Curredamages have two additonal research https://www.lifewire.com/how-many-emails-
topics: are-sent-every-day-117121,2017, accessed:
1) Speed and Efficiency: While our novel 2017.
spam algorithm does improve the accuracy [5] K. Tretyakov, “Machine learning
of the Naive Bayes classifier, we are techniques in spam filtering,” in Data Mining
currently expanding our research to Problem-Oriented Seminar,2004.
include speed and efficiency optimization. [6] A. Aski and N. Sourti, “Proposed efficient
These two factors are crucial simply algorithm to filter spam using machine,” in
because of the sheer number of processed Pacific Science Review A: Natural Science
emails which induces a high energy and Engineering, vol. 18, 2016, pp.145–149.
consumption; lowering
[7] J. Rao and D. Reiley, “The economics of
energyconsumption not only increases
spam,” in Journal of Economic Perspectives,
environmental well-being, but also reduces
vol. 26, no. 3,2012.
the cost of these spam filters while
maintaining the sameaccuracy. [8] J. Graham-Cumming, “Does bayesian
poisoning exist?”
2) Expansion to HTML and XML: We https://www.virusbulletin.com/virusbulletin/20
plan to extend our techniques from 06/02/does-bayesianpoisoning-exist/, 2006.
detecting text modification in pure texts to
[9] B. Biggio, P. Laskov, and B. Nelson,
other forms of email messages. For
“Poisoning attacks against support vector
example we will try to improve spam machines,” in Proc. of the 29th International
detection for messages encrypted in Conference on Machine Learning, 2012, pp.
HTML and XML. Currently, many spam 1467–1474.
emails are sent in HTML/XML to bypass
spam detectors. We plan on proposing new [10] J. Eberhardt, “Bayesian spam detection,”
in University of Minnesota Morris Digital
methodologies that will have the ability to
Well, vol. 2, no. 1,2015.
turn encrypted HTML into a computer-
friendly message, which will be then sent [11] K. Asa and L. Eikvil, “Text
through our proposed filteringalgorithm. categorisation: a survey,”
https://www.nr.no/eikvil/tm survey.pdf,1999.
REFERENCES
[12] M. Sprengers, “The effects of different
[1] “A bayesian approach to filtering junk e- bayesian poison methods on the quality of the
mail,”https://robotics.stanford.edu/users/saham bayesian spam filter,” in B.S. thesis, Radboud
i/papers-dir/spam.pdf, Stanford University, University Nijmegen, Nijmegen, Netherlands,
accessed: 2017, 2018. 2009.
[2] A. Bhowmick and S. Hazarika, “Machine [13] S. Raschka, “Naive bayes and text
learning for e-mail spam filtering: review, classification i: introduction and theory,”
techniques and trends,” https://arxiv.org/abs/1410.5329, Cornell
University Library, 2014, 2015, 2017, [23] G. Cormack, “Email spam filtering: a
accessed: 2017, 2018. systematic review, foundation and trends,” in
Journal Foundations and Trends in
[14] N. NaveenKumar and A. Saritha,
Information Retrieval Archive, vol. 1, no. 4,
“Effective classfication of text,” in
2007, pp. 335–455.
International Journal of Computer Trends and
Technology, vol. 11, no. 1,2014. [24] “Exim internet mailer,”
https://www.exim.org, accessed: 2017,2018.
[15] “The shifting tactics of spammers: What
you need to know about new email threats,” [25] “Spamassassin: Welcome to
https://secureemailplus.com/wp/WP10-01- spamassassin,”
0406 Postini Connections.pdf, Postini, Inc, https://spamassassin.apache.org, accessed:
2004, accessed: 2017, 2018. 2017, 2018.
[16] “The shifting tactics of spammers: What [26] csmining.org, “Ling-spam datasets -
you need to know about new email threats,” csmining group,”
https://emailmarketing.comm100.com/email- http://csmining.org/index.php/ling-spam-
marketingebook/spam-words.aspx, accessed: datasets.html, accessed: 2017,2018 bye.
2017, 2018.
[17] C. Li and L. Li, “Research and
improvement of a spam filter based on naive
bayes,” in Proc. of 7th International
Conference on Intelligent Human-Machine
Systems and Cybernetics,2015.
[18] K. Netti and Y. Rahdika, “A novel
method for minimizing loss of accuracy in
naive bayes classifier,” in Proc. of IEEE
International Conference on Computational
Intelligence and Computing Research,2015.
[19] S. Ahmed and F. Mithun, “Word
stemming to enhance spam filtering,” in Proc.
of First Conference on Email and Anti-Spam
(CEAS), 2004.
[20] G. Qiang, “An effective algorithm for
improving the performance of naive bayes for
text classification,” in Proc. of IEEE
International Conference on Computational
Intelligence and Computing Research,2010.
[21] S. Sarkar, G. S., A. A., and J. Aktar, “A
novel feature selection technique for text
classification using nave bayes,” International
Scholarly Research Notices, vol. 2014,2014.
[22] “Text classification and nave bayes, the
task of text
classification,”https://web.stanford.edu/
jurafsky/slp3/slides/7 NB.pdf, Stanford
University, 2011, accessed: 2017,2018.

S-ar putea să vă placă și