Documente Academic
Documente Profesional
Documente Cultură
Abstract
Spam emails have been a chronic issue in controversial topic, is actually a real
computer security. They are very costly phenomenon and utilized by spammers.
economically and extremely dangerous for
Index Terms—Email, Spam, Spam Filter,
computers and networks. Despite of the
Bayes Spam Filter, Naive Bayes Classifier,
emergence of social networks and other
Spamassassin, Text Classification,
Internet based information exchange
Bayesian Poisoning
venues, dependence on email
communication has increased over the I. INTRODUCTION
years and this dependence has resulted in
an urgent need to improve spam filters. As the digitization of communication
Although many spam filters have been grows, electronic mail, or emails, has
created to help prevent these spam emails become increasingly popular; in 2016, an
from entering a user’s inbox, there is a estimated 2.3 million people used email. In
lack or research focusing on text 2015, 205 billion emails were sent and
modifications. Currently, Naive Bayes is received daily, which is expected to grow
one of the most popular methods of spam at an annual rate of 3% and reach over 246
classification because of its simplicity and billion by 2019. However, the growth in
efficiency. Naive Bayes is also very emails has also led to an unprecedented
accurate; however, it is unable to correctly increase in the number of illegitimate mail,
classify emails when they contain or spam - 49.7% of emails sent is spam-
leetspeak or diacritics. Thus, in this because current spam detection methods
proposes, we implemented a novel lack an accurate spam classifier. Spam is
algorithm for enhancing the accuracy of problematic not only because it often is the
the Naive Bayes Spam Filter so that it can carrier of malware, but also because spam
detect text modifications and correctly emails hoard network bandwidth, storage
classify the email as spam or ham. Our space, and computational power.
Python algorithm combines semantic Additionally, the commercial world has
significant interests in spam detection
based, keyword based, and machine
learning algorithms to increase the because spam causes loss of work
accuracy of Naive Bayes compared to productivity and financial loss. It is
Spamassassin by over two hundred estimated that American firms and
percent. Additionally, we have discovered consumers lose 20 billion annually, even
a relationship between the length of the while sustained by the private firms’
email and the spam score, indicating that investment in anti-spam software. On the
Bayesian Poisoning,a other hand, spam advertising earns200
million per year. Although extensive work hand, machine learning methods are
has been done on spam filter improvement customized based on the user and is able to
over the years, many of the spam filters adapt to the changing spamming methods,
today have limited success because of the yet is slower.
dynamic nature of spam. Spammers are
Another major issue with knowledge
constantly developing new techniques to
engineering spam detection is that,
bypass filters, some of which include word
although some of the rules are often
obfuscation and statistical poisoning.
characteristics of spam emails, they do not
Although these two text classification necessarily imply that the message is
issues are recognized, research today has spam. Since emails are text in the form of
largely neglected to provide a successful strings, they must be converted into
method to improve spam detection by objects such as vectors of numbers, or
counteracting word obstruction and feature vectors, so that there is some
Bayesian poisoning, and many common measure of similarity between theobjects.
spam filters are unable to detect them. In this conversion process, there may be
loss of information. Feature selection is a
In the remainder of this paper, we will
prominent yet neglected issue in modern
discuss related methods, definitions, our
spam filtering because spam and ham
new method, results, and future work.
emails with the same feature vector will be
Section II reviews other machine learning
incorrectly classified, resulting in a high
spam filter techniques as well as related
false positive rate, or ham emails being
work and definitions. Section 3 proposes a
misclassified as spam. Thus, most effective
new algorithm that will effectively
spam detection methods utilize some form
increase the accuracy of Naive Bayes and
of machine learning.
reduce false positives. W describe our
methodology in Section III. In Section IV Non-machine learning methods of spam
we described our implementation, testing detectors include using the IP numbers of
results and performance issues. senders, calculating the correlation of text
Concluding remarks and further research to a preset list of words used to find spam,
work are presented in SectionV. and many etc. The differentiating
characteristic between machine learning
II. BACKGROUND
techniques and non-machine learning
Many existing methods of spam detection methods is that after being trained using a
are ineffective, exemplified by the increase data set, the machine is able to make more
in spam mail. The two categories of email- accurate predictions on its own instead of
filtering techniques are knowledge constantly requiring human programming.
engineering and machine learning. In Spam detection methods employing
knowledge engineering, a set of rules to machine learning include the Naive Bayes,
determine legitimacy is established, or Vector Space Models, clustering, neural
rule-based spam filtration. However, rule- networks, and rulebased classification.
based spam detection has slight
A. NaiveBayes
disadvantages because this method is
exclusively based on spammers’ previous The most effective spam detection
methods, while spammers’ methods are methods utilize some form of machine
continuously expanding. On the other learning. Some machine learning spam
filtration methods includeNaive
Bayes, Vector Space Models, clustering, symbols. For example, ”A” can be written
neural networks, and rule-based as ”/-\”. When a word is modified using
classification. The Naive Bayes spam leetspeak, spam detectors are not able to
detection method is a supervised machine identify the email as spam, which creates a
learning probabilistic model for spam false positive. Naive Bayes filters are also
classification based on Bayes Theorem. susceptible to a form of statistical attack
Supervised machine learning is a method called Bayesian Poisoning; Bayesian
of teaching computers without direct poisoning occurs when random but
programming, or machine learning, that harmless words are inserted into spam
uses a known data set (a training set), messages, causing a spam email to be
which contains input and response values, incorrectly classified as ham.
to make predictions on another dataset, the
Additionally, there was, and is,
testing set. We have chosen Naive Bayes
speculation about the existence of
for its speed, multi- class prediction
Bayesian Poisoning because it requires
ability, and small training set.
knowledge of which words are considered
Additionally, since Naive Bayes is the
ham. However, studies have shown that in
baseline for most spam filters, improving
both a passive attack, or when the
Naive Bayes will inevitably improve most
spammers speculates about ham words,
spam filters overall.
and an active attack, when the spammer
B. CurrentIssues knows which words are ham words, the
performance of Naive Bayes severely
Although the Naive Bayes classifier has
decreases. Bayesian poisoning results in a
high accuracy in testing, there are several
high false negative rate, or spam emails
major flaws that surface in real life spam
incorrectly classified as ham. Furthermore,
detection as exemplified by the increase in
there are some inherent flaws within Bayes
spam. One way spammers can bypass
Theorem and the Naive Bayes classifier
spam filters easily is through using
itself. Naive Bayes makes the naive
tokenization attacks. A tokenization attack
assumption that all feature vectors are
disrupts the feature selection process by
independent of one another; in other
inserting forms of word obfustication.
words, Naive Bayes will not be able to
Some different forms of word
detect if certain words or phrases are
obfustication include embedding special
related.
characters, using HTML comments,
character-entity encoding, or ASCII codes. C. PreviousWork
One important aspect to note here is that Some advanced methods of text
although humans are able to discern the classification that have been proposed
actual words, the computer is incompetent. include word stemming and preprocessing
Common Spam senders are able to bypass to Naive Bayes.
spam detectors by using leetspeak and
Data preprocessing is a data mining
diacritics. Leetspeak is an alternative
technique that transforms raw data into a
alphabet that is primarily used on the
more understandable format. It is appliedto
Internet. Diacritics are the accents placed
the dataset of emails to account for feature
on words to modify the appearance.
obstruction, tokenization attacks, and
Leetspeak allows the spam senders to
Bayesian Poisoning. Three of the most
change letters into symbols or a series of
popular preprocessing techniquesinclude
word stemming, lemmatization, and stop then run the word through the Naive Bayes
word removal. spam filter.
The goal of both stemming and A. Naive Bayes Classifier
lemmatization is to reduce inflectional
Naive Bayes is known for its simplicity,
forms and sometimes derivationally related
multi-class predictive ability, and small
forms of a word to a common base form.
training set requirement. Naive Bayes uses
Stemming is a heuristic process that
probabilistic models based on Bayes
removes derivational affixes, while
Theorem, which states that:
lemmatization is a more refined process
that considers the context to return the P(c|x) = P(x|c)P(c)
lemma of the word. Another method is the
P(x)
removal of stop words. The stoplist is a list
of 100 of the most common words in the where
English language, such as the, and, of, etc.
Since Naive Bayes is a probability model, • P(c|x) is the posterior probability that
these stop words would act the accuracy of document c is in a class given evidencex;
the emails. • P(x|c) is the conditional probability that
A keyword based statistical method like evidence x is in documentc;
Naive Bayes relies on the strict lexical • P(c) is the class prior probability that any
correctness of the words. In other words, document is in the certainclass;
the Naive Bayes algorithm is unable to
detect all forms of word obfuscation. Very • P(x) is the predictor prior probability that
minimal research has been done on the evidence x istrue.
forms of word obfuscation. One previous B. BayesianClassifier
research paper created a very simple
algorithm for word stemming, essentially For spam detection, we employ the
removing the non- alpha characters, Bayesian Classifier. There are two classes:
vowels, and repeated characters to get the let S stand for spam, and L stand for ham.
phonetic syllables. However, although the P(c|x) is the a-priori probability, given a
efficiency of the Spam-Assassin and their message x, what category c produced it, if
addition was approximately 20%, the study a word is in x:
failed to account for word boundary P(c|x) = P(x|c)P(c)
detection or optimization.
P(x|S)P(S) + P(x|L)P(L)
The research in proposed a two-step
feature selection • P(c) is the overall probability that a
message isspam;
III. METHODOLOGY
• P(x|c) is the probability that the word
In order to minimize false positives and appears in spammessages;
increase the accuracy of Naive Bayes, we
created an addition to the existing Naive • P(L) is the probability that a given
Bayes method. Our addition will be able to message is legitimatemail;
convert symbols inside words to possible
• P(x|L) is the probability that the word is
letters and use a spell check function to
in ham.
ensure the corrected symbol is a word and
C. Multinomial NaiveBayes explained by algorithm 4. The keyword
search command used in algorithm 4
We chose the Multinomial Naive Bayes
searches for common spam key words. Our
optimization of the Naive Bayes classifier
addition to Naive Bayes is described in the
mainly for its multi-class predictive
4 pseudo-code algorithms below.
ability. Multinomial Naive Bayes is also
more accurate than the other optimization Algorithm 1 Leetspeak and Diacritics
methods. Multinomial Naive Bayes(MNB) Preprocessing
is the conditional probability of the
1: If a = leetspeak
evidence tk, or words, within a message d
given a class of the message, spam or ham. 2: then replace(a, c)
It assumes that the message is a bag of
3: If b =diacritc
tokens or words, such that the order of the
tokens is irrelevant. Multinomial Naive 4: then replace(b,c)
Bayes essentially counts the relative Algorithm 2 Multinomial Naive Bayes(C;
occurrences of a particular token within D)
the message to determine the
conditionalprobability: 1: V ExtractVocabulary(D)
P(c|d)=P(c) π P(tk|c)
2: N CountDocs(D)
3: for each c∈Cdo
1≤k≥nd
4: Nc← CountDocsInClass(D; c)
where
5: prior[c] ← Nc/N
• d is thedocument;
6:text←ConcatenateTextOfAllDocsInClas
• c is the class, either spam orham; s(D; c)
• tk is the evidence, where t1 to tnd are 7: for each t∈Vdo
tokens. 8: Tct ← CountTokensOfTerm(textc;t)
Theoretically, the best class is determined 9: for each t∈Vdo
by multiplying all the probabilities that
each individual word is spam together as 10: condprob[t][c] ← Tct+1
shown in the equation to get an overall Σtf (Tctf +1)
probability, with probabilities closer to 1
being spam. However, there are instances 11: return v; prior; condprob
where the spam word does not occur at all Algorithm 3 ApplyMultinomialNB(C;C;
in a message; Laplacian Smoothing may prior; condprob; d)
ameliorate this problem.
1: W ← ExtractTokensFromDoc(V ; d)
D. OurAlgorithm
2: for each c∈C do
Besides preprocessing the data, we
3: score[c] ← log prior[c]
developed another algorithm that combines
semantic-based, key-word based, and 4: for each t∈Vdo
machine learning in Python. First, we pre-
5: score[c]+= log condprob[t][c]
process the messages using Algorithm 1,
the leetspeak and diacritics preprocessing. 6: return argmaxc2Cscore[c]
Then, we utilize algorithms 2 and 3 as
(which has Spamassassin built in) to User
1, and once again through our Python
script to User 2. The Python script that
implements the new spam filtering
algorithm is embedded into the Exim4
email server. The testing set up allows us
to easily evaluate the performance of our
spam algorithm in a real Internet
environment.
B. Spamassasin
Fig. 1. Real-Time Testing Environment Leetspeak spam, diacritics spam, and ham
emails were first sent through the
Algorithm 4 A Novel Algorithm for Naive
Spamassasin server. From Figure 2, the
Bayes
majority of spam scores fall under the
1: Convert all text to lowercase using orange threshold of 5. Spamassasin only
lower() classified 26.9% of the leetspeak emails as
2: Replace leetspeak and diacritics
3: Determine P(x|c) through training set
4: Use Fuzzy Matching
5: Use Multinomial Naive Bayes
IV. IMPLEMENTATION,TESTING,
ANDPERFORMANCE
Fig. 2. Spamassasin Leetspeak Spam
A. Real-Time TestEnvironment Scores.
We set up a real-time testing suite to verify
the viability of our spam filtering
algorithm. Our goal is to observe and
compare the behavior of the default spam
filtering functionality with our new spam
filtering algorithm. The test environment
consists of a Linux server running the
latest Debian Linux. The Linux server
hosts an Exim4 email server integrated
with the well known Spamassissin spam
detection/filtering software. Exim4 is the
default mail transfer agent in Debian
Linux, and 56% of all publicly reachable
mail servers run Exim4. Fig. 3. Spamassasin Diacritics Results
The test environment is shown in Figure 1. spam even though all of the emails were
Each email message will be sent twice, spam. For diacritics emails, Spamassasin
once through the normal reception path assigned a mode score of 0.
In addition, Figure 3 depicts how the vast Where NS→Lis false negative, NL→Sis false
majority of spam emails are given a score positive, NSis total spam, and NLis total
of 0, falsely indicating non-spam. ham. Error rate is probability of
misclassification of a classifier and
Next, we tested ham emails and
accuracy is defined as the percentage of
encountered a similar issue. The
the dataset correctly classified by the
distribution of Figure 4 shows about half
method. Using our Python filter we
the emails received scores indicating ham
achieved an error rate of 38% and an
while another half received scores
accuracy of 62%. This is a drastic increase
indicating spam; furthermore, there are a
to the accuracy and error rate of the
couple outliers of extremely high spam
original Spamassassin, which has an error
scores. Thus, the false positive rate, or the
rate of 76.1% and accuracy of23.9%.
percentage of misclassified ham emails, is
44.66%. V. CONCLUSION AND FURTHER
WORK
C. Optimal Threshold: SpamScore
We proposed a novel algorithm for
Most spam servers have a pre-programmed
enhancing the accuracy of the Naive Bayes
value that determines whether the email
Spam Filter. The algorithmm was
should be spam or ham. Depending on the
implemented and tested in real-time
elements incorporated within different
environments over the Internet. We carried
servers, that spam threshold varies. Thus,
out our testing to two main types of spam
when proposing a new addition to the
encryptions: leetspeak and diacritics. The
spam servers, we tested for the optimal
algorithm significantly improved the
threshold.
accuracy of the Spam Server Spamassassin
D. Optimal Threshold: FuzzyMatching by coding a new addition to the current
servers. We analyzed a particular aspect of
Although many elements of accuracy are spam emails that causes many challenges
implemented such as text reversal and
for individuals and companies. We
detection, we still have to account for demonstrated that our algorithm
spelling mistakes that cannot register as consistently reduced the amount of spam
spam words through the spam servers. For emails misclassified as ham email. The
example, our Python code will not be able algorithm not only increased the accuracy
to detect a small change in the spelling of a of email sorting but also proved to be a
key spam word, and then not count that as valuable addition to the current systems.
a spam word. Thus we have incorporated
the Fuzzy Matching algorithm that allows Through the Naive Bayes classifier, we
for a greater degree of mispellings to were able to improve upon the baseline for
identify spam words. We found this code spam filtering. Naive Bayes has a very fast
matches about 80% of letters in processing speed and allows for a small
corresponding order to word in the spam training set, hence is suitable for real-time
dataset. spam filtering. Previous research showed
that most spam classifications failed when
E. Error Rate and Accuracy exposed to text modifications. By creating
BbbErrorRate = NS→L+NL→S an addition to Naive Bayes, we were able
to improve the multi-class prediction
NS+ NL ability.
Accuracy = 1 – ErrorRate
Using our improved Spamassassin spam between the length of the email and the
server, we tested hundreds of emails both spam score. As shown in Figure 6, there is
spam and ham on the servers. Our test an exponential regression between email
results showed that the addition did, length and spam score, meaning that as the
indeed, drastically improve accuracy from length increases, the score decreases. This
about 23.9% to 62%. This is almost phenomenon partially confirms that
Bayesian Poisoning does influence the
spam score negatively. Dealing with
Bayesian Poisoning is still a major
challenge.
One of the key features of our research is
applying text classification techniques to
spam filtering/detection. While email spam
detection has fundamental differences
from text classification because the
Fig. 4. Spamassasin Ham Emails opponents of spam detection are dynamic,
our work on machine learning text
classification techniques could also be
applied to other naturallanguage