Sunteți pe pagina 1din 1

E-mail Spam Filter

Xi Chen, Jiawen Shen, Shidi Xu, Yiwen Chen, Guanyu Lu

Introduction

Background

Data Description
5762 observations of emails with 30 variables that describe email characteristics Logical variables, e.g. isRe describes whether there is string Re in title Numeric variable, e.g. bodyCharacterCount counts the number of characters in email body

Email spam, also known as junk mail, is a subset of spam with the same messages sent to multiple recipients by email without authorization. Each year, spam averages 78% of all email sent. In 2007, a survey estimated that lost productivity costs Internet users in the United States $21.58 billion annually. For this project, our purpose is to find the combination of variables that makes a best predictor for spam.

A world in need of a good spam filter: Spam averages 78% of all email sent in 2011 12.4 billion spam mails are sent daily Spam cost to all non-corp Internet users $255 million in 2007 16% email address changes due to email spam Wasted corporate time per Spam email: 4-5 seconds

Data Analysis

Single variables analysis


isRe is a strong predictor: an email replied is less likely to be spam than one not replied (figure 1) Priority is a weak predictor of spam: proportion of spam with priority and proportion of spam without priority are the same (figure 2) hourSent is a weak predictor: the distributions of ham and spam are similar (center at 2.6, upper tail at 3.0, lower tail at around 1.0) figure 3
Figure 2

Filter Construction
Three approaches tested to make a spam filter Generalized Linear Model. - Hard to use because data contain NAs Contingency Table - Fairly accurate for certain cases where obvious predictors stand out - Yet cannot handle the majority of emails for it makes arbitrary decisions Classification Tree - Best at identifying emails in general - Except some cases that can be dealt well by Contingency Table Final Filter Contingency Table and Classification Tree 1. Contingency table is first used to filter out highly confident cases 2. The rest of the emails are passed to regression tree 3. Two models combined together are tested to be outsmarting any single classification method
Figure 6

Figure 1

percentForwards is a strong predictor: an email with more than 20% of forwards are spam (figure 4) Strong predictors isRe isInReplyTo: almost all emails with subject in reply to another email are ham; emails without isInReplyTo have 40% chance of being spam SortedRecipients: nearly all spams have sorted recipients fromNumericEnd: 35% of emails with headers end in a number are spam

Figure 3

Figure 4

percentForwards subjectExclamationCount : emails with more than 1 exclamation mark in the subject are very likely to be spam numRecipients: emails with more than 3 recipients are spam percentCapitals: emails with more than 30% of letters capital in the subject line are likely to be spam

The model is tuned by testing on a tuning group and confirmed by consistent performance on a separate testing group

Multiple variables analysis


High percentage of forwards very likely a ham Percentage of capitals >11%, and in reply to another

Filtered to Filtered be Ham to be Spam Ham 873 50 (5.4%

email surely a ham (94.6% of of Ham) Small percentage of capitals, from an address ended Ham) with a number and large percentage of blanks in subject Spam 69 270 surely a spam (20.4% of (79.6% of Seldom-forwarded email from an address that is not Spam) Spam) ended with a number, small percentage of capitals, small number of body character very likely a ham Percentage of capitals <11%, original address ended with a number, and extreme large body character Strong predictors are: isRe, numRecipients, bodyCharacterCount, very likely a ham isInReplyTo, sortedRecipients, fromNumericEnd, Percentage of blanks in subject < 34%, seldompercentCapitals, percentSubjectBlanks and percentForward forwarded email from an address not ended with number, large body character, extreme small percentage A classification tree works the best among all 3 methods of classfication Prediction could be inaccurate due to inherent limitation of of capitals likely a ham classification model and limited number of variables available Good predictors: percentForwards, percentCapitals, and InReplyTo Reference: Wikipedia, Time, Spam Links

Conclusion

S-ar putea să vă placă și