Documente Academic
Documente Profesional
Documente Cultură
Introduction
Background
Data Description
5762 observations of emails with 30 variables that describe email characteristics Logical variables, e.g. isRe describes whether there is string Re in title Numeric variable, e.g. bodyCharacterCount counts the number of characters in email body
Email spam, also known as junk mail, is a subset of spam with the same messages sent to multiple recipients by email without authorization. Each year, spam averages 78% of all email sent. In 2007, a survey estimated that lost productivity costs Internet users in the United States $21.58 billion annually. For this project, our purpose is to find the combination of variables that makes a best predictor for spam.
A world in need of a good spam filter: Spam averages 78% of all email sent in 2011 12.4 billion spam mails are sent daily Spam cost to all non-corp Internet users $255 million in 2007 16% email address changes due to email spam Wasted corporate time per Spam email: 4-5 seconds
Data Analysis
Filter Construction
Three approaches tested to make a spam filter Generalized Linear Model. - Hard to use because data contain NAs Contingency Table - Fairly accurate for certain cases where obvious predictors stand out - Yet cannot handle the majority of emails for it makes arbitrary decisions Classification Tree - Best at identifying emails in general - Except some cases that can be dealt well by Contingency Table Final Filter Contingency Table and Classification Tree 1. Contingency table is first used to filter out highly confident cases 2. The rest of the emails are passed to regression tree 3. Two models combined together are tested to be outsmarting any single classification method
Figure 6
Figure 1
percentForwards is a strong predictor: an email with more than 20% of forwards are spam (figure 4) Strong predictors isRe isInReplyTo: almost all emails with subject in reply to another email are ham; emails without isInReplyTo have 40% chance of being spam SortedRecipients: nearly all spams have sorted recipients fromNumericEnd: 35% of emails with headers end in a number are spam
Figure 3
Figure 4
percentForwards subjectExclamationCount : emails with more than 1 exclamation mark in the subject are very likely to be spam numRecipients: emails with more than 3 recipients are spam percentCapitals: emails with more than 30% of letters capital in the subject line are likely to be spam
The model is tuned by testing on a tuning group and confirmed by consistent performance on a separate testing group
email surely a ham (94.6% of of Ham) Small percentage of capitals, from an address ended Ham) with a number and large percentage of blanks in subject Spam 69 270 surely a spam (20.4% of (79.6% of Seldom-forwarded email from an address that is not Spam) Spam) ended with a number, small percentage of capitals, small number of body character very likely a ham Percentage of capitals <11%, original address ended with a number, and extreme large body character Strong predictors are: isRe, numRecipients, bodyCharacterCount, very likely a ham isInReplyTo, sortedRecipients, fromNumericEnd, Percentage of blanks in subject < 34%, seldompercentCapitals, percentSubjectBlanks and percentForward forwarded email from an address not ended with number, large body character, extreme small percentage A classification tree works the best among all 3 methods of classfication Prediction could be inaccurate due to inherent limitation of of capitals likely a ham classification model and limited number of variables available Good predictors: percentForwards, percentCapitals, and InReplyTo Reference: Wikipedia, Time, Spam Links
Conclusion