Documente Academic
Documente Profesional
Documente Cultură
• Abstract
• Problem Definition
• Objective
• Introduction
• Literature Survey
• Content Diagram
• Conclusion
ABSTRACT
• Spam is one of the major threats posed to email users. Spam refers to the electronic
messaging system to send out unrequested or unwanted messages in bulk.
• The privacy and security of large amount of sensitive data are thread by malicious spam.
• Data mining has many approaches and alogorithms for email filtering. A classifier is a
supervised function where the learned attribute is categorical.
• Context based methods analyze the control of the email to determine if the email is spam
or not.
PROBLEM DEFINITION
• Each email provider has their own filtering system or technique, but most of them don’t work to their
full capacity.
• There is no integrated approach for classifying and filtering a mail as a spam or a legitimate.
• To compare the accuracy of the classifiers and introduces the best classifier for
detecting mails.
• We use different data mining techniques classifiers like Naive Bayes, KNN, Decision
Tree, Logistic Regression to categorize a mail into spam or legitimate.
INTRODUCTION
• Spam emails are the emails that the receiver does not wish to receive. Spam messages not
only increases the network communication and memory space but can also be used for some
attack.
• The spam filter is a basic filter mechanism that is used to control the spam in the emails.
• A large number of identical messages are sent to several recipients of email. Increasing
volume of such spam emails is causing serious problems for internet users, Internet Service
Providers, and the whole Internet backbone network.
• The spam filter is a basic filter mechanism that is used to control the spam in the emails. This does not
allow any strangers or any pre-approved persons to send an email to the user. In return to the email sent
by these pre-approved persons, it asks to validate them in order to pass on the email.
• The logic behind this strategy is that the pre-approved users do not have time to validate their own
email ID from thousands of emails that it might have sent.
• Dealing with spam and classifying it is a very difficult task. Moreover a single model cannot tackle the
problem since new spams are constantly evolving and these spams are often actively tailored so that
they are not detected adding further impediment to accurate detection.
LITERARTURE SURVEY
EXISTING SYSTEM:
• The arrival of the new email can be treated as the excitation input to each existing item, and the scale of the
input is analogous to the similarity between the new email and each existing email item in the database.
• The strength of each item is then accumulated, i.e. the more the item resembles the new email, the stronger
the stimulation is, and the faster the corresponding strength grows.
• Each text file is firstly partitioned into a sequence of chunks according to the algorithm TTTD (Two
Thresholds, Two Divisors).
• The chunks of suspicious emails are encoded by hash function, which is able to provide privacy for email
users.Hash function is used to map data of dsize values. The values returned by ahash function are called
hash values.
• They emphasized on an integrated approach for classifying and filtering of spam and legitimate mails.
• Applying the integrated approach over the traditional approach has increased the accuracy by more than
1% with respect to the real world data set.
DISADVANTAGES OF EXISTING SYSTEM:
• It does not work with classifiers of different size and different density.
• The spam filter itself can create a problem to the other user. This is a case when an
email is sent back to an actual person mentioning to authenticate the valid user.
• The spam filter on the other side may also stop the email that is sent by spam filter
from entering into the inbox as it is not sent by the intended recipient.
PROPOSED SYSTEM
Spam detecting system can distinguish spam from non-spam emails based on a self-learning
algorithm .
NAIVE BAYES:
• Naive Bayes is a type of supervised learning algorithm. Supervised learning is defined as the task of
inferring a function from supervised training data. It is based on conditional probabilities.
• Conditional probability is the probability of the occurrence of an event, given that another event has
occurred. For implementing Navie Bayes for e-mail spam filtering, we will use Bayes theorem.
K NEAREST NEIGHBORS:
• The decision tree algorithm aims to build a tree structure according to a set of rules from the
training dataset, which can be used to classify unlabeled data.
• This algorithm builds the decision tree based on entropy and the information gain.
• Entropy measures the impurity of an arbitrary collection of samples while the information gain
calculates the reduction in entropy by partitioning the sample according to a certain attribute.
LOGISTIC REGRESSION:
• Spam is considered as one of the main attacks in launching various attacks like stealing user identities
and spreading malware.
• Spam mails area unit used for spreading virus or malicious code, for fraud in banking, for phishing,
and for advertising. To avoid spam/irrelevant mails we'd like effective spam filtering strategies.
• Applying the integrated approach over the traditional approach may increase the accuracy with
respect to the real world data set. It basically helps internet users to avoid spam mails.
• We will see running time and accuracy rates of the following algorithms Naive Bayes, KNN,
Decision Tree, Logistic Regression.
THANK YOU