Sunteți pe pagina 1din 15

BATCH-C2

DETECTING MALICIOUS SPAM EMAILS


USING CLASSIFIERS AND LOGICAL
REGRESSION MECHANISMS

PROJECT SUPERVISOR PROJECT BATCH

Mrs. D.KAMAL KUMARI B.Hyndavi(16NM1A05D0)


Asst. Professor V. HARSHINI(16NM1A05H8)
Department of CSE K. LAHARI (16NM1A05F3)
T. NISHITHA (16NM1A05H6)
CONTENTS

• Abstract
• Problem Definition
• Objective
• Introduction
• Literature Survey
• Content Diagram
• Conclusion
ABSTRACT

• Email is an information stored on a computer that is exchanged between two users


over telecommunication.

• Spam is one of the major threats posed to email users. Spam refers to the electronic
messaging system to send out unrequested or unwanted messages in bulk.

• The privacy and security of large amount of sensitive data are thread by malicious spam.

• Data mining has many approaches and alogorithms for email filtering. A classifier is a
supervised function where the learned attribute is categorical.

• Context based methods analyze the control of the email to determine if the email is spam
or not.
PROBLEM DEFINITION

• Each email provider has their own filtering system or technique, but most of them don’t work to their
full capacity.

• The cyberpunks use emails to obtain valuable credentials.

• There is no integrated approach for classifying and filtering a mail as a spam or a legitimate.

• Data intergration is a preprocessing technique that involves combining data.


OBJECTIVE

• To compare the accuracy of the classifiers and introduces the best classifier for
detecting mails.

• We use different data mining techniques classifiers like Naive Bayes, KNN, Decision
Tree, Logistic Regression to categorize a mail into spam or legitimate.
INTRODUCTION

• Email is one of the crucial aspects of web data communication.

• Spam emails are the emails that the receiver does not wish to receive. Spam messages not
only increases the network communication and memory space but can also be used for some
attack.

• Spamming is actually done by sending unsolicited bulk messages to indiscriminate set of


recipients for advertising purpose.

• The spam filter is a basic filter mechanism that is used to control the spam in the emails.

• A large number of identical messages are sent to several recipients of email. Increasing
volume of such spam emails is causing serious problems for internet users, Internet Service
Providers, and the whole Internet backbone network.
• The spam filter is a basic filter mechanism that is used to control the spam in the emails. This does not
allow any strangers or any pre-approved persons to send an email to the user. In return to the email sent
by these pre-approved persons, it asks to validate them in order to pass on the email.

• The logic behind this strategy is that the pre-approved users do not have time to validate their own
email ID from thousands of emails that it might have sent.

• Dealing with spam and classifying it is a very difficult task. Moreover a single model cannot tackle the
problem since new spams are constantly evolving and these spams are often actively tailored so that
they are not detected adding further impediment to accurate detection.
LITERARTURE SURVEY
EXISTING SYSTEM:

• The arrival of the new email can be treated as the excitation input to each existing item, and the scale of the
input is analogous to the similarity between the new email and each existing email item in the database.

• The strength of each item is then accumulated, i.e. the more the item resembles the new email, the stronger
the stimulation is, and the faster the corresponding strength grows.

• Each text file is firstly partitioned into a sequence of chunks according to the algorithm TTTD (Two
Thresholds, Two Divisors).

• The chunks of suspicious emails are encoded by hash function, which is able to provide privacy for email
users.Hash function is used to map data of dsize values. The values returned by ahash function are called
hash values.

• They emphasized on an integrated approach for classifying and filtering of spam and legitimate mails.

• Applying the integrated approach over the traditional approach has increased the accuracy by more than
1% with respect to the real world data set.
DISADVANTAGES OF EXISTING SYSTEM:

• Difficult to do on catagorical data.

• It does not work with classifiers of different size and different density.

• It also disregurds the expectations of the users.

• The spam filter itself can create a problem to the other user. This is a case when an
email is sent back to an actual person mentioning to authenticate the valid user.

• The spam filter on the other side may also stop the email that is sent by spam filter
from entering into the inbox as it is not sent by the intended recipient.
PROPOSED SYSTEM

Spam detecting system can distinguish spam from non-spam emails based on a self-learning
algorithm .

NAIVE BAYES:

• Naive Bayes is a type of supervised learning algorithm. Supervised learning is defined as the task of
inferring a function from supervised training data. It is based on conditional probabilities.
• Conditional probability is the probability of the occurrence of an event, given that another event has
occurred. For implementing Navie Bayes for e-mail spam filtering, we will use Bayes theorem.
K NEAREST NEIGHBORS:

• k Nearest Neighbors algorithm is a non-parametric method used for pattern classification. It is an


instance-based learning algorithm.
• For implementing the algorithm, we will use the Euclidean distance between the attributes of the
instances.
• We directly calculate the k nearest neighbors for each data point in the testing dataset. Then
among these k values, we calculate whether the email is mostly classified as ham or spam.
• If majority of the k neighbors are classified as spam, the data point or the instance is classified as
spam. Otherwise it is classified as ham.
• Then we move on to calculate the accuracy by checking the value of the ham/spam to the original
class value.
DECISION TREE:

• The decision tree algorithm aims to build a tree structure according to a set of rules from the
training dataset, which can be used to classify unlabeled data.
• This algorithm builds the decision tree based on entropy and the information gain.
• Entropy measures the impurity of an arbitrary collection of samples while the information gain
calculates the reduction in entropy by partitioning the sample according to a certain attribute.

LOGISTIC REGRESSION:

• Logistic regression is a statistical method used to demonstrate if a binary response variable Y is


dependent on one or more independent variables X = (X1,··· ,Xn).
• It links the probability of an email being spam (πi) to the prediction variables (x1i,··· ,xij) through
a framework very similar to that of multiple regression.
• It is another way determine a class label, depending on the features. Logistic regression takes
features that can be continuous and translate them to discrete values.
CONTENT DIAGRAM
CONCLUSION

• Spam is considered as one of the main attacks in launching various attacks like stealing user identities
and spreading malware.

• Spam mails area unit used for spreading virus or malicious code, for fraud in banking, for phishing,
and for advertising. To avoid spam/irrelevant mails we'd like effective spam filtering strategies.

• Applying the integrated approach over the traditional approach may increase the accuracy with
respect to the real world data set. It basically helps internet users to avoid spam mails.

• We will see running time and accuracy rates of the following algorithms Naive Bayes, KNN,
Decision Tree, Logistic Regression.
THANK YOU

S-ar putea să vă placă și