Image Spam Detection

IMAGE SPAM DETECTION
Image spam is a type of e-mail spam that embeds spam text content into graphical images
to bypass traditional text-based e-mail spam filters. To effectively detect image spam, it is
desirable to leverage image content analysis technologies.
We propose a more desirable comprehensive solution which embraces both server-side
filtering and client-side detection to effectively mitigate image spam. Server side, we present a
nonnegative sparsity induced similarity measure for cluster analysis of spam images to filter the
attack activities of spammers and fast trace back the spam sources.
We presented a comprehensive solution to image spam filtering, which combines cluster
analysis of spam images on the server side and active learning classification on the client side for
effectively filtering image spams. We propose a nonnegative sparse representation induced
similarity measure to be used together with spectral clustering algorithm for clustering analysis
of spam images. Then relatively larger groups of images are suspected to be spams, which can be
further analyzed to identify the spam sources. The spam sources can then be blocked on the
server side directly without reaching e-mail users. For those spam images that survived this
server-side filtering and reached the client of e-mail users, we propose a prototype active
learning spam hunter to enable the users to efficiently and interactively filter out the spam
images.
On the client side, we employ the principle of active learning where the learner guides the
users to label as few images as possible while maximizing the classification accuracy. The
server-side filtering identifies large image clusters as suspicious spam sources and further
analysis can be performed to identify the real sources and block them from the beginning. For
those spam images which survived the server-side filter, our active learner on the client side will
further guide the users to interactively and efficiently filter them out.
INTRODUCTION
Spam is commonly defined as unsolicited email messages, and the goal of spam
categorization is to distinguish between spam and legitimate email messages. Spam used to be
considered a mere nuisance, but due to the abundant amounts of spam being sent today, it has
progressed from being a nuisance to becoming a major problem. Spam filtering is able to control
the problem in a variety of ways. Many researches in spam filtering have been centered on the
more sophisticated classifier-related issues.
Spam Filtering is the processing of e-mail to organize it according to specified criteria.
Most often this refers to the automatic processing of incoming messages, but the term also
applies to the intervention of human intelligence in addition to anti-spam techniques, and to
outgoing emails as well as those being received. We use pre-classified emails (priory knowledge)
to train the spam email detector. With the model generated from the training step, the detector is
able to decide whether an email is a spam email or an ordinary email.
Therefore, anti-spammers turn to analyze the visual content of the image spam. In the
early stage, there are several organizations and companies working on filtering image- based
spam.
To foil spam filtering systems based on visual matching of existing images to previously
encountered spam images, CAPTCHA techniques include randomly tiling images, varying
borders or backgrounds, varying spacing or margins, adding speckles and dots in the image
background, randomly changing image file names, randomly inserting subject lines, and slightly
rotating the image.
By appending texts containing randomly generated words based on normal natural

language statistics in the body of the e-mail or subject lines with the spam images, the image
spam can successfully bypass text based spam filters. We have to leverage both image content
analysis and machine learning algorithms to visually recognize these spam images.
USER INTERFACE DESIGN
Design a GUI (Graphical user interface) part for user interaction with our application for
e.g. user need to register their details includes user name, password, and personal details. After
successful registration, user need to login with user name and password. User has three type of
process with this application that is Compose mail, Inbox, Spam.
MAIL CLIENT
Clients are PCs or workstations on which users run applications. Clients rely on servers
for resources, such as files, devices, and even processing power. According to our process user
need to enter their details and send to the server as request. Thus server provides mail access to
that user. After successful login, client checks their inbox or spam or compose mail to another
client with their username respectively. Client can send the message or attachment to another
client.
MAIL SERVER
According to client request server stores the records in the database, for every time the
client login its checks the client authentication by comparing the records in its database. Server
can play a role to transfer the mail between clients. For filtering or classifying the mail, server
can use nave bias algorithm steps.
The training stage of the spam email detector includes five steps:
1. Preparation of Training Set.
Training Set is divided into positive set (spam emails) and negative set (ordinary emails).
2. Email Preprocessing.
All emails are preprocessed by the preprocessor. We use main bodies of emails to train the spam
email detector. Header information is discarded.
FEATURE EXTRACTION
When the input data to an algorithm is too large to be processed and it is

suspected to be notoriously redundant (much data, but not much information) then
the input data will be transformed into a reduced representation set of features
(also named features vector). Transforming the input data into the set of features is
called feature extraction. If the features extracted are carefully chosen it is
expected that the features set will extract the relevant information from the input
data in order to perform the desired task using this reduced representation instead
of the full size input.
Feature extraction involves simplifying the amount of resources required to
describe a large set of data accurately. When performing analysis of complex data
one of the major problems stems from the number of variables involved. Analysis
with a large number of variables generally requires a large amount of memory and
computation power or a classification algorithm which overfits the training sample
and generalizes poorly to new samples. Feature extraction is a general term for
methods of constructing combinations of the variables to get around these
problems while still describing the data with sufficient accuracy.
These features are motivated by the fact that spam images usually present different visual
statistics when compared with natural or normal images. Therefore, adopting them as visual
representations may naturally discriminate spam images from normal or natural images. To
illustrate this well, we randomly pick up a set of images from our data collection. It clearly
presents that the adopted statistic visual features can separate the normal images from spam
images very well. As we can clearly observe, there is clear modes separation between normal and
spam images from all the feature distributions we plotted
FEATURE MATCHING
.
We compare the proposed similarity measure with two other competitive measures.
Nonnegative sparse representation induced similarity measure to be used together with spectral
clustering algorithm for clustering analysis of spam images. Then relatively larger groups of
images are suspected to be spam, which can be further analyzed to identify the spam sources.
The spam sources can then be blocked on the server side directly without reaching e-mail users.
For those spam images that survived this server-side filtering and reached the client of e-mail
users, we propose a prototype active learning spam hunter to enable the users to efficiently and
interactively filter out the spam images.

Image Spam Detection

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Image Spam Detection

Încărcat de

Drepturi de autor:

Formate disponibile

IMAGE SPAM DETECTION

By appending texts containing randomly generated words based on normal natural

USER INTERFACE DESIGN

When the input data to an algorithm is too large to be processed and it is

S-ar putea să vă placă și