Sunteți pe pagina 1din 12

Email Spam

Introduction
Discovering e-mail spam and anticipation via collaboration and AI techniques. It also specifies user interface performance etc.

Purpose
This project is to develop a desirous web based platform for the discovering email spam and filtering them.

Scope
The project Discovering e-mail spam and anticipation via collaboration and AI techniques is a web application. Spam is unsolicited email, which is of serious concern because it wastes resourc es for the users and ISPs who have to handle it. Multiple techniques have been p roposed to contain spam. Collaborative technique is widely used whereby mail s ervers can update each other on the recently discovered spam. Each server store s digests of identified spams against which digests of incoming mails are match ed. This project implements spam algorithms based on identification of spam using fingerprints, and uses a collaborative approach to update other servers in the n etwork. Collaborative identification of spam exploits the fact that every spam m essage is usually sent by an automated system to many recipients. Finger print p rocessing uses the comparisons of fingerprint values to prevent spam. Since fing erprint is smaller than email messages, this finger print vector can compare thes e fingerprint values much easily and efficiently.

Important Modules:
1. 2. 3. 4. 5. Authentication. Finger Print Processing. Finger Print Techniques: Collaborative. Bayesian. Legal Word Verification, Illegal Word Verification,

Dept. of Computer Science and Engineering, SDMCET 2011-12

Page 1

Email Spam

Existing system
In existing methods, spam mails occurring or contains viruses, unwanted, r epeated files or words are not properly identified by the servers and this will allow without knowing the spam occurrences, so this may indeed to a loss of files or text or file corruption .The problem with spam is that one doesnt know its a spam until rec ognized by the user. What is spam for one user can be a legitimate email for another. So the complexities jus t increase. Many systems are evolving to deal with this menace. Some of them use user reports and automated learning systems that block spam as and when reported .Unless s pam are detected in the first place monitoring and eliminating them isnt possible. It is with this aspect in mind that the implemented system explores a combination of method s to detect spam. Spam is a problem because it wastes resources. It reduces wastage cap acity in the mailbox; reduce wastage of time and money using spam detection systems. To overcome such a problem we implemented some way of methods to detect an e-mail spasm will see in proposed system.

Proposed system
The proposed system recognizes the need to integrate information about spam from eve ry available source. Which is why when some of the existing systems work on a piece meal approach , the proposed system tries to gather spam information from all legitimat e sources such as peer servers, client side reporting , server side reporting and global sp am declarations . It also follows through with the finger print processing which is a recent develo pment in spam detection methods. It uses a combination of these approaches rather than a single dimension focus by many of the current systems in practice. In essence it tries t o leverage the best practices of as many approaches as possible and integrate them toget her for a unified approach. Also there is a comparison of existing received mail which f ound the simultaneous spam mails.

Dept. of Computer Science and Engineering, SDMCET 2011-12

Page 2

Email Spam

Module description:
Authentication: The user is provided with an ID and password. Only authorized persons are allowed to l ogin in this module the server first checks for the right password and user name then, if it does not match then the client is disconnected .If it matches then it accepts to send the mail and allow to transfer a mail. Since the user is not available it allows to register for a new user also. Finger Print Processing: Comparing the fingerprint values can detect spam. Fingerprint is a digest value that is unique for a string, whereby specific spam e-mails are identified and a unique fingerpr int" is developed. It is calculated based on a fingerprint algorithm using substrings. This fingerprint size is smaller than the email size. Divide the documents into all possible consecutive substrings of length L. Document of n characters will have (n-L+1) substrings. Rank the substring on the frequency of occurrence. Based on the Rank creating fingerprint for the unique substring. Finger Print Checking (Collaborative): Spam report Received by server Update the spam dictionary. Created the related digest for this dictionary store & forward to other server. This is known as Collaborative appr oach because the servers share the spam digest among themselves. Fingerprinting techniques examine the characteristics, or fingerprint, of emails previous ly identified as spam and use this information to identify the same or similar email each time one is intercepted. These real time fingerprint checks are continuously updated by ESP and provide a method of identifying spam with nearly zero false positives. Finger Print Filtering (Bayesian AI Technique): Bayesian filters are personalized to each user and adapt automatically to changes in spa m. To determine the likelihood that an email is spam, these filters use Bayesian analysis to compare the words or phrases in the email in question to the frequency of the same w ords or phrases in the intended recipient's previous emails (both legitimate and spam). It checks the received mail with the previous mails i.e. with the words, characters already created for a digest values.
Dept. of Computer Science and Engineering, SDMCET 2011-12 Page 3

Email Spam

Threshold for mapping of Digest: Above 80% of mapping to the stored Digest in server it is spam. Discarding the HTML Tags and considering only content. Filter based on user input in which user specifies some ID as spam ID. Legal Word Verification: In legal word verification the words are analyzed in the dictionary. Here we are originat ing some legal word which r compared with receiving mails that contains words, if the word is identified in the dictionary it is considered as a meaningful/legal word. If the w ord comparison consists of less than 60% of legal words then the mail is considered as a spam. It is stored in spam- id db Illegal Word Verification : In this spam verification each word of message in a mail is evaluated/compared in dicti onary by checking its words in its corresponding dictionary. Finally the values of the w ords are added and if the resulting value is higher than the average value, that message i s confirmed as a spam ie if the word comparison consists of greater than 60% of legal w ords then the mail are considered as a spam. It is stored in spam- id db

Glossary
The two most common definitions of spam are "Unsolicited Bulk Email" and "Unsolicited Junk Email", mostly with commercial purpose. The spam practice is basically tied to three factors: illegal and indiscriminate abstention of email addresses to create mailing lists; large scale (bulk) distribution of unsolicited email messages (for marketing, promotion, fraud etc.); and use of open mail relays for bulk email distribution. Open mail relays are SMTP servers that permit third-party relay. These servers permit spammers to connect to them from anywhere in the world, usually from a modem connection, and then forward the spam to its intended victims.

Dept. of Computer Science and Engineering, SDMCET 2011-12

Page 4

Email Spam

System Interfaces Hardware requirements


Monitors Memory : Approximately 512 MB of on board memory. I/O : Two or Three button mouse & standard keyboard. MHz : At least 700 MHz processor OS : Any operating system.

Software requirements
This project can be capable of running in Windows operating systems, since it is Java based. Testing is done in the following Windows platform 98, 2000, XP us ing MS-Access Database. Operating Systems : Windows XP or higher version Database : ODBC Compliant Data Source Java : JDK 1.4 or above.

Communications Interfaces
HTTP: Hypertext Transfer Protocol is a transaction oriented client/serve r protocol Between web browser & a Web Server. HTTPS: Secure Hypertext Transfer Protocol is a HTTP over SSL (secur e socket layer). TCP/IP: Transmission Control Protocol/Internet Protocol, the suite of co mmunication protocols used to connect hosts on the Internet. TCP/IP use several protoc ols, the two main ones being TCP and IP.

Dept. of Computer Science and Engineering, SDMCET 2011-12

Page 5

Email Spam

Working of the implemented system


Rule-based filtering. Rule-based filters assign a spam score to each email based on whether the email contains features typical of spam messages, such as fake SMTP components, key words, HTML formatting like fancy fonts and background colors. A major problem wit h rule-based scores is that since their semantics is not well-defined, it is difficult to aggr egate them and to establish a threshold that can actually limit the number of false positi ves. Also, experience has shown that spammers quickly learn feature-based rules and fr eely investigate ways to overcome them. Filters used for implementing Rule based Filtering Preferred list

This list maintains the preferred list of e-mail for each client separately. This list is compared for granting access to the clients inbox .if the clients preferred list submit ted to his service provider does not contain the email id of the inward email , it is filtere d. Master Spam Report

This is a comprehensive report that contains the list of spams reported across geog raphic and domains .the two very important sources are Source 1 : From clients of server who report spam. This can either be intranet work or internetwork . Source 2 : From Global spam report by other server also called an ALERT. The illustration depicts some of the spam reportings that the system recognizes. Prototyping The Object Oriented Rapid Prototyping (OORP) method will be used to implem ent a limited and functional prototype for the registration system. The prototype wi ll be a working example of part of the system for demonstration and proof of conc ept purposes only. It will include web-based forms as an end-user interface with th e DB2 database. The prototype will be presented to the implementation team.

Dept. of Computer Science and Engineering, SDMCET 2011-12

Page 6

Email Spam

Product Functions The implemented system utilizes the following techniques to detect spam: Fingerprint checking Pattern Matching Space count filter Scoring

The implemented system aims to utilize filter based spam detection methods an d classify them as such. It uses such an approach in unison with fingerprint processing which increases efficiency and the spam detection hit ratio. The system also makes use of multiple sources of spam reporting. These are detailed as mail server side reporting, peer to peer reporting by other servers, in-house client side reporting to mail servers, an d server side reporting to clients. Furthermore care should be taken not to filter legitima te emails of the clients in the haste to contain spam. The implemented system uses these approaches to optimize on the performances. The implemented system makes use of fingerprint checking to mark the message as Spa m. Scoring is yet another stage which is adds to the previous steps outlined. As we see spam detection remains one of the focused areas of research in recent times a nd not one complete solution has been found to be satisfactory. The implemented syste m aims to fulfill some of the objectives pertaining to spam detection.

Flow of spam filters working in the system:


Dept. of Computer Science and Engineering, SDMCET 2011-12 Page 7

Email Spam

INCOMING EMAIL S TO MAIL SERV ER

MAIL SERVE R

SPAM FILTER B ASED ON RULE S

FILTERED EMAILS FORWARDED TO CLIENTS

CLIEN T

CLIENT

CLIEN T

Block diagram
Client server interaction

Client

Server

Flow chart
Dept. of Computer Science and Engineering, SDMCET 2011-12 Page 8

Email Spam

High level design of the implemented system Digest compare Above 50%

N Y
Spam

User Identification

Y Spam ID filter Address book filter N Space count filter Spam dictionary Dictionary Spam Spam +ve Spam -ve

Not Spam

Dept. of Computer Science and Engineering, SDMCET 2011-12

Page 9

Email Spam

Milestones:
Requirement gathering: We need to collect all the requirements which help to authenticate different account into single unique account. Then we will analyze these requirements for feasibility and based on this test we will decide on features that would make final cut. System and Software design: In this phase, we will design the whole system in an abstract way. We will freeze off the E-R diagram, the data flow model and all the other design models by the end of this phas e. We will also decide on the look of the project during this phase. Implementation: This phase will deal with writing the complete code of the project. After we are done wi th this, we will move on to the front end. Here we will write the code for the website and then the business logic. Testing: In this phase we will test our product for robustness and completeness. We also intend to test the site for security by asking people to try hacking into it. This would expose the loo p holes in the code.

Feasibility and Risk Analysis:


Feasibility: Almost all the features we are trying to implement have been implementing in different s cenarios. Our main objective in this project is to find ways to implement these features un der one roof. Hence the project is completely feasible. Risks: 1. Filtering Risk level: HIGH The biggest problem we might face is with the Filtering. We use different methods to filter spam. 2. Knowledge database Risk Level: LOW As mentioned before, there will be a knowledge database too.

Dept. of Computer Science and Engineering, SDMCET 2011-12

Page 10

Email Spam

Milestone table:
Phase Requirement gathering Description Start date This phase deals 16-Dec-2011 with the collection of all the requirements needed by client. End date 16-Jan-2012

System and This phase deals 17-Jan-2012 software design with the designing of a system for the given problem that satisfies all the requirements collected. Implementation In this phase, we 16-feb-2012 will do the actual implementation of the System design.

15-Feb-2012

29-feb-2012

Testing

This phase deals 1-mar-2012 with the testing of the program to check for completeness and robustness of the system.

23-mar-2012

Deployment and This is a final step 24-mar-2012 maintenance where we deploy our system and hand it to the . client. After this we will continuously maintain it as long as the client uses the system
Dept. of Computer Science and Engineering, SDMCET 2011-12

Continues as long as the system is deployed

Page 11

Email Spam

Dept. of Computer Science and Engineering, SDMCET 2011-12

Page 12

S-ar putea să vă placă și