Report

A big data architecture for security data and its application for phishing characterization
01. Problem Statement
Use of internet grows so the cyber security problems also arises. Different activities
are done by attacker to gain sensitive information of the victim. Different malicious
activities are being carried out by the attackers so that they will be able to get the
information of the victim. Using this information the attackers performs their illegal
activities. After gaining the information attacker perform illegal activities. For this we
are proposing the system and the main objective of this project is to detect phishing
emails from collected data in Honeypot. In this project we design an architecture in
which being built on the top of the Big Data frameworks that aims to mitigate the
cyber security problem like phishing.
PVG’s COET, Pune 1

02. Literature Survey
2.1 Present work related to the project topic

The architecture designed stores the emails on the top of the HDFS in the form of
mailboxes. They implemented an application to process large volumes of spam traffic
collected from all the world. The honeypot is being located in different countries and
continents and stores the messages in the mailboxes on the top of HDFS. The main
contribution of the application is its capability to identify phishing emails is a set of
spam messages. Using Natural Language Processing (NLP) and Locality-Sensitive
Hashing(LSH) to inspect the text present in the messages they were able to detect the
phishing campaigns. The drawback of this system is that their experiments showed
that the method can correctly detect phishing campaigns presenting the accuracy of
98.1%.
2.2 Proposed work of project topic

We locate a honeypot on one of the computer which will be connected to all the
computers we are using. The data collected from the honeypot along with the network
traffic pcap files, blacklist sites data and a data set Enroll-emails which contains 5
lakh emails in being stored on the top of HDFS. These are given as input to the
project architecture. In processing part, we use an algorithm which differences the
legal and illegal emails and then sends the suspected emails to the phishing
campaigns. These campaigns conduct different tests and based on that we finally
detect the phishing emails. The algorithm we will use is Naive Bayes algorithm which
gives the accuracy of 99.5% of detecting the emails which overtakes the drawback of
the designed system.

03. Software Requirement Specification
3.1 Introduction
Security issues become more critical due to factors such as the large volumes and
variety of data that may be vulnerable, the diversity of data sources and formats, and
the velocity in which data are generated, typically following a stream nature with
a high volume. Enterprises usually collect terabytes of security-relevant data,
including network traffic, and software application events, among others. However,
well established techniques, most of the time, are not scalable and typically produce
many false positives when dealing with large amounts of data, degrading their
efficacy. To face these emerging problems, big data analytics has attracted the interest
of the security community. The use of big data frameworks for security solutions
presents several benefits, such as the possibility of storing and using large quantities
of security data. Although analyzing logs, network flows, and system events has been
used for several decades in security solutions, conventional technologies are not
adequate to be applied on such long term, large-scale volumes. In general, the
traditional infrastructure keep the data only for a limited period. Besides that,
traditional techniques are inefficient when performing analytics and complex queries
on large, unstructured datasets, while big data platforms perform these operations
efficiently. In this paper we present an architecture for cybersecurity applications
based on big data frameworks. Our architecture has the capability of collecting data
from different sources, storing, combining, and processing them effectively. For
example, sources like pcap files and other logs from a honeypot, data streams
collected from black list sites can all be stored in our system.
3.1.1 Project Scope

The scope of the project is limited to four laptops which will be connected in LAN
and among which on one laptop Honeypot will be installed and data will be collected.
The data will be stored on HDFS, retrieved and operated using naive bayes algorithm
and the spam messaged will be detected.

3.1.2 User Classes and Characteristics
End User: The user who will operate the system will be known as end user. The user
interacting with system should be able to understand the operation of the system.
Technical User: Any technical user will be able to operate on the project. It will be
easy for the user to interact with the system.
Non-Technical User: A non technical user will also be able to operate on the system
as the GUI will be designed in such a way that it will get easy for the user to interact
with the system. Certain documentation will also be provided so that it will get easy
for the user to understand the working of the system.
3.1.3 Operating Environment

The system designed will work on Ubuntu operating system. The four laptops
connected among which one will be master and others will be the slaves. The master
computer will be having Honeypot installed on it. The computers will be connected in
LAN and then it will be able to collect the data from every computer.
3.1.4 Design and Implementation Constraints

The hardware requirement is that the hard disk required will be of 1 TB so that all the
emails collected should be stored. We use honeypot to collect the data and a data set
named Enron email which consists of 5lakh emails. The language used for coding is
Java. The communication protocol used are POP3, TCP/IP . The security should taken
into consideration that all the data is being properly stored and while retrieving data
the data should not get manipulated or lost.
3.1.5 Assumption and Dependencies

The assumption is being made that the data collected is exactly what we require.
While using the algorithm it is assumed that the spam messages should be detected.
The project is depended on the HDFS for the store and retrieve operations using map
reduce computation.
3.2 System Features
3.2.1 System Features 1: Any project requirement needs to be well through out,
balanced and clearly understood by all involved, but perhaps the most important is
that they are not dropped or compromised halfway through the project. The official
definition of the functional requirement is that it is essentially specifies something the
system should do.Some of the functional requirement are:
REQ-1: The main requirement is the detection of the spam emails from the data set of
big data.
REQ-2: The time required for the algorithm to run successfully and detect the spam
emails should be less.
REQ-3: If invalid inputs are provided like data other than emails then pop up
messages should appear regarding as invalid input.
3.2.2 System Features 2:

Functional requirements are:
REQ-1: All the important information required for the execution of the system should
be mentioned clearly.
REQ-2: The system designed should be complete and should reach to its end point.
REQ-3: The inputs given to the system should be unambiguous.
REQ-4: The system should be verifiable. The output generated should be confirmed.
3.3 External interface requirements

3.3.1 User interface
The user interface is being specifically designed by keeping in mind, giving them
convenience to detect the spam emails. The GUI will consist of the buttons and the
menus. For storing the data a button will be provided, on clicking the data will get
stored on the top of HDFS. After the storing of the data a pop up window will appear
which will tell us whether the data is being stored on not successfully. For further
operations buttons and menus are provided. After the detection of spam emails again a
pop up window will appear that will let us know whether there are spam emails or not
and they are detected or not.

3.3.2 Hardware interface

LAN cable, Hard disk of 1TB
3.3.3 Software interface

Ubuntu , Java, Python
3.3.4 Communication interface

TCP/IP
3.4 Non Functional Requirements

3.4.1 Performance Requirements
The requirement here provides a detailed specification of the user interaction with the
software and measurements placed on the system performance.
3.4.2 Safety Requirements

The application does not affect any other application on the machine. There is no loss
of data while fetching the information from the database.
3.4.3 Security Requirements

The security here provided is that by detecting the spam emails and the virus in the
emails the system is totally secured as it gets detected before being downloaded on
computer.
3.4.4 Software Quality Attributes

Agility:
The system should work efficiently by updating the database before and after
detecting the spam emails and maintaining the synchronization with the
system.
Reusability:
The system designed is network host based and hence can be used anywhere
on four computers.

Consistency:
The data provided to the system as input should be managed by the system and
project will get execute on different types of computer without any
modification of data.
3.5 Analysis Model
3.5.1 Data Flow Diagram
Figure 3.5.1.Data Flow Diagram

3.5.2 Entity Relationship Diagram
Figure 3.5.2.Entity Relationship Diagram

3.5.3 Mathematical Model

S = { I ,O , Fn, S, F}
I = Set of Inputs
O = Set of Outputs
Fn = Set of Functions
S = Success
F = Failure
I=Input={Emails, pcap files and other logs from a honeypot, data

streams collected from blacklist sites}
O = Output = {Successfully detection of spam email}
S = Success = { Detection of spam email }
F = Failure = { Detection of spam email is fail, connection loss}

3.6 System Implementation Plan



04. System Design
4.1 System Architecture
Figure 4.1.System Architecture

4.2 UML Diagram
Figure 4.2.UML Diagram

05. Technical Specifications
5.1 Advantage
 The spam emails are detected successfully.
 Provides security from harmful emails.
 Provides accuracy of 99.5%.
 Reusable
5.2 Disadvantage
 Complexity is average.
 Require more time to find out the spam messages due to use of big data set.
5.3 Applications
 It is similar to Gmail account but the inputs provided are more.
 It can be used in software company for detecting spam emails.

06. Results
We expect that the project designed should be able to detect the phishing/spam
emails. It should be able to detect the spam emails in less stipulated to time. The
system should work and give result as per the accuracy.

07. Conclusion
The proliferation of data sources and data collecting structures has lead to a large
increase in the data available for cyber security experts. To process such large
volumes of data, scalable massive data processing solutions are needed. As
mentioned in literature survey, the present work on the project uses the LSH
algorithm which detects the spam emails but it gives accuracy of 98.1%. The
system complexity is also high. Our system will reduce the complexity along with
that the accuracy increases to 99.5% and the spam emails will be detected
successfully in less time.

08. Bibliography
[1]Kotsiantis, S.B., Zaharakis, I.D., Pintelas, P.E.: Machine learning: a review of

classification and combining techniques. Artificial Intelligence Review, 2006
[2] Y. Yu, Y. Mu, and G. Ateniese, “Recent advances in security and privacy in big
data,” j-jucs, Mar 2015.
[3] P. H. B. Las-Casas, V. Santos Dias, R. Ferreira, W. Meira, and D. Guedes, “A

hadoop extension to process mail folders and its application to a spam dataset,” in
International Symposiumon Computer Architecture and High Performance
ComputingWorkshop (SBAC-PADW), Oct 2014, pp. 108–113.
[4] A. A. Cardenas, P. K. Manadhata, and S. P. Rajan, “Big data analytics for

security,” IEEE Security & Privacy, 2013.
[5] P. H. B. Las-Casas, V. Santos Dias, R. Ferreira, W. Meira, and D. Guedes, “A

hadoop extension to process mail folders and its application to a spam dataset,” in
International Symposiumon Computer Architecture and High Performance
ComputingWorkshop (SBAC-PADW), Oct 2014, pp. 108–113.

09. Annexure
Annex A:
Glossary
Hadoop Open source implementation of

frameworks for reliable, scalable,
distributed computing and data storage
Big data A collection of huge data set is known as
big data
Pcap files These are the log files
HDFS Hadoop Distributed File System
Spam emails The emails which are illegal
Phisher The person who sends illegal emails so as
to gain information.

Report

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Report

Încărcat de

Drepturi de autor:

Formate disponibile

A big data architecture for security data and its application for phishing characterization

01. Problem Statement

PVG’s COET, Pune 1

02. Literature Survey

2.1 Present work related to the project topic

2.2 Proposed work of project topic

PVG’s COET, Pune 2

03. Software Requirement Specification

3.1.1 Project Scope

PVG’s COET, Pune 3

3.1.2 User Classes and Characteristics

3.1.3 Operating Environment

3.1.4 Design and Implementation Constraints

3.1.5 Assumption and Dependencies

3.2 System Features

3.2.2 System Features 2:

3.3 External interface requirements

PVG’s COET, Pune 5

3.3.2 Hardware interface

3.3.3 Software interface

3.3.4 Communication interface

3.4 Non Functional Requirements

3.4.2 Safety Requirements

3.4.3 Security Requirements

3.4.4 Software Quality Attributes

PVG’s COET, Pune 6

3.5 Analysis Model

3.5.1 Data Flow Diagram

Figure 3.5.1.Data Flow Diagram

PVG’s COET, Pune 7

3.5.2 Entity Relationship Diagram

Figure 3.5.2.Entity Relationship Diagram

PVG’s COET, Pune 8

3.5.3 Mathematical Model

I=Input={Emails, pcap files and other logs from a honeypot, data

PVG’s COET, Pune 9

3.6 System Implementation Plan

PVG’s COET, Pune 10

PVG’s COET, Pune 11

PVG’s COET, Pune 12

04. System Design

4.1 System Architecture

Figure 4.1.System Architecture

PVG’s COET, Pune 13

4.2 UML Diagram

Figure 4.2.UML Diagram

PVG’s COET, Pune 14

05. Technical Specifications

PVG’s COET, Pune 15

PVG’s COET, Pune 16

PVG’s COET, Pune 17

[1]Kotsiantis, S.B., Zaharakis, I.D., Pintelas, P.E.: Machine learning: a review of

[3] P. H. B. Las-Casas, V. Santos Dias, R. Ferreira, W. Meira, and D. Guedes, “A

[4] A. A. Cardenas, P. K. Manadhata, and S. P. Rajan, “Big data analytics for

[5] P. H. B. Las-Casas, V. Santos Dias, R. Ferreira, W. Meira, and D. Guedes, “A

PVG’s COET, Pune 18

Hadoop Open source implementation of

PVG’s COET, Pune 19

S-ar putea să vă placă și