Sunteți pe pagina 1din 19

A big data architecture for security data and its application for phishing characterization

01. Problem Statement

Use of internet grows so the cyber security problems also arises. Different activities
are done by attacker to gain sensitive information of the victim. Different malicious
activities are being carried out by the attackers so that they will be able to get the
information of the victim. Using this information the attackers performs their illegal
activities. After gaining the information attacker perform illegal activities. For this we
are proposing the system and the main objective of this project is to detect phishing
emails from collected data in Honeypot. In this project we design an architecture in
which being built on the top of the Big Data frameworks that aims to mitigate the
cyber security problem like phishing.

PVG’s COET, Pune 1


A big data architecture for security data and its application for phishing characterization

02. Literature Survey

2.1 Present work related to the project topic


The architecture designed stores the emails on the top of the HDFS in the form of
mailboxes. They implemented an application to process large volumes of spam traffic
collected from all the world. The honeypot is being located in different countries and
continents and stores the messages in the mailboxes on the top of HDFS. The main
contribution of the application is its capability to identify phishing emails is a set of
spam messages. Using Natural Language Processing (NLP) and Locality-Sensitive
Hashing(LSH) to inspect the text present in the messages they were able to detect the
phishing campaigns. The drawback of this system is that their experiments showed
that the method can correctly detect phishing campaigns presenting the accuracy of
98.1%.

2.2 Proposed work of project topic


We locate a honeypot on one of the computer which will be connected to all the
computers we are using. The data collected from the honeypot along with the network
traffic pcap files, blacklist sites data and a data set Enroll-emails which contains 5
lakh emails in being stored on the top of HDFS. These are given as input to the
project architecture. In processing part, we use an algorithm which differences the
legal and illegal emails and then sends the suspected emails to the phishing
campaigns. These campaigns conduct different tests and based on that we finally
detect the phishing emails. The algorithm we will use is Naive Bayes algorithm which
gives the accuracy of 99.5% of detecting the emails which overtakes the drawback of
the designed system.

PVG’s COET, Pune 2


A big data architecture for security data and its application for phishing characterization

03. Software Requirement Specification

3.1 Introduction
Security issues become more critical due to factors such as the large volumes and
variety of data that may be vulnerable, the diversity of data sources and formats, and
the velocity in which data are generated, typically following a stream nature with
a high volume. Enterprises usually collect terabytes of security-relevant data,
including network traffic, and software application events, among others. However,
well established techniques, most of the time, are not scalable and typically produce
many false positives when dealing with large amounts of data, degrading their
efficacy. To face these emerging problems, big data analytics has attracted the interest
of the security community. The use of big data frameworks for security solutions
presents several benefits, such as the possibility of storing and using large quantities
of security data. Although analyzing logs, network flows, and system events has been
used for several decades in security solutions, conventional technologies are not
adequate to be applied on such long term, large-scale volumes. In general, the
traditional infrastructure keep the data only for a limited period. Besides that,
traditional techniques are inefficient when performing analytics and complex queries
on large, unstructured datasets, while big data platforms perform these operations
efficiently. In this paper we present an architecture for cybersecurity applications
based on big data frameworks. Our architecture has the capability of collecting data
from different sources, storing, combining, and processing them effectively. For
example, sources like pcap files and other logs from a honeypot, data streams
collected from black list sites can all be stored in our system.

3.1.1 Project Scope


The scope of the project is limited to four laptops which will be connected in LAN
and among which on one laptop Honeypot will be installed and data will be collected.
The data will be stored on HDFS, retrieved and operated using naive bayes algorithm
and the spam messaged will be detected.

PVG’s COET, Pune 3


A big data architecture for security data and its application for phishing characterization

3.1.2 User Classes and Characteristics

End User: The user who will operate the system will be known as end user. The user
interacting with system should be able to understand the operation of the system.

Technical User: Any technical user will be able to operate on the project. It will be
easy for the user to interact with the system.

Non-Technical User: A non technical user will also be able to operate on the system
as the GUI will be designed in such a way that it will get easy for the user to interact
with the system. Certain documentation will also be provided so that it will get easy
for the user to understand the working of the system.

3.1.3 Operating Environment


The system designed will work on Ubuntu operating system. The four laptops
connected among which one will be master and others will be the slaves. The master
computer will be having Honeypot installed on it. The computers will be connected in
LAN and then it will be able to collect the data from every computer.

3.1.4 Design and Implementation Constraints


The hardware requirement is that the hard disk required will be of 1 TB so that all the
emails collected should be stored. We use honeypot to collect the data and a data set
named Enron email which consists of 5lakh emails. The language used for coding is
Java. The communication protocol used are POP3, TCP/IP . The security should taken
into consideration that all the data is being properly stored and while retrieving data
the data should not get manipulated or lost.

3.1.5 Assumption and Dependencies


The assumption is being made that the data collected is exactly what we require.
While using the algorithm it is assumed that the spam messages should be detected.
The project is depended on the HDFS for the store and retrieve operations using map
reduce computation.
PVG’s COET, Pune 4
A big data architecture for security data and its application for phishing characterization

3.2 System Features

3.2.1 System Features 1: Any project requirement needs to be well through out,
balanced and clearly understood by all involved, but perhaps the most important is
that they are not dropped or compromised halfway through the project. The official
definition of the functional requirement is that it is essentially specifies something the
system should do.Some of the functional requirement are:
REQ-1: The main requirement is the detection of the spam emails from the data set of
big data.
REQ-2: The time required for the algorithm to run successfully and detect the spam
emails should be less.
REQ-3: If invalid inputs are provided like data other than emails then pop up
messages should appear regarding as invalid input.

3.2.2 System Features 2:


Functional requirements are:
REQ-1: All the important information required for the execution of the system should
be mentioned clearly.
REQ-2: The system designed should be complete and should reach to its end point.
REQ-3: The inputs given to the system should be unambiguous.
REQ-4: The system should be verifiable. The output generated should be confirmed.

3.3 External interface requirements


3.3.1 User interface
The user interface is being specifically designed by keeping in mind, giving them
convenience to detect the spam emails. The GUI will consist of the buttons and the
menus. For storing the data a button will be provided, on clicking the data will get
stored on the top of HDFS. After the storing of the data a pop up window will appear
which will tell us whether the data is being stored on not successfully. For further
operations buttons and menus are provided. After the detection of spam emails again a
pop up window will appear that will let us know whether there are spam emails or not
and they are detected or not.

PVG’s COET, Pune 5


A big data architecture for security data and its application for phishing characterization

3.3.2 Hardware interface


LAN cable, Hard disk of 1TB

3.3.3 Software interface


Ubuntu , Java, Python

3.3.4 Communication interface


TCP/IP

3.4 Non Functional Requirements


3.4.1 Performance Requirements
The requirement here provides a detailed specification of the user interaction with the
software and measurements placed on the system performance.

3.4.2 Safety Requirements


The application does not affect any other application on the machine. There is no loss
of data while fetching the information from the database.

3.4.3 Security Requirements


The security here provided is that by detecting the spam emails and the virus in the
emails the system is totally secured as it gets detected before being downloaded on
computer.

3.4.4 Software Quality Attributes


Agility:
The system should work efficiently by updating the database before and after
detecting the spam emails and maintaining the synchronization with the
system.
Reusability:
The system designed is network host based and hence can be used anywhere
on four computers.

PVG’s COET, Pune 6


A big data architecture for security data and its application for phishing characterization

Consistency:
The data provided to the system as input should be managed by the system and
project will get execute on different types of computer without any
modification of data.

3.5 Analysis Model

3.5.1 Data Flow Diagram

Figure 3.5.1.Data Flow Diagram

PVG’s COET, Pune 7


A big data architecture for security data and its application for phishing characterization

3.5.2 Entity Relationship Diagram

Figure 3.5.2.Entity Relationship Diagram

PVG’s COET, Pune 8


A big data architecture for security data and its application for phishing characterization

3.5.3 Mathematical Model


S = { I ,O , Fn, S, F}
I = Set of Inputs
O = Set of Outputs
Fn = Set of Functions
S = Success
F = Failure

I=Input={Emails, pcap files and other logs from a honeypot, data


streams collected from blacklist sites}
O = Output = {Successfully detection of spam email}
S = Success = { Detection of spam email }
F = Failure = { Detection of spam email is fail, connection loss}

PVG’s COET, Pune 9


A big data architecture for security data and its application for phishing characterization

3.6 System Implementation Plan

PVG’s COET, Pune 10


A big data architecture for security data and its application for phishing characterization

PVG’s COET, Pune 11


A big data architecture for security data and its application for phishing characterization

PVG’s COET, Pune 12


A big data architecture for security data and its application for phishing characterization

04. System Design

4.1 System Architecture

Figure 4.1.System Architecture

PVG’s COET, Pune 13


A big data architecture for security data and its application for phishing characterization

4.2 UML Diagram

Figure 4.2.UML Diagram

PVG’s COET, Pune 14


A big data architecture for security data and its application for phishing characterization

05. Technical Specifications

5.1 Advantage
 The spam emails are detected successfully.
 Provides security from harmful emails.
 Provides accuracy of 99.5%.
 Reusable

5.2 Disadvantage
 Complexity is average.
 Require more time to find out the spam messages due to use of big data set.

5.3 Applications
 It is similar to Gmail account but the inputs provided are more.
 It can be used in software company for detecting spam emails.

PVG’s COET, Pune 15


A big data architecture for security data and its application for phishing characterization

06. Results

We expect that the project designed should be able to detect the phishing/spam
emails. It should be able to detect the spam emails in less stipulated to time. The
system should work and give result as per the accuracy.

PVG’s COET, Pune 16


A big data architecture for security data and its application for phishing characterization

07. Conclusion

The proliferation of data sources and data collecting structures has lead to a large
increase in the data available for cyber security experts. To process such large
volumes of data, scalable massive data processing solutions are needed. As
mentioned in literature survey, the present work on the project uses the LSH
algorithm which detects the spam emails but it gives accuracy of 98.1%. The
system complexity is also high. Our system will reduce the complexity along with
that the accuracy increases to 99.5% and the spam emails will be detected
successfully in less time.

PVG’s COET, Pune 17


A big data architecture for security data and its application for phishing characterization

08. Bibliography

[1]Kotsiantis, S.B., Zaharakis, I.D., Pintelas, P.E.: Machine learning: a review of


classification and combining techniques. Artificial Intelligence Review, 2006

[2] Y. Yu, Y. Mu, and G. Ateniese, “Recent advances in security and privacy in big
data,” j-jucs, Mar 2015.

[3] P. H. B. Las-Casas, V. Santos Dias, R. Ferreira, W. Meira, and D. Guedes, “A


hadoop extension to process mail folders and its application to a spam dataset,” in
International Symposiumon Computer Architecture and High Performance
ComputingWorkshop (SBAC-PADW), Oct 2014, pp. 108–113.

[4] A. A. Cardenas, P. K. Manadhata, and S. P. Rajan, “Big data analytics for


security,” IEEE Security & Privacy, 2013.

[5] P. H. B. Las-Casas, V. Santos Dias, R. Ferreira, W. Meira, and D. Guedes, “A


hadoop extension to process mail folders and its application to a spam dataset,” in
International Symposiumon Computer Architecture and High Performance
ComputingWorkshop (SBAC-PADW), Oct 2014, pp. 108–113.

PVG’s COET, Pune 18


A big data architecture for security data and its application for phishing characterization

09. Annexure

Annex A:
Glossary

Hadoop Open source implementation of


frameworks for reliable, scalable,
distributed computing and data storage
Big data A collection of huge data set is known as
big data
Pcap files These are the log files
HDFS Hadoop Distributed File System
Spam emails The emails which are illegal
Phisher The person who sends illegal emails so as
to gain information.

PVG’s COET, Pune 19

S-ar putea să vă placă și