Rathore2016 Article RealTimeIntrusionDetectionSyst

J Supercomput (2016) 72:3489–3510
DOI 10.1007/s11227-015-1615-5
Real time intrusion detection system for

ultra-high-speed big data environments
M. Mazhar Rathore1 · Awais Ahmad1 ·

Anand Paul1
Published online: 23 February 2016

© Springer Science+Business Media New York 2016
Abstract In recent years, the number of people using the Internet and network services
is increasing day by day. On a daily basis, a large amount of data is generated over
the Internet from zeta byte to petabytes with a very high speed. On the other hand,
we see more security threats on the network, the Internet, websites, and the enterprise
network. Therefore, detecting intrusion in such ultra-high-speed environment in real
time is a challenging task. Many intrusion detection systems (IDSs) are proposed for
various types of network attacks using machine learning approaches. Most of them
are unable to detect recent unknown attacks, whereas the others do not provide a
real-time solution to overcome the above-mentioned challenges. Therefore, to address
these problems, we propose a real-time intrusion detection system for ultra-high-speed
big data environment using Hadoop implementation. The proposed system includes
four-layered IDS architecture, which consists of the capturing layer, filtration and
load balancing layer, processing or Hadoop layer, and the decision-making layer.
Furthermore, feature selection scheme is proposed that selects nine parameters for
classification using (FSR) and (BER), as well as from the analysis of DARPA datasets.
In addition, five major machine learning approaches are used to evaluate the proposed
system including J48, REPTree, random forest tree, conjunctive rule, support vector
machine, and Naïve Bayes classifiers. Results show that among all these classifiers,
REPTree and J48 are the best classifiers in terms of accuracy as well as efficiency. The
B Anand Paul
paul.editor@gmail.com
M. Mazhar Rathore
rathoremazhar@gmail.com
Awais Ahmad
awais.ahmad@live.com
1 School of Computer Science and Engineering, Kyungpook National University, Daegu, Korea
123
3490 M. M. Rathore et al.
proposed system architecture is evaluated with respect to accuracy in terms of true

positive (TP) and false positive (FP), with respect to efficiency in terms of processing
time and by comparing results with traditional techniques. It has more than 99 % TP
and less than 0.001 % FP on REPTree and J48. The system has overall higher accuracy
than existing IDSs with the capability to work in real time in ultra-high-speed big data
environment.
Keywords Machine learning · Intrusion detection · Threats · Big data · Network
1 Introduction
In this technological era, the network and the Internet speed has reached gigabytes
and even terabytes. People from various fields, with lack of computer knowledge, are
getting benefits by using Internet services. Companies are gaining profit by managing
their resources and transactions on the network. Humans from different fields are
expanding their resources from health care to military applications, using various types
of networks such as sensors, vehicular networks, a cellular network, etc. However,
the possibility of cyber-attacks by stealing personal and secret information from the
computers and the network is also increasing at the same rate.
The world is full of such intruders, who try to penetrate into the secret network
to steal data and destroy network resources. They might use their own single hidden
system or use multiple ordinary users’ machines as zombies by taking illegal control
over them without their knowledge to launch an attack on the network. Moreover, there
are various other scenarios that the attacker practices to penetrate into the network and
get illegal access by looking into the vulnerabilities that exist in the system. Many
security mechanisms and intrusion detection systems (IDS) are proposed and are used
to detect such intruders.
The intrusion detection systems concept was first introduced by Denning [1,2]
in 1986 while providing the first intrusion detection model that identifies abnormal
behavior in the network. However, it is still an important topic for researchers due
to the continuous evolution and changing structure of data, speed of networks, and
changing adaptation techniques of the intruders.
We can define intrusion as any illegal computer activity that gets access for infor-
mation gathering, eavesdropping, etc., passively, or doing harmful packet forwarding,
packet dropping, or performing hole attack, etc. Many other researchers define IDS
with different perspectives [3–6]. Butun et al. [3] define IDS as a collection of tools,
methods, and resources that help to identify, assess, and report intrusions. Intrusion
detection is usually one part of the whole network security system that is installed on
a system and is not a separate protection measure [4]. Zhang et al. [5] elaborate intru-
sion as “any set of actions that attempt to compromise the integrity, confidentiality,
or availability of a resource” and intrusion prevention techniques, such as encryption,
authentication, access control, secure routing, etc., are offered as the first guard against
attacks. Intrusion detection emanates when prevention techniques have failed to pro-
tect resources from intruders. IDS also makes the network more secure by detecting
any suspicious activity affecting the internal network performed by the network mem-
123
Real time intrusion detection system for ultra-high-speed. . . 3491
ber himself. IDS can support other systems to mitigate and remediate the effects of
intrusion by providing information of attacks launched by the intruder, such as intruder
identification, his/her location, time of intrusion, intrusion type (e.g., active, passive,
or attacks name such as worm hole, black hole, sink hole, selective forwarding, etc.),
etc. IDSs are cyberspace equal of burglar alarms that are being used in current physical
security systems [7].
Technological advancement in cyberspace increases the usage of ubiquitous net-
works, wireless sensor networks, and Web technologies. The abundant use of
technology results in an exponential increase in the network data traffic and speed.
According to one of the reports, 65 % of UK houses were connected to the Internet in
2008 [8] and it increased to 80 % in 2012 [9]. Moreover, in 2012 the overall computer-
generated data were estimated as 2.27 zettabytes and 8 zettabytes is expected in the
current year [10], of which more than 90 % contents were generated in last 2 years
[11]. On the Internet, this data is transmitted at a very high speed in various ways.
Therefore, an efficient system is needed, which keeps the high velocity of data under
consideration to analyze such high-speed data, when needed. Those with high vol-
ume, high velocity, and with different varieties are usually termed as big data. It can
be structured, semi-structured, and unstructured. In an era of big data, the IDS should
be efficient enough to process ultra-high-speed transmission at a real time without
losing any vital flow packets.
No ideal solution exists that provides the ideal way to be more powerful and generic
with higher accuracy and efficiency rate. Therefore, it is extremely vital to come up with
a better IDS to provide security to the valuable resources in the network by protecting
machines from unauthorized or malicious actions, especially in high-speed network
traffic environment. For any IDS, the main requirement is the accuracy of the system
and then efficiency. In high-speed big data transmission, where the transmission is
achieved in gigabytes per second, the efficiency of the IDS is most important.
Therefore, to address the aforementioned challenges, the proposed system meets
the need for efficiency with higher accuracy while operating continuously in a parallel
environment of Hadoop and introducing no extra overhead that degrades the perfor-
mance of the transmission. The proposed system comprises ultra-high-speed IDS that
detects any network intrusion in real time with more accuracy and efficiency. The
contribution of the proposed scheme is manifold, i.e., (i) Hadoop-based architecture is
proposed for intrusion detection systems, (ii) intrusion detection scheme is proposed
that selects the nine best features of data flows and detects abnormal flows, (iii) imple-
mentation of the proposed intrusion detection system on Hadoop, and (iv) evaluation
of the proposed method on using machine learning classifiers. The proposed system
is compared with existing techniques with respect to accuracy and efficiency while
considering most of the reputed machine learning techniques.
The proposed system has higher accuracy and is more efficient than existing sys-
tems. Therefore, it has the capability to work in ultra-high-speed big data environment
due to its obvious advantages over the traditional system. The system can be imple-
mented by capturing traffic from either switch, router, gateway or any other high-speed
network device with high-speed capturing card. It detects any intrusion in the system
from any malicious user over the Internet. An abstract level model of the system is
shown in Fig. 1.
123
Network 1 (N2) Network 1 (N1)

PC1N1 PC2N1 PC3N1 PCKN1 Firewall/Router/ System
... Hub/Switch Implementation
PC1N2 ME1N2 PC2N2 PCKN2 MEKN2

... INTERNET
...
...
...
...
PC1N3 PC2N3 PC3N3 PCKN3 MEKN2

Network Z (NZ) Network 1 (N3)
MSC
...
BSC
PC1NZ PC2NZ PC3NZ PCKNZ
...
...
Femto Cell
Mobile Equipment's (ME)
Fig. 1 Implementation model of the proposed IDS
The rest of the paper is organized as follows. Section 2 describes the background of
IDS and related work. Section 3 presents the proposed system including the datasets
and tools used for analysis and testing, features and parameters selection for IDS,
proposed IDS architecture, and algorithms. Section 4 describes the implementation
details, results and discussion, and evaluation. Sect. 5 finally concludes the article and
presents the future work.
2 Background and related work
Security attacks on any network system could be broadly categorized as active or

passive. In passive attacks, the attackers are typically hidden and either capture com-
munication on a transmission link or rescind the network functioning elements, such
as eavesdropping, node malfunctioning, node tampering or destruction, and illegal
traffic analysis. While performing active attacks, intruders disturb the operations in
the attacked network for achieving their objectives, such as attackers might want to
degrade or terminate the networking services. This can be achieved by denial of service
(DoS), jamming, black hole, wormhole, sinkhole, flooding and Sybil attacks. IDS is
mostly built for active types of attacks. The major three security strategies adopted to
cater to such attacks are prevention, detection, and mitigation. Firstly, the prevention
strategy makes it possible that no intruder could penetrate into the system and get
illegal access. It works on “Prevent before it happens” rule. Secondly, the detection is
performed on those attacks that cannot be prevented. The system immediately starts the
detection process that detects the attack and the compromised node. Thirdly, mitigation
is done that reacts to the effects of attacks and cures the affected node and damage.
Intrusion detection systems could be categorized on the basis of detection mech-
anisms and source(s) for which the detection is provided. For detection mechanism,
123
IDS can be anomaly based, misuse based, or specification based, while IDS can be
network based, host based, or hybrid, depending on the source of audit data.
In anomaly-based IDS, the profile of standard statistical behavior of the member or
network is maintained. The statistical behavior is continuously monitored and a par-
ticular deviation from the normal behavior is treated as an intrusion. Anomaly-based
detection is very powerful for latest attacks that are unknown and encountered for the
first time. However, for such detection techniques, the profile of the usual behavior
must be updated periodically because of the changing behavior of the network with
usage. The periodic updating raises the overhead of the whole system. Anomaly-based
IDS could be based on statistical measurements in which the network traffic is captured
and then a stochastic, statistical, or probabilistic profile of its behavior is maintained.
The profile can be flow based or as a whole. Deviation from the particular threshold
of anomaly score, generated from stochastic behavior profile, is detected as intrusion.
Knowledge-based anomaly-based IDSs requires the prior knowledge about the net-
work parameters in normal condition as well as in an attacking environment. It can be
expert system (based on rules classification), description language (based on UML),
finite state machine (state and transition are defined for available normal data as well
as intrusions), and data clustering and outlier detection (data are grouped into clusters
based on specified similarity or distance measure). One of the major techniques used
in anomaly-based detection is machine learning (ML). In ML-based anomaly IDS,
patterns are generated for a normal profile and attacked profile. The design model is
updated periodically to improve the IDS performance and accuracy. Machine learn-
ing IDSs detection use Bayesian networks (use probabilistic relationships among the
important parameters), Markov model (use stochastic Markov theory in which the
topology and capabilities of the system are modeled as states that are interconnected
through certain transition probabilities), fuzzy logic (measure estimation and uncer-
tainty), genetic algorithms: based on evolutionary theory of biology, neural networks
(inspired by human brain, principal component analysis (PCA, eigenvalues of matrix
and dimensionality reduction technique), and support vector machine (SVM, matrices)
A rule-based technique is proposed in [12] that is based on known ratio propagation
model by describing power decay of the message transmission rule. The technique is
very powerful for various attacks, such as DoS or flooding and wormhole. They treat the
message as suspicious if its transmitted power deviates from its sender’s geographical
position. Puttini et al. [7] proposed the Bayesian classification statistical method that is
used to detect intrusion. Their main aim is to detect packet flooding that results in DoS.
The proposed model uses a behavioral model that maintains multiple users’ profiles
by applying posterior Bayesian classification to them as a detection algorithm. In
[13], the estimated congestion at the intermediary nodes is used as a decision-making
mechanism to detect malicious behavior that causes packet dropping. The authors
suggested that the traffic pattern can be one of the measurements to choose for intrusion
detection from hop-to-hope. The proposed intrusion detection technique is general and
suitable for bandwidth unlimited networks with strict security requirements, such as
tactical systems. The IDSs proposed in [14–16] uses ML methods and classifiers to
detect intrusion. They used kdd99cup datasets and introduced various parameters for
ML classifiers for detecting various attacks.
123
Abbes et al. [17] also used ML approach for active IDS by analyzing different
application protocols. They used separate and distinct adaptive decision trees for each
protocol that classified records into two groups, benign and anomalies. Their system
is used to identify DoS attack, scans attack, and botnets. Wagner et al. [18] and Khan
et al. [19] use support vector machine (SVM) for intrusion classification. Wagner et
al. use the proposed one-class SVM classifiers that can detect new anomalies [20].
Moreover, two state network IDS model is proposed [21] by using k-means clustering
to group data into three clusters (e.g., C1), attack data (e.g., Probe, U2R, and R2L; C2,
DoS attack data), and C3 for normal data. Gaddam et al. [22] proposed IDS scheme
using K-means clustering and decision tree learning. Cho proposed the idea of using
Markov model for intrusion detection by comparing the intrusion model with the
typical model [23]. The author used neural networks and fuzzy logic for making their
system robust and flexible. Zhenwei et al. used the idea of automatically tuned IDS in
[24] for attack classification by involving operators when FP occurs.
Misuse-based detection can be signature based or rule based. Signatures or pat-
terns of the previous attacks are identified and then used for future detection. For
instance, the signature can be “more than five attempts to sign in but failed” for small
brute force attack. Signature-based detection is very simple, accurate, and efficient
for known attacks. However, it will now perform more accurately for new kinds of
attacks. Most of the antiviruses use such type of detection mechanism. On the other
hand, the authors identified some rule-based detection by identifying some rule for
intrusion detection, such as interval rule, retransmission, integrity rule, delay rule,
repetition rule, radio transmission range, etc. [25]. Wai et al. proposed a hybrid IDS
that can work on both wired and wireless ad hoc networks [26]. It uses misuse as well
as an anomaly-based detection mechanism.
In specification-based IDS, some specification and constraints for a standard appli-
cation are defined. The application is monitored on those defined constraints, and if
deviated then it is detected as abnormal and an intrusion. Nadkarni and Mishra propose
one of the specification-based techniques, which is mainly concerned with detection
attacks such as DoS, replay attacks as well as compromised node in distance-vector
routing protocols such as DSDV protocol [27].
Network-based IDS monitored and made an analysis on each incoming packet of
the network traffic and identified intrusions that occurred on the network. It can be
implemented on network devices, such as switch router, server, gateway, etc. Most
of the above-mentioned work is host network based. Francisco et al. [28] proposed a
network intrusion detection system (NIDS) for the smart sensor-inspired device.
Host based is concerned with the events that occurred at each node. It identified
any intrusion activity on a single node as a result of any event, such as changes in the
critical system files on the host, repeated failure access attempts to the host, unusual
process memory allocations, and unusual CPU activity or I/O activity. One of the hosts-
based anomaly detection ADMIT was done by Sequeira and Zaki [29] by creating user
profiles of a sequence of user or computers commands.
Hybrid IDS have both network-based and host-based features. It performs intrusion
detection on the host as well as on network as a whole at the same time. El-Khatib
proposed one of the hybrid IDS systems for the 802.11 protocol-specific attacks [30].
123
The author uses information gain ratio for feature selection and K-mean classifier for
intrusion detection.
As the speed of the network traffic is increased day by day, it results in high-speed
big data generation. In such an era, we need a high-speed system that can efficiently
work in the high-speed environment. Limited works have been done in the area of
intrusion detection in big data environment, which lack real-time implementation and
efficiency. Tan et al. [31] proposed a theoretical framework to improve the security
as well as the privacy of big data by studying the vulnerabilities that exist in cloud
computing. Huang, Kalbarczyk, and Nicol [32] developed a latent Dirichlet allocation
(LDA)-based hybrid approach for intrusion detection through knowledge discovery
in the big data. Similarly, Ahn et al. [33] also give an idea about a new model for
unknown attacks detection based on big data analysis techniques while extracting
information from various sources. In addition, Marchal et al. [34] also proposed an
architecture based on big data for large-scale security monitoring. However the system
did not consider the accuracy and efficiency, and only limited analysis was performed.
However, all of these systems present some theoretical model, framework, architecture,
etc., but lack practical implementation.
Therefore, based on the challenges mentioned in the literature, it is now a chal-
lenging task to provide high-speed intrusion detection, while designing a network in
which attacker is unable to find a way to break the security. ML is a most widely used
approach to detect an intruder with a high accuracy. However, the existing techniques
are still not efficient enough to process high-speed big data at real time. Therefore,
based on the previous ML knowledge, the proposed system detects intrusions based on
the nine features with higher accuracy. Moreover, efficiency is achieved in a high-speed
data network by implementing the proposed architecture along with various proposed
algorithms using Hadoop (MapReduce). The details of the proposed system, as well
as implementation, are described in later sections.
3 Proposed model
1. Datasets, tool, and experimental environment We use publicly available and widely
used benchmark dataset from three sources for analysis, testing, and evaluation.
DARPA [35] is the basic dataset we have used for our analysis that contains multiple
complex attacks including probing, breaking into the system by exploiting vulnera-
bilities, installing DDos software for the compromised system, and launching DDos
attack against another target. For testing and feature selection, the most widely used
dataset KDDCUP99 [36] is considered. The dataset is built on the traffic captured by
DARPA [35], which have various intrusions. Each flow is characterized by 41 para-
meters and labeled as normal or attack of a specific type. A training dataset of KDD
contains 24 specific types of intrusions with additional 14 attacks in the testing dataset,
which includes denial of service (DoS) attack, user to root attack (U2R), remote to
local attack (R2L), and probing attacks. Moreover, the NSL-KDD dataset [37], which
removes the issues that exist in the KDD dataset, is also used while testing the proposed
system. Redundant and duplicate records are also withdrawn from KDD to make it
123
more reliable for the researchers. DARPA and KDD datasets have almost a size of 5.5
and 1 GB, respectively.
We use Java programming with weka 3.6.12 library for machine learning algorithm
implementation. Moreover, we also use Hadoop in a single node setup environment
using Pcap-Input Format, Hadoop-pcap-lib, and Hadoop-pcap-serde library to process
real-time traffic having network packets and large datasets and calculate the flow
parameters for machine learning (ML) classification algorithms for intrusion detection.
Ubuntu 14.04 LTS system with 4 GB RAM and core i5-3.20 GHz processor is used
while performing experiments and evaluation. However, only 2 GB RAM is used for
Heap building in ML classifiers.
2. Features and parameters selection KDD99 [33] suggested 41 parameters for IDS
classification. However, this number is too large to increase the computational power
of ML algorithms implementation while processing large datasets or ultra-high-speed
network traffic for intrusion detection. Moreover, it also reduces the accuracy rate of
the system. Various techniques have been used for selection of features for intrusion
detection and to find a relationship between them. Aljarrah proposed RF-FSR and
RF-BER [38] feature selection techniques to select the best 16 features among 41 of
the KDD-proposed features. Kayacik [14] and Araujo [15] reduced this number to
15 and 14, respectively. Still, 14 more features are there to process real-time traffic
for intrusion or large datasets efficiently. Some of the ML approaches take more time
to process large datasets using those features. Kantor [16] finally selects the best 6
among those 41 features while detecting intrusions, although the number is very short
for efficient processing of ultra-high-speed traffic. On the other hand, it reduces the
accuracy of the system, especially for unknown future attacks. While keeping in mind
this requirement, we use forward selection ranking (FSR) and backward elimination
ranking (BER) [38] mechanism together to select the 4 best features among 41 of them
including feature 1, 2, 3, and 16. Instead of parameters 6,7 i.e., src_bytes, dst_bytes, we
use “number of packets” and “packet size mean”. Furthermore, by analyzing DARPA
TCPDump traffic, we observed that the packet size distribution for normal traffic and
malicious flow for a particular application differs. Therefore, we added three more
parameters, i.e., pkt_rate, pkt_sd_size, range_pkt_size, in our selected feature list.
In either FSR or BER feature selection technique, the weight of the feature plays
a major role in the selection procedure. Enhanced support vector decision function
(ESVDF) [39] is used to identify the weights of all parameters depending upon their
importance in the detection. After that, random forest [40] sorts all features depending
on their weights. By FSP, initially two features are selected, which have the highest
weight among 41 features and form a set called selected features set (SFS). The SFS is
then used for building the intrusion classification model. SFS is evaluated on accuracy
and efficiency while identifying intrusions. Afterward, one more parameter is added
with SFS which has higher weight among the other 39 parameters and again the
evaluation is performed using SFS features. If the newly added parameter enhanced
the performance of the system in terms of accuracy and efficiency, then it is kept in
SFS, otherwise it is removed from SFS. This process continues until all 41 parameters
are evaluated while putting them into SFS one by one. In the case of BER, initially
all 41 parameters are kept in SFS. The parameters are removed one by one from
123
Table 1 Selected features of the proposed IDS
Serial # Features Details
1 Duration Whole duration of the flow/session

2 protocol Protocol (TCP, UDP, HTTP, etc.)
3 Service Particular service the host is using
4 Num_root Number of roots involved
5 No. of packets No. of packets
6 Pkt_rate Packet rate in packet/ second transmitted by a flow
7 Pkt_size_mean Mean value of the packet size exchange between flow
8 Pkt_sd_size Pkr sizes standard deviation
9 Range_pkt_size The range of the packet sizes
SFS depending on their weight, from lower to higher. If removal of the parameter
degrades the performance of the system, the parameter is again added to SFS. We use
FSR and BER together to select the 4–6 best parameters amongst all 41 parameters.
Finally, we decide on BER and FS-R selected features as well as our analysis-based
features, Table 1 shows the details of all the nine features which we use for intrusion
classification.
3. Classification algorithms Various machine learning classifiers such as naïve Bayes,

support vector machine, random forest, J48, and REPTree are used to identify intru-
sions by applying the selected features. A short description is given for each of the
classifiers used in our work. Naive Bayes is a construction classifier, i.e., model that
is used for assigning class labels to problem instances. Naïve Bayes does classifica-
tion by a vector of feature values, made from some finite set. Naive Bayes classifier
assumes that the values of specific features are independent of other features in a class
variable; for instance, a fruit may be considered to be an orange if its color is orange,
has a round shape and diameter 3”. Naive Bayes classifier may consider any one these
features to be independently contributed to the probability that the mentioned fruit
is orange, unrelated to any correlations between its shape, diameters and diameter
features.
The support vector machine is an administered learning model which analyzes
data and recognizes patterns. These are also used for the regression and classification
analysis; for training example, each mark is assigned to one of the two categories. An
SVM training algorithm is used to figure out the model that is used for the assignment
of new examples into one group, making it a non-probabilistic binary linear classifier.
Conjunctive rule classifier is used in the implementation of the simple conjunctive
rule learner, which can predict numeric and nominal class labels. In simple conjunc-
tive, a rule having antecedents “AND” together and the consequent (class values) for
the classification/regression. In such a case, the consequent is the distribution of the
available classes in the dataset. In case the test example is not covered by this rule,
then it is predicted that the usage of the default class distribution of the data by the
mentioned rule is not covered in the training data.
123
Random forest (tree based) belongs to the ensemble learning methods that are used
for classification, regression, and other tasks. They are operated by constructing a
multitude of the decision tree at training time and outputting the class, i.e., the class’s
mode (classifications) or mean prediction (regression) of the individual trees.
The c4.5 algorithm is used to generate a decision tree. It is the extension of the
Quinlan’s earlier ID3 algorithm. They can also be used for classification in the decision
tree. For this very reason, C4.5 can be referred to as a statistical classifier. In this paper,
we have used Java that uses J48, which is based on C4.5.
REPTree is a fast decision tree learner that builds a decision/regression tree using
information gain/variance. Afterward, it prunes by exploiting reduced error pruning
algorithm (with back fitting). It is also used for sorting numeric attribute values, in
which missing values are given out by splitting the corresponding instances into pieces
(i.e., as in C4.5)
4. Proposed architecture The main objective of the proposed system is to process

network traffic at real time for intrusion detection with higher accuracy in the high-
speed big data environment. Keeping in mind the objective of the system, the proposed
architecture is designed, which can be implemented on any network device such as
at switch or router, and even on ISPs and telecommunication authorities’ gateways
and firewalls. Initially, the traffic is captured at the above-mentioned ultra-high-speed
network with high-speed capturing device and drivers such as RF_RING and TNAPI
[41], so that no packet can remain uncaptured. The captured traffic is sent to the next
layer filtration and loads balancing server (FLBS). FLBS has two primary responsi-
bilities. First, it filters only those flows’ traffic for analysis, which are not yet decided
as an intrusion or normal flows by efficient searching and comparisons in In-Memory
intruders database. Secondly, it sends the unidentified flows traffic and required packet
header information to the third layer (Hadoop layer) master servers. FLBS also bal-
ances the load by deciding which packets are sent to which master server depending on
the IP addresses. The master takes the network traffic/packets and generates sequence
file for each flow, so that it can be processed by Hadoop data nodes. The master node
extracts the necessary information from each packet by using Pcap-Input Format,
Hadoop-pcap-lib, and Hadoop-pcap-serde APIs and stores that information into the
sequence file, such that each packet corresponds to one line. The process continues for
a particular duration for each flow. Afterward, the sequence file is sent to data nodes
which are equipped with feature value calculation algorithm implemented in MapRe-
duce. The MapReduce code of the algorithm calculates the network flow feature by
processing sequence file line by line in parallel. To achieve the real-time efficiency, we
use Spark tool over the Hadoop ecosystem. Finally, the feature values are sent to layer
4 decision server(s). Decision server(s) has the implementations of the various clas-
sifiers, such as J48, REPTree, and SVM, which classify the flows as normal or attack
based on their parameter values. Finally, decisions about a particular flow are stored in
In-Memory intrusion list that can be used by filtration server for filtering the intruder’s
traffic. In-Memory database increases the efficiency of the system by providing data
with high speed for comparisons and searching. Apart from the proposed architecture,
there are few existing big data processing architectures [42,43] that have the ability to
process high-speed data. However, the proposed architecture is particularly designed
123
Fig. 2 Proposed IDS architecture
for the intrusion detection systems. A complete picture of the architecture is shown in
Fig. 2.
5. Proposed algorithm A joint algorithm is proposed for all layers to identify intruder
flows. Flows are distinct by four tuples i.e., source IP, destination IP, source port, and
destination (src_IP, dst_IP, src_port, dst_port). Algorithm 1 describes the pseudocode
of the proposed algorithm. Initially, for each captured packet, the filtration is performed
at FLBS, as described in step 2. FLBS pass those packets, which belong to the flows
that are not identified as an intruder or normal flows, for processing. Step 3 is performed
at a master node, which checks whether the incoming packet belongs to an already
registered flow. If it does not belong to an already registered flow, then it is registered
as a new flow, distinct by (src_IP, dst_IP, src_port, dst_port), and a new sequence file is
created for this flow by inputting necessary packets information in the first line. On the
other hand, if the packet belongs to the registered flow, then the packets information is
just inputted into the particular existing sequence file corresponding to that registered
flow. The master node continues to copy packet information into the sequence file for a
particular duration for each flow. When the duration threshold deviates, the sequence
file is sent to one of the data nodes for flow parameters calculations, as coded in
step 5. The data node uses the Map and Reduce function equipped with parameters
calculations code to measure the final values for each of the nine features for intrusion
detection. MapReduce code having Map and Reduce function have the capability
to run in parallel by taking the sequence file as input on the Hadoop environment.
123
Since each data node processes a distinct flow information in parallel, the overall
performance is enhanced. Finally, the calculated feature values are sent to the decision
server, which is equipped with various ML classifiers to decide about the flow: whether
it is an intrusion or normal flow based on its features values. The ML algorithm used
in this paper is described in Sect. 3 (3). The decision made by decision servers are
then informed to the In-memory database at FLBS for updating the intruders flow list.
A complete picture of the flow of the system is depicted in Fig. 3.
Algorithm 1. IDS Algorithm Pseudo Code
INPUT: Continuous real-time network traffic/packets
OUTPUT: result, intrusion flows/ normal flows.
1. ForEach (packet) Do step 2-8
2. If(Flow_already_classified) // at Filtration Server
Return Next packet. //return to next incoming packet }
3. IF(flow_Not_registered) // at Master Node
Flow_list Flow_list + new_flow(pkt_src_IP,pkt_dst_IP,Pkt_src_port,pkt_dst_port).
// register new flow in sequence file.
Add_packet_Papameters(packet values). //Add packet parameters into sequence file
Return Next packet.
Else
Update_packets_parameters(Seq_file). //update in corresponding sequence file
4. If(Flow_duration< time_threshold)
Return Next packet.
Else //at data node
a.Send_to_Dnode(seq_file). //send to data node for classification
b.Foreach(seq_file) // at Data Node
Calculate(Flow parameters/features).
c. Send_to_Dserver(feature values).
5. Result Mechine_Learinig_classifier(parameters values); // at Decision-making Server
6. Store (result); // in In-Memory DB, working memory
7. Return Next packet.
8. End.
4 Implementation and evaluation
The proposed IDS is implemented in MapReduce Java programming and Spark on the
top of Hadoop ecosystem using Hadoop-pcap-input, Hadoop -pcap-lib, and Hadoop -
pcap-serde APIs for real-time packet processing. The proposed system is implemented
on a single node Hadoop, taking it as the master and data node. ML classifiers are
implemented in Java at the decision server, while Hadoop processes sequence file
and calculates parameters’ values for each incoming flow. Most wide and more effi-
cient ML classifiers are selected for evaluating the proposed system and features. The
selected classifiers are naïve Bayes, support vector machine (SVM), conjunctive rule,
random forest tree, REPTree, and J48 (C4.5 Java implementation), described in Sect. 3.
The proposed system is evaluated for accuracy by considering true positive (TP) and
false positive (FP). The system is also evaluated by considering efficiency in terms of
processing time in the above-mentioned classifiers and KDD datasets. Accuracy eval-
uation is done by taking three KDD [36] dataset files, i.e., corrected dataset file with
total flows/IPs of 311030, KDDCup.data.corrected file with first 1048576 flows/IPs,
and KDDcup.data.corrected.10 % file with 494021 flows using the above stated ML
123
Incoming packets
Flow identification
No Already Yes
registered
Add as new
flow
Filtration
Calculate
parameters No Is already Yes
Detected?
Return Return
(Next packet) No Flowtime Yes (Next packet)
<
IDS Detection α Update
Classifier Parameters
Is Return
No Yes (Next packet)
Malicious?
Add to flow Add to
malicious
normal
flows
Return
(Next packet)
Fig. 3 Flowchart of the IDS algorithm
Table 2 Accuracy of the proposed system on three files of KDD99 Dataset
Serial Classifiers Corrected KddCup.Data. KddCup.Data. Over all

dataset file Corrected Corrected_10%
TP (%) FP (%) TP (%) FP (%) TP (%) FP (%) TP (%) FP (%)
1 Naive Bayes 94.1 0.002 94.7 0.0001 95 0.0015 94.6 0.0012

2 Conjunctive rule 80.1 0.05 75 0.0001 78.95 0.061 78.02 0.037
3 SVM 97.7 0.005 94.3 0.0001 95.8 0.0001 95.93 0.001
4 Random forest 98.9 0.002 99.9 0.0001 99.9 0.00001 99.57 0.0007
5 J48 99.9 0 99.9 0.0001 99.9 0 99.9 0.00003
6 RepTree 99.9 0.0005 99.9 0.0001 99.9 0 99.9 0.0002
classifiers. Finally, the comparison is made with older IDS with respect to accuracy
and efficiency. Techniques to which the comparison is made are RF-FSR and RF-BER
[38], Kayacik [14], Araujo [15], and Kantor [16]. The proposed IDS has more than
99 % TP on all intrusion datasets. The comprehensive accuracy results in terms of TP
and FP are shown in Table 2.
Intrusion detection by using the proposed nine features performs well while done
by J48 and REPTree classifiers. The accuracy in terms of TP is more than 99.9 %
on KddCup.Data.Corrected and KddCup.Data.Corrected_10 % dataset files. The FP
123
200 Corrected Dataset File KddCup.Data.Corrected KddCup.Data.Corrected_10%
Built model me (sec)

150
100
50
0
Naive Bayes SVM Conjuncve Random J48 RepTree
Classiﬁers Rule Forest
Fig. 4 Time taken by each classifier to build a model of various files of the KDD99 dataset
12
Corrected Dataset File KddCup.Data.Corrected KddCup.Data.Corrected_10%
10
8
Decision me (sec)
0
Naive Bayes SVM Conjuncve Random J48 RepTree
Rule Forest
Classiﬁers
Fig. 5 Time taken by each classifier to classify intrusion on various files of the KDD99 dataset
rate of both of these classifiers is very low, i.e., less than .0001 % for both of Kdd-
Cup.Data.Corrected and KddCup.Data.Corrected_10 % dataset files. Moreover, the
proposed system also has overall more than 99 % TP and less than 0.0001 % FP. The
accuracy results also show that the choice of using conjunctive rule classifier for intru-
sion detection is not good, as it has very low TP and very high FP rate as compared to
other ML classifiers.
While considering the efficiency in terms of processing time, since the proposed
solution has less number of parameters and it is implemented on the parallel envi-
ronment of Hadoop, it takes a shorter time to process larger datasets. The IDS
implementation using REPTREE classifiers is most efficient for both building model
and decision-making, as shown in Figs. 4 and 5. Naïve based classifiers also performed
well using the proposed features in terms of processing time. However, naïve based
classifiers are not efficient while decision-making and not more accurate. The time
123
consumed by different classifiers on the model building by using the proposed feature
on three dataset’s files is shown in Fig. 4.
Figure 5 shows the time that elapsed while making a decision by each classi-
fier after model building for the KDDCup dataset files. Random Fores, SVM, and
naïve-Bayes implementation took more time while identifying intrusions in KDD-
cup.Data.Corrected file. REPTree, J48, and conjunctive rule classifiers took almost
the same time while processing dataset for intrusion detection. However, as shown in
Table 2, the conjunctive rule classifier’s accuracy is lower as compared to other clas-
sifiers. Moreover, naïve Bayes classifier is more efficient for model building, but less
efficient for decision-making. The SVM is not more efficient while model building or
decision-making. Finally by analyzing the accuracy and efficiency results of various
ML classifiers, we conclude that REPTree and J48 are two best choices for intrusion
detection with higher accuracy and more efficiency using the proposed features on
Hadoop.
Finally, a comparison is made with existing techniques, mentioned above, while
considering efficiency in terms of processing time and accuracy in terms of TP and FP.
It is obvious from the results of various datasets that the proposed IDS system is more
accurate on most of the ML classifiers as shown in Tables 3, 4, and 5. Our technique has
higher accuracy rate than most of the existing techniques on several datasets files with
higher TP and lower FP. While applying the detection on KDDcup.corrected.data file,
Kantor’s system using naïve Bayes classifier outperforms the proposed system in terms
of TP as described in Table 4. On the other hand, the proposed system outperforms
Kantor’s system with a major difference in processing time. Similarly, Conjunctive
rule classifier for RF-FSR also gives better accuracy for the result of TP than our
system, but in this case, the FP is quite higher and incorrect. When we consider the
accuracy results on KDDcup.corrected.data_10 % dataset file shown in Table 5, most
of the techniques’ accuracy is equal to the proposed system; however, the proposed
system worn out all these approaches in terms of processing time efficiency.
The efficiency comparison is made based on the time consumed on building a classi-
fication model for intrusion detection as well as on decision-making i.e., classification
itself for corrected dataset file. The efficiency comparison graph is shown in Fig. 6 for
building model processing time in seconds. Moreover, Fig. 7 shows the classification
or decision-making time using various machine learning classifiers. While consider-
ing model building, only Kantor’s system takes the same time for model building as
compared to the proposed system. On the other hand, the proposed system outper-
forms Kantor’s IDS system using any classifier while considering making decisions,
as shown in Fig. 7.
It is quite obvious that for every ML classifier, the proposed system takes less
modeling time as well as less decision-making time for all existing techniques. For
REPTree and J48 classifier implementation, the proposed system is most efficient and
with higher accuracy than any other system. The evaluation of the system proved
that the system is accurate and efficient and has the capability to perform better in
ultra-high-speed big data environment.
123
3504
123
Table 3 Accuracy comparison among different IDS on the corrected data file of the KDD99 dataset
Classifiers TP (%) FP (%)
RF-FSR RF-BER Kayacik Araujo Kantor Our system RF-FSR RF-BER Kayacik Araujo Kantor Our system
Naive Bayes 91.4 88 91.8 90.1 91.4 94.1 0.003 0.003 0.005 0.002 0.003 0.002
SVM 95.8 95.8 95.8 95.4 94.1 97.7 0.009 0.009 0.009 0.01 0.11 0.05
Conjunctive rule 72.2 72.2 72.2 72.2 72.2 80.1 0.067 0.067 0.067 0.067 0.067 0.005
Random forest 98.1 97.9 97.9 97.6 97.2 98.9 0.002 0.003 0.002 0.001 0.004 0.002
J48 98 99.9 97.9 97.5 97.2 99.9 0.002 0 0.002 0.001 0.004 0
REPTree 97.9 97.7 97.9 97.4 97.2 99.9 0.003 0.003 0.003 0.001 0.004 0.0005
M. M. Rathore et al.
Table 4 Accuracy comparison among different IDS on KDDcup.corrected.data file of the KDD99 dataset
Naive Bayes 94.9 94.7 94.9 94.3 97.1 94.7 0.001 0 0.001 0.001 0.001 0.0001
SVM 90 94 93.7 93.6 93.4 94.3 0.008 0.061 0.008 0.01 0.006 0.0001
Conjunctive rule 77.9 74 77.9 73.9 74 75 0.078 0.074 0.078 0.074 0.074 0.0001
Real time intrusion detection system for ultra-high-speed. . .
Random forest 99.9 99.9 99.9 99.9 99.8 99.9 0.0001 0 0 0.00001 0.00001 0
J48 99.9 99.9 99.9 99.9 99.8 99.9 0.0001 0 0 0.00001 0.00001 0
REPTree 99.9 99.9 99.9 99.9 99.8 99.9 0.0001 0 0 0.00001 0.00001 0
123
3505
3506
123
Table 5 Accuracy comparison among different IDS on KDDcup.corrected.data_10 % file of the KDD99 dataset
Naive Bayes 95.5 94.8 94.5 93.6 92.8 95 0 0.001 0.001 0.001 0.007 0.0015
SVM 99.4 99.7 99.3 99.1 98.7 95.8 0.001 0.0001 0.002 0.002 0.004 0.061
Conjunctive rule 78.5 78.5 78.5 78.5 78.5 78.95 0.061 0.061 0.061 0.061 0.061 0.0001
Random forest 99.9 99.9 99.9 99.9 99.8 99.9 0 0 0 0 0.00001 0.00001
J48 99.9 99.9 99.9 99.9 99.8 99.9 0 0 0 0 0.00001 0
REPTree 99.9 99.9 99.9 99.9 99.8 99.9 0 0 0 0 0.00001 0
M. M. Rathore et al.
RF-FSR RF-BER Kaycik Araujo Kantor Our System

600
500
400
300
200
100
0
NAIVE BAYES SVM CONJUNCTIVE RANDOM J48 REPTREE
RULE FOREST
Fig. 6 Efficiency comparison of various IDS systems based on built modeling time for the correct file of
the KDD dataset
140
120 RF-FSR RF-BER Kaycik Araujo Kantor Our System
100
80
60
40
20
0
Naive Bayes SVM Conjuncve Random Forest J48 RepTree
Rule
Fig. 7 Efficiency comparison of various IDS systems based on the classification (detection) time for a
correct file of the KDD dataset
5 Conclusion
In this paper, we proposed a real-time intrusion detection system that includes the
four-layered IDS Hadoop-based architecture, proposed feature selection algorithm,
machine learning classifiers, and proposed intrusion detection algorithm with imple-
mentation details. The proposed architecture is composed of Hadoop various master
and data nodes, which process high-speed real-time traffic with more efficiency due
to the parallel processing nature of Hadoop. We evaluated our proposed system by
implementing the system on Hadoop single node using MapReduce programming
with various machine learning approaches. The system generates best results on REP-
Tree and J48 ML classifiers by taking the proposed features with an overall accuracy
of more than 99 % TP and less than 0.0001 % FP. Finally, we compared the proposed
system with existing solutions with respect to efficiency in terms of processing time
and with respect to accuracy in terms of TP and FP. The proposed system outperforms
123
the existing solution in terms of accuracy and efficiency. Most widely used intrusion
datasets, such as DARPA, KDDCup99, and NSL-KDD, are used for evaluation and
testing the system. Finally, the proposed system with the nine identified features for
intrusion detection is recommended to be implemented on Hadoop using REPTree or
J48 for processing network traffic in real-time high-speed big data environment.
Acknowledgments This study was supported by the Brain Korea 21 Plus project (SW Human Resource
Development Program for Supporting Smart Life) funded by Ministry of Education, School of Computer
Science and Engineering, Kyungpook National University, Korea (21A20131600005). This work is also
supported by Institute for Information and Communication Technology Promotion(IITP) Grant funded by
the Korean government (MSIP). [No. 10041145, Self-Organized Software Platform (SoSp) for Welfare
Devices].
References
1. Denning D (1986) An intrusion-detection model. In: IEEE computer society Symposium on research
security and privacy, pp 118–131
2. Denning DE (1987) An intrusion-detection model. IEEE Trans Softw Eng 13(2):222–232. doi:10.
1109/TSE.1987.232894
3. Butun I, Morgera SD, Sankar R (2014) A survey of intrusion detection systems in wireless sensor
networks. IEEE Commun Surv Tutor 16(1):266–282
4. Ngadi M, Abdullah AH, Mandala S (2008) A survey on MANET intrusion detection. Int J Comput
Sci Secur 2(1):1–11
5. Zhang Y, Lee W, Huang YA (2003) Intrusion detection techniques for mobile wireless networks. J
Wirel Netw 9(5):545–556
6. Patcha A, Park JM (2007) An overview of anomaly detection techniques: existing solutions and latest
technological trends. Elsevier J Comput Netw 51(12):3448–3470
7. Puttini R, Hanashiro M, Miziara F, de Sousa R, Garcia-Villalba L, Barenco C(2006) On the anomaly
intrusion-detection in mobile ad hoc network environments. In: Proc. 11th IFIP TC6 international
conference on personal wireless communications. Springer, pp 182–193
8. Engen, V.: Machine learning for network based intrusion. Ph.D. dissertation, Bournemouth Univ.,
Poole (2010)
9. ofcom (2013) Communications market report 2013 [Online]. http://www.ofcom.org.uk/cmruk/
10. Sagiroglu S, Sinanc D (2013) Big data: a review. In: Collaboration technologies and systems (CTS),
2013 International Conference on. IEEE, pp 42–47
11. Wu X, Zhu X, Wu G-Q, Ding W (2014) Data mining with big data. Knowl Data Eng IEEE Trans
26(1):97–107
12. Pires Jr. WR, de Paula Figueiredo TH, Wong HC, Loureiro AAF (2004) Malicious node detection in
wireless sensor networks. In: Proc. 18th Int. Parallel Distrib. Process. Symp. (2004)
13. Rao R, Kesidis G (2003) Detecting malicious packet dropping using statistically regular traffic patterns
in multihop wireless networks that are not bandwidth limited. In: Proc. IEEE GLOBECOM
14. Kayacik HG, Zincir-Heywood AN, Heywood MI (2005) Selecting features for intrusion detection: a
feature relevance analysis on kdd99 intrusion detection datasets. In: Proceedings of the third annual
conference on privacy, security and trust, Citeseer
15. Araujo N, de Oliveira R, Ferreira E-W, Shinoda A, Bhargava B (2010) Identifying important charac-
teristics in the kdd99 intrusion detection dataset by feature selection using a hybrid approach. In: IEEE
17th international conference on telecommunications (ICT), pp 552–558. IEEE
16. Kantor P, Muresan G, Roberts F et al (2005) Analysis of three intrusion detection system benchmark
datasets using machine learning algorithms. In: Intelligence and security informatics, sec. 3, p 363.
Springer-Verlag, Berlin, Heidelberg
17. Abbes T, Bouhoula A, Rusinowitch M (2010) Efficient decision tree for protocol analysis in intrusion
detection. Int J Secur Netw 5(4):220–235
18. Wagner C, François J, State R, Engel T (2011) Machine learning approach for IP-flow record anomaly
detection. In: Proc. 10th International IFIP
123
19. Khan L, Awad M, Thuraisingham B (2007) A new intrusion detection system using support vector
machines and hierarchical clustering. VLDB J 16(4):507–521
20. Schölkopf B, Platt JC, Shawe-Taylor JC, Smola AJ, Williamson RC (2001) Estimating the support of
a high-dimensional distribution. Neural Comput 13(7):1443–1471
21. Muda Z, Yassin W, Sulaiman MN, Udzir NI (2011) A K-means and naive bayes learning approach for
better intrusion detection. Inf Technol J 10(3):648–655
22. Gaddam SR, Phoha VV, Balagani KS (2007) K-Means+ID3: a novel method for supervised anomaly
detection by cascading kmeans clustering and ID3 decision tree learning methods. IEEE Trans Knowl
Data Eng 19(3):345–354
23. Cho SB (2002) Incorporating soft computing techniques into a probabilistic intrusion detection ystem.
Syst Man Cybern Part C Appl Rev IEEE Trans 32(2):154–160
24. Yu Z, Tsai JJP, Weigert T (2007) An automatically tuning intrusion detection system. Syst Man Cybern
Part B Cybern IEEE Trans 37(2):373–384
25. da Silva AP, Martins M, Rocha B, Loureiro A, Ruiz L, Wong HC (2005) Decentralized intrusion
detection in wireless sensor networks. In: Proc. 1st ACM International workshop on quality of service
and security in wireless and mobile networks (Q2SWinet ’05), pp 16–23. ACM Press
26. Wai FH, Aye YN, James NH (2005) Intrusion detection in wireless ad-hoc networks. CS4274, Intro-
duction to Mobile Computing, term paper, School of Computing, National University of Singapore
27. Nadkarni K, Mishra A (2003) Intrusion detection in MANETs-the second wall of defense. In: Proc.
29th annual conference of the IEEE industrial electronics society
28. Francisco M-P et al (2011) Network intrusion detection system embedded on a smart sensor. Ind
Electron IEEE Trans 58(3):722–732
29. Sequeira K, Zaki M (2002) ADMIT: anomaly-based data mining for intrusions. In: Proc. eighth ACM
SIGKDD international conference on Knowledge discovery and data mining, pp 386–395. ACM, New
York
30. El-Khatib K (2010) Impact of feature reduction on the efficiency of wireless intrusion detection systems.
Parallel Distrib Syst IEEE Trans 21(8):1143–1149
31. Tan Z, Nagar UT, Xiangjian He, Nanda P, Ren Ping Liu, Song Wang, Jiankun Hu (2014) Enhancing
big data security with collaborative intrusion detection. Cloud Comput IEEE 1(3):27–33. doi:10.1109/
MCC.2014.53
32. Huang J, Kalbarczyk Z, Nicol DM (2014) Knowledge discovery from big data for intrusion detection
using LDA. In: Big data (BigData Congress), 2014 IEEE international congress on, June 27 2014-July
2 2014, pp 760–761. doi:10.1109/BigData.Congress.2014.111
33. Ahn S-H, Kim N-U, Chung T-M (2014) Big data analysis system concept for detecting unknown
attacks. In: Advanced communication technology (ICACT), 2014 16th International Conference on,
16–19 Feb 2014, pp 269–272. doi:10.1109/ICACT.2014.6778962
34. Marchal S, Jiang X, State R, Engel T (2014) A Big data architecture for large scale security monitoring.
In: Big data (BigData Congress), 2014 IEEE international congress on, June 27 2014–July 2 2014, pp
56–63. doi:10.1109/BigData.Congress.2014.18
35. I.S.T.G. MIT Lincoln Lab (2000) DARPA intrusion detection data sets. http://www.ll.mit.edu/mission/
communications/ist/corpora/ideval/data/2000data.html
36. KDDcup99 (1999) Knowledge discovery in databases DARPA archive. http://www.kdd.ics.uci.edu/
databases/kddcup99/task.html
37. NSL-KDD (2009) NSL-KDD data set for network-based intrusion detection systems. http://iscx.cs.
unb.ca/NSL-KDD/
38. Al-Jarrah OY et al (2014) Machine-learning-based feature selection techniques for large-scale net-
work intrusion detection. In: Distributed computing systems workshops (ICDCSW), 2014 IEEE 34th
international conference on. IEEE
39. ENGEN (2010) Machine learning for network based intrusion detection. Doctoral dissertation,
Bournemouth University
40. Zaman S, Karray F (2009) Features selection for intrusion detection systems based on support vector
machines. In: Consumer communications and networking conference, 2009. CCNC 2009. 6th IEEE,
pp 1–8
41. Fusco F, Deri L (2010) High speed network traffic analysis with commodity multi-core systems. ACM
IMC 2010
123
42. Rathore MMU, Paul A, Ahmad A, Chen B, Huang B, Ji W (2015) Real-Time Big Data Analytical
Architecture for Remote Sensing Application. Sel Top Appli Earth Observations Remote Sens, IEEE
J 8(10):4610–4621. doi:10.1109/JSTARS.2015.2424683
43. Ahmad A, Paul A, Rathore MM (2016) An efficient divide-and-conquer approach for big data analytics
in machine-to-machine communication. Neurocomputing 174:439–453
123

Rathore2016 Article RealTimeIntrusionDetectionSyst

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Rathore2016 Article RealTimeIntrusionDetectionSyst

Încărcat de

Drepturi de autor:

Formate disponibile

J Supercomput (2016) 72:3489–3510

Real time intrusion detection system for

M. Mazhar Rathore1 · Awais Ahmad1 ·

Published online: 23 February 2016

proposed system architecture is evaluated with respect to accuracy in terms of true

Keywords Machine learning · Intrusion detection · Threats · Big data · Network

Network 1 (N2) Network 1 (N1)

PC1N2 ME1N2 PC2N2 PCKN2 MEKN2

PC1N3 PC2N3 PC3N3 PCKN3 MEKN2

Fig. 1 Implementation model of the proposed IDS

2 Background and related work

Security attacks on any network system could be broadly categorized as active or

Table 1 Selected features of the proposed IDS

Serial # Features Details

1 Duration Whole duration of the flow/session

3. Classification algorithms Various machine learning classifiers such as naïve Bayes,

4. Proposed architecture The main objective of the proposed system is to process

Fig. 2 Proposed IDS architecture

4 Implementation and evaluation

Fig. 3 Flowchart of the IDS algorithm

Table 2 Accuracy of the proposed system on three files of KDD99 Dataset

Serial Classifiers Corrected KddCup.Data. KddCup.Data. Over all

TP (%) FP (%) TP (%) FP (%) TP (%) FP (%) TP (%) FP (%)

1 Naive Bayes 94.1 0.002 94.7 0.0001 95 0.0015 94.6 0.0012

200 Corrected Dataset File KddCup.Data.Corrected KddCup.Data.Corrected_10%

Built model me (sec)

Classifiers TP (%) FP (%)

Classifiers TP (%) FP (%)

Classifiers TP (%) FP (%)

RF-FSR RF-BER Kaycik Araujo Kantor Our System

120 RF-FSR RF-BER Kaycik Araujo Kantor Our System

S-ar putea să vă placă și