Sunteți pe pagina 1din 9

International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No.

ISSN: 1837-7823

High Performance Network Intrusion Detection Model Using Graph

D.P.Jeyepalan1 and E.Kirubakaran2
Research Scholar,
School of Computer Science, Engineering and Applications,
Bharathidasan University, Tiruchirappalli, Tamilnadu, India.
SSTP (Systems),
Bharat Heavy Electricals Ltd,,
Tiruchirappalli, India.
Intrusion Detection is a dynamic scenario that constantly requires different mechanisms in order to compensate for
the ever changing intrusions. Though various methods are available for detecting intrusions, they tend to get out
dated very frequently. Hence the necessity of this scenario is a flexible system that adapts itself to the intrusion
scenario accordingly. Further, a new challenge called the Big Data is seen on the rise due to the increase in the
amount of data being generated and the inability of the traditional data mining system to cope up with the volume
and the velocity of the data. The current study presents systems that can be used to perform effective intrusion
detection in the dynamic environment existing in the current context, and also discusses the future enhancements
that can be carried out in the system to make it flexible and effective.
Keywords: Intrusion Detection; Graph Database; Hadoop; Significant Terms Aggregation


Intrusion is the process of attempted or succeeded illegal access on a computer system or a network. Intrusions have
been attempted (and sometimes succeeded) ever since the first of the computers were created. Since the first
intrusion attempt, attempts to evade or stop these attacks have been in place [1]. But the evolution has been mutual.
As the system evolves to fight back the attacks, the intensity of the attacks metamorphose back more powerful to
compromise the system. This is a cycle that had been repeating all through the evolution of computers [2][3]. The
intensity of attack and the number of attacks that can be carried out by an adversary has seen manifold increase due
to the availability of Botnets and other such features. Moreover, these facilities are available online and are lent for
specific periods, which increases the complexity of creating an Intrusion Detection System (IDS). Hence until the
current generation, attacks and intrusion attempts were on a one to one basis, however now it is available as a
service to everyone (even nave users) interested to perform intrusion, and there is no specified limit to the level of
attack, since these services are created and lent to users on an hourly basis [13][14].
In the current generation of high performance computing, the nature of the attacks that we witness has shown a
tremendous increase in terms of volume, velocity and variety (variations in attack). Even though the type of data
returned from a network is semi-structured, the presence of volume and the speed at which the data should be
processed (velocity) makes it a Big Data problem [25]. Hence by the basic definition of Big Data, it becomes clear
that intrusion detection, in the current scenario cannot be processed using traditional algorithms [26].
This paper discusses conventional IDS, the current intrusion scenario in networks and why conventional methods
tend to fail in the current scenario, mechanisms that can be used for intrusion detection, their pros and cons and
finally proposes a graph data structure and a graph based intrusion detection model that facilitates real time detection
in networks.
The remainder of this paper is structured as follows; section II presents the related works dealing with conventional
IDS methods and parallel computation based IDS, section III presents the current demands of an IDS and the
probable architectures that can solve this problem and discusses their pros and cons. Analysis is carried out and it
was concluded that Graph Databases work best for the process of intrusion detection. Section IV concludes the

International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9
ISSN: 1837-7823

Related Works


Conventional IDS Methods

It has been observed that the process of Intrusion Detection was facilitated and improvised by the usage of mining
algorithms and statistical analysis tools. Clustering and Classification algorithms had been the major contributors for
Intrusion Detection Systems[4][5]. Further, the branch of Data Mining has also provided various prediction
algorithms to facilitate early detection of intrusions. The applications for intrusion detection were developed as
single threaded applications utilizing the capacities of a CPU [6]. A graph based Clustering method was proposed in
[20]. A graph representing the entire network is created. The nodes are clustered according to the transmission
history of each node. An added advantage to this method is that it does not bind the algorithm to create fixed number
of clusters or defined shaped cluster. Outlier detection method is used to sort out the outliers. The foremost
drawback faced by systems of this category is the large amount of time taken for completing the detection process.
Further, these algorithms are not capable enough to handle dynamic datasets. In order to reduce time and provide
scalability, hybridized algorithms and metaheuristic based algorithms came into existence. A game theoretic based
intrusion detection system that also identifies the candidate nodes for IDS deployment is proposed in [21]. The
process of candidate identification is carried out by clustering the network and determining the cluster heads, which
is performed by the Ant Colony Clustering algorithm. Similar strategies have been discussed in [23][24]. All these
methodologies are single threaded. They were designed in the earlier stages, hence do not leverage the parallel
computation capabilities of the current processors.

Parallel Computation based IDS

Hence we migrate to the next method of achieving accuracy by providing speedup in algorithms by improving the
hardware architecture. This becomes possible due to the increase in the processing capabilities of the CPUs and the
introduction of GPUs [7-9]. Usage of CPUs and GPUs were greatly believed to enhance the processing abilities[10].
Usage of the parallel processing nature of GPUs to perform the process of packet inspection is proposed in [8]. It
leverages the ability of GPUs to perform faster processing in parallel. A similar but enhanced method is proposed in
[9], which performs stream categorization and intrusion detection on GPU using CUDA. The feasibility of using a
GPU for performing the intrusion detection is discussed in [7]. It highlights the limit of GPUs in performing
concurrent and asynchronous processes for detecting intrusions. Finally, it concludes saying that several
modifications are to be performed on the graphics cards for efficient usage. A probability distribution based pattern
recognition method is presented in [10]. This method tends to detect network anomalies by analyzing various
guidelines such as probability distribution, relative and normalized relative entropies. The authors [10] also perform
analysis on CPU using serial algorithm and using its parallel variants in GPU and Map Reduce environments. A
similar approach that uses agent based IDS is described in [22]. ACO being an intrinsically parallel algorithm
performs well in parallel environments. Experiments were carried out using multi core CPUs and many core GPUs
and the results shows an increase in performance, when compared to single threaded applications.
But even this scheme has its own limitations, governed by Amdahls Law, Gustafsons Law and Moores Law.
While the maximum speed up that can be achieved by a given problem is limited by its sequential part (as stated by
Amdhals Law [17]), the speedup is also limited by the speed of the storage device in which the required data is
stored (according to Moores Law [15]). Hence by Gustafsons law [16], speedup can be achieved only by
increasing the number of processors, whose feasibility is in question.
Further, the limited memory capacity of the GPUs (1GB to 12GB) had become a bottleneck when large datasets
were to be used. Even though this problem can be overcome by using the secondary storage, the speed of the
processing will be limited by the data transfer time as dictated by the Moores Law. Hence, even if patterns of
intrusion/anomaly were found, they could not be stored on an in memory data structure so that the entire dataset
could be searched for similar patterns. Constant fast memory in a GPU is limited to 64KB and could not hold all the
patterns, while storing the data on the device memory will affect the performance considerably.

Our approach


Current IDS Requirement

According to the current network scenario, the IDS should be capable of handling a huge amount of traffic
(scalable). The algorithm being implemented should be capable of incorporating dynamic data (high and
unpredictable network traffic) and provide efficient results in the fastest possible time. An online intrusion detection
system is recommended. Online detection involves providing the detection results on the go, unlike traditional

International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9
ISSN: 1837-7823
systems which takes time to perform the prediction. This is implemented using a combination of statistical, machine
learning and data mining techniques.

Recommended architectures for IDS : Pros and Cons

The environment in which the applications are run also play a major role in their performance. As discussed earlier,
the maximum performance that can be extracted from a hardware is limited, hence a distributed architecture was
proposed to improve the performance of the system. One such architecture is the Hadoop ecosystem.

Hadoop : MapReduce

Hadoop is a reliable, scalable and distributed computing architecture. While the Hadoop Distributed File System
(HDFS) manages the storage, Hadoop YARN and MapReduce provides a framework for job scheduling and cluster
resource management and parallel processing of the data stored in HDFS. Though Hadoop can store and process a
large amount of data, it is not free of intermediate reads and writes in the secondary memory. Writing the
intermediate results to the disk creates a lot of latency and not so very effective when considering the current
scenario. So a scale-up solution with huge memory, could be preferred for the current application in contrast to a
cluster of commodity machines. Since the dataset size is of the order of GBs and not TBs, it is better to go for an in
memory approach as suggested in Hadoop Version 2 using Spark, RDD. Spark is a fast and general computation
engine for Hadoop. The advantage of Spark is that it provides a simple and expressive programming model, which
supports a wide range of applications, such as ETL, machine learning, stream processing, and graph computation.
The major drawback of the Hadoop ecosystem is its tendency to perform batch processing. Intrusion detection
mechanism requires real time predictions, which involve interactive processing. Hence usage of Hadoop for the
process of online intrusion detection is not recommended.

Significant Terms Aggregation

Significant terms aggregation is the process of analyzing a dataset to determine interesting or unusual occurrences of
a particular data in an itemset. This method analyzes the patterns that stand out from the background, in other words,
it identifies the anomalies that do not fit into the regular pattern of the data under study. These terms are not absolute
anomalies. They actually are more common in certain types of queries, while their presence when compared with the
entire dataset is scarce. These are generally termed as the uncommonly common data items.
This method in general considers two categories of sets; the foreground set and the background set. The foreground
set is considered to be the items of interest, and the background set is considered to be the base set used for
statistical comparisons. The background set contains the entire data, while the foreground set contains specific data
about the items of interest. This method works on the basic principle that the most commonly occurring term might
not be the actual term of interest. For example, the mostly occurring term in any document would be the, but it is
hardly significant.
A sample example is shown in the figures 1 and 2, that analyzes the uncommonly common words in a document set.
The x axis contains the % of documents containing the word and the y axis contains the % of documents containing
the word from a random sample of documents. This forms a diagonal connecting the bottom left and the top right
points, i.e., (0,0) and (100,100). The words contained in the top right corner are the frequently occurring words.
These words are of least help in the analysis, as they are contained in all the documents, while the words in the
bottom left corresponds to the rare occurrences. They are the items of interest. In order to determine the true
interestingness of the word, the diagonal is overlaid with x axis as the % of documents containing the word and the y
axis as % of search results containing the word. The overlaid graph displays the uncommonly common data to the
top left, the area which is categorized as the area of interest.

% o f rand om d ocument sam ples

contain ing the wor d

International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9
ISSN: 1837-7823

th e

c lus t er
bigda ta
had oo p
% of documen ts cont ain ing the wor d

Figure 1: Sample graph depicting common words (Diagonal Construction)

Area o f uncommon ly
common t erms

% of search results
containin g the wor d

ha doop
bigd at a
clust er

bu t

% o f d ocument s con ta ining th e word

Figure 2: Sample graph depicting uncommonly common words

The above description actually deals with the standard example of text mining. This can be directly mapped to our
scenario of Intrusion Detection by considering every transaction as a record and by plotting the aggregated
transaction result on the graph. Since intrusions are uncommon, they do not explicitly reveal themselves when
analyzed with the normal transactions, i.e., the background data. But when overlaid on the transactions containing
the intrusion records alone, they can be observed in the top left corner, which is the area for uncommonly common
data items.

International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9
ISSN: 1837-7823

Graph Databases

A graph database uses nodes and edges to store and represent data. The advantage of using such a structure is that it
provides index free adjacency. Every node is directly connected to another node and direct traversal can be
performed on it rather than index based lookups.
Further, the advantage of using a graph database is that it provides partition free solution. Usage of data on Hadoop
will obviously lead to partitioning the data between systems. This will lead to inaccurate solutions when performing
operations such as Clustering or Classification which require the availability of the entire data rather than a subset of
the data.
Graph Databases were found to be a suitable structure for IDS applications because of the very nature of the
problem and due the fact that graph databases work well in scale up options and also they are not so very suitable for
scale out environments. Graph DBs uses the entire graph data for processing, hence allows us to match patterns of
intrusion/anomaly with the entire dataset of millions of nodes in the order of milliseconds as opposed to any other
mechanism which will take seconds or even minutes depending on dataset size. Graph DBs are optimized for such
pattern matching and anomaly detection is one classic example for that.
The current modeling technique uses the NSL-KDD dataset for modeling graphs. The KDD CUP 99 [19] dataset has
been one of the mostly used benchmarks for anomaly detection (Classification). It contains the raw TCP dump data
covering nine weeks of observations, obtained from the Lincoln Labs. It contains four categories of attacks, DOS,
R2L, U2R and probing, comprising of 24 training attack types. The KDD CUP 99 dataset was analyzed and two
major drawbacks had been observed [18][27]. The dataset contains very large number of records with a high
redundancy rate. It has been observed that about 78% of the records had been duplicated in the training set and 75%
of the records are duplicated in the test set. This will lead to a bias in the learning algorithm towards the most
frequently occurring data, while infrequent data tends to get neglected. Hence evaluation results tend to get biased in
this process. The NSL KDD has evolved, which overcomes these issues and provides a functional dataset for
researchers. The shortcomings that NSL KDD overcomes from the KDD dataset are,

Redundancy in the training set and the test set is eliminated.

The count of the selected records from each difficulty level is inversely proportional to the percentage of
records in the original data set.

Reduction in the number of records in the training and the test sets

These justify the use of NSL KDD dataset for our study.

Proposed IDS Architecture

Extensive research has been carried out in the area of intrusion detection on the available network data using the
sterling algorithms, a modified form of the algorithm or a hybridized form of the algorithm. But the usage of a
different data structure has not yet been experimented with. Hence the authors propose a graph database that can be
used for processing intrusion data. The authors have modeled the problem of intrusion detection as a Neo4j property
graph. A property graph is one that has properties associated with its vertices or edges. The graph is created by
considering every transmission as a node and by taking transaction properties as the properties of the node.
Querying is performed naturally by means of the CYPHER Query language that comes as part of the Neo4j Graph

Graph Model

The initial phase of the detection process deals with grouping the nodes depending on the properties. Clustering
similar nodes such that one cluster contains normal transmission data, while each of the other clusters contain data
pertaining to a specific attack. Since we use a graph database, the tasks are accomplished through queries and the
results can be directly visualized. The query language being used varies depending on the graph database.
Since we use KDD CUP 99 dataset, the training data is already classified, hence the attack property alone is
sufficient for grouping data. In case of datasets without labels, clustering can be performed by considering all the
properties (numerical and nominal; strings are ignored) and measuring their distance with respect to each other and
grouping the properties that are nearest to each other.

International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9
ISSN: 1837-7823


oco lT
Pro t
is o f

y pe



Transmission 1
i s of Serv

yp e

e Ty
of S




t oc


of P

co l
o to








is o
f Pr










Transmission 2



e T yp
r vic
of S



Transmission 3



rv i



of S
i s iso

yp e


ftp write

ac kTy pe
is o f At tt

Transmission 4



per l

Figure 3: Sample Process Graph

International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9
ISSN: 1837-7823
The advantage of using a graph database is that, interactive querying is made possible in opposition to the static
nature of the existing approaches to intrusion detection, especially those based on data mining and machine learning
approaches. Hence if a new data is to be added to the system, all we need to do is to run the query that is used for
cluster creation and the system gets updated with the new data.
Classification deals with binding an element with a category by analyzing the similarities. This is the process that is
to be carried out as soon as a transmission arrives. The process of clustering is based on all the properties present in
the node. The properties of current transmission are compared to the aggregated properties of each cluster. The
cluster with minimum difference in properties is selected as the destination cluster, and its label is added to the
current transaction. This can be used to label the transaction legitimate or anomalous. Typical running time for a
query in graph databases is to the order of milliseconds. Thus the database updating process is faster and results are
obtained in real time.
Another major advantage of this approach is that it facilitates online prediction. As soon as a transaction is recorded,
it can be added as a node in the graph and querying can be performed to classify the information to the cluster it
belongs. This query will provide the result of whether the transaction is normal or anomalous.
The Figure 3 shows a simple process graph for the KDD Cup 99 dataset. It can be observed that all the transmissions
and transmission properties are depicted as nodes. The edges correspond to the properties pertaining to specific
transmissions. Every transmission node is connected to its corresponding property. All nominal properties are
constructed as vertices with the corresponding property labels, while numerical properties are encoded as attributes
in the transaction node. A similar graph is constructed for the entire KDD Cup 99 dataset, containing 10.5 million
records with 42 properties including the attack type.
Data mining approaches allows us to visualize the elements at the end and in the case of a graph DB, this process
provides the necessary visualizations without any additional effort. Using the graph DB and the associated query
language, we can do interactive analysis and querying which allows us to further optimize the results. The results
can also be filtered immediately without any additional processing, hence providing deeper insights from the results.
Performing these processes on the obtained results is not so very feasible in the case of any machine learning
Further advantage of this approach includes making use of the graph visualization approaches to manually browse
through the various clusters. This can provide much better understanding about the problem at hand, and it can also
provide certain hidden results that were not found in earlier approaches that work directly on the data. It is also
possible to mark and analyze the areas of interest identified to contain too many outliers. Several graph visualization
tools are available for visualization of Big Data and graph data in particular and the same could be made use of in
effective identification and for real time analysis of the obtained results. Some of the graph visualization tools
available for Big Data are the Graphviz, Warlus and Gephi. Since our current modelling involves Neo4j as the graph
DB, it provides an inbuilt Neo4j Browser that can function as the visualization tool. This is a powerful and
customizable data visualization tool, that is based on D3.js library. In case of additional assistance in visualization,
SVG based graph interaction, Keylines Neo4j Graph visualization or Tom Sawyer Perspectives can be utilized.


This paper presents various latest technologies that can be utilized for intrusion detection. This becomes a necessity
due to the fact that in the current scenario, network data has become so huge that it is classified as Big Data. Hence
traditional methods of data processing systems fail to work on this data. Due to the presence of the Velocity
component along with the Volume, it becomes mandatory for the processing system to be storage efficient and fast.
In this paper, we discuss the three probable methods of graph processing, Hadoop, Elastic Search and Graph
Databases. After analysis is was found that various shortcomings exist in the architecture of Hadoop and Elastic
Search ( in accordance with our application ). Hence we zero in on Graph databases and elucidate on their efficiency
in processing graph structures.
Graph DB performance could be further improved by creating custom User Defined Functions (Query Functions)
that are executed on GPUs rather than on CPUs. The number crunching work could be done on the GPUs and
thereby reducing the load on the CPUs to a considerable extent. GPU based algorithms for graph processing could
be considered for further optimization of the problem solution.


International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9
ISSN: 1837-7823




K. Jackson, (1989), Viewgraphs on Intrusion Detection: User Authentication Profiles at Los Alamos, Los
Alamos National Laboratory.


H. S. Javitz , A. Valdes , D. E. Denning and P. G. Neumann, (1986), Analytical Techniques Development

for a Statistical Intrusion Detection System (SIDS) based on Accounting Records, SRI International.


T. F. Lunt, (1988), Automated Audit Trail Analysis and Intrusion Detection: A Survey", Proceedings of
the 11th National Computer Security Conference.


Shieh, Shiuhpyng Winston, and Virgil D. Gligor, (1991), "A pattern-oriented intrusion-detection model and
its applications." Research in Security and Privacy, Proceedings. 1991 IEEE Computer Society Symposium
on. IEEE.


Lunt, Teresa F., et al., (1989), "Knowledge-based intrusion detection." AI Systems in Government
Conference, Proceedings of the Annual. IEEE.


Snapp, Steven, et al., (1991), "A system for distributed intrusion detection." COMPCOM Spring 91: 170176.


Riedmller, R., et al., (2010), "Constraints on autonomous use of standard GPU components for
asynchronous observations and intrusion detection, Security and Communication Networks (IWSCN),
2010 2nd International Workshop on. IEEE.


Huang, Nen-Fu, et al., (2008), "A gpu-based multiple-pattern matching algorithm for network intrusion
detection systems", Advanced Information Networking and Applications-Workshops, . AINAW 2008. 22nd
International Conference on. IEEE.


Khabbazian, M.H. ,Eslami, H. ; Totoni, E. ; Khadem, (2010), A.High-throughput stream categorization

and intrusion detection on GPU, Formal Methods and Models for Codesign (MEMOCODE),8th
IEEE/ACM International Conference on,81 84.


Quan Qian ,Hongyi Che ; Rui Zhang ; Mingjun Xin, (2010), The Comparison of the Relative Entropy for
Intrusion Detection on CPU and GPU,Computer and Information Science (ICIS), IEEE/ACIS 9th
International Conference on,141 - 146.


Wu, Chengkun, et al., (2009), "An efficient pre-filtering mechanism for parallel intrusion detection based
on many-core GPU", Security Technology. Springer Berlin Heidelberg, 298-305.


Vokorokos, Liberios, Michal Ennert, and Jn Raduovsk, (2014), "A Survey of parallel intrusion detection
on graphical processors", Central European Journal of Computer Science 4.4: 222-230


Tiirmaa-Klaar, Heli, et al, (2013), "Botnets: How to Fight the Ever-Growing Threat on a Technical Level",
Botnets. Springer London, 41-97.


Silva, Srgio SC, et al., (2013), "Botnets: A survey", Computer Networks 57.2: 378-403.


Fairchild's Director of R & D, (2007), "Moore's Law" Predicts the Future of Integrated Circuits",Computer
History Museum, Retrieved 2009-03-19.


John L Gustafson, (1988), Reevaluating Amdahl's Law, Communications of the ACM 31(5), pp. 532533.


Rodgers, David P. (June 1985). "Improvements in multiprocessor system design". ACM SIGARCH
Computer Architecture News archive (New York, NY, USA:ACM) 13 (3): 225231.
doi:10.1145/327070.327215. ISBN 0-8186-0634-7. ISSN 0163-5964.


Tavallaee, Mahbod, et al., (2009), "A detailed analysis of the KDD CUP 99 data set", Proceedings of the
Second IEEE Symposium on Computational Intelligence for Security and Defence Applications .

[19] Referred on : 30 October 2014


International Journal of Computational Intelligence and Information Security December 2014 Vol. 5, No. 9
ISSN: 1837-7823

D. P. Jeyepalan, E. Kirubakaran ,(April 2013),"A Novel Graph Based Clustering Approach for Network
Intrusion Detection", International Journal of Computational Intelligence and Information Security, Vol. 4
No. 4,ISSN: 1837-7823.


D. P. Jeyepalan, E. Kirubakaran,(2014), "A Co-operative Game Theoretic Approach to Improve the

Intrusion Detection System in a Network using Ant Colony Clustering", International Journal of Computer
Applications,Volume 87 - Number 14.


D. P. Jeyepalan, E. Kirubakaran,(2014), "Agent Based Parallelized Intrusion Detection System Using Ant
Colony Optimization ,",International Journal of Computer Applications (0975 8887), Volume 105
Number 10.


Marinakis, Yannis, et al., (2011), "A hybrid ACO-GRASP algorithm for clustering analysis." Annals of
Operations Research 188.1: 343-358.


Ganapathy, Sannasi, et al., (2013), "Intelligent feature selection and classification techniques for intrusion
detection in networks: a survey." EURASIP Journal on Wireless Communications and Networking 2013.1:


daCosta, Francis, and Francis daCosta., (2013), "Small Data, Big Data, and Human Interaction", Rethinking
the Internet of Things: A Scalable Approach to Connecting Everything : 77-94.


Feifei Li,Suman Nath, (2014), "Scalable data summarization on big data", Distributed and Parallel
Databases, 32(3). DOI: 10.1007/s10619-014-7145-y.


J. McHugh, (2000), Testing intrusion detection systems: a critique of the 1998 and 1999 Darpa intrusion
detection system evaluations as performed by Lincoln laboratory, ACM Transactions on Information and
System Security, vol. 3, no. 4, pp. 262294.