Sunteți pe pagina 1din 5

World Congress on Software Engineering

The Design and Implementation of a Distributed Network Intrusion Detection


System Based on Data Mining

Desheng Fu, Shu Zhou, Ping Guo


College of Computer and Software, Nanjing University of Information Science and Technology,
Nanjing 210044, China

Abstract some research or applications based on a kind


of method. In fact, different techniques of data
Data mining is applied to intrusion detection mining fit into different intrusion detection, for
system, which putting forward a system model based example, applying associative analysis into IDS can
on data mining, improving the FP-Growth algorithm abstract associative characteristics of the hacker’s
based on associative analysis, and refining the intrusion activities; exerting sequence pattern analysis
technology of FCM network intrusion detection based method can find the sequence relationships of the
on statistical binning. The experimental result shows hacker’s intrusion activities; data sorting, which
that the network intrusion detection developed by this usually uses to aid other methods of data mining in
paper can work very stably under the Ethernet, find IDS, makes preprocessing and later processing. In
intrusion activities in time, solve the problem of data addition, each method of data mining has its limitation.
mining speed effectively, enhance the detective ability Though many methods of data mining are applied into
of intrusion detection, and possess a favorable IDS, which makes a remarkable progress, there are lots
performance of intrusion detection. of problems of high mistaken alarm rates, high missing
Keywords: intrusion detection system; distribution; alarm rates and bad real-time character.
data mining Joint using many data mining methods is better
than using a single method. Such as, applying cluster
analysis into the preprocessing of associative analysis
1. Introduction and classify after rules produced will remarkably
improve the effect of associative analysis; Merging
Nowadays most of the distributed intrusion other data mining methods with rough set theory,
detection systems used for commercial purpose are the genetic algorithm and immune algorithm based on
matching technique based on the rules of having been biotechnology will improve IDS’ universality, real-
known intrusion activities, detective engines are time character and reliability.
distributed among network or hosts needing monitored, A distributed network IDS system model based on
and detect intrusion independently. IDS central data mining has been put forward, the algorithm has
administrative control platform only takes charge been discussed and the relative experimental result has
platform configuration. The management of the been analyzed.
detective engines and the results of the detective
engines show the lacking of the cooperative analysis to 2. System model
the data logger. At the same time, network IDS,
firewall and antivirus software work independently, The distributed IDS based on data mining shows as
which is difficult to make a proper decision when face figure2.1. There are three layers: local layer, network
the complicated attacks. So it’s very important for layer and system layer. The local layer includes data
theoretical significance and practical value to research acquisition analyzer and data mining detector; the
and develop the distributed network IDS based on data network layer includes alarm optimizer; the system
mining. layer includes logger and central control platform.
Positive progress has been achieved in Data acquisition analyzer and data mining detector
implementing data mining to IDS both at domestic and have been distributed in the LAN key node, which
overseas, and some IDS models based on data mining monitors the data of the whole network segment,
have been built. But all these models usually are responds to the intrusion activities and sends the alarm

978-0-7695-3570-8/09 $25.00 © 2009 IEEE 447


446
DOI 10.1109/WCSE.2009.225

Authorized licensed use limited to: Gudlavalleru Engineering College. Downloaded on November 24, 2009 at 22:40 from IEEE Xplore. Restrictions apply.
message to the alarm optimizer. The alarm optimizer database providing frequent itemsets into a frequent
locates in the WAN, which collects all the alarm pattern tree, keeping their association messages, and
messages sent by detection engines in the LAN, and dividing compressed databases into a group of
stores them into the logger. The central control conditional databases, which associate a frequent item,
platform locates the system layer, and provides a and respectively mining each database.
friendly and visual control interface to the FP-Growth algorithm mining frequent pattern, it
administrator. recursively generates conditional FP-tree. When each
frequent pattern is produced, a conditional FP-tree is
generated. Under a low minimum support, even to a
very small database, it will produce hundreds of
Log pool Central control thousands frequent patterns. Dynamically generating
System platform and releasing hundreds of thousands FP-tree, the
Layer algorithm will consume a great many of system time
and space. In addition, FP-tree and conditional FP-tree
need top-down generation, frequent-pattern mining
needs bottom-up process. So FP-Growth algorithm
doesn’t have high space-time efficiency, and need
Alarm improving.
Reading optimizer
A method in this paper improves FP-Growth
alarm
message algorithm, and introduces single chain structure of
Network Generating
Layer Module
format file
polymeric chain. Improved FP-tree is unidirectional,
and each node only keeps the points pointing to their
parent nodes, which save the tree space; the trace
information of different nodes of the identical entry is
compressed into polymeric chain, which avoids
generating node chains and conditional pattern pool, so
Data acquisition analyzer remarkably improving mining efficiency.
and data mining detector The experiments show, with the condition of the
same minimum support, improved algorithm spends
Data Detection little time than FP-Growth algorithm; with the
mining module Alarm
reduction of the minimum support, the running
module
efficiency of the improved algorithm has been
obviously improved. With the reduction of the
minimum support, the number of generating node
Data Data Data
Local preprocessing acquisition
chains and conditional pattern pools grows quickly;
analyzer
Layer module module module FP-Growth algorithm spends much time in generating
node chains and conditional pattern pools. Improved
FP-Growth algorithm has the higher efficiency, well
fitting to the real-time network IDS.

4. FCM network intrusion detection


Network
data packet technique based on statistical binning

Figure 2.1 System general structure model The traditional FCM algorithm is a method that
divides cluster without instruction and supervision, is
3. Improved FP-Growth algorithm easy to involve in locally extreme point or
saddle point, and there is no an optimal solution or
The system data auditing uses improved FP- even a satisfactory solution. Meanwhile, as to a great
Growth (frequent-pattern growth) algorithms. many of data, such as dealing with data records of
FP-Growth algorithms was put forward by Jiawei network connection, needing to frequently update
Han, Jian Pei and Yiwen Yin. It is an algorithm using a clustering center and consuming time of the algorithm.
pattern growth for frequent mode mining, and no A FCM algorithm based on statistical binning has been
needing to produce a candidate set. It adopts a divide- put forward, according to the division of the known
and-conquer strategy as follows: compressing the cluster with marking, deciding the type of new data

447
448

Authorized licensed use limited to: Gudlavalleru Engineering College. Downloaded on November 24, 2009 at 22:40 from IEEE Xplore. Restrictions apply.
records, not only involves into a locally extreme point avg old S old + d new
or saddle point, but also decides whether updating the avg new = (4-2)
cluster center according to the binning, which solves Told + 1
the problem of needing to frequent updating cluster Here S is the binning capacity in the table of
center and improves the speed of data processing. clustering distance distribution, T is the total binning
In the traditional FCM algorithm, using capacity “totality” of the cluster in the table of binning
membership function μ ki presents subordination of ratio scale.
the data record xk and the clustering Binning capacity :
S new = S old + 1 (4-3)
subset X i ( 1 ≤ i ≤ c ), and using maximum
Total binning capacity:
subordination principle to determine where the data
records belong to. Using a method of
Tnew = Told + 1 (4-4)
fuzzy approximate degree to judge is as follows. The clustering center is made up of discrete
attribute vector and continuous attribute vector. There
di are two conditions for the update of discrete attribute
Ti = 1 − (4-1)
Di vector:
① the discrete value i of the discrete attribute
For those known clustering computation Ti , if
vector m of the new data record is in the set I of the
min{Ti } is within a threshold range of the existing discrete value, viz., i ∈ I . The updated value
correspondence cluster, making the cluster, which pi of the
of the original probability and statistics value
min{Ti } corresponds to, as a associated class of the discrete value i of the discrete attribute vector m is
new data record; if min{Ti } isn’t within a threshold shown in the formula (4-5).
range of the correspondence cluster, then as a new pi Told + 1
class to process.
pi′ = (4-5)
Told + 1
When detecting some kind of new data records of
network connection belongs to some sort of clusters, The updated value of the original probability and
not updating clustering center immediately as the statistics value p j of other discrete value j of the
previously clustering method, but logging the data discrete attribute vector m is shown in the formula (4-
records to the distance d i of associated cluster. 6).
Compared with the binning of the cluster, when the p j Told
binning needs updating, it will update the clustering p ′j = (4-6)
center of this cluster. The method solves the problem Told + 1
of frequently updating clustering center which cannot ② The discrete value i of the discrete attribute
be avoided in the previous clustering methods. vector m of the new data record is not in the set I of
For the different clusters, some of its binning the existing discrete value, viz., i ∉ I . The probability
partition are continuous, some lie in fault and
amplitude is of different degree. For the continuous and statistics value pi of the new discrete value i of
binning segments, there are updating requirements at the discrete attribute vector m is shown in the formula
the end of the binning; for the binning segments with (4-7).
fault, updating lies in the bottom of the binning. 1
When the distance d new between a certain data pi = (4-7)
1 + Told
record and the clustering center enters into some
Otherwise, the updated value of the original
binning, there are two discussions:
probability and statistics value p j of the existing
① If d new < min − scale , updating at the bottom
discrete value j of the discrete attribute vector m is
of the binning as min new = d new , using formula(4-
shown in the formula (4-8).
2)to update its average.
pj
② If d new > max + scale , updating at the top of p ′j = (4-8)
1 + Told
the binning as max new = d new , using formula(4-2)to
The updated value of the center of continuous
update its average. attribute vector is show in the formula (4-9).

448
449

Authorized licensed use limited to: Gudlavalleru Engineering College. Downloaded on November 24, 2009 at 22:40 from IEEE Xplore. Restrictions apply.
Y p Told + X p Table 5.1 Composition of the data set of sample
Y p′ = (4-9)
Trai
Abnormal
1 + Told Gr Pr
ning Total Nor
HereY p is the value of continuous attribute vector ou DO R2 U2 ob
/Tes ity mal
p S L R in
of a certain clustering center C with 11 ≤ p ≤ 32 . ting
g
X p is the value of continuous attribute vector of a Trai 2000 1960 24 10 2 4
ning
certain data. 1
Test 2500 2430 42 18 4 6
If a new data record of network connections ing
doesn’t belong to any existing cluster, then a new Trai 2500 2310 114 47 10 19
cluster and the corresponding binning should be ning
created according to the data record. 2
Test 3000 2770 138 57 12 23
ing
5. Experimental analysis Trai 3000 2460 324 135 27 54
ning
3
Test 3500 2870 378 157 31 64
5.1. Experimental data set ing
For the testing data in Table 5.1, according to the
The data set of network intrusion detection in data mining module based on the improved FP-Growth
KDDCup99[1] is chosen for the experiment. By the algorithm, which respectively processes learning and
assistance of the Defense Advanced Research Projects training of the three training groups, the abnormal
Agency, The MIT Lincoln Laboratory built the patterns are extracted and the rule base is formed. Then
experimental network according to the LAN structure the experiments are carried out with the testing data
of the US air force. Imitating the normal use of the sets, experimental results are shown in Table 5.2.
network, they designedly carried out the four types of Table 5.2 Experimental results of data mining
attack: DoS, R2L, U2R and Probing. The recorded data based on the improved FP-Growth algorithm
such as flow log and host file system image are given Trainin
False False
to the IDS taking part in the evaluation for off-line Gro Detection g
Positive Negative
analysis. The training data set includes five million up Ratio Time(s
Ratio Ratio
data connections, and the testing data set includes two econd)
million data connections. Each data sample has forty- 1 96.01% 3.71% 14.28% 10.31
one properties which describe the information of 2 94.93% 4.33% 13.91% 12.69
network connections such as basic features, content 3 93.34% 5.85% 10.31% 15.02
and traffic statistics. The data set contains the training The experimental results of the FP-Growth
data with identification and the testing data without algorithm without improvement are shown in Table
identification. In it there are one normal identification 5.3:
type-normal and twenty-two training attack types. In Table 5.3 Experimental results of the FP-Growth
addition, there are fourteen attack types which exist algorithm without improvement
only in the testing data set. Gro Detection False False Training
up Ratio Positive Negativ Time(se
Ratio e Ratio cond)
5.2. Experimental results and analysis 1 94.88% 4.52% 25.71% 20.25
2 93.73% 5.05% 20.86% 24.94
3 91.74% 6.79% 14.92% 31.16
By the data packet of kddcup_data_10percent from
Drawing a comparison between Table 5.2 and
section 5.1, three groups of data are formed
Table 5.3, we conclude that:
respectively to imitate the three situations of less,
(1) Three sets of training time are reduced by
medium and more attacking activities in the actual
network environment. The detailed circumstances of 49.09%, 49.12% and 51.80% respectively in
comparison with the FP-Growth algorithm without
training set and testing set in each group of data are
improvement, which shows that the detection
shown in Table 5.1.
efficiency can be greatly increased with the application
of the improved FP-Growth algorithm.

449
450

Authorized licensed use limited to: Gudlavalleru Engineering College. Downloaded on November 24, 2009 at 22:40 from IEEE Xplore. Restrictions apply.
(2) For each group of data, carrying out the [4] Rajeev Gopalakrishna, Eugene H.Spafford, A Framework
improved the FP-Growth algorithm, the detection ratio for Distributed Intrusion Detection using Interest Driven
is increased from 1.19% to 1.75%, the false positive Cooperating Agents, Department of Computer science,
ratio is reduced from 13.84% to 17.52%, and the false Purdue University, May 2001
negative ratio is reduced from 30.90% to 44.46%. The [5] Fayyad U M, Piatesky-shapiro G, Smyth P, Advances in
detection performance has improved greatly.
knowledge discovery and data mining, Galifornia;
AAAI/MIT Press, 1996
6. Conclusions
[6] S. Ramaswamy, R. Rastogi, K. Shim, Efficient
The application of data mining in the intrusion algorithms for mining outliers from large data sets,
detection system is an important direction in IDS Proceedings of the ACM SIGMOD International Conference
research. The paper presents the improved association on Management of Data, Dallas, TX, USA, 2000, pp. 427–
analysis algorithm based on FP-Growth and FCM 438.
network intrusion detection technologies based on [7] Jia wei Han, Sonny H. S. Chee, Jenny Y. Chiang, Issues
statistical binning, with the application of which in the for On-Line Analytical Mining of Data Warehouses.
NIDS, the velocity of mining speed is increased, the
[8] Andreas Fuchsberger, Intrusion Detection Systems and
detection performance of IDS is strengthened, and a
Intrusion Prevention Systems, Information Security
more solid foundation for network maintenance and
support is provided to the system administrators. Technical Report 2005, 10:134-139
[9] Kim G H, Spafford E H, Experiences with tripwire:
7. References Using integrity checkers for intrusion detection[R], West
Lafayette, USA: Purdue University, Depatment of Computer
[1] http://kdd.ics.uci.edu/databases/kddcup99/kddcup99. Sciences, 1994.
html. KDD Cup 1999 Data. [10] Lee W, Stolfo S J, Chan P K, et al, Real time data
[2] Guan Jian, Study on Method of Data Analysis and Its mining-based intrusion detection[A], Proceedings of 2nd
Correlative Technologies in Intrusion Detection Systems, A DARPA Information Survivability Conference and
Dissertation for the Degree of D. Eng, Harbin Engineering Exposition (DISCEX).
University, 2004.
[3] Shi Zhicai, Ji Zhenzhou, Hu Mingzeng, Research on
Distributed Network Intrusion Detection Techniques,
Computer Engineering, 2005, Vol.31, No.13

450
451

Authorized licensed use limited to: Gudlavalleru Engineering College. Downloaded on November 24, 2009 at 22:40 from IEEE Xplore. Restrictions apply.

S-ar putea să vă placă și