Sunteți pe pagina 1din 15

Layered Approach Using Conditional

Random Fields for Intrusion Detection


Kapil Kumar Gupta, Baikunth Nath, Senior Member, IEEE, and
Ramamohanarao Kotagiri, Member, IEEE
AbstractIntrusion detection faces a number of challenges; an intrusion detection system must reliably detect malicious activities
in a network and must perform efficiently to cope with the large amount of network traffic. In this paper, we address these two issues
of Accuracy and Efficiency using Conditional Random Fields and Layered Approach. We demonstrate that high attack detection
accuracy can be achieved by using Conditional Random Fields and high efficiency by implementing the Layered Approach.
Experimental results on the benchmark KDD 99 intrusion data set show that our proposed system based on Layered Conditional
Random Fields outperforms other well-known methods such as the decision trees and the naive Bayes. The improvement in attack
detection accuracy is very high, particularly, for the U2R attacks (34.8 percent improvement) and the R2L attacks (34.5 percent
improvement). Statistical Tests also demonstrate higher confidence in detection accuracy for our method. Finally, we show that our
system is robust and is able to handle noisy data without compromising performance.
Index TermsIntrusion detection, Layered Approach, Conditional Random Fields, network security, decision trees, naive Bayes.

1 INTRODUCTION
I
NTRUSION detection as defined by the SysAdmin, Audit,
Networking, and Security (SANS) Institute is the art of
detecting inappropriate, inaccurate, or anomalous activity
[6]. Today, intrusion detection is one of the high priority and
challenging tasks for network administrators and security
professionals. More sophisticatedsecuritytools meanthat the
attackers come up with newer and more advanced penetra-
tion methods to defeat the installed security systems [4] and
[24]. Thus, there is a need to safeguard the networks from
known vulnerabilities and at the same time take steps to
detect new and unseen, but possible, system abuses by
developing more reliable and efficient intrusion detection
systems. Any intrusion detection system has some inherent
requirements. Its prime purpose is to detect as many attacks
as possible with minimum number of false alarms, i.e., the
system must be accurate in detecting attacks. However, an
accurate system that cannot handle large amount of network
traffic and is slow in decision making will not fulfill the
purpose of an intrusion detection system. We desire a system
that detects most of the attacks, gives very few false alarms,
copes with large amount of data, and is fast enough to make
real-time decisions.
Intrusion detection started in around 1980s after the
influential paper from Anderson [10]. Intrusion detection
systems are classified as network based, host based, or
application based depending on their mode of deployment
and data used for analysis [11]. Additionally, intrusion
detection systems can also be classified as signature based or
anomaly baseddepending uponthe attack detectionmethod.
Thesignature-basedsystemsaretrainedbyextractingspecific
patterns (or signatures) frompreviously knownattacks while
the anomaly-based systems learn from the normal data
collected when there is no anomalous activity [11].
Another approach for detecting intrusions is to consider
both the normal and the known anomalous patterns for
training a system and then performing classification on the
test data. Such a system incorporates the advantages of both
the signature-based and the anomaly-based systems and is
known as the Hybrid System. Hybrid systems can be very
efficient, subject to the classification method used, and can
also be used to label unseen or new instances as they assign
one of the known classes to every test instance. This is
possible because during training the system learns features
fromall the classes. The only concernwiththe hybridmethod
is the availability of labeleddata. However, data requirement
is also a concern for the signature- and the anomaly-based
systems as they require completely anomalous and attack-
free data, respectively, which are not easy to ensure.
The rest of this paper is organized as follows: In Section 2,
we discuss the related work with emphasis on various
methods and frameworks used for intrusion detection. We
describe the use of Conditional Random Fields (CRFs) for
intrusion detection [23] in Section 3 and the Layered
Approach [22] in Section 4. We then describe how to
integrate the Layered Approach and the CRFs in Section 5.
In Section 6, we give our experimental results and compare
our method with other approaches that are known to
perform well. We observe that our proposed system,
Layered CRFs, performs significantly better than other
systems. We study the robustness of our method in Section 7
by introducing noise in the system. We discuss feature
selection in Section 8 and draw conclusions in Section 9.
2 RELATED WORK
The field of intrusion detection and network security has
been around since late 1980s. Since then, a number of
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010 35
. The authors are with the Department of Computer Science and Software
Engineering, and NICTA Victoria Research Laboratory, The University of
Melbourne, Parkville 3010, Australia.
E-mail: {kgupta, bnath, rao}@csse.unimelb.edu.au.
Manuscript received 6 Mar. 2007; revised 11 Dec. 2007; accepted 28 Jan.
2008; published online 12 Mar. 2008.
For information on obtaining reprints of this article, please send e-mail to:
tdsc@computer.org, and reference IEEECS Log Number TDSC-2007-03-0031.
Digital Object Identifier no. 10.1109/TDSC.2008.20.
1545-5971/10/$26.00 2010 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.
methods and frameworks have been proposed and many
systems have been built to detect intrusions. Various
techniques such as association rules, clustering, naive Bayes
classifier, support vector machines, genetic algorithms,
artificial neural networks, and others have been applied to
detect intrusions. In this section, we briefly discuss these
techniques and frameworks.
Lee et al. introduced data mining approaches for
detecting intrusions in [30], [31], and [32]. Data mining
approaches for intrusion detection include association rules
and frequent episodes, which are based on building
classifiers by discovering relevant patterns of program
and user behavior. Association rules [8] and frequent
episodes are used to learn the record patterns that describe
user behavior. These methods can deal with symbolic data,
and the features can be defined in the form of packet and
connection details. However, mining of features is limited
to entry level of the packet and requires the number of
records to be large and sparsely populated; otherwise, they
tend to produce a large number of rules that increase the
complexity of the system [7].
Data clusteringmethods suchas the /-means andthe fuzzy
c-means have also been applied extensively for intrusion
detection [36] and [39]. One of the main drawbacks of the
clustering technique is that it is based on calculating numeric
distance between the observations, and hence, the observa-
tions must be numeric. Observations with symbolic features
cannot be easily used for clustering, resulting in inaccuracy.
In addition, the clustering methods consider the features
independently and are unable to capture the relationship
between different features of a single record, which further
degrades attack detection accuracy.
Naive Bayes classifiers have also been used for intrusion
detection [9]. However, they make strict independence
assumption between the features in an observation result-
ing in lower attack detection accuracy when the features are
correlated, which is often the case for intrusion detection.
Bayesian network can also be used for intrusion detection
[28]. However, they tend to be attack specific and build a
decision network based on special characteristics of
individual attacks. Thus, the size of a Bayesian network
increases rapidly as the number of features and the type of
attacks modeled by a Bayesian network increases.
To detect anomalous traces of system calls in privileged
processes [20], hidden Markov models (HMMs) have been
applied in [17], [42], and [43]. However, modeling the system
calls alone may not always provide accurate classification as
in such cases various connection level features are ignored.
Further, HMMs are generative systems and fail to model
long-range dependencies between the observations [29]. We
further discuss this in detail in Section 3.
Decision trees have also been used for intrusion
detection [9]. The decision trees select the best features for
each decision node during the construction of the tree based
on some well-defined criteria. One such criterion is to use
the information gain ratio, which is used in C4.5. Decision
trees generally have very high speed of operation and high-
attack detection accuracy.
Debar et al. [14] and Zhang et al. [46] discuss the use of
artificial neural networks for network intrusion detection.
Though the neural networks can work effectively with noisy
data, they require large amount of data for training and it is
often hard to select the best possible architecture for a neural
network. Support vector machines have also been used for
detecting intrusions [26]. Support vector machines map real-
valued input feature vector to a higher dimensional feature
space through nonlinear mapping and can provide real-time
detection capability, deal with large dimensionality of
data, and can be used for binary-class as well as multiclass
classification. Other approaches for detecting intrusion
include the use of genetic algorithm and autonomous and
probabilistic agents for intrusion detection [1] and [5]. These
methods are generally aimed at developing a distributed
intrusion detection system.
To overcome the weakness of a single intrusion detection
system, a number of frameworks have been proposed, which
describe the collaborative use of network-based and host-
based systems [45]. Systems that employ both signature-
based and behavior-based techniques are discussed in [19]
and [41]. In [32], the authors describe a data mining frame-
work for building adaptive intrusion detection models. A
distributed intrusion detection framework based on mobile
agents is discussed in [12].
The most closely related work, to our work, is of Lee et al.
[30], [31], and [32]. They, however, consider a data mining
approach for mining association rules and finding frequent
episodes in order to calculate the support and confidence of
the rules separately. Instead, in our work, we define features
from the observations as well as from the observations and
the previous labels and perform sequence labeling via the
CRFs to label every feature in the observation. This setting is
sufficient for modeling the correlation between different
features of an observation. We also compare our work with
[21], which describes the use of maximum entropy principle
for detecting anomalies in the network traffic. The key
difference between [21] and our work is that the authors in
[21] use only the normal data during training and build a
baseline system, i.e., a behavior-based system, while we train
our system with both the normal and the anomalous data,
i.e., we build a hybrid system. Second, the systemin [21] fails
to model long-range dependencies in the observations,
which can be easily represented in our model. We also
integrate the Layered Approach with the CRFs to gain the
benefits of computational efficiency and high accuracy of
detection in a single system.
We compare the LayeredApproachwiththe works in[18],
[25], and [41]. The authors in [18] describe the combination of
strong classifiers using stacking, where the decision tress,
naive Bayes, anda number of other classificationmethods are
used as base classifiers. The authors show that the output
from these classifiers can be combined to generate a better
classifier rather thanselectingthe best one. In[25], the authors
use a combination of weak classifiers. The individual
classification power of weak classifiers is slightly better than
random guessing. The authors show that a number of such
classifiers when combined using simple majority voting
mechanism, provide good classification. In [41], the authors
apply a combination of anomaly and misuse detectors for
better qualification of analyzed events. However, our work is
not based upon classifier combination. Combination of
classifiers is expensive with regard to the processing time
anddecisionmaking. The purpose of classifier combinationis
to improve accuracy. Rather, our system is based upon serial
36 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.
layering of multiple hybrid detectors. From our experiments
in Section 6, we show that the Layered CRFs perform better
than individual classifiers and they are more efficient and
accurate than a system based on classifier combination. The
results fromindividual classifiers at a layer are not combined
at any later stage in the Layered Approach, and hence, an
attack canbe blockedat the layer where it is detected. There is
nocommunicationoverheadamongthelayers andthecentral
decision-maker. In addition, since the layers are independent
they can be trained separately and deployed at critical
locations in a network depending upon the specific require-
ments of a network. Using a stacked system will not give us
the advantage of reduced processing when an attack is
detected at the initial layers in the sequential model.
In this paper, we show the effectiveness of CRFs for
intrusion detection. Motivated by our results in [23], we
perform detailed analysis and show that CRFs are a strong
candidate for building robust intrusion detection systems.
We then show that high efficiency can be achieved by
implementing the Layered Approach. Finally, we integrate
the Layered Approach and the CRFs to develop a system
that is accurate and performs efficiently.
3 CONDITIONAL RANDOM FIELDS FOR INTRUSION
DETECTION
Conditional models are probabilistic systems that are used
to model the conditional distribution over a set of random
variables. Such models have been extensively used in the
natural language processing tasks. Conditional models offer
a better framework as they do not make any unwarranted
assumptions on the observations and can be used to model
rich overlapping features among the visible observations.
Maxent classifiers [37], maximum entropy Markov models
[34], and CRFs [29] are such conditional models. The
advantage of CRFs is that they are undirected and are, thus,
free from the Label Bias and the Observation Bias [27]. The
simplest conditional classifier is the Maxent classifier based
upon maximum entropy classification, which estimates the
conditional distribution of every class given the observations
[37]. The training data is used to constrain this conditional
distribution while ensuring maximum entropy and hence
maximumuniformity. We nowgive a brief description of the
CRFs, which is motivated from the work in [29].
Let A be the random variable over data sequence to be
labeled and Y the corresponding label sequence. In
addition, let G \ . 1 be a graph such that Y Y
.

.2\
,
so that Y is indexed by the vertices of G. Then, A. Y is a
CRF, when conditioned on A, the random variables Y
.
obey the Markov property with respect to the graph:
jY
.
jA. Y
n
. n 6 . jY
.
jA. Y
n
. n $ ., where n $ . means
that n and . are neighbors in G, i.e., a CRF is a random field
globally conditioned on A. For a simple sequence (or chain)
modeling, as in our case, the joint distribution over the label
sequence Y given A has the following form:
j
0
yjr/exp
X
c21./
`
/
)
/
c. yj
c
. r
X
.2\ ./
j
/
q
/
.. yj
.
. r
!
. 1
where r is the data sequence, y is a label sequence, and yj
:
is
the set of components of y associated with the vertices or
edges in subgraph o. In addition, the features )
/
and q
/
are assumed to be given and fixed. For example, a
Boolean edge feature )
/
might be true if the observation A
i
is protocol tcp, tag Y
i1
is normal, and tag Y
i
is
normal. Similarly, a Booleanvertex feature q
/
might be true
if the observation A
i
is service ftp and tag Y
i
is attack.
Further, the parameter estimation problem is to find the
parameters 0 `
1
. `
2
. . . . ; j
1
. j
2
. . . . from the training data
1 r
i
. y
i

`
i1
with the empirical distribution ~ jr. y [29].
CRFs are undirected graphical models used for sequence
tagging. The prime difference between CRF and other
graphical models such as the HMM is that the HMM, being
generative, models the joint distribution jy. r, whereas the
CRF are discriminative models and directly model the
conditional distribution jyjr, which is the distribution of
interest for the task of classification and sequence labeling.
Similar to HMM, the naive Bayes is also generative and
models the joint distribution. Modeling the joint distribution
has two disadvantages. First, it is not the distribution of
interest, since the observations are completely visible and the
interest is in finding the correct class for the observations,
which is the conditional distribution jyjr. Second, inferring
the conditional probability jyjr from the modeled joint
distribution, using the Bayes rule, requires the marginal
distribution jr. To estimate this marginal distribution is
difficult since the amount of training data is often limitedand
the observation r contains highly dependent features that are
difficult to model and therefore strong independence
assumptions are made among the features of an observation.
This results in reducedaccuracy [40]. CRFs, however, predict
the label sequence y given the observation sequence r. This
allows them to model arbitrary relationship among different
features in an observation r [15]. CRFs also avoid the
observation bias and the label bias problem, which are
present inother discriminative models, suchas the maximum
entropy Markov models. This is because the maximum
entropy Markov models have a per-state exponential model
for the conditional probabilities of the next state given the
current state and the observation, whereas the CRFs have a
single exponential model for the joint probability of the entire
sequence of labels given the observation sequence [29].
The task of intrusion detection can be compared to many
problems in machine learning, natural language processing,
and bioinformatics. The CRFs have proven to be very
successful in such tasks, as they do not make any unwar-
ranted assumptions about the data. Hence, we explore the
suitability of CRFs for intrusion detection.
3.1 Motivating Example
The data analyzed by the intrusion detection system for
classification often has a number of features that are highly
correlated and complex relationships exist between them.
For example, when classifying network connections as
either normal or as attack, a system may consider features
such as logged in and number of file creations. When
these features are analyzed individually, they do not
provide any information that can aid in detecting attacks.
However, when these features are analyzed together, they
can provide meaningful information, which can be helpful
for the classification task. Taking another example, the
connection level feature such as the service invoked at the
GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 37
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.
destination provides some information about the class label
(in case an attacker sends request to a service that is not
available). This information becomes more concrete and
aids in classification when analyzed with other features
such as protocol type and amount of data transferred
between source and destination (in case the client connects
to an available service such as the ftp and performs data
transfer). These relationships, between different features in
the observed data, if considered during classification can
significantly decrease classification error. The CRFs do not
consider features to be independent and hence perform
better when compared with other methods.
The data set used in our experiments represents features
of every session in relational form with only one label for
the entire record. In this case, using a conditional model
would result in a simple maximum entropy classifier [40].
However, we represent the data in the form of a sequence
and assign a label to every feature in the sequence using the
first-order Markov assumption instead of assigning a single
label to the entire observation. Though, this increases the
complexity but it also increases the attack detection accuracy.
Each record represents a separate connection, and hence,
we consider every record as a separate sequence. We aim to
model the relationships among features of individual
connections using a CRF, as shown in Fig. 1. In the figure,
features such as duration, protocol, service, flag, and
src_bytes take some possible value for every connection.
During training, feature weights are learnt, and during
testing, features are evaluated for the given observation,
which is then labeled accordingly.
As it is evident from the figure, every label is connected
to every input feature, which indicates that all the features
in an observation help in labeling, and thus, a CRF can
model dependencies among the features in an observation.
Present intrusion detection systems do not consider such
relationships among the features in the observations. They
either consider only one feature, such as in the case of
system call modeling, or assume conditional independence
among different features in the observation as in the case of
a naive Bayes classifier. As we will show from our
experimental results, the CRFs can effectively model such
relationships among different features of an observation
resulting in higher attack detection accuracy. Another
advantage of using CRFs is that every element in the
sequence is labeled such that the probability of the entire
labeling is maximized, i.e., all the features in the observa-
tion collectively determine the final labels. Hence, even if
some data is missing, the observation sequence can still be
labeled with less number of features.
Our first goal is to improve the attack detection accuracy.
We first compare the accuracy of CRFs for detecting attacks
with other methods in Section 6. We consider all the
41 features in the data set for each of the four attack groups
separately. As we shall observe, the CRFs outperform other
methods for detecting Unauthorized access to Root (U2R)
attacks. They are also effective in detecting the Probe,
Remote to Local (R2L), and Denial of Service (DoS)
attacks. However, CRFs can be expensive during training
and testing. For a simple linear chain structure, the time
complexity for training a CRF is OT1
2
`1, where T is the
length of the sequence, 1 is the number of labels, ` is the
number of training instances, and 1 is the number of
iterations. During inference, the Viterbi algorithm is
employed, which has a complexity of OT1
2
. The quad-
ratic complexity is significant when the number of labels is
large as in language tasks. However, for intrusion detection,
there are only two labels normal and attack, and thus,
the system is very efficient. We further improve the overall
system performance by using the Layered Approach, which
decreases T, i.e., the length of the sequence. The Layered
Approach is described next.
4 LAYERED APPROACH FOR INTRUSION DETECTION
We now describe the Layer-based Intrusion Detection
System (LIDS) in detail. The LIDS draws its motivation
from what we call as the Airport Security model, where a
number of security checks are performed one after the other
in a sequence. Similar to this model, the LIDS represents a
sequential Layered Approach and is based on ensuring
availability, confidentiality, and integrity of data and (or)
services over a network. Fig. 2 gives a generic representa-
tion of the framework.
The goal of usingalayeredmodel is toreduce computation
andthe overall time requiredtodetect anomalous events. The
time required to detect an intrusive event is significant and
can be reduced by eliminating the communication overhead
among different layers. This can be achieved by making the
layers autonomous and self-sufficient to block an attack
without the need of a central decision-maker. Every layer in
the LIDS framework is trained separately and then deployed
sequentially. We define four layers that correspond to the
four attack groups mentioned in the data set. They are Probe
layer, DoS layer, R2L layer, and U2R layer. Each layer is then
separately trained with a small set of relevant features.
Feature selection is significant for Layered Approach and
discussed in the next section. In order to make the layers
independent, some features may be present in more than one
layer. The layers essentially act as filters that block any
anomalous connection, thereby eliminating the need of
further processing at subsequent layers enabling quick
38 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010
Fig. 1. Graphical representation of a CRF.
Fig. 2. Layered representation.
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.
response tointrusion. The effect of suchasequence of layers is
that the anomalous events are identified and blocked as soon
as they are detected.
Our second goal is to improve the speed of operation of
the system. Hence, we implement the LIDS and select a
small set of features for every layer rather than using all the
41 features. This results in significant performance im-
provement during both the training and the testing of the
system. In many situations, there is a trade-off between
efficiency and accuracy of the system and there can be
various avenues to improve system performance. Methods
such as naive Bayes assume independence among the
observed data. This certainly increases system efficiency,
but it may severely affect the accuracy. To balance this
trade-off, we use the CRFs that are more accurate, though
expensive, but we implement the Layered Approach to
improve overall system performance. The performance of
our proposed system, Layered CRFs, is comparable to that
of the decision trees and the naive Bayes, and our system
has higher attack detection accuracy.
5 INTEGRATING LAYERED APPROACH WITH
CONDITIONAL RANDOM FIELD
In Section 1, we discussed two main requirements for an
intrusion detection system; accuracy of detection and
efficiency in operation. As discussed in Sections 3 and 4,
respectively, the CRFs can be effective in improving the
attack detection accuracy by reducing the number of false
alarms, while the Layered Approach can be implemented to
improve the overall system efficiency. Hence, a natural
choice is to integrate them to build a single system that is
accurate in detecting attacks and efficient in operation. Given
the data, we first select four layers corresponding to the four
attack groups (Probe, DoS, R2L, and U2R) and perform
feature selection for each layer, which is described next.
5.1 Feature Selection
Ideally, we would like to perform feature selection auto-
matically. However, as will be discussed later in Section 8,
the methods for automatic feature selection were not found
to be effective. In this section, we describe our approach for
selecting features for every layer and why some features
were chosen over others. In our system, every layer is
separately trained to detect a single type of attack category.
We observe that the attack groups are different in their
impact, and hence, it becomes necessary to treat them
differently. Hence, we select features for each layer based
upon the type of attacks that the layer is trained to detect.
5.1.1 Probe Layer
The probe attacks are aimed at acquiring information about
the target network from a source that is often external to the
network. Hence, basic connection level features such as the
duration of connection and source bytes are significant
while features like number of files creations and number
of files accessed are not expected to provide information
for detecting probes.
5.1.2 DoS Layer
The DoS attacks are meant to force the target to stop the
service(s) that is (are) provided by flooding it with
illegitimate requests. Hence, for the DoS layer, traffic
features such as the percentage of connections having same
destination host and same service and packet level features
such as the source bytes and percentage of packets with
errors are significant. To detect DoS attacks, it may not be
important to know whether a user is logged in or not.
5.1.3 R2L Layer
The R2L attacks are one of the most difficult to detect as
they involve the network level and the host level features.
We therefore selected both the network level features such
as the duration of connection and service requested
and the host level features such as the number of failed
login attempts among others for detecting R2L attacks.
5.1.4 U2R Layer
The U2R attacks involve the semantic details that are very
difficult to capture at an early stage. Such attacks are often
content based and target an application. Hence, for U2R
attacks, we selected features such as number of file
creations and number of shell prompts invoked, while
we ignored features such as protocol and source bytes.
We used domain knowledge together with the practical
significance and the feasibility of each feature before
selecting it for a particular layer. Thus, from the total
41 features, we selected only 5 features for Probe layer,
9 features for DoS layer, 14 features for R2L layer, and
8 features for U2R layer. Since each layer is independent of
every other layer, the feature set for the layers is not
disjoint. The selected features for all the four layers are
presented in Appendix A. We then use the CRFs for attack
detection as discussed in Section 3. However, the difference
is that we use only the selected features for each layer rather
than using all the 41 features. We now give the algorithm
for integrating CRFs with the Layered Approach.
Algorithm
Training
Step 1: Select the number of layers, i, for the complete
system.
Step 2: Separately perform features selection for each layer.
Step 3: Train a separate model with CRFs for each layer
using the features selected from Step 2.
Step 4: Plug in the trained models sequentially such that
only the connections labeled as normal are passed
to the next layer.
Testing
Step 5: For each (next) test instance perform Steps 6
through 9.
Step 6: Test the instance and label it either as attack or
normal.
Step 7: If the instance is labeled as attack, block it and
identify it as an attack represented by the layer
name at which it is detected and go to Step 5. Else
pass the sequence to the next layer.
Step 8: If the current layer is not the last layer in the system,
test the instance and go to Step 7. Else go to Step 9.
Step 9: Test the instance and label it either as normal or as
an attack. If the instance is labeled as an attack,
block it and identify it as an attack corresponding
to the layer name.
GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 39
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.
Our final goal is to improve both the attack detection
accuracy and the efficiency of the system. Hence, we
integrate the CRFs and the Layered Approach to build a
single system. We perform detailed experiments and show
that our integrated system has dual advantage. First, as
expected, the efficiency of the system increases signifi-
cantly. Second, since we select significant features for each
layer, the accuracy of the system further increases. This is
because all the 41 features are not required for detecting
attacks belonging to a particular attack group. Using more
features than required can result in fitting irregularities in
the data, which has a negative effect on the attack detection
accuracy of the system.
6 EXPERIMENTS
For our experiments, we use the benchmark KDD 99
intrusion data set [3]. This data set is a version of the original
1998 DARPA intrusion detection evaluation program, which
is prepared and managed by the MIT Lincoln Laboratory.
The data set contains about five million connection records
as the training data and about two million connection
records as the test data. In our experiments, we use
10 percent of the total training data and 10 percent of the
test data (with corrected labels), which are provided
separately. This leads to 494,020 training and 311,029 test
instances. Each record in the data set represents a connection
between two IP addresses, starting and ending at some well-
defined times with a well-defined protocol. Further, every
record is represented by 41 different features. Each record
represents a separate connection and is hence considered to
be independent of any other record.
The training data is either labeled as normal or as one of
the 24 different kinds of attack. These 24 attacks can be
grouped into four classes; Probing, DoS, R2L, and U2R.
Similarly, the test data is also labeled as either normal or as
one of the attacks belonging to the four attack groups. It is
important to note that the test data is not from the same
probability distribution as the training data, and it includes
specific attack types not present in the training data. This
makes the intrusion detection task more realistic [3]. Table 1
gives the number of instances for each group of attack in the
data set.
For our experiments with CRFs, we use the CRF toolkit,
CRF++[2]. We use the Weka tool [44] to performexperiments
with the decision trees and the naive Bayes classifier. We
develop python and shell scripts for data formatting and
implementing the Layered Approach. For all our experi-
ments, weperformhybriddetection, as discussedinSection1,
and use both the normal and the anomalous connections for
training the model. We perform our experiments on a
desktop running with Intel(R) Core(TM) 2, CPU 2.4 GHz,
and 2-Gbyte RAMunder exactly the same conditions. We are
mainly interested in the test time efficiency and not in the
time required for training of the model as the real-time
performance of the system depend upon the test efficiency
alone. We note that our systemis veryefficient duringtesting.
When we considered all the 41 features, the time taken to test
all the 250,436 attacks was 57 seconds, which reduced to
17 seconds when we performed feature selection and
implemented the Layered Approach. More details will be
presented when we give the detailed results for the
experiments.
For our results, we give the Precision, Recall, and 1-Value
and not the accuracy alone as with the given data set, it is
easy to achieve very high accuracy by carefully selecting the
sample size. From Table 1, we note that the number of
instances for the U2R, Probes, and R2L attacks is very low.
Hence, if we use accuracy as a measure for testing the
performance of the system, the system can be biased and can
attain an accuracy of more than 99 percent for U2R attacks
[16]. However, Precision, Recall, and 1-Value are not
dependent on the size of the training and the test samples.
They are defined as follows:
Precision
TP
TP FP
Recall
TP
TP FN
1-Value
1 u
2
Recall Precision
u
2
Recall Precision
.
where TP, FP, and FN are the number of True Positives,
False Positives, and False Negatives, respectively, and u
corresponds to the relative importance of precision versus
recall and is usually set to 1.
We divide the training data into different groups;
Normal, Probe, DoS, R2L, and U2R. Similarly, we divide
the test data. We perform 10 experiments for each attack
class by randomly selecting data corresponding to that
attack class and normal data only. For example, to detect
Probe attacks, we train and test the system with Probe
attacks and normal data only. We do not add the DoS, R2L,
and U2R data when detecting Probes. Not including these
attacks while training allows the system to better learn the
features for Probe attacks and normal events. When such a
system is deployed online, other attacks such as DoS can
either be seen as normal or as Probes. If DoS attacks are
detected as normal, we expect them to be detected as attack
at other layers in the system. However, if the DoS attacks are
detected as Probe, it must be considered as an advantage
since the attack is detected at an early stage. Similarly, if
some Probe attacks are not detected at the Probe layer, they
may be detected at subsequent layers. Hence, for four attack
classes, we have four independent models, which are
trained separately with specific features to detect attacks
belonging to that particular group. For our experiments, we
report the best, the average, and the worst cases.
We represent a single layer, for example the Probe layer,
in Fig. 3. Other layers can be constructed similarly.
In Section 6.1, we perform experiments with individual
layers in the system. In Section 6.2, we represent how to
40 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010
TABLE 1
Data Set
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.
implement the system in real scenario and compare our
results with other systems in Section 6.3. In Section 6.4, we
discuss the significance of our results.
6.1 Building Individual Layers of the System
We perform two sets of experiments. From the first
experiment, we wish to examine the accuracy of CRFs for
intrusion detection. The objective is to see how CRFs
compare with other techniques, which are known to
perform well. We do not consider feature selection, and
the systems are trained using all the 41 features. From this
experiment, we observe that CRFs perform much better for
U2R attacks while the decision trees achieve higher attack
detection for Probes and R2L. The difference in attack
detection accuracy for DoS is not significant. We note that
the reason for better performance of decision trees is that
they perform feature selection. This motivates us to perform
our second experiment where we perform feature selection
by selecting a small set of features for every attack group
instead of using all the 41 features. We perform the same
experiment with decision trees and naive Bayes and
compare the results. We call the integrated models as
Layered CRFs, layered decision trees, and layered naive
Bayes, respectively. For better comparison and readability,
we give the results for both the experiments together.
6.1.1 Detecting Probe Attacks with All 41 Features
We randomly select about 10,000 normal records and all the
Probe records from the training data as the training data for
detecting Probe attacks. We then use all the normal and
Probe records from the test data for testing. Hence, we have
15,000 training instances and 64,759 test instances. Table 2
gives the results for the experiments.
In Table 2, the testing time of 14.53 seconds represents
the total time taken to label 64,759 test instances. The results
show that the decision trees are more efficient than the CRFs
and the naive Bayes. This is because they have a small tree
structure, often with very few decision nodes, which is very
efficient. The attack detection accuracy is also higher for the
decision trees as they select the best possible features during
tree construction. However, as we show next, once we
perform feature selection, our system achieves much higher
accuracy and there is significant improvement in efficiency.
6.1.2 Detecting Probe Attacks with Feature Selection
We used the same set of instances for this experiment as
used in the previous experiment. However, we perform
feature selection for this experiment. Table 3 gives the
results for this experiment.
We observe that the system takes only 2.04 seconds to
label all the 64,759 test instances. The Layered CRFs
perform better and faster than our previous experiment
and are the best choice for detecting Probes. We also note
that there is no significant advantage with respect to time
for the layered decision trees as the number of features used
in normal decision trees and in the layered decision trees is
approximately the same, resulting in similar efficiency. We
further note that the Recall and hence the 1-Value for the
layered naive Bayes decreases drastically. This can be
explained as follows: The classification accuracy with naive
Bayes generally improves as the number of features
increases. However, if the number of features increases to
a very large extent, the estimation tends to become
unreliable. As a result, when we use all the 41 features,
the naive Bayes performs well but when we decrease the
number of features to five, its classification accuracy
decreases. From this experiment, we conclude that the
Layered CRFs are a better choice for detecting Probe
attacks.
6.1.3 Detecting DoS Attacks with All 41 Features
We randomly select about 20,000 normal records and about
4,000 DoS records from the training data as the training data
for detecting DoS attacks. We then use all the normal and
DoS records from the test data for testing. Hence, we have
24,000 training instances and 290,446 test instances. Table 4
gives the results for the experiments.
In Table 4, the testing time of 64.42 seconds represents
the time taken to label all the 290,446 test instances. The
results show that all the three methods considered have
similar attack detection accuracy; however, decision trees
give a slight advantage with regard to test time efficiency.
6.1.4 Detecting DoS Attacks with Feature Selection
We used the same data for this experiment as used in the
previous experiment. However, we perform feature selec-
tion. Table 5 gives the results.
We observe that the system now takes only 15.17 seconds
to label all the 290,446 test instances. The results follow the
GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 41
Fig. 3. Representation of a single layer (e.g., probe layer).
TABLE 2
Normal and Probes (All 41 Features)
TABLE 3
Normal and Probes (with Feature Selection)
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.
same trend as in the previous experiment with only a slight
improvement. However, if we consider the testing time we
find that layered decision trees are a better choice. We also
note that there is slight increase in the detection accuracy
when we perform feature selection, but this increase is not
significant. The real advantage is seen in the reduced time
for testing, which decreases four folds.
6.1.5 Detecting R2L Attacks with All 41 Features
We randomly select about 1,000 normal records and all the
R2L records from the training data as the training data for
detecting R2L attacks. We then use all the normal and
R2L records from the test data for testing. Hence, we have
2,000 training instances and 76,942 test instances. Table 6 gives
the results.
In Table 6, the testing time of 17.16 seconds represents
the time taken to label all the 76,942 test instances. We
observe that the decision trees have a higher 1-Value, but if
we look at the number of false alarms, we find that the CRFs
perform better and have high Precision compared to the
decision trees and the naive Bayes.
6.1.6 Detecting R2L Attacks with Feature Selection
Table 7 gives the results when we performed feature
selection for detecting R2L attacks.
We observe that the time taken to test all the
76,942 instances is only 5.96 seconds. Further, the
Layered CRFs perform much better than the CRFs (an
increase of about 60 percent), layered decision trees (an
increase of about 125 percent), decision trees (an increase
of about 17 percent), layered naive Bayes (an increase of
about 250 percent), and naive Bayes (an increase of
about 250 percent) and are the best choice for detecting
the R2L attacks. The Layered CRFs take slightly more
time, which is acceptable since we achieve much higher
detection accuracy.
6.1.7 Detecting U2R Attacks with All 41 Features
We randomly select about 1,000 normal records and all the
U2R records from the training data as the training data for
detecting the User to Root attacks. We then use all the
normal and U2R records from the test data for testing.
Hence, we have 1,000 training instances and 60,661 test
instances. Table 8 gives the results.
In Table 8, the testing time of 13.45 seconds represents
the time taken to label all the 60,661 test instances. In this
experiment, we find that the CRFs are far better than the
other two methods. The 1-Value for CRFs is more than
150 percent with respect to the decision trees and more than
600 percent with respect to the naive Bayes. The U2R attacks
are very difficult to detect and most of the present intrusion
detection systems fail to detect such attacks with acceptable
reliability. We find that the CRFs can be used to reliably
detect such attacks.
42 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010
TABLE 5
Normal and DoS (with Feature Selection)
TABLE 6
Normal and R2L (All 41 Features)
TABLE 7
Normal and R2L (with Feature Selection)
TABLE 4
Normal and DoS (All 41 Features)
TABLE 8
Normal and U2R (All 41 Features)
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.
6.1.8 Detecting U2R Attacks with Feature Selection
In this experiment, we used exactly the same set of instances
as we used in the previous experiment. We also perform
feature selection. Table 9 gives the results for this experiment.
We observe that the systemtakes only 2.67 seconds to label
all the 60,661 test instances. The Layered CRFs are the best
choice for detecting the U2R attacks and are far better than
CRFs (an increase of about 8 percent), layered decision trees
(an increase of about 30 percent), decision trees (an increase
of about 184 percent), layered naive Bayes (an increase of
about 38 percent), and naive Bayes (an increase of about
675 percent). We observe that the attack detection capability
also increases for the decision trees and the naive Bayes.
It is evident from the results that the accuracy of Layered
CRFs is significantly higher for the U2R, R2L, and the Probe
attacks. The difference in accuracy is, however, not sig-
nificant for the DoS attacks. Further, regardless of the
method considered and particularly for the CRFs, the time
required for training and testing the system is drastically
reduced once we perform feature selection. We also note
that the increase in detection accuracy is not significant for
the layered decision trees and the layered naive Bayes for
the DoS group of attack. Their accuracy of detection
decreases for the Probe and R2L attacks while it increases
for the U2R attacks. However, we find that in all the cases,
the Layered CRFs perform significantly better and can better
learn a model when we use a small set of specific features for
training.
6.2 Implementing the System in Real Life
In real scenario, we are not aware of the category of an
attack. Rather, we are interested in identifying the attack
category once the system detects an event as anomalous.
Layered Approach not only improves the attack detection,
but it also helps identify the type of attack once detected
because every layer is trained to detect only a particular
category of attack. Hence, if an attack is detected at the U2R
layer, it is very likely that the attack is of U2R type. This
enables to perform quick recovery and prevent similar
attacks. Fig. 4 gives the real-time system representation.
We integrate the four models (with feature selection) from
Section 6.1 to develop the final system. In this experiment,
we use the same data for training the individual models as
used in our previous experiments. However, the data in the
test set is relabeled either as normal or as attack and all the
data from the test set is passed though the system starting
from the first layer. If layer 1 detects any connection as an
attack, it is blocked and labeled as Probe. Only the events
labeled as Normal are allowed to go to the next layer. The
same process is repeated at the next layers where an attack is
blocked and labeled as DoS, R2L, or U2R at layer 2,
layer 3, and layer 4, respectively. We perform all the
experiments 10 times and report their average. We give the
results for this experiment in Tables 10, 11, and 12. Table 10
gives the confusion matrix where the values represent the
percent detection with respect to each of the five classes.
From the table, we observe that our system can detect
most of the Probe (98.62 percent), DoS (97.40 percent),
and U2R (86.33 percent) attacks while giving very few
false alarms at each layer. The system can also detect R2L
attacks with much higher reliability (29.62 percent) when
compared with the previously reported systems, as we will
discuss later in Section 6.3. The confusion matrix shows that
GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 43
TABLE 9
Normal and U2R (with Feature Selection)
Fig. 4. Real-time representation of the system.
TABLE 10
Confusion Matrix
TABLE 11
Attack Detection at Each Layer (Case 1)
TABLE 12
Attack Detection at Each Layer (Case 2)
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.
only 71.90 percent of DoS attacks are labeled as DoS during
testing. However, it is very important to note that the
accuracy for detecting DoS attacks is not 71.90 percent, but
it is 25.50 71.90 0.00 0.00 97.40 percent. This is
because 25.50 percent of the DoS attacks are already
detected at the first layer, though our system identifies
them as probes since they are detected at the first layer. This
is acceptable because in the real environment it is critical to
detect an attack as early as possible to minimize its impact.
It is also important to note that most of the U2R attacks
are detected in the third layer itself and hence labeled as
R2L. However, if we remove the third layer, the fourth
layer can detect these attacks with similar accuracy. Further,
looking at the R2L and U2R columns in Table 10, it is
natural to think that the two layers can be merged. However,
this has two disadvantages. First, merging the two layers
results in increasing the number of features, which reduces
efficiency. The merged layer performs poorly with regard to
the total time taken when compared with both the
unmerged layers together. Second, when the layers are
merged, the U2R attacks are not detected effectively and
their individual attack detection accuracy decreases. This is
because the number of U2R instances is very low in the
training data and the system simply learns the features that
are specific to the R2L attacks. Hence, we prefer separate
layers for the two attack groups. Using our approach, we can
hope that any attack, even though its category is unknown,
can be detected at any one of the layers in the system. We
can also increase or decrease the number of layers depend-
ing upon the environment where the system is deployed.
We evaluate the performance of each layer in the
system in Table 11. From the table, we observe that out of
all the 250,436 attack instances in the test data set, more
than 25 percent of the attacks are blocked at layer 1, and
more than 90 percent of all the attacks have been blocked
by the end of layer 2. Thus, the Layered Approach is very
effective in reducing the attack traffic at each layer in the
system. The configuration takes 21 seconds to classify all
the 250,436 attacks.
We can do further optimization by putting the DoS layer
before the Probe layer. We can do this because the data is
relational and each layer in our system is independent.
Putting the DoS layer before the Probe layer serves dual
advantage as most of the attacks are detected at the first
layer itself and the overall system performs efficiently. This
optimization becomes significant in severe attack situations
when the target is overwhelmed with illegitimate connec-
tions. The results are presented in Table 12.
We observe that the Layered Approach can be very
effective in restricting the attack traffic to the initial layers in
the system. We also performed experiments when we do
not implement the Layered Approach, i.e., we consider only
a single system that is trained with two classes (normal and
attack). In this system, all the Probes, DoS, R2L, and U2R
attacks are labeled as attack. We perform experiments
both with and without feature selection. For feature
selection, we consider 21 features, which are selected by
applying the union operation on the feature sets of all the
four attack types. We compare these results with the
Layered Approach in Table 13.
We observe that a system implementing the Layered
Approach with feature selection is more efficient and more
accurate in detecting attacks particularly the U2R, the R2L,
and the Probes.
It is important to note that the time should be read in
relative terms rather than absolute, as for ease of experi-
ments we used scripts for implementation. In real environ-
ment, high speed can be achieved by implementing the
complete system in languages with efficient compilers such
as the C Language. Further, pipelining can be implemen-
ted in multicore processors, where each core may represent a
single layer, and due to pipelining, multiple I/O operations
can be replaced by a single I/O operation providing very
high speed of operation.
6.3 Comparison of Results
In this section, we compare our work with other well-known
methods based on the anomaly intrusion detection principle.
The anomaly-based systems primarily detect deviations
from the learnt normal data by using statistical methods,
machine learning, or data mining approaches. Standard
techniques such as the decision trees and naive Bayes are
known to perform well. However, our experiments show
that the Layered CRFs perform far better than these
techniques. The main reason for this is that the CRFs do not
consider the observation features to be independent. In [38],
the authors present a comparative study of various classifiers
when applied to the KDD99 data set, andin [13], the authors
propose the use of Principle Component Analysis (PCA)
before applying a machine learning algorithm. Use of
support vector machines is discussed in [26]. We compare
our results from the results presented in these papers in
Table 14. The table represents the Probability of Detection
(PD) and False Alarm Rate (FAR) in percent for various
methods including the KDD 99 cup winners.
From the table, we observe that the Layered CRFs
perform significantly better than the previously reported
results including the winner of the KDD 99 cup and
various other methods applied to this data set. The most
impressive part of the Layered CRFs is the margin of improve-
ment as compared with other methods. Layered CRFs have very
high attack detection of 98.6 percent for Probes (5.8 percent
improvement) and 97.40 percent detection for DoS. They
outperform by a significant percentage for the R2L (34.5 percent
improvement) and the U2R (34.8 percent improvement) attacks.
6.4 Discussion and Issues
From our experiments and the comparison in Table 14, we
conclude that the Layered CRFs can be very effective in
detecting the Probe, the U2R, and the R2L attacks as well as
44 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010
TABLE 13
Layered versus Nonlayered Approach
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.
the DoS attacks. However, if we consider all the 41 features
given in the data set, we find that the time required to train
and test the model is high. To address this, we performed
experiments with our integrated system by implementing a
four-layer system. The four layers correspond to Probe,
DoS, R2L, and U2R. For each layer, we then selected a set of
features that is sufficient to detect attacks at that particular
layer. Feature selection for each layer enhances the
performance of the entire system. The runtime (testing)
performance of our model is comparable with other
methods; however, the time required to train the model is
slightly higher. We also observe that feature selection not
only decreases the time required to test an instance, but it
also increases the accuracy of attack detection. This is
because using more features than required can generate
superfluous rules often resulting in fitting irregularities in
the data, which can misguide classification. From our
experimental results, we conclude that the main strength
of our method lies in detecting the R2L and the U2R attacks,
which are not satisfactorily detected by other methods. Our
method gives slight improvement for detecting Probe
attacks and was similar in accuracy when compared with
other methods for detecting the DoS attacks.
The prime reason for better detection accuracy for the
CRFs is that they do not consider the observation features to
be independent. CRFs evaluate all the rules together, which
are applicable for a given observation. This results in
capturing the correlation among different features of the
observation resulting in higher accuracy. Considering both
the accuracy and the time required for testing, our system
scores better. Our integrated system also has the advantage
that any method can be used in the layers of the system. This
gives flexibility to the user to decide between the time and
accuracy trade-off. Furthermore, we can increase or decrease
the number of layers in the system depending upon the task
requirement. Finally, our system can be used for performing
analysis on attacks because the attack category can be
inferred from the layer at which the attack is detected.
To determine the statistical significance of our results, we
compare our proposed method (Layered CRFs) with others
for detecting Probes, DoS, R2L, and U2R attacks. We use the
Wilcoxon sum rank test with 95 percent confidence interval
to discriminate the performance of these methods. Table 15
gives the ranking for various methods compared, where a
system with rank 1 is the best.
The results of the Wilcoxon test indicate that the Layered
CRFs are much better (or equal) for detecting attacks. Thus,
we conclude that the Layered CRFs are a strong candidate
for building robust and efficient intrusion detection systems.
7 EFFECT OF NOISE
Ideally, we would like to perform similar experiments with
a large number of data sets. However, given the domain of
the problem, there are no other data sets that are freely
available, which can be used for our experiments. To
ameliorate this problem to some extent, we add substantial
amount of noise in the training data and perform similar
experiments to study the robustness of these systems. By
experimenting with noisy data, we want to determine the
sensitivity of the proposed scheme with respect to noise. If
the system performs poorly with noisy data, the results
could be an artifact of the data set.
7.1 Addition of Noise to Data
Addition of noise was controlled by two parameters, the
probability of adding noise to a feature, j, and the scaling
factor, :, for a feature. We performed four sets of experi-
ments with noisy data, separately, one for each layer. For
each layer, we varied the parameter j between 0 and 0.95 (by
keeping it at values 0.10, 0.20, 0.33, 0.50, 0.75, 0.90, and 0.95)
and varied the parameter : between 1,000 and 1,000. In
case when the original feature was 0, noise was added to
any feature by using an additive function (random value
between 1,000 and 1,000) instead of scaling. Figs. 5, 6, 7,
and 8 represent the effect of noise on each layer separately.
We find that our integrated system is robust to noise in
the training data and performs better than other methods
for all of the four attack groups.
GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 45
TABLE 14
Comparison of Results
TABLE 15
Ranking for the Six Methods
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.
8 AUTOMATIC FEATURE SELECTION
From our experiments in the previous sections, we showed
the advantages of performing feature selection and im-
plementing the Layered Approach for attack detection. We
performed our experiments by manually selecting features
for different layers. However, we want to compare the
results of manual feature selection with the results of
automatic feature selection (and no feature selection) for all
the layers. Hence, we investigated various methods for
automatic feature selection.
We experimented with a feed forward neural network to
determine the weights for all the 41 features. Features with
weights close to zero were discarded. As a result, only a
small set of features was selected for each layer. However,
when we performed the experiments on the reduced set of
features and compared the results, there was no significant
improvement in the detection accuracy, though there was
reduction in training and testing time.
We then used PCA for dimensionality reduction [13].
However, the main drawback of using PCA in our task is
that PCA transforms a large number of possibly correlated
features into a small number of uncorrelated features known
as the principle components. Hence, when we applied PCA
followed by CRFs in the newly transformed feature space,
the method did not provide significant advantage as the
strength of our approach is to model correlation among
features and the features in the new space are independent.
We also note that, to construct a decision tree, the
C4.5 algorithm performs feature selection. We selected the
same features as selected by the C4.5 algorithm and then
performed experiments with only those features. However,
there was no significant improvement in the results.
We also performed experiments with the method
proposed in [33] for efficiently inducing features for a
CRF. The method is based upon iteratively constructing
feature conjunctions that would significantly increase the
conditional log-likelihood if added to the model. We used
the Mallet tool [35] for performing these experiments and
compare the results with our previous results based on
Layered CRFs with manual feature selection in Table 16. We
observed that both the systems, with automatic and manual
feature selections, had similar test time performance, but
the accuracy of detection when features were induced
automatically was significantly lower than our system
based upon manual feature selection.
It was not surprising that manual feature selection
performed better than automatic feature selection. How-
ever, we note that automatic feature selection for Layered
CRFs performed better than the decision trees, particularly
for the R2L and the U2R attacks. This suggests that Layered
46 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010
Fig. 5. Effect of noise on Probe layer.
Fig. 6. Effect of noise on DoS layer.
Fig. 7. Effect of noise on R2L layer.
Fig. 8. Effect of noise on U2R layer.
TABLE 16
Feature Selection
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.
Approach using CRFs with automated feature selection is a
feasible scheme for building reliable intrusion detection
systems.
9 CONCLUSIONS
In this paper, we have addressed the dual problem of
Accuracy and Efficiency for building robust and efficient
intrusion detection systems. Our experimental results in
Section 6 show that CRFs are very effective in improving
the attack detection rate and decreasing the FAR. Having a
low FAR is very important for any intrusion detection
system. Further, feature selection and implementing the
Layered Approach significantly reduce the time required to
train and test the model. Even though we used a relational
data set for our experiments, we showed that the sequence
labeling methods such as the CRFs can be very effective in
detecting attacks and they outperform other methods that
are known to work well with the relational data. We
compared our approach with some well-known methods
and found that most of the present methods for intrusion
detection fail to reliably detect R2L and U2R attacks, while
our integrated system can effectively and efficiently detect
such attacks giving an improvement of 34.5 percent for the
R2L and 34.8 percent for the U2R attacks. We also discussed
how our system is implemented in real life. Our system can
help in identifying an attack once it is detected at a
particular layer, which expedites the intrusion response
mechanism, thus minimizing the impact of an attack. We
showed that our system is robust to noise and performs
better than any other compared system even when the
training data is noisy. Finally, our system has the advantage
that the number of layers can be increased or decreased
depending upon the environment in which the system is
deployed, giving flexibility to the network administrators.
The areas for future research include the use of our
method for extracting features that can aid in the develop-
ment of signatures for signature-based systems. The
signature-based systems can be deployed at the periphery
of a network to filter out attacks that are frequent and
previously known, leaving the detection of new unknown
attacks for anomaly and hybrid systems. Sequence analysis
methods such as the CRFs when applied to relational data
give us the opportunity to employ the Layered Approach,
as shown in this paper. This can further be extended to
implement pipelining of layers in multicore processors,
which is likely to result in very high performance.
APPENDIX A
FEATURE SELECTION
A.1 Features Selected for Probe Layer
A.2 Features Selected for DoS Layer
A.3 Features Selected for R2L Layer
A.4 Features Selected for U2R Layer
ACKNOWLEDGMENTS
The authors sincerely thank the anonymous reviewers
whose comments have greatly helped clarify and improve
this paper.
REFERENCES
[1] Autonomous Agents for Intrusion Detection, http://www.cerias.
purdue.edu/research/aafid/, 2010.
[2] CRF++: Yet Another CRF Toolkit, http://crfpp.sourceforge.net/,
2010.
[3] KDD Cup 1999 Intrusion Detection Data, http://kdd.ics.uci.edu/
databases/kddcup99/kddcup99.html, 2010.
[4] Overview of Attack Trends, http://www.cert.org/archive/pdf/
attack_trends.pdf, 2002.
[5] Probabilistic Agent Based Intrusion Detection, http://www.cse.sc.
edu/research/isl/agentIDS.shtml, 2010.
GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 47
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.
[6] SANS InstituteIntrusion Detection FAQ, http://www.sans.org/
resources/idfaq/, 2010.
[7] T. Abraham, IDDM: Intrusion Detection Using Data Mining
Techniques, http://www.dsto.defence./gov.au/publications/
2345/DSTO-GD-0286.pdf, 2008.
[8] R. Agrawal, T. Imielinski, and A. Swami, Mining Association
Rules between Sets of Items in Large Databases, Proc. ACM
SIGMOD, vol. 22, no. 2, pp. 207-216, 1993.
[9] N.B. Amor, S. Benferhat, and Z. Elouedi, Naive Bayes vs.
Decision Trees in Intrusion Detection Systems, Proc. ACM Symp.
Applied Computing (SAC 04), pp. 420-424, 2004.
[10] J.P. Anderson, Computer Security Threat Monitoring and Surveillance,
http://csrc.nist.gov/publications/history/ande80.pdf, 2010.
[11] R. Bace and P. Mell, Intrusion Detection Systems, Computer
Security Division, Information Technology Laboratory, Natl Inst.
of Standards and Technology, 2001.
[12] D. Boughaci, H. Drias, A. Bendib, Y. Bouznit, and B. Benhamou,
Distributed Intrusion Detection Framework Based on Mobile
Agents, Proc. Intl Conf. Dependability of Computer Systems
(DepCoS-RELCOMEX 06), pp. 248-255, 2006.
[13] Y. Bouzida and S. Gombault, Eigenconnections to Intrusion
Detection, Security and Protection in Information Processing Systems,
pp. 241-258, 2004.
[14] H. Debar, M. Becke, and D. Siboni, A Neural Network
Component for an Intrusion Detection System, Proc. IEEE Symp.
Research in Security and Privacy (RSP 92), pp. 240-250, 1992.
[15] T.G. Dietterich, Machine Learning for Sequential Data: A
Review, Proc. Joint IAPR Intl Workshop Structural, Syntactic,
and Statistical Pattern Recognition (SSPR/SPR 02), LNCS 2396,
pp. 15-30, 2002.
[16] P. Dokas, L. Ertoz, A. Lazarevic, J. Srivastava, and P.-N. Tan, Data
Mining for Network Intrusion Detection, Proc. NSF Workshop Next
Generation Data Mining (NGDM 02), pp. 21-30, 2002.
[17] Y. Du, H. Wang, and Y. Pang, A Hidden Markov Models-Based
Anomaly Intrusion Detection Method, Proc. Fifth World Congress
on Intelligent Control and Automation (WCICA 04), vol. 5,
pp. 4348-4351, 2004.
[18] S. Dzeroski and B. Zenko, Is Combining Classifiers Better than
Selecting the Best One, Proc. 19th Intl Conf. Machine Learning
(ICML 02), pp. 123-129, 2002.
[19] L. Ertoz, A. Lazarevic, E. Eilertson, P.-N. Tan, P. Dokas, V. Kumar,
and J. Srivastava, Protecting against Cyber Threats in Networked
Information Systems, Proc. SPIE Battlespace Digitization and
Network Centric Systems III, pp. 51-56, 2003.
[20] S. Forrest, S.A. Hofmeyr, A. Somayaji, and T.A. Longstaff,
A Sense of Self for Unix Processes, Proc. IEEE Symp.
Research in Security and Privacy (RSP 96), pp. 120-128, 1996.
[21] Y. Gu, A. McCallum, and D. Towsley, Detecting Anomalies in
Network Traffic Using Maximum Entropy Estimation, Proc.
Internet Measurement Conf. (IMC 05), pp. 345-350, USENIX Assoc.,
2005.
[22] K.K. Gupta, B. Nath, and R. Kotagiri, Network Security Frame-
work, Intl J. Computer Science and Network Security, vol. 6, no. 7B,
pp. 151-157, 2006.
[23] K.K. Gupta, B. Nath, and R. Kotagiri, Conditional Random Fields
for Intrusion Detection, Proc. 21st Intl Conf. Advanced Information
Networking and Applications Workshops (AINAW 07), pp. 203-208,
2007.
[24] K.K. Gupta, B. Nath, R. Kotagiri, and A. Kazi, Attacking
Confidentiality: An Agent Based Approach, Proc. IEEE Intl Conf.
Intelligence and Security Informatics (ISI 06), vol. 3975, pp. 285-296,
2006.
[25] C. Ji and S. Ma, Combinations of Weak Classifiers, IEEE Trans.
Neural Networks, vol. 8, no. 1, pp. 32-42, 1997.
[26] D.S. Kim and J.S. Park, Network-Based Intrusion Detection with
Support Vector Machines, Proc. Information Networking, Network-
ing Technologies for Enhanced Internet Services Intl Conf. (ICOIN 03),
pp. 747-756, 2003.
[27] D. Klein and C.D. Manning, Conditional Structure versus
Conditional Estimation in NLP Models, Proc. ACL Conf.
Empirical Methods in Natural Language Processing (EMNLP 02),
vol. 10, pp. 9-16, Assoc. for Computational Linguistics, 2002.
[28] C. Kruegel, D. Mutz, W. Robertson, and F. Valeur, Bayesian
Event Classification for Intrusion Detection, Proc. 19th Ann.
Computer Security Applications Conf. (ACSAC 03), pp. 14-23, 2003.
[29] J. Lafferty, A. McCallum, and F. Pereira, Conditional Random
Fields: Probabilistic Models for Segmenting and Labeling
Sequence Data, Proc. 18th Intl Conf. Machine Learning
(ICML 01), pp. 282-289, 2001.
[30] W. Lee and S. Stolfo, Data Mining Approaches for Intrusion
Detection, Proc. Seventh USENIX Security Symp. (Security 98),
pp. 79-94, 1998.
[31] W. Lee, S. Stolfo, and K. Mok, Mining Audit Data to Build
Intrusion Detection Models, Proc. Fourth Intl Conf. Knowledge
Discovery and Data Mining (KDD 98), pp. 66-72, 1998.
[32] W. Lee, S. Stolfo, and K. Mok, A Data Mining Framework for
Building Intrusion Detection Model, Proc. IEEE Symp. Security
and Privacy (SP 99), pp. 120-132, 1999.
[33] A. McCallum, Efficiently Inducing Features of Conditional
Random Fields, Proc. 19th Ann. Conf. Uncertainty in Artificial
Intelligence (UAI 03), pp. 403-410, 2003.
[34] A. McCallum, D. Freitag, and F. Pereira, Maximum Entropy
Markov Models for Information Extraction and Segmentation,
Proc. 17th Intl Conf. Machine Learning (ICML 00), pp. 591-598,
2000.
[35] A.K. McCallum, MALLET: A Machine Learning for Language Toolkit,
http://mallet.cs.umass.edu, 2010.
[36] L. Portnoy, E. Eskin, and S. Stolfo, Intrusion Detection with
Unlabeled Data Using Clustering, Proc. ACM Workshop Data
Mining Applied to Security (DMSA), 2001.
[37] A. Ratnaparkhi, A Maximum Entropy Model for Part-of-Speech
Tagging, Proc. Conf. Empirical Methods in Natural Language
Processing (EMNLP 96), pp. 133-142, Assoc. for Computational
Linguistics, 1996.
[38] M. Sabhnani and G. Serpen, Application of Machine Learning
Algorithms to KDD Intrusion Detection Dataset within Misuse
Detection Context, Proc. Intl Conf. Machine Learning, Models,
Technologies and Applications (MLMTA 03), pp. 209-215, 2003.
[39] H. Shah, J. Undercoffer, and A. Joshi, Fuzzy Clustering for
Intrusion Detection, Proc. 12th IEEE Intl Conf. Fuzzy Systems
(FUZZ-IEEE 03), vol. 2, pp. 1274-1278, 2003.
[40] C. Sutton and A. McCallum, An Introduction to Conditional
Random Fields for Relational Learning, Introduction to Statistical
Relational Learning, 2006.
[41] E. Tombini, H. Debar, L. Me, and M. Ducasse, A Serial
Combination of Anomaly and Misuse IDSes Applied to HTTP
Traffic, Proc. 20th Ann. Computer Security Applications Conf.
(ACSAC 04), pp. 428-437, 2004.
[42] W. Wang, X.H. Guan, and X.L. Zhang, Modeling Program
Behaviors by Hidden Markov Models for Intrusion Detection,
Proc. Intl Conf. Machine Learning and Cybernetics (ICMLC 04),
vol. 5, pp. 2830-2835, 2004.
[43] C. Warrender, S. Forrest, and B. Pearlmutter, Detecting Intru-
sions Using System Calls: Alternative Data Models, Proc. IEEE
Symp. Security and Privacy (SP 99), pp. 133-145, 1999.
[44] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques. Morgan Kaufmann, 2005.
[45] Y.-S. Wu, B. Foo, Y. Mei, and S. Bagchi, Collaborative
Intrusion Detection System (CIDS): A Framework for Accurate
and Efficient IDS, Proc. 19th Ann. Computer Security Applications
Conf. (ACSAC 03), pp. 234-244, 2003.
[46] Z. Zhang, J. Li, C.N. Manikopoulos, J. Jorgenson, and J. Ucles,
HIDE: A Hierarchical Network Intrusion Detection System Using
Statistical Preprocessing and Neural Network Classification,
Proc. IEEE Workshop Information Assurance and Security (IAW 01),
pp. 85-90, 2001.
Kapil Kumar Gupta received the BTech degree
in computer science and engineering from the
Guru Gobind Singh Indraprastha (GGSIP) Uni-
versity, Delhi, India, in 2004. He worked for a
year at HCL Technologies, Noida, India. He is
currently a PhD student in the Department of
Computer Science and Software Engineering,
The University of Melbourne, Parkville, Australia.
His research interests include intrusion detec-
tion, network security, data security and data
privacy, machine learning, data mining, and artificial intelligence.
48 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 7, NO. 1, JANUARY-MARCH 2010
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.
Baikunth Nath received the MA degree from
Punjab University, Chandigarh, India and the
PhD degree from the University of Queensland,
Brisbane, Australia. He was with Monash Uni-
versity for more than 25 years in various senior
positions including the director of research in the
Gippsland School of IT. In 2001, he joined the
Department of Computer Science and Software
Engineering, The University of Melbourne, Park-
ville, Australia, as an associate professor and
the director of postgraduate studies. His research interests include
image processing, intrusion detection, scheduling, optimization, data
mining, evolutionary computing, neural networks, financial forecasting,
and operations research. He is the author of numerous research
publications in various well-reputed international journals and con-
ference proceedings. He is a senior member of the IEEE.
Ramamohanarao (Rao) Kotagiri received the
BE degree from Andhra University, the ME
degree from the Indian Institute of Science
(IISc), Bangalore, India, and the PhD degree
from Monash University. He was awarded the
Alexander von Humboldt Fellowship in 1983. He
joined the University of Melbourne in 1980 and
was appointed as a professor in computer
science in 1989. He has held several senior
positions including head of Computer Science
and Software Engineering, head of the School of Electrical Engineering
and Computer Science, deputy director of the Centre for Ultra
Broadband Information Networks, codirector of the Key Centre for
Knowledge-Based Systems, and research director for the Cooperative
Research Centre for Intelligent Decision Systems, The University of
Melbourne. He served as a member of the ARC Information Technology
Panel. He also served on the Prime Ministers Science, Engineering and
Innovation Council Working Party on Data for Scientists. He is currently
the associate dean for research in the Faculty of Engineering, The
University of Melbourne. He is on the editorial boards of the Universal
Computer Science, Journal of Knowledge and Information Systems,
IEEE Transactions on Knowledge and Data Engineering (TKDE),
Journal of Statistical Analysis and Data Mining, and Very Large Data
Bases (VLDB) Journal. He served as a program committee member of
numerous international conferences including the International Con-
ference on Management of Data (SIGMOD), International Conference
on Very Large Data Bases (VLDB), International Conference on Logic
Programming (ICLP), and International Conference on Data Engineering
(ICDE). He is a steering committee member of the IEEE International
Conference on Data Mining (ICDM), Pacific-Asia Conference on
Knowledge Discovery and Data Mining (PAKDD), and International
Conference on Database Systems for Advanced Applications
(DASFAA). He was the program cochair for VLDB, PAKDD, DASFAA,
and the International Conference on Deductive and Object-Oriented
Databases (DOOD). His research interests include database systems,
logic-based systems, agent-oriented systems, information retrieval, data
mining, intrusion detection, and machine learning. He has published
widely in conference proceedings and international journals. He is a
fellow of the Institute of Engineers Australia, Australian Academy of
Technological Sciences and Engineering, and Australian Academy of
Science. He is a member of the IEEE.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
GUPTA ET AL.: LAYERED APPROACH USING CONDITIONAL RANDOM FIELDS FOR INTRUSION DETECTION 49
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on July 10,2010 at 09:43:34 UTC from IEEE Xplore. Restrictions apply.