Sunteți pe pagina 1din 5

Proceedings of AIAI2010

HIGH ACCURATE INTERNET TRAFFIC


CLASSIFICATION BASED ON CO-TRAINING SEMI-
SUPERVISED CLUSTERING
Xiang Li, Feng Qi, Li kun Yu, Xue song Qiu
State Key Laboratory of Networking and Switching Technology,
Beijing University of Posts and Telecommunications, Beijing, China
{lxxuanyuan, alabanxia}@bupt.cn,{qifeng, xsqiu}@bupt.edu.cn

specific signatures of the known application are


Abstract contained by analyzing the useful load of packets.
Currently the popular methods of network traffic Studies show that the method be able to identify
classification are the classification based on different applications even P2P, but it not only
payload and supervised or unsupervised machine involves individuals information and waste
learning algorithm. But in the actual flows bandwidth, and cannot also identify applications of
classification, traditional methods have faced more data encryption.
and more challenges due to increasing applications The study of traffic classification based on
and difficult to obtain labeled flows. This paper machine learning has been considered the most
proposes a traffic classification method based on promising and attracted wide attention from
co-training semi-supervised clustering. This scholars [8, 9, 10]. For machine learning
method uses a few labeled flows and classifiers identification method of traffic classification,
based on two different evaluation metrics to learning algorithm is the key to improve
achieve high-performance classifiers. Finally we classification accuracy. The current study of traffic
intercept data from the campus backbone and use classification based on machine learning has
open source tools to implement the experiment, focused on supervised learning algorithm and
which shows higher accuracy, precision and recall unsupervised learning algorithm. The method
than other classic clustering methods (such as K- based on supervised learning form manually
means, DBSCAN and two-layer semi-supervised classifies labeled samples and then models all
clustering). samples, which not only has a huge workload, but
also depends on the understanding of the samples
Keywords: Internet Traffic; Network Traffic and is unable to identify unknown applications.
Classification; Machine Learning; Semi- The method based on unsupervised learning form
Supervised; Clustering; Co-training can find the structural knowledge hidden in
training cases through learning the training
samples without labels. Although this method
1 Introduction
doesnt require marking the samples, it has lower
A variety of network applications are running on classification accuracy and more difficult training
network currently and new applications are still processes.
emerging. The traffic classification of application This paper proposes a method based on co-training
layer is the premise and basis of identifying semi-supervised clustering to classify Internet
network applications, which helps to analyze trend, flows. Semi-supervised machine learning uses a lot
control dynamic access, study differences of of unlabeled samples to train classifiers with the
services, detect intrusion, monitor traffic, manage help of a few labeled training samples, which
billing and analyze users behavior. Moreover, it is reduces the cost. In addition, semi-supervised
also the important reference of network security method utilizes some labeled flows to improve
and traffic engineering [1,2]. clustering performance, so its accuracy is higher
Early in the Internet, the traditional traffic compared to other classical methods.
classification method is to detect port numbers The remainder of this paper is organized as follows:
registered in the IANA [3]. But some applications related work is presented in section 2. The
(such as P2P) start using salutatory port numbers algorithm and method is described in section 3.
to avoid the restrictions and detections of firewall, Then section 4 introduces data sets collection and
which makes the traffic classification method pretreatment. The experiments and testing results
based on port very difficult and low accuracy [4, 5]. are presented in section 5. In the end, section 6
Another practicable traffic classification method is concludes the paper and forecasts the future work.
based on payload [6, 7], which determines whether
2 Related work in that they both attempt to find the centers of
natural clusters in the data.
Although the traffic classification method based on Firstly k data objects are chosen from all n objects
port is the fastest and easiest method, studies have as initial cluster centers. Secondly according to the
shown a poor performance [4, 5]. Measurement similarity (distance) with cluster centers, other
results of this method show it cant determine the objects are respectively assigned to their most
specific situation of a certain application, because similar clusters (represented by cluster centers).
the port identification method is being restricted Thirdly new cluster centers which get objects are
increasingly in recent years. calculated once again (the certain mean of all
The traffic classification method based on payload objects of the cluster). The algorithm repeats this
detects the content of data packets to determine the process until the standard measure function is
true application. The method will produce a very convergent. Generally the variance is adopted as
accuracy classification if a set of unique signature the measure function. K-means clustering has the
payload is provided to application program, which following characteristics: every cluster itself is
is often used in commercial bandwidth compacting as much as possible, but clusters are
management and intrusion detection tools [5, 6]. separated with each other. Objective function of
However this method has high computational clustering is that:
complexity involves users privacy and cant
identify encrypted data flows.
Supervised learning method becomes one of the
most concerned traffic classification methods. [13] where is data sets, is k clusters,
made use of Nearest Neighbor and Linear
and is k clustering centers.
Discriminate Analysis to map different
applications to different QoS levels. [14] studied
network applications identification based on Naive 3.2 Feature Selection Algorithm
Bayes. [15] identified application protocols Feature selection is to remove irrelevant or
through simple fingerprint statistics. [9] compared redundant features from candidate feature sets and
the performance of several supervised learning select an optimal feature subset under the condition
algorithms. of guarantying or not reducing classification
Unsupervised learning method only uses unlabeled accuracy. Feature selection methods are divided
data, while labeled data is used to test its learning into filter mode and wrapper mode based on the
performance as testing data. [13] constructed a relationship between the evaluation function and
flexible traffic generator through a clustering the classifier. In order to meet the independent
method based on flow communication mode. [16] requirement of two views according to co-training,
used Expectation Maximization to classify flows we use the filter modes feature evaluation metrics
into different applications. [17] used Sequential of Correlation-based Feature Selection (CFS) [20]
Forward Selection and Autoclass clustering and information gain (IG) [21] to obtain a
method to identify applications, which devotes to streamlined feature set.
earliest consider optimization of feature sets.
Semi-supervised classification is an emerging 3.3 Co-training
learning mechanism in recent years, which just
applies in the network traffic classification. [19] Co-training algorithm is a semi-supervised
firstly applied semi-supervised in network learning technology [11, 12], which uses two kinds
applications identification, but the author only used of independent complete feature sets to describe
a kind of clustering algorithm, lack of comparing objects based on multi-view. In the process of co-
with other algorithms. training, every classifier selects and marks several
samples with higher degree of confidence from
3 Algorithm and method unlabeled samples, and then adds labeled samples
to labeled training sets of another classifier, which
3.1 Unsupervised Clustering help another classifier update with these new
labeled samples. The process of co-training iterates
K-means algorithm proposed by MacQueen is a continuously, until meeting pre-conditions. Under
well-known unsupervised clustering algorithm. It ideal condition, co-training requires two views are
is randomly initialized k-cluster centers. K-means independent and every feature set can get a strong
algorithm aims to partition n observations into k classifier. In the paper, we use the feature selection
clusters, so every observation belongs to the algorithm based on two evaluation metrics to
cluster with the nearest distance from center. The obtain two feature sets, which satisfies the input
cluster centers are updated with observation added, requirement of the co-training semi-supervised
until not change. It is similar to the expectation- algorithm in large extent.
maximization algorithm for mixtures of Gaussians
3.4 Co-training Semi-supervised Clustering 4.2 Data Sets
Algorithm
Data sets used in work are described in this section.
In order to facilitate our work, we use Jpcap open-
Algorithm CLFS: Semi-Supervised
source toolkit based on Winpcap/Libpcap to
Clustering Traffic Classification
collect data in the university backbone. Because
Input: Data set , Labeled flows five-tuple array can determine the unique flow, we
, Unlabeled flows consider the same five-tuple array during the close
interval as the same flow. The data packets are
Output: k disjoint clustering C in data set X firstly divided into uni-directional flows according
begin to five-tuple array, and then uni-directional flows
Gain feature set with IG from full feature; are combined into bi-directional flows. Although
Gain feature set with CFS from full feature; the method of flows statistics is used, we intercept
while (IsSteady( )) the complete information of packets, because we
Train a classifier with in by k-means; need to use the application layer information to
Train a classifier with in by k-means; determine the categories of flows in later analysis
Choose flows from ; training. Table 2 shows the data sets traced in the
for ( ) campus network.
= Classify( ); Table 2 Data set for network flow experiment
= Classify( ); (Campus Traces)
if( ) Traffic Class Bytes Number of Number of
Label and Move to ; Packets Flows
end if WWW 2.92GB 7,538,462 63,406
end for FTP 6.33GB 12,198,334 9,847
DNS 0.85GB 2,913,896 22,485
end while
Mail 0.48GB 1,371,355 13,049
return ; Multimedia 4.73GB 7,080,415 3,520
end Interactive 0.01GB 11,207 227
We show that our Semi-Supervised Clustering Chat 1.14GB 2,730,304 26,741
Traffic Classification (CLFS) algorithm based on P2P 14.7GB 31,023,819 31,397
co-training semi-supervised clustering Step 1 of Game 1.36GB 3,890,237 21,482
Algorithm CLFS utilizes two feature evaluation Other 2.31GB 6,893,572 35,123
Total 34.47GB 75,651,601 227,277
metrics of IG and CFS to gain two feature subsets
of approximate mutual independence, then to train
two classifiers with co-training algorithm. Step 2 Because of our limited disk space for a complete
classifies unlabeled flows into corresponding packets capture, we take an hour in a day to collect
labeled flow clusters. flow data of the campus network from Gbps
Ethernet link in a week. We adopt a filter to filter
4 Data sets and experiment approach data packet, and collect TCP and UDP data packets
with payload on the network layer.
4.1 Classification Object
From the resource utilization and QoS requirement 4.3 Flow feature definitions
perspective, network applications are usually
divided into a few categories. A typical The flow features defined preferably have
classification is based on the application distinction and low cost, which can obtain
characteristics. The Table I shows 10 categories maximum interval with minimal cost. At the same
including unknown application and its example. time, flow features selection is also restricted by
We leave more accurate traffic classification on the the actual IP network resources. In our selection,
application layer to future work. the bottom acquisition based on Libpcap packets
Table 1. Internet traffic categories can obtain more data.
Class Representative Application/Protocal We have selected 30 flow features according to the
WWW http,https above standards, which are based on 248 bi-
FTP ftp directional flow features [22]. 30 flow features are
DNS dns as follows, which we refer to as the full feature set
Mail smtp,pop3,imap in Table 3.
Multimedia voice,video streaming Our features are simple and well understood
Interactive ssh,telnet,rlogin
because they dont need payload. They represent a
Chat qq,msn,yahoo
P2P Kazaa,Bittorrent,Gnutella,Thunder,uTorrent
reasonable benchmark feature set to which more
Game WoW,WarCraft,Half-life complex features might be added in the future.
Unknown
Table 3. The full flow features
bidirectional flow features
The protocol (TCP or UDP)
The flow duration
Total number of packets in the flow
The average packets size of a flow
The version
The variance of window size
The number ratio of send and receive packets
The byte ratio of send and receive packets
unidirectional flow features(send or arrival)
Port
Flow volume in bytes and packets
Packet length (minimum, mean, maximum and variance)
Inter-arrival time between packets (minimum, mean, Figure 1. Influence of the size of training set on
maximum and variance) classification accuracy.

5 Traffic Classification Experiment 5.2 Precision and Recall


Based on Semi-supervised Clustering In this section, we mainly discuss the precision and
recall evaluation criteria of classifiers. We obtain
In this section, we test and compare the K-means
the overall accuracy and mean class
algorithm, two-layer semi-clustering algorithm
recall/precision rates across the classes after each
[19], DBSCAN algorithm [18] and semi-
test. From Figure 2 and Figure 3 we can see the
supervised clustering algorithm proposed in this
precision and recall of the semi-supervised method
paper. The experiment is implemented by the
have improved evidently, which are ahead of other
procedure developed secondary based on WEKA,
clustering algorithms in most applications.
which is a popular tool in machine learning.
DBSCAN algorithm is sensitive to the input
parameters including Eps, min Pts, because the
input parameters may lead to different results of
clustering. Therefore we set min Pts= 4 and Eps=
0.04 based on [18] and our experience.
We use SFS [20] algorithm run on dataset to select
two feature evaluation metrics. Table 4 shows the
selection results. From the figure we can see that (a) Precision of WWW, FTP, DNS, Mail and
elements of the two feature sets are very different, Multimedia
which satisfies our training requirement for co-
training semi-supervised algorithm.
Table 4 The feature subsets according to CFS and
IG method
CFS subset protocol, duration, averpacknum, arport,
flowbyte, seminpkl, semeanpkl, arvarpkl,
seminibp, arvaribp

IG subset version, duration, varofws, nrofsrp, brofsrp,


arport, flowpack, armeanpkl, armaxpkl,
arvarpkl, semeanibp, sevaribp
(b) Precision of Interactive, Chat, P2P, Game and
Unknown
5.1 Accuracy Figure 2. Precision of Per-application.
In this section, we analyze the change of the
overall accuracy of every clustering classification
with the size of training set. Figure 1 shows the
result using 5000 labeled flows. From the figure
we can see the overall accuracy of our semi-
supervised clustering algorithm is highest, and the
second is two-layer semi-supervised clustering
algorithm [19]. This is because the labeled flows of
our method are marked each other to extend the (a) Recall of WWW, FTP, DNS, Mail and
quantity of labeled flows. But the method of [19] is Multimedia
only the mapping from clusters to applications.
improve other semi-supervised in Internet traffic
classification.
Acknowledgements
Supported by 973 project of China
(2007CB310703), Funds for Creative Research
Groups of China (60821001) NSFC (60973108)
and National S&T Major Project (2009ZX03004-
003-03).
(b) Recall of Interactive, Chat, P2P, Game and References
Unknown [1] CAIDA : research : traffic-analysis : classification-overview.
Figure 3. Recall of Per-application. http://www.caida.org/research/ traffic-analysis/classification-
overview/.
5.3 The Relationship between Accuracy and [2] S. Sen, J. Wang, Analyzing Peer-to-Peer Traffic across Large
Labeled Flows Networks, IEEE/ACM Transaction Networking, 2004.
[3] IANA. Internet Assigned Numbers Authority.
Figure 4 shows the relationship between accuracy http://www.iana.org/ assignments/port numbers
[4] H. Dreger, A. Feldmann, M. Mai, V. Paxson, and R. R.
and the fraction of initial labeled flows in all flows Sommer, Dynamic application-layer protocol analysis for
in two kinds of semi-supervised methods. From the network intrusion detection, In USENIX Security
figure we can see initial labeled flows have Symposium, July 2006.
[5] J. Erman, A. Mahanti, Byte Me: A Case for Byte Accuracy in
improved accuracy greatly. But due to expensive Traffic Classification. In ACM SIGMETRICS MineNet
labeled flows in practice, we must tradeoff cost Workshop, June 2007.
and accuracy. Meanwhile, semi-supervised method [6] P. Haffner, S. Sen, O. Spatscheck, Acas:Automataed
construction of application signatures, In SIGCOMM
is found to be better than 2-layer SC of [19] under MineNet Workshop, 2005.
the same condition. [7] F. Risso, M. Baldi, O. Morandi, Lightweight, Payload-Based
Traffic Classification: An Experimental Evaluation, IEEE
ICC 2008.
[8] N. Williams, S. Zander, G. Armitrage, A Preliminary
Performance Comparison of Five Machine Learning
Algorithms for Practical IP Traffic Flow Classification,
Computer Communication Review, 2006.
[9] H. Kim, K. Claffy, M. Fomenkov, Internet Traffic
Classification Demystified: Myths, Caveats, and the Best
Practices, In CoNEXT08.
[10] T. Nguyen and G. Armitage, A Survey of Techniques for
Internet Traffic Classification using Machine Learning, IEEE
Communications Surveys and Tutorials, 2008.
[11] A. Blum, T. Mitchell, Combining labeled and unlabeled data
Figure 4. Influence of labeled flows on with co-training, COLT 98.
classification. [12] O. Chapelle, B. Schlkopf, A. Zien, eds. Semi-Supervised
Learning, Cambridge, MA: MIT Press, 2006.
[13] F. Hernndez-Campos, F. D. Smith, Statistical Clustering of
6 Conclusions and future work Internet Communications Patterns, Computing Science and
Statistics 2003.
This paper proposed and evaluated a semi- [14] A. W. Moore and D. Zuev, Internet traffic classification
supervised clustering method based on flow using bayesian analysis techniques, in Proc. ACM
statistics for classifying Internet traffic. The SIGMETRICS, June 2005
[15] M. Crotti, M. Dusi, F. Gringoli, Traffic classification through
method makes use of some known flows and two simple statistical fingerprinting, SIGCOMM Comput.
classifiers (based on IG and CFS evaluation Commun., 2007.
metrics) to improve accuracy of classifier, so the [16] A. McGregor, M. Hall, P. Lorier, Flow clustering using
machine learning techniques, PAM2004.
performance of our classifier is better than K-mean [17] S. Zander, T. Nguyen, and G. Armitage, Automated traffic
clustering, DBSCAN and two-layer semi- classification and application identification using machine
supervised clustering. Through the experiment, we learning, LCN 2005
[18] J. Erman, M. Arlitt, and A. Mahanti, Traffic classification
find co-training semi-supervised clustering has using clustering algorithms, in MineNet 06: Proceedings of
higher overall accuracy than other three kinds of the 2006 SIGCOMM workshop on Mining network data.
clustering methods, in addition the precision and [19] J. Erman, A. Mahanti, M. Arlitt, Offline/Realtime Traffic
Classification Using Semi-Supervised Learning, In IFIP
recall metrics is also better than other classical Performance, October 2007.
methods. Moreover, the results of the experiment [20] H. Liu, L. Yu, Towards integrating feature selection
shows that fraction of initial labeled flows has algorithms for classification and clustering, IEEE Trans. on
Knowledge and Data Engineering, 2005.
great influence on the accuracy of semi-supervised [21] L. Yu, H. Liu, Efficient feature selection via analysis of
classifier. relevance and redundancy, Journal of Machine Learning
Co-training is a very classical method in semi- Research, 2004, 5.
[22] A. W. Moore and D. Zuev, Discriminators for use in flow-
supervised machine learning, however, some based classification, Technical report, Intel Research,
important problems still need to be resolved (e.g. Cambridge, 2005.
two feature subsets are approximately sufficient
and redundant). Our future work will research and

S-ar putea să vă placă și