CLASSIFICATION BASED ON CO-TRAINING SEMI- SUPERVISED CLUSTERING Xiang Li, Feng Qi, Li kun Yu, Xue song Qiu State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China {lxxuanyuan, alabanxia}@bupt.cn,{qifeng, xsqiu}@bupt.edu.cn
specific signatures of the known application are
Abstract contained by analyzing the useful load of packets. Currently the popular methods of network traffic Studies show that the method be able to identify classification are the classification based on different applications even P2P, but it not only payload and supervised or unsupervised machine involves individuals information and waste learning algorithm. But in the actual flows bandwidth, and cannot also identify applications of classification, traditional methods have faced more data encryption. and more challenges due to increasing applications The study of traffic classification based on and difficult to obtain labeled flows. This paper machine learning has been considered the most proposes a traffic classification method based on promising and attracted wide attention from co-training semi-supervised clustering. This scholars [8, 9, 10]. For machine learning method uses a few labeled flows and classifiers identification method of traffic classification, based on two different evaluation metrics to learning algorithm is the key to improve achieve high-performance classifiers. Finally we classification accuracy. The current study of traffic intercept data from the campus backbone and use classification based on machine learning has open source tools to implement the experiment, focused on supervised learning algorithm and which shows higher accuracy, precision and recall unsupervised learning algorithm. The method than other classic clustering methods (such as K- based on supervised learning form manually means, DBSCAN and two-layer semi-supervised classifies labeled samples and then models all clustering). samples, which not only has a huge workload, but also depends on the understanding of the samples Keywords: Internet Traffic; Network Traffic and is unable to identify unknown applications. Classification; Machine Learning; Semi- The method based on unsupervised learning form Supervised; Clustering; Co-training can find the structural knowledge hidden in training cases through learning the training samples without labels. Although this method 1 Introduction doesnt require marking the samples, it has lower A variety of network applications are running on classification accuracy and more difficult training network currently and new applications are still processes. emerging. The traffic classification of application This paper proposes a method based on co-training layer is the premise and basis of identifying semi-supervised clustering to classify Internet network applications, which helps to analyze trend, flows. Semi-supervised machine learning uses a lot control dynamic access, study differences of of unlabeled samples to train classifiers with the services, detect intrusion, monitor traffic, manage help of a few labeled training samples, which billing and analyze users behavior. Moreover, it is reduces the cost. In addition, semi-supervised also the important reference of network security method utilizes some labeled flows to improve and traffic engineering [1,2]. clustering performance, so its accuracy is higher Early in the Internet, the traditional traffic compared to other classical methods. classification method is to detect port numbers The remainder of this paper is organized as follows: registered in the IANA [3]. But some applications related work is presented in section 2. The (such as P2P) start using salutatory port numbers algorithm and method is described in section 3. to avoid the restrictions and detections of firewall, Then section 4 introduces data sets collection and which makes the traffic classification method pretreatment. The experiments and testing results based on port very difficult and low accuracy [4, 5]. are presented in section 5. In the end, section 6 Another practicable traffic classification method is concludes the paper and forecasts the future work. based on payload [6, 7], which determines whether 2 Related work in that they both attempt to find the centers of natural clusters in the data. Although the traffic classification method based on Firstly k data objects are chosen from all n objects port is the fastest and easiest method, studies have as initial cluster centers. Secondly according to the shown a poor performance [4, 5]. Measurement similarity (distance) with cluster centers, other results of this method show it cant determine the objects are respectively assigned to their most specific situation of a certain application, because similar clusters (represented by cluster centers). the port identification method is being restricted Thirdly new cluster centers which get objects are increasingly in recent years. calculated once again (the certain mean of all The traffic classification method based on payload objects of the cluster). The algorithm repeats this detects the content of data packets to determine the process until the standard measure function is true application. The method will produce a very convergent. Generally the variance is adopted as accuracy classification if a set of unique signature the measure function. K-means clustering has the payload is provided to application program, which following characteristics: every cluster itself is is often used in commercial bandwidth compacting as much as possible, but clusters are management and intrusion detection tools [5, 6]. separated with each other. Objective function of However this method has high computational clustering is that: complexity involves users privacy and cant identify encrypted data flows. Supervised learning method becomes one of the most concerned traffic classification methods. [13] where is data sets, is k clusters, made use of Nearest Neighbor and Linear and is k clustering centers. Discriminate Analysis to map different applications to different QoS levels. [14] studied network applications identification based on Naive 3.2 Feature Selection Algorithm Bayes. [15] identified application protocols Feature selection is to remove irrelevant or through simple fingerprint statistics. [9] compared redundant features from candidate feature sets and the performance of several supervised learning select an optimal feature subset under the condition algorithms. of guarantying or not reducing classification Unsupervised learning method only uses unlabeled accuracy. Feature selection methods are divided data, while labeled data is used to test its learning into filter mode and wrapper mode based on the performance as testing data. [13] constructed a relationship between the evaluation function and flexible traffic generator through a clustering the classifier. In order to meet the independent method based on flow communication mode. [16] requirement of two views according to co-training, used Expectation Maximization to classify flows we use the filter modes feature evaluation metrics into different applications. [17] used Sequential of Correlation-based Feature Selection (CFS) [20] Forward Selection and Autoclass clustering and information gain (IG) [21] to obtain a method to identify applications, which devotes to streamlined feature set. earliest consider optimization of feature sets. Semi-supervised classification is an emerging 3.3 Co-training learning mechanism in recent years, which just applies in the network traffic classification. [19] Co-training algorithm is a semi-supervised firstly applied semi-supervised in network learning technology [11, 12], which uses two kinds applications identification, but the author only used of independent complete feature sets to describe a kind of clustering algorithm, lack of comparing objects based on multi-view. In the process of co- with other algorithms. training, every classifier selects and marks several samples with higher degree of confidence from 3 Algorithm and method unlabeled samples, and then adds labeled samples to labeled training sets of another classifier, which 3.1 Unsupervised Clustering help another classifier update with these new labeled samples. The process of co-training iterates K-means algorithm proposed by MacQueen is a continuously, until meeting pre-conditions. Under well-known unsupervised clustering algorithm. It ideal condition, co-training requires two views are is randomly initialized k-cluster centers. K-means independent and every feature set can get a strong algorithm aims to partition n observations into k classifier. In the paper, we use the feature selection clusters, so every observation belongs to the algorithm based on two evaluation metrics to cluster with the nearest distance from center. The obtain two feature sets, which satisfies the input cluster centers are updated with observation added, requirement of the co-training semi-supervised until not change. It is similar to the expectation- algorithm in large extent. maximization algorithm for mixtures of Gaussians 3.4 Co-training Semi-supervised Clustering 4.2 Data Sets Algorithm Data sets used in work are described in this section. In order to facilitate our work, we use Jpcap open- Algorithm CLFS: Semi-Supervised source toolkit based on Winpcap/Libpcap to Clustering Traffic Classification collect data in the university backbone. Because Input: Data set , Labeled flows five-tuple array can determine the unique flow, we , Unlabeled flows consider the same five-tuple array during the close interval as the same flow. The data packets are Output: k disjoint clustering C in data set X firstly divided into uni-directional flows according begin to five-tuple array, and then uni-directional flows Gain feature set with IG from full feature; are combined into bi-directional flows. Although Gain feature set with CFS from full feature; the method of flows statistics is used, we intercept while (IsSteady( )) the complete information of packets, because we Train a classifier with in by k-means; need to use the application layer information to Train a classifier with in by k-means; determine the categories of flows in later analysis Choose flows from ; training. Table 2 shows the data sets traced in the for ( ) campus network. = Classify( ); Table 2 Data set for network flow experiment = Classify( ); (Campus Traces) if( ) Traffic Class Bytes Number of Number of Label and Move to ; Packets Flows end if WWW 2.92GB 7,538,462 63,406 end for FTP 6.33GB 12,198,334 9,847 DNS 0.85GB 2,913,896 22,485 end while Mail 0.48GB 1,371,355 13,049 return ; Multimedia 4.73GB 7,080,415 3,520 end Interactive 0.01GB 11,207 227 We show that our Semi-Supervised Clustering Chat 1.14GB 2,730,304 26,741 Traffic Classification (CLFS) algorithm based on P2P 14.7GB 31,023,819 31,397 co-training semi-supervised clustering Step 1 of Game 1.36GB 3,890,237 21,482 Algorithm CLFS utilizes two feature evaluation Other 2.31GB 6,893,572 35,123 Total 34.47GB 75,651,601 227,277 metrics of IG and CFS to gain two feature subsets of approximate mutual independence, then to train two classifiers with co-training algorithm. Step 2 Because of our limited disk space for a complete classifies unlabeled flows into corresponding packets capture, we take an hour in a day to collect labeled flow clusters. flow data of the campus network from Gbps Ethernet link in a week. We adopt a filter to filter 4 Data sets and experiment approach data packet, and collect TCP and UDP data packets with payload on the network layer. 4.1 Classification Object From the resource utilization and QoS requirement 4.3 Flow feature definitions perspective, network applications are usually divided into a few categories. A typical The flow features defined preferably have classification is based on the application distinction and low cost, which can obtain characteristics. The Table I shows 10 categories maximum interval with minimal cost. At the same including unknown application and its example. time, flow features selection is also restricted by We leave more accurate traffic classification on the the actual IP network resources. In our selection, application layer to future work. the bottom acquisition based on Libpcap packets Table 1. Internet traffic categories can obtain more data. Class Representative Application/Protocal We have selected 30 flow features according to the WWW http,https above standards, which are based on 248 bi- FTP ftp directional flow features [22]. 30 flow features are DNS dns as follows, which we refer to as the full feature set Mail smtp,pop3,imap in Table 3. Multimedia voice,video streaming Our features are simple and well understood Interactive ssh,telnet,rlogin because they dont need payload. They represent a Chat qq,msn,yahoo P2P Kazaa,Bittorrent,Gnutella,Thunder,uTorrent reasonable benchmark feature set to which more Game WoW,WarCraft,Half-life complex features might be added in the future. Unknown Table 3. The full flow features bidirectional flow features The protocol (TCP or UDP) The flow duration Total number of packets in the flow The average packets size of a flow The version The variance of window size The number ratio of send and receive packets The byte ratio of send and receive packets unidirectional flow features(send or arrival) Port Flow volume in bytes and packets Packet length (minimum, mean, maximum and variance) Inter-arrival time between packets (minimum, mean, Figure 1. Influence of the size of training set on maximum and variance) classification accuracy.
5 Traffic Classification Experiment 5.2 Precision and Recall
Based on Semi-supervised Clustering In this section, we mainly discuss the precision and recall evaluation criteria of classifiers. We obtain In this section, we test and compare the K-means the overall accuracy and mean class algorithm, two-layer semi-clustering algorithm recall/precision rates across the classes after each [19], DBSCAN algorithm [18] and semi- test. From Figure 2 and Figure 3 we can see the supervised clustering algorithm proposed in this precision and recall of the semi-supervised method paper. The experiment is implemented by the have improved evidently, which are ahead of other procedure developed secondary based on WEKA, clustering algorithms in most applications. which is a popular tool in machine learning. DBSCAN algorithm is sensitive to the input parameters including Eps, min Pts, because the input parameters may lead to different results of clustering. Therefore we set min Pts= 4 and Eps= 0.04 based on [18] and our experience. We use SFS [20] algorithm run on dataset to select two feature evaluation metrics. Table 4 shows the selection results. From the figure we can see that (a) Precision of WWW, FTP, DNS, Mail and elements of the two feature sets are very different, Multimedia which satisfies our training requirement for co- training semi-supervised algorithm. Table 4 The feature subsets according to CFS and IG method CFS subset protocol, duration, averpacknum, arport, flowbyte, seminpkl, semeanpkl, arvarpkl, seminibp, arvaribp
arport, flowpack, armeanpkl, armaxpkl, arvarpkl, semeanibp, sevaribp (b) Precision of Interactive, Chat, P2P, Game and Unknown 5.1 Accuracy Figure 2. Precision of Per-application. In this section, we analyze the change of the overall accuracy of every clustering classification with the size of training set. Figure 1 shows the result using 5000 labeled flows. From the figure we can see the overall accuracy of our semi- supervised clustering algorithm is highest, and the second is two-layer semi-supervised clustering algorithm [19]. This is because the labeled flows of our method are marked each other to extend the (a) Recall of WWW, FTP, DNS, Mail and quantity of labeled flows. But the method of [19] is Multimedia only the mapping from clusters to applications. improve other semi-supervised in Internet traffic classification. Acknowledgements Supported by 973 project of China (2007CB310703), Funds for Creative Research Groups of China (60821001) NSFC (60973108) and National S&T Major Project (2009ZX03004- 003-03). (b) Recall of Interactive, Chat, P2P, Game and References Unknown [1] CAIDA : research : traffic-analysis : classification-overview. Figure 3. Recall of Per-application. http://www.caida.org/research/ traffic-analysis/classification- overview/. 5.3 The Relationship between Accuracy and [2] S. Sen, J. Wang, Analyzing Peer-to-Peer Traffic across Large Labeled Flows Networks, IEEE/ACM Transaction Networking, 2004. [3] IANA. Internet Assigned Numbers Authority. Figure 4 shows the relationship between accuracy http://www.iana.org/ assignments/port numbers [4] H. Dreger, A. Feldmann, M. Mai, V. Paxson, and R. R. and the fraction of initial labeled flows in all flows Sommer, Dynamic application-layer protocol analysis for in two kinds of semi-supervised methods. From the network intrusion detection, In USENIX Security figure we can see initial labeled flows have Symposium, July 2006. [5] J. Erman, A. Mahanti, Byte Me: A Case for Byte Accuracy in improved accuracy greatly. But due to expensive Traffic Classification. In ACM SIGMETRICS MineNet labeled flows in practice, we must tradeoff cost Workshop, June 2007. and accuracy. Meanwhile, semi-supervised method [6] P. Haffner, S. Sen, O. Spatscheck, Acas:Automataed construction of application signatures, In SIGCOMM is found to be better than 2-layer SC of [19] under MineNet Workshop, 2005. the same condition. [7] F. Risso, M. Baldi, O. Morandi, Lightweight, Payload-Based Traffic Classification: An Experimental Evaluation, IEEE ICC 2008. [8] N. Williams, S. Zander, G. Armitrage, A Preliminary Performance Comparison of Five Machine Learning Algorithms for Practical IP Traffic Flow Classification, Computer Communication Review, 2006. [9] H. Kim, K. Claffy, M. Fomenkov, Internet Traffic Classification Demystified: Myths, Caveats, and the Best Practices, In CoNEXT08. [10] T. Nguyen and G. Armitage, A Survey of Techniques for Internet Traffic Classification using Machine Learning, IEEE Communications Surveys and Tutorials, 2008. [11] A. Blum, T. Mitchell, Combining labeled and unlabeled data Figure 4. Influence of labeled flows on with co-training, COLT 98. classification. [12] O. Chapelle, B. Schlkopf, A. Zien, eds. Semi-Supervised Learning, Cambridge, MA: MIT Press, 2006. [13] F. Hernndez-Campos, F. D. Smith, Statistical Clustering of 6 Conclusions and future work Internet Communications Patterns, Computing Science and Statistics 2003. This paper proposed and evaluated a semi- [14] A. W. Moore and D. Zuev, Internet traffic classification supervised clustering method based on flow using bayesian analysis techniques, in Proc. ACM statistics for classifying Internet traffic. The SIGMETRICS, June 2005 [15] M. Crotti, M. Dusi, F. Gringoli, Traffic classification through method makes use of some known flows and two simple statistical fingerprinting, SIGCOMM Comput. classifiers (based on IG and CFS evaluation Commun., 2007. metrics) to improve accuracy of classifier, so the [16] A. McGregor, M. Hall, P. Lorier, Flow clustering using machine learning techniques, PAM2004. performance of our classifier is better than K-mean [17] S. Zander, T. Nguyen, and G. Armitage, Automated traffic clustering, DBSCAN and two-layer semi- classification and application identification using machine supervised clustering. Through the experiment, we learning, LCN 2005 [18] J. Erman, M. Arlitt, and A. Mahanti, Traffic classification find co-training semi-supervised clustering has using clustering algorithms, in MineNet 06: Proceedings of higher overall accuracy than other three kinds of the 2006 SIGCOMM workshop on Mining network data. clustering methods, in addition the precision and [19] J. Erman, A. Mahanti, M. Arlitt, Offline/Realtime Traffic Classification Using Semi-Supervised Learning, In IFIP recall metrics is also better than other classical Performance, October 2007. methods. Moreover, the results of the experiment [20] H. Liu, L. Yu, Towards integrating feature selection shows that fraction of initial labeled flows has algorithms for classification and clustering, IEEE Trans. on Knowledge and Data Engineering, 2005. great influence on the accuracy of semi-supervised [21] L. Yu, H. Liu, Efficient feature selection via analysis of classifier. relevance and redundancy, Journal of Machine Learning Co-training is a very classical method in semi- Research, 2004, 5. [22] A. W. Moore and D. Zuev, Discriminators for use in flow- supervised machine learning, however, some based classification, Technical report, Intel Research, important problems still need to be resolved (e.g. Cambridge, 2005. two feature subsets are approximately sufficient and redundant). Our future work will research and
(Lecture Notes in Computer Science 5692 _ Information Systems and Applications, Incl. Internet_Web, And HCI) Edith Elkind (Auth.), Tommaso Di Noia, Francesco Buccafurri (Eds.)-E-Commerce and Web Techn