Sunteți pe pagina 1din 8

2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications

Protocol Formats Reverse Engineering based on


Association Rules in Wireless Environment
Yong Wang, Nan Zhang, Yan-mei Wu, Bin-bin Su, Yong-jian Liao
School of Computer Science and Engineering
University of Electronic Science and Technology of China, 611731
Chengdu, China
cla@uestc.edu.cn

In this paper, we propose a novel protocol formats reverse engineering framework, which can automatically detect unknown protocols from the captured binary data. Our
framework contains ve components:data capturing, frames
location, frequency nding, association analysis, and protocol
format inference. The data capturing captures binary data from
wireless network communicating channels, converts them into
the bit stream. Then the frames location segments the bit
stream by identifying the preamble, after which, the frequency
nding and the association analysis respectively nds the
frequent sequences and analyzes their probability relationship
by utilizing the association rules. Finally, the protocol format
inference extracts and identies the potential unknown protocols.
Our main contributions are:
We propose a protocol formats reverse engineering framework, which can automatically analyze the binary data
captured from wireless communication channels.
We design a new algorithm to identify unknown protocols
from binary data by utilizing feature sequences and
association rules.
Several experiments are carried out in real-world wireless
environments to verify our frameworks validity.
The rest of the paper is organized as follows. Section II is
dedicated to the related work. Section III shows the system
architecture. Our algorithms are introduced in section IV. In
section V, we carry out experiments to evaluate our framework.
We draw brief conclusions in Section VI.

AbstractWith the wide deployment of wireless networks, attackers may exploit Wi-Fi network vulnerabilities to transfer data
secretly, or covert communication channels to spread malicious
codes. The protocol formats reverse engineering technique can be
used to detect such attacks, however, previous works are focused
on the application layer protocol analysis, and can hardly work
under the scenarios that the captured data is only in binary
format due to the lack of semantics. In this paper, we propose a
novel protocol formats reverse engineering framework, which
utilizes the association rules of feature sequences to identify
unknown protocols from captured binary data. We rst convert
the captured binary data into a bit stream, and segment it into
frames. The improved AC algorithm is adopted to analyze the
binary sequences. After which, we extract the feature sequences
and analyze their association rules to detect potential unknown
protocols. The experimental results show that our framework
can identify 100% ARP packets and 98% ICMP packets from
captured binary data.
Index Termsbinary analysis; protocol formats; association
rules; wireless network

I. I NTRODUCTION
The wireless network has become one of the most popular
ways to access the Internet. As many application protocols
on the Internet are proprietary and have no publicly released
specications, the protocol formats reverse engineering can
help to detect these unknown protocols, especially in many
security applications, the protocol formats reverse engineering are widely used. For example, the intrusion prevention
and detection that performs deep packet inspection, and the
penetration testing which generates network inputs to an
application to detect potential vulnerabilities. They can also
be applied to identify protocols and tunnelings in monitored
network trafc.
The traditional methods are mainly based on manual work.
It is time-consuming and error-prone. For example, manually
reverse engineering the Microsoft Server Message Block (SMB) protocol took 12 years in the open source SAMBA project.
For a closed protocol, there are large elds to parse and there
may exist complex relationships. Some researcher proposed
automatic protocol formats reverse engineering techniques,
such as Biprominer [1] and ProDecoder [2]. These systems
mainly focus on application layer protocols, for the binary
protocol, due to the lacking of the semantic information, these
methods could hardly work, further, it is difcult to distinguish
the same binary sequences from various protocol messages.
978-0-7695-5022-0/13 $26.00 2013 IEEE
DOI 10.1109/TrustCom.2013.21

II. R ELATED W ORKS


Protocol reverse engineering has been a traditional technique in analyzing network packets [3][5]. Originally, it
mainly depends on manual work to analyze the specications
of the protocol, which are slow and costly. It was recent that
the eld of automatic inference of protocol specications has
been developed. Marshall A.Beddoe presents Protocol Informatics [6] in 2004, which employed bioinformatics sequence
alignment algorithms to reveal similarities between messages.
As the same as the PI, the ScriptGen [7] is proposed in 2005,
by Corrado Leita et al.. It is used to solve the problem in Honeyd. These techniques are most widely used, and applicable
for reverse engineering most of unknown protocols. Another
method in analyzing automatically is application dialogue
134

replayers [8]. It aimed to replay an original protocol session


to extract a partial session description. For binary analysis
in protocol formats inference, Juan Caballero et al. proposes
Ployglot [9] in 2007 and Z.Lin et al. proposes AutoFormat
[10] in 2008. They propose to extract protocol information
by observing the execution of a program while it processes
execution traces to detect the elds which compose a message.
Yipeng Wang et al. proposes Biprominer [1] to mining the feature of binary protocol automatically. They present a transition
probability model for a better description of the protocol. After
that, they propose another system ProDecoder [2], inferring
the protocol format by the semantics of protocols. They group
the sequences by the same semantics and infer formats by the
keywords and cluster sequence alignment.
Machine learning could also be used to infer the unknown
protocol formats. Paolo Milani Comparetti et al. [11] focus on
the method of state machines. They cluster different types of
messages based on their structures and their behaviors. Xuejun
Cai et al. [12] proposes a machine learning and keywordmatching method to identify the protocol. They classify the
data ow and generate the keyword vector in training process
and then identify protocols.

different semantics in different protocols. The association rules


contain the feature sequences and their probabilistic location
relations. We can improve the accuracy of unknown protocol
identication.
B. System Architecture
Our framework consists ve components: data capturing,
frames location, frequency nding, association analysis and
protocol format inference.
1) Data Capturing: This component captures and preprocesses the data we need to analyze. Firstly, We intercept
the signals transmitting in the wireless network. Then we
preprocess these signals to lter the noisy data. The output
is the bit stream in the same trafc ow.
2) Frames Location: The function of this component is
segmenting the bit stream into frames using preamble we
identied. According to the 802.11x, the preamble is a pseudorandom sequence, which appears before a frame and it is used
for synchronizing the transmitter and the receiver to insure
the sequences can maintain the timing relation. According
to the position of the preamble, we can locate the frames
in the bit stream. To identify the preamble, the multi-pattern
matching algorithm which can extract the frequent sequences
is adopt. Then, we concatenate them as long as possible to
nd the longest frequent sequences. Based on the preambles characteristics, we dene the the preambles candidate
set f orcanset = {cset1 , cset2 , ...csetn } , where the cseti
presents a preamble approximate sequence. The output of this
component are the f orcanset, a sequence we considered as
the preamble, and frames which we extract from the bit stream
based on the preamble.
3) Frequency Finding: In this component, the frequent
sequences of frames would be extracted. Firstly, the frames
are clustered. Frames belong to the same protocol may have
different formats, e.g., for ARP, the broadcast packets are
partly different from unicast packets. For higher accuracy, we
process them separately. Secondly, the 4 bit frequent sequences
are identied using the improved AC algorithm. Finally, for
the efciency, we splice the 4 bit frequent sequences to get
longer ones. We consider each frame as a long sequence. For
a subsequence P of sequence S, we dene Supp(P ) as P s
support in S. If S has r subsequences, P s frequency is k , we
dene Supp(P ) = kr . We choose Suppmin as the threshold to
lter out the unfrequent sequences. For example, if Supp(P )

III. P RELIMINARIES
A. Design Goals
The basic goal of our framework is to identify the unknown
protocol accurately from captured binary data in wireless
environments.
Firstly, we need to locate the frames in bit stream to identify
the protocol formats. The frames of the same protocol may
not equal in length, thus the frames location methods based
on length are not applicable. The preamble is used to identify
the start position of the frames.
Secondly, we focus on working out an efcient method of
extracting feature sequences. Our framework scans the raw data by bits, so the traditional multi-pattern matching algorithms
are not adapted. Considering the frameworks efciency, the
improved AC algorithm is introduced. It can scan the input
data by bits and record all the frequent sequences.
Finally, the method based on association rules is to solve the
problem that using feature sequences to identify an unknown
protocol may not be accurate enough. Due to the feature of
bit stream, there may be the same sequences which have

Fig. 1: The System Architecture

135

the pattern set, the one-on-one onto mapping state function g


analyze each bit and the nal output function O give all the
matching strings and statistical information. Also, there is a
failure function f deal with the situation when g failed.
The details are depicted in Algorithm 1.

is larger than the Suppmin , P is considered as a frequent


sequence.
4) Association Analysis: This component mines the association rules [13] of frequent sequences. The traditional
methods of simply using the feature sequences to identify an
unknown protocol is limited in bit stream, cause we cannot
know the meaning of these sequences. Besides, they can
hardly distinguish the same sequence in different protocols.
The association rules based method is proposed to identify
the formats of unknown protocols. The association rules are
possibility distributions of all the feature sequences. We will
prove in the verify experiment that using this method can get
higher accuracy. The FP-Growth algorithm is adopted as the
association rules mining algorithm. Using this algorithm, we
can get the latent and probability relationships among frequent
sequences.
5) Protocol Format Inference: With the association rules,
we get the probability distribution of all the feature sequences,
and the latent relationships among them. Then we build the
protocol message components and store them as the unknown
protocol formats.

Algorithm 1 The improved AC algorithm


Input: bit-stream le Str , threshold, Suppmin
Output: The frequent sequences set D
1: n thelengthof Str
2: buf f er1[], buf f er2[], buf f er3[], buf f er4[], set a lter to get the frequent
sequences, Len 4
3: build a 1 Len bit sequences tree State.T ree by enumerate its node
4: build a root node, root null,build the child sequence node Qt Qt , for root,
Qt 0, Qt 1
5: the sequence node Qm
6: for Qm has no child , the length of Qm < Len do
7:
built the child node Qm+1 ,Qm +1
8:
Qm+1 add a bit 0 at the end of Qm , Qm +1 add a bit 1 at the end of
Qm
9:
set a counter for each four-bit state, Data current bit from Str
10:
for Data is not null
11: end for
12: if Data is the rst three bits of the Str then
13:
buf f er1[] current bit, buf f er2[] the last two bits, buf f er3[]
the last three bits
14: else
15:
buf f er1[] current bit, buf f er2[] the last two bits, buf f er3[]
the last three bits, buf f er4[] the last four bits
16:
read buf f er4[], according to the content of bit from butter4[], traversal the
State.T ree and get the state
17:
the counter for the current state add 1;
18: end if
19: Data the next bit form Str
20: Suppmin nLen+1
threshold
2Len
21: for statecount > Suppmin do
22:
write corresponding sequence into D;
23: end for

IV. O UR F RAMEWORK
We employ the improved AC algorithm in frames location
and frequency nding. And the approximate string matching
algorithm [14][16] to compress the feature sequences set, the
FP-Growth algorithm [17] in association analysis.
A. The Data Capturing
The inputs of our framework are signals intercepted from
wireless network. We transfer these signals into binary data,
which is treated as the bit stream. This bit stream contains many frames, and each frame has multiple feature
sequences dened in the specication of the protocol. A
keyword is a binary sequence of arbitrary length essentially.
For example, the keywords used in the ARP protocol include
0x0806 ,0x0001 ,0x0800 ,etc.

C. The Frequency Finding


The inputs of this component are frames which we extract
based on the preamble. The frames belong to the same protocol
may have different feature sequences. The ARP packets,
for example, the sequence 0x000000000000 is frequently
appearing in the MAC address eld in broadcast packets, but
is unfrequent in unicast packets.
The improved AC algorithm is adopted to identify frequent
sequences which appear in each frame or in most frames. The
Supp(P ) = kr is dened, which means sequence P appears in
k frames of r total frames. The Suppmin is dened to be the
threshold.
For mass data, the feature sequences may be a great quantity
and repeat a lot. To solve this, the approximate string matching
algorithm is adopt to evaluate the similarity of two sequences.
This algorithm is applied for nding similar sequences in the
patten set. A key parameter is called the edit distance C|x||y|
. The sim(A, B) indicates the similarity between frequent
sequences A and B. We dene it as:

B. The Frames location


The function of this component is to segment the bit stream
into frames based on the preamble. To identify the preamble,
the multi-pattern matching algorithm is adopted to extract the
frequent sequences, then we splice these sequences to get the
preamble candidate set f orcanset = {cset1 , cset2 , ...csetn }
which contains the long frequent sequences. Based on the
preamble, we extract the sequences between them. This algorithm is improved from the multi-pattern matching algorithm
to treat the binary data. The multi-pattern matching algorithm,
which can match several patterns in one scanning, are WuManber [19] based on hash function and Aho-Corasick (AC)
[20] based on nite state machine. Wu-Manber adds ltering
into Boyer-Moore [21] and its core idea is using bad character
bolck and good sufx. AC is a classic multi-pattern matching
algorithm, and its complexity O(n) is independent to the
number
of pattern sequences. Putting bit stream sequences

(b1 , b2 , ..., bn ) into the automaton M that constructed for

sim(A, B) =
Where length(x, y) =
computed as followed:

length(x,y)C|x||y|
length(x,y)

length(x)+length(y)
.
2

C0,i = i, Cj,0 = j
Ci,j = Ci1,j1 , if Pi = Tj

136

The C|x||y| is

Ci,j = 1 + min(Ci1,j , Ci1,j1 , Ci,j1 ) , if Pi = Tj


C|x||y| = ed(x, y)

algorithm 3 and the condition FP-tree algorithm is described


in details in algorithm 4.

The simmin is dened to be the threshold. If sim(x,y) is


larger than simmin , we can consider that x and y are similar.
Based on this algorithm, we get the sub-sequences among
approximate sequences and compress the feature sequence set.
The process of this algorithm is described in detail as shown
in algorithm 2.

E. The Format Inference


The input of this component is a set of association rules.
We build the protocol message models with association rules,
and store them as unknown protocol formats in the feature
database.

Algorithm 2 The approximate string matching algorithm

V. E XPERIMENTS AND E VALUATIONS

Input: Sequence X, Sequence Y , length(X), length(Y )


Output: sim(X, Y )
1: Matrix C|X||Y | = N U LL
2: for i = 1 :|X| do
3:
Let Ci,0 = 0
4: end for
5: for j = 1 : |Y | do
6:
Let C0,j = 0
7: end for
8: for the rest cells Ci,j in Matrix do
9:
Create Pi is the ith element of X , and Tj is the jth element of Y
10:
if Pi = Tj then
11:
Let Ci,j = Ci1,j1
12:
else
13:
Let Ci,j = 1 + min(Ci1,j , Ci1,j1 , Ci,j1 )
14:
end if
15:
Let the edit distance ed(X, Y ) = C|X||Y | ,
length(X)+length(Y )
16:
length(X, Y ) =
2
length(X,Y )ed(X,Y )
17:
The similarity sim =
length(X,Y )
18: end for

A. Evaluation Parameters
In the evaluation experiments, we dene the following three
sets:
1) True Positives: the set of type X frames where each frame
matches an association rule generated by our framework.
2) False Positives: the set of not type X frames where
each frame matches an association rule generated by our
framework.
3) False Negatives: the set of type X frames where each
frame can not match an association rule generated by our
framework.
Next, the following two parameters are dened to quantitatively evaluate the effectiveness of our framework.

D. The Association Analysis


precision =

The purpose of this component is to mine the association


rules among frequent sequences. They can be divided into
two problems: discovering the frequent items and generating
association rules. The Apriori [22] algorithm and the FPGrowth [17] algorithm are two basic algorithms for association
rules ming. We dene I = {I1 , I2 , I3 , ..., Im } as the set of
feature sequences, and the ith feature sequence describe as
Ii . An association rule is like X Y , where X, Y
I, X Y = , X is the antecedent and Y is the consequent.
To evaluate this association rule X Y , two parameters
are dened: Support, the joint probability of X and Y
P (X, Y ) , and Conf idence, conditional probabilityP (X|Y ).
The Conf idence is computed as bellow:
Condence X Y =

recall =

|T rueP ositives|
|T rueP ositives| + |F alseP ositives|

|T rueP ositives|
|T rueP ositives| + |F alseN egatives|

B. Experiment Setup
In the frame location experiment, we use USPR to intercept
Beacon frames sent by single AP(Access Point). In the frequent nding and association analyzing experiment, ARP and
ICMP are chosen as the target protocols. In this experiment,
we capture the data by Wireshark and convert them into
binary to simulate the data frame which we segmented at
frame location experiment. Our data set consists of 6000 ARP
packets of a total of 2.66 MB, 6000 ICMP packets of a total
of 3.54 MB, and 10000 non-ARP and non-ICMP packets of a
total of 10 MB. The average packet lengths are also small for
both SMTP and SMB protocols because they mostly consist
of command codes rather than payload data. We use 90% of
the packet traces for training and the rest 10% for measuring
the precision and recall of our framework.
1) The Suppmin is used in the improved AC algorithm.
2) The Simmin is used in the approximate string matching
algorithm.
3) The Confmin is used in the FP Growth algorithm.
After a number of experiments with different values of parameters, we set Suppmin = 0.6, Simmin = 0.8, Confmin =
0.9.

Supp(XY )
Supp(X)

Three parameters: Lift,Leverage and Conviction are dened to


assist the Conf idence.
P(L,R)
Lift = P(L)P(R)
Leverage = P(L, R) P(L)P(R)
Conviction = P(L)P(!R)
P(L,!R)

The FP-tree building algorithm is the rst step in FP Growth


algorithm. It builds the tree by scanning the frequent sequence
database. After the tree-building algorithm, FP-Growth algorithm will build another tree: condition FP-tree which illustrates the relationship among frequent sequences. Meanwhile,
it will lter out the un-frequent sequences by Suppmin . The
process of FP-tree building algorithm is described in details in

137

preamble sequence will be in this approximate set. So according to Table I, we could ensure the true preamble with priori
knowledge and locate the frames.

Algorithm 3 The build FP-tree algorithm


Input: The frequent sequences database D , the Suppmin
Output: The FP-tree
1: Scan the database D = { T IDi , EV EN TT IDi }
2: Each kind of EV EBTT IDi Items = {Item1 , Item2 , ...}
3: Record all the Itemi and their occurrences to the Frequent Set F = { Item1 ,
count; Item2 ,count; ... }
4: Compute each Itemi s Supp(Itemi )
5: Let the Frequent Set F descending by count, and denoted as Set L
6: Build Node root of the FP tree , and root = NULL
7: Create a Frequent.Item Table ={Item.namei , N ode.headi }
8: The Item.namei Set Ls Itemi
9: Scan the database D again
10: Filter the Set F by Set L and descend the EV EN TT ID by Set Ls order
11: Each TID denes a [p | P ] , p denote the EV EN TT ID1 and P is the other
EV EN TT ID
12: Insert each ([p | P ], T ) to the FP-tree , and T is the current node in FP-tree
13: for p = N U LL do
14:
if (T has a ChildNode N )  (N .Item.name = p.Item.name) then
15:
N .Item.count + 1 N .Item.count
16:
else
17:
Build a new ChildNode N 
18:
N  .Item.name p.Item.name
19:
N  .Item.count 1
20:
N  .Item.link T .ChildLink
21:
if The node is rst built then
22:
N  .Item.link N ode.headN 
23:
end if
24:
end if
25:
if The new node N  .Item.name = The other nodes T  .Item.name then
26:
N  .Item.link T  .Item.ChildLink
27:
end if
28:
p the next EV EN TT IDi
29: end for

TABLE I: The preamble approximate sequences


No
1
2
3
4
5
6

Length
112
104
136
144
128
120

Sequences
0xffffffffffffffffffffffffe0b9
0xffffffffffffffffffffffff82
0xfffffffffffffffffffffffffff05cf0a0
0xfffffffffffffffffffffffffffff05cf0a0
0xffffffffffffffffffffffffc173c281
0xfffffffffffffffffffffff05cf0a0

We verify the preamble approximate sequences matching


degree. In Fig.2, Sequence 1 fully matches with the true
preamble. However, the other sequences matching degrees are
almost above 91%. It may be mistaken as preamble. That due
to the characteristic of bit stream, these approximate preamble
sequences contains some parts of the true preamble. Although
their frequency are higher, we can lter out them with priori
information.

Algorithm 4 The condition FP-tree algorithm


Input: The FP-tree, The Suppmin and The Frequent Item Table
Output: The frequent set based on association
1: Create m the last Item.name of the Frequent Item Table
2: for m = NULL do
3:
Search the FP-tree and record the path conclude the m to the
Path Set = {Item.namei : Item.count ...m : m.count  }
4:
Create the condition patten base CP = { Item.namei , Item.namej
,..., m : m.count , Item.namex , Item.namey
, ..., m : m.count}
5:
Delete the m in CP , so the CP ={Item.namei , Item.namej
, ... : m.count }
6:
Build a conditional FP-tree T ree1 and the root = NULL
7:
Insert (Item.name, T ), and the T is the current node in
conditional tree
8:
for Item.name = the rst Item.name do
9:
if (T has a node N )  (N .Item.name = Item.name) then
10:
N .Item.count + 1 N .Item.count
11:
else
12:
Create a new node N 
13:
N  .Item.Link T.Item.ChildLink
14:
N  .Item.count 1
15:
end if
16:
Insert the next Item.name in CP
17:
end for
18:
if The new tree has one path then
19:
if Each nodes Item.count < Suppmin then
20:
Delete this node
21:
else
22:
Generate the Frequent Set based on association rules
23:
end if
24:
else
25:
Create the new conditional pattern base CP 
26:
Create the new conditional FP-tree T ree2
27:
Jump to step 4
28:
end if
29: end for

Fig. 2: The frequency and matching degree of preamble


approximate strings
2) The Frequent Finding Experiment: There are 38 frequent sequences we extracted in the experiment. And we
will preprocess them with approximate matching algorithm to
ensure the quality of the frequent sequences. For example, in
these frequent sequences, the 0x020000 and 0x200000000000
, it is obvious that the previous one is the sub-sequence of
the latter one. So we only need the same parts of these two
sequences to compress feature sequence set. After this, there
are only 11 sequences in this set; Table II shows us these
sequences.
TABLE II: Frequent sequences
No
A
B
C
D
E
F
G
H
I
J
K

C. Experiment Results
1) The Frame Location Experiment: The experiment results
illustrate in Table I and Fig.2. Table I shows the preambles
approximate sequences we found. We consider that the true

138

Length
28
24
24
48
44
20
20
20
20
24
24

The frequent sequences


0x0018100
0x020000
0x040001
0xffffffffffff
0x00000000000
0x15002
0x10800
0x42000
0x08400
0x054008
0x800042

Frequency
100%
100%
98.6%
98%
100%
68.2%
100%
100%
100%
37.6%
100%

In these 11 sequences, the Sequence Js frequency is under


37.6%. The reason of low frequency is that it have three-bits
offsets. We propose three ways to solve this problem in this
paper and all these methods are veried below:
i) preprocess the data frame and process them separately by
different types;
ii) using machine learning algorithm to lter out less frequent ones;
iii) regulate the frequent nding parameters to screen out
the frequent sequences.
Analyzing Table II, we could nd that there are two special
feature sequences in Table II. Sequence D is made up of 481,
and Sequence E is 440. Sequences like them are obviously
special and they usually appear in typical packets. The special
frames often belong to management frames or control frames.
They have special characteristic sequences to distinguishing
with others. So we preprocess the special frame separately to
improve the feature sequences True Positive and efciency.
And for ARP packets, we separate the broadcast packets as
the special ones. The other are the unicast packets. Table III
shows the frequent sequences of ARP broadcast packets. And
the unicast packets are illustrated in Table IV.

Compared with Table II, after separating the special ones,


the Sequence Ds and Es frequencies ascend to 100%. It
illustrates that Sequence D and E are the typical sequences
in ARP broadcast packets. Sequence Fs and Sequence Js
frequencies are below 35%. We will lter them out by the
method of machine learning. The same solution appears in
Table IV.

below 30%. According to the specication of ARP, we nd


that they are the sub-sequence of the IP address 192.168.1. The
ARPs operands include 0x00 and 0x01. The end of Sequence
C represents the 0x01 and the end of Sequence B is the 0x00.
Besides, Sequence B has one-bit offset. The Sequence I has
three-bits offsets. Because our feature sequences algorithm is
not accurate enough. After adjust the offset, these sequences
frequencies can reach 99.9%.
Based on the frequent sequences extracted above, we lter
the frequent sequences set deeply with machine learning (set
the threshold as 0.7).
Fig.3 presents that, after machine learning, the feature
sequences precision rates of ARP broadcast packets reach
to 100%. Cause their easy structure, the feature sequences of
broadcast packets are obvious and easy to positive. And as the
same as the broadcast packets, the unicast packets precision
rates are also 100% in Fig.4. Further more, the precision rate
of ARP do not uctuate with the increase number of packets.
Fig.5 explains that the frequent sequences efciency. The
Sequence B and Sequence D have low recall rate. Especially
for Sequence D, we just prove that it is a typical sequence in
ARP broadcast packets. But the recall rate is below 80% and it
has a signicant decline after 5000 packets. Because Sequence
D has 440, and it represents the MAC address of one packet.
So it will appears in many positions and to be frequent enough.
Fig.6 illustrates the unicast packets, and their recall rates are
almost above 95%.
It can be included that, the precision rate become low when
using single feature sequence to identify ARP packets, unless
we can nd the sequence which has the highest precision rate
accurately. This method is difcult to achieve for unknown
protocol recognition in bit stream environment. So we present
a method by using association rules to describe the feature of
protocols.
3) The Association Analysis Experiment: We use the data conversed above as the input. The results are in Table
V and Table VI. For example, the rst association rule
0x00000000000 0x0018100 conf : 1 means if sequence
0x00000000000 appears, 0x0018100 will appears with the
probability of 100%. In this experiment, we separate the
broadcast packets and unicast packets as the same. And then
we will compute their precision rates and recall rates.

TABLE IV: The Frequent sequences of Unicast packets

TABLE V: The Association rules of Broadcast packets

TABLE III: The Frequent sequences of Broadcast packets


No
A
B
C
D
E
F
G
H
I
J
K

No
A
B
C
D
E
F
G
H
I
J

Length
28
24
24
48
44
20
20
20
20
24
24

Length
28
24
24
44
20
20
20
20
24
24

The frequent sequences


0x0018100
0x020000
0x040001
0xffffffffffff
0x00000000000
0x15002
0x10800
0x42000
0x08400
0x054008
0x800042

The frequent sequences


0x0018100
0x020000
0x040001
0x00000000000
0x15002
0x10800
0x42000
0x08400
0x054008
0x800042

Frequency
100%
100%
100%
100%
100%
34.9%
100%
100%
100%
32.5%
100%

NO.
Asso.A
Asso.B
Asso.C
Asso.D
Asso.E

Frequency
100%
30.2%
30.2%
78.2%
52.9%
100%
100%
100%
25.2%
100%

Association rule
0x00000000000 0x0018100
0x020000 0x0018100
0x040001 0x0018100
0xffffffffffff 0x0018100
0x00000000000,020000,040001,ffffffffffff 0x0018100

To evaluate the correctness, we calculate the true positive in


the ARP only environment and the false negative rate in the
comprehensive environment, to compute the precision rate and
recall rate. In Fig.7, it shows the precision rate of association
rules. Recall rate is shown in Fig.8.

In Table IV, the Sequence B , C and Is frequencies are

139

Fig. 3: The Precision Rate of ARP


Broadcast Frequent Sequences

Fig. 6: The ARP unicast frequent


sequences recall rate

Fig. 4: The Precision Rate of ARP Unicast


Frequent Sequences

Fig. 5: The ARP broadcast frequent


sequences recall rate

Fig. 7: The Precision Rate of Association Fig. 8: The Precision Rate of Association
rules in ARP broadcast packets
rules in ARP unicast packets

Fig. 9: The Recall Rate of Association rules in ARP


broadcast packets

Fig. 10: The Recall Rate of Association rules in ARP


unicast packets

TABLE VI: The Association rules of Unicast packets

As the same with the frequent nding experiment, we


describe the precision rate and recall rate of association rules.
And the precision rate is shown in Fig.7 and Fig.8. The recall
rate is shown in Fig.9 and Fig.10. Through the gures, we
nd the precision rates of all association rules reach 100%,
the recall rates solution is as the same. In this experiment,
the precision rates could reach to 100% as we separate the
broadcast packets and unicast packets. Further more, the result
of recall rates is better than the frequent nding experiments.
Compared with the Fig.5, the association rule As recall rate
which contains Sequence D raise up to 100% in Fig.10. That
will improve the accuracy of the format we extract.

NO.
Asso.A
Asso.B
Asso.C
Asso.D
Asso.E

Association rule
0x80042 0x0018100
0x10800 0x0018100
0x42000 0x0018100
0x08400 0x0018100
0x10800,42000,08400,80042 0x0018100

4) The ICMP Experiment: Table VII is the parts of association rules in ICMP. As the same method as the ARP,
we compute the precision rate and recall rate to evaluate the
experiment.

140

R EFERENCES

TABLE VII: The Association rules of ICMP


NO.
A
B
C
D

[1] Y.Wang, X.Li et al. Biprominer: Automatic Mining of Binary Protocol


Features. In: 12th International Conference on Parallel and Distributed
Computing (PDC),2011.
[2] Wang, Yipeng, et al. A Semantics Aware Approach to Automated
Reverse Engineering Unknown Protocols. 2012 20th IEEE International
Conference on Network Protocols (ICNP), 2012.
[3] G. Combs et al. Wireshark. Available at www.wireshark.org/, 2006.
[4] J. Rauch. PDB: The protocol debugger. BlackHat USA, 2006.
[5] T. Bcardsley. Manual protocol reverse engineering. BrcakingPoint
Systems, 2009.
[6] Marshall A.Beddoe. Network Protocol Analysis using Bioinformatics
Algorithms. In: Toorcon, 2004.
[7] C.Leita, K.Mermoud, and M.Dacier. ScriptGen: An Automated Script
Generation Tool for Honeyd. In: 21st Annual Computer Security
Applications Conferences (ACSAC), 2005.
[8] J.Newsome, D.Brumley et al. Replayer: Automatic Protocol Replay
by Binary Analysis. In: 13th ACM Conference on Computer and
Commubications Secturity (CCS), 2006.
[9] J.Caballero, Heng Yin et al. Polyglot: Automatic Extraction of Protocol
Message Format using Dynamic Binary Analysis. In: ACM Conference
on Computer and Communications Security (CCS),2007.
[10] Z.Lin, X.Jiang et al. Automatic Protocol Format Reverse Engineering
through Conectect-Aware Monitored Execution. In: 15th Symposium
on Network and Distributed System Security (NDSS), 2008.
[11] Paolo Milani Comparetti, et al. Prospex: Protocol specication extraction. 2009 30th IEEE Symposium on Security and Privacy. IEEE, 2009.
[12] Cai, Xuejun, Ruoyuan Zhang, and Bin Wang. Machine Learning
and keyword-matching integrated Protocol Identication. In: proc. of
2010 3rd IEEE International Conference on Broadband Network and
Multimedia Technology (IC-BNMT), 2010.
[13] R.Agrawal,T.Imielinski,and A.Swami. Mining association rules between sets of items in large database. ACM-SIGMOD 1993,207-216.
[14] A. Blumer, J. Blumer, and D. Haussler et al. The smallest automaton
recognizing the subwords of a text Theoret. Comput. Sci., 40 (1985),
pp. 31C55.
[15] W.I. Chang, E.L. Lawler. Approximate string matching in sublinear expected time Proc. IEEE 1990 Ann. Symp. on Foundations of Computer
Science (1990), pp. 116C124.
[16] G.R. Dowling, P. Hall. Approximate string matching. ACM Comput.
Surveys, 12 (1980), pp. 381C402.
[17] J.Han, J.Pei, and Y.Yin. Mining Frequent patterns without candidate
generation. SIGMOD 2000,1-12.
[18] IEEE 802.11. Wireless LAN Medium Access Control (MAC) and
Physical Layer (PHY) Specications.1999
[19] S. Wu, U. Manber. A fast algorithm for multi-pattern searching. Tech.
R. TR-94-17, Dept.of Comp. Science, Univ. of Arizona, 1994.
[20] Aho. A. V, M. J. Corasick. Efcient string matching: an aid to
bibliographic search. Comm. of the ACM 18, pp. 333-340.
[21] Boyer R.S, Moore J.S. A fast string searching algorithm. Comm. of
the ACM 20. pp.762-772, 1977.
[22] Agrawal.R and Srikant.S. Mining sequential patterns. ICDE 1995, 3-14.

Association rule
0x616263646566 0x2c4c6c8cacc
0x131b232b333b43 0x616263646566
0x131b232b333b43 0x2c4c6c8cacc
0x131b232b333b43 0x616263646566,2c4c6c8cacc

Fig.11 shows that the four association rules frequency is


consistent, and proofs the effectiveness and accuracy of these
association rules we have extracted. These association rules
precision rates are around 99%. The average recall rate of

Fig. 11: The Precision Rate of ICMP in association rules

Fig. 12: The Recall Rate of ICMP in association rules


association rules is above 98% in Fig.12 . With the increase
of the packets, theres no large decline.
VI. C ONCLUSION
We propose a novel framework for automatic unknown
protocol formats reverse engineering. It is mainly based on the
association rules that nd the probability relationship among
the frequent sequences. In order to analyze the binary protocol,
our algorithms have been improved and introduced. By the
improved AC algorithm, we record the frequency of all. Combine with the association rules, we could detect the unknown
protocols. We carry out the experiments by considering ARP
and ICMP as black box protocols. The experimental results
show that the precision of our framework is about 98% and
the recall rate is about 99%.
ACKNOWLEDGMENT
We thank the anonymous reviewers for their helpful comments. This work is supported by a SafeNet Research Award,
and by the Joint Funds of the National Natural Science
Foundation of China ( Grant No.U1230106).

141

S-ar putea să vă placă și