Sunteți pe pagina 1din 7

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page3717



Accelerating Multipattern Matching On Compressed HTTP Traffic

Abdul Mannan Virani, M.Tech
CSE Dept, DIMAT, Raipur (C.G.),

Abstract: Present days security is the key for everything it
is considering as major issue in each and every aspect .There
are lots of technologies developed to provide a se currently
one of them is by using signature-based detection, cant
bare stressed traffic, where marketing value is increasing.
This paper developed on the theme of compressed HTTP
traffic. HTTP uses GZIP compression. Decompression phase
is used by HTTP before performing a string matching. In this
a algorithm, AhoCorasick-based algorithm for Compressed
HTTP (ACCH) was used to provide more advantage than the
commonly used AhoCorasick pattern-matching algorithm.
The advantage of this is that takes advantage of information
gathered by the decompression phase in order to accelerate.
we explore that it is faster to perform pattern matching on the
compressed data, with the defect of decompression than on
regular traffic. We are the initial one that analyzes the
problem of on-the-fly multipattern matching on compressed
HTTP traffic and solves it.

I. INTRODUCTI ON
Technologies for security, such as Network Intrusion
Detection System (NIDS) or Web Application Firewall
(WAF).This deals with signature-based detection techniques
to identify attacks. Now a days, security tools is judged by
the speed of the underlying string-matching algorithms that
detect these signatures .HTTP compression nothing but
content encoding is openly available method to compress
textual content transferred from Web servers to browsers.
Lots of websites and social sites are using this HTTP
compression. As per research 25% + industries are using
HTTP compression, and increasing. This compressed content
is built into HTTP 1.1 and was supported by most browsers.
Most current security tools either ignore scanning compressed

Somesh Kumar Dewangan
Associate Professor (CSE),DIMATRaipur (C.G.),

traffic, which causes of security holes, or disable the option for
compressed traffic by re- producing the client-to HTTP
Header to shows that compression is not backed by the clients
browser thus decays the complete performance and bandwidth.
Less security tools HTTP compressed traffic by decompressing
the entire page on the proxy and doing a signature scan on the
decompressed page before passing to the client. This option
doesnt applicable for security tools that performs at a high
speed or when performing additional delay is not an option. In
this paper, we explore a novel algorithm, AhoCorasick based
algorithm on Compressed HTTP (ACCH). ACCH
decompresses the traffic and then uses the data from the
decompression phase to accelerate the pattern matching.
Specifically, the GZIP compression algorithm works by
avoiding repetitions of strings using back-references (pointers)
to the repeated strings. Our key insight is to store information
produced by the pattern-matching algorithm for the already-
scanned decompressed traffic, and if any case of a pointer, use
this information to get if it contains a match or one can securely
deletes scanning bytes in it. ACCH can skip up to 84% of the
data and boost the performance of the multi pattern-matching
algorithm by up to 74%.
II. PROBLEM STATEMENT

IDS find the Intrusion using known attack patterns called
signatures.
Every IDS will have more number of signatures ( more than
5000)
If Pattern matching algorithm is slow, the IDS attack
response time will be very high.
The existing efficient algorithms such as Boyer
Moore (BM), Aho-Coarasick(AC) does not improve
the throughput of IDS.
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page3718

The proposed system is an Implementation of
Scalable look-ahead Regular Expression Detection
System.
Works based on look-ahead Finite Automatic
Machine.
Improves the detection speed or attack-response
time.
The proposed system should be capable of
processing more number of signatures with more
Number of Complex Regular Expressions on every
packet payload.
The attack response time should be less when
Compared with Deterministic Finite Automatic
(DFA) Pattern Matching Procedures (aho-
coarasick).
Should Provide pattern matching with Assertions
(back References, look-ahead, look-back, and
Conditional sub-patterns).
Should use less memory ( Space complexity is low)

III. SYSTEM DEVELOPMENT
1. Packet Capturing
2. Application Payload Extraction
3. HTTP Encoding Header Identification
4. Decompression of Payload
5. Alert Verification in SNORT.
Packet Capturing
The module opens the network interface card
Reads every packet that are received by the Network
Interface Card (NIC).
Queues all the packets in buffer
Application Payload Extraction
The buffered packets will be in raw packet format.
The Module identifies the headers and payloads at each
layer of TCP/IP.
Decodes the payload according to their header formats.
The Application Payload will be buffered or stored for the
next level.
HTTP Encoding Header Identification
If the packet payload is in the HTTP Protocol format, the
module checks for the HTTP Header
Accept-Encoding: g zip or
Accept-Encoding: deflate
Accept-Encoding: chunked Etc.

Presence of Accept-Encoding HTTP Header confirms that
the payload is in the encoding format.
If the Header is Accept-Encoding: gzip then the payload
is in the compressed format.

Decompression of Payload
If the payload is in the gzip compression, the payloads has
to be buffered for all the incoming packets
The buffered payload is processed for gzip decompression.
The decompressed data is passed on to the Pattern
Matching for attack identification.

Alert Verification in SNORT
The SNORT IDS consists of several signatures
The signatures cannot be applied on the compressed data.
The decompressed payload can be used on to the SNORT.
The module confirms that, the signatures are not hitting for
the compressed data and hitting for the decompressed data.

Multi pattern Matching:
Pattern matching has been a topic of intensive research resulting
in several approaches; the two fundamental approaches are
based on the AhoCorasick (AC) and the BoyerMoore
algorithms. In this paper, we illustrate our technique using the
AC algorithm .The basic AC algorithm constructs a
deterministic finite automaton (DFA) for detecting all
occurrences of given patterns by processing the input in a single
pass. The input is inspected symbol by symbol (usually each
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page3719

symbol is a byte), such that each symbol results in a state
transition. Thus, the AC algorithm has deterministic
performance, which does not depend on the input, and
therefore is not vulnerable to various attacks, making it very
attractive to NIDS systems. Each arrow indicates a DFA
transition made by a single byte scan. The label of the
destination state indicates the scanned byte. If there is no
adequate destination state for the scanned byte, the next state
is set to root. For readability, transitions to root were omitted.
Note that this common encoding requires a large matrix of
size , where is the set of ASCII symbols and is the number of
states) with one entry per DFA edge. In the typical case, the
number of edges, and thus the number of entries, is 256[s].

Fig :AhoCorasick DFA for patterns

For example, Snort patterns used in Section VII require
16.2 MB for 1202 patterns that translate into 16 649 states.
There are many compression algorithms for the DFA, but
most of them are based on hardware solutions. At the bottom
line, DFAs require a significant amount of memory; therefore
they are usually maintained in main memory and characterized
by random rather than consecutive accesses to memory.
Challenges faced in multi pattern matching
1) Remove the HTTP header and store the Huffman dictionary
of the specific session in memory. Note that different HTTP
sessions would have different Huffman dictionaries.
2) Decode the Huffman mapping of each symbol to the
Original byte or pointer representation using the specific
Huffman dictionary table.
3) Decode the LZ77 part.
4) Perform multi-pattern matching on the decompressed
Traffic.
Space: One of the problems of decompression is its memory
requirement: The straightforward approach requires 32 kB
sliding window for each HTTP session. Note that this
requirement is difficult to avoid since the back-reference pointer
can refer to any point within the sliding window and the pointers
may be recursive unlimitedly (i.e., pointer may


Algorithm1 Native Decompression with Aho
Corasick pattern matching

Trf the input, compressed traffic (after Huffman
decompression)
SWin
1 32KB
the sliding window of LZ77, where SWin
j
is the
information about the uncompressed byte which is located
bytes before current byte
FSM(state,byte) AC FSM receives state and byte and
returns the next state, where startStateFSM is the initial FSM
state
Match(state) if state is match state it stores information
about the matched pattern, otherwise NULL

1: state = function scanAC(state)
2: state=FSM (state,byte)
3: if Match (state) / NULL then
4: act according to Match (state)
5: end if
6: return state
7: procedure ZIPDecompressPlusAC(Trf
1
,.., Trf
n
)
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page3720

8: state=startStateFSM
9: for =1 to n do
10: if Trf
i
is pointer (dist,len) then
11: for j =0 to length-1 do
12: state scanAC (state, SWin
dist j
)

13: end for
14: update SWin with bytes SWin
dist dist len
15: else
16: state=scanAC(state, Trf
i
)
17: update SWin with the byte Trf
i

18: end if
19:end

point to area with a pointer). Indeed the distribution of
pointers on the real-life data set (see Section VII for details
on the data set) is spread across the entire window. On the
other hand, pattern matching of non compressed traffic
requires storing only one or two packets (to handle cross-
packet data),where the maximum size of a TCP packet is 1.5
kB. Hence, dealing with compressed traffic poses a higher
memory requirement by a factor of 10. Thus, mid-range
firewall, which handles 30 K concurrent sessions, requires 1
GB memory, while a high-end firewall with 300 K
concurrent sessions requires 10 GB. This memory
requirement has implication on not only the price and
feasibility of the architecture, but also on the capability to
perform caching. The space requirement is not in the focus of
this paper. Still, recent work by Afek et al. has shown
techniques that circumvent that problem and drastically
reduce the space requirement by over 80%, with only a slight
increase in time. It has also shown a method to combine that
technique with ACCH, which achieves improvements of
almost 80% in space and above 40% in the time requirement
for the overall DPI processing of compressed Web traffic.
Time: Recall that pattern matching is a dominant factor in the
performance of security tools, while performing
decompression further increases the overall time penalty.
Therefore, security tools tend to ignore compressed traffic.
This paper focuses on reducing the time requirement by using
the information gathered by the compression phase. We note that
pattern matching with the AC algorithm requires significantly
more time than decompression since decompression is based on
consecutive memory reading from the sliding window, hence it
has low read-per-byte cost. On the other hand, the AC algorithm
employs a very large DFA that is accessed with random memory
reads, which typically does not fit in cache, thus requiring main
memory accesses. Appendix A introduces a model that compares
the time requirements of the decompression and the AC
algorithm.
Experiments on real data show that decompression takes only a
negligible 3.5% of the time it takes to run the AC algorithm. For
that reason, we focus on improving AC performance. We show
that we can reduce the AC time by skipping more than 70% of
the DFA scans and hence reduce the total time requirement for
handling pattern matching in compressed traffic by more than
60%.
IV. OVERVIEW OF SYSTEM ARCHITECTURE:
Packet capturing modules receives every packet.
Payload extraction module, extracts the application layer
packet.
Using time stamps module (TLM) each incoming character
is cross checked against non-repetition types of variable
strings.
Character look up module (CLM) is responsible for
identifying frequently access character strings.
Repetition Detection module is responsible for identifying
repetition that are not detected by CLM
Frequently appearing repetition module(FRM) it Reduces
Resource usage by creating opportunity for sharing Effort
of Frequent bases.
VIII. RELATED WORK
As per M. Fisk and G. Varghese, An analysis of fast string
matching applied to content-based forwarding and intrusion
detection, Tech. Rep. CS2001-0670 (updated version), 2002.-
Pattern matching is one of the most performance critical
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page3721

components in network intrusion detection and prevention
system, which needs to be accelerated by carefully designed
architectures. In this paper, we present a highly parameterized
multilevel pattern matching architecture (MPM), which is
implemented on FPGA by exploiting redundant resources
among patterns for less chip area. In practice, MPM can be
partitioned to several pipelines for high frequency. This paper
also presents a pattern set compiler that can generate RTL
codes of MPM with the given pattern set and predefined
parameters. One MPM architecture is generated by our
compiler based on Snort rules on Xilinx FPGA. The results
show that MPM can achieve 4.3Gbps throughput with only
0.22 slices per character, about one half chip area than the
most area-efficient architecture in literature. MPM can be
parameterized potential for more than 100 Gbps throughput.
The problem of pattern matching on compressed data has
received attention in the context of the LempelZiv
compression family. However, the LZW/LZ78 are more
attractive and simple for pattern matching than LZ77. HTTP
uses LZ77 compression, which has a simpler decompression
algorithm, but performing pattern matching on it is a more
complex task that requires some kind of decompression (see
Section II). Hence, all the above works are not applicable to
our case. Klein and Shapira suggest modification to the LZ77
compression algorithm to make the task of the matching easier
in files. However, the suggestion is not implemented in
todays HTTP .References M. Farach and M. Thorup, String
matching in Lempel-Ziv compressed strings, in Proc. 27th
Annu. ACM Symp. Theory Comput.,1995, pp. 703712 and
L. Gasieniec, M. Karpinski, W. Plandowski, and W. Rytter,
Efficient algorithms for lempel-ziv encoding (extended
abstract), in Proc. 4th Scandinavian Workshop Algor.
Theory, 1996, pp. 392403. and are the only papers we are
aware of that deal with pattern matching over LZ77. However,
in those papers, the algorithms are for a single pattern and
require two passes over the compressed text (file), which is
not applicable for network domains that requires on-the-fly
processing .One outcome of this paper is the surprising
conclusion that pattern matching on compressed HTTP traffic,
with the overhead of decompression, is faster than pattern
matching on regular traffic .We note that other works with the
context of pattern matching in compressed data such as U.
Manber, A text compression scheme that allows
fast searching directly in the compressed file, Trans. Inf. Syst.,
vol. 15, no. 2, pp.124136, Apr. 1997 and N. Ziviani, E. de
Moura, G. Navarro, and R. Baeza-Yates, Compression: A key
for next-generation text retrieval systems, Computer, vol.33,
no. 11, pp. 3744, 2000and have shown a similar conclusion,
stating that compressing a file once and then performing pattern
matching on the compressed file accelerates the scanning
process.

Algorithm 2 ACCHOptimization II

absPosition Absolute position from the beginning of data.
After line 38: absPosition+=len
After line 49: absPosition++
MatchTable a hash table, where each entry represents a
Match. The key is the Match absPosition,and the value is a list
of patterns that where located at the position.
Function scanAC a new line is added after line 4: add patterns
in Match (state) to MatchTable(absPosition)
Procedure ACCH instead of the while loop, lines (41-47):
handleInternalMatches(state,curPos,len-1)
scanSegment(state,curPos,len-1)
Function scanSegment should ignore Matches found by
scanAC same all matches within pointer are located by
functions scanleft and handleInternalMatches.

1: function handleInternalMatches(start,end)
2: for curPos = start to (end) do
3: if .status= Match then

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page3722

4: if MatchTable(curPos) contains patterns shorter or
equal to curPos then
5: add those patterns to MatchTable(absPosition)
6: curPtrInfo[curPos].status = Match
7: else curPtrInfo[curPos].status =Check
8: end if
9: else
10: curPtrInfo[curPos].status=
.status
11: end if
11: end for
Algorithm2: ACCHOptimization II
CDepth1,CDepth2 Instead of one constant parameter
CDepth, we maintain two, where CDepth1 < CDepth2
Function scanAC line 8 changes to lines:
Else if Depth(status) CDepth1 then status = Uncheck1 else
status = Uncheck2
Function scanSegment line 21: instead of searching for
maximal Uncheck, it searches for maximal Uncheck1 or
Uncheck2
Function scanSegment lines 23-30: CDepth parameter
changes to CDepth1 or CDepth2 depending on whether the
state found on line 21 is Uncheck1 or Uncheck2,respectively

V. CONCLUSION
Now a days almost each and every one modern security tool
is a pattern matching algorithm. Web traffic is completely
based on HTTP compression. Normally security tools ignore
this traffic and leave security holes. In another case it neglects
the parameters of the connection, it leads dangerous
situation to the performance and bandwidth of client side
and server side .Our algorithm eliminates up to 84% of data
scan based on information stored in the compressed data.
Unexpectedly, it is
faster to perform pattern matching on compressed data with
the effect of e compression, rather than pattern matching on
uncompressed traffic. We have to observe that ACCH is not
intrusive for the AC algorithm, so all the methods that improve
AC DFA are orthogonal to ACCH and are applicable. We are
the first paper that analyzes the problem of on-the-fly multi
pattern-matching algorithm on compressed HTTP traffic and
suggests a solution.

REFERENCES

[1] M. Fisk and G. Varghese, An analysis of fast string
matching applied to content-based forwarding and intrusion
detection, Tech. Rep.CS2001-0670 (updated version), 2002.
[2] Port80, Port80 Software, San Diego, CA [Online].
Available:
http://www.port80software.com/surveys/top1000compression
[3] Website Optimization, LLC, Website Optimization, LLC,
Ann Arbor, MI [Online]. Available:
http://www.websiteoptimization.com
[4] P.Deutsch, Gzip file format specification, RFC 1952, May
1996 [Online].Available: http://www.ietf.org/rfc/rfc1952.txt
[5] P. Deutsch, Deflate compressed data format specification,
RFC 1951, May 1996 [Online]. Available:
http://www.ietf.org/rfc/rfc1951.txt
[6] J. Ziv and A. Lempel, A universal algorithm for sequential
data compression, IEEE Trans. Inf. Theory, vol. IT-23, no. 3,
pp. 337343, May 1977.
[7] D. Huffman, A method for the construction of minimum-
redundancy codes, Proc. IRE, vol. 40, no. 9, pp. 10981101,
Sep. 1952.
[8] Zlib, [Online]. Available: http://www.zlib.net
[9] A. Aho and M. Corasick, Efficient string matching: An aid
to bibliographic search, Commun. ACM, vol. 18, pp. 333340,
Jun. 1975.
[10] R. Boyer and J. Moore, A fast string searching algorithm,
Commun.ACM, vol. 20, no. 10, pp. 762772, Oct. 1977.
[11] N. Ziviani, E. de Moura, G. Navarro, and R. Baeza-Yates,
Compression:A key for next-generation text retrieval systems,
Computer, vol.33, no. 11, pp. 3744, 2000.
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page3723

[12] T. Song, W. Zhang, D. Wang, and Y. Xue, A memory
efficient multiple patternmatching architecture for network
security, in Proc. IEEE INFOCOM, Apr. 2008, pp. 166170.
[13] J. van Lunteren, High-performance pattern-matching for
intrusion detection,in Proc. IEEE INFOCOM, Apr. 2006, pp.
113.
[14] V. Dimopoulos, I. Papaefstathiou, and D. Pnevmatikatos,
A memoryefficient reconfigurable AhoCorasick FSM
implementation for intrusion detection systems, in Proc. IC-
SAMOS, Jul. 2007, pp. 186193.
[15] N. Tuck, T. Sherwood, B. Calder, and G. Varghese,
Deterministic memory-efficient string matching algorithms
for intrusion detection,in Proc. IEEE INFOCOM, 2004, vol.
4, pp. 26282639.
[16] M. Alicherry, M. Muthuprasanna, and V. Kumar, High
speed patternmatching for network ids/ips, in Proc. IEEE
ICNP, 2006, pp.187196.



First Author: Abdul Mannan Virani received his B.E.
(CSE) degree from RCET Bhilai, Pandit Ravi Shankar Shukla
University (Pt.RSU) Raipur, in 2006.From 2006 to 2010 he
worked in various multinational companies as Consultant and
Customer support .He is currently an M.Tech student in the
Computer Science Engineering from DIMAT Raipur,
Chhattisgarh Swami Vivekananda University Bhilai. His
Research interests are in the areas of Wireless and Network
Security, with current focus on secure data services in cloud
computing and secure computation outsourcing.



Second Author: Somesh Kumar Dewangan received his M.
Tech in Computer Science and Engineering from RCET Bhilai,
Chhattisgarh Swami Vivekananda University Bhilai , in 2009.
Before that the MCA. Degree in Computer Application from
MPBO University, Bhopal, India, in 2005. He is lecturer,
Assistant Professor, Associate professor, Disha Institute of
Management and Technology, Chhattisgarh Swami
Vivekananda Technical University Bhilai, India, in 2005 and
2008 respectively. His research interests include digital signal
processing and image processing, Natural Language Processing,
Neural Network, Artificial Intelligence, Information and
Network Security, Mobile Networking and Cryptography and
Android based Application.

S-ar putea să vă placă și