Sunteți pe pagina 1din 13

String Matching

Given: Two strings T[1..n] and P[1..m] over alphabet . Want to find all occurrences of P[1..m] the pattern in T[1..n] the text.

Example: = {a, b, c}
text T a pattern P b s=3 c a a b b a a a a b c a a b a c

- P occurs with shift s. - P occurs beginning at position s+1. -s is a valid shift. The idea of the string matching problem is that we want to find all occurrences of the pattern P in the given text T

Background
String matching Nave method
n size of input string m size of pattern to be matched O( (n-m+1)m )
( n2 ) if m = floor( n/2 ) Nave-String-Matcher(T,P) 1. n=T.length 2. m=P.length 3. for s = 0 .. n-m 4. If P[1m]==(T[s+1, .., s+m]) 5. Print Pattern occurs with shift s

How it works
Consider a hashing scheme Let characters in both arrays T and P be digits in radix- notation. ( = (0,1,...,9) Assume each character is digit in radix-d notation (e.g. d=10) Let p be the value of the characters in P Choose a prime number q such that fits within a computer word to speed computations. Compute (p mod q) The value of p mod q is what we will be using to find all matches of the pattern P in T.

Compute (T[s+1, .., s+m] mod q) for s = 0 .. n-m Test against P only those sequences in T having the same (mod q) value Hash pattern P into a numeric value Let a string be represented by the sum of these digits Example { A, B, C, ..., Z } { 0, 1, 2, ..., 26 } BAN 1 + 0 + 13 = 14 CARD 2 + 0 + 17 + 3 = 22

Problem For long patterns, or for large alphabets, the number representing a given string may be too large to be practical Solution Use MOD operation When MOD q, values will be < q Example BAN = 1 + 0 + 13 = 14 14 mod 13 = 1 BAN 1 CARD = 2 + 0 + 17 + 3 = 22 22 mod 13 = 9 CARD 9

Example
pattern P 3 1 4 1 5

mod 13 7

text T

2 3 5 9

0 2 3 1 4 1 5 2 6 7 mod 13 7 valid match


Comp 750, Fall 2009

3 9 9 2 1 mod 13 7

spurious hit
String Matching -

Searching
pattern is M characters long hash_p=hash value of pattern hash_t=hash value of first M letters in body of text do if (hash_p == hash_t) brute force comparison of pattern and selected section of text hash_t= hash value of next section of text, one character over while (end of text or brute force comparison == true)

Spurious Hits
Question
Does a hash value match mean that the patterns match?

Answer
No these are called spurious hits

Possible cases
MOD operation interfered with uniqueness of hash values
14 mod 13 = 1 27 mod 13 = 1 MOD value q is usually chosen as a prime such that 10q just fits within 1 computer word

Information is lost in generalization (addition)


BAN 1 + 0 + 13 = 14 CAM 2 + 0 + 12 = 14

Assume each character is digit in radix-d notation (e.g. d=10) p = decimal value of pattern ts = decimal value of substring T[s+1..s+m] for s = 0,1...,n-m

s = a valid shift
We never explicitly compute a new value. We simply adjust the existing value as we move over one character.

Code
RABIN-KARP-MATCHER( T, P, d, q ) n length[ T ] m length[ P ] h dm-1mod q p0 t0 0 for i 1 to m Preprocessing do p ( d*p + P[ i ] ) mod q t0 ( d*t0 + T[ i ] ) mod q for s 0 to n m Matching do if p = ts then if P[ 1..m ] = T[ s+1 .. s+m ] then print Pattern occurs with shift s if s < n m then ts +1 ( d * ( ts T[ s + 1 ] * h ) + T[ s + m + 1 ] ) mod q

Performance
Preprocessing (determining each pattern hash) ( m ) Worst case running time ( (n-m+1)m ) No better than nave method Expected case If we assume the number of hits is constant compared to n, we expect O( n ) Only pattern-match hits not all shifts

Rabin-Karp Complexity
If a sufficiently large prime number is used for the hash function, the hashed values of two different patterns will usually be distinct. If this is the case, searching takes O(N) time, where N is the number of characters in the larger body of text. It is always possible to construct a scenario with a worst case complexity of O(MN). This, however, is likely to happen only if the prime number used for hashing is small.

Thank You

S-ar putea să vă placă și