Sunteți pe pagina 1din 22

2/16/2017

Bloom Filters

References
A. Broder and M. Mitzenmacher, “Network
Network applications of Bloom
filters: A survey,” Internet Mathematics, vol. 1 no. 4, pp. 485-509,
2004.

Li Fan, Pei Cao, Jussara Almeida, Andrei Broder, “Summary Cache:


A Scalable Wide-Area Web Cache Sharing Protocol,” IEEE/ACM
Transactions on Networking, Vol. 8, No. 3, June 2000.
o Origin
O i in of
f counting
ntin Bl
Bloom
m filt
filterss

2/16/2017 Bloom Filters (Simon S. Lam) 1

1
2/16/2017

Origin and applications


Randomized data structure introduced by
Burton Bloom [CACM 1970]
o It represents a set for membership queries, with
false positives
o Probability of false positive can be controlled by
design parameters
o When space efficiency is important, a Bloom filter
ma be used if the effect of
may f false p
positives
sitives can be
mitigated.

First applications in dictionaries and


databases

2/16/2017 Bloom Filters (Simon S. Lam) 2

2
2/16/2017

First application in networking:


distributed cache (2000)
Proxy 2
Cache 2
Proxy 1 Summary 1
Cache 1 Summary y3
Summary 2
Summary 3
Proxy 3
Cache 3
Summary 1
Summary 2

 N
Numerous applications
li ti iin networking
t ki since
i 2000
2/16/2017 Bloom Filters (Simon S. Lam) 3

3
2/16/2017

Standard Bloom Filter


 A Bloom filter is an array of m bits representing
a set S = { x1, x2, … , xn} of n elements
o Array set to 0 initially
 k independent hash functions h1, … , hk with range
{1 2
{1, 2, …, m}}
o Assume that each hash function maps each item in the
universe to a random number uniformly over the range
{1, 2, …, m}
 For each element x in S, the bit hi(x) in the array
i sett tto 11, f
is for 1 ≤ i ≤ k,
k
o A bit in the array may be set to 1 multiple times for
ff
different elements
m

2/16/2017 Bloom Filters (Simon S. Lam) 4

4
2/16/2017

A Bloom filter example


((three hash functions))

Insert X1 and X2

Check Y1 and Y2

2/16/2017 Bloom Filters (Simon S. Lam) 5

5
2/16/2017

Standard Bloom Filter (cont.)


To check membership of y in S, check
whether hi(y), 1≤i≤k,
≤ ≤k, are all set to 1
o If not, y is definitely not in S
o Else, we conclude that y is in S, but sometimes this
conclusion is wrong (false positive)
For many applications, false positives are
acceptable
t bl as long
l as the
th probability
b bilit of
fa
false positive is small enough

We will assume that kn < m

2/16/2017 Bloom Filters (Simon S. Lam) 6

6
2/16/2017

False positive probability


 After all members of S have been hashed to a Bloom
filter, the probability that a specific bit is still 0 is
1 kn − kn / m
p ' = (1 − )  e = p
m
 For a non member, it may be found to be a member
of S (all of its k bits are nonzero) with false positive
probability
(1 − p ') k
 (1 − p ) k

2/16/2017 Bloom Filters (Simon S. Lam) 7

7
2/16/2017

False positive probability (cont.)


Define

1 kn k
f ' = (1 − p ') = (1 − (1 − ) )
k

m
f = (1 − p ) k = (1 − e − kn / m ) k
 Two competing forces as k increases
o Larger k ->> (1 − p ')) k is smaller for a fixed p’
p

o Larger k -> p’= (1 − 1 / m ) kn is smaller -> 1-p’ larger

2/16/2017 Bloom Filters (Simon S. Lam) 8

8
2/16/2017

False positive rate vs. k


m
Number of bits per member =8
n

Number of

2/16/2017 Bloom Filters (Simon S. Lam) 9

9
2/16/2017

Optimal number k from derivative


Rewrite f as
f = exp(ln(1 − e − kn / m ) k ) = exp( k ln(1 − e − kn / m ))
Let g = k ln(1 − e − kn / m )
Minimizingg g will minimize f = exp(
p( g )
− kn / m
∂g − kn / m k ∂ (1 − e )
= ln(1 − e )+
∂k 1 − e − kn / m ∂k
− kn / m k n − kn / m
= ln(1 − e )+ − kn / m
e = − ln(2) + ln(2) = 0
1− e m
if we plug k = ( m / n ) ln 2 which is optimal
( is
(It i in
i fact
f a global
l b l optimum)
i )
2/16/2017 Bloom Filters (Simon S. Lam) 10

10
2/16/2017

Optimal k from symmetry


− kn / m
 Alternatively, from p=e we get
m
k = − ln( p )
n
From previous slide, we have
− kn / m m
g = k ln(1 − e ) = − ln( p ) ln(1 − p )
n
 From above, symmetry indicates that the
minimum value for g occurs when p=1/2.
p
Thus m m
kopt = − ln(1 / 2) = ln(2)
n n
2/16/2017 Bloom Filters (Simon S. Lam) 11

11
2/16/2017

Optimal k from symmetry


using the precise probability of false positive

f ' = ((1 − p ')) k = exp( ( − p '))


p( k ln(1 ))

From p ' = ((1 − 1 / m ) kn , solvingg for k


1
k= ln( p ')
l (1 − 1 / m )
n ln(1
Let g ' = k ln(1
( − p ')) ((in equation
q for f ' above))
1
= ln( p ') ln(1 − p ')
n ln(1 − 1 / m )
2/16/2017 Bloom Filters (Simon S. Lam) 12

12
2/16/2017

Using the precise probability of false


positive to get
p g optimal
p k (cont.)
( )
From previous slide
1
g' = ln( p ') ln(1 − p ')
n ln(1 − 1 / m )

By symmetry, g’ (also f’) minimized at p’=1/2

Optimal k is
1 1
k 'opt = ln( p ') = ln(1 / 2)
n ln(1 − 1 / m ) n ln(1 − 1 / m )

2/16/2017 Bloom Filters (Simon S. Lam) 13

13
2/16/2017

Optimal number of hash functions


 Using k = m ln(2) the false positive rate is
opt
n
m m
ln(2)
( ) ln(2)
( )
(1 − p ) n
= (0.5) n
 (0.6185) m / n , where ln(2) = 0.6931
 In practice, k should be an integer. May choose an integer
value
l smaller
ll than
h kopt to reduce
d h
hashing
hi overhead
h d
m/n denotes
False positive rate
bits per entry

2/16/2017 Bloom Filters (Simon S. Lam) 14

14
2/16/2017

False positive rate vs. bits per entry


False
positive
rate 4 hash functions

Using optimal number


of hash functions

m/n
2/16/2017 Bloom Filters (Simon S. Lam) 15

15
2/16/2017

Standard Bloom Filter tricks


Two Bloom filters representing sets S1 and
S2 with the same number of bits and using
g
the same hash functions.
o A Bloom filter that represents the union of S1 and
S2 can be obtained by taking the OR of the bit
vectors
A Bloom filter can be halved in size. Suppose
the
h size
i is
i a power off22.
o Just OR the first and second halves of the bit
vector
o When hashing to do a lookup, the highest order bit
is masked
Notation: OR denotes bitwise or
2/16/2017 Bloom Filters (Simon S. Lam) 16

16
2/16/2017

Counting Bloom filters


Proposed by Fan et al. [2000] for distributed
cach ng
caching
Every entry in a counting Bloom filter is a small
counter ((rather than a single
g bit).
)
o When an item is inserted into the set, the
corresponding counters are each incremented by 1
o When
h an item is deleted
d l d from
f the
h set, the
h
corresponding counters are each decremented by 1
To avoid counter overflow,
overflow its size must be
sufficiently large. It was found that 4 bits per
counter
u are enough.
ug .

2/16/2017 Bloom Filters (Simon S. Lam) 17

17
2/16/2017

Counter overflow probability


Consider a set of n elements, k hash
functions, and m counters
o C(i) is the count for the ith counter
nk − j
  1  
j
nk 1
P[c(i ) = j ] =      1 − 
 j  m   m

 nk  1
P[c(i ) ≥ j ] ≤   j
 j m
j
 enk 
≤  (a very loose upper bound)
 jm 

2/16/2017 Bloom Filters (Simon S. Lam) 18

18
2/16/2017

Counter overflow probability (cont.)


Choose k such that k ≤ m/n (ln 2)
Then j j
 enk   e ln 2 
P[c(i ) ≥ j ] ≤   ≤ 
 jm   j 
j
 e ln 2 
P[max c(i ) ≥ j ] ≤ m   for some i
1≤i ≤ m
 j 
Using 4 bits, each counter counts from 0 to
15
−15
P[max c(i ) ≥ 16] ≤ m × 1.37 × 10
1≤i ≤ m

2/16/2017 Bloom Filters (Simon S. Lam) 19

19
2/16/2017

Counter overflow consequences


When a counter does overflow, it may be left
at its
ts maximum
max mum value.
This can later cause a false negative only if
eventuallyy the counter goes
g down to 0 when it
should have remain at nonzero.
The expected
p time to this event is very
y large
g
but is something we need to keep in mind for
any application that does not allow false
negatives
ti

2/16/2017 Bloom Filters (Simon S. Lam) 20

20
2/16/2017

Conclusions
Wherever a list or set is used, and space is at
a premium,
prem um, a Bloom filter
f lter may be used if
f the
effect of false positives can be mitigated
o No false negative
With a counting Bloom filter, false negatives
are possible, albeit highly unlikely

2/16/2017 Bloom Filters (Simon S. Lam) 21

21
2/16/2017

The End

2/16/2017 Bloom Filters (Simon S. Lam) 22

22

S-ar putea să vă placă și