Sunteți pe pagina 1din 97

Randomized Algorithms

Overview of the lecture

 What is a randomized algorithm ?

 Motivation

 Types of randomized algorithms

 Examples of Randomized Algorithm


2
What is a randomized algorithm ?

3
Deterministic Algorithm

Input Output

Algorithm
 The output as well as the running time are functions only of the input.
 Goal: Prove for all input instances the algorithm solves the problem
correctly and the number of steps is bounded by a polynomial in the size
of the input.
4
Randomized Algorithm

Random bits

Input Output

Algorithm
 The output or the running time are functions of the input and
random bits chosen .
 Behavior can vary even on a fixed input;
5
Motivation for Randomized Algorithms

 Simplicity;

 Performance;

 Reflects reality better (Online Algorithms);

 For many hard problems helps obtain better complexity bounds


when compared to deterministic approaches;

6
Types of Randomized Algorithms

Randomized Las Vegas Algorithms:


 Output is always correct
 Running time is a random variable
Example: Randomized Quick Sort

Randomized Monte Carlo Algorithms:


 Output may be incorrect with some probability
 Running time is deterministic.
Example: Randomized algorithm for approximate median

7
Las Vegas Randomized Algorithms

INPUT ALGORITHM
OUTPUT

RANDOM NUMBERS

Goal: Prove that for all input instances the algorithm solves the
problem correctly and the expected number of steps is bounded
by a polynomial in the input size.

Note: The expectation is over the random choices made by the


algorithm.
8
Probabilistic Analysis of Algorithms

RANDOM ALGORITHM OUTPUT


INPUT DISTRIBUTION

Input is assumed to be from a probability distribution.

Goal: Show that for all inputs the algorithm works correctly and
for most inputs the number of steps is bounded by a polynomial in
the size of the input.

9
Monte Carlo Randomized Algorithms

INPUT ALGORITHM
OUTPUT

RANDOM NUMBERS

Goal: Prove that the algorithm


– with high probability solves the problem correctly;
– for every input the expected number of steps is bounded by a
polynomial in the input size.

Note: The expectation is over the random choices made by the


algorithm. 10
Monte Carlo versus Las Vegas
 A Monte Carlo algorithm runs produces an answer that is correct
with non-zero probability, whereas a Las Vegas algorithm always
produces the correct answer.

 The running time of both types of randomized algorithms is a


random variable whose expectation is bounded say by a
polynomial in terms of input size.

 These expectations are only over the random choices made by the
algorithm independent of the input. Thus independent repetitions
of Monte Carlo algorithms drive down the failure probability
exponentially.

11
Example 1 : Randomized Quick
Sort (Las Vegas Algorithm)

12
QuickSort(𝑺)

QuickSort(𝑺)
{ If (|𝑺|>1)
Pick and remove an element 𝒙 from 𝑺;
(𝑺<𝒙 , 𝑺>𝒙 ) Partition(𝑺,𝒙);
return( Concatenate(QuickSort(𝑺<𝒙 ), 𝒙, QuickSort(𝑺>𝒙 ))
}

13
QuickSort(𝑺)

QuickSort(𝑨,𝒍, 𝒓)
{ If (𝒍 < 𝒓)
𝒙 𝒍;
𝒊 Partition(𝑨,𝒍,𝒓,x);
QuickSort(𝑨,𝒍, 𝒊 − 𝟏);
QuickSort(𝑨,𝒊 + 𝟏, 𝒓)
}

 Average case running time: O(n log n)


 Worst case running time: O(𝐧𝟐 )
 Distribution sensitive: Time taken depends upon the initial permutation of
𝑨.
14
Randomized QuickSort(𝑺)

QuickSort(𝑨,𝒍, 𝒓)
{ If (𝒍 < 𝒓)
an element selected randomly uniformly from 𝑨[𝒍..𝒓];
𝒙 𝑨[𝒍];
𝒊 Partition(𝑨,𝒍,𝒓,x);
QuickSort(𝑨,𝒍, 𝒊 − 𝟏);
QuickSort(𝑨,𝒊 + 𝟏, 𝒓);
}
 Distribution insensitive: Time taken does not depend on initial permutation
of 𝑨.
 Time taken depends upon the random choices of pivot elements.
1. For a given input, Expected(average) running time: O(n log n)
2. Worst case running time: O(𝐧𝟐 )
15
Partition
Partition(𝑨,𝒍,𝒓,x)
{
swap(A[r],A[x]);
pivot  A[r];
I  l-1;
for (j  p to r-1){
if(A[j] <= pivot){
i  i+1;
swap(A[i],A[j]);
}
}
swap(A[i+1],A[r])
return i+1
}
Analysis of Randomized Quick Sort

17
Linearity of Expectation

If X1, X2, …, Xn are random variables, then

 n  n
E   Xi    E[ Xi ]
i  1  i  1

18
Notation
z2 z9 z8 z3 z5 z4 z1 z6 z10 z7
2 9 8 3 5 4 1 6 10 7

 Rename the elements of A as z1, z2, . . . , zn, with zi being the ith
smallest element (Rank “i”).

 Define the set Zij = {zi , zi+1, . . . , zj } be the set of elements


between zi and zj, inclusive.

19
Expected Number of Total Comparisons
in PARTITION
indicator
Let Xij = I {zi is compared to zj } random variable
Let X be the total number of comparisons performed by the
algorithm. Then

n 1 n 
 X    X ij 
 i 1 j i 1 
The expected number of comparisons performed by the algorithm is

 n 1 n 
  E X 
n 1 n
E[ X ]  E   X ij   ij
 i 1 j i 1  i 1 j  i 1
by linearity
of expectation
n 1 n
  Pr{z i is compared to z j }
i 1 j i 1
20
Comparisons in PARTITION
Observation 1: Each pair of elements is compared at most once
during the entire execution of the algorithm

 Elements are compared only to the pivot point!

 Pivot point is excluded from future calls to PARTITION

Observation 2: Only the pivot is compared with elements in both


partitions
z2 z9 z8 z3 z5 z4 z1 z6 z10 z7
2 9 8 3 5 4 1 6 10 7

Z1,6= {1, 2, 3, 4, 5, 6} {7} Z8,10 = {8, 9, 10}


pivot
Elements between different partitions are never compared 21
Comparisons in PARTITION
z2 z9 z8 z3 z5 z4 z1 z6 z10 z7
2 9 8 3 5 4 1 6 10 7

Z1,6= {1, 2, 3, 4, 5, 6} {7} Z8,10 = {8, 9, 10}

Pr{zi is compared to z j }?
Case 1: pivot chosen such as: zi < x < zj
 zi and zj will never be compared

Case 2: zi or zj is the pivot


 zi and zj will be compared
 only if one of them is chosen as pivot before any other element
in range zi to zj
22
Expected Number of Comparisons in
PARTITION
Pr {Zi is compared with Zj}

= Pr{Zi or Zj is chosen as pivot before other elements in Zi,j} = 2 / (j-i+1)

n 1 n
E[ X ]    Pr{z
i 1 j  i 1
i is compared to z j }

n 1 n 1 n i n 1 n
n
2 2 2 n 1
E[ X ]          O(lg n)
i 1 j i 1 j  i  1 i 1 k 1 k  1 i 1 k 1 k i 1

= O(nlgn)
23
Example 2 : Max Cut (Monte Carlo
Algorithm)

24
Global Cut

 Given an undirected graph G = ( V , E ), a cut in G


is a pair ( S , V – S ) of two sets S and V – S that
split the nodes into two groups.
 The size or cost of a cut, denoted by c ( S, V – S),
is the number of edges with one endpoint in S and one
in V – S.
 A global min cut is a cut in G with the least total cost
(min edges). A global max cut is a cut in G with
maximum total cost (max edges).
Global Cut - Example

V-S S
Global Cut

 Interestingly: There are many polynomial-time algorithms


known for global min-cut.
 Global max-cut is NP -hard and no polynomial-time
algorithms are known for it.
 Today, we'll see an algorithm for approximating global
max-cut.
 Next, we'll see a randomized algorithm for finding a
global min-cut.
Approximating Max Cut

 For a maximization problem, an α-approximation


algorithm is an algorithm that produces a value that is
within a factor of α of the true value.
 A 0.5-approximation to max-cut would produce a cut
whose size is at least 50% the size of the true largest cut.
 Our goal will be to find a randomized approximation
algorithm for max-cut.
A Really Simple Algorithm

 Here is our algorithm:


 For each node, toss a fair coin.
 If it lands heads, place the node into one part of the cut.
 If it lands tails, place the node into the other part of the
cut.
Analyzing the Algorithm

 On expectation, how large of a cut will this algorithm find?


 For each edge e, Cₑ be an indicator random variable where

 Then the number of edges X crossing the cut will be given by


What is Expected?

 The expected number of edges crossing the cut is given by


E[X]
Four Possible Choices

 Let u and v be the end points of edge e.


 There are four different possibilities
 Both u and v belong to S
 Both u and v belong to V-S
 u belong to S and v belong to V-S
 v belong to S and u belong to V-S
What is Expected is Unexpected

 The expected number of edges crossing the cut is given by


E[X]

 All Cuts are of size < m, so this is always with in a factor


of half of the optimal!
Randomized Approximation
Algorithm
 This algorithm is a randomized 0.5 approximation of max-
cut
 The Algorithm runs in O(n)
 It is NP-hard to fins a true max-cut but it is not at all hard
to find a cut that has size atleast half of the max-cut on
the average
Improving the Odds

 Running our algorithm will, on expectation, produce a cut


with size m / 2.
 However, we don't know the actual probability that our
cut has this size.
 We can use a standard technique to amplify the
probability of success.
Do it Again

 Since any individual run of the algorithm might not


produce a large cut, we could try this approach:
 Run the algorithm k times.
 Return the largest cut found.
 Goal: Show that with the right choice of k, this returns a
large cut with high probability.
 Specifically: Will show we get a cut of size m / 4 with high
probability.
 Runtime is O((m + n)k): k rounds of doing O(m + n) work
(n to build the cut, m to determine the size.)
More Probabilistic Analysis

 Let X₁, X₂, …, Xₖ be random variables corresponding to


the sizes of the cuts found by each run of the algorithm.
 Let Ɛ be the event that our algorithm produces a cut of
size less than m / 4. Then

 Since all Xi variables are independent, we have


More Probabilistic Analysis

 Let Y₁, Y₂, …, Yₖ be random variables defined as follows:

 Then
Markov’s Inequality

 Markov's Inequality states that for any nonnegative


random variable X, that

 Equivalently,

 This holds for any random variable X


Markov’s Inequality

 Let Y₁, Y₂, …, Yₖ be random variables defined as follows:

 Then
Final Comment

 If we run the algorithm k times and take the maximum cut


we find, then the probability that we don't get m / 4
edges or more is at most (2 / 3)k.
 The probability we do get at least m / 4 edges is at least
1 – (2 / 3)k.
 If we set k = log3/2 m, the probability we get at least m/4
edges is 1 – 1/m.
 There is a randomized, O((m + n) log m)-time algorithm
that finds a (0.25)-approximation to max-cut with
probability 1 – 1 / m.
Why it Works

 Given a randomized algorithm that has a probability p of


success, we can amplify that probability significantly by
repeating the algorithm multiple times.
 This technique is used extensively in randomized
algorithms; we'll see another example of this on min-cut.
Example 3 : Min Cut (Monte Carlo
Algorithm)

43
Graph Contraction
For an undirected graph G, we can construct a new graph G’ by
contracting two vertices u, v in G as follows:
 u and v become one vertex {u,v} and the edge (u,v) is removed;
 the other edges incident to u or v in G are now incident on the
new vertex {u,v} in G’;
Note: There may be multi-edges between two vertices. We just
keep them. b
a b
a

u v
{u,v}

d e
c d e
c
44
Graph G Graph G’
Karger’s Min-cut Algorithm
C C CD
contract
B B D B
D

A A
A contract
(i) Graph G (ii) Contract nodes C and D (iii) contract nodes A and CD

Note: C is a cut but not necessarily a min-cut. ACD

B 45

(Iv) Cut C={(A,B), (B,C), (B,D)}


Karger’s Min-cut Algorithm

For i = 1 to x
repeat
randomly pick an edge (u,v)
contract u and v
until two vertices are left
ci ← the number of edges between them
Output mini ci

46
Key Idea

 Let C* = {c1*, c2*, …, ck*} be a min-cut in G and Ci be a cut


determined by Karger’s algorithm during some iteration i.
 Ci will be a min-cut for G if during iteration “i” none of
the edges in C* are contracted.
 If we can show that with prob. Ω(1/n2), where n = |V|, Ci
will be a min-cut, then by repeatedly obtaining min-cuts
O(n2) times and taking minimum gives the min-cut with
high prob.

47
Analysis of Karger’s Algorithm
Let k be the number of edges of min cut (S, V-S). k

If we never picked a crossing edge in the


algorithm, then the number of edges between two
last vertices is the correct answer.

The probability that in step 1 of an iteration a


crossing edge is not picked = (|E|-k)/|E|. ≥k

By def of min cut, we know that each vertex v has


degree at least k, Otherwise the cut ({v}, V-{v}) is
lighter.
Thus
|E| ≥ nk/2 and (|E|-k)/|E| = 1 - k/|E| ≥ 1-2/n. 48
Analysis of Karger’s Algorithm
 In step 1, Pr [no crossing edge picked] >= 1 – 2/n
 Similarly, in step 2, Pr [no crossing edge picked] ≥ 1-2/(n-1)
 In general, in step j, Pr [no crossing edge picked] ≥ 1-2/(n-j+1)
 Pr {the n-2 contractions never contract a crossing edge}
 = Pr [first step good]
* Pr [second step good after surviving first step]
* Pr [third step good after surviving first two steps]
* …
* Pr [(n-2)-th step good after surviving first n-3 steps]
≥ (1-2/n) (1-2/(n-1)) … (1-2/3)
= [(n-2)/n] [(n-3)(n-1)] … [1/3] = 2/[n(n-1)] = Ω(1/n2)
49
Example 4:Finding Similar Items:
Locality Sensitive Hashing
A Common Metaphor
 Many problems can be expressed as
finding “similar” sets:
 Find near-neighbors in high-dimensional space
 Examples:
 Pages with similar words
 For duplicate detection, classification by topic
 Customers who purchased similar products
 Products with similar customer sets
 Images with similar features

51
Problem for Today
 Given: High dimensional data points 𝒙𝟏 , 𝒙𝟐 , …
 For example: Image is a long vector of pixel colors
1 2 1
0 2 1 → [1 2 1 0 2 1 0 1 0]
0 1 0
 And some distance function 𝒅(𝒙𝟏 , 𝒙𝟐 )
 Which quantifies the “distance” between 𝒙𝟏 and 𝒙𝟐

 Goal: Find all pairs of data points (𝒙𝒊 , 𝒙𝒋 ) that are within some distance
threshold 𝒅 𝒙𝒊 , 𝒙𝒋 ≤ 𝒔
 Note: Naïve solution would take 𝑶 𝑵𝟐 
where 𝑵 is the number of data points

 MAGIC: This can be done in 𝑶 𝑵 !!


How? 52
Finding Similar Items
Distance Measures
 Goal: Find near-neighbors in high-dim.
space
 We formally define “near neighbors” as
points that are a “small distance” apart
 For each application, we first need to define what “distance” means
 Today: Jaccard distance/similarity
 The Jaccard similarity of two sets is the size of their intersection divided by
the size of their union:
sim(C1, C2) = |C1C2|/|C1C2|
 Jaccard distance: d(C1, C2) = 1 - |C1C2|/|C1C2|

3 in intersection
8 in union
Jaccard similarity=
54 3/8
Jaccard distance = 5/8
Task: Finding Similar Documents
 Goal: Given a large number (𝑵 in the millions or billions) of documents,
find “near duplicate” pairs
 Applications:
 Mirror websites, or approximate mirrors
 Don’t want to show both in search results

 Similar news articles at many news sites


 Cluster articles by “same story”

 Problems:
 Many small pieces of one document can appear
out of order in another
 Too many documents to compare all pairs
 Documents are so large or so many that they cannot
fit in main memory

55
3 Essential Steps for Similar Docs

1. Shingling: Convert documents to sets

2. Min-Hashing: Convert large sets to short signatures,


while preserving similarity

3. Locality-Sensitive Hashing: Focus on pairs of signatures


likely to be from similar documents
 Candidate pairs!

56
The Big Picture

Candidate
pairs:
Locality-
Docu- those pairs
Sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity

57
Docu-
ment

The set
of strings
of length k
that appear
in the doc-
ument

Shingling
Step 1: Shingling: Convert documents to sets
Documents as High-Dim Data

 Step 1: Shingling: Convert documents to sets

 Simple approaches:
 Document = set of words appearing in document
 Document = set of “important” words
 Don’t work well for this application. Why?

 Need to account for ordering of words!


 A different way: Shingles!
59
Define: Shingles
 A k-shingle (or k-gram) for a document is a sequence of
k tokens that appears in the doc
 Tokens can be characters, words or something else, depending
on the application
 Assume tokens = characters for examples

 Example: k=2; document D1 = abcab


Set of 2-shingles: S(D1) = {ab, bc, ca}
 Option: Shingles as a bag (multiset), count ab twice: S’(D1) =
{ab, bc, ca, ab}

60
Similarity Metric for Shingles
 Document D1 is a set of its k-shingles C1=S(D1)
 Equivalently, each document is a 0/1 vector in the space
of k-shingles
 Each unique shingle is a dimension
 Vectors are very sparse
 A natural similarity measure is the
Jaccard similarity:
sim(D1, D2) = |C1C2|/|C1C2|

61
Working Assumption

 Documents that have lots of shingles in common have


similar text, even if the text appears in different order

 Caveat: You must pick k large enough, or most documents


will have most shingles
 k = 5 is OK for short documents
 k = 10 is better for long documents

62
Motivation for Minhash/LSH
 Suppose we need to find near-duplicate documents
among 𝑵 = 𝟏 million documents

 Naïvely, we would have to compute pairwise


Jaccard similarities for every pair of docs
 𝑵(𝑵 − 𝟏)/𝟐 ≈ 5*1011 comparisons
 At 105 secs/day and 106 comparisons/sec,
it would take 5 days

 For 𝑵 = 𝟏𝟎 million, it takes more than a year…

63
Docu-
ment

The set Signatures:


of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity

MinHashing
Step 2: Minhashing: Convert large sets to short
signatures, while preserving similarity
Encoding Sets as Bit Vectors
 Many similarity problems can be
formalized as finding subsets that
have significant intersection
 Encode sets using 0/1 (bit, boolean) vectors
 One dimension per element in the universal set
 Interpret set intersection as bitwise AND, and
set union as bitwise OR

 Example: C1 = 10111; C2 = 10011


 Size of intersection = 3; size of union = 4,
 Jaccard similarity (not distance) = 3/4
 Distance: d(C1,C2) = 1 – (Jaccard similarity) = 1/4 65
From Sets to Boolean Matrices
 Rows = elements (shingles)
 Columns = sets (documents)
 1 in row e and column s if and only if e is a member of s Documents
 Column similarity is the Jaccard similarity of the 1 1 1 0
corresponding sets (rows with value 1)
 Typical matrix is sparse! 1 1 0 1
 Each document is a column: 0 1 0 1

Shingles
 Example: sim(C1 ,C2) = ? 0 0 0 1
 Size of intersection = 3; size of union = 1 0 0 1
6,
Jaccard similarity (not distance) = 3/6 1 1 1 0
 d(C1,C2) = 1 – (Jaccard similarity) = 1 0 66 1 0
3/6
Outline: Finding Similar Columns
 So far:
 Documents  Sets of shingles
 Represent sets as boolean vectors in a matrix
 Next goal: Find similar columns while computing small
signatures
 Similarity of columns == similarity of signatures

67
Outline: Finding Similar Columns
 Next Goal: Find similar columns, Small signatures
 Naïve approach:
 1) Signatures of columns: small summaries of columns
 2) Examine pairs of signatures to find similar columns
 Essential: Similarities of signatures and columns are related
 3) Optional: Check that columns with similar signatures are really
similar
 Warnings:
 Comparing all pairs may take too much time: Job for LSH
 These methods can produce false negatives, and even false positives
(if the optional check is not made)
68
Hashing Columns (Signatures)
 Key idea: “hash” each column C to a small signature h(C), such that:
 (1) h(C) is small enough that the signature fits in RAM
 (2) sim(C1, C2) is the same as the “similarity” of signatures h(C1) and h(C2)

 Goal: Find a hash function h(·) such that:


 If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
 If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

 Hash docs into buckets. Expect that “most” pairs of near duplicate docs
hash into the same bucket!

69
Min-Hashing
 Goal: Find a hash function h(·) such that:
 if sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
 if sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

 Clearly, the hash function depends on the similarity


metric:
 Not all similarity metrics have a suitable hash function
 There is a suitable hash function for the Jaccard
similarity: It is called Min-Hashing

70
Min-Hashing
 Imagine the rows of the boolean matrix permuted under
random permutation 

 Define a “hash” function h(C) = the index of the first (in


the permuted order ) row in which column C has value 1:
h (C) = min (C)

 Use several (e.g., 100) independent hash functions (that


is, permutations) to create a signature of a column

71
Min-Hashing Example
2nd element of the permutation
is the first to map to a 1

Permutation  Input matrix (Shingles x Documents)


Signature matrix M

2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation
is the first to map to a 1
5 7 1 1 0 1 0
4 5 5
72

1 0 1 0
0 0
The Min-Hash Property 0 0
 Choose a random permutation  1 1
 Claim: Pr[h(C1) = h(C2)] = sim(C1, C2)
0 0
 Why?
 Let X be a doc (set of shingles), y X is a shingle 0 1
 Then: Pr[(y) = min((X))] = 1/|X|
1 0
 It is equally likely that any y X is mapped to the min element
 Let y be s.t. (y) = min((C1C2))
 Then either: (y) = min((C1)) if y  C1 , or
(y) = min((C2)) if y  C2
One of the two
 So the prob. that both are true is the prob. y  C1  C2 cols had to have
1 at position y
 Pr[min((C1))=min((C2))]=|C1C2|/|C1C2|= sim(C1, C2)

73
Similarity for Signatures

 We know: Pr[h(C1) = h(C2)] = sim(C1, C2)


 Now generalize to multiple hash functions

 The similarity of two signatures is the fraction of the


hash functions in which they agree

 Note: Because of the Min-Hash property, the similarity of


columns is the same as the expected similarity of their
signatures

75
Min-Hashing Example
Permutation  Input matrix (Shingles x Documents)
Signature matrix M

2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 Similarities:
1-3 2-4 1-2 3-4
5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0
Sig/Sig 0.67 1.00 0 0
4 5 5 1 0 1 0
76
Min-Hash Signatures
 Pick K=100 random permutations of the rows
 Think of sig(C) as a column vector
 sig(C)[i] = according to the i-th permutation, the index of the first row
that has a 1 in column C

sig(C)[i] = min (i(C))


 Note: The sketch (signature) of document C is small ~𝟏𝟎𝟎 bytes!

 We achieved our goal! We “compressed” long bit vectors into short


signatures

77
Implementation Trick
 Permuting rows even once is prohibitive
 Row hashing!
 Pick K = 100 hash functions ki
 Ordering under ki gives a random row permutation!
 One-pass implementation
 For each column C and hash-func. ki keep a “slot” for the min-hash
value
How to pick a random
 Initialize all sig(C)[i] =  hash function h(x)?
 Scan rows looking for 1s Universal hashing:
ha,b(x)=((a·x+b) mod p) mod N
 Suppose row j has 1 in column C where:
 Then for each ki : a,b … random integers
p … prime number (p > N)
 If ki(j) < sig(C)[i], then sig(C)[i]  ki(j)

78
Candidate
pairs:
Locality-
Docu- those pairs
Sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity

Locality Sensitive Hashing


Step 3: Locality-Sensitive Hashing: Focus on pairs
of signatures likely to be from similar documents
LSH: First Cut
 Goal: Find documents with Jaccard similarity at least s
(for some similarity threshold, e.g., s=0.8)

 LSH – General idea: Use a function f(x,y) that tells


whether x and y is a candidate pair: a pair of elements
whose similarity must be evaluated

 For Min-Hash matrices:


 Hash columns of signature matrix M to many buckets
 Each pair of documents that hashes into the
same bucket is a candidate pair
80
Candidates from Min-Hash

 Pick a similarity threshold s (0 < s < 1)

 Columns x and y of M are a candidate pair if


their signatures agree on at least fraction s of
their rows:
M (i, x) = M (i, y) for at least frac. s values of i
 We expect documents x and y to have the same
(Jaccard) similarity as their signatures

81
LSH for Min-Hash
 Big idea: Hash columns of
signature matrix M several times

 Arrange that (only) similar columns are


likely to hash to the same bucket, with
high probability

 Candidate pairs are those that hash to


the same bucket

82
Partition M into b Bands

r rows
per band

b bands

One
signature

83
Signature matrix M
Partition M into Bands
 Divide matrix M into b bands of r rows

 For each band, hash its portion of each


column to a hash table with k buckets
 Make k as large as possible

 Candidate column pairs are those that hash


to the same bucket for ≥ 1 band

 Tune b and r to catch most similar pairs, 84

but few non-similar pairs


Hashing Bands
Columns 2 and 6
Buckets are probably identical
(candidate pair)

Columns 6 and 7 are


surely different.

Matrix M

r rows b bands

85
Simplifying Assumption

 There are enough buckets that columns are unlikely to


hash to the same bucket unless they are identical in a
particular band

 Hereafter, we assume that “same bucket” means


“identical in that band”

 Assumption needed only to simplify analysis, not for


correctness of algorithm
86
Example of Bands

Assume the following case:


 Suppose 100,000 columns of M (100k docs)
 Signatures of 100 integers (rows)
 Therefore, signatures take 40Mb
 Choose b = 20 bands of r = 5 integers/band

 Goal: Find pairs of documents that


are at least s = 0.8 similar 87
C1, C2 are 80% Similar
 Find pairs of  s=0.8 similarity, set b=20, r=5
 Assume: sim(C1, C2) = 0.8
 Since sim(C1, C2)  s, we want C1, C2 to be a candidate pair: We
want them to hash to at least 1 common bucket (at least one band
is identical)
 Probability C1, C2 identical in one particular
band: (0.8)5 = 0.328
 Probability C1, C2 are not similar in all of the 20 bands: (1-
0.328)20 = 0.00035
 i.e., about 1/3000th of the 80%-similar column pairs
are false negatives (we miss them)
 We would find 99.965% pairs of truly similar documents
88
C1, C2 are 30% Similar
 Find pairs of  s=0.8 similarity, set b=20, r=5
 Assume: sim(C1, C2) = 0.3
 Since sim(C1, C2) < s we want C1, C2 to hash to NO
common buckets (all bands should be different)
 Probability C1, C2 identical in one particular band:
(0.3)5 = 0.00243
 Probability C1, C2 identical in at least 1 of 20 bands: 1 -
(1 - 0.00243)20 = 0.0474
 In other words, approximately 4.74% pairs of docs with
similarity 0.3% end up becoming candidate pairs
 They are false positives since we will have to examine them (they
are candidate pairs) but then it will turn out their similarity is
below threshold s 89
LSH Involves a Tradeoff
 Pick:
 The number of Min-Hashes (rows of M)
 The number of bands b, and
 The number of rows r per band
to balance false positives/negatives

 Example: If we had only 15 bands of 5 rows,


the number of false positives would go
down, but the number of false negatives
would go up
90
Analysis of LSH – What We Want

Probability = 1
if t > s

Similarity threshold s
Probability
No chance
of sharing
if t < s
a bucket

Similarity t =sim(C1, C2) of two sets


91
What 1 Band of 1 Row Gives You

Probability Remember:
of sharing Probability of
a bucket equal hash-values
= similarity

Similarity t =sim(C1, C2) of two sets


92
b bands, r rows/band

 Columns C1 and C2 have similarity t


 Pick any band (r rows)
 Prob. that all rows in band equal = tr
 Prob. that some row in band unequal = 1 - tr

 Prob. that no band identical = (1 - tr)b

 Prob. that at least 1 band identical = 1 - (1 - tr)b

93
What b Bands of r Rows Gives You

At least
No bands
one band
identical
identical

Probability s ~ (1/b)1/r 1 - (1 -t r )b
of sharing
a bucket

All rows
Some row of a band
of a band are equal
Similarity t=sim(C1, C2) of two sets unequal

94
Example: b = 20; r = 5
 Similarity threshold s
 Prob. that at least 1 band is identical:

s 1-(1-sr)b
.2 .006
.3 .047
.4 .186
.5 .470
.6 .802
.7 .975
95
.8 .9996
Picking r and b: The S-curve
 Picking r and b to get the best S-curve
 50 hash-functions (r=5, b=10)

0.9

Prob. sharing a bucket


0.8

0.7

0.6

0.5

0.4

0.3

Green area: False Negative rate


0.2

0.1
Orange area: False Positive rate
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Similarity
96
LSH Summary
 Tune M, b, r to get almost all pairs with similar signatures, but
eliminate most pairs that do not have similar signatures

 Check in main memory that candidate pairs really do have similar


signatures

 Optional: In another pass through data, check that the remaining


candidate pairs really represent similar documents

97
Summary: 3 Steps
 Shingling: Convert documents to sets
 We used hashing to assign each shingle an ID
 Min-Hashing: Convert large sets to short signatures, while preserving
similarity
 We used similarity preserving hashing to generate signatures with property
Pr[h(C1) = h(C2)] = sim(C1, C2)
 We used hashing to get around generating random permutations
 Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from
similar documents
 We used hashing to find candidate pairs of similarity  s

98

S-ar putea să vă placă și