Randomized Algorithms

Randomized Algorithms
Overview of the lecture
 What is a randomized algorithm ?
 Motivation
 Types of randomized algorithms
 Examples of Randomized Algorithm

2
What is a randomized algorithm ?
3
Deterministic Algorithm
Input Output
Algorithm
 The output as well as the running time are functions only of the input.
 Goal: Prove for all input instances the algorithm solves the problem
correctly and the number of steps is bounded by a polynomial in the size
of the input.
4
Randomized Algorithm
Random bits
Input Output
Algorithm
 The output or the running time are functions of the input and
random bits chosen .
 Behavior can vary even on a fixed input;
5
Motivation for Randomized Algorithms
 Simplicity;
 Performance;
 Reflects reality better (Online Algorithms);
 For many hard problems helps obtain better complexity bounds

when compared to deterministic approaches;
6
Types of Randomized Algorithms
Randomized Las Vegas Algorithms:

 Output is always correct
 Running time is a random variable
Example: Randomized Quick Sort
Randomized Monte Carlo Algorithms:

 Output may be incorrect with some probability
 Running time is deterministic.
Example: Randomized algorithm for approximate median
7
Las Vegas Randomized Algorithms
INPUT ALGORITHM
OUTPUT
RANDOM NUMBERS
Goal: Prove that for all input instances the algorithm solves the
problem correctly and the expected number of steps is bounded
by a polynomial in the input size.
Note: The expectation is over the random choices made by the

algorithm.
8
Probabilistic Analysis of Algorithms
RANDOM ALGORITHM OUTPUT

INPUT DISTRIBUTION
Input is assumed to be from a probability distribution.
Goal: Show that for all inputs the algorithm works correctly and
for most inputs the number of steps is bounded by a polynomial in
the size of the input.
9
Monte Carlo Randomized Algorithms
INPUT ALGORITHM
OUTPUT
RANDOM NUMBERS
Goal: Prove that the algorithm

– with high probability solves the problem correctly;
– for every input the expected number of steps is bounded by a
polynomial in the input size.
Note: The expectation is over the random choices made by the

algorithm. 10
Monte Carlo versus Las Vegas
 A Monte Carlo algorithm runs produces an answer that is correct
with non-zero probability, whereas a Las Vegas algorithm always
produces the correct answer.
 The running time of both types of randomized algorithms is a

random variable whose expectation is bounded say by a
polynomial in terms of input size.
 These expectations are only over the random choices made by the
algorithm independent of the input. Thus independent repetitions
of Monte Carlo algorithms drive down the failure probability
exponentially.
11
Example 1 : Randomized Quick
Sort (Las Vegas Algorithm)
12
QuickSort(𝑺)
QuickSort(𝑺)
{ If (|𝑺|>1)
Pick and remove an element 𝒙 from 𝑺;
(𝑺<𝒙 , 𝑺>𝒙 ) Partition(𝑺,𝒙);
return( Concatenate(QuickSort(𝑺<𝒙 ), 𝒙, QuickSort(𝑺>𝒙 ))
}
13
QuickSort(𝑺)
QuickSort(𝑨,𝒍, 𝒓)
{ If (𝒍 < 𝒓)
𝒙 𝒍;
𝒊 Partition(𝑨,𝒍,𝒓,x);
QuickSort(𝑨,𝒍, 𝒊 − 𝟏);
QuickSort(𝑨,𝒊 + 𝟏, 𝒓)
}
 Average case running time: O(n log n)

 Worst case running time: O(𝐧𝟐 )
 Distribution sensitive: Time taken depends upon the initial permutation of
𝑨.
14
Randomized QuickSort(𝑺)
QuickSort(𝑨,𝒍, 𝒓)
{ If (𝒍 < 𝒓)
an element selected randomly uniformly from 𝑨[𝒍..𝒓];
𝒙 𝑨[𝒍];
𝒊 Partition(𝑨,𝒍,𝒓,x);
QuickSort(𝑨,𝒍, 𝒊 − 𝟏);
QuickSort(𝑨,𝒊 + 𝟏, 𝒓);
}
 Distribution insensitive: Time taken does not depend on initial permutation
of 𝑨.
 Time taken depends upon the random choices of pivot elements.
1. For a given input, Expected(average) running time: O(n log n)
2. Worst case running time: O(𝐧𝟐 )
15
Partition
Partition(𝑨,𝒍,𝒓,x)
{
swap(A[r],A[x]);
pivot  A[r];
I  l-1;
for (j  p to r-1){
if(A[j] <= pivot){
i  i+1;
swap(A[i],A[j]);
}
}
swap(A[i+1],A[r])
return i+1
}
Analysis of Randomized Quick Sort
17
Linearity of Expectation
If X1, X2, …, Xn are random variables, then
 n  n
E   Xi    E[ Xi ]
i  1  i  1
18
Notation
z2 z9 z8 z3 z5 z4 z1 z6 z10 z7
2 9 8 3 5 4 1 6 10 7
 Rename the elements of A as z1, z2, . . . , zn, with zi being the ith
smallest element (Rank “i”).
 Define the set Zij = {zi , zi+1, . . . , zj } be the set of elements

between zi and zj, inclusive.
19
Expected Number of Total Comparisons
in PARTITION
indicator
Let Xij = I {zi is compared to zj } random variable
Let X be the total number of comparisons performed by the
algorithm. Then

n 1 n 
 X    X ij 
 i 1 j i 1 
The expected number of comparisons performed by the algorithm is
 n 1 n 
  E X 
n 1 n
E[ X ]  E   X ij   ij
 i 1 j i 1  i 1 j  i 1
by linearity
of expectation
n 1 n
  Pr{z i is compared to z j }
i 1 j i 1
20
Comparisons in PARTITION
Observation 1: Each pair of elements is compared at most once
during the entire execution of the algorithm
 Elements are compared only to the pivot point!
 Pivot point is excluded from future calls to PARTITION
Observation 2: Only the pivot is compared with elements in both

partitions
z2 z9 z8 z3 z5 z4 z1 z6 z10 z7
2 9 8 3 5 4 1 6 10 7
Z1,6= {1, 2, 3, 4, 5, 6} {7} Z8,10 = {8, 9, 10}

pivot
Elements between different partitions are never compared 21
Comparisons in PARTITION
z2 z9 z8 z3 z5 z4 z1 z6 z10 z7
2 9 8 3 5 4 1 6 10 7
Z1,6= {1, 2, 3, 4, 5, 6} {7} Z8,10 = {8, 9, 10}
Pr{zi is compared to z j }?
Case 1: pivot chosen such as: zi < x < zj
 zi and zj will never be compared
Case 2: zi or zj is the pivot

 zi and zj will be compared
 only if one of them is chosen as pivot before any other element
in range zi to zj
22
Expected Number of Comparisons in
PARTITION
Pr {Zi is compared with Zj}
= Pr{Zi or Zj is chosen as pivot before other elements in Zi,j} = 2 / (j-i+1)
n 1 n
E[ X ]    Pr{z
i 1 j  i 1
i is compared to z j }
n 1 n 1 n i n 1 n
n
2 2 2 n 1
E[ X ]          O(lg n)
i 1 j i 1 j  i  1 i 1 k 1 k  1 i 1 k 1 k i 1
= O(nlgn)
23
Example 2 : Max Cut (Monte Carlo
Algorithm)
24
Global Cut
 Given an undirected graph G = ( V , E ), a cut in G

is a pair ( S , V – S ) of two sets S and V – S that
split the nodes into two groups.
 The size or cost of a cut, denoted by c ( S, V – S),
is the number of edges with one endpoint in S and one
in V – S.
 A global min cut is a cut in G with the least total cost
(min edges). A global max cut is a cut in G with
maximum total cost (max edges).
Global Cut - Example
V-S S
Global Cut
 Interestingly: There are many polynomial-time algorithms

known for global min-cut.
 Global max-cut is NP -hard and no polynomial-time
algorithms are known for it.
 Today, we'll see an algorithm for approximating global
max-cut.
 Next, we'll see a randomized algorithm for finding a
global min-cut.
Approximating Max Cut
 For a maximization problem, an α-approximation

algorithm is an algorithm that produces a value that is
within a factor of α of the true value.
 A 0.5-approximation to max-cut would produce a cut
whose size is at least 50% the size of the true largest cut.
 Our goal will be to find a randomized approximation
algorithm for max-cut.
A Really Simple Algorithm
 Here is our algorithm:

 For each node, toss a fair coin.
 If it lands heads, place the node into one part of the cut.
 If it lands tails, place the node into the other part of the
cut.
Analyzing the Algorithm
 On expectation, how large of a cut will this algorithm find?

 For each edge e, Cₑ be an indicator random variable where
 Then the number of edges X crossing the cut will be given by

What is Expected?
 The expected number of edges crossing the cut is given by

E[X]
Four Possible Choices
 Let u and v be the end points of edge e.

 There are four different possibilities
 Both u and v belong to S
 Both u and v belong to V-S
 u belong to S and v belong to V-S
 v belong to S and u belong to V-S
What is Expected is Unexpected
 The expected number of edges crossing the cut is given by

E[X]
 All Cuts are of size < m, so this is always with in a factor

of half of the optimal!
Randomized Approximation
Algorithm
 This algorithm is a randomized 0.5 approximation of max-
cut
 The Algorithm runs in O(n)
 It is NP-hard to fins a true max-cut but it is not at all hard
to find a cut that has size atleast half of the max-cut on
the average
Improving the Odds
 Running our algorithm will, on expectation, produce a cut

with size m / 2.
 However, we don't know the actual probability that our
cut has this size.
 We can use a standard technique to amplify the
probability of success.
Do it Again
 Since any individual run of the algorithm might not

produce a large cut, we could try this approach:
 Run the algorithm k times.
 Return the largest cut found.
 Goal: Show that with the right choice of k, this returns a
large cut with high probability.
 Specifically: Will show we get a cut of size m / 4 with high
probability.
 Runtime is O((m + n)k): k rounds of doing O(m + n) work
(n to build the cut, m to determine the size.)
More Probabilistic Analysis
 Let X₁, X₂, …, Xₖ be random variables corresponding to

the sizes of the cuts found by each run of the algorithm.
 Let Ɛ be the event that our algorithm produces a cut of
size less than m / 4. Then
 Since all Xi variables are independent, we have

More Probabilistic Analysis
 Let Y₁, Y₂, …, Yₖ be random variables defined as follows:
 Then
Markov’s Inequality
 Markov's Inequality states that for any nonnegative

random variable X, that
 Equivalently,
 This holds for any random variable X

Markov’s Inequality
 Let Y₁, Y₂, …, Yₖ be random variables defined as follows:
 Then
Final Comment
 If we run the algorithm k times and take the maximum cut

we find, then the probability that we don't get m / 4
edges or more is at most (2 / 3)k.
 The probability we do get at least m / 4 edges is at least
1 – (2 / 3)k.
 If we set k = log3/2 m, the probability we get at least m/4
edges is 1 – 1/m.
 There is a randomized, O((m + n) log m)-time algorithm
that finds a (0.25)-approximation to max-cut with
probability 1 – 1 / m.
Why it Works
 Given a randomized algorithm that has a probability p of

success, we can amplify that probability significantly by
repeating the algorithm multiple times.
 This technique is used extensively in randomized
algorithms; we'll see another example of this on min-cut.
Example 3 : Min Cut (Monte Carlo
Algorithm)
43
Graph Contraction
For an undirected graph G, we can construct a new graph G’ by
contracting two vertices u, v in G as follows:
 u and v become one vertex {u,v} and the edge (u,v) is removed;
 the other edges incident to u or v in G are now incident on the
new vertex {u,v} in G’;
Note: There may be multi-edges between two vertices. We just
keep them. b
a b
a
u v
{u,v}
d e
c d e
c
44
Graph G Graph G’
Karger’s Min-cut Algorithm
C C CD
contract
B B D B
D
A A
A contract
(i) Graph G (ii) Contract nodes C and D (iii) contract nodes A and CD
Note: C is a cut but not necessarily a min-cut. ACD
B 45
(Iv) Cut C={(A,B), (B,C), (B,D)}

Karger’s Min-cut Algorithm
For i = 1 to x
repeat
randomly pick an edge (u,v)
contract u and v
until two vertices are left
ci ← the number of edges between them
Output mini ci
46
Key Idea
 Let C* = {c1*, c2*, …, ck*} be a min-cut in G and Ci be a cut

determined by Karger’s algorithm during some iteration i.
 Ci will be a min-cut for G if during iteration “i” none of
the edges in C* are contracted.
 If we can show that with prob. Ω(1/n2), where n = |V|, Ci
will be a min-cut, then by repeatedly obtaining min-cuts
O(n2) times and taking minimum gives the min-cut with
high prob.
47
Analysis of Karger’s Algorithm
Let k be the number of edges of min cut (S, V-S). k
If we never picked a crossing edge in the

algorithm, then the number of edges between two
last vertices is the correct answer.
The probability that in step 1 of an iteration a

crossing edge is not picked = (|E|-k)/|E|. ≥k
By def of min cut, we know that each vertex v has

degree at least k, Otherwise the cut ({v}, V-{v}) is
lighter.
Thus
|E| ≥ nk/2 and (|E|-k)/|E| = 1 - k/|E| ≥ 1-2/n. 48
Analysis of Karger’s Algorithm
 In step 1, Pr [no crossing edge picked] >= 1 – 2/n
 Similarly, in step 2, Pr [no crossing edge picked] ≥ 1-2/(n-1)
 In general, in step j, Pr [no crossing edge picked] ≥ 1-2/(n-j+1)
 Pr {the n-2 contractions never contract a crossing edge}
 = Pr [first step good]
* Pr [second step good after surviving first step]
* Pr [third step good after surviving first two steps]
* …
* Pr [(n-2)-th step good after surviving first n-3 steps]
≥ (1-2/n) (1-2/(n-1)) … (1-2/3)
= [(n-2)/n] [(n-3)(n-1)] … [1/3] = 2/[n(n-1)] = Ω(1/n2)
49
Example 4:Finding Similar Items:
Locality Sensitive Hashing
A Common Metaphor
 Many problems can be expressed as
finding “similar” sets:
 Find near-neighbors in high-dimensional space
 Examples:
 Pages with similar words
 For duplicate detection, classification by topic
 Customers who purchased similar products
 Products with similar customer sets
 Images with similar features
51
Problem for Today
 Given: High dimensional data points 𝒙𝟏 , 𝒙𝟐 , …
 For example: Image is a long vector of pixel colors
1 2 1
0 2 1 → [1 2 1 0 2 1 0 1 0]
0 1 0
 And some distance function 𝒅(𝒙𝟏 , 𝒙𝟐 )
 Which quantifies the “distance” between 𝒙𝟏 and 𝒙𝟐
 Goal: Find all pairs of data points (𝒙𝒊 , 𝒙𝒋 ) that are within some distance
threshold 𝒅 𝒙𝒊 , 𝒙𝒋 ≤ 𝒔
 Note: Naïve solution would take 𝑶 𝑵𝟐 
where 𝑵 is the number of data points
 MAGIC: This can be done in 𝑶 𝑵 !!

How? 52
Finding Similar Items
Distance Measures
 Goal: Find near-neighbors in high-dim.
space
 We formally define “near neighbors” as
points that are a “small distance” apart
 For each application, we first need to define what “distance” means
 Today: Jaccard distance/similarity
 The Jaccard similarity of two sets is the size of their intersection divided by
the size of their union:
sim(C1, C2) = |C1C2|/|C1C2|
 Jaccard distance: d(C1, C2) = 1 - |C1C2|/|C1C2|
3 in intersection
8 in union
Jaccard similarity=
54 3/8
Jaccard distance = 5/8
Task: Finding Similar Documents
 Goal: Given a large number (𝑵 in the millions or billions) of documents,
find “near duplicate” pairs
 Applications:
 Mirror websites, or approximate mirrors
 Don’t want to show both in search results
 Similar news articles at many news sites

 Cluster articles by “same story”
 Problems:
 Many small pieces of one document can appear
out of order in another
 Too many documents to compare all pairs
 Documents are so large or so many that they cannot
fit in main memory
55
3 Essential Steps for Similar Docs
1. Shingling: Convert documents to sets
2. Min-Hashing: Convert large sets to short signatures,

while preserving similarity
3. Locality-Sensitive Hashing: Focus on pairs of signatures

likely to be from similar documents
 Candidate pairs!
56
The Big Picture
Candidate
pairs:
Locality-
Docu- those pairs
Sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
57
Docu-
ment
The set
of strings
of length k
that appear
in the doc-
ument
Shingling
Step 1: Shingling: Convert documents to sets
Documents as High-Dim Data
 Step 1: Shingling: Convert documents to sets
 Simple approaches:
 Document = set of words appearing in document
 Document = set of “important” words
 Don’t work well for this application. Why?
 Need to account for ordering of words!

 A different way: Shingles!
59
Define: Shingles
 A k-shingle (or k-gram) for a document is a sequence of
k tokens that appears in the doc
 Tokens can be characters, words or something else, depending
on the application
 Assume tokens = characters for examples
 Example: k=2; document D1 = abcab

Set of 2-shingles: S(D1) = {ab, bc, ca}
 Option: Shingles as a bag (multiset), count ab twice: S’(D1) =
{ab, bc, ca, ab}
60
Similarity Metric for Shingles
 Document D1 is a set of its k-shingles C1=S(D1)
 Equivalently, each document is a 0/1 vector in the space
of k-shingles
 Each unique shingle is a dimension
 Vectors are very sparse
 A natural similarity measure is the
Jaccard similarity:
sim(D1, D2) = |C1C2|/|C1C2|
61
Working Assumption
 Documents that have lots of shingles in common have

similar text, even if the text appears in different order
 Caveat: You must pick k large enough, or most documents

will have most shingles
 k = 5 is OK for short documents
 k = 10 is better for long documents
62
Motivation for Minhash/LSH
 Suppose we need to find near-duplicate documents
among 𝑵 = 𝟏 million documents
 Naïvely, we would have to compute pairwise

Jaccard similarities for every pair of docs
 𝑵(𝑵 − 𝟏)/𝟐 ≈ 5*1011 comparisons
 At 105 secs/day and 106 comparisons/sec,
it would take 5 days
 For 𝑵 = 𝟏𝟎 million, it takes more than a year…
63
Docu-
ment
The set Signatures:

ument reflect their
similarity
MinHashing
Step 2: Minhashing: Convert large sets to short
signatures, while preserving similarity
Encoding Sets as Bit Vectors
 Many similarity problems can be
formalized as finding subsets that
have significant intersection
 Encode sets using 0/1 (bit, boolean) vectors
 One dimension per element in the universal set
 Interpret set intersection as bitwise AND, and
set union as bitwise OR
 Example: C1 = 10111; C2 = 10011

 Size of intersection = 3; size of union = 4,
 Jaccard similarity (not distance) = 3/4
 Distance: d(C1,C2) = 1 – (Jaccard similarity) = 1/4 65
From Sets to Boolean Matrices
 Rows = elements (shingles)
 Columns = sets (documents)
 1 in row e and column s if and only if e is a member of s Documents
 Column similarity is the Jaccard similarity of the 1 1 1 0
corresponding sets (rows with value 1)
 Typical matrix is sparse! 1 1 0 1
 Each document is a column: 0 1 0 1
Shingles
 Example: sim(C1 ,C2) = ? 0 0 0 1
 Size of intersection = 3; size of union = 1 0 0 1
6,
Jaccard similarity (not distance) = 3/6 1 1 1 0
 d(C1,C2) = 1 – (Jaccard similarity) = 1 0 66 1 0
3/6
Outline: Finding Similar Columns
 So far:
 Documents  Sets of shingles
 Represent sets as boolean vectors in a matrix
 Next goal: Find similar columns while computing small
signatures
 Similarity of columns == similarity of signatures
67
Outline: Finding Similar Columns
 Next Goal: Find similar columns, Small signatures
 Naïve approach:
 1) Signatures of columns: small summaries of columns
 2) Examine pairs of signatures to find similar columns
 Essential: Similarities of signatures and columns are related
 3) Optional: Check that columns with similar signatures are really
similar
 Warnings:
 Comparing all pairs may take too much time: Job for LSH
 These methods can produce false negatives, and even false positives
(if the optional check is not made)
68
Hashing Columns (Signatures)
 Key idea: “hash” each column C to a small signature h(C), such that:
 (1) h(C) is small enough that the signature fits in RAM
 (2) sim(C1, C2) is the same as the “similarity” of signatures h(C1) and h(C2)
 Goal: Find a hash function h(·) such that:

 If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
 If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)
 Hash docs into buckets. Expect that “most” pairs of near duplicate docs
hash into the same bucket!
69
Min-Hashing
 Goal: Find a hash function h(·) such that:
 if sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
 if sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)
 Clearly, the hash function depends on the similarity

metric:
 Not all similarity metrics have a suitable hash function
 There is a suitable hash function for the Jaccard
similarity: It is called Min-Hashing
70
Min-Hashing
 Imagine the rows of the boolean matrix permuted under
random permutation 
 Define a “hash” function h(C) = the index of the first (in

the permuted order ) row in which column C has value 1:
h (C) = min (C)
 Use several (e.g., 100) independent hash functions (that

is, permutations) to create a signature of a column
71
Min-Hashing Example
2nd element of the permutation
is the first to map to a 1
Permutation  Input matrix (Shingles x Documents)

Signature matrix M
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation
is the first to map to a 1
5 7 1 1 0 1 0
4 5 5
72
1 0 1 0
0 0
The Min-Hash Property 0 0
 Choose a random permutation  1 1
 Claim: Pr[h(C1) = h(C2)] = sim(C1, C2)
0 0
 Why?
 Let X be a doc (set of shingles), y X is a shingle 0 1
 Then: Pr[(y) = min((X))] = 1/|X|
1 0
 It is equally likely that any y X is mapped to the min element
 Let y be s.t. (y) = min((C1C2))
 Then either: (y) = min((C1)) if y  C1 , or
(y) = min((C2)) if y  C2
One of the two
 So the prob. that both are true is the prob. y  C1  C2 cols had to have
1 at position y
 Pr[min((C1))=min((C2))]=|C1C2|/|C1C2|= sim(C1, C2)
73
Similarity for Signatures
 We know: Pr[h(C1) = h(C2)] = sim(C1, C2)

 Now generalize to multiple hash functions
 The similarity of two signatures is the fraction of the

hash functions in which they agree
 Note: Because of the Min-Hash property, the similarity of

columns is the same as the expected similarity of their
signatures
75
Min-Hashing Example
Permutation  Input matrix (Shingles x Documents)
Signature matrix M
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 Similarities:
1-3 2-4 1-2 3-4
5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0
Sig/Sig 0.67 1.00 0 0
4 5 5 1 0 1 0
76
Min-Hash Signatures
 Pick K=100 random permutations of the rows
 Think of sig(C) as a column vector
 sig(C)[i] = according to the i-th permutation, the index of the first row
that has a 1 in column C
sig(C)[i] = min (i(C))

 Note: The sketch (signature) of document C is small ~𝟏𝟎𝟎 bytes!
 We achieved our goal! We “compressed” long bit vectors into short

signatures
77
Implementation Trick
 Permuting rows even once is prohibitive
 Row hashing!
 Pick K = 100 hash functions ki
 Ordering under ki gives a random row permutation!
 One-pass implementation
 For each column C and hash-func. ki keep a “slot” for the min-hash
value
How to pick a random
 Initialize all sig(C)[i] =  hash function h(x)?
 Scan rows looking for 1s Universal hashing:
ha,b(x)=((a·x+b) mod p) mod N
 Suppose row j has 1 in column C where:
 Then for each ki : a,b … random integers
p … prime number (p > N)
 If ki(j) < sig(C)[i], then sig(C)[i]  ki(j)
78
Candidate
pairs:
Locality-
Docu- those pairs
Sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures: similarity
ument reflect their
similarity
Locality Sensitive Hashing

Step 3: Locality-Sensitive Hashing: Focus on pairs
of signatures likely to be from similar documents
LSH: First Cut
 Goal: Find documents with Jaccard similarity at least s
(for some similarity threshold, e.g., s=0.8)
 LSH – General idea: Use a function f(x,y) that tells

whether x and y is a candidate pair: a pair of elements
whose similarity must be evaluated
 For Min-Hash matrices:

 Hash columns of signature matrix M to many buckets
 Each pair of documents that hashes into the
same bucket is a candidate pair
80
Candidates from Min-Hash
 Pick a similarity threshold s (0 < s < 1)
 Columns x and y of M are a candidate pair if

their signatures agree on at least fraction s of
their rows:
M (i, x) = M (i, y) for at least frac. s values of i
 We expect documents x and y to have the same
(Jaccard) similarity as their signatures
81
LSH for Min-Hash
 Big idea: Hash columns of
signature matrix M several times
 Arrange that (only) similar columns are

likely to hash to the same bucket, with
high probability
 Candidate pairs are those that hash to

the same bucket
82
Partition M into b Bands
r rows
per band
b bands
One
signature
83
Signature matrix M
Partition M into Bands
 Divide matrix M into b bands of r rows
 For each band, hash its portion of each

column to a hash table with k buckets
 Make k as large as possible
 Candidate column pairs are those that hash

to the same bucket for ≥ 1 band
 Tune b and r to catch most similar pairs, 84
but few non-similar pairs

Hashing Bands
Columns 2 and 6
Buckets are probably identical
(candidate pair)
Columns 6 and 7 are

surely different.
Matrix M
r rows b bands
85
Simplifying Assumption
 There are enough buckets that columns are unlikely to

hash to the same bucket unless they are identical in a
particular band
 Hereafter, we assume that “same bucket” means

“identical in that band”
 Assumption needed only to simplify analysis, not for

correctness of algorithm
86
Example of Bands
Assume the following case:

 Suppose 100,000 columns of M (100k docs)
 Signatures of 100 integers (rows)
 Therefore, signatures take 40Mb
 Choose b = 20 bands of r = 5 integers/band
 Goal: Find pairs of documents that

are at least s = 0.8 similar 87
C1, C2 are 80% Similar
 Find pairs of  s=0.8 similarity, set b=20, r=5
 Assume: sim(C1, C2) = 0.8
 Since sim(C1, C2)  s, we want C1, C2 to be a candidate pair: We
want them to hash to at least 1 common bucket (at least one band
is identical)
 Probability C1, C2 identical in one particular
band: (0.8)5 = 0.328
 Probability C1, C2 are not similar in all of the 20 bands: (1-
0.328)20 = 0.00035
 i.e., about 1/3000th of the 80%-similar column pairs
are false negatives (we miss them)
 We would find 99.965% pairs of truly similar documents
88
C1, C2 are 30% Similar
 Find pairs of  s=0.8 similarity, set b=20, r=5
 Assume: sim(C1, C2) = 0.3
 Since sim(C1, C2) < s we want C1, C2 to hash to NO
common buckets (all bands should be different)
 Probability C1, C2 identical in one particular band:
(0.3)5 = 0.00243
 Probability C1, C2 identical in at least 1 of 20 bands: 1 -
(1 - 0.00243)20 = 0.0474
 In other words, approximately 4.74% pairs of docs with
similarity 0.3% end up becoming candidate pairs
 They are false positives since we will have to examine them (they
are candidate pairs) but then it will turn out their similarity is
below threshold s 89
LSH Involves a Tradeoff
 Pick:
 The number of Min-Hashes (rows of M)
 The number of bands b, and
 The number of rows r per band
to balance false positives/negatives
 Example: If we had only 15 bands of 5 rows,

the number of false positives would go
down, but the number of false negatives
would go up
90
Analysis of LSH – What We Want
Probability = 1
if t > s
Similarity threshold s
Probability
No chance
of sharing
if t < s
a bucket
Similarity t =sim(C1, C2) of two sets

91
What 1 Band of 1 Row Gives You
Probability Remember:
of sharing Probability of
a bucket equal hash-values
= similarity
Similarity t =sim(C1, C2) of two sets

92
b bands, r rows/band
 Columns C1 and C2 have similarity t

 Pick any band (r rows)
 Prob. that all rows in band equal = tr
 Prob. that some row in band unequal = 1 - tr
 Prob. that no band identical = (1 - tr)b
 Prob. that at least 1 band identical = 1 - (1 - tr)b
93
What b Bands of r Rows Gives You
At least
No bands
one band
identical
identical
Probability s ~ (1/b)1/r 1 - (1 -t r )b
of sharing
a bucket
All rows
Some row of a band
of a band are equal
Similarity t=sim(C1, C2) of two sets unequal
94
Example: b = 20; r = 5
 Similarity threshold s
 Prob. that at least 1 band is identical:
s 1-(1-sr)b
.2 .006
.3 .047
.4 .186
.5 .470
.6 .802
.7 .975
95
.8 .9996
Picking r and b: The S-curve
 Picking r and b to get the best S-curve
 50 hash-functions (r=5, b=10)
0.9
Prob. sharing a bucket

0.8
0.7
0.6
0.5
0.4
0.3
Green area: False Negative rate

0.2
0.1
Orange area: False Positive rate
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity
96
LSH Summary
 Tune M, b, r to get almost all pairs with similar signatures, but
eliminate most pairs that do not have similar signatures
 Check in main memory that candidate pairs really do have similar

signatures
 Optional: In another pass through data, check that the remaining

candidate pairs really represent similar documents
97
Summary: 3 Steps
 Shingling: Convert documents to sets
 We used hashing to assign each shingle an ID
 Min-Hashing: Convert large sets to short signatures, while preserving
similarity
 We used similarity preserving hashing to generate signatures with property
Pr[h(C1) = h(C2)] = sim(C1, C2)
 We used hashing to get around generating random permutations
 Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from
similar documents
 We used hashing to find candidate pairs of similarity  s
98

Randomized Algorithms

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Randomized Algorithms

Încărcat de

Drepturi de autor:

Formate disponibile

Randomized Algorithms

Overview of the lecture

 What is a randomized algorithm ?

 Types of randomized algorithms

 Examples of Randomized Algorithm

 Reflects reality better (Online Algorithms);

 For many hard problems helps obtain better complexity bounds

Randomized Las Vegas Algorithms:

Randomized Monte Carlo Algorithms:

Note: The expectation is over the random choices made by the

RANDOM ALGORITHM OUTPUT

Input is assumed to be from a probability distribution.

Goal: Prove that the algorithm

Note: The expectation is over the random choices made by the

 The running time of both types of randomized algorithms is a

 Average case running time: O(n log n)

If X1, X2, …, Xn are random variables, then

 Define the set Zij = {zi , zi+1, . . . , zj } be the set of elements

 Elements are compared only to the pivot point!

 Pivot point is excluded from future calls to PARTITION

Observation 2: Only the pivot is compared with elements in both

Z1,6= {1, 2, 3, 4, 5, 6} {7} Z8,10 = {8, 9, 10}

Z1,6= {1, 2, 3, 4, 5, 6} {7} Z8,10 = {8, 9, 10}

Case 2: zi or zj is the pivot

= Pr{Zi or Zj is chosen as pivot before other elements in Zi,j} = 2 / (j-i+1)

 Given an undirected graph G = ( V , E ), a cut in G

 Interestingly: There are many polynomial-time algorithms

 For a maximization problem, an α-approximation

 Here is our algorithm:

 On expectation, how large of a cut will this algorithm find?

 Then the number of edges X crossing the cut will be given by

 The expected number of edges crossing the cut is given by

 Let u and v be the end points of edge e.

 The expected number of edges crossing the cut is given by

 All Cuts are of size < m, so this is always with in a factor

 Running our algorithm will, on expectation, produce a cut

 Since any individual run of the algorithm might not

 Let X₁, X₂, …, Xₖ be random variables corresponding to

 Since all Xi variables are independent, we have

 Let Y₁, Y₂, …, Yₖ be random variables defined as follows:

 Markov's Inequality states that for any nonnegative

 This holds for any random variable X

 Let Y₁, Y₂, …, Yₖ be random variables defined as follows:

 If we run the algorithm k times and take the maximum cut

 Given a randomized algorithm that has a probability p of

Note: C is a cut but not necessarily a min-cut. ACD

(Iv) Cut C={(A,B), (B,C), (B,D)}

 Let C* = {c1*, c2*, …, ck*} be a min-cut in G and Ci be a cut

If we never picked a crossing edge in the

The probability that in step 1 of an iteration a

By def of min cut, we know that each vertex v has

 MAGIC: This can be done in 𝑶 𝑵 !!

 Similar news articles at many news sites

1. Shingling: Convert documents to sets

2. Min-Hashing: Convert large sets to short signatures,

3. Locality-Sensitive Hashing: Focus on pairs of signatures

 Step 1: Shingling: Convert documents to sets

 Need to account for ordering of words!

 Example: k=2; document D1 = abcab

 Documents that have lots of shingles in common have

 Caveat: You must pick k large enough, or most documents

 Naïvely, we would have to compute pairwise

 For 𝑵 = 𝟏𝟎 million, it takes more than a year…

The set Signatures:

 Let C* = {c1, c2, …, ck*} be a min-cut in G and Ci be a cut