Documente Academic
Documente Profesional
Documente Cultură
Motivation
3
Deterministic Algorithm
Input Output
Algorithm
The output as well as the running time are functions only of the input.
Goal: Prove for all input instances the algorithm solves the problem
correctly and the number of steps is bounded by a polynomial in the size
of the input.
4
Randomized Algorithm
Random bits
Input Output
Algorithm
The output or the running time are functions of the input and
random bits chosen .
Behavior can vary even on a fixed input;
5
Motivation for Randomized Algorithms
Simplicity;
Performance;
6
Types of Randomized Algorithms
7
Las Vegas Randomized Algorithms
INPUT ALGORITHM
OUTPUT
RANDOM NUMBERS
Goal: Prove that for all input instances the algorithm solves the
problem correctly and the expected number of steps is bounded
by a polynomial in the input size.
Goal: Show that for all inputs the algorithm works correctly and
for most inputs the number of steps is bounded by a polynomial in
the size of the input.
9
Monte Carlo Randomized Algorithms
INPUT ALGORITHM
OUTPUT
RANDOM NUMBERS
These expectations are only over the random choices made by the
algorithm independent of the input. Thus independent repetitions
of Monte Carlo algorithms drive down the failure probability
exponentially.
11
Example 1 : Randomized Quick
Sort (Las Vegas Algorithm)
12
QuickSort(𝑺)
QuickSort(𝑺)
{ If (|𝑺|>1)
Pick and remove an element 𝒙 from 𝑺;
(𝑺<𝒙 , 𝑺>𝒙 ) Partition(𝑺,𝒙);
return( Concatenate(QuickSort(𝑺<𝒙 ), 𝒙, QuickSort(𝑺>𝒙 ))
}
13
QuickSort(𝑺)
QuickSort(𝑨,𝒍, 𝒓)
{ If (𝒍 < 𝒓)
𝒙 𝒍;
𝒊 Partition(𝑨,𝒍,𝒓,x);
QuickSort(𝑨,𝒍, 𝒊 − 𝟏);
QuickSort(𝑨,𝒊 + 𝟏, 𝒓)
}
QuickSort(𝑨,𝒍, 𝒓)
{ If (𝒍 < 𝒓)
an element selected randomly uniformly from 𝑨[𝒍..𝒓];
𝒙 𝑨[𝒍];
𝒊 Partition(𝑨,𝒍,𝒓,x);
QuickSort(𝑨,𝒍, 𝒊 − 𝟏);
QuickSort(𝑨,𝒊 + 𝟏, 𝒓);
}
Distribution insensitive: Time taken does not depend on initial permutation
of 𝑨.
Time taken depends upon the random choices of pivot elements.
1. For a given input, Expected(average) running time: O(n log n)
2. Worst case running time: O(𝐧𝟐 )
15
Partition
Partition(𝑨,𝒍,𝒓,x)
{
swap(A[r],A[x]);
pivot A[r];
I l-1;
for (j p to r-1){
if(A[j] <= pivot){
i i+1;
swap(A[i],A[j]);
}
}
swap(A[i+1],A[r])
return i+1
}
Analysis of Randomized Quick Sort
17
Linearity of Expectation
n n
E Xi E[ Xi ]
i 1 i 1
18
Notation
z2 z9 z8 z3 z5 z4 z1 z6 z10 z7
2 9 8 3 5 4 1 6 10 7
Rename the elements of A as z1, z2, . . . , zn, with zi being the ith
smallest element (Rank “i”).
19
Expected Number of Total Comparisons
in PARTITION
indicator
Let Xij = I {zi is compared to zj } random variable
Let X be the total number of comparisons performed by the
algorithm. Then
n 1 n
X X ij
i 1 j i 1
The expected number of comparisons performed by the algorithm is
n 1 n
E X
n 1 n
E[ X ] E X ij ij
i 1 j i 1 i 1 j i 1
by linearity
of expectation
n 1 n
Pr{z i is compared to z j }
i 1 j i 1
20
Comparisons in PARTITION
Observation 1: Each pair of elements is compared at most once
during the entire execution of the algorithm
Pr{zi is compared to z j }?
Case 1: pivot chosen such as: zi < x < zj
zi and zj will never be compared
n 1 n
E[ X ] Pr{z
i 1 j i 1
i is compared to z j }
n 1 n 1 n i n 1 n
n
2 2 2 n 1
E[ X ] O(lg n)
i 1 j i 1 j i 1 i 1 k 1 k 1 i 1 k 1 k i 1
= O(nlgn)
23
Example 2 : Max Cut (Monte Carlo
Algorithm)
24
Global Cut
V-S S
Global Cut
Then
Markov’s Inequality
Equivalently,
Then
Final Comment
43
Graph Contraction
For an undirected graph G, we can construct a new graph G’ by
contracting two vertices u, v in G as follows:
u and v become one vertex {u,v} and the edge (u,v) is removed;
the other edges incident to u or v in G are now incident on the
new vertex {u,v} in G’;
Note: There may be multi-edges between two vertices. We just
keep them. b
a b
a
u v
{u,v}
d e
c d e
c
44
Graph G Graph G’
Karger’s Min-cut Algorithm
C C CD
contract
B B D B
D
A A
A contract
(i) Graph G (ii) Contract nodes C and D (iii) contract nodes A and CD
B 45
For i = 1 to x
repeat
randomly pick an edge (u,v)
contract u and v
until two vertices are left
ci ← the number of edges between them
Output mini ci
46
Key Idea
47
Analysis of Karger’s Algorithm
Let k be the number of edges of min cut (S, V-S). k
51
Problem for Today
Given: High dimensional data points 𝒙𝟏 , 𝒙𝟐 , …
For example: Image is a long vector of pixel colors
1 2 1
0 2 1 → [1 2 1 0 2 1 0 1 0]
0 1 0
And some distance function 𝒅(𝒙𝟏 , 𝒙𝟐 )
Which quantifies the “distance” between 𝒙𝟏 and 𝒙𝟐
Goal: Find all pairs of data points (𝒙𝒊 , 𝒙𝒋 ) that are within some distance
threshold 𝒅 𝒙𝒊 , 𝒙𝒋 ≤ 𝒔
Note: Naïve solution would take 𝑶 𝑵𝟐
where 𝑵 is the number of data points
3 in intersection
8 in union
Jaccard similarity=
54 3/8
Jaccard distance = 5/8
Task: Finding Similar Documents
Goal: Given a large number (𝑵 in the millions or billions) of documents,
find “near duplicate” pairs
Applications:
Mirror websites, or approximate mirrors
Don’t want to show both in search results
Problems:
Many small pieces of one document can appear
out of order in another
Too many documents to compare all pairs
Documents are so large or so many that they cannot
fit in main memory
55
3 Essential Steps for Similar Docs
56
The Big Picture
Candidate
pairs:
Locality-
Docu- those pairs
Sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
57
Docu-
ment
The set
of strings
of length k
that appear
in the doc-
ument
Shingling
Step 1: Shingling: Convert documents to sets
Documents as High-Dim Data
Simple approaches:
Document = set of words appearing in document
Document = set of “important” words
Don’t work well for this application. Why?
60
Similarity Metric for Shingles
Document D1 is a set of its k-shingles C1=S(D1)
Equivalently, each document is a 0/1 vector in the space
of k-shingles
Each unique shingle is a dimension
Vectors are very sparse
A natural similarity measure is the
Jaccard similarity:
sim(D1, D2) = |C1C2|/|C1C2|
61
Working Assumption
62
Motivation for Minhash/LSH
Suppose we need to find near-duplicate documents
among 𝑵 = 𝟏 million documents
63
Docu-
ment
MinHashing
Step 2: Minhashing: Convert large sets to short
signatures, while preserving similarity
Encoding Sets as Bit Vectors
Many similarity problems can be
formalized as finding subsets that
have significant intersection
Encode sets using 0/1 (bit, boolean) vectors
One dimension per element in the universal set
Interpret set intersection as bitwise AND, and
set union as bitwise OR
Shingles
Example: sim(C1 ,C2) = ? 0 0 0 1
Size of intersection = 3; size of union = 1 0 0 1
6,
Jaccard similarity (not distance) = 3/6 1 1 1 0
d(C1,C2) = 1 – (Jaccard similarity) = 1 0 66 1 0
3/6
Outline: Finding Similar Columns
So far:
Documents Sets of shingles
Represent sets as boolean vectors in a matrix
Next goal: Find similar columns while computing small
signatures
Similarity of columns == similarity of signatures
67
Outline: Finding Similar Columns
Next Goal: Find similar columns, Small signatures
Naïve approach:
1) Signatures of columns: small summaries of columns
2) Examine pairs of signatures to find similar columns
Essential: Similarities of signatures and columns are related
3) Optional: Check that columns with similar signatures are really
similar
Warnings:
Comparing all pairs may take too much time: Job for LSH
These methods can produce false negatives, and even false positives
(if the optional check is not made)
68
Hashing Columns (Signatures)
Key idea: “hash” each column C to a small signature h(C), such that:
(1) h(C) is small enough that the signature fits in RAM
(2) sim(C1, C2) is the same as the “similarity” of signatures h(C1) and h(C2)
Hash docs into buckets. Expect that “most” pairs of near duplicate docs
hash into the same bucket!
69
Min-Hashing
Goal: Find a hash function h(·) such that:
if sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
if sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)
70
Min-Hashing
Imagine the rows of the boolean matrix permuted under
random permutation
71
Min-Hashing Example
2nd element of the permutation
is the first to map to a 1
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation
is the first to map to a 1
5 7 1 1 0 1 0
4 5 5
72
1 0 1 0
0 0
The Min-Hash Property 0 0
Choose a random permutation 1 1
Claim: Pr[h(C1) = h(C2)] = sim(C1, C2)
0 0
Why?
Let X be a doc (set of shingles), y X is a shingle 0 1
Then: Pr[(y) = min((X))] = 1/|X|
1 0
It is equally likely that any y X is mapped to the min element
Let y be s.t. (y) = min((C1C2))
Then either: (y) = min((C1)) if y C1 , or
(y) = min((C2)) if y C2
One of the two
So the prob. that both are true is the prob. y C1 C2 cols had to have
1 at position y
Pr[min((C1))=min((C2))]=|C1C2|/|C1C2|= sim(C1, C2)
73
Similarity for Signatures
75
Min-Hashing Example
Permutation Input matrix (Shingles x Documents)
Signature matrix M
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 Similarities:
1-3 2-4 1-2 3-4
5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0
Sig/Sig 0.67 1.00 0 0
4 5 5 1 0 1 0
76
Min-Hash Signatures
Pick K=100 random permutations of the rows
Think of sig(C) as a column vector
sig(C)[i] = according to the i-th permutation, the index of the first row
that has a 1 in column C
77
Implementation Trick
Permuting rows even once is prohibitive
Row hashing!
Pick K = 100 hash functions ki
Ordering under ki gives a random row permutation!
One-pass implementation
For each column C and hash-func. ki keep a “slot” for the min-hash
value
How to pick a random
Initialize all sig(C)[i] = hash function h(x)?
Scan rows looking for 1s Universal hashing:
ha,b(x)=((a·x+b) mod p) mod N
Suppose row j has 1 in column C where:
Then for each ki : a,b … random integers
p … prime number (p > N)
If ki(j) < sig(C)[i], then sig(C)[i] ki(j)
78
Candidate
pairs:
Locality-
Docu- those pairs
Sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
81
LSH for Min-Hash
Big idea: Hash columns of
signature matrix M several times
82
Partition M into b Bands
r rows
per band
b bands
One
signature
83
Signature matrix M
Partition M into Bands
Divide matrix M into b bands of r rows
Matrix M
r rows b bands
85
Simplifying Assumption
Probability = 1
if t > s
Similarity threshold s
Probability
No chance
of sharing
if t < s
a bucket
Probability Remember:
of sharing Probability of
a bucket equal hash-values
= similarity
93
What b Bands of r Rows Gives You
At least
No bands
one band
identical
identical
Probability s ~ (1/b)1/r 1 - (1 -t r )b
of sharing
a bucket
All rows
Some row of a band
of a band are equal
Similarity t=sim(C1, C2) of two sets unequal
94
Example: b = 20; r = 5
Similarity threshold s
Prob. that at least 1 band is identical:
s 1-(1-sr)b
.2 .006
.3 .047
.4 .186
.5 .470
.6 .802
.7 .975
95
.8 .9996
Picking r and b: The S-curve
Picking r and b to get the best S-curve
50 hash-functions (r=5, b=10)
0.9
0.7
0.6
0.5
0.4
0.3
0.1
Orange area: False Positive rate
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity
96
LSH Summary
Tune M, b, r to get almost all pairs with similar signatures, but
eliminate most pairs that do not have similar signatures
97
Summary: 3 Steps
Shingling: Convert documents to sets
We used hashing to assign each shingle an ID
Min-Hashing: Convert large sets to short signatures, while preserving
similarity
We used similarity preserving hashing to generate signatures with property
Pr[h(C1) = h(C2)] = sim(C1, C2)
We used hashing to get around generating random permutations
Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from
similar documents
We used hashing to find candidate pairs of similarity s
98