Documente Academic
Documente Profesional
Documente Cultură
4 Hash Tables
‣ hash functions
‣ separate chaining
‣ linear probing
‣ applications
Algorithms, 4th Edition · Robert Sedgewick and Kevin Wayne · Copyright © 2002–2010 · January 30, 2011 1:00:31 PM
Optimize judiciously
“ We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil. ” — Donald E. Knuth
2
ST implementations: summary
sequential search
N N N N/2 N N/2 no equals()
(linked list)
binary search
lg N N N lg N N/2 N/2 yes compareTo()
(ordered array)
Q. Can we do better?
A. Yes, but with different access to the data.
3
Hashing: basic plan
1
hash("it") = 3
2
3 "it"
4
Issues.
5
• Computing the hash function.
• Equality test: Method for checking whether two keys are equal.
4
Hashing: basic plan
1
hash("it") = 3
2
3 "it"
?? 4
Issues. hash("times") = 3 5
• Computing the hash function.
• Equality test: Method for checking whether two keys are equal.
• Collision resolution: Algorithm and data structure
to handle two keys that hash to the same array index.
6
Computing the hash function
Ex 1. Phone numbers.
• Bad: first three digits. table
index
7
Java’s hash code conventions
All Java classes inherit a method hashCode(), which returns a 32-bit int.
x y
x.hashCode() y.hashCode()
9
Implementing hash code: strings
10
War story: String hashing in Java
http://www.cs.princeton.edu/introcs/13loop/Hello.java
http://www.cs.princeton.edu/introcs/13loop/Hello.class
http://www.cs.princeton.edu/introcs/13loop/Hello.html
http://www.cs.princeton.edu/introcs/12type/index.html
11
Implementing hash code: user-defined types
...
12
Hash code design
Basic rule. Need to use the whole key to compute hash code;
consult an expert for state-of-the-art hash codes.
13
Modular hashing
bug
1-in-a-billion bug
correct
14
Uniform hashing assumption
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Birthday problem. Expect two balls in the same bin after ~ π M / 2 tosses.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Java's String data uniformly distribute the keys of Tale of Two Cities
16
‣ hash functions
‣ separate chaining
‣ linear probing
‣ applications
17
Collisions
1
hash("it") = 3
2
3 "it"
?? 4
hash("times") = 3 5
18
Separate chaining ST
M 4 9
first
P 3 10 M 9 H 5 C 4 R 3
L 3 11
E 0 12
19
Separate chaining ST: Java implementation
public SeparateChainingHashST(int M)
{
this.M = M;
st = (SequentialSearchST<Key, Value>[]) new SequentialSearchST[M];
for (int i = 0; i < M; i++)
st[i] = new SequentialSearchST<Key, Value>();
}
private int hash(Key key)
{ return (key.hashCode() & 0x7fffffff) % M; }
20
Analysis of separate chaining
(10, .12511...)
.125
0
0 10 20 30
Binomial distribution (N = 10 4 , M = 10 3 , ! = 10 )
sequential search
N N N N/2 N N/2 no equals()
(linked list)
binary search
lg N N N lg N N/2 N/2 yes compareTo()
(ordered array)
22
‣ hash functions
‣ separate chaining
‣ linear probing
‣ applications
23
Collision resolution: open addressing
st[0] jocularly
st[1] null
st[2] listen
st[3] suburban
null
st[30000] browsing
24
Linear probing
- - - S H - - A C E R - -
0 1 2 3 4 5 6 7 8 9 10 11 12
insert I
- - - S H - - A C E R I -
hash(I) = 11
0 1 2 3 4 5 6 7 8 9 10 11 12
insert N
- - - S H - - A C E R I N hash(N) = 8
0 1 2 3 4 5 6 7 8 9 10 11 12
25
Linear probing: trace of standard indexing client
28
Knuth's parking problem
displacement = 3
29
Analysis of linear probing
Parameters.
• M too large ⇒ too many empty array entries.
• M too small ⇒ search time blows up.
• Typical choice: α = N / M ~ ½.
# probes for search hit is about 3/2
# probes for search miss is about 5/2
30
ST implementations: summary
sequential search
N N N N/2 N N/2 no equals()
(linked list)
binary search
lg N N N lg N N/2 N/2 yes compareTo()
(ordered array)
31
War story: algorithmic complexity attacks
33
Diversion: one-way hash functions
One-way hash function. "Hard" to find a key that will hash to a desired value
(or two keys that hash to same value).
known to be insecure
34
Separate chaining vs. linear probing
Separate chaining.
• Easier to implement delete.
• Performance degrades gracefully.
• Clustering less sensitive to poorly-designed hash function.
Linear probing.
• Less wasted space.
• Better cache performance.
35
Hashing: variations on the theme
36
Hashing vs. balanced search trees
Hashing.
• Simpler to code.
• No effective alternative for unordered keys.
• Faster for simple keys (a few arithmetic ops versus log N compares).
• Better system support in Java for strings (e.g., cached hash code).
37
3.5 Symbol Table Applications
‣ sets
‣ dictionary clients
‣ indexing clients
‣ sparse vectors
‣ geometric applications
Algorithms, 4th Edition · Robert Sedgewick and Kevin Wayne · Copyright © 2002–2010 · January 30, 2011 1:01:57 PM
‣ sets
‣ dictionary clients
‣ indexing clients
‣ sparse vectors
2
Set API
Q. How to implement?
3
Exception filter
% more list.txt
list of exceptional words
was it the of
4
Exception filter applications
5
Exception filter: Java implementation
In in = new In(args[0]);
while (!in.isEmpty()) read in whitelist
set.add(in.readString());
while (!StdIn.isEmpty())
{
String word = StdIn.readString();
if (set.contains(word)) print words in list
StdOut.println(word);
}
}
}
6
Exception filter: Java implementation
In in = new In(args[0]);
while (!in.isEmpty()) read in blacklist
set.add(in.readString());
while (!StdIn.isEmpty())
{
String word = StdIn.readString();
if (!set.contains(word)) print words not in list
StdOut.println(word);
}
}
}
7
‣ sets
‣ dictionary clients
‣ indexing clients
‣ sparse vectors
8
Dictionary lookup
Command-line arguments.
• A comma-separated value (CSV) file.
•
% more ip.csv
Key field. www.princeton.edu,128.112.128.15
•
www.cs.princeton.edu,128.112.136.35
Value field. www.math.princeton.edu,128.112.18.11
www.cs.harvard.edu,140.247.50.127
www.harvard.edu,128.103.60.24
www.yale.edu,130.132.51.8
Ex 1. DNS lookup. URL is key IP is value
www.econ.yale.edu,128.36.236.74
www.cs.yale.edu,128.36.229.30
espn.com,199.181.135.201
yahoo.com,66.94.234.13
% java LookupCSV ip.csv 0 1 msn.com,207.68.172.246
adobe.com google.com,64.233.167.99
baidu.com,202.108.22.33
192.150.18.60
yahoo.co.jp,202.93.91.141
www.princeton.edu sina.com.cn,202.108.33.32
128.112.128.15 ebay.com,66.135.192.87
adobe.com,192.150.18.60
ebay.edu 163.com,220.181.29.154
Not found IP is key URL is value passport.net,65.54.179.226
tom.com,61.135.158.237
nate.com,203.226.253.11
% java LookupCSV ip.csv 1 0 cnn.com,64.236.16.20
daum.net,211.115.77.211
128.112.128.15 blogger.com,66.102.15.100
www.princeton.edu fastclick.com,205.180.86.4
wikipedia.org,66.230.200.100
999.999.999.99
rakuten.co.jp,202.72.51.22
Not found ...
9
Dictionary lookup
Ex 2. Amino acids.
TAC,Tyr,Y,Tyrosine
TAA,Stop,Stop,Stop
codon is key name is value TAG,Stop,Stop,Stop
TGT,Cys,C,Cysteine
TGC,Cys,C,Cysteine
% java LookupCSV amino.csv 0 3 TGA,Stop,Stop,Stop
TGG,Trp,W,Tryptophan
ACT CTT,Leu,L,Leucine
Threonine CTC,Leu,L,Leucine
CTA,Leu,L,Leucine
TAG CTG,Leu,L,Leucine
Stop CCT,Pro,P,Proline
CCC,Pro,P,Proline
CAT
CCA,Pro,P,Proline
Histidine CCG,Pro,P,Proline
CAT,His,H,Histidine
CAC,His,H,Histidine
CAA,Gln,Q,Glutamine
CAG,Gln,Q,Glutamine
CGT,Arg,R,Arginine
CGC,Arg,R,Arginine
...
10
Dictionary lookup
Command-line arguments.
% more classlist.csv
• A comma-separated value (CSV) file. 13,Berl,Ethan Michael,P01,eberl
•
11,Bourque,Alexander Joseph,P01,abourque
Key field. 12,Cao,Phillips Minghua,P01,pcao
•
11,Chehoud,Christel,P01,cchehoud
Value field. 10,Douglas,Malia Morioka,P01,malia
12,Haddock,Sara Lynn,P01,shaddock
12,Hantman,Nicole Samantha,P01,nhantman
Ex 3. Class list. first name
11,Hesterberg,Adam Classen,P01,ahesterb
13,Hwang,Roland Lee,P01,rhwang
login is key is value
13,Hyde,Gregory Thomas,P01,ghyde
13,Kim,Hyunmoon,P01,hktwo
11,Kleinfeld,Ivan Maximillian,P01,ikleinfe
% java LookupCSV classlist.csv 4 1
12,Korac,Damjan,P01,dkorac
eberl 11,MacDonald,Graham David,P01,gmacdona
Ethan 10,Michal,Brian Thomas,P01,bmichal
precept 12,Nam,Seung Hyeon,P01,seungnam
nwebb
login is key is value 11,Nastasescu,Maria Monica,P01,mnastase
Natalie
11,Pan,Di,P01,dpan
12,Partridge,Brenton Alan,P01,bpartrid
% java LookupCSV classlist.csv 4 3 13,Rilee,Alexander,P01,arilee
dpan 13,Roopakalu,Ajay,P01,aroopaka
11,Sheng,Ben C,P01,bsheng
P01 12,Webb,Natalie Sue,P01,nwebb
...
11
Dictionary lookup: Java implementation
while (!StdIn.isEmpty())
{
String s = StdIn.readString(); process lookups
if (!st.contains(s)) StdOut.println("Not found"); with standard I/O
else StdOut.println(st.get(s));
}
}
}
12
‣ sets
‣ dictionary clients
‣ indexing clients
‣ sparse vectors
13
File indexing
14
File indexing
% ls *.txt % ls *.java
aesop.txt magna.txt moby.txt
sawyer.txt tale.txt % java FileIndex *.java
BlackList.java Concordance.java
% java FileIndex *.txt DeDup.java FileIndex.java ST.java
freedom SET.java WhiteList.java
magna.txt moby.txt tale.txt
import
whale FileIndex.java SET.java ST.java
moby.txt
Comparator
lamb null
sawyer.txt aesop.txt
Solution. Key = query string; value = set of files containing that string.
15
File indexing
while (!StdIn.isEmpty())
{
String query = StdIn.readString(); process queries
StdOut.println(st.get(query));
}
}
}
16
Book index
17
Keyword-in-context search
Goal. Preprocess a text to support KWIC queries: given a word, find all
occurrences with their immediate contexts.
majesty
their turnkeys and the *majesty* of the law fired
me treason against the *majesty* of the people in
of his most gracious *majesty* king george the third
princeton
no matches
18
KWIC
while (!StdIn.isEmpty())
{
String query = StdIn.readString();
process queries
SET<Integer> set = st.get(query);
and print matches
for (int k : set)
// print words[k-5] to words[k+5]
}
}
}
19
‣ sets
‣ dictionary clients
‣ indexing clients
‣ sparse vectors
20
Matrix-vector multiplication (standard implementation)
Matrix-vector multiplication
...
double[][] a = new double[N][N];
double[] x = new double[N];
double[] b = new double[N];
...
// initialize a[][] and x[]
...
nested loops
for (int i = 0; i < N; i++)
(N2 running time)
{
sum = 0.0;
for (int j = 0; j < N; j++)
sum += a[i][j]*x[j];
b[i] = sum;
}
21
Sparse matrix-vector multiplication
A * x = b
22
Vector representations
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
key value
st
23
Sparse vector data type
0 1 2 3 4 st
1 .90
0.0 .90 0.0 0.0 0.0
key value
0 1 2 3 4 st
a a 2 .36 3 .36 4 .18
0.0 0.0 .36 .36 .18
0 0
1 0 1 2 3 4 1
st
2 0.0 0.0 0.0 .90 0.0 2 3 .90 independent
symbol-table
3 3 objects
0 1 2 3 4
4 4 st
.90 0.0 0.0 0.0 0.0 0 .90
0 1 2 3 4
.45 0.0 .45 0.0 0.0 st
0 .45 2 .45
a[4][2]
Matrix-vector multiplication
..
SparseVector[] a = new SparseVector[N];
double[] x = new double[N];
double[] b = new double[N];
...
// Initialize a[] and x[]
...
for (int i = 0; i < N; i++) linear running time
b[i] = a[i].dot(x); for sparse matrix
26
‣ sets
‣ dictionary clients
‣ indexing clients
‣ sparse vectors
‣ challenges
‣ geometric applications
27
2d tree
8
1
6 3 2
9
4 6 7 8
1
5 5 10 9
2
10
28
2d tree implementation
q
p
q
p
points points points points
left of p right of p below q above q
6 1
9
3 3 2
5 1
2
4 6 7 8
7
5 10 9
10
29
2d tree: 2d orthogonal range search
6 1
3 3 2
5 1
4 6 7 8
5 10 9
30
2d tree: nearest neighbor search
Nearest neighbor search. Given a query point, find the closest point.
• Check distance from point in node to query point.
• Recursively search left/top subdivision (if it could contain a closer point).
• Recursively search right/bottom subdivision (if it could contain a closer point).
• Organize recursive method so that it begins by searching for query point.
closest point = 5
1
3
8
6 1
9
3 3 2
5 1
2
4 6 7 8
7
5 10 9
10
31
Kd tree
level ≡ i (mod k)
points points
whose ith whose ith
coordinate coordinate
is less than p’s is greater than p’s
Simple version. 2d, all objects are horizontal or vertical line segments.
Brute force. Test all Θ(N 2) pairs of line segments for intersection.
33
Orthogonal line segment intersection search: sweep-line algorithm
3 3
2 2
1 1
0 0
y-coordinates 34
Orthogonal line segment intersection search: sweep-line algorithm
3 3
1 1
0 0
y-coordinates 35
Orthogonal line segment intersection search: sweep-line algorithm
3 3
4
1d range
search
2
1 1
0 0
y-coordinates 36
Orthogonal line segment intersection search: sweep-line algorithm
37
‣ sets
‣ dictionary clients
‣ indexing clients
‣ sparse vectors
‣ challenges
38
Searching challenge 1A
39
Searching challenge 1B
40
Searching challenge 2A
41
Searching challenge 2B
42
Searching challenge 3
43
Searching challenge 4
44
Searching challenge 5
45
Searching challenge 6
46