Documente Academic
Documente Profesional
Documente Cultură
Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Prasad
L05TolerantIR
This lecture
n
Tolerant retrieval
n n n
Motivation
For expressiveness to accommodate variants (e.g., automat*) n To deal with incomplete information about spelling or multiple spellings (E.g., S*dney)
n
2
Sec. 3.1
The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list in what data structure?
Sec. 3.1
A nave dictionary
n
An array of struct:
n n
char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently? How do we quickly look up elements at query time?
Sec. 3.1
Hashtables Trees
Sec. 3.1
Hashtables
n
(We assume youve seen hashtables before) Lookup is faster than for a tree: O(1) No easy way to find minor variants:
n
Pros:
n
Cons:
n
judgment/judgement
n n
No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything 6
Sec. 3.1
a-hu
hy-m
n-sh
si-z
Sec. 3.1
Tree: B-tree
a-hu hy-m n-z
Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].
Sec. 3.1
Trees
n n n
Simplest: binary tree More usual: B-trees Trees require a standard ordering of characters and hence strings but we typically have one Pros: n Solves the prefix problem (terms starting with hyp) Cons: n Slower: O(log M) [and this requires balanced tree] n Rebalancing binary trees is expensive
n
Wild-card queries
10
Wild-card queries: *
n
mon*: find all docs containing any word beginning with mon.
n n
Hashing unsuitable because order not preserved Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon w < moo
Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent ?
11
Query processing
n
At this point, we have an enumeration of all terms in the dictionary that match the wild-card query. We still have to look up the postings for each enumerated term. E.g., consider the query: se*ate AND fil*er This may result in the execution of many Boolean AND queries.
12
How can we handle *s in the middle of query term? (Especially multiple *s)
n n
Consider co*tion We could look up co* AND *tion in a B-tree and intersect the two term sets Expensive
The solution: transform every wild-card query so that the *s occur at the end This gives rise to the Permuterm Index.
13
Permuterm index
n
Queries:
n n n
X lookup on X$ *X* lookup on X* X*Y lookup on Y$X* Query = hel*o X=hel, Y=o Lookup o$hel*
14
Rotate query wild-card to the right Now use B-tree lookup as before. Permuterm problem: quadruples lexicon size
Empirical observation for English.
15
Bigram indexes
n
Enumerate all k-grams (sequence of k chars) occurring in any term e.g., from text April is the cruelest month we get the 2-grams (bigrams)
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru, ue,el,le,es,st,t$, $m,mo,on,nt,h$
n
Maintain a second inverted index from bigrams to dictionary terms that match each bigram.
16
17
$m AND mo AND on
n n n
Gets terms that match AND version of our wildcard query. But wed incorrectly enumerate moon as well. Must post-filter these terms against query. Surviving enumerated terms are then looked up in the original term-document inverted index. Fast, space efficient (compared to permuterm).
18
As before, we must execute a Boolean query for each enumerated, filtered term. Wild-cards can result in expensive query execution (very large disjunctions)
n
Type your search terms, use * if you need to. E.g., Alex* will match Alexander.
19
Advanced features
n
Avoiding UI clutter is one reason to hide advanced features behind an Advanced Search button It also deters most users from unnecessarily hitting the engine with fancy queries
20
Spelling correction
21
Spell correction
n
Correcting document(s) being indexed Retrieving matching documents when query contains a spelling error Isolated word
n n
Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words e.g., from form Look at surrounding words, e.g., I flew form Heathrow to Narita.
Context-sensitive
n
22
Document correction
n
E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing).
n n
But also: web pages and even printed material have typos Goal: the dictionary contains fewer misspellings But often we dont change the documents and instead fix the query-document mapping
23
Query mis-spellings
n
E.g., the query Alanis Morisett Retrieve documents indexed by the correct spelling, OR Return several suggested alternative queries with the correct spelling
n
We can either
n
24
Fundamental premise there is a lexicon from which the correct spellings come Two basic choices for this
n
Websters English Dictionary An industry-specific lexicon hand-maintained E.g., all words on the web All names, acronyms etc. (Including the mis-spellings)
25
n n
Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q Whats closest? Well study several alternatives
n n n
26
Edit distance
n
Given two strings S1 and S2, the minimum number of operations to covert one to the other Basic operations are typically character-level
n
Insert, Delete, Replace, Transposition From cat to act is 2 from cat to dog is 3. (Just 1 with transpose.)
n n
Generally found by dynamic programming. See http://www.merriampark.com/ld.htm for a nice example plus an applet.
27
Meant to capture OCR or keyboard errors, e.g., m more likely to be mis-typed as n than as q Therefore, replacing m by n is a smaller edit distance than by q This may be formulated as a probability model
n n
n n n
Given query, first enumerate all character sequences within a preset (weighted) edit distance (e.g., 2) Intersect this set with list of correct words Show terms you found to user as suggestions Alternatively,
n
We can look up all possible corrections in our inverted index and return all docs slow We can run with a single most likely correction
The alternatives disempower the user, but save a round of interaction with the user
29
Sec. 3.3.4
Given a (mis-spelled) query do we compute its edit distance to every dictionary term?
n n
n n
How do we cut the set of candidate dictionary terms? One possibility is to use n-gram overlap for this This can also be used by itself for spelling correction.
30
n-gram overlap
n
Enumerate all the n-grams in the query string as well as in the lexicon Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams Threshold by number of matching n-grams
n
31
Trigrams are nov, ove, vem, emb, mbe, ber. Trigrams are dec, ece, cem, emb, mbe, ber.
So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized measure of overlap?
32
A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is
X Y / X Y
n
n n
Equals 1 when X and Y have the same elements and zero when they are disjoint X and Y dont have to be of the same size Always assigns a number between 0 and 1
n n
Now threshold to decide if you have a match E.g., if J.C. > 0.8, declare a match
33
Matching bigrams
n
Consider the query lord we wish to identify words matching 2 of its 3 bigrams (lo, or, rd)
lo or rd alone border ardent lord lord border sloth morbid card
Sec. 3.3.4
Matching trigrams
n
Consider the query lord we wish to identify words matching 2 of its 3 bigrams (lo, or, rd)
lo or rd alone border ardent lore lore border sloth morbid card
Standard postings merge will enumerate Adapt this to using Jaccard (or another) measure.35
Text: I flew from Heathrow to Narita. Consider the phrase query flew form Heathrow
Wed like to respond Did you mean flew from Heathrow? because no docs matched the query phrase.
n
36
Context-sensitive correction
n
First idea: retrieve dictionary terms close (in weighted edit distance) to each query term Now try all possible resulting phrases with one word fixed at a time
n n n
Hit-based spelling correction: Suggest the alternative that has lots of hits.
37
Sec. 3.3.5
We enumerate multiple alternatives for Did you mean? Need to figure out which to present to the user
n n
The alternative hitting most docs Query log analysis argmaxcorr P(corr | query) From Bayes rule, this is equivalent to argmaxcorr P(query | corr) * P(corr)
Noisy channel Language model
38
Computational cost
n
Spell-correction is computationally expensive Avoid running routinely on every query? Run only on queries that matched few docs
39
Thesauri
n
40
Query expansion
n
Docs frequently contain equivalences puma jaguar retrieves documents on cars instead of on sneakers.
41
Soundex
42
Soundex
n
43
n n
Turn every token to be indexed into a 4-character reduced form Do the same with query terms Build and search an index on the reduced forms
n
n http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top
44
45
Soundex continued
4. Remove all pairs of consecutive digits. 5. Remove all zeros from the resulting string. 6. Pad the resulting string with trailing zeros and return the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>. E.g., Herman becomes H655.
Exercise
n
Using the algorithm described above, find the soundex code for your name Do you know someone who spells their name differently from you, but their name yields the same soundex code?
47
Sec. 3.4
Soundex
n
n n n
Soundex is the classic algorithm, provided by most databases (Oracle, Microsoft, ) How useful is soundex? Not very for information retrieval Okay for high recall tasks (e.g., Interpol), though biased to names of certain nationalities Zobel and Dart (1996) show that other algorithms for phonetic matching perform much better in the context of IR
48
Language detection
n
For docs/paragraphs at indexing time For query terms at query time much harder
For docs/paragraphs, generally have enough text to apply machine learning methods For queries, lack sufficient text
n
Augment with other cues, such as client properties/specification from application Domain of query origination, etc.
49
We have
n n n n
Basic inverted index with skip pointers Wild-card index Spell-correction Soundex
50
If 25% of your users are searching for britney AND spears then you probably do need spelling correction, but you dont need to keep on intersecting those two postings lists Web query distribution is extremely skewed, and you can usefully cache results for common queries.
n
B-Trees
B-Trees
52
Index structures for large datasets cannot be stored in main memory Storing it on disk requires different approach to efficiency Assuming that a disk spins at 3600 RPM, one revolution occurs in 1/60 of a second, or 16.7ms Crudely speaking, one disk access takes about the same time as 200,000 instructions
53
B-Trees
Motivation (cont.)
n
Assume that we use an AVL tree to store about 20 million records We end up with a very deep binary tree with lots of different disk accesses; log2 20,000,000 is about 24, so this takes about 0.2 seconds We know we cant improve on the log n lower bound on search for a binary tree But, the solution is to use more branches and thus reduce the height of the tree!
n
B-Trees
Definition of a B-tree
n
A B-tree of order m is an m-way tree (i.e., a tree where each node may have up to m children) in which:
1. the number of keys in each non-leaf node is one less than the number of its children and these keys partition the keys in the children in the fashion of a search tree 2. all leaves are on the same level 3. all non-leaf nodes except the root have at least m / 2 children 4. the root is either a leaf node, or it has from two to B-Trees m children 55 5. a leaf node contains no more than m 1 keys
An example B-Tree
26 6 12
27
29
45
46
48
53
55
60
64
70
90
Constructing a B-tree
n
n n
Suppose we start with an empty B-tree and keys arrive in the following order:1 12 8 2 25 5 14 28 17 7 52 16 48 68 3 26 29 53 55 45 We want to construct a B-tree of order 5 The first four items go into the root:
1 2 8 12
To put the fifth item in the root would violate condition 5 Therefore, when 25 arrives, pick the middle key to make a new root
57
B-Trees
12
25
12
14
25
28
B-Trees
58
12
14
25
28
12
14
16
25
28
48
52
B-Trees
59
12
14
16 25
25 26
26 28
28 29
29
52
53
55
68
28
48
12
14
16
25
26
29
45
52
53
55
68
B-Trees
61
Insert the following keys to a 5-way B-tree: 3, 7, 9, 23, 45, 1, 5, 14, 25, 24, 13, 11, 8, 19, 4, 31, 35, 56
B-Trees
63
If (1) or (2) lead to a leaf node containing less than the minimum number of keys then we have to look at the siblings immediately adjacent to the leaf in question:
n
B-Trees
3: if one of them has more than the min. number of keys then we can promote one of its keys to the parent and take the parent key into our lacking leaf 4: if neither of them has more than the min. number of keys then the lacking leaf and one of its neighbours can be combined with their shared parent (the opposite of promoting a key) and the new leaf will have the 65 correct number of keys; if this step leave the parent with too few keys then we repeat the process up to the root itself, if
12 29 52
15 22
31 43
56 69 72
Delete 2: Since there are enough keys in the node, just delete it
Note when printed: this slide is animated B-Trees 66
15 22
31 43
56 69 72
15 22
31 43
69 72
Too few keys! Delete 72
15 22
31 43 56 69
15 22
31 43 56 69
Delete 22
Note when printed: this slide is animated B-Trees 70
15 29
43 56 69
Given 5-way B-tree created by these data (last exercise): 3, 7, 9, 23, 45, 1, 5, 14, 25, 24, 13, 11, 8, 19, 4, 31, 35, 56 Add these further keys: 2, 6,12 Delete these keys: 4, 5, 7, 3, 14
n
B-Trees 72
Analysis of B-Trees
n
So, the total number of items is (1 + m + m2 + m3 + + mh)(m 1) = [(mh+1 1)/ (m 1)] (m 1) = mh+1 1 When m = 5 and h = 2 this gives 53 1 = 124
B-Trees
73
When searching tables held on disc, the cost of each disc transfer is high but doesn't depend much on the amount of data transferred, especially if consecutive items are transferred
n
If we use a B-tree of order 101, say, we can transfer each node in one disc read operation A B-tree of order 101 and height 3 can hold 1014 1 items (approximately 100 million) and any item can be accessed with 3 disc reads (assuming we hold the root in memory)
If we take m = 3, we get a 2-3 tree, in which nonleaf nodes have two or three children (i.e., one or two keys) B-Trees 74
n
n
B-Trees are always balanced (since the leaves are all at the same level), so 2-3 trees make a good
Comparing Trees
n
Binary trees
n
Can become unbalanced and lose their good time complexity (big O) AVL trees are strict binary trees that overcome the balance problem Heaps remain balanced but only prioritise (not order) the keys
Multi-way trees
n
B-Trees
n
B-Trees can be m-way, they can have any (odd) number of children 75 One B-Tree, the 2-3 (or 3-way) B-Tree, approximates a permanently balanced binary tree,