cs707 011712

Tolerant IR
Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Prasad
L05TolerantIR
This lecture
n
Tolerant retrieval
n n n
Wild-card queries Spelling correction Soundex
Motivation
For expressiveness to accommodate variants (e.g., automat*) n To deal with incomplete information about spelling or multiple spellings (E.g., S*dney)
n
2
Dictionary data structures for inverted indexes

n
Sec. 3.1
The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list in what data structure?
Sec. 3.1
A nave dictionary
n
An array of struct:
n n
char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently? How do we quickly look up elements at query time?
Sec. 3.1
Dictionary data structures

n
Two main choices:

n n
Hashtables Trees
Some IR systems use hashtables, some trees
Sec. 3.1
Hashtables
n
Each vocabulary term is hashed to an integer

n
(We assume youve seen hashtables before) Lookup is faster than for a tree: O(1) No easy way to find minor variants:
n
Pros:
n
Cons:
n
judgment/judgement
n n
No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything 6
Sec. 3.1
Tree: binary tree

a-m Root n-z
a-hu
hy-m
n-sh
si-z
Sec. 3.1
Tree: B-tree
a-hu hy-m n-z
Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].
Sec. 3.1
Trees
n n n
Simplest: binary tree More usual: B-trees Trees require a standard ordering of characters and hence strings but we typically have one Pros: n Solves the prefix problem (terms starting with hyp) Cons: n Slower: O(log M) [and this requires balanced tree] n Rebalancing binary trees is expensive
n
But B-trees mitigate the rebalancing problem

9
Wild-card queries
10
Wild-card queries: *
n
mon*: find all docs containing any word beginning with mon.
n n
Hashing unsuitable because order not preserved Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon w < moo
*mon: find words ending in mon: harder

Maintain an additional B-tree for terms backwards. Can retrieve all words in range: nom w < non.
n
Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent ?
11
Query processing
n
At this point, we have an enumeration of all terms in the dictionary that match the wild-card query. We still have to look up the postings for each enumerated term. E.g., consider the query: se*ate AND fil*er This may result in the execution of many Boolean AND queries.
12
B-trees handle *s at the end of a query term

n
How can we handle *s in the middle of query term? (Especially multiple *s)
n n
Consider co*tion We could look up co* AND *tion in a B-tree and intersect the two term sets Expensive
The solution: transform every wild-card query so that the *s occur at the end This gives rise to the Permuterm Index.
13
Permuterm index
n
For term hello index under:

hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol.
n
Queries:
n n n
X lookup on X$ *X* lookup on X* X*Y lookup on Y$X* Query = hel*o X=hel, Y=o Lookup o$hel*
*X lookup on X$* X*Y*Z ???
14
Permuterm query processing

n n n
Rotate query wild-card to the right Now use B-tree lookup as before. Permuterm problem: quadruples lexicon size
Empirical observation for English.
15
Bigram indexes
n
Enumerate all k-grams (sequence of k chars) occurring in any term e.g., from text April is the cruelest month we get the 2-grams (bigrams)
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru, ue,el,le,es,st,t$, $m,mo,on,nt,h$
n
$ is a special word boundary symbol
Maintain a second inverted index from bigrams to dictionary terms that match each bigram.
16
Bigram index example

The k-gram index finds terms based on a query consisting of k-grams (here k=2). $m mo on mace among among madden amortize around
17
Processing n-gram wild-cards

n
Query mon* can now be run as

n
$m AND mo AND on
n n n
Gets terms that match AND version of our wildcard query. But wed incorrectly enumerate moon as well. Must post-filter these terms against query. Surviving enumerated terms are then looked up in the original term-document inverted index. Fast, space efficient (compared to permuterm).
18
Processing wild-card queries

n
As before, we must execute a Boolean query for each enumerated, filtered term. Wild-cards can result in expensive query execution (very large disjunctions)
n
Avoid encouraging laziness in the UI: Search
Type your search terms, use * if you need to. E.g., Alex* will match Alexander.
19
Advanced features
n
Avoiding UI clutter is one reason to hide advanced features behind an Advanced Search button It also deters most users from unnecessarily hitting the engine with fancy queries
20
Spelling correction
21
Spell correction
n
Two principal uses

n n
Correcting document(s) being indexed Retrieving matching documents when query contains a spelling error Isolated word
n n
Two main flavors:

n
Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words e.g., from form Look at surrounding words, e.g., I flew form Heathrow to Narita.
Context-sensitive
n
22
Document correction
n
Especially needed for OCRed documents

n n
Correction algorithms tuned for this: rn vs m Can use domain-specific knowledge

n
E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing).
n n
But also: web pages and even printed material have typos Goal: the dictionary contains fewer misspellings But often we dont change the documents and instead fix the query-document mapping
23
Query mis-spellings
n
Our principal focus here

n
E.g., the query Alanis Morisett Retrieve documents indexed by the correct spelling, OR Return several suggested alternative queries with the correct spelling
n
We can either
n
Did you mean ?
24
Isolated word correction

n
Fundamental premise there is a lexicon from which the correct spellings come Two basic choices for this
n
A standard lexicon such as

n n
Websters English Dictionary An industry-specific lexicon hand-maintained E.g., all words on the web All names, acronyms etc. (Including the mis-spellings)
25
The lexicon of the indexed corpus

n n n
Isolated word correction

n
n n
Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q Whats closest? Well study several alternatives
n n n
Edit distance Weighted edit distance n-gram overlap
26
Edit distance
n
Given two strings S1 and S2, the minimum number of operations to covert one to the other Basic operations are typically character-level
n
Insert, Delete, Replace, Transposition From cat to act is 2 from cat to dog is 3. (Just 1 with transpose.)
E.g., the edit distance from dof to dog is 1

n n
n n
Generally found by dynamic programming. See http://www.merriampark.com/ld.htm for a nice example plus an applet.
27
Weighted edit distance

n
As above, but the weight of an operation depends on the character(s) involved

n
Meant to capture OCR or keyboard errors, e.g., m more likely to be mis-typed as n than as q Therefore, replacing m by n is a smaller edit distance than by q This may be formulated as a probability model
n n
Require weight matrix as input Modify dynamic programming to handle weights

28
Using edit distances

n
n n n
Given query, first enumerate all character sequences within a preset (weighted) edit distance (e.g., 2) Intersect this set with list of correct words Show terms you found to user as suggestions Alternatively,
n
We can look up all possible corrections in our inverted index and return all docs slow We can run with a single most likely correction
The alternatives disempower the user, but save a round of interaction with the user
29
Sec. 3.3.4
Edit distance to all dictionary terms?

n
Given a (mis-spelled) query do we compute its edit distance to every dictionary term?
n n
Expensive and slow Alternative?
n n
How do we cut the set of candidate dictionary terms? One possibility is to use n-gram overlap for this This can also be used by itself for spelling correction.
30
n-gram overlap
n
Enumerate all the n-grams in the query string as well as in the lexicon Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams Threshold by number of matching n-grams
n
Variants weight by keyboard layout, etc.
31
Example with trigrams

n
Suppose the text is november

n
Trigrams are nov, ove, vem, emb, mbe, ber. Trigrams are dec, ece, cem, emb, mbe, ber.
The query is december

n
So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized measure of overlap?
32
One option Jaccard coefficient

n n
A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is
X Y / X Y
n
n n
Equals 1 when X and Y have the same elements and zero when they are disjoint X and Y dont have to be of the same size Always assigns a number between 0 and 1
n n
Now threshold to decide if you have a match E.g., if J.C. > 0.8, declare a match
33
Matching bigrams
n
Consider the query lord we wish to identify words matching 2 of its 3 bigrams (lo, or, rd)
lo or rd alone border ardent lord lord border sloth morbid card
Standard postings merge will enumerate

34
Sec. 3.3.4
Matching trigrams
n
Consider the query lord we wish to identify words matching 2 of its 3 bigrams (lo, or, rd)
lo or rd alone border ardent lore lore border sloth morbid card
Standard postings merge will enumerate Adapt this to using Jaccard (or another) measure.35
Context-sensitive spell correction

n n
Text: I flew from Heathrow to Narita. Consider the phrase query flew form Heathrow
Wed like to respond Did you mean flew from Heathrow? because no docs matched the query phrase.
n
36
Context-sensitive correction
n
Need surrounding context to catch this.

n
NLP too heavyweight for this.
First idea: retrieve dictionary terms close (in weighted edit distance) to each query term Now try all possible resulting phrases with one word fixed at a time
n n n
flew from heathrow fled form heathrow flea form heathrow
Hit-based spelling correction: Suggest the alternative that has lots of hits.
37
Sec. 3.3.5
General issues in spell correction

n
We enumerate multiple alternatives for Did you mean? Need to figure out which to present to the user
n n
The alternative hitting most docs Query log analysis argmaxcorr P(corr | query) From Bayes rule, this is equivalent to argmaxcorr P(query | corr) * P(corr)
Noisy channel Language model
38
More generally, rank alternatives probabilistically

n
Computational cost
n
Spell-correction is computationally expensive Avoid running routinely on every query? Run only on queries that matched few docs
39
Thesauri
n
Thesaurus: language-specific list of synonyms for terms likely to be queried

n n
car automobile, etc. Machine learning methods can assist
Can be viewed as hand-made alternative to editdistance, etc.
40
Query expansion
n
Usually do query expansion rather than index expansion

n n
No index blowup Query processing slowed down

n
Docs frequently contain equivalences puma jaguar retrieves documents on cars instead of on sneakers.
May retrieve more junk

n
41
Soundex
42
Soundex
n
Class of heuristics to expand a query into phonetic equivalents

n n
Language specific mainly for names E.g., chebyshev tchebycheff
43
Soundex typical algorithm

n
n n
Turn every token to be indexed into a 4-character reduced form Do the same with query terms Build and search an index on the reduced forms
n
(when the query calls for a soundex match)
n http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top
44
Soundex typical algorithm

1. Retain the first letter of the word. 2. Change all occurrences of the following letters to '0' (zero): 'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'. 3. Change letters to digits as follows: n B, F, P, V 1 n C, G, J, K, Q, S, X, Z 2 n D,T 3 n L4 n M, N 5 n R6
45
Soundex continued
4. Remove all pairs of consecutive digits. 5. Remove all zeros from the resulting string. 6. Pad the resulting string with trailing zeros and return the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>. E.g., Herman becomes H655.
Will hermann generate the same code?

46
Exercise
n
Using the algorithm described above, find the soundex code for your name Do you know someone who spells their name differently from you, but their name yields the same soundex code?
47
Sec. 3.4
Soundex
n
n n n
Soundex is the classic algorithm, provided by most databases (Oracle, Microsoft, ) How useful is soundex? Not very for information retrieval Okay for high recall tasks (e.g., Interpol), though biased to names of certain nationalities Zobel and Dart (1996) show that other algorithms for phonetic matching perform much better in the context of IR
48
Language detection
n
Many of the components described above require language detection

n n
For docs/paragraphs at indexing time For query terms at query time much harder
For docs/paragraphs, generally have enough text to apply machine learning methods For queries, lack sufficient text
n
Augment with other cues, such as client properties/specification from application Domain of query origination, etc.
49
What queries can we process?

n
We have
n n n n
Basic inverted index with skip pointers Wild-card index Spell-correction Soundex
Queries such as (SPELL(moriset) /3 toron*to) OR SOUNDEX(chaikofski)

n
50
Aside results caching

n
If 25% of your users are searching for britney AND spears then you probably do need spelling correction, but you dont need to keep on intersecting those two postings lists Web query distribution is extremely skewed, and you can usefully cache results for common queries.
n
Query log analysis

51
B-Trees
B-Trees
52
Motivation for B-Trees

n
Index structures for large datasets cannot be stored in main memory Storing it on disk requires different approach to efficiency Assuming that a disk spins at 3600 RPM, one revolution occurs in 1/60 of a second, or 16.7ms Crudely speaking, one disk access takes about the same time as 200,000 instructions
53
B-Trees
Motivation (cont.)
n
Assume that we use an AVL tree to store about 20 million records We end up with a very deep binary tree with lots of different disk accesses; log2 20,000,000 is about 24, so this takes about 0.2 seconds We know we cant improve on the log n lower bound on search for a binary tree But, the solution is to use more branches and thus reduce the height of the tree!
n
As branching increases, depth decreases

54
B-Trees
Definition of a B-tree
n
A B-tree of order m is an m-way tree (i.e., a tree where each node may have up to m children) in which:
1. the number of keys in each non-leaf node is one less than the number of its children and these keys partition the keys in the children in the fashion of a search tree 2. all leaves are on the same level 3. all non-leaf nodes except the root have at least m / 2 children 4. the root is either a leaf node, or it has from two to B-Trees m children 55 5. a leaf node contains no more than m 1 keys
An example B-Tree
26 6 12
A B-tree of order 5 containing 26 items

13 15 18 25 42 51 62
27
29
45
46
48
53
55
60
64
70
90
Note that all the leaves are at the same level

B-Trees 56
Constructing a B-tree
n
n n
Suppose we start with an empty B-tree and keys arrive in the following order:1 12 8 2 25 5 14 28 17 7 52 16 48 68 3 26 29 53 55 45 We want to construct a B-tree of order 5 The first four items go into the root:
1 2 8 12
To put the fifth item in the root would violate condition 5 Therefore, when 25 arrives, pick the middle key to make a new root
57
B-Trees
Constructing a B-tree (contd.)

8
12
25
6, 14, 28 get added to the leaf nodes:

8
12
14
25
28
B-Trees
58

Adding 17 to the right leaf node would over-fill it, so we take the middle key, promote it (to the root) and split the leaf
8 17
12
14
25
28
7, 52, 16, 48 get added to the leaf nodes

8 17
12
14
16
25
28
48
52
B-Trees
59

Adding 68 causes us to split the right most leaf, promoting 48 to the root, and adding 3 causes us to split the left most leaf, promoting 3 to the root; 26, 29, 53, 55 then go into the leaves

3 8 17 48
12
14
16 25
25 26
26 28
28 29
29
52
53
55
68
Adding 45 causes a split of

and promoting 28 to the root then causes the root to split

B-Trees 60

17
28
48
12
14
16
25
26
29
45
52
53
55
68
B-Trees
61
Inserting into a B-Tree

Attempt to insert the new key into a leaf n If this would result in that leaf becoming too big, split the leaf into two, promoting the middle key to the leafs parent n If this would result in the parent becoming too big, split the parent into two, promoting the middle key n This strategy might have to be repeated all the way to the top n If necessary, the root is split in two and the middle key is promoted62 a new root, making the to B-Trees tree one level higher
n
Exercise in Inserting a B-Tree

n n
Insert the following keys to a 5-way B-tree: 3, 7, 9, 23, 45, 1, 5, 14, 25, 24, 13, 11, 8, 19, 4, 31, 35, 56
B-Trees
63
Removal from a B-tree

During insertion, the key always goes into a leaf. For deletion we wish to remove from a leaf. There are three possible ways we can do this: n 1 - If the key is already in a leaf node, and removing it doesnt cause that leaf node to have too few keys, then simply remove the key to be deleted. n 2 - If the key is not in a leaf then it is guaranteed (by the nature of a B-tree) that its predecessor or successor will be in a leaf -- in this case we can delete the key and promote the predecessor or successor key to the non-leaf deleted keys B-Trees 64 position.
n
Removal from a B-tree (2)

n
If (1) or (2) lead to a leaf node containing less than the minimum number of keys then we have to look at the siblings immediately adjacent to the leaf in question:
n
B-Trees
3: if one of them has more than the min. number of keys then we can promote one of its keys to the parent and take the parent key into our lacking leaf 4: if neither of them has more than the min. number of keys then the lacking leaf and one of its neighbours can be combined with their shared parent (the opposite of promoting a key) and the new leaf will have the 65 correct number of keys; if this step leave the parent with too few keys then we repeat the process up to the root itself, if
Type #1: Simple leaf deletion

Assuming a 5-way B-Tree, as before...
12 29 52
15 22
31 43
56 69 72
Delete 2: Since there are enough keys in the node, just delete it
Note when printed: this slide is animated B-Trees 66
Type #2: Simple non-leaf deletion

12 29 52 56
Delete 52
15 22
31 43
56 69 72
Borrow the predecessor or (in this case) successor

Type #4: Too few keys in node and its siblings

12 29 56
Join back together
15 22
31 43
69 72
Too few keys! Delete 72
Type #4: Too few keys in node and its siblings

12 29
15 22
31 43 56 69
Type #3: Enough siblings

12 29
Demote root key and promote leaf key
15 22
31 43 56 69
Delete 22
Type #3: Enough siblings

12 31
15 29
43 56 69
Exercise in Removal from a BTree

n
Given 5-way B-tree created by these data (last exercise): 3, 7, 9, 23, 45, 1, 5, 14, 25, 24, 13, 11, 8, 19, 4, 31, 35, 56 Add these further keys: 2, 6,12 Delete these keys: 4, 5, 7, 3, 14
n
B-Trees 72
Analysis of B-Trees
n
The maximum number of items in a B-tree of order m and height h:

root level 1 level 2 . . . level h m1 m(m 1) m2(m 1) mh(m 1)
So, the total number of items is (1 + m + m2 + m3 + + mh)(m 1) = [(mh+1 1)/ (m 1)] (m 1) = mh+1 1 When m = 5 and h = 2 this gives 53 1 = 124
B-Trees
73
Reasons for using B-Trees

n
When searching tables held on disc, the cost of each disc transfer is high but doesn't depend much on the amount of data transferred, especially if consecutive items are transferred
n
If we use a B-tree of order 101, say, we can transfer each node in one disc read operation A B-tree of order 101 and height 3 can hold 1014 1 items (approximately 100 million) and any item can be accessed with 3 disc reads (assuming we hold the root in memory)
If we take m = 3, we get a 2-3 tree, in which nonleaf nodes have two or three children (i.e., one or two keys) B-Trees 74
n
n
B-Trees are always balanced (since the leaves are all at the same level), so 2-3 trees make a good
Comparing Trees
n
Binary trees
n
Can become unbalanced and lose their good time complexity (big O) AVL trees are strict binary trees that overcome the balance problem Heaps remain balanced but only prioritise (not order) the keys
Multi-way trees
n
B-Trees
n
B-Trees can be m-way, they can have any (odd) number of children 75 One B-Tree, the 2-3 (or 3-way) B-Tree, approximates a permanently balanced binary tree,

cs707 011712

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

cs707 011712

Încărcat de

Drepturi de autor:

Formate disponibile

Tolerant IR

Wild-card queries Spelling correction Soundex

Dictionary data structures for inverted indexes

Dictionary data structures

Two main choices:

Some IR systems use hashtables, some trees

Each vocabulary term is hashed to an integer

Tree: binary tree

But B-trees mitigate the rebalancing problem

*mon: find words ending in mon: harder

B-trees handle *s at the end of a query term

For term hello index under:

*X lookup on X$* X*Y*Z ???

Permuterm query processing

$ is a special word boundary symbol

Bigram index example

Processing n-gram wild-cards

Query mon* can now be run as

Processing wild-card queries

Avoid encouraging laziness in the UI: Search

Two principal uses

Two main flavors:

Especially needed for OCRed documents

Correction algorithms tuned for this: rn vs m Can use domain-specific knowledge

Our principal focus here

Did you mean ?

Isolated word correction

A standard lexicon such as

The lexicon of the indexed corpus

Isolated word correction

Edit distance Weighted edit distance n-gram overlap

E.g., the edit distance from dof to dog is 1

Weighted edit distance

As above, but the weight of an operation depends on the character(s) involved

Require weight matrix as input Modify dynamic programming to handle weights

Using edit distances

Edit distance to all dictionary terms?

Expensive and slow Alternative?

Variants weight by keyboard layout, etc.

Example with trigrams

Suppose the text is november

The query is december

One option Jaccard coefficient

Standard postings merge will enumerate

Context-sensitive spell correction

Need surrounding context to catch this.

NLP too heavyweight for this.

flew from heathrow fled form heathrow flea form heathrow

General issues in spell correction

More generally, rank alternatives probabilistically

Thesaurus: language-specific list of synonyms for terms likely to be queried

car automobile, etc. Machine learning methods can assist

Can be viewed as hand-made alternative to editdistance, etc.

Usually do query expansion rather than index expansion

No index blowup Query processing slowed down

May retrieve more junk

Class of heuristics to expand a query into phonetic equivalents

Language specific mainly for names E.g., chebyshev tchebycheff

Soundex typical algorithm

(when the query calls for a soundex match)

Soundex typical algorithm

Will hermann generate the same code?

X lookup on X$ XYZ ???