Sunteți pe pagina 1din 75

Tolerant IR

Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Prasad

L05TolerantIR

This lecture
n

Tolerant retrieval
n n n

Wild-card queries Spelling correction Soundex

Motivation
For expressiveness to accommodate variants (e.g., automat*) n To deal with incomplete information about spelling or multiple spellings (E.g., S*dney)
n
2

Dictionary data structures for inverted indexes


n

Sec. 3.1

The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list in what data structure?

Sec. 3.1

A nave dictionary
n

An array of struct:

n n

char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently? How do we quickly look up elements at query time?

Sec. 3.1

Dictionary data structures


n

Two main choices:


n n

Hashtables Trees

Some IR systems use hashtables, some trees

Sec. 3.1

Hashtables
n

Each vocabulary term is hashed to an integer


n

(We assume youve seen hashtables before) Lookup is faster than for a tree: O(1) No easy way to find minor variants:
n

Pros:
n

Cons:
n

judgment/judgement

n n

No prefix search [tolerant retrieval] If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything 6

Sec. 3.1

Tree: binary tree


a-m Root n-z

a-hu

hy-m

n-sh

si-z

Sec. 3.1

Tree: B-tree
a-hu hy-m n-z

Definition: Every internal nodel has a number of children in the interval [a,b] where a, b are appropriate natural numbers, e.g., [2,4].

Sec. 3.1

Trees
n n n

Simplest: binary tree More usual: B-trees Trees require a standard ordering of characters and hence strings but we typically have one Pros: n Solves the prefix problem (terms starting with hyp) Cons: n Slower: O(log M) [and this requires balanced tree] n Rebalancing binary trees is expensive
n

But B-trees mitigate the rebalancing problem


9

Wild-card queries

10

Wild-card queries: *
n

mon*: find all docs containing any word beginning with mon.
n n

Hashing unsuitable because order not preserved Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon w < moo

*mon: find words ending in mon: harder


Maintain an additional B-tree for terms backwards. Can retrieve all words in range: nom w < non.
n

Exercise: from this, how can we enumerate all terms meeting the wild-card query pro*cent ?
11

Query processing
n

At this point, we have an enumeration of all terms in the dictionary that match the wild-card query. We still have to look up the postings for each enumerated term. E.g., consider the query: se*ate AND fil*er This may result in the execution of many Boolean AND queries.

12

B-trees handle *s at the end of a query term


n

How can we handle *s in the middle of query term? (Especially multiple *s)
n n

Consider co*tion We could look up co* AND *tion in a B-tree and intersect the two term sets Expensive

The solution: transform every wild-card query so that the *s occur at the end This gives rise to the Permuterm Index.
13

Permuterm index
n

For term hello index under:


hello$, ello$h, llo$he, lo$hel, o$hell where $ is a special symbol.
n

Queries:
n n n

X lookup on X$ *X* lookup on X* X*Y lookup on Y$X* Query = hel*o X=hel, Y=o Lookup o$hel*

*X lookup on X$* X*Y*Z ???

14

Permuterm query processing


n n n

Rotate query wild-card to the right Now use B-tree lookup as before. Permuterm problem: quadruples lexicon size
Empirical observation for English.

15

Bigram indexes
n

Enumerate all k-grams (sequence of k chars) occurring in any term e.g., from text April is the cruelest month we get the 2-grams (bigrams)
$a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru, ue,el,le,es,st,t$, $m,mo,on,nt,h$
n

$ is a special word boundary symbol

Maintain a second inverted index from bigrams to dictionary terms that match each bigram.
16

Bigram index example


The k-gram index finds terms based on a query consisting of k-grams (here k=2). $m mo on mace among among madden amortize around

17

Processing n-gram wild-cards


n

Query mon* can now be run as


n

$m AND mo AND on

n n n

Gets terms that match AND version of our wildcard query. But wed incorrectly enumerate moon as well. Must post-filter these terms against query. Surviving enumerated terms are then looked up in the original term-document inverted index. Fast, space efficient (compared to permuterm).
18

Processing wild-card queries


n

As before, we must execute a Boolean query for each enumerated, filtered term. Wild-cards can result in expensive query execution (very large disjunctions)
n

Avoid encouraging laziness in the UI: Search

Type your search terms, use * if you need to. E.g., Alex* will match Alexander.

19

Advanced features
n

Avoiding UI clutter is one reason to hide advanced features behind an Advanced Search button It also deters most users from unnecessarily hitting the engine with fancy queries

20

Spelling correction

21

Spell correction
n

Two principal uses


n n

Correcting document(s) being indexed Retrieving matching documents when query contains a spelling error Isolated word
n n

Two main flavors:


n

Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words e.g., from form Look at surrounding words, e.g., I flew form Heathrow to Narita.

Context-sensitive
n

22

Document correction
n

Especially needed for OCRed documents


n n

Correction algorithms tuned for this: rn vs m Can use domain-specific knowledge


n

E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing).

n n

But also: web pages and even printed material have typos Goal: the dictionary contains fewer misspellings But often we dont change the documents and instead fix the query-document mapping
23

Query mis-spellings
n

Our principal focus here


n

E.g., the query Alanis Morisett Retrieve documents indexed by the correct spelling, OR Return several suggested alternative queries with the correct spelling
n

We can either
n

Did you mean ?

24

Isolated word correction


n

Fundamental premise there is a lexicon from which the correct spellings come Two basic choices for this
n

A standard lexicon such as


n n

Websters English Dictionary An industry-specific lexicon hand-maintained E.g., all words on the web All names, acronyms etc. (Including the mis-spellings)
25

The lexicon of the indexed corpus


n n n

Isolated word correction


n

n n

Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q Whats closest? Well study several alternatives
n n n

Edit distance Weighted edit distance n-gram overlap

26

Edit distance
n

Given two strings S1 and S2, the minimum number of operations to covert one to the other Basic operations are typically character-level
n

Insert, Delete, Replace, Transposition From cat to act is 2 from cat to dog is 3. (Just 1 with transpose.)

E.g., the edit distance from dof to dog is 1


n n

n n

Generally found by dynamic programming. See http://www.merriampark.com/ld.htm for a nice example plus an applet.
27

Weighted edit distance


n

As above, but the weight of an operation depends on the character(s) involved


n

Meant to capture OCR or keyboard errors, e.g., m more likely to be mis-typed as n than as q Therefore, replacing m by n is a smaller edit distance than by q This may be formulated as a probability model

n n

Require weight matrix as input Modify dynamic programming to handle weights


28

Using edit distances


n

n n n

Given query, first enumerate all character sequences within a preset (weighted) edit distance (e.g., 2) Intersect this set with list of correct words Show terms you found to user as suggestions Alternatively,
n

We can look up all possible corrections in our inverted index and return all docs slow We can run with a single most likely correction

The alternatives disempower the user, but save a round of interaction with the user
29

Sec. 3.3.4

Edit distance to all dictionary terms?


n

Given a (mis-spelled) query do we compute its edit distance to every dictionary term?
n n

Expensive and slow Alternative?

n n

How do we cut the set of candidate dictionary terms? One possibility is to use n-gram overlap for this This can also be used by itself for spelling correction.
30

n-gram overlap
n

Enumerate all the n-grams in the query string as well as in the lexicon Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams Threshold by number of matching n-grams
n

Variants weight by keyboard layout, etc.

31

Example with trigrams


n

Suppose the text is november


n

Trigrams are nov, ove, vem, emb, mbe, ber. Trigrams are dec, ece, cem, emb, mbe, ber.

The query is december


n

So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized measure of overlap?

32

One option Jaccard coefficient


n n

A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is

X Y / X Y
n

n n

Equals 1 when X and Y have the same elements and zero when they are disjoint X and Y dont have to be of the same size Always assigns a number between 0 and 1
n n

Now threshold to decide if you have a match E.g., if J.C. > 0.8, declare a match
33

Matching bigrams
n

Consider the query lord we wish to identify words matching 2 of its 3 bigrams (lo, or, rd)
lo or rd alone border ardent lord lord border sloth morbid card

Standard postings merge will enumerate


34

Sec. 3.3.4

Matching trigrams
n

Consider the query lord we wish to identify words matching 2 of its 3 bigrams (lo, or, rd)
lo or rd alone border ardent lore lore border sloth morbid card

Standard postings merge will enumerate Adapt this to using Jaccard (or another) measure.35

Context-sensitive spell correction


n n

Text: I flew from Heathrow to Narita. Consider the phrase query flew form Heathrow

Wed like to respond Did you mean flew from Heathrow? because no docs matched the query phrase.
n

36

Context-sensitive correction
n

Need surrounding context to catch this.


n

NLP too heavyweight for this.

First idea: retrieve dictionary terms close (in weighted edit distance) to each query term Now try all possible resulting phrases with one word fixed at a time
n n n

flew from heathrow fled form heathrow flea form heathrow

Hit-based spelling correction: Suggest the alternative that has lots of hits.
37

Sec. 3.3.5

General issues in spell correction


n

We enumerate multiple alternatives for Did you mean? Need to figure out which to present to the user
n n

The alternative hitting most docs Query log analysis argmaxcorr P(corr | query) From Bayes rule, this is equivalent to argmaxcorr P(query | corr) * P(corr)
Noisy channel Language model
38

More generally, rank alternatives probabilistically


n

Computational cost
n

Spell-correction is computationally expensive Avoid running routinely on every query? Run only on queries that matched few docs

39

Thesauri
n

Thesaurus: language-specific list of synonyms for terms likely to be queried


n n

car automobile, etc. Machine learning methods can assist

Can be viewed as hand-made alternative to editdistance, etc.

40

Query expansion
n

Usually do query expansion rather than index expansion


n n

No index blowup Query processing slowed down


n

Docs frequently contain equivalences puma jaguar retrieves documents on cars instead of on sneakers.

May retrieve more junk


n

41

Soundex

42

Soundex
n

Class of heuristics to expand a query into phonetic equivalents


n n

Language specific mainly for names E.g., chebyshev tchebycheff

43

Soundex typical algorithm


n

n n

Turn every token to be indexed into a 4-character reduced form Do the same with query terms Build and search an index on the reduced forms
n

(when the query calls for a soundex match)

n http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top

44

Soundex typical algorithm


1. Retain the first letter of the word. 2. Change all occurrences of the following letters to '0' (zero): 'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'. 3. Change letters to digits as follows: n B, F, P, V 1 n C, G, J, K, Q, S, X, Z 2 n D,T 3 n L4 n M, N 5 n R6

45

Soundex continued
4. Remove all pairs of consecutive digits. 5. Remove all zeros from the resulting string. 6. Pad the resulting string with trailing zeros and return the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>. E.g., Herman becomes H655.

Will hermann generate the same code?


46

Exercise
n

Using the algorithm described above, find the soundex code for your name Do you know someone who spells their name differently from you, but their name yields the same soundex code?

47

Sec. 3.4

Soundex
n

n n n

Soundex is the classic algorithm, provided by most databases (Oracle, Microsoft, ) How useful is soundex? Not very for information retrieval Okay for high recall tasks (e.g., Interpol), though biased to names of certain nationalities Zobel and Dart (1996) show that other algorithms for phonetic matching perform much better in the context of IR
48

Language detection
n

Many of the components described above require language detection


n n

For docs/paragraphs at indexing time For query terms at query time much harder

For docs/paragraphs, generally have enough text to apply machine learning methods For queries, lack sufficient text
n

Augment with other cues, such as client properties/specification from application Domain of query origination, etc.
49

What queries can we process?


n

We have
n n n n

Basic inverted index with skip pointers Wild-card index Spell-correction Soundex

Queries such as (SPELL(moriset) /3 toron*to) OR SOUNDEX(chaikofski)


n

50

Aside results caching


n

If 25% of your users are searching for britney AND spears then you probably do need spelling correction, but you dont need to keep on intersecting those two postings lists Web query distribution is extremely skewed, and you can usefully cache results for common queries.
n

Query log analysis


51

B-Trees

B-Trees

52

Motivation for B-Trees


n

Index structures for large datasets cannot be stored in main memory Storing it on disk requires different approach to efficiency Assuming that a disk spins at 3600 RPM, one revolution occurs in 1/60 of a second, or 16.7ms Crudely speaking, one disk access takes about the same time as 200,000 instructions
53

B-Trees

Motivation (cont.)
n

Assume that we use an AVL tree to store about 20 million records We end up with a very deep binary tree with lots of different disk accesses; log2 20,000,000 is about 24, so this takes about 0.2 seconds We know we cant improve on the log n lower bound on search for a binary tree But, the solution is to use more branches and thus reduce the height of the tree!
n

As branching increases, depth decreases


54

B-Trees

Definition of a B-tree
n

A B-tree of order m is an m-way tree (i.e., a tree where each node may have up to m children) in which:

1. the number of keys in each non-leaf node is one less than the number of its children and these keys partition the keys in the children in the fashion of a search tree 2. all leaves are on the same level 3. all non-leaf nodes except the root have at least m / 2 children 4. the root is either a leaf node, or it has from two to B-Trees m children 55 5. a leaf node contains no more than m 1 keys

An example B-Tree
26 6 12

A B-tree of order 5 containing 26 items



13 15 18 25 42 51 62

27

29

45

46

48

53

55

60

64

70

90

Note that all the leaves are at the same level


B-Trees 56

Constructing a B-tree
n

n n

Suppose we start with an empty B-tree and keys arrive in the following order:1 12 8 2 25 5 14 28 17 7 52 16 48 68 3 26 29 53 55 45 We want to construct a B-tree of order 5 The first four items go into the root:
1 2 8 12

To put the fifth item in the root would violate condition 5 Therefore, when 25 arrives, pick the middle key to make a new root
57

B-Trees

Constructing a B-tree (contd.)


8

12

25

6, 14, 28 get added to the leaf nodes:



8

12

14

25

28

B-Trees

58

Constructing a B-tree (contd.)


Adding 17 to the right leaf node would over-fill it, so we take the middle key, promote it (to the root) and split the leaf
8 17

12

14

25

28

7, 52, 16, 48 get added to the leaf nodes



8 17

12

14

16

25

28

48

52

B-Trees

59

Constructing a B-tree (contd.)


Adding 68 causes us to split the right most leaf, promoting 48 to the root, and adding 3 causes us to split the left most leaf, promoting 3 to the root; 26, 29, 53, 55 then go into the leaves

3 8 17 48

12

14

16 25

25 26

26 28

28 29

29

52

53

55

68

Adding 45 causes a split of


and promoting 28 to the root then causes the root to split



B-Trees 60

Constructing a B-tree (contd.)


17

28

48

12

14

16

25

26

29

45

52

53

55

68

B-Trees

61

Inserting into a B-Tree


Attempt to insert the new key into a leaf n If this would result in that leaf becoming too big, split the leaf into two, promoting the middle key to the leafs parent n If this would result in the parent becoming too big, split the parent into two, promoting the middle key n This strategy might have to be repeated all the way to the top n If necessary, the root is split in two and the middle key is promoted62 a new root, making the to B-Trees tree one level higher
n

Exercise in Inserting a B-Tree


n n

Insert the following keys to a 5-way B-tree: 3, 7, 9, 23, 45, 1, 5, 14, 25, 24, 13, 11, 8, 19, 4, 31, 35, 56

B-Trees

63

Removal from a B-tree


During insertion, the key always goes into a leaf. For deletion we wish to remove from a leaf. There are three possible ways we can do this: n 1 - If the key is already in a leaf node, and removing it doesnt cause that leaf node to have too few keys, then simply remove the key to be deleted. n 2 - If the key is not in a leaf then it is guaranteed (by the nature of a B-tree) that its predecessor or successor will be in a leaf -- in this case we can delete the key and promote the predecessor or successor key to the non-leaf deleted keys B-Trees 64 position.
n

Removal from a B-tree (2)


n

If (1) or (2) lead to a leaf node containing less than the minimum number of keys then we have to look at the siblings immediately adjacent to the leaf in question:
n

B-Trees

3: if one of them has more than the min. number of keys then we can promote one of its keys to the parent and take the parent key into our lacking leaf 4: if neither of them has more than the min. number of keys then the lacking leaf and one of its neighbours can be combined with their shared parent (the opposite of promoting a key) and the new leaf will have the 65 correct number of keys; if this step leave the parent with too few keys then we repeat the process up to the root itself, if

Type #1: Simple leaf deletion


Assuming a 5-way B-Tree, as before...

12 29 52

15 22

31 43

56 69 72

Delete 2: Since there are enough keys in the node, just delete it
Note when printed: this slide is animated B-Trees 66

Type #2: Simple non-leaf deletion


12 29 52 56
Delete 52

15 22

31 43

56 69 72

Borrow the predecessor or (in this case) successor


Note when printed: this slide is animated B-Trees 67

Type #4: Too few keys in node and its siblings


12 29 56
Join back together

15 22

31 43

69 72
Too few keys! Delete 72

Note when printed: this slide is animated B-Trees 68

Type #4: Too few keys in node and its siblings


12 29

15 22

31 43 56 69

Note when printed: this slide is animated B-Trees 69

Type #3: Enough siblings


12 29
Demote root key and promote leaf key

15 22

31 43 56 69

Delete 22
Note when printed: this slide is animated B-Trees 70

Type #3: Enough siblings


12 31

15 29

43 56 69

Note when printed: this slide is animated B-Trees 71

Exercise in Removal from a BTree


n

Given 5-way B-tree created by these data (last exercise): 3, 7, 9, 23, 45, 1, 5, 14, 25, 24, 13, 11, 8, 19, 4, 31, 35, 56 Add these further keys: 2, 6,12 Delete these keys: 4, 5, 7, 3, 14

n
B-Trees 72

Analysis of B-Trees
n

The maximum number of items in a B-tree of order m and height h:


root level 1 level 2 . . . level h m1 m(m 1) m2(m 1) mh(m 1)

So, the total number of items is (1 + m + m2 + m3 + + mh)(m 1) = [(mh+1 1)/ (m 1)] (m 1) = mh+1 1 When m = 5 and h = 2 this gives 53 1 = 124

B-Trees

73

Reasons for using B-Trees


n

When searching tables held on disc, the cost of each disc transfer is high but doesn't depend much on the amount of data transferred, especially if consecutive items are transferred
n

If we use a B-tree of order 101, say, we can transfer each node in one disc read operation A B-tree of order 101 and height 3 can hold 1014 1 items (approximately 100 million) and any item can be accessed with 3 disc reads (assuming we hold the root in memory)

If we take m = 3, we get a 2-3 tree, in which nonleaf nodes have two or three children (i.e., one or two keys) B-Trees 74
n
n

B-Trees are always balanced (since the leaves are all at the same level), so 2-3 trees make a good

Comparing Trees
n

Binary trees
n

Can become unbalanced and lose their good time complexity (big O) AVL trees are strict binary trees that overcome the balance problem Heaps remain balanced but only prioritise (not order) the keys

Multi-way trees
n

B-Trees
n

B-Trees can be m-way, they can have any (odd) number of children 75 One B-Tree, the 2-3 (or 3-way) B-Tree, approximates a permanently balanced binary tree,

S-ar putea să vă placă și