Sunteți pe pagina 1din 12

Computing Relevance, Similarity:

The Vector Space Model


Chapter 27, Part B
Based on Larson and Hearsts slides at
UC-Berkeley
http://www.sims.berkeley.edu/courses/is202/f00/

Database Management Systems, R. Ramakrishnan

Document Vectors

Documents are represented as bags of


words
Represented as vectors when used
computationally
A vector is like an array of floating point
Has direction and magnitude
Each vector holds a place for every term in the
collection
Therefore, most vectors are sparse

Database Management Systems, R. Ramakrishnan

Document Vectors:
One location for each word.
A
B
C
D
E
F
G
H
I

nova
10
5

galaxy heat
5
3
10

hwood film

diet

fur

10 in text
5
Nova occurs9 10 times
A
Galaxy occurs 5 times in text A 10
Heat occurs 3 times in text A 9
7 (Blank means 0 occurrences.)
9

10
10

10

10

2
7

Database Management Systems, R. Ramakrishnan

8
5

role

Document Vectors
Document ids

A
B
C
D
E
F
G
H
I

nova
10
5

galaxy heat
5
3
10

hwood film

10
9

7
6

10

2
7

8
10

role

diet

fur

10
9

10
10

7
5

9
8
5

Database Management Systems, R. Ramakrishnan

We Can Plot the Vectors


Star
Doc about astronomy

Doc about movie stars

Doc about mammal behavior

Diet
Assumption: Documents that are close in space are similar.
Database Management Systems, R. Ramakrishnan

Vector Space Model

Documents are represented as vectors in term


space
Terms are usually stems
Documents represented by binary vectors of terms

Queries represented the same as documents


A vector distance measure between the query
and documents is used to rank retrieved
documents

Query and Document similarity is based on length


and direction of their vectors
Vector operations to capture boolean query
Database Management Systems, R. Ramakrishnan
conditions

Vector Space Documents


and Queries
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
Q

t1
1
1
0
1
1
1
0
0
0
0
1
1
q1

t2
0
0
1
0
1
1
1
1
0
1
0
2
q2

t3
1
0
1
0
1
0
0
0
1
1
1
3
q3

t1

RSV=Q.Di
4
1
5
1
6
3
2
2
3
5
3

t3

D9

D2

D1

D4

D11

D5

D3

D6

D10

D7

D8

t2

Boolean term combinations

Q is a query also represented


as a vector
Database Management Systems, R. Ramakrishnan

Assigning Weights to Terms


nBinary Weights
oRaw term frequency
ptf x idf
Recall the Zipf distribution
Want to weight terms highly if they are
frequent in relevant documents BUT
infrequent in the collection as a whole

Database Management Systems, R. Ramakrishnan

Binary Weights

Only the presence (1) or absence (0) of a term


is included in the vector
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11

Database Management Systems, R. Ramakrishnan

t1
1
1
0
1
1
1
0
0
0
0
1

t2
0
0
1
0
1
1
1
1
0
1
0

t3
1
0
1
0
1
0
0
0
1
1
1

Raw Term Weights

The frequency of occurrence for the term in


each document is included in the vector
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11

t1
2
1
0
3
1
3
0
0
0
0
4

t2
0
0
4
0
6
5
8
10
0
3
0

t3
3
0
7
0
3
0
0
0
1
5
1

Database Management Systems, R. Ramakrishnan

10

TF x IDF Weights

tf x idf measure:
Term Frequency (tf)
Inverse Document Frequency (idf) -- a way to deal
with the problems of the Zipf distribution

Goal: Assign a tf * idf weight to each term in


each document

Database Management Systems, R. Ramakrishnan

11

TF x IDF Calculation
wik = tf ik * log( N / nk )
Tk = term k in document Di
tf ik = frequency of term Tk in document Di
idf k = inverse document frequency of term Tk in C
N = total number of documents in the collection C
nk = the number of documents in C that contain Tk
idf k = log N
nk
Database Management Systems, R. Ramakrishnan

12

Inverse Document Frequency

IDF provides high values for rare words and


low values for common words

For a
collection
of 10000
documents

10000
log
=0
10000
10000
log
= 0.301
5000
10000
log
= 2.698
20
10000
log
=4
1

Database Management Systems, R. Ramakrishnan

13

TF x IDF Normalization

Normalize the term weights (so longer


documents are not unfairly given more
weight)
The longer the document, the more likely it is for a
given term to appear in it, and the more often a
given term is likely to appear in it. So, we want to
reduce the importance attached to a term
appearing in a document based on the length of
the document. tf log( N / n )

wik =

ik

(tf ik ) 2 [log( N / nk )]2


k =1

Database Management Systems, R. Ramakrishnan

14

Pair-wise Document Similarity


A
B
C
D

nova
1
5

galaxy
3
2

heat
1

hwood

film

role

2
4

1
1

diet

fur

How to compute document similarity?

Database Management Systems, R. Ramakrishnan

15

Pair-wise Document Similarity


sim( A, B) = (1 5) + (2 3) = 11

D1 = w11 , w12, ..., w1t

sim( A, C ) = 0
sim( A, D) = 0

D2 = w21 , w22, ..., w2t

sim( B, C ) = 0

sim( D1 , D2 ) = w1i w2i

sim( B, D) = 0
sim(C , D) = ( 2 4) + (1 1) = 9

i =1

A
B
C
D

nova
1
5

galaxy heat
3
1
2

hwood film

2
4

role

1
1

diet

fur

Database Management Systems, R. Ramakrishnan

16

Pair-wise Document Similarity


(cosine normalization)
D1 = w11 , w12, ..., w1t
D2 = w21 , w22, ..., w2t
t

sim( D1 , D2 ) = w1i w2i unnormalized


i =1

sim( D1 , D2 ) =

1i

w2i

i =1

(w

2
1i )

i =1

cosine normalized

(w

2
2i )

i =1

Database Management Systems, R. Ramakrishnan

17

Vector Space Relevance Measure


Di = wd i1 , wd i 2 ,..., wd it
Q = wq1 , wq 2, ..., wqt

w = 0 if a term is absent

if term weights normalized :

sim(Q, Di ) = wqj wd ij
j =1

otherwise normalize in the similarity comparison :


t

sim(Q, Di ) =

w
j =1

(w
j =1

Database Management Systems, R. Ramakrishnan

qj

qj

wd ij

)2

(w
j =1

d ij

)2
18

Computing Relevance Scores


Say we have query vector Q = (0.4,0.8)
Also, document D2 = (0.2,0.7)
What does their similarity comparison yield?
(0.4 * 0.2) + (0.8 * 0.7)
sim(Q, D2 ) =
[(0.4) 2 + (0.8) 2 ] * [(0.2) 2 + (0.7) 2 ]
=

0.64
= 0.98
0.42

Database Management Systems, R. Ramakrishnan

19

Vector Space with Term Weights


and Cosine Matching
Di=(di1,wdi1;di2, wdi2;;dit, wdit)
Q =(qi1,wqi1;qi2, wqi2;;qit, wqit)

Term B
1.0
0.8
0.6

D2

Q = (0.4,0.8)
D1=(0.8,0.3)
D2=(0.2,0.7)

sim(Q, D 2) =

0.4
0.2
0

D1
1

0.2

=
0.4 0.6
Term A

0.8

sim(Q, Di ) =

1.0

sim(Q, D1 ) =

j =1

wq j wd ij

( wq j ) 2 j =1 ( wdij ) 2
j =1

(0.4 0.2) + (0.8 0.7)


[(0.4) 2 + (0.8) 2 ] [(0.2) 2 + (0.7) 2 ]
0.64
= 0.98
0.42

.56
= 0.74
0.58

Database Management Systems, R. Ramakrishnan

20

Text Clustering

Finds overall similarities among groups of


documents
Finds overall similarities among groups of
tokens
Picks out some themes, ignores others

Database Management Systems, R. Ramakrishnan

21

Text Clustering
Clustering is
The art of finding groups in data.
-- Kaufmann and Rousseeu

Term 1

Term
2

Database Management Systems, R. Ramakrishnan

22

Problems with Vector Space

There is no real theoretical basis for the


assumption of a term space
It is more for visualization than having any real
basis
Most similarity measures work about the same

Terms are not really orthogonal dimensions


Terms are not independent of all other terms;
remember our discussion of correlated terms in
text

Database Management Systems, R. Ramakrishnan

23

Probabilistic Models

Rigorous formal model attempts to predict


the probability that a given document will be
relevant to a given query
Ranks retrieved documents according to this
probability of relevance (Probability Ranking
Principle)
Relies on accurate estimates of probabilities

Database Management Systems, R. Ramakrishnan

24

Probability Ranking Principle

If a reference retrieval systems response to each


request is a ranking of the documents in the
collections in the order of decreasing probability of
usefulness to the user who submitted the request,
where the probabilities are estimated as accurately as
possible on the basis of whatever data has been made
available to the system for this purpose, then the
overall effectiveness of the system to its users will be
the best that is obtainable on the basis of that data.

Stephen E. Robertson, J. Documentation 1977


Database Management Systems, R. Ramakrishnan

25

Iterative Query Refinement

Database Management Systems, R. Ramakrishnan

26

Query Modification

Problem: How can we reformulate the query


to help a user who is trying several searches
to get at the same information?
Thesaurus expansion:
Suggest terms similar to query terms
Relevance feedback:
Suggest terms (and documents) similar to
retrieved documents that have been judged to
be relevant

Database Management Systems, R. Ramakrishnan

27

Relevance Feedback

Main Idea:
Modify existing query based on relevance
judgements
Extract terms from relevant documents and add
them to the query
AND/OR re-weight the terms already in the
query

There are many variations:


Usually positive weights for terms from relevant
docs
Sometimes negative weights for terms from nonrelevant docs

Users, or the system, guide this process by


l ti t
f
t
ti ll

Database
Management Systems, R. Ramakrishnan

28

Rocchio Method

Rocchio automatically
Re-weights terms
Adds in new terms (from relevant docs)
have to be careful when using negative terms
Rocchio is not a machine learning algorithm

Database Management Systems, R. Ramakrishnan

29

Rocchio Method
Q1 = Q0 +

n1

n1

n2

R n
i =1

2 i =1

Si

where
Q0 = the vector for the initial query
Ri = the vector for the relevant document i
Si = the vector for the non - relevant document i
n1 = the number of relevant documents chosen
n2 = the number of non - relevant documents chosen

, and tune the importance of relevant and nonrelevant terms


(in some studies best to set to 0.75 and to 0.25)
Database Management Systems, R. Ramakrishnan

30

Rocchio/Vector Illustration
Information
1.0

Q0 = retrieval of information = (0.7,0.3)


D1 = information science =
(0.2,0.8)
D2 = retrieval systems =
(0.9,0.1)

D1
Q
0.5

Q = *Q0+ * D1 = (0.45,0.55)
Q = *Q0+ * D2 = (0.80,0.20)

Q0
Q

0
Database Management Systems, R. Ramakrishnan

0.5
Retrieval

D2
1.0
31

Alternative Notions of Relevance Feedback

Find people whose taste is similar to yours.

Follow a users actions in the background.

Will you like what they like?


Can this be used to predict what the user will
want to see next?

Track what lots of people are doing.


Does this implicitly indicate what they think is
good and not good?

Database Management Systems, R. Ramakrishnan

32

Collaborative Filtering (Social Filtering)

If Pam liked the paper, Ill like the paper


If you liked Star Wars, youll like
Independence Day
Rating based on ratings of similar people
Ignores text, so also works on sound, pictures etc.
But: Initial users can bias ratings of future users

Sally
Star Wars
7
Jurassic Park
6
Terminator II
3
Independence Day 7
Database Management Systems, R. Ramakrishnan

Bob
7
4
4
7

Chris
3
7
7
2

Lynn
4
4
6
2

Karen
7
4
3
?
33

Ringo Collaborative Filtering

Users rate items from like to dislike


7 = like; 4 = ambivalent; 1 = dislike
A normal distribution; the extremes are what matter
Nearest Neighbors Strategy: Find similar users and
predicted (weighted) average of user ratings
Pearson Algorithm: Weight by degree of correlation
between user U and user J
1 means similar, 0 means no correlation, -1 dissimilar
Works better to compare against the ambivalent rating
(4), rather than the individuals average score

rUJ =

(U U)(J J )
(U U) (J J )
2

Database Management Systems, R. Ramakrishnan

2
34

S-ar putea să vă placă și