Documente Academic
Documente Profesional
Documente Cultură
Document Vectors
Document Vectors:
One location for each word.
A
B
C
D
E
F
G
H
I
nova
10
5
galaxy heat
5
3
10
hwood film
diet
fur
10 in text
5
Nova occurs9 10 times
A
Galaxy occurs 5 times in text A 10
Heat occurs 3 times in text A 9
7 (Blank means 0 occurrences.)
9
10
10
10
10
2
7
8
5
role
Document Vectors
Document ids
A
B
C
D
E
F
G
H
I
nova
10
5
galaxy heat
5
3
10
hwood film
10
9
7
6
10
2
7
8
10
role
diet
fur
10
9
10
10
7
5
9
8
5
Diet
Assumption: Documents that are close in space are similar.
Database Management Systems, R. Ramakrishnan
t1
1
1
0
1
1
1
0
0
0
0
1
1
q1
t2
0
0
1
0
1
1
1
1
0
1
0
2
q2
t3
1
0
1
0
1
0
0
0
1
1
1
3
q3
t1
RSV=Q.Di
4
1
5
1
6
3
2
2
3
5
3
t3
D9
D2
D1
D4
D11
D5
D3
D6
D10
D7
D8
t2
Binary Weights
t1
1
1
0
1
1
1
0
0
0
0
1
t2
0
0
1
0
1
1
1
1
0
1
0
t3
1
0
1
0
1
0
0
0
1
1
1
t1
2
1
0
3
1
3
0
0
0
0
4
t2
0
0
4
0
6
5
8
10
0
3
0
t3
3
0
7
0
3
0
0
0
1
5
1
10
TF x IDF Weights
tf x idf measure:
Term Frequency (tf)
Inverse Document Frequency (idf) -- a way to deal
with the problems of the Zipf distribution
11
TF x IDF Calculation
wik = tf ik * log( N / nk )
Tk = term k in document Di
tf ik = frequency of term Tk in document Di
idf k = inverse document frequency of term Tk in C
N = total number of documents in the collection C
nk = the number of documents in C that contain Tk
idf k = log N
nk
Database Management Systems, R. Ramakrishnan
12
For a
collection
of 10000
documents
10000
log
=0
10000
10000
log
= 0.301
5000
10000
log
= 2.698
20
10000
log
=4
1
13
TF x IDF Normalization
wik =
ik
14
nova
1
5
galaxy
3
2
heat
1
hwood
film
role
2
4
1
1
diet
fur
15
sim( A, C ) = 0
sim( A, D) = 0
sim( B, C ) = 0
sim( B, D) = 0
sim(C , D) = ( 2 4) + (1 1) = 9
i =1
A
B
C
D
nova
1
5
galaxy heat
3
1
2
hwood film
2
4
role
1
1
diet
fur
16
sim( D1 , D2 ) =
1i
w2i
i =1
(w
2
1i )
i =1
cosine normalized
(w
2
2i )
i =1
17
w = 0 if a term is absent
sim(Q, Di ) = wqj wd ij
j =1
sim(Q, Di ) =
w
j =1
(w
j =1
qj
qj
wd ij
)2
(w
j =1
d ij
)2
18
0.64
= 0.98
0.42
19
Term B
1.0
0.8
0.6
D2
Q = (0.4,0.8)
D1=(0.8,0.3)
D2=(0.2,0.7)
sim(Q, D 2) =
0.4
0.2
0
D1
1
0.2
=
0.4 0.6
Term A
0.8
sim(Q, Di ) =
1.0
sim(Q, D1 ) =
j =1
wq j wd ij
( wq j ) 2 j =1 ( wdij ) 2
j =1
.56
= 0.74
0.58
20
Text Clustering
21
Text Clustering
Clustering is
The art of finding groups in data.
-- Kaufmann and Rousseeu
Term 1
Term
2
22
23
Probabilistic Models
24
25
26
Query Modification
27
Relevance Feedback
Main Idea:
Modify existing query based on relevance
judgements
Extract terms from relevant documents and add
them to the query
AND/OR re-weight the terms already in the
query
Database
Management Systems, R. Ramakrishnan
28
Rocchio Method
Rocchio automatically
Re-weights terms
Adds in new terms (from relevant docs)
have to be careful when using negative terms
Rocchio is not a machine learning algorithm
29
Rocchio Method
Q1 = Q0 +
n1
n1
n2
R n
i =1
2 i =1
Si
where
Q0 = the vector for the initial query
Ri = the vector for the relevant document i
Si = the vector for the non - relevant document i
n1 = the number of relevant documents chosen
n2 = the number of non - relevant documents chosen
30
Rocchio/Vector Illustration
Information
1.0
D1
Q
0.5
Q = *Q0+ * D1 = (0.45,0.55)
Q = *Q0+ * D2 = (0.80,0.20)
Q0
Q
0
Database Management Systems, R. Ramakrishnan
0.5
Retrieval
D2
1.0
31
32
Sally
Star Wars
7
Jurassic Park
6
Terminator II
3
Independence Day 7
Database Management Systems, R. Ramakrishnan
Bob
7
4
4
7
Chris
3
7
7
2
Lynn
4
4
6
2
Karen
7
4
3
?
33
rUJ =
(U U)(J J )
(U U) (J J )
2
2
34