Sunteți pe pagina 1din 6

Documents similarity

Dmitriy Selivanov
2017-08-08

Documents similarity
Document similarity (or distance between documents) is a one of the central themes in
Information Retrieval. How humans usually define how similar are documents? Usually
documents treated as similar if they are semantically close and describe similar concepts. On
other hand “similarity” can be used in context of duplicate detection. We will review several
common approaches.

API
text2vec package provides 2 set of functions for measuring various distances/similarity in a
unified way. All methods are written with special attention to computational performance and
memory efficiency.

1. sim2(x, y, method) - calculates similarity between each row of matrix x and each
row of matrix y using given method.
2. psim2(x, y, method) - calculates parallel similarity between rows of
matrix x and corresponding rows of matrix yusing given method.
3. dist2(x, y, method) - calculates distance/dissimilarity between each row of
matrix x and each row of matrix y using given method.
4. pdist2(x, y, method) - calculates parallel distance/dissimilarity between rows of
matrix x and corresponding rows of matrix y using given method.

Methods have siffix 2 in their names because in contrast to base dist() function they work
with two matrces instead of one.
Following methods are implemented at the moment:

1. Jaccard distance
2. Cosine distance
3. Euclidean distance
4. Relaxed Word Mover’s Distance

Practical examples
As usual we will use built-in text2vec::moview_review dataset. Let’s clean it a little bit:

library(stringr)
library(text2vec)
data("movie_review")
# select 500 rows for faster running times
movie_review = movie_review[1:500, ]
prep_fun = function(x) {
x %>%
# make text lower case
str_to_lower %>%
# remove non-alphanumeric symbols
str_replace_all("[^[:alnum:]]", " ") %>%
# collapse multiple spaces
str_replace_all("\\s+", " ")
}
movie_review$review_clean = prep_fun(movie_review$review)

Now let’s define two sets of documents on which we will evaluate our distance models:

doc_set_1 = movie_review[1:300, ]
it1 = itoken(doc_set_1$review_clean, progressbar = FALSE)

# specially take different number of docs in second set


doc_set_2 = movie_review[301:500, ]
it2 = itoken(doc_set_2$review_clean, progressbar = FALSE)

We will compare documents in a vector space. So we need to define common space and project
documents to it. We will use vocabulary-based vectorization vectorization for better
interpretability:

it = itoken(movie_review$review_clean, progressbar = FALSE)


v = create_vocabulary(it) %>% prune_vocabulary(doc_proportion_max = 0.1, te
rm_count_min = 5)
vectorizer = vocab_vectorizer(v)

Jaccard similarity
Jaccard similarity is a simple but intuitive measure of similarity between two sets.
J(doc1,doc2)=doc1∩doc2doc1∪doc2J(doc1,doc2)=doc1∩doc2doc1∪doc2
For documents we measure it as proportion of number of common words to number of unique
words in both documets. In the field of NLP jaccard similarity can be particularly useful for
duplicates detection. text2vec however provides generic efficient realization which can be used
in many other applications.
For calculation of jaccard similarity between 2 sets of documents user have to provide DTM for
each them (DTMs should be in the same vector space!):

# they will be in the same space because we use same vectorizer


# hash_vectorizer will also work fine
dtm1 = create_dtm(it1, vectorizer)
dim(dtm1)
## [1] 300 2338
dtm2 = create_dtm(it2, vectorizer)
dim(dtm2)
## [1] 200 2338

Once we have representation of documents in vector space we are almost done. One thing
remains - call sim2():

d1_d2_jac_sim = sim2(dtm1, dtm2, method = "jaccard", norm = "none")

Check result:

dim(d1_d2_jac_sim)
## [1] 300 200
d1_d2_jac_sim[1:2, 1:5]
## 2 x 5 sparse Matrix of class "dgCMatrix"
## 1 2 3 4 5
## 1 0.02142857 . 0.02362205 0.007575758 0.02597403
## 2 0.01219512 . 0.02941176 0.013888889 0.02083333

Also we can comptute “parallel” similarity - similarity between corresponding rows of matrices
(matrices should have identical shapes):

dtm1_2 = dtm1[1:200, ]
dtm2_2 = dtm2[1:200, ]
d1_d2_jac_psim = psim2(dtm1_2, dtm2_2, method = "jaccard", norm = "none")
str(d1_d2_jac_psim)
## Named num [1:200] 0.02143 0 0.00735 0 0.03311 ...
## - attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...

We define Jaccard distance or Jaccard


dissimilarity as 1−similarity(doc1,doc2)1−similarity(doc1,doc2). sim2() and psim2() have
corresponding companion functions dist2(), pdist2() which computes dissimilarity. Note
however that in many cases similarity between documents is 0. sim2 function exploit this
advantage - result matrix will be sparse. Use dist2() on large sparse matrices carefully.
Cosine similarity
Classical approach from computational linguistics is to measure similarity based on the content
overlap between documents. For this we will represent documents as bag-of-words, so each
document will be a sparse vector. And define measure of overlap as angle between vectors:
similarity(doc1,doc2)=cos(θ)=doc1doc2|doc1||doc2|similarity(doc1,doc2)=cos(θ)=doc1doc2|do
c1||doc2|
By cosine distance/dissimilarity we assume following:
distance(doc1,doc2)=1−similarity(doc1,doc2)distance(doc1,doc2)=1−similarity(doc1,doc2)
It is important to note, however, that this is not a proper distance metric in a mathematical
sense as it does not have the triangle inequality property and it violates the coincidence axiom.
Calculation of cosine similarity is similar to jaccard similarity:

d1_d2_cos_sim = sim2(dtm1, dtm2, method = "cosine", norm = "l2")

Check result:

dim(d1_d2_cos_sim)
## [1] 300 200
d1_d2_cos_sim[1:2, 1:5]
## 2 x 5 sparse Matrix of class "dgCMatrix"
## 1 2 3 4 5
## 1 0.02703999 . 0.05063299 0.009500143 0.02753954
## 2 0.02455143 . 0.06567587 0.034503278 0.04000800

Cosine similarity with Tf-Idf


It can be useful to measure similarity not on vanilla bag-of-words matrix, but on transformed
one. One choice is to apply tf-idf transformation. First let’t create tf-idf model:

dtm = create_dtm(it, vectorizer)


tfidf = TfIdf$new()
dtm_tfidf = fit_transform(dtm, tfidf)

Calculate similarities between all rows of dtm_tfidf matrix:

d1_d2_tfidf_cos_sim = sim2(x = dtm_tfidf, method = "cosine", norm = "l2")


d1_d2_tfidf_cos_sim[1:2, 1:5]
## 2 x 5 sparse Matrix of class "dgCMatrix"
## 1 2 3 4 5
## 1 1.000000000 0.007885415 0.02380155 0.02864296 0.01510648
## 2 0.007885415 1.000000000 0.01120455 . .

Cosine similarity with LSA


Usually tf-idf/bag-of-words matrices contain a lot of noise. Applying LSA model can help with
this problem, so you can achieve better quality similarities:

lsa = LSA$new(n_topics = 100)


dtm_tfidf_lsa = fit_transform(dtm_tfidf, lsa)

Calculate similarities between all rows of dtm_tfidf_lsa matrix:

d1_d2_tfidf_cos_sim = sim2(x = dtm_tfidf_lsa, method = "cosine", norm = "l2


")
d1_d2_tfidf_cos_sim[1:2, 1:5]
## 1 2 3 4 5
## 1 1.0000000 0.1131078 0.2441785 0.22306293 0.2737060
## 2 0.1131078 1.0000000 0.1196555 -0.02698341 -0.0475149

And “parallel” similarities:

x = dtm_tfidf_lsa[1:250, ]
y = dtm_tfidf_lsa[251:500, ]
head(psim2(x = x, y = y, method = "cosine", norm = "l2"))
## 1 2 3 4 5 6
## 0.11315322 0.12302464 0.19952480 0.07238329 0.18239954 0.02230496

Euclidean distance
Euclidean distance is not so useful in NLP field as Jaccard or Cosine similarities. But it always
worth to try different measures. In text2vec it can by computed only on dense matrices, here is
example:

x = dtm_tfidf_lsa[1:300, ]
y = dtm_tfidf_lsa[1:200, ]
m1 = dist2(x, y, method = "euclidean")

Also we can apply different row normalization techniques (by default was "l2" in example
above):

m2 = dist2(x, y, method = "euclidean", norm = "l1")


m3 = dist2(x, y, method = "euclidean", norm = "none")

S-ar putea să vă placă și