Documente Academic
Documente Profesional
Documente Cultură
Dmitriy Selivanov
2017-08-08
Documents similarity
Document similarity (or distance between documents) is a one of the central themes in
Information Retrieval. How humans usually define how similar are documents? Usually
documents treated as similar if they are semantically close and describe similar concepts. On
other hand “similarity” can be used in context of duplicate detection. We will review several
common approaches.
API
text2vec package provides 2 set of functions for measuring various distances/similarity in a
unified way. All methods are written with special attention to computational performance and
memory efficiency.
1. sim2(x, y, method) - calculates similarity between each row of matrix x and each
row of matrix y using given method.
2. psim2(x, y, method) - calculates parallel similarity between rows of
matrix x and corresponding rows of matrix yusing given method.
3. dist2(x, y, method) - calculates distance/dissimilarity between each row of
matrix x and each row of matrix y using given method.
4. pdist2(x, y, method) - calculates parallel distance/dissimilarity between rows of
matrix x and corresponding rows of matrix y using given method.
Methods have siffix 2 in their names because in contrast to base dist() function they work
with two matrces instead of one.
Following methods are implemented at the moment:
1. Jaccard distance
2. Cosine distance
3. Euclidean distance
4. Relaxed Word Mover’s Distance
Practical examples
As usual we will use built-in text2vec::moview_review dataset. Let’s clean it a little bit:
library(stringr)
library(text2vec)
data("movie_review")
# select 500 rows for faster running times
movie_review = movie_review[1:500, ]
prep_fun = function(x) {
x %>%
# make text lower case
str_to_lower %>%
# remove non-alphanumeric symbols
str_replace_all("[^[:alnum:]]", " ") %>%
# collapse multiple spaces
str_replace_all("\\s+", " ")
}
movie_review$review_clean = prep_fun(movie_review$review)
Now let’s define two sets of documents on which we will evaluate our distance models:
doc_set_1 = movie_review[1:300, ]
it1 = itoken(doc_set_1$review_clean, progressbar = FALSE)
We will compare documents in a vector space. So we need to define common space and project
documents to it. We will use vocabulary-based vectorization vectorization for better
interpretability:
Jaccard similarity
Jaccard similarity is a simple but intuitive measure of similarity between two sets.
J(doc1,doc2)=doc1∩doc2doc1∪doc2J(doc1,doc2)=doc1∩doc2doc1∪doc2
For documents we measure it as proportion of number of common words to number of unique
words in both documets. In the field of NLP jaccard similarity can be particularly useful for
duplicates detection. text2vec however provides generic efficient realization which can be used
in many other applications.
For calculation of jaccard similarity between 2 sets of documents user have to provide DTM for
each them (DTMs should be in the same vector space!):
Once we have representation of documents in vector space we are almost done. One thing
remains - call sim2():
Check result:
dim(d1_d2_jac_sim)
## [1] 300 200
d1_d2_jac_sim[1:2, 1:5]
## 2 x 5 sparse Matrix of class "dgCMatrix"
## 1 2 3 4 5
## 1 0.02142857 . 0.02362205 0.007575758 0.02597403
## 2 0.01219512 . 0.02941176 0.013888889 0.02083333
Also we can comptute “parallel” similarity - similarity between corresponding rows of matrices
(matrices should have identical shapes):
dtm1_2 = dtm1[1:200, ]
dtm2_2 = dtm2[1:200, ]
d1_d2_jac_psim = psim2(dtm1_2, dtm2_2, method = "jaccard", norm = "none")
str(d1_d2_jac_psim)
## Named num [1:200] 0.02143 0 0.00735 0 0.03311 ...
## - attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
Check result:
dim(d1_d2_cos_sim)
## [1] 300 200
d1_d2_cos_sim[1:2, 1:5]
## 2 x 5 sparse Matrix of class "dgCMatrix"
## 1 2 3 4 5
## 1 0.02703999 . 0.05063299 0.009500143 0.02753954
## 2 0.02455143 . 0.06567587 0.034503278 0.04000800
x = dtm_tfidf_lsa[1:250, ]
y = dtm_tfidf_lsa[251:500, ]
head(psim2(x = x, y = y, method = "cosine", norm = "l2"))
## 1 2 3 4 5 6
## 0.11315322 0.12302464 0.19952480 0.07238329 0.18239954 0.02230496
Euclidean distance
Euclidean distance is not so useful in NLP field as Jaccard or Cosine similarities. But it always
worth to try different measures. In text2vec it can by computed only on dense matrices, here is
example:
x = dtm_tfidf_lsa[1:300, ]
y = dtm_tfidf_lsa[1:200, ]
m1 = dist2(x, y, method = "euclidean")
Also we can apply different row normalization techniques (by default was "l2" in example
above):