Sunteți pe pagina 1din 6

A Survey of Binary Similarity and Distance Measures

Seung-Seok Choi, Sung-Hyuk Cha, Charles C. Tappert


Department of Computer Science, Pace University
New York, US

ABSTRACT ecological 25 fish species [21]. Tubbs summarized seven


conventional similarity measures to solve the template
The binary feature vector is one of the most common matching problem [28], and Zhang et al. compared those
representations of patterns and measuring similarity and seven measures to show the recognition capability in
distance measures play a critical role in many problems handwriting identification [31]. Willett evaluated 13
such as clustering, classification, etc. Ever since Jaccard similarity measures for binary fingerprint code [30]. Cha
proposed a similarity measure to classify ecological et al. proposed weighted binary measurement to improve
species in 1901, numerous binary similarity and distance classification performance based on the comparative
measures have been proposed in various fields. Applying study [4].
appropriate measures results in more accurate data
analysis. Notwithstanding, few comprehensive surveys Few studies, however, have enumerated or grouped the
on binary measures have been conducted. Hence we existing binary measures. The number of similarity or
collected 76 binary similarity and distance measures used dissimilarity measures was often limited to those
over the last century and reveal their correlations through provided from several commercial statistical cluster
the hierarchical clustering technique. analysis tools. We collected and analyzed 76 binary
similarity and distance measures used over the last
Keywords: binary similarity measure, binary distance century, providing the most extensive survey on these
measure, hierarchical clustering, classification, measures.
operational taxonomic unit
This paper is organized as follows. Section 2 describes
1. INTRODUCTION the definitions of 76 binary similarity and dissimilarity
measures. Section 3 discusses the grouping of those
The binary similarity and dissimilarity (distance) measures using hierarchical clustering. Section 4
measures play a critical role in pattern analysis problems concludes this work.
such as classification, clustering, etc. Since the
performance relies on the choice of an appropriate 2. DEFINITIONS
measure, many researchers have taken elaborate efforts to
find the most meaningful binary similarity and distance Table 1 OTUs Expression of Binary Instances i and j
measures over a hundred years. Numerous binary j i 1 (Presence) 0 (Absence) Sum
similarity measures and distance measures have been
proposed in various fields. 1 (Presence) a = i• j b=i• j a+b
For example, the Jaccard similarity measure was used for 0 (Absence) c = i• j d =i• j c+d
clustering ecological species [20], and Forbes proposed a
coefficient for clustering ecologically related species [13, Sum a+c b+d n=a+b+c+d
14]. The binary similarity measures were subsequently
applied in biology [19, 23], ethnology [8], taxonomy
[27], image retrieval [25], geology [24], and chemistry Suppose that two objects or patterns, i and j are
[29]. Recently, they have been actively used to solve the represented by the binary feature vector form. Let n be
identification problems in biometrics such as fingerprint the number of features (attributes) or dimension of the
[30], iris images [4], and handwritten character feature vector. Definitions of binary similarity and
recognition [2, 3]. Many papers [7, 16, 17, 18, 19, 22, 26] distance measures are expressed by Operational
discuss their properties and features. Taxonomic Units (OTUs as shown in Table 1) [9] in a 2 x
2 contingency table where a is the number of features
Even though numerous binary similarity measures have where the values of i and j are both 1 (or presence),
been described in the literature, only a few comparative meaning ‘positive matches’, b is the number of attributes
studies collected the wide variety of binary similarity where the value of i and j is (0,1), meaning ‘i absence
measures [4, 5, 19, 21, 28, 30, 31]. Hubalek collected 43 mismatches’, c is the number of attributes where the
similarity measures, and 20 of them were used for cluster value of i and j is (1,0), meaning ‘j absence mismatches’,
analysis on fungi data to produce five clusters of related and d is the number of attributes where both i and j have
coefficients [19]. Jackson et al. compared eight binary 0 (or absence), meaning ‘negative matches’. The diagonal
similarity measures to choose the best measure for sum a+d represents the total number of matches between

ISSN: 1690-4524 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 8 - NUMBER 1 - YEAR 2010 43
i and j, the other diagonal sum b+c represents the total (b + c )
number of mismatches between i and j. The total sum of DVARI = (23)
4(a + b + c + d )
the 2x2 table, a+b+c+d is always equal to n.
(b + c) 2
DSIZEDIFFERENCE = (24)
Table 2 [5] lists definitions of 76 binary similarity and (a + b + c + d ) 2
distance measures used over the last century where S and n(b + c) − (b − c) 2
DSHAPEDIFFERENCE = (25)
D are similarity and distance measures, respectively. (a + b + c + d ) 2
4bc
Table 2 Definitions of Measures for binary data DPATTERNDIFFERENCE = (26)
(a + b + c + d ) 2
a
S JACCARD = (1) b+c
a+b+c DLANCE&WILLIAMS = (27)
(2a + b + c )
2a
S DICE = (2) b+c
2a + b + c DBRAY &CURTIS =
( 2a + b + c )
(28)
2a
SCZEKANOWSKI = (3) ⎛ ⎞
2a + b + c DHELLINGER = 2 ⎜1 −
a ⎟ (29)
⎜ ( + )( + ) ⎟
⎝ a b a c ⎠
3a
S3W − JACCARD = (4) ⎛ ⎞
3a + b + c D CHORD = 2⎜ 1 −
a ⎟ (30)
⎜ ( a + b )( a + c ) ⎟⎠

2a
S NEI & LI = (5) a
( a + b) + ( a + c ) SCOSINE = 2 (31)
(a + b)(a + c)
a
S SOKAL&SNEATH −I = (6) a+b a+c
a + 2b + 2c SGILBERT &WELLS = log a − log n − log( ) − log( ) (32)
n n
a+d
S SOKAL & MICHENER = (7) S OCHIAI − I =
a
a+b+c+d (a + b)(a + c) (33)
2( a + d )
S SOKAL&SNEATH − II = (8) S FORBESI =
na
2a + b + c + 2d (a + b)(a + c) (34)
a+d n(a − 0.5) 2
S ROGER&TANIMOTO = (9) S FOSSUM =
a + 2(b + c ) + d (a + b)(a + c) (35)

a + 0.5d a2
S FAITH = (10) S SORGENFREI = (36)
a+b+c+d ( a + b)(a + c)

a+d a
SGOWER&LEGENDRE = (11) S MOUNTFORD =
a + 0.5(b + c) + d 0.5(ab + ac ) + bc (37)

S INTERSECTI ON = a (12) S OTSUKA =


a
((a + b)(a + c)) 0.5 (38)
S INNERPRODU CT = a + d (13)
a 2 − bc
a S MCCONNAUGHEY = (39)
S RUSSELL& RAO = (14) (a + b)(a + c)
a+b+c+d
na − ( a + b)( a + c)
S TARWID = (40)
DHAMMING = b + c (15) na + ( a + b)( a + c)
a
DEUCLID = b + c (16) ( 2a + b + c )
S KULCZYNSKI − II = 2 (41)
(a + b)(a + c)
DSQUARED− EUCLID = (b + c) 2
(17)
a 1 1
S DRIVER &KROEBER = ( + ) (42)
2
(18) 2 a+b a+c
DCANBERRA = (b + c) 2

a a
DMANHATTAN = b + c (19) S JOHNSON = + (43)
a+b a+c
b+c ad − bc
DMEAN − MANHATTAN = (20) S DENNIS =
a+b+c+d n(a + b)(a + c) (44)
DCITYBLOCK = b + c (21) a
S SIMPSON = (45)
min(a + b, a + c)
1
DMINKOWSKI = (b + c) 1 (22) a
S BRAUN & BANQUET =
max(a + b, a + c) (46)

44 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 8 - NUMBER 1 - YEAR 2010 ISSN: 1690-4524
max(a + b, a + c )
S FAGER&McGOWAN =
a
− σ −σ '
( a + b)( a + c ) 2 (47) S ANDERBERG = (70)
2n
na − (a + b)(a + c)
S FORBES − II = (48) ad + a
n min(a + b, a + c) − (a + b)(a + c) S BARONI −URBANI & BUSER − I = (71)
ad + a + b + c
a a d d
+ + +
=
(a + b) (a + c) (b + d ) (b + d ) (49) ad + a − (b + c)
S SOKAL&SNEATH − IV
4 S BARONI −URBANI & BUSER − II = (72)
ad + a + b + c
a+d
S GOWER = (50) ab + bc
(a + b)(a + c)(b + d )(c + d ) S PEIRCE = (73)
ab + 2bc + cd
n(ad − bc )
2
SPEARSON −I = χ 2 where χ 2 = n 2 (na − (a + b)( a + c))
(a + b)(a + c)(c + d )(b + d )
(51) S EYRAUD = (74)
( a + b)(a + c)(b + d )(c + d )
χ 2
S PEARSON − II = ( )1 / 2 (52) a
n+ χ2 ( a + b) a (c + d )
S TARANTULA = = (75)
ρ where ad − bc
c c (a + b)
SPEARSON−III = ( )1/ 2 ρ= (53) (c + d )
n+ ρ (a + b)(a + c)(b + d )(c + d )

ad − bc
a
S PEARSON&HERON − I = (54) ( a + b) a (c + d )
(a + b)(a + c)(b + d )(c + d ) S AMPLE = = (76)
c c (a + b)
π bc (c + d )
S PEARSON & HERON − II = Cos( ) (55)
ad + bc
a+d
S SOKAL&SNEATH −III = (56) The inclusion or exclusion of negative matches, d in the
b+c
binary similarity measures have been an ongoing issue [9,
ad 12, 15, 16, 17, 18, 26, 27]. The Sokal & Michener, the
S SOKAL&SNEATH −V = (57)
(a + b)(a + c)(b + d )(c + d ) 0.5 Roger & Tanimoto, the Faith, the Ochiai II, the Cole, the
2(ad − bc) Gower, Pearson I, and the Stiles etc. are included in the
SCOLE = (58) negative match inclusive measures. The Jaccard, the
(ad − bc) − (a + b)(a + c)(b + d )(c + d )
2

n
Tanimoto, the Dice & Sorenson, the Kulczynski I, the
n(| ad − bc | − ) 2 Ochiai I, the Mountford, the Sorgenfrei, and the Simpson
S STILES = log 10 2 (59)
( a + b)( a + c )(b + d )(c + d ) etc. are included in the negative match exclusive
ad measures. Sokal et al. argued that the negative matches
SOCHIAI − II = (60) do not mean necessarily any similarity between two
(a + b)(a + c)(b + d )(c + d )
objects [27]. This is because an almost infinite number of
ad − bc attributes is possibly lacking in two objects.
S YULEQ = (61)
ad + bc

2bc In cases where the two binary states are not equally
DYULEQ = (62) important, such as in the asymmetric type of binary data,
ad + bc
the positive matches are usually more significant than the
ad − bc negative matches [1, 6, 10, 26]. Faith included the
S YULEw = (63)
ad + bc negative match but only gave the half credits while giving
a the full credits for the positive matches in eqn (10) [11].
S KULCZYNSKI − I = (64)
b+c In [4], different weights for positive and negative matches
were studied. Weighted similarity measures such as
a
STANIMOTO = (65) weighted hamming distance or azzoo [4] are not covered
( a + b) + ( a + c ) − a
in this paper though.
ad − bc
S DISPERSON = (66)
(a + b + c + d ) 2 Historically, all the binary measures observed above have
had a meaningful performance in their respective fields.
(a + d ) − (b + c)
S HAMANN = (67) The binary similarity coefficients proposed by Peirce,
a+b+c+d
Yule, and Pearson in 1900s contributes to the evolution
4(ad − bc) of the various correlation based binary similarity
S MICHAEL = (68)
(a + d ) 2 + (b + c) 2 measures. The Jaccard coefficient proposed at 1901 is
still widely used in the various fields such as ecology and
σ − σ ' where biology. The discussion of inclusion or exclusion of
S GOODMAN & KRUSKAL =
2n − σ ' negative matches was actively arisen by Sokal & Sneath
σ = max(a, b) + max(c, d ) + max(a, c) + max(b, d ) , (69)
in during 1960s and by Goodman & Kruskal in 1970s. In
σ ' = max(a + c, b + d ) + max(a + b, c + d ) Figure 1, the measures are arranged in historical order.

ISSN: 1690-4524 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 8 - NUMBER 1 - YEAR 2010 45
Correlation Based Baroni-Urbani & Buser I,II

Fager & McGowan


Yule / Pearson Heron I,II
Sokal & Sneath I,II,III,IV,V
Michael Eyraud Tarantula
McConnaughey AMPLE
Peirce Pearson I,II,III Braun-Blanquet Sorgenfrei Johnson
Yule Driver & Kroeber Ochiai I,II Gilbert & Wells Anderberg
Forbes I Kulczynski I,II Mountford Goodman & Kruskal
Forbes II Simpson
Non-Correlation Roger & Tanimoto Faith
Based Russell & Rao Sokal & Michener
Hamann
Jaccard Dice & Sorenson
Tanimoto Gower & Legendre

Distance Based Hamming


Binary Bray & Curtis
Euclidean

1884 1901 1907 1927 1945 1957 1963 1979 1986


1900 1905 1912 1932 1940 1958 1962 1966 1972 2005
1913 1925 1936 1943 1950 1960 1964 1982
1920 1959 1967 1973
1961 1976
Figure 1 Chronological Table of Binary Similarity Measures and Distance Measures by Year

3. HIERARCHICAL CLUSTERING behavior. All of the Hamming-like binary distance


measures are categorized in Group 1 while the Lance &
Hierarchical clustering is conducted to estimate the Williams and the Bray-Curtis distance measures are
similarity among the measures collected. Random binary clustered in Group 2 closely related with the Hellinger
data set are used as data set. The reference set consist of and the Chord distance measures. Most of negative match
30 binary instances, each of which has 100 binary exclusive measures are clustered in Group 2 and 3.
features. When a test query is measured with the Additive form of negative match exclusive measures such
reference set data, 100 distance or similarity values are as the Jaccard, the Dice & Sorenson, or the Kulczynski I,
produced for each measure. The correlation coefficient have high correlation with the Cosine based measures
values between two measures are used to build a such as the Ochiai I or the Sorgenfrei. Interestingly, the
dendrogram. The agglomerative single linkage with the Faith is categorized in Group 2 even though it is a
average clustering method is used [9]. variation of the Sokal & Michener of Group 1. The
Driver & Kroeber, the Forbes I, and the Fossum have
The dendrogram in Figure 2 is produced by averaging 30 high correlation with inner product based measures such
independent trials. The vertical scale on the left side of as the Russell & Rao. They are clustered in Group 3. The
dendrogram represents the binary similarity or probabilistic similarity measures such as the Goodman &
dissimilarity measures examined. The horizontal scale Kruskal and the Anderberg are identical as clustered in
represents the closeness of two clusters of binary Group 6. The Yule w, the Eyraud, the Fager &
similarity or dissimilarity measures, where 0 ≤ r ≤ 1. The McGowan, the Stiles, the Tanimoto, and the Peirce are
dendrogram provides intuitive semantic groupings of different from others as they are clustered in Group 5, 7,
binary similarity measures and distance measures. 8, 9, 10, and 11 respectively. The Chi-square based
measures such as the Pearson I and Pearson II are
High correlations are found in the most of measures clustered separately forming Group 4. The Tarantula has
including negative matches. They are identified as Group high correlation with the Sokal & Sneath III and clustered
1 including the Simple Matching, the Pearson’s phi-like in Group 1 while the AMPLE coefficient, the absolute
coefficients, and the Yule Q. The exceptional case is the value of the Tarantula, has high correlation with chi-
Yule w, which has a square root of ad – bc in the square based measures and clustered in Group 4.
numerator. It is clustered in Group 5 showing different

46 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 8 - NUMBER 1 - YEAR 2010 ISSN: 1690-4524
Figure 2 Hierarchical Clustering Result of Random Binary Data Set

ISSN: 1690-4524 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 8 - NUMBER 1 - YEAR 2010 47
4. CONCLUSIONS [15] Gilbert, G.K., (1884), “Finely’s tornado predictions,”
The American Meteorological Journal, 1, 166-72.
Numerous binary similarity measures and distance [16] Goodman, L.A., Kruskal, W.H., (1954), “Measures of
measures have been used in various fields. Each of them is association for cross classifications”, Journal of the
differently defined by its own synthetic properties. Some American Statistical Association 49, 732-764.
include negative matches and some do not. Some use [17] Goodman, L.A., Kruskal, W.H., (1959), “Measures of
simple count difference and some utilize complicated association for cross classifications II. Further discussion
correlation. In this survey, we collected 76 binary similarity and references”, Journal of the American Statistical
and distance measures used over the last century, classified Association 54, 123-163 (pp. 35-75).
them through hierarchical clustering, and observed close [18] Goodman, L.A., Kruskal, W.H., (1963), “Measures of
relationships among some of the measures. We expect that association for cross classifications III. Approximate
the relationship of each pair of measures should help sampling theory”, Journal of the American Statistical
researchers select more accurate measure for binary data Association 58, 310-364.
analysis in various domains. [19] Hubalek, Z., (1982), “Coefficients of Association and
Similarity, Based on Binary (Presence-Absence) Data: An
5. REFERENCES Evaluation”, Biological Reviews, Vol.57-4,669-689.
[20] Jaccard, P., (1901), “Étude comparative de la
distribuition florale dans une portion des Alpes et des Jura”,
[1] Baroni-Urbani, C., Buser, M.W., (1976), “Similarity of Bull Soc Vandoise Sci Nat 37:547-579.
Binary Data”, Systematic Zoology, Vol. 25, No. 3, pp. 251- [21] Jackson, D.A., Somers, K.M., Harvey, H.H., (1989),
259. “Similarity Coefficients: Measures of Co-Occurrence and
[2] Cha, S.-H., Srihari, S.N., (2000), “A fast nearest Association or Simply Measures of Occurrence?”, The
neighbor search algorithm by filtration”, Pattern American Nat1uralist, Vol. 133, No. 3, pp. 436-453.
Recognition 35, P 515-525. [22] Kuhns, J.L., (1965), “The continuum of coefficients of
[3] Cha, S.-H., Tappert, C.C., (2003), “Optimizing Binary association”, Statistical Association Methods for
Feature Vector Similarity Measure using Genetic Mechanized Documentation, (Edited by Stevens et al.)
Algorithm”, ICDAR, Edinburgh, Scotland. National Bureau of Standards, Washington, 33-39.
[4] Cha, S.-H., Yoon S-, Tappert, C.C., (2006), “Enhancing [23] Michael, E.L., (1920), “Marine ecology and the
Binary Feature Vector Similarity Measures”, Journal of coefficient of association: a plea in behalf of quantitative
Pattern Recognition research I. biology”, Ecology 8, 54-59.
[5] Choi, S.-S, (2008), “Correlation Analysis of Binary [24] Michael H., (1976), “Binary coefficients: A theoretical
Similarity Measures and Dissimilarity Measures”, and empirical study, Mathematical Geology, Volume 8,
Doctorate dissertation, Pace University. Number 2, April, 1976.
[6] Clifford, H., Stephenson, W., (1975), “An Introduction [25] Smith, J.R., Chang, S.-F., (1996), “Automated binary
to Numerical Taxonomy”, Academic Press, New York. texture feature sets for image retrieval”, International Conf.
[7] Cormack, R.M., (1971), “A review of classification”, Accoust., Speech, Signal processing, Atlantic, GA.
Journal of the Royal Statistical Society, Series A, 134., pp. [26] Sneath, P.H.A., Sokal, R.R., (1973), “Numerical
321 - 353. Taxonomy: The Principles and Practice of Numerical
[8] Driver, H.E., Kroeber, A.L., (1932), “Quantitative Classification”, W.H. Freeman and Company, San
Expression of Cultural Relationships”, University of Francisco.
California Press. [27] Sokal, R.R., Sneath P.H., (1963), “Principles of
[9] Dunn, G., Everitt, B.S., (1982), “An Introduction to numeric taxonomy”, San Francisco, W.H. Freeman.
Mathematical Taxonomy”, Cambridge University Press. [28] Tubbs, J.D., (1989), “A note on binary template
[10] Faith, D.P, (1983), “Asymmetric binary similarity matching”, Pattern Recognition, 22(4):359-365.
measures”, Oecologia, Vol.57, No. 3, pp. 287-290. [29] Willett, P., Barnard, J.M., Downs, G.M., (1998),
[11] Faith, D.P., Minchin, P.R., Belbin, L., (1987), “Chemical similarity searching” Chem Inf Comput Sci 38:
“Compositional dissimilarity as a robust measure of 983-996.
ecological distance”, Journal of Plant Ecology, Volume 69, [30] Willett, P., (2003), “Similarity-based approaches to
Numbers 1-3. virtual screening”, Biochemical Society Transactions 31,
[12] Finely, J.P., (1884), “Tornado prediction,” The 603–606.
American Meteorological Journal, 1, 85-8. [31] Zhang, B., Srihari, S.N., (2003), “Binary vector
[13] Forbes, S.A., (1907), “On the local distribution of dissimilarities for handwriting identification”, Proceedings
certain Illinois fishes. An essay in statistical ecology,” of SPIE, Document Recognition and Retrieval X, p 15-
Bulletin of the Illinois State Laboratory of Natural History. 166.
[14] Forbes, S.A., (1925), “Method of determining and
measuring the associative relations of species”, Science 61,
524.

48 SYSTEMICS, CYBERNETICS AND INFORMATICS VOLUME 8 - NUMBER 1 - YEAR 2010 ISSN: 1690-4524

S-ar putea să vă placă și