Approximate String Distances in R

Approximate String Distances
Description
Compute the approximate string distance between character vectors. The distance is a generalized
Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and
substitutions needed to transform one string into another.
Usage
adist(x, y = NULL, costs = NULL, counts = FALSE, fixed = TRUE,
partial = !fixed, ignore.case = FALSE, useBytes = FALSE)
Arguments
x a character vector. Long vectors are not supported.

y a character vector, or NULL (default) indicating taking x as y.
costs a numeric vector or list with names partially
matching insertions, deletions andsubstitutions giving the respective costs
for computing the Levenshtein distance, or NULL(default) indicating using unit cost for
all three possible transformations.
counts a logical indicating whether to optionally return the transformation counts (numbers of
insertions, deletions and substitutions) as the "counts" attribute of the return value.
fixed a logical. If TRUE (default), the x elements are used as string literals. Otherwise, they
are taken as regular expressions and partial = TRUE is implied (corresponding to
the approximate string distance used by agrep with fixed = FALSE.
partial a logical indicating whether the transformed x elements must exactly match the
complete yelements, or only substrings of these. The latter corresponds to the
approximate string distance used by agrep (by default).
ignore.case a logical. If TRUE, case is ignored for computing the distances.
useBytes a logical. If TRUE distance computations are done byte-by-byte rather than character-by-
character.
Details
The (generalized) Levenshtein (or edit) distance between two strings s and t is the minimal possibly
weighted number of insertions, deletions and substitutions needed to transform s into t (so that the
transformation exactly matches t). This distance is computed for partial = FALSE, currently using a
dynamic programming algorithm (see, e.g., https://en.wikipedia.org/wiki/Levenshtein_distance) with space
and time complexity O(mn), where m andn are the lengths of s and t, respectively. Additionally computing
the transformation sequence and counts isO(\max(m, n)).
The generalized Levenshtein distance can also be used for approximate (fuzzy) string matching, in which
case one finds the substring of t with minimal distance to the pattern s (which could be taken as a regular
expression, in which case the principle of using the leftmost and longest match applies), see,
e.g.,https://en.wikipedia.org/wiki/Approximate_string_matching. This distance is computed for partial
= TRUEusing tre by Ville Laurikari (http://laurikari.net/tre/) and corresponds to the distance used
by agrep. In this case, the given cost values are coerced to integer.
Note that the costs for insertions and deletions can be different, in which case the distance
between s and t can be different from the distance between t and s.
Value
A matrix with the approximate string distances of the elements of x and y, with rows and columns
corresponding tox and y, respectively.
If counts is TRUE, the transformation counts are returned as the "counts" attribute of this matrix, as a
3-dimensional array with dimensions corresponding to the elements of x, the elements of y, and the type
of transformation (insertions, deletions and substitutions), respectively. Additionally, if partial =
FALSE, the transformation sequences are returned as the "trafos" attribute of the return value, as
character strings with elements M, I, D and S indicating a match, insertion, deletion and substitution,
respectively. If partial = TRUE, the offsets (positions of the first and last element) of the matched
substrings are returned as the "offsets"attribute of the return value (with both offsets -1 in case of no
match).
See Also
agrep for approximate string matching (fuzzy matching) using the generalized Levenshtein distance.
Examples
## Cf. https://en.wikipedia.org/wiki/Levenshtein_distance
adist("kitten", "sitting")
## To see the transformation counts for the Levenshtein distance:
drop(attr(adist("kitten", "sitting", counts = TRUE), "counts"))
## To see the transformation sequences:
attr(adist(c("kitten", "sitting"), counts = TRUE), "trafos")
## Cf. the examples for agrep:

adist("lasy", "1 lazy 2")
## For a "partial approximate match" (as used for agrep):
adist("lasy", "1 lazy 2", partial = TRUE)

Approximate String Distances in R

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Approximate String Distances in R

Încărcat de

Drepturi de autor:

Formate disponibile

Approximate String Distances

x a character vector. Long vectors are not supported.

## Cf. the examples for agrep:

S-ar putea să vă placă și