Document Similarity¶
Compute the similarity between two strings using Hamming distance, which gives the number of characters at corresponding string indices that differ, including chars in the longer string that have no correspondents in the shorter. |
|
Measure the similarity between two strings using Levenshtein distance, which gives the minimum number of character insertions, deletions, and substitutions needed to change one string into the other. |
|
Measure the similarity between two strings using Jaro (not Jaro-Winkler) distance, which searches for common characters while taking transpositions and string lengths into account. |
|
Measure the similarity between two strings using a character ngrams similarity metric, in which strings are transformed into trigrams of alnum-only characters, vectorized and weighted by tf-idf, then compared by cosine similarity. |
|
Measure the similarity between two sequences of strings as sets using the Jaccard index. |
|
Measure the similarity between two sequences of strings as sets using the Sørensen-Dice index, which is similar to the Jaccard index. |
|
Measure the similarity between two sequences of strings as sets using the (symmetric) Tversky index, which is a generalization of Jaccard ( |
|
Measure the similarity between two sequences of strings as sets using the Otsuka-Ochiai variation of cosine similarity (which is equivalent to the usual formulation when values are binary). |
|
Measure the similarity between two sequences of strings (not as sets) using the “bag distance” measure, which can be considered an approximation of edit distance. |
|
Measure the similarity between two sequences of strings by finding contiguous matching subsequences without any “junk” elements and normalizing by the total number of elements. |
|
Measure the similarity between two strings or sequences of strings using Levenshtein distance, only with non-alphanumeric characters removed and the ordering of tokens in each sorted before comparison. |
|
Measure the similarity between two sequences of strings using the (symmetric) Monge-Elkan method, which takes the average of the maximum pairwise similarity between the tokens in each sequence as compared to those in the other sequence. |
Edit-based Metrics¶
textacy.similarity.edits
: Normalized similarity metrics built on edit-based
algorithms that compute the number of operations (additions, subtractions, …)
needed to transform one string into another.
-
textacy.similarity.edits.
hamming
(str1: str, str2: str) → float[source]¶ Compute the similarity between two strings using Hamming distance, which gives the number of characters at corresponding string indices that differ, including chars in the longer string that have no correspondents in the shorter.
- Parameters
str1 –
str2 –
- Returns
Similarity between
str1
andstr2
in the interval [0.0, 1.0], where larger values correspond to more similar strings
-
textacy.similarity.edits.
levenshtein
(str1: str, str2: str) → float[source]¶ Measure the similarity between two strings using Levenshtein distance, which gives the minimum number of character insertions, deletions, and substitutions needed to change one string into the other.
- Parameters
str1 –
str2 –
- Returns
Similarity between
str1
andstr2
in the interval [0.0, 1.0], where larger values correspond to more similar strings
-
textacy.similarity.edits.
jaro
(str1: str, str2: str) → float[source]¶ Measure the similarity between two strings using Jaro (not Jaro-Winkler) distance, which searches for common characters while taking transpositions and string lengths into account.
- Parameters
str1 –
str2 –
- Returns
Similarity between
str1
andstr2
in the interval [0.0, 1.0], where larger values correspond to more similar strings
-
textacy.similarity.edits.
character_ngrams
(str1: str, str2: str) → float[source]¶ Measure the similarity between two strings using a character ngrams similarity metric, in which strings are transformed into trigrams of alnum-only characters, vectorized and weighted by tf-idf, then compared by cosine similarity.
- Parameters
str1 –
str2 –
- Returns
Similarity between
str1
andstr2
in the interval [0.0, 1.0], where larger values correspond to more similar strings
Note
This method has been used in cross-lingual plagiarism detection and authorship attribution, and seems to work better on longer texts. At the very least, it is slow on shorter texts relative to the other similarity measures.
Token-based Metrics¶
textacy.similarity.edits
: Normalized similarity metrics built on token-based
algorithms that identify and count similar tokens between one sequence and another,
and don’t rely on the ordering of those tokens.
-
textacy.similarity.tokens.
jaccard
(seq1: Iterable[str], seq2: Iterable[str]) → float[source]¶ Measure the similarity between two sequences of strings as sets using the Jaccard index.
- Parameters
seq1 –
seq2 –
- Returns
Similarity between
seq1
andseq2
in the interval [0.0, 1.0], where larger values correspond to more similar sequences of strings
-
textacy.similarity.tokens.
sorensen_dice
(seq1: Iterable[str], seq2: Iterable[str]) → float[source]¶ Measure the similarity between two sequences of strings as sets using the Sørensen-Dice index, which is similar to the Jaccard index.
- Parameters
seq1 –
seq2 –
- Returns
Similarity between
seq1
andseq2
in the interval [0.0, 1.0], where larger values correspond to more similar sequences
-
textacy.similarity.tokens.
tversky
(seq1: Iterable[str], seq2: Iterable[str], *, alpha: float = 1.0, beta: float = 1.0) → float[source]¶ Measure the similarity between two sequences of strings as sets using the (symmetric) Tversky index, which is a generalization of Jaccard (
alpha=0.5, beta=2.0
) and Sørensen-Dice (alpha=0.5, beta=1.0
).- Parameters
seq1 –
seq2 –
alpha –
beta –
- Returns
Similarity between
seq1
andseq2
in the interval [0.0, 1.0], where larger values correspond to more similar sequences
-
textacy.similarity.tokens.
cosine
(seq1: Iterable[str], seq2: Iterable[str]) → float[source]¶ Measure the similarity between two sequences of strings as sets using the Otsuka-Ochiai variation of cosine similarity (which is equivalent to the usual formulation when values are binary).
- Parameters
seq1 –
seq2 –
- Returns
Similarity between
seq1
andseq2
in the interval [0.0, 1.0], where larger values correspond to more similar sequences
-
textacy.similarity.tokens.
bag
(seq1: Iterable[str], seq2: Iterable[str]) → float[source]¶ Measure the similarity between two sequences of strings (not as sets) using the “bag distance” measure, which can be considered an approximation of edit distance.
- Parameters
seq1 –
seq2 –
- Returns
Similarity between
seq1
andseq2
in the interval [0.0, 1.0], where larger values correspond to more similar sequences
- Reference:
Bartolini, Ilaria, Paolo Ciaccia, and Marco Patella. “String matching with metric trees using an approximate distance.” International Symposium on String Processing and Information Retrieval. Springer, Berlin, Heidelberg, 2002.
Sequence-based Metrics¶
textacy.similarity.sequences
: Normalized similarity metrics built on
sequence-based algorithms that identify and measure the subsequences common to each.
-
textacy.similarity.sequences.
matching_subsequences_ratio
(seq1: Sequence[str], seq2: Sequence[str], **kwargs) → float[source]¶ Measure the similarity between two sequences of strings by finding contiguous matching subsequences without any “junk” elements and normalizing by the total number of elements.
- Parameters
seq1 –
seq2 –
**kwargs – isjunk: Optional[Callable[str], bool] = None autojunk: bool = True
- Returns
Similarity between
seq1
andseq2
in the interval [0.0, 1.0], where larger values correspond to more similar sequences of strings
Hybrid Metrics¶
textacy.similarity.hybrid
: Normalized similarity metrics that combine edit-,
token-, and/or sequence-based algorithms.
-
textacy.similarity.hybrid.
token_sort_ratio
(s1: str | Sequence[str], s2: str | Sequence[str]) → float[source]¶ Measure the similarity between two strings or sequences of strings using Levenshtein distance, only with non-alphanumeric characters removed and the ordering of tokens in each sorted before comparison.
- Parameters
s1 –
s2 –
- Returns
Similarity between
s1
ands2
in the interval [0.0, 1.0], where larger values correspond to more similar strings.
-
textacy.similarity.hybrid.
monge_elkan
(seq1: Sequence[str], seq2: Sequence[str], sim_func: Callable[[str, str], float] = <function levenshtein>) → float[source]¶ Measure the similarity between two sequences of strings using the (symmetric) Monge-Elkan method, which takes the average of the maximum pairwise similarity between the tokens in each sequence as compared to those in the other sequence.
- Parameters
seq1 –
seq2 –
sim_func – Callable that computes a string-to-string similarity metric; by default, Levenshtein edit distance.
- Returns
Similarity between
seq1
andseq2
in the interval [0.0, 1.0], where larger values correspond to more similar strings.