Document Similarity¶

`edits.hamming`	Compute the similarity between two strings using Hamming distance, which gives the number of characters at corresponding string indices that differ, including chars in the longer string that have no correspondents in the shorter.
`edits.levenshtein`	Measure the similarity between two strings using Levenshtein distance, which gives the minimum number of character insertions, deletions, and substitutions needed to change one string into the other.
`edits.jaro`	Measure the similarity between two strings using Jaro (not Jaro-Winkler) distance, which searches for common characters while taking transpositions and string lengths into account.
`edits.character_ngrams`	Measure the similarity between two strings using a character ngrams similarity metric, in which strings are transformed into trigrams of alnum-only characters, vectorized and weighted by tf-idf, then compared by cosine similarity.
`tokens.jaccard`	Measure the similarity between two sequences of strings as sets using the Jaccard index.
`tokens.sorensen_dice`	Measure the similarity between two sequences of strings as sets using the Sørensen-Dice index, which is similar to the Jaccard index.
`tokens.tversky`	Measure the similarity between two sequences of strings as sets using the (symmetric) Tversky index, which is a generalization of Jaccard (`alpha=0.5, beta=2.0`) and Sørensen-Dice (`alpha=0.5, beta=1.0`).
`tokens.cosine`	Measure the similarity between two sequences of strings as sets using the Otsuka-Ochiai variation of cosine similarity (which is equivalent to the usual formulation when values are binary).
`tokens.bag`	Measure the similarity between two sequences of strings (not as sets) using the “bag distance” measure, which can be considered an approximation of edit distance.
`sequences.matching_subsequences_ratio`	Measure the similarity between two sequences of strings by finding contiguous matching subsequences without any “junk” elements and normalizing by the total number of elements.
`hybrid.token_sort_ratio`	Measure the similarity between two strings or sequences of strings using Levenshtein distance, only with non-alphanumeric characters removed and the ordering of tokens in each sorted before comparison.
`hybrid.monge_elkan`	Measure the similarity between two sequences of strings using the (symmetric) Monge-Elkan method, which takes the average of the maximum pairwise similarity between the tokens in each sequence as compared to those in the other sequence.

Edit-based Metrics¶

textacy.similarity.edits: Normalized similarity metrics built on edit-based algorithms that compute the number of operations (additions, subtractions, …) needed to transform one string into another.

textacy.similarity.edits.hamming(str1: str, str2: str) → float [source]¶

Compute the similarity between two strings using Hamming distance, which gives the number of characters at corresponding string indices that differ, including chars in the longer string that have no correspondents in the shorter.

Parameters

str1 –
str2 –

Returns

Similarity between str1 and str2 in the interval [0.0, 1.0], where larger values correspond to more similar strings

textacy.similarity.edits.levenshtein(str1: str, str2: str) → float [source]¶

Measure the similarity between two strings using Levenshtein distance, which gives the minimum number of character insertions, deletions, and substitutions needed to change one string into the other.

Parameters

str1 –
str2 –

Returns

Similarity between str1 and str2 in the interval [0.0, 1.0], where larger values correspond to more similar strings

textacy.similarity.edits.jaro(str1: str, str2: str) → float [source]¶

Measure the similarity between two strings using Jaro (not Jaro-Winkler) distance, which searches for common characters while taking transpositions and string lengths into account.

Parameters

str1 –
str2 –

Returns

Similarity between str1 and str2 in the interval [0.0, 1.0], where larger values correspond to more similar strings

textacy.similarity.edits.character_ngrams(str1: str, str2: str) → float [source]¶

Measure the similarity between two strings using a character ngrams similarity metric, in which strings are transformed into trigrams of alnum-only characters, vectorized and weighted by tf-idf, then compared by cosine similarity.

Parameters

str1 –
str2 –

Returns

Similarity between str1 and str2 in the interval [0.0, 1.0], where larger values correspond to more similar strings

Note

This method has been used in cross-lingual plagiarism detection and authorship attribution, and seems to work better on longer texts. At the very least, it is slow on shorter texts relative to the other similarity measures.

Token-based Metrics¶

textacy.similarity.edits: Normalized similarity metrics built on token-based algorithms that identify and count similar tokens between one sequence and another, and don’t rely on the ordering of those tokens.

textacy.similarity.tokens.jaccard(seq1: Iterable[str], seq2: Iterable[str]) → float [source]¶

Measure the similarity between two sequences of strings as sets using the Jaccard index.

Parameters

seq1 –
seq2 –

Returns

Similarity between seq1 and seq2 in the interval [0.0, 1.0], where larger values correspond to more similar sequences of strings

Reference:: https://en.wikipedia.org/wiki/Jaccard_index

textacy.similarity.tokens.sorensen_dice(seq1: Iterable[str], seq2: Iterable[str]) → float [source]¶

Measure the similarity between two sequences of strings as sets using the Sørensen-Dice index, which is similar to the Jaccard index.

Parameters

seq1 –
seq2 –

Returns

Similarity between seq1 and seq2 in the interval [0.0, 1.0], where larger values correspond to more similar sequences

Reference:: https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

textacy.similarity.tokens.tversky(seq1: Iterable[str], seq2: Iterable[str], *, alpha: float = 1.0, beta: float = 1.0) → float [source]¶

Measure the similarity between two sequences of strings as sets using the (symmetric) Tversky index, which is a generalization of Jaccard (alpha=0.5, beta=2.0) and Sørensen-Dice (alpha=0.5, beta=1.0).

Parameters

seq1 –
seq2 –
alpha –
beta –

Returns

Similarity between seq1 and seq2 in the interval [0.0, 1.0], where larger values correspond to more similar sequences

Reference:: https://en.wikipedia.org/wiki/Tversky_index

textacy.similarity.tokens.cosine(seq1: Iterable[str], seq2: Iterable[str]) → float [source]¶

Measure the similarity between two sequences of strings as sets using the Otsuka-Ochiai variation of cosine similarity (which is equivalent to the usual formulation when values are binary).

Parameters

seq1 –
seq2 –

Returns

Similarity between seq1 and seq2 in the interval [0.0, 1.0], where larger values correspond to more similar sequences

Reference:: https://en.wikipedia.org/wiki/Cosine_similarity#Otsuka-Ochiai_coefficient

textacy.similarity.tokens.bag(seq1: Iterable[str], seq2: Iterable[str]) → float [source]¶

Measure the similarity between two sequences of strings (not as sets) using the “bag distance” measure, which can be considered an approximation of edit distance.

Parameters

seq1 –
seq2 –

Returns

Similarity between seq1 and seq2 in the interval [0.0, 1.0], where larger values correspond to more similar sequences

Reference:: Bartolini, Ilaria, Paolo Ciaccia, and Marco Patella. “String matching with metric trees using an approximate distance.” International Symposium on String Processing and Information Retrieval. Springer, Berlin, Heidelberg, 2002.

Sequence-based Metrics¶

textacy.similarity.sequences: Normalized similarity metrics built on sequence-based algorithms that identify and measure the subsequences common to each.

textacy.similarity.sequences.matching_subsequences_ratio(seq1: Sequence[str], seq2: Sequence[str], **kwargs) → float [source]¶

Measure the similarity between two sequences of strings by finding contiguous matching subsequences without any “junk” elements and normalizing by the total number of elements.

Parameters

seq1 –
seq2 –
**kwargs – isjunk: Optional[Callable[str], bool] = None autojunk: bool = True

Returns

Similarity between seq1 and seq2 in the interval [0.0, 1.0], where larger values correspond to more similar sequences of strings

Reference:: https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.ratio

Hybrid Metrics¶

textacy.similarity.hybrid: Normalized similarity metrics that combine edit-, token-, and/or sequence-based algorithms.

textacy.similarity.hybrid.token_sort_ratio(s1: str | Sequence[str], s2: str | Sequence[str]) → float [source]¶

Measure the similarity between two strings or sequences of strings using Levenshtein distance, only with non-alphanumeric characters removed and the ordering of tokens in each sorted before comparison.

Parameters

s1 –
s2 –

Returns

Similarity between s1 and s2 in the interval [0.0, 1.0], where larger values correspond to more similar strings.

Document Similarity¶

Edit-based Metrics¶

Token-based Metrics¶

Sequence-based Metrics¶

Hybrid Metrics¶

Navigation

Related Topics