Document Similarity

edits.hamming

Compute the similarity between two strings using Hamming distance, which gives the number of characters at corresponding string indices that differ, including chars in the longer string that have no correspondents in the shorter.

edits.levenshtein

Measure the similarity between two strings using Levenshtein distance, which gives the minimum number of character insertions, deletions, and substitutions needed to change one string into the other.

edits.jaro

Measure the similarity between two strings using Jaro (not Jaro-Winkler) distance, which searches for common characters while taking transpositions and string lengths into account.

edits.character_ngrams

Measure the similarity between two strings using a character ngrams similarity metric, in which strings are transformed into trigrams of alnum-only characters, vectorized and weighted by tf-idf, then compared by cosine similarity.

tokens.jaccard

Measure the similarity between two sequences of strings as sets using the Jaccard index.

tokens.sorensen_dice

Measure the similarity between two sequences of strings as sets using the Sørensen-Dice index, which is similar to the Jaccard index.

tokens.tversky

Measure the similarity between two sequences of strings as sets using the (symmetric) Tversky index, which is a generalization of Jaccard (alpha=0.5, beta=2.0) and Sørensen-Dice (alpha=0.5, beta=1.0).

tokens.cosine

Measure the similarity between two sequences of strings as sets using the Otsuka-Ochiai variation of cosine similarity (which is equivalent to the usual formulation when values are binary).

tokens.bag

Measure the similarity between two sequences of strings (not as sets) using the “bag distance” measure, which can be considered an approximation of edit distance.

sequences.matching_subsequences_ratio

Measure the similarity between two sequences of strings by finding contiguous matching subsequences without any “junk” elements and normalizing by the total number of elements.

hybrid.token_sort_ratio

Measure the similarity between two strings or sequences of strings using Levenshtein distance, only with non-alphanumeric characters removed and the ordering of tokens in each sorted before comparison.

hybrid.monge_elkan

Measure the similarity between two sequences of strings using the (symmetric) Monge-Elkan method, which takes the average of the maximum pairwise similarity between the tokens in each sequence as compared to those in the other sequence.

Edit-based Metrics

textacy.similarity.edits: Normalized similarity metrics built on edit-based algorithms that compute the number of operations (additions, subtractions, …) needed to transform one string into another.

textacy.similarity.edits.hamming(str1: str, str2: str)float[source]

Compute the similarity between two strings using Hamming distance, which gives the number of characters at corresponding string indices that differ, including chars in the longer string that have no correspondents in the shorter.

Parameters
  • str1

  • str2

Returns

Similarity between str1 and str2 in the interval [0.0, 1.0], where larger values correspond to more similar strings

textacy.similarity.edits.levenshtein(str1: str, str2: str)float[source]

Measure the similarity between two strings using Levenshtein distance, which gives the minimum number of character insertions, deletions, and substitutions needed to change one string into the other.

Parameters
  • str1

  • str2

Returns

Similarity between str1 and str2 in the interval [0.0, 1.0], where larger values correspond to more similar strings

textacy.similarity.edits.jaro(str1: str, str2: str)float[source]

Measure the similarity between two strings using Jaro (not Jaro-Winkler) distance, which searches for common characters while taking transpositions and string lengths into account.

Parameters
  • str1

  • str2

Returns

Similarity between str1 and str2 in the interval [0.0, 1.0], where larger values correspond to more similar strings

textacy.similarity.edits.character_ngrams(str1: str, str2: str)float[source]

Measure the similarity between two strings using a character ngrams similarity metric, in which strings are transformed into trigrams of alnum-only characters, vectorized and weighted by tf-idf, then compared by cosine similarity.

Parameters
  • str1

  • str2

Returns

Similarity between str1 and str2 in the interval [0.0, 1.0], where larger values correspond to more similar strings

Note

This method has been used in cross-lingual plagiarism detection and authorship attribution, and seems to work better on longer texts. At the very least, it is slow on shorter texts relative to the other similarity measures.

Token-based Metrics

textacy.similarity.edits: Normalized similarity metrics built on token-based algorithms that identify and count similar tokens between one sequence and another, and don’t rely on the ordering of those tokens.

textacy.similarity.tokens.jaccard(seq1: Iterable[str], seq2: Iterable[str])float[source]

Measure the similarity between two sequences of strings as sets using the Jaccard index.

Parameters
  • seq1

  • seq2

Returns

Similarity between seq1 and seq2 in the interval [0.0, 1.0], where larger values correspond to more similar sequences of strings

Reference:

https://en.wikipedia.org/wiki/Jaccard_index

textacy.similarity.tokens.sorensen_dice(seq1: Iterable[str], seq2: Iterable[str])float[source]

Measure the similarity between two sequences of strings as sets using the Sørensen-Dice index, which is similar to the Jaccard index.

Parameters
  • seq1

  • seq2

Returns

Similarity between seq1 and seq2 in the interval [0.0, 1.0], where larger values correspond to more similar sequences

Reference:

https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

textacy.similarity.tokens.tversky(seq1: Iterable[str], seq2: Iterable[str], *, alpha: float = 1.0, beta: float = 1.0)float[source]

Measure the similarity between two sequences of strings as sets using the (symmetric) Tversky index, which is a generalization of Jaccard (alpha=0.5, beta=2.0) and Sørensen-Dice (alpha=0.5, beta=1.0).

Parameters
  • seq1

  • seq2

  • alpha

  • beta

Returns

Similarity between seq1 and seq2 in the interval [0.0, 1.0], where larger values correspond to more similar sequences

Reference:

https://en.wikipedia.org/wiki/Tversky_index

textacy.similarity.tokens.cosine(seq1: Iterable[str], seq2: Iterable[str])float[source]

Measure the similarity between two sequences of strings as sets using the Otsuka-Ochiai variation of cosine similarity (which is equivalent to the usual formulation when values are binary).

Parameters
  • seq1

  • seq2

Returns

Similarity between seq1 and seq2 in the interval [0.0, 1.0], where larger values correspond to more similar sequences

Reference:

https://en.wikipedia.org/wiki/Cosine_similarity#Otsuka-Ochiai_coefficient

textacy.similarity.tokens.bag(seq1: Iterable[str], seq2: Iterable[str])float[source]

Measure the similarity between two sequences of strings (not as sets) using the “bag distance” measure, which can be considered an approximation of edit distance.

Parameters
  • seq1

  • seq2

Returns

Similarity between seq1 and seq2 in the interval [0.0, 1.0], where larger values correspond to more similar sequences

Reference:

Bartolini, Ilaria, Paolo Ciaccia, and Marco Patella. “String matching with metric trees using an approximate distance.” International Symposium on String Processing and Information Retrieval. Springer, Berlin, Heidelberg, 2002.

Sequence-based Metrics

textacy.similarity.sequences: Normalized similarity metrics built on sequence-based algorithms that identify and measure the subsequences common to each.

textacy.similarity.sequences.matching_subsequences_ratio(seq1: Sequence[str], seq2: Sequence[str], **kwargs)float[source]

Measure the similarity between two sequences of strings by finding contiguous matching subsequences without any “junk” elements and normalizing by the total number of elements.

Parameters
  • seq1

  • seq2

  • **kwargs – isjunk: Optional[Callable[str], bool] = None autojunk: bool = True

Returns

Similarity between seq1 and seq2 in the interval [0.0, 1.0], where larger values correspond to more similar sequences of strings

Reference:

https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.ratio

Hybrid Metrics

textacy.similarity.hybrid: Normalized similarity metrics that combine edit-, token-, and/or sequence-based algorithms.

textacy.similarity.hybrid.token_sort_ratio(s1: str | Sequence[str], s2: str | Sequence[str])float[source]

Measure the similarity between two strings or sequences of strings using Levenshtein distance, only with non-alphanumeric characters removed and the ordering of tokens in each sorted before comparison.

Parameters
  • s1

  • s2

Returns

Similarity between s1 and s2 in the interval [0.0, 1.0], where larger values correspond to more similar strings.

textacy.similarity.hybrid.monge_elkan(seq1: Sequence[str], seq2: Sequence[str], sim_func: Callable[[str, str], float] = <function levenshtein>)float[source]

Measure the similarity between two sequences of strings using the (symmetric) Monge-Elkan method, which takes the average of the maximum pairwise similarity between the tokens in each sequence as compared to those in the other sequence.

Parameters
  • seq1

  • seq2

  • sim_func – Callable that computes a string-to-string similarity metric; by default, Levenshtein edit distance.

Returns

Similarity between seq1 and seq2 in the interval [0.0, 1.0], where larger values correspond to more similar strings.