Document Similarity¶
Compute the similarity between two strings using Hamming distance, which gives the number of characters at corresponding string indices that differ, including chars in the longer string that have no correspondents in the shorter. 

Measure the similarity between two strings using Levenshtein distance, which gives the minimum number of character insertions, deletions, and substitutions needed to change one string into the other. 

Measure the similarity between two strings using Jaro (not JaroWinkler) distance, which searches for common characters while taking transpositions and string lengths into account. 

Measure the similarity between two strings using a character ngrams similarity metric, in which strings are transformed into trigrams of alnumonly characters, vectorized and weighted by tfidf, then compared by cosine similarity. 

Measure the similarity between two sequences of strings as sets using the Jaccard index. 

Measure the similarity between two sequences of strings as sets using the SørensenDice index, which is similar to the Jaccard index. 

Measure the similarity between two sequences of strings as sets using the (symmetric) Tversky index, which is a generalization of Jaccard ( 

Measure the similarity between two sequences of strings as sets using the OtsukaOchiai variation of cosine similarity (which is equivalent to the usual formulation when values are binary). 

Measure the similarity between two sequences of strings (not as sets) using the “bag distance” measure, which can be considered an approximation of edit distance. 

Measure the similarity between two sequences of strings by finding contiguous matching subsequences without any “junk” elements and normalizing by the total number of elements. 

Measure the similarity between two strings or sequences of strings using Levenshtein distance, only with nonalphanumeric characters removed and the ordering of tokens in each sorted before comparison. 

Measure the similarity between two sequences of strings using the (symmetric) MongeElkan method, which takes the average of the maximum pairwise similarity between the tokens in each sequence as compared to those in the other sequence. 
Editbased Metrics¶
textacy.similarity.edits
: Normalized similarity metrics built on editbased
algorithms that compute the number of operations (additions, subtractions, …)
needed to transform one string into another.

textacy.similarity.edits.
hamming
(str1: str, str2: str) → float[source]¶ Compute the similarity between two strings using Hamming distance, which gives the number of characters at corresponding string indices that differ, including chars in the longer string that have no correspondents in the shorter.
 Parameters
str1 –
str2 –
 Returns
Similarity between
str1
andstr2
in the interval [0.0, 1.0], where larger values correspond to more similar strings

textacy.similarity.edits.
levenshtein
(str1: str, str2: str) → float[source]¶ Measure the similarity between two strings using Levenshtein distance, which gives the minimum number of character insertions, deletions, and substitutions needed to change one string into the other.
 Parameters
str1 –
str2 –
 Returns
Similarity between
str1
andstr2
in the interval [0.0, 1.0], where larger values correspond to more similar strings

textacy.similarity.edits.
jaro
(str1: str, str2: str) → float[source]¶ Measure the similarity between two strings using Jaro (not JaroWinkler) distance, which searches for common characters while taking transpositions and string lengths into account.
 Parameters
str1 –
str2 –
 Returns
Similarity between
str1
andstr2
in the interval [0.0, 1.0], where larger values correspond to more similar strings

textacy.similarity.edits.
character_ngrams
(str1: str, str2: str) → float[source]¶ Measure the similarity between two strings using a character ngrams similarity metric, in which strings are transformed into trigrams of alnumonly characters, vectorized and weighted by tfidf, then compared by cosine similarity.
 Parameters
str1 –
str2 –
 Returns
Similarity between
str1
andstr2
in the interval [0.0, 1.0], where larger values correspond to more similar strings
Note
This method has been used in crosslingual plagiarism detection and authorship attribution, and seems to work better on longer texts. At the very least, it is slow on shorter texts relative to the other similarity measures.
Tokenbased Metrics¶
textacy.similarity.edits
: Normalized similarity metrics built on tokenbased
algorithms that identify and count similar tokens between one sequence and another,
and don’t rely on the ordering of those tokens.

textacy.similarity.tokens.
jaccard
(seq1: Iterable[str], seq2: Iterable[str]) → float[source]¶ Measure the similarity between two sequences of strings as sets using the Jaccard index.
 Parameters
seq1 –
seq2 –
 Returns
Similarity between
seq1
andseq2
in the interval [0.0, 1.0], where larger values correspond to more similar sequences of strings

textacy.similarity.tokens.
sorensen_dice
(seq1: Iterable[str], seq2: Iterable[str]) → float[source]¶ Measure the similarity between two sequences of strings as sets using the SørensenDice index, which is similar to the Jaccard index.
 Parameters
seq1 –
seq2 –
 Returns
Similarity between
seq1
andseq2
in the interval [0.0, 1.0], where larger values correspond to more similar sequences

textacy.similarity.tokens.
tversky
(seq1: Iterable[str], seq2: Iterable[str], *, alpha: float = 1.0, beta: float = 1.0) → float[source]¶ Measure the similarity between two sequences of strings as sets using the (symmetric) Tversky index, which is a generalization of Jaccard (
alpha=0.5, beta=2.0
) and SørensenDice (alpha=0.5, beta=1.0
). Parameters
seq1 –
seq2 –
alpha –
beta –
 Returns
Similarity between
seq1
andseq2
in the interval [0.0, 1.0], where larger values correspond to more similar sequences

textacy.similarity.tokens.
cosine
(seq1: Iterable[str], seq2: Iterable[str]) → float[source]¶ Measure the similarity between two sequences of strings as sets using the OtsukaOchiai variation of cosine similarity (which is equivalent to the usual formulation when values are binary).
 Parameters
seq1 –
seq2 –
 Returns
Similarity between
seq1
andseq2
in the interval [0.0, 1.0], where larger values correspond to more similar sequences

textacy.similarity.tokens.
bag
(seq1: Iterable[str], seq2: Iterable[str]) → float[source]¶ Measure the similarity between two sequences of strings (not as sets) using the “bag distance” measure, which can be considered an approximation of edit distance.
 Parameters
seq1 –
seq2 –
 Returns
Similarity between
seq1
andseq2
in the interval [0.0, 1.0], where larger values correspond to more similar sequences
 Reference:
Bartolini, Ilaria, Paolo Ciaccia, and Marco Patella. “String matching with metric trees using an approximate distance.” International Symposium on String Processing and Information Retrieval. Springer, Berlin, Heidelberg, 2002.
Sequencebased Metrics¶
textacy.similarity.sequences
: Normalized similarity metrics built on
sequencebased algorithms that identify and measure the subsequences common to each.

textacy.similarity.sequences.
matching_subsequences_ratio
(seq1: Sequence[str], seq2: Sequence[str], **kwargs) → float[source]¶ Measure the similarity between two sequences of strings by finding contiguous matching subsequences without any “junk” elements and normalizing by the total number of elements.
 Parameters
seq1 –
seq2 –
**kwargs – isjunk: Optional[Callable[str], bool] = None autojunk: bool = True
 Returns
Similarity between
seq1
andseq2
in the interval [0.0, 1.0], where larger values correspond to more similar sequences of strings
Hybrid Metrics¶
textacy.similarity.hybrid
: Normalized similarity metrics that combine edit,
token, and/or sequencebased algorithms.

textacy.similarity.hybrid.
token_sort_ratio
(s1: str  Sequence[str], s2: str  Sequence[str]) → float[source]¶ Measure the similarity between two strings or sequences of strings using Levenshtein distance, only with nonalphanumeric characters removed and the ordering of tokens in each sorted before comparison.
 Parameters
s1 –
s2 –
 Returns
Similarity between
s1
ands2
in the interval [0.0, 1.0], where larger values correspond to more similar strings.

textacy.similarity.hybrid.
monge_elkan
(seq1: Sequence[str], seq2: Sequence[str], sim_func: Callable[[str, str], float] = <function levenshtein>) → float[source]¶ Measure the similarity between two sequences of strings using the (symmetric) MongeElkan method, which takes the average of the maximum pairwise similarity between the tokens in each sequence as compared to those in the other sequence.
 Parameters
seq1 –
seq2 –
sim_func – Callable that computes a stringtostring similarity metric; by default, Levenshtein edit distance.
 Returns
Similarity between
seq1
andseq2
in the interval [0.0, 1.0], where larger values correspond to more similar strings.