
Text Statistics


Class to compute a variety of basic and readability statistics for a given doc, where each stat is a lazily-computed attribute.


Compute the number of sentences in a document.


Compute the number of words in a document.


Compute the number of unique words in a document.


Compute the number of characters for each word in a document.


Compute the total number of characters in a document.


Compute the number of long words in a document.


Compute the number of syllables for each word in a document.


Compute the total number of syllables in a document.


Compute the number of monosyllobic words in a document.


Compute the number of polysyllobic words in a document.


Compute the entropy of words in a document.


Readability test for English-language texts, particularly for technical writing, whose value estimates the U.S.


Readability test for Arabic-language texts based on number of characters and average word and sentence lengths.


Readability test whose value estimates the number of years of education required to understand a text, similar to flesch_kincaid_grade_level() and smog_index(), but using characters per word instead of syllables.


Readability test used widely in education, whose value estimates the U.S.


Readability test used as a general-purpose standard in several languages, based on a weighted combination of avg.


Readability test for Italian-language texts, whose value is in the range [0, 100] similar to flesch_reading_ease().


Readability test whose value estimates the number of years of education required to understand a text, similar to flesch_kincaid_grade_level() and smog_index().


Readability test commonly used in Sweden on both English- and non-English-language texts, whose value estimates the difficulty of reading a foreign text.


Readability test for Spanish-language texts based on number of words and the mean and variance of their lengths in characters, whose value is in the range [0, 100].


Readability test for Spanish-language texts, whose value is in the range [0, 100]; very similar to the Spanish-specific formulation of flesch_reading_ease(), but included additionally since it’s become a common readability standard.


Readability test commonly used in medical writing and the healthcare industry, whose value estimates the number of years of education required to understand a text similar to flesch_kincaid_grade_level() and intended as a substitute for gunning_fog_index().


Readability test for German-language texts, whose value estimates the grade level required to understand a text.

textacy.text_stats.api: Compute various basic counts and readability statistics for documents.

class textacy.text_stats.api.TextStats(doc: spacy.tokens.doc.Doc)[source]

Class to compute a variety of basic and readability statistics for a given doc, where each stat is a lazily-computed attribute.

>>> text = next(textacy.datasets.CapitolWords().texts(limit=1))
>>> doc = textacy.make_spacy_doc(text)
>>> ts = textacy.text_stats.TextStats(doc)
>>> ts.n_words
>>> ts.n_unique_words
>>> ts.entropy
>>> ts.flesch_kincaid_grade_level
>>> ts.flesch_reading_ease

Some stats vary by language or are designed for use with specific languages:

>>> text = (
...     "Muchos años después, frente al pelotón de fusilamiento, "
...     "el coronel Aureliano Buendía había de recordar aquella tarde remota "
...     "en que su padre lo llevó a conocer el hielo."
... )
>>> doc = textacy.make_spacy_doc(text, lang="es")
>>> ts = textacy.text_stats.TextStats(doc)
>>> ts.n_words
>>> ts.perspicuity_index
>>> ts.mu_legibility_index

Each of these stats have stand-alone functions in textacy.text_stats.basics and textacy.text_stats.readability with more detailed info and links in the docstrings – when in doubt, read the docs!


doc – A text document tokenized and (optionally) sentence-segmented by spaCy.

property n_sents

Number of sentences in document.

property n_words

Number of words in document.

property n_unique_words

Number of unique words in document.

property n_long_words

Number of long words in document.

property n_chars_per_word

Number of characters for each word in document.

property n_chars

Total number of characters in document.

property n_syllables_per_word

Number of syllables for each word in document.

property n_syllables

Total number of syllables in document.

property n_monosyllable_words

Number of monosyllobic words in document.

property n_polysyllable_words

Number of polysyllobic words in document.

property entropy

Entropy of words in document.

property automated_readability_index

Readability test for English-language texts. Higher value => more difficult text.

property automatic_arabic_readability_index

Readability test for Arabic-language texts. Higher value => more difficult text.

property coleman_liau_index

Readability test, not language-specific. Higher value => more difficult text.

property flesch_kincaid_grade_level

Readability test, not language-specific. Higher value => more difficult text.

property flesch_reading_ease

Readability test with several language-specific formulations. Higher value => easier text.

property gulpease_index

Readability test for Italian-language texts. Higher value => easier text.

property gunning_fog_index

Readability test, not language-specific. Higher value => more difficult text.

property lix

Readability test for both English- and non-English-language texts. Higher value => more difficult text.

property mu_legibility_index

Readability test for Spanish-language texts. Higher value => easier text.

property perspicuity_index

Readability test for Spanish-language texts. Higher value => easier text.

property smog_index

Readability test, not language-specific. Higher value => more difficult text.

property wiener_sachtextformel

Readability test for German-language texts. Higher value => more difficult text.

textacy.text_stats.api.load_hyphenator(lang: str)[source]

Load an object that hyphenates words at valid points, as used in LaTex typesetting.



Standard 2-letter language abbreviation. To get a list of valid values:

>>> import pyphen; pyphen.LANGUAGES



textacy.text_stats.basics: Functions for computing basic text statistics.

textacy.text_stats.basics.n_words(doc_or_words: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]])int[source]

Compute the number of words in a document.


doc_or_words – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all are included as-is.

textacy.text_stats.basics.n_unique_words(doc_or_words: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]])int[source]

Compute the number of unique words in a document.


doc_or_words – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all are included as-is.

textacy.text_stats.basics.n_chars_per_word(doc_or_words: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]]) → Tuple[int, ][source]

Compute the number of characters for each word in a document.


doc_or_words – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all are included as-is.

textacy.text_stats.basics.n_chars(n_chars_per_word: Tuple[int, ])int[source]

Compute the total number of characters in a document.


n_chars_per_word – Number of characters per word in a given document, as computed by n_chars_per_word().

textacy.text_stats.basics.n_long_words(n_chars_per_word: Tuple[int, ], min_n_chars: int = 7)int[source]

Compute the number of long words in a document.

  • n_chars_per_word – Number of characters per word in a given document, as computed by n_chars_per_word().

  • min_n_chars – Minimum number of characters required for a word to be considered “long”.

textacy.text_stats.basics.n_syllables_per_word(doc_or_words: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]], lang: str) → Tuple[int, ][source]

Compute the number of syllables for each word in a document.


doc_or_words – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all are included as-is.


Identifying syllables is _tricky_; this method relies on hyphenation, which is more straightforward but doesn’t always give the correct number of syllables. While all hyphenation points fall on syllable divisions, not all syllable divisions are valid hyphenation points.

textacy.text_stats.basics.n_syllables(n_syllables_per_word: Tuple[int, ])int[source]

Compute the total number of syllables in a document.


n_syllables_per_word – Number of syllables per word in a given document, as computed by n_syllables_per_word().

textacy.text_stats.basics.n_monosyllable_words(n_syllables_per_word: Tuple[int, ])int[source]

Compute the number of monosyllobic words in a document.


n_syllables_per_word – Number of syllables per word in a given document, as computed by n_syllables_per_word().

textacy.text_stats.basics.n_polysyllable_words(n_syllables_per_word: Tuple[int, ], min_n_syllables: int = 3)int[source]

Compute the number of polysyllobic words in a document.

  • n_syllables_per_word – Number of syllables per word in a given document, as computed by n_syllables_per_word().

  • min_n_syllables – Minimum number of syllables required for a word to be considered “polysyllobic”.

textacy.text_stats.basics.n_sents(doc: spacy.tokens.doc.Doc)int[source]

Compute the number of sentences in a document.


If doc has not been segmented into sentences, it will be modified in-place using spaCy’s rule-based Sentencizer pipeline component before counting.

textacy.text_stats.basics.entropy(doc_or_words: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]])float[source]

Compute the entropy of words in a document.


doc_or_words – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all are included as-is.

textacy.text_stats.readability: Functions for computing various measures of text “readability”.

textacy.text_stats.readability.automated_readability_index(n_chars: int, n_words: int, n_sents: int)float[source]

Readability test for English-language texts, particularly for technical writing, whose value estimates the U.S. grade level required to understand a text. Similar to several other tests (e.g. flesch_kincaid_grade_level()), but uses characters per word instead of syllables like coleman_liau_index(). Higher value => more difficult text.



textacy.text_stats.readability.automatic_arabic_readability_index(n_chars: int, n_words: int, n_sents: int)float[source]

Readability test for Arabic-language texts based on number of characters and average word and sentence lengths. Higher value => more difficult text.


Al Tamimi, Abdel Karim, et al. “AARI: automatic arabic readability index.” Int. Arab J. Inf. Technol. 11.4 (2014): 370-378.

textacy.text_stats.readability.coleman_liau_index(n_chars: int, n_words: int, n_sents: int)float[source]

Readability test whose value estimates the number of years of education required to understand a text, similar to flesch_kincaid_grade_level() and smog_index(), but using characters per word instead of syllables. Higher value => more difficult text.



textacy.text_stats.readability.flesch_kincaid_grade_level(n_syllables: int, n_words: int, n_sents: int)float[source]

Readability test used widely in education, whose value estimates the U.S. grade level / number of years of education required to understand a text. Higher value => more difficult text.



textacy.text_stats.readability.flesch_reading_ease(n_syllables: int, n_words: int, n_sents: int, *, lang: Optional[str] = None)float[source]

Readability test used as a general-purpose standard in several languages, based on a weighted combination of avg. sentence length and avg. word length. Values usually fall in the range [0, 100], but may be arbitrarily negative in extreme cases. Higher value => easier text.


Coefficients in this formula are language-dependent; if lang is null, the English-language formulation is used.


English: https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease German: https://de.wikipedia.org/wiki/Lesbarkeitsindex#Flesch-Reading-Ease Spanish: Fernández-Huerta formulation French: ? Italian: https://it.wikipedia.org/wiki/Formula_di_Flesch Dutch: ? Portuguese: https://pt.wikipedia.org/wiki/Legibilidade_de_Flesch Turkish: Atesman formulation Russian: https://ru.wikipedia.org/wiki/%D0%98%D0%BD%D0%B4%D0%B5%D0%BA%D1%81_%D1%83%D0%B4%D0%BE%D0%B1%D0%BE%D1%87%D0%B8%D1%82%D0%B0%D0%B5%D0%BC%D0%BE%D1%81%D1%82%D0%B8

textacy.text_stats.readability.gulpease_index(n_chars: int, n_words: int, n_sents: int)float[source]

Readability test for Italian-language texts, whose value is in the range [0, 100] similar to flesch_reading_ease(). Higher value => easier text.



textacy.text_stats.readability.gunning_fog_index(n_words: int, n_polysyllable_words: int, n_sents: int)float[source]

Readability test whose value estimates the number of years of education required to understand a text, similar to flesch_kincaid_grade_level() and smog_index(). Higher value => more difficult text.



textacy.text_stats.readability.lix(n_words: int, n_long_words: int, n_sents: int)float[source]

Readability test commonly used in Sweden on both English- and non-English-language texts, whose value estimates the difficulty of reading a foreign text. Higher value => more difficult text.



textacy.text_stats.readability.mu_legibility_index(n_chars_per_word: Collection[int])float[source]

Readability test for Spanish-language texts based on number of words and the mean and variance of their lengths in characters, whose value is in the range [0, 100]. Higher value => easier text.


Muñoz, M., and J. Muñoz. “Legibilidad Mµ.” Viña del Mar: CHL (2006).

textacy.text_stats.readability.perspicuity_index(n_syllables: int, n_words: int, n_sents: int)float[source]

Readability test for Spanish-language texts, whose value is in the range [0, 100]; very similar to the Spanish-specific formulation of flesch_reading_ease(), but included additionally since it’s become a common readability standard. Higher value => easier text.


Pazos, Francisco Szigriszt. Sistemas predictivos de legibilidad del mensaje escrito: fórmula de perspicuidad. Universidad Complutense de Madrid, Servicio de Reprografía, 1993.

textacy.text_stats.readability.smog_index(n_polysyllable_words: int, n_sents: int)float[source]

Readability test commonly used in medical writing and the healthcare industry, whose value estimates the number of years of education required to understand a text similar to flesch_kincaid_grade_level() and intended as a substitute for gunning_fog_index(). Higher value => more difficult text.



textacy.text_stats.readability.wiener_sachtextformel(n_words: int, n_polysyllable_words: int, n_monosyllable_words: int, n_long_words: int, n_sents: int, *, variant: int = 1)float[source]

Readability test for German-language texts, whose value estimates the grade level required to understand a text. Higher value => more difficult text.





Measure the semantic similarity between two documents using Word Movers Distance.


Measure the semantic similarity between one spacy Doc, Span, Token, or Lexeme and another like object using the cosine distance between the objects’ (average) word2vec vectors.


Measure the similarity between two strings or sequences of strings using Jaccard distance, with optional fuzzy matching of not-identical pairs when obj1 and obj2 are sequences of strings.


Measure the similarity between two strings using Levenshtein distance, which gives the minimum number of character insertions, deletions, and substitutions needed to change one string into the other.


Measure the similarity between two strings based on levenshtein(), only with non-alphanumeric characters removed and the ordering of words in each string sorted before comparison.


Measure the similarity between two strings using a character ngrams similarity metric, in which strings are transformed into trigrams of alnum-only characters, vectorized and weighted by tf-idf, then compared by cosine similarity.

textacy.similarity: Collection of semantic + lexical similarity metrics between tokens, strings, and sequences thereof, returning values between 0.0 (totally dissimilar) and 1.0 (totally similar).

textacy.similarity.word_movers(doc1: spacy.tokens.doc.Doc, doc2: spacy.tokens.doc.Doc, metric: str = 'cosine')float[source]

Measure the semantic similarity between two documents using Word Movers Distance.

  • doc1

  • doc2

  • metric ({"cosine", "euclidean", "l1", "l2", "manhattan"}) –


Similarity between doc1 and doc2 in the interval [0.0, 1.0], where larger values correspond to more similar documents.


  • Ofir Pele and Michael Werman, “A linear time histogram metric for improved SIFT matching,” in Computer Vision - ECCV 2008, Marseille, France, 2008.

  • Ofir Pele and Michael Werman, “Fast and robust earth mover’s distances,” in Proc. 2009 IEEE 12th Int. Conf. on Computer Vision, Kyoto, Japan, 2009.

  • Kusner, Matt J., et al. “From word embeddings to document distances.” Proceedings of the 32nd International Conference on Machine Learning (ICML 2015). 2015. http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf

textacy.similarity.word2vec(obj1: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span, spacy.tokens.token.Token], obj2: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span, spacy.tokens.token.Token])float[source]

Measure the semantic similarity between one spacy Doc, Span, Token, or Lexeme and another like object using the cosine distance between the objects’ (average) word2vec vectors.

  • obj1

  • obj2


Similarity between obj1 and obj2 in the interval [0.0, 1.0], where larger values correspond to more similar objects

textacy.similarity.jaccard(obj1: Union[str, Sequence[str]], obj2: Union[str, Sequence[str]], fuzzy_match: bool = False, match_threshold: float = 0.8)float[source]

Measure the similarity between two strings or sequences of strings using Jaccard distance, with optional fuzzy matching of not-identical pairs when obj1 and obj2 are sequences of strings.

  • obj1

  • obj2 – If str, both inputs are treated as sequences of characters, in which case fuzzy matching is not permitted

  • fuzzy_match – If True, allow for fuzzy matching in addition to the usual identical matching of pairs between input vectors

  • match_threshold – Value in the interval [0.0, 1.0]; fuzzy comparisons with a score >= this value will be considered matches


Similarity between obj1 and obj2 in the interval [0.0, 1.0], where larger values correspond to more similar strings or sequences of strings

  • ValueError – if fuzzy_match is True but obj1 and obj2 are strings,

  • or if match_threshold is not a valid float

textacy.similarity.levenshtein(str1: str, str2: str)float[source]

Measure the similarity between two strings using Levenshtein distance, which gives the minimum number of character insertions, deletions, and substitutions needed to change one string into the other.

  • str1

  • str2


Similarity between str1 and str2 in the interval [0.0, 1.0], where larger values correspond to more similar strings

textacy.similarity.token_sort_ratio(str1: str, str2: str)float[source]

Measure the similarity between two strings based on levenshtein(), only with non-alphanumeric characters removed and the ordering of words in each string sorted before comparison.

  • str1

  • str2


Similarity between str1 and str2 in the interval [0.0, 1.0], where larger values correspond to more similar strings.

textacy.similarity.character_ngrams(str1: str, str2: str)float[source]

Measure the similarity between two strings using a character ngrams similarity metric, in which strings are transformed into trigrams of alnum-only characters, vectorized and weighted by tf-idf, then compared by cosine similarity.

  • str1

  • str2


Similarity between str1 and str2 in the interval [0.0, 1.0], where larger values correspond to more similar strings


This method has been used in cross-lingual plagiarism detection and authorship attribution, and seems to work better on longer texts. At the very least, it is slow on shorter texts relative to the other similarity measures.

Semantic Networks

textacy.network: Represent documents as semantic networks, where nodes are individual terms or whole sentences and edges are weighted by the strength of their co-occurrence or similarity, respectively.

textacy.network.terms_to_semantic_network(terms: Union[Sequence[str], Sequence[spacy.tokens.token.Token]], *, normalize: Union[str, bool, Callable[[spacy.tokens.token.Token], str]] = 'lemma', window_width: int = 10, edge_weighting: str = 'cooc_freq') → networkx.classes.graph.Graph[source]

Transform an ordered list of non-overlapping terms into a semantic network, where each term is represented by a node with weighted edges linking it to other terms that co-occur within window_width terms of itself.

  • terms

  • normalize

    If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in terms; if a callable, must accept a Token and return a str, e.g. textacy.spacier.utils.get_normalized_text().


    This is applied to the elements of terms only if it’s a list of Token.

  • window_width – Size of sliding window over terms that determines which are said to co-occur. If 2, only immediately adjacent terms have edges in the returned network.

  • edge_weighting – If ‘cooc_freq’, the nodes for all co-occurring terms are connected by edges with weight equal to the number of times they co-occurred within a sliding window; if ‘binary’, all such edges have weight = 1.


Networkx graph whose nodes represent individual terms, connected by edges based on term co-occurrence with weights determined by edge_weighting.


  • Be sure to filter out stopwords, punctuation, certain parts of speech, etc. from the terms list before passing it to this function

  • Multi-word terms, such as named entities and compound nouns, must be merged into single strings or Token s beforehand

  • If terms are already strings, be sure to have normalized them so that like terms are counted together; for example, by applying textacy.spacier.utils.get_normalized_text()

textacy.network.sents_to_semantic_network(sents: Union[Sequence[str], Sequence[spacy.tokens.span.Span]], *, normalize: Union[str, bool, Callable[[spacy.tokens.token.Token], str]] = 'lemma', edge_weighting: str = 'cosine') → networkx.classes.graph.Graph[source]

Transform a list of sentences into a semantic network, where each sentence is represented by a node with edges linking it to other sentences weighted by the (cosine or jaccard) similarity of their constituent words.

  • sents

  • normalize

    If ‘lemma’, lemmatize words in sents; if ‘lower’, lowercase words in sents; if false-y, use the form of words as they appear in sents; if a callable, must accept a spacy.tokens.Token and return a str, e.g. textacy.spacier.utils.get_normalized_text().


    This is applied to the elements of sents only if it’s a list of Span s.

  • edge_weighting – Similarity metric to use for weighting edges between sentences. If ‘cosine’, use the cosine similarity between sentences represented as tf-idf word vectors; if ‘jaccard’, use the set intersection divided by the set union of all words in a given sentence pair.


Networkx graph whose nodes are the integer indexes of the sentences in sents, not the actual text of the sentences. Edges connect every node, with weights determined by edge_weighting.


  • If passing sentences as strings, be sure to filter out stopwords, punctuation, certain parts of speech, etc. beforehand

  • Consider normalizing the strings so that like terms are counted together (see textacy.spacier.utils.get_normalized_text())