Text Statistics¶

`api.TextStats`	Class to compute a variety of basic and readability statistics for a given doc, where each stat is a lazily-computed attribute.
`basics.n_sents`	Compute the number of sentences in a document.
`basics.n_words`	Compute the number of words in a document.
`basics.n_unique_words`	Compute the number of unique words in a document.
`basics.n_chars_per_word`	Compute the number of characters for each word in a document.
`basics.n_chars`	Compute the total number of characters in a document.
`basics.n_long_words`	Compute the number of long words in a document.
`basics.n_syllables_per_word`	Compute the number of syllables for each word in a document.
`basics.n_syllables`	Compute the total number of syllables in a document.
`basics.n_monosyllable_words`	Compute the number of monosyllobic words in a document.
`basics.n_polysyllable_words`	Compute the number of polysyllobic words in a document.
`basics.entropy`	Compute the entropy of words in a document.
`readability.automated_readability_index`	Readability test for English-language texts, particularly for technical writing, whose value estimates the U.S.
`readability.automatic_arabic_readability_index`	Readability test for Arabic-language texts based on number of characters and average word and sentence lengths.
`readability.coleman_liau_index`	Readability test whose value estimates the number of years of education required to understand a text, similar to `flesch_kincaid_grade_level()` and `smog_index()`, but using characters per word instead of syllables.
`readability.flesch_kincaid_grade_level`	Readability test used widely in education, whose value estimates the U.S.
`readability.flesch_reading_ease`	Readability test used as a general-purpose standard in several languages, based on a weighted combination of avg.
`readability.gulpease_index`	Readability test for Italian-language texts, whose value is in the range [0, 100] similar to `flesch_reading_ease()`.
`readability.gunning_fog_index`	Readability test whose value estimates the number of years of education required to understand a text, similar to `flesch_kincaid_grade_level()` and `smog_index()`.
`readability.lix`	Readability test commonly used in Sweden on both English- and non-English-language texts, whose value estimates the difficulty of reading a foreign text.
`readability.mu_legibility_index`	Readability test for Spanish-language texts based on number of words and the mean and variance of their lengths in characters, whose value is in the range [0, 100].
`readability.perspicuity_index`	Readability test for Spanish-language texts, whose value is in the range [0, 100]; very similar to the Spanish-specific formulation of `flesch_reading_ease()`, but included additionally since it’s become a common readability standard.
`readability.smog_index`	Readability test commonly used in medical writing and the healthcare industry, whose value estimates the number of years of education required to understand a text similar to `flesch_kincaid_grade_level()` and intended as a substitute for `gunning_fog_index()`.
`readability.wiener_sachtextformel`	Readability test for German-language texts, whose value estimates the grade level required to understand a text.

textacy.text_stats.api: Compute basic and readability statistics of documents.

class textacy.text_stats.api.TextStats(doc: spacy.tokens.doc.Doc)[source]¶

Class to compute a variety of basic and readability statistics for a given doc, where each stat is a lazily-computed attribute.

>>> text = next(textacy.datasets.CapitolWords().texts(limit=1))
>>> doc = textacy.make_spacy_doc(text)
>>> ts = textacy.text_stats.TextStats(doc)
>>> ts.n_words
136
>>> ts.n_unique_words
80
>>> ts.entropy
6.00420319027642
>>> ts.flesch_kincaid_grade_level
11.817647058823532
>>> ts.flesch_reading_ease
50.707745098039254

Some stats vary by language or are designed for use with specific languages:

>>> text = (
...     "Muchos años después, frente al pelotón de fusilamiento, "
...     "el coronel Aureliano Buendía había de recordar aquella tarde remota "
...     "en que su padre lo llevó a conocer el hielo."
... )
>>> doc = textacy.make_spacy_doc(text, lang="es")
>>> ts = textacy.text_stats.TextStats(doc)
>>> ts.n_words
28
>>> ts.perspicuity_index
56.46000000000002
>>> ts.mu_legibility_index
71.18644067796609

Each of these stats have stand-alone functions in textacy.text_stats.basics and textacy.text_stats.readability with more detailed info and links in the docstrings – when in doubt, read the docs!

Parameters: doc – A text document tokenized and (optionally) sentence-segmented by spaCy.

property n_sents¶: Number of sentences in document.

See also

textacy.text_stats.basics.n_sents()

property n_words¶: Number of words in document.

See also

textacy.text_stats.basics.n_words()

property n_unique_words¶: Number of unique words in document.

See also

textacy.text_stats.basics.n_unique_words()

property n_long_words¶: Number of long words in document.

See also

textacy.text_stats.basics.n_long_words()

property n_chars_per_word¶: Number of characters for each word in document.

See also

textacy.text_stats.basics.n_chars_per_word()

property n_chars¶: Total number of characters in document.

See also

textacy.text_stats.basics.n_chars()

property n_syllables_per_word¶: Number of syllables for each word in document.

See also

textacy.text_stats.basics.n_syllables_per_word()

property n_syllables¶: Total number of syllables in document.

See also

textacy.text_stats.basics.n_syllables()

property n_monosyllable_words¶: Number of monosyllobic words in document.

See also

textacy.text_stats.basics.n_monosyllable_words()

property n_polysyllable_words¶: Number of polysyllobic words in document.

See also

textacy.text_stats.basics.n_polysyllable_words()

property entropy¶: Entropy of words in document.

See also

textacy.text_stats.basics.entropy()

property automated_readability_index¶: Readability test for English-language texts. Higher value => more difficult text.

See also

textacy.text_stats.readability.automated_readability_index()

property automatic_arabic_readability_index¶: Readability test for Arabic-language texts. Higher value => more difficult text.

See also

textacy.text_stats.readability.automatic_arabic_readability_index()

property coleman_liau_index¶: Readability test, not language-specific. Higher value => more difficult text.

See also

textacy.text_stats.readability.coleman_liau_index()

property flesch_kincaid_grade_level¶: Readability test, not language-specific. Higher value => more difficult text.

See also

textacy.text_stats.readability.flesch_kincaid_grade_level()

property flesch_reading_ease¶: Readability test with several language-specific formulations. Higher value => easier text.

See also

textacy.text_stats.readability.flesch_reading_ease()

property gulpease_index¶: Readability test for Italian-language texts. Higher value => easier text.

See also

textacy.text_stats.readability.gulpease_index()

property gunning_fog_index¶: Readability test, not language-specific. Higher value => more difficult text.

See also

textacy.text_stats.readability.gunning_fog_index()

property lix¶: Readability test for both English- and non-English-language texts. Higher value => more difficult text.

See also

textacy.text_stats.readability.lix()

property mu_legibility_index¶: Readability test for Spanish-language texts. Higher value => easier text.

See also

textacy.text_stats.readability.mu_legibility_index()

property perspicuity_index¶: Readability test for Spanish-language texts. Higher value => easier text.

See also

textacy.text_stats.readability.perspicuity_index()

property smog_index¶: Readability test, not language-specific. Higher value => more difficult text.

See also

textacy.text_stats.readability.smog_index()

property wiener_sachtextformel¶: Readability test for German-language texts. Higher value => more difficult text.

See also

textacy.text_stats.readability.wiener_sachtextformel()

textacy.text_stats.api.load_hyphenator(lang: str)[source]¶

Load an object that hyphenates words at valid points, as used in LaTex typesetting.

Parameters

lang –

Standard 2-letter language abbreviation. To get a list of valid values:

>>> import pyphen; pyphen.LANGUAGES

Returns

pyphen.Pyphen()

Basic Stats¶

textacy.text_stats.basics: Low-level functions for computing basic text statistics, typically accessed via textacy.text_stats.TextStats.

textacy.text_stats.basics.n_words(doc_or_words: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]]) → int [source]¶

Compute the number of words in a document.

Parameters: doc_or_words – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all are included as-is.

textacy.text_stats.basics.n_unique_words(doc_or_words: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]]) → int [source]¶

Compute the number of unique words in a document.

Parameters: doc_or_words – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all are included as-is.

textacy.text_stats.basics.n_chars_per_word(doc_or_words: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]]) → Tuple[int, …][source]¶

Compute the number of characters for each word in a document.

Parameters: doc_or_words – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all are included as-is.

textacy.text_stats.basics.n_chars(n_chars_per_word: Tuple[int, …]) → int [source]¶

Compute the total number of characters in a document.

Parameters: n_chars_per_word – Number of characters per word in a given document, as computed by n_chars_per_word().

textacy.text_stats.basics.n_long_words(n_chars_per_word: Tuple[int, …], min_n_chars: int = 7) → int [source]¶

Compute the number of long words in a document.

Parameters

n_chars_per_word – Number of characters per word in a given document, as computed by n_chars_per_word().
min_n_chars – Minimum number of characters required for a word to be considered “long”.

textacy.text_stats.basics.n_syllables_per_word(doc_or_words: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]], lang: str) → Tuple[int, …][source]¶

Compute the number of syllables for each word in a document.

Parameters: doc_or_words – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all are included as-is.

Note

Identifying syllables is _tricky_; this method relies on hyphenation, which is more straightforward but doesn’t always give the correct number of syllables. While all hyphenation points fall on syllable divisions, not all syllable divisions are valid hyphenation points.

textacy.text_stats.basics.n_syllables(n_syllables_per_word: Tuple[int, …]) → int [source]¶

Compute the total number of syllables in a document.

Parameters: n_syllables_per_word – Number of syllables per word in a given document, as computed by n_syllables_per_word().

textacy.text_stats.basics.n_monosyllable_words(n_syllables_per_word: Tuple[int, …]) → int [source]¶

Compute the number of monosyllobic words in a document.

Parameters: n_syllables_per_word – Number of syllables per word in a given document, as computed by n_syllables_per_word().

textacy.text_stats.basics.n_polysyllable_words(n_syllables_per_word: Tuple[int, …], min_n_syllables: int = 3) → int [source]¶

Compute the number of polysyllobic words in a document.

Parameters

n_syllables_per_word – Number of syllables per word in a given document, as computed by n_syllables_per_word().
min_n_syllables – Minimum number of syllables required for a word to be considered “polysyllobic”.

textacy.text_stats.basics.n_sents(doc: spacy.tokens.doc.Doc) → int [source]¶: Compute the number of sentences in a document.

Warning

If doc has not been segmented into sentences, it will be modified in-place using spaCy’s rule-based Sentencizer pipeline component before counting.

textacy.text_stats.basics.entropy(doc_or_words: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]]) → float [source]¶

Compute the entropy of words in a document.

Parameters: doc_or_words – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all are included as-is.

Readability Stats¶

textacy.text_stats.readability: Low-level functions for computing various measures of text “readability”, typically accessed via textacy.text_stats.TextStats.

textacy.text_stats.readability.automated_readability_index(n_chars: int, n_words: int, n_sents: int) → float [source]¶

Readability test for English-language texts, particularly for technical writing, whose value estimates the U.S. grade level required to understand a text. Similar to several other tests (e.g. flesch_kincaid_grade_level()), but uses characters per word instead of syllables like coleman_liau_index(). Higher value => more difficult text.

References

https://en.wikipedia.org/wiki/Automated_readability_index

textacy.text_stats.readability.automatic_arabic_readability_index(n_chars: int, n_words: int, n_sents: int) → float [source]¶

Readability test for Arabic-language texts based on number of characters and average word and sentence lengths. Higher value => more difficult text.

References

Al Tamimi, Abdel Karim, et al. “AARI: automatic arabic readability index.” Int. Arab J. Inf. Technol. 11.4 (2014): 370-378.

textacy.text_stats.readability.coleman_liau_index(n_chars: int, n_words: int, n_sents: int) → float [source]¶

Readability test whose value estimates the number of years of education required to understand a text, similar to flesch_kincaid_grade_level() and smog_index(), but using characters per word instead of syllables. Higher value => more difficult text.

References

https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index

textacy.text_stats.readability.flesch_kincaid_grade_level(n_syllables: int, n_words: int, n_sents: int) → float [source]¶

Readability test used widely in education, whose value estimates the U.S. grade level / number of years of education required to understand a text. Higher value => more difficult text.

References

https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch.E2.80.93Kincaid_grade_level

textacy.text_stats.readability.flesch_reading_ease(n_syllables: int, n_words: int, n_sents: int, *, lang: Optional[str] = None) → float [source]¶

Readability test used as a general-purpose standard in several languages, based on a weighted combination of avg. sentence length and avg. word length. Values usually fall in the range [0, 100], but may be arbitrarily negative in extreme cases. Higher value => easier text.

Note

Coefficients in this formula are language-dependent; if lang is null, the English-language formulation is used.

References

English: https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease German: https://de.wikipedia.org/wiki/Lesbarkeitsindex#Flesch-Reading-Ease Spanish: Fernández-Huerta formulation French: ? Italian: https://it.wikipedia.org/wiki/Formula_di_Flesch Dutch: ? Portuguese: https://pt.wikipedia.org/wiki/Legibilidade_de_Flesch Turkish: Atesman formulation Russian: https://ru.wikipedia.org/wiki/%D0%98%D0%BD%D0%B4%D0%B5%D0%BA%D1%81_%D1%83%D0%B4%D0%BE%D0%B1%D0%BE%D1%87%D0%B8%D1%82%D0%B0%D0%B5%D0%BC%D0%BE%D1%81%D1%82%D0%B8

textacy.text_stats.readability.gulpease_index(n_chars: int, n_words: int, n_sents: int) → float [source]¶

Readability test for Italian-language texts, whose value is in the range [0, 100] similar to flesch_reading_ease(). Higher value => easier text.

References

https://it.wikipedia.org/wiki/Indice_Gulpease

textacy.text_stats.readability.gunning_fog_index(n_words: int, n_polysyllable_words: int, n_sents: int) → float [source]¶

Readability test whose value estimates the number of years of education required to understand a text, similar to flesch_kincaid_grade_level() and smog_index(). Higher value => more difficult text.

References

https://en.wikipedia.org/wiki/Gunning_fog_index

textacy.text_stats.readability.lix(n_words: int, n_long_words: int, n_sents: int) → float [source]¶

Readability test commonly used in Sweden on both English- and non-English-language texts, whose value estimates the difficulty of reading a foreign text. Higher value => more difficult text.

References

https://en.wikipedia.org/wiki/Lix_(readability_test)

textacy.text_stats.readability.mu_legibility_index(n_chars_per_word: Collection[int]) → float [source]¶

Readability test for Spanish-language texts based on number of words and the mean and variance of their lengths in characters, whose value is in the range [0, 100]. Higher value => easier text.

References

Muñoz, M., and J. Muñoz. “Legibilidad Mµ.” Viña del Mar: CHL (2006).

textacy.text_stats.readability.perspicuity_index(n_syllables: int, n_words: int, n_sents: int) → float [source]¶

Readability test for Spanish-language texts, whose value is in the range [0, 100]; very similar to the Spanish-specific formulation of flesch_reading_ease(), but included additionally since it’s become a common readability standard. Higher value => easier text.

References

Pazos, Francisco Szigriszt. Sistemas predictivos de legibilidad del mensaje escrito: fórmula de perspicuidad. Universidad Complutense de Madrid, Servicio de Reprografía, 1993.

textacy.text_stats.readability.smog_index(n_polysyllable_words: int, n_sents: int) → float [source]¶

Readability test commonly used in medical writing and the healthcare industry, whose value estimates the number of years of education required to understand a text similar to flesch_kincaid_grade_level() and intended as a substitute for gunning_fog_index(). Higher value => more difficult text.

References

https://en.wikipedia.org/wiki/SMOG

textacy.text_stats.readability.wiener_sachtextformel(n_words: int, n_polysyllable_words: int, n_monosyllable_words: int, n_long_words: int, n_sents: int, *, variant: int = 1) → float [source]¶

Readability test for German-language texts, whose value estimates the grade level required to understand a text. Higher value => more difficult text.

References

https://de.wikipedia.org/wiki/Lesbarkeitsindex#Wiener_Sachtextformel

Pipeline Components¶

textacy.text_stats.components: Custom components to add to a spaCy language pipeline.

class textacy.text_stats.components.TextStatsComponent(attrs: Optional[Union[str, Collection[str]]] = None)[source]¶

A custom component to be added to a spaCy language pipeline that computes one, some, or all text stats for a parsed doc and sets the values as custom attributes on a spacy.tokens.Doc.

Add the component to a pipeline, after the parser and any subsequent components that modify the tokens/sentences of the doc (to be safe, just put it last):

>>> en = spacy.load("en_core_web_sm")
>>> en.add_pipe("textacy_text_stats", last=True)

Process a text with the pipeline and access the custom attributes via spaCy’s underscore syntax:

>>> doc = en(u"This is a test test someverylongword.")
>>> doc._.n_words
6
>>> doc._.flesch_reading_ease
73.84500000000001

Specify which attributes of the textacy.text_stats.TextStats() to add to processed documents:

>>> en = spacy.load("en_core_web_sm")
>>> en.add_pipe("textacy_text_stats", last=True, config={"attrs": "n_words"})
>>> doc = en(u"This is a test test someverylongword.")
>>> doc._.n_words
6
>>> doc._.flesch_reading_ease
AttributeError: [E046] Can't retrieve unregistered extension attribute 'flesch_reading_ease'. Did you forget to call the `set_extension` method?

Parameters: attr – If str, a single text stat to compute and set on a Doc; if Iterable[str], set multiple text stats; if None, all text stats are computed and set as extensions.

Text Statistics¶

Basic Stats¶

Readability Stats¶

Pipeline Components¶

Navigation

Related Topics