Text Statistics¶

`api.TextStats`	Class to compute a variety of basic, readability, morphological, and lexical diversity statistics for a given document.
`basics.n_sents`	Compute the number of sentences in a document.
`basics.n_words`	Compute the number of words in a document.
`basics.n_unique_words`	Compute the number of unique words in a document.
`basics.n_chars_per_word`	Compute the number of characters for each word in a document.
`basics.n_chars`	Compute the total number of characters in a document’s words.
`basics.n_long_words`	Compute the number of long words in a document.
`basics.n_syllables_per_word`	Compute the number of syllables for each word in a document.
`basics.n_syllables`	Compute the total number of syllables in a document.
`basics.n_monosyllable_words`	Compute the number of monosyllobic words in a document.
`basics.n_polysyllable_words`	Compute the number of polysyllobic words in a document.
`basics.entropy`	Compute the entropy of words in a document.
`counts.morph`	Count the number of times each value for a morphological feature appears as a token annotation in `doclike`.
`counts.tag`	Count the number of times each fine-grained part-of-speech tag appears as a token annotation in `doclike`.
`counts.pos`	Count the number of times each coarsed-grained universal part-of-speech tag appears as a token annotation in `doclike`.
`counts.dep`	Count the number of times each syntactic dependency relation appears as a token annotation in `doclike`.
`diversity.ttr`	Compute the Type-Token Ratio (TTR) of `doc_or_tokens`, a direct ratio of the number of unique words (types) to all words (tokens).
`diversity.log_ttr`	Compute the logarithmic Type-Token Ratio (TTR) of `doc_or_tokens`, a modification of TTR that uses log functions to better adapt for text length.
`diversity.segmented_ttr`	Compute the Mean Segmental TTR (MS-TTR) or Moving Average TTR (MA-TTR) of `doc_or_tokens`, in which the TTR of tumbling or rolling segments of words, respectively, each with length `segment_size`, are computed and then averaged.
`diversity.mtld`	Compute the Measure of Textual Lexical Diversity (MTLD) of `doc_or_tokens`, the average length of the longest consecutive sequences of words that maintain a TTR of at least `min_ttr`.
`diversity.hdd`	Compute the Hypergeometric Distribution Diversity (HD-D) of `doc_or_tokens`, which calculates the mean contribution that each unique word (aka type) makes to the TTR of all possible combinations of random samples of words of a given size, then sums all contributions together.
`readability.automated_readability_index`	Readability test for English-language texts, particularly for technical writing, whose value estimates the U.S.
`readability.automatic_arabic_readability_index`	Readability test for Arabic-language texts based on number of characters and average word and sentence lengths.
`readability.coleman_liau_index`	Readability test whose value estimates the number of years of education required to understand a text, similar to `flesch_kincaid_grade_level()` and `smog_index()`, but using characters per word instead of syllables.
`readability.flesch_kincaid_grade_level`	Readability test used widely in education, whose value estimates the U.S.
`readability.flesch_reading_ease`	Readability test used as a general-purpose standard in several languages, based on a weighted combination of avg.
`readability.gulpease_index`	Readability test for Italian-language texts, whose value is in the range [0, 100] similar to `flesch_reading_ease()`.
`readability.gunning_fog_index`	Readability test whose value estimates the number of years of education required to understand a text, similar to `flesch_kincaid_grade_level()` and `smog_index()`.
`readability.lix`	Readability test commonly used in Sweden on both English- and non-English-language texts, whose value estimates the difficulty of reading a foreign text.
`readability.mu_legibility_index`	Readability test for Spanish-language texts based on number of words and the mean and variance of their lengths in characters, whose value is in the range [0, 100].
`readability.perspicuity_index`	Readability test for Spanish-language texts, whose value is in the range [0, 100]; very similar to the Spanish-specific formulation of `flesch_reading_ease()`, but included additionally since it’s become a common readability standard.
`readability.smog_index`	Readability test commonly used in medical writing and the healthcare industry, whose value estimates the number of years of education required to understand a text similar to `flesch_kincaid_grade_level()` and intended as a substitute for `gunning_fog_index()`.
`readability.wiener_sachtextformel`	Readability test for German-language texts, whose value estimates the grade level required to understand a text.
`utils.get_words`	Get all non-punct, non-space tokens – “words” as we commonly understand them – from input `Doc` or `Iterable[Token]` object.
`utils.compute_n_words_and_types`	Compute the number of words and the number of unique words (aka types).
`utils.load_hyphenator`	Load an object that hyphenates words at valid points, as used in LaTex typesetting.

textacy.text_stats.api: Compute a variety of text statistics for documents.

class textacy.text_stats.api.TextStats(doc: spacy.tokens.doc.Doc)[source]¶

Class to compute a variety of basic, readability, morphological, and lexical diversity statistics for a given document.

>>> text = next(textacy.datasets.CapitolWords().texts(limit=1))
>>> doc = textacy.make_spacy_doc(text, lang="en_core_web_sm")
>>> ts = textacy.text_stats.TextStats(doc)
>>> ts.n_words
137
>>> ts.n_unique_words
81
>>> ts.entropy
6.02267943673824
>>> ts.readability("flesch-kincaid-grade-level")
11.40259124087591
>>> ts.diversity("ttr")
0.5912408759124088

Some readability stats vary by language or are designed for use with specific languages:

>>> text = (
...     "Muchos años después, frente al pelotón de fusilamiento, "
...     "el coronel Aureliano Buendía había de recordar aquella tarde remota "
...     "en que su padre lo llevó a conocer el hielo."
... )
>>> doc = textacy.make_spacy_doc(text, lang="es_core_news_sm")
>>> ts = textacy.text_stats.TextStats(doc)
>>> ts.readability("perspicuity-index")
56.46000000000002
>>> ts.readability("mu-legibility-index")
71.18644067796609

Each of these stats have stand-alone functions in textacy.text_stats.basics , textacy.text_stats.readability , and textacy.text_stats.diversity with more detailed info and links in the docstrings – when in doubt, read the docs!

Parameters: doc – A text document tokenized and (optionally) sentence-segmented by spaCy.

Warning

The TextStats class is deprecated as of v0.12. Instead, call the stats functions directly – text_stats.TextStats(doc).n_sents => text_stats.n_sents(doc) – or set them as custom doc extensions and access them from the Doc – textacy.set_doc_extensions('text_stats'); doc._.n_sents .

property n_sents¶: Number of sentences in document.

See also

textacy.text_stats.basics.n_sents()

property n_words¶: Number of words in document.

See also

textacy.text_stats.basics.n_words()

property n_unique_words¶: Number of unique words in document.

See also

textacy.text_stats.basics.n_unique_words()

property n_long_words¶: Number of long words in document.

See also

textacy.text_stats.basics.n_long_words()

property n_chars_per_word¶: Number of characters for each word in document.

See also

textacy.text_stats.basics.n_chars_per_word()

property n_chars¶: Total number of characters in document.

See also

textacy.text_stats.basics.n_chars()

property n_syllables_per_word¶: Number of syllables for each word in document.

See also

textacy.text_stats.basics.n_syllables_per_word()

property n_syllables¶: Total number of syllables in document.

See also

textacy.text_stats.basics.n_syllables()

property n_monosyllable_words¶: Number of monosyllobic words in document.

See also

textacy.text_stats.basics.n_monosyllable_words()

property n_polysyllable_words¶: Number of polysyllobic words in document.

See also

textacy.text_stats.basics.n_polysyllable_words()

property entropy¶: Entropy of words in document.

See also

textacy.text_stats.basics.entropy()

counts(name: CountsNameType) → Dict[str, int] | Dict[str, Dict[str, int]][source]¶: Count the number of times each value for the feature specified by name appear as token annotations.

See also

textacy.text_stats.counts

readability(name: Literal[automated - readability - index, automatic - arabic - readability - index, coleman - liau - index, flesch - kincaid - grade - level, flesch - reading - ease, gulpease - index, gunning - fog - index, lix, mu - legibility - index, perspicuity - index, smog - index, wiener - sachtextformel], **kwargs) → float [source]¶

Compute a measure of text readability using a method with specified name.

Higher values => more difficult text for the following methods:

automated readability index
automatic arabic readability index
colman-liau index
flesch-kincaid grade level
gunning-fog index
lix
smog index
wiener-sachtextformel

Higher values => less difficult text for the following methods:

flesch reading ease
gulpease index
mu legibility index
perspicuity index

See also

textacy.text_stats.readability

diversity(name: Literal[ttr, log - ttr, segmented - ttr, mtld, hdd], **kwargs) → float [source]¶

Compute a measure of lexical diversity using a method with specified name , optionally specifying method variants and parameters.

Higher values => higher lexical diversity.

See also

textacy.text_stats.diversity

Basic Stats¶

textacy.text_stats.basics: Low-level functions for computing basic text statistics, typically accessed via textacy.text_stats.TextStats.

textacy.text_stats.basics.n_sents(doc: spacy.tokens.doc.Doc) → int [source]¶

Compute the number of sentences in a document.

Parameters: doc –

Warning

If doc has not been segmented into sentences, it will be modified in-place using spaCy’s rule-based Sentencizer pipeline component before counting.

textacy.text_stats.basics.n_words(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]]) → int [source]¶

Compute the number of words in a document.

Parameters: doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.

textacy.text_stats.basics.n_unique_words(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]]) → int [source]¶

Compute the number of unique words in a document.

Parameters: doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.

textacy.text_stats.basics.n_chars_per_word(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]]) → Tuple[int, …][source]¶

Compute the number of characters for each word in a document.

Parameters: doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.

Note

This function is cached, since other functions rely upon its outputs to compute theirs. As such, doc_or_tokens must be hashable – for example, it may be a Doc or Tuple[Token, ...] , but not a List[Token] .

textacy.text_stats.basics.n_chars(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]]) → int [source]¶

Compute the total number of characters in a document’s words.

Parameters: doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.

See also

n_chars_per_word()

textacy.text_stats.basics.n_long_words(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]], *, min_n_chars: int = 7) → int [source]¶

Compute the number of long words in a document.

Parameters

doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.
min_n_chars – Minimum number of characters required for a word to be considered “long”.

textacy.text_stats.basics.n_syllables_per_word(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]], *, lang: Optional[str] = None) → Tuple[int, …][source]¶

Compute the number of syllables for each word in a document.

Parameters

doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.
lang – Standard 2-letter language code used to load hyphenator. If not specified and doc_or_tokens is a spaCy Doc , the value will be gotten from Doc.lang_ .

Note

Identifying syllables is _tricky_; this method relies on hyphenation, which is more straightforward but doesn’t always give the correct number of syllables. While all hyphenation points fall on syllable divisions, not all syllable divisions are valid hyphenation points.

Also: This function is cached, since other functions rely upon its outputs to compute theirs. As such, doc_or_tokens must be hashable – for example, it may be a Doc or Tuple[Token, ...] , but not a List[Token] .

textacy.text_stats.basics.n_syllables(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]], *, lang: Optional[str] = None) → int [source]¶

Compute the total number of syllables in a document.

Parameters

doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.
lang – Standard 2-letter language code used to load hyphenator. If not specified and doc_or_tokens is a spaCy Doc , the value will be gotten from Doc.lang_ .

See also

n_syllables_per_word()

textacy.text_stats.basics.n_monosyllable_words(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]], *, lang: Optional[str] = None) → int [source]¶

Compute the number of monosyllobic words in a document.

Parameters

doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.
lang – Standard 2-letter language code used to load hyphenator. If not specified and doc_or_tokens is a spaCy Doc , the value will be gotten from Doc.lang_ .

See also

n_syllables_per_word()

textacy.text_stats.basics.n_polysyllable_words(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]], *, lang: Optional[str] = None, min_n_syllables: int = 3) → int [source]¶

Compute the number of polysyllobic words in a document.

Parameters

doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.
lang – Standard 2-letter language code used to load hyphenator. If not specified and doc_or_tokens is a spaCy Doc , the value will be gotten from Doc.lang_ .
min_n_syllables – Minimum number of syllables required for a word to be considered “polysyllobic”.

See also

n_syllables_per_word()

textacy.text_stats.basics.entropy(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]]) → float [source]¶

Compute the entropy of words in a document.

Parameters: doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.

Annotation Counts¶

textacy.text_stats.counts: Functions for computing the counts of morphological, part-of-speech, and dependency features on the tokens in a document.

textacy.text_stats.counts.morph(doclike: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span]) → Dict[str, Dict[str, int]][source]¶

Count the number of times each value for a morphological feature appears as a token annotation in doclike.

Parameters: doclike –
Returns: Mapping of morphological feature to value counts of occurrence.

See also

spacy.tokens.MorphAnalysis

textacy.text_stats.counts.tag(doclike: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span]) → Dict[str, int][source]¶

Count the number of times each fine-grained part-of-speech tag appears as a token annotation in doclike.

Parameters: doclike –
Returns: Mapping of part-of-speech tag to count of occurrence.

textacy.text_stats.counts.pos(doclike: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span]) → Dict[str, int][source]¶

Count the number of times each coarsed-grained universal part-of-speech tag appears as a token annotation in doclike.

Parameters: doclike –
Returns: Mapping of universal part-of-speech tag to count of occurrence.

textacy.text_stats.counts.dep(doclike: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span]) → Dict[str, int][source]¶

Count the number of times each syntactic dependency relation appears as a token annotation in doclike.

Parameters: doclike –
Returns: Mapping of dependency relation to count of occurrence.

Lexical Diversity Stats¶

textacy.text_stats.diversity: Low-level functions for computing various measures of lexical diversity, typically accessed via textacy.text_stats.TextStats.diversity().

textacy.text_stats.diversity.ttr(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]], variant: Literal[standard, root, corrected] = 'standard') → float [source]¶

Compute the Type-Token Ratio (TTR) of doc_or_tokens, a direct ratio of the number of unique words (types) to all words (tokens).

Higher values indicate higher lexical diversity.

Parameters

doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.
variant – Particular variant of TTR. - “standard” => n_types / n_words - “root” => n_types / sqrt(n_words) - “corrected” => n_types / sqrt(2 * n_words)

Note

All variants of this measure are sensitive to document length, so values from texts with different lengths should not be compared.

References

Templin, M. (1957). Certain language skills in children. Minneapolis: University of Minnesota Press.
RTTR: Guiraud 1954, 1960
CTTR: 1964 Carrol

textacy.text_stats.diversity.log_ttr(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]], variant: Literal[herdan, summer, dugast] = 'herdan') → float [source]¶

Compute the logarithmic Type-Token Ratio (TTR) of doc_or_tokens, a modification of TTR that uses log functions to better adapt for text length.

Higher values indicate higher lexical diversity.

Parameters

doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.
variant – Particular variant of log-TTR. - “herdan” => log(n_types) / log(n_words) - “summer” => log(log(n_types)) / log(log(n_words)) - “dugast” => log(n_words) ** 2 / (log(n_words) - log(n_types))

Note

All variants of this measure are slightly sensitive to document length, so values from texts with different lengths should be compared with care.

The popular Maas variant of log-TTR is simply the reciprocal of Dugast’s: (log(n_words) - log(n_types)) / log(n_words) ** 2. It isn’t included as a variant because its interpretation differs: lower values indicate higher lexical diversity.

References

Herdan, G. (1964). Quantitative linguistics. London: Butterworths.
Somers, H. H. (1966). Statistical methods in literary analysis. In J. Leeds (Ed.), The computer and literary style (pp. 128-140). Kent, OH: Kent State University.
Dugast, D. (1978). Sur quoi se fonde la notion d’étendue théoretique du vocabulaire? Le Français Moderne, 46, 25-32.

textacy.text_stats.diversity.segmented_ttr(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]], segment_size: int = 50, variant: Literal[mean, moving - avg] = 'mean') → float [source]¶

Compute the Mean Segmental TTR (MS-TTR) or Moving Average TTR (MA-TTR) of doc_or_tokens, in which the TTR of tumbling or rolling segments of words, respectively, each with length segment_size, are computed and then averaged.

Higher values indicate higher lexical diversity.

Parameters

doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.
segment_size – Number of consecutive words to include in each segment.
variant – Variant of segmented TTR to compute. - “mean” => MS-TTR - “moving-avg” => MA-TTR

References

Johnson, W. (1944). Studies in language behavior: I. A program of research. Psychological Monographs, 56, 1-15.
Covington, M. A., & McFall, J. D. (2010). Cutting the Gordian knot: The moving-average type–token ratio (MATTR). Journal of quantitative linguistics, 17(2), 94-100.

textacy.text_stats.diversity.mtld(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]], min_ttr: float = 0.72) → float [source]¶

Compute the Measure of Textual Lexical Diversity (MTLD) of doc_or_tokens, the average length of the longest consecutive sequences of words that maintain a TTR of at least min_ttr.

Higher values indicate higher lexical diversity.

Parameters

doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.
min_ttr – Minimum TTR for each segment in doc_or_tokens. When an ongoing segment’s TTR falls below this value, a new segment is started. Value should be in the range [0.66, 0.75].

References

McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior research methods, 42(2), 381-392.

textacy.text_stats.diversity.hdd(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]], sample_size: int = 42) → float [source]¶

Compute the Hypergeometric Distribution Diversity (HD-D) of doc_or_tokens, which calculates the mean contribution that each unique word (aka type) makes to the TTR of all possible combinations of random samples of words of a given size, then sums all contributions together.

Parameters

doc_or_tokens – If a spaCy Doc, non-punctuation tokens (words) are extracted; if an iterable of spaCy Token s, all non-punct elements are used.
sample_size – Number of words randomly sampled without replacement when computing unique word appearance probabilities. Value should be in the range [35, 50].

Note

The popular vocd-D index of lexical diversity is actually just an approximation of HD-D, and should not be used.

References

McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior research methods, 42(2), 381-392.
McCarthy, P. M., & Jarvis, S. (2007). A theoretical and empirical evaluation of vocd. Language Testing, 24, 459-488.

Readability Stats¶

textacy.text_stats.readability: Low-level functions for computing various measures of text “readability”, typically accessed via textacy.text_stats.TextStats.readability().

textacy.text_stats.readability.automated_readability_index(doc: spacy.tokens.doc.Doc) → float [source]¶

Readability test for English-language texts, particularly for technical writing, whose value estimates the U.S. grade level required to understand a text. Similar to several other tests (e.g. flesch_kincaid_grade_level()), but uses characters per word instead of syllables like coleman_liau_index().

Higher value => more difficult text.

Parameters: doc –

References

https://en.wikipedia.org/wiki/Automated_readability_index

textacy.text_stats.readability.automatic_arabic_readability_index(doc: spacy.tokens.doc.Doc) → float [source]¶

Readability test for Arabic-language texts based on number of characters and average word and sentence lengths.

Higher value => more difficult text.

Parameters: doc –

References

Al Tamimi, Abdel Karim, et al. “AARI: automatic arabic readability index.” Int. Arab J. Inf. Technol. 11.4 (2014): 370-378.

textacy.text_stats.readability.coleman_liau_index(doc: spacy.tokens.doc.Doc) → float [source]¶

Readability test whose value estimates the number of years of education required to understand a text, similar to flesch_kincaid_grade_level() and smog_index(), but using characters per word instead of syllables.

Higher value => more difficult text.

Parameters: doc –

References

https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index

textacy.text_stats.readability.flesch_kincaid_grade_level(doc: spacy.tokens.doc.Doc) → float [source]¶

Readability test used widely in education, whose value estimates the U.S. grade level / number of years of education required to understand a text.

Higher value => more difficult text.

Parameters: doc –

References

https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch.E2.80.93Kincaid_grade_level

textacy.text_stats.readability.flesch_reading_ease(doc: spacy.tokens.doc.Doc, *, lang: Optional[str] = None) → float [source]¶

Readability test used as a general-purpose standard in several languages, based on a weighted combination of avg. sentence length and avg. word length. Values usually fall in the range [0, 100], but may be arbitrarily negative in extreme cases.

Higher value => easier text.

Parameters

doc –
lang –

Note

Coefficients in this formula are language-dependent; if lang is null, the value of Doc.lang_ is used.

References

English: https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease German: https://de.wikipedia.org/wiki/Lesbarkeitsindex#Flesch-Reading-Ease Spanish: Fernández-Huerta formulation French: ? Italian: https://it.wikipedia.org/wiki/Formula_di_Flesch Dutch: ? Portuguese: https://pt.wikipedia.org/wiki/Legibilidade_de_Flesch Turkish: Atesman formulation Russian: https://ru.wikipedia.org/wiki/%D0%98%D0%BD%D0%B4%D0%B5%D0%BA%D1%81_%D1%83%D0%B4%D0%BE%D0%B1%D0%BE%D1%87%D0%B8%D1%82%D0%B0%D0%B5%D0%BC%D0%BE%D1%81%D1%82%D0%B8

textacy.text_stats.readability.gulpease_index(doc: spacy.tokens.doc.Doc) → float [source]¶

Readability test for Italian-language texts, whose value is in the range [0, 100] similar to flesch_reading_ease().

Higher value => easier text.

Parameters: doc –

References

https://it.wikipedia.org/wiki/Indice_Gulpease

textacy.text_stats.readability.gunning_fog_index(doc: spacy.tokens.doc.Doc) → float [source]¶

Readability test whose value estimates the number of years of education required to understand a text, similar to flesch_kincaid_grade_level() and smog_index().

Higher value => more difficult text.

Parameters: doc –

References

https://en.wikipedia.org/wiki/Gunning_fog_index

textacy.text_stats.readability.lix(doc: spacy.tokens.doc.Doc) → float [source]¶

Readability test commonly used in Sweden on both English- and non-English-language texts, whose value estimates the difficulty of reading a foreign text.

Higher value => more difficult text.

Parameters: doc –

References

https://en.wikipedia.org/wiki/Lix_(readability_test)

textacy.text_stats.readability.mu_legibility_index(doc: spacy.tokens.doc.Doc) → float [source]¶

Readability test for Spanish-language texts based on number of words and the mean and variance of their lengths in characters, whose value is in the range [0, 100].

Higher value => easier text.

Parameters: doc –

References

Muñoz, M., and J. Muñoz. “Legibilidad Mµ.” Viña del Mar: CHL (2006).

textacy.text_stats.readability.perspicuity_index(doc: spacy.tokens.doc.Doc) → float [source]¶

Readability test for Spanish-language texts, whose value is in the range [0, 100]; very similar to the Spanish-specific formulation of flesch_reading_ease(), but included additionally since it’s become a common readability standard.

Higher value => easier text.

Parameters: doc –

References

Pazos, Francisco Szigriszt. Sistemas predictivos de legibilidad del mensaje escrito: fórmula de perspicuidad. Universidad Complutense de Madrid, Servicio de Reprografía, 1993.

textacy.text_stats.readability.smog_index(doc: spacy.tokens.doc.Doc) → float [source]¶

Readability test commonly used in medical writing and the healthcare industry, whose value estimates the number of years of education required to understand a text similar to flesch_kincaid_grade_level() and intended as a substitute for gunning_fog_index().

Higher value => more difficult text.

Parameters: doc –

References

https://en.wikipedia.org/wiki/SMOG

textacy.text_stats.readability.wiener_sachtextformel(doc: spacy.tokens.doc.Doc, *, variant: int = 1) → float [source]¶

Readability test for German-language texts, whose value estimates the grade level required to understand a text.

Higher value => more difficult text.

Parameters

doc –
variant –

References

https://de.wikipedia.org/wiki/Lesbarkeitsindex#Wiener_Sachtextformel

textacy.text_stats.utils: Utility functions for computing text statistics, called under the hood of many stats functions – and not typically accessed by users.

textacy.text_stats.utils.get_words(doc_or_tokens: Union[spacy.tokens.doc.Doc, Iterable[spacy.tokens.token.Token]]) → Iterable[spacy.tokens.token.Token][source]¶: Get all non-punct, non-space tokens – “words” as we commonly understand them – from input Doc or Iterable[Token] object.

textacy.text_stats.utils.compute_n_words_and_types(words: Iterable[spacy.tokens.token.Token]) → Tuple[int, int][source]¶

Compute the number of words and the number of unique words (aka types).

Parameters: words – Sequence of non-punct, non-space tokens – “words” – as output, say, by get_words().
Returns: (n_words, n_types)

textacy.text_stats.utils.load_hyphenator(lang: str)[source]¶

Load an object that hyphenates words at valid points, as used in LaTex typesetting.

Parameters

lang –

Standard 2-letter language abbreviation. To get a list of valid values:

>>> import pyphen; pyphen.LANGUAGES

Returns

pyphen.Pyphen()

Text Statistics¶

Basic Stats¶

Annotation Counts¶

Lexical Diversity Stats¶

Readability Stats¶

Navigation

Related Topics