spaCy extensions

Doc extensions


Transform Doc into a semantic network, where nodes are either “words” or “sents” and edges between nodes may be weighted in different ways.

textacy.spacier.doc_extensions: Inspect, extend, and transform spaCy’s core data structure, spacy.tokens.Doc, either directly via functions that take a Doc as their first argument or as custom attributes / methods on instantiated docs prepended by an underscore:

>>> spacy_lang = textacy.load_spacy_lang("en")
>>> doc = spacy_lang("This is a short text.")
>>> print(get_preview(doc))
Doc(6 tokens: "This is a short text.")
>>> print(doc._.preview)
Doc(6 tokens: "This is a short text.")

Set textacy’s custom property and method doc extensions on the global spacy.tokens.Doc.


Get textacy’s custom property and method doc extensions that can be set on or removed from the global spacy.tokens.Doc.


Remove textacy’s custom property and method doc extensions from the global spacy.tokens.Doc.

textacy.spacier.doc_extensions.get_lang(doc: spacy.tokens.doc.Doc)str[source]

textacy.spacier.doc_extensions.get_preview(doc: spacy.tokens.doc.Doc)str[source]

textacy.spacier.doc_extensions.get_tokens(doc: spacy.tokens.doc.Doc) → Iterable[spacy.tokens.token.Token][source]

textacy.spacier.doc_extensions.get_meta(doc: spacy.tokens.doc.Doc)dict[source]

textacy.spacier.doc_extensions.set_meta(doc: spacy.tokens.doc.Doc, value: dict)None[source]

textacy.spacier.doc_extensions.get_n_tokens(doc: spacy.tokens.doc.Doc)int[source]

textacy.spacier.doc_extensions.get_n_sents(doc: spacy.tokens.doc.Doc)int[source]

textacy.spacier.doc_extensions.to_tokenized_text(doc: spacy.tokens.doc.Doc) → List[List[str]][source]

If doc hasn’t been segmented into sentences, the entire document is treated as a single sentence.

textacy.spacier.doc_extensions.to_tagged_text(doc: spacy.tokens.doc.Doc) → List[List[Tuple[str, str]]][source]

Transform Doc into an ordered, nested list of (token-text, part-of-speech tag) pairs per sentence.


If doc hasn’t been segmented into sentences, the entire document is treated as a single sentence.

textacy.spacier.doc_extensions.to_terms_list(doc: spacy.tokens.doc.Doc, *, ngrams: Optional[Union[int, Collection[int]]] = 1, 2, 3, entities: Optional[bool] = True, normalize: Optional[Union[str, Callable[[Union[spacy.tokens.span.Span, spacy.tokens.token.Token]], str]]] = 'lemma', as_strings: bool = False, **kwargs) → Union[Iterable[int], Iterable[str]][source]

  • doc

  • ngrams – ngrams to include in the terms list. If {1, 2, 3}, unigrams, bigrams, and trigrams are included; if 2, only bigrams are included; if None, ngrams aren’t included, except for those belonging to named entities.

  • entities

    If True, entities are included in the terms list; if False, they are excluded from the list; if None, entities aren’t included or excluded at all.


    When both entities and ngrams are non-null, exact duplicates (based on start and end indexes) are handled. If entities is True, any duplicate entities are included while duplicate ngrams are discarded to avoid double-counting; if entities is False, no entities are included of course, and duplicate ngrams are discarded as well.

  • normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in doc; if callable, must accept a Token or Span and return a str, e.g. get_normalized_text().

  • as_strings – If True, terms are returned as strings; if False, terms are returned as their unique integer ids.

  • kwargs

    • filter_stops (bool)

    • filter_punct (bool)

    • filter_nums (bool)

    • include_pos (str or Set[str])

    • exclude_pos (str or Set[str])

    • min_freq (int)

    • include_types (str or Set[str])

    • exclude_types (str or Set[str]

    • drop_determiners (bool)

    See textacy.extract.words(), textacy.extract.ngrams(), and textacy.extract.entities() for details.


The next term in the terms list, as either a unique integer id or a string.

  • ValueError – if neither entities nor ngrams are included, or if normalize have invalid values

  • TypeError – if entities has an invalid type


Despite the name, this is a generator function; to get an actual list of terms, call list(to_terms_list(doc)).

textacy.spacier.doc_extensions.to_bag_of_terms(doc: spacy.tokens.doc.Doc, *, ngrams: Optional[Union[int, Collection[int]]] = 1, 2, 3, entities: Optional[bool] = True, normalize: Optional[Union[str, Callable[[Union[spacy.tokens.span.Span, spacy.tokens.token.Token]], str]]] = 'lemma', weighting: str = 'count', as_strings: bool = False, **kwargs) → Dict[Union[int, str], Union[int, float]][source]

  • doc

  • ngrams – n of which n-grams to include. (1, 2, 3) (default) includes unigrams (words), bigrams, and trigrams; 2 if only bigrams are wanted; falsy (e.g. False) to not include any

  • entities – If True (default), include named entities; note: if ngrams are also included, any ngrams that exactly overlap with an entity are skipped to prevent double-counting

  • normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in doc; if a callable, must accept a Token or Span and return a str, e.g. textacy.spacier.utils.get_normalized_text().

  • weighting ({"count", "freq", "binary"}) – Type of weight to assign to terms. If “count” (default), weights are the absolute number of occurrences (count) of term in doc. If “binary”, all counts are set equal to 1. If “freq”, term counts are normalized by the total token count, giving their relative frequency of occurrence.

  • as_strings – If True, words are returned as strings; if False (default), words are returned as their unique integer ids.

  • kwargs

    • filter_stops (bool)

    • filter_punct (bool)

    • filter_nums (bool)

    • include_pos (str or Set[str])

    • exclude_pos (str or Set[str])

    • min_freq (int)

    • include_types (str or Set[str])

    • exclude_types (str or Set[str]

    • drop_determiners (bool)

    See textacy.extract.words(), textacy.extract.ngrams(), and textacy.extract.entities() for details.


Mapping of a unique term id or string (depending on the value of as_strings) to its absolute, relative, or binary frequency of occurrence (depending on the value of weighting).

See also

to_terms_list(), which is used under the hood.

textacy.spacier.doc_extensions.to_bag_of_words(doc: spacy.tokens.doc.Doc, *, normalize: str = 'lemma', weighting: str = 'count', as_strings: bool = False, filter_stops: bool = True, filter_punct: bool = True, filter_nums: bool = False) → Dict[Union[int, str], Union[int, float]][source]

  • doc

  • normalize – If “lemma”, lemmatize words before counting; if “lower”, lowercase words before counting; otherwise, words are counted using the form with which they they appear in doc.

  • weighting ({"count", "freq", "binary"}) – Type of weight to assign to words. If “count” (default), weights are the absolute number of occurrences (count) of word in doc. If “binary”, all counts are set equal to 1. If “freq”, word counts are normalized by the total token count, giving their relative frequency of occurrence. Note: The resulting set of frequencies won’t (necessarily) sum to 1.0, since punctuation and stop words are filtered out after counts are normalized.

  • as_strings (bool) – If True, words are returned as strings; if False (default), words are returned as their unique integer ids

  • filter_stops (bool) – If True (default), stop words are removed after counting.

  • filter_punct (bool) – If True (default), punctuation tokens are removed after counting.

  • filter_nums (bool) – If True, tokens consisting of digits are removed after counting.


Mapping of a unique term id or string (depending on the value of as_strings) to its absolute, relative, or binary frequency of occurrence (depending on the value of weighting).

textacy.spacier.doc_extensions.to_semantic_network(doc: spacy.tokens.doc.Doc, *, nodes: str = 'words', normalize: Optional[Union[str, Callable[[Union[spacy.tokens.span.Span, spacy.tokens.token.Token]], str]]] = 'lemma', edge_weighting: str = 'default', window_width: int = 10) → networkx.classes.graph.Graph[source]

  • doc

  • nodes ({"words", "sents"}) – Type of doc component to use as nodes in the semantic network.

  • normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in doc; if a callable, must accept a Token or Span (if nodes = “words” or “sents”, respectively) and return a str, e.g. get_normalized_text()

  • edge_weighting – Type of weighting to apply to edges between nodes; if nodes = “words”, options are {“cooc_freq”, “binary”}, if nodes = “sents”, options are {“cosine”, “jaccard”}; if “default”, “cooc_freq” or “cosine” will be automatically used.

  • window_width – Size of sliding window over terms that determines which are said to co-occur; only applicable if nodes = “words”.


where nodes represent either terms or sentences in doc; edges, the relationships between them.

Return type



ValueError – If nodes is neither “words” nor “sents”.

Pipeline Components

textacy.spacier.components: Custom components to add to a spaCy language pipeline.

class textacy.spacier.components.TextStatsComponent(attrs=None)[source]

A custom component to be added to a spaCy language pipeline that computes one, some, or all text stats for a parsed doc and sets the values as custom attributes on a spacy.tokens.Doc.

Add the component to a pipeline, after the parser (as well as any subsequent components that modify the tokens/sentences of the doc):

>>> en = spacy.load('en')
>>> text_stats_component = TextStatsComponent()
>>> en.add_pipe(text_stats_component, after='parser')

Process a text with the pipeline and access the custom attributes via spaCy’s underscore syntax:

>>> doc = en(u"This is a test test someverylongword.")
>>> doc._.n_words
>>> doc._.flesch_reading_ease

Specify which attributes of the textacy.text_stats.TextStats() to add to processed documents:

>>> en = spacy.load('en')
>>> text_stats_component = TextStatsComponent(attrs='n_words')
>>> en.add_pipe(text_stats_component, last=True)
>>> doc = en(u"This is a test test someverylongword.")
>>> doc._.n_words
>>> doc._.flesch_reading_ease
AttributeError: [E046] Can't retrieve unregistered extension attribute 'flesch_reading_ease'. Did you forget to call the `set_extension` method?

attrs (str or Iterable[str] or None) – If str, a single text stat to compute and set on a Doc. If Iterable[str], multiple text stats. If None, all text stats are computed and set as extensions.


Default name of this component in a spaCy language pipeline, used to get and modify the component via various spacy.Language methods, e.g.



See also


spaCy Utils

textacy.spacier.utils: Helper functions for working with / extending spaCy’s core functionality.

textacy.spacier.utils.make_doc_from_text_chunks(text: str, lang: Union[str, spacy.language.Language], chunk_size: int = 100000) → spacy.tokens.doc.Doc[source]

Make a single spaCy-processed document from 1 or more chunks of text. This is a workaround for processing very long texts, for which spaCy is unable to allocate enough RAM.

Although this function’s performance is pretty good, it’s inherently less performant that just processing the entire text in one shot. Only use it if necessary!

  • text – Text document to be chunked and processed by spaCy.

  • lang – A 2-letter language code (e.g. “en”), the name of a spaCy model for the desired language, or an already-instantiated spaCy language pipeline.

  • chunk_size

    Number of characters comprising each text chunk (excluding the last chunk, which is probably smaller). For best performance, value should be somewhere between 1e3 and 1e7, depending on how much RAM you have available.


    Since chunking is done by character, chunks edges’ probably won’t respect natural language segmentation, which means that every chunk_size characters, spaCy will probably get tripped up and make weird parsing errors.


A single processed document, initialized from components accumulated chunk by chunk.

textacy.spacier.utils.merge_spans(spans: Iterable[spacy.tokens.span.Span], doc: spacy.tokens.doc.Doc)None[source]

Merge spans into single tokens in doc, in-place.

  • spans (Iterable[spacy.tokens.Span]) –

  • doc (spacy.tokens.Doc) –

textacy.spacier.utils.preserve_case(token: spacy.tokens.token.Token)bool[source]

Return True if token is a proper noun or acronym; otherwise, False.


ValueError – If parent document has not been POS-tagged.

textacy.spacier.utils.get_normalized_text(span_or_token: Union[spacy.tokens.span.Span, spacy.tokens.token.Token])str[source]

Get the text of a spaCy span or token, normalized depending on its characteristics. For proper nouns and acronyms, text is returned as-is; for everything else, text is lemmatized.

textacy.spacier.utils.get_main_verbs_of_sent(sent: spacy.tokens.span.Span) → List[spacy.tokens.token.Token][source]

Return the main (non-auxiliary) verbs in a sentence.

textacy.spacier.utils.get_subjects_of_verb(verb: spacy.tokens.token.Token) → List[spacy.tokens.token.Token][source]

Return all subjects of a verb according to the dependency parse.

textacy.spacier.utils.get_objects_of_verb(verb: spacy.tokens.token.Token) → List[spacy.tokens.token.Token][source]

Return all objects of a verb according to the dependency parse, including open clausal complements.

textacy.spacier.utils.get_span_for_compound_noun(noun: spacy.tokens.token.Token) → Tuple[int, int][source]

Return document indexes spanning all (adjacent) tokens in a compound noun.

textacy.spacier.utils.get_span_for_verb_auxiliaries(verb: spacy.tokens.token.Token) → Tuple[int, int][source]

Return document indexes spanning all (adjacent) tokens around a verb that are auxiliary verbs or negations.