spaCy extensions

Doc extensions

get_lang

Get the standard, two-letter language code assigned to Doc and its associated spacy.vocab.Vocab.

get_preview

Get a short preview of the Doc, including the number of tokens and an initial snippet.

get_meta

Get custom metadata added to Doc.

set_meta

Add custom metadata to Doc.

get_tokens

Yield the tokens in Doc, one at a time.

get_n_tokens

Get the number of tokens (including punctuation) in Doc.

get_n_sents

Get the number of sentences in Doc.

to_tokenized_text

Transform Doc into an ordered, nested list of token-texts per sentence.

to_tagged_text

Transform Doc into an ordered, nested list of (token-text, part-of-speech tag) pairs per sentence.

to_terms_list

Transform Doc into a sequence of ngrams and/or entities — not necessarily in order of appearance — where each appears in the sequence as many times as it appears in Doc.

to_bag_of_terms

Transform Doc into a bag-of-terms: the set of unique terms in Doc mapped to their frequency of occurrence, where “terms” includes ngrams and/or entities.

to_bag_of_words

Transform Doc into a bag-of-words: the set of unique words in Doc mapped to their absolute, relative, or binary frequency of occurrence.

to_semantic_network

Transform Doc into a semantic network, where nodes are either “words” or “sents” and edges between nodes may be weighted in different ways.

textacy.spacier.doc_extensions: Inspect, extend, and transform spaCy’s core data structure, spacy.tokens.Doc, either directly via functions that take a Doc as their first argument or as custom attributes / methods on instantiated docs prepended by an underscore:

>>> spacy_lang = textacy.load_spacy_lang("en")
>>> doc = spacy_lang("This is a short text.")
>>> print(get_preview(doc))
Doc(6 tokens: "This is a short text.")
>>> print(doc._.preview)
Doc(6 tokens: "This is a short text.")
textacy.spacier.doc_extensions.set_doc_extensions()[source]

Set textacy’s custom property and method doc extensions on the global spacy.tokens.Doc.

textacy.spacier.doc_extensions.get_doc_extensions()[source]

Get textacy’s custom property and method doc extensions that can be set on or removed from the global spacy.tokens.Doc.

textacy.spacier.doc_extensions.remove_doc_extensions()[source]

Remove textacy’s custom property and method doc extensions from the global spacy.tokens.Doc.

textacy.spacier.doc_extensions.get_lang(doc: spacy.tokens.doc.Doc)str[source]

Get the standard, two-letter language code assigned to Doc and its associated spacy.vocab.Vocab.

textacy.spacier.doc_extensions.get_preview(doc: spacy.tokens.doc.Doc)str[source]

Get a short preview of the Doc, including the number of tokens and an initial snippet.

textacy.spacier.doc_extensions.get_tokens(doc: spacy.tokens.doc.Doc) → Iterable[spacy.tokens.token.Token][source]

Yield the tokens in Doc, one at a time.

textacy.spacier.doc_extensions.get_meta(doc: spacy.tokens.doc.Doc)dict[source]

Get custom metadata added to Doc.

textacy.spacier.doc_extensions.set_meta(doc: spacy.tokens.doc.Doc, value: dict)None[source]

Add custom metadata to Doc.

textacy.spacier.doc_extensions.get_n_tokens(doc: spacy.tokens.doc.Doc)int[source]

Get the number of tokens (including punctuation) in Doc.

textacy.spacier.doc_extensions.get_n_sents(doc: spacy.tokens.doc.Doc)int[source]

Get the number of sentences in Doc.

textacy.spacier.doc_extensions.to_tokenized_text(doc: spacy.tokens.doc.Doc) → List[List[str]][source]

Transform Doc into an ordered, nested list of token-texts per sentence.

Note

If doc hasn’t been segmented into sentences, the entire document is treated as a single sentence.

textacy.spacier.doc_extensions.to_tagged_text(doc: spacy.tokens.doc.Doc) → List[List[Tuple[str, str]]][source]

Transform Doc into an ordered, nested list of (token-text, part-of-speech tag) pairs per sentence.

Note

If doc hasn’t been segmented into sentences, the entire document is treated as a single sentence.

textacy.spacier.doc_extensions.to_terms_list(doc: spacy.tokens.doc.Doc, *, ngrams: Optional[Union[int, Collection[int]]] = 1, 2, 3, entities: Optional[bool] = True, normalize: Optional[Union[str, Callable[[Union[spacy.tokens.span.Span, spacy.tokens.token.Token]], str]]] = 'lemma', as_strings: bool = False, **kwargs) → Union[Iterable[int], Iterable[str]][source]

Transform Doc into a sequence of ngrams and/or entities — not necessarily in order of appearance — where each appears in the sequence as many times as it appears in Doc.

Parameters
  • doc

  • ngrams – ngrams to include in the terms list. If {1, 2, 3}, unigrams, bigrams, and trigrams are included; if 2, only bigrams are included; if None, ngrams aren’t included, except for those belonging to named entities.

  • entities

    If True, entities are included in the terms list; if False, they are excluded from the list; if None, entities aren’t included or excluded at all.

    Note

    When both entities and ngrams are non-null, exact duplicates (based on start and end indexes) are handled. If entities is True, any duplicate entities are included while duplicate ngrams are discarded to avoid double-counting; if entities is False, no entities are included of course, and duplicate ngrams are discarded as well.

  • normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in doc; if callable, must accept a Token or Span and return a str, e.g. get_normalized_text().

  • as_strings – If True, terms are returned as strings; if False, terms are returned as their unique integer ids.

  • kwargs

    • filter_stops (bool)

    • filter_punct (bool)

    • filter_nums (bool)

    • include_pos (str or Set[str])

    • exclude_pos (str or Set[str])

    • min_freq (int)

    • include_types (str or Set[str])

    • exclude_types (str or Set[str]

    • drop_determiners (bool)

    See textacy.extract.words(), textacy.extract.ngrams(), and textacy.extract.entities() for details.

Yields

The next term in the terms list, as either a unique integer id or a string.

Raises
  • ValueError – if neither entities nor ngrams are included, or if normalize have invalid values

  • TypeError – if entities has an invalid type

Note

Despite the name, this is a generator function; to get an actual list of terms, call list(to_terms_list(doc)).

textacy.spacier.doc_extensions.to_bag_of_terms(doc: spacy.tokens.doc.Doc, *, ngrams: Optional[Union[int, Collection[int]]] = 1, 2, 3, entities: Optional[bool] = True, normalize: Optional[Union[str, Callable[[Union[spacy.tokens.span.Span, spacy.tokens.token.Token]], str]]] = 'lemma', weighting: str = 'count', as_strings: bool = False, **kwargs) → Dict[Union[int, str], Union[int, float]][source]

Transform Doc into a bag-of-terms: the set of unique terms in Doc mapped to their frequency of occurrence, where “terms” includes ngrams and/or entities.

Parameters
  • doc

  • ngrams – n of which n-grams to include. (1, 2, 3) (default) includes unigrams (words), bigrams, and trigrams; 2 if only bigrams are wanted; falsy (e.g. False) to not include any

  • entities – If True (default), include named entities; note: if ngrams are also included, any ngrams that exactly overlap with an entity are skipped to prevent double-counting

  • normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in doc; if a callable, must accept a Token or Span and return a str, e.g. textacy.spacier.utils.get_normalized_text().

  • weighting ({"count", "freq", "binary"}) – Type of weight to assign to terms. If “count” (default), weights are the absolute number of occurrences (count) of term in doc. If “binary”, all counts are set equal to 1. If “freq”, term counts are normalized by the total token count, giving their relative frequency of occurrence.

  • as_strings – If True, words are returned as strings; if False (default), words are returned as their unique integer ids.

  • kwargs

    • filter_stops (bool)

    • filter_punct (bool)

    • filter_nums (bool)

    • include_pos (str or Set[str])

    • exclude_pos (str or Set[str])

    • min_freq (int)

    • include_types (str or Set[str])

    • exclude_types (str or Set[str]

    • drop_determiners (bool)

    See textacy.extract.words(), textacy.extract.ngrams(), and textacy.extract.entities() for details.

Returns

Mapping of a unique term id or string (depending on the value of as_strings) to its absolute, relative, or binary frequency of occurrence (depending on the value of weighting).

See also

to_terms_list(), which is used under the hood.

textacy.spacier.doc_extensions.to_bag_of_words(doc: spacy.tokens.doc.Doc, *, normalize: str = 'lemma', weighting: str = 'count', as_strings: bool = False, filter_stops: bool = True, filter_punct: bool = True, filter_nums: bool = False) → Dict[Union[int, str], Union[int, float]][source]

Transform Doc into a bag-of-words: the set of unique words in Doc mapped to their absolute, relative, or binary frequency of occurrence.

Parameters
  • doc

  • normalize – If “lemma”, lemmatize words before counting; if “lower”, lowercase words before counting; otherwise, words are counted using the form with which they they appear in doc.

  • weighting ({"count", "freq", "binary"}) – Type of weight to assign to words. If “count” (default), weights are the absolute number of occurrences (count) of word in doc. If “binary”, all counts are set equal to 1. If “freq”, word counts are normalized by the total token count, giving their relative frequency of occurrence. Note: The resulting set of frequencies won’t (necessarily) sum to 1.0, since punctuation and stop words are filtered out after counts are normalized.

  • as_strings (bool) – If True, words are returned as strings; if False (default), words are returned as their unique integer ids

  • filter_stops (bool) – If True (default), stop words are removed after counting.

  • filter_punct (bool) – If True (default), punctuation tokens are removed after counting.

  • filter_nums (bool) – If True, tokens consisting of digits are removed after counting.

Returns

Mapping of a unique term id or string (depending on the value of as_strings) to its absolute, relative, or binary frequency of occurrence (depending on the value of weighting).

textacy.spacier.doc_extensions.to_semantic_network(doc: spacy.tokens.doc.Doc, *, nodes: str = 'words', normalize: Optional[Union[str, Callable[[Union[spacy.tokens.span.Span, spacy.tokens.token.Token]], str]]] = 'lemma', edge_weighting: str = 'default', window_width: int = 10) → networkx.classes.graph.Graph[source]

Transform Doc into a semantic network, where nodes are either “words” or “sents” and edges between nodes may be weighted in different ways.

Parameters
  • doc

  • nodes ({"words", "sents"}) – Type of doc component to use as nodes in the semantic network.

  • normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in doc; if a callable, must accept a Token or Span (if nodes = “words” or “sents”, respectively) and return a str, e.g. get_normalized_text()

  • edge_weighting – Type of weighting to apply to edges between nodes; if nodes = “words”, options are {“cooc_freq”, “binary”}, if nodes = “sents”, options are {“cosine”, “jaccard”}; if “default”, “cooc_freq” or “cosine” will be automatically used.

  • window_width – Size of sliding window over terms that determines which are said to co-occur; only applicable if nodes = “words”.

Returns

where nodes represent either terms or sentences in doc; edges, the relationships between them.

Return type

networkx.Graph

Raises

ValueError – If nodes is neither “words” nor “sents”.

Pipeline Components

textacy.spacier.components: Custom components to add to a spaCy language pipeline.

class textacy.spacier.components.TextStatsComponent(attrs=None)[source]

A custom component to be added to a spaCy language pipeline that computes one, some, or all text stats for a parsed doc and sets the values as custom attributes on a spacy.tokens.Doc.

Add the component to a pipeline, after the parser (as well as any subsequent components that modify the tokens/sentences of the doc):

>>> en = spacy.load('en')
>>> text_stats_component = TextStatsComponent()
>>> en.add_pipe(text_stats_component, after='parser')

Process a text with the pipeline and access the custom attributes via spaCy’s underscore syntax:

>>> doc = en(u"This is a test test someverylongword.")
>>> doc._.n_words
6
>>> doc._.flesch_reading_ease
73.84500000000001

Specify which attributes of the textacy.text_stats.TextStats() to add to processed documents:

>>> en = spacy.load('en')
>>> text_stats_component = TextStatsComponent(attrs='n_words')
>>> en.add_pipe(text_stats_component, last=True)
>>> doc = en(u"This is a test test someverylongword.")
>>> doc._.n_words
6
>>> doc._.flesch_reading_ease
AttributeError: [E046] Can't retrieve unregistered extension attribute 'flesch_reading_ease'. Did you forget to call the `set_extension` method?
Parameters

attrs (str or Iterable[str] or None) – If str, a single text stat to compute and set on a Doc. If Iterable[str], multiple text stats. If None, all text stats are computed and set as extensions.

name

Default name of this component in a spaCy language pipeline, used to get and modify the component via various spacy.Language methods, e.g. https://spacy.io/api/language#get_pipe.

Type

str

See also

textacy.text_stats.TextStats

spaCy Utils

textacy.spacier.utils: Helper functions for working with / extending spaCy’s core functionality.

textacy.spacier.utils.make_doc_from_text_chunks(text: str, lang: Union[str, spacy.language.Language], chunk_size: int = 100000) → spacy.tokens.doc.Doc[source]

Make a single spaCy-processed document from 1 or more chunks of text. This is a workaround for processing very long texts, for which spaCy is unable to allocate enough RAM.

Although this function’s performance is pretty good, it’s inherently less performant that just processing the entire text in one shot. Only use it if necessary!

Parameters
  • text – Text document to be chunked and processed by spaCy.

  • lang – A 2-letter language code (e.g. “en”), the name of a spaCy model for the desired language, or an already-instantiated spaCy language pipeline.

  • chunk_size

    Number of characters comprising each text chunk (excluding the last chunk, which is probably smaller). For best performance, value should be somewhere between 1e3 and 1e7, depending on how much RAM you have available.

    Note

    Since chunking is done by character, chunks edges’ probably won’t respect natural language segmentation, which means that every chunk_size characters, spaCy will probably get tripped up and make weird parsing errors.

Returns

A single processed document, initialized from components accumulated chunk by chunk.

textacy.spacier.utils.merge_spans(spans: Iterable[spacy.tokens.span.Span], doc: spacy.tokens.doc.Doc)None[source]

Merge spans into single tokens in doc, in-place.

Parameters
  • spans (Iterable[spacy.tokens.Span]) –

  • doc (spacy.tokens.Doc) –

textacy.spacier.utils.preserve_case(token: spacy.tokens.token.Token)bool[source]

Return True if token is a proper noun or acronym; otherwise, False.

Raises

ValueError – If parent document has not been POS-tagged.

textacy.spacier.utils.get_normalized_text(span_or_token: Union[spacy.tokens.span.Span, spacy.tokens.token.Token])str[source]

Get the text of a spaCy span or token, normalized depending on its characteristics. For proper nouns and acronyms, text is returned as-is; for everything else, text is lemmatized.

textacy.spacier.utils.get_main_verbs_of_sent(sent: spacy.tokens.span.Span) → List[spacy.tokens.token.Token][source]

Return the main (non-auxiliary) verbs in a sentence.

textacy.spacier.utils.get_subjects_of_verb(verb: spacy.tokens.token.Token) → List[spacy.tokens.token.Token][source]

Return all subjects of a verb according to the dependency parse.

textacy.spacier.utils.get_objects_of_verb(verb: spacy.tokens.token.Token) → List[spacy.tokens.token.Token][source]

Return all objects of a verb according to the dependency parse, including open clausal complements.

textacy.spacier.utils.get_span_for_compound_noun(noun: spacy.tokens.token.Token) → Tuple[int, int][source]

Return document indexes spanning all (adjacent) tokens in a compound noun.

textacy.spacier.utils.get_span_for_verb_auxiliaries(verb: spacy.tokens.token.Token) → Tuple[int, int][source]

Return document indexes spanning all (adjacent) tokens around a verb that are auxiliary verbs or negations.