spaCy extensions¶

`Doc` extensions¶

`get_lang`	Get the standard, two-letter language code assigned to `Doc` and its associated `spacy.vocab.Vocab`.
`get_preview`	Get a short preview of the `Doc`, including the number of tokens and an initial snippet.
`get_meta`	Get custom metadata added to `Doc`.
`set_meta`	Add custom metadata to `Doc`.
`get_tokens`	Yield the tokens in `Doc`, one at a time.
`get_n_tokens`	Get the number of tokens (including punctuation) in `Doc`.
`get_n_sents`	Get the number of sentences in `Doc`.
`to_tokenized_text`	Transform `Doc` into an ordered, nested list of token-texts per sentence.
`to_tagged_text`	Transform `Doc` into an ordered, nested list of (token-text, part-of-speech tag) pairs per sentence.
`to_terms_list`	Transform `Doc` into a sequence of ngrams and/or entities — not necessarily in order of appearance — where each appears in the sequence as many times as it appears in `Doc`.
`to_bag_of_terms`	Transform `Doc` into a bag-of-terms: the set of unique terms in `Doc` mapped to their frequency of occurrence, where “terms” includes ngrams and/or entities.
`to_bag_of_words`	Transform `Doc` into a bag-of-words: the set of unique words in `Doc` mapped to their absolute, relative, or binary frequency of occurrence.
`to_semantic_network`	Transform `Doc` into a semantic network, where nodes are either “words” or “sents” and edges between nodes may be weighted in different ways.

textacy.spacier.doc_extensions: Inspect, extend, and transform spaCy’s core data structure, spacy.tokens.Doc, either directly via functions that take a Doc as their first argument or as custom attributes / methods on instantiated docs prepended by an underscore:

>>> spacy_lang = textacy.load_spacy_lang("en")
>>> doc = spacy_lang("This is a short text.")
>>> print(get_preview(doc))
Doc(6 tokens: "This is a short text.")
>>> print(doc._.preview)
Doc(6 tokens: "This is a short text.")

textacy.spacier.doc_extensions.set_doc_extensions()[source]¶: Set textacy’s custom property and method doc extensions on the global spacy.tokens.Doc.

textacy.spacier.doc_extensions.get_doc_extensions()[source]¶: Get textacy’s custom property and method doc extensions that can be set on or removed from the global spacy.tokens.Doc.

textacy.spacier.doc_extensions.remove_doc_extensions()[source]¶: Remove textacy’s custom property and method doc extensions from the global spacy.tokens.Doc.

textacy.spacier.doc_extensions.get_lang(doc: spacy.tokens.doc.Doc) → str [source]¶: Get the standard, two-letter language code assigned to Doc and its associated spacy.vocab.Vocab.

textacy.spacier.doc_extensions.get_preview(doc: spacy.tokens.doc.Doc) → str [source]¶: Get a short preview of the Doc, including the number of tokens and an initial snippet.

textacy.spacier.doc_extensions.get_tokens(doc: spacy.tokens.doc.Doc) → Iterable[spacy.tokens.token.Token][source]¶: Yield the tokens in Doc, one at a time.

textacy.spacier.doc_extensions.get_meta(doc: spacy.tokens.doc.Doc) → dict [source]¶: Get custom metadata added to Doc.

textacy.spacier.doc_extensions.set_meta(doc: spacy.tokens.doc.Doc, value: dict) → None [source]¶: Add custom metadata to Doc.

textacy.spacier.doc_extensions.get_n_tokens(doc: spacy.tokens.doc.Doc) → int [source]¶: Get the number of tokens (including punctuation) in Doc.

textacy.spacier.doc_extensions.get_n_sents(doc: spacy.tokens.doc.Doc) → int [source]¶: Get the number of sentences in Doc.

textacy.spacier.doc_extensions.to_tokenized_text(doc: spacy.tokens.doc.Doc) → List[List[str]][source]¶: Transform Doc into an ordered, nested list of token-texts per sentence.

Note

If doc hasn’t been segmented into sentences, the entire document is treated as a single sentence.

textacy.spacier.doc_extensions.to_tagged_text(doc: spacy.tokens.doc.Doc) → List[List[Tuple[str, str]]][source]¶: Transform Doc into an ordered, nested list of (token-text, part-of-speech tag) pairs per sentence.

Note

If doc hasn’t been segmented into sentences, the entire document is treated as a single sentence.

textacy.spacier.doc_extensions.to_terms_list(doc: spacy.tokens.doc.Doc, *, ngrams: Optional[Union[int, Collection[int]]] = 1, 2, 3, entities: Optional[bool] = True, normalize: Optional[Union[str, Callable[[Union[spacy.tokens.span.Span, spacy.tokens.token.Token]], str]]] = 'lemma', as_strings: bool = False, **kwargs) → Union[Iterable[int], Iterable[str]][source]¶

Transform Doc into a sequence of ngrams and/or entities — not necessarily in order of appearance — where each appears in the sequence as many times as it appears in Doc.

Parameters

doc –
ngrams – ngrams to include in the terms list. If {1, 2, 3}, unigrams, bigrams, and trigrams are included; if 2, only bigrams are included; if None, ngrams aren’t included, except for those belonging to named entities.
entities –
If True, entities are included in the terms list; if False, they are excluded from the list; if None, entities aren’t included or excluded at all.

Note

When both entities and ngrams are non-null, exact duplicates (based on start and end indexes) are handled. If entities is True, any duplicate entities are included while duplicate ngrams are discarded to avoid double-counting; if entities is False, no entities are included of course, and duplicate ngrams are discarded as well.
normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in doc; if callable, must accept a Token or Span and return a str, e.g. get_normalized_text().
as_strings – If True, terms are returned as strings; if False, terms are returned as their unique integer ids.
kwargs –
- filter_stops (bool)
- filter_punct (bool)
- filter_nums (bool)
- include_pos (str or Set[str])
- exclude_pos (str or Set[str])
- min_freq (int)
- include_types (str or Set[str])
- exclude_types (str or Set[str]
- drop_determiners (bool)
See textacy.extract.words(), textacy.extract.ngrams(), and textacy.extract.entities() for details.

Yields

The next term in the terms list, as either a unique integer id or a string.

Raises

ValueError – if neither entities nor ngrams are included, or if normalize have invalid values
TypeError – if entities has an invalid type

Note

Despite the name, this is a generator function; to get an actual list of terms, call list(to_terms_list(doc)).

textacy.spacier.doc_extensions.to_bag_of_terms(doc: spacy.tokens.doc.Doc, *, ngrams: Optional[Union[int, Collection[int]]] = 1, 2, 3, entities: Optional[bool] = True, normalize: Optional[Union[str, Callable[[Union[spacy.tokens.span.Span, spacy.tokens.token.Token]], str]]] = 'lemma', weighting: str = 'count', as_strings: bool = False, **kwargs) → Dict[Union[int, str], Union[int, float]][source]¶

Transform Doc into a bag-of-terms: the set of unique terms in Doc mapped to their frequency of occurrence, where “terms” includes ngrams and/or entities.

Parameters

doc –
ngrams – n of which n-grams to include. (1, 2, 3) (default) includes unigrams (words), bigrams, and trigrams; 2 if only bigrams are wanted; falsy (e.g. False) to not include any
entities – If True (default), include named entities; note: if ngrams are also included, any ngrams that exactly overlap with an entity are skipped to prevent double-counting
normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in doc; if a callable, must accept a Token or Span and return a str, e.g. textacy.spacier.utils.get_normalized_text().
weighting ({"count", "freq", "binary"}) – Type of weight to assign to terms. If “count” (default), weights are the absolute number of occurrences (count) of term in doc. If “binary”, all counts are set equal to 1. If “freq”, term counts are normalized by the total token count, giving their relative frequency of occurrence.
as_strings – If True, words are returned as strings; if False (default), words are returned as their unique integer ids.
kwargs –
- filter_stops (bool)
- filter_punct (bool)
- filter_nums (bool)
- include_pos (str or Set[str])
- exclude_pos (str or Set[str])
- min_freq (int)
- include_types (str or Set[str])
- exclude_types (str or Set[str]
- drop_determiners (bool)
See textacy.extract.words(), textacy.extract.ngrams(), and textacy.extract.entities() for details.

Returns

Mapping of a unique term id or string (depending on the value of as_strings) to its absolute, relative, or binary frequency of occurrence (depending on the value of weighting).

Pipeline Components¶

textacy.spacier.components: Custom components to add to a spaCy language pipeline.

class textacy.spacier.components.TextStatsComponent(attrs=None)[source]¶

A custom component to be added to a spaCy language pipeline that computes one, some, or all text stats for a parsed doc and sets the values as custom attributes on a spacy.tokens.Doc.

Add the component to a pipeline, after the parser (as well as any subsequent components that modify the tokens/sentences of the doc):

>>> en = spacy.load('en')
>>> text_stats_component = TextStatsComponent()
>>> en.add_pipe(text_stats_component, after='parser')

Process a text with the pipeline and access the custom attributes via spaCy’s underscore syntax:

>>> doc = en(u"This is a test test someverylongword.")
>>> doc._.n_words
6
>>> doc._.flesch_reading_ease
73.84500000000001

Specify which attributes of the textacy.text_stats.TextStats() to add to processed documents:

>>> en = spacy.load('en')
>>> text_stats_component = TextStatsComponent(attrs='n_words')
>>> en.add_pipe(text_stats_component, last=True)
>>> doc = en(u"This is a test test someverylongword.")
>>> doc._.n_words
6
>>> doc._.flesch_reading_ease
AttributeError: [E046] Can't retrieve unregistered extension attribute 'flesch_reading_ease'. Did you forget to call the `set_extension` method?

Parameters: attrs (str or Iterable[str] or None) – If str, a single text stat to compute and set on a Doc. If Iterable[str], multiple text stats. If None, all text stats are computed and set as extensions.

name¶

Default name of this component in a spaCy language pipeline, used to get and modify the component via various spacy.Language methods, e.g. https://spacy.io/api/language#get_pipe.

Type: str

spaCy Utils¶

textacy.spacier.utils: Helper functions for working with / extending spaCy’s core functionality.

textacy.spacier.utils.make_doc_from_text_chunks(text: str, lang: Union[str, spacy.language.Language], chunk_size: int = 100000) → spacy.tokens.doc.Doc[source]¶

Make a single spaCy-processed document from 1 or more chunks of text. This is a workaround for processing very long texts, for which spaCy is unable to allocate enough RAM.

Although this function’s performance is pretty good, it’s inherently less performant that just processing the entire text in one shot. Only use it if necessary!

Parameters

text – Text document to be chunked and processed by spaCy.
lang – A 2-letter language code (e.g. “en”), the name of a spaCy model for the desired language, or an already-instantiated spaCy language pipeline.
chunk_size –
Number of characters comprising each text chunk (excluding the last chunk, which is probably smaller). For best performance, value should be somewhere between 1e3 and 1e7, depending on how much RAM you have available.

Note

Since chunking is done by character, chunks edges’ probably won’t respect natural language segmentation, which means that every chunk_size characters, spaCy will probably get tripped up and make weird parsing errors.

Returns

A single processed document, initialized from components accumulated chunk by chunk.

textacy.spacier.utils.merge_spans(spans: Iterable[spacy.tokens.span.Span], doc: spacy.tokens.doc.Doc) → None [source]¶

Merge spans into single tokens in doc, in-place.

Parameters

spans (Iterable[spacy.tokens.Span]) –
doc (spacy.tokens.Doc) –

textacy.spacier.utils.preserve_case(token: spacy.tokens.token.Token) → bool [source]¶

Return True if token is a proper noun or acronym; otherwise, False.

Raises: ValueError – If parent document has not been POS-tagged.

textacy.spacier.utils.get_normalized_text(span_or_token: Union[spacy.tokens.span.Span, spacy.tokens.token.Token]) → str [source]¶: Get the text of a spaCy span or token, normalized depending on its characteristics. For proper nouns and acronyms, text is returned as-is; for everything else, text is lemmatized.

textacy.spacier.utils.get_main_verbs_of_sent(sent: spacy.tokens.span.Span) → List[spacy.tokens.token.Token][source]¶: Return the main (non-auxiliary) verbs in a sentence.

textacy.spacier.utils.get_subjects_of_verb(verb: spacy.tokens.token.Token) → List[spacy.tokens.token.Token][source]¶: Return all subjects of a verb according to the dependency parse.

textacy.spacier.utils.get_objects_of_verb(verb: spacy.tokens.token.Token) → List[spacy.tokens.token.Token][source]¶: Return all objects of a verb according to the dependency parse, including open clausal complements.

textacy.spacier.utils.get_span_for_compound_noun(noun: spacy.tokens.token.Token) → Tuple[int, int][source]¶: Return document indexes spanning all (adjacent) tokens in a compound noun.

textacy.spacier.utils.get_span_for_verb_auxiliaries(verb: spacy.tokens.token.Token) → Tuple[int, int][source]¶: Return document indexes spanning all (adjacent) tokens around a verb that are auxiliary verbs or negations.

spaCy extensions¶

`Doc` extensions¶

Pipeline Components¶

spaCy Utils¶

Navigation

Related Topics

spaCy extensions¶

Doc extensions¶

Pipeline Components¶

spaCy Utils¶

`Doc` extensions¶