spaCy extensions¶
Doc
extensions¶
Get the standard, two-letter language code assigned to |
|
Get a short preview of the |
|
Get custom metadata added to |
|
Add custom metadata to |
|
Yield the tokens in |
|
Get the number of tokens (including punctuation) in |
|
Get the number of sentences in |
|
Transform |
|
Transform |
|
Transform |
|
Transform |
|
Transform |
|
Transform |
textacy.spacier.doc_extensions
: Inspect, extend, and transform spaCy’s core
data structure, spacy.tokens.Doc
, either directly via functions that take
a Doc
as their first argument or as custom attributes / methods on instantiated docs
prepended by an underscore:
>>> spacy_lang = textacy.load_spacy_lang("en")
>>> doc = spacy_lang("This is a short text.")
>>> print(get_preview(doc))
Doc(6 tokens: "This is a short text.")
>>> print(doc._.preview)
Doc(6 tokens: "This is a short text.")
-
textacy.spacier.doc_extensions.
set_doc_extensions
()[source]¶ Set textacy’s custom property and method doc extensions on the global
spacy.tokens.Doc
.
-
textacy.spacier.doc_extensions.
get_doc_extensions
()[source]¶ Get textacy’s custom property and method doc extensions that can be set on or removed from the global
spacy.tokens.Doc
.
-
textacy.spacier.doc_extensions.
remove_doc_extensions
()[source]¶ Remove textacy’s custom property and method doc extensions from the global
spacy.tokens.Doc
.
-
textacy.spacier.doc_extensions.
get_lang
(doc: spacy.tokens.doc.Doc) → str[source]¶ Get the standard, two-letter language code assigned to
Doc
and its associatedspacy.vocab.Vocab
.
-
textacy.spacier.doc_extensions.
get_preview
(doc: spacy.tokens.doc.Doc) → str[source]¶ Get a short preview of the
Doc
, including the number of tokens and an initial snippet.
-
textacy.spacier.doc_extensions.
get_tokens
(doc: spacy.tokens.doc.Doc) → Iterable[spacy.tokens.token.Token][source]¶ Yield the tokens in
Doc
, one at a time.
-
textacy.spacier.doc_extensions.
get_meta
(doc: spacy.tokens.doc.Doc) → dict[source]¶ Get custom metadata added to
Doc
.
-
textacy.spacier.doc_extensions.
set_meta
(doc: spacy.tokens.doc.Doc, value: dict) → None[source]¶ Add custom metadata to
Doc
.
-
textacy.spacier.doc_extensions.
get_n_tokens
(doc: spacy.tokens.doc.Doc) → int[source]¶ Get the number of tokens (including punctuation) in
Doc
.
-
textacy.spacier.doc_extensions.
get_n_sents
(doc: spacy.tokens.doc.Doc) → int[source]¶ Get the number of sentences in
Doc
.
-
textacy.spacier.doc_extensions.
to_tokenized_text
(doc: spacy.tokens.doc.Doc) → List[List[str]][source]¶ Transform
Doc
into an ordered, nested list of token-texts per sentence.Note
If
doc
hasn’t been segmented into sentences, the entire document is treated as a single sentence.
-
textacy.spacier.doc_extensions.
to_tagged_text
(doc: spacy.tokens.doc.Doc) → List[List[Tuple[str, str]]][source]¶ Transform
Doc
into an ordered, nested list of (token-text, part-of-speech tag) pairs per sentence.Note
If
doc
hasn’t been segmented into sentences, the entire document is treated as a single sentence.
-
textacy.spacier.doc_extensions.
to_terms_list
(doc: spacy.tokens.doc.Doc, *, ngrams: Optional[Union[int, Collection[int]]] = 1, 2, 3, entities: Optional[bool] = True, normalize: Optional[Union[str, Callable[[Union[spacy.tokens.span.Span, spacy.tokens.token.Token]], str]]] = 'lemma', as_strings: bool = False, **kwargs) → Union[Iterable[int], Iterable[str]][source]¶ Transform
Doc
into a sequence of ngrams and/or entities — not necessarily in order of appearance — where each appears in the sequence as many times as it appears inDoc
.- Parameters
doc –
ngrams – ngrams to include in the terms list. If
{1, 2, 3}
, unigrams, bigrams, and trigrams are included; if2
, only bigrams are included; if None, ngrams aren’t included, except for those belonging to named entities.entities –
If True, entities are included in the terms list; if False, they are excluded from the list; if None, entities aren’t included or excluded at all.
Note
When both
entities
andngrams
are non-null, exact duplicates (based on start and end indexes) are handled. Ifentities
is True, any duplicate entities are included while duplicate ngrams are discarded to avoid double-counting; ifentities
is False, no entities are included of course, and duplicate ngrams are discarded as well.normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in doc; if callable, must accept a
Token
orSpan
and return a str, e.g.get_normalized_text()
.as_strings – If True, terms are returned as strings; if False, terms are returned as their unique integer ids.
kwargs –
filter_stops (bool)
filter_punct (bool)
filter_nums (bool)
include_pos (str or Set[str])
exclude_pos (str or Set[str])
min_freq (int)
include_types (str or Set[str])
exclude_types (str or Set[str]
drop_determiners (bool)
See
textacy.extract.words()
,textacy.extract.ngrams()
, andtextacy.extract.entities()
for details.
- Yields
The next term in the terms list, as either a unique integer id or a string.
- Raises
ValueError – if neither
entities
norngrams
are included, or ifnormalize
have invalid valuesTypeError – if
entities
has an invalid type
Note
Despite the name, this is a generator function; to get an actual list of terms, call
list(to_terms_list(doc))
.
-
textacy.spacier.doc_extensions.
to_bag_of_terms
(doc: spacy.tokens.doc.Doc, *, ngrams: Optional[Union[int, Collection[int]]] = 1, 2, 3, entities: Optional[bool] = True, normalize: Optional[Union[str, Callable[[Union[spacy.tokens.span.Span, spacy.tokens.token.Token]], str]]] = 'lemma', weighting: str = 'count', as_strings: bool = False, **kwargs) → Dict[Union[int, str], Union[int, float]][source]¶ Transform
Doc
into a bag-of-terms: the set of unique terms inDoc
mapped to their frequency of occurrence, where “terms” includes ngrams and/or entities.- Parameters
doc –
ngrams – n of which n-grams to include.
(1, 2, 3)
(default) includes unigrams (words), bigrams, and trigrams; 2 if only bigrams are wanted; falsy (e.g. False) to not include anyentities – If True (default), include named entities; note: if ngrams are also included, any ngrams that exactly overlap with an entity are skipped to prevent double-counting
normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in
doc
; if a callable, must accept aToken
orSpan
and return a str, e.g.textacy.spacier.utils.get_normalized_text()
.weighting ({"count", "freq", "binary"}) – Type of weight to assign to terms. If “count” (default), weights are the absolute number of occurrences (count) of term in doc. If “binary”, all counts are set equal to 1. If “freq”, term counts are normalized by the total token count, giving their relative frequency of occurrence.
as_strings – If True, words are returned as strings; if False (default), words are returned as their unique integer ids.
kwargs –
filter_stops (bool)
filter_punct (bool)
filter_nums (bool)
include_pos (str or Set[str])
exclude_pos (str or Set[str])
min_freq (int)
include_types (str or Set[str])
exclude_types (str or Set[str]
drop_determiners (bool)
See
textacy.extract.words()
,textacy.extract.ngrams()
, andtextacy.extract.entities()
for details.
- Returns
Mapping of a unique term id or string (depending on the value of
as_strings
) to its absolute, relative, or binary frequency of occurrence (depending on the value ofweighting
).
See also
to_terms_list()
, which is used under the hood.
-
textacy.spacier.doc_extensions.
to_bag_of_words
(doc: spacy.tokens.doc.Doc, *, normalize: str = 'lemma', weighting: str = 'count', as_strings: bool = False, filter_stops: bool = True, filter_punct: bool = True, filter_nums: bool = False) → Dict[Union[int, str], Union[int, float]][source]¶ Transform
Doc
into a bag-of-words: the set of unique words inDoc
mapped to their absolute, relative, or binary frequency of occurrence.- Parameters
doc –
normalize – If “lemma”, lemmatize words before counting; if “lower”, lowercase words before counting; otherwise, words are counted using the form with which they they appear in doc.
weighting ({"count", "freq", "binary"}) – Type of weight to assign to words. If “count” (default), weights are the absolute number of occurrences (count) of word in doc. If “binary”, all counts are set equal to 1. If “freq”, word counts are normalized by the total token count, giving their relative frequency of occurrence. Note: The resulting set of frequencies won’t (necessarily) sum to 1.0, since punctuation and stop words are filtered out after counts are normalized.
as_strings (bool) – If True, words are returned as strings; if False (default), words are returned as their unique integer ids
filter_stops (bool) – If True (default), stop words are removed after counting.
filter_punct (bool) – If True (default), punctuation tokens are removed after counting.
filter_nums (bool) – If True, tokens consisting of digits are removed after counting.
- Returns
Mapping of a unique term id or string (depending on the value of
as_strings
) to its absolute, relative, or binary frequency of occurrence (depending on the value ofweighting
).
-
textacy.spacier.doc_extensions.
to_semantic_network
(doc: spacy.tokens.doc.Doc, *, nodes: str = 'words', normalize: Optional[Union[str, Callable[[Union[spacy.tokens.span.Span, spacy.tokens.token.Token]], str]]] = 'lemma', edge_weighting: str = 'default', window_width: int = 10) → networkx.classes.graph.Graph[source]¶ Transform
Doc
into a semantic network, where nodes are either “words” or “sents” and edges between nodes may be weighted in different ways.- Parameters
doc –
nodes ({"words", "sents"}) – Type of doc component to use as nodes in the semantic network.
normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in doc; if a callable, must accept a
Token
orSpan
(ifnodes
= “words” or “sents”, respectively) and return a str, e.g.get_normalized_text()
edge_weighting – Type of weighting to apply to edges between nodes; if
nodes
= “words”, options are {“cooc_freq”, “binary”}, ifnodes
= “sents”, options are {“cosine”, “jaccard”}; if “default”, “cooc_freq” or “cosine” will be automatically used.window_width – Size of sliding window over terms that determines which are said to co-occur; only applicable if
nodes
= “words”.
- Returns
where nodes represent either terms or sentences in doc; edges, the relationships between them.
- Return type
networkx.Graph
- Raises
ValueError – If
nodes
is neither “words” nor “sents”.
Pipeline Components¶
textacy.spacier.components
: Custom components to add to a spaCy language pipeline.
-
class
textacy.spacier.components.
TextStatsComponent
(attrs=None)[source]¶ A custom component to be added to a spaCy language pipeline that computes one, some, or all text stats for a parsed doc and sets the values as custom attributes on a
spacy.tokens.Doc
.Add the component to a pipeline, after the parser (as well as any subsequent components that modify the tokens/sentences of the doc):
>>> en = spacy.load('en') >>> text_stats_component = TextStatsComponent() >>> en.add_pipe(text_stats_component, after='parser')
Process a text with the pipeline and access the custom attributes via spaCy’s underscore syntax:
>>> doc = en(u"This is a test test someverylongword.") >>> doc._.n_words 6 >>> doc._.flesch_reading_ease 73.84500000000001
Specify which attributes of the
textacy.text_stats.TextStats()
to add to processed documents:>>> en = spacy.load('en') >>> text_stats_component = TextStatsComponent(attrs='n_words') >>> en.add_pipe(text_stats_component, last=True) >>> doc = en(u"This is a test test someverylongword.") >>> doc._.n_words 6 >>> doc._.flesch_reading_ease AttributeError: [E046] Can't retrieve unregistered extension attribute 'flesch_reading_ease'. Did you forget to call the `set_extension` method?
- Parameters
attrs (str or Iterable[str] or None) – If str, a single text stat to compute and set on a
Doc
. If Iterable[str], multiple text stats. If None, all text stats are computed and set as extensions.
-
name
¶ Default name of this component in a spaCy language pipeline, used to get and modify the component via various
spacy.Language
methods, e.g. https://spacy.io/api/language#get_pipe.- Type
See also
textacy.text_stats.TextStats
spaCy Utils¶
textacy.spacier.utils
: Helper functions for working with / extending spaCy’s
core functionality.
-
textacy.spacier.utils.
make_doc_from_text_chunks
(text: str, lang: Union[str, spacy.language.Language], chunk_size: int = 100000) → spacy.tokens.doc.Doc[source]¶ Make a single spaCy-processed document from 1 or more chunks of
text
. This is a workaround for processing very long texts, for which spaCy is unable to allocate enough RAM.Although this function’s performance is pretty good, it’s inherently less performant that just processing the entire text in one shot. Only use it if necessary!
- Parameters
text – Text document to be chunked and processed by spaCy.
lang – A 2-letter language code (e.g. “en”), the name of a spaCy model for the desired language, or an already-instantiated spaCy language pipeline.
chunk_size –
Number of characters comprising each text chunk (excluding the last chunk, which is probably smaller). For best performance, value should be somewhere between 1e3 and 1e7, depending on how much RAM you have available.
Note
Since chunking is done by character, chunks edges’ probably won’t respect natural language segmentation, which means that every
chunk_size
characters, spaCy will probably get tripped up and make weird parsing errors.
- Returns
A single processed document, initialized from components accumulated chunk by chunk.
-
textacy.spacier.utils.
merge_spans
(spans: Iterable[spacy.tokens.span.Span], doc: spacy.tokens.doc.Doc) → None[source]¶ Merge spans into single tokens in
doc
, in-place.- Parameters
spans (Iterable[
spacy.tokens.Span
]) –doc (
spacy.tokens.Doc
) –
-
textacy.spacier.utils.
preserve_case
(token: spacy.tokens.token.Token) → bool[source]¶ Return True if
token
is a proper noun or acronym; otherwise, False.- Raises
ValueError – If parent document has not been POS-tagged.
-
textacy.spacier.utils.
get_normalized_text
(span_or_token: Union[spacy.tokens.span.Span, spacy.tokens.token.Token]) → str[source]¶ Get the text of a spaCy span or token, normalized depending on its characteristics. For proper nouns and acronyms, text is returned as-is; for everything else, text is lemmatized.
-
textacy.spacier.utils.
get_main_verbs_of_sent
(sent: spacy.tokens.span.Span) → List[spacy.tokens.token.Token][source]¶ Return the main (non-auxiliary) verbs in a sentence.
-
textacy.spacier.utils.
get_subjects_of_verb
(verb: spacy.tokens.token.Token) → List[spacy.tokens.token.Token][source]¶ Return all subjects of a verb according to the dependency parse.
-
textacy.spacier.utils.
get_objects_of_verb
(verb: spacy.tokens.token.Token) → List[spacy.tokens.token.Token][source]¶ Return all objects of a verb according to the dependency parse, including open clausal complements.