Lang, Doc, Corpus¶

textacy.spacier.core: Convenient entry point for loading spaCy language pipelines and making spaCy docs.

textacy.spacier.core.load_spacy_lang(name: str | pathlib.Path, **kwargs) → Language[source]¶

Load a spaCy Language — a shared vocabulary and language-specific data for tokenizing text, and (if available) model data and a processing pipeline containing a sequence of components for annotating a document — and cache results, for quick reloading as needed.

Note that as of spaCy v3, for which pipeline aliases are no longer allowed, this function is just a convenient access point to underlying spacy.load().

>>> en_nlp = textacy.load_spacy_lang("en_core_web_sm")
>>> en_nlp = textacy.load_spacy_lang("en_core_web_sm", disable=("parser",))
>>> textacy.load_spacy_lang("ar")
...
OSError: [E050] Can't find model 'ar'. It doesn't seem to be a Python package or a valid path to a data directory.

Parameters

name – Name or path to the spaCy language pipeline to load.
**kwargs –

Note

Although spaCy’s API specifies some kwargs as List[str], here we require Tuple[str, ...] equivalents. Language pipelines are stored in an LRU cache with unique identifiers generated from the hash of the function name and args — and lists aren’t hashable.

Returns: Loaded spaCy Language.
Raises: OSError –

textacy.spacier.core.make_spacy_doc(data: Union[str, textacy.types.Record, spacy.tokens.doc.Doc], lang: Union[str, pathlib.Path, spacy.language.Language, Callable[[str], str], Callable[[str], pathlib.Path], Callable[[str], spacy.language.Language]], *, chunk_size: Optional[int] = None) → spacy.tokens.doc.Doc[source]¶

Make a spacy.tokens.Doc from valid inputs, and automatically load/validate spacy.language.Language pipelines to process data.

Make a Doc from text:

>>> text = "To be, or not to be, that is the question."
>>> doc = make_spacy_doc(text, "en_core_web_sm")
>>> doc._.preview
'Doc(13 tokens: "To be, or not to be, that is the question.")'

Make a Doc from a (text, metadata) pair, aka a “record”:

>>> record = (text, {"author": "Shakespeare, William"})
>>> doc = make_spacy_doc(record, "en_core_web_sm")
>>> doc._.preview
'Doc(13 tokens: "To be, or not to be, that is the question.")'
>>> doc._.meta
{'author': 'Shakespeare, William'}

Specify the language pipeline used to process the text in a few different ways:

>>> make_spacy_doc(text, lang="en_core_web_sm")
>>> make_spacy_doc(text, lang=textacy.load_spacy_lang("en_core_web_sm"))
>>> make_spacy_doc(text, lang=lambda txt: "en_core_web_sm")

Ensure that an already-processed Doc is compatible with lang:

>>> spacy_lang = textacy.load_spacy_lang("en_core_web_sm")
>>> doc = spacy_lang(text)
>>> make_spacy_doc(doc, lang="en_core_web_sm")
>>> make_spacy_doc(doc, lang="es_core_news_sm")
...
ValueError: `spacy.Vocab` used to process document must be the same as that used by the `lang` pipeline ('es_core_news_sm')

Parameters

data – Make a spacy.tokens.Doc from a text or (text, metadata) pair. If already a Doc, ensure that it’s compatible with lang to avoid surprises downstream, and return it as-is.
lang – Language with which spaCy processes (or processed) data, represented as the full name of a spaCy language pipeline, the path on disk to it, an already instantiated pipeline, or a callable function that takes the text component of data and outputs one of the above representations.
chunk_size –
Size of chunks in number of characters into which text will be split before processing each via spaCy and concatenating the results into a single Doc.

Note

This is intended as a workaround for processing very long texts, for which spaCy is unable to allocate enough RAM. For best performance, chunk size should be somewhere between 1e3 and 1e7 characters, depending on how much RAM you have available.

Since chunking is done by character, chunks’ boundaries likely won’t respect natural language segmentation, and as a result spaCy’s models may make mistakes on sentences/words that cross them.

Returns

Processed spaCy Doc.

Raises

TypeError –
ValueError –

textacy.corpus: Class for working with a collection of spaCy Doc s. Includes functionality for easily adding, getting, and removing documents; saving to / loading their data from disk; and tracking basic corpus statistics.

class textacy.corpus.Corpus(lang: Union[str, pathlib.Path, spacy.language.Language], data: Optional[Union[str, textacy.types.Record, spacy.tokens.doc.Doc, Iterable[str], Iterable[textacy.types.Record], Iterable[spacy.tokens.doc.Doc]]] = None)[source]¶

An ordered collection of spacy.tokens.Doc, all of the same language and sharing the same spacy.language.Language processing pipeline and vocabulary, with data held in-memory.

Initialize from a Language name or instance and (optionally) one or a stream of texts or (text, metadata) pairs:

>>> ds = textacy.datasets.CapitolWords()
>>> records = ds.records(limit=50)
>>> corpus = textacy.Corpus("en_core_web_sm", data=records)
>>> print(corpus)
Corpus(50 docs, 32175 tokens)

Add or remove documents, with automatic updating of corpus statistics:

>>> texts = ds.texts(congress=114, limit=25)
>>> corpus.add(texts)
>>> corpus.add("If Burton were a member of Congress, here's what he'd say.")
>>> print(corpus)
Corpus(76 docs, 55906 tokens)
>>> corpus.remove(lambda doc: doc._.meta.get("speaker_name") == "Rick Santorum")
>>> print(corpus)
Corpus(61 docs, 48567 tokens)

Get subsets of documents matching your particular use case:

>>> match_func = lambda doc: doc._.meta.get("speaker_name") == "Bernie Sanders"
>>> for doc in corpus.get(match_func, limit=3):
...     print(doc._.preview)
Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")
Doc(336 tokens: "Mr. Speaker, I thank the gentleman for yielding...")
Doc(177 tokens: "Mr. Speaker, if we want to understand why in th...")

Get or remove documents by indexing, too:

>>> corpus[0]._.preview
'Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")'
>>> [doc._.preview for doc in corpus[:3]]
['Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")',
 'Doc(219 tokens: "Mr. Speaker, a relationship, to work and surviv...")',
 'Doc(336 tokens: "Mr. Speaker, I thank the gentleman for yielding...")']
>>> del corpus[:5]
>>> print(corpus)
Corpus(56 docs, 41573 tokens)

Compute basic corpus statistics:

>>> corpus.n_docs, corpus.n_sents, corpus.n_tokens
(56, 1771, 41573)
>>> word_counts = corpus.word_counts(as_strings=True)
>>> sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:5]
[('-PRON-', 2553), ('people', 215), ('year', 148), ('Mr.', 139), ('$', 137)]
>>> word_doc_counts = corpus.word_doc_counts(weighting="freq", as_strings=True)
>>> sorted(word_doc_counts.items(), key=lambda x: x[1], reverse=True)[:5]
[('-PRON-', 0.9821428571428571),
 ('Mr.', 0.7678571428571429),
 ('President', 0.5),
 ('people', 0.48214285714285715),
 ('need', 0.44642857142857145)]

Save corpus data to and load from disk:

>>> corpus.save("./cw_sample.bin.gz")
>>> corpus = textacy.Corpus.load("en_core_web_sm", "./cw_sample.bin.gz")
>>> print(corpus)
Corpus(56 docs, 41573 tokens)

Parameters

lang –
Language with which spaCy processes (or processed) all documents added to the corpus, whether as data now or later.

Pass the name of a spacy language pipeline (e.g. “en_core_web_sm”), or an already-instantiated spacy.language.Language object.

A given / detected language string is then used to instantiate a corresponding Language with all default components enabled.
data –
One or a stream of texts, records, or spacy.tokens.Doc s to be added to the corpus.

See also

Corpus.add()

lang¶

Type: str

spacy_lang¶

Type: spacy.language.Language

docs¶

Type: List[spacy.tokens.doc.Doc]

n_docs¶

Type: int

n_sents¶

Type: int

n_tokens¶

Type: int

add(data: Union[str, textacy.types.Record, spacy.tokens.doc.Doc, Iterable[str], Iterable[textacy.types.Record], Iterable[spacy.tokens.doc.Doc]], batch_size: int = 1000, n_process: int = 1)[source]¶

Add one or a stream of texts, records, or spacy.tokens.Doc s to the corpus, ensuring that all processing is or has already been done by the Corpus.spacy_lang pipeline.

Parameters

data –
batch_size – Number of texts to buffer when processing with spaCy.
n_process –
Number of parallel processors to run when processing. If -1, this is set to multiprocessing.cpu_count().

Note

This feature is only applies when data is a sequence of texts or records.

See also

Corpus.add_text()
Corpus.add_texts()
Corpus.add_record()
Corpus.add_records()
Corpus.add_doc()
Corpus.add_docs()

add_text(text: str) → None [source]¶

Add one text to the corpus, processing it into a spacy.tokens.Doc using the Corpus.spacy_lang pipeline.

Parameters: text (str) –

add_texts(texts: Iterable[str], batch_size: int = 1000, n_process: int = 1) → None [source]¶

Add a stream of texts to the corpus, efficiently processing them into spacy.tokens.Doc s using the Corpus.spacy_lang pipeline.

Parameters

texts – Sequence of texts to process and add to corpus.
batch_size – Number of texts to buffer when processing with spaCy.
n_process –
Number of parallel processors to run when processing. If -1, this is set to multiprocessing.cpu_count().

Note

This feature is only available in spaCy 2.2.2+.

add_record(record: textacy.types.Record) → None [source]¶

Add one record to the corpus, processing it into a spacy.tokens.Doc using the Corpus.spacy_lang pipeline.

Parameters: record –

add_records(records: Iterable[textacy.types.Record], batch_size: int = 1000, n_process: int = 1) → None [source]¶

Add a stream of records to the corpus, efficiently processing them into spacy.tokens.Doc s using the Corpus.spacy_lang pipeline.

Parameters

records – Sequence of records to process and add to corpus.
batch_size – Number of texts to buffer when processing with spaCy.
n_process –
Number of parallel processors to run when processing. If -1, this is set to multiprocessing.cpu_count().

Note

This feature is only available in spaCy 2.2.2+.

add_doc(doc: spacy.tokens.doc.Doc) → None [source]¶

Add one spacy.tokens.Doc to the corpus, provided it was processed using the Corpus.spacy_lang pipeline.

Parameters: doc –

add_docs(docs: Iterable[spacy.tokens.doc.Doc]) → None [source]¶

Add a stream of spacy.tokens.Doc s to the corpus, provided they were processed using the Corpus.spacy_lang pipeline.

Parameters: docs –

get(match_func: Callable[[spacy.tokens.doc.Doc], bool], limit: Optional[int] = None) → Iterator[spacy.tokens.doc.Doc][source]¶

Get all (or N <= limit) docs in Corpus for which match_func(doc) is True.

Parameters

match_func –
Function that takes a spacy.tokens.Doc as input and returns a boolean value. For example:
```
Corpus.get(lambda x: len(x) >= 100)
```
gets all docs with at least 100 tokens. And:
```
Corpus.get(lambda doc: doc._.meta["author"] == "Burton DeWilde")
```
gets all docs whose author was given as ‘Burton DeWilde’.
limit – Maximum number of matched docs to return.

Yields

spacy.tokens.Doc – Next document passing match_func.

Tip

To get doc(s) by index, treat Corpus as a list and use Python’s usual indexing and slicing: Corpus[0] gets the first document in the corpus; Corpus[:5] gets the first 5; etc.

remove(match_func: Callable[[spacy.tokens.doc.Doc], bool], limit: Optional[int] = None) → None [source]¶

Remove all (or N <= limit) docs in Corpus for which match_func(doc) is True. Corpus doc/sent/token counts are adjusted accordingly.

Parameters

match_func –
Function that takes a spacy.tokens.Doc and returns a boolean value. For example:
```
Corpus.remove(lambda x: len(x) >= 100)
```
removes docs with at least 100 tokens. And:
```
Corpus.remove(lambda doc: doc._.meta["author"] == "Burton DeWilde")
```
removes docs whose author was given as “Burton DeWilde”.
limit – Maximum number of matched docs to remove.

Tip

To remove doc(s) by index, treat Corpus as a list and use Python’s usual indexing and slicing: del Corpus[0] removes the first document in the corpus; del Corpus[:5] removes the first 5; etc.

property vectors¶: Constituent docs’ word vectors stacked in a 2d array.

property vector_norms¶: Constituent docs’ L2-normalized word vectors stacked in a 2d array.

word_counts(*, by: str = 'lemma', weighting: str = 'count', **kwargs) → Dict[int, int | float] | Dict[str, int | float][source]¶

Map the set of unique words in Corpus to their counts as absolute, relative, or binary frequencies of occurence, similar to Doc._.to_bag_of_words() but aggregated over all docs.

Parameters

by – Attribute by which spaCy Token s are grouped before counting, as given by getattr(token, by). If “lemma”, tokens are grouped by their base form w/o inflections; if “lower”, by the lowercase form of the token text; if “norm”, by the normalized form of the token text; if “orth”, by the token text exactly as it appears in documents. To output keys as strings, append an underscore to any of these options; for example, “lemma_” groups tokens by their lemmas as strings.
weighting – Type of weighting to assign to unique words given by by. If “count”, weights are the absolute number of occurrences (i.e. counts); if “freq”, weights are counts normalized by the total token count, giving their relative frequency of occurrence.
**kwargs – Passed directly on to textacy.extract.words() - filter_stops: If True, stop words are removed before counting. - filter_punct: If True, punctuation tokens are removed before counting. - filter_nums: If True, number-like tokens are removed before counting.

Returns

Mapping of a unique word id or string (depending on the value of by) to its absolute, relative, or binary frequency of occurrence (depending on the value of weighting).

word_doc_counts(*, by: str = 'lemma', weighting: str = 'count', smooth_idf: bool = True, **kwargs) → Dict[int, int | float] | Dict[str, int | float][source]¶

Map the set of unique words in Corpus to their document counts as absolute, relative, or inverse frequencies of occurence.

Parameters

by – Attribute by which spaCy Token s are grouped before counting, as given by getattr(token, by). If “lemma”, tokens are grouped by their base form w/o inflections; if “lower”, by the lowercase form of the token text; if “norm”, by the normalized form of the token text; if “orth”, by the token text exactly as it appears in documents. To output keys as strings, append an underscore to any of these options; for example, “lemma_” groups tokens by their lemmas as strings.
weighting – Type of weighting to assign to unique words given by by. If “count”, weights are the absolute number of occurrences (i.e. counts); if “freq”, weights are counts normalized by the total token count, giving their relative frequency of occurrence; if “idf”, weights are the log of the inverse relative frequencies, i.e. log(n_docs / word_doc_count) or, if smooth_idf is True, log(1 + (n_docs / word_doc_count)).
smooth_idf – If True, add 1 to all word doc counts when calculating “idf” weighting, equivalent to adding a single document to the corpus containing every unique word.

Returns

Mapping of a unique word id or string (depending on the value of by) to the number of documents in which it appears, weighted as absolute, relative, or inverse frequency of occurrence (depending on the value of weighting).

See also

textacy.vsm.get_doc_freqs()

agg_metadata(name: str, agg_func: Callable[[Iterable[Any]], Any], default: Optional[Any] = None) → Any[source]¶

Aggregate values for a particular metadata field over all documents in Corpus.

Parameters

name – Name of metadata field (key) in Doc._.meta.
agg_func – Callable that accepts an iterable of field values and outputs a single, aggregated result.
default – Default field value to use if name is not found in a given document’s metadata.

Returns

Aggregated value for metadata field.

save(filepath: types.PathLike, attrs: Optional[str | Iterable[str]] = 'auto', store_user_data: bool = True)[source]¶

Save Corpus to disk as binary data.

Parameters

filepath – Full path to file on disk where Corpus docs data will be saved as a binary file.
attrs – List of token attributes to serialize; if “auto”, an appropriate list is inferred from annotations found on the first Doc; if None, spaCy’s default values are used (https://spacy.io/api/docbin#init)
store_user_data – If True, store user data and values of custom extension attributes along with core spaCy attributes.

See also

Corpus.load()
textacy.io.write_spacy_docs()
spacy.tokens.DocBin

classmethod load(lang: Union[str, pathlib.Path, spacy.language.Language], filepath: Union[str, pathlib.Path]) → Corpus [source]¶

Load previously saved Corpus binary data, reproduce the original :class:`spacy.tokens.Doc`s tokens and annotations, and instantiate a new :class:`Corpus from them.

Parameters

lang –
filepath – Full path to file on disk where Corpus data was previously saved as a binary file.

Returns

Initialized corpus.

See also

Corpus.save()
textacy.io.read_spacy_docs()
spacy.tokens.DocBin

Doc Extensions¶

`get_preview`	Get a short preview of the `Doc`, including the number of tokens and an initial snippet.
`get_meta`	Get custom metadata added to `Doc`.
`set_meta`	Add custom metadata to `Doc`.
`to_tokenized_text`	Transform `doc` into an ordered, nested list of token-texts for each sentence.
`to_bag_of_words`	Transform a `Doc` or `Span` into a bag-of-words: the set of unique words therein mapped to their absolute, relative, or binary frequencies of occurrence.
`to_bag_of_terms`	Transform a `Doc` or `Span` into a bag-of-terms: the set of unique terms therein mapped to their absolute, relative, or binary frequencies of occurrence, where “terms” may be a combination of n-grams, entities, and/or noun chunks.

textacy.extensions: Inspect, extend, and transform spaCy’s core Doc data structure, either directly via functions that take a Doc as their first arg or as custom attributes / methods on instantiated docs prepended by an underscore:

>>> doc = textacy.make_spacy_doc("This is a short text.", "en_core_web_sm")
>>> print(get_preview(doc))
Doc(6 tokens: "This is a short text.")
>>> print(doc._.preview)
Doc(6 tokens: "This is a short text.")

textacy.extensions.get_preview(doc: spacy.tokens.doc.Doc) → str [source]¶: Get a short preview of the Doc, including the number of tokens and an initial snippet.

textacy.extensions.get_meta(doc: spacy.tokens.doc.Doc) → dict [source]¶: Get custom metadata added to Doc.

textacy.extensions.set_meta(doc: spacy.tokens.doc.Doc, value: dict) → None [source]¶: Add custom metadata to Doc.

textacy.extensions.to_tokenized_text(doc: spacy.tokens.doc.Doc) → List[List[str]][source]¶

Transform doc into an ordered, nested list of token-texts for each sentence.

Parameters: doc –
Returns: A list of tokens’ texts for each sentence in doc.

Note

If doc hasn’t been segmented into sentences, the entire document is treated as a single sentence.

textacy.extensions.to_bag_of_words(doclike: types.DocLike, *, by: str = 'lemma_', weighting: str = 'count', **kwargs) → Dict[int, int | float] | Dict[str, int | float][source]¶

Transform a Doc or Span into a bag-of-words: the set of unique words therein mapped to their absolute, relative, or binary frequencies of occurrence.

Parameters

doclike –
by – Attribute by which spaCy Token s are grouped before counting, as given by getattr(token, by). If “lemma”, tokens are grouped by their base form w/o inflectional suffixes; if “lower”, by the lowercase form of the token text; if “norm”, by the normalized form of the token text; if “orth”, by the token text exactly as it appears in doc. To output keys as strings, simply append an underscore to any of these; for example, “lemma_” creates a bag whose keys are token lemmas as strings.
weighting – Type of weighting to assign to unique words given by by. If “count”, weights are the absolute number of occurrences (i.e. counts); if “freq”, weights are counts normalized by the total token count, giving their relative frequency of occurrence; if “binary”, weights are set equal to 1.
**kwargs – Passed directly on to textacy.extract.words() - filter_stops: If True, stop words are removed before counting. - filter_punct: If True, punctuation tokens are removed before counting. - filter_nums: If True, number-like tokens are removed before counting.

Returns

Mapping of a unique word id or string (depending on the value of by) to its absolute, relative, or binary frequency of occurrence (depending on the value of weighting).

Note

For “freq” weighting, the resulting set of frequencies won’t (necessarily) sum to 1.0, since all tokens are used when normalizing counts but some (punctuation, stop words, etc.) may be filtered out of the bag afterwards.

See also

textacy.extract.words()

textacy.extensions.to_bag_of_terms(doclike: types.DocLike, *, by: str = 'lemma_', weighting: str = 'count', ngs: Optional[int | Collection[int] | types.DocLikeToSpans] = None, ents: Optional[bool | types.DocLikeToSpans] = None, ncs: Optional[bool | types.DocLikeToSpans] = None, dedupe: bool = True) → Dict[str, int] | Dict[str, float][source]¶

Transform a Doc or Span into a bag-of-terms: the set of unique terms therein mapped to their absolute, relative, or binary frequencies of occurrence, where “terms” may be a combination of n-grams, entities, and/or noun chunks.

Parameters

doclike –
by – Attribute by which spaCy Span s are grouped before counting, as given by getattr(token, by). If “lemma”, tokens are counted by their base form w/o inflectional suffixes; if “lower”, by the lowercase form of the token text; if “orth”, by the token text exactly as it appears in doc. To output keys as strings, simply append an underscore to any of these; for example, “lemma_” creates a bag whose keys are token lemmas as strings.
weighting – Type of weighting to assign to unique terms given by by. If “count”, weights are the absolute number of occurrences (i.e. counts); if “freq”, weights are counts normalized by the total token count, giving their relative frequency of occurrence; if “binary”, weights are set equal to 1.
ngs – N-gram terms to be extracted. If one or multiple ints, textacy.extract.ngrams(doclike, n=ngs)() is used to extract terms; if a callable, ngs(doclike) is used to extract terms; if None, no n-gram terms are extracted.
ents – Entity terms to be extracted. If True, textacy.extract.entities(doclike)() is used to extract terms; if a callable, ents(doclike) is used to extract terms; if None, no entity terms are extracted.
ncs – Noun chunk terms to be extracted. If True, textacy.extract.noun_chunks(doclike)() is used to extract terms; if a callable, ncs(doclike) is used to extract terms; if None, no noun chunk terms are extracted.
dedupe – If True, deduplicate terms whose spans are extracted by multiple types (e.g. a span that is both an n-gram and an entity), as identified by identical (start, stop) indexes in doclike; otherwise, don’t.

Returns

Mapping of a unique term id or string (depending on the value of by) to its absolute, relative, or binary frequency of occurrence (depending on the value of weighting).

See also

textacy.extract.terms()

textacy.extensions.get_doc_extensions() → Dict[str, Dict[str, Any]][source]¶: Get textacy’s custom property and method doc extensions that can be set on or removed from the global spacy.tokens.Doc.

textacy.extensions.set_doc_extensions()[source]¶: Set textacy’s custom property and method doc extensions on the global spacy.tokens.Doc.

textacy.extensions.remove_doc_extensions()[source]¶: Remove textacy’s custom property and method doc extensions from the global spacy.tokens.Doc.

Lang, Doc, Corpus¶

Doc Extensions¶

Navigation

Related Topics