Lang, Doc, Corpus¶

textacy.spacier.core: Convenient entry point for loading spaCy language pipelines and making spaCy docs.

textacy.spacier.core.load_spacy_lang(name: Union[str, pathlib.Path], disable: Optional[Tuple[str, …]] = None, allow_blank: bool = False) → spacy.language.Language[source]¶

Load a spaCy Language: a shared vocabulary and language-specific data for tokenizing text, and (if available) model data and a processing pipeline containing a sequence of components for annotating a document. An LRU cache saves languages in memory for quick reloading.

>>> en_nlp = textacy.load_spacy_lang("en")
>>> en_nlp = textacy.load_spacy_lang("en_core_web_sm")
>>> en_nlp = textacy.load_spacy_lang("en", disable=("parser",))
>>> textacy.load_spacy_lang("ar")
...
OSError: [E050] Can't find model 'ar'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
>>> textacy.load_spacy_lang("ar", allow_blank=True)
<spacy.lang.ar.Arabic at 0x126418550>

Parameters

name – spaCy language to load. Could be a shortcut link, full package name, or path to model directory, or a 2-letter ISO language code for which spaCy has language data.
disable –
Names of pipeline components to disable, if any.

Note

Although spaCy’s API specifies this argument as a list, here we require a tuple. Pipelines are stored in the LRU cache with unique identifiers generated from the hash of the function name and args — and lists aren’t hashable.
allow_blank – If True, allow loading of blank spaCy Language s; if False, raise an OSError if a full processing pipeline isn’t available. Note that spaCy Doc s produced by blank languages are missing key functionality, e.g. POS tags, entities, sentences.

Returns

A loaded spaCy Language.

Raises

OSError –
ImportError –

See also

textacy.spacier.core.make_spacy_doc(data: Union[str, Tuple[str, dict], spacy.tokens.doc.Doc], lang: Union[str, Callable[[str], str], spacy.language.Language] = <bound method LangIdentifier.identify_lang of <textacy.lang_utils.LangIdentifier object>>) → spacy.tokens.doc.Doc[source]¶

Make a spacy.tokens.Doc from valid inputs, and automatically load/validate spacy.language.Language pipelines to process data.

Make a Doc from text:

>>> text = "To be, or not to be, that is the question."
>>> doc = make_spacy_doc(text)
>>> doc._.preview
'Doc(13 tokens: "To be, or not to be, that is the question.")'

Make a Doc from a (text, metadata) pair, aka a “record”:

>>> record = (text, {"author": "Shakespeare, William"})
>>> doc = make_spacy_doc(record)
>>> doc._.preview
'Doc(13 tokens: "To be, or not to be, that is the question.")'
>>> doc._.meta
{'author': 'Shakespeare, William'}

Specify the language / Language pipeline used to process the text — or don’t:

>>> make_spacy_doc(text)
>>> make_spacy_doc(text, lang="en")
>>> make_spacy_doc(text, lang="en_core_web_sm")
>>> make_spacy_doc(text, lang=textacy.load_spacy_lang("en"))
>>> make_spacy_doc(text, lang=textacy.lang_utils.identify_lang)

Ensure that an already-processed Doc is compatible with lang:

>>> spacy_lang = textacy.load_spacy_lang("en")
>>> doc = spacy_lang(text)
>>> make_spacy_doc(doc, lang="en")
>>> make_spacy_doc(doc, lang="es")
...
ValueError: lang of spacy pipeline used to process document ('en') must be the same as `lang` ('es')

Parameters

data – Make a spacy.tokens.Doc from a text or (text, metadata) pair. If already a Doc, ensure that it’s compatible with lang to avoid surprises downstream, and return it as-is.
lang –
Language with which spaCy processes (or processed) data.

If known, pass a standard 2-letter language code (e.g. “en”), or the name of a spacy language pipeline (e.g. “en_core_web_md”), or an already-instantiated spacy.language.Language object. If not known, pass a function that takes unicode text as input and outputs a standard 2-letter language code.

A given / detected language string is then used to instantiate a corresponding Language with all default components enabled.

Returns

Processed spaCy Doc.

Raises

TypeError –
ValueError –

textacy.corpus: Class for working with a collection of spaCy Doc s. Includes functionality for easily adding, getting, and removing documents; saving to / loading their data from disk; and tracking basic corpus statistics.

class textacy.corpus.Corpus(lang: Union[str, spacy.language.Language], data: Optional[Union[str, spacy.tokens.doc.Doc, Tuple[str, dict], Iterable[str], Iterable[spacy.tokens.doc.Doc], Iterable[Tuple[str, dict]]]] = None)[source]¶

An ordered collection of spacy.tokens.Doc, all of the same language and sharing the same spacy.language.Language processing pipeline and vocabulary, with data held in-memory.

Initialize from a language / Language and (optionally) one or a stream of texts or (text, metadata) pairs:

>>> ds = textacy.datasets.CapitolWords()
>>> records = ds.records(limit=50)
>>> corpus = textacy.Corpus("en", data=records)
>>> print(corpus)
Corpus(50 docs, 32175 tokens)

Add or remove documents, with automatic updating of corpus statistics:

>>> texts = ds.texts(congress=114, limit=25)
>>> corpus.add(texts)
>>> corpus.add("If Burton were a member of Congress, here's what he'd say.")
>>> print(corpus)
Corpus(76 docs, 55906 tokens)
>>> corpus.remove(lambda doc: doc._.meta.get("speaker_name") == "Rick Santorum")
>>> print(corpus)
Corpus(61 docs, 48567 tokens)

Get subsets of documents matching your particular use case:

>>> match_func = lambda doc: doc._.meta.get("speaker_name") == "Bernie Sanders"
>>> for doc in corpus.get(match_func, limit=3):
...     print(doc._.preview)
Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")
Doc(336 tokens: "Mr. Speaker, I thank the gentleman for yielding...")
Doc(177 tokens: "Mr. Speaker, if we want to understand why in th...")

Get or remove documents by indexing, too:

>>> corpus[0]._.preview
'Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")'
>>> [doc._.preview for doc in corpus[:3]]
['Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")',
 'Doc(219 tokens: "Mr. Speaker, a relationship, to work and surviv...")',
 'Doc(336 tokens: "Mr. Speaker, I thank the gentleman for yielding...")']
>>> del corpus[:5]
>>> print(corpus)
Corpus(56 docs, 41573 tokens)

Compute basic corpus statistics:

>>> corpus.n_docs, corpus.n_sents, corpus.n_tokens
(56, 1771, 41573)
>>> word_counts = corpus.word_counts(as_strings=True)
>>> sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:5]
[('-PRON-', 2553), ('people', 215), ('year', 148), ('Mr.', 139), ('$', 137)]
>>> word_doc_counts = corpus.word_doc_counts(weighting="freq", as_strings=True)
>>> sorted(word_doc_counts.items(), key=lambda x: x[1], reverse=True)[:5]
[('-PRON-', 0.9821428571428571),
 ('Mr.', 0.7678571428571429),
 ('President', 0.5),
 ('people', 0.48214285714285715),
 ('need', 0.44642857142857145)]

Save corpus data to and load from disk:

>>> corpus.save("~/Desktop/capitol_words_sample.bin.gz")
>>> corpus = textacy.Corpus.load("en", "~/Desktop/capitol_words_sample.bin.gz")
>>> print(corpus)
Corpus(56 docs, 41573 tokens)

Parameters

lang –
Language with which spaCy processes (or processed) all documents added to the corpus, whether as data now or later.

Pass a standard 2-letter language code (e.g. “en”), or the name of a spacy language pipeline (e.g. “en_core_web_md”), or an already-instantiated spacy.language.Language object.

A given / detected language string is then used to instantiate a corresponding Language with all default components enabled.
data (obj or Iterable[obj]) –
One or a stream of texts, records, or spacy.tokens.Doc s to be added to the corpus.

See also

Corpus.add()

lang¶

spacy_lang¶

docs¶

n_docs¶

n_sents¶

n_tokens¶

add(data: Union[str, spacy.tokens.doc.Doc, Tuple[str, dict], Iterable[str], Iterable[spacy.tokens.doc.Doc], Iterable[Tuple[str, dict]]], batch_size: int = 1000, n_process: int = 1) → None [source]¶

Add one or a stream of texts, records, or spacy.tokens.Doc s to the corpus, ensuring that all processing is or has already been done by the Corpus.spacy_lang pipeline.

Parameters

data –
batch_size – Number of texts to buffer when processing with spaCy.
n_process –
Number of parallel processors to run when processing. If -1, this is set to multiprocessing.cpu_count().

Note

This feature is only available in spaCy 2.2.2+, and only applies when data is a sequence of texts or records.

See also

Corpus.add_text()
Corpus.add_texts()
Corpus.add_record()
Corpus.add_records()
Corpus.add_doc()
Corpus.add_docs()

add_text(text: str) → None [source]¶

Add one text to the corpus, processing it into a spacy.tokens.Doc using the Corpus.spacy_lang pipeline.

Parameters: text (str) –

add_texts(texts: Iterable[str], batch_size: int = 1000, n_process: int = 1) → None [source]¶

Add a stream of texts to the corpus, efficiently processing them into spacy.tokens.Doc s using the Corpus.spacy_lang pipeline.

Parameters

texts – Sequence of texts to process and add to corpus.
batch_size – Number of texts to buffer when processing with spaCy.
n_process –
Number of parallel processors to run when processing. If -1, this is set to multiprocessing.cpu_count().

Note

This feature is only available in spaCy 2.2.2+.

add_record(record: Tuple[str, Dict[Any, Any]]) → None [source]¶

Add one record to the corpus, processing it into a spacy.tokens.Doc using the Corpus.spacy_lang pipeline.

Parameters: record –

add_records(records: Iterable[Tuple[str, dict]], batch_size: int = 1000, n_process: int = 1) → None [source]¶

Add a stream of records to the corpus, efficiently processing them into spacy.tokens.Doc s using the Corpus.spacy_lang pipeline.

Parameters

records – Sequence of records to process and add to corpus.
batch_size – Number of texts to buffer when processing with spaCy.
n_process –
Number of parallel processors to run when processing. If -1, this is set to multiprocessing.cpu_count().

Note

This feature is only available in spaCy 2.2.2+.

add_doc(doc: spacy.tokens.doc.Doc) → None [source]¶

Add one spacy.tokens.Doc to the corpus, provided it was processed using the Corpus.spacy_lang pipeline.

Parameters: doc –

add_docs(docs: Iterable[spacy.tokens.doc.Doc]) → None [source]¶

Add a stream of spacy.tokens.Doc s to the corpus, provided they were processed using the Corpus.spacy_lang pipeline.

Parameters: docs –

get(match_func: Callable[[spacy.tokens.doc.Doc], bool], limit: Optional[int] = None) → Iterator[spacy.tokens.doc.Doc][source]¶

Get all (or N <= limit) docs in Corpus for which match_func(doc) is True.

Parameters

match_func –
Function that takes a spacy.tokens.Doc as input and returns a boolean value. For example:
```
Corpus.get(lambda x: len(x) >= 100)
```
gets all docs with at least 100 tokens. And:
```
Corpus.get(lambda doc: doc._.meta["author"] == "Burton DeWilde")
```
gets all docs whose author was given as ‘Burton DeWilde’.
limit – Maximum number of matched docs to return.

Yields

spacy.tokens.Doc – Next document passing match_func.

Tip

To get doc(s) by index, treat Corpus as a list and use Python’s usual indexing and slicing: Corpus[0] gets the first document in the corpus; Corpus[:5] gets the first 5; etc.

remove(match_func: Callable[[spacy.tokens.doc.Doc], bool], limit: Optional[int] = None) → None [source]¶

Remove all (or N <= limit) docs in Corpus for which match_func(doc) is True. Corpus doc/sent/token counts are adjusted accordingly.

Parameters

match_func –
Function that takes a spacy.tokens.Doc and returns a boolean value. For example:
```
Corpus.remove(lambda x: len(x) >= 100)
```
removes docs with at least 100 tokens. And:
```
Corpus.remove(lambda doc: doc._.meta["author"] == "Burton DeWilde")
```
removes docs whose author was given as “Burton DeWilde”.
limit – Maximum number of matched docs to remove.

Tip

To remove doc(s) by index, treat Corpus as a list and use Python’s usual indexing and slicing: del Corpus[0] removes the first document in the corpus; del Corpus[:5] removes the first 5; etc.

property vectors¶: Constituent docs’ word vectors stacked in a 2d array.

property vector_norms¶: Constituent docs’ L2-normalized word vectors stacked in a 2d array.

word_counts(*, normalize: str = 'lemma', weighting: str = 'count', as_strings: bool = False, filter_stops: bool = True, filter_punct: bool = True, filter_nums: bool = False) → Dict[Union[int, str], Union[int, float]][source]¶

Map the set of unique words in Corpus to their counts as absolute, relative, or binary frequencies of occurence, similar to Doc._.to_bag_of_words() but aggregated over all docs.

Parameters

normalize – If “lemma”, lemmatize words before counting; if “lower”, lowercase words before counting; otherwise, words are counted using the form with which they appear.
weighting ({"count", "freq"}) –
Type of weight to assign to words. If “count” (default), weights are the absolute number of occurrences (count) of word in corpus. If “freq”, word counts are normalized by the total token count, giving their relative frequencies of occurrence.

Note

The resulting set of frequencies won’t (necessarily) sum to 1.0, since punctuation and stop words are filtered out after counts are normalized.
as_strings – If True, words are returned as strings; if False (default), words are returned as their unique integer ids.
filter_stops – If True (default), stop word counts are removed.
filter_punct – If True (default), punctuation counts are removed.
filter_nums – If True, number counts are removed.

Returns

Mapping of a unique word id or string (depending on the value of as_strings) to its absolute, relative, or binary frequency of occurrence (depending on the value of weighting).

See also

textacy.vsm.get_term_freqs()

word_doc_counts(*, normalize: str = 'lemma', weighting: str = 'count', smooth_idf: bool = True, as_strings: bool = False, filter_stops: bool = True, filter_punct: bool = True, filter_nums: bool = True) → Dict[Union[int, str], Union[int, float]][source]¶

Map the set of unique words in Corpus to their document counts as absolute, relative, inverse, or binary frequencies of occurence.

Parameters

normalize – If “lemma”, lemmatize words before counting; if “lower”, lowercase words before counting; otherwise, words are counted using the form with which they appear.
weighting ({"count", "freq", "idf"}) – Type of weight to assign to words. If “count” (default), weights are the absolute number (count) of documents in which word appears. If “freq”, word doc counts are normalized by the total document count, giving their relative frequencies of occurrence. If “idf”, weights are the log of the inverse relative frequencies: log(n_docs / word_doc_count) or (if smooth_idf is True) log(1 + (n_docs / word_doc_count)) .
smooth_idf – If True, add 1 to all word doc counts when calculating “idf” weighting, equivalent to adding a single document to the corpus containing every unique word.
as_strings – If True, words are returned as strings; if False (default), words are returned as their unique integer ids
filter_stops – If True (default), stop word counts are removed.
filter_punct – If True (default), punctuation counts are removed.
filter_nums – If True (default), number counts are removed.

Returns

Mapping of a unique word id or string (depending on the value of as_strings) to the number of documents in which it appears weighted as absolute, relative, or binary frequencies (depending on the value of weighting).

See also

textacy.vsm.get_doc_freqs()

save(filepath: Union[str, pathlib.Path], store_user_data: bool = True) → None [source]¶

Save Corpus to disk as binary data.

Parameters

filepath – Full path to file on disk where Corpus data will be saved as a binary file.
store_user_data – If True, store user data and values of custom extension attributes along with core spaCy attributes.

See also

Corpus.load()
spacy.tokens.DocBin

classmethod load(lang: Union[str, spacy.language.Language], filepath: Union[str, pathlib.Path], store_user_data: bool = True) → textacy.corpus.Corpus [source]¶

Load previously saved Corpus binary data, reproduce the original :class:`spacy.tokens.Doc`s tokens and annotations, and instantiate a new :class:`Corpus from them.

Parameters

lang –
filepath – Full path to file on disk where Corpus data was previously saved as a binary file.
store_user_data – If True, load stored user data and values of custom extension attributes along with core spaCy attributes.

Returns

Corpus

See also

Corpus.save()
spacy.tokens.DocBin

Lang, Doc, Corpus¶

Navigation

Related Topics