Lang, Doc, Corpus

textacy.spacier.core: Convenient entry point for loading spaCy language pipelines and making spaCy docs.

textacy.spacier.core.load_spacy_lang(name: Union[str, pathlib.Path], disable: Optional[Tuple[str, ]] = None, allow_blank: bool = False) → spacy.language.Language[source]

Load a spaCy Language: a shared vocabulary and language-specific data for tokenizing text, and (if available) model data and a processing pipeline containing a sequence of components for annotating a document. An LRU cache saves languages in memory for quick reloading.

>>> en_nlp = textacy.load_spacy_lang("en")
>>> en_nlp = textacy.load_spacy_lang("en_core_web_sm")
>>> en_nlp = textacy.load_spacy_lang("en", disable=("parser",))
>>> textacy.load_spacy_lang("ar")
...
OSError: [E050] Can't find model 'ar'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
>>> textacy.load_spacy_lang("ar", allow_blank=True)
<spacy.lang.ar.Arabic at 0x126418550>
Parameters
  • name – spaCy language to load. Could be a shortcut link, full package name, or path to model directory, or a 2-letter ISO language code for which spaCy has language data.

  • disable

    Names of pipeline components to disable, if any.

    Note

    Although spaCy’s API specifies this argument as a list, here we require a tuple. Pipelines are stored in the LRU cache with unique identifiers generated from the hash of the function name and args — and lists aren’t hashable.

  • allow_blank – If True, allow loading of blank spaCy Language s; if False, raise an OSError if a full processing pipeline isn’t available. Note that spaCy Doc s produced by blank languages are missing key functionality, e.g. POS tags, entities, sentences.

Returns

A loaded spaCy Language.

Raises
textacy.spacier.core.make_spacy_doc(data: Union[str, Tuple[str, dict], spacy.tokens.doc.Doc], lang: Union[str, Callable[[str], str], spacy.language.Language] = <bound method LangIdentifier.identify_lang of <textacy.lang_utils.LangIdentifier object>>) → spacy.tokens.doc.Doc[source]

Make a spacy.tokens.Doc from valid inputs, and automatically load/validate spacy.language.Language pipelines to process data.

Make a Doc from text:

>>> text = "To be, or not to be, that is the question."
>>> doc = make_spacy_doc(text)
>>> doc._.preview
'Doc(13 tokens: "To be, or not to be, that is the question.")'

Make a Doc from a (text, metadata) pair, aka a “record”:

>>> record = (text, {"author": "Shakespeare, William"})
>>> doc = make_spacy_doc(record)
>>> doc._.preview
'Doc(13 tokens: "To be, or not to be, that is the question.")'
>>> doc._.meta
{'author': 'Shakespeare, William'}

Specify the language / Language pipeline used to process the text — or don’t:

>>> make_spacy_doc(text)
>>> make_spacy_doc(text, lang="en")
>>> make_spacy_doc(text, lang="en_core_web_sm")
>>> make_spacy_doc(text, lang=textacy.load_spacy_lang("en"))
>>> make_spacy_doc(text, lang=textacy.lang_utils.identify_lang)

Ensure that an already-processed Doc is compatible with lang:

>>> spacy_lang = textacy.load_spacy_lang("en")
>>> doc = spacy_lang(text)
>>> make_spacy_doc(doc, lang="en")
>>> make_spacy_doc(doc, lang="es")
...
ValueError: lang of spacy pipeline used to process document ('en') must be the same as `lang` ('es')
Parameters
  • data – Make a spacy.tokens.Doc from a text or (text, metadata) pair. If already a Doc, ensure that it’s compatible with lang to avoid surprises downstream, and return it as-is.

  • lang

    Language with which spaCy processes (or processed) data.

    If known, pass a standard 2-letter language code (e.g. “en”), or the name of a spacy language pipeline (e.g. “en_core_web_md”), or an already-instantiated spacy.language.Language object. If not known, pass a function that takes unicode text as input and outputs a standard 2-letter language code.

    A given / detected language string is then used to instantiate a corresponding Language with all default components enabled.

Returns

Processed spaCy Doc.

Raises

textacy.corpus: Class for working with a collection of spaCy Doc s. Includes functionality for easily adding, getting, and removing documents; saving to / loading their data from disk; and tracking basic corpus statistics.

class textacy.corpus.Corpus(lang: Union[str, spacy.language.Language], data: Optional[Union[str, spacy.tokens.doc.Doc, Tuple[str, dict], Iterable[str], Iterable[spacy.tokens.doc.Doc], Iterable[Tuple[str, dict]]]] = None)[source]

An ordered collection of spacy.tokens.Doc, all of the same language and sharing the same spacy.language.Language processing pipeline and vocabulary, with data held in-memory.

Initialize from a language / Language and (optionally) one or a stream of texts or (text, metadata) pairs:

>>> ds = textacy.datasets.CapitolWords()
>>> records = ds.records(limit=50)
>>> corpus = textacy.Corpus("en", data=records)
>>> print(corpus)
Corpus(50 docs, 32175 tokens)

Add or remove documents, with automatic updating of corpus statistics:

>>> texts = ds.texts(congress=114, limit=25)
>>> corpus.add(texts)
>>> corpus.add("If Burton were a member of Congress, here's what he'd say.")
>>> print(corpus)
Corpus(76 docs, 55906 tokens)
>>> corpus.remove(lambda doc: doc._.meta.get("speaker_name") == "Rick Santorum")
>>> print(corpus)
Corpus(61 docs, 48567 tokens)

Get subsets of documents matching your particular use case:

>>> match_func = lambda doc: doc._.meta.get("speaker_name") == "Bernie Sanders"
>>> for doc in corpus.get(match_func, limit=3):
...     print(doc._.preview)
Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")
Doc(336 tokens: "Mr. Speaker, I thank the gentleman for yielding...")
Doc(177 tokens: "Mr. Speaker, if we want to understand why in th...")

Get or remove documents by indexing, too:

>>> corpus[0]._.preview
'Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")'
>>> [doc._.preview for doc in corpus[:3]]
['Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")',
 'Doc(219 tokens: "Mr. Speaker, a relationship, to work and surviv...")',
 'Doc(336 tokens: "Mr. Speaker, I thank the gentleman for yielding...")']
>>> del corpus[:5]
>>> print(corpus)
Corpus(56 docs, 41573 tokens)

Compute basic corpus statistics:

>>> corpus.n_docs, corpus.n_sents, corpus.n_tokens
(56, 1771, 41573)
>>> word_counts = corpus.word_counts(as_strings=True)
>>> sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:5]
[('-PRON-', 2553), ('people', 215), ('year', 148), ('Mr.', 139), ('$', 137)]
>>> word_doc_counts = corpus.word_doc_counts(weighting="freq", as_strings=True)
>>> sorted(word_doc_counts.items(), key=lambda x: x[1], reverse=True)[:5]
[('-PRON-', 0.9821428571428571),
 ('Mr.', 0.7678571428571429),
 ('President', 0.5),
 ('people', 0.48214285714285715),
 ('need', 0.44642857142857145)]

Save corpus data to and load from disk:

>>> corpus.save("~/Desktop/capitol_words_sample.bin.gz")
>>> corpus = textacy.Corpus.load("en", "~/Desktop/capitol_words_sample.bin.gz")
>>> print(corpus)
Corpus(56 docs, 41573 tokens)
Parameters
  • lang

    Language with which spaCy processes (or processed) all documents added to the corpus, whether as data now or later.

    Pass a standard 2-letter language code (e.g. “en”), or the name of a spacy language pipeline (e.g. “en_core_web_md”), or an already-instantiated spacy.language.Language object.

    A given / detected language string is then used to instantiate a corresponding Language with all default components enabled.

  • data (obj or Iterable[obj]) –

    One or a stream of texts, records, or spacy.tokens.Doc s to be added to the corpus.

    See also

    Corpus.add()

lang
spacy_lang
docs
n_docs
n_sents
n_tokens
add(data: Union[str, spacy.tokens.doc.Doc, Tuple[str, dict], Iterable[str], Iterable[spacy.tokens.doc.Doc], Iterable[Tuple[str, dict]]], batch_size: int = 1000, n_process: int = 1)None[source]

Add one or a stream of texts, records, or spacy.tokens.Doc s to the corpus, ensuring that all processing is or has already been done by the Corpus.spacy_lang pipeline.

Parameters
  • data

  • batch_size – Number of texts to buffer when processing with spaCy.

  • n_process

    Number of parallel processors to run when processing. If -1, this is set to multiprocessing.cpu_count().

    Note

    This feature is only available in spaCy 2.2.2+, and only applies when data is a sequence of texts or records.

add_text(text: str)None[source]

Add one text to the corpus, processing it into a spacy.tokens.Doc using the Corpus.spacy_lang pipeline.

Parameters

text (str) –

add_texts(texts: Iterable[str], batch_size: int = 1000, n_process: int = 1)None[source]

Add a stream of texts to the corpus, efficiently processing them into spacy.tokens.Doc s using the Corpus.spacy_lang pipeline.

Parameters
  • texts – Sequence of texts to process and add to corpus.

  • batch_size – Number of texts to buffer when processing with spaCy.

  • n_process

    Number of parallel processors to run when processing. If -1, this is set to multiprocessing.cpu_count().

    Note

    This feature is only available in spaCy 2.2.2+.

add_record(record: Tuple[str, Dict[Any, Any]])None[source]

Add one record to the corpus, processing it into a spacy.tokens.Doc using the Corpus.spacy_lang pipeline.

Parameters

record

add_records(records: Iterable[Tuple[str, dict]], batch_size: int = 1000, n_process: int = 1)None[source]

Add a stream of records to the corpus, efficiently processing them into spacy.tokens.Doc s using the Corpus.spacy_lang pipeline.

Parameters
  • records – Sequence of records to process and add to corpus.

  • batch_size – Number of texts to buffer when processing with spaCy.

  • n_process

    Number of parallel processors to run when processing. If -1, this is set to multiprocessing.cpu_count().

    Note

    This feature is only available in spaCy 2.2.2+.

add_doc(doc: spacy.tokens.doc.Doc)None[source]

Add one spacy.tokens.Doc to the corpus, provided it was processed using the Corpus.spacy_lang pipeline.

Parameters

doc

add_docs(docs: Iterable[spacy.tokens.doc.Doc])None[source]

Add a stream of spacy.tokens.Doc s to the corpus, provided they were processed using the Corpus.spacy_lang pipeline.

Parameters

docs

get(match_func: Callable[[spacy.tokens.doc.Doc], bool], limit: Optional[int] = None) → Iterator[spacy.tokens.doc.Doc][source]

Get all (or N <= limit) docs in Corpus for which match_func(doc) is True.

Parameters
  • match_func

    Function that takes a spacy.tokens.Doc as input and returns a boolean value. For example:

    Corpus.get(lambda x: len(x) >= 100)
    

    gets all docs with at least 100 tokens. And:

    Corpus.get(lambda doc: doc._.meta["author"] == "Burton DeWilde")
    

    gets all docs whose author was given as ‘Burton DeWilde’.

  • limit – Maximum number of matched docs to return.

Yields

spacy.tokens.Doc – Next document passing match_func.

Tip

To get doc(s) by index, treat Corpus as a list and use Python’s usual indexing and slicing: Corpus[0] gets the first document in the corpus; Corpus[:5] gets the first 5; etc.

remove(match_func: Callable[[spacy.tokens.doc.Doc], bool], limit: Optional[int] = None)None[source]

Remove all (or N <= limit) docs in Corpus for which match_func(doc) is True. Corpus doc/sent/token counts are adjusted accordingly.

Parameters
  • match_func

    Function that takes a spacy.tokens.Doc and returns a boolean value. For example:

    Corpus.remove(lambda x: len(x) >= 100)
    

    removes docs with at least 100 tokens. And:

    Corpus.remove(lambda doc: doc._.meta["author"] == "Burton DeWilde")
    

    removes docs whose author was given as “Burton DeWilde”.

  • limit – Maximum number of matched docs to remove.

Tip

To remove doc(s) by index, treat Corpus as a list and use Python’s usual indexing and slicing: del Corpus[0] removes the first document in the corpus; del Corpus[:5] removes the first 5; etc.

property vectors

Constituent docs’ word vectors stacked in a 2d array.

property vector_norms

Constituent docs’ L2-normalized word vectors stacked in a 2d array.

word_counts(*, normalize: str = 'lemma', weighting: str = 'count', as_strings: bool = False, filter_stops: bool = True, filter_punct: bool = True, filter_nums: bool = False) → Dict[Union[int, str], Union[int, float]][source]

Map the set of unique words in Corpus to their counts as absolute, relative, or binary frequencies of occurence, similar to Doc._.to_bag_of_words() but aggregated over all docs.

Parameters
  • normalize – If “lemma”, lemmatize words before counting; if “lower”, lowercase words before counting; otherwise, words are counted using the form with which they appear.

  • weighting ({"count", "freq"}) –

    Type of weight to assign to words. If “count” (default), weights are the absolute number of occurrences (count) of word in corpus. If “freq”, word counts are normalized by the total token count, giving their relative frequencies of occurrence.

    Note

    The resulting set of frequencies won’t (necessarily) sum to 1.0, since punctuation and stop words are filtered out after counts are normalized.

  • as_strings – If True, words are returned as strings; if False (default), words are returned as their unique integer ids.

  • filter_stops – If True (default), stop word counts are removed.

  • filter_punct – If True (default), punctuation counts are removed.

  • filter_nums – If True, number counts are removed.

Returns

Mapping of a unique word id or string (depending on the value of as_strings) to its absolute, relative, or binary frequency of occurrence (depending on the value of weighting).

word_doc_counts(*, normalize: str = 'lemma', weighting: str = 'count', smooth_idf: bool = True, as_strings: bool = False, filter_stops: bool = True, filter_punct: bool = True, filter_nums: bool = True) → Dict[Union[int, str], Union[int, float]][source]

Map the set of unique words in Corpus to their document counts as absolute, relative, inverse, or binary frequencies of occurence.

Parameters
  • normalize – If “lemma”, lemmatize words before counting; if “lower”, lowercase words before counting; otherwise, words are counted using the form with which they appear.

  • weighting ({"count", "freq", "idf"}) – Type of weight to assign to words. If “count” (default), weights are the absolute number (count) of documents in which word appears. If “freq”, word doc counts are normalized by the total document count, giving their relative frequencies of occurrence. If “idf”, weights are the log of the inverse relative frequencies: log(n_docs / word_doc_count) or (if smooth_idf is True) log(1 + (n_docs / word_doc_count)) .

  • smooth_idf – If True, add 1 to all word doc counts when calculating “idf” weighting, equivalent to adding a single document to the corpus containing every unique word.

  • as_strings – If True, words are returned as strings; if False (default), words are returned as their unique integer ids

  • filter_stops – If True (default), stop word counts are removed.

  • filter_punct – If True (default), punctuation counts are removed.

  • filter_nums – If True (default), number counts are removed.

Returns

Mapping of a unique word id or string (depending on the value of as_strings) to the number of documents in which it appears weighted as absolute, relative, or binary frequencies (depending on the value of weighting).

save(filepath: Union[str, pathlib.Path], store_user_data: bool = True)None[source]

Save Corpus to disk as binary data.

Parameters
  • filepath – Full path to file on disk where Corpus data will be saved as a binary file.

  • store_user_data – If True, store user data and values of custom extension attributes along with core spaCy attributes.

See also

classmethod load(lang: Union[str, spacy.language.Language], filepath: Union[str, pathlib.Path], store_user_data: bool = True)textacy.corpus.Corpus[source]

Load previously saved Corpus binary data, reproduce the original :class:`spacy.tokens.Doc`s tokens and annotations, and instantiate a new :class:`Corpus from them.

Parameters
  • lang

  • filepath – Full path to file on disk where Corpus data was previously saved as a binary file.

  • store_user_data – If True, load stored user data and values of custom extension attributes along with core spaCy attributes.

Returns

Corpus

See also