Lang, Doc, Corpus¶
textacy.spacier.core
: Convenient entry point for loading spaCy language pipelines
and making spaCy docs.
-
textacy.spacier.core.
load_spacy_lang
(name: str | pathlib.Path, **kwargs) → Language[source]¶ Load a spaCy
Language
— a shared vocabulary and language-specific data for tokenizing text, and (if available) model data and a processing pipeline containing a sequence of components for annotating a document — and cache results, for quick reloading as needed.Note that as of spaCy v3, for which pipeline aliases are no longer allowed, this function is just a convenient access point to underlying
spacy.load()
.>>> en_nlp = textacy.load_spacy_lang("en_core_web_sm") >>> en_nlp = textacy.load_spacy_lang("en_core_web_sm", disable=("parser",)) >>> textacy.load_spacy_lang("ar") ... OSError: [E050] Can't find model 'ar'. It doesn't seem to be a Python package or a valid path to a data directory.
- Parameters
name – Name or path to the spaCy language pipeline to load.
**kwargs –
Note
Although spaCy’s API specifies some kwargs as
List[str]
, here we requireTuple[str, ...]
equivalents. Language pipelines are stored in an LRU cache with unique identifiers generated from the hash of the function name and args — and lists aren’t hashable.- Returns
Loaded spaCy
Language
.- Raises
OSError –
-
textacy.spacier.core.
make_spacy_doc
(data: Union[str, textacy.types.Record, spacy.tokens.doc.Doc], lang: Union[str, pathlib.Path, spacy.language.Language, Callable[[str], str], Callable[[str], pathlib.Path], Callable[[str], spacy.language.Language]], *, chunk_size: Optional[int] = None) → spacy.tokens.doc.Doc[source]¶ Make a
spacy.tokens.Doc
from valid inputs, and automatically load/validatespacy.language.Language
pipelines to processdata
.Make a
Doc
from text:>>> text = "To be, or not to be, that is the question." >>> doc = make_spacy_doc(text, "en_core_web_sm") >>> doc._.preview 'Doc(13 tokens: "To be, or not to be, that is the question.")'
Make a
Doc
from a (text, metadata) pair, aka a “record”:>>> record = (text, {"author": "Shakespeare, William"}) >>> doc = make_spacy_doc(record, "en_core_web_sm") >>> doc._.preview 'Doc(13 tokens: "To be, or not to be, that is the question.")' >>> doc._.meta {'author': 'Shakespeare, William'}
Specify the language pipeline used to process the text in a few different ways:
>>> make_spacy_doc(text, lang="en_core_web_sm") >>> make_spacy_doc(text, lang=textacy.load_spacy_lang("en_core_web_sm")) >>> make_spacy_doc(text, lang=lambda txt: "en_core_web_sm")
Ensure that an already-processed
Doc
is compatible withlang
:>>> spacy_lang = textacy.load_spacy_lang("en_core_web_sm") >>> doc = spacy_lang(text) >>> make_spacy_doc(doc, lang="en_core_web_sm") >>> make_spacy_doc(doc, lang="es_core_news_sm") ... ValueError: `spacy.Vocab` used to process document must be the same as that used by the `lang` pipeline ('es_core_news_sm')
- Parameters
data – Make a
spacy.tokens.Doc
from a text or (text, metadata) pair. If already aDoc
, ensure that it’s compatible withlang
to avoid surprises downstream, and return it as-is.lang – Language with which spaCy processes (or processed)
data
, represented as the full name of a spaCy language pipeline, the path on disk to it, an already instantiated pipeline, or a callable function that takes the text component ofdata
and outputs one of the above representations.chunk_size –
Size of chunks in number of characters into which
text
will be split before processing each via spaCy and concatenating the results into a singleDoc
.Note
This is intended as a workaround for processing very long texts, for which spaCy is unable to allocate enough RAM. For best performance, chunk size should be somewhere between 1e3 and 1e7 characters, depending on how much RAM you have available.
Since chunking is done by character, chunks’ boundaries likely won’t respect natural language segmentation, and as a result spaCy’s models may make mistakes on sentences/words that cross them.
- Returns
Processed spaCy Doc.
- Raises
textacy.corpus
: Class for working with a collection of spaCy Doc
s.
Includes functionality for easily adding, getting, and removing documents;
saving to / loading their data from disk; and tracking basic corpus statistics.
-
class
textacy.corpus.
Corpus
(lang: Union[str, pathlib.Path, spacy.language.Language], data: Optional[Union[str, textacy.types.Record, spacy.tokens.doc.Doc, Iterable[str], Iterable[textacy.types.Record], Iterable[spacy.tokens.doc.Doc]]] = None)[source]¶ An ordered collection of
spacy.tokens.Doc
, all of the same language and sharing the samespacy.language.Language
processing pipeline and vocabulary, with data held in-memory.Initialize from a
Language
name or instance and (optionally) one or a stream of texts or (text, metadata) pairs:>>> ds = textacy.datasets.CapitolWords() >>> records = ds.records(limit=50) >>> corpus = textacy.Corpus("en_core_web_sm", data=records) >>> print(corpus) Corpus(50 docs, 32175 tokens)
Add or remove documents, with automatic updating of corpus statistics:
>>> texts = ds.texts(congress=114, limit=25) >>> corpus.add(texts) >>> corpus.add("If Burton were a member of Congress, here's what he'd say.") >>> print(corpus) Corpus(76 docs, 55906 tokens) >>> corpus.remove(lambda doc: doc._.meta.get("speaker_name") == "Rick Santorum") >>> print(corpus) Corpus(61 docs, 48567 tokens)
Get subsets of documents matching your particular use case:
>>> match_func = lambda doc: doc._.meta.get("speaker_name") == "Bernie Sanders" >>> for doc in corpus.get(match_func, limit=3): ... print(doc._.preview) Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...") Doc(336 tokens: "Mr. Speaker, I thank the gentleman for yielding...") Doc(177 tokens: "Mr. Speaker, if we want to understand why in th...")
Get or remove documents by indexing, too:
>>> corpus[0]._.preview 'Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")' >>> [doc._.preview for doc in corpus[:3]] ['Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")', 'Doc(219 tokens: "Mr. Speaker, a relationship, to work and surviv...")', 'Doc(336 tokens: "Mr. Speaker, I thank the gentleman for yielding...")'] >>> del corpus[:5] >>> print(corpus) Corpus(56 docs, 41573 tokens)
Compute basic corpus statistics:
>>> corpus.n_docs, corpus.n_sents, corpus.n_tokens (56, 1771, 41573) >>> word_counts = corpus.word_counts(as_strings=True) >>> sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:5] [('-PRON-', 2553), ('people', 215), ('year', 148), ('Mr.', 139), ('$', 137)] >>> word_doc_counts = corpus.word_doc_counts(weighting="freq", as_strings=True) >>> sorted(word_doc_counts.items(), key=lambda x: x[1], reverse=True)[:5] [('-PRON-', 0.9821428571428571), ('Mr.', 0.7678571428571429), ('President', 0.5), ('people', 0.48214285714285715), ('need', 0.44642857142857145)]
Save corpus data to and load from disk:
>>> corpus.save("./cw_sample.bin.gz") >>> corpus = textacy.Corpus.load("en_core_web_sm", "./cw_sample.bin.gz") >>> print(corpus) Corpus(56 docs, 41573 tokens)
- Parameters
lang –
Language with which spaCy processes (or processed) all documents added to the corpus, whether as
data
now or later.Pass the name of a spacy language pipeline (e.g. “en_core_web_sm”), or an already-instantiated
spacy.language.Language
object.A given / detected language string is then used to instantiate a corresponding
Language
with all default components enabled.data –
One or a stream of texts, records, or
spacy.tokens.Doc
s to be added to the corpus.See also
-
spacy_lang
¶ - Type
spacy.language.Language
-
docs
¶ - Type
List[spacy.tokens.doc.Doc]
-
add
(data: Union[str, textacy.types.Record, spacy.tokens.doc.Doc, Iterable[str], Iterable[textacy.types.Record], Iterable[spacy.tokens.doc.Doc]], batch_size: int = 1000, n_process: int = 1)[source]¶ Add one or a stream of texts, records, or
spacy.tokens.Doc
s to the corpus, ensuring that all processing is or has already been done by theCorpus.spacy_lang
pipeline.- Parameters
data –
batch_size – Number of texts to buffer when processing with spaCy.
n_process –
Number of parallel processors to run when processing. If -1, this is set to
multiprocessing.cpu_count()
.Note
This feature is only applies when
data
is a sequence of texts or records.
-
add_text
(text: str) → None[source]¶ Add one text to the corpus, processing it into a
spacy.tokens.Doc
using theCorpus.spacy_lang
pipeline.- Parameters
text (str) –
-
add_texts
(texts: Iterable[str], batch_size: int = 1000, n_process: int = 1) → None[source]¶ Add a stream of texts to the corpus, efficiently processing them into
spacy.tokens.Doc
s using theCorpus.spacy_lang
pipeline.- Parameters
texts – Sequence of texts to process and add to corpus.
batch_size – Number of texts to buffer when processing with spaCy.
n_process –
Number of parallel processors to run when processing. If -1, this is set to
multiprocessing.cpu_count()
.Note
This feature is only available in spaCy 2.2.2+.
-
add_record
(record: textacy.types.Record) → None[source]¶ Add one record to the corpus, processing it into a
spacy.tokens.Doc
using theCorpus.spacy_lang
pipeline.- Parameters
record –
-
add_records
(records: Iterable[textacy.types.Record], batch_size: int = 1000, n_process: int = 1) → None[source]¶ Add a stream of records to the corpus, efficiently processing them into
spacy.tokens.Doc
s using theCorpus.spacy_lang
pipeline.- Parameters
records – Sequence of records to process and add to corpus.
batch_size – Number of texts to buffer when processing with spaCy.
n_process –
Number of parallel processors to run when processing. If -1, this is set to
multiprocessing.cpu_count()
.Note
This feature is only available in spaCy 2.2.2+.
-
add_doc
(doc: spacy.tokens.doc.Doc) → None[source]¶ Add one
spacy.tokens.Doc
to the corpus, provided it was processed using theCorpus.spacy_lang
pipeline.- Parameters
doc –
-
add_docs
(docs: Iterable[spacy.tokens.doc.Doc]) → None[source]¶ Add a stream of
spacy.tokens.Doc
s to the corpus, provided they were processed using theCorpus.spacy_lang
pipeline.- Parameters
docs –
-
get
(match_func: Callable[[spacy.tokens.doc.Doc], bool], limit: Optional[int] = None) → Iterator[spacy.tokens.doc.Doc][source]¶ Get all (or N <=
limit
) docs inCorpus
for whichmatch_func(doc)
is True.- Parameters
match_func –
Function that takes a
spacy.tokens.Doc
as input and returns a boolean value. For example:Corpus.get(lambda x: len(x) >= 100)
gets all docs with at least 100 tokens. And:
Corpus.get(lambda doc: doc._.meta["author"] == "Burton DeWilde")
gets all docs whose author was given as ‘Burton DeWilde’.
limit – Maximum number of matched docs to return.
- Yields
spacy.tokens.Doc
– Next document passingmatch_func
.
Tip
To get doc(s) by index, treat
Corpus
as a list and use Python’s usual indexing and slicing:Corpus[0]
gets the first document in the corpus;Corpus[:5]
gets the first 5; etc.
-
remove
(match_func: Callable[[spacy.tokens.doc.Doc], bool], limit: Optional[int] = None) → None[source]¶ Remove all (or N <=
limit
) docs inCorpus
for whichmatch_func(doc)
is True. Corpus doc/sent/token counts are adjusted accordingly.- Parameters
match_func –
Function that takes a
spacy.tokens.Doc
and returns a boolean value. For example:Corpus.remove(lambda x: len(x) >= 100)
removes docs with at least 100 tokens. And:
Corpus.remove(lambda doc: doc._.meta["author"] == "Burton DeWilde")
removes docs whose author was given as “Burton DeWilde”.
limit – Maximum number of matched docs to remove.
Tip
To remove doc(s) by index, treat
Corpus
as a list and use Python’s usual indexing and slicing:del Corpus[0]
removes the first document in the corpus;del Corpus[:5]
removes the first 5; etc.
-
property
vectors
¶ Constituent docs’ word vectors stacked in a 2d array.
-
property
vector_norms
¶ Constituent docs’ L2-normalized word vectors stacked in a 2d array.
-
word_counts
(*, by: str = 'lemma', weighting: str = 'count', **kwargs) → Dict[int, int | float] | Dict[str, int | float][source]¶ Map the set of unique words in
Corpus
to their counts as absolute, relative, or binary frequencies of occurence, similar toDoc._.to_bag_of_words()
but aggregated over all docs.- Parameters
by – Attribute by which spaCy
Token
s are grouped before counting, as given bygetattr(token, by)
. If “lemma”, tokens are grouped by their base form w/o inflections; if “lower”, by the lowercase form of the token text; if “norm”, by the normalized form of the token text; if “orth”, by the token text exactly as it appears in documents. To output keys as strings, append an underscore to any of these options; for example, “lemma_” groups tokens by their lemmas as strings.weighting – Type of weighting to assign to unique words given by
by
. If “count”, weights are the absolute number of occurrences (i.e. counts); if “freq”, weights are counts normalized by the total token count, giving their relative frequency of occurrence.**kwargs – Passed directly on to
textacy.extract.words()
- filter_stops: If True, stop words are removed before counting. - filter_punct: If True, punctuation tokens are removed before counting. - filter_nums: If True, number-like tokens are removed before counting.
- Returns
Mapping of a unique word id or string (depending on the value of
by
) to its absolute, relative, or binary frequency of occurrence (depending on the value ofweighting
).
-
word_doc_counts
(*, by: str = 'lemma', weighting: str = 'count', smooth_idf: bool = True, **kwargs) → Dict[int, int | float] | Dict[str, int | float][source]¶ Map the set of unique words in
Corpus
to their document counts as absolute, relative, or inverse frequencies of occurence.- Parameters
by – Attribute by which spaCy
Token
s are grouped before counting, as given bygetattr(token, by)
. If “lemma”, tokens are grouped by their base form w/o inflections; if “lower”, by the lowercase form of the token text; if “norm”, by the normalized form of the token text; if “orth”, by the token text exactly as it appears in documents. To output keys as strings, append an underscore to any of these options; for example, “lemma_” groups tokens by their lemmas as strings.weighting – Type of weighting to assign to unique words given by
by
. If “count”, weights are the absolute number of occurrences (i.e. counts); if “freq”, weights are counts normalized by the total token count, giving their relative frequency of occurrence; if “idf”, weights are the log of the inverse relative frequencies, i.e.log(n_docs / word_doc_count)
or, ifsmooth_idf
is True,log(1 + (n_docs / word_doc_count))
.smooth_idf – If True, add 1 to all word doc counts when calculating “idf” weighting, equivalent to adding a single document to the corpus containing every unique word.
- Returns
Mapping of a unique word id or string (depending on the value of
by
) to the number of documents in which it appears, weighted as absolute, relative, or inverse frequency of occurrence (depending on the value ofweighting
).
See also
textacy.vsm.get_doc_freqs()
-
agg_metadata
(name: str, agg_func: Callable[[Iterable[Any]], Any], default: Optional[Any] = None) → Any[source]¶ Aggregate values for a particular metadata field over all documents in
Corpus
.- Parameters
name – Name of metadata field (key) in
Doc._.meta
.agg_func – Callable that accepts an iterable of field values and outputs a single, aggregated result.
default – Default field value to use if
name
is not found in a given document’s metadata.
- Returns
Aggregated value for metadata field.
-
save
(filepath: types.PathLike, attrs: Optional[str | Iterable[str]] = 'auto', store_user_data: bool = True)[source]¶ Save
Corpus
to disk as binary data.- Parameters
filepath – Full path to file on disk where
Corpus
docs data will be saved as a binary file.attrs – List of token attributes to serialize; if “auto”, an appropriate list is inferred from annotations found on the first
Doc
; if None, spaCy’s default values are used (https://spacy.io/api/docbin#init)store_user_data – If True, store user data and values of custom extension attributes along with core spaCy attributes.
See also
textacy.io.write_spacy_docs()
spacy.tokens.DocBin
-
classmethod
load
(lang: Union[str, pathlib.Path, spacy.language.Language], filepath: Union[str, pathlib.Path]) → Corpus[source]¶ Load previously saved
Corpus
binary data, reproduce the original :class:`spacy.tokens.Doc`s tokens and annotations, and instantiate a new :class:`Corpus from them.- Parameters
lang –
filepath – Full path to file on disk where
Corpus
data was previously saved as a binary file.
- Returns
Initialized corpus.
See also
textacy.io.read_spacy_docs()
spacy.tokens.DocBin
Doc Extensions¶
Get a short preview of the |
|
Get custom metadata added to |
|
Add custom metadata to |
|
Transform |
|
Transform a |
|
Transform a |
textacy.extensions
: Inspect, extend, and transform spaCy’s core Doc
data structure, either directly via functions that take a Doc
as their first arg
or as custom attributes / methods on instantiated docs prepended by an underscore:
>>> doc = textacy.make_spacy_doc("This is a short text.", "en_core_web_sm")
>>> print(get_preview(doc))
Doc(6 tokens: "This is a short text.")
>>> print(doc._.preview)
Doc(6 tokens: "This is a short text.")
-
textacy.extensions.
get_preview
(doc: spacy.tokens.doc.Doc) → str[source]¶ Get a short preview of the
Doc
, including the number of tokens and an initial snippet.
-
textacy.extensions.
get_meta
(doc: spacy.tokens.doc.Doc) → dict[source]¶ Get custom metadata added to
Doc
.
-
textacy.extensions.
set_meta
(doc: spacy.tokens.doc.Doc, value: dict) → None[source]¶ Add custom metadata to
Doc
.
-
textacy.extensions.
to_tokenized_text
(doc: spacy.tokens.doc.Doc) → List[List[str]][source]¶ Transform
doc
into an ordered, nested list of token-texts for each sentence.- Parameters
doc –
- Returns
A list of tokens’ texts for each sentence in
doc
.
Note
If
doc
hasn’t been segmented into sentences, the entire document is treated as a single sentence.
-
textacy.extensions.
to_bag_of_words
(doclike: types.DocLike, *, by: str = 'lemma_', weighting: str = 'count', **kwargs) → Dict[int, int | float] | Dict[str, int | float][source]¶ Transform a
Doc
orSpan
into a bag-of-words: the set of unique words therein mapped to their absolute, relative, or binary frequencies of occurrence.- Parameters
doclike –
by – Attribute by which spaCy
Token
s are grouped before counting, as given bygetattr(token, by)
. If “lemma”, tokens are grouped by their base form w/o inflectional suffixes; if “lower”, by the lowercase form of the token text; if “norm”, by the normalized form of the token text; if “orth”, by the token text exactly as it appears indoc
. To output keys as strings, simply append an underscore to any of these; for example, “lemma_” creates a bag whose keys are token lemmas as strings.weighting – Type of weighting to assign to unique words given by
by
. If “count”, weights are the absolute number of occurrences (i.e. counts); if “freq”, weights are counts normalized by the total token count, giving their relative frequency of occurrence; if “binary”, weights are set equal to 1.**kwargs – Passed directly on to
textacy.extract.words()
- filter_stops: If True, stop words are removed before counting. - filter_punct: If True, punctuation tokens are removed before counting. - filter_nums: If True, number-like tokens are removed before counting.
- Returns
Mapping of a unique word id or string (depending on the value of
by
) to its absolute, relative, or binary frequency of occurrence (depending on the value ofweighting
).
Note
For “freq” weighting, the resulting set of frequencies won’t (necessarily) sum to 1.0, since all tokens are used when normalizing counts but some (punctuation, stop words, etc.) may be filtered out of the bag afterwards.
See also
textacy.extract.words()
-
textacy.extensions.
to_bag_of_terms
(doclike: types.DocLike, *, by: str = 'lemma_', weighting: str = 'count', ngs: Optional[int | Collection[int] | types.DocLikeToSpans] = None, ents: Optional[bool | types.DocLikeToSpans] = None, ncs: Optional[bool | types.DocLikeToSpans] = None, dedupe: bool = True) → Dict[str, int] | Dict[str, float][source]¶ Transform a
Doc
orSpan
into a bag-of-terms: the set of unique terms therein mapped to their absolute, relative, or binary frequencies of occurrence, where “terms” may be a combination of n-grams, entities, and/or noun chunks.- Parameters
doclike –
by – Attribute by which spaCy
Span
s are grouped before counting, as given bygetattr(token, by)
. If “lemma”, tokens are counted by their base form w/o inflectional suffixes; if “lower”, by the lowercase form of the token text; if “orth”, by the token text exactly as it appears indoc
. To output keys as strings, simply append an underscore to any of these; for example, “lemma_” creates a bag whose keys are token lemmas as strings.weighting – Type of weighting to assign to unique terms given by
by
. If “count”, weights are the absolute number of occurrences (i.e. counts); if “freq”, weights are counts normalized by the total token count, giving their relative frequency of occurrence; if “binary”, weights are set equal to 1.ngs – N-gram terms to be extracted. If one or multiple ints,
textacy.extract.ngrams(doclike, n=ngs)()
is used to extract terms; if a callable,ngs(doclike)
is used to extract terms; if None, no n-gram terms are extracted.ents – Entity terms to be extracted. If True,
textacy.extract.entities(doclike)()
is used to extract terms; if a callable,ents(doclike)
is used to extract terms; if None, no entity terms are extracted.ncs – Noun chunk terms to be extracted. If True,
textacy.extract.noun_chunks(doclike)()
is used to extract terms; if a callable,ncs(doclike)
is used to extract terms; if None, no noun chunk terms are extracted.dedupe – If True, deduplicate terms whose spans are extracted by multiple types (e.g. a span that is both an n-gram and an entity), as identified by identical (start, stop) indexes in
doclike
; otherwise, don’t.
- Returns
Mapping of a unique term id or string (depending on the value of
by
) to its absolute, relative, or binary frequency of occurrence (depending on the value ofweighting
).
See also
textacy.extract.terms()
-
textacy.extensions.
get_doc_extensions
() → Dict[str, Dict[str, Any]][source]¶ Get textacy’s custom property and method doc extensions that can be set on or removed from the global
spacy.tokens.Doc
.