Lang, Doc, Corpus¶
textacy.spacier.core
: Convenient entry point for loading spaCy language pipelines
and making spaCy docs.
-
textacy.spacier.core.
load_spacy_lang
(name: Union[str, pathlib.Path], disable: Optional[Tuple[str, …]] = None, allow_blank: bool = False) → spacy.language.Language[source]¶ Load a spaCy
Language
: a shared vocabulary and language-specific data for tokenizing text, and (if available) model data and a processing pipeline containing a sequence of components for annotating a document. An LRU cache saves languages in memory for quick reloading.>>> en_nlp = textacy.load_spacy_lang("en") >>> en_nlp = textacy.load_spacy_lang("en_core_web_sm") >>> en_nlp = textacy.load_spacy_lang("en", disable=("parser",)) >>> textacy.load_spacy_lang("ar") ... OSError: [E050] Can't find model 'ar'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory. >>> textacy.load_spacy_lang("ar", allow_blank=True) <spacy.lang.ar.Arabic at 0x126418550>
- Parameters
name – spaCy language to load. Could be a shortcut link, full package name, or path to model directory, or a 2-letter ISO language code for which spaCy has language data.
disable –
Names of pipeline components to disable, if any.
Note
Although spaCy’s API specifies this argument as a list, here we require a tuple. Pipelines are stored in the LRU cache with unique identifiers generated from the hash of the function name and args — and lists aren’t hashable.
allow_blank – If True, allow loading of blank spaCy
Language
s; if False, raise an OSError if a full processing pipeline isn’t available. Note that spaCyDoc
s produced by blank languages are missing key functionality, e.g. POS tags, entities, sentences.
- Returns
A loaded spaCy
Language
.- Raises
OSError –
-
textacy.spacier.core.
make_spacy_doc
(data: Union[str, Tuple[str, dict], spacy.tokens.doc.Doc], lang: Union[str, Callable[[str], str], spacy.language.Language] = <bound method LangIdentifier.identify_lang of <textacy.lang_utils.LangIdentifier object>>) → spacy.tokens.doc.Doc[source]¶ Make a
spacy.tokens.Doc
from valid inputs, and automatically load/validatespacy.language.Language
pipelines to processdata
.Make a
Doc
from text:>>> text = "To be, or not to be, that is the question." >>> doc = make_spacy_doc(text) >>> doc._.preview 'Doc(13 tokens: "To be, or not to be, that is the question.")'
Make a
Doc
from a (text, metadata) pair, aka a “record”:>>> record = (text, {"author": "Shakespeare, William"}) >>> doc = make_spacy_doc(record) >>> doc._.preview 'Doc(13 tokens: "To be, or not to be, that is the question.")' >>> doc._.meta {'author': 'Shakespeare, William'}
Specify the language /
Language
pipeline used to process the text — or don’t:>>> make_spacy_doc(text) >>> make_spacy_doc(text, lang="en") >>> make_spacy_doc(text, lang="en_core_web_sm") >>> make_spacy_doc(text, lang=textacy.load_spacy_lang("en")) >>> make_spacy_doc(text, lang=textacy.lang_utils.identify_lang)
Ensure that an already-processed
Doc
is compatible withlang
:>>> spacy_lang = textacy.load_spacy_lang("en") >>> doc = spacy_lang(text) >>> make_spacy_doc(doc, lang="en") >>> make_spacy_doc(doc, lang="es") ... ValueError: lang of spacy pipeline used to process document ('en') must be the same as `lang` ('es')
- Parameters
data – Make a
spacy.tokens.Doc
from a text or (text, metadata) pair. If already aDoc
, ensure that it’s compatible withlang
to avoid surprises downstream, and return it as-is.lang –
Language with which spaCy processes (or processed)
data
.If known, pass a standard 2-letter language code (e.g. “en”), or the name of a spacy language pipeline (e.g. “en_core_web_md”), or an already-instantiated
spacy.language.Language
object. If not known, pass a function that takes unicode text as input and outputs a standard 2-letter language code.A given / detected language string is then used to instantiate a corresponding
Language
with all default components enabled.
- Returns
Processed spaCy Doc.
- Raises
textacy.corpus
: Class for working with a collection of spaCy Doc
s.
Includes functionality for easily adding, getting, and removing documents;
saving to / loading their data from disk; and tracking basic corpus statistics.
-
class
textacy.corpus.
Corpus
(lang: Union[str, spacy.language.Language], data: Optional[Union[str, spacy.tokens.doc.Doc, Tuple[str, dict], Iterable[str], Iterable[spacy.tokens.doc.Doc], Iterable[Tuple[str, dict]]]] = None)[source]¶ An ordered collection of
spacy.tokens.Doc
, all of the same language and sharing the samespacy.language.Language
processing pipeline and vocabulary, with data held in-memory.Initialize from a language /
Language
and (optionally) one or a stream of texts or (text, metadata) pairs:>>> ds = textacy.datasets.CapitolWords() >>> records = ds.records(limit=50) >>> corpus = textacy.Corpus("en", data=records) >>> print(corpus) Corpus(50 docs, 32175 tokens)
Add or remove documents, with automatic updating of corpus statistics:
>>> texts = ds.texts(congress=114, limit=25) >>> corpus.add(texts) >>> corpus.add("If Burton were a member of Congress, here's what he'd say.") >>> print(corpus) Corpus(76 docs, 55906 tokens) >>> corpus.remove(lambda doc: doc._.meta.get("speaker_name") == "Rick Santorum") >>> print(corpus) Corpus(61 docs, 48567 tokens)
Get subsets of documents matching your particular use case:
>>> match_func = lambda doc: doc._.meta.get("speaker_name") == "Bernie Sanders" >>> for doc in corpus.get(match_func, limit=3): ... print(doc._.preview) Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...") Doc(336 tokens: "Mr. Speaker, I thank the gentleman for yielding...") Doc(177 tokens: "Mr. Speaker, if we want to understand why in th...")
Get or remove documents by indexing, too:
>>> corpus[0]._.preview 'Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")' >>> [doc._.preview for doc in corpus[:3]] ['Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")', 'Doc(219 tokens: "Mr. Speaker, a relationship, to work and surviv...")', 'Doc(336 tokens: "Mr. Speaker, I thank the gentleman for yielding...")'] >>> del corpus[:5] >>> print(corpus) Corpus(56 docs, 41573 tokens)
Compute basic corpus statistics:
>>> corpus.n_docs, corpus.n_sents, corpus.n_tokens (56, 1771, 41573) >>> word_counts = corpus.word_counts(as_strings=True) >>> sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:5] [('-PRON-', 2553), ('people', 215), ('year', 148), ('Mr.', 139), ('$', 137)] >>> word_doc_counts = corpus.word_doc_counts(weighting="freq", as_strings=True) >>> sorted(word_doc_counts.items(), key=lambda x: x[1], reverse=True)[:5] [('-PRON-', 0.9821428571428571), ('Mr.', 0.7678571428571429), ('President', 0.5), ('people', 0.48214285714285715), ('need', 0.44642857142857145)]
Save corpus data to and load from disk:
>>> corpus.save("~/Desktop/capitol_words_sample.bin.gz") >>> corpus = textacy.Corpus.load("en", "~/Desktop/capitol_words_sample.bin.gz") >>> print(corpus) Corpus(56 docs, 41573 tokens)
- Parameters
lang –
Language with which spaCy processes (or processed) all documents added to the corpus, whether as
data
now or later.Pass a standard 2-letter language code (e.g. “en”), or the name of a spacy language pipeline (e.g. “en_core_web_md”), or an already-instantiated
spacy.language.Language
object.A given / detected language string is then used to instantiate a corresponding
Language
with all default components enabled.data (obj or Iterable[obj]) –
One or a stream of texts, records, or
spacy.tokens.Doc
s to be added to the corpus.See also
-
lang
¶
-
spacy_lang
¶
-
docs
¶
-
n_docs
¶
-
n_sents
¶
-
n_tokens
¶
-
add
(data: Union[str, spacy.tokens.doc.Doc, Tuple[str, dict], Iterable[str], Iterable[spacy.tokens.doc.Doc], Iterable[Tuple[str, dict]]], batch_size: int = 1000, n_process: int = 1) → None[source]¶ Add one or a stream of texts, records, or
spacy.tokens.Doc
s to the corpus, ensuring that all processing is or has already been done by theCorpus.spacy_lang
pipeline.- Parameters
data –
batch_size – Number of texts to buffer when processing with spaCy.
n_process –
Number of parallel processors to run when processing. If -1, this is set to
multiprocessing.cpu_count()
.Note
This feature is only available in spaCy 2.2.2+, and only applies when
data
is a sequence of texts or records.
-
add_text
(text: str) → None[source]¶ Add one text to the corpus, processing it into a
spacy.tokens.Doc
using theCorpus.spacy_lang
pipeline.- Parameters
text (str) –
-
add_texts
(texts: Iterable[str], batch_size: int = 1000, n_process: int = 1) → None[source]¶ Add a stream of texts to the corpus, efficiently processing them into
spacy.tokens.Doc
s using theCorpus.spacy_lang
pipeline.- Parameters
texts – Sequence of texts to process and add to corpus.
batch_size – Number of texts to buffer when processing with spaCy.
n_process –
Number of parallel processors to run when processing. If -1, this is set to
multiprocessing.cpu_count()
.Note
This feature is only available in spaCy 2.2.2+.
-
add_record
(record: Tuple[str, Dict[Any, Any]]) → None[source]¶ Add one record to the corpus, processing it into a
spacy.tokens.Doc
using theCorpus.spacy_lang
pipeline.- Parameters
record –
-
add_records
(records: Iterable[Tuple[str, dict]], batch_size: int = 1000, n_process: int = 1) → None[source]¶ Add a stream of records to the corpus, efficiently processing them into
spacy.tokens.Doc
s using theCorpus.spacy_lang
pipeline.- Parameters
records – Sequence of records to process and add to corpus.
batch_size – Number of texts to buffer when processing with spaCy.
n_process –
Number of parallel processors to run when processing. If -1, this is set to
multiprocessing.cpu_count()
.Note
This feature is only available in spaCy 2.2.2+.
-
add_doc
(doc: spacy.tokens.doc.Doc) → None[source]¶ Add one
spacy.tokens.Doc
to the corpus, provided it was processed using theCorpus.spacy_lang
pipeline.- Parameters
doc –
-
add_docs
(docs: Iterable[spacy.tokens.doc.Doc]) → None[source]¶ Add a stream of
spacy.tokens.Doc
s to the corpus, provided they were processed using theCorpus.spacy_lang
pipeline.- Parameters
docs –
-
get
(match_func: Callable[[spacy.tokens.doc.Doc], bool], limit: Optional[int] = None) → Iterator[spacy.tokens.doc.Doc][source]¶ Get all (or N <=
limit
) docs inCorpus
for whichmatch_func(doc)
is True.- Parameters
match_func –
Function that takes a
spacy.tokens.Doc
as input and returns a boolean value. For example:Corpus.get(lambda x: len(x) >= 100)
gets all docs with at least 100 tokens. And:
Corpus.get(lambda doc: doc._.meta["author"] == "Burton DeWilde")
gets all docs whose author was given as ‘Burton DeWilde’.
limit – Maximum number of matched docs to return.
- Yields
spacy.tokens.Doc
– Next document passingmatch_func
.
Tip
To get doc(s) by index, treat
Corpus
as a list and use Python’s usual indexing and slicing:Corpus[0]
gets the first document in the corpus;Corpus[:5]
gets the first 5; etc.
-
remove
(match_func: Callable[[spacy.tokens.doc.Doc], bool], limit: Optional[int] = None) → None[source]¶ Remove all (or N <=
limit
) docs inCorpus
for whichmatch_func(doc)
is True. Corpus doc/sent/token counts are adjusted accordingly.- Parameters
match_func –
Function that takes a
spacy.tokens.Doc
and returns a boolean value. For example:Corpus.remove(lambda x: len(x) >= 100)
removes docs with at least 100 tokens. And:
Corpus.remove(lambda doc: doc._.meta["author"] == "Burton DeWilde")
removes docs whose author was given as “Burton DeWilde”.
limit – Maximum number of matched docs to remove.
Tip
To remove doc(s) by index, treat
Corpus
as a list and use Python’s usual indexing and slicing:del Corpus[0]
removes the first document in the corpus;del Corpus[:5]
removes the first 5; etc.
-
property
vectors
¶ Constituent docs’ word vectors stacked in a 2d array.
-
property
vector_norms
¶ Constituent docs’ L2-normalized word vectors stacked in a 2d array.
-
word_counts
(*, normalize: str = 'lemma', weighting: str = 'count', as_strings: bool = False, filter_stops: bool = True, filter_punct: bool = True, filter_nums: bool = False) → Dict[Union[int, str], Union[int, float]][source]¶ Map the set of unique words in
Corpus
to their counts as absolute, relative, or binary frequencies of occurence, similar toDoc._.to_bag_of_words()
but aggregated over all docs.- Parameters
normalize – If “lemma”, lemmatize words before counting; if “lower”, lowercase words before counting; otherwise, words are counted using the form with which they appear.
weighting ({"count", "freq"}) –
Type of weight to assign to words. If “count” (default), weights are the absolute number of occurrences (count) of word in corpus. If “freq”, word counts are normalized by the total token count, giving their relative frequencies of occurrence.
Note
The resulting set of frequencies won’t (necessarily) sum to 1.0, since punctuation and stop words are filtered out after counts are normalized.
as_strings – If True, words are returned as strings; if False (default), words are returned as their unique integer ids.
filter_stops – If True (default), stop word counts are removed.
filter_punct – If True (default), punctuation counts are removed.
filter_nums – If True, number counts are removed.
- Returns
Mapping of a unique word id or string (depending on the value of
as_strings
) to its absolute, relative, or binary frequency of occurrence (depending on the value ofweighting
).
See also
-
word_doc_counts
(*, normalize: str = 'lemma', weighting: str = 'count', smooth_idf: bool = True, as_strings: bool = False, filter_stops: bool = True, filter_punct: bool = True, filter_nums: bool = True) → Dict[Union[int, str], Union[int, float]][source]¶ Map the set of unique words in
Corpus
to their document counts as absolute, relative, inverse, or binary frequencies of occurence.- Parameters
normalize – If “lemma”, lemmatize words before counting; if “lower”, lowercase words before counting; otherwise, words are counted using the form with which they appear.
weighting ({"count", "freq", "idf"}) – Type of weight to assign to words. If “count” (default), weights are the absolute number (count) of documents in which word appears. If “freq”, word doc counts are normalized by the total document count, giving their relative frequencies of occurrence. If “idf”, weights are the log of the inverse relative frequencies:
log(n_docs / word_doc_count)
or (ifsmooth_idf
is True)log(1 + (n_docs / word_doc_count))
.smooth_idf – If True, add 1 to all word doc counts when calculating “idf” weighting, equivalent to adding a single document to the corpus containing every unique word.
as_strings – If True, words are returned as strings; if False (default), words are returned as their unique integer ids
filter_stops – If True (default), stop word counts are removed.
filter_punct – If True (default), punctuation counts are removed.
filter_nums – If True (default), number counts are removed.
- Returns
Mapping of a unique word id or string (depending on the value of
as_strings
) to the number of documents in which it appears weighted as absolute, relative, or binary frequencies (depending on the value ofweighting
).
See also
-
save
(filepath: Union[str, pathlib.Path], store_user_data: bool = True) → None[source]¶ Save
Corpus
to disk as binary data.- Parameters
filepath – Full path to file on disk where
Corpus
data will be saved as a binary file.store_user_data – If True, store user data and values of custom extension attributes along with core spaCy attributes.
See also
spacy.tokens.DocBin
-
classmethod
load
(lang: Union[str, spacy.language.Language], filepath: Union[str, pathlib.Path], store_user_data: bool = True) → textacy.corpus.Corpus[source]¶ Load previously saved
Corpus
binary data, reproduce the original :class:`spacy.tokens.Doc`s tokens and annotations, and instantiate a new :class:`Corpus from them.- Parameters
lang –
filepath – Full path to file on disk where
Corpus
data was previously saved as a binary file.store_user_data – If True, load stored user data and values of custom extension attributes along with core spaCy attributes.
- Returns
See also
spacy.tokens.DocBin