Miscellany¶

`lang_id.lang_identifier.identify_lang`	Identify the most probable language identified in `text`, with or without the corresponding probability.
`lang_id.lang_identifier.identify_topn_langs`	Identify the `topn` most probable languages identified in `text`, with or without the corresponding probabilities.
`utils.get_config`	Get key configuration info about dev environment: OS, python, spacy, and textacy.
`utils.print_markdown`	Print `items` as a markdown-formatted list.
`utils.is_record`	Check whether `obj` is a “record” – that is, a (text, metadata) 2-tuple.
`utils.to_collection`	Validate and cast a value or values to a collection.
`utils.to_bytes`	Coerce string `s` to bytes.
`utils.to_unicode`	Coerce string `s` to unicode.
`utils.to_path`	Coerce `path` to a `pathlib.Path`.
`utils.validate_set_members`	Validate values that must be of a certain type and (optionally) found among a set of known valid values.
`utils.validate_and_clip_range`	Validate and clip range values.

Language Identification¶

textacy.lang_id: Interface for de/serializing a language identification model, and using it to identify the most probable language(s) of a given text. Inspired by Google’s Compact Language Detector v3 (https://github.com/google/cld3) and implemented with thinc v8.0.

Model¶

Character unigrams, bigrams, and trigrams are extracted separately from the first 1000 characters of lower-cased input text. Each collection of ngrams is hash-embedded into a 100-dimensional space, then averaged. The resulting feature vectors are concatenated into a single embedding layer, then passed on to a dense layer with ReLu activation and finally a Softmax output layer. The model’s predictions give the probabilities for a text to be written in ~140 ISO 639-1 languages.

Dataset¶

The model was trained on a randomized, stratified subset of ~375k texts drawn from several sources:

WiLi: A public dataset of short text extracts from Wikipedias in over 230 languages. Style is relatively formal; subject matter is “encyclopedic”. Source: https://zenodo.org/record/841984
Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
UDHR: The UN’s Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/

Performance¶

The trained model achieved F1 = 0.97 when averaged over all languages.

A few languages have worse performance; for example, the two Norwegians (“nb” and “no”), as well as Bosnian (“bs”), Serbian (“sr”), and Croatian (“hr”), which are extremely similar to each other. See the textacy-data releases for more details: https://github.com/bdewilde/textacy-data/releases/tag/lang-identifier-v2.0

class textacy.lang_id.lang_identifier.LangIdentifier(version: float | str, data_dir: str | pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/lang_identifier'), model_base: Model = <thinc.model.Model object>)[source]¶

Parameters

version –
data_dir –
model_base –

model¶

classes¶

save_model()[source]¶: Save trained LangIdentifier.model to disk, as bytes.

load_model() → thinc.model.Model[source]¶: Load trained model from bytes on disk, using LangIdentifier.model_base as the framework into which the data is fit.

download(force: bool = False)[source]¶

Download version-specific model data as a binary file and save it to disk at LangIdentifier.model_fpath.

Parameters: force – If True, download the model data, even if it already exists on disk under self.data_dir; otherwise, don’t.

identify_lang(text: str, with_probs: bool = False) → str | Tuple[str, float][source]¶

Identify the most probable language identified in text, with or without the corresponding probability.

Parameters

text –
with_probs –

Returns

ISO 639-1 standard language code of the most probable language, optionally with its probability.

identify_topn_langs(text: str, topn: int = 3, with_probs: bool = False) → List[str] | List[Tuple[str, float]][source]¶

Identify the topn most probable languages identified in text, with or without the corresponding probabilities.

Parameters

text –
topn –
with_probs –

Returns

ISO 639-1 standard language code and optionally with its probability of the topn most probable languages.

textacy.lang_id.lang_identifier.identify_lang(text: str, with_probs: bool = False) → str | Tuple[str, float]¶

Identify the most probable language identified in text, with or without the corresponding probability.

Parameters

text –
with_probs –

Returns

ISO 639-1 standard language code of the most probable language, optionally with its probability.

textacy.lang_id.lang_identifier.identify_topn_langs(text: str, topn: int = 3, with_probs: bool = False) → List[str] | List[Tuple[str, float]]¶

Identify the topn most probable languages identified in text, with or without the corresponding probabilities.

Parameters

text –
topn –
with_probs –

Returns

ISO 639-1 standard language code and optionally with its probability of the topn most probable languages.

Utilities¶

textacy.utils: Variety of general-purpose utility functions for inspecting / validating / transforming args and facilitating meta package tasks.

textacy.utils.deprecated(message: str, *, action: str = 'always')[source]¶

Show a deprecation warning, optionally filtered.

Parameters

message – Message to display with DeprecationWarning.
action – Filter controlling whether warning is ignored, displayed, or turned into an error. For reference:

spaCy Utils¶

textacy.spacier.utils: Helper functions for working with / extending spaCy’s core functionality.

textacy.spacier.utils.make_doc_from_text_chunks(text: str, lang: Union[str, pathlib.Path, spacy.language.Language], chunk_size: int = 100000) → spacy.tokens.doc.Doc[source]¶

Make a single spaCy-processed document from 1 or more chunks of text. This is a workaround for processing very long texts, for which spaCy is unable to allocate enough RAM.

Parameters

text – Text document to be chunked and processed by spaCy.
lang – Language with which spaCy processes text, represented as the full name of or path on disk to the pipeline, or an already instantiated pipeline instance.
chunk_size –
Number of characters comprising each text chunk (excluding the last chunk, which is probably smaller). For best performance, value should be somewhere between 1e3 and 1e7, depending on how much RAM you have available.

Note

Since chunking is done by character, chunks edges’ probably won’t respect natural language segmentation, which means that every chunk_size characters, spaCy’s models may make mistakes.

Returns

A single processed document, built from concatenated text chunks.

textacy.spacier.utils.merge_spans(spans: Iterable[spacy.tokens.span.Span], doc: spacy.tokens.doc.Doc) → None [source]¶

Merge spans into single tokens in doc, in-place.

Parameters

spans (Iterable[spacy.tokens.Span]) –
doc (spacy.tokens.Doc) –

textacy.spacier.utils.preserve_case(token: spacy.tokens.token.Token) → bool [source]¶

Return True if token is a proper noun or acronym; otherwise, False.

Raises: ValueError – If parent document has not been POS-tagged.

textacy.spacier.utils.get_normalized_text(span_or_token: Span | Token) → str [source]¶: Get the text of a spaCy span or token, normalized depending on its characteristics. For proper nouns and acronyms, text is returned as-is; for everything else, text is lemmatized.

textacy.spacier.utils.get_main_verbs_of_sent(sent: spacy.tokens.span.Span) → List[spacy.tokens.token.Token][source]¶: Return the main (non-auxiliary) verbs in a sentence.

textacy.spacier.utils.get_subjects_of_verb(verb: spacy.tokens.token.Token) → List[spacy.tokens.token.Token][source]¶: Return all subjects of a verb according to the dependency parse.

textacy.spacier.utils.get_objects_of_verb(verb: spacy.tokens.token.Token) → List[spacy.tokens.token.Token][source]¶: Return all objects of a verb according to the dependency parse, including open clausal complements.

textacy.spacier.utils.get_span_for_compound_noun(noun: spacy.tokens.token.Token) → Tuple[int, int][source]¶: Return document indexes spanning all (adjacent) tokens in a compound noun.

textacy.spacier.utils.get_span_for_verb_auxiliaries(verb: spacy.tokens.token.Token) → Tuple[int, int][source]¶: Return document indexes spanning all (adjacent) tokens around a verb that are auxiliary verbs or negations.

Miscellany¶

Language Identification¶

Model¶

Dataset¶

Performance¶

Utilities¶

spaCy Utils¶

Semantic Networks¶

Navigation

Related Topics