

Identify the most probable language identified in text, with or without the corresponding probability.


Identify the topn most probable languages identified in text, with or without the corresponding probabilities.


Get key configuration info about dev environment: OS, python, spacy, and textacy.


Print items as a markdown-formatted list.


Check whether obj is a “record” – that is, a (text, metadata) 2-tuple.


Validate and cast a value or values to a collection.


Coerce string s to bytes.


Coerce string s to unicode.


Coerce path to a pathlib.Path.


Validate values that must be of a certain type and (optionally) found among a set of known valid values.


Validate and clip range values.

Language Identification

textacy.lang_id: Interface for de/serializing a language identification model, and using it to identify the most probable language(s) of a given text. Inspired by Google’s Compact Language Detector v3 (https://github.com/google/cld3) and implemented with thinc v8.0.


Character unigrams, bigrams, and trigrams are extracted separately from the first 1000 characters of lower-cased input text. Each collection of ngrams is hash-embedded into a 100-dimensional space, then averaged. The resulting feature vectors are concatenated into a single embedding layer, then passed on to a dense layer with ReLu activation and finally a Softmax output layer. The model’s predictions give the probabilities for a text to be written in ~140 ISO 639-1 languages.


The model was trained on a randomized, stratified subset of ~375k texts drawn from several sources:

  • WiLi: A public dataset of short text extracts from Wikipedias in over 230 languages. Style is relatively formal; subject matter is “encyclopedic”. Source: https://zenodo.org/record/841984

  • Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.

  • UDHR: The UN’s Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html

  • DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/


The trained model achieved F1 = 0.97 when averaged over all languages.

A few languages have worse performance; for example, the two Norwegians (“nb” and “no”), as well as Bosnian (“bs”), Serbian (“sr”), and Croatian (“hr”), which are extremely similar to each other. See the textacy-data releases for more details: https://github.com/bdewilde/textacy-data/releases/tag/lang-identifier-v2.0

class textacy.lang_id.lang_identifier.LangIdentifier(version: float | str, data_dir: str | pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.12.0/lib/python3.8/site-packages/textacy/data/lang_identifier'), model_base: Model = <thinc.model.Model object>)[source]
  • version

  • data_dir

  • model_base


Save trained LangIdentifier.model to disk, as bytes.


Load trained model from bytes on disk, using LangIdentifier.model_base as the framework into which the data is fit.

download(force: bool = False)[source]

Download version-specific model data as a binary file and save it to disk at LangIdentifier.model_fpath.


force – If True, download the model data, even if it already exists on disk under self.data_dir; otherwise, don’t.

identify_lang(text: str, with_probs: bool = False)str | Tuple[str, float][source]

Identify the most probable language identified in text, with or without the corresponding probability.

  • text

  • with_probs


ISO 639-1 standard language code of the most probable language, optionally with its probability.

identify_topn_langs(text: str, topn: int = 3, with_probs: bool = False)List[str] | List[Tuple[str, float]][source]

Identify the topn most probable languages identified in text, with or without the corresponding probabilities.

  • text

  • topn

  • with_probs


ISO 639-1 standard language code and optionally with its probability of the topn most probable languages.

textacy.lang_id.lang_identifier.identify_lang(text: str, with_probs: bool = False)str | Tuple[str, float]

Identify the most probable language identified in text, with or without the corresponding probability.

  • text

  • with_probs


ISO 639-1 standard language code of the most probable language, optionally with its probability.

textacy.lang_id.lang_identifier.identify_topn_langs(text: str, topn: int = 3, with_probs: bool = False)List[str] | List[Tuple[str, float]]

Identify the topn most probable languages identified in text, with or without the corresponding probabilities.

  • text

  • topn

  • with_probs


ISO 639-1 standard language code and optionally with its probability of the topn most probable languages.


textacy.utils: Variety of general-purpose utility functions for inspecting / validating / transforming args and facilitating meta package tasks.

textacy.utils.deprecated(message: str, *, action: str = 'always')[source]

Show a deprecation warning, optionally filtered.

  • message – Message to display with DeprecationWarning.

  • action – Filter controlling whether warning is ignored, displayed, or turned into an error. For reference:

textacy.utils.get_config()Dict[str, Any][source]

Get key configuration info about dev environment: OS, python, spacy, and textacy.



textacy.utils.print_markdown(items: Dict[Any, Any] | Iterable[Tuple[Any, Any]])[source]

Print items as a markdown-formatted list. Specifically useful when submitting config info on GitHub issues.



textacy.utils.is_record(obj: Any)bool[source]

Check whether obj is a “record” – that is, a (text, metadata) 2-tuple.

textacy.utils.to_collection(val: types.AnyVal | Collection[types.AnyVal], val_type: Type[Any] | Tuple[Type[Any], ], col_type: Type[Any])Collection[types.AnyVal][source]

Validate and cast a value or values to a collection.

  • val (object) – Value or values to validate and cast.

  • val_type (type) – Type of each value in collection, e.g. int or (str, bytes).

  • col_type (type) – Type of collection to return, e.g. tuple or set.


Collection of type col_type with values all of type val_type.



textacy.utils.to_bytes(s: Union[str, bytes], *, encoding: str = 'utf-8', errors: str = 'strict')bytes[source]

Coerce string s to bytes.

textacy.utils.to_unicode(s: Union[str, bytes], *, encoding: str = 'utf-8', errors: str = 'strict')str[source]

Coerce string s to unicode.

textacy.utils.to_path(path: Union[str, pathlib.Path])pathlib.Path[source]

Coerce path to a pathlib.Path.





textacy.utils.validate_set_members(vals: types.AnyVal | Set[types.AnyVal], val_type: Type[Any] | Tuple[Type[Any], ], valid_vals: Optional[Set[types.AnyVal]] = None)Set[types.AnyVal][source]

Validate values that must be of a certain type and (optionally) found among a set of known valid values.

  • vals – Value or values to validate.

  • val_type – Type(s) of which all vals must be instances.

  • valid_vals – Set of valid values in which all vals must be found.


Validated values.

Return type


textacy.utils.validate_and_clip_range(range_vals: Tuple[types.AnyVal, types.AnyVal], full_range: Tuple[types.AnyVal, types.AnyVal], val_type: Optional[Type[Any] | Tuple[Type[Any], ]] = None)Tuple[types.AnyVal, types.AnyVal][source]

Validate and clip range values.

  • range_vals – Range values, i.e. [start_val, end_val), to validate and, if necessary, clip. If None, the value is set to the corresponding value in full_range.

  • full_range – Full range of values, i.e. [min_val, max_val), within which range_vals must lie.

  • val_type – Type(s) of which all range_vals must be instances, unless val is None.


Range for which null or too-small/large values have been clipped to the min/max valid values.

textacy.utils.get_kwargs_for_func(func: Callable, kwargs: Dict[str, Any])Dict[str, Any][source]

Get the set of keyword arguments from kwargs that are used by func. Useful when calling a func from another func and inferring its signature from provided **kwargs.

textacy.utils.text_to_char_ngrams(text: str, n: int, *, pad: bool = False)Tuple[str, ][source]

Convert a text string into an ordered sequence of character ngrams.

  • text

  • n – Number of characters to concatenate in each n-gram.

  • pad – If True, pad text by adding n - 1 “_” characters on either side; if False, leave text as-is.


Ordered sequence of character ngrams.

textacy.utils.get_function_names(module, ignore_private: bool = True)Iterable[str][source]

Get names of functions in module, optionally ignoring private members.

  • module

  • ignore_private


Alphabetically ordered sequence of function names.

textacy.types: Definitions for common object types used throughout the package.

class textacy.types.Record(text, meta)[source]
text: str

Alias for field number 0

meta: dict

Alias for field number 1

class textacy.types.AugTok(text: str, ws: str, pos: str, is_word: bool, syns: List[str])[source]

Minimal token data required for data augmentation transforms.

text: str

Alias for field number 0

ws: str

Alias for field number 1

pos: str

Alias for field number 2

is_word: bool

Alias for field number 3

syns: List[str]

Alias for field number 4

class textacy.types.AugTransform(*args, **kwargs)[source]
class textacy.types.DocExtFunc(*args, **kwargs)[source]

textacy.errors: Helper functions for making consistent errors.

textacy.cache: Functionality for caching language data and other NLP resources. Loading data from disk can be slow; let’s just do it once and forget about it. :)

textacy.cache.LRU_CACHE = LRUCache([], maxsize=2147483648, currsize=0)

Least Recently Used (LRU) cache for loaded data.

The max cache size may be set by the TEXTACY_MAX_CACHE_SIZE environment variable, where the value must be an integer (in bytes). Otherwise, the max size is 2GB.




Clear textacy’s cache of loaded data.

spaCy Utils

textacy.spacier.utils: Helper functions for working with / extending spaCy’s core functionality.

textacy.spacier.utils.make_doc_from_text_chunks(text: str, lang: Union[str, pathlib.Path, spacy.language.Language], chunk_size: int = 100000)spacy.tokens.doc.Doc[source]

Make a single spaCy-processed document from 1 or more chunks of text. This is a workaround for processing very long texts, for which spaCy is unable to allocate enough RAM.

  • text – Text document to be chunked and processed by spaCy.

  • lang – Language with which spaCy processes text, represented as the full name of or path on disk to the pipeline, or an already instantiated pipeline instance.

  • chunk_size

    Number of characters comprising each text chunk (excluding the last chunk, which is probably smaller). For best performance, value should be somewhere between 1e3 and 1e7, depending on how much RAM you have available.


    Since chunking is done by character, chunks edges’ probably won’t respect natural language segmentation, which means that every chunk_size characters, spaCy’s models may make mistakes.


A single processed document, built from concatenated text chunks.

textacy.spacier.utils.merge_spans(spans: Iterable[spacy.tokens.span.Span], doc: spacy.tokens.doc.Doc)None[source]

Merge spans into single tokens in doc, in-place.

  • spans (Iterable[spacy.tokens.Span]) –

  • doc (spacy.tokens.Doc) –

textacy.spacier.utils.preserve_case(token: spacy.tokens.token.Token)bool[source]

Return True if token is a proper noun or acronym; otherwise, False.


ValueError – If parent document has not been POS-tagged.

textacy.spacier.utils.get_normalized_text(span_or_token: Span | Token)str[source]

Get the text of a spaCy span or token, normalized depending on its characteristics. For proper nouns and acronyms, text is returned as-is; for everything else, text is lemmatized.

textacy.spacier.utils.get_main_verbs_of_sent(sent: spacy.tokens.span.Span)List[spacy.tokens.token.Token][source]

Return the main (non-auxiliary) verbs in a sentence.

textacy.spacier.utils.get_subjects_of_verb(verb: spacy.tokens.token.Token)List[spacy.tokens.token.Token][source]

Return all subjects of a verb according to the dependency parse.

textacy.spacier.utils.get_objects_of_verb(verb: spacy.tokens.token.Token)List[spacy.tokens.token.Token][source]

Return all objects of a verb according to the dependency parse, including open clausal complements.

textacy.spacier.utils.get_span_for_compound_noun(noun: spacy.tokens.token.Token)Tuple[int, int][source]

Return document indexes spanning all (adjacent) tokens in a compound noun.

textacy.spacier.utils.get_span_for_verb_auxiliaries(verb: spacy.tokens.token.Token)Tuple[int, int][source]

Return document indexes spanning all (adjacent) tokens around a verb that are auxiliary verbs or negations.

textacy.spacier.utils.get_spacy_lang_morph_labels(lang: Union[str, pathlib.Path, spacy.language.Language])Set[str][source]

Get the full set of morphological feature labels assigned by a spaCy language pipeline according to its “morphologizer” pipe’s metadata, or just get the default set of Universal Dependencies (v2) feature labels.


lang – Language with which spaCy processes text, represented as the full name of a spaCy language pipeline, the path on disk to it, or an already instantiated pipeline.


Set of morphological feature labels assigned/assignable by lang.