Miscellany

lang_id.lang_identifier.identify_lang

Identify the most probable language identified in text, with or without the corresponding probability.

lang_id.lang_identifier.identify_topn_langs

Identify the topn most probable languages identified in text, with or without the corresponding probabilities.

utils.get_config

Get key configuration info about dev environment: OS, python, spacy, and textacy.

utils.print_markdown

Print items as a markdown-formatted list.

utils.is_record

Check whether obj is a “record” – that is, a (text, metadata) 2-tuple.

utils.to_collection

Validate and cast a value or values to a collection.

utils.to_bytes

Coerce string s to bytes.

utils.to_unicode

Coerce string s to unicode.

utils.to_path

Coerce path to a pathlib.Path.

utils.validate_set_members

Validate values that must be of a certain type and (optionally) found among a set of known valid values.

utils.validate_and_clip_range

Validate and clip range values.

Language Identification

textacy.lang_id: Interface for de/serializing a language identification model, and using it to identify the most probable language(s) of a given text. Inspired by Google’s Compact Language Detector v3 (https://github.com/google/cld3) and implemented with thinc v8.0.

Model

Character unigrams, bigrams, and trigrams are extracted separately from the first 1000 characters of lower-cased input text. Each collection of ngrams is hash-embedded into a 100-dimensional space, then averaged. The resulting feature vectors are concatenated into a single embedding layer, then passed on to a dense layer with ReLu activation and finally a Softmax output layer. The model’s predictions give the probabilities for a text to be written in ~140 ISO 639-1 languages.

Dataset

The model was trained on a randomized, stratified subset of ~375k texts drawn from several sources:

  • WiLi: A public dataset of short text extracts from Wikipedias in over 230 languages. Style is relatively formal; subject matter is “encyclopedic”. Source: https://zenodo.org/record/841984

  • Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.

  • UDHR: The UN’s Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html

  • DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/

Performance

The trained model achieved F1 = 0.97 when averaged over all languages.

A few languages have worse performance; for example, the two Norwegians (“nb” and “no”), as well as Bosnian (“bs”), Serbian (“sr”), and Croatian (“hr”), which are extremely similar to each other. See the textacy-data releases for more details: https://github.com/bdewilde/textacy-data/releases/tag/lang-identifier-v2.0

class textacy.lang_id.lang_identifier.LangIdentifier(version: float | str, data_dir: str | pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/lang_identifier'), model_base: Model = <thinc.model.Model object>)[source]
Parameters
  • version

  • data_dir

  • model_base

model
classes
save_model()[source]

Save trained LangIdentifier.model to disk, as bytes.

load_model()thinc.model.Model[source]

Load trained model from bytes on disk, using LangIdentifier.model_base as the framework into which the data is fit.

download(force: bool = False)[source]

Download version-specific model data as a binary file and save it to disk at LangIdentifier.model_fpath.

Parameters

force – If True, download the model data, even if it already exists on disk under self.data_dir; otherwise, don’t.

identify_lang(text: str, with_probs: bool = False)str | Tuple[str, float][source]

Identify the most probable language identified in text, with or without the corresponding probability.

Parameters
  • text

  • with_probs

Returns

ISO 639-1 standard language code of the most probable language, optionally with its probability.

identify_topn_langs(text: str, topn: int = 3, with_probs: bool = False)List[str] | List[Tuple[str, float]][source]

Identify the topn most probable languages identified in text, with or without the corresponding probabilities.

Parameters
  • text

  • topn

  • with_probs

Returns

ISO 639-1 standard language code and optionally with its probability of the topn most probable languages.

textacy.lang_id.lang_identifier.identify_lang(text: str, with_probs: bool = False)str | Tuple[str, float]

Identify the most probable language identified in text, with or without the corresponding probability.

Parameters
  • text

  • with_probs

Returns

ISO 639-1 standard language code of the most probable language, optionally with its probability.

textacy.lang_id.lang_identifier.identify_topn_langs(text: str, topn: int = 3, with_probs: bool = False)List[str] | List[Tuple[str, float]]

Identify the topn most probable languages identified in text, with or without the corresponding probabilities.

Parameters
  • text

  • topn

  • with_probs

Returns

ISO 639-1 standard language code and optionally with its probability of the topn most probable languages.

Utilities

textacy.utils: Variety of general-purpose utility functions for inspecting / validating / transforming args and facilitating meta package tasks.

textacy.utils.deprecated(message: str, *, action: str = 'always')[source]

Show a deprecation warning, optionally filtered.

Parameters
  • message – Message to display with DeprecationWarning.

  • action – Filter controlling whether warning is ignored, displayed, or turned into an error. For reference:

textacy.utils.get_config()Dict[str, Any][source]

Get key configuration info about dev environment: OS, python, spacy, and textacy.

Returns

dict

textacy.utils.print_markdown(items: Union[Dict[Any, Any], Iterable[Tuple[Any, Any]]])[source]

Print items as a markdown-formatted list. Specifically useful when submitting config info on GitHub issues.

Parameters

items

textacy.utils.is_record(obj: Any)bool[source]

Check whether obj is a “record” – that is, a (text, metadata) 2-tuple.

textacy.utils.to_collection(val: Any, val_type: Union[Type[Any], Tuple[Type[Any], ]], col_type: Type[Any])Optional[Collection[Any]][source]

Validate and cast a value or values to a collection.

Parameters
  • val (object) – Value or values to validate and cast.

  • val_type (type) – Type of each value in collection, e.g. int or str.

  • col_type (type) – Type of collection to return, e.g. tuple or set.

Returns

Collection of type col_type with values all of type val_type.

Raises

TypeError

textacy.utils.to_bytes(s: Union[str, bytes], *, encoding: str = 'utf-8', errors: str = 'strict')bytes[source]

Coerce string s to bytes.

textacy.utils.to_unicode(s: Union[str, bytes], *, encoding: str = 'utf-8', errors: str = 'strict')str[source]

Coerce string s to unicode.

textacy.utils.to_path(path: Union[str, pathlib.Path])pathlib.Path[source]

Coerce path to a pathlib.Path.

Parameters

path

Returns

pathlib.Path

textacy.utils.validate_set_members(vals: Union[Any, Set[Any]], val_type: Union[Type[Any], Tuple[Type[Any], ]], valid_vals: Optional[Set[Any]] = None)Set[Any][source]

Validate values that must be of a certain type and (optionally) found among a set of known valid values.

Parameters
  • vals – Value or values to validate.

  • val_type – Type(s) of which all vals must be instances.

  • valid_vals – Set of valid values in which all vals must be found.

Returns

Validated values.

Return type

Set[obj]

Raises
textacy.utils.validate_and_clip_range(range_vals: Tuple[Any, Any], full_range: Tuple[Any, Any], val_type: Optional[Union[Type[Any], Tuple[Type[Any], ]]] = None)Tuple[Any, Any][source]

Validate and clip range values.

Parameters
  • range_vals – Range values, i.e. [start_val, end_val), to validate and, if necessary, clip. If None, the value is set to the corresponding value in full_range.

  • full_range – Full range of values, i.e. [min_val, max_val), within which range_vals must lie.

  • val_type – Type(s) of which all range_vals must be instances, unless val is None.

Returns

Range for which null or too-small/large values have been clipped to the min/max valid values.

Raises
textacy.utils.get_kwargs_for_func(func: Callable, kwargs: Dict[str, Any])Dict[str, Any][source]

Get the set of keyword arguments from kwargs that are used by func. Useful when calling a func from another func and inferring its signature from provided **kwargs.

textacy.utils.text_to_char_ngrams(text: str, n: int, *, pad: bool = False)Tuple[str, ][source]

Convert a text string into an ordered sequence of character ngrams.

Parameters
  • text

  • n – Number of characters to concatenate in each n-gram.

  • pad – If True, pad text by adding n - 1 “_” characters on either side; if False, leave text as-is.

Returns

Ordered sequence of character ngrams.

textacy.types: Definitions for common object types used throughout the package.

class textacy.types.Record(text, meta)
meta

Alias for field number 1

text

Alias for field number 0

textacy.errors: Helper functions for making consistent errors.

textacy.cache: Functionality for caching language data and other NLP resources. Loading data from disk can be slow; let’s just do it once and forget about it. :)

textacy.cache.LRU_CACHE = LRUCache([], maxsize=2147483648, currsize=0)

Least Recently Used (LRU) cache for loaded data.

The max cache size may be set by the TEXTACY_MAX_CACHE_SIZE environment variable, where the value must be an integer (in bytes). Otherwise, the max size is 2GB.

Type

cachetools.LRUCache

textacy.cache.clear()[source]

Clear textacy’s cache of loaded data.

spaCy Utils

textacy.spacier.utils: Helper functions for working with / extending spaCy’s core functionality.

textacy.spacier.utils.make_doc_from_text_chunks(text: str, lang: Union[str, pathlib.Path, spacy.language.Language], chunk_size: int = 100000)spacy.tokens.doc.Doc[source]

Make a single spaCy-processed document from 1 or more chunks of text. This is a workaround for processing very long texts, for which spaCy is unable to allocate enough RAM.

Parameters
  • text – Text document to be chunked and processed by spaCy.

  • lang – Language with which spaCy processes text, represented as the full name of or path on disk to the pipeline, or an already instantiated pipeline instance.

  • chunk_size

    Number of characters comprising each text chunk (excluding the last chunk, which is probably smaller). For best performance, value should be somewhere between 1e3 and 1e7, depending on how much RAM you have available.

    Note

    Since chunking is done by character, chunks edges’ probably won’t respect natural language segmentation, which means that every chunk_size characters, spaCy’s models may make mistakes.

Returns

A single processed document, built from concatenated text chunks.

textacy.spacier.utils.merge_spans(spans: Iterable[spacy.tokens.span.Span], doc: spacy.tokens.doc.Doc)None[source]

Merge spans into single tokens in doc, in-place.

Parameters
  • spans (Iterable[spacy.tokens.Span]) –

  • doc (spacy.tokens.Doc) –

textacy.spacier.utils.preserve_case(token: spacy.tokens.token.Token)bool[source]

Return True if token is a proper noun or acronym; otherwise, False.

Raises

ValueError – If parent document has not been POS-tagged.

textacy.spacier.utils.get_normalized_text(span_or_token: Span | Token)str[source]

Get the text of a spaCy span or token, normalized depending on its characteristics. For proper nouns and acronyms, text is returned as-is; for everything else, text is lemmatized.

textacy.spacier.utils.get_main_verbs_of_sent(sent: spacy.tokens.span.Span)List[spacy.tokens.token.Token][source]

Return the main (non-auxiliary) verbs in a sentence.

textacy.spacier.utils.get_subjects_of_verb(verb: spacy.tokens.token.Token)List[spacy.tokens.token.Token][source]

Return all subjects of a verb according to the dependency parse.

textacy.spacier.utils.get_objects_of_verb(verb: spacy.tokens.token.Token)List[spacy.tokens.token.Token][source]

Return all objects of a verb according to the dependency parse, including open clausal complements.

textacy.spacier.utils.get_span_for_compound_noun(noun: spacy.tokens.token.Token)Tuple[int, int][source]

Return document indexes spanning all (adjacent) tokens in a compound noun.

textacy.spacier.utils.get_span_for_verb_auxiliaries(verb: spacy.tokens.token.Token)Tuple[int, int][source]

Return document indexes spanning all (adjacent) tokens around a verb that are auxiliary verbs or negations.

Semantic Networks