Utilities

lang_utils.identify_lang

Identify the most probable language identified in text.

text_utils.is_acronym

Pass single token as a string, return True/False if is/is not valid acronym.

text_utils.keyword_in_context

Search for keyword in text via regular expression, return or print strings spanning window_width characters before and after each occurrence of keyword.

text_utils.KWIC

Alias of keyword_in_context.

text_utils.clean_terms

Clean up a sequence of single- or multi-word strings: strip leading/trailing junk chars, handle dangling parens and odd hyphenation, etc.

utils.get_config

Get key configuration info about dev environment: OS, python, spacy, and textacy.

utils.print_markdown

Print items as a markdown-formatted list.

utils.is_record

Check whether obj is a “record” – that is, a (text, metadata) 2-tuple.

utils.to_collection

Validate and cast a value or values to a collection.

utils.to_bytes

Coerce string s to bytes.

utils.to_unicode

Coerce string s to unicode.

utils.to_path

Coerce path to a pathlib.Path.

utils.validate_set_members

Validate values that must be of a certain type and (optionally) found among a set of known valid values.

utils.validate_and_clip_range

Validate and clip range values.

Language Identification

Pipeline for identifying the language of a text, using a model inspired by Google’s Compact Language Detector v3 (https://github.com/google/cld3) and implemented with scikit-learn>=0.20.

Model

Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~140 different languages as ISO 639-1 language codes.

Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.

Dataset

The pipeline was trained on a randomized, stratified subset of ~750k texts drawn from several sources:

  • Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.

  • Leipzig Corpora: A collection of corpora for many languages pulling from comparable sources – specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download

  • UDHR: The UN’s Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html

  • Twitter: A collection of tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html

  • DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/

Performance

The trained model achieved F1 = 0.96 when (macro and micro) averaged over all languages. A few languages have worse performance; for example, the two Norwegians (“nb” and “no”), Bosnian (“bs”) and Serbian (“sr”), and Bashkir (“ba”) and Tatar (“tt”) are often confused with each other. See the textacy-data releases for more details: https://github.com/bdewilde/textacy-data/releases/tag/lang_identifier_v1.1_sklearn_v21

class textacy.lang_utils.LangIdentifier(data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.10.1/lib/python3.8/site-packages/textacy/data/lang_identifier'), max_text_len=1000)[source]
Parameters
  • data_dir (str) –

  • max_text_len (int) –

pipeline
Type

sklearn.pipeline.Pipeline

download(force=False)[source]

Download the pipeline data as a Python version-specific compressed pickle file and save it to disk under the LangIdentifier.data_dir directory.

Parameters

force (bool) – If True, download the dataset, even if it already exists on disk under data_dir.

identify_lang(text)[source]

Identify the most probable language identified in text.

Parameters

text (str) –

Returns

2-letter language code of the most probable language.

Return type

str

identify_topn_langs(text, topn=3)[source]

Identify the topn most probable languages identified in text.

Parameters
  • text (str) –

  • topn (int) –

Returns

2-letter language code and its probability for the topn most probable languages.

Return type

List[Tuple[str, float]]

init_pipeline()[source]

Initialize a new language identification pipeline, overwriting any pre-trained pipeline loaded from disk under LangIdentifier.data_dir. Must be trained on (text, lang) examples before use.

textacy.lang_utils.identify_lang(text)

Identify the most probable language identified in text.

Parameters

text (str) –

Returns

2-letter language code of the most probable language.

Return type

str

Other Utils

textacy.text_utils: Set of small utility functions that take text strings as input.

textacy.text_utils.is_acronym(token: str, exclude: Optional[Set[str]] = None)bool[source]

Pass single token as a string, return True/False if is/is not valid acronym.

Parameters
  • token – Single word to check for acronym-ness

  • exclude – If technically valid but not actual acronyms are known in advance, pass them in as a set of strings; matching tokens will return False.

Returns

Whether or not token is an acronym.

textacy.text_utils.keyword_in_context(text: str, keyword: str, *, ignore_case: bool = True, window_width: int = 50, print_only: bool = True) → Optional[Iterable[Tuple[str, str, str]]][source]

Search for keyword in text via regular expression, return or print strings spanning window_width characters before and after each occurrence of keyword.

Parameters
  • text – Text in which to search for keyword.

  • keyword

    Technically, any valid regular expression string should work, but usually this is a single word or short phrase: “spam”, “spam and eggs”; to account for variations, use regex: “[Ss]pam (and|&) [Ee]ggs?”

    Note

    If keyword contains special characters, be sure to escape them!

  • ignore_case – If True, ignore letter case in keyword matching.

  • window_width – Number of characters on either side of keyword to include as “context”.

  • print_only – If True, print out all results with nice formatting; if False, return all (pre, kw, post) matches as generator of raw strings.

Yields

Next 3-tuple of prior context, the match itself, and posterior context.

textacy.text_utils.KWIC(text: str, keyword: str, *, ignore_case: bool = True, window_width: int = 50, print_only: bool = True) → Optional[Iterable[Tuple[str, str, str]]]

Alias of keyword_in_context.

textacy.text_utils.clean_terms(terms: Iterable[str]) → Iterable[str][source]

Clean up a sequence of single- or multi-word strings: strip leading/trailing junk chars, handle dangling parens and odd hyphenation, etc.

Parameters

terms – Sequence of terms such as “presidency”, “epic failure”, or “George W. Bush” that may be _unclean_ for whatever reason.

Yields

Next term in terms but with the cruft cleaned up, excluding terms that were _entirely_ cruft

Warning

Terms with (intentionally) unusual punctuation may get “cleaned” into a form that changes or obscures the original meaning of the term.

textacy.utils.deprecated(message: str, *, action: str = 'always')[source]

Show a deprecation warning, optionally filtered.

Parameters
  • message – Message to display with DeprecationWarning.

  • action – Filter controlling whether warning is ignored, displayed, or turned into an error. For reference:

textacy.utils.get_config() → Dict[str, Any][source]

Get key configuration info about dev environment: OS, python, spacy, and textacy.

Returns

dict

textacy.utils.print_markdown(items: Union[Dict[Any, Any], Iterable[Tuple[Any, Any]]])[source]

Print items as a markdown-formatted list. Specifically useful when submitting config info on GitHub issues.

Parameters

items

textacy.utils.is_record(obj: Any)bool[source]

Check whether obj is a “record” – that is, a (text, metadata) 2-tuple.

textacy.utils.to_collection(val: Any, val_type: Union[Type[Any], Tuple[Type[Any], ]], col_type: Type[Any]) → Optional[Collection[Any]][source]

Validate and cast a value or values to a collection.

Parameters
  • val (object) – Value or values to validate and cast.

  • val_type (type) – Type of each value in collection, e.g. int or str.

  • col_type (type) – Type of collection to return, e.g. tuple or set.

Returns

Collection of type col_type with values all of type val_type.

Raises

TypeError

textacy.utils.to_bytes(s: Union[str, bytes], *, encoding: str = 'utf-8', errors: str = 'strict')bytes[source]

Coerce string s to bytes.

textacy.utils.to_unicode(s: Union[str, bytes], *, encoding: str = 'utf-8', errors: str = 'strict')str[source]

Coerce string s to unicode.

textacy.utils.to_path(path: Union[str, pathlib.Path])pathlib.Path[source]

Coerce path to a pathlib.Path.

Parameters

path

Returns

pathlib.Path

textacy.utils.validate_set_members(vals: Union[Any, Set[Any]], val_type: Union[Type[Any], Tuple[Type[Any], ]], valid_vals: Optional[Set[Any]] = None) → Set[Any][source]

Validate values that must be of a certain type and (optionally) found among a set of known valid values.

Parameters
  • vals – Value or values to validate.

  • val_type – Type(s) of which all vals must be instances.

  • valid_vals – Set of valid values in which all vals must be found.

Returns

Validated values.

Return type

Set[obj]

Raises
textacy.utils.validate_and_clip_range(range_vals: Tuple[Any, Any], full_range: Tuple[Any, Any], val_type: Optional[Union[Type[Any], Tuple[Type[Any], ]]] = None) → Tuple[Any, Any][source]

Validate and clip range values.

Parameters
  • range_vals – Range values, i.e. [start_val, end_val), to validate and, if necessary, clip. If None, the value is set to the corresponding value in full_range.

  • full_range – Full range of values, i.e. [min_val, max_val), within which range_vals must lie.

  • val_type – Type(s) of which all range_vals must be instances, unless val is None.

Returns

Range for which null or too-small/large values have been clipped to the min/max valid values.

Raises
textacy.utils.get_kwargs_for_func(func: Callable, kwargs: Dict[str, Any]) → Dict[str, Any][source]

Get the set of keyword arguments from kwargs that are used by func. Useful when calling a func from another func and inferring its signature from provided **kwargs.

Functionality for caching language data and other NLP resources. Loading data from disk can be slow; let’s just do it once and forget about it. :)

textacy.cache.LRU_CACHE = LRUCache([], maxsize=2147483648, currsize=0)

Least Recently Used (LRU) cache for loaded data.

The max cache size may be set by the TEXTACY_MAX_CACHE_SIZE environment variable, where the value must be an integer (in bytes). Otherwise, the max size is 2GB.

Type

cachetools.LRUCache

textacy.cache.clear()[source]

Clear textacy’s cache of loaded data.