Utilities¶

`lang_utils.identify_lang`	Identify the most probable language identified in `text`.
`text_utils.is_acronym`	Pass single token as a string, return True/False if is/is not valid acronym.
`text_utils.keyword_in_context`	Search for `keyword` in `text` via regular expression, return or print strings spanning `window_width` characters before and after each occurrence of keyword.
`text_utils.KWIC`	Alias of `keyword_in_context`.
`text_utils.clean_terms`	Clean up a sequence of single- or multi-word strings: strip leading/trailing junk chars, handle dangling parens and odd hyphenation, etc.
`utils.get_config`	Get key configuration info about dev environment: OS, python, spacy, and textacy.
`utils.print_markdown`	Print `items` as a markdown-formatted list.
`utils.is_record`	Check whether `obj` is a “record” – that is, a (text, metadata) 2-tuple.
`utils.to_collection`	Validate and cast a value or values to a collection.
`utils.to_bytes`	Coerce string `s` to bytes.
`utils.to_unicode`	Coerce string `s` to unicode.
`utils.to_path`	Coerce `path` to a `pathlib.Path`.
`utils.validate_set_members`	Validate values that must be of a certain type and (optionally) found among a set of known valid values.
`utils.validate_and_clip_range`	Validate and clip range values.

Language Identification¶

Pipeline for identifying the language of a text, using a model inspired by Google’s Compact Language Detector v3 (https://github.com/google/cld3) and implemented with scikit-learn>=0.20.

Model¶

Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~140 different languages as ISO 639-1 language codes.

Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.

Dataset¶

The pipeline was trained on a randomized, stratified subset of ~750k texts drawn from several sources:

Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
Leipzig Corpora: A collection of corpora for many languages pulling from comparable sources – specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
UDHR: The UN’s Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
Twitter: A collection of tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/

Performance¶

The trained model achieved F1 = 0.96 when (macro and micro) averaged over all languages. A few languages have worse performance; for example, the two Norwegians (“nb” and “no”), Bosnian (“bs”) and Serbian (“sr”), and Bashkir (“ba”) and Tatar (“tt”) are often confused with each other. See the textacy-data releases for more details: https://github.com/bdewilde/textacy-data/releases/tag/lang_identifier_v1.1_sklearn_v21

class textacy.lang_utils.LangIdentifier(data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.10.1/lib/python3.8/site-packages/textacy/data/lang_identifier'), max_text_len=1000)[source]¶

Parameters

data_dir (str) –
max_text_len (int) –

pipeline¶

Type: sklearn.pipeline.Pipeline

download(force=False)[source]¶

Download the pipeline data as a Python version-specific compressed pickle file and save it to disk under the LangIdentifier.data_dir directory.

Parameters: force (bool) – If True, download the dataset, even if it already exists on disk under data_dir.

identify_lang(text)[source]¶

Identify the most probable language identified in text.

Parameters: text (str) –
Returns: 2-letter language code of the most probable language.
Return type: str

identify_topn_langs(text, topn=3)[source]¶

Identify the topn most probable languages identified in text.

Parameters

text (str) –
topn (int) –

Returns

2-letter language code and its probability for the topn most probable languages.

Return type

List[Tuple[str, float]]

init_pipeline()[source]¶: Initialize a new language identification pipeline, overwriting any pre-trained pipeline loaded from disk under LangIdentifier.data_dir. Must be trained on (text, lang) examples before use.

textacy.lang_utils.identify_lang(text)¶

Identify the most probable language identified in text.

Parameters: text (str) –
Returns: 2-letter language code of the most probable language.
Return type: str

Other Utils¶

textacy.text_utils: Set of small utility functions that take text strings as input.

textacy.text_utils.is_acronym(token: str, exclude: Optional[Set[str]] = None) → bool [source]¶

Pass single token as a string, return True/False if is/is not valid acronym.

Parameters

token – Single word to check for acronym-ness
exclude – If technically valid but not actual acronyms are known in advance, pass them in as a set of strings; matching tokens will return False.

Returns

Whether or not token is an acronym.

textacy.text_utils.keyword_in_context(text: str, keyword: str, *, ignore_case: bool = True, window_width: int = 50, print_only: bool = True) → Optional[Iterable[Tuple[str, str, str]]][source]¶

Search for keyword in text via regular expression, return or print strings spanning window_width characters before and after each occurrence of keyword.

Parameters

text – Text in which to search for keyword.
keyword –
Technically, any valid regular expression string should work, but usually this is a single word or short phrase: “spam”, “spam and eggs”; to account for variations, use regex: “[Ss]pam (and|&) [Ee]ggs?”

Note

If keyword contains special characters, be sure to escape them!
ignore_case – If True, ignore letter case in keyword matching.
window_width – Number of characters on either side of keyword to include as “context”.
print_only – If True, print out all results with nice formatting; if False, return all (pre, kw, post) matches as generator of raw strings.

Yields

Next 3-tuple of prior context, the match itself, and posterior context.

textacy.text_utils.KWIC(text: str, keyword: str, *, ignore_case: bool = True, window_width: int = 50, print_only: bool = True) → Optional[Iterable[Tuple[str, str, str]]]¶: Alias of keyword_in_context.

textacy.text_utils.clean_terms(terms: Iterable[str]) → Iterable[str][source]¶

Clean up a sequence of single- or multi-word strings: strip leading/trailing junk chars, handle dangling parens and odd hyphenation, etc.

Parameters: terms – Sequence of terms such as “presidency”, “epic failure”, or “George W. Bush” that may be _unclean_ for whatever reason.
Yields: Next term in terms but with the cruft cleaned up, excluding terms that were _entirely_ cruft

Warning

Terms with (intentionally) unusual punctuation may get “cleaned” into a form that changes or obscures the original meaning of the term.

textacy.utils.deprecated(message: str, *, action: str = 'always')[source]¶

Show a deprecation warning, optionally filtered.

Parameters

message – Message to display with DeprecationWarning.
action – Filter controlling whether warning is ignored, displayed, or turned into an error. For reference:

Utilities¶

Language Identification¶

Model¶

Dataset¶

Performance¶

Other Utils¶

Navigation

Related Topics