Utilities¶
Identify the most probable language identified in |
|
Pass single token as a string, return True/False if is/is not valid acronym. |
|
Search for |
|
Alias of |
|
Clean up a sequence of single- or multi-word strings: strip leading/trailing junk chars, handle dangling parens and odd hyphenation, etc. |
|
Get key configuration info about dev environment: OS, python, spacy, and textacy. |
|
Print |
|
Check whether |
|
Validate and cast a value or values to a collection. |
|
Coerce string |
|
Coerce string |
|
Coerce |
|
Validate values that must be of a certain type and (optionally) found among a set of known valid values. |
|
Validate and clip range values. |
Language Identification¶
Pipeline for identifying the language of a text, using a model inspired by
Google’s Compact Language Detector v3 (https://github.com/google/cld3) and
implemented with scikit-learn>=0.20
.
Model¶
Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~140 different languages as ISO 639-1 language codes.
Technically, the model was implemented as a sklearn.pipeline.Pipeline
with two steps: a sklearn.feature_extraction.text.HashingVectorizer
for vectorizing input texts and a sklearn.neural_network.MLPClassifier
for multi-class language classification.
Dataset¶
The pipeline was trained on a randomized, stratified subset of ~750k texts drawn from several sources:
Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
Leipzig Corpora: A collection of corpora for many languages pulling from comparable sources – specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download
UDHR: The UN’s Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
Twitter: A collection of tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html
DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/
Performance¶
The trained model achieved F1 = 0.96 when (macro and micro) averaged over all languages. A few languages have worse performance; for example, the two Norwegians (“nb” and “no”), Bosnian (“bs”) and Serbian (“sr”), and Bashkir (“ba”) and Tatar (“tt”) are often confused with each other. See the textacy-data releases for more details: https://github.com/bdewilde/textacy-data/releases/tag/lang_identifier_v1.1_sklearn_v21
-
class
textacy.lang_utils.
LangIdentifier
(data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.10.1/lib/python3.8/site-packages/textacy/data/lang_identifier'), max_text_len=1000)[source]¶ -
-
pipeline
¶ - Type
sklearn.pipeline.Pipeline
-
download
(force=False)[source]¶ Download the pipeline data as a Python version-specific compressed pickle file and save it to disk under the
LangIdentifier.data_dir
directory.- Parameters
force (bool) – If True, download the dataset, even if it already exists on disk under
data_dir
.
-
Other Utils¶
textacy.text_utils
: Set of small utility functions that take text strings as input.
-
textacy.text_utils.
is_acronym
(token: str, exclude: Optional[Set[str]] = None) → bool[source]¶ Pass single token as a string, return True/False if is/is not valid acronym.
- Parameters
token – Single word to check for acronym-ness
exclude – If technically valid but not actual acronyms are known in advance, pass them in as a set of strings; matching tokens will return False.
- Returns
Whether or not
token
is an acronym.
-
textacy.text_utils.
keyword_in_context
(text: str, keyword: str, *, ignore_case: bool = True, window_width: int = 50, print_only: bool = True) → Optional[Iterable[Tuple[str, str, str]]][source]¶ Search for
keyword
intext
via regular expression, return or print strings spanningwindow_width
characters before and after each occurrence of keyword.- Parameters
text – Text in which to search for
keyword
.keyword –
Technically, any valid regular expression string should work, but usually this is a single word or short phrase: “spam”, “spam and eggs”; to account for variations, use regex: “[Ss]pam (and|&) [Ee]ggs?”
Note
If keyword contains special characters, be sure to escape them!
ignore_case – If True, ignore letter case in
keyword
matching.window_width – Number of characters on either side of
keyword
to include as “context”.print_only – If True, print out all results with nice formatting; if False, return all (pre, kw, post) matches as generator of raw strings.
- Yields
Next 3-tuple of prior context, the match itself, and posterior context.
-
textacy.text_utils.
KWIC
(text: str, keyword: str, *, ignore_case: bool = True, window_width: int = 50, print_only: bool = True) → Optional[Iterable[Tuple[str, str, str]]]¶ Alias of
keyword_in_context
.
-
textacy.text_utils.
clean_terms
(terms: Iterable[str]) → Iterable[str][source]¶ Clean up a sequence of single- or multi-word strings: strip leading/trailing junk chars, handle dangling parens and odd hyphenation, etc.
- Parameters
terms – Sequence of terms such as “presidency”, “epic failure”, or “George W. Bush” that may be _unclean_ for whatever reason.
- Yields
Next term in terms but with the cruft cleaned up, excluding terms that were _entirely_ cruft
Warning
Terms with (intentionally) unusual punctuation may get “cleaned” into a form that changes or obscures the original meaning of the term.
-
textacy.utils.
deprecated
(message: str, *, action: str = 'always')[source]¶ Show a deprecation warning, optionally filtered.
- Parameters
message – Message to display with
DeprecationWarning
.action – Filter controlling whether warning is ignored, displayed, or turned into an error. For reference:
-
textacy.utils.
get_config
() → Dict[str, Any][source]¶ Get key configuration info about dev environment: OS, python, spacy, and textacy.
- Returns
dict
-
textacy.utils.
print_markdown
(items: Union[Dict[Any, Any], Iterable[Tuple[Any, Any]]])[source]¶ Print
items
as a markdown-formatted list. Specifically useful when submitting config info on GitHub issues.- Parameters
items –
-
textacy.utils.
is_record
(obj: Any) → bool[source]¶ Check whether
obj
is a “record” – that is, a (text, metadata) 2-tuple.
-
textacy.utils.
to_collection
(val: Any, val_type: Union[Type[Any], Tuple[Type[Any], …]], col_type: Type[Any]) → Optional[Collection[Any]][source]¶ Validate and cast a value or values to a collection.
-
textacy.utils.
to_bytes
(s: Union[str, bytes], *, encoding: str = 'utf-8', errors: str = 'strict') → bytes[source]¶ Coerce string
s
to bytes.
-
textacy.utils.
to_unicode
(s: Union[str, bytes], *, encoding: str = 'utf-8', errors: str = 'strict') → str[source]¶ Coerce string
s
to unicode.
-
textacy.utils.
to_path
(path: Union[str, pathlib.Path]) → pathlib.Path[source]¶ Coerce
path
to apathlib.Path
.- Parameters
path –
- Returns
-
textacy.utils.
validate_set_members
(vals: Union[Any, Set[Any]], val_type: Union[Type[Any], Tuple[Type[Any], …]], valid_vals: Optional[Set[Any]] = None) → Set[Any][source]¶ Validate values that must be of a certain type and (optionally) found among a set of known valid values.
- Parameters
vals – Value or values to validate.
val_type – Type(s) of which all
vals
must be instances.valid_vals – Set of valid values in which all
vals
must be found.
- Returns
Validated values.
- Return type
Set[obj]
- Raises
-
textacy.utils.
validate_and_clip_range
(range_vals: Tuple[Any, Any], full_range: Tuple[Any, Any], val_type: Optional[Union[Type[Any], Tuple[Type[Any], …]]] = None) → Tuple[Any, Any][source]¶ Validate and clip range values.
- Parameters
range_vals – Range values, i.e. [start_val, end_val), to validate and, if necessary, clip. If None, the value is set to the corresponding value in
full_range
.full_range – Full range of values, i.e. [min_val, max_val), within which
range_vals
must lie.val_type – Type(s) of which all
range_vals
must be instances, unless val is None.
- Returns
Range for which null or too-small/large values have been clipped to the min/max valid values.
- Raises
-
textacy.utils.
get_kwargs_for_func
(func: Callable, kwargs: Dict[str, Any]) → Dict[str, Any][source]¶ Get the set of keyword arguments from
kwargs
that are used byfunc
. Useful when calling a func from another func and inferring its signature from provided**kwargs
.
Functionality for caching language data and other NLP resources. Loading data from disk can be slow; let’s just do it once and forget about it. :)
-
textacy.cache.
LRU_CACHE
= LRUCache([], maxsize=2147483648, currsize=0)¶ Least Recently Used (LRU) cache for loaded data.
The max cache size may be set by the TEXTACY_MAX_CACHE_SIZE environment variable, where the value must be an integer (in bytes). Otherwise, the max size is 2GB.
- Type
cachetools.LRUCache