Miscellany¶
Identify the most probable language identified in |
|
Identify the |
|
Get key configuration info about dev environment: OS, python, spacy, and textacy. |
|
Print |
|
Check whether |
|
Validate and cast a value or values to a collection. |
|
Coerce string |
|
Coerce string |
|
Coerce |
|
Validate values that must be of a certain type and (optionally) found among a set of known valid values. |
|
Validate and clip range values. |
Language Identification¶
textacy.lang_id
: Interface for de/serializing a language identification model,
and using it to identify the most probable language(s) of a given text. Inspired by
Google’s Compact Language Detector v3 (https://github.com/google/cld3) and
implemented with thinc
v8.0.
Model¶
Character unigrams, bigrams, and trigrams are extracted separately from the first 1000 characters of lower-cased input text. Each collection of ngrams is hash-embedded into a 100-dimensional space, then averaged. The resulting feature vectors are concatenated into a single embedding layer, then passed on to a dense layer with ReLu activation and finally a Softmax output layer. The model’s predictions give the probabilities for a text to be written in ~140 ISO 639-1 languages.
Dataset¶
The model was trained on a randomized, stratified subset of ~375k texts drawn from several sources:
WiLi: A public dataset of short text extracts from Wikipedias in over 230 languages. Style is relatively formal; subject matter is “encyclopedic”. Source: https://zenodo.org/record/841984
Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.
UDHR: The UN’s Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html
DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/
Performance¶
The trained model achieved F1 = 0.97 when averaged over all languages.
A few languages have worse performance; for example, the two Norwegians (“nb” and “no”), as well as Bosnian (“bs”), Serbian (“sr”), and Croatian (“hr”), which are extremely similar to each other. See the textacy-data releases for more details: https://github.com/bdewilde/textacy-data/releases/tag/lang-identifier-v2.0
-
class
textacy.lang_id.lang_identifier.
LangIdentifier
(version: float | str, data_dir: str | pathlib.Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/lang_identifier'), model_base: Model = <thinc.model.Model object>)[source]¶ - Parameters
version –
data_dir –
model_base –
-
model
¶
-
classes
¶
-
save_model
()[source]¶ Save trained
LangIdentifier.model
to disk, as bytes.
-
load_model
() → thinc.model.Model[source]¶ Load trained model from bytes on disk, using
LangIdentifier.model_base
as the framework into which the data is fit.
-
download
(force: bool = False)[source]¶ Download version-specific model data as a binary file and save it to disk at
LangIdentifier.model_fpath
.- Parameters
force – If True, download the model data, even if it already exists on disk under
self.data_dir
; otherwise, don’t.
-
identify_lang
(text: str, with_probs: bool = False) → str | Tuple[str, float][source]¶ Identify the most probable language identified in
text
, with or without the corresponding probability.- Parameters
text –
with_probs –
- Returns
ISO 639-1 standard language code of the most probable language, optionally with its probability.
-
identify_topn_langs
(text: str, topn: int = 3, with_probs: bool = False) → List[str] | List[Tuple[str, float]][source]¶ Identify the
topn
most probable languages identified intext
, with or without the corresponding probabilities.- Parameters
text –
topn –
with_probs –
- Returns
ISO 639-1 standard language code and optionally with its probability of the
topn
most probable languages.
-
textacy.lang_id.lang_identifier.
identify_lang
(text: str, with_probs: bool = False) → str | Tuple[str, float]¶ Identify the most probable language identified in
text
, with or without the corresponding probability.- Parameters
text –
with_probs –
- Returns
ISO 639-1 standard language code of the most probable language, optionally with its probability.
-
textacy.lang_id.lang_identifier.
identify_topn_langs
(text: str, topn: int = 3, with_probs: bool = False) → List[str] | List[Tuple[str, float]]¶ Identify the
topn
most probable languages identified intext
, with or without the corresponding probabilities.- Parameters
text –
topn –
with_probs –
- Returns
ISO 639-1 standard language code and optionally with its probability of the
topn
most probable languages.
Utilities¶
textacy.utils
: Variety of general-purpose utility functions for inspecting /
validating / transforming args and facilitating meta package tasks.
-
textacy.utils.
deprecated
(message: str, *, action: str = 'always')[source]¶ Show a deprecation warning, optionally filtered.
- Parameters
message – Message to display with
DeprecationWarning
.action – Filter controlling whether warning is ignored, displayed, or turned into an error. For reference:
-
textacy.utils.
get_config
() → Dict[str, Any][source]¶ Get key configuration info about dev environment: OS, python, spacy, and textacy.
- Returns
dict
-
textacy.utils.
print_markdown
(items: Union[Dict[Any, Any], Iterable[Tuple[Any, Any]]])[source]¶ Print
items
as a markdown-formatted list. Specifically useful when submitting config info on GitHub issues.- Parameters
items –
-
textacy.utils.
is_record
(obj: Any) → bool[source]¶ Check whether
obj
is a “record” – that is, a (text, metadata) 2-tuple.
-
textacy.utils.
to_collection
(val: Any, val_type: Union[Type[Any], Tuple[Type[Any], …]], col_type: Type[Any]) → Optional[Collection[Any]][source]¶ Validate and cast a value or values to a collection.
-
textacy.utils.
to_bytes
(s: Union[str, bytes], *, encoding: str = 'utf-8', errors: str = 'strict') → bytes[source]¶ Coerce string
s
to bytes.
-
textacy.utils.
to_unicode
(s: Union[str, bytes], *, encoding: str = 'utf-8', errors: str = 'strict') → str[source]¶ Coerce string
s
to unicode.
-
textacy.utils.
to_path
(path: Union[str, pathlib.Path]) → pathlib.Path[source]¶ Coerce
path
to apathlib.Path
.- Parameters
path –
- Returns
-
textacy.utils.
validate_set_members
(vals: Union[Any, Set[Any]], val_type: Union[Type[Any], Tuple[Type[Any], …]], valid_vals: Optional[Set[Any]] = None) → Set[Any][source]¶ Validate values that must be of a certain type and (optionally) found among a set of known valid values.
- Parameters
vals – Value or values to validate.
val_type – Type(s) of which all
vals
must be instances.valid_vals – Set of valid values in which all
vals
must be found.
- Returns
Validated values.
- Return type
Set[obj]
- Raises
-
textacy.utils.
validate_and_clip_range
(range_vals: Tuple[Any, Any], full_range: Tuple[Any, Any], val_type: Optional[Union[Type[Any], Tuple[Type[Any], …]]] = None) → Tuple[Any, Any][source]¶ Validate and clip range values.
- Parameters
range_vals – Range values, i.e. [start_val, end_val), to validate and, if necessary, clip. If None, the value is set to the corresponding value in
full_range
.full_range – Full range of values, i.e. [min_val, max_val), within which
range_vals
must lie.val_type – Type(s) of which all
range_vals
must be instances, unless val is None.
- Returns
Range for which null or too-small/large values have been clipped to the min/max valid values.
- Raises
-
textacy.utils.
get_kwargs_for_func
(func: Callable, kwargs: Dict[str, Any]) → Dict[str, Any][source]¶ Get the set of keyword arguments from
kwargs
that are used byfunc
. Useful when calling a func from another func and inferring its signature from provided**kwargs
.
-
textacy.utils.
text_to_char_ngrams
(text: str, n: int, *, pad: bool = False) → Tuple[str, …][source]¶ Convert a text string into an ordered sequence of character ngrams.
- Parameters
text –
n – Number of characters to concatenate in each
n
-gram.pad – If True, pad
text
by addingn - 1
“_” characters on either side; if False, leavetext
as-is.
- Returns
Ordered sequence of character ngrams.
textacy.types
: Definitions for common object types used throughout the package.
-
class
textacy.types.
Record
(text, meta)¶ -
meta
¶ Alias for field number 1
-
text
¶ Alias for field number 0
-
textacy.errors
: Helper functions for making consistent errors.
textacy.cache
: Functionality for caching language data and other NLP resources.
Loading data from disk can be slow; let’s just do it once and forget about it. :)
-
textacy.cache.
LRU_CACHE
= LRUCache([], maxsize=2147483648, currsize=0)¶ Least Recently Used (LRU) cache for loaded data.
The max cache size may be set by the TEXTACY_MAX_CACHE_SIZE environment variable, where the value must be an integer (in bytes). Otherwise, the max size is 2GB.
- Type
cachetools.LRUCache
spaCy Utils¶
textacy.spacier.utils
: Helper functions for working with / extending spaCy’s
core functionality.
-
textacy.spacier.utils.
make_doc_from_text_chunks
(text: str, lang: Union[str, pathlib.Path, spacy.language.Language], chunk_size: int = 100000) → spacy.tokens.doc.Doc[source]¶ Make a single spaCy-processed document from 1 or more chunks of
text
. This is a workaround for processing very long texts, for which spaCy is unable to allocate enough RAM.- Parameters
text – Text document to be chunked and processed by spaCy.
lang – Language with which spaCy processes
text
, represented as the full name of or path on disk to the pipeline, or an already instantiated pipeline instance.chunk_size –
Number of characters comprising each text chunk (excluding the last chunk, which is probably smaller). For best performance, value should be somewhere between 1e3 and 1e7, depending on how much RAM you have available.
Note
Since chunking is done by character, chunks edges’ probably won’t respect natural language segmentation, which means that every
chunk_size
characters, spaCy’s models may make mistakes.
- Returns
A single processed document, built from concatenated text chunks.
-
textacy.spacier.utils.
merge_spans
(spans: Iterable[spacy.tokens.span.Span], doc: spacy.tokens.doc.Doc) → None[source]¶ Merge spans into single tokens in
doc
, in-place.- Parameters
spans (Iterable[
spacy.tokens.Span
]) –doc (
spacy.tokens.Doc
) –
-
textacy.spacier.utils.
preserve_case
(token: spacy.tokens.token.Token) → bool[source]¶ Return True if
token
is a proper noun or acronym; otherwise, False.- Raises
ValueError – If parent document has not been POS-tagged.
-
textacy.spacier.utils.
get_normalized_text
(span_or_token: Span | Token) → str[source]¶ Get the text of a spaCy span or token, normalized depending on its characteristics. For proper nouns and acronyms, text is returned as-is; for everything else, text is lemmatized.
-
textacy.spacier.utils.
get_main_verbs_of_sent
(sent: spacy.tokens.span.Span) → List[spacy.tokens.token.Token][source]¶ Return the main (non-auxiliary) verbs in a sentence.
-
textacy.spacier.utils.
get_subjects_of_verb
(verb: spacy.tokens.token.Token) → List[spacy.tokens.token.Token][source]¶ Return all subjects of a verb according to the dependency parse.
-
textacy.spacier.utils.
get_objects_of_verb
(verb: spacy.tokens.token.Token) → List[spacy.tokens.token.Token][source]¶ Return all objects of a verb according to the dependency parse, including open clausal complements.