Data Augmentation¶
Randomly apply one or many data augmentation transforms to spaCy |
|
Randomly substitute words for which synonyms are available with a randomly selected synonym, up to |
|
Randomly insert random synonyms of tokens for which synonyms are available, up to |
|
Randomly swap the positions of two adjacent words, up to |
|
Randomly delete words, up to |
|
Randomly substitute a single character in randomly-selected words with another, up to |
|
Randomly insert a character into randomly-selected words, up to |
|
Randomly swap two adjacent characters in randomly-selected words, up to |
|
Randomly delete a character in randomly-selected words, up to |
|
Transform a spaCy |
|
Get lang-specific character weights for use in certain data augmentation transforms, based on texts in |
-
class
textacy.augmentation.augmenter.
Augmenter
(transforms: Sequence[Callable], *, num: Optional[Union[int, float, Sequence[float]]] = None)[source]¶ Randomly apply one or many data augmentation transforms to spaCy
Doc
s to produce new docs with additional variety and/or noise in the data.Initialize an
Augmenter
with multiple transforms, and customize the randomization of their selection when applying to a document:>>> tfs = [transforms.delete_words, transforms.swap_chars, transforms.delete_chars] >>> Augmenter(tfs, num=None) # all tfs applied each time >>> Augmenter(tfs, num=1) # one randomly-selected tf applied each time >>> Augmenter(tfs, num=0.5) # tfs randomly selected with 50% prob each time >>> augmenter = Augmenter(tfs, num=[0.4, 0.8, 0.6]) # tfs randomly selected with 40%, 80%, 60% probs, respectively, each time
Apply transforms to a given
Doc
to produce new documents:>>> text = "The quick brown fox jumps over the lazy dog." >>> doc = textacy.make_spacy_doc(text, lang="en") >>> augmenter.apply_transforms(doc) The quick brown ox jupms over the lazy dog. >>> augmenter.apply_transforms(doc) The quikc brown fox over the lazy dog. >>> augmenter.apply_transforms(doc) quick brown fox jumps over teh lazy dog.
Parameters for individual transforms may be specified when initializing
Augmenter
or, if necessary, when applying to individual documents:>>> from functools import partial >>> tfs = [partial(transforms.delete_words, num=3), transforms.swap_chars] >>> augmenter = Augmenter(tfs) >>> augmenter.apply_transforms(doc) brown fox jumps over layz dog. >>> augmenter.apply_transforms(doc, lang=doc.lang) # (not actually needed for these tfs) quick brown fox over teh lazy.
- Parameters
transforms –
Ordered sequence of callables that must take List[
AugTok
] as their first positional argument and return another List[AugTok
].Note
Although the particular transforms applied may vary doc-by-doc, they are applied in order as listed here. Since some transforms may clobber text in a way that makes other transforms less effective, a stable ordering can improve the quality of augmented data.
num – If int, number of transforms to randomly select from
transforms
each timeAugmenter.apply_tranforms()
is called. If float, probability that any given transform will be selected. If Sequence[float], the probability that the corresponding transform intransforms
will be selected (these must be the same length). If None (default), num is set tolen(transforms)
, which means that every transform is applied each time.
See also
A collection of general-purpose transforms are implemented in
textacy.augmentation.transforms
.-
apply_transforms
(doc: spacy.tokens.doc.Doc, **kwargs) → spacy.tokens.doc.Doc[source]¶ Sequentially apply some subset of data augmentation transforms to
doc
, then return a newDoc
created from the augmented text.- Parameters
doc –
**kwargs – If, for whatever reason, you have to pass keyword argument values into transforms that vary or depend on characteristics of
doc
, specify them here. The transforms’ call signatures will be inspected, and values will be passed along, as needed.
- Returns
spacy.tokens.Doc
-
textacy.augmentation.transforms.
substitute_word_synonyms
(aug_toks: List[textacy.augmentation.utils.AugTok], *, num: Union[int, float] = 1, pos: Optional[Union[str, Set[str]]] = None) → List[textacy.augmentation.utils.AugTok][source]¶ Randomly substitute words for which synonyms are available with a randomly selected synonym, up to
num
times or with a probability ofnum
.- Parameters
aug_toks – Sequence of tokens to augment through synonym substitution.
num – If int, maximum number of words with available synonyms to substitute with a randomly selected synonym; if float, probability that a given word with synonyms will be substituted.
pos – Part of speech tag(s) of words to be considered for augmentation. If None, all words with synonyms are considered.
- Returns
New, augmented sequence of tokens.
Note
This transform requires
textacy.resources.ConceptNet
to be downloaded to work properly, since this is the data source for word synonyms to be substituted.
-
textacy.augmentation.transforms.
insert_word_synonyms
(aug_toks: List[textacy.augmentation.utils.AugTok], *, num: Union[int, float] = 1, pos: Optional[Union[str, Set[str]]] = None) → List[textacy.augmentation.utils.AugTok][source]¶ Randomly insert random synonyms of tokens for which synonyms are available, up to
num
times or with a probability ofnum
.- Parameters
aug_toks – Sequence of tokens to augment through synonym insertion.
num – If int, maximum number of words with available synonyms from which a random synonym is selected and randomly inserted; if float, probability that a given word with synonyms will provide a synonym to be inserted.
pos – Part of speech tag(s) of words to be considered for augmentation. If None, all words with synonyms are considered.
- Returns
New, augmented sequence of tokens.
Note
This transform requires
textacy.resources.ConceptNet
to be downloaded to work properly, since this is the data source for word synonyms to be inserted.
-
textacy.augmentation.transforms.
swap_words
(aug_toks: List[textacy.augmentation.utils.AugTok], *, num: Union[int, float] = 1, pos: Optional[Union[str, Set[str]]] = None) → List[textacy.augmentation.utils.AugTok][source]¶ Randomly swap the positions of two adjacent words, up to
num
times or with a probability ofnum
.- Parameters
aug_toks – Sequence of tokens to augment through position swapping.
num – If int, maximum number of adjacent word pairs to swap; if float, probability that a given word pair will be swapped.
pos – Part of speech tag(s) of words to be considered for augmentation. If None, all words are considered.
- Returns
New, augmented sequence of tokens.
-
textacy.augmentation.transforms.
delete_words
(aug_toks: List[textacy.augmentation.utils.AugTok], *, num: Union[int, float] = 1, pos: Optional[Union[str, Set[str]]] = None) → List[textacy.augmentation.utils.AugTok][source]¶ Randomly delete words, up to
num
times or with a probability ofnum
.- Parameters
aug_toks – Sequence of tokens to augment through word deletion.
num – If int, maximum number of words to delete; if float, probability that a given word will be deleted.
pos – Part of speech tag(s) of words to be considered for augmentation. If None, all words are considered.
- Returns
New, augmented sequence of tokens.
-
textacy.augmentation.transforms.
substitute_chars
(aug_toks: List[textacy.augmentation.utils.AugTok], *, num: Union[int, float] = 1, lang: Optional[str] = None) → List[textacy.augmentation.utils.AugTok][source]¶ Randomly substitute a single character in randomly-selected words with another, up to
num
times or with a probability ofnum
.- Parameters
aug_toks – Sequence of tokens to augment through character substitution.
num – If int, maximum number of words to modify with a random character substitution; if float, probability that a given word will be modified.
lang – Standard, two-letter language code corresponding to
aug_toks
. Used to load a weighted distribution of language-appropriate characters that are randomly selected for substitution. More common characters are more likely to be substituted. If not specified, ascii letters and digits are randomly selected with equal probability.
- Returns
New, augmented sequence of tokens.
Note
This transform requires
textacy.datasets.UDHR
to be downloaded to work properly, since this is the data source for character weights when deciding which char(s) to insert.
-
textacy.augmentation.transforms.
insert_chars
(aug_toks: List[textacy.augmentation.utils.AugTok], *, num: Union[int, float] = 1, lang: Optional[str] = None) → List[textacy.augmentation.utils.AugTok][source]¶ Randomly insert a character into randomly-selected words, up to
num
times or with a probability ofnum
.- Parameters
aug_toks – Sequence of tokens to augment through character insertion.
num – If int, maximum number of words to modify with a random character insertion; if float, probability that a given word will be modified.
lang – Standard, two-letter language code corresponding to
aug_toks
. Used to load a weighted distribution of language-appropriate characters that are randomly selected for substitution. More common characters are more likely to be substituted. If not specified, ascii letters and digits are randomly selected with equal probability.
- Returns
New, augmented sequence of tokens.
Note
This transform requires
textacy.datasets.UDHR
to be downloaded to work properly, since this is the data source for character weights when deciding which char(s) to insert.
-
textacy.augmentation.transforms.
swap_chars
(aug_toks: List[textacy.augmentation.utils.AugTok], *, num: Union[int, float] = 1) → List[textacy.augmentation.utils.AugTok][source]¶ Randomly swap two adjacent characters in randomly-selected words, up to
num
times or with a probability ofnum
.- Parameters
aug_toks – Sequence of tokens to augment through character swapping.
num – If int, maximum number of words to modify with a random character swap; if float, probability that a given word will be modified.
- Returns
New, augmented sequence of tokens.
-
textacy.augmentation.transforms.
delete_chars
(aug_toks: List[textacy.augmentation.utils.AugTok], *, num: Union[int, float] = 1) → List[textacy.augmentation.utils.AugTok][source]¶ Randomly delete a character in randomly-selected words, up to
num
times or with a probability ofnum
.- Parameters
aug_toks – Sequence of tokens to augment through character deletion.
num – If int, maximum number of words to modify with a random character deletion; if float, probability that a given word will be modified.
- Returns
New, augmented sequence of tokens.
-
class
textacy.augmentation.utils.
AugTok
(text, ws, pos, is_word, syns)¶ tuple: Minimal token data required for data augmentation transforms.
-
is_word
¶ Alias for field number 3
-
pos
¶ Alias for field number 2
-
syns
¶ Alias for field number 4
-
text
¶ Alias for field number 0
-
ws
¶ Alias for field number 1
-
-
textacy.augmentation.utils.
to_aug_toks
(spacy_obj: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span]) → List[textacy.augmentation.utils.AugTok][source]¶ Transform a spaCy
Doc
orSpan
into a list ofAugTok
objects, suitable for use in data augmentation transform functions.
-
textacy.augmentation.utils.
get_char_weights
(lang: str) → List[Tuple[str, int]][source]¶ Get lang-specific character weights for use in certain data augmentation transforms, based on texts in
textacy.datasets.UDHR
.- Parameters
lang – Standard two-letter language code.
- Returns
Collection of (character, weight) pairs, based on the distribution of characters found in the source text.