Data Augmentation

augmenter.Augmenter

Randomly apply one or many data augmentation transforms to spaCy Doc s to produce new docs with additional variety and/or noise in the data.

transforms.substitute_word_synonyms

Randomly substitute words for which synonyms are available with a randomly selected synonym, up to num times or with a probability of num.

transforms.insert_word_synonyms

Randomly insert random synonyms of tokens for which synonyms are available, up to num times or with a probability of num.

transforms.swap_words

Randomly swap the positions of two adjacent words, up to num times or with a probability of num.

transforms.delete_words

Randomly delete words, up to num times or with a probability of num.

transforms.substitute_chars

Randomly substitute a single character in randomly-selected words with another, up to num times or with a probability of num.

transforms.insert_chars

Randomly insert a character into randomly-selected words, up to num times or with a probability of num.

transforms.swap_chars

Randomly swap two adjacent characters in randomly-selected words, up to num times or with a probability of num.

transforms.delete_chars

Randomly delete a character in randomly-selected words, up to num times or with a probability of num.

utils.to_aug_toks

Transform a spaCy Doc or Span into a list of AugTok objects, suitable for use in data augmentation transform functions.

utils.get_char_weights

Get lang-specific character weights for use in certain data augmentation transforms, based on texts in textacy.datasets.UDHR.

class textacy.augmentation.augmenter.Augmenter(transforms: Sequence[AugTransform], *, num: Optional[int | float | Sequence[float]] = None)[source]

Randomly apply one or many data augmentation transforms to spaCy Doc s to produce new docs with additional variety and/or noise in the data.

Initialize an Augmenter with multiple transforms, and customize the randomization of their selection when applying to a document:

>>> tfs = [transforms.delete_words, transforms.swap_chars, transforms.delete_chars]
>>> Augmenter(tfs, num=None)  # all tfs applied each time
>>> Augmenter(tfs, num=1)  # one randomly-selected tf applied each time
>>> Augmenter(tfs, num=0.5)  # tfs randomly selected with 50% prob each time
>>> augmenter = Augmenter(tfs, num=[0.4, 0.8, 0.6])  # tfs randomly selected with 40%, 80%, 60% probs, respectively, each time

Apply transforms to a given Doc to produce new documents:

>>> text = "The quick brown fox jumps over the lazy dog."
>>> doc = textacy.make_spacy_doc(text, lang="en_core_web_sm")
>>> augmenter.apply_transforms(doc, lang="en_core_web_sm")
The quick brown ox jupms over the lazy dog.
>>> augmenter.apply_transforms(doc, lang="en_core_web_sm")
The quikc brown fox over the lazy dog.
>>> augmenter.apply_transforms(doc, lang="en_core_web_sm")
quick brown fox jumps over teh lazy dog.

Parameters for individual transforms may be specified when initializing Augmenter or, if necessary, when applying to individual documents:

>>> from functools import partial
>>> tfs = [partial(transforms.delete_words, num=3), transforms.swap_chars]
>>> augmenter = Augmenter(tfs)
>>> augmenter.apply_transforms(doc, lang="en_core_web_sm")
brown fox jumps over layz dog.
>>> augmenter.apply_transforms(doc, lang="en_core_web_sm", pos={"NOUN", "ADJ"})
The jumps over the lazy odg.
Parameters
  • transforms

    Ordered sequence of callables that must take List[AugTok] as their first positional argument and return another List[AugTok].

    Note

    Although the particular transforms applied may vary doc-by-doc, they are applied in order as listed here. Since some transforms may clobber text in a way that makes other transforms less effective, a stable ordering can improve the quality of augmented data.

  • num – If int, number of transforms to randomly select from transforms each time Augmenter.apply_tranforms() is called. If float, probability that any given transform will be selected. If Sequence[float], the probability that the corresponding transform in transforms will be selected (these must be the same length). If None (default), num is set to len(transforms), which means that every transform is applied each time.

See also

A collection of general-purpose transforms are implemented in textacy.augmentation.transforms.

apply_transforms(doc: spacy.tokens.doc.Doc, lang: Union[str, pathlib.Path, spacy.language.Language], **kwargs)spacy.tokens.doc.Doc[source]

Sequentially apply some subset of data augmentation transforms to doc, then return a new Doc created from the augmented text using lang.

Parameters
  • doc

  • lang

  • **kwargs – If, for whatever reason, you have to pass keyword argument values into transforms that vary or depend on characteristics of doc, specify them here. The transforms’ call signatures will be inspected, and values will be passed along, as needed.

Returns

spacy.tokens.Doc

textacy.augmentation.transforms.substitute_word_synonyms(aug_toks: List[aug_utils.AugTok], *, num: int | float = 1, pos: Optional[str | Set[str]] = None)List[aug_utils.AugTok][source]

Randomly substitute words for which synonyms are available with a randomly selected synonym, up to num times or with a probability of num.

Parameters
  • aug_toks – Sequence of tokens to augment through synonym substitution.

  • num – If int, maximum number of words with available synonyms to substitute with a randomly selected synonym; if float, probability that a given word with synonyms will be substituted.

  • pos – Part of speech tag(s) of words to be considered for augmentation. If None, all words with synonyms are considered.

Returns

New, augmented sequence of tokens.

Note

This transform requires textacy.resources.ConceptNet to be downloaded to work properly, since this is the data source for word synonyms to be substituted.

textacy.augmentation.transforms.insert_word_synonyms(aug_toks: List[aug_utils.AugTok], *, num: int | float = 1, pos: Optional[str | Set[str]] = None)List[aug_utils.AugTok][source]

Randomly insert random synonyms of tokens for which synonyms are available, up to num times or with a probability of num.

Parameters
  • aug_toks – Sequence of tokens to augment through synonym insertion.

  • num – If int, maximum number of words with available synonyms from which a random synonym is selected and randomly inserted; if float, probability that a given word with synonyms will provide a synonym to be inserted.

  • pos – Part of speech tag(s) of words to be considered for augmentation. If None, all words with synonyms are considered.

Returns

New, augmented sequence of tokens.

Note

This transform requires textacy.resources.ConceptNet to be downloaded to work properly, since this is the data source for word synonyms to be inserted.

textacy.augmentation.transforms.swap_words(aug_toks: List[aug_utils.AugTok], *, num: int | float = 1, pos: Optional[str | Set[str]] = None)List[aug_utils.AugTok][source]

Randomly swap the positions of two adjacent words, up to num times or with a probability of num.

Parameters
  • aug_toks – Sequence of tokens to augment through position swapping.

  • num – If int, maximum number of adjacent word pairs to swap; if float, probability that a given word pair will be swapped.

  • pos – Part of speech tag(s) of words to be considered for augmentation. If None, all words are considered.

Returns

New, augmented sequence of tokens.

textacy.augmentation.transforms.delete_words(aug_toks: List[aug_utils.AugTok], *, num: int | float = 1, pos: Optional[str | Set[str]] = None)List[aug_utils.AugTok][source]

Randomly delete words, up to num times or with a probability of num.

Parameters
  • aug_toks – Sequence of tokens to augment through word deletion.

  • num – If int, maximum number of words to delete; if float, probability that a given word will be deleted.

  • pos – Part of speech tag(s) of words to be considered for augmentation. If None, all words are considered.

Returns

New, augmented sequence of tokens.

textacy.augmentation.transforms.substitute_chars(aug_toks: List[aug_utils.AugTok], *, num: int | float = 1, lang: Optional[str] = None)List[aug_utils.AugTok][source]

Randomly substitute a single character in randomly-selected words with another, up to num times or with a probability of num.

Parameters
  • aug_toks – Sequence of tokens to augment through character substitution.

  • num – If int, maximum number of words to modify with a random character substitution; if float, probability that a given word will be modified.

  • lang – Standard, two-letter language code corresponding to aug_toks. Used to load a weighted distribution of language-appropriate characters that are randomly selected for substitution. More common characters are more likely to be substituted. If not specified, ascii letters and digits are randomly selected with equal probability.

Returns

New, augmented sequence of tokens.

Note

This transform requires textacy.datasets.UDHR to be downloaded to work properly, since this is the data source for character weights when deciding which char(s) to insert.

textacy.augmentation.transforms.insert_chars(aug_toks: List[aug_utils.AugTok], *, num: int | float = 1, lang: Optional[str] = None)List[aug_utils.AugTok][source]

Randomly insert a character into randomly-selected words, up to num times or with a probability of num.

Parameters
  • aug_toks – Sequence of tokens to augment through character insertion.

  • num – If int, maximum number of words to modify with a random character insertion; if float, probability that a given word will be modified.

  • lang – Standard, two-letter language code corresponding to aug_toks. Used to load a weighted distribution of language-appropriate characters that are randomly selected for substitution. More common characters are more likely to be substituted. If not specified, ascii letters and digits are randomly selected with equal probability.

Returns

New, augmented sequence of tokens.

Note

This transform requires textacy.datasets.UDHR to be downloaded to work properly, since this is the data source for character weights when deciding which char(s) to insert.

textacy.augmentation.transforms.swap_chars(aug_toks: List[aug_utils.AugTok], *, num: int | float = 1)List[aug_utils.AugTok][source]

Randomly swap two adjacent characters in randomly-selected words, up to num times or with a probability of num.

Parameters
  • aug_toks – Sequence of tokens to augment through character swapping.

  • num – If int, maximum number of words to modify with a random character swap; if float, probability that a given word will be modified.

Returns

New, augmented sequence of tokens.

textacy.augmentation.transforms.delete_chars(aug_toks: List[aug_utils.AugTok], *, num: int | float = 1)List[aug_utils.AugTok][source]

Randomly delete a character in randomly-selected words, up to num times or with a probability of num.

Parameters
  • aug_toks – Sequence of tokens to augment through character deletion.

  • num – If int, maximum number of words to modify with a random character deletion; if float, probability that a given word will be modified.

Returns

New, augmented sequence of tokens.

class textacy.augmentation.utils.AugTok(text, ws, pos, is_word, syns)

tuple: Minimal token data required for data augmentation transforms.

is_word

Alias for field number 3

pos

Alias for field number 2

syns

Alias for field number 4

text

Alias for field number 0

ws

Alias for field number 1

textacy.augmentation.utils.to_aug_toks(doclike: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span])List[textacy.augmentation.utils.AugTok][source]

Transform a spaCy Doc or Span into a list of AugTok objects, suitable for use in data augmentation transform functions.

textacy.augmentation.utils.get_char_weights(lang: str)List[Tuple[str, int]][source]

Get lang-specific character weights for use in certain data augmentation transforms, based on texts in textacy.datasets.UDHR.

Parameters

lang – Standard two-letter language code.

Returns

Collection of (character, weight) pairs, based on the distribution of characters found in the source text.