Text (Pre-)Processing

normalize.normalize_hyphenated_words

Normalize words in text that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace.

normalize.normalize_quotation_marks

Normalize all “fancy” single- and double-quotation marks in text to just the basic ASCII equivalents.

normalize.normalize_repeating_chars

Normalize repeating characters in text by truncating their number of consecutive repetitions to maxn.

normalize.normalize_unicode

Normalize unicode characters in text into canonical forms.

normalize.normalize_whitespace

Replace all contiguous zero-width spaces with an empty string, line-breaking spaces with a single newline, and non-breaking spaces with a single space, then strip any leading/trailing whitespace.

remove.remove_accents

Remove accents from any accented unicode characters in text, either by replacing them with ASCII equivalents or removing them entirely.

remove.remove_punctuation

Remove punctuation from text by replacing all instances of marks with whitespace.

replace.replace_currency_symbols

Replace all currency symbols in text with replace_with.

replace.replace_emails

Replace all email addresses in text with replace_with.

replace.replace_emojis

Replace all emoji and pictographs in text with replace_with.

replace.replace_hashtags

Replace all hashtags in text with replace_with.

replace.replace_numbers

Replace all numbers in text with replace_with.

replace.replace_phone_numbers

Replace all phone numbers in text with replace_with.

replace.replace_urls

Replace all URLs in text with replace_with.

replace.replace_user_handles

Replace all user handles in text with replace_with.

Normalize

textacy.preprocessing.normalize: Normalize aspects of raw text that may vary in problematic ways.

textacy.preprocessing.normalize.normalize_hyphenated_words(text: str)str[source]

Normalize words in text that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace.

textacy.preprocessing.normalize.normalize_quotation_marks(text: str)str[source]

Normalize all “fancy” single- and double-quotation marks in text to just the basic ASCII equivalents. Note that this will also normalize fancy apostrophes, which are typically represented as single quotation marks.

textacy.preprocessing.normalize.normalize_repeating_chars(text: str, *, chars: str, maxn: int = 1)str[source]

Normalize repeating characters in text by truncating their number of consecutive repetitions to maxn.

Parameters
  • text

  • chars – One or more characters whose consecutive repetitions are to be normalized, e.g. “.” or “?!”.

  • maxn – Maximum number of consecutive repetitions of chars to which longer repetitions will be truncated.

Returns

str

textacy.preprocessing.normalize.normalize_unicode(text: str, *, form: str = 'NFC')str[source]

Normalize unicode characters in text into canonical forms.

Parameters
  • text

  • form ({"NFC", "NFD", "NFKC", "NFKD"}) – Form of normalization applied to unicode characters. For example, an “e” with accute accent “´” can be written as “e´” (canonical decomposition, “NFD”) or “é” (canonical composition, “NFC”). Unicode can be normalized to NFC form without any change in meaning, so it’s usually a safe bet. If “NFKC”, additional normalizations are applied that can change characters’ meanings, e.g. ellipsis characters are replaced with three periods.

textacy.preprocessing.normalize.normalize_whitespace(text: str)str[source]

Replace all contiguous zero-width spaces with an empty string, line-breaking spaces with a single newline, and non-breaking spaces with a single space, then strip any leading/trailing whitespace.

Remove

textacy.preprocessing.remove: Remove aspects of raw text that may be unwanted for certain use cases.

textacy.preprocessing.remove.remove_accents(text: str, *, fast: bool = False)str[source]

Remove accents from any accented unicode characters in text, either by replacing them with ASCII equivalents or removing them entirely.

Parameters
  • text

  • fast

    If False, accents are removed from any unicode symbol with a direct ASCII equivalent; if True, accented chars for all unicode symbols are removed, regardless.

    Note

    fast=True can be significantly faster than fast=False, but its transformation of text is less “safe” and more likely to result in changes of meaning, spelling errors, etc.

Returns

str

Raises

ValueError – If method is not in {“unicode”, “ascii”}.

See also

For a more powerful (but slower) alternative, check out unidecode: https://github.com/avian2/unidecode

textacy.preprocessing.remove.remove_punctuation(text: str, *, marks: Optional[str] = None)str[source]

Remove punctuation from text by replacing all instances of marks with whitespace.

Parameters
  • text

  • marks – Remove only those punctuation marks specified here. For example, “,;:” removes commas, semi-colons, and colons. If None, all unicode punctuation marks are removed.

Returns

str

Note

When marks=None, Python’s built-in str.translate() is used to remove punctuation; otherwise, a regular expression is used. The former’s performance is about 5-10x faster.

Replace

textacy.preprocessing.replace: Replace parts of raw text that are semantically important as members of a group but not so much in the individual instances.

textacy.preprocessing.replace.replace_currency_symbols(text: str, replace_with: str = '_CUR_')str[source]

Replace all currency symbols in text with replace_with.

textacy.preprocessing.replace.replace_emails(text: str, replace_with: str = '_EMAIL_')str[source]

Replace all email addresses in text with replace_with.

textacy.preprocessing.replace.replace_emojis(text: str, replace_with: str = '_EMOJI_')str[source]

Replace all emoji and pictographs in text with replace_with.

Note

If your Python has a narrow unicode build (“USC-2”), only dingbats and miscellaneous symbols are replaced because Python isn’t able to represent the unicode data for things like emoticons. Sorry!

textacy.preprocessing.replace.replace_hashtags(text: str, replace_with: str = '_TAG_')str[source]

Replace all hashtags in text with replace_with.

textacy.preprocessing.replace.replace_numbers(text: str, replace_with: str = '_NUMBER_')str[source]

Replace all numbers in text with replace_with.

textacy.preprocessing.replace.replace_phone_numbers(text: str, replace_with: str = '_PHONE_')str[source]

Replace all phone numbers in text with replace_with.

textacy.preprocessing.replace.replace_urls(text: str, replace_with: str = '_URL_')str[source]

Replace all URLs in text with replace_with.

textacy.preprocessing.replace.replace_user_handles(text: str, replace_with: str = '_USER_')str[source]

Replace all user handles in text with replace_with.