Text (Pre-)Processing


Normalize words in text that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace.


Normalize all “fancy” single- and double-quotation marks in text to just the basic ASCII equivalents.


Normalize repeating characters in text by truncating their number of consecutive repetitions to maxn.


Normalize unicode characters in text into canonical forms.


Replace all contiguous zero-width spaces with an empty string, line-breaking spaces with a single newline, and non-breaking spaces with a single space, then strip any leading/trailing whitespace.


Remove accents from any accented unicode characters in text, either by replacing them with ASCII equivalents or removing them entirely.


Remove punctuation from text by replacing all instances of marks with whitespace.


Replace all currency symbols in text with replace_with.


Replace all email addresses in text with replace_with.


Replace all emoji and pictographs in text with replace_with.


Replace all hashtags in text with replace_with.


Replace all numbers in text with replace_with.


Replace all phone numbers in text with replace_with.


Replace all URLs in text with replace_with.


Replace all user handles in text with replace_with.


textacy.preprocessing.normalize: Normalize aspects of raw text that may vary in problematic ways.

textacy.preprocessing.normalize.normalize_hyphenated_words(text: str)str[source]

Normalize words in text that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace.

textacy.preprocessing.normalize.normalize_quotation_marks(text: str)str[source]

Normalize all “fancy” single- and double-quotation marks in text to just the basic ASCII equivalents. Note that this will also normalize fancy apostrophes, which are typically represented as single quotation marks.

textacy.preprocessing.normalize.normalize_repeating_chars(text: str, *, chars: str, maxn: int = 1)str[source]

Normalize repeating characters in text by truncating their number of consecutive repetitions to maxn.

  • text

  • chars – One or more characters whose consecutive repetitions are to be normalized, e.g. “.” or “?!”.

  • maxn – Maximum number of consecutive repetitions of chars to which longer repetitions will be truncated.



textacy.preprocessing.normalize.normalize_unicode(text: str, *, form: str = 'NFC')str[source]

Normalize unicode characters in text into canonical forms.

  • text

  • form ({"NFC", "NFD", "NFKC", "NFKD"}) – Form of normalization applied to unicode characters. For example, an “e” with accute accent “´” can be written as “e´” (canonical decomposition, “NFD”) or “é” (canonical composition, “NFC”). Unicode can be normalized to NFC form without any change in meaning, so it’s usually a safe bet. If “NFKC”, additional normalizations are applied that can change characters’ meanings, e.g. ellipsis characters are replaced with three periods.

textacy.preprocessing.normalize.normalize_whitespace(text: str)str[source]

Replace all contiguous zero-width spaces with an empty string, line-breaking spaces with a single newline, and non-breaking spaces with a single space, then strip any leading/trailing whitespace.


textacy.preprocessing.remove: Remove aspects of raw text that may be unwanted for certain use cases.

textacy.preprocessing.remove.remove_accents(text: str, *, fast: bool = False)str[source]

Remove accents from any accented unicode characters in text, either by replacing them with ASCII equivalents or removing them entirely.

  • text

  • fast

    If False, accents are removed from any unicode symbol with a direct ASCII equivalent; if True, accented chars for all unicode symbols are removed, regardless.


    fast=True can be significantly faster than fast=False, but its transformation of text is less “safe” and more likely to result in changes of meaning, spelling errors, etc.




ValueError – If method is not in {“unicode”, “ascii”}.

See also

For a more powerful (but slower) alternative, check out unidecode: https://github.com/avian2/unidecode

textacy.preprocessing.remove.remove_punctuation(text: str, *, marks: Optional[str] = None)str[source]

Remove punctuation from text by replacing all instances of marks with whitespace.

  • text

  • marks – Remove only those punctuation marks specified here. For example, “,;:” removes commas, semi-colons, and colons. If None, all unicode punctuation marks are removed.




When marks=None, Python’s built-in str.translate() is used to remove punctuation; otherwise, a regular expression is used. The former’s performance is about 5-10x faster.


textacy.preprocessing.replace: Replace parts of raw text that are semantically important as members of a group but not so much in the individual instances.

textacy.preprocessing.replace.replace_currency_symbols(text: str, replace_with: str = '_CUR_')str[source]

Replace all currency symbols in text with replace_with.

textacy.preprocessing.replace.replace_emails(text: str, replace_with: str = '_EMAIL_')str[source]

Replace all email addresses in text with replace_with.

textacy.preprocessing.replace.replace_emojis(text: str, replace_with: str = '_EMOJI_')str[source]

Replace all emoji and pictographs in text with replace_with.


If your Python has a narrow unicode build (“USC-2”), only dingbats and miscellaneous symbols are replaced because Python isn’t able to represent the unicode data for things like emoticons. Sorry!

textacy.preprocessing.replace.replace_hashtags(text: str, replace_with: str = '_TAG_')str[source]

Replace all hashtags in text with replace_with.

textacy.preprocessing.replace.replace_numbers(text: str, replace_with: str = '_NUMBER_')str[source]

Replace all numbers in text with replace_with.

textacy.preprocessing.replace.replace_phone_numbers(text: str, replace_with: str = '_PHONE_')str[source]

Replace all phone numbers in text with replace_with.

textacy.preprocessing.replace.replace_urls(text: str, replace_with: str = '_URL_')str[source]

Replace all URLs in text with replace_with.

textacy.preprocessing.replace.replace_user_handles(text: str, replace_with: str = '_USER_')str[source]

Replace all user handles in text with replace_with.