Text Preprocessing

pipeline.make_pipeline

Make a callable pipeline that takes a text as input, passes it through one or more functions in sequential order, then outputs a single (preprocessed) text string.

normalize.bullet_points

Normalize all “fancy” bullet point symbols in text to just the basic ASCII “-“, provided they are the first non-whitespace characters on a new line (like a list of items).

normalize.hyphenated_words

Normalize words in text that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace.

normalize.quotation_marks

Normalize all “fancy” single- and double-quotation marks in text to just the basic ASCII equivalents.

normalize.repeating_chars

Normalize repeating characters in text by truncating their number of consecutive repetitions to maxn.

normalize.unicode

Normalize unicode characters in text into canonical forms.

normalize.whitespace

Replace all contiguous zero-width spaces with an empty string, line-breaking spaces with a single newline, and non-breaking spaces with a single space, then strip any leading/trailing whitespace.

remove.accents

Remove accents from any accented unicode characters in text, either by replacing them with ASCII equivalents or removing them entirely.

remove.brackets

Remove text within curly {}, square [], and/or round () brackets, as well as the brackets themselves.

remove.html_tags

Remove HTML tags from text, returning just the text found between tags and other non-data elements.

remove.punctuation

Remove punctuation from text by replacing all instances of punctuation (or a subset thereof specified by only) with whitespace.

replace.currency_symbols

Replace all currency symbols in text with repl.

replace.emails

Replace all email addresses in text with repl.

replace.emojis

Replace all emoji and pictographs in text with repl.

replace.hashtags

Replace all hashtags in text with repl.

replace.numbers

Replace all numbers in text with repl.

replace.phone_numbers

Replace all phone numbers in text with repl.

replace.urls

Replace all URLs in text with repl.

replace.user_handles

Replace all (Twitter-style) user handles in text with repl.

Pipeline

textacy.preprocessing.pipeline: Basic functionality for composing multiple preprocessing steps into a single callable pipeline.

textacy.preprocessing.pipeline.make_pipeline(*funcs: Callable[[str], str])Callable[[str], str][source]

Make a callable pipeline that takes a text as input, passes it through one or more functions in sequential order, then outputs a single (preprocessed) text string.

This function is intended as a lightweight convenience for users, allowing them to flexibly specify which (and in which order) preprocessing functions are to be applied to raw texts, then treating the whole thing as a single callable.

>>> from textacy import preprocessing
>>> preproc = preprocessing.make_pipeline(
...     preprocessing.replace.hashtags,
...     preprocessing.replace.user_handles,
...     preprocessing.replace.emojis,
... )
>>> preproc("@spacy_io is OSS for industrial-strength NLP in Python developed by @explosion_ai 💥")
'_USER_ is OSS for industrial-strength NLP in Python developed by _USER_ _EMOJI_'
>>> preproc("hacking with my buddy Isaac Mewton 🥰 #PawProgramming")
'hacking with my buddy Isaac Mewton _EMOJI_ _TAG_'
Parameters

*funcs

Returns

Pipeline composed of *funcs that applies each in sequential order.

Normalize

textacy.preprocessing.normalize: Normalize aspects of raw text that may vary in problematic ways.

textacy.preprocessing.normalize.bullet_points(text: str)str[source]

Normalize all “fancy” bullet point symbols in text to just the basic ASCII “-“, provided they are the first non-whitespace characters on a new line (like a list of items).

textacy.preprocessing.normalize.hyphenated_words(text: str)str[source]

Normalize words in text that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace.

textacy.preprocessing.normalize.quotation_marks(text: str)str[source]

Normalize all “fancy” single- and double-quotation marks in text to just the basic ASCII equivalents. Note that this will also normalize fancy apostrophes, which are typically represented as single quotation marks.

textacy.preprocessing.normalize.repeating_chars(text: str, *, chars: str, maxn: int = 1)str[source]

Normalize repeating characters in text by truncating their number of consecutive repetitions to maxn.

Parameters
  • text

  • chars – One or more characters whose consecutive repetitions are to be normalized, e.g. “.” or “?!”.

  • maxn – Maximum number of consecutive repetitions of chars to which longer repetitions will be truncated.

Returns

str

textacy.preprocessing.normalize.unicode(text: str, *, form: str = 'NFC')str[source]

Normalize unicode characters in text into canonical forms.

Parameters
  • text

  • form ({"NFC", "NFD", "NFKC", "NFKD"}) – Form of normalization applied to unicode characters. For example, an “e” with accute accent “´” can be written as “e´” (canonical decomposition, “NFD”) or “é” (canonical composition, “NFC”). Unicode can be normalized to NFC form without any change in meaning, so it’s usually a safe bet. If “NFKC”, additional normalizations are applied that can change characters’ meanings, e.g. ellipsis characters are replaced with three periods.

textacy.preprocessing.normalize.whitespace(text: str)str[source]

Replace all contiguous zero-width spaces with an empty string, line-breaking spaces with a single newline, and non-breaking spaces with a single space, then strip any leading/trailing whitespace.

Remove

textacy.preprocessing.remove: Remove aspects of raw text that may be unwanted for certain use cases.

textacy.preprocessing.remove.accents(text: str, *, fast: bool = False)str[source]

Remove accents from any accented unicode characters in text, either by replacing them with ASCII equivalents or removing them entirely.

Parameters
  • text

  • fast

    If False, accents are removed from any unicode symbol with a direct ASCII equivalent; if True, accented chars for all unicode symbols are removed, regardless.

    Note

    fast=True can be significantly faster than fast=False, but its transformation of text is less “safe” and more likely to result in changes of meaning, spelling errors, etc.

Returns

str

See also

For a more powerful (but slower) alternative, check out unidecode: https://github.com/avian2/unidecode

textacy.preprocessing.remove.brackets(text: str, *, only: Optional[str | Collection[str]] = None)str[source]

Remove text within curly {}, square [], and/or round () brackets, as well as the brackets themselves.

Parameters
  • text

  • only – Remove only those bracketed contents as specified here: “curly”, “square”, and/or “round”. For example, "square" removes only those contents found between square brackets, while ["round", "square"] removes those contents found between square or round brackets, but not curly.

Returns

str

Note

This function relies on regular expressions, applied sequentially for curly, square, then round brackets; as such, it doesn’t handle nested brackets of the same type and may behave unexpectedly on text with “wild” use of brackets. It should be fine removing structured bracketed contents, as is often used, for instance, to denote in-text citations.

textacy.preprocessing.remove.html_tags(text: str)str[source]

Remove HTML tags from text, returning just the text found between tags and other non-data elements.

Parameters

text

Returns

str

Note

This function relies on the stdlib html.parser.HTMLParser and doesn’t do anything fancy. For a better and potentially faster solution, consider using lxml and/or beautifulsoup4.

textacy.preprocessing.remove.punctuation(text: str, *, only: Optional[str | Collection[str]] = None)str[source]

Remove punctuation from text by replacing all instances of punctuation (or a subset thereof specified by only) with whitespace.

Parameters
  • text

  • only – Remove only those punctuation marks specified here. For example, "." removes only periods, while [",", ";", ":"] removes commas, semicolons, and colons; if None, all unicode punctuation marks are removed.

Returns

str

Note

When only=None, Python’s built-in str.translate() is used to remove punctuation; otherwise, a regular expression is used. The former’s performance can be up to an order of magnitude faster.

Replace

textacy.preprocessing.replace: Replace parts of raw text that are semantically important as members of a group but not so much in the individual instances. Can also be used to remove such parts by specifying repl="" in function calls.

textacy.preprocessing.replace.currency_symbols(text: str, repl: str = '_CUR_')str[source]

Replace all currency symbols in text with repl.

textacy.preprocessing.replace.emails(text: str, repl: str = '_EMAIL_')str[source]

Replace all email addresses in text with repl.

textacy.preprocessing.replace.emojis(text: str, repl: str = '_EMOJI_')str[source]

Replace all emoji and pictographs in text with repl.

Note

If your Python has a narrow unicode build (“USC-2”), only dingbats and miscellaneous symbols are replaced because Python isn’t able to represent the unicode data for things like emoticons. Sorry!

textacy.preprocessing.replace.hashtags(text: str, repl: str = '_TAG_')str[source]

Replace all hashtags in text with repl.

textacy.preprocessing.replace.numbers(text: str, repl: str = '_NUMBER_')str[source]

Replace all numbers in text with repl.

textacy.preprocessing.replace.phone_numbers(text: str, repl: str = '_PHONE_')str[source]

Replace all phone numbers in text with repl.

textacy.preprocessing.replace.urls(text: str, repl: str = '_URL_')str[source]

Replace all URLs in text with repl.

textacy.preprocessing.replace.user_handles(text: str, repl: str = '_USER_')str[source]

Replace all (Twitter-style) user handles in text with repl.