Text Preprocessing¶

`pipeline.make_pipeline`	Make a callable pipeline that takes a text as input, passes it through one or more functions in sequential order, then outputs a single (preprocessed) text string.
`normalize.bullet_points`	Normalize all “fancy” bullet point symbols in `text` to just the basic ASCII “-“, provided they are the first non-whitespace characters on a new line (like a list of items).
`normalize.hyphenated_words`	Normalize words in `text` that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace.
`normalize.quotation_marks`	Normalize all “fancy” single- and double-quotation marks in `text` to just the basic ASCII equivalents.
`normalize.repeating_chars`	Normalize repeating characters in `text` by truncating their number of consecutive repetitions to `maxn`.
`normalize.unicode`	Normalize unicode characters in `text` into canonical forms.
`normalize.whitespace`	Replace all contiguous zero-width spaces with an empty string, line-breaking spaces with a single newline, and non-breaking spaces with a single space, then strip any leading/trailing whitespace.
`remove.accents`	Remove accents from any accented unicode characters in `text`, either by replacing them with ASCII equivalents or removing them entirely.
`remove.brackets`	Remove text within curly {}, square [], and/or round () brackets, as well as the brackets themselves.
`remove.html_tags`	Remove HTML tags from `text`, returning just the text found between tags and other non-data elements.
`remove.punctuation`	Remove punctuation from `text` by replacing all instances of punctuation (or a subset thereof specified by `only`) with whitespace.
`replace.currency_symbols`	Replace all currency symbols in `text` with `repl`.
`replace.emails`	Replace all email addresses in `text` with `repl`.
`replace.emojis`	Replace all emoji and pictographs in `text` with `repl`.
`replace.hashtags`	Replace all hashtags in `text` with `repl`.
`replace.numbers`	Replace all numbers in `text` with `repl`.
`replace.phone_numbers`	Replace all phone numbers in `text` with `repl`.
`replace.urls`	Replace all URLs in `text` with `repl`.
`replace.user_handles`	Replace all (Twitter-style) user handles in `text` with `repl`.

Pipeline¶

textacy.preprocessing.pipeline: Basic functionality for composing multiple preprocessing steps into a single callable pipeline.

textacy.preprocessing.pipeline.make_pipeline(*funcs: Callable[[str], str]) → Callable[[str], str][source]¶

Make a callable pipeline that takes a text as input, passes it through one or more functions in sequential order, then outputs a single (preprocessed) text string.

This function is intended as a lightweight convenience for users, allowing them to flexibly specify which (and in which order) preprocessing functions are to be applied to raw texts, then treating the whole thing as a single callable.

>>> from textacy import preprocessing
>>> preproc = preprocessing.make_pipeline(
...     preprocessing.replace.hashtags,
...     preprocessing.replace.user_handles,
...     preprocessing.replace.emojis,
... )
>>> preproc("@spacy_io is OSS for industrial-strength NLP in Python developed by @explosion_ai 💥")
'_USER_ is OSS for industrial-strength NLP in Python developed by _USER_ _EMOJI_'
>>> preproc("hacking with my buddy Isaac Mewton 🥰 #PawProgramming")
'hacking with my buddy Isaac Mewton _EMOJI_ _TAG_'

Parameters: *funcs –
Returns: Pipeline composed of *funcs that applies each in sequential order.

Normalize¶

textacy.preprocessing.normalize: Normalize aspects of raw text that may vary in problematic ways.

textacy.preprocessing.normalize.bullet_points(text: str) → str [source]¶: Normalize all “fancy” bullet point symbols in text to just the basic ASCII “-“, provided they are the first non-whitespace characters on a new line (like a list of items).

textacy.preprocessing.normalize.hyphenated_words(text: str) → str [source]¶: Normalize words in text that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace.

textacy.preprocessing.normalize.quotation_marks(text: str) → str [source]¶: Normalize all “fancy” single- and double-quotation marks in text to just the basic ASCII equivalents. Note that this will also normalize fancy apostrophes, which are typically represented as single quotation marks.

textacy.preprocessing.normalize.repeating_chars(text: str, *, chars: str, maxn: int = 1) → str [source]¶

Normalize repeating characters in text by truncating their number of consecutive repetitions to maxn.

Parameters

text –
chars – One or more characters whose consecutive repetitions are to be normalized, e.g. “.” or “?!”.
maxn – Maximum number of consecutive repetitions of chars to which longer repetitions will be truncated.

Returns

str

textacy.preprocessing.normalize.unicode(text: str, *, form: str = 'NFC') → str [source]¶

Normalize unicode characters in text into canonical forms.

Parameters

text –
form ({"NFC", "NFD", "NFKC", "NFKD"}) – Form of normalization applied to unicode characters. For example, an “e” with accute accent “´” can be written as “e´” (canonical decomposition, “NFD”) or “é” (canonical composition, “NFC”). Unicode can be normalized to NFC form without any change in meaning, so it’s usually a safe bet. If “NFKC”, additional normalizations are applied that can change characters’ meanings, e.g. ellipsis characters are replaced with three periods.

Remove¶

textacy.preprocessing.remove: Remove aspects of raw text that may be unwanted for certain use cases.

textacy.preprocessing.remove.accents(text: str, *, fast: bool = False) → str [source]¶

Remove accents from any accented unicode characters in text, either by replacing them with ASCII equivalents or removing them entirely.

Parameters

text –
fast –
If False, accents are removed from any unicode symbol with a direct ASCII equivalent; if True, accented chars for all unicode symbols are removed, regardless.

Note

fast=True can be significantly faster than fast=False, but its transformation of text is less “safe” and more likely to result in changes of meaning, spelling errors, etc.

Returns

str

Replace¶

textacy.preprocessing.replace: Replace parts of raw text that are semantically important as members of a group but not so much in the individual instances. Can also be used to remove such parts by specifying repl="" in function calls.

textacy.preprocessing.replace.currency_symbols(text: str, repl: str = '_CUR_') → str [source]¶: Replace all currency symbols in text with repl.

textacy.preprocessing.replace.emails(text: str, repl: str = '_EMAIL_') → str [source]¶: Replace all email addresses in text with repl.

textacy.preprocessing.replace.emojis(text: str, repl: str = '_EMOJI_') → str [source]¶: Replace all emoji and pictographs in text with repl.

Note

If your Python has a narrow unicode build (“USC-2”), only dingbats and miscellaneous symbols are replaced because Python isn’t able to represent the unicode data for things like emoticons. Sorry!

textacy.preprocessing.replace.hashtags(text: str, repl: str = '_TAG_') → str [source]¶: Replace all hashtags in text with repl.

textacy.preprocessing.replace.numbers(text: str, repl: str = '_NUMBER_') → str [source]¶: Replace all numbers in text with repl.

textacy.preprocessing.replace.phone_numbers(text: str, repl: str = '_PHONE_') → str [source]¶: Replace all phone numbers in text with repl.

textacy.preprocessing.replace.urls(text: str, repl: str = '_URL_') → str [source]¶: Replace all URLs in text with repl.

textacy.preprocessing.replace.user_handles(text: str, repl: str = '_USER_') → str [source]¶: Replace all (Twitter-style) user handles in text with repl.

Text Preprocessing¶

Pipeline¶

Normalize¶

Remove¶

Replace¶

Navigation

Related Topics