Text (Pre-)Processing¶

`normalize.normalize_hyphenated_words`	Normalize words in `text` that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace.
`normalize.normalize_quotation_marks`	Normalize all “fancy” single- and double-quotation marks in `text` to just the basic ASCII equivalents.
`normalize.normalize_repeating_chars`	Normalize repeating characters in `text` by truncating their number of consecutive repetitions to `maxn`.
`normalize.normalize_unicode`	Normalize unicode characters in `text` into canonical forms.
`normalize.normalize_whitespace`	Replace all contiguous zero-width spaces with an empty string, line-breaking spaces with a single newline, and non-breaking spaces with a single space, then strip any leading/trailing whitespace.
`remove.remove_accents`	Remove accents from any accented unicode characters in `text`, either by replacing them with ASCII equivalents or removing them entirely.
`remove.remove_punctuation`	Remove punctuation from `text` by replacing all instances of `marks` with whitespace.
`replace.replace_currency_symbols`	Replace all currency symbols in `text` with `replace_with`.
`replace.replace_emails`	Replace all email addresses in `text` with `replace_with`.
`replace.replace_emojis`	Replace all emoji and pictographs in `text` with `replace_with`.
`replace.replace_hashtags`	Replace all hashtags in `text` with `replace_with`.
`replace.replace_numbers`	Replace all numbers in `text` with `replace_with`.
`replace.replace_phone_numbers`	Replace all phone numbers in `text` with `replace_with`.
`replace.replace_urls`	Replace all URLs in `text` with `replace_with`.
`replace.replace_user_handles`	Replace all user handles in `text` with `replace_with`.

Normalize¶

textacy.preprocessing.normalize: Normalize aspects of raw text that may vary in problematic ways.

textacy.preprocessing.normalize.normalize_hyphenated_words(text: str) → str [source]¶: Normalize words in text that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace.

textacy.preprocessing.normalize.normalize_quotation_marks(text: str) → str [source]¶: Normalize all “fancy” single- and double-quotation marks in text to just the basic ASCII equivalents. Note that this will also normalize fancy apostrophes, which are typically represented as single quotation marks.

textacy.preprocessing.normalize.normalize_repeating_chars(text: str, *, chars: str, maxn: int = 1) → str [source]¶

Normalize repeating characters in text by truncating their number of consecutive repetitions to maxn.

Parameters

text –
chars – One or more characters whose consecutive repetitions are to be normalized, e.g. “.” or “?!”.
maxn – Maximum number of consecutive repetitions of chars to which longer repetitions will be truncated.

Returns

str

textacy.preprocessing.normalize.normalize_unicode(text: str, *, form: str = 'NFC') → str [source]¶

Normalize unicode characters in text into canonical forms.

Parameters

text –
form ({"NFC", "NFD", "NFKC", "NFKD"}) – Form of normalization applied to unicode characters. For example, an “e” with accute accent “´” can be written as “e´” (canonical decomposition, “NFD”) or “é” (canonical composition, “NFC”). Unicode can be normalized to NFC form without any change in meaning, so it’s usually a safe bet. If “NFKC”, additional normalizations are applied that can change characters’ meanings, e.g. ellipsis characters are replaced with three periods.

Remove¶

textacy.preprocessing.remove: Remove aspects of raw text that may be unwanted for certain use cases.

textacy.preprocessing.remove.remove_accents(text: str, *, fast: bool = False) → str [source]¶

Remove accents from any accented unicode characters in text, either by replacing them with ASCII equivalents or removing them entirely.

Parameters

text –
fast –
If False, accents are removed from any unicode symbol with a direct ASCII equivalent; if True, accented chars for all unicode symbols are removed, regardless.

Note

fast=True can be significantly faster than fast=False, but its transformation of text is less “safe” and more likely to result in changes of meaning, spelling errors, etc.

Returns

str

Raises

ValueError – If method is not in {“unicode”, “ascii”}.

Replace¶

textacy.preprocessing.replace: Replace parts of raw text that are semantically important as members of a group but not so much in the individual instances.

textacy.preprocessing.replace.replace_currency_symbols(text: str, replace_with: str = '_CUR_') → str [source]¶: Replace all currency symbols in text with replace_with.

textacy.preprocessing.replace.replace_emails(text: str, replace_with: str = '_EMAIL_') → str [source]¶: Replace all email addresses in text with replace_with.

textacy.preprocessing.replace.replace_emojis(text: str, replace_with: str = '_EMOJI_') → str [source]¶: Replace all emoji and pictographs in text with replace_with.

Note

If your Python has a narrow unicode build (“USC-2”), only dingbats and miscellaneous symbols are replaced because Python isn’t able to represent the unicode data for things like emoticons. Sorry!

textacy.preprocessing.replace.replace_hashtags(text: str, replace_with: str = '_TAG_') → str [source]¶: Replace all hashtags in text with replace_with.

textacy.preprocessing.replace.replace_numbers(text: str, replace_with: str = '_NUMBER_') → str [source]¶: Replace all numbers in text with replace_with.

textacy.preprocessing.replace.replace_phone_numbers(text: str, replace_with: str = '_PHONE_') → str [source]¶: Replace all phone numbers in text with replace_with.

textacy.preprocessing.replace.replace_urls(text: str, replace_with: str = '_URL_') → str [source]¶: Replace all URLs in text with replace_with.

textacy.preprocessing.replace.replace_user_handles(text: str, replace_with: str = '_USER_') → str [source]¶: Replace all user handles in text with replace_with.

Text (Pre-)Processing¶

Normalize¶

Remove¶

Replace¶

Navigation

Related Topics