Text (Pre-)Processing¶
Normalize words in |
|
Normalize all “fancy” single- and double-quotation marks in |
|
Normalize repeating characters in |
|
Normalize unicode characters in |
|
Replace all contiguous zero-width spaces with an empty string, line-breaking spaces with a single newline, and non-breaking spaces with a single space, then strip any leading/trailing whitespace. |
|
Remove accents from any accented unicode characters in |
|
Remove punctuation from |
|
Replace all currency symbols in |
|
Replace all email addresses in |
|
Replace all emoji and pictographs in |
|
Replace all hashtags in |
|
Replace all numbers in |
|
Replace all phone numbers in |
|
Replace all URLs in |
|
Replace all user handles in |
Normalize¶
textacy.preprocessing.normalize
: Normalize aspects of raw text that may vary
in problematic ways.
-
textacy.preprocessing.normalize.
normalize_hyphenated_words
(text: str) → str[source]¶ Normalize words in
text
that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace.
-
textacy.preprocessing.normalize.
normalize_quotation_marks
(text: str) → str[source]¶ Normalize all “fancy” single- and double-quotation marks in
text
to just the basic ASCII equivalents. Note that this will also normalize fancy apostrophes, which are typically represented as single quotation marks.
-
textacy.preprocessing.normalize.
normalize_repeating_chars
(text: str, *, chars: str, maxn: int = 1) → str[source]¶ Normalize repeating characters in
text
by truncating their number of consecutive repetitions tomaxn
.- Parameters
text –
chars – One or more characters whose consecutive repetitions are to be normalized, e.g. “.” or “?!”.
maxn – Maximum number of consecutive repetitions of
chars
to which longer repetitions will be truncated.
- Returns
str
-
textacy.preprocessing.normalize.
normalize_unicode
(text: str, *, form: str = 'NFC') → str[source]¶ Normalize unicode characters in
text
into canonical forms.- Parameters
text –
form ({"NFC", "NFD", "NFKC", "NFKD"}) – Form of normalization applied to unicode characters. For example, an “e” with accute accent “´” can be written as “e´” (canonical decomposition, “NFD”) or “é” (canonical composition, “NFC”). Unicode can be normalized to NFC form without any change in meaning, so it’s usually a safe bet. If “NFKC”, additional normalizations are applied that can change characters’ meanings, e.g. ellipsis characters are replaced with three periods.
Remove¶
textacy.preprocessing.remove
: Remove aspects of raw text that may be unwanted
for certain use cases.
-
textacy.preprocessing.remove.
remove_accents
(text: str, *, fast: bool = False) → str[source]¶ Remove accents from any accented unicode characters in
text
, either by replacing them with ASCII equivalents or removing them entirely.- Parameters
text –
fast –
If False, accents are removed from any unicode symbol with a direct ASCII equivalent; if True, accented chars for all unicode symbols are removed, regardless.
Note
fast=True
can be significantly faster thanfast=False
, but its transformation oftext
is less “safe” and more likely to result in changes of meaning, spelling errors, etc.
- Returns
str
- Raises
ValueError – If
method
is not in {“unicode”, “ascii”}.
See also
For a more powerful (but slower) alternative, check out
unidecode
: https://github.com/avian2/unidecode
-
textacy.preprocessing.remove.
remove_punctuation
(text: str, *, marks: Optional[str] = None) → str[source]¶ Remove punctuation from
text
by replacing all instances ofmarks
with whitespace.- Parameters
text –
marks – Remove only those punctuation marks specified here. For example, “,;:” removes commas, semi-colons, and colons. If None, all unicode punctuation marks are removed.
- Returns
str
Note
When
marks=None
, Python’s built-instr.translate()
is used to remove punctuation; otherwise, a regular expression is used. The former’s performance is about 5-10x faster.
Replace¶
textacy.preprocessing.replace
: Replace parts of raw text that are semantically
important as members of a group but not so much in the individual instances.
-
textacy.preprocessing.replace.
replace_currency_symbols
(text: str, replace_with: str = '_CUR_') → str[source]¶ Replace all currency symbols in
text
withreplace_with
.
-
textacy.preprocessing.replace.
replace_emails
(text: str, replace_with: str = '_EMAIL_') → str[source]¶ Replace all email addresses in
text
withreplace_with
.
-
textacy.preprocessing.replace.
replace_emojis
(text: str, replace_with: str = '_EMOJI_') → str[source]¶ Replace all emoji and pictographs in
text
withreplace_with
.Note
If your Python has a narrow unicode build (“USC-2”), only dingbats and miscellaneous symbols are replaced because Python isn’t able to represent the unicode data for things like emoticons. Sorry!
Replace all hashtags in
text
withreplace_with
.
-
textacy.preprocessing.replace.
replace_numbers
(text: str, replace_with: str = '_NUMBER_') → str[source]¶ Replace all numbers in
text
withreplace_with
.
-
textacy.preprocessing.replace.
replace_phone_numbers
(text: str, replace_with: str = '_PHONE_') → str[source]¶ Replace all phone numbers in
text
withreplace_with
.