Text Preprocessing¶
Make a callable pipeline that takes a text as input, passes it through one or more functions in sequential order, then outputs a single (preprocessed) text string. |
|
Normalize all “fancy” bullet point symbols in |
|
Normalize words in |
|
Normalize all “fancy” single- and double-quotation marks in |
|
Normalize repeating characters in |
|
Normalize unicode characters in |
|
Replace all contiguous zero-width spaces with an empty string, line-breaking spaces with a single newline, and non-breaking spaces with a single space, then strip any leading/trailing whitespace. |
|
Remove accents from any accented unicode characters in |
|
Remove text within curly {}, square [], and/or round () brackets, as well as the brackets themselves. |
|
Remove HTML tags from |
|
Remove punctuation from |
|
Replace all currency symbols in |
|
Replace all email addresses in |
|
Replace all emoji and pictographs in |
|
Replace all hashtags in |
|
Replace all numbers in |
|
Replace all phone numbers in |
|
Replace all URLs in |
|
Replace all (Twitter-style) user handles in |
Pipeline¶
textacy.preprocessing.pipeline
: Basic functionality for composing multiple
preprocessing steps into a single callable pipeline.
-
textacy.preprocessing.pipeline.
make_pipeline
(*funcs: Callable[[str], str]) → Callable[[str], str][source]¶ Make a callable pipeline that takes a text as input, passes it through one or more functions in sequential order, then outputs a single (preprocessed) text string.
This function is intended as a lightweight convenience for users, allowing them to flexibly specify which (and in which order) preprocessing functions are to be applied to raw texts, then treating the whole thing as a single callable.
>>> from textacy import preprocessing >>> preproc = preprocessing.make_pipeline( ... preprocessing.replace.hashtags, ... preprocessing.replace.user_handles, ... preprocessing.replace.emojis, ... ) >>> preproc("@spacy_io is OSS for industrial-strength NLP in Python developed by @explosion_ai 💥") '_USER_ is OSS for industrial-strength NLP in Python developed by _USER_ _EMOJI_' >>> preproc("hacking with my buddy Isaac Mewton 🥰 #PawProgramming") 'hacking with my buddy Isaac Mewton _EMOJI_ _TAG_'
- Parameters
*funcs –
- Returns
Pipeline composed of
*funcs
that applies each in sequential order.
Normalize¶
textacy.preprocessing.normalize
: Normalize aspects of raw text that may vary
in problematic ways.
-
textacy.preprocessing.normalize.
bullet_points
(text: str) → str[source]¶ Normalize all “fancy” bullet point symbols in
text
to just the basic ASCII “-“, provided they are the first non-whitespace characters on a new line (like a list of items).
-
textacy.preprocessing.normalize.
hyphenated_words
(text: str) → str[source]¶ Normalize words in
text
that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace.
-
textacy.preprocessing.normalize.
quotation_marks
(text: str) → str[source]¶ Normalize all “fancy” single- and double-quotation marks in
text
to just the basic ASCII equivalents. Note that this will also normalize fancy apostrophes, which are typically represented as single quotation marks.
-
textacy.preprocessing.normalize.
repeating_chars
(text: str, *, chars: str, maxn: int = 1) → str[source]¶ Normalize repeating characters in
text
by truncating their number of consecutive repetitions tomaxn
.- Parameters
text –
chars – One or more characters whose consecutive repetitions are to be normalized, e.g. “.” or “?!”.
maxn – Maximum number of consecutive repetitions of
chars
to which longer repetitions will be truncated.
- Returns
str
-
textacy.preprocessing.normalize.
unicode
(text: str, *, form: str = 'NFC') → str[source]¶ Normalize unicode characters in
text
into canonical forms.- Parameters
text –
form ({"NFC", "NFD", "NFKC", "NFKD"}) – Form of normalization applied to unicode characters. For example, an “e” with accute accent “´” can be written as “e´” (canonical decomposition, “NFD”) or “é” (canonical composition, “NFC”). Unicode can be normalized to NFC form without any change in meaning, so it’s usually a safe bet. If “NFKC”, additional normalizations are applied that can change characters’ meanings, e.g. ellipsis characters are replaced with three periods.
Remove¶
textacy.preprocessing.remove
: Remove aspects of raw text that may be unwanted
for certain use cases.
-
textacy.preprocessing.remove.
accents
(text: str, *, fast: bool = False) → str[source]¶ Remove accents from any accented unicode characters in
text
, either by replacing them with ASCII equivalents or removing them entirely.- Parameters
text –
fast –
If False, accents are removed from any unicode symbol with a direct ASCII equivalent; if True, accented chars for all unicode symbols are removed, regardless.
Note
fast=True
can be significantly faster thanfast=False
, but its transformation oftext
is less “safe” and more likely to result in changes of meaning, spelling errors, etc.
- Returns
str
See also
For a more powerful (but slower) alternative, check out
unidecode
: https://github.com/avian2/unidecode
-
textacy.preprocessing.remove.
brackets
(text: str, *, only: Optional[str | Collection[str]] = None) → str[source]¶ Remove text within curly {}, square [], and/or round () brackets, as well as the brackets themselves.
- Parameters
text –
only – Remove only those bracketed contents as specified here: “curly”, “square”, and/or “round”. For example,
"square"
removes only those contents found between square brackets, while["round", "square"]
removes those contents found between square or round brackets, but not curly.
- Returns
str
Note
This function relies on regular expressions, applied sequentially for curly, square, then round brackets; as such, it doesn’t handle nested brackets of the same type and may behave unexpectedly on text with “wild” use of brackets. It should be fine removing structured bracketed contents, as is often used, for instance, to denote in-text citations.
Remove HTML tags from
text
, returning just the text found between tags and other non-data elements.- Parameters
text –
- Returns
str
Note
This function relies on the stdlib
html.parser.HTMLParser
and doesn’t do anything fancy. For a better and potentially faster solution, consider usinglxml
and/orbeautifulsoup4
.
-
textacy.preprocessing.remove.
punctuation
(text: str, *, only: Optional[str | Collection[str]] = None) → str[source]¶ Remove punctuation from
text
by replacing all instances of punctuation (or a subset thereof specified byonly
) with whitespace.- Parameters
text –
only – Remove only those punctuation marks specified here. For example,
"."
removes only periods, while[",", ";", ":"]
removes commas, semicolons, and colons; if None, all unicode punctuation marks are removed.
- Returns
str
Note
When
only=None
, Python’s built-instr.translate()
is used to remove punctuation; otherwise, a regular expression is used. The former’s performance can be up to an order of magnitude faster.
Replace¶
textacy.preprocessing.replace
: Replace parts of raw text that are semantically
important as members of a group but not so much in the individual instances. Can also
be used to remove such parts by specifying repl=""
in function calls.
-
textacy.preprocessing.replace.
currency_symbols
(text: str, repl: str = '_CUR_') → str[source]¶ Replace all currency symbols in
text
withrepl
.
-
textacy.preprocessing.replace.
emails
(text: str, repl: str = '_EMAIL_') → str[source]¶ Replace all email addresses in
text
withrepl
.
-
textacy.preprocessing.replace.
emojis
(text: str, repl: str = '_EMOJI_') → str[source]¶ Replace all emoji and pictographs in
text
withrepl
.Note
If your Python has a narrow unicode build (“USC-2”), only dingbats and miscellaneous symbols are replaced because Python isn’t able to represent the unicode data for things like emoticons. Sorry!
Replace all hashtags in
text
withrepl
.
-
textacy.preprocessing.replace.
numbers
(text: str, repl: str = '_NUMBER_') → str[source]¶ Replace all numbers in
text
withrepl
.
-
textacy.preprocessing.replace.
phone_numbers
(text: str, repl: str = '_PHONE_') → str[source]¶ Replace all phone numbers in
text
withrepl
.