Installation¶
The simplest way to install textacy
is via pip
:
$ pip install textacy
or conda
:
$ conda install -c conda-forge textacy
If you prefer — or are obliged — you can download and unzip the source tar.gz
from PyPi, then install manually:
$ python setup.py install
Dependencies¶
Given the breadth of functionality, textacy
depends on a number of other Python packages. Most of these are common components in the PyData stack (numpy
, scikit-learn
, etc.), but a few are more niche. One heavy dependency has been made optional.
Specifically: To use visualization functionality in textacy.viz
, you’ll need to have matplotlib
installed. You can do so via pip install textacy[viz]
or pip install matplotlib
.
Downloading Data¶
For most uses of textacy, language-specific model data in spaCy is required. Fortunately, spaCy makes the process of getting this data easy; just follow the instructions in their docs, which also includes a list of currently-supported languages and their models.
Note: In previous versions of spaCy, users were able to link a specific model to a different name (e.g. “en_core_web_sm” => “en”), but this is no longer permitted. As such, textacy
now requires users to fully specify which model to apply to a text, rather than leveraging automatic language identification to do it for them.
textacy
itself features convenient access to several datasets comprised of thousands of text + metadata records, as well as a couple linguistic resources. Data can be downloaded via the .download()
method on corresponding dataset/resource classes (see Datasets and Resources for details) or directly from the command line.
$ python -m textacy download capitol_words
$ python -m textacy download depeche_mood
$ python -m textacy download lang_identifier --version 2.0
These commands download and save a compressed json file with ~11k speeches given by the main protagonists of the 2016 U.S. Presidential election, followed by a set of emotion lexicons in English and Italian with various word representations, and lastly a language identification model that works for 140 languages. For more information about particular datasets/resources use the info
subcommand:
$ python -m textacy info capitol_words