Resources¶
ConceptNet¶
ConceptNet is a multilingual knowledge base, representing common words and phrases and the common-sense relationships between them. This information is collected from a variety of sources, including crowd-sourced resources (e.g. Wiktionary, Open Mind Common Sense), games with a purpose (e.g. Verbosity, nadya.jp), and expert-created resources (e.g. WordNet, JMDict).
The interface in textacy gives access to several key relationships between terms that are useful in a variety of NLP tasks:
antonyms: terms that are opposites of each other in some relevant way
hyponyms: terms that are subtypes or specific instances of other terms
meronyms: terms that are parts of other terms
synonyms: terms that are sufficiently similar that they may be used interchangeably
-
class
textacy.resources.concept_net.
ConceptNet
(data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.10.1/lib/python3.8/site-packages/textacy/data/concept_net'), version='5.7.0')[source]¶ Interface to ConceptNet, a multilingual knowledge base representing common words and phrases and the common-sense relationships between them.
Download the data (one time only!), and save its contents to disk:
>>> import textacy.resources >>> rs = textacy.resources.ConceptNet() >>> rs.download() >>> rs.info {'name': 'concept_net', 'site_url': 'http://conceptnet.io', 'publication_url': 'https://arxiv.org/abs/1612.03975', 'description': 'An open, multilingual semantic network of general knowledge, designed to help computers understand the meanings of words.'}
Access other same-language terms related to a given term in a variety of ways:
>>> rs.get_synonyms("spouse", lang="en", sense="n") ['mate', 'married person', 'better half', 'partner'] >>> rs.get_antonyms("love", lang="en", sense="v") ['detest', 'hate', 'loathe'] >>> rs.get_hyponyms("marriage", lang="en", sense="n") ['cohabitation situation', 'union', 'legal agreement', 'ritual', 'family', 'marital status']
Note: The very first time a given relationship is accessed, the full ConceptNet db must be parsed and split for fast future access. This can take a couple minutes; be patient.
When passing a spaCy
Token
orSpan
, the correspondinglang
andsense
are inferred automatically from the object:>>> text = "The quick brown fox jumps over the lazy dog." >>> doc = textacy.make_spacy_doc(text, lang="en") >>> rs.get_synonyms(doc[1]) # quick ['flying', 'fast', 'rapid', 'ready', 'straightaway', 'nimble', 'speedy', 'warm'] >>> rs.get_synonyms(doc[4:5]) # jumps over ['leap', 'startle', 'hump', 'flinch', 'jump off', 'skydive', 'jumpstart', ...]
Many terms won’t have entries, for actual linguistic reasons or because the db’s coverage of a given language’s vocabulary isn’t comprehensive:
>>> rs.get_meronyms(doc[3]) # fox [] >>> rs.get_antonyms(doc[7]) # lazy []
- Parameters
data_dir (str or
pathlib.Path
) – Path to directory on disk under which resource data is stored, i.e./path/to/data_dir/concept_net
.version ({"5.7.0", "5.6.0", "5.5.5"}) – Version string of the ConceptNet db to use. Since newer versions typically represent improvements over earlier versions, you’ll probably want “5.7.0” (the default value).
-
download
(*, force=False)[source]¶ Download resource data as a gzipped csv file, then save it to disk under the
ConceptNet.data_dir
directory.- Parameters
force (bool) – If True, download resource data, even if it already exists on disk; otherwise, don’t re-download the data.
-
property
filepath
¶ Full path on disk for the ConceptNet gzipped csv file corresponding to the given
ConceptNet.data_dir
.- Type
-
property
antonyms
¶ Mapping of language code to term to sense to set of term’s antonyms – opposites of the term in some relevant way, like being at opposite ends of a scale or fundamentally similar but with a key difference between them – such as black <=> white or hot <=> cold. Note that this relationship is symmetric.
Based on the “/r/Antonym” relation in ConceptNet.
-
property
hyponyms
¶ Mapping of language code to term to sense to set of term’s hyponyms – subtypes or specific instances of the term – such as car => vehicle or Chicago => city. Every A is a B.
Based on the “/r/IsA” relation in ConceptNet.
-
property
meronyms
¶ Mapping of language code to term to sense to set of term’s meronyms – parts of the term – such as gearshift => car.
Based on the “/r/PartOf” relation in ConceptNet.
-
property
synonyms
¶ Mapping of language code to term to sense to set of term’s synonyms – sufficiently similar concepts that they may be used interchangeably – such as sunlight <=> sunshine. Note that this relationship is symmetric.
Based on the “/r/Synonym” relation in ConceptNet.
DepecheMood¶
DepecheMood is a high-quality and high-coverage emotion lexicon for English and Italian text, mapping individual terms to their emotional valences. These word-emotion weights are inferred from crowd-sourced datasets of emotionally tagged news articles (rappler.com for English, corriere.it for Italian).
English terms are assigned weights to eight emotions:
AFRAID
AMUSED
ANGRY
ANNOYED
DONT_CARE
HAPPY
INSPIRED
SAD
Italian terms are assigned weights to five emotions:
DIVERTITO (~amused)
INDIGNATO (~annoyed)
PREOCCUPATO (~afraid)
SODDISFATTO (~happy)
TRISTE (~sad)
-
class
textacy.resources.depeche_mood.
DepecheMood
(data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.10.1/lib/python3.8/site-packages/textacy/data/depeche_mood'), lang='en', word_rep='lemmapos', min_freq=3)[source]¶ Interface to DepecheMood, an emotion lexicon for English and Italian text.
Download the data (one time only!), and save its contents to disk:
>>> import textacy.resources >>> rs = textacy.resources.DepecheMood(lang="en", word_rep="lemmapos") >>> rs.download() >>> rs.info {'name': 'depeche_mood', 'site_url': 'http://www.depechemood.eu', 'publication_url': 'https://arxiv.org/abs/1810.03660', 'description': 'A simple tool to analyze the emotions evoked by a text.'}
Access emotional valences for individual terms:
>>> rs.get_emotional_valence("disease#n") {'AFRAID': 0.37093526222120465, 'AMUSED': 0.06953745082761113, 'ANGRY': 0.06979683067736414, 'ANNOYED': 0.06465401081252636, 'DONT_CARE': 0.07080580707440012, 'HAPPY': 0.07537324330608403, 'INSPIRED': 0.13394731320662606, 'SAD': 0.14495008187418348} >>> rs.get_emotional_valence("heal#v") {'AFRAID': 0.060450319886187334, 'AMUSED': 0.09284046387491741, 'ANGRY': 0.06207816933776029, 'ANNOYED': 0.10027622719958346, 'DONT_CARE': 0.11259594401785, 'HAPPY': 0.09946106491457314, 'INSPIRED': 0.37794768332634626, 'SAD': 0.09435012744278205}
When passing multiple terms in the form of a List[str] or
Span
orDoc
, emotion weights are averaged over all terms for which weights are available:>>> rs.get_emotional_valence(["disease#n", "heal#v"]) {'AFRAID': 0.215692791053696, 'AMUSED': 0.08118895735126427, 'ANGRY': 0.06593750000756221, 'ANNOYED': 0.08246511900605491, 'DONT_CARE': 0.09170087554612506, 'HAPPY': 0.08741715411032858, 'INSPIRED': 0.25594749826648616, 'SAD': 0.11965010465848278} >>> text = "The acting was sweet and amazing, but the plot was dumb and terrible." >>> doc = textacy.make_spacy_doc(text, lang="en") >>> rs.get_emotional_valence(doc) {'AFRAID': 0.05272350876803627, 'AMUSED': 0.13725054992595098, 'ANGRY': 0.15787016147081184, 'ANNOYED': 0.1398733360688608, 'DONT_CARE': 0.14356943460620503, 'HAPPY': 0.11923217912716871, 'INSPIRED': 0.17880214720077342, 'SAD': 0.07067868283219296} >>> rs.get_emotional_valence(doc[0:6]) # the acting was sweet and amazing {'AFRAID': 0.039790959333750785, 'AMUSED': 0.1346884072825313, 'ANGRY': 0.1373596223131593, 'ANNOYED': 0.11391999698695347, 'DONT_CARE': 0.1574819173485831, 'HAPPY': 0.1552521762333925, 'INSPIRED': 0.21232264216449326, 'SAD': 0.049184278337136296}
For good measure, here’s how Italian w/o POS-tagged words looks:
>>> rs = textacy.resources.DepecheMood(lang="it", word_rep="lemma") >>> rs.get_emotional_valence("amore") {'INDIGNATO': 0.11451408951814121, 'PREOCCUPATO': 0.1323655108545536, 'TRISTE': 0.18249663560400609, 'DIVERTITO': 0.33558928569110086, 'SODDISFATTO': 0.23503447833219815}
- Parameters
data_dir (str or
pathlib.Path
) – Path to directory on disk under which resource data is stored, i.e./path/to/data_dir/depeche_mood
.lang ({"en", "it"}) – Standard two-letter code for the language of terms for which emotional valences are to be retrieved.
word_rep ({"token", "lemma", "lemmapos"}) – Level of text processing used in computing terms’ emotion weights. “token” => tokenization only; “lemma” => tokenization and lemmatization; “lemmapos” => tokenization, lemmatization, and part-of-speech tagging.
min_freq (int) – Minimum number of times that a given term must have appeared in the source dataset for it to be included in the emotion weights dict. This can be used to remove noisy terms at the expense of reducing coverage. Researchers observed peak performance at 10, but anywhere between 1 and 20 is reasonable.
-
property
filepath
¶ Full path on disk for the DepecheMood tsv file corresponding to the
lang
andword_rep
.- Type
-
property
weights
¶ Mapping of term string (or term#POS, if
DepecheMood.word_rep
is “lemmapos”) to the terms’ normalized weights on a fixed set of affective dimensions (aka “emotions”).
-
download
(*, force=False)[source]¶ Download resource data as a zip archive file, then save it to disk and extract its contents under the
data_dir
directory.- Parameters
force (bool) – If True, download the resource, even if it already exists on disk under
data_dir
.
-
get_emotional_valence
(terms)[source]¶ Get average emotional valence over all terms in
terms
for which emotion weights are available.- Parameters
terms (str or Sequence[str],
Token
or Sequence[Token
]) –One or more terms over which to average emotional valences. Note that only nouns, adjectives, adverbs, and verbs are included.
Note
If the resource was initialized with
word_rep="lemmapos"
, then string terms must have matching parts-of-speech appended to them like TERM#POS. Only “n” => noun, “v” => verb, “a” => adjective, and “r” => adverb are included in the data.- Returns
Mapping of emotion to average weight.
- Return type