Datasets and Resources¶
Stream a collection of Congressional speeches from a compressed json file on disk, either as texts or text + metadata pairs. |
|
Stream a collection of US Supreme Court decisions from a compressed json file on disk, either as texts or text + metadata pairs. |
|
Stream a collection of Wikipedia pages from a version- and language-specific database dump, either as texts or text + metadata pairs. |
|
Stream a collection of Wikinews pages from a version- and language-specific database dump, either as texts or text + metadata pairs. |
|
Stream a collection of Reddit comments from 1 or more compressed files on disk, either as texts or text + metadata pairs. |
|
Stream a collection of English-language literary works from text files on disk, either as texts or text + metadata pairs. |
|
Stream a collection of IMDB movie reviews from text files on disk, either as texts or text + metadata pairs. |
|
Stream a collection of UDHR translations from disk, either as texts or text + metadata pairs. |
Interface to ConceptNet, a multilingual knowledge base representing common words and phrases and the common-sense relationships between them. |
|
Interface to DepecheMood, an emotion lexicon for English and Italian text. |
Capitol Words Congressional speeches¶
A collection of ~11k (almost all) speeches given by the main protagonists of the 2016 U.S. Presidential election that had previously served in the U.S. Congress – including Hillary Clinton, Bernie Sanders, Barack Obama, Ted Cruz, and John Kasich – from January 1996 through June 2016.
Records include the following data:
text
: Full text of the Congressperson’s remarks.
title
: Title of the speech, in all caps.
date
: Date on which the speech was given, as an ISO-standard string.
speaker_name
: First and last name of the speaker.
speaker_party
: Political party of the speaker: “R” for Republican, “D” for Democrat, “I” for Independent.
congress
: Number of the Congress in which the speech was given: ranges continuously between 104 and 114.
chamber
: Chamber of Congress in which the speech was given: almost all are either “House” or “Senate”, with a small number of “Extensions”.
This dataset was derived from data provided by the (now defunct) Sunlight Foundation’s Capitol Words API.
-
class
textacy.datasets.capitol_words.
CapitolWords
(data_dir: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/capitol_words'))[source]¶ Stream a collection of Congressional speeches from a compressed json file on disk, either as texts or text + metadata pairs.
Download the data (one time only!) from the textacy-data repo (https://github.com/bdewilde/textacy-data), and save its contents to disk:
>>> import textacy.datasets >>> ds = textacy.datasets.CapitolWords() >>> ds.download() >>> ds.info {'name': 'capitol_words', 'site_url': 'http://sunlightlabs.github.io/Capitol-Words/', 'description': 'Collection of ~11k speeches in the Congressional Record given by notable U.S. politicians between Jan 1996 and Jun 2016.'}
Iterate over speeches as texts or records with both text and metadata:
>>> for text in ds.texts(limit=3): ... print(text, end="\n\n") >>> for text, meta in ds.records(limit=3): ... print("\n{} ({})\n{}".format(meta["title"], meta["speaker_name"], text))
Filter speeches by a variety of metadata fields and text length:
>>> for text, meta in ds.records(speaker_name="Bernie Sanders", limit=3): ... print("\n{}, {}\n{}".format(meta["title"], meta["date"], text)) >>> for text, meta in ds.records(speaker_party="D", congress={110, 111, 112}, ... chamber="Senate", limit=3): ... print(meta["title"], meta["speaker_name"], meta["date"]) >>> for text, meta in ds.records(speaker_name={"Barack Obama", "Hillary Clinton"}, ... date_range=("2005-01-01", "2005-12-31")): ... print(meta["title"], meta["speaker_name"], meta["date"]) >>> for text in ds.texts(min_len=50000): ... print(len(text))
Stream speeches into a
textacy.Corpus
:>>> textacy.Corpus("en", data=ota.records(limit=100)) Corpus(100 docs; 70496 tokens)
- Parameters
data_dir – Path to directory on disk under which dataset is stored, i.e.
/path/to/data_dir/capitol_words
.
-
full_date_range
¶ First and last dates for which speeches are available, each as an ISO-formatted string (YYYY-MM-DD).
-
congresses
¶ All distinct numbers of the congresses in which speeches were given, e.g. 114.
- Type
Set[int]
-
property
filepath
¶ Full path on disk for CapitolWords data as compressed json file.
None
if file is not found, e.g. has not yet been downloaded.
-
download
(*, force: bool = False) → None[source]¶ Download the data as a Python version-specific compressed json file and save it to disk under the
data_dir
directory.- Parameters
force – If True, download the dataset, even if it already exists on disk under
data_dir
.
-
texts
(*, speaker_name: Optional[Union[str, Set[str]]] = None, speaker_party: Optional[Union[str, Set[str]]] = None, chamber: Optional[Union[str, Set[str]]] = None, congress: Optional[Union[int, Set[int]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None) → Iterable[str][source]¶ Iterate over speeches in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only, in chronological order.
- Parameters
speaker_name – Filter speeches by the speakers’ name; see
CapitolWords.speaker_names
.speaker_party – Filter speeches by the speakers’ party; see
CapitolWords.speaker_parties
.chamber – Filter speeches by the chamber in which they were given; see
CapitolWords.chambers
.congress – Filter speeches by the congress in which they were given; see
CapitolWords.congresses
.date_range – Filter speeches by the date on which they were given. Both start and end date must be specified, but a null value for either will be replaced by the min/max date available for the dataset.
min_len – Filter texts by the length (# characters) of their text content.
limit – Yield no more than
limit
texts that match all specified filters.
- Yields
Full text of next (by chronological order) speech in dataset passing all filter params.
- Raises
ValueError – If any filtering options are invalid.
-
records
(*, speaker_name: Optional[Union[str, Set[str]]] = None, speaker_party: Optional[Union[str, Set[str]]] = None, chamber: Optional[Union[str, Set[str]]] = None, congress: Optional[Union[int, Set[int]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None) → Iterable[textacy.types.Record][source]¶ Iterate over speeches in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs, in chronological order.
- Parameters
speaker_name – Filter speeches by the speakers’ name; see
CapitolWords.speaker_names
.speaker_party – Filter speeches by the speakers’ party; see
CapitolWords.speaker_parties
.chamber – Filter speeches by the chamber in which they were given; see
CapitolWords.chambers
.congress – Filter speeches by the congress in which they were given; see
CapitolWords.congresses
.date_range – Filter speeches by the date on which they were given. Both start and end date must be specified, but a null value for either will be replaced by the min/max date available for the dataset.
min_len – Filter speeches by the length (# characters) of their text content.
limit – Yield no more than
limit
speeches that match all specified filters.
- Yields
Full text of the next (by chronological order) speech in dataset passing all filters, and its corresponding metadata.
- Raises
ValueError – If any filtering options are invalid.
Supreme Court decisions¶
A collection of ~8.4k (almost all) decisions issued by the U.S. Supreme Court from November 1946 through June 2016 – the “modern” era.
Records include the following data:
text
: Full text of the Court’s decision.
case_name
: Name of the court case, in all caps.
argument_date
: Date on which the case was argued before the Court, as an ISO-formatted string (“YYYY-MM-DD”).
decision_date
: Date on which the Court’s decision was announced, as an ISO-formatted string (“YYYY-MM-DD”).
decision_direction
: Ideological direction of the majority’s decision: one of “conservative”, “liberal”, or “unspecifiable”.
maj_opinion_author
: Name of the majority opinion’s author, if available and identifiable, as an integer code whose mapping is given inSupremeCourt.opinion_author_codes
.
n_maj_votes
: Number of justices voting in the majority.
n_min_votes
: Number of justices voting in the minority.
issue
: Subject matter of the case’s core disagreement (e.g. “affirmative action”) rather than its legal basis (e.g. “the equal protection clause”), as a string code whose mapping is given inSupremeCourt.issue_codes
.
issue_area
: Higher-level categorization of the issue (e.g. “Civil Rights”), as an integer code whose mapping is given inSupremeCourt.issue_area_codes
.
us_cite_id
: Citation identifier for each case according to the official United States Reports. Note: There are ~300 cases with duplicate ids, and it’s not clear if that’s “correct” or a data quality problem.
The text in this dataset was derived from FindLaw’s searchable database of court cases: http://caselaw.findlaw.com/court/us-supreme-court.
The metadata was extracted without modification from the Supreme Court Database: Harold J. Spaeth, Lee Epstein, et al. 2016 Supreme Court Database, Version 2016 Release 1. http://supremecourtdatabase.org. Its license is CC BY-NC 3.0 US: https://creativecommons.org/licenses/by-nc/3.0/us/.
This dataset’s creation was inspired by a blog post by Emily Barry: http://www.emilyinamillion.me/blog/2016/7/13/visualizing-supreme-court-topics-over-time.
The two datasets were merged through much munging and a carefully
trained model using the dedupe
package. The model’s duplicate threshold
was set so as to maximize the F-score where precision had twice as much
weight as recall. Still, given occasionally baffling inconsistencies in case
naming, citation ids, and decision dates, a very small percentage of texts
may be incorrectly matched to metadata. (Sorry.)
-
class
textacy.datasets.supreme_court.
SupremeCourt
(data_dir: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/supreme_court'))[source]¶ Stream a collection of US Supreme Court decisions from a compressed json file on disk, either as texts or text + metadata pairs.
Download the data (one time only!) from the textacy-data repo (https://github.com/bdewilde/textacy-data), and save its contents to disk:
>>> import textacy.datasets >>> ds = textacy.datasets.SupremeCourt() >>> ds.download() >>> ds.info {'name': 'supreme_court', 'site_url': 'http://caselaw.findlaw.com/court/us-supreme-court', 'description': 'Collection of ~8.4k decisions issued by the U.S. Supreme Court between November 1946 and June 2016.'}
Iterate over decisions as texts or records with both text and metadata:
>>> for text in ds.texts(limit=3): ... print(text[:500], end="\n\n") >>> for text, meta in ds.records(limit=3): ... print("\n{} ({})\n{}".format(meta["case_name"], meta["decision_date"], text[:500]))
Filter decisions by a variety of metadata fields and text length:
>>> for text, meta in ds.records(opinion_author=109, limit=3): # Notorious RBG! ... print(meta["case_name"], meta["decision_direction"], meta["n_maj_votes"]) >>> for text, meta in ds.records(decision_direction="liberal", ... issue_area={1, 9, 10}, limit=3): ... print(meta["case_name"], meta["maj_opinion_author"], meta["n_maj_votes"]) >>> for text, meta in ds.records(opinion_author=102, date_range=('1985-02-11', '1986-02-11')): ... print("\n{} ({})".format(meta["case_name"], meta["decision_date"])) ... print(ds.issue_codes[meta["issue"]], "=>", meta["decision_direction"]) >>> for text in ds.texts(min_len=250000): ... print(len(text))
Stream decisions into a
textacy.Corpus
:>>> textacy.Corpus("en", data=ds.records(limit=25)) Corpus(25 docs; 136696 tokens)
- Parameters
data_dir (str or
pathlib.Path
) – Path to directory on disk under which the data is stored, i.e./path/to/data_dir/supreme_court
.
-
full_date_range
¶ First and last dates for which decisions are available, each as an ISO-formatted string (YYYY-MM-DD).
-
issue_area_codes
¶ Mapping of high-level issue area of the case’s core disagreement, from id code to description.
-
issue_codes
¶ Mapping of the specific issue of the case’s core disagreement, from id code to description.
-
property
filepath
¶ Full path on disk for SupremeCourt data as compressed json file.
None
if file is not found, e.g. has not yet been downloaded.
-
download
(*, force: bool = False) → None[source]¶ Download the data as a Python version-specific compressed json file and save it to disk under the
data_dir
directory.- Parameters
force – If True, download the dataset, even if it already exists on disk under
data_dir
.
-
texts
(*, opinion_author: Optional[Union[int, Set[int]]] = None, decision_direction: Optional[Union[str, Set[str]]] = None, issue_area: Optional[Union[int, Set[int]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None) → Iterable[str][source]¶ Iterate over decisions in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only, in chronological order by decision date.
- Parameters
opinion_author – Filter decisions by the name(s) of the majority opinion’s author, coded as an integer whose mapping is given in
SupremeCourt.opinion_author_codes
.decision_direction – Filter decisions by the ideological direction of the majority’s decision; see
SupremeCourt.decision_directions
.issue_area – Filter decisions by the issue area of the case’s subject matter, coded as an integer whose mapping is given in
SupremeCourt.issue_area_codes
.date_range – Filter decisions by the date on which they were decided; both start and end date must be specified, but a null value for either will be replaced by the min/max date available for the dataset.
min_len – Filter decisions by the length (# characters) of their text content.
limit – Yield no more than
limit
decisions that match all specified filters.
- Yields
Text of the next decision in dataset passing all filters.
- Raises
ValueError – If any filtering options are invalid.
-
records
(*, opinion_author: Optional[Union[int, Set[int]]] = None, decision_direction: Optional[Union[str, Set[str]]] = None, issue_area: Optional[Union[int, Set[int]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None) → Iterable[textacy.types.Record][source]¶ Iterate over decisions in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs, in chronological order by decision date.
- Parameters
opinion_author – Filter decisions by the name(s) of the majority opinion’s author, coded as an integer whose mapping is given in
SupremeCourt.opinion_author_codes
.decision_direction – Filter decisions by the ideological direction of the majority’s decision; see
SupremeCourt.decision_directions
.issue_area – Filter decisions by the issue area of the case’s subject matter, coded as an integer whose mapping is given in
SupremeCourt.issue_area_codes
.date_range – Filter decisions by the date on which they were decided; both start and end date must be specified, but a null value for either will be replaced by the min/max date available for the dataset.
min_len – Filter decisions by the length (# characters) of their text content.
limit – Yield no more than
limit
decisions that match all specified filters.
- Yields
Text of the next decision in dataset passing all filters, and its corresponding metadata.
- Raises
ValueError – If any filtering options are invalid.
Wikimedia articles¶
All articles for a given Wikimedia project, specified by language and version.
Records include the following key fields (plus a few others):
text
: Plain text content of the wiki page – no wiki markup!
title
: Title of the wiki page.
wiki_links
: A list of other wiki pages linked to from this page.
ext_links
: A list of external URLs linked to from this page.
categories
: A list of categories to which this wiki page belongs.
dt_created
: Date on which the wiki page was first created.
page_id
: Unique identifier of the wiki page, usable in Wikimedia APIs.
Datasets are generated by the Wikimedia Foundation for a variety of projects, such as Wikipedia and Wikinews. The source files are meant for search indexes, so they’re dumped in Elasticsearch bulk insert format – basically, a compressed JSON file with one record per line. For more information, refer to https://meta.wikimedia.org/wiki/Data_dumps.
-
class
textacy.datasets.wikimedia.
Wikimedia
(name, meta, project, data_dir, lang='en', version='current', namespace=0)[source]¶ Base class for project-specific Wikimedia datasets. See:
-
property
filepath
¶ Full path on disk for the Wikimedia CirrusSearch db dump corresponding to the
project
,lang
, andversion
.- Type
-
download
(*, force: bool = False) → None[source]¶ Download the Wikimedia CirrusSearch db dump corresponding to the given
project
,lang
, andversion
as a compressed JSON file, and save it to disk under thedata_dir
directory.- Parameters
force – If True, download the dataset, even if it already exists on disk under
data_dir
.
Note
Some datasets are quite large (e.g. English Wikipedia is ~28GB) and can take hours to fully download.
-
texts
(*, category: Optional[Union[str, Set[str]]] = None, wiki_link: Optional[Union[str, Set[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None) → Iterable[str][source]¶ Iterate over wiki pages in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only, in order of appearance in the db dump file.
- Parameters
category – Filter wiki pages by the categories to which they’ve been assigned. For multiple values (Set[str]), ANY rather than ALL of the values must be found among a given page’s categories.
wiki_link – Filter wiki pages by the other wiki pages to which they’ve been linked. For multiple values (Set[str]), ANY rather than ALL of the values must be found among a given page’s wiki links.
min_len – Filter wiki pages by the length (# characters) of their text content.
limit – Yield no more than
limit
wiki pages that match all specified filters.
- Yields
Text of the next wiki page in dataset passing all filters.
- Raises
ValueError – If any filtering options are invalid.
-
records
(*, category: Optional[Union[str, Set[str]]] = None, wiki_link: Optional[Union[str, Set[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None) → Iterable[textacy.types.Record][source]¶ Iterate over wiki pages in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs, in order of appearance in the db dump file.
- Parameters
category – Filter wiki pages by the categories to which they’ve been assigned. For multiple values (Set[str]), ANY rather than ALL of the values must be found among a given page’s categories.
wiki_link – Filter wiki pages by the other wiki pages to which they’ve been linked. For multiple values (Set[str]), ANY rather than ALL of the values must be found among a given page’s wiki links.
min_len – Filter wiki pages by the length (# characters) of their text content.
limit – Yield no more than
limit
wiki pages that match all specified filters.
- Yields
Text of the next wiki page in dataset passing all filters, and its corresponding metadata.
- Raises
ValueError – If any filtering options are invalid.
-
property
-
class
textacy.datasets.wikimedia.
Wikipedia
(data_dir: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/wikipedia'), lang: str = 'en', version: str = 'current', namespace: int = 0)[source]¶ Stream a collection of Wikipedia pages from a version- and language-specific database dump, either as texts or text + metadata pairs.
Download a database dump (one time only!) and save its contents to disk:
>>> import textacy.datasets >>> ds = textacy.datasets.Wikipedia(lang="en", version="current") >>> ds.download() >>> ds.info {'name': 'wikipedia', 'site_url': 'https://en.wikipedia.org/wiki/Main_Page', 'description': 'All pages for a given language- and version-specific Wikipedia site snapshot.'}
Iterate over wiki pages as texts or records with both text and metadata:
>>> for text in ds.texts(limit=5): ... print(text[:500]) >>> for text, meta in ds.records(limit=5): ... print(meta["page_id"], meta["title"])
Filter wiki pages by a variety of metadata fields and text length:
>>> for text, meta in ds.records(category="Living people", limit=5): ... print(meta["title"], meta["categories"]) >>> for text, meta in ds.records(wiki_link="United_States", limit=5): ... print(meta["title"], meta["wiki_links"]) >>> for text in ds.texts(min_len=10000, limit=5): ... print(len(text))
Stream wiki pages into a
textacy.Corpus
:>>> textacy.Corpus("en", data=ds.records(min_len=2000, limit=50)) Corpus(50 docs; 72368 tokens)
- Parameters
data_dir – Path to directory on disk under which database dump files are stored. Each file is expected as
{lang}{project}/{version}/{lang}{project}-{version}-cirrussearch-content.json.gz
immediately under this directory.lang – Standard two-letter language code, e.g. “en” => “English”, “de” => “German”. https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
version – Database dump version to use. Either “current” for the most recently available version or a date formatted as “YYYYMMDD”. Dumps are produced weekly; check for available versions at https://dumps.wikimedia.org/other/cirrussearch/.
namespace – Namespace of the wiki pages to include. Typical, public- facing content is in the 0 (default) namespace.
-
class
textacy.datasets.wikimedia.
Wikinews
(data_dir: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/wikinews'), lang: str = 'en', version: str = 'current', namespace: int = 0)[source]¶ Stream a collection of Wikinews pages from a version- and language-specific database dump, either as texts or text + metadata pairs.
Download a database dump (one time only!) and save its contents to disk:
>>> import textacy.datasets >>> ds = textacy.datasets.Wikinews(lang="en", version="current") >>> ds.download() >>> ds.info {'name': 'wikinews', 'site_url': 'https://en.wikinews.org/wiki/Main_Page', 'description': 'All pages for a given language- and version-specific Wikinews site snapshot.'}
Iterate over wiki pages as texts or records with both text and metadata:
>>> for text in ds.texts(limit=5): ... print(text[:500]) >>> for text, meta in ds.records(limit=5): ... print(meta["page_id"], meta["title"])
Filter wiki pages by a variety of metadata fields and text length:
>>> for text, meta in ds.records(category="Politics and conflicts", limit=5): ... print(meta["title"], meta["categories"]) >>> for text, meta in ds.records(wiki_link="Reuters", limit=5): ... print(meta["title"], meta["wiki_links"]) >>> for text in ds.texts(min_len=5000, limit=5): ... print(len(text))
Stream wiki pages into a
textacy.Corpus
:>>> textacy.Corpus("en", data=ds.records(limit=100)) Corpus(100 docs; 33092 tokens)
- Parameters
data_dir – Path to directory on disk under which database dump files are stored. Each file is expected as
{lang}{project}/{version}/{lang}{project}-{version}-cirrussearch-content.json.gz
immediately under this directory.lang – Standard two-letter language code, e.g. “en” => “English”, “de” => “German”. https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
version – Database dump version to use. Either “current” for the most recently available version or a date formatted as “YYYYMMDD”. Dumps are produced weekly; check for available versions at https://dumps.wikimedia.org/other/cirrussearch/.
namespace – Namespace of the wiki pages to include. Typical, public- facing content is in the 0 (default) namespace.
Reddit comments¶
A collection of up to ~1.5 billion Reddit comments posted from October 2007 through May 2015.
Records include the following key fields (plus a few others):
body
: Full text of the comment.
created_utc
: Date on which the comment was posted.
subreddit
: Sub-reddit in which the comment was posted, excluding the familiar “/r/” prefix.
score
: Net score (upvotes - downvotes) on the comment.
gilded
: Number of times this comment received reddit gold.
The raw data was originally collected by /u/Stuck_In_the_Matrix via Reddit’s APIS, and stored for posterity by the Internet Archive. For more details, refer to https://archive.org/details/2015_reddit_comments_corpus.
-
class
textacy.datasets.reddit_comments.
RedditComments
(data_dir: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/reddit_comments'))[source]¶ Stream a collection of Reddit comments from 1 or more compressed files on disk, either as texts or text + metadata pairs.
Download the data (one time only!) or subsets thereof by specifying a date range:
>>> import textacy.datasets >>> ds = textacy.datasets.RedditComments() >>> ds.download(date_range=("2007-10", "2008-03")) >>> ds.info {'name': 'reddit_comments', 'site_url': 'https://archive.org/details/2015_reddit_comments_corpus', 'description': 'Collection of ~1.5 billion publicly available Reddit comments from October 2007 through May 2015.'}
Iterate over comments as texts or records with both text and metadata:
>>> for text in ds.texts(limit=5): ... print(text) >>> for text, meta in ds.records(limit=5): ... print("\n{} {}\n{}".format(meta["author"], meta["created_utc"], text))
Filter comments by a variety of metadata fields and text length:
>>> for text, meta in ds.records(subreddit="politics", limit=5): ... print(meta["score"], ":", text) >>> for text, meta in ds.records(date_range=("2008-01", "2008-03"), limit=5): ... print(meta["created_utc"]) >>> for text, meta in ds.records(score_range=(10, None), limit=5): ... print(meta["score"], ":", text) >>> for text in ds.texts(min_len=2000, limit=5): ... print(len(text))
Stream comments into a
textacy.Corpus
:>>> textacy.Corpus("en", data=ds.records(limit=1000)) Corpus(1000 docs; 27582 tokens)
- Parameters
data_dir – Path to directory on disk under which the data is stored, i.e.
/path/to/data_dir/reddit_comments
. Each file covers a given month, as indicated in the filename like “YYYY/RC_YYYY-MM.bz2”.
-
full_date_range
¶ First and last dates for which comments are available, each as an ISO-formatted string (YYYY-MM-DD).
-
property
filepaths
¶ Full paths on disk for all Reddit comments files found under
RedditComments.data_dir
directory, sorted in chronological order.
-
download
(*, date_range: Tuple[Optional[str], Optional[str]] = (None, None), force: bool = False) → None[source]¶ Download 1 or more monthly Reddit comments files from archive.org and save them to disk under the
data_dir
directory.- Parameters
date_range – Interval specifying the [start, end) dates for which comments files will be downloaded. Each item must be a str formatted as YYYY-MM or YYYY-MM-DD (the latter is converted to the corresponding YYYY-MM value). Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.
force – If True, download the dataset, even if it already exists on disk under
data_dir
.
-
texts
(*, subreddit: Optional[Union[str, Set[str]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, score_range: Optional[Tuple[Optional[int], Optional[int]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None) → Iterable[str][source]¶ Iterate over comments (text-only) in 1 or more files of this dataset, optionally filtering by a variety of metadata and/or text length, in chronological order.
- Parameters
subreddit – Filter comments for those which were posted in the specified subreddit(s).
date_range – Filter comments for those which were posted within the interval [start, end). Each item must be a str in ISO-standard format, i.e. some amount of YYYY-MM-DDTHH:mm:ss. Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.
score_range – Filter comments for those whose score (# upvotes minus # downvotes) is within the interval [low, high). Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.
min_len – Filter comments for those whose body length in chars is at least this long.
limit – Maximum number of comments passing all filters to yield. If None, all comments are iterated over.
- Yields
Text of the next comment in dataset passing all filters.
- Raises
ValueError – If any filtering options are invalid.
-
records
(*, subreddit: Optional[Union[str, Set[str]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, score_range: Optional[Tuple[Optional[int], Optional[int]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None) → Iterable[textacy.types.Record][source]¶ Iterate over comments (including text and metadata) in 1 or more files of this dataset, optionally filtering by a variety of metadata and/or text length, in chronological order.
- Parameters
subreddit – Filter comments for those which were posted in the specified subreddit(s).
date_range – Filter comments for those which were posted within the interval [start, end). Each item must be a str in ISO-standard format, i.e. some amount of YYYY-MM-DDTHH:mm:ss. Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.
score_range – Filter comments for those whose score (# upvotes minus # downvotes) is within the interval [low, high). Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.
min_len – Filter comments for those whose body length in chars is at least this long.
limit – Maximum number of comments passing all filters to yield. If None, all comments are iterated over.
- Yields
Text of the next comment in dataset passing all filters, and its corresponding metadata.
- Raises
ValueError – If any filtering options are invalid.
Oxford Text Archive literary works¶
A collection of ~2.7k Creative Commons literary works from the Oxford Text Archive, containing primarily English-language 16th-20th century literature and history.
Records include the following data:
text
: Full text of the literary work.
title
: Title of the literary work.
author
: Author(s) of the literary work.
year
: Year that the literary work was published.
url
: URL at which literary work can be found online via the OTA.
id
: Unique identifier of the literary work within the OTA.
This dataset was compiled by David Mimno from the Oxford Text Archive and stored in his GitHub repo to avoid unnecessary scraping of the OTA site. It is downloaded from that repo, and excluding some light cleaning of its metadata, is reproduced exactly here.
-
class
textacy.datasets.oxford_text_archive.
OxfordTextArchive
(data_dir: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/oxford_text_archive'))[source]¶ Stream a collection of English-language literary works from text files on disk, either as texts or text + metadata pairs.
Download the data (one time only!), saving and extracting its contents to disk:
>>> import textacy.datasets >>> ds = textacy.datasets.OxfordTextArchive() >>> ds.download() >>> ds.info {'name': 'oxford_text_archive', 'site_url': 'https://ota.ox.ac.uk/', 'description': 'Collection of ~2.7k Creative Commons texts from the Oxford Text Archive, containing primarily English-language 16th-20th century literature and history.'}
Iterate over literary works as texts or records with both text and metadata:
>>> for text in ds.texts(limit=3): ... print(text[:200]) >>> for text, meta in ds.records(limit=3): ... print("\n{}, {}".format(meta["title"], meta["year"])) ... print(text[:300])
Filter literary works by a variety of metadata fields and text length:
>>> for text, meta in ds.records(author="Shakespeare, William", limit=1): ... print("{}\n{}".format(meta["title"], text[:500])) >>> for text, meta in ds.records(date_range=("1900-01-01", "1990-01-01"), limit=5): ... print(meta["year"], meta["author"]) >>> for text in ds.texts(min_len=4000000): ... print(len(text))
Stream literary works into a
textacy.Corpus
:>>> textacy.Corpus("en", data=ds.records(limit=5)) Corpus(5 docs; 182289 tokens)
- Parameters
data_dir (str or
pathlib.Path
) – Path to directory on disk under which dataset is stored, i.e./path/to/data_dir/oxford_text_archive
.
-
full_date_range
¶ First and last dates for which works are available, each as an ISO-formatted string (YYYY-MM-DD).
Full names of all distinct authors included in this dataset, e.g. “Shakespeare, William”.
- Type
Set[str]
-
download
(*, force: bool = False) → None[source]¶ Download the data as a zip archive file, then save it to disk and extract its contents under the
OxfordTextArchive.data_dir
directory.- Parameters
force – If True, download the dataset, even if it already exists on disk under
data_dir
.
-
property
metadata
¶ Dict[str, dict]
-
texts
(*, author: Optional[Union[str, Set[str]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None) → Iterable[str][source]¶ Iterate over works in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only.
- Parameters
author – Filter texts by the authors’ name. For multiple values (Set[str]), ANY rather than ALL of the authors must be found among a given works’s authors.
date_range – Filter texts by the date on which it was published; both start and end date must be specified, but a null value for either will be replaced by the min/max date available in the dataset.
min_len – Filter texts by the length (# characters) of their text content.
limit – Yield no more than
limit
texts that match all specified filters.
- Yields
Text of the next work in dataset passing all filters.
- Raises
ValueError – If any filtering options are invalid.
-
records
(*, author: Optional[Union[str, Set[str]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None) → Iterable[textacy.types.Record][source]¶ Iterate over works in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs.
- Parameters
author – Filter texts by the authors’ name. For multiple values (Set[str]), ANY rather than ALL of the authors must be found among a given works’s authors.
date_range – Filter texts by the date on which it was published; both start and end date must be specified, but a null value for either will be replaced by the min/max date available in the dataset.
min_len – Filter texts by the length (# characters) of their text content.
limit – Yield no more than
limit
texts that match all specified filters.
- Yields
Text of the next work in dataset passing all filters, and its corresponding metadata.
- Raises
ValueError – If any filtering options are invalid.
IMDB movie reviews¶
A collection of 50k highly polar movie reviews posted to IMDB, split evenly into training and testing sets, with 25k positive and 25k negative sentiment labels, as well as some unlabeled reviews.
Records include the following key fields (plus a few others):
text
: Full text of the review.
subset
: Subset of the dataset (“train” or “test”) into which the review has been split.
label
: Sentiment label (“pos” or “neg”) assigned to the review.
rating
: Numeric rating assigned by the original reviewer, ranging from 1 to 10. Reviews with a rating <= 5 are “neg”; the rest are “pos”.
movie_id
: Unique identifier for the movie under review within IMDB, useful for grouping reviews or joining with an external movie dataset.
Reference: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
-
class
textacy.datasets.imdb.
IMDB
(data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/imdb'))[source]¶ Stream a collection of IMDB movie reviews from text files on disk, either as texts or text + metadata pairs.
Download the data (one time only!), saving and extracting its contents to disk:
>>> import textacy.datasets >>> ds = textacy.datasets.IMDB() >>> ds.download() >>> ds.info {'name': 'imdb', 'site_url': 'http://ai.stanford.edu/~amaas/data/sentiment', 'description': 'Collection of 50k highly polar movie reviews split evenly into train and test sets, with 25k positive and 25k negative labels. Also includes some unlabeled reviews.'}
Iterate over movie reviews as texts or records with both text and metadata:
>>> for text in ds.texts(limit=5): ... print(text) >>> for text, meta in ds.records(limit=5): ... print("\n{} {}\n{}".format(meta["label"], meta["rating"], text))
Filter movie reviews by a variety of metadata fields and text length:
>>> for text, meta in ds.records(label="pos", limit=5): ... print(meta["rating"], ":", text) >>> for text, meta in ds.records(rating_range=(9, 11), limit=5): ... print(meta["rating"], text) >>> for text in ds.texts(min_len=1000, limit=5): ... print(len(text))
Stream movie reviews into a
textacy.Corpus
:>>> textacy.Corpus("en", data=ds.records(limit=100)) Corpus(100 docs; 24340 tokens)
- Parameters
data_dir – Path to directory on disk under which the data is stored, i.e.
/path/to/data_dir/imdb
.
-
full_rating_range
¶ Lowest and highest ratings for which movie reviews are available.
-
download
(*, force: bool = False) → None[source]¶ Download the data as a compressed tar archive file, then save it to disk and extract its contents under the
data_dir
directory.- Parameters
force – If True, always download the dataset even if it already exists on disk under
data_dir
.
-
texts
(*, subset: Optional[str] = None, label: Optional[str] = None, rating_range: Optional[Tuple[Optional[int], Optional[int]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None) → Iterable[str][source]¶ Iterate over movie reviews in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only.
- Parameters
subset ({"train", "test"}) – Filter movie reviews by the dataset subset into which they’ve already been split.
label ({"pos", "neg", "unsup"}) – Filter movie reviews by the assigned sentiment label (or lack thereof, for “unsup”).
rating_range – Filter movie reviews by the rating assigned by the reviewer. Only those with ratings in the interval [low, high) are included. Both low and high values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.
min_len – Filter reviews by the length (# characters) of their text content.
limit – Yield no more than
limit
reviews that match all specified filters.
- Yields
Text of the next movie review in dataset passing all filters.
- Raises
ValueError – If any filtering options are invalid.
-
records
(*, subset: Optional[str] = None, label: Optional[str] = None, rating_range: Optional[Tuple[Optional[int], Optional[int]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None) → Iterable[textacy.types.Record][source]¶ Iterate over movie reviews in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs.
- Parameters
subset ({"train", "test"}) – Filter movie reviews by the dataset subset into which they’ve already been split.
label ({"pos", "neg", "unsup"}) – Filter movie reviews by the assigned sentiment label (or lack thereof, for “unsup”).
rating_range – Filter movie reviews by the rating assigned by the reviewer. Only those with ratings in the interval [low, high) are included. Both low and high values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.
min_len – Filter reviews by the length (# characters) of their text content.
limit – Yield no more than
limit
reviews that match all specified filters.
- Yields
Text of the next movie review in dataset passing all filters, and its corresponding metadata.
- Raises
ValueError – If any filtering options are invalid.
UDHR translations¶
A collection of translations of the Universal Declaration of Human Rights (UDHR), a milestone document in the history of human rights that first, formally established fundamental human rights to be universally protected.
Records include the following fields:
text
: Full text of the translated UDHR document.
lang
: ISO-639-1 language code of the text.
lang_name
: Ethnologue entry for the language (see https://www.ethnologue.com).
The source dataset was compiled and is updated by the Unicode Consortium
as a way to demonstrate the use of unicode in representing a wide variety of languages.
In fact, the UDHR was chosen because it’s been translated into more languages
than any other document! However, this dataset only provides access to records
translated into ISO-639-1 languages — that is, major living languages only,
rather than every language, major or minor, that has ever existed. If you need access
to texts in those other languages, you can find them at UDHR._texts_dirpath
.
For more details, go to https://unicode.org/udhr.
-
class
textacy.datasets.udhr.
UDHR
(data_dir: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/udhr'))[source]¶ Stream a collection of UDHR translations from disk, either as texts or text + metadata pairs.
Download the data (one time only!), saving and extracting its contents to disk:
>>> import textacy.datasets >>> ds = textacy.datasets.UDHR() >>> ds.download() >>> ds.info {'name': 'udhr', 'site_url': 'http://www.ohchr.org/EN/UDHR', 'description': 'A collection of translations of the Universal Declaration of Human Rights (UDHR), a milestone document in the history of human rights that first, formally established fundamental human rights to be universally protected.'}
Iterate over translations as texts or records with both text and metadata:
>>> for text in ds.texts(limit=5): ... print(text[:500]) >>> for text, meta in ds.records(limit=5): ... print("\n{} ({})\n{}".format(meta["lang_name"], meta["lang"], text[:500]))
Filter translations by language, and note that some languages have multiple translations:
>>> for text, meta in ds.records(lang="en"): ... print("\n{} ({})\n{}".format(meta["lang_name"], meta["lang"], text[:500])) >>> for text, meta in ds.records(lang="zh"): ... print("\n{} ({})\n{}".format(meta["lang_name"], meta["lang"], text[:500]))
Note: Streaming translations into a
textacy.Corpus
doesn’t work as for other available datasets, since this dataset is multilingual.- Parameters
data_dir (str or
pathlib.Path
) – Path to directory on disk under which the data is stored, i.e./path/to/data_dir/udhr
.
-
download
(*, force: bool = False) → None[source]¶ Download the data as a zipped archive of language-specific text files, then save it to disk and extract its contents under the
data_dir
directory.- Parameters
force – If True, download the dataset, even if it already exists on disk under
data_dir
.
-
texts
(*, lang: Optional[Union[str, Set[str]]] = None, limit: Optional[int] = None) → Iterable[str][source]¶ Iterate over records in this dataset, optionally filtering by language, and yield texts only.
- Parameters
lang – Filter records by the language in which they’re written; see
UDHR.langs
.limit – Yield no more than
limit
texts that match specified filter.
- Yields
Text of the next record in dataset passing filters.
- Raises
ValueError – If any filtering options are invalid.
-
records
(*, lang: Optional[Union[str, Set[str]]] = None, limit: Optional[int] = None) → Iterable[textacy.types.Record][source]¶ Iterate over reocrds in this dataset, optionally filtering by a language, and yield text + metadata pairs.
- Parameters
lang – Filter records by the language in which they’re written; see
UDHR.langs
.limit – Yield no more than
limit
texts that match specified filter.
- Yields
Text of the next record in dataset passing filters, and its corresponding metadata.
- Raises
ValueError – If any filtering options are invalid.
ConceptNet¶
ConceptNet is a multilingual knowledge base, representing common words and phrases and the common-sense relationships between them. This information is collected from a variety of sources, including crowd-sourced resources (e.g. Wiktionary, Open Mind Common Sense), games with a purpose (e.g. Verbosity, nadya.jp), and expert-created resources (e.g. WordNet, JMDict).
The interface in textacy gives access to several key relationships between terms that are useful in a variety of NLP tasks:
antonyms: terms that are opposites of each other in some relevant way
hyponyms: terms that are subtypes or specific instances of other terms
meronyms: terms that are parts of other terms
synonyms: terms that are sufficiently similar that they may be used interchangeably
-
class
textacy.resources.concept_net.
ConceptNet
(data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/concept_net'), version='5.7.0')[source]¶ Interface to ConceptNet, a multilingual knowledge base representing common words and phrases and the common-sense relationships between them.
Download the data (one time only!), and save its contents to disk:
>>> import textacy.resources >>> rs = textacy.resources.ConceptNet() >>> rs.download() >>> rs.info {'name': 'concept_net', 'site_url': 'http://conceptnet.io', 'publication_url': 'https://arxiv.org/abs/1612.03975', 'description': 'An open, multilingual semantic network of general knowledge, designed to help computers understand the meanings of words.'}
Access other same-language terms related to a given term in a variety of ways:
>>> rs.get_synonyms("spouse", lang="en", sense="n") ['mate', 'married person', 'better half', 'partner'] >>> rs.get_antonyms("love", lang="en", sense="v") ['detest', 'hate', 'loathe'] >>> rs.get_hyponyms("marriage", lang="en", sense="n") ['cohabitation situation', 'union', 'legal agreement', 'ritual', 'family', 'marital status']
Note: The very first time a given relationship is accessed, the full ConceptNet db must be parsed and split for fast future access. This can take a couple minutes; be patient.
When passing a spaCy
Token
orSpan
, the correspondinglang
andsense
are inferred automatically from the object:>>> text = "The quick brown fox jumps over the lazy dog." >>> doc = textacy.make_spacy_doc(text, lang="en") >>> rs.get_synonyms(doc[1]) # quick ['flying', 'fast', 'rapid', 'ready', 'straightaway', 'nimble', 'speedy', 'warm'] >>> rs.get_synonyms(doc[4:5]) # jumps over ['leap', 'startle', 'hump', 'flinch', 'jump off', 'skydive', 'jumpstart', ...]
Many terms won’t have entries, for actual linguistic reasons or because the db’s coverage of a given language’s vocabulary isn’t comprehensive:
>>> rs.get_meronyms(doc[3]) # fox [] >>> rs.get_antonyms(doc[7]) # lazy []
- Parameters
data_dir (str or
pathlib.Path
) – Path to directory on disk under which resource data is stored, i.e./path/to/data_dir/concept_net
.version ({"5.7.0", "5.6.0", "5.5.5"}) – Version string of the ConceptNet db to use. Since newer versions typically represent improvements over earlier versions, you’ll probably want “5.7.0” (the default value).
-
download
(*, force=False)[source]¶ Download resource data as a gzipped csv file, then save it to disk under the
ConceptNet.data_dir
directory.- Parameters
force (bool) – If True, download resource data, even if it already exists on disk; otherwise, don’t re-download the data.
-
property
filepath
¶ Full path on disk for the ConceptNet gzipped csv file corresponding to the given
ConceptNet.data_dir
.- Type
-
property
antonyms
¶ Mapping of language code to term to sense to set of term’s antonyms – opposites of the term in some relevant way, like being at opposite ends of a scale or fundamentally similar but with a key difference between them – such as black <=> white or hot <=> cold. Note that this relationship is symmetric.
Based on the “/r/Antonym” relation in ConceptNet.
-
property
hyponyms
¶ Mapping of language code to term to sense to set of term’s hyponyms – subtypes or specific instances of the term – such as car => vehicle or Chicago => city. Every A is a B.
Based on the “/r/IsA” relation in ConceptNet.
-
property
meronyms
¶ Mapping of language code to term to sense to set of term’s meronyms – parts of the term – such as gearshift => car.
Based on the “/r/PartOf” relation in ConceptNet.
-
property
synonyms
¶ Mapping of language code to term to sense to set of term’s synonyms – sufficiently similar concepts that they may be used interchangeably – such as sunlight <=> sunshine. Note that this relationship is symmetric.
Based on the “/r/Synonym” relation in ConceptNet.
DepecheMood¶
DepecheMood is a high-quality and high-coverage emotion lexicon for English and Italian text, mapping individual terms to their emotional valences. These word-emotion weights are inferred from crowd-sourced datasets of emotionally tagged news articles (rappler.com for English, corriere.it for Italian).
English terms are assigned weights to eight emotions:
AFRAID
AMUSED
ANGRY
ANNOYED
DONT_CARE
HAPPY
INSPIRED
SAD
Italian terms are assigned weights to five emotions:
DIVERTITO (~amused)
INDIGNATO (~annoyed)
PREOCCUPATO (~afraid)
SODDISFATTO (~happy)
TRISTE (~sad)
-
class
textacy.resources.depeche_mood.
DepecheMood
(data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/depeche_mood'), lang='en', word_rep='lemmapos', min_freq=3)[source]¶ Interface to DepecheMood, an emotion lexicon for English and Italian text.
Download the data (one time only!), and save its contents to disk:
>>> import textacy.resources >>> rs = textacy.resources.DepecheMood(lang="en", word_rep="lemmapos") >>> rs.download() >>> rs.info {'name': 'depeche_mood', 'site_url': 'http://www.depechemood.eu', 'publication_url': 'https://arxiv.org/abs/1810.03660', 'description': 'A simple tool to analyze the emotions evoked by a text.'}
Access emotional valences for individual terms:
>>> rs.get_emotional_valence("disease#n") {'AFRAID': 0.37093526222120465, 'AMUSED': 0.06953745082761113, 'ANGRY': 0.06979683067736414, 'ANNOYED': 0.06465401081252636, 'DONT_CARE': 0.07080580707440012, 'HAPPY': 0.07537324330608403, 'INSPIRED': 0.13394731320662606, 'SAD': 0.14495008187418348} >>> rs.get_emotional_valence("heal#v") {'AFRAID': 0.060450319886187334, 'AMUSED': 0.09284046387491741, 'ANGRY': 0.06207816933776029, 'ANNOYED': 0.10027622719958346, 'DONT_CARE': 0.11259594401785, 'HAPPY': 0.09946106491457314, 'INSPIRED': 0.37794768332634626, 'SAD': 0.09435012744278205}
When passing multiple terms in the form of a List[str] or
Span
orDoc
, emotion weights are averaged over all terms for which weights are available:>>> rs.get_emotional_valence(["disease#n", "heal#v"]) {'AFRAID': 0.215692791053696, 'AMUSED': 0.08118895735126427, 'ANGRY': 0.06593750000756221, 'ANNOYED': 0.08246511900605491, 'DONT_CARE': 0.09170087554612506, 'HAPPY': 0.08741715411032858, 'INSPIRED': 0.25594749826648616, 'SAD': 0.11965010465848278} >>> text = "The acting was sweet and amazing, but the plot was dumb and terrible." >>> doc = textacy.make_spacy_doc(text, lang="en") >>> rs.get_emotional_valence(doc) {'AFRAID': 0.05272350876803627, 'AMUSED': 0.13725054992595098, 'ANGRY': 0.15787016147081184, 'ANNOYED': 0.1398733360688608, 'DONT_CARE': 0.14356943460620503, 'HAPPY': 0.11923217912716871, 'INSPIRED': 0.17880214720077342, 'SAD': 0.07067868283219296} >>> rs.get_emotional_valence(doc[0:6]) # the acting was sweet and amazing {'AFRAID': 0.039790959333750785, 'AMUSED': 0.1346884072825313, 'ANGRY': 0.1373596223131593, 'ANNOYED': 0.11391999698695347, 'DONT_CARE': 0.1574819173485831, 'HAPPY': 0.1552521762333925, 'INSPIRED': 0.21232264216449326, 'SAD': 0.049184278337136296}
For good measure, here’s how Italian w/o POS-tagged words looks:
>>> rs = textacy.resources.DepecheMood(lang="it", word_rep="lemma") >>> rs.get_emotional_valence("amore") {'INDIGNATO': 0.11451408951814121, 'PREOCCUPATO': 0.1323655108545536, 'TRISTE': 0.18249663560400609, 'DIVERTITO': 0.33558928569110086, 'SODDISFATTO': 0.23503447833219815}
- Parameters
data_dir (str or
pathlib.Path
) – Path to directory on disk under which resource data is stored, i.e./path/to/data_dir/depeche_mood
.lang ({"en", "it"}) – Standard two-letter code for the language of terms for which emotional valences are to be retrieved.
word_rep ({"token", "lemma", "lemmapos"}) – Level of text processing used in computing terms’ emotion weights. “token” => tokenization only; “lemma” => tokenization and lemmatization; “lemmapos” => tokenization, lemmatization, and part-of-speech tagging.
min_freq (int) – Minimum number of times that a given term must have appeared in the source dataset for it to be included in the emotion weights dict. This can be used to remove noisy terms at the expense of reducing coverage. Researchers observed peak performance at 10, but anywhere between 1 and 20 is reasonable.
-
property
filepath
¶ Full path on disk for the DepecheMood tsv file corresponding to the
lang
andword_rep
.- Type
-
property
weights
¶ Mapping of term string (or term#POS, if
DepecheMood.word_rep
is “lemmapos”) to the terms’ normalized weights on a fixed set of affective dimensions (aka “emotions”).
-
download
(*, force=False)[source]¶ Download resource data as a zip archive file, then save it to disk and extract its contents under the
data_dir
directory.- Parameters
force (bool) – If True, download the resource, even if it already exists on disk under
data_dir
.
-
get_emotional_valence
(terms)[source]¶ Get average emotional valence over all terms in
terms
for which emotion weights are available.- Parameters
terms (str or Sequence[str],
Token
or Sequence[Token
]) –One or more terms over which to average emotional valences. Note that only nouns, adjectives, adverbs, and verbs are included.
Note
If the resource was initialized with
word_rep="lemmapos"
, then string terms must have matching parts-of-speech appended to them like TERM#POS. Only “n” => noun, “v” => verb, “a” => adjective, and “r” => adverb are included in the data.- Returns
Mapping of emotion to average weight.
- Return type