Datasets and Resources

capitol_words.CapitolWords

Stream a collection of Congressional speeches from a compressed json file on disk, either as texts or text + metadata pairs.

supreme_court.SupremeCourt

Stream a collection of US Supreme Court decisions from a compressed json file on disk, either as texts or text + metadata pairs.

wikimedia.Wikipedia

Stream a collection of Wikipedia pages from a version- and language-specific database dump, either as texts or text + metadata pairs.

wikimedia.Wikinews

Stream a collection of Wikinews pages from a version- and language-specific database dump, either as texts or text + metadata pairs.

reddit_comments.RedditComments

Stream a collection of Reddit comments from 1 or more compressed files on disk, either as texts or text + metadata pairs.

oxford_text_archive.OxfordTextArchive

Stream a collection of English-language literary works from text files on disk, either as texts or text + metadata pairs.

imdb.IMDB

Stream a collection of IMDB movie reviews from text files on disk, either as texts or text + metadata pairs.

udhr.UDHR

Stream a collection of UDHR translations from disk, either as texts or text + metadata pairs.

concept_net.ConceptNet

Interface to ConceptNet, a multilingual knowledge base representing common words and phrases and the common-sense relationships between them.

depeche_mood.DepecheMood

Interface to DepecheMood, an emotion lexicon for English and Italian text.

Capitol Words Congressional speeches

A collection of ~11k (almost all) speeches given by the main protagonists of the 2016 U.S. Presidential election that had previously served in the U.S. Congress – including Hillary Clinton, Bernie Sanders, Barack Obama, Ted Cruz, and John Kasich – from January 1996 through June 2016.

Records include the following data:

  • text: Full text of the Congressperson’s remarks.

  • title: Title of the speech, in all caps.

  • date: Date on which the speech was given, as an ISO-standard string.

  • speaker_name: First and last name of the speaker.

  • speaker_party: Political party of the speaker: “R” for Republican, “D” for Democrat, “I” for Independent.

  • congress: Number of the Congress in which the speech was given: ranges continuously between 104 and 114.

  • chamber: Chamber of Congress in which the speech was given: almost all are either “House” or “Senate”, with a small number of “Extensions”.

This dataset was derived from data provided by the (now defunct) Sunlight Foundation’s Capitol Words API.

class textacy.datasets.capitol_words.CapitolWords(data_dir: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/capitol_words'))[source]

Stream a collection of Congressional speeches from a compressed json file on disk, either as texts or text + metadata pairs.

Download the data (one time only!) from the textacy-data repo (https://github.com/bdewilde/textacy-data), and save its contents to disk:

>>> import textacy.datasets
>>> ds = textacy.datasets.CapitolWords()
>>> ds.download()
>>> ds.info
{'name': 'capitol_words',
 'site_url': 'http://sunlightlabs.github.io/Capitol-Words/',
 'description': 'Collection of ~11k speeches in the Congressional Record given by notable U.S. politicians between Jan 1996 and Jun 2016.'}

Iterate over speeches as texts or records with both text and metadata:

>>> for text in ds.texts(limit=3):
...     print(text, end="\n\n")
>>> for text, meta in ds.records(limit=3):
...     print("\n{} ({})\n{}".format(meta["title"], meta["speaker_name"], text))

Filter speeches by a variety of metadata fields and text length:

>>> for text, meta in ds.records(speaker_name="Bernie Sanders", limit=3):
...     print("\n{}, {}\n{}".format(meta["title"], meta["date"], text))
>>> for text, meta in ds.records(speaker_party="D", congress={110, 111, 112},
...                          chamber="Senate", limit=3):
...     print(meta["title"], meta["speaker_name"], meta["date"])
>>> for text, meta in ds.records(speaker_name={"Barack Obama", "Hillary Clinton"},
...                              date_range=("2005-01-01", "2005-12-31")):
...     print(meta["title"], meta["speaker_name"], meta["date"])
>>> for text in ds.texts(min_len=50000):
...     print(len(text))

Stream speeches into a textacy.Corpus:

>>> textacy.Corpus("en", data=ota.records(limit=100))
Corpus(100 docs; 70496 tokens)
Parameters

data_dir – Path to directory on disk under which dataset is stored, i.e. /path/to/data_dir/capitol_words.

full_date_range

First and last dates for which speeches are available, each as an ISO-formatted string (YYYY-MM-DD).

Type

Tuple[str, str]

speaker_names

Full names of all speakers included in corpus, e.g. “Bernie Sanders”.

Type

Set[str]

speaker_parties

All distinct political parties of speakers, e.g. “R”.

Type

Set[str]

chambers

All distinct chambers in which speeches were given, e.g. “House”.

Type

Set[str]

congresses

All distinct numbers of the congresses in which speeches were given, e.g. 114.

Type

Set[int]

property filepath

Full path on disk for CapitolWords data as compressed json file. None if file is not found, e.g. has not yet been downloaded.

download(*, force: bool = False)None[source]

Download the data as a Python version-specific compressed json file and save it to disk under the data_dir directory.

Parameters

force – If True, download the dataset, even if it already exists on disk under data_dir.

texts(*, speaker_name: Optional[Union[str, Set[str]]] = None, speaker_party: Optional[Union[str, Set[str]]] = None, chamber: Optional[Union[str, Set[str]]] = None, congress: Optional[Union[int, Set[int]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None)Iterable[str][source]

Iterate over speeches in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only, in chronological order.

Parameters
  • speaker_name – Filter speeches by the speakers’ name; see CapitolWords.speaker_names.

  • speaker_party – Filter speeches by the speakers’ party; see CapitolWords.speaker_parties.

  • chamber – Filter speeches by the chamber in which they were given; see CapitolWords.chambers.

  • congress – Filter speeches by the congress in which they were given; see CapitolWords.congresses.

  • date_range – Filter speeches by the date on which they were given. Both start and end date must be specified, but a null value for either will be replaced by the min/max date available for the dataset.

  • min_len – Filter texts by the length (# characters) of their text content.

  • limit – Yield no more than limit texts that match all specified filters.

Yields

Full text of next (by chronological order) speech in dataset passing all filter params.

Raises

ValueError – If any filtering options are invalid.

records(*, speaker_name: Optional[Union[str, Set[str]]] = None, speaker_party: Optional[Union[str, Set[str]]] = None, chamber: Optional[Union[str, Set[str]]] = None, congress: Optional[Union[int, Set[int]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None)Iterable[textacy.types.Record][source]

Iterate over speeches in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs, in chronological order.

Parameters
  • speaker_name – Filter speeches by the speakers’ name; see CapitolWords.speaker_names.

  • speaker_party – Filter speeches by the speakers’ party; see CapitolWords.speaker_parties.

  • chamber – Filter speeches by the chamber in which they were given; see CapitolWords.chambers.

  • congress – Filter speeches by the congress in which they were given; see CapitolWords.congresses.

  • date_range – Filter speeches by the date on which they were given. Both start and end date must be specified, but a null value for either will be replaced by the min/max date available for the dataset.

  • min_len – Filter speeches by the length (# characters) of their text content.

  • limit – Yield no more than limit speeches that match all specified filters.

Yields

Full text of the next (by chronological order) speech in dataset passing all filters, and its corresponding metadata.

Raises

ValueError – If any filtering options are invalid.

Supreme Court decisions

A collection of ~8.4k (almost all) decisions issued by the U.S. Supreme Court from November 1946 through June 2016 – the “modern” era.

Records include the following data:

  • text: Full text of the Court’s decision.

  • case_name: Name of the court case, in all caps.

  • argument_date: Date on which the case was argued before the Court, as an ISO-formatted string (“YYYY-MM-DD”).

  • decision_date: Date on which the Court’s decision was announced, as an ISO-formatted string (“YYYY-MM-DD”).

  • decision_direction: Ideological direction of the majority’s decision: one of “conservative”, “liberal”, or “unspecifiable”.

  • maj_opinion_author: Name of the majority opinion’s author, if available and identifiable, as an integer code whose mapping is given in SupremeCourt.opinion_author_codes.

  • n_maj_votes: Number of justices voting in the majority.

  • n_min_votes: Number of justices voting in the minority.

  • issue: Subject matter of the case’s core disagreement (e.g. “affirmative action”) rather than its legal basis (e.g. “the equal protection clause”), as a string code whose mapping is given in SupremeCourt.issue_codes.

  • issue_area: Higher-level categorization of the issue (e.g. “Civil Rights”), as an integer code whose mapping is given in SupremeCourt.issue_area_codes.

  • us_cite_id: Citation identifier for each case according to the official United States Reports. Note: There are ~300 cases with duplicate ids, and it’s not clear if that’s “correct” or a data quality problem.

The text in this dataset was derived from FindLaw’s searchable database of court cases: http://caselaw.findlaw.com/court/us-supreme-court.

The metadata was extracted without modification from the Supreme Court Database: Harold J. Spaeth, Lee Epstein, et al. 2016 Supreme Court Database, Version 2016 Release 1. http://supremecourtdatabase.org. Its license is CC BY-NC 3.0 US: https://creativecommons.org/licenses/by-nc/3.0/us/.

This dataset’s creation was inspired by a blog post by Emily Barry: http://www.emilyinamillion.me/blog/2016/7/13/visualizing-supreme-court-topics-over-time.

The two datasets were merged through much munging and a carefully trained model using the dedupe package. The model’s duplicate threshold was set so as to maximize the F-score where precision had twice as much weight as recall. Still, given occasionally baffling inconsistencies in case naming, citation ids, and decision dates, a very small percentage of texts may be incorrectly matched to metadata. (Sorry.)

class textacy.datasets.supreme_court.SupremeCourt(data_dir: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/supreme_court'))[source]

Stream a collection of US Supreme Court decisions from a compressed json file on disk, either as texts or text + metadata pairs.

Download the data (one time only!) from the textacy-data repo (https://github.com/bdewilde/textacy-data), and save its contents to disk:

>>> import textacy.datasets
>>> ds = textacy.datasets.SupremeCourt()
>>> ds.download()
>>> ds.info
{'name': 'supreme_court',
 'site_url': 'http://caselaw.findlaw.com/court/us-supreme-court',
 'description': 'Collection of ~8.4k decisions issued by the U.S. Supreme Court between November 1946 and June 2016.'}

Iterate over decisions as texts or records with both text and metadata:

>>> for text in ds.texts(limit=3):
...     print(text[:500], end="\n\n")
>>> for text, meta in ds.records(limit=3):
...     print("\n{} ({})\n{}".format(meta["case_name"], meta["decision_date"], text[:500]))

Filter decisions by a variety of metadata fields and text length:

>>> for text, meta in ds.records(opinion_author=109, limit=3):  # Notorious RBG!
...     print(meta["case_name"], meta["decision_direction"], meta["n_maj_votes"])
>>> for text, meta in ds.records(decision_direction="liberal",
...                              issue_area={1, 9, 10}, limit=3):
...     print(meta["case_name"], meta["maj_opinion_author"], meta["n_maj_votes"])
>>> for text, meta in ds.records(opinion_author=102, date_range=('1985-02-11', '1986-02-11')):
...     print("\n{} ({})".format(meta["case_name"], meta["decision_date"]))
...     print(ds.issue_codes[meta["issue"]], "=>", meta["decision_direction"])
>>> for text in ds.texts(min_len=250000):
...     print(len(text))

Stream decisions into a textacy.Corpus:

>>> textacy.Corpus("en", data=ds.records(limit=25))
Corpus(25 docs; 136696 tokens)
Parameters

data_dir (str or pathlib.Path) – Path to directory on disk under which the data is stored, i.e. /path/to/data_dir/supreme_court.

full_date_range

First and last dates for which decisions are available, each as an ISO-formatted string (YYYY-MM-DD).

Type

Tuple[str, str]

decision_directions

All distinct decision directions, e.g. “liberal”.

Type

Set[str]

opinion_author_codes

Mapping of majority opinion authors, from id code to full name.

Type

Dict[int, Optional[str]]

issue_area_codes

Mapping of high-level issue area of the case’s core disagreement, from id code to description.

Type

Dict[int, Optional[str]]

issue_codes

Mapping of the specific issue of the case’s core disagreement, from id code to description.

Type

Dict[str, str]

property filepath

Full path on disk for SupremeCourt data as compressed json file. None if file is not found, e.g. has not yet been downloaded.

download(*, force: bool = False)None[source]

Download the data as a Python version-specific compressed json file and save it to disk under the data_dir directory.

Parameters

force – If True, download the dataset, even if it already exists on disk under data_dir.

texts(*, opinion_author: Optional[Union[int, Set[int]]] = None, decision_direction: Optional[Union[str, Set[str]]] = None, issue_area: Optional[Union[int, Set[int]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None)Iterable[str][source]

Iterate over decisions in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only, in chronological order by decision date.

Parameters
  • opinion_author – Filter decisions by the name(s) of the majority opinion’s author, coded as an integer whose mapping is given in SupremeCourt.opinion_author_codes.

  • decision_direction – Filter decisions by the ideological direction of the majority’s decision; see SupremeCourt.decision_directions.

  • issue_area – Filter decisions by the issue area of the case’s subject matter, coded as an integer whose mapping is given in SupremeCourt.issue_area_codes.

  • date_range – Filter decisions by the date on which they were decided; both start and end date must be specified, but a null value for either will be replaced by the min/max date available for the dataset.

  • min_len – Filter decisions by the length (# characters) of their text content.

  • limit – Yield no more than limit decisions that match all specified filters.

Yields

Text of the next decision in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

records(*, opinion_author: Optional[Union[int, Set[int]]] = None, decision_direction: Optional[Union[str, Set[str]]] = None, issue_area: Optional[Union[int, Set[int]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None)Iterable[textacy.types.Record][source]

Iterate over decisions in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs, in chronological order by decision date.

Parameters
  • opinion_author – Filter decisions by the name(s) of the majority opinion’s author, coded as an integer whose mapping is given in SupremeCourt.opinion_author_codes.

  • decision_direction – Filter decisions by the ideological direction of the majority’s decision; see SupremeCourt.decision_directions.

  • issue_area – Filter decisions by the issue area of the case’s subject matter, coded as an integer whose mapping is given in SupremeCourt.issue_area_codes.

  • date_range – Filter decisions by the date on which they were decided; both start and end date must be specified, but a null value for either will be replaced by the min/max date available for the dataset.

  • min_len – Filter decisions by the length (# characters) of their text content.

  • limit – Yield no more than limit decisions that match all specified filters.

Yields

Text of the next decision in dataset passing all filters, and its corresponding metadata.

Raises

ValueError – If any filtering options are invalid.

Wikimedia articles

All articles for a given Wikimedia project, specified by language and version.

Records include the following key fields (plus a few others):

  • text: Plain text content of the wiki page – no wiki markup!

  • title: Title of the wiki page.

  • wiki_links: A list of other wiki pages linked to from this page.

  • ext_links: A list of external URLs linked to from this page.

  • categories: A list of categories to which this wiki page belongs.

  • dt_created: Date on which the wiki page was first created.

  • page_id: Unique identifier of the wiki page, usable in Wikimedia APIs.

Datasets are generated by the Wikimedia Foundation for a variety of projects, such as Wikipedia and Wikinews. The source files are meant for search indexes, so they’re dumped in Elasticsearch bulk insert format – basically, a compressed JSON file with one record per line. For more information, refer to https://meta.wikimedia.org/wiki/Data_dumps.

class textacy.datasets.wikimedia.Wikimedia(name, meta, project, data_dir, lang='en', version='current', namespace=0)[source]

Base class for project-specific Wikimedia datasets. See:

property filepath

Full path on disk for the Wikimedia CirrusSearch db dump corresponding to the project, lang, and version.

Type

str

download(*, force: bool = False)None[source]

Download the Wikimedia CirrusSearch db dump corresponding to the given project, lang, and version as a compressed JSON file, and save it to disk under the data_dir directory.

Parameters

force – If True, download the dataset, even if it already exists on disk under data_dir.

Note

Some datasets are quite large (e.g. English Wikipedia is ~28GB) and can take hours to fully download.

texts(*, category: Optional[Union[str, Set[str]]] = None, wiki_link: Optional[Union[str, Set[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None)Iterable[str][source]

Iterate over wiki pages in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only, in order of appearance in the db dump file.

Parameters
  • category – Filter wiki pages by the categories to which they’ve been assigned. For multiple values (Set[str]), ANY rather than ALL of the values must be found among a given page’s categories.

  • wiki_link – Filter wiki pages by the other wiki pages to which they’ve been linked. For multiple values (Set[str]), ANY rather than ALL of the values must be found among a given page’s wiki links.

  • min_len – Filter wiki pages by the length (# characters) of their text content.

  • limit – Yield no more than limit wiki pages that match all specified filters.

Yields

Text of the next wiki page in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

records(*, category: Optional[Union[str, Set[str]]] = None, wiki_link: Optional[Union[str, Set[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None)Iterable[textacy.types.Record][source]

Iterate over wiki pages in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs, in order of appearance in the db dump file.

Parameters
  • category – Filter wiki pages by the categories to which they’ve been assigned. For multiple values (Set[str]), ANY rather than ALL of the values must be found among a given page’s categories.

  • wiki_link – Filter wiki pages by the other wiki pages to which they’ve been linked. For multiple values (Set[str]), ANY rather than ALL of the values must be found among a given page’s wiki links.

  • min_len – Filter wiki pages by the length (# characters) of their text content.

  • limit – Yield no more than limit wiki pages that match all specified filters.

Yields

Text of the next wiki page in dataset passing all filters, and its corresponding metadata.

Raises

ValueError – If any filtering options are invalid.

class textacy.datasets.wikimedia.Wikipedia(data_dir: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/wikipedia'), lang: str = 'en', version: str = 'current', namespace: int = 0)[source]

Stream a collection of Wikipedia pages from a version- and language-specific database dump, either as texts or text + metadata pairs.

Download a database dump (one time only!) and save its contents to disk:

>>> import textacy.datasets
>>> ds = textacy.datasets.Wikipedia(lang="en", version="current")
>>> ds.download()
>>> ds.info
{'name': 'wikipedia',
 'site_url': 'https://en.wikipedia.org/wiki/Main_Page',
 'description': 'All pages for a given language- and version-specific Wikipedia site snapshot.'}

Iterate over wiki pages as texts or records with both text and metadata:

>>> for text in ds.texts(limit=5):
...     print(text[:500])
>>> for text, meta in ds.records(limit=5):
...     print(meta["page_id"], meta["title"])

Filter wiki pages by a variety of metadata fields and text length:

>>> for text, meta in ds.records(category="Living people", limit=5):
...     print(meta["title"], meta["categories"])
>>> for text, meta in ds.records(wiki_link="United_States", limit=5):
...     print(meta["title"], meta["wiki_links"])
>>> for text in ds.texts(min_len=10000, limit=5):
...     print(len(text))

Stream wiki pages into a textacy.Corpus:

>>> textacy.Corpus("en", data=ds.records(min_len=2000, limit=50))
Corpus(50 docs; 72368 tokens)
Parameters
  • data_dir – Path to directory on disk under which database dump files are stored. Each file is expected as {lang}{project}/{version}/{lang}{project}-{version}-cirrussearch-content.json.gz immediately under this directory.

  • lang – Standard two-letter language code, e.g. “en” => “English”, “de” => “German”. https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

  • version – Database dump version to use. Either “current” for the most recently available version or a date formatted as “YYYYMMDD”. Dumps are produced weekly; check for available versions at https://dumps.wikimedia.org/other/cirrussearch/.

  • namespace – Namespace of the wiki pages to include. Typical, public- facing content is in the 0 (default) namespace.

class textacy.datasets.wikimedia.Wikinews(data_dir: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/wikinews'), lang: str = 'en', version: str = 'current', namespace: int = 0)[source]

Stream a collection of Wikinews pages from a version- and language-specific database dump, either as texts or text + metadata pairs.

Download a database dump (one time only!) and save its contents to disk:

>>> import textacy.datasets
>>> ds = textacy.datasets.Wikinews(lang="en", version="current")
>>> ds.download()
>>> ds.info
{'name': 'wikinews',
 'site_url': 'https://en.wikinews.org/wiki/Main_Page',
 'description': 'All pages for a given language- and version-specific Wikinews site snapshot.'}

Iterate over wiki pages as texts or records with both text and metadata:

>>> for text in ds.texts(limit=5):
...     print(text[:500])
>>> for text, meta in ds.records(limit=5):
...     print(meta["page_id"], meta["title"])

Filter wiki pages by a variety of metadata fields and text length:

>>> for text, meta in ds.records(category="Politics and conflicts", limit=5):
...     print(meta["title"], meta["categories"])
>>> for text, meta in ds.records(wiki_link="Reuters", limit=5):
...     print(meta["title"], meta["wiki_links"])
>>> for text in ds.texts(min_len=5000, limit=5):
...     print(len(text))

Stream wiki pages into a textacy.Corpus:

>>> textacy.Corpus("en", data=ds.records(limit=100))
Corpus(100 docs; 33092 tokens)
Parameters
  • data_dir – Path to directory on disk under which database dump files are stored. Each file is expected as {lang}{project}/{version}/{lang}{project}-{version}-cirrussearch-content.json.gz immediately under this directory.

  • lang – Standard two-letter language code, e.g. “en” => “English”, “de” => “German”. https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

  • version – Database dump version to use. Either “current” for the most recently available version or a date formatted as “YYYYMMDD”. Dumps are produced weekly; check for available versions at https://dumps.wikimedia.org/other/cirrussearch/.

  • namespace – Namespace of the wiki pages to include. Typical, public- facing content is in the 0 (default) namespace.

Reddit comments

A collection of up to ~1.5 billion Reddit comments posted from October 2007 through May 2015.

Records include the following key fields (plus a few others):

  • body: Full text of the comment.

  • created_utc: Date on which the comment was posted.

  • subreddit: Sub-reddit in which the comment was posted, excluding the familiar “/r/” prefix.

  • score: Net score (upvotes - downvotes) on the comment.

  • gilded: Number of times this comment received reddit gold.

The raw data was originally collected by /u/Stuck_In_the_Matrix via Reddit’s APIS, and stored for posterity by the Internet Archive. For more details, refer to https://archive.org/details/2015_reddit_comments_corpus.

class textacy.datasets.reddit_comments.RedditComments(data_dir: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/reddit_comments'))[source]

Stream a collection of Reddit comments from 1 or more compressed files on disk, either as texts or text + metadata pairs.

Download the data (one time only!) or subsets thereof by specifying a date range:

>>> import textacy.datasets
>>> ds = textacy.datasets.RedditComments()
>>> ds.download(date_range=("2007-10", "2008-03"))
>>> ds.info
{'name': 'reddit_comments',
 'site_url': 'https://archive.org/details/2015_reddit_comments_corpus',
 'description': 'Collection of ~1.5 billion publicly available Reddit comments from October 2007 through May 2015.'}

Iterate over comments as texts or records with both text and metadata:

>>> for text in ds.texts(limit=5):
...     print(text)
>>> for text, meta in ds.records(limit=5):
...     print("\n{} {}\n{}".format(meta["author"], meta["created_utc"], text))

Filter comments by a variety of metadata fields and text length:

>>> for text, meta in ds.records(subreddit="politics", limit=5):
...     print(meta["score"], ":", text)
>>> for text, meta in ds.records(date_range=("2008-01", "2008-03"), limit=5):
...     print(meta["created_utc"])
>>> for text, meta in ds.records(score_range=(10, None), limit=5):
...     print(meta["score"], ":", text)
>>> for text in ds.texts(min_len=2000, limit=5):
...     print(len(text))

Stream comments into a textacy.Corpus:

>>> textacy.Corpus("en", data=ds.records(limit=1000))
Corpus(1000 docs; 27582 tokens)
Parameters

data_dir – Path to directory on disk under which the data is stored, i.e. /path/to/data_dir/reddit_comments. Each file covers a given month, as indicated in the filename like “YYYY/RC_YYYY-MM.bz2”.

full_date_range

First and last dates for which comments are available, each as an ISO-formatted string (YYYY-MM-DD).

Type

Tuple[str, str]

property filepaths

Full paths on disk for all Reddit comments files found under RedditComments.data_dir directory, sorted in chronological order.

download(*, date_range: Tuple[Optional[str], Optional[str]] = (None, None), force: bool = False)None[source]

Download 1 or more monthly Reddit comments files from archive.org and save them to disk under the data_dir directory.

Parameters
  • date_range – Interval specifying the [start, end) dates for which comments files will be downloaded. Each item must be a str formatted as YYYY-MM or YYYY-MM-DD (the latter is converted to the corresponding YYYY-MM value). Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.

  • force – If True, download the dataset, even if it already exists on disk under data_dir.

texts(*, subreddit: Optional[Union[str, Set[str]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, score_range: Optional[Tuple[Optional[int], Optional[int]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None)Iterable[str][source]

Iterate over comments (text-only) in 1 or more files of this dataset, optionally filtering by a variety of metadata and/or text length, in chronological order.

Parameters
  • subreddit – Filter comments for those which were posted in the specified subreddit(s).

  • date_range – Filter comments for those which were posted within the interval [start, end). Each item must be a str in ISO-standard format, i.e. some amount of YYYY-MM-DDTHH:mm:ss. Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.

  • score_range – Filter comments for those whose score (# upvotes minus # downvotes) is within the interval [low, high). Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.

  • min_len – Filter comments for those whose body length in chars is at least this long.

  • limit – Maximum number of comments passing all filters to yield. If None, all comments are iterated over.

Yields

Text of the next comment in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

records(*, subreddit: Optional[Union[str, Set[str]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, score_range: Optional[Tuple[Optional[int], Optional[int]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None)Iterable[textacy.types.Record][source]

Iterate over comments (including text and metadata) in 1 or more files of this dataset, optionally filtering by a variety of metadata and/or text length, in chronological order.

Parameters
  • subreddit – Filter comments for those which were posted in the specified subreddit(s).

  • date_range – Filter comments for those which were posted within the interval [start, end). Each item must be a str in ISO-standard format, i.e. some amount of YYYY-MM-DDTHH:mm:ss. Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.

  • score_range – Filter comments for those whose score (# upvotes minus # downvotes) is within the interval [low, high). Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.

  • min_len – Filter comments for those whose body length in chars is at least this long.

  • limit – Maximum number of comments passing all filters to yield. If None, all comments are iterated over.

Yields

Text of the next comment in dataset passing all filters, and its corresponding metadata.

Raises

ValueError – If any filtering options are invalid.

Oxford Text Archive literary works

A collection of ~2.7k Creative Commons literary works from the Oxford Text Archive, containing primarily English-language 16th-20th century literature and history.

Records include the following data:

  • text: Full text of the literary work.

  • title: Title of the literary work.

  • author: Author(s) of the literary work.

  • year: Year that the literary work was published.

  • url: URL at which literary work can be found online via the OTA.

  • id: Unique identifier of the literary work within the OTA.

This dataset was compiled by David Mimno from the Oxford Text Archive and stored in his GitHub repo to avoid unnecessary scraping of the OTA site. It is downloaded from that repo, and excluding some light cleaning of its metadata, is reproduced exactly here.

class textacy.datasets.oxford_text_archive.OxfordTextArchive(data_dir: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/oxford_text_archive'))[source]

Stream a collection of English-language literary works from text files on disk, either as texts or text + metadata pairs.

Download the data (one time only!), saving and extracting its contents to disk:

>>> import textacy.datasets
>>> ds = textacy.datasets.OxfordTextArchive()
>>> ds.download()
>>> ds.info
{'name': 'oxford_text_archive',
 'site_url': 'https://ota.ox.ac.uk/',
 'description': 'Collection of ~2.7k Creative Commons texts from the Oxford Text Archive, containing primarily English-language 16th-20th century literature and history.'}

Iterate over literary works as texts or records with both text and metadata:

>>> for text in ds.texts(limit=3):
...     print(text[:200])
>>> for text, meta in ds.records(limit=3):
...     print("\n{}, {}".format(meta["title"], meta["year"]))
...     print(text[:300])

Filter literary works by a variety of metadata fields and text length:

>>> for text, meta in ds.records(author="Shakespeare, William", limit=1):
...     print("{}\n{}".format(meta["title"], text[:500]))
>>> for text, meta in ds.records(date_range=("1900-01-01", "1990-01-01"), limit=5):
...     print(meta["year"], meta["author"])
>>> for text in ds.texts(min_len=4000000):
...     print(len(text))

Stream literary works into a textacy.Corpus:

>>> textacy.Corpus("en", data=ds.records(limit=5))
Corpus(5 docs; 182289 tokens)
Parameters

data_dir (str or pathlib.Path) – Path to directory on disk under which dataset is stored, i.e. /path/to/data_dir/oxford_text_archive.

full_date_range

First and last dates for which works are available, each as an ISO-formatted string (YYYY-MM-DD).

Type

Tuple[str, str]

authors

Full names of all distinct authors included in this dataset, e.g. “Shakespeare, William”.

Type

Set[str]

download(*, force: bool = False)None[source]

Download the data as a zip archive file, then save it to disk and extract its contents under the OxfordTextArchive.data_dir directory.

Parameters

force – If True, download the dataset, even if it already exists on disk under data_dir.

property metadata

Dict[str, dict]

texts(*, author: Optional[Union[str, Set[str]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None)Iterable[str][source]

Iterate over works in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only.

Parameters
  • author – Filter texts by the authors’ name. For multiple values (Set[str]), ANY rather than ALL of the authors must be found among a given works’s authors.

  • date_range – Filter texts by the date on which it was published; both start and end date must be specified, but a null value for either will be replaced by the min/max date available in the dataset.

  • min_len – Filter texts by the length (# characters) of their text content.

  • limit – Yield no more than limit texts that match all specified filters.

Yields

Text of the next work in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

records(*, author: Optional[Union[str, Set[str]]] = None, date_range: Optional[Tuple[Optional[str], Optional[str]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None)Iterable[textacy.types.Record][source]

Iterate over works in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs.

Parameters
  • author – Filter texts by the authors’ name. For multiple values (Set[str]), ANY rather than ALL of the authors must be found among a given works’s authors.

  • date_range – Filter texts by the date on which it was published; both start and end date must be specified, but a null value for either will be replaced by the min/max date available in the dataset.

  • min_len – Filter texts by the length (# characters) of their text content.

  • limit – Yield no more than limit texts that match all specified filters.

Yields

Text of the next work in dataset passing all filters, and its corresponding metadata.

Raises

ValueError – If any filtering options are invalid.

IMDB movie reviews

A collection of 50k highly polar movie reviews posted to IMDB, split evenly into training and testing sets, with 25k positive and 25k negative sentiment labels, as well as some unlabeled reviews.

Records include the following key fields (plus a few others):

  • text: Full text of the review.

  • subset: Subset of the dataset (“train” or “test”) into which the review has been split.

  • label: Sentiment label (“pos” or “neg”) assigned to the review.

  • rating: Numeric rating assigned by the original reviewer, ranging from 1 to 10. Reviews with a rating <= 5 are “neg”; the rest are “pos”.

  • movie_id: Unique identifier for the movie under review within IMDB, useful for grouping reviews or joining with an external movie dataset.

Reference: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

class textacy.datasets.imdb.IMDB(data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/imdb'))[source]

Stream a collection of IMDB movie reviews from text files on disk, either as texts or text + metadata pairs.

Download the data (one time only!), saving and extracting its contents to disk:

>>> import textacy.datasets
>>> ds = textacy.datasets.IMDB()
>>> ds.download()
>>> ds.info
{'name': 'imdb',
 'site_url': 'http://ai.stanford.edu/~amaas/data/sentiment',
 'description': 'Collection of 50k highly polar movie reviews split evenly into train and test sets, with 25k positive and 25k negative labels. Also includes some unlabeled reviews.'}

Iterate over movie reviews as texts or records with both text and metadata:

>>> for text in ds.texts(limit=5):
...     print(text)
>>> for text, meta in ds.records(limit=5):
...     print("\n{} {}\n{}".format(meta["label"], meta["rating"], text))

Filter movie reviews by a variety of metadata fields and text length:

>>> for text, meta in ds.records(label="pos", limit=5):
...     print(meta["rating"], ":", text)
>>> for text, meta in ds.records(rating_range=(9, 11), limit=5):
...     print(meta["rating"], text)
>>> for text in ds.texts(min_len=1000, limit=5):
...     print(len(text))

Stream movie reviews into a textacy.Corpus:

>>> textacy.Corpus("en", data=ds.records(limit=100))
Corpus(100 docs; 24340 tokens)
Parameters

data_dir – Path to directory on disk under which the data is stored, i.e. /path/to/data_dir/imdb.

full_rating_range

Lowest and highest ratings for which movie reviews are available.

Type

Tuple[int, int]

download(*, force: bool = False)None[source]

Download the data as a compressed tar archive file, then save it to disk and extract its contents under the data_dir directory.

Parameters

force – If True, always download the dataset even if it already exists on disk under data_dir.

texts(*, subset: Optional[str] = None, label: Optional[str] = None, rating_range: Optional[Tuple[Optional[int], Optional[int]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None)Iterable[str][source]

Iterate over movie reviews in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only.

Parameters
  • subset ({"train", "test"}) – Filter movie reviews by the dataset subset into which they’ve already been split.

  • label ({"pos", "neg", "unsup"}) – Filter movie reviews by the assigned sentiment label (or lack thereof, for “unsup”).

  • rating_range – Filter movie reviews by the rating assigned by the reviewer. Only those with ratings in the interval [low, high) are included. Both low and high values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.

  • min_len – Filter reviews by the length (# characters) of their text content.

  • limit – Yield no more than limit reviews that match all specified filters.

Yields

Text of the next movie review in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

records(*, subset: Optional[str] = None, label: Optional[str] = None, rating_range: Optional[Tuple[Optional[int], Optional[int]]] = None, min_len: Optional[int] = None, limit: Optional[int] = None)Iterable[textacy.types.Record][source]

Iterate over movie reviews in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs.

Parameters
  • subset ({"train", "test"}) – Filter movie reviews by the dataset subset into which they’ve already been split.

  • label ({"pos", "neg", "unsup"}) – Filter movie reviews by the assigned sentiment label (or lack thereof, for “unsup”).

  • rating_range – Filter movie reviews by the rating assigned by the reviewer. Only those with ratings in the interval [low, high) are included. Both low and high values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.

  • min_len – Filter reviews by the length (# characters) of their text content.

  • limit – Yield no more than limit reviews that match all specified filters.

Yields

Text of the next movie review in dataset passing all filters, and its corresponding metadata.

Raises

ValueError – If any filtering options are invalid.

UDHR translations

A collection of translations of the Universal Declaration of Human Rights (UDHR), a milestone document in the history of human rights that first, formally established fundamental human rights to be universally protected.

Records include the following fields:

  • text: Full text of the translated UDHR document.

  • lang: ISO-639-1 language code of the text.

  • lang_name: Ethnologue entry for the language (see https://www.ethnologue.com).

The source dataset was compiled and is updated by the Unicode Consortium as a way to demonstrate the use of unicode in representing a wide variety of languages. In fact, the UDHR was chosen because it’s been translated into more languages than any other document! However, this dataset only provides access to records translated into ISO-639-1 languages — that is, major living languages only, rather than every language, major or minor, that has ever existed. If you need access to texts in those other languages, you can find them at UDHR._texts_dirpath.

For more details, go to https://unicode.org/udhr.

class textacy.datasets.udhr.UDHR(data_dir: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/udhr'))[source]

Stream a collection of UDHR translations from disk, either as texts or text + metadata pairs.

Download the data (one time only!), saving and extracting its contents to disk:

>>> import textacy.datasets
>>> ds = textacy.datasets.UDHR()
>>> ds.download()
>>> ds.info
{'name': 'udhr',
 'site_url': 'http://www.ohchr.org/EN/UDHR',
 'description': 'A collection of translations of the Universal Declaration of Human Rights (UDHR), a milestone document in the history of human rights that first, formally established fundamental human rights to be universally protected.'}

Iterate over translations as texts or records with both text and metadata:

>>> for text in ds.texts(limit=5):
...     print(text[:500])
>>> for text, meta in ds.records(limit=5):
...     print("\n{} ({})\n{}".format(meta["lang_name"], meta["lang"], text[:500]))

Filter translations by language, and note that some languages have multiple translations:

>>> for text, meta in ds.records(lang="en"):
...     print("\n{} ({})\n{}".format(meta["lang_name"], meta["lang"], text[:500]))
>>> for text, meta in ds.records(lang="zh"):
...     print("\n{} ({})\n{}".format(meta["lang_name"], meta["lang"], text[:500]))

Note: Streaming translations into a textacy.Corpus doesn’t work as for other available datasets, since this dataset is multilingual.

Parameters

data_dir (str or pathlib.Path) – Path to directory on disk under which the data is stored, i.e. /path/to/data_dir/udhr.

langs

All distinct language codes with texts in this dataset, e.g. “en” for English.

Type

Set[str]

download(*, force: bool = False)None[source]

Download the data as a zipped archive of language-specific text files, then save it to disk and extract its contents under the data_dir directory.

Parameters

force – If True, download the dataset, even if it already exists on disk under data_dir.

texts(*, lang: Optional[Union[str, Set[str]]] = None, limit: Optional[int] = None)Iterable[str][source]

Iterate over records in this dataset, optionally filtering by language, and yield texts only.

Parameters
  • lang – Filter records by the language in which they’re written; see UDHR.langs.

  • limit – Yield no more than limit texts that match specified filter.

Yields

Text of the next record in dataset passing filters.

Raises

ValueError – If any filtering options are invalid.

records(*, lang: Optional[Union[str, Set[str]]] = None, limit: Optional[int] = None)Iterable[textacy.types.Record][source]

Iterate over reocrds in this dataset, optionally filtering by a language, and yield text + metadata pairs.

Parameters
  • lang – Filter records by the language in which they’re written; see UDHR.langs.

  • limit – Yield no more than limit texts that match specified filter.

Yields

Text of the next record in dataset passing filters, and its corresponding metadata.

Raises

ValueError – If any filtering options are invalid.

ConceptNet

ConceptNet is a multilingual knowledge base, representing common words and phrases and the common-sense relationships between them. This information is collected from a variety of sources, including crowd-sourced resources (e.g. Wiktionary, Open Mind Common Sense), games with a purpose (e.g. Verbosity, nadya.jp), and expert-created resources (e.g. WordNet, JMDict).

The interface in textacy gives access to several key relationships between terms that are useful in a variety of NLP tasks:

  • antonyms: terms that are opposites of each other in some relevant way

  • hyponyms: terms that are subtypes or specific instances of other terms

  • meronyms: terms that are parts of other terms

  • synonyms: terms that are sufficiently similar that they may be used interchangeably

class textacy.resources.concept_net.ConceptNet(data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/concept_net'), version='5.7.0')[source]

Interface to ConceptNet, a multilingual knowledge base representing common words and phrases and the common-sense relationships between them.

Download the data (one time only!), and save its contents to disk:

>>> import textacy.resources
>>> rs = textacy.resources.ConceptNet()
>>> rs.download()
>>> rs.info
{'name': 'concept_net',
 'site_url': 'http://conceptnet.io',
 'publication_url': 'https://arxiv.org/abs/1612.03975',
 'description': 'An open, multilingual semantic network of general knowledge, designed to help computers understand the meanings of words.'}

Access other same-language terms related to a given term in a variety of ways:

>>> rs.get_synonyms("spouse", lang="en", sense="n")
['mate', 'married person', 'better half', 'partner']
>>> rs.get_antonyms("love", lang="en", sense="v")
['detest', 'hate', 'loathe']
>>> rs.get_hyponyms("marriage", lang="en", sense="n")
['cohabitation situation', 'union', 'legal agreement', 'ritual', 'family', 'marital status']

Note: The very first time a given relationship is accessed, the full ConceptNet db must be parsed and split for fast future access. This can take a couple minutes; be patient.

When passing a spaCy Token or Span, the corresponding lang and sense are inferred automatically from the object:

>>> text = "The quick brown fox jumps over the lazy dog."
>>> doc = textacy.make_spacy_doc(text, lang="en")
>>> rs.get_synonyms(doc[1])  # quick
['flying', 'fast', 'rapid', 'ready', 'straightaway', 'nimble', 'speedy', 'warm']
>>> rs.get_synonyms(doc[4:5])  # jumps over
['leap', 'startle', 'hump', 'flinch', 'jump off', 'skydive', 'jumpstart', ...]

Many terms won’t have entries, for actual linguistic reasons or because the db’s coverage of a given language’s vocabulary isn’t comprehensive:

>>> rs.get_meronyms(doc[3])  # fox
[]
>>> rs.get_antonyms(doc[7])  # lazy
[]
Parameters
  • data_dir (str or pathlib.Path) – Path to directory on disk under which resource data is stored, i.e. /path/to/data_dir/concept_net.

  • version ({"5.7.0", "5.6.0", "5.5.5"}) – Version string of the ConceptNet db to use. Since newer versions typically represent improvements over earlier versions, you’ll probably want “5.7.0” (the default value).

download(*, force=False)[source]

Download resource data as a gzipped csv file, then save it to disk under the ConceptNet.data_dir directory.

Parameters

force (bool) – If True, download resource data, even if it already exists on disk; otherwise, don’t re-download the data.

property filepath

Full path on disk for the ConceptNet gzipped csv file corresponding to the given ConceptNet.data_dir.

Type

str

property antonyms

Mapping of language code to term to sense to set of term’s antonyms – opposites of the term in some relevant way, like being at opposite ends of a scale or fundamentally similar but with a key difference between them – such as black <=> white or hot <=> cold. Note that this relationship is symmetric.

Based on the “/r/Antonym” relation in ConceptNet.

Type

Dict[str, Dict[str, Dict[str, List[str]]]]

get_antonyms(term, *, lang=None, sense=None)[source]
Parameters
  • term (str or spacy.tokens.Token or spacy.tokens.Span) –

  • lang (str) – Standard code for the language of term.

  • sense (str) – Sense in which term is used in context, which in practice is just its part of speech. Valid values: “n” or “NOUN”, “v” or “VERB”, “a” or “ADJ”, “r” or “ADV”.

Returns

List[str]

property hyponyms

Mapping of language code to term to sense to set of term’s hyponyms – subtypes or specific instances of the term – such as car => vehicle or Chicago => city. Every A is a B.

Based on the “/r/IsA” relation in ConceptNet.

Type

Dict[str, Dict[str, Dict[str, List[str]]]]

get_hyponyms(term, *, lang=None, sense=None)[source]
Parameters
  • term (str or spacy.tokens.Token or spacy.tokens.Span) –

  • lang (str) – Standard code for the language of term.

  • sense (str) – Sense in which term is used in context, which in practice is just its part of speech. Valid values: “n” or “NOUN”, “v” or “VERB”, “a” or “ADJ”, “r” or “ADV”.

Returns

List[str]

property meronyms

Mapping of language code to term to sense to set of term’s meronyms – parts of the term – such as gearshift => car.

Based on the “/r/PartOf” relation in ConceptNet.

Type

Dict[str, Dict[str, Dict[str, List[str]]]]

get_meronyms(term, *, lang=None, sense=None)[source]
Parameters
  • term (str or spacy.tokens.Token or spacy.tokens.Span) –

  • lang (str) – Standard code for the language of term.

  • sense (str) – Sense in which term is used in context, which in practice is just its part of speech. Valid values: “n” or “NOUN”, “v” or “VERB”, “a” or “ADJ”, “r” or “ADV”.

Returns

List[str]

property synonyms

Mapping of language code to term to sense to set of term’s synonyms – sufficiently similar concepts that they may be used interchangeably – such as sunlight <=> sunshine. Note that this relationship is symmetric.

Based on the “/r/Synonym” relation in ConceptNet.

Type

Dict[str, Dict[str, Dict[str, List[str]]]]

get_synonyms(term, *, lang=None, sense=None)[source]
Parameters
  • term (str or spacy.tokens.Token or spacy.tokens.Span) –

  • lang (str) – Standard code for the language of term.

  • sense (str) – Sense in which term is used in context, which in practice is just its part of speech. Valid values: “n” or “NOUN”, “v” or “VERB”, “a” or “ADJ”, “r” or “ADV”.

Returns

List[str]

DepecheMood

DepecheMood is a high-quality and high-coverage emotion lexicon for English and Italian text, mapping individual terms to their emotional valences. These word-emotion weights are inferred from crowd-sourced datasets of emotionally tagged news articles (rappler.com for English, corriere.it for Italian).

English terms are assigned weights to eight emotions:

  • AFRAID

  • AMUSED

  • ANGRY

  • ANNOYED

  • DONT_CARE

  • HAPPY

  • INSPIRED

  • SAD

Italian terms are assigned weights to five emotions:

  • DIVERTITO (~amused)

  • INDIGNATO (~annoyed)

  • PREOCCUPATO (~afraid)

  • SODDISFATTO (~happy)

  • TRISTE (~sad)

class textacy.resources.depeche_mood.DepecheMood(data_dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data/depeche_mood'), lang='en', word_rep='lemmapos', min_freq=3)[source]

Interface to DepecheMood, an emotion lexicon for English and Italian text.

Download the data (one time only!), and save its contents to disk:

>>> import textacy.resources
>>> rs = textacy.resources.DepecheMood(lang="en", word_rep="lemmapos")
>>> rs.download()
>>> rs.info
{'name': 'depeche_mood',
 'site_url': 'http://www.depechemood.eu',
 'publication_url': 'https://arxiv.org/abs/1810.03660',
 'description': 'A simple tool to analyze the emotions evoked by a text.'}

Access emotional valences for individual terms:

>>> rs.get_emotional_valence("disease#n")
{'AFRAID': 0.37093526222120465,
 'AMUSED': 0.06953745082761113,
 'ANGRY': 0.06979683067736414,
 'ANNOYED': 0.06465401081252636,
 'DONT_CARE': 0.07080580707440012,
 'HAPPY': 0.07537324330608403,
 'INSPIRED': 0.13394731320662606,
 'SAD': 0.14495008187418348}
>>> rs.get_emotional_valence("heal#v")
{'AFRAID': 0.060450319886187334,
 'AMUSED': 0.09284046387491741,
 'ANGRY': 0.06207816933776029,
 'ANNOYED': 0.10027622719958346,
 'DONT_CARE': 0.11259594401785,
 'HAPPY': 0.09946106491457314,
 'INSPIRED': 0.37794768332634626,
 'SAD': 0.09435012744278205}

When passing multiple terms in the form of a List[str] or Span or Doc, emotion weights are averaged over all terms for which weights are available:

>>> rs.get_emotional_valence(["disease#n", "heal#v"])
{'AFRAID': 0.215692791053696,
 'AMUSED': 0.08118895735126427,
 'ANGRY': 0.06593750000756221,
 'ANNOYED': 0.08246511900605491,
 'DONT_CARE': 0.09170087554612506,
 'HAPPY': 0.08741715411032858,
 'INSPIRED': 0.25594749826648616,
 'SAD': 0.11965010465848278}
>>> text = "The acting was sweet and amazing, but the plot was dumb and terrible."
>>> doc = textacy.make_spacy_doc(text, lang="en")
>>> rs.get_emotional_valence(doc)
{'AFRAID': 0.05272350876803627,
 'AMUSED': 0.13725054992595098,
 'ANGRY': 0.15787016147081184,
 'ANNOYED': 0.1398733360688608,
 'DONT_CARE': 0.14356943460620503,
 'HAPPY': 0.11923217912716871,
 'INSPIRED': 0.17880214720077342,
 'SAD': 0.07067868283219296}
>>> rs.get_emotional_valence(doc[0:6])  # the acting was sweet and amazing
{'AFRAID': 0.039790959333750785,
 'AMUSED': 0.1346884072825313,
 'ANGRY': 0.1373596223131593,
 'ANNOYED': 0.11391999698695347,
 'DONT_CARE': 0.1574819173485831,
 'HAPPY': 0.1552521762333925,
 'INSPIRED': 0.21232264216449326,
 'SAD': 0.049184278337136296}

For good measure, here’s how Italian w/o POS-tagged words looks:

>>> rs = textacy.resources.DepecheMood(lang="it", word_rep="lemma")
>>> rs.get_emotional_valence("amore")
{'INDIGNATO': 0.11451408951814121,
 'PREOCCUPATO': 0.1323655108545536,
 'TRISTE': 0.18249663560400609,
 'DIVERTITO': 0.33558928569110086,
 'SODDISFATTO': 0.23503447833219815}
Parameters
  • data_dir (str or pathlib.Path) – Path to directory on disk under which resource data is stored, i.e. /path/to/data_dir/depeche_mood.

  • lang ({"en", "it"}) – Standard two-letter code for the language of terms for which emotional valences are to be retrieved.

  • word_rep ({"token", "lemma", "lemmapos"}) – Level of text processing used in computing terms’ emotion weights. “token” => tokenization only; “lemma” => tokenization and lemmatization; “lemmapos” => tokenization, lemmatization, and part-of-speech tagging.

  • min_freq (int) – Minimum number of times that a given term must have appeared in the source dataset for it to be included in the emotion weights dict. This can be used to remove noisy terms at the expense of reducing coverage. Researchers observed peak performance at 10, but anywhere between 1 and 20 is reasonable.

property filepath

Full path on disk for the DepecheMood tsv file corresponding to the lang and word_rep.

Type

str

property weights

Mapping of term string (or term#POS, if DepecheMood.word_rep is “lemmapos”) to the terms’ normalized weights on a fixed set of affective dimensions (aka “emotions”).

Type

Dict[str, Dict[str, float]]

download(*, force=False)[source]

Download resource data as a zip archive file, then save it to disk and extract its contents under the data_dir directory.

Parameters

force (bool) – If True, download the resource, even if it already exists on disk under data_dir.

get_emotional_valence(terms)[source]

Get average emotional valence over all terms in terms for which emotion weights are available.

Parameters

terms (str or Sequence[str], Token or Sequence[Token]) –

One or more terms over which to average emotional valences. Note that only nouns, adjectives, adverbs, and verbs are included.

Note

If the resource was initialized with word_rep="lemmapos", then string terms must have matching parts-of-speech appended to them like TERM#POS. Only “n” => noun, “v” => verb, “a” => adjective, and “r” => adverb are included in the data.

Returns

Mapping of emotion to average weight.

Return type

Dict[str, float]