Information Extraction¶
Extract an ordered sequence of words from a document processed by spaCy, optionally filtering words by part-of-speech tag and frequency. |
|
Extract an ordered sequence of n-grams ( |
|
Extract an ordered sequence of named entities (PERSON, ORG, LOC, etc.) from a |
|
Extract an ordered sequence of noun chunks from a spacy-parsed doc, optionally filtering by frequency and dropping leading determiners. |
|
Extract sequences of consecutive tokens from a spacy-parsed doc whose part-of-speech tags match the specified regex pattern. |
|
Extract |
|
Extract an ordered sequence of subject-verb-object (SVO) triples from a spacy-parsed doc. |
|
Extract a collection of acronyms and their most likely definitions, if available, from a spacy-parsed doc. |
|
Extract “semi-structured statements” from a spacy-parsed doc, each as a (entity, cue, fragment) triple. |
|
Baseline, not-great attempt at direction quotation extraction (no indirect or mixed quotations) using rules and patterns. |
textacy.extract
: Functions to extract various elements of interest from documents
already parsed by spaCy, such as n-grams, named entities, subject-verb-object triples,
and acronyms.
-
textacy.extract.
words
(doc: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span], *, filter_stops: bool = True, filter_punct: bool = True, filter_nums: bool = False, include_pos: Optional[Union[str, Set[str]]] = None, exclude_pos: Optional[Union[str, Set[str]]] = None, min_freq: int = 1) → Iterable[spacy.tokens.token.Token][source]¶ Extract an ordered sequence of words from a document processed by spaCy, optionally filtering words by part-of-speech tag and frequency.
- Parameters
doc –
filter_stops – If True, remove stop words from word list.
filter_punct – If True, remove punctuation from word list.
filter_nums – If True, remove number-like words (e.g. 10, “ten”) from word list.
include_pos – Remove words whose part-of-speech tag IS NOT in the specified tags.
exclude_pos – Remove words whose part-of-speech tag IS in the specified tags.
min_freq – Remove words that occur in
doc
fewer thanmin_freq
times.
- Yields
Next token from
doc
passing specified filters in order of appearance in the document.- Raises
TypeError – if
include_pos
orexclude_pos
is not a str, a set of str, or a falsy value
Note
Filtering by part-of-speech tag uses the universal POS tag set; for details, check spaCy’s docs: https://spacy.io/api/annotation#pos-tagging
-
textacy.extract.
ngrams
(doc: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span], n: int, *, filter_stops: bool = True, filter_punct: bool = True, filter_nums: bool = False, include_pos: Optional[Union[str, Set[str]]] = None, exclude_pos: Optional[Union[str, Set[str]]] = None, min_freq: int = 1) → Iterable[spacy.tokens.span.Span][source]¶ Extract an ordered sequence of n-grams (
n
consecutive words) from a spacy-parsed doc, optionally filtering n-grams by the types and parts-of-speech of the constituent words.- Parameters
doc –
n – Number of tokens per n-gram; 2 => bigrams, 3 => trigrams, etc.
filter_stops – If True, remove ngrams that start or end with a stop word
filter_punct – If True, remove ngrams that contain any punctuation-only tokens
filter_nums – If True, remove ngrams that contain any numbers or number-like tokens (e.g. 10, ‘ten’)
include_pos – Remove ngrams if any constituent tokens’ part-of-speech tags ARE NOT included in this param
exclude_pos – Remove ngrams if any constituent tokens’ part-of-speech tags ARE included in this param
min_freq – Remove ngrams that occur in
doc
fewer thanmin_freq
times
- Yields
Next ngram from
doc
passing all specified filters, in order of appearance in the document- Raises
ValueError – if
n
< 1TypeError – if
include_pos
orexclude_pos
is not a str, a set of str, or a falsy value
Note
Filtering by part-of-speech tag uses the universal POS tag set; for details, check spaCy’s docs: https://spacy.io/api/annotation#pos-tagging
-
textacy.extract.
entities
(doc: spacy.tokens.doc.Doc, *, include_types: Optional[Union[str, Set[str]]] = None, exclude_types: Optional[Union[str, Set[str]]] = None, drop_determiners: bool = True, min_freq: int = 1) → Iterable[spacy.tokens.span.Span][source]¶ Extract an ordered sequence of named entities (PERSON, ORG, LOC, etc.) from a
Doc
, optionally filtering by entity types and frequencies.- Parameters
doc –
include_types – Remove entities whose type IS NOT in this param; if “NUMERIC”, all numeric entity types (“DATE”, “MONEY”, “ORDINAL”, etc.) are included
exclude_types – Remove entities whose type IS in this param; if “NUMERIC”, all numeric entity types (“DATE”, “MONEY”, “ORDINAL”, etc.) are excluded
drop_determiners –
Remove leading determiners (e.g. “the”) from entities (e.g. “the United States” => “United States”).
Note
Entities from which a leading determiner has been removed are, effectively, new entities, and not saved to the
Doc
from which they came. This is irritating but unavoidable, since this function is not meant to have side-effects on document state. If you’re only using the text of the returned spans, this is no big deal, but watch out if you’re counting on determiner-less entities associated with the doc downstream.min_freq – Remove entities that occur in
doc
fewer thanmin_freq
times
- Yields
Next entity from
doc
passing all specified filters in order of appearance in the document- Raises
TypeError – if
include_types
orexclude_types
is not a str, a set of str, or a falsy value
-
textacy.extract.
noun_chunks
(doc: spacy.tokens.doc.Doc, *, drop_determiners: bool = True, min_freq: int = 1) → Iterable[spacy.tokens.span.Span][source]¶ Extract an ordered sequence of noun chunks from a spacy-parsed doc, optionally filtering by frequency and dropping leading determiners.
- Parameters
doc –
drop_determiners – Remove leading determiners (e.g. “the”) from phrases (e.g. “the quick brown fox” => “quick brown fox”)
min_freq – Remove chunks that occur in
doc
fewer thanmin_freq
times
- Yields
Next noun chunk from
doc
in order of appearance in the document
-
textacy.extract.
pos_regex_matches
(doc: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span], pattern: str) → Iterable[spacy.tokens.span.Span][source]¶ Extract sequences of consecutive tokens from a spacy-parsed doc whose part-of-speech tags match the specified regex pattern.
- Parameters
doc –
pattern –
Pattern of consecutive POS tags whose corresponding words are to be extracted, inspired by the regex patterns used in NLTK’s nltk.chunk.regexp. Tags are uppercase, from the universal tag set; delimited by < and >, which are basically converted to parentheses with spaces as needed to correctly extract matching word sequences; white space in the input doesn’t matter.
Examples (see
constants.POS_REGEX_PATTERNS
):noun phrase: r’<DET>? (<NOUN>+ <ADP|CONJ>)* <NOUN>+’
compound nouns: r’<NOUN>+’
verb phrase: r’<VERB>?<ADV>*<VERB>+’
prepositional phrase: r’<PREP> <DET>? (<NOUN>+<ADP>)* <NOUN>+’
- Yields
Next span of consecutive tokens from
doc
whose parts-of-speech matchpattern
, in order of appearance
Warning
DEPRECATED! For similar but more powerful and performant functionality, use
textacy.extract.matches()
instead.
-
textacy.extract.
matches
(doc: spacy.tokens.doc.Doc, patterns: Union[str, List[str], List[Dict[str, str]], List[List[Dict[str, str]]]], *, on_match: Callable = None) → Iterable[spacy.tokens.span.Span][source]¶ Extract
Span
s from aDoc
matching one or more patterns of per-token attr:value pairs, with optional quantity qualifiers.- Parameters
doc –
patterns –
One or multiple patterns to match against
doc
using aspacy.matcher.Matcher
.If List[dict] or List[List[dict]], each pattern is specified as attr: value pairs per token, with optional quantity qualifiers:
[{"POS": "NOUN"}]
matches singular or plural nouns, like “friend” or “enemies”[{"POS": "PREP"}, {"POS": "DET", "OP": "?"}, {"POS": "ADJ", "OP": "?"}, {"POS": "NOUN", "OP": "+"}]
matches prepositional phrases, like “in the future” or “from the distant past”[{"IS_DIGIT": True}, {"TAG": "NNS"}]
matches numbered plural nouns, like “60 seconds” or “2 beers”[{"POS": "PROPN", "OP": "+"}, {}]
matches proper nouns and whatever word follows them, like “Burton DeWilde yaaasss”
If str or List[str], each pattern is specified as one or more per-token patterns separated by whitespace where attribute, value, and optional quantity qualifiers are delimited by colons. Note that boolean and integer values have special syntax — “bool(val)” and “int(val)”, respectively — and that wildcard tokens still need a colon between the (empty) attribute and value strings.
"POS:NOUN"
matches singular or plural nouns"POS:PREP POS:DET:? POS:ADJ:? POS:NOUN:+"
matches prepositional phrases"IS_DIGIT:bool(True) TAG:NNS"
matches numbered plural nouns"POS:PROPN:+ :"
matches proper nouns and whatever word follows them
Also note that these pattern strings don’t support spaCy v2.1’s “extended” pattern syntax; if you need such complex patterns, it’s probably better to use a List[dict] or List[List[dict]], anyway.
on_match – Callback function to act on matches. Takes the arguments
matcher
,doc
,i
andmatches
.
- Yields
Next matching
Span
indoc
, in order of appearance- Raises
-
textacy.extract.
subject_verb_object_triples
(doc: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span]) → Iterable[Tuple[spacy.tokens.span.Span, spacy.tokens.span.Span, spacy.tokens.span.Span]][source]¶ Extract an ordered sequence of subject-verb-object (SVO) triples from a spacy-parsed doc. Note that this only works for SVO languages.
- Parameters
doc –
- Yields
Next 3-tuple of spans from
doc
representing a (subject, verb, object) triple, in order of appearance
-
textacy.extract.
acronyms_and_definitions
(doc: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span], known_acro_defs: Optional[Dict[str, str]] = None) → Dict[str, List[str]][source]¶ Extract a collection of acronyms and their most likely definitions, if available, from a spacy-parsed doc. If multiple definitions are found for a given acronym, only the most frequently occurring definition is returned.
- Parameters
doc –
known_acro_defs – If certain acronym/definition pairs are known, pass them in as {acronym (str): definition (str)}; algorithm will not attempt to find new definitions
- Returns
Unique acronyms (keys) with matched definitions (values)
References
Taghva, Kazem, and Jeff Gilbreth. “Recognizing acronyms and their definitions.” International Journal on Document Analysis and Recognition 1.4 (1999): 191-198.
-
textacy.extract.
semistructured_statements
(doc: spacy.tokens.doc.Doc, entity: str, *, cue: str = 'be', ignore_entity_case: bool = True, min_n_words: int = 1, max_n_words: int = 20) → Tuple[Union[spacy.tokens.span.Span, spacy.tokens.token.Token], Union[spacy.tokens.span.Span, spacy.tokens.token.Token], spacy.tokens.span.Span][source]¶ Extract “semi-structured statements” from a spacy-parsed doc, each as a (entity, cue, fragment) triple. This is similar to subject-verb-object triples.
- Parameters
doc –
entity – a noun or noun phrase of some sort (e.g. “President Obama”, “global warming”, “Python”)
cue – verb lemma with which
entity
is associated (e.g. “talk about”, “have”, “write”)ignore_entity_case – If True, entity matching is case-independent
min_n_words – Min number of tokens allowed in a matching fragment
max_n_words – Max number of tokens allowed in a matching fragment
- Yields
Next matching triple, consisting of (entity, cue, fragment).
Notes
Inspired by N. Diakopoulos, A. Zhang, A. Salway. Visual Analytics of Media Frames in Online News and Blogs. IEEE InfoVis Workshop on Text Visualization. October, 2013.
Which itself was inspired by by Salway, A.; Kelly, L.; Skadiņa, I.; and Jones, G. 2010. Portable Extraction of Partially Structured Facts from the Web. In Proc. ICETAL 2010, LNAI 6233, 345-356. Heidelberg, Springer.
-
textacy.extract.
direct_quotations
(doc: spacy.tokens.doc.Doc) → Iterable[Tuple[spacy.tokens.span.Span, spacy.tokens.token.Token, spacy.tokens.span.Span]][source]¶ Baseline, not-great attempt at direction quotation extraction (no indirect or mixed quotations) using rules and patterns. English only.
- Parameters
doc (
spacy.tokens.Doc
) –- Yields
Next quotation in
doc
as a (speaker, reporting verb, quotation) triple
Notes
Loosely inspired by Krestel, Bergler, Witte. “Minding the Source: Automatic Tagging of Reported Speech in Newspaper Articles”.
TODO: Better approach would use ML, but needs a training dataset.
Keyterm Extraction¶
Extract key terms from a document using the TextRank algorithm, or a variation thereof. |
|
Extract key terms from a document using the YAKE algorithm. |
|
Extract key terms from a document using the sCAKE algorithm. |
|
Extract key terms from a document using the SGRank algorithm. |
-
textacy.ke.textrank.
textrank
(doc: spacy.tokens.doc.Doc, *, normalize: Optional[Union[str, Callable[[spacy.tokens.token.Token], str]]] = 'lemma', include_pos: Optional[Union[str, Collection[str]]] = 'NOUN', 'PROPN', 'ADJ', window_size: int = 2, edge_weighting: str = 'binary', position_bias: bool = False, topn: Union[int, float] = 10) → List[Tuple[str, float]][source]¶ Extract key terms from a document using the TextRank algorithm, or a variation thereof. For example:
TextRank:
window_size=2, edge_weighting="binary", position_bias=False
SingleRank:
window_size=10, edge_weighting="count", position_bias=False
PositionRank:
window_size=10, edge_weighting="count", position_bias=True
- Parameters
doc – spaCy
Doc
from which to extract keyterms.normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if None, use the form of terms as they appeared in
doc
; if a callable, must accept aToken
and return a str, e.g.textacy.spacier.utils.get_normalized_text()
.include_pos – One or more POS tags with which to filter for good candidate keyterms. If None, include tokens of all POS tags (which also allows keyterm extraction from docs without POS-tagging.)
window_size – Size of sliding window in which term co-occurrences are determined.
edge_weighting ({"count", "binary"}) – : If “count”, the nodes for all co-occurring terms are connected by edges with weight equal to the number of times they co-occurred within a sliding window; if “binary”, all such edges have weight = 1.
position_bias – If True, bias the PageRank algorithm for weighting nodes in the word graph, such that words appearing earlier and more frequently in
doc
tend to get larger weights.topn – Number of top-ranked terms to return as key terms. If an integer, represents the absolute number; if a float, value must be in the interval (0.0, 1.0], which is converted to an int by
int(round(len(set(candidates)) * topn))
.
- Returns
Sorted list of top
topn
key terms and their corresponding TextRank ranking scores.
References
Mihalcea, R., & Tarau, P. (2004, July). TextRank: Bringing order into texts. Association for Computational Linguistics.
Wan, Xiaojun and Jianguo Xiao. 2008. Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pages 855–860.
Florescu, C. and Cornelia, C. (2017). PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents. In proceedings of ACL*, pages 1105-1115.
-
textacy.ke.yake.
yake
(doc: spacy.tokens.doc.Doc, *, normalize: Optional[str] = 'lemma', ngrams: Union[int, Collection[int]] = 1, 2, 3, include_pos: Optional[Union[str, Collection[str]]] = 'NOUN', 'PROPN', 'ADJ', window_size: int = 2, topn: Union[int, float] = 10) → List[Tuple[str, float]][source]¶ Extract key terms from a document using the YAKE algorithm.
- Parameters
doc – spaCy
Doc
from which to extract keyterms. Must be sentence-segmented; optionally POS-tagged.normalize –
If “lemma”, lemmatize terms; if “lower”, lowercase terms; if None, use the form of terms as they appeared in
doc
.Note
Unlike the other keyterm extraction functions, this one doesn’t accept a callable for
normalize
.ngrams – n of which n-grams to consider as keyterm candidates. For example, (1, 2, 3)` includes all unigrams, bigrams, and trigrams, while
2
includes bigrams only.include_pos – One or more POS tags with which to filter for good candidate keyterms. If None, include tokens of all POS tags (which also allows keyterm extraction from docs without POS-tagging.)
window_size – Number of words to the right and left of a given word to use as context when computing the “relatedness to context” component of its score. Note that the resulting sliding window’s full width is
1 + (2 * window_size)
.topn – Number of top-ranked terms to return as key terms. If an integer, represents the absolute number; if a float, value must be in the interval (0.0, 1.0], which is converted to an int by
int(round(len(candidates) * topn))
- Returns
Sorted list of top
topn
key terms and their corresponding YAKE scores.
References
Campos, Mangaravite, Pasquali, Jorge, Nunes, and Jatowt. (2018). A Text Feature Based Automatic Keyword Extraction Method for Single Documents. Advances in Information Retrieval. ECIR 2018. Lecture Notes in Computer Science, vol 10772, pp. 684-691.
-
textacy.ke.scake.
scake
(doc: spacy.tokens.doc.Doc, *, normalize: Optional[Union[str, Callable[[spacy.tokens.token.Token], str]]] = 'lemma', include_pos: Optional[Union[str, Collection[str]]] = 'NOUN', 'PROPN', 'ADJ', topn: Union[int, float] = 10) → List[Tuple[str, float]][source]¶ Extract key terms from a document using the sCAKE algorithm.
- Parameters
doc – spaCy
Doc
from which to extract keyterms. Must be sentence-segmented; optionally POS-tagged.normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if None, use the form of terms as they appeared in
doc
; if a callable, must accept aToken
and return a str, e.g.textacy.spacier.utils.get_normalized_text()
.include_pos – One or more POS tags with which to filter for good candidate keyterms. If None, include tokens of all POS tags (which also allows keyterm extraction from docs without POS-tagging.)
topn – Number of top-ranked terms to return as key terms. If an integer, represents the absolute number; if a float, value must be in the interval (0.0, 1.0], which is converted to an int by
int(round(len(candidates) * topn))
- Returns
Sorted list of top
topn
key terms and their corresponding scores.
References
Duari, Swagata & Bhatnagar, Vasudha. (2018). sCAKE: Semantic Connectivity Aware Keyword Extraction. Information Sciences. 477. https://arxiv.org/abs/1811.10831v1
-
class
textacy.ke.sgrank.
Candidate
(text, idx, length, count)¶ -
count
¶ Alias for field number 3
-
idx
¶ Alias for field number 1
-
length
¶ Alias for field number 2
-
text
¶ Alias for field number 0
-
-
textacy.ke.sgrank.
sgrank
(doc: spacy.tokens.doc.Doc, *, normalize: Optional[Union[str, Callable[[spacy.tokens.span.Span], str]]] = 'lemma', ngrams: Union[int, Collection[int]] = 1, 2, 3, 4, 5, 6, include_pos: Optional[Union[str, Collection[str]]] = 'NOUN', 'PROPN', 'ADJ', window_size: int = 1500, topn: Union[int, float] = 10, idf: Dict[str, float] = None) → List[Tuple[str, float]][source]¶ Extract key terms from a document using the SGRank algorithm.
- Parameters
doc – spaCy
Doc
from which to extract keyterms.normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if None, use the form of terms as they appeared in
doc
; if a callable, must accept aSpan
and return a str, e.g.textacy.spacier.utils.get_normalized_text()
ngrams – n of which n-grams to include. For example,
(1, 2, 3, 4, 5, 6)
(default) includes all ngrams from 1 to 6; 2 if only bigrams are wantedinclude_pos – One or more POS tags with which to filter for good candidate keyterms. If None, include tokens of all POS tags (which also allows keyterm extraction from docs without POS-tagging.)
window_size – Size of sliding window in which term co-occurrences are determined to occur. Note: Larger values may dramatically increase runtime, owing to the larger number of co-occurrence combinations that must be counted.
topn – Number of top-ranked terms to return as keyterms. If int, represents the absolute number; if float, must be in the open interval (0.0, 1.0), and is converted to an integer by
int(round(len(candidates) * topn))
idf – Mapping of
normalize(term)
to inverse document frequency for re-weighting of unigrams (n-grams with n > 1 have df assumed = 1). Results are typically better with idf information.
- Returns
Sorted list of top
topn
key terms and their corresponding SGRank scores- Raises
ValueError – if
topn
is a float but not in (0.0, 1.0] orwindow_size
< 2
References
Danesh, Sumner, and Martin. “SGRank: Combining Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction.” Lexical and Computational Semantics (* SEM 2015) (2015): 117.
Keyterm Extraction Utils¶
-
textacy.ke.utils.
normalize_terms
(terms: Union[Iterable[spacy.tokens.span.Span], Iterable[spacy.tokens.token.Token]], normalize: Optional[Union[str, Callable[[Union[spacy.tokens.span.Span, spacy.tokens.token.Token]], str]]]) → Iterable[str][source]¶ Transform a sequence of terms from spaCy
Token
orSpan
s into strings, normalized bynormalize
.- Parameters
terms –
normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if None, use the form of terms as they appear in
terms
; if a callable, must accept aToken
orSpan
and return a str, e.g.textacy.spacier.utils.get_normalized_text()
.
- Yields
str
-
textacy.ke.utils.
aggregate_term_variants
(terms: Set[str], *, acro_defs: Optional[Dict[str, str]] = None, fuzzy_dedupe: bool = True) → List[Set[str]][source]¶ Take a set of unique terms and aggregate terms that are symbolic, lexical, and ordering variants of each other, as well as acronyms and fuzzy string matches.
- Parameters
terms – Set of unique terms with potential duplicates
acro_defs – If not None, terms that are acronyms will be aggregated with their definitions and terms that are definitions will be aggregated with their acronyms
fuzzy_dedupe – If True, fuzzy string matching will be used to aggregate similar terms of a sufficient length
- Returns
Each item is a set of aggregated terms.
Notes
Partly inspired by aggregation of variants discussed in Park, Youngja, Roy J. Byrd, and Branimir K. Boguraev. “Automatic glossary extraction: beyond terminology identification.” Proceedings of the 19th international conference on Computational linguistics-Volume 1. Association for Computational Linguistics, 2002.
-
textacy.ke.utils.
get_longest_subsequence_candidates
(doc: spacy.tokens.doc.Doc, match_func: Callable[[spacy.tokens.token.Token], bool]) → Iterable[Tuple[spacy.tokens.token.Token, …]][source]¶ Get candidate keyterms from
doc
, where candidates are longest consecutive subsequences of tokens for which allmatch_func(token)
is True.- Parameters
doc –
match_func – Function applied sequentially to each
Token
indoc
that returns True for matching (“good”) tokens, False otherwise.
- Yields
Next longest consecutive subsequence candidate, as a tuple of constituent tokens.
-
textacy.ke.utils.
get_ngram_candidates
(doc: spacy.tokens.doc.Doc, ns: Union[int, Collection[int]], *, include_pos: Optional[Union[str, Collection[str]]] = 'NOUN', 'PROPN', 'ADJ') → Iterable[Tuple[spacy.tokens.token.Token, …]][source]¶ Get candidate keyterms from
doc
, where candidates are n-length sequences of tokens (for all n inns
) that don’t start/end with a stop word or contain punctuation tokens, and whose constituent tokens are filtered by POS tag.- Parameters
doc –
ns – One or more n values for which to generate n-grams. For example,
2
gets bigrams;(2, 3)
gets bigrams and trigrams.include_pos – One or more POS tags with which to filter ngrams. If None, include tokens of all POS tags.
- Yields
Next ngram candidate, as a tuple of constituent Tokens.
See also
-
textacy.ke.utils.
get_pattern_matching_candidates
(doc: spacy.tokens.doc.Doc, patterns: Union[str, List[str], List[dict], List[List[dict]]]) → Iterable[Tuple[spacy.tokens.token.Token, …]][source]¶ Get candidate keyterms from
doc
, where candidates are sequences of tokens that match any pattern inpatterns
- Parameters
doc –
patterns – One or multiple patterns to match against
doc
using aspacy.matcher.Matcher
.
- Yields
Tuple[
spacy.tokens.Token
] – Next pattern-matching candidate, as a tuple of constituent Tokens.
See also
-
textacy.ke.utils.
get_filtered_topn_terms
(term_scores: Iterable[Tuple[str, float]], topn: int, *, match_threshold: Optional[float] = None) → List[Tuple[str, float]][source]¶ Build up a list of the
topn
terms, filtering out any that are substrings of better-scoring terms and optionally filtering out any that are sufficiently similar to better-scoring terms.- Parameters
term_scores – Iterable of (term, score) pairs, sorted in order of score from best to worst. Note that this may be from high to low value or low to high, depending on the scoring algorithm.
topn – Maximum number of top-scoring terms to get.
match_threshold – Minimal edit distance between a term and previously seen terms, used to filter out terms that are sufficiently similar to higher-scoring terms. Uses
textacy.similarity.token_sort_ratio()
.
-
textacy.ke.utils.
most_discriminating_terms
(terms_lists: Iterable[Iterable[str]], bool_array_grp1: Iterable[bool], *, max_n_terms: int = 1000, top_n_terms: Union[int, float] = 25) → Tuple[List[str], List[str]][source]¶ Given a collection of documents assigned to 1 of 2 exclusive groups, get the
top_n_terms
most discriminating terms for group1-and-not-group2 and group2-and-not-group1.- Parameters
terms_lists – Sequence of documents, each as a sequence of (str) terms; used as input to
doc_term_matrix()
bool_array_grp1 – Ordered sequence of True/False values, where True corresponds to documents falling into “group 1” and False corresponds to those in “group 2”.
max_n_terms – Only consider terms whose document frequency is within the top
max_n_terms
out of all distinct terms; must be > 0.top_n_terms – If int (must be > 0), the total number of most discriminating terms to return for each group; if float (must be in the interval (0, 1)), the fraction of
max_n_terms
to return for each group.
- Returns
List of the top
top_n_terms
most discriminating terms for grp1-not-grp2, and list of the toptop_n_terms
most discriminating terms for grp2-not-grp1.
References
King, Gary, Patrick Lam, and Margaret Roberts. “Computer-Assisted Keyword and Document Set Discovery from Unstructured Text.” (2014). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.458.1445&rep=rep1&type=pdf