Information Extraction¶
Extract an ordered sequence of words from a document processed by spaCy, optionally filtering words by part-of-speech tag and frequency. |
|
Extract an ordered sequence of n-grams ( |
|
Extract an ordered sequence of named entities (PERSON, ORG, LOC, etc.) from a |
|
Extract an ordered sequence of noun chunks from a spacy-parsed doc, optionally filtering by frequency and dropping leading determiners. |
|
Extract one or multiple types of terms – ngrams, entities, and/or noun chunks – from |
|
Extract |
|
Extract |
|
Extract an ordered sequence of subject-verb-object triples from a document or sentence. |
|
Extract “semi-structured statements” from a document as a sequence of (entity, cue, fragment) triples. |
|
Extract direct quotations with an attributable speaker from a document using simple rules and patterns. |
|
Extract tokens whose text is “acronym-like” from a document or sentence, in order of appearance. |
|
Extract a collection of acronyms and their most likely definitions, if available, from a spacy-parsed doc. |
|
Search for |
|
Extract key terms from a document using the TextRank algorithm, or a variation thereof. |
|
Extract key terms from a document using the YAKE algorithm. |
|
Extract key terms from a document using the sCAKE algorithm. |
|
Extract key terms from a document using the SGRank algorithm. |
Basics¶
textacy.extract.basics
: Extract basic components from a document or sentence
via spaCy, with bells and whistles for filtering the results.
-
textacy.extract.basics.
words
(doclike: types.DocLike, *, filter_stops: bool = True, filter_punct: bool = True, filter_nums: bool = False, include_pos: Optional[str | Collection[str]] = None, exclude_pos: Optional[str | Collection[str]] = None, min_freq: int = 1) → Iterable[Token][source]¶ Extract an ordered sequence of words from a document processed by spaCy, optionally filtering words by part-of-speech tag and frequency.
- Parameters
doclike –
filter_stops – If True, remove stop words from word list.
filter_punct – If True, remove punctuation from word list.
filter_nums – If True, remove number-like words (e.g. 10, “ten”) from word list.
include_pos – Remove words whose part-of-speech tag IS NOT in the specified tags.
exclude_pos – Remove words whose part-of-speech tag IS in the specified tags.
min_freq – Remove words that occur in
doclike
fewer thanmin_freq
times.
- Yields
Next token from
doclike
passing specified filters in order of appearance in the document.- Raises
TypeError – if
include_pos
orexclude_pos
is not a str, a set of str, or a falsy value
Note
Filtering by part-of-speech tag uses the universal POS tag set; for details, check spaCy’s docs: https://spacy.io/api/annotation#pos-tagging
-
textacy.extract.basics.
ngrams
(doclike: types.DocLike, n: int | Collection[int], *, filter_stops: bool = True, filter_punct: bool = True, filter_nums: bool = False, include_pos: Optional[str | Collection[str]] = None, exclude_pos: Optional[str | Collection[str]] = None, min_freq: int = 1) → Iterable[Span][source]¶ Extract an ordered sequence of n-grams (
n
consecutive tokens) from a spaCyDoc
orSpan
, for one or multiplen
values, optionally filtering n-grams by the types and parts-of-speech of the constituent tokens.- Parameters
doclike –
n – Number of tokens included per n-gram; for example,
2
yields bigrams and3
yields trigrams. If multiple values are specified, then the collections of n-grams are concatenated together; for example,(2, 3)
yields bigrams and then trigrams.filter_stops – If True, remove ngrams that start or end with a stop word.
filter_punct – If True, remove ngrams that contain any punctuation-only tokens.
filter_nums – If True, remove ngrams that contain any numbers or number-like tokens (e.g. 10, ‘ten’).
include_pos – Remove ngrams if any constituent tokens’ part-of-speech tags ARE NOT included in this param.
exclude_pos – Remove ngrams if any constituent tokens’ part-of-speech tags ARE included in this param.
min_freq – Remove ngrams that occur in
doclike
fewer thanmin_freq
times
- Yields
Next ngram from
doclike
passing all specified filters, in order of appearance in the document.- Raises
ValueError – if any
n
< 1TypeError – if
include_pos
orexclude_pos
is not a str, a set of str, or a falsy value
Note
Filtering by part-of-speech tag uses the universal POS tag set; for details, check spaCy’s docs: https://spacy.io/api/annotation#pos-tagging
-
textacy.extract.basics.
entities
(doclike: types.DocLike, *, include_types: Optional[str | Collection[str]] = None, exclude_types: Optional[str | Collection[str]] = None, drop_determiners: bool = True, min_freq: int = 1) → Iterable[Span][source]¶ Extract an ordered sequence of named entities (PERSON, ORG, LOC, etc.) from a
Doc
, optionally filtering by entity types and frequencies.- Parameters
doclike –
include_types – Remove entities whose type IS NOT in this param; if “NUMERIC”, all numeric entity types (“DATE”, “MONEY”, “ORDINAL”, etc.) are included
exclude_types – Remove entities whose type IS in this param; if “NUMERIC”, all numeric entity types (“DATE”, “MONEY”, “ORDINAL”, etc.) are excluded
drop_determiners –
Remove leading determiners (e.g. “the”) from entities (e.g. “the United States” => “United States”).
Note
Entities from which a leading determiner has been removed are, effectively, new entities, and not saved to the
Doc
from which they came. This is irritating but unavoidable, since this function is not meant to have side-effects on document state. If you’re only using the text of the returned spans, this is no big deal, but watch out if you’re counting on determiner-less entities associated with the doc downstream.min_freq – Remove entities that occur in
doclike
fewer thanmin_freq
times
- Yields
Next entity from
doclike
passing all specified filters in order of appearance in the document- Raises
TypeError – if
include_types
orexclude_types
is not a str, a set of str, or a falsy value
-
textacy.extract.basics.
noun_chunks
(doclike: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span], *, drop_determiners: bool = True, min_freq: int = 1) → Iterable[spacy.tokens.span.Span][source]¶ Extract an ordered sequence of noun chunks from a spacy-parsed doc, optionally filtering by frequency and dropping leading determiners.
- Parameters
doclike –
drop_determiners – Remove leading determiners (e.g. “the”) from phrases (e.g. “the quick brown fox” => “quick brown fox”)
min_freq – Remove chunks that occur in
doclike
fewer thanmin_freq
times
- Yields
Next noun chunk from
doclike
in order of appearance in the document
-
textacy.extract.basics.
terms
(doclike: types.DocLike, *, ngs: Optional[int | Collection[int] | types.DocLikeToSpans] = None, ents: Optional[bool | types.DocLikeToSpans] = None, ncs: Optional[bool | types.DocLikeToSpans] = None, dedupe: bool = True) → Iterable[Span][source]¶ Extract one or multiple types of terms – ngrams, entities, and/or noun chunks – from
doclike
as a single, concatenated collection, with optional deduplication of spans extracted by more than one type.>>> extract.terms(doc, ngs=2, ents=True, ncs=True) >>> extract.terms(doc, ngs=lambda doc: extract.ngrams(doc, n=2)) >>> extract.terms(doc, ents=extract.entities) >>> extract.terms(doc, ents=partial(extract.entities, include_types="PERSON"))
- Parameters
doclike –
ngs – N-gram terms to be extracted. If one or multiple ints,
textacy.extract.ngrams(doclike, n=ngs)()
is used to extract terms; if a callable,ngs(doclike)
is used to extract terms; if None, no n-gram terms are extracted.ents – Entity terms to be extracted. If True,
textacy.extract.entities(doclike)()
is used to extract terms; if a callable,ents(doclike)
is used to extract terms; if None, no entity terms are extracted.ncs – Noun chunk terms to be extracted. If True,
textacy.extract.noun_chunks(doclike)()
is used to extract terms; if a callable,ncs(doclike)
is used to extract terms; if None, no noun chunk terms are extracted.dedupe – If True, deduplicate terms whose spans are extracted by multiple types (e.g. a span that is both an n-gram and an entity), as identified by identical (start, stop) indexes in
doclike
; otherwise, don’t.
- Returns
Next term from
doclike
, in order of n-grams then entities then noun chunks, with each collection’s terms given in order of appearance.
Note
This function is not to be confused with keyterm extraction, which leverages statistics and algorithms to quantify the “key”-ness of terms before returning the top-ranking terms. There is no such scoring or ranking here.
See also
textacy.extact.ngrams()
textacy.extact.entities()
textacy.extact.noun_chunks()
textacy.extact.keyterms
Matches¶
textacy.extract.matches
: Extract matching spans from a document or sentence
using spaCy’s built-in matcher or regular expressions.
-
textacy.extract.matches.
token_matches
(doclike: types.DocLike, patterns: str | List[str] | List[Dict[str, str]] | List[List[Dict[str, str]]], *, on_match: Optional[Callable] = None) → Iterable[Span][source]¶ Extract
Span
s from a document or sentence matching one or more patterns of per-token attr:value pairs, with optional quantity qualifiers.- Parameters
doclike –
patterns –
One or multiple patterns to match against
doclike
using aspacy.matcher.Matcher
.If List[dict] or List[List[dict]], each pattern is specified as attr: value pairs per token, with optional quantity qualifiers:
[{"POS": "NOUN"}]
matches singular or plural nouns, like “friend” or “enemies”[{"POS": "PREP"}, {"POS": "DET", "OP": "?"}, {"POS": "ADJ", "OP": "?"}, {"POS": "NOUN", "OP": "+"}]
matches prepositional phrases, like “in the future” or “from the distant past”[{"IS_DIGIT": True}, {"TAG": "NNS"}]
matches numbered plural nouns, like “60 seconds” or “2 beers”[{"POS": "PROPN", "OP": "+"}, {}]
matches proper nouns and whatever word follows them, like “Burton DeWilde yaaasss”
If str or List[str], each pattern is specified as one or more per-token patterns separated by whitespace where attribute, value, and optional quantity qualifiers are delimited by colons. Note that boolean and integer values have special syntax — “bool(val)” and “int(val)”, respectively — and that wildcard tokens still need a colon between the (empty) attribute and value strings.
"POS:NOUN"
matches singular or plural nouns"POS:PREP POS:DET:? POS:ADJ:? POS:NOUN:+"
matches prepositional phrases"IS_DIGIT:bool(True) TAG:NNS"
matches numbered plural nouns"POS:PROPN:+ :"
matches proper nouns and whatever word follows them
Also note that these pattern strings don’t support spaCy v2.1’s “extended” pattern syntax; if you need such complex patterns, it’s probably better to use a List[dict] or List[List[dict]], anyway.
on_match – Callback function to act on matches. Takes the arguments
matcher
,doclike
,i
andmatches
.
- Yields
Next matching
Span
indoclike
, in order of appearance- Raises
-
textacy.extract.matches.
regex_matches
(doclike: types.DocLike, pattern: str | Pattern, *, alignment_mode: str = 'strict') → Iterable[Span][source]¶ Extract
Span
s from a document or sentence whose full texts match against a regular expressionpattern
.- Parameters
doclike –
pattern – Valid regular expression against which to match document text, either as a string or compiled pattern object.
alignment_mode – How character indices of regex matches snap to spaCy token boundaries. If “strict”, only exact alignments are included (no snapping); if “contract”, tokens completely within the character span are included; if “expand”, tokens at least partially covered by the character span are included.
- Yields
Next matching
Span
.
Triples¶
textacy.extract.triples
: Extract structured triples from a document or sentence
through rule-based pattern-matching of the annotated tokens.
-
class
textacy.extract.triples.
SVOTriple
(subject, verb, object)¶ -
object
¶ Alias for field number 2
-
subject
¶ Alias for field number 0
-
verb
¶ Alias for field number 1
-
-
class
textacy.extract.triples.
SSSTriple
(entity, cue, fragment)¶ -
cue
¶ Alias for field number 1
-
entity
¶ Alias for field number 0
-
fragment
¶ Alias for field number 2
-
-
class
textacy.extract.triples.
DQTriple
(speaker, cue, content)¶ -
content
¶ Alias for field number 2
-
cue
¶ Alias for field number 1
-
speaker
¶ Alias for field number 0
-
-
textacy.extract.triples.
subject_verb_object_triples
(doclike: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span]) → Iterable[textacy.extract.triples.SVOTriple][source]¶ Extract an ordered sequence of subject-verb-object triples from a document or sentence.
- Parameters
doclike –
- Yields
Next SVO triple as (subject, verb, object), in approximate order of appearance.
-
textacy.extract.triples.
semistructured_statements
(doclike: types.DocLike, *, entity: str | Pattern, cue: str, fragment_len_range: Optional[Tuple[Optional[int], Optional[int]]] = None) → Iterable[SSSTriple][source]¶ Extract “semi-structured statements” from a document as a sequence of (entity, cue, fragment) triples.
- Parameters
doclike –
entity – Noun or noun phrase of interest expressed as a regular expression pattern string (e.g.
"[Gg]lobal [Ww]arming"
) or compiled object (e.g.re.compile("global warming", re.IGNORECASE)
).cue – Verb lemma with which
entity
is associated (e.g. “be”, “have”, “say”).fragment_len_range – Filter statements to those whose fragment length in tokens is within the specified [low, high) interval. Both low and high values must be specified, but a null value for either is automatically replaced by safe default values. None (default) skips filtering by fragment length.
- Yields
Next matching triple, consisting of (entity, cue, fragment), in order of appearance.
Notes
Inspired by N. Diakopoulos, A. Zhang, A. Salway. Visual Analytics of Media Frames in Online News and Blogs. IEEE InfoVis Workshop on Text Visualization. October, 2013.
Which itself was inspired by by Salway, A.; Kelly, L.; Skadiņa, I.; and Jones, G. 2010. Portable Extraction of Partially Structured Facts from the Web. In Proc. ICETAL 2010, LNAI 6233, 345-356. Heidelberg, Springer.
-
textacy.extract.triples.
direct_quotations
(doc: spacy.tokens.doc.Doc) → Iterable[textacy.extract.triples.DQTriple][source]¶ Extract direct quotations with an attributable speaker from a document using simple rules and patterns. Does not extract indirect or mixed quotations!
- Parameters
doc –
- Yields
Next direct quotation in
doc
as a (speaker, cue, content) triple.
Notes
Loosely inspired by Krestel, Bergler, Witte. “Minding the Source: Automatic Tagging of Reported Speech in Newspaper Articles”.
Acronyms¶
textacy.extract.acronyms
: Extract acronyms and their definitions from a document
or sentence through rule-based pattern-matching of the annotated tokens.
-
textacy.extract.acros.
acronyms
(doclike: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span]) → Iterable[spacy.tokens.token.Token][source]¶ Extract tokens whose text is “acronym-like” from a document or sentence, in order of appearance.
- Parameters
doclike –
- Yields
Next acronym-like
Token
.
-
textacy.extract.acros.
acronyms_and_definitions
(doclike: Union[spacy.tokens.doc.Doc, spacy.tokens.span.Span], known_acro_defs: Optional[Dict[str, str]] = None) → Dict[str, List[str]][source]¶ Extract a collection of acronyms and their most likely definitions, if available, from a spacy-parsed doc. If multiple definitions are found for a given acronym, only the most frequently occurring definition is returned.
- Parameters
doclike –
known_acro_defs – If certain acronym/definition pairs are known, pass them in as {acronym (str): definition (str)}; algorithm will not attempt to find new definitions
- Returns
Unique acronyms (keys) with matched definitions (values)
References
Taghva, Kazem, and Jeff Gilbreth. “Recognizing acronyms and their definitions.” International Journal on Document Analysis and Recognition 1.4 (1999): 191-198.
-
textacy.extract.acros.
is_acronym
(token: str, exclude: Optional[Set[str]] = None) → bool[source]¶ Pass single token as a string, return True/False if is/is not valid acronym.
- Parameters
token – Single word to check for acronym-ness
exclude – If technically valid but not actual acronyms are known in advance, pass them in as a set of strings; matching tokens will return False.
- Returns
Whether or not
token
is an acronym.
KWIC¶
textacy.extract.kwic
: Extract keywords with their surrounding contexts from
a text document using regular expressions.
-
textacy.extract.kwic.
keyword_in_context
(doc: Doc | str, keyword: str | Pattern, *, ignore_case: bool = True, window_width: int = 50, pad_context: bool = False) → Iterable[Tuple[str, str, str]][source]¶ Search for
keyword
matches indoc
via regular expression and yield matches along withwindow_width
characters of context before and after occurrence.- Parameters
doc – spaCy
Doc
or raw text in which to search forkeyword
. If aDoc
, constituent text is grabbed viaspacy.tokens.Doc.text
. Note that spaCy annotations aren’t used at all here, they’re just a convenient owner of document text.keyword – String or regular expression pattern defining the keyword(s) to match. Typically, this is a single word or short phrase (“spam”, “spam and eggs”), but to account for variations, use regex (
r"[Ss]pam (and|&) [Ee]ggs?"
), optionally compiled (re.compile(r"[Ss]pam (and|&) [Ee]ggs?")
).ignore_case – If True, ignore letter case in
keyword
matching; otherwise, use case-sensitive matching. Note that this argument is only used ifkeyword
is a string; for pre-compiled regular expressions, there.IGNORECASE
flag is left as-is.window_width – Number of characters on either side of
keyword
to include as “context”.pad_context – If True, pad pre- and post-context strings to
window_width
chars in length; otherwise, us as many chars as are found in the text, up to the specified width.
- Yields
Next matching triple of (pre-context, keyword match, post-context).
Keyterms¶
textacy.extract.keyterms
: Extract keyterms from documents using a variety of
rule-based algorithms.
-
textacy.extract.keyterms.textrank.
textrank
(doc: Doc, *, normalize: Optional[str | Callable[[Token], str]] = 'lemma', include_pos: Optional[str | Collection[str]] = ('NOUN', 'PROPN', 'ADJ'), window_size: int = 2, edge_weighting: str = 'binary', position_bias: bool = False, topn: int | float = 10) → List[Tuple[str, float]][source]¶ Extract key terms from a document using the TextRank algorithm, or a variation thereof. For example:
TextRank:
window_size=2, edge_weighting="binary", position_bias=False
SingleRank:
window_size=10, edge_weighting="count", position_bias=False
PositionRank:
window_size=10, edge_weighting="count", position_bias=True
- Parameters
doc – spaCy
Doc
from which to extract keyterms.normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if None, use the form of terms as they appeared in
doc
; if a callable, must accept aToken
and return a str, e.g.textacy.spacier.utils.get_normalized_text()
.include_pos – One or more POS tags with which to filter for good candidate keyterms. If None, include tokens of all POS tags (which also allows keyterm extraction from docs without POS-tagging.)
window_size – Size of sliding window in which term co-occurrences are determined.
edge_weighting ({"count", "binary"}) – : If “count”, the nodes for all co-occurring terms are connected by edges with weight equal to the number of times they co-occurred within a sliding window; if “binary”, all such edges have weight = 1.
position_bias – If True, bias the PageRank algorithm for weighting nodes in the word graph, such that words appearing earlier and more frequently in
doc
tend to get larger weights.topn – Number of top-ranked terms to return as key terms. If an integer, represents the absolute number; if a float, value must be in the interval (0.0, 1.0], which is converted to an int by
int(round(len(set(candidates)) * topn))
.
- Returns
Sorted list of top
topn
key terms and their corresponding TextRank ranking scores.
References
Mihalcea, R., & Tarau, P. (2004, July). TextRank: Bringing order into texts. Association for Computational Linguistics.
Wan, Xiaojun and Jianguo Xiao. 2008. Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pages 855–860.
Florescu, C. and Cornelia, C. (2017). PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents. In proceedings of ACL*, pages 1105-1115.
-
textacy.extract.keyterms.yake.
yake
(doc: Doc, *, normalize: Optional[str] = 'lemma', ngrams: int | Collection[int] = (1, 2, 3), include_pos: Optional[str | Collection[str]] = ('NOUN', 'PROPN', 'ADJ'), window_size: int = 2, topn: int | float = 10) → List[Tuple[str, float]][source]¶ Extract key terms from a document using the YAKE algorithm.
- Parameters
doc – spaCy
Doc
from which to extract keyterms. Must be sentence-segmented; optionally POS-tagged.normalize –
If “lemma”, lemmatize terms; if “lower”, lowercase terms; if None, use the form of terms as they appeared in
doc
.Note
Unlike the other keyterm extraction functions, this one doesn’t accept a callable for
normalize
.ngrams – n of which n-grams to consider as keyterm candidates. For example, (1, 2, 3)` includes all unigrams, bigrams, and trigrams, while
2
includes bigrams only.include_pos – One or more POS tags with which to filter for good candidate keyterms. If None, include tokens of all POS tags (which also allows keyterm extraction from docs without POS-tagging.)
window_size – Number of words to the right and left of a given word to use as context when computing the “relatedness to context” component of its score. Note that the resulting sliding window’s full width is
1 + (2 * window_size)
.topn – Number of top-ranked terms to return as key terms. If an integer, represents the absolute number; if a float, value must be in the interval (0.0, 1.0], which is converted to an int by
int(round(len(candidates) * topn))
- Returns
Sorted list of top
topn
key terms and their corresponding YAKE scores.
References
Campos, Mangaravite, Pasquali, Jorge, Nunes, and Jatowt. (2018). A Text Feature Based Automatic Keyword Extraction Method for Single Documents. Advances in Information Retrieval. ECIR 2018. Lecture Notes in Computer Science, vol 10772, pp. 684-691.
-
textacy.extract.keyterms.scake.
scake
(doc: Doc, *, normalize: Optional[str | Callable[[Token], str]] = 'lemma', include_pos: Optional[str | Collection[str]] = ('NOUN', 'PROPN', 'ADJ'), topn: int | float = 10) → List[Tuple[str, float]][source]¶ Extract key terms from a document using the sCAKE algorithm.
- Parameters
doc – spaCy
Doc
from which to extract keyterms. Must be sentence-segmented; optionally POS-tagged.normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if None, use the form of terms as they appeared in
doc
; if a callable, must accept aToken
and return a str, e.g.textacy.spacier.utils.get_normalized_text()
.include_pos – One or more POS tags with which to filter for good candidate keyterms. If None, include tokens of all POS tags (which also allows keyterm extraction from docs without POS-tagging.)
topn – Number of top-ranked terms to return as key terms. If an integer, represents the absolute number; if a float, value must be in the interval (0.0, 1.0], which is converted to an int by
int(round(len(candidates) * topn))
- Returns
Sorted list of top
topn
key terms and their corresponding scores.
References
Duari, Swagata & Bhatnagar, Vasudha. (2018). sCAKE: Semantic Connectivity Aware Keyword Extraction. Information Sciences. 477. https://arxiv.org/abs/1811.10831v1
-
class
textacy.extract.keyterms.sgrank.
Candidate
(text, idx, length, count)¶ -
count
¶ Alias for field number 3
-
idx
¶ Alias for field number 1
-
length
¶ Alias for field number 2
-
text
¶ Alias for field number 0
-
-
textacy.extract.keyterms.sgrank.
sgrank
(doc: Doc, *, normalize: Optional[str | Callable[[Span], str]] = 'lemma', ngrams: int | Collection[int] = (1, 2, 3, 4, 5, 6), include_pos: Optional[str | Collection[str]] = ('NOUN', 'PROPN', 'ADJ'), window_size: int = 1500, topn: int | float = 10, idf: Dict[str, float] = None) → List[Tuple[str, float]][source]¶ Extract key terms from a document using the SGRank algorithm.
- Parameters
doc – spaCy
Doc
from which to extract keyterms.normalize – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if None, use the form of terms as they appeared in
doc
; if a callable, must accept aSpan
and return a str, e.g.textacy.spacier.utils.get_normalized_text()
ngrams – n of which n-grams to include. For example,
(1, 2, 3, 4, 5, 6)
(default) includes all ngrams from 1 to 6; 2 if only bigrams are wantedinclude_pos – One or more POS tags with which to filter for good candidate keyterms. If None, include tokens of all POS tags (which also allows keyterm extraction from docs without POS-tagging.)
window_size – Size of sliding window in which term co-occurrences are determined to occur. Note: Larger values may dramatically increase runtime, owing to the larger number of co-occurrence combinations that must be counted.
topn – Number of top-ranked terms to return as keyterms. If int, represents the absolute number; if float, must be in the open interval (0.0, 1.0), and is converted to an integer by
int(round(len(candidates) * topn))
idf – Mapping of
normalize(term)
to inverse document frequency for re-weighting of unigrams (n-grams with n > 1 have df assumed = 1). Results are typically better with idf information.
- Returns
Sorted list of top
topn
key terms and their corresponding SGRank scores- Raises
ValueError – if
topn
is a float but not in (0.0, 1.0] orwindow_size
< 2
References
Danesh, Sumner, and Martin. “SGRank: Combining Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction.” Lexical and Computational Semantics (* SEM 2015) (2015): 117.