Vectorization and Topic Modeling¶
Transform one or more tokenized documents into a sparse document-term matrix of shape (# docs, # unique terms), with flexibly weighted and normalized values. |
|
Transform one or more tokenized documents into a group-term matrix of shape (# groups, # unique terms), with tf-, tf-idf, or binary-weighted values. |
|
Train and apply a topic model to vectorized texts using scikit-learn’s implementations of LSA, LDA, and NMF models. |
|
Compute frequencies for all terms in a document-term matrix, with optional sub-linear scaling. |
|
Compute inverse document frequencies for all terms in a document-term matrix, using one of several IDF formulations. |
|
Apply inverse document frequency (idf) weighting to a term-frequency (tf) weighted document-term matrix, using one of several IDF formulations. |
|
Filter out terms that are too common and/or too rare (by document frequency), and compactify the top |
Vectorizers¶
textacy.vsm.vectorizers
: Transform a collection of tokenized documents into
a document-term matrix of shape (# docs, # unique terms), with various ways to filter
or limit included terms and flexible weighting schemes for their values.
A second option aggregates terms in tokenized documents by provided group labels, resulting in a “group-term-matrix” of shape (# unique groups, # unique terms), with filtering and weighting functionality as described above.
See the Vectorizer
and GroupVectorizer
docstrings for usage
examples and explanations of the various weighting schemes.
-
class
textacy.vsm.vectorizers.
Vectorizer
(*, tf_type='linear', apply_idf=False, idf_type='smooth', apply_dl=False, dl_type='sqrt', norm=None, min_df=1, max_df=1.0, max_n_terms=None, vocabulary_terms=None)[source]¶ Transform one or more tokenized documents into a sparse document-term matrix of shape (# docs, # unique terms), with flexibly weighted and normalized values.
Stream a corpus with metadata from disk:
>>> ds = textacy.datasets.CapitolWords() >>> records = ds.records(limit=1000) >>> corpus = textacy.Corpus("en", data=records) >>> corpus Corpus(1000 docs; 538172 tokens)
Tokenize and vectorize the first 600 documents of this corpus:
>>> tokenized_docs = ( ... doc._.to_terms_list(ngrams=1, entities=True, as_strings=True) ... for doc in corpus[:600]) >>> vectorizer = Vectorizer( ... apply_idf=True, norm="l2", ... min_df=3, max_df=0.95) >>> doc_term_matrix = vectorizer.fit_transform(tokenized_docs) >>> doc_term_matrix <600x4346 sparse matrix of type '<class 'numpy.float64'>' with 69673 stored elements in Compressed Sparse Row format>
Tokenize and vectorize the remaining 400 documents of the corpus, using only the groups, terms, and weights learned in the previous step:
>>> tokenized_docs = ( ... doc._.to_terms_list(ngrams=1, entities=True, as_strings=True) ... for doc in corpus[600:]) >>> doc_term_matrix = vectorizer.transform(tokenized_docs) >>> doc_term_matrix <400x4346 sparse matrix of type '<class 'numpy.float64'>' with 38756 stored elements in Compressed Sparse Row format>
Inspect the terms associated with columns; they’re sorted alphabetically:
>>> vectorizer.terms_list[:5] ['', '$', '$ 1 million', '$ 1.2 billion', '$ 10 billion']
(Btw: That empty string shouldn’t be there. Somehow, spaCy is labeling it as a named entity…)
If known in advance, limit the terms included in vectorized outputs to a particular set of values:
>>> tokenized_docs = ( ... doc._.to_terms_list(ngrams=1, entities=True, as_strings=True) ... for doc in corpus[:600]) >>> vectorizer = Vectorizer( ... apply_idf=True, idf_type="smooth", norm="l2", ... min_df=3, max_df=0.95, ... vocabulary_terms=["president", "bill", "unanimous", "distinguished", "american"]) >>> doc_term_matrix = vectorizer.fit_transform(tokenized_docs) >>> doc_term_matrix <600x5 sparse matrix of type '<class 'numpy.float64'>' with 844 stored elements in Compressed Sparse Row format> >>> vectorizer.terms_list ['american', 'bill', 'distinguished', 'president', 'unanimous']
Specify different weighting schemes to determine values in the matrix, adding or customizing individual components, as desired:
>>> money_idx = vectorizer.vocabulary_terms["$"] >>> doc_term_matrix = Vectorizer( ... tf_type="linear", norm=None, min_df=3, max_df=0.95 ... ).fit_transform(tokenized_docs) >>> print(doc_term_matrix[0:7, money_idx].toarray()) [[0] [0] [1] [4] [0] [0] [2]] >>> doc_term_matrix = Vectorizer( ... tf_type="sqrt", apply_dl=True, dl_type="sqrt", norm=None, min_df=3, max_df=0.95 ... ).fit_transform(tokenized_docs) >>> print(doc_term_matrix[0:7, money_idx].toarray()) [[0. ] [0. ] [0.10101525] [0.26037782] [0. ] [0. ] [0.11396058]] >>> doc_term_matrix = Vectorizer( ... tf_type="bm25", apply_idf=True, idf_type="smooth", norm=None, min_df=3, max_df=0.95 ... ).fit_transform(tokenized_docs) >>> print(doc_term_matrix[0:7, money_idx].toarray()) [[0. ] [0. ] [3.28353965] [5.82763722] [0. ] [0. ] [4.83933924]]
If you’re not sure what’s going on mathematically,
Vectorizer.weighting
gives the formula being used to calculate weights, based on the parameters set when initializing the vectorizer:>>> vectorizer.weighting '(tf * (k + 1)) / (k + tf) * log((n_docs + 1) / (df + 1)) + 1'
In general, weights may consist of a local component (term frequency), a global component (inverse document frequency), and a normalization component (document length). Individual components may be modified: they may have different scaling (e.g. tf vs. sqrt(tf)) or different behaviors (e.g. “standard” idf vs bm25’s version). There are many possible weightings, and some may be better for particular use cases than others. When in doubt, though, just go with something standard.
“tf”: Weights are simply the absolute per-document term frequencies (tfs), i.e. value (i, j) in an output doc-term matrix corresponds to the number of occurrences of term j in doc i. Terms appearing many times in a given doc receive higher weights than less common terms. Params:
tf_type="linear", apply_idf=False, apply_dl=False
“tfidf”: Doc-specific, local tfs are multiplied by their corpus-wide, global inverse document frequencies (idfs). Terms appearing in many docs have higher document frequencies (dfs), correspondingly smaller idfs, and in turn, lower weights. Params:
tf_type="linear", apply_idf=True, idf_type="smooth", apply_dl=False
“bm25”: This scheme includes a local tf component that increases asymptotically, so higher tfs have diminishing effects on the overall weight; a global idf component that can go negative for terms that appear in a sufficiently high proportion of docs; as well as a row-wise normalization that accounts for document length, such that terms in shorter docs hit the tf asymptote sooner than those in longer docs. Params:
tf_type="bm25", apply_idf=True, idf_type="bm25", apply_dl=True
“binary”: This weighting scheme simply replaces all non-zero tfs with 1, indicating the presence or absence of a term in a particular doc. That’s it. Params:
tf_type="binary", apply_idf=False, apply_dl=False
Slightly altered versions of these “standard” weighting schemes are common, and may have better behavior in general use cases:
“lucene-style tfidf”: Adds a doc-length normalization to the usual local and global components. Params:
tf_type="linear", apply_idf=True, idf_type="smooth", apply_dl=True, dl_type="sqrt"
“lucene-style bm25”: Uses a smoothed idf instead of the classic bm25 variant to prevent weights on terms from going negative. Params:
tf_type="bm25", apply_idf=True, idf_type="smooth", apply_dl=True, dl_type="linear"
- Parameters
tf_type ({"linear", "sqrt", "log", "binary"}) –
Type of term frequency (tf) to use for weights’ local component:
”linear”: tf (tfs are already linear, so left as-is)
”sqrt”: tf => sqrt(tf)
”log”: tf => log(tf) + 1
”binary”: tf => 1
apply_idf (bool) – If True, apply global idfs to local term weights, i.e. divide per-doc term frequencies by the (log of the) total number of documents in which they appear; otherwise, don’t.
idf_type ({"standard", "smooth", "bm25"}) –
Type of inverse document frequency (idf) to use for weights’ global component:
”standard”: idf = log(n_docs / df) + 1.0
”smooth”: idf = log(n_docs + 1 / df + 1) + 1.0, i.e. 1 is added to all document frequencies, as if a single document containing every unique term was added to the corpus. This prevents zero divisions!
”bm25”: idf = log((n_docs - df + 0.5) / (df + 0.5)), which is a form commonly used in information retrieval that allows for very common terms to receive negative weights.
apply_dl (bool) – If True, normalize local(+global) weights by doc length, i.e. divide by the total number of in-vocabulary terms appearing in a given doc; otherwise, don’t.
dl_type ({"linear", "sqrt", "log"}) –
Type of document-length scaling to use for weights’ normalization component:
”linear”: dl (dls are already linear, so left as-is)
”sqrt”: dl => sqrt(dl)
”log”: dl => log(dl)
norm ({"l1", "l2"} or None) – If “l1” or “l2”, normalize weights by the L1 or L2 norms, respectively, of row-wise vectors; otherwise, don’t.
vocabulary_terms (Dict[str, int] or Iterable[str]) – Mapping of unique term string to unique term id, or an iterable of term strings that gets converted into a suitable mapping. Note that, if specified, vectorized outputs will include only these terms as columns.
min_df (float or int) – If float, value is the fractional proportion of the total number of documents, which must be in [0.0, 1.0]. If int, value is the absolute number. Filter terms whose document frequency is less than
min_df
.max_df (float or int) – If float, value is the fractional proportion of the total number of documents, which must be in [0.0, 1.0]. If int, value is the absolute number. Filter terms whose document frequency is greater than
max_df
.max_n_terms (int) – Only include terms whose document frequency is within the top
max_n_terms
.
-
vocabulary_terms
¶ Mapping of unique term string to unique term id, either provided on instantiation or generated by calling
Vectorizer.fit()
on a collection of tokenized documents.
-
property
id_to_term
¶ Mapping of unique term id (int) to unique term string (str), i.e. the inverse of
Vectorizer.vocabulary
. This attribute is only generated if needed, and it is automatically kept in sync with the corresponding vocabulary.
-
property
terms_list
¶ List of term strings in column order of vectorized outputs. For example,
terms_list[0]
gives the term assigned to the first column in an output doc-term-matrix,doc_term_matrix[:, 0]
.
-
fit
(tokenized_docs)[source]¶ Count terms in
tokenized_docs
and, if not already provided, build up a vocabulary based those terms. Fit and store global weights (IDFs) and, if needed for term weighting, the average document length.- Parameters
tokenized_docs (Iterable[Iterable[str]]) –
A sequence of tokenized documents, where each is a sequence of (str) terms. For example:
>>> ([tok.lemma_ for tok in spacy_doc] ... for spacy_doc in spacy_docs) >>> ((ne.text for ne in extract.entities(doc)) ... for doc in corpus) >>> (doc._.to_terms_list(as_strings=True) ... for doc in docs)
- Returns
The instance that has just been fit.
- Return type
-
fit_transform
(tokenized_docs)[source]¶ Count terms in
tokenized_docs
and, if not already provided, build up a vocabulary based those terms. Fit and store global weights (IDFs) and, if needed for term weighting, the average document length. Transformtokenized_docs
into a document-term matrix with values weighted according to the parameters inVectorizer
initialization.- Parameters
tokenized_docs (Iterable[Iterable[str]]) –
A sequence of tokenized documents, where each is a sequence of (str) terms. For example:
>>> ([tok.lemma_ for tok in spacy_doc] ... for spacy_doc in spacy_docs) >>> ((ne.text for ne in extract.entities(doc)) ... for doc in corpus) >>> (doc._.to_terms_list(as_strings=True) ... for doc in docs)
- Returns
The transformed document-term matrix. Rows correspond to documents and columns correspond to terms.
- Return type
-
transform
(tokenized_docs)[source]¶ Transform
tokenized_docs
into a document-term matrix with values weighted according to the parameters inVectorizer
initialization and the global weights computed by callingVectorizer.fit()
.- Parameters
tokenized_docs (Iterable[Iterable[str]]) –
A sequence of tokenized documents, where each is a sequence of (str) terms. For example:
>>> ([tok.lemma_ for tok in spacy_doc] ... for spacy_doc in spacy_docs) >>> ((ne.text for ne in extract.entities(doc)) ... for doc in corpus) >>> (doc._.to_terms_list(as_strings=True) ... for doc in docs)
- Returns
The transformed document-term matrix. Rows correspond to documents and columns correspond to terms.
- Return type
Note
For best results, the tokenization used to produce
tokenized_docs
should be the same as was applied to the docs used in fitting this vectorizer or in generating a fixed input vocabulary.Consider an extreme case where the docs used in fitting consist of lowercased (non-numeric) terms, while the docs to be transformed are all uppercased: The output doc-term-matrix will be empty.
-
property
weighting
¶ A mathematical representation of the overall weighting scheme used to determine values in the vectorized matrix, depending on the params used to initialize the
Vectorizer
.- Type
-
class
textacy.vsm.vectorizers.
GroupVectorizer
(*, tf_type='linear', apply_idf=False, idf_type='smooth', apply_dl=False, dl_type='linear', norm=None, min_df=1, max_df=1.0, max_n_terms=None, vocabulary_terms=None, vocabulary_grps=None)[source]¶ Transform one or more tokenized documents into a group-term matrix of shape (# groups, # unique terms), with tf-, tf-idf, or binary-weighted values.
This is an extension of typical document-term matrix vectorization, where terms are grouped by the documents in which they co-occur. It allows for customized grouping, such as by a shared author or publication year, that may span multiple documents, without forcing users to merge those documents themselves.
Stream a corpus with metadata from disk:
>>> ds = textacy.datasets.CapitolWords() >>> records = ds.records(limit=1000) >>> corpus = textacy.Corpus("en", data=records) >>> corpus Corpus(1000 docs; 538172 tokens)
Tokenize and vectorize the first 600 documents of this corpus, where terms are grouped not by documents but by a categorical value in the docs’ metadata:
>>> tokenized_docs, groups = textacy.io.unzip( ... (doc._.to_terms_list(ngrams=1, entities=True, as_strings=True), ... doc._.meta["speaker_name"]) ... for doc in corpus[:600]) >>> vectorizer = GroupVectorizer( ... apply_idf=True, idf_type="smooth", norm="l2", ... min_df=3, max_df=0.95) >>> grp_term_matrix = vectorizer.fit_transform(tokenized_docs, groups) >>> grp_term_matrix <5x1793 sparse matrix of type '<class 'numpy.float64'>' with 6075 stored elements in Compressed Sparse Row format>
Tokenize and vectorize the remaining 400 documents of the corpus, using only the groups, terms, and weights learned in the previous step:
>>> tokenized_docs, groups = textacy.io.unzip( ... (doc._.to_terms_list(ngrams=1, entities=True, as_strings=True), ... doc._.meta["speaker_name"]) ... for doc in corpus[600:]) >>> grp_term_matrix = vectorizer.transform(tokenized_docs, groups) >>> grp_term_matrix <5x1793 sparse matrix of type '<class 'numpy.float64'>' with 4440 stored elements in Compressed Sparse Row format>
Inspect the terms associated with columns and groups associated with rows; they’re sorted alphabetically:
>>> vectorizer.terms_list[:5] ['$ 1 million', '$ 160 million', '$ 7 billion', '0', '1 minute'] >>> vectorizer.grps_list ['Bernie Sanders', 'John Kasich', 'Joseph Biden', 'Lindsey Graham', 'Rick Santorum']
If known in advance, limit the terms and/or groups included in vectorized outputs to a particular set of values:
>>> tokenized_docs, groups = textacy.io.unzip( ... (doc._.to_terms_list(ngrams=1, entities=True, as_strings=True), ... doc._.meta["speaker_name"]) ... for doc in corpus[:600]) >>> vectorizer = GroupVectorizer( ... apply_idf=True, idf_type="smooth", norm="l2", ... min_df=3, max_df=0.95, ... vocabulary_terms=["legislation", "federal government", "house", "constitutional"], ... vocabulary_grps=["Bernie Sanders", "Lindsey Graham", "Rick Santorum"]) >>> grp_term_matrix = vectorizer.fit_transform(tokenized_docs, groups) >>> grp_term_matrix <3x4 sparse matrix of type '<class 'numpy.float64'>' with 12 stored elements in Compressed Sparse Row format> >>> vectorizer.terms_list ['constitutional', 'federal government', 'house', 'legislation'] >>> vectorizer.grps_list ['Bernie Sanders', 'Lindsey Graham', 'Rick Santorum']
For a discussion of the various weighting schemes that can be applied, check out the
Vectorizer
docstring.- Parameters
tf_type ({"linear", "sqrt", "log", "binary"}) –
Type of term frequency (tf) to use for weights’ local component:
”linear”: tf (tfs are already linear, so left as-is)
”sqrt”: tf => sqrt(tf)
”log”: tf => log(tf) + 1
”binary”: tf => 1
apply_idf (bool) – If True, apply global idfs to local term weights, i.e. divide per-doc term frequencies by the total number of documents in which they appear (well, the log of that number); otherwise, don’t.
idf_type ({"standard", "smooth", "bm25"}) –
Type of inverse document frequency (idf) to use for weights’ global component:
”standard”: idf = log(n_docs / df) + 1.0
”smooth”: idf = log(n_docs + 1 / df + 1) + 1.0, i.e. 1 is added to all document frequencies, as if a single document containing every unique term was added to the corpus.
”bm25”: idf = log((n_docs - df + 0.5) / (df + 0.5)), which is a form commonly used in information retrieval that allows for very common terms to receive negative weights.
apply_dl (bool) – If True, normalize local(+global) weights by doc length, i.e. divide by the total number of in-vocabulary terms appearing in a given doc; otherwise, don’t.
dl_type ({"linear", "sqrt", "log"}) –
Type of document-length scaling to use for weights’ normalization component:
”linear”: dl (dls are already linear, so left as-is)
”sqrt”: dl => sqrt(dl)
”log”: dl => log(dl)
norm ({"l1", "l2"} or None) – If “l1” or “l2”, normalize weights by the L1 or L2 norms, respectively, of row-wise vectors; otherwise, don’t.
vocabulary_terms (Dict[str, int] or Iterable[str]) – Mapping of unique term string to unique term id, or an iterable of term strings that gets converted into a suitable mapping. Note that, if specified, vectorized outputs will include only these terms as columns.
vocabulary_grps (Dict[str, int] or Iterable[str]) – Mapping of unique group string to unique group id, or an iterable of group strings that gets converted into a suitable mapping. Note that, if specified, vectorized outputs will include only these groups as rows.
min_df (float or int) – If float, value is the fractional proportion of the total number of documents, which must be in [0.0, 1.0]. If int, value is the absolute number. Filter terms whose document frequency is less than
min_df
.max_df (float or int) – If float, value is the fractional proportion of the total number of documents, which must be in [0.0, 1.0]. If int, value is the absolute number. Filter terms whose document frequency is greater than
max_df
.max_n_terms (int) – Only include terms whose document frequency is within the top
max_n_terms
.
-
vocabulary_terms
¶ Mapping of unique term string to unique term id, either provided on instantiation or generated by calling
GroupVectorizer.fit()
on a collection of tokenized documents.
-
vocabulary_grps
¶ Mapping of unique group string to unique group id, either provided on instantiation or generated by calling
GroupVectorizer.fit()
on a collection of tokenized documents.
-
id_to_term
¶ Mapping of unique term id to unique term string, i.e. the inverse of
GroupVectorizer.vocabulary_terms
. This mapping is only generated as needed.
See also
-
property
id_to_grp
¶ Mapping of unique group id (int) to unique group string (str), i.e. the inverse of
GroupVectorizer.vocabulary_grps
. This attribute is only generated if needed, and it is automatically kept in sync with the corresponding vocabulary.
-
property
grps_list
¶ List of group strings in row order of vectorized outputs. For example,
grps_list[0]
gives the group assigned to the first row in an output group-term-matrix,grp_term_matrix[0, :]
.
-
fit
(tokenized_docs, grps)[source]¶ Count terms in
tokenized_docs
and, if not already provided, build up a vocabulary based those terms; do the same for the groups ingrps
. Fit and store global weights (IDFs) and, if needed for term weighting, the average document length.- Parameters
tokenized_docs (Iterable[Iterable[str]]) –
A sequence of tokenized documents, where each is a sequence of (str) terms. For example:
>>> ([tok.lemma_ for tok in spacy_doc] ... for spacy_doc in spacy_docs) >>> ((ne.text for ne in extract.entities(doc)) ... for doc in corpus) >>> (doc._.to_terms_list(as_strings=True) ... for doc in docs)
grps (Iterable[str]) – Sequence of group names by which the terms in
tokenized_docs
are aggregated, where the first item ingrps
corresponds to the first item intokenized_docs
, and so on.
- Returns
The instance that has just been fit.
- Return type
-
fit_transform
(tokenized_docs, grps)[source]¶ Count terms in
tokenized_docs
and, if not already provided, build up a vocabulary based those terms; do the same for the groups ingrps
. Fit and store global weights (IDFs) and, if needed for term weighting, the average document length. Transformtokenized_docs
into a group-term matrix with values weighted according to the parameters inGroupVectorizer
initialization.- Parameters
tokenized_docs (Iterable[Iterable[str]]) –
A sequence of tokenized documents, where each is a sequence of (str) terms. For example:
>>> ([tok.lemma_ for tok in spacy_doc] ... for spacy_doc in spacy_docs) >>> ((ne.text for ne in extract.entities(doc)) ... for doc in corpus) >>> (doc._.to_terms_list(as_strings=True) ... for doc in docs)
grps (Iterable[str]) – Sequence of group names by which the terms in
tokenized_docs
are aggregated, where the first item ingrps
corresponds to the first item intokenized_docs
, and so on.
- Returns
The transformed group-term matrix. Rows correspond to groups and columns correspond to terms.
- Return type
-
transform
(tokenized_docs, grps)[source]¶ Transform
tokenized_docs
andgrps
into a group-term matrix with values weighted according to the parameters inGroupVectorizer
initialization and the global weights computed by callingGroupVectorizer.fit()
.- Parameters
tokenized_docs (Iterable[Iterable[str]]) –
A sequence of tokenized documents, where each is a sequence of (str) terms. For example:
>>> ([tok.lemma_ for tok in spacy_doc] ... for spacy_doc in spacy_docs) >>> ((ne.text for ne in extract.entities(doc)) ... for doc in corpus) >>> (doc._.to_terms_list(as_strings=True) ... for doc in docs)
grps (Iterable[str]) – Sequence of group names by which the terms in
tokenized_docs
are aggregated, where the first item ingrps
corresponds to the first item intokenized_docs
, and so on.
- Returns
The transformed group-term matrix. Rows correspond to groups and columns correspond to terms.
- Return type
Note
For best results, the tokenization used to produce
tokenized_docs
should be the same as was applied to the docs used in fitting this vectorizer or in generating a fixed input vocabulary.Consider an extreme case where the docs used in fitting consist of lowercased (non-numeric) terms, while the docs to be transformed are all uppercased: The output group-term-matrix will be empty.
Sparse Matrix Utils¶
textacy.vsm.matrix_utils
: Functions for computing corpus-wide term- or
document-based values, like term frequency, document frequency, and document length,
and filtering terms from a matrix by their document frequency.
-
textacy.vsm.matrix_utils.
get_term_freqs
(doc_term_matrix, *, type_='linear')[source]¶ Compute frequencies for all terms in a document-term matrix, with optional sub-linear scaling.
- Parameters
doc_term_matrix (
scipy.sparse.csr_matrix
) – M x N sparse matrix, where M is the # of docs and N is the # of unique terms. Values must be the linear, un-scaled counts of term n per doc m.type ({'linear', 'sqrt', 'log'}) – Scaling applied to absolute term counts. If ‘linear’, term counts are left as-is, since the sums are already linear; if ‘sqrt’, tf => sqrt(tf); if ‘log’, tf => log(tf) + 1.
- Returns
Array of term frequencies, with length equal to the # of unique terms (# of columns) in
doc_term_matrix
.- Return type
- Raises
ValueError – if
doc_term_matrix
doesn’t have any non-zero entries, or iftype_
isn’t one of {“linear”, “sqrt”, “log”}.
-
textacy.vsm.matrix_utils.
get_doc_freqs
(doc_term_matrix)[source]¶ Compute document frequencies for all terms in a document-term matrix.
- Parameters
doc_term_matrix (
scipy.sparse.csr_matrix
) –M x N sparse matrix, where M is the # of docs and N is the # of unique terms.
Note
Weighting on the terms doesn’t matter! Could be binary or tf or tfidf, a term’s doc freq will be the same.
- Returns
Array of document frequencies, with length equal to the # of unique terms (# of columns) in
doc_term_matrix
.- Return type
- Raises
ValueError – if
doc_term_matrix
doesn’t have any non-zero entries.
-
textacy.vsm.matrix_utils.
get_inverse_doc_freqs
(doc_term_matrix, *, type_='smooth')[source]¶ Compute inverse document frequencies for all terms in a document-term matrix, using one of several IDF formulations.
- Parameters
doc_term_matrix (
scipy.sparse.csr_matrix
) – M x N sparse matrix, where M is the # of docs and N is the # of unique terms. The particular weighting of matrix values doesn’t matter.type ({'standard', 'smooth', 'bm25'}) – Type of IDF formulation to use. If ‘standard’, idfs => log(n_docs / dfs) + 1.0; if ‘smooth’, idfs => log(n_docs + 1 / dfs + 1) + 1.0, i.e. 1 is added to all document frequencies, equivalent to adding a single document to the corpus containing every unique term; if ‘bm25’, idfs => log((n_docs - dfs + 0.5) / (dfs + 0.5)), which is a form commonly used in BM25 ranking that allows for extremely common terms to have negative idf weights.
- Returns
Array of inverse document frequencies, with length equal to the # of unique terms (# of columns) in
doc_term_matrix
.- Return type
- Raises
ValueError – if
type_
isn’t one of {“standard”, “smooth”, “bm25”}.
-
textacy.vsm.matrix_utils.
get_doc_lengths
(doc_term_matrix, *, type_='linear')[source]¶ Compute the lengths (i.e. number of terms) for all documents in a document-term matrix.
- Parameters
doc_term_matrix (
scipy.sparse.csr_matrix
) – M x N sparse matrix, where M is the # of docs, N is the # of unique terms, and values are the absolute counts of term n per doc m.type ({'linear', 'sqrt', 'log'}) – Scaling applied to absolute doc lengths. If ‘linear’, lengths are left as-is, since the sums are already linear; if ‘sqrt’, dl => sqrt(dl); if ‘log’, dl => log(dl) + 1.
- Returns
Array of document lengths, with length equal to the # of documents (# of rows) in
doc_term_matrix
.- Return type
- Raises
ValueError – if
type_
isn’t one of {“linear”, “sqrt”, “log”}.
-
textacy.vsm.matrix_utils.
get_information_content
(doc_term_matrix)[source]¶ Compute information content for all terms in a document-term matrix. IC is a float in [0.0, 1.0], defined as
-df * log2(df) - (1 - df) * log2(1 - df)
, where df is a term’s normalized document frequency.- Parameters
doc_term_matrix (
scipy.sparse.csr_matrix
) –M x N sparse matrix, where M is the # of docs and N is the # of unique terms.
Note
Weighting on the terms doesn’t matter! Could be binary or tf or tfidf, a term’s information content will be the same.
- Returns
Array of term information content values, with length equal to the # of unique terms (# of columns) in
doc_term_matrix
.- Return type
- Raises
ValueError – if
doc_term_matrix
doesn’t have any non-zero entries.
-
textacy.vsm.matrix_utils.
apply_idf_weighting
(doc_term_matrix, *, type_='smooth')[source]¶ Apply inverse document frequency (idf) weighting to a term-frequency (tf) weighted document-term matrix, using one of several IDF formulations.
- Parameters
doc_term_matrix (
scipy.sparse.csr_matrix
) – M x N sparse matrix, where M is the # of docs and N is the # of unique terms.type ({'standard', 'smooth', 'bm25'}) – Type of IDF formulation to use.
- Returns
Sparse matrix of shape M x N, where value (i, j) is the tfidf weight of term j in doc i.
- Return type
See also
-
textacy.vsm.matrix_utils.
filter_terms_by_df
(doc_term_matrix, term_to_id, *, max_df=1.0, min_df=1, max_n_terms=None)[source]¶ Filter out terms that are too common and/or too rare (by document frequency), and compactify the top
max_n_terms
in theid_to_term
mapping accordingly. Borrows heavily from thesklearn.feature_extraction.text
module.- Parameters
doc_term_matrix (
scipy.sparse.csr_matrix
) – M X N matrix, where M is the # of docs and N is the # of unique terms.term_to_id (Dict[str, int]) – Mapping of term string to unique term id, e.g.
Vectorizer.vocabulary_terms
.min_df (float or int) – if float, value is the fractional proportion of the total number of documents and must be in [0.0, 1.0]; if int, value is the absolute number; filter terms whose document frequency is less than
min_df
max_df (float or int) – if float, value is the fractional proportion of the total number of documents and must be in [0.0, 1.0]; if int, value is the absolute number; filter terms whose document frequency is greater than
max_df
max_n_terms (int) – only include terms whose term frequency is within the top max_n_terms
- Returns
Sparse matrix of shape (# docs, # unique filtered terms), where value (i, j) is the weight of term j in doc i.
Dict[str, int]: Term to id mapping, where keys are unique filtered terms as strings and values are their corresponding integer ids.
- Return type
- Raises
ValueError – if
max_df
ormin_df
ormax_n_terms
< 0.
-
textacy.vsm.matrix_utils.
filter_terms_by_ic
(doc_term_matrix, term_to_id, *, min_ic=0.0, max_n_terms=None)[source]¶ Filter out terms that are too common and/or too rare (by information content), and compactify the top
max_n_terms
in theid_to_term
mapping accordingly. Borrows heavily from thesklearn.feature_extraction.text
module.- Parameters
doc_term_matrix (
scipy.sparse.csr_matrix
) – M X N sparse matrix, where M is the # of docs and N is the # of unique terms.term_to_id (Dict[str, int]) – Mapping of term string to unique term id, e.g.
Vectorizer.vocabulary_terms
.min_ic (float) – filter terms whose information content is less than this value; must be in [0.0, 1.0]
max_n_terms (int) – only include terms whose information content is within the top
max_n_terms
- Returns
Sparse matrix of shape (# docs, # unique filtered terms), where value (i, j) is the weight of term j in doc i.
Dict[str, int]: Term to id mapping, where keys are unique filtered terms as strings and values are their corresponding integer ids.
- Return type
- Raises
ValueError – if
min_ic
not in [0.0, 1.0] ormax_n_terms
< 0.
Topic Models¶
textacy.tm.topic_model
: Convenient and consolidated topic-modeling,
built on scikit-learn
.
-
class
textacy.tm.topic_model.
TopicModel
(model, n_topics=10, **kwargs)[source]¶ Train and apply a topic model to vectorized texts using scikit-learn’s implementations of LSA, LDA, and NMF models. Also any other topic model implementations that have component_, n_topics and transform attributes. Inspect and visualize results. Save and load trained models to and from disk.
Prepare a vectorized corpus (i.e. document-term matrix) and corresponding vocabulary (i.e. mapping of term strings to column indices in the matrix). See
textacy.vsm.Vectorizer
for details. In short:>>> vectorizer = Vectorizer( ... tf_type="linear", apply_idf=True, idf_type="smooth", norm="l2", ... min_df=3, max_df=0.95, max_n_terms=100000) >>> doc_term_matrix = vectorizer.fit_transform(terms_list)
Initialize and train a topic model:
>>> model = textacy.tm.TopicModel("nmf", n_topics=20) >>> model.fit(doc_term_matrix) >>> model TopicModel(n_topics=10, model=NMF)
Transform the corpus and interpret our model:
>>> doc_topic_matrix = model.transform(doc_term_matrix) >>> for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, topics=[0,1]): ... print("topic", topic_idx, ":", " ".join(top_terms)) topic 0 : people american go year work think $ today money america topic 1 : rescind quorum order unanimous consent ask president mr. madam absence >>> for topic_idx, top_docs in model.top_topic_docs(doc_topic_matrix, topics=[0,1], top_n=2): ... print(topic_idx) ... for j in top_docs: ... print(corpus[j]._.meta["title"]) 0 THE MOST IMPORTANT ISSUES FACING THE AMERICAN PEOPLE 55TH ANNIVERSARY OF THE BATTLE OF CRETE 1 CHEMICAL WEAPONS CONVENTION MFN STATUS FOR CHINA >>> for doc_idx, topics in model.top_doc_topics(doc_topic_matrix, docs=range(5), top_n=2): ... print(corpus[doc_idx]._.meta["title"], ":", topics) JOIN THE SENATE AND PASS A CONTINUING RESOLUTION : (9, 0) MEETING THE CHALLENGE : (2, 0) DISPOSING OF SENATE AMENDMENT TO H.R. 1643, EXTENSION OF MOST-FAVORED- NATION TREATMENT FOR BULGARIA : (0, 9) EXAMINING THE SPEAKER'S UPCOMING TRAVEL SCHEDULE : (0, 9) FLOODING IN PENNSYLVANIA : (0, 9) >>> for i, val in enumerate(model.topic_weights(doc_topic_matrix)): ... print(i, val) 0 0.302796022302 1 0.0635617650602 2 0.0744927472417 3 0.0905778808867 4 0.0521162262192 5 0.0656303769725 6 0.0973516532757 7 0.112907245542 8 0.0680659204364 9 0.0725001620636
Visualize the model:
>>> model.termite_plot(doc_term_matrix, vectorizer.id_to_term, ... topics=-1, n_terms=25, sort_terms_by="seriation")
Persist our topic model to disk:
>>> model.save("nmf-10topics.pkl")
- Parameters
model ({“nmf”, “lda”, “lsa”} or
sklearn.decomposition.<model>
) –n_topics (int) – number of topics in the model to be initialized
**kwargs – variety of parameters used to initialize the model; see individual sklearn pages for full details
- Raises
ValueError – if
model
not in{"nmf", "lda", "lsa"}
or is not an NMF, LatentDirichletAllocation, or TruncatedSVD instance
See also
-
get_doc_topic_matrix
(doc_term_matrix, *, normalize=True)[source]¶ Transform a document-term matrix into a document-topic matrix, where rows correspond to documents and columns to the topics in the topic model.
- Parameters
doc_term_matrix (array-like or sparse matrix) – Corpus represented as a document-term matrix with shape (n_docs, n_terms). LDA expects tf-weighting, while NMF and LSA may do better with tfidf-weighting.
normalize (bool) – if True, the values in each row are normalized, i.e. topic weights on each document sum to 1
- Returns
Document-topic matrix with shape (n_docs, n_topics).
- Return type
-
top_topic_terms
(id2term, *, topics=- 1, top_n=10, weights=False)[source]¶ Get the top
top_n
terms by weight per topic inmodel
.- Parameters
id2term (list(str) or dict) – object that returns the term string corresponding to term id
i
throughid2term[i]
; could be a list of strings where the index represents the term id, such as that returned bysklearn.feature_extraction.text.CountVectorizer.get_feature_names()
, or a mapping of term id: term stringtopics (int or Sequence[int]) – topic(s) for which to return top terms; if -1 (default), all topics’ terms are returned
top_n (int) – number of top terms to return per topic
weights (bool) – if True, terms are returned with their corresponding topic weights; otherwise, terms are returned without weights
- Yields
Tuple[int, Tuple[str]] or Tuple[int, Tuple[Tuple[str, float]]] – next tuple corresponding to a topic; the first element is the topic’s index; if
weights
is False, the second element is a tuple of str representing the toptop_n
related terms; otherwise, the second is a tuple of (str, float) pairs representing the toptop_n
related terms and their associated weights wrt the topic; for example:>>> list(TopicModel.top_topic_terms(id2term, topics=(0, 1), top_n=2, weights=False)) [(0, ('foo', 'bar')), (1, ('bat', 'baz'))] >>> list(TopicModel.top_topic_terms(id2term, topics=0, top_n=2, weights=True)) [(0, (('foo', 0.1415), ('bar', 0.0986)))]
-
top_topic_docs
(doc_topic_matrix, *, topics=- 1, top_n=10, weights=False)[source]¶ Get the top
top_n
docs by weight per topic indoc_topic_matrix
.- Parameters
doc_topic_matrix (
numpy.ndarray
) – document-topic matrix with shape (n_docs, n_topics), the result of callingTopicModel.get_doc_topic_matrix()
topics (int or Sequence[int]) – topic(s) for which to return top docs; if -1, all topics’ docs are returned
top_n (int) – number of top docs to return per topic
weights (bool) – if True, docs are returned with their corresponding (normalized) topic weights; otherwise, docs are returned without weights
- Yields
Tuple[int, Tuple[int]] or Tuple[int, Tuple[Tuple[int, float]]] – next tuple corresponding to a topic; the first element is the topic’s index; if
weights
is False, the second element is a tuple of ints representing the toptop_n
related docs; otherwise, the second is a tuple of (int, float) pairs representing the toptop_n
related docs and their associated weights wrt the topic; for example:>>> list(TopicModel.top_doc_terms(dtm, topics=(0, 1), top_n=2, weights=False)) [(0, (4, 2)), (1, (1, 3))] >>> list(TopicModel.top_doc_terms(dtm, topics=0, top_n=2, weights=True)) [(0, ((4, 0.3217), (2, 0.2154)))]
-
top_doc_topics
(doc_topic_matrix, *, docs=- 1, top_n=3, weights=False)[source]¶ Get the top
top_n
topics by weight per doc fordocs
indoc_topic_matrix
.- Parameters
doc_topic_matrix (
numpy.ndarray
) – document-topic matrix with shape (n_docs, n_topics), the result of callingTopicModel.get_doc_topic_matrix()
docs (int or Sequence[int]) – docs for which to return top topics; if -1, all docs’ top topics are returned
top_n (int) – number of top topics to return per doc
weights (bool) – if True, docs are returned with their corresponding (normalized) topic weights; otherwise, docs are returned without weights
- Yields
Tuple[int, Tuple[int]] or Tuple[int, Tuple[Tuple[int, float]]] – next tuple corresponding to a doc; the first element is the doc’s index; if
weights
is False, the second element is a tuple of ints representing the toptop_n
related topics; otherwise, the second is a tuple of (int, float) pairs representing the toptop_n
related topics and their associated weights wrt the doc; for example:>>> list(TopicModel.top_doc_topics(dtm, docs=(0, 1), top_n=2, weights=False)) [(0, (1, 4)), (1, (3, 2))] >>> list(TopicModel.top_doc_topics(dtm, docs=0, top_n=2, weights=True)) [(0, ((1, 0.2855), (4, 0.2412)))]
-
topic_weights
(doc_topic_matrix)[source]¶ Get the overall weight of topics across an entire corpus. Note: Values depend on whether topic weights per document in
doc_topic_matrix
were normalized, or not. I suppose either way makes sense… o_O- Parameters
doc_topic_matrix (
numpy.ndarray
) – document-topic matrix with shape (n_docs, n_topics), the result of callingTopicModel.get_doc_topic_matrix()
- Returns
the ith element is the ith topic’s overall weight
- Return type
-
termite_plot
(doc_term_matrix, id2term, *, topics=- 1, sort_topics_by='index', highlight_topics=None, n_terms=25, rank_terms_by='topic_weight', sort_terms_by='seriation', save=False, rc_params=None)[source]¶ Make a “termite” plot for assessing topic models using a tabular layout to promote comparison of terms both within and across topics.
- Parameters
doc_term_matrix (
numpy.ndarray
or sparse matrix) – corpus represented as a document-term matrix with shape (n_docs, n_terms); may have tf- or tfidf-weightingid2term (List[str] or dict) – object that returns the term string corresponding to term id
i
throughid2term[i]
; could be a list of strings where the index represents the term id, such as that returned bysklearn.feature_extraction.text.CountVectorizer.get_feature_names()
, or a mapping of term id: term stringtopics (int or Sequence[int]) – topic(s) to include in termite plot; if -1, all topics are included
sort_topics_by ({'index', 'weight'}) –
highlight_topics (int or Sequence[int]) – indices for up to 6 topics to visually highlight in the plot with contrasting colors
n_terms (int) – number of top terms to include in termite plot
rank_terms_by ({'topic_weight', 'corpus_weight'}) – value used to rank terms; the top-ranked
n_terms
are included in the plotsort_terms_by ({'seriation', 'weight', 'index', 'alphabetical'}) – method used to vertically sort the selected top
n_terms
terms; the default (“seriation”) groups similar terms together, which facilitates cross-topic assessmentsave (str) – give the full /path/to/fname on disk to save figure rc_params (dict, optional): allow passing parameters to rc_context in matplotlib.plyplot, details in https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.pyplot.rc_context.html
- Returns
Axis on which termite plot is plotted.
- Return type
matplotlib.axes.Axes.axis
- Raises
ValueError – if more than 6 topics are selected for highlighting, or an invalid value is passed for the sort_topics_by, rank_terms_by, and/or sort_terms_by params
References
Chuang, Jason, Christopher D. Manning, and Jeffrey Heer. “Termite: Visualization techniques for assessing textual topic models.” Proceedings of the International Working Conference on Advanced Visual Interfaces. ACM, 2012.
for sorting by “seriation”, see https://arxiv.org/abs/1406.5370
See also
viz.termite_plot
TODO: rank_terms_by other metrics, e.g. topic salience or relevance