Topic Modeling¶
textacy.tm.topic_model
: Convenient and consolidated topic-modeling,
built on scikit-learn
.
-
class
textacy.tm.topic_model.
TopicModel
(model, n_topics=10, **kwargs)[source]¶ Train and apply a topic model to vectorized texts using scikit-learn’s implementations of LSA, LDA, and NMF models. Also any other topic model implementations that have component_, n_topics and transform attributes. Inspect and visualize results. Save and load trained models to and from disk.
Prepare a vectorized corpus (i.e. document-term matrix) and corresponding vocabulary (i.e. mapping of term strings to column indices in the matrix). See
textacy.representations.vectorizers.Vectorizer
for details. In short:>>> vectorizer = Vectorizer( ... tf_type="linear", idf_type="smooth", norm="l2", ... min_df=3, max_df=0.95, max_n_terms=100000) >>> doc_term_matrix = vectorizer.fit_transform(terms_list)
Initialize and train a topic model:
>>> model = textacy.tm.TopicModel("nmf", n_topics=20) >>> model.fit(doc_term_matrix) >>> model TopicModel(n_topics=10, model=NMF)
Transform the corpus and interpret our model:
>>> doc_topic_matrix = model.transform(doc_term_matrix) >>> for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, topics=[0,1]): ... print("topic", topic_idx, ":", " ".join(top_terms)) topic 0 : people american go year work think $ today money america topic 1 : rescind quorum order unanimous consent ask president mr. madam absence >>> for topic_idx, top_docs in model.top_topic_docs(doc_topic_matrix, topics=[0,1], top_n=2): ... print(topic_idx) ... for j in top_docs: ... print(corpus[j]._.meta["title"]) 0 THE MOST IMPORTANT ISSUES FACING THE AMERICAN PEOPLE 55TH ANNIVERSARY OF THE BATTLE OF CRETE 1 CHEMICAL WEAPONS CONVENTION MFN STATUS FOR CHINA >>> for doc_idx, topics in model.top_doc_topics(doc_topic_matrix, docs=range(5), top_n=2): ... print(corpus[doc_idx]._.meta["title"], ":", topics) JOIN THE SENATE AND PASS A CONTINUING RESOLUTION : (9, 0) MEETING THE CHALLENGE : (2, 0) DISPOSING OF SENATE AMENDMENT TO H.R. 1643, EXTENSION OF MOST-FAVORED- NATION TREATMENT FOR BULGARIA : (0, 9) EXAMINING THE SPEAKER'S UPCOMING TRAVEL SCHEDULE : (0, 9) FLOODING IN PENNSYLVANIA : (0, 9) >>> for i, val in enumerate(model.topic_weights(doc_topic_matrix)): ... print(i, val) 0 0.302796022302 1 0.0635617650602 2 0.0744927472417 3 0.0905778808867 4 0.0521162262192 5 0.0656303769725 6 0.0973516532757 7 0.112907245542 8 0.0680659204364 9 0.0725001620636
Visualize the model:
>>> model.termite_plot(doc_term_matrix, vectorizer.id_to_term, ... topics=-1, n_terms=25, sort_terms_by="seriation")
Persist our topic model to disk:
>>> model.save("nmf-10topics.pkl")
- Parameters
model ({“nmf”, “lda”, “lsa”} or
sklearn.decomposition.<model>
) –n_topics (int) – number of topics in the model to be initialized
**kwargs – variety of parameters used to initialize the model; see individual sklearn pages for full details
- Raises
ValueError – if
model
not in{"nmf", "lda", "lsa"}
or is not an NMF, LatentDirichletAllocation, or TruncatedSVD instance
See also
-
get_doc_topic_matrix
(doc_term_matrix, *, normalize=True)[source]¶ Transform a document-term matrix into a document-topic matrix, where rows correspond to documents and columns to the topics in the topic model.
- Parameters
doc_term_matrix (array-like or sparse matrix) – Corpus represented as a document-term matrix with shape (n_docs, n_terms). LDA expects tf-weighting, while NMF and LSA may do better with tfidf-weighting.
normalize (bool) – if True, the values in each row are normalized, i.e. topic weights on each document sum to 1
- Returns
Document-topic matrix with shape (n_docs, n_topics).
- Return type
-
top_topic_terms
(id2term, *, topics=- 1, top_n=10, weights=False)[source]¶ Get the top
top_n
terms by weight per topic inmodel
.- Parameters
id2term (list(str) or dict) – object that returns the term string corresponding to term id
i
throughid2term[i]
; could be a list of strings where the index represents the term id, such as that returned bysklearn.feature_extraction.text.CountVectorizer.get_feature_names()
, or a mapping of term id: term stringtopics (int or Sequence[int]) – topic(s) for which to return top terms; if -1 (default), all topics’ terms are returned
top_n (int) – number of top terms to return per topic
weights (bool) – if True, terms are returned with their corresponding topic weights; otherwise, terms are returned without weights
- Yields
Tuple[int, Tuple[str]] or Tuple[int, Tuple[Tuple[str, float]]] – next tuple corresponding to a topic; the first element is the topic’s index; if
weights
is False, the second element is a tuple of str representing the toptop_n
related terms; otherwise, the second is a tuple of (str, float) pairs representing the toptop_n
related terms and their associated weights wrt the topic; for example:>>> list(TopicModel.top_topic_terms(id2term, topics=(0, 1), top_n=2, weights=False)) [(0, ('foo', 'bar')), (1, ('bat', 'baz'))] >>> list(TopicModel.top_topic_terms(id2term, topics=0, top_n=2, weights=True)) [(0, (('foo', 0.1415), ('bar', 0.0986)))]
-
top_topic_docs
(doc_topic_matrix, *, topics=- 1, top_n=10, weights=False)[source]¶ Get the top
top_n
docs by weight per topic indoc_topic_matrix
.- Parameters
doc_topic_matrix (
numpy.ndarray
) – document-topic matrix with shape (n_docs, n_topics), the result of callingTopicModel.get_doc_topic_matrix()
topics (int or Sequence[int]) – topic(s) for which to return top docs; if -1, all topics’ docs are returned
top_n (int) – number of top docs to return per topic
weights (bool) – if True, docs are returned with their corresponding (normalized) topic weights; otherwise, docs are returned without weights
- Yields
Tuple[int, Tuple[int]] or Tuple[int, Tuple[Tuple[int, float]]] – next tuple corresponding to a topic; the first element is the topic’s index; if
weights
is False, the second element is a tuple of ints representing the toptop_n
related docs; otherwise, the second is a tuple of (int, float) pairs representing the toptop_n
related docs and their associated weights wrt the topic; for example:>>> list(TopicModel.top_doc_terms(dtm, topics=(0, 1), top_n=2, weights=False)) [(0, (4, 2)), (1, (1, 3))] >>> list(TopicModel.top_doc_terms(dtm, topics=0, top_n=2, weights=True)) [(0, ((4, 0.3217), (2, 0.2154)))]
-
top_doc_topics
(doc_topic_matrix, *, docs=- 1, top_n=3, weights=False)[source]¶ Get the top
top_n
topics by weight per doc fordocs
indoc_topic_matrix
.- Parameters
doc_topic_matrix (
numpy.ndarray
) – document-topic matrix with shape (n_docs, n_topics), the result of callingTopicModel.get_doc_topic_matrix()
docs (int or Sequence[int]) – docs for which to return top topics; if -1, all docs’ top topics are returned
top_n (int) – number of top topics to return per doc
weights (bool) – if True, docs are returned with their corresponding (normalized) topic weights; otherwise, docs are returned without weights
- Yields
Tuple[int, Tuple[int]] or Tuple[int, Tuple[Tuple[int, float]]] – next tuple corresponding to a doc; the first element is the doc’s index; if
weights
is False, the second element is a tuple of ints representing the toptop_n
related topics; otherwise, the second is a tuple of (int, float) pairs representing the toptop_n
related topics and their associated weights wrt the doc; for example:>>> list(TopicModel.top_doc_topics(dtm, docs=(0, 1), top_n=2, weights=False)) [(0, (1, 4)), (1, (3, 2))] >>> list(TopicModel.top_doc_topics(dtm, docs=0, top_n=2, weights=True)) [(0, ((1, 0.2855), (4, 0.2412)))]
-
topic_weights
(doc_topic_matrix)[source]¶ Get the overall weight of topics across an entire corpus. Note: Values depend on whether topic weights per document in
doc_topic_matrix
were normalized, or not. I suppose either way makes sense… o_O- Parameters
doc_topic_matrix (
numpy.ndarray
) – document-topic matrix with shape (n_docs, n_topics), the result of callingTopicModel.get_doc_topic_matrix()
- Returns
the ith element is the ith topic’s overall weight
- Return type
-
termite_plot
(doc_term_matrix, id2term, *, topics=- 1, sort_topics_by='index', highlight_topics=None, n_terms=25, rank_terms_by='topic_weight', sort_terms_by='seriation', save=False, rc_params=None)[source]¶ Make a “termite” plot for assessing topic models using a tabular layout to promote comparison of terms both within and across topics.
- Parameters
doc_term_matrix (
numpy.ndarray
or sparse matrix) – corpus represented as a document-term matrix with shape (n_docs, n_terms); may have tf- or tfidf-weightingid2term (List[str] or dict) – object that returns the term string corresponding to term id
i
throughid2term[i]
; could be a list of strings where the index represents the term id, such as that returned bysklearn.feature_extraction.text.CountVectorizer.get_feature_names()
, or a mapping of term id: term stringtopics (int or Sequence[int]) – topic(s) to include in termite plot; if -1, all topics are included
sort_topics_by ({'index', 'weight'}) –
highlight_topics (int or Sequence[int]) – indices for up to 6 topics to visually highlight in the plot with contrasting colors
n_terms (int) – number of top terms to include in termite plot
rank_terms_by ({'topic_weight', 'corpus_weight'}) – value used to rank terms; the top-ranked
n_terms
are included in the plotsort_terms_by ({'seriation', 'weight', 'index', 'alphabetical'}) – method used to vertically sort the selected top
n_terms
terms; the default (“seriation”) groups similar terms together, which facilitates cross-topic assessmentsave (str) – give the full /path/to/fname on disk to save figure rc_params (dict, optional): allow passing parameters to rc_context in matplotlib.plyplot, details in https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.pyplot.rc_context.html
- Returns
Axis on which termite plot is plotted.
- Return type
matplotlib.axes.Axes.axis
- Raises
ValueError – if more than 6 topics are selected for highlighting, or an invalid value is passed for the sort_topics_by, rank_terms_by, and/or sort_terms_by params
References
Chuang, Jason, Christopher D. Manning, and Jeffrey Heer. “Termite: Visualization techniques for assessing textual topic models.” Proceedings of the International Working Conference on Advanced Visual Interfaces. ACM, 2012.
for sorting by “seriation”, see https://arxiv.org/abs/1406.5370
See also
viz.termite_plot
TODO: rank_terms_by other metrics, e.g. topic salience or relevance