Topic Modeling

textacy.tm.topic_model: Convenient and consolidated topic-modeling, built on scikit-learn.

class textacy.tm.topic_model.TopicModel(model, n_topics=10, **kwargs)[source]

Train and apply a topic model to vectorized texts using scikit-learn’s implementations of LSA, LDA, and NMF models. Also any other topic model implementations that have component_, n_topics and transform attributes. Inspect and visualize results. Save and load trained models to and from disk.

Prepare a vectorized corpus (i.e. document-term matrix) and corresponding vocabulary (i.e. mapping of term strings to column indices in the matrix). See textacy.representations.vectorizers.Vectorizer for details. In short:

>>> vectorizer = Vectorizer(
...     tf_type="linear", idf_type="smooth", norm="l2",
...     min_df=3, max_df=0.95, max_n_terms=100000)
>>> doc_term_matrix = vectorizer.fit_transform(terms_list)

Initialize and train a topic model:

>>> model = textacy.tm.TopicModel("nmf", n_topics=20)
>>> model.fit(doc_term_matrix)
>>> model
TopicModel(n_topics=10, model=NMF)

Transform the corpus and interpret our model:

>>> doc_topic_matrix = model.transform(doc_term_matrix)
>>> for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, topics=[0,1]):
...     print("topic", topic_idx, ":", "   ".join(top_terms))
topic 0 : people   american   go   year   work   think   $   today   money   america
topic 1 : rescind   quorum   order   unanimous   consent   ask   president   mr.   madam   absence
>>> for topic_idx, top_docs in model.top_topic_docs(doc_topic_matrix, topics=[0,1], top_n=2):
...     print(topic_idx)
...     for j in top_docs:
...         print(corpus[j]._.meta["title"])
0
THE MOST IMPORTANT ISSUES FACING THE AMERICAN PEOPLE
55TH ANNIVERSARY OF THE BATTLE OF CRETE
1
CHEMICAL WEAPONS CONVENTION
MFN STATUS FOR CHINA
>>> for doc_idx, topics in model.top_doc_topics(doc_topic_matrix, docs=range(5), top_n=2):
...     print(corpus[doc_idx]._.meta["title"], ":", topics)
JOIN THE SENATE AND PASS A CONTINUING RESOLUTION : (9, 0)
MEETING THE CHALLENGE : (2, 0)
DISPOSING OF SENATE AMENDMENT TO H.R. 1643, EXTENSION OF MOST-FAVORED- NATION TREATMENT FOR BULGARIA : (0, 9)
EXAMINING THE SPEAKER'S UPCOMING TRAVEL SCHEDULE : (0, 9)
FLOODING IN PENNSYLVANIA : (0, 9)
>>> for i, val in enumerate(model.topic_weights(doc_topic_matrix)):
...     print(i, val)
0 0.302796022302
1 0.0635617650602
2 0.0744927472417
3 0.0905778808867
4 0.0521162262192
5 0.0656303769725
6 0.0973516532757
7 0.112907245542
8 0.0680659204364
9 0.0725001620636

Visualize the model:

>>> model.termite_plot(doc_term_matrix, vectorizer.id_to_term,
...                    topics=-1,  n_terms=25, sort_terms_by="seriation")

Persist our topic model to disk:

>>> model.save("nmf-10topics.pkl")
Parameters
  • model ({“nmf”, “lda”, “lsa”} or sklearn.decomposition.<model>) –

  • n_topics (int) – number of topics in the model to be initialized

  • **kwargs – variety of parameters used to initialize the model; see individual sklearn pages for full details

Raises

ValueError – if model not in {"nmf", "lda", "lsa"} or is not an NMF, LatentDirichletAllocation, or TruncatedSVD instance

get_doc_topic_matrix(doc_term_matrix, *, normalize=True)[source]

Transform a document-term matrix into a document-topic matrix, where rows correspond to documents and columns to the topics in the topic model.

Parameters
  • doc_term_matrix (array-like or sparse matrix) – Corpus represented as a document-term matrix with shape (n_docs, n_terms). LDA expects tf-weighting, while NMF and LSA may do better with tfidf-weighting.

  • normalize (bool) – if True, the values in each row are normalized, i.e. topic weights on each document sum to 1

Returns

Document-topic matrix with shape (n_docs, n_topics).

Return type

numpy.ndarray

top_topic_terms(id2term, *, topics=- 1, top_n=10, weights=False)[source]

Get the top top_n terms by weight per topic in model.

Parameters
  • id2term (list(str) or dict) – object that returns the term string corresponding to term id i through id2term[i]; could be a list of strings where the index represents the term id, such as that returned by sklearn.feature_extraction.text.CountVectorizer.get_feature_names(), or a mapping of term id: term string

  • topics (int or Sequence[int]) – topic(s) for which to return top terms; if -1 (default), all topics’ terms are returned

  • top_n (int) – number of top terms to return per topic

  • weights (bool) – if True, terms are returned with their corresponding topic weights; otherwise, terms are returned without weights

Yields

Tuple[int, Tuple[str]] or Tuple[int, Tuple[Tuple[str, float]]] – next tuple corresponding to a topic; the first element is the topic’s index; if weights is False, the second element is a tuple of str representing the top top_n related terms; otherwise, the second is a tuple of (str, float) pairs representing the top top_n related terms and their associated weights wrt the topic; for example:

>>> list(TopicModel.top_topic_terms(id2term, topics=(0, 1), top_n=2, weights=False))
[(0, ('foo', 'bar')), (1, ('bat', 'baz'))]
>>> list(TopicModel.top_topic_terms(id2term, topics=0, top_n=2, weights=True))
[(0, (('foo', 0.1415), ('bar', 0.0986)))]
top_topic_docs(doc_topic_matrix, *, topics=- 1, top_n=10, weights=False)[source]

Get the top top_n docs by weight per topic in doc_topic_matrix.

Parameters
  • doc_topic_matrix (numpy.ndarray) – document-topic matrix with shape (n_docs, n_topics), the result of calling TopicModel.get_doc_topic_matrix()

  • topics (int or Sequence[int]) – topic(s) for which to return top docs; if -1, all topics’ docs are returned

  • top_n (int) – number of top docs to return per topic

  • weights (bool) – if True, docs are returned with their corresponding (normalized) topic weights; otherwise, docs are returned without weights

Yields

Tuple[int, Tuple[int]] or Tuple[int, Tuple[Tuple[int, float]]] – next tuple corresponding to a topic; the first element is the topic’s index; if weights is False, the second element is a tuple of ints representing the top top_n related docs; otherwise, the second is a tuple of (int, float) pairs representing the top top_n related docs and their associated weights wrt the topic; for example:

>>> list(TopicModel.top_doc_terms(dtm, topics=(0, 1), top_n=2, weights=False))
[(0, (4, 2)), (1, (1, 3))]
>>> list(TopicModel.top_doc_terms(dtm, topics=0, top_n=2, weights=True))
[(0, ((4, 0.3217), (2, 0.2154)))]
top_doc_topics(doc_topic_matrix, *, docs=- 1, top_n=3, weights=False)[source]

Get the top top_n topics by weight per doc for docs in doc_topic_matrix.

Parameters
  • doc_topic_matrix (numpy.ndarray) – document-topic matrix with shape (n_docs, n_topics), the result of calling TopicModel.get_doc_topic_matrix()

  • docs (int or Sequence[int]) – docs for which to return top topics; if -1, all docs’ top topics are returned

  • top_n (int) – number of top topics to return per doc

  • weights (bool) – if True, docs are returned with their corresponding (normalized) topic weights; otherwise, docs are returned without weights

Yields

Tuple[int, Tuple[int]] or Tuple[int, Tuple[Tuple[int, float]]] – next tuple corresponding to a doc; the first element is the doc’s index; if weights is False, the second element is a tuple of ints representing the top top_n related topics; otherwise, the second is a tuple of (int, float) pairs representing the top top_n related topics and their associated weights wrt the doc; for example:

>>> list(TopicModel.top_doc_topics(dtm, docs=(0, 1), top_n=2, weights=False))
[(0, (1, 4)), (1, (3, 2))]
>>> list(TopicModel.top_doc_topics(dtm, docs=0, top_n=2, weights=True))
[(0, ((1, 0.2855), (4, 0.2412)))]
topic_weights(doc_topic_matrix)[source]

Get the overall weight of topics across an entire corpus. Note: Values depend on whether topic weights per document in doc_topic_matrix were normalized, or not. I suppose either way makes sense… o_O

Parameters

doc_topic_matrix (numpy.ndarray) – document-topic matrix with shape (n_docs, n_topics), the result of calling TopicModel.get_doc_topic_matrix()

Returns

the ith element is the ith topic’s overall weight

Return type

numpy.ndarray

termite_plot(doc_term_matrix, id2term, *, topics=- 1, sort_topics_by='index', highlight_topics=None, n_terms=25, rank_terms_by='topic_weight', sort_terms_by='seriation', save=False, rc_params=None)[source]

Make a “termite” plot for assessing topic models using a tabular layout to promote comparison of terms both within and across topics.

Parameters
  • doc_term_matrix (numpy.ndarray or sparse matrix) – corpus represented as a document-term matrix with shape (n_docs, n_terms); may have tf- or tfidf-weighting

  • id2term (List[str] or dict) – object that returns the term string corresponding to term id i through id2term[i]; could be a list of strings where the index represents the term id, such as that returned by sklearn.feature_extraction.text.CountVectorizer.get_feature_names(), or a mapping of term id: term string

  • topics (int or Sequence[int]) – topic(s) to include in termite plot; if -1, all topics are included

  • sort_topics_by ({'index', 'weight'}) –

  • highlight_topics (int or Sequence[int]) – indices for up to 6 topics to visually highlight in the plot with contrasting colors

  • n_terms (int) – number of top terms to include in termite plot

  • rank_terms_by ({'topic_weight', 'corpus_weight'}) – value used to rank terms; the top-ranked n_terms are included in the plot

  • sort_terms_by ({'seriation', 'weight', 'index', 'alphabetical'}) – method used to vertically sort the selected top n_terms terms; the default (“seriation”) groups similar terms together, which facilitates cross-topic assessment

  • save (str) – give the full /path/to/fname on disk to save figure rc_params (dict, optional): allow passing parameters to rc_context in matplotlib.plyplot, details in https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.pyplot.rc_context.html

Returns

Axis on which termite plot is plotted.

Return type

matplotlib.axes.Axes.axis

Raises

ValueError – if more than 6 topics are selected for highlighting, or an invalid value is passed for the sort_topics_by, rank_terms_by, and/or sort_terms_by params

References

  • Chuang, Jason, Christopher D. Manning, and Jeffrey Heer. “Termite: Visualization techniques for assessing textual topic models.” Proceedings of the International Working Conference on Advanced Visual Interfaces. ACM, 2012.

  • for sorting by “seriation”, see https://arxiv.org/abs/1406.5370

See also

viz.termite_plot

TODO: rank_terms_by other metrics, e.g. topic salience or relevance