Topic Modeling

textacy.tm.topic_model: Convenient and consolidated topic-modeling, built on scikit-learn.

class textacy.tm.topic_model.TopicModel(model: Literal[nmf, lda, lsa] | NMF | LatentDirichletAllocation | TruncatedSVD, n_topics: int = 10, **kwargs)[source]

Train and apply a topic model to vectorized texts using scikit-learn’s implementations of LSA, LDA, and NMF models. Also any other topic model implementations that have component_, n_topics and transform attributes. Inspect and visualize results. Save and load trained models to and from disk.

Prepare a vectorized corpus (i.e. document-term matrix) and corresponding vocabulary (i.e. mapping of term strings to column indices in the matrix). See textacy.representations.vectorizers.Vectorizer for details. In short:

>>> vectorizer = Vectorizer(
...     tf_type="linear", idf_type="smooth", norm="l2",
...     min_df=3, max_df=0.95, max_n_terms=100000)
>>> doc_term_matrix = vectorizer.fit_transform(terms_list)

Initialize and train a topic model:

>>> model = textacy.tm.TopicModel("nmf", n_topics=20)
>>> model.fit(doc_term_matrix)
>>> model
TopicModel(n_topics=10, model=NMF)

Transform the corpus and interpret our model:

>>> doc_topic_matrix = model.transform(doc_term_matrix)
>>> for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, topics=[0,1]):
...     print("topic", topic_idx, ":", "   ".join(top_terms))
topic 0 : people   american   go   year   work   think   $   today   money   america
topic 1 : rescind   quorum   order   unanimous   consent   ask   president   mr.   madam   absence
>>> for topic_idx, top_docs in model.top_topic_docs(doc_topic_matrix, topics=[0,1], top_n=2):
...     print(topic_idx)
...     for j in top_docs:
...         print(corpus[j]._.meta["title"])
0
THE MOST IMPORTANT ISSUES FACING THE AMERICAN PEOPLE
55TH ANNIVERSARY OF THE BATTLE OF CRETE
1
CHEMICAL WEAPONS CONVENTION
MFN STATUS FOR CHINA
>>> for doc_idx, topics in model.top_doc_topics(doc_topic_matrix, docs=range(5), top_n=2):
...     print(corpus[doc_idx]._.meta["title"], ":", topics)
JOIN THE SENATE AND PASS A CONTINUING RESOLUTION : (9, 0)
MEETING THE CHALLENGE : (2, 0)
DISPOSING OF SENATE AMENDMENT TO H.R. 1643, EXTENSION OF MOST-FAVORED- NATION TREATMENT FOR BULGARIA : (0, 9)
EXAMINING THE SPEAKER'S UPCOMING TRAVEL SCHEDULE : (0, 9)
FLOODING IN PENNSYLVANIA : (0, 9)
>>> for i, val in enumerate(model.topic_weights(doc_topic_matrix)):
...     print(i, val)
0 0.302796022302
1 0.0635617650602
2 0.0744927472417
3 0.0905778808867
4 0.0521162262192
5 0.0656303769725
6 0.0973516532757
7 0.112907245542
8 0.0680659204364
9 0.0725001620636

Visualize the model:

>>> model.termite_plot(doc_term_matrix, vectorizer.id_to_term,
...                    topics=-1,  n_terms=25, sort_terms_by="seriation")

Persist our topic model to disk:

>>> model.save("nmf-10topics.pkl")
Parameters
  • model – Name or instance of an sklearn decomposition model.

  • n_topics – Number of topics in the model to be initialized

  • **kwargs – variety of parameters used to initialize the model; see individual sklearn pages for full details

Raises

ValueError – if model not in {"nmf", "lda", "lsa"} or is not an NMF, LatentDirichletAllocation, or TruncatedSVD instance

get_doc_topic_matrix(doc_term_matrix, *, normalize: bool = True)numpy.ndarray[source]

Transform a document-term matrix into a document-topic matrix, where rows correspond to documents and columns to the topics in the topic model.

Parameters
  • doc_term_matrix (array-like or sparse matrix) – Corpus represented as a document-term matrix with shape (n_docs, n_terms). LDA expects tf-weighting, while NMF and LSA may do better with tfidf-weighting.

  • normalize – If True, the values in each row are normalized, i.e. topic weights on each document sum to 1.

Returns

Document-topic matrix with shape (n_docs, n_topics).

top_topic_terms(id2term: Sequence[str] | Dict[int, str], *, topics: int | Sequence[int] = - 1, top_n: int = 10, weights: bool = False)Iterable[Tuple[int, Tuple[str, ]]] | Iterable[Tuple[int, Tuple[Tuple[str, float], ]]][source]

Get the top top_n terms by weight per topic in model.

Parameters
  • id2term – Object that returns the term string corresponding to term id i through id2term[i]; could be a list of strings where the index represents the term id, such as that returned by sklearn.feature_extraction.text.CountVectorizer.get_feature_names(), or a mapping of term id to term string.

  • topics – Topic(s) for which to return top terms; if -1 (default), all topics’ terms are returned.

  • top_n – Number of top terms to return per topic

  • weights – If True, terms are returned with their corresponding topic weights; otherwise, terms are returned without weights

Yields

Next tuple corresponding to a topic; the first element is the topic’s index; if weights is False, the second element is a tuple of str representing the top top_n related terms; otherwise, the second is a tuple of (str, float) pairs representing the top top_n related terms and their associated weights wrt the topic; for example:

>>> list(TopicModel.top_topic_terms(id2term, topics=(0, 1), top_n=2, weights=False))
[(0, ('foo', 'bar')), (1, ('bat', 'baz'))]
>>> list(TopicModel.top_topic_terms(id2term, topics=0, top_n=2, weights=True))
[(0, (('foo', 0.1415), ('bar', 0.0986)))]
top_topic_docs(doc_topic_matrix: np.ndarray, *, topics: int | Sequence[int] = - 1, top_n: int = 10, weights: bool = False)Iterable[Tuple[int, Tuple[int, ]]] | Iterable[Tuple[int, Tuple[Tuple[int, float], ]]][source]

Get the top top_n docs by weight per topic in doc_topic_matrix.

Parameters
  • doc_topic_matrix – Document-topic matrix with shape (n_docs, n_topics), the result of calling TopicModel.get_doc_topic_matrix()

  • topics – Topic(s) for which to return top docs; if -1, all topics’ docs are returned.

  • top_n – Number of top docs to return per topic.

  • weights – If True, docs are returned with their corresponding (normalized) topic weights; otherwise, docs are returned without weights.

Yields

Next tuple corresponding to a topic; the first element is the topic’s index; if weights is False, the second element is a tuple of ints representing the top top_n related docs; otherwise, the second is a tuple of (int, float) pairs representing the top top_n related docs and their associated weights wrt the topic; for example:

>>> list(TopicModel.top_doc_terms(dtm, topics=(0, 1), top_n=2, weights=False))
[(0, (4, 2)), (1, (1, 3))]
>>> list(TopicModel.top_doc_terms(dtm, topics=0, top_n=2, weights=True))
[(0, ((4, 0.3217), (2, 0.2154)))]
top_doc_topics(doc_topic_matrix: np.ndarray, *, docs: int | Sequence[int] = - 1, top_n: int = 3, weights: bool = False)Iterable[Tuple[int, Tuple[int, ]]] | Iterable[Tuple[int, Tuple[Tuple[int, float], ]]][source]

Get the top top_n topics by weight per doc for docs in doc_topic_matrix.

Parameters
  • doc_topic_matrix – Document-topic matrix with shape (n_docs, n_topics), the result of calling TopicModel.get_doc_topic_matrix() .

  • docs – Docs for which to return top topics; if -1, all docs’ top topics are returned.

  • top_n – Number of top topics to return per doc.

  • If True (weights) – topic weights; otherwise, docs are returned without weights.

  • are returned with their corresponding (docs) – topic weights; otherwise, docs are returned without weights.

Yields

Next tuple corresponding to a doc; the first element is the doc’s index; if weights is False, the second element is a tuple of ints representing the top top_n related topics; otherwise, the second is a tuple of (int, float) pairs representing the top top_n related topics and their associated weights wrt the doc; for example:

>>> list(TopicModel.top_doc_topics(dtm, docs=(0, 1), top_n=2, weights=False))
[(0, (1, 4)), (1, (3, 2))]
>>> list(TopicModel.top_doc_topics(dtm, docs=0, top_n=2, weights=True))
[(0, ((1, 0.2855), (4, 0.2412)))]
topic_weights(doc_topic_matrix: numpy.ndarray)numpy.ndarray[source]

Get the overall weight of topics across an entire corpus. Note: Values depend on whether topic weights per document in doc_topic_matrix were normalized, or not. I suppose either way makes sense… o_O

Parameters

doc_topic_matrix – Document-topic matrix with shape (n_docs, n_topics), the result of calling TopicModel.get_doc_topic_matrix()

Returns

Array, where the ith element is the ith topic’s overall weight.

termite_plot(doc_term_matrix: np.ndarray | sp.csr_matrix, id2term: List[str] | Dict[int, str], *, topics: int | Sequence[int] = - 1, sort_topics_by: Literal[index, weight] = 'index', highlight_topics: Optional[int | Sequence[int]] = None, n_terms: int = 25, rank_terms_by: Literal[topic_weight, corpus_weight] = 'topic_weight', sort_terms_by: Literal[seriation, weight, index, alphabetical] = 'seriation', save: Optional[str] = None, rc_params: Optional[dict] = None)[source]

Make a “termite” plot for assessing topic models using a tabular layout to promote comparison of terms both within and across topics.

Parameters
  • doc_term_matrix – Corpus represented as a document-term matrix with shape (n_docs, n_terms); may have tf- or tfidf-weighting.

  • id2term – Object that returns the term string corresponding to term id i through id2term[i]. Could be a list of strings where the index represents the term id, such as that returned by sklearn.feature_extraction.text.CountVectorizer.get_feature_names(), or a mapping of term id to term string.

  • topics – Topic(s) to include in termite plot; if -1, all topics are included.

  • sort_topics_by

  • highlight_topics – Indices for up to 6 topics to visually highlight in the plot with contrasting colors.

  • n_terms – Number of top terms to include in termite plot.

  • rank_terms_by – Value used to rank terms; the top-ranked n_terms are included in the plot.

  • sort_terms_by – Method used to vertically sort the selected top n_terms terms; the default (“seriation”) groups similar terms together, which facilitates cross-topic assessment.

  • save – The full /path/to/fname on disk to save figure, or None.

  • rc_params – Allow passing parameters to rc_context in matplotlib.plyplot, details in https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.pyplot.rc_context.html

Returns

Axis on which termite plot is plotted.

Return type

matplotlib.axes.Axes.axis

Raises

ValueError – if more than 6 topics are selected for highlighting, or an invalid value is passed for the sort_topics_by, rank_terms_by, and/or sort_terms_by params

References

  • Chuang, Jason, Christopher D. Manning, and Jeffrey Heer. “Termite: Visualization techniques for assessing textual topic models.” Proceedings of the International Working Conference on Advanced Visual Interfaces. ACM, 2012.

  • for sorting by “seriation”, see https://arxiv.org/abs/1406.5370

See also

viz.termite_plot

TODO: rank_terms_by other metrics, e.g. topic salience or relevance