Document Representations

network.build_cooccurrence_network

Transform an ordered sequence of strings (or a sequence of such sequences) into a graph, where each string is represented by a node with weighted edges linking it to other strings that co-occur within window_size elements of itself.

network.build_similarity_network

Transform a sequence of strings (or a sequence of such sequences) into a graph, where each element of the top-level sequence is represented by a node with edges linking it to all other elements weighted by their pairwise similarity.

sparse_vec.build_doc_term_matrix

Transform one or more tokenized documents into a document-term matrix of shape (# docs, # unique terms), with flexible weighting/normalization of values.

sparse_vec.build_grp_term_matrix

Transform one or more tokenized documents into a group-term matrix of shape (# unique groups, # unique terms), with flexible weighting/normalization of values.

vectorizers.Vectorizer

Transform one or more tokenized documents into a sparse document-term matrix of shape (# docs, # unique terms), with flexible weighting/normalization of values.

vectorizers.GroupVectorizer

Transform one or more tokenized documents into a group-term matrix of shape (# groups, # unique terms), with tf-, tf-idf, or binary-weighted values.

Networks

textacy.representations.network: Represent document data as networks, where nodes are terms, sentences, or even full documents and edges between them are weighted by the strength of their co-occurrence or similarity.

textacy.representations.network.build_cooccurrence_network(data: Sequence[str] | Sequence[Sequence[str]], *, window_size: int = 2, edge_weighting: str = 'count')nx.Graph[source]

Transform an ordered sequence of strings (or a sequence of such sequences) into a graph, where each string is represented by a node with weighted edges linking it to other strings that co-occur within window_size elements of itself.

Input data can take a variety of forms. For example, as a Sequence[str] where elements are token or term strings from a single document:

>>> texts = [
...     "Mary had a little lamb. Its fleece was white as snow.",
...     "Everywhere that Mary went the lamb was sure to go.",
... ]
>>> docs = [make_spacy_doc(text, lang="en_core_web_sm") for text in texts]
>>> data = [tok.text for tok in docs[0]]
>>> graph = build_cooccurrence_network(data, window_size=2)
>>> sorted(graph.adjacency())[0]
('.', {'lamb': {'weight': 1}, 'Its': {'weight': 1}, 'snow': {'weight': 1}})

Or as a Sequence[Sequence[str]], where elements are token or term strings per sentence from a single document:

>>> data = [[tok.text for tok in sent] for sent in docs[0].sents]
>>> graph = build_cooccurrence_network(data, window_size=2)
>>> sorted(graph.adjacency())[0]
('.', {'lamb': {'weight': 1}, 'snow': {'weight': 1}})

Or as a Sequence[Sequence[str]], where elements are token or term strings per document from multiple documents:

>>> data = [[tok.text for tok in doc] for doc in docs]
>>> graph = build_cooccurrence_network(data, window_size=2)
>>> sorted(graph.adjacency())[0]
('.',
 {'lamb': {'weight': 1},
  'Its': {'weight': 1},
  'snow': {'weight': 1},
  'go': {'weight': 1}})

Note how the “.” token’s connections to other nodes change for each case. (Note that in real usage, you’ll probably want to remove stopwords, punctuation, etc. so that nodes in the graph represent meaningful concepts.)

Parameters
  • data

  • window_size

    Size of sliding window over data that determines which strings are said to co-occur. For example, a value of 2 means that only immediately adjacent strings will have edges in the network; larger values loosen the definition of co-occurrence and typically lead to a more densely-connected network.

    Note

    Co-occurrence windows are not permitted to cross sequences. So, if data is a Sequence[Sequence[str]], then co-occ counts are computed separately for each sub-sequence, then summed together.

  • edge_weighting – Method by which edges between nodes are weighted. If “count”, nodes are connected by edges with weights equal to the number of times they co-occurred within a sliding window; if “binary”, all such edges have weight set equal to 1.

Returns

Graph whose nodes correspond to individual strings from data; those that co-occur are connected by edges with weights determined by edge_weighting.

Reference:

https://en.wikipedia.org/wiki/Co-occurrence_network

textacy.representations.network.build_similarity_network(data: Sequence[str] | Sequence[Sequence[str]], edge_weighting: str)nx.Graph[source]

Transform a sequence of strings (or a sequence of such sequences) into a graph, where each element of the top-level sequence is represented by a node with edges linking it to all other elements weighted by their pairwise similarity.

Input data can take a variety of forms. For example, as a Sequence[str] where elements are sentence texts from a single document:

>>> texts = [
...     "Mary had a little lamb. Its fleece was white as snow.",
...     "Everywhere that Mary went the lamb was sure to go.",
... ]
>>> docs = [make_spacy_doc(text, lang="en_core_web_sm") for text in texts]
>>> data = [sent.text.lower() for sent in docs[0].sents]
>>> graph = build_similarity_network(data, "levenshtein")
>>> sorted(graph.adjacency())[0]
('its fleece was white as snow.',
 {'mary had a little lamb.': {'weight': 0.24137931034482762}})

Or as a Sequence[str] where elements are full texts from multiple documents:

>>> data = [doc.text.lower() for doc in docs]
>>> graph = build_similarity_network(data, "jaro")
>>> sorted(graph.adjacency())[0]
('everywhere that mary went the lamb was sure to go.',
 {'mary had a little lamb. its fleece was white as snow.': {'weight': 0.6516002795248078}})

Or as a Sequence[Sequence[str]] where elements are tokenized texts from multiple documents:

>>> data = [[tok.lower_ for tok in doc] for doc in docs]
>>> graph = build_similarity_network(data, "jaccard")
>>> sorted(graph.adjacency())[0]
(('everywhere', 'that', 'mary', 'went', 'the', 'lamb', 'was', 'sure', 'to', 'go', '.'),
 {('mary', 'had', 'a', 'little', 'lamb', '.', 'its', 'fleece', 'was', 'white', 'as', 'snow', '.'): {'weight': 0.21052631578947367}})
Parameters
  • data

  • edge_weighting

    Similarity metric to use for weighting edges between elements in data, represented as the name of a function available in textacy.similarity.

    Note

    Different metrics are suited for different forms and contexts of data. You’ll have to decide which method makes sense. For example, when comparing a sequence of short strings, “levenshtein” is often a reasonable bet; when comparing a sequence of sequences of somewhat noisy strings (e.g. includes punctuation, cruft tokens), you might try “matching_subsequences_ratio” to help filter out the noise.

Returns

Graph whose nodes correspond to top-level sequence elements in data, connected by edges to all other nodes with weights determined by their pairwise similarity.

Reference:

https://en.wikipedia.org/wiki/Semantic_similarity_network – this is not the same as what’s implemented here, but they’re similar in spirit.

textacy.representations.network.rank_nodes_by_pagerank(graph: networkx.classes.graph.Graph, weight: str = 'weight', **kwargs)Dict[Any, float][source]

Rank nodes in graph using the Pagegrank algorithm.

Parameters
  • graph

  • weight – Key in edge data that holds weights.

  • **kwargs

Returns

Mapping of node object to Pagerank score.

textacy.representations.network.rank_nodes_by_bestcoverage(graph: networkx.classes.graph.Graph, k: int, c: int = 1, alpha: float = 1.0, weight: str = 'weight')Dict[Any, float][source]

Rank nodes in a network using the BestCoverage algorithm that attempts to balance between node centrality and diversity.

Parameters
  • graph

  • k – Number of results to return for top-k search.

  • cl parameter for l-step expansion; best if 1 or 2

  • alpha – Float in [0.0, 1.0] specifying how much of central vertex’s score to remove from its l-step neighbors; smaller value puts more emphasis on centrality, larger value puts more emphasis on diversity

  • weight – Key in edge data that holds weights.

Returns

Top k nodes as ranked by bestcoverage algorithm; keys as node identifiers, values as corresponding ranking scores

References

Küçüktunç, O., Saule, E., Kaya, K., & Çatalyürek, Ü. V. (2013, May). Diversified recommendation on graphs: pitfalls, measures, and algorithms. In Proceedings of the 22nd international conference on World Wide Web (pp. 715-726). International World Wide Web Conferences Steering Committee. http://www2013.wwwconference.org/proceedings/p715.pdf

textacy.representations.network.rank_nodes_by_divrank(graph: networkx.classes.graph.Graph, r: Optional[numpy.ndarray] = None, lambda_: float = 0.5, alpha: float = 0.5)Dict[str, float][source]

Rank nodes in a network using the DivRank algorithm that attempts to balance between node centrality and diversity.

Parameters
  • graph

  • r – The “personalization vector”; by default, r = ones(1, n)/n

  • lambda – Float in [0.0, 1.0]

  • alpha – Float in [0.0, 1.0] that controls the strength of self-links.

Returns

Mapping of node to score ordered by descending divrank score

References

Mei, Q., Guo, J., & Radev, D. (2010, July). Divrank: the interplay of prestige and diversity in information networks. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1009-1018). ACM. http://clair.si.umich.edu/~radev/papers/SIGKDD2010.pdf

Sparse Vectors

textacy.representations.sparse_vec: Transform a collection of tokenized docs into a doc-term matrix of shape (# docs, # unique terms) or a group-term matrix of shape (# unique groups, # unique terms), with various ways to filter/limit included terms and flexible weighting/normalization schemes for their values.

Intended primarily as a simpler- and higher-level API for sparse vectorization of docs.

textacy.representations.sparse_vec.build_doc_term_matrix(tokenized_docs: Iterable[Iterable[str]], *, tf_type: str = 'linear', idf_type: Optional[str] = None, dl_type: Optional[str] = None, **kwargs)Tuple[scipy.sparse.csr.csr_matrix, Dict[str, int]][source]

Transform one or more tokenized documents into a document-term matrix of shape (# docs, # unique terms), with flexible weighting/normalization of values.

Parameters
  • tokenized_docs

    A sequence of tokenized documents, where each is a sequence of term strings. For example:

    >>> ([tok.lemma_ for tok in spacy_doc]
    ...  for spacy_doc in spacy_docs)
    >>> ((ne.text for ne in extract.entities(doc))
    ...  for doc in corpus)
    

  • tf_type

    Type of term frequency (tf) to use for weights’ local component:

    • ”linear”: tf (tfs are already linear, so left as-is)

    • ”sqrt”: tf => sqrt(tf)

    • ”log”: tf => log(tf) + 1

    • ”binary”: tf => 1

  • idf_type

    Type of inverse document frequency (idf) to use for weights’ global component:

    • ”standard”: idf = log(n_docs / df) + 1.0

    • ”smooth”: idf = log(n_docs + 1 / df + 1) + 1.0, i.e. 1 is added to all document frequencies, as if a single document containing every unique term was added to the corpus.

    • ”bm25”: idf = log((n_docs - df + 0.5) / (df + 0.5)), which is a form commonly used in information retrieval that allows for very common terms to receive negative weights.

    • None: no global weighting is applied to local term weights.

  • dl_type

    Type of document-length scaling to use for weights’ normalization component:

    • ”linear”: dl (dls are already linear, so left as-is)

    • ”sqrt”: dl => sqrt(dl)

    • ”log”: dl => log(dl)

    • None: no normalization is applied to local(*global?) weights

  • **kwargs – Passed directly into vectorizer class

Returns

Document-term matrix as a sparse row matrix, and the corresponding mapping of term strings to integer ids (column indexes).

Note

If you need to transform other sequences of tokenized documents in the same way, or if you need more access to the underlying vectorization process, consider using textacy.representations.vectorizers.Vectorizer directly.

Reference:

https://en.wikipedia.org/wiki/Document-term_matrix

textacy.representations.sparse_vec.build_grp_term_matrix(tokenized_docs: Iterable[Iterable[str]], grps: Iterable[str], *, tf_type: str = 'linear', idf_type: Optional[str] = None, dl_type: Optional[str] = None, **kwargs)scipy.sparse.csr.csr_matrix[source]

Transform one or more tokenized documents into a group-term matrix of shape (# unique groups, # unique terms), with flexible weighting/normalization of values.

This is an extension of typical document-term matrix vectorization, where terms are grouped by the documents in which they co-occur. It allows for customized grouping, such as by a shared author or publication year, that may span multiple documents, without forcing users to merge those documents themselves.

Parameters
  • tokenized_docs

    A sequence of tokenized documents, where each is a sequence of term strings. For example:

    >>> ([tok.lemma_ for tok in spacy_doc]
    ...  for spacy_doc in spacy_docs)
    >>> ((ne.text for ne in extract.entities(doc))
    ...  for doc in corpus)
    

  • grps – Sequence of group names by which the terms in tokenized_docs are aggregated, where the first item in grps corresponds to the first item in tokenized_docs, and so on.

  • tf_type

    Type of term frequency (tf) to use for weights’ local component:

    • ”linear”: tf (tfs are already linear, so left as-is)

    • ”sqrt”: tf => sqrt(tf)

    • ”log”: tf => log(tf) + 1

    • ”binary”: tf => 1

  • idf_type

    Type of inverse document frequency (idf) to use for weights’ global component:

    • ”standard”: idf = log(n_docs / df) + 1.0

    • ”smooth”: idf = log(n_docs + 1 / df + 1) + 1.0, i.e. 1 is added to all document frequencies, as if a single document containing every unique term was added to the corpus.

    • ”bm25”: idf = log((n_docs - df + 0.5) / (df + 0.5)), which is a form commonly used in information retrieval that allows for very common terms to receive negative weights.

    • None: no global weighting is applied to local term weights.

  • dl_type

    Type of document-length scaling to use for weights’ normalization component:

    • ”linear”: dl (dls are already linear, so left as-is)

    • ”sqrt”: dl => sqrt(dl)

    • ”log”: dl => log(dl)

    • None: no normalization is applied to local(*global?) weights

  • **kwargs – Passed directly into vectorizer class

Returns

Group-term matrix as a sparse row matrix, and the corresponding mapping of term strings to integer ids (column indexes), and the corresponding mapping of group strings to integer ids (row indexes).

Note

If you need to transform other sequences of tokenized documents in the same way, or if you need more access to the underlying vectorization process, consider using textacy.representations.vectorizers.GroupVectorizer directly.

Reference:

https://en.wikipedia.org/wiki/Document-term_matrix

Vectorizers

textacy.representations.vectorizers: Transform a collection of tokenized docs into a doc-term matrix of shape (# docs, # unique terms), with various ways to filter or limit included terms and flexible weighting schemes for their values.

A second option aggregates terms in tokenized documents by provided group labels, resulting in a “group-term-matrix” of shape (# unique groups, # unique terms), with filtering and weighting functionality as described above.

See the Vectorizer and GroupVectorizer docstrings for usage examples and explanations of the various weighting schemes.

class textacy.representations.vectorizers.Vectorizer(*, tf_type: str = 'linear', idf_type: Optional[str] = None, dl_type: Optional[str] = None, norm: Optional[str] = None, min_df: int | float = 1, max_df: int | float = 1.0, max_n_terms: Optional[int] = None, vocabulary_terms: Optional[Dict[str, int] | Iterable[str]] = None)[source]

Transform one or more tokenized documents into a sparse document-term matrix of shape (# docs, # unique terms), with flexible weighting/normalization of values.

Stream a corpus with metadata from disk:

>>> ds = textacy.datasets.CapitolWords()
>>> records = ds.records(limit=1000)
>>> corpus = textacy.Corpus("en_core_web_sm", data=records)
>>> print(corpus)
Corpus(1000 docs, 538397 tokens)

Tokenize and vectorize the first 600 documents of this corpus:

>>> tokenized_docs = (
...     (term.lemma_ for term in textacy.extract.terms(doc, ngs=1, ents=True))
...     for doc in corpus[:600])
>>> vectorizer = Vectorizer(
...     tf_type="linear", idf_type="smooth", norm="l2",
...     min_df=3, max_df=0.95)
>>> doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
>>> doc_term_matrix
<600x4412 sparse matrix of type '<class 'numpy.float64'>'
        with 65210 stored elements in Compressed Sparse Row format>

Tokenize and vectorize the remaining 400 documents of the corpus, using only the groups, terms, and weights learned in the previous step:

>>> tokenized_docs = (
...     (term.lemma_ for term in textacy.extract.terms(doc, ngs=1, ents=True))
...     for doc in corpus[600:])
>>> doc_term_matrix = vectorizer.transform(tokenized_docs)
>>> doc_term_matrix
<400x4412 sparse matrix of type '<class 'numpy.float64'>'
        with 36212 stored elements in Compressed Sparse Row format>

Inspect the terms associated with columns; they’re sorted alphabetically:

>>> vectorizer.terms_list[:5]
['', '$', '$ 1 million', '$ 1.2 billion', '$ 10 billion']

(Btw: That empty string shouldn’t be there. Somehow, spaCy is labeling it as a named entity…?)

If known in advance, limit the terms included in vectorized outputs to a particular set of values:

>>> tokenized_docs = (
...     (term.lemma_ for term in textacy.extract.terms(doc, ngs=1, ents=True))
...     for doc in corpus[:600])
>>> vectorizer = Vectorizer(
...     idf_type="smooth", norm="l2",
...     min_df=3, max_df=0.95,
...     vocabulary_terms=["president", "bill", "unanimous", "distinguished", "american"])
>>> doc_term_matrix = vectorizer.fit_transform(tokenized_docs)
>>> doc_term_matrix
<600x5 sparse matrix of type '<class 'numpy.float64'>'
        with 516 stored elements in Compressed Sparse Row format>
>>> vectorizer.terms_list
['american', 'bill', 'distinguished', 'president', 'unanimous']

Specify different weighting schemes to determine values in the matrix, adding or customizing individual components, as desired:

>>> tokenized_docs = [
    [term.lemma_ for term in textacy.extract.terms(doc, ngs=1, ents=True)]
    for doc in corpus[:600]]
>>> doc_term_matrix = Vectorizer(
...     tf_type="linear", norm=None, min_df=3, max_df=0.95
... ).fit_transform(tokenized_docs)
>>> print(doc_term_matrix[:8, vectorizer.vocabulary_terms["$"]].toarray())
[[0]
 [0]
 [1]
 [4]
 [0]
 [0]
 [2]
 [4]]
>>> doc_term_matrix = Vectorizer(
...     tf_type="sqrt", dl_type="sqrt", norm=None, min_df=3, max_df=0.95
... ).fit_transform(tokenized_docs)
>>> print(doc_term_matrix[:8, vectorizer.vocabulary_terms["$"]].toarray())
[[0.        ]
 [0.        ]
 [0.10660036]
 [0.2773501 ]
 [0.        ]
 [0.        ]
 [0.11704115]
 [0.24806947]]
>>> doc_term_matrix = Vectorizer(
...     tf_type="bm25", idf_type="smooth", norm=None, min_df=3, max_df=0.95
... ).fit_transform(tokenized_docs)
>>> print(doc_term_matrix[:8, vectorizer.vocabulary_terms["$"]].toarray())
[[0.        ]
 [0.        ]
 [2.68009606]
 [4.97732126]
 [0.        ]
 [0.        ]
 [3.87124987]
 [4.97732126]]

If you’re not sure what’s going on mathematically, Vectorizer.weighting gives the formula being used to calculate weights, based on the parameters set when initializing the vectorizer:

>>> vectorizer.weighting
'(tf * (k + 1)) / (k + tf) * log((n_docs + 1) / (df + 1)) + 1'

In general, weights may consist of a local component (term frequency), a global component (inverse document frequency), and a normalization component (document length). Individual components may be modified: they may have different scaling (e.g. tf vs. sqrt(tf)) or different behaviors (e.g. “standard” idf vs bm25’s version). There are many possible weightings, and some may be better for particular use cases than others. When in doubt, though, just go with something standard.

  • “tf”: Weights are simply the absolute per-document term frequencies (tfs), i.e. value (i, j) in an output doc-term matrix corresponds to the number of occurrences of term j in doc i. Terms appearing many times in a given doc receive higher weights than less common terms. Params: tf_type="linear", apply_idf=False, apply_dl=False

  • “tfidf”: Doc-specific, local tfs are multiplied by their corpus-wide, global inverse document frequencies (idfs). Terms appearing in many docs have higher document frequencies (dfs), correspondingly smaller idfs, and in turn, lower weights. Params: tf_type="linear", apply_idf=True, idf_type="smooth", apply_dl=False

  • “bm25”: This scheme includes a local tf component that increases asymptotically, so higher tfs have diminishing effects on the overall weight; a global idf component that can go negative for terms that appear in a sufficiently high proportion of docs; as well as a row-wise normalization that accounts for document length, such that terms in shorter docs hit the tf asymptote sooner than those in longer docs. Params: tf_type="bm25", apply_idf=True, idf_type="bm25", apply_dl=True

  • “binary”: This weighting scheme simply replaces all non-zero tfs with 1, indicating the presence or absence of a term in a particular doc. That’s it. Params: tf_type="binary", apply_idf=False, apply_dl=False

Slightly altered versions of these “standard” weighting schemes are common, and may have better behavior in general use cases:

  • “lucene-style tfidf”: Adds a doc-length normalization to the usual local and global components. Params: tf_type="linear", apply_idf=True, idf_type="smooth", apply_dl=True, dl_type="sqrt"

  • “lucene-style bm25”: Uses a smoothed idf instead of the classic bm25 variant to prevent weights on terms from going negative. Params: tf_type="bm25", apply_idf=True, idf_type="smooth", apply_dl=True, dl_type="linear"

Parameters
  • tf_type

    Type of term frequency (tf) to use for weights’ local component:

    • ”linear”: tf (tfs are already linear, so left as-is)

    • ”sqrt”: tf => sqrt(tf)

    • ”log”: tf => log(tf) + 1

    • ”binary”: tf => 1

  • idf_type

    Type of inverse document frequency (idf) to use for weights’ global component:

    • ”standard”: idf = log(n_docs / df) + 1.0

    • ”smooth”: idf = log(n_docs + 1 / df + 1) + 1.0, i.e. 1 is added to all document frequencies, as if a single document containing every unique term was added to the corpus.

    • ”bm25”: idf = log((n_docs - df + 0.5) / (df + 0.5)), which is a form commonly used in information retrieval that allows for very common terms to receive negative weights.

    • None: no global weighting is applied to local term weights.

  • dl_type

    Type of document-length scaling to use for weights’ normalization component:

    • ”linear”: dl (dls are already linear, so left as-is)

    • ”sqrt”: dl => sqrt(dl)

    • ”log”: dl => log(dl)

    • None: no normalization is applied to local(*global?) weights

  • norm – If “l1” or “l2”, normalize weights by the L1 or L2 norms, respectively, of row-wise vectors; otherwise, don’t.

  • min_df – Minimum number of documents in which a term must appear for it to be included in the vocabulary and as a column in a transformed doc-term matrix. If float, value is the fractional proportion of the total number of docs, which must be in [0.0, 1.0]; if int, value is the absolute number.

  • max_df – Maximum number of documents in which a term may appear for it to be included in the vocabulary and as a column in a transformed doc-term matrix. If float, value is the fractional proportion of the total number of docs, which must be in [0.0, 1.0]; if int, value is the absolute number.

  • max_n_terms – If specified, only include terms whose document frequency is within the top max_n_terms.

  • vocabulary_terms – Mapping of unique term string to unique term id, or an iterable of term strings that gets converted into such a mapping. Note that, if specified, vectorized outputs will include only these terms.

vocabulary_terms

Mapping of unique term string to unique term id, either provided on instantiation or generated by calling Vectorizer.fit() on a collection of tokenized documents.

Type

Dict[str, int]

property id_to_term

Mapping of unique term id (int) to unique term string (str), i.e. the inverse of Vectorizer.vocabulary. This attribute is only generated if needed, and it is automatically kept in sync with the corresponding vocabulary.

property terms_list

List of term strings in column order of vectorized outputs. For example, terms_list[0] gives the term assigned to the first column in an output doc-term-matrix, doc_term_matrix[:, 0].

fit(tokenized_docs: Iterable[Iterable[str]])Vectorizer[source]

Count terms in tokenized_docs and, if not already provided, build up a vocabulary based those terms. Fit and store global weights (IDFs) and, if needed for term weighting, the average document length.

Parameters

tokenized_docs

A sequence of tokenized documents, where each is a sequence of term strings. For example:

>>> ([tok.lemma_ for tok in spacy_doc]
...  for spacy_doc in spacy_docs)
>>> ((ne.text for ne in extract.entities(doc))
...  for doc in corpus)

Returns

Vectorizer instance that has just been fit.

fit_transform(tokenized_docs: Iterable[Iterable[str]])scipy.sparse.csr.csr_matrix[source]

Count terms in tokenized_docs and, if not already provided, build up a vocabulary based those terms. Fit and store global weights (IDFs) and, if needed for term weighting, the average document length. Transform tokenized_docs into a document-term matrix with values weighted according to the parameters in Vectorizer initialization.

Parameters

tokenized_docs

A sequence of tokenized documents, where each is a sequence of term strings. For example:

>>> ([tok.lemma_ for tok in spacy_doc]
...  for spacy_doc in spacy_docs)
>>> ((ne.text for ne in extract.entities(doc))
...  for doc in corpus)

Returns

The transformed document-term matrix, where rows correspond to documents and columns correspond to terms, as a sparse row matrix.

transform(tokenized_docs: Iterable[Iterable[str]])scipy.sparse.csr.csr_matrix[source]

Transform tokenized_docs into a document-term matrix with values weighted according to the parameters in Vectorizer initialization and the global weights computed by calling Vectorizer.fit().

Parameters

tokenized_docs

A sequence of tokenized documents, where each is a sequence of term strings. For example:

>>> ([tok.lemma_ for tok in spacy_doc]
...  for spacy_doc in spacy_docs)
>>> ((ne.text for ne in extract.entities(doc))
...  for doc in corpus)

Returns

The transformed document-term matrix, where rows correspond to documents and columns correspond to terms, as a sparse row matrix.

Note

For best results, the tokenization used to produce tokenized_docs should be the same as was applied to the docs used in fitting this vectorizer or in generating a fixed input vocabulary.

Consider an extreme case where the docs used in fitting consist of lowercased (non-numeric) terms, while the docs to be transformed are all uppercased: The output doc-term-matrix will be empty.

property weighting

A mathematical representation of the overall weighting scheme used to determine values in the vectorized matrix, depending on the params used to initialize the Vectorizer.

class textacy.representations.vectorizers.GroupVectorizer(*, tf_type: str = 'linear', idf_type: Optional[str] = None, dl_type: Optional[str] = None, norm: Optional[str] = None, min_df: int | float = 1, max_df: int | float = 1.0, max_n_terms: Optional[int] = None, vocabulary_terms: Optional[Dict[str, int] | Iterable[str]] = None, vocabulary_grps: Optional[Dict[str, int] | Iterable[str]] = None)[source]

Transform one or more tokenized documents into a group-term matrix of shape (# groups, # unique terms), with tf-, tf-idf, or binary-weighted values.

This is an extension of typical document-term matrix vectorization, where terms are grouped by the documents in which they co-occur. It allows for customized grouping, such as by a shared author or publication year, that may span multiple documents, without forcing users to merge those documents themselves.

Stream a corpus with metadata from disk:

>>> ds = textacy.datasets.CapitolWords()
>>> records = ds.records(limit=1000)
>>> corpus = textacy.Corpus("en_core_web_sm", data=records)
>>> corpus
Corpus(1000 docs, 538397 tokens)

Tokenize and vectorize the first 600 documents of this corpus, where terms are grouped not by documents but by a categorical value in the docs’ metadata:

>>> tokenized_docs, groups = textacy.io.unzip(
...     ((term.lemma_ for term in textacy.extract.terms(doc, ngs=1, ents=True)),
...      doc._.meta["speaker_name"])
...     for doc in corpus[:600])
>>> vectorizer = GroupVectorizer(
...     tf_type="linear", idf_type="smooth", norm="l2",
...     min_df=3, max_df=0.95)
>>> grp_term_matrix = vectorizer.fit_transform(tokenized_docs, groups)
>>> grp_term_matrix
<5x1822 sparse matrix of type '<class 'numpy.float64'>'
        with 6139 stored elements in Compressed Sparse Row format>

Tokenize and vectorize the remaining 400 documents of the corpus, using only the groups, terms, and weights learned in the previous step:

>>> tokenized_docs, groups = textacy.io.unzip(
...     ((term.lemma_ for term in textacy.extract.terms(doc, ngs=1, ents=True)),
...      doc._.meta["speaker_name"])
...     for doc in corpus[600:])
>>> grp_term_matrix = vectorizer.transform(tokenized_docs, groups)
>>> grp_term_matrix
<5x1822 sparse matrix of type '<class 'numpy.float64'>'
        with 4414 stored elements in Compressed Sparse Row format>

Inspect the terms associated with columns and groups associated with rows; they’re sorted alphabetically:

>>> vectorizer.terms_list[:5]
['', '$ 1 million', '$ 160 million', '$ 5 billion', '$ 7 billion']
>>> vectorizer.grps_list
['Bernie Sanders', 'John Kasich', 'Joseph Biden', 'Lindsey Graham', 'Rick Santorum']

If known in advance, limit the terms and/or groups included in vectorized outputs to a particular set of values:

>>> tokenized_docs, groups = textacy.io.unzip(
...     ((term.lemma_ for term in textacy.extract.terms(doc, ngs=1, ents=True)),
...      doc._.meta["speaker_name"])
...     for doc in corpus[:600])
>>> vectorizer = GroupVectorizer(
...     tf_type="linear", idf_type="smooth", norm="l2",
...     min_df=3, max_df=0.95,
...     vocabulary_terms=["legislation", "federal government", "house", "constitutional"],
...     vocabulary_grps=["Bernie Sanders", "Lindsey Graham", "Rick Santorum"])
>>> grp_term_matrix = vectorizer.fit_transform(tokenized_docs, groups)
>>> grp_term_matrix
<3x4 sparse matrix of type '<class 'numpy.float64'>'
        with 9 stored elements in Compressed Sparse Row format>
>>> vectorizer.terms_list
['constitutional', 'federal government', 'house', 'legislation']
>>> vectorizer.grps_list
['Bernie Sanders', 'Lindsey Graham', 'Rick Santorum']

For a discussion of the various weighting schemes that can be applied, check out the Vectorizer docstring.

Parameters
  • tf_type

    Type of term frequency (tf) to use for weights’ local component:

    • ”linear”: tf (tfs are already linear, so left as-is)

    • ”sqrt”: tf => sqrt(tf)

    • ”log”: tf => log(tf) + 1

    • ”binary”: tf => 1

  • idf_type

    Type of inverse document frequency (idf) to use for weights’ global component:

    • ”standard”: idf = log(n_docs / df) + 1.0

    • ”smooth”: idf = log(n_docs + 1 / df + 1) + 1.0, i.e. 1 is added to all document frequencies, as if a single document containing every unique term was added to the corpus.

    • ”bm25”: idf = log((n_docs - df + 0.5) / (df + 0.5)), which is a form commonly used in information retrieval that allows for very common terms to receive negative weights.

    • None: no global weighting is applied to local term weights.

  • dl_type

    Type of document-length scaling to use for weights’ normalization component:

    • ”linear”: dl (dls are already linear, so left as-is)

    • ”sqrt”: dl => sqrt(dl)

    • ”log”: dl => log(dl)

    • None: no normalization is applied to local(*global?) weights

  • norm – If “l1” or “l2”, normalize weights by the L1 or L2 norms, respectively, of row-wise vectors; otherwise, don’t.

  • min_df – Minimum number of documents in which a term must appear for it to be included in the vocabulary and as a column in a transformed doc-term matrix. If float, value is the fractional proportion of the total number of docs, which must be in [0.0, 1.0]; if int, value is the absolute number.

  • max_df – Maximum number of documents in which a term may appear for it to be included in the vocabulary and as a column in a transformed doc-term matrix. If float, value is the fractional proportion of the total number of docs, which must be in [0.0, 1.0]; if int, value is the absolute number.

  • max_n_terms – If specified, only include terms whose document frequency is within the top max_n_terms.

  • vocabulary_terms – Mapping of unique term string to unique term id, or an iterable of term strings that gets converted into such a mapping. Note that, if specified, vectorized output will include only these terms.

  • vocabulary_grps – Mapping of unique group string to unique group id, or an iterable of group strings that gets converted into such a mapping. Note that, if specified, vectorized output will include only these groups.

vocabulary_terms

Mapping of unique term string to unique term id, either provided on instantiation or generated by calling GroupVectorizer.fit() on a collection of tokenized documents.

Type

Dict[str, int]

vocabulary_grps

Mapping of unique group string to unique group id, either provided on instantiation or generated by calling GroupVectorizer.fit() on a collection of tokenized documents.

Type

Dict[str, int]

See also

Vectorizer

property id_to_grp

Mapping of unique group id (int) to unique group string (str), i.e. the inverse of GroupVectorizer.vocabulary_grps. This attribute is only generated if needed, and it is automatically kept in sync with the corresponding vocabulary.

property grps_list

List of group strings in row order of vectorized outputs. For example, grps_list[0] gives the group assigned to the first row in an output group-term-matrix, grp_term_matrix[0, :].

fit(tokenized_docs: Iterable[Iterable[str]], grps: Iterable[str])GroupVectorizer[source]

Count terms in tokenized_docs and, if not already provided, build up a vocabulary based those terms; do the same for the groups in grps. Fit and store global weights (IDFs) and, if needed for term weighting, the average document length.

Parameters
  • tokenized_docs

    A sequence of tokenized documents, where each is a sequence of term strings. For example:

    >>> ([tok.lemma_ for tok in spacy_doc]
    ...  for spacy_doc in spacy_docs)
    >>> ((ne.text for ne in extract.entities(doc))
    ...  for doc in corpus)
    

  • grps – Sequence of group names by which the terms in tokenized_docs are aggregated, where the first item in grps corresponds to the first item in tokenized_docs, and so on.

Returns

GroupVectorizer instance that has just been fit.

fit_transform(tokenized_docs: Iterable[Iterable[str]], grps: Iterable[str])scipy.sparse.csr.csr_matrix[source]

Count terms in tokenized_docs and, if not already provided, build up a vocabulary based those terms; do the same for the groups in grps. Fit and store global weights (IDFs) and, if needed for term weighting, the average document length. Transform tokenized_docs into a group-term matrix with values weighted according to the parameters in GroupVectorizer initialization.

Parameters
  • tokenized_docs

    A sequence of tokenized documents, where each is a sequence of term strings. For example:

    >>> ([tok.lemma_ for tok in spacy_doc]
    ...  for spacy_doc in spacy_docs)
    >>> ((ne.text for ne in extract.entities(doc))
    ...  for doc in corpus)
    

  • grps – Sequence of group names by which the terms in tokenized_docs are aggregated, where the first item in grps corresponds to the first item in tokenized_docs, and so on.

Returns

The transformed group-term matrix, where rows correspond to groups and columns correspond to terms, as a sparse row matrix.

transform(tokenized_docs: Iterable[Iterable[str]], grps: Iterable[str])scipy.sparse.csr.csr_matrix[source]

Transform tokenized_docs and grps into a group-term matrix with values weighted according to the parameters in GroupVectorizer initialization and the global weights computed by calling GroupVectorizer.fit().

Parameters
  • tokenized_docs

    A sequence of tokenized documents, where each is a sequence of term strings. For example:

    >>> ([tok.lemma_ for tok in spacy_doc]
    ...  for spacy_doc in spacy_docs)
    >>> ((ne.text for ne in extract.entities(doc))
    ...  for doc in corpus)
    

  • grps – Sequence of group names by which the terms in tokenized_docs are aggregated, where the first item in grps corresponds to the first item in tokenized_docs, and so on.

Returns

The transformed group-term matrix, where rows correspond to groups and columns correspond to terms, as a sparse row matrix.

Note

For best results, the tokenization used to produce tokenized_docs should be the same as was applied to the docs used in fitting this vectorizer or in generating a fixed input vocabulary.

Consider an extreme case where the docs used in fitting consist of lowercased (non-numeric) terms, while the docs to be transformed are all uppercased: The output group-term-matrix will be empty.

textacy.vsm.matrix_utils: Functions for computing corpus-wide term- or document-based values, like term frequency, document frequency, and document length, and filtering terms from a matrix by their document frequency.

textacy.representations.matrix_utils.get_term_freqs(doc_term_matrix, *, type_='linear')[source]

Compute frequencies for all terms in a document-term matrix, with optional sub-linear scaling.

Parameters
  • doc_term_matrix (scipy.sparse.csr_matrix) – M x N sparse matrix, where M is the # of docs and N is the # of unique terms. Values must be the linear, un-scaled counts of term n per doc m.

  • type ({'linear', 'sqrt', 'log'}) – Scaling applied to absolute term counts. If ‘linear’, term counts are left as-is, since the sums are already linear; if ‘sqrt’, tf => sqrt(tf); if ‘log’, tf => log(tf) + 1.

Returns

Array of term frequencies, with length equal to the # of unique terms (# of columns) in doc_term_matrix.

Return type

numpy.ndarray

Raises

ValueError – if doc_term_matrix doesn’t have any non-zero entries, or if type_ isn’t one of {“linear”, “sqrt”, “log”}.

textacy.representations.matrix_utils.get_doc_freqs(doc_term_matrix)[source]

Compute document frequencies for all terms in a document-term matrix.

Parameters

doc_term_matrix (scipy.sparse.csr_matrix) –

M x N sparse matrix, where M is the # of docs and N is the # of unique terms.

Note

Weighting on the terms doesn’t matter! Could be binary or tf or tfidf, a term’s doc freq will be the same.

Returns

Array of document frequencies, with length equal to the # of unique terms (# of columns) in doc_term_matrix.

Return type

numpy.ndarray

Raises

ValueError – if doc_term_matrix doesn’t have any non-zero entries.

textacy.representations.matrix_utils.get_inverse_doc_freqs(doc_term_matrix, *, type_='smooth')[source]

Compute inverse document frequencies for all terms in a document-term matrix, using one of several IDF formulations.

Parameters
  • doc_term_matrix (scipy.sparse.csr_matrix) – M x N sparse matrix, where M is the # of docs and N is the # of unique terms. The particular weighting of matrix values doesn’t matter.

  • type ({'standard', 'smooth', 'bm25'}) – Type of IDF formulation to use. If ‘standard’, idfs => log(n_docs / dfs) + 1.0; if ‘smooth’, idfs => log(n_docs + 1 / dfs + 1) + 1.0, i.e. 1 is added to all document frequencies, equivalent to adding a single document to the corpus containing every unique term; if ‘bm25’, idfs => log((n_docs - dfs + 0.5) / (dfs + 0.5)), which is a form commonly used in BM25 ranking that allows for extremely common terms to have negative idf weights.

Returns

Array of inverse document frequencies, with length equal to the # of unique terms (# of columns) in doc_term_matrix.

Return type

numpy.ndarray

Raises

ValueError – if type_ isn’t one of {“standard”, “smooth”, “bm25”}.

textacy.representations.matrix_utils.get_doc_lengths(doc_term_matrix, *, type_='linear')[source]

Compute the lengths (i.e. number of terms) for all documents in a document-term matrix.

Parameters
  • doc_term_matrix (scipy.sparse.csr_matrix) – M x N sparse matrix, where M is the # of docs, N is the # of unique terms, and values are the absolute counts of term n per doc m.

  • type ({'linear', 'sqrt', 'log'}) – Scaling applied to absolute doc lengths. If ‘linear’, lengths are left as-is, since the sums are already linear; if ‘sqrt’, dl => sqrt(dl); if ‘log’, dl => log(dl) + 1.

Returns

Array of document lengths, with length equal to the # of documents (# of rows) in doc_term_matrix.

Return type

numpy.ndarray

Raises

ValueError – if type_ isn’t one of {“linear”, “sqrt”, “log”}.

textacy.representations.matrix_utils.get_information_content(doc_term_matrix)[source]

Compute information content for all terms in a document-term matrix. IC is a float in [0.0, 1.0], defined as -df * log2(df) - (1 - df) * log2(1 - df), where df is a term’s normalized document frequency.

Parameters

doc_term_matrix (scipy.sparse.csr_matrix) –

M x N sparse matrix, where M is the # of docs and N is the # of unique terms.

Note

Weighting on the terms doesn’t matter! Could be binary or tf or tfidf, a term’s information content will be the same.

Returns

Array of term information content values, with length equal to the # of unique terms (# of columns) in doc_term_matrix.

Return type

numpy.ndarray

Raises

ValueError – if doc_term_matrix doesn’t have any non-zero entries.

textacy.representations.matrix_utils.apply_idf_weighting(doc_term_matrix, *, type_='smooth')[source]

Apply inverse document frequency (idf) weighting to a term-frequency (tf) weighted document-term matrix, using one of several IDF formulations.

Parameters
  • doc_term_matrix (scipy.sparse.csr_matrix) – M x N sparse matrix, where M is the # of docs and N is the # of unique terms.

  • type ({'standard', 'smooth', 'bm25'}) – Type of IDF formulation to use.

Returns

Sparse matrix of shape M x N, where value (i, j) is the tfidf weight of term j in doc i.

Return type

scipy.sparse.csr_matrix

textacy.representations.matrix_utils.filter_terms_by_df(doc_term_matrix, term_to_id, *, max_df=1.0, min_df=1, max_n_terms=None)[source]

Filter out terms that are too common and/or too rare (by document frequency), and compactify the top max_n_terms in the id_to_term mapping accordingly. Borrows heavily from the sklearn.feature_extraction.text module.

Parameters
  • doc_term_matrix (scipy.sparse.csr_matrix) – M X N matrix, where M is the # of docs and N is the # of unique terms.

  • term_to_id (Dict[str, int]) – Mapping of term string to unique term id, e.g. Vectorizer.vocabulary_terms.

  • min_df (float or int) – if float, value is the fractional proportion of the total number of documents and must be in [0.0, 1.0]; if int, value is the absolute number; filter terms whose document frequency is less than min_df

  • max_df (float or int) – if float, value is the fractional proportion of the total number of documents and must be in [0.0, 1.0]; if int, value is the absolute number; filter terms whose document frequency is greater than max_df

  • max_n_terms (int) – only include terms whose term frequency is within the top max_n_terms

Returns

Sparse matrix of shape (# docs, # unique filtered terms), where value (i, j) is the weight of term j in doc i.

Dict[str, int]: Term to id mapping, where keys are unique filtered terms as strings and values are their corresponding integer ids.

Return type

scipy.sparse.csr_matrix

Raises

ValueError – if max_df or min_df or max_n_terms < 0.

textacy.representations.matrix_utils.filter_terms_by_ic(doc_term_matrix, term_to_id, *, min_ic=0.0, max_n_terms=None)[source]

Filter out terms that are too common and/or too rare (by information content), and compactify the top max_n_terms in the id_to_term mapping accordingly. Borrows heavily from the sklearn.feature_extraction.text module.

Parameters
  • doc_term_matrix (scipy.sparse.csr_matrix) – M X N sparse matrix, where M is the # of docs and N is the # of unique terms.

  • term_to_id (Dict[str, int]) – Mapping of term string to unique term id, e.g. Vectorizer.vocabulary_terms.

  • min_ic (float) – filter terms whose information content is less than this value; must be in [0.0, 1.0]

  • max_n_terms (int) – only include terms whose information content is within the top max_n_terms

Returns

Sparse matrix of shape (# docs, # unique filtered terms), where value (i, j) is the weight of term j in doc i.

Dict[str, int]: Term to id mapping, where keys are unique filtered terms as strings and values are their corresponding integer ids.

Return type

scipy.sparse.csr_matrix

Raises

ValueError – if min_ic not in [0.0, 1.0] or max_n_terms < 0.