File I/O

text.read_text

Read the contents of a text file at filepath, either all at once or streaming line-by-line.

text.write_text

Write text data to disk at filepath, either all at once or streaming line-by-line.

json.read_json

Read the contents of a JSON file at filepath, either all at once or streaming item-by-item.

json.write_json

Write JSON data to disk at filepath, either all at once or streaming item-by-item.

csv.read_csv

Read the contents of a CSV file at filepath, streaming line-by-line, where each line is a list of strings and/or floats whose values are separated by delimiter.

csv.write_csv

Write rows of data to disk at filepath, where each row is an iterable or a dictionary of strings and/or numbers, written to one line with values separated by delimiter.

matrix.read_sparse_matrix

Read the data, indices, indptr, and shape arrays from a .npz file on disk at filepath, and return an instantiated sparse matrix.

matrix.write_sparse_matrix

Write sparse matrix data to disk at filepath, optionally compressed, into a single .npz file.

spacy.read_spacy_docs

Read the contents of a file at filepath, written in binary or pickle format.

spacy.write_spacy_docs

Write one or more Doc s to disk at filepath in binary or pickle format.

http.read_http_stream

Read data from url in a stream, either all at once or line-by-line.

http.write_http_stream

Download data from url in a stream, and write successive chunks to disk at filepath.

utils.open_sesame

Open file filepath.

utils.split_records

Split records’ content (text) from associated metadata, but keep them paired together.

utils.unzip

Borrowed from toolz.sandbox.core.unzip, but using cytoolz instead of toolz to avoid the additional dependency.

utils.get_filepaths

Yield full paths of files on disk under directory dirpath, optionally filtering for or against particular patterns or file extensions and crawling all subdirectories.

utils.download_file

Download a file from url and save it to disk.

utils.unpack_archive

Extract data from a zip or tar archive file into a directory (or do nothing if the file isn’t an archive).

textacy.io.text: Functions for reading from and writing to disk records in plain text format, either as one text per file or one text per line in a file.

textacy.io.text.read_text(filepath: Union[str, pathlib.Path], *, mode: str = 'rt', encoding: Optional[str] = None, lines: bool = False)Iterable[str][source]

Read the contents of a text file at filepath, either all at once or streaming line-by-line.

Parameters
  • filepath – Path to file on disk from which data will be read.

  • mode – Mode with which filepath is opened.

  • encoding – Name of the encoding used to decode or encode the data in filepath. Only applicable in text mode.

  • lines – If False, all data is read in at once; otherwise, data is read in one line at a time.

Yields

Next line of text to read in.

If lines is False, wrap this output in next() to conveniently access the full text.

textacy.io.text.write_text(data: str | Iterable[str], filepath: types.PathLike, *, mode: str = 'wt', encoding: Optional[str] = None, make_dirs: bool = False, lines: bool = False)None[source]

Write text data to disk at filepath, either all at once or streaming line-by-line.

Parameters
  • If lines is False (data) –

    “isnt rick and morty that thing you get when you die and your body gets all stiff”

    If lines is True, an iterable of strings to write to disk, one item per line; for example:

    ["isnt rick and morty that thing you get when you die and your body gets all stiff",
     "You're thinking of rigor mortis. Rick and morty is when you get trolled into watching "never gonna give you up"",
     "That's rickrolling. Rick and morty is a type of pasta"]
    

  • single string to write to disk; for example:: (a) –

    “isnt rick and morty that thing you get when you die and your body gets all stiff”

    If lines is True, an iterable of strings to write to disk, one item per line; for example:

    ["isnt rick and morty that thing you get when you die and your body gets all stiff",
     "You're thinking of rigor mortis. Rick and morty is when you get trolled into watching "never gonna give you up"",
     "That's rickrolling. Rick and morty is a type of pasta"]
    

  • filepath – Path to file on disk to which data will be written.

  • mode – Mode with which filepath is opened.

  • encoding – Name of the encoding used to decode or encode the data in filepath. Only applicable in text mode.

  • make_dirs – If True, automatically create (sub)directories if not already present in order to write filepath.

  • lines – If False, all data is written at once; otherwise, data is written to disk one line at a time.

textacy.io.json: Functions for reading from and writing to disk records in JSON format, as one record per file or one record per line in a file.

textacy.io.json.read_json(filepath: Union[str, pathlib.Path], *, mode: str = 'rt', encoding: Optional[str] = None, lines: bool = False)Iterable[source]

Read the contents of a JSON file at filepath, either all at once or streaming item-by-item.

Parameters
  • filepath – Path to file on disk from which data will be read.

  • mode – Mode with which filepath is opened.

  • encoding – Name of the encoding used to decode or encode the data in filepath. Only applicable in text mode.

  • lines – If False, all data is read in at once; otherwise, data is read in one line at a time.

Yields

Next JSON item; could be a dict, list, int, float, str, depending on the data and the value of lines.

textacy.io.json.read_json_mash(filepath: Union[str, pathlib.Path], *, mode: str = 'rt', encoding: Optional[str] = None, buffer_size: int = 2048)Iterable[source]

Read the contents of a JSON file at filepath one item at a time, where all of the items have been mashed together, end-to-end, on a single line.

Parameters
  • filepath – Path to file on disk to which data will be written.

  • mode – Mode with which filepath is opened.

  • encoding – Name of the encoding used to decode or encode the data in filepath. Only applicable in text mode.

  • buffer_size – Number of bytes to read in as a chunk.

Yields

Next valid JSON object, converted to native Python equivalent.

Note

Storing JSON data in this format is Not Good. Reading it is doable, so this function is included for users’ convenience, but note that there is no analogous write_json_mash() function. Don’t do it.

textacy.io.json.write_json(data: Any, filepath: types.PathLike, *, mode: str = 'wt', encoding: Optional[str] = None, make_dirs: bool = False, lines: bool = False, ensure_ascii: bool = False, separators: Tuple[str, str] = (',', ':'), sort_keys: bool = False, indent: Optional[int | str] = None)None[source]

Write JSON data to disk at filepath, either all at once or streaming item-by-item.

Parameters
  • data

    JSON data to write to disk, including any Python objects encodable by default in json, as well as dates and datetimes. For example:

    [
        {"title": "Harrison Bergeron", "text": "The year was 2081, and everybody was finally equal."},
        {"title": "2BR02B", "text": "Everything was perfectly swell."},
        {"title": "Slaughterhouse-Five", "text": "All this happened, more or less."},
    ]
    

    If lines is False, all of data is written as a single object; if True, each item is written to a separate line in filepath.

  • filepath – Path to file on disk to which data will be written.

  • mode – Mode with which filepath is opened.

  • encoding – Name of the encoding used to decode or encode the data in filepath. Only applicable in text mode.

  • make_dirs – If True, automatically create (sub)directories if not already present in order to write filepath.

  • lines – If False, all data is written at once; otherwise, data is written to disk one item at a time.

  • ensure_ascii – If True, all non-ASCII characters are escaped; otherwise, non-ASCII characters are output as-is.

  • separators – An (item_separator, key_separator) pair specifying how items and keys are separated in output.

  • sort_keys – If True, each output dictionary is sorted by key; otherwise, dictionary ordering is taken as-is.

  • indent – If a non-negative integer or string, items are pretty-printed with the specified indent level; if 0, negative, or “”, items are separated by newlines; if None, the most compact representation is used when storing data.

class textacy.io.json.ExtendedJSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Sub-class of json.JSONEncoder, used to write JSON data to disk in write_json() while handling a broader range of Python objects.

default(obj)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)

textacy.io.csv: Functions for reading from and writing to disk records in CSV format, where CSVs may be delimited not only by commas (the default) but tabs, pipes, and other valid one-char delimiters.

textacy.io.csv.read_csv(filepath: types.PathLike, *, encoding: Optional[str] = None, fieldnames: Optional[str | Sequence[str]] = None, dialect: str | Type[csv.Dialect] = 'excel', delimiter: str = ',', quoting: int = 2)Iterable[list] | Iterable[dict][source]

Read the contents of a CSV file at filepath, streaming line-by-line, where each line is a list of strings and/or floats whose values are separated by delimiter.

Parameters
  • filepath – Path to file on disk from which data will be read.

  • encoding – Name of the encoding used to decode or encode the data in filepath.

  • fieldnames – If specified, gives names for columns of values, which are used as keys in an ordered dictionary representation of each line’s data. If ‘infer’, the first kB of data is analyzed to make a guess about whether the first row is a header of column names, and if so, those names are used as keys. If None, no column names are used, and each line is returned as a list of strings/floats.

  • dialect – Grouping of formatting parameters that determine how the data is parsed when reading/writing. If ‘infer’, the first kB of data is analyzed to get a best guess for the correct dialect.

  • delimiter – 1-character string used to separate fields in a row.

  • quoting – Type of quoting to apply to field values. See: https://docs.python.org/3/library/csv.html#csv.QUOTE_NONNUMERIC

Yields

List[obj] – Next row, whose elements are strings and/or floats. If fieldnames is None or ‘infer’ doesn’t detect a header row.

or

Dict[str, obj]: Next row, as an ordered dictionary of (key, value) pairs, where keys are column names and values are the corresponding strings and/or floats. If fieldnames is a list of column names or ‘infer’ detects a header row.

textacy.io.csv.write_csv(data: Iterable[Dict[str, Any]] | Iterable[Iterable], filepath: types.PathLike, *, encoding: Optional[str] = None, make_dirs: bool = False, fieldnames: Optional[Sequence[str]] = None, dialect: str = 'excel', delimiter: str = ',', quoting: int = 2)None[source]

Write rows of data to disk at filepath, where each row is an iterable or a dictionary of strings and/or numbers, written to one line with values separated by delimiter.

Parameters
  • data

    If fieldnames is None, an iterable of iterables of strings and/or numbers to write to disk; for example:

    [['That was a great movie!', 0.9],
     ['The movie was okay, I guess.', 0.2],
     ['Worst. Movie. Ever.', -1.0]]
    

    If fieldnames is specified, an iterable of dictionaries with string and/or number values to write to disk; for example:

    [{'text': 'That was a great movie!', 'score': 0.9},
     {'text': 'The movie was okay, I guess.', 'score': 0.2},
     {'text': 'Worst. Movie. Ever.', 'score': -1.0}]
    

  • filepath – Path to file on disk to which data will be written.

  • encoding – Name of the encoding used to decode or encode the data in filepath.

  • make_dirs – If True, automatically create (sub)directories if not already present in order to write filepath.

  • fieldnames

    Sequence of keys that identify the order in which values in each rows’ dictionary is written to filepath. These are included in filepath as a header row of column names.

    Note

    Only specify this if data is an iterable of dictionaries.

  • dialect – Grouping of formatting parameters that determine how the data is parsed when reading/writing.

  • delimiter – 1-character string used to separate fields in a row.

  • quoting – Type of quoting to apply to field values. See: https://docs.python.org/3/library/csv.html#csv.QUOTE_NONNUMERIC

textacy.io.matrix: Functions for reading from and writing to disk CSC and CSR sparse matrices in numpy binary format.

textacy.io.matrix.read_sparse_matrix(filepath: types.PathLike, *, kind: str = 'csc')sp.csc_matrix | sp.csr_matrix[source]

Read the data, indices, indptr, and shape arrays from a .npz file on disk at filepath, and return an instantiated sparse matrix.

Parameters
  • filepath – Path to file on disk from which data will be read.

  • kind ({'csc', 'csr'}) – Kind of sparse matrix to instantiate.

Returns

An instantiated sparse matrix, whose type depends on the value of kind.

textacy.io.matrix.write_sparse_matrix(data: sp.csc_matrix | sp.csr_matrix, filepath: types.PathLike, *, compressed: bool = True, make_dirs: bool = False)None[source]

Write sparse matrix data to disk at filepath, optionally compressed, into a single .npz file.

Parameters
  • data

  • filepath – Path to file on disk to which data will be written. If filepath does not end in .npz, that extension is automatically appended to the name.

  • compressed – If True, save arrays into a single file in compressed numpy binary format.

  • make_dirs – If True, automatically create (sub)directories if not already present in order to write filepath.

textacy.io.spacy: Functions for reading from and writing to disk spacy documents in either pickle or binary format. Be warned: Both formats have pros and cons.

textacy.io.spacy.read_spacy_docs(filepath: Union[str, pathlib.Path], *, format: str = 'binary', lang: Optional[Union[str, pathlib.Path, spacy.language.Language]] = None)Iterable[spacy.tokens.doc.Doc][source]

Read the contents of a file at filepath, written in binary or pickle format.

Parameters
  • filepath – Path to file on disk from which data will be read.

  • format ({"binary", "pickle"}) –

    Format of the data that was written to disk. If “binary”, uses spacy.tokens.DocBin to deserialie data; if “pickle”, uses python’s stdlib pickle.

    Warning

    Docs written in pickle format were saved all together as a list, which means they’re all loaded into memory at once before streaming one by one. Mind your RAM usage, especially when reading many docs!

  • lang – Language with which spaCy originally processed docs, represented as the full name of or path on disk to the pipeline, or an already instantiated pipeline instance. Note that this is only required when format is “binary”.

Yields

Next deserialized document.

Raises

ValueError – if format is not “binary” or “pickle”, or if lang is None when format="binary"

textacy.io.spacy.write_spacy_docs(data: Doc | Iterable[Doc], filepath: types.PathLike, *, make_dirs: bool = False, format: str = 'binary', attrs: Optional[Iterable[str]] = None, store_user_data: bool = False)None[source]

Write one or more Doc s to disk at filepath in binary or pickle format.

Parameters
  • data – A single Doc or a sequence of Doc s to write to disk.

  • filepath – Path to file on disk to which data will be written.

  • make_dirs – If True, automatically create (sub)directories if not already present in order to write filepath.

  • format ({"pickle", "binary"}) –

    Format of the data written to disk. If “binary”, uses spacy.tokens.DocBin to serialie data; if “pickle”, uses python’s stdlib pickle.

    Warning

    When writing docs in pickle format, all the docs in data must be saved as a list, which means they’re all loaded into memory. Mind your RAM usage, especially when writing many docs!

  • attrs – List of attributes to serialize if format is “binary”. If None, spaCy’s default values are used; see here: https://spacy.io/api/docbin#init

  • store_user_data – If True, write :attr`Doc.user_data` and the values of custom extension attributes to disk; otherwise, don’t.

Raises

ValueError – if format is not “binary” or “pickle”

textacy.io.http: Functions for reading data from URLs via streaming HTTP requests and either reading it into memory or writing it directly to disk.

textacy.io.http.read_http_stream(url: str, *, lines: bool = False, decode_unicode: bool = False, chunk_size: int = 1024, auth: Optional[Tuple[str, str]] = None)Iterable[str] | Iterable[bytes][source]

Read data from url in a stream, either all at once or line-by-line.

Parameters
  • url – URL to which a GET request is made for data.

  • lines – If False, yield all of the data at once; otherwise, yield data line-by-line.

  • decode_unicode – If True, yield data as unicode, where the encoding is taken from the HTTP response headers; otherwise, yield bytes.

  • chunk_size – Number of bytes read into memory per chunk. Because decoding may occur, this is not necessarily the length of each chunk.

  • auth

    (username, password) pair for simple HTTP authentication required (if at all) to access the data at url.

Yields

If lines is True, the next line in the response data, which is bytes if decode_unicode is False or unicode otherwise. If lines is False, yields the full response content, either as bytes or unicode.

textacy.io.http.write_http_stream(url: str, filepath: Union[str, pathlib.Path], *, mode: str = 'wt', encoding: Optional[str] = None, make_dirs: bool = False, chunk_size: int = 1024, auth: Optional[Tuple[str, str]] = None)None[source]

Download data from url in a stream, and write successive chunks to disk at filepath.

Parameters
  • url – URL to which a GET request is made for data.

  • filepath – Path to file on disk to which data will be written.

  • mode – Mode with which filepath is opened.

  • encoding

    Name of the encoding used to decode or encode the data in filepath. Only applicable in text mode.

    Note

    The encoding on the HTTP response is inferred from its headers, or set to ‘utf-8’ as a fall-back in the case that no encoding is detected. It is not set by encoding.

  • make_dirs – If True, automatically create (sub)directories if not already present in order to write filepath.

  • chunk_size – Number of bytes read into memory per chunk. Because decoding may occur, this is not necessarily the length of each chunk.

  • auth

    (username, password) pair for simple HTTP authentication required (if at all) to access the data at url.

I/O Utils

textacy.io.utils: Functions to help read and write data to disk in a variety of formats.

textacy.io.utils.open_sesame(filepath: Union[str, pathlib.Path], *, mode: str = 'rt', encoding: Optional[str] = None, errors: Optional[str] = None, newline: Optional[str] = None, compression: str = 'infer', make_dirs: bool = False)IO[source]

Open file filepath. Automatically handle file compression, relative paths and symlinks, and missing intermediate directory creation, as needed.

open_sesame may be used as a drop-in replacement for io.open().

Parameters
  • filepath – Path on disk (absolute or relative) of the file to open.

  • mode – The mode in which filepath is opened.

  • encoding – Name of the encoding used to decode or encode filepath. Only applicable in text mode.

  • errors – String specifying how encoding/decoding errors are handled. Only applicable in text mode.

  • newline – String specifying how universal newlines mode works. Only applicable in text mode.

  • compression – Type of compression, if any, with which filepath is read from or written to disk. If None, no compression is used; if ‘infer’, compression is inferred from the extension on filepath.

  • make_dirs – If True, automatically create (sub)directories if not already present in order to write filepath.

Returns

file object

Raises
  • TypeError – if filepath is not a string

  • ValueError – if encoding is specified but mode is binary

  • OSError – if filepath doesn’t exist but mode is read

textacy.io.utils.coerce_content_type(content: str | bytes, file_mode: str)str | bytes[source]

If the content to be written to file and the file_mode used to open it are incompatible (either bytes with text mode or unicode with bytes mode), try to coerce the content type so it can be written.

textacy.io.utils.split_records(items: Iterable, content_field: str | int, itemwise: bool = False)Iterable[source]

Split records’ content (text) from associated metadata, but keep them paired together.

Parameters
  • items – An iterable of dicts, e.g. as read from disk by read_json(lines=True), or an iterable of lists, e.g. as read from disk by read_csv().

  • content_field – If str, key in each dict item whose value is the item’s content (text); if int, index of the value in each list item corresponding to the item’s content (text).

  • itemwise – If True, content + metadata are paired item-wise as an iterable of (content, metadata) 2-tuples; if False, content + metadata are paired by position in two parallel iterables in the form of a (iterable(content), iterable(metadata)) 2-tuple.

Returns

If itemwise is True and items is Iterable[dict]; the first element in each tuple is the item’s content, the second element is its metadata as a dictionary.

Generator(Tuple[str, list]): If itemwise is True and items is Iterable[list]; the first element in each tuple is the item’s content, the second element is its metadata as a list.

Tuple[Iterable[str], Iterable[dict]]: If itemwise is False and items is Iterable[dict]; the first element of the tuple is an iterable of items’ contents, the second is an iterable of their metadata dicts.

Tuple[Iterable[str], Iterable[list]]: If itemwise is False and items is Iterable[list]; the first element of the tuple is an iterable of items’ contents, the second is an iterable of their metadata lists.

Return type

Generator(Tuple[str, dict])

textacy.io.utils.unzip(seq: Iterable)Tuple[source]

Borrowed from toolz.sandbox.core.unzip, but using cytoolz instead of toolz to avoid the additional dependency.

textacy.io.utils.get_filepaths(dirpath: Union[str, pathlib.Path], *, match_regex: Optional[str] = None, ignore_regex: Optional[str] = None, extension: Optional[str] = None, ignore_invisible: bool = True, recursive: bool = False)Iterable[str][source]

Yield full paths of files on disk under directory dirpath, optionally filtering for or against particular patterns or file extensions and crawling all subdirectories.

Parameters
  • dirpath – Path to directory on disk where files are stored.

  • match_regex – Regular expression pattern. Only files whose names match this pattern are included.

  • ignore_regex – Regular expression pattern. Only files whose names do not match this pattern are included.

  • extension – File extension, e.g. “.txt” or “.json”. Only files whose extensions match are included.

  • ignore_invisible – If True, ignore invisible files, i.e. those that begin with a period.; otherwise, include them.

  • recursive – If True, iterate recursively through subdirectories in search of files to include; otherwise, only return files located directly under dirpath.

Yields

Next file’s name, including the full path on disk.

Raises

OSError – if dirpath is not found on disk

textacy.io.utils.download_file(url: str, *, filename: Optional[str] = None, dirpath: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data'), force: bool = False)Optional[str][source]

Download a file from url and save it to disk.

Parameters
  • url – Web address from which to download data.

  • filename – Name of the file to which downloaded data is saved. If None, a filename will be inferred from the url.

  • dirpath – Full path to the directory on disk under which downloaded data will be saved as filename.

  • force – If True, download the data even if it already exists at dirpath/filename; otherwise, only download if the data doesn’t already exist on disk.

Returns

Full path of file saved to disk.

textacy.io.utils.get_filename_from_url(url: str)str[source]

Derive a filename from a URL’s path.

Parameters

url – URL from which to extract a filename.

Returns

Filename in URL.

textacy.io.utils.unpack_archive(filepath: Union[str, pathlib.Path], *, extract_dir: Optional[Union[str, pathlib.Path]] = None)Union[str, pathlib.Path][source]

Extract data from a zip or tar archive file into a directory (or do nothing if the file isn’t an archive).

Parameters
  • filepath – Full path to file on disk from which archived contents will be extracted.

  • extract_dir – Full path of the directory into which contents will be extracted. If not provided, the same directory as filepath is used.

Returns

Path to directory of extracted contents.