File I/O¶
Read the contents of a text file at |
|
Write text |
|
Read the contents of a JSON file at |
|
Write JSON |
|
Read the contents of a CSV file at |
|
Write rows of |
|
Read the data, indices, indptr, and shape arrays from a |
|
Write sparse matrix |
|
Read the contents of a file at |
|
Write one or more |
|
Read data from |
|
Download data from |
|
Open file |
|
Split records’ content (text) from associated metadata, but keep them paired together. |
|
Borrowed from |
|
Yield full paths of files on disk under directory |
|
Download a file from |
|
Extract data from a zip or tar archive file into a directory (or do nothing if the file isn’t an archive). |
textacy.io.text
: Functions for reading from and writing to disk records in
plain text format, either as one text per file or one text per line in a file.
-
textacy.io.text.
read_text
(filepath: Union[str, pathlib.Path], *, mode: str = 'rt', encoding: Optional[str] = None, lines: bool = False) → Iterable[str][source]¶ Read the contents of a text file at
filepath
, either all at once or streaming line-by-line.- Parameters
filepath – Path to file on disk from which data will be read.
mode – Mode with which
filepath
is opened.encoding – Name of the encoding used to decode or encode the data in
filepath
. Only applicable in text mode.lines – If False, all data is read in at once; otherwise, data is read in one line at a time.
- Yields
Next line of text to read in.
If
lines
is False, wrap this output innext()
to conveniently access the full text.
-
textacy.io.text.
write_text
(data: str | Iterable[str], filepath: types.PathLike, *, mode: str = 'wt', encoding: Optional[str] = None, make_dirs: bool = False, lines: bool = False) → None[source]¶ Write text
data
to disk atfilepath
, either all at once or streaming line-by-line.- Parameters
If lines is False (data) –
“isnt rick and morty that thing you get when you die and your body gets all stiff”
If
lines
is True, an iterable of strings to write to disk, one item per line; for example:["isnt rick and morty that thing you get when you die and your body gets all stiff", "You're thinking of rigor mortis. Rick and morty is when you get trolled into watching "never gonna give you up"", "That's rickrolling. Rick and morty is a type of pasta"]
single string to write to disk; for example:: (a) –
“isnt rick and morty that thing you get when you die and your body gets all stiff”
If
lines
is True, an iterable of strings to write to disk, one item per line; for example:["isnt rick and morty that thing you get when you die and your body gets all stiff", "You're thinking of rigor mortis. Rick and morty is when you get trolled into watching "never gonna give you up"", "That's rickrolling. Rick and morty is a type of pasta"]
filepath – Path to file on disk to which data will be written.
mode – Mode with which
filepath
is opened.encoding – Name of the encoding used to decode or encode the data in
filepath
. Only applicable in text mode.make_dirs – If True, automatically create (sub)directories if not already present in order to write
filepath
.lines – If False, all data is written at once; otherwise, data is written to disk one line at a time.
textacy.io.json
: Functions for reading from and writing to disk records in JSON format,
as one record per file or one record per line in a file.
-
textacy.io.json.
read_json
(filepath: Union[str, pathlib.Path], *, mode: str = 'rt', encoding: Optional[str] = None, lines: bool = False) → Iterable[source]¶ Read the contents of a JSON file at
filepath
, either all at once or streaming item-by-item.- Parameters
filepath – Path to file on disk from which data will be read.
mode – Mode with which
filepath
is opened.encoding – Name of the encoding used to decode or encode the data in
filepath
. Only applicable in text mode.lines – If False, all data is read in at once; otherwise, data is read in one line at a time.
- Yields
Next JSON item; could be a dict, list, int, float, str, depending on the data and the value of
lines
.
-
textacy.io.json.
read_json_mash
(filepath: Union[str, pathlib.Path], *, mode: str = 'rt', encoding: Optional[str] = None, buffer_size: int = 2048) → Iterable[source]¶ Read the contents of a JSON file at
filepath
one item at a time, where all of the items have been mashed together, end-to-end, on a single line.- Parameters
filepath – Path to file on disk to which data will be written.
mode – Mode with which
filepath
is opened.encoding – Name of the encoding used to decode or encode the data in
filepath
. Only applicable in text mode.buffer_size – Number of bytes to read in as a chunk.
- Yields
Next valid JSON object, converted to native Python equivalent.
Note
Storing JSON data in this format is Not Good. Reading it is doable, so this function is included for users’ convenience, but note that there is no analogous
write_json_mash()
function. Don’t do it.
-
textacy.io.json.
write_json
(data: Any, filepath: types.PathLike, *, mode: str = 'wt', encoding: Optional[str] = None, make_dirs: bool = False, lines: bool = False, ensure_ascii: bool = False, separators: Tuple[str, str] = (',', ':'), sort_keys: bool = False, indent: Optional[int | str] = None) → None[source]¶ Write JSON
data
to disk atfilepath
, either all at once or streaming item-by-item.- Parameters
data –
JSON data to write to disk, including any Python objects encodable by default in
json
, as well as dates and datetimes. For example:[ {"title": "Harrison Bergeron", "text": "The year was 2081, and everybody was finally equal."}, {"title": "2BR02B", "text": "Everything was perfectly swell."}, {"title": "Slaughterhouse-Five", "text": "All this happened, more or less."}, ]
If
lines
is False, all ofdata
is written as a single object; if True, each item is written to a separate line infilepath
.filepath – Path to file on disk to which data will be written.
mode – Mode with which
filepath
is opened.encoding – Name of the encoding used to decode or encode the data in
filepath
. Only applicable in text mode.make_dirs – If True, automatically create (sub)directories if not already present in order to write
filepath
.lines – If False, all data is written at once; otherwise, data is written to disk one item at a time.
ensure_ascii – If True, all non-ASCII characters are escaped; otherwise, non-ASCII characters are output as-is.
separators – An (item_separator, key_separator) pair specifying how items and keys are separated in output.
sort_keys – If True, each output dictionary is sorted by key; otherwise, dictionary ordering is taken as-is.
indent – If a non-negative integer or string, items are pretty-printed with the specified indent level; if 0, negative, or “”, items are separated by newlines; if None, the most compact representation is used when storing
data
.
-
class
textacy.io.json.
ExtendedJSONEncoder
(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶ Sub-class of
json.JSONEncoder
, used to write JSON data to disk inwrite_json()
while handling a broader range of Python objects.datetime.datetime
=> ISO-formatted stringdatetime.date
=> ISO-formatted string
-
default
(obj)[source]¶ Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
textacy.io.csv
: Functions for reading from and writing to disk records in CSV format,
where CSVs may be delimited not only by commas (the default) but tabs, pipes, and
other valid one-char delimiters.
-
textacy.io.csv.
read_csv
(filepath: types.PathLike, *, encoding: Optional[str] = None, fieldnames: Optional[str | Sequence[str]] = None, dialect: str | Type[csv.Dialect] = 'excel', delimiter: str = ',', quoting: int = 2) → Iterable[list] | Iterable[dict][source]¶ Read the contents of a CSV file at
filepath
, streaming line-by-line, where each line is a list of strings and/or floats whose values are separated bydelimiter
.- Parameters
filepath – Path to file on disk from which data will be read.
encoding – Name of the encoding used to decode or encode the data in
filepath
.fieldnames – If specified, gives names for columns of values, which are used as keys in an ordered dictionary representation of each line’s data. If ‘infer’, the first kB of data is analyzed to make a guess about whether the first row is a header of column names, and if so, those names are used as keys. If None, no column names are used, and each line is returned as a list of strings/floats.
dialect – Grouping of formatting parameters that determine how the data is parsed when reading/writing. If ‘infer’, the first kB of data is analyzed to get a best guess for the correct dialect.
delimiter – 1-character string used to separate fields in a row.
quoting – Type of quoting to apply to field values. See: https://docs.python.org/3/library/csv.html#csv.QUOTE_NONNUMERIC
- Yields
List[obj] – Next row, whose elements are strings and/or floats. If
fieldnames
is None or ‘infer’ doesn’t detect a header row.or
Dict[str, obj]: Next row, as an ordered dictionary of (key, value) pairs, where keys are column names and values are the corresponding strings and/or floats. If
fieldnames
is a list of column names or ‘infer’ detects a header row.
-
textacy.io.csv.
write_csv
(data: Iterable[Dict[str, Any]] | Iterable[Iterable], filepath: types.PathLike, *, encoding: Optional[str] = None, make_dirs: bool = False, fieldnames: Optional[Sequence[str]] = None, dialect: str = 'excel', delimiter: str = ',', quoting: int = 2) → None[source]¶ Write rows of
data
to disk atfilepath
, where each row is an iterable or a dictionary of strings and/or numbers, written to one line with values separated bydelimiter
.- Parameters
data –
If
fieldnames
is None, an iterable of iterables of strings and/or numbers to write to disk; for example:[['That was a great movie!', 0.9], ['The movie was okay, I guess.', 0.2], ['Worst. Movie. Ever.', -1.0]]
If
fieldnames
is specified, an iterable of dictionaries with string and/or number values to write to disk; for example:[{'text': 'That was a great movie!', 'score': 0.9}, {'text': 'The movie was okay, I guess.', 'score': 0.2}, {'text': 'Worst. Movie. Ever.', 'score': -1.0}]
filepath – Path to file on disk to which data will be written.
encoding – Name of the encoding used to decode or encode the data in
filepath
.make_dirs – If True, automatically create (sub)directories if not already present in order to write
filepath
.fieldnames –
Sequence of keys that identify the order in which values in each rows’ dictionary is written to
filepath
. These are included infilepath
as a header row of column names.Note
Only specify this if
data
is an iterable of dictionaries.dialect – Grouping of formatting parameters that determine how the data is parsed when reading/writing.
delimiter – 1-character string used to separate fields in a row.
quoting – Type of quoting to apply to field values. See: https://docs.python.org/3/library/csv.html#csv.QUOTE_NONNUMERIC
textacy.io.matrix
: Functions for reading from and writing to disk CSC and CSR
sparse matrices in numpy binary format.
-
textacy.io.matrix.
read_sparse_matrix
(filepath: types.PathLike, *, kind: str = 'csc') → sp.csc_matrix | sp.csr_matrix[source]¶ Read the data, indices, indptr, and shape arrays from a
.npz
file on disk atfilepath
, and return an instantiated sparse matrix.- Parameters
filepath – Path to file on disk from which data will be read.
kind ({'csc', 'csr'}) – Kind of sparse matrix to instantiate.
- Returns
An instantiated sparse matrix, whose type depends on the value of
kind
.
-
textacy.io.matrix.
write_sparse_matrix
(data: sp.csc_matrix | sp.csr_matrix, filepath: types.PathLike, *, compressed: bool = True, make_dirs: bool = False) → None[source]¶ Write sparse matrix
data
to disk atfilepath
, optionally compressed, into a single.npz
file.- Parameters
data –
filepath – Path to file on disk to which data will be written. If
filepath
does not end in.npz
, that extension is automatically appended to the name.compressed – If True, save arrays into a single file in compressed numpy binary format.
make_dirs – If True, automatically create (sub)directories if not already present in order to write
filepath
.
textacy.io.spacy
: Functions for reading from and writing to disk spacy documents
in either pickle or binary format. Be warned: Both formats have pros and cons.
-
textacy.io.spacy.
read_spacy_docs
(filepath: Union[str, pathlib.Path], *, format: str = 'binary', lang: Optional[Union[str, pathlib.Path, spacy.language.Language]] = None) → Iterable[spacy.tokens.doc.Doc][source]¶ Read the contents of a file at
filepath
, written in binary or pickle format.- Parameters
filepath – Path to file on disk from which data will be read.
format ({"binary", "pickle"}) –
Format of the data that was written to disk. If “binary”, uses
spacy.tokens.DocBin
to deserialie data; if “pickle”, uses python’s stdlibpickle
.Warning
Docs written in pickle format were saved all together as a list, which means they’re all loaded into memory at once before streaming one by one. Mind your RAM usage, especially when reading many docs!
lang – Language with which spaCy originally processed docs, represented as the full name of or path on disk to the pipeline, or an already instantiated pipeline instance. Note that this is only required when
format
is “binary”.
- Yields
Next deserialized document.
- Raises
ValueError – if format is not “binary” or “pickle”, or if
lang
is None whenformat="binary"
-
textacy.io.spacy.
write_spacy_docs
(data: Doc | Iterable[Doc], filepath: types.PathLike, *, make_dirs: bool = False, format: str = 'binary', attrs: Optional[Iterable[str]] = None, store_user_data: bool = False) → None[source]¶ Write one or more
Doc
s to disk atfilepath
in binary or pickle format.- Parameters
data – A single
Doc
or a sequence ofDoc
s to write to disk.filepath – Path to file on disk to which data will be written.
make_dirs – If True, automatically create (sub)directories if not already present in order to write
filepath
.format ({"pickle", "binary"}) –
Format of the data written to disk. If “binary”, uses
spacy.tokens.DocBin
to serialie data; if “pickle”, uses python’s stdlibpickle
.Warning
When writing docs in pickle format, all the docs in
data
must be saved as a list, which means they’re all loaded into memory. Mind your RAM usage, especially when writing many docs!attrs – List of attributes to serialize if
format
is “binary”. If None, spaCy’s default values are used; see here: https://spacy.io/api/docbin#initstore_user_data – If True, write :attr`Doc.user_data` and the values of custom extension attributes to disk; otherwise, don’t.
- Raises
ValueError – if format is not “binary” or “pickle”
textacy.io.http
: Functions for reading data from URLs via streaming HTTP requests
and either reading it into memory or writing it directly to disk.
-
textacy.io.http.
read_http_stream
(url: str, *, lines: bool = False, decode_unicode: bool = False, chunk_size: int = 1024, auth: Optional[Tuple[str, str]] = None) → Iterable[str] | Iterable[bytes][source]¶ Read data from
url
in a stream, either all at once or line-by-line.- Parameters
url – URL to which a GET request is made for data.
lines – If False, yield all of the data at once; otherwise, yield data line-by-line.
decode_unicode – If True, yield data as unicode, where the encoding is taken from the HTTP response headers; otherwise, yield bytes.
chunk_size – Number of bytes read into memory per chunk. Because decoding may occur, this is not necessarily the length of each chunk.
auth –
(username, password) pair for simple HTTP authentication required (if at all) to access the data at
url
.
- Yields
If
lines
is True, the next line in the response data, which is bytes ifdecode_unicode
is False or unicode otherwise. Iflines
is False, yields the full response content, either as bytes or unicode.
-
textacy.io.http.
write_http_stream
(url: str, filepath: Union[str, pathlib.Path], *, mode: str = 'wt', encoding: Optional[str] = None, make_dirs: bool = False, chunk_size: int = 1024, auth: Optional[Tuple[str, str]] = None) → None[source]¶ Download data from
url
in a stream, and write successive chunks to disk atfilepath
.- Parameters
url – URL to which a GET request is made for data.
filepath – Path to file on disk to which data will be written.
mode – Mode with which
filepath
is opened.encoding –
Name of the encoding used to decode or encode the data in
filepath
. Only applicable in text mode.Note
The encoding on the HTTP response is inferred from its headers, or set to ‘utf-8’ as a fall-back in the case that no encoding is detected. It is not set by
encoding
.make_dirs – If True, automatically create (sub)directories if not already present in order to write
filepath
.chunk_size – Number of bytes read into memory per chunk. Because decoding may occur, this is not necessarily the length of each chunk.
auth –
(username, password) pair for simple HTTP authentication required (if at all) to access the data at
url
.
I/O Utils¶
textacy.io.utils
: Functions to help read and write data to disk
in a variety of formats.
-
textacy.io.utils.
open_sesame
(filepath: Union[str, pathlib.Path], *, mode: str = 'rt', encoding: Optional[str] = None, errors: Optional[str] = None, newline: Optional[str] = None, compression: str = 'infer', make_dirs: bool = False) → IO[source]¶ Open file
filepath
. Automatically handle file compression, relative paths and symlinks, and missing intermediate directory creation, as needed.open_sesame
may be used as a drop-in replacement forio.open()
.- Parameters
filepath – Path on disk (absolute or relative) of the file to open.
mode – The mode in which
filepath
is opened.encoding – Name of the encoding used to decode or encode
filepath
. Only applicable in text mode.errors – String specifying how encoding/decoding errors are handled. Only applicable in text mode.
newline – String specifying how universal newlines mode works. Only applicable in text mode.
compression – Type of compression, if any, with which
filepath
is read from or written to disk. If None, no compression is used; if ‘infer’, compression is inferred from the extension onfilepath
.make_dirs – If True, automatically create (sub)directories if not already present in order to write
filepath
.
- Returns
file object
- Raises
TypeError – if
filepath
is not a stringValueError – if
encoding
is specified butmode
is binaryOSError – if
filepath
doesn’t exist butmode
is read
-
textacy.io.utils.
coerce_content_type
(content: str | bytes, file_mode: str) → str | bytes[source]¶ If the content to be written to file and the file_mode used to open it are incompatible (either bytes with text mode or unicode with bytes mode), try to coerce the content type so it can be written.
-
textacy.io.utils.
split_records
(items: Iterable, content_field: str | int, itemwise: bool = False) → Iterable[source]¶ Split records’ content (text) from associated metadata, but keep them paired together.
- Parameters
items – An iterable of dicts, e.g. as read from disk by
read_json(lines=True)
, or an iterable of lists, e.g. as read from disk byread_csv()
.content_field – If str, key in each dict item whose value is the item’s content (text); if int, index of the value in each list item corresponding to the item’s content (text).
itemwise – If True, content + metadata are paired item-wise as an iterable of (content, metadata) 2-tuples; if False, content + metadata are paired by position in two parallel iterables in the form of a (iterable(content), iterable(metadata)) 2-tuple.
- Returns
If
itemwise
is True anditems
is Iterable[dict]; the first element in each tuple is the item’s content, the second element is its metadata as a dictionary.Generator(Tuple[str, list]): If
itemwise
is True anditems
is Iterable[list]; the first element in each tuple is the item’s content, the second element is its metadata as a list.Tuple[Iterable[str], Iterable[dict]]: If
itemwise
is False anditems
is Iterable[dict]; the first element of the tuple is an iterable of items’ contents, the second is an iterable of their metadata dicts.Tuple[Iterable[str], Iterable[list]]: If
itemwise
is False anditems
is Iterable[list]; the first element of the tuple is an iterable of items’ contents, the second is an iterable of their metadata lists.- Return type
-
textacy.io.utils.
unzip
(seq: Iterable) → Tuple[source]¶ Borrowed from
toolz.sandbox.core.unzip
, but using cytoolz instead of toolz to avoid the additional dependency.
-
textacy.io.utils.
get_filepaths
(dirpath: Union[str, pathlib.Path], *, match_regex: Optional[str] = None, ignore_regex: Optional[str] = None, extension: Optional[str] = None, ignore_invisible: bool = True, recursive: bool = False) → Iterable[str][source]¶ Yield full paths of files on disk under directory
dirpath
, optionally filtering for or against particular patterns or file extensions and crawling all subdirectories.- Parameters
dirpath – Path to directory on disk where files are stored.
match_regex – Regular expression pattern. Only files whose names match this pattern are included.
ignore_regex – Regular expression pattern. Only files whose names do not match this pattern are included.
extension – File extension, e.g. “.txt” or “.json”. Only files whose extensions match are included.
ignore_invisible – If True, ignore invisible files, i.e. those that begin with a period.; otherwise, include them.
recursive – If True, iterate recursively through subdirectories in search of files to include; otherwise, only return files located directly under
dirpath
.
- Yields
Next file’s name, including the full path on disk.
- Raises
OSError – if
dirpath
is not found on disk
-
textacy.io.utils.
download_file
(url: str, *, filename: Optional[str] = None, dirpath: Union[str, pathlib.Path] = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/textacy/envs/0.11.0/lib/python3.8/site-packages/textacy/data'), force: bool = False) → Optional[str][source]¶ Download a file from
url
and save it to disk.- Parameters
url – Web address from which to download data.
filename – Name of the file to which downloaded data is saved. If None, a filename will be inferred from the
url
.dirpath – Full path to the directory on disk under which downloaded data will be saved as
filename
.force – If True, download the data even if it already exists at
dirpath/filename
; otherwise, only download if the data doesn’t already exist on disk.
- Returns
Full path of file saved to disk.
-
textacy.io.utils.
get_filename_from_url
(url: str) → str[source]¶ Derive a filename from a URL’s path.
- Parameters
url – URL from which to extract a filename.
- Returns
Filename in URL.
-
textacy.io.utils.
unpack_archive
(filepath: Union[str, pathlib.Path], *, extract_dir: Optional[Union[str, pathlib.Path]] = None) → Union[str, pathlib.Path][source]¶ Extract data from a zip or tar archive file into a directory (or do nothing if the file isn’t an archive).
- Parameters
filepath – Full path to file on disk from which archived contents will be extracted.
extract_dir – Full path of the directory into which contents will be extracted. If not provided, the same directory as
filepath
is used.
- Returns
Path to directory of extracted contents.