API Reference#

This page documents the public surface of zcollection. Everything listed here is importable from the top-level zcollection package (with a few sub-namespaces for grouped APIs: zcollection.codecs, zcollection.merge, zcollection.partitioning, zcollection.view, zcollection.aio).

Building a schema#

A DatasetSchema is the immutable description of a dataset: dimensions, variables, attributes, child groups, and a format version. The recommended way to build one is the fluent Schema() factory:

import numpy
import zcollection as zc

schema = (
    zc.Schema()
    .with_dimension("time", chunks=4096)
    .with_dimension("x_ac", size=240, chunks=240)
    .with_variable("time", dtype="datetime64[ns]", dimensions=("time",))
    .with_variable(
        "ssh",
        dtype="float32",
        dimensions=("time", "x_ac"),
        fill_value=numpy.float32("nan"),
        codecs=zc.codecs.profile("cloud-balanced"),
    )
    .build()
)

Hierarchical schemas#

Variables can be placed inside nested groups by passing group= to with_dimension(), with_variable(), and with_attribute(). Use with_group() to attach group-level attributes ahead of time. Intermediate groups along the path are created on demand. Variables in a child group may reference dimensions declared on any ancestor (dimension inheritance):

schema = (
    zc.Schema()
    .with_dimension("time", chunks=4096)
    .with_variable("time", dtype="int64", dimensions=("time",))
    .with_group("/data_01/ku", attrs={"band": "Ku"})
    .with_dimension("range", size=240, chunks=240, group="/data_01/ku")
    .with_variable(
        "power",
        dtype="float32",
        dimensions=("time", "range"),  # ``time`` inherited from root
        group="/data_01/ku",
    )
    .build()
)

The resulting DatasetSchema exposes the hierarchy through groups and all_variables() (keys are absolute paths). with_partition_axis() recurses into nested groups, marking variables that don’t span the partitioning axis as immutable.

zcollection.Schema()[source]#

Shorthand for SchemaBuilder.

Return type:

SchemaBuilder

class zcollection.SchemaBuilder[source]#

Mutable builder that produces an immutable DatasetSchema.

build()[source]#

Build and return the immutable DatasetSchema.

Return type:

DatasetSchema

with_attribute(name, value, *, group=None)[source]#

Register an attribute on the root or a nested group.

Parameters:
  • name (str) – Attribute name.

  • value (Any) – Attribute value.

  • group (str | None) – Optional path of the group this attribute belongs to. Defaults to the root group.

Returns:

This builder, to allow chaining.

Return type:

SchemaBuilder

with_dimension(name, *, size=None, chunks=None, shards=None, group=None)[source]#

Register a dimension on the root or a nested group.

Parameters:
  • name (str) – Dimension name.

  • size (int | None) – Fixed size, or None if unknown (e.g. partitioning axis).

  • chunks (int | None) – Chunk size along this dimension; None to use the full extent.

  • shards (int | None) – Shard size along this dimension; None for no sharding.

  • group (str | None) – Optional path of the group this dimension belongs to. Defaults to the root group.

Returns:

This builder, to allow chaining.

Return type:

SchemaBuilder

with_group(path, *, attrs=None)[source]#

Declare a nested group at path, optionally with attributes.

Intermediate groups along the path are created if missing. Calling this is only required to attach attributes to a group ahead of time; otherwise with_variable() and with_dimension() will create groups lazily.

Parameters:
  • path (str) – Absolute or relative group path (e.g. "/data_01/ku").

  • attrs (dict[str, Any] | None) – Optional group-level attributes.

Returns:

This builder, to allow chaining.

Return type:

SchemaBuilder

with_variable(name, *, dtype, dimensions, fill_value=None, codecs=None, attrs=None, role=VariableRole.USER, group=None)[source]#

Register a variable on the root or a nested group.

Parameters:
  • name (str) – Variable name.

  • dtype (Any) – NumPy dtype or anything numpy.dtype() accepts.

  • dimensions (Iterable[str]) – Names of the dimensions this variable spans. Each dimension must be declared on the same group or an ancestor.

  • fill_value (Any | None) – Optional fill value.

  • codecs (CodecStack | None) – Optional explicit codec stack; auto-detected when None.

  • attrs (dict[str, Any] | None) – Optional attribute mapping.

  • role (VariableRole) – Provenance role of the variable.

  • group (str | None) – Optional path of the group this variable belongs to. Defaults to the root group.

Returns:

This builder, to allow chaining.

Return type:

SchemaBuilder

class zcollection.DatasetSchema(name='/', dimensions=<factory>, variables=<factory>, groups=<factory>, attrs=<factory>, format_version=1)[source]#

Immutable description of a collection’s dataset.

Extends GroupSchema with a format_version field and the JSON envelope used by the on-disk _zcollection.json config. The root name is always "/".

Parameters:
all_variables_by_role(*, immutable=None)[source]#

Return all variables across the tree (keyed by absolute path).

Parameters:

immutable (bool | None)

Return type:

dict[str, VariableSchema]

property dim_chunks: dict[str, int | None]#

Return a mapping of dimension name to declared chunk size.

Walks the entire tree so chunk hints declared on ancestor groups are visible to descendants.

property dim_sizes: dict[str, int | None]#

Return a mapping of dimension name to declared size (root only).

format_version: int#

Format version of this schema; used for compatibility checks and upgrades.

classmethod from_json(payload)[source]#

Build a schema from a JSON-compatible dict, upgrading older formats.

Parameters:

payload (dict[str, Any])

Return type:

DatasetSchema

select(names)[source]#

Return a new schema restricted to the named variables.

names may be short names (resolved against the root group) or absolute paths (/grp/sub/var). Empty groups in the tree are pruned from the result.

Raises:

SchemaError – If any of names is not a known variable.

Parameters:

names (Iterable[str])

Return type:

DatasetSchema

to_json()[source]#

Serialize the schema (root + nested groups) to a JSON dict.

Return type:

dict[str, Any]

variables_by_role(*, immutable=None)[source]#

Return root-group variables, optionally filtered by immutable.

Parameters:

immutable (bool | None) – If given, keep only variables whose immutable attribute matches this value.

Returns:

A tuple of the matching variables in declaration order.

Return type:

tuple[VariableSchema, …]

with_partition_axis(axis)[source]#

Bind the partition axis and tag each variable accordingly.

Mark the variable immutable iff it does not span axis.

The immutable tag means “constant across all partitions”: the variable is written once at the collection root (_immutable/) and merged into the dataset returned by every partition open.

A Collection only knows two kinds of variables:

  • Partitioned — variables that span axis. Their rows are split across partitions.

  • Immutable — variables whose dimensions are all declared with a fixed size. They are the same in every partition.

Anything else (an unbounded dimension other than axis) is forbidden: the collection has no rule to slice such a variable, no rule to merge it across granules, and no rule to deduplicate it. If a dataset carries two independent unbounded series (e.g. a 1 Hz and a 20 Hz time series), they belong in two different collections, not one.

Parameters:

axis (str) – Name of the partition axis (a root dimension).

Returns:

A new DatasetSchema with immutable flags set on every variable.

Raises:

SchemaError – If axis is not a root dimension, or if any variable references an unbounded dimension other than axis.

Return type:

DatasetSchema

class zcollection.GroupSchema(name='/', dimensions=<factory>, variables=<factory>, groups=<factory>, attrs=<factory>)[source]#

Immutable description of a (possibly nested) group of variables.

Parameters:
  • name (str) – Short name of the group ("/" for the root).

  • dimensions (Mapping[str, Dimension]) – Mapping of dimension name to dimension metadata declared on this group. Variables declared on this group or any descendant may reference dimensions declared on this group or any ancestor.

  • variables (Mapping[str, VariableSchema]) – Variables declared at this group level.

  • groups (Mapping[str, GroupSchema]) – Child groups, keyed by short name.

  • attrs (Mapping[str, Any]) – Group-level attributes.

all_variables()[source]#

Return every variable in the tree keyed by absolute path.

Return type:

dict[str, VariableSchema]

attrs: Mapping[str, Any]#

Optional attributes associated with this group.

dimensions: Mapping[str, Dimension]#

Mapping of dimension name to dimension metadata.

classmethod from_json(payload)[source]#

Build a group schema from a JSON-compatible dict.

Parameters:

payload (dict[str, Any])

Return type:

GroupSchema

get_group(path)[source]#

Return the group at path (absolute or relative).

Raises:

KeyError – If the path does not resolve to a known group.

Parameters:

path (str)

Return type:

GroupSchema

groups: Mapping[str, GroupSchema]#

Mapping of child group name to child GroupSchema.

iter_groups()[source]#

Yield all descendant groups depth-first (excluding self).

Return type:

Iterable[GroupSchema]

name: str#

Short name of the group; "/" for the root.

to_json()[source]#

Serialize this group (recursively) to a JSON-compatible dict.

Return type:

dict[str, Any]

variables: Mapping[str, VariableSchema]#

Mapping of variable name to variable metadata.

with_group(path, group)[source]#

Return a copy with group inserted at path (relative).

path is the parent path; the inserted group keeps its own name. Intermediate groups are created if missing.

Parameters:
Return type:

GroupSchema

class zcollection.Dimension(name, size=None, chunks=None, shards=None)[source]#

A named axis of a dataset.

size is None when unknown (typical for the partitioning axis). chunks is None when unconstrained (use full extent on write). shards is None when unsharded; otherwise the shard size in elements along this dimension.

Parameters:
  • name (str)

  • size (int | None)

  • chunks (int | None)

  • shards (int | None)

chunks: int | None#

Chunk size along this dimension; None to use the full extent.

classmethod from_json(payload)[source]#

Build a dimension from a JSON-compatible dict.

Parameters:

payload (dict[str, Any])

Return type:

Dimension

name: str#

Dimension name, unique within the dataset.

shards: int | None#

Shard size along this dimension; None for no sharding.

size: int | None#

Fixed size, or None if unknown (e.g. partitioning axis).

to_json()[source]#

Serialize the dimension to a JSON-compatible dict.

Return type:

dict[str, Any]

class zcollection.VariableSchema(name, dtype, dimensions, fill_value=None, codecs=<factory>, attrs=<factory>, role=VariableRole.USER, immutable=False)[source]#

All information needed to create a Zarr v3 array for one variable.

Parameters:
attrs: dict[str, Any]#

Attributes associated with the variable.

codecs: CodecStack#

Codec stack for encoding/decoding the variable’s data.

dimensions: tuple[str, ...]#

Names of the dimensions this variable spans, in order.

dtype: dtype#

NumPy dtype of the variable.

fill_value: Any | None#

Optional fill value for missing data; should be compatible with the dtype.

classmethod from_json(payload)[source]#

Build a variable from a JSON-compatible dict.

Parameters:

payload (dict[str, Any])

Return type:

VariableSchema

immutable: bool#

Whether the variable is immutable.

name: str#

Variable name, unique within the dataset.

property ndim: int#

Return the number of dimensions of this variable.

role: VariableRole#

Role of the variable in the schema.

to_json()[source]#

Serialize the variable to a JSON-compatible dict.

Return type:

dict[str, Any]

with_immutable(immutable)[source]#

Return a copy of this variable with the immutable flag overridden.

Parameters:

immutable (bool)

Return type:

VariableSchema

class zcollection.VariableRole(*values)[source]#

Provenance / purpose of a variable in the schema.

Creating and opening collections#

A Collection is the persistent, partitioned container. The two entry points dispatch on URL scheme (file://, memory://, s3://, icechunk://):

zcollection.create_collection(path, *, schema, axis, partitioning, catalog_enabled=False, overwrite=False)[source]#

Create a new collection at path and return its handle.

Convenience wrapper around Collection.create(). If path is a string it is dispatched through open_store() using its URL scheme (file://, memory://, s3://, icechunk://).

Parameters:
  • path (str | Store) – A URL or a pre-built Store.

  • schema (DatasetSchema) – The dataset schema. Variables not depending on axis become immutable.

  • axis (str) – Name of the partition axis (a dimension of schema).

  • partitioning (Partitioning) – A Partitioning instance (e.g. Date, Sequence, GroupedSequence).

  • catalog_enabled (bool) – If True, maintain a sharded _catalog/ of partition paths to make cold opens and listings O(1).

  • overwrite (bool) – If True, replace any existing root at path.

Returns:

A writable Collection ready to insert.

Raises:

CollectionExistsError – If a collection already exists at path and overwrite=False.

Return type:

Collection

zcollection.open_collection(path, *, mode='r')[source]#

Open an existing collection in r (read-only) or rw mode.

Parameters:
  • path (str | Store) – A URL or a pre-built Store.

  • mode (str) – "r" for read-only access (default) or "rw" for full read-write.

Returns:

A Collection bound to the existing root. In "r" mode all mutating methods raise ReadOnlyError.

Raises:
Return type:

Collection

Working with a Collection#

class zcollection.Collection(store, *, schema, axis, partitioning, catalog_enabled=False, read_only=False)[source]#

A partitioned Zarr v3 collection on a Store.

A Collection materialises a DatasetSchema over a set of partitions on disk. Each partition is a self-contained Zarr v3 group whose key in the partitioning dimension is encoded into its path (e.g. year=2024/month=03). Variables that don’t span the partition axis are flagged immutable, written once under _immutable/, and merged back into every read.

Typical lifecycle:

  1. zcollection.create_collection() (or Collection.create()) writes the root config and returns a writable handle.

  2. insert() (and any merge strategy) populates partitions.

  3. zcollection.open_collection() (or Collection.open()) reopens an existing root, optionally read-only.

  4. query(), map(), update(), and drop_partitions() operate on subsets selected by partition filters; each has an _async sibling with identical semantics.

Direct use of __init__() is reserved for the library; user code should always go through create() / open() or the factory functions in zcollection.api.

Parameters:
property axis: str#

Return the name of the partition axis.

drop_partitions(*, filters=None)[source]#

Delete matching partitions.

Wrapped in a store session so transactional backends commit the whole drop atomically.

Unlike insert(), query(), map() and update(), this method has no _async sibling: deletes are sequential by design (the store-session commit is one transaction).

Parameters:

filters (str | None) – A partition-key predicate; see partitions(). If None, every partition is dropped — pass an explicit filter when you don’t mean that.

Returns:

The list of partition paths that were removed, in the order they were processed.

Raises:
  • ReadOnlyError – If the collection was opened with read_only=True.

  • ExpressionError – If filters is malformed; see partitions().

Return type:

list[str]

insert(dataset, *, overwrite=True, merge=None)[source]#

Insert dataset into the collection.

The dataset is split by the collection’s partitioning. Each slice is written under the matching partition path. Variables flagged immutable in the schema (those that don’t span the partition axis) are written once under _immutable/ and are merged into the dataset returned by every subsequent query() / map() / update().

On a transactional store (e.g. IcechunkStore) the entire insert is wrapped in a single commit — a crash mid-insert leaves no partial state.

Parameters:
  • dataset (Dataset) – The data to insert. Variables can be backed by numpy, Dask, or Zarr AsyncArray.

  • overwrite (bool) – When a partition already exists, whether to replace its chunks (True, default) or fail. When `merge` is set, ``overwrite=True`` is effectively required: the merged dataset must replace the existing partition, so passing overwrite=False with a non-None merge will raise once the writer hits the existing group.

  • merge (MergeCallable | __annotationlib_name_1__ | None) – Strategy for combining the inserted data with an existing partition. Either a built-in alias ("replace" / "concat" / "time_series" / "upsert"), a MergeCallable, or None (default — the inserted slice is written as-is and replaces the existing partition’s chunks). For tolerance-aware nearest-neighbour matching, build a callable with zcollection.collection.merge.upsert_within() — it isn’t registered as a string alias because it needs the tolerance argument.

Returns:

The list of partition paths that were written, in the order they were produced.

Raises:
  • ReadOnlyError – If the collection was opened with read_only=True.

  • KeyError – If merge is a string that doesn’t match a built-in strategy.

  • PartitionError – If dataset is missing variables required by the partitioning.

Return type:

list[str]

async insert_async(dataset, *, overwrite=True, merge=None)[source]#

Async variant of insert().

Parameters:
Return type:

list[str]

map(fn, *, filters=None, variables=None)[source]#

Apply fn to each matching partition and collect the results.

fn receives the partition Dataset with the immutable group merged in. Read-only — the partition is not written back. Use update() to persist transformed data.

Parameters:
  • fn (Callable[[Dataset], Any]) – Callable applied to each partition dataset; its return value is stored against the partition path.

  • filters (str | None) – Partition-key predicate; see partitions().

  • variables (Iterable[str] | None) – Optional whitelist of variables to load.

Returns:

A mapping {partition_path: fn(dataset)}.

Return type:

dict[str, Any]

async map_async(fn, *, filters=None, variables=None)[source]#

Async variant of map().

Parameters:
Return type:

dict[str, Any]

property partitioning: Partitioning#

Return the partitioning strategy.

partitions(*, filters=None)[source]#

Yield relative partition paths in sorted order, optionally filtered.

Parameters:

filters (str | None) – An optional partition-key expression. The expression language is the small typed subset described in compile_filter(): comparisons, and/or/not, in / not in, integer/string literals, and partition-key names. Examples: "year == 2024 and month >= 3", "cycle in (1, 2)". Comparable types are whatever Partitioning.decode produces (integers and strings in the built-in partitionings). None yields every partition.

Yields:

Relative partition paths (e.g. "year=2024/month=03"), sorted lexicographically.

Raises:

ExpressionError – If filters has invalid syntax or uses disallowed AST nodes (raised at compile time), or if it references a partition-key name that doesn’t exist (raised the first time the predicate is evaluated against a partition).

Return type:

Iterator[str]

query(*, filters=None, variables=None)[source]#

Read matching partitions and return the concatenated dataset.

Partitions are loaded concurrently up to zcollection.config["partition.concurrency"]. The variables flagged immutable in the schema (read once from _immutable/) are merged into the result; on name conflict the partition’s data wins.

Parameters:
  • filters (str | None) – A partition-key predicate; see partitions(). None reads every partition.

  • variables (Iterable[str] | None) – Iterable of variable names to load. None (default) loads every variable. Loading a subset is the primary way to keep cold S3 reads cheap.

Returns:

The concatenated Dataset along the partitioning dimension, or None if no partition matched filters.

Raises:

ExpressionError – If filters is malformed; see partitions().

Return type:

Dataset | None

async query_async(*, filters=None, variables=None)[source]#

Async variant of query().

Parameters:
Return type:

Dataset | None

property read_only: bool#

Return whether the collection is read-only.

repair_catalog()[source]#

Rebuild the catalog by walking the store; return the new path list.

Use after a crash, or after manual edits to the partition tree.

Returns:

The freshly walked list of partition paths, in sorted order, after the catalog has been rewritten.

Raises:
  • RuntimeError – If the collection was not opened with catalog_enabled=True (no catalog to repair).

  • ReadOnlyError – If the collection was opened read-only.

Return type:

list[str]

property schema: DatasetSchema#

Return the bound dataset schema.

property store: Store#

Return the backing store.

update(fn, *, filters=None, variables=None)[source]#

Read each matching partition, transform it, and write it back.

fn is called per partition with the dataset returned by query() (immutable variables merged in). It must return a Dataset carrying every variable the partition should keep — the call rewrites the partition wholesale with whatever fn returns.

Warning

update is not a partial-write API. Each partition group is recreated with overwrite=True before the new dataset is written, so any variable absent from fn’s return value is dropped from disk. This holds regardless of variables: if you load only a subset and return only a subset, the unloaded variables are also lost. To update one variable without disturbing the others, return a Dataset that still carries the rest (e.g. start from the input ds and replace just the target variable).

fn should also preserve the partition’s length along the partitioning dimension; update does not refresh the catalog from per-partition geometry.

Parameters:
  • fn (Callable[[Dataset], Dataset]) – Function Dataset -> Dataset applied to each matching partition.

  • filters (str | None) – Partition-key predicate; see partitions().

  • variables (Iterable[str] | None) – Optional whitelist of variables to load before calling fn. Reduces I/O at read time, but does not protect unloaded variables from being dropped on write — see the warning above.

Returns:

The list of partition paths that were written, in the order they were processed.

Raises:
  • ReadOnlyError – If the collection was opened with read_only=True.

  • ExpressionError – If filters is malformed; see partitions().

Return type:

list[str]

async update_async(fn, *, filters=None, variables=None)[source]#

Async variant of update().

Parameters:
Return type:

list[str]

Merge strategies#

Strategies for inserting into a partition that already has data on disk. Pass any of these to Collection.insert() via the merge= argument, either as the callable or by name ("replace", "concat", "time_series", "upsert").

zcollection.merge.replace(existing, inserted, *, axis, partitioning_dim)[source]#

Drop existing entirely; only inserted is written.

Parameters:
  • existing (Dataset) – Dataset already on disk (ignored).

  • inserted (Dataset) – New dataset to write.

  • axis (str) – Unused; kept for protocol compatibility.

  • partitioning_dim (str) – Unused; kept for protocol compatibility.

Returns:

inserted unchanged.

Return type:

Dataset

zcollection.merge.concat(existing, inserted, *, axis, partitioning_dim)[source]#

Append inserted after existing along partitioning_dim.

Parameters:
  • existing (Dataset) – Dataset already on disk for this partition.

  • inserted (Dataset) – New dataset to append.

  • axis (str) – Unused; kept for protocol compatibility.

  • partitioning_dim (str) – Dimension along which to concatenate.

Returns:

The concatenation existing || inserted. No deduplication or sorting is performed.

Return type:

Dataset

zcollection.merge.time_series(existing, inserted, *, axis, partitioning_dim)[source]#

Time-aware merge: drop the existing window covered by inserted.

Rows in existing whose axis value falls inside [inserted_min, inserted_max] are dropped wholesale, the remaining rows are concatenated with inserted, and the result is sorted by axis. The input axes do not need to be monotonic — the function sorts on the way out.

Empty-input short-circuits: if inserted has zero rows along axis the function returns existing unchanged; if existing is empty it returns inserted unchanged.

Parameters:
  • existing (Dataset) – Dataset already on disk.

  • inserted (Dataset) – New dataset.

  • axis (str) – Name of the variable used for the time-window comparison. Must be a 1-D variable present on both sides; comparable types include numeric and datetime64.

  • partitioning_dim (str) – Dimension along which to slice and concat.

Returns:

The merged, axis-sorted dataset.

Raises:

ValueError – If axis is not a variable on both sides.

Return type:

Dataset

zcollection.merge.upsert(existing, inserted, *, axis, partitioning_dim, tolerance=None)[source]#

Row-wise replace-or-add by axis proximity.

For each row in existing: keep it iff its axis value has no match in inserted[axis]. The kept rows are concatenated with inserted and the result is sorted by axis. Use this when a new acquisition may partially overlap (replace) and partially extend (add) an existing partition without wiping the gaps in between.

Empty-input short-circuits: if inserted has zero rows along axis the function returns existing unchanged; if existing is empty it returns inserted unchanged.

tolerance controls what counts as a match:

  • None (default) — exact equality (numpy.isin).

  • a scalar (e.g. numpy.timedelta64(500, "ms") for datetime axes, or a float for numeric axes) — an existing row matches the nearest inserted row when |existing - nearest_inserted| <= tolerance. Useful when re-acquired timestamps are jittered by clock drift.

The string alias "upsert" covers only exact equality. For a tolerance-aware merge passed via Collection.insert(merge=…) — which accepts a string alias or a callable, but not a string plus an argument — wrap the tolerance with upsert_within() and pass the returned callable:

col.insert(
    ds,
    merge=zcollection.merge.upsert_within(numpy.timedelta64(500, "ms")),
)
Parameters:
  • existing (Dataset) – Dataset already on disk.

  • inserted (Dataset) – New dataset.

  • axis (str) – Name of the matching variable. Must be a 1-D variable present on both sides; comparable types include numeric and datetime64.

  • partitioning_dim (str) – Dimension along which to slice rows.

  • tolerance (Any) – None for exact equality, or a scalar for nearest-neighbour matching. For datetime axes pass a numpy.timedelta64 (e.g. timedelta64(500, "ms")); for numeric axes pass a plain float.

Returns:

The merged, axis-sorted dataset.

Raises:

ValueError – If axis is not a variable on both sides.

Return type:

Dataset

zcollection.merge.upsert_within(tolerance)[source]#

Return an upsert() strategy bound to tolerance.

The returned MergeCallable is suitable for passing to Collection.insert(merge=…):

import numpy
from zcollection.collection import merge

# 500 ms clock-drift tolerance on a datetime axis.
col.insert(ds, merge=merge.upsert_within(numpy.timedelta64(500, "ms")))

# Or on a numeric axis:
col.insert(ds, merge=merge.upsert_within(1e-6))
Parameters:

tolerance (Any) – None for exact equality, or a scalar (numeric or numpy.timedelta64) controlling the nearest-neighbour match window in upsert().

Returns:

A MergeCallable that calls upsert() with the given tolerance.

Return type:

MergeCallable

class zcollection.merge.MergeCallable(*args, **kwargs)[source]#

Contract for a merge strategy.

Implementations are invoked by Collection.insert_async whenever the dataset slice destined for a partition collides with an existing on-disk partition. The callable receives the two datasets plus two metadata strings, and returns the dataset that should actually be written.

Partitioning#

Partitioning strategies decide how rows are bucketed onto disk. Pick one when you create the collection; it’s persisted and used for every subsequent open.

class zcollection.partitioning.Date(variables, *, resolution, dimension=None)[source]#

Partition by truncating a 1-D datetime64 variable to resolution.

Component names match the layout (year=2024/month=03/day=01).

Parameters:
  • variables (tuple[str, ...] | str) – The variable(s) to partition by; must be exactly one.

  • resolution (str) – The partitioning resolution, one of “Y”, “M”, “D”, “h”, “m”, or “s”.

  • dimension (str | None) – The dimension to partition along; if None, inferred from the variable name.

property axis: tuple[str, ...]#

Return the partition-key component names.

decode(path)[source]#

Decode a relative storage path into a key.

Parameters:

path (str) – A string representing the relative storage path.

Returns:

A tuple of (component-name, value) pairs corresponding to the partition key.

Raises:

PartitionError – If the path is not properly formatted or if any component is missing or has an invalid value.

Return type:

tuple[tuple[str, int], …]

property dimension: str#

Return the dimension this partitioning splits.

encode(key)[source]#

Encode a key as a relative storage path.

Parameters:

key (tuple[tuple[str, int], ...]) – A tuple of (component-name, value) pairs corresponding to the partition key.

Returns:

A string representing the relative storage path for the given key, formatted according to the component names and zero-padding widths defined for this partitioning.

Return type:

str

classmethod from_json(payload)[source]#

Reconstruct a Date partitioning from its JSON payload.

Parameters:

payload (dict[str, Any]) – The JSON payload containing the partitioning information.

Returns:

An instance of the Date partitioning based on the provided payload.

Return type:

Date

property resolution: str#

Return the resolution code (Y/M/D/h/m/s).

split(dataset)[source]#

Yield (key, slice) for each contiguous bucket in dataset.

Parameters:

dataset (Dataset)

Return type:

Iterator[tuple[tuple[tuple[str, int], …], slice]]

to_json()[source]#

Return a JSON-serializable description of the partitioning.

Return type:

dict[str, Any]

class zcollection.partitioning.Sequence(variables, *, dimension)[source]#

Partition by the unique value tuples of one or more variables.

The variables must all share the same single dimension, which becomes the partitioning axis.

Parameters:
  • variables (tuple[str, ...]) – The variable(s) to partition by; must be at least one.

  • dimension (str) – The dimension to partition along; if None, inferred from the variable name.

property axis: tuple[str, ...]#

Return the partition-key variable names.

decode(path)[source]#

Decode a relative storage path into a key.

Parameters:

path (str) – The relative storage path to decode.

Returns:

The decoded partition key.

Raises:

PartitionError – If the path is not in the expected format.

Return type:

tuple[tuple[str, int], …]

property dimension: str#

Return the dimension this partitioning splits.

encode(key)[source]#

Encode a key as a relative storage path.

Parameters:

key (tuple[tuple[str, int], ...]) – The partition key to encode.

Returns:

A relative storage path representing the encoded partition key.

Return type:

str

classmethod from_json(payload)[source]#

Reconstruct a Sequence partitioning from its JSON payload.

Parameters:

payload (dict[str, Any]) – The JSON payload containing the partitioning information.

Returns:

An instance of the Sequence partitioning based on the provided payload.

Return type:

Sequence

split(dataset)[source]#

Yield (key, slice) for each unique tuple in dataset.

Parameters:

dataset (Dataset) – The dataset to split; must contain the partition-key variables as 1-D arrays along the partitioning dimension.

Yields:

Tuples of the form (key, slice), where key is a partition key tuple (e.g. (("x", 1), ("y", 2))) representing the unique value combination for the partition, and slice is a slice object that can be used to index into the dataset along the partitioning dimension.

Return type:

Iterator[tuple[tuple[tuple[str, int], …], slice]]

to_json()[source]#

Return a JSON-serializable description of the partitioning.

Return type:

dict[str, Any]

class zcollection.partitioning.GroupedSequence(variables, *, dimension, size, start=0)[source]#

Like Sequence, but groups the last variable into buckets of size.

The first len(variables) - 1 variables continue to act as exact keys; the last variable is mapped to (value - start) // size * size + start before grouping. size must be ≥ 2 (otherwise prefer Sequence).

Parameters:
  • variables (tuple[str, ...]) – The variable(s) to partition by; must be at least one.

  • dimension (str) – The dimension to partition along; if None, inferred from the variable name.

  • size (int) – The bucket size for the last variable; must be ≥ 2.

  • start (int) – The bucket origin for the last variable; defaults to 0.

property axis: tuple[str, ...]#

Return the partition-key variable names.

decode(path)#

Decode a relative storage path into a key.

Parameters:

path (str) – The relative storage path to decode.

Returns:

The decoded partition key.

Raises:

PartitionError – If the path is not in the expected format.

Return type:

tuple[tuple[str, int], …]

property dimension: str#

Return the dimension this partitioning splits.

encode(key)#

Encode a key as a relative storage path.

Parameters:

key (tuple[tuple[str, int], ...]) – The partition key to encode.

Returns:

A relative storage path representing the encoded partition key.

Return type:

str

classmethod from_json(payload)[source]#

Reconstruct a GroupedSequence from its JSON payload.

Parameters:

payload (dict[str, Any]) – The JSON payload containing the partitioning information.

Returns:

An instance of the GroupedSequence partitioning based on the provided JSON payload.

Return type:

GroupedSequence

property size: int#

Return the bucket size for the grouped variable.

split(dataset)[source]#

Yield (key, slice) for each contiguous bucket in dataset.

Parameters:

dataset (Dataset) – The dataset to partition, which must contain all variables in this partitioning and have the partitioning dimension in each.

Yields:

Tuples of (partition_key, slice) for each contiguous bucket in the dataset, where partition_key is a tuple of (component, value) pairs representing the partition key for that bucket, and slice is a slice object that can be used to index into the dataset along the partitioning dimension.

Return type:

Iterator[tuple[tuple[tuple[str, int], …], slice]]

property start: int#

Return the bucket origin for the grouped variable.

to_json()[source]#

Return a JSON-serializable description of the partitioning.

Return type:

dict[str, Any]

class zcollection.partitioning.Partitioning(*args, **kwargs)[source]#

Maps dataset rows along a partitioning axis to partition keys.

Implementations operate on plain numpy — Dask is layered higher up.

property axis: tuple[str, ...]#

Variables (along the partitioning dimension) used to derive the key.

decode(path)[source]#

Decode a relative storage path into a key.

Parameters:

path (str)

Return type:

tuple[tuple[str, int], …]

property dimension: str#

The dataset dimension this partitioning splits.

encode(key)[source]#

Encode a key as a relative storage path.

Parameters:

key (tuple[tuple[str, int], ...])

Return type:

str

name: str#

A human-readable name for this partitioning strategy, e.g. “sequence”.

split(dataset)[source]#

Yield (partition_key, slice) for each contiguous run.

Parameters:

dataset (Dataset)

Return type:

Iterator[tuple[tuple[tuple[str, int], …], slice]]

to_json()[source]#

Return a JSON-serializable description of the partitioning.

Return type:

dict[str, Any]

zcollection.partitioning.compile_filter(expr)[source]#

Compile a filter expression to a predicate over partition-key dicts.

expr=None or empty string returns a tautology.

Parameters:

expr (str | None) – The filter expression to compile. This should be a string containing a valid Python expression using only the allowed syntax.

Returns:

A predicate function that takes a partition-key dict and returns a bool indicating whether the partition key satisfies the filter expression.

Raises:

ExpressionError – If the expression contains syntax errors or uses disallowed syntax.

Return type:

Callable[[dict[str, Any]], bool]

The compile_filter() helper turns a filter string (the kind passed to Collection.partitions(filters=…)) into a typed predicate. You usually do not need to call it directly.

Views#

A view overlays extra variables on top of an existing read-only base collection without copying its data. Views live in a separate store and share the partitioning of the base.

class zcollection.view.View(*, store, base, view_schema, reference, read_only=False)[source]#

Overlay of extra variables on top of a base Collection.

Parameters:
property base: Collection#

Return the underlying base collection.

classmethod create(store, *, base, variables, reference, overwrite=False)[source]#

Create a new view backed by store and overlaying base.

Parameters:
  • store (Store) – Backing store for the view’s overlay variables. Must be different from base.store.

  • base (Collection) – The underlying read-only base collection.

  • variables (Iterable[VariableSchema]) – Schemas for the new variables the view adds. Each must share at least the partitioning dimension with the base collection.

  • reference (ViewReference | str) – Either a ViewReference or a string URI identifying the base collection.

  • overwrite (bool) – If True, replace any existing view at this location.

Returns:

A writable View ready to update.

Raises:
  • CollectionExistsError – If a view already exists at store.root_uri and overwrite=False.

  • ZCollectionError – If a view variable’s dimensions are inconsistent with the base schema.

Return type:

View

classmethod open(store, *, base, read_only=False)[source]#

Open an existing view from store.

Parameters:
  • store (Store) – The store backing the view’s overlay variables.

  • base (Collection) – The base collection that this view extends. The caller is responsible for ensuring it matches reference.

  • read_only (bool) – If True, mutating methods (update) raise ReadOnlyError.

Returns:

A View bound to the existing overlay.

Raises:

CollectionNotFoundError – If no view config exists at store.root_uri.

Return type:

View

partitions(*, filters=None)[source]#

Yield partition paths from the base collection, optionally filtered.

Parameters:

filters (str | None)

Return type:

Iterator[str]

query(*, filters=None, variables=None)[source]#

Return the merged base+overlay dataset for matching partitions.

Variables are sourced from whichever side owns them. On a name collision the overlay (view) wins.

Parameters:
  • filters (str | None) – Partition-key predicate forwarded to the base collection’s partitions().

  • variables (Iterable[str] | None) – Optional whitelist mixing base and overlay names. None returns all base + overlay variables.

Returns:

The merged Dataset, or None if no base partition matched filters. If matching partitions exist but no overlay has been written for them yet, only the base is returned.

Return type:

Dataset | None

async query_async(*, filters=None, variables=None)[source]#

Async variant of query().

Parameters:
Return type:

Dataset | None

property read_only: bool#

Return whether the view is read-only.

property reference: ViewReference#

Return the reference to the base collection.

property store: Store#

Return the backing store for the view overlay.

update(fn, *, filters=None, variables=None)[source]#

Compute view-variable arrays for each base partition and write them.

fn runs once per matching base partition. It receives the merged base+view dataset and must return a mapping from view-variable name to a numpy array sized along the partitioning dimension. Returned keys must be a subset of view_schema’s variables; missing keys are ignored, unknown keys raise.

Parameters:
  • fn (Callable[[Dataset], dict[str, Any]]) – Pure function Dataset -> {name: numpy.ndarray}.

  • filters (str | None) – Partition-key predicate (same syntax as Collection.partitions()).

  • variables (Iterable[str] | None) – Optional whitelist of variables to load before calling fn. None loads everything available.

Returns:

The list of partition paths that were written, in the order they were processed.

Raises:

ReadOnlyError – If the view was opened with read_only=True.

Return type:

list[str]

async update_async(fn, *, filters=None, variables=None)[source]#

Async variant of update().

Parameters:
Return type:

list[str]

property variables: tuple[str, ...]#

Return the names of the view’s overlay variables.

property view_schema: DatasetSchema#

Return the view’s overlay schema.

class zcollection.view.ViewReference(uri)[source]#

Pointer to a view’s underlying base collection.

Parameters:

uri (str)

classmethod from_json(payload)[source]#

Build a ViewReference from its JSON payload.

Parameters:

payload (dict[str, Any])

Return type:

ViewReference

to_json()[source]#

Return the reference as a JSON-serialisable dictionary.

Return type:

dict[str, Any]

Indexing#

A Indexer is a Parquet-backed lookup table: it maps user-defined key columns to (partition, start, stop) row ranges so callers can slice a collection without scanning every partition.

class zcollection.indexing.parquet.Indexer(table)[source]#

Lookup table over a Collection’s rows.

Parameters:

table (Table)

classmethod build(collection, *, builder, filters=None, variables=None)[source]#

Build an index by walking the collection’s partitions.

Parameters:
  • collection (Collection) – The Collection to index.

  • builder (Callable[[Dataset], ndarray | dict[str, ndarray]]) – Per-partition row generator. It receives a Dataset slice and must return either a structured numpy array (with named fields, including _start and _stop) or a dict {column_name: 1-D numpy.ndarray} of equal length. The user-defined columns become the queryable key columns; _start / _stop carry contiguous row ranges within the partition.

  • filters (str | None) – Partition-key predicate forwarded to Collection.partitions() to skip whole partitions.

  • variables (Iterable[str] | None) – Optional whitelist of variables to load before calling builder — keep this tight on cloud stores.

Returns:

An Indexer whose Parquet table is the concatenation of every non-empty per-partition output, with a _partition column added by the builder pipeline.

Raises:
  • ValueError – If builder’s output for any partition omits a _start or _stop column.

  • TypeError – If builder returns something other than a structured array or a dict of arrays.

Return type:

Indexer

property key_columns: tuple[str, ...]#

Return the user-defined key column names (excluding reserved ones).

lookup(**predicates)[source]#

Return {partition: [(start, stop), ...]} for matching rows.

Filters are AND-ed: a row matches when every named predicate matches. A scalar predicate is equality; a list / tuple / set / ndarray predicate is set membership (IN).

Parameters:

**predicates (Any) – One key_column=value per filter. Allowed column names are exactly key_columns.

Returns:

A mapping from partition path to the list of contiguous (start, stop) row ranges that satisfy the predicates. Partitions with no matching row are absent from the mapping.

Raises:

KeyError – If a predicate names an unknown column.

Return type:

dict[str, list[tuple[int, int]]]

classmethod read(path)[source]#

Load an indexer from a Parquet file at path.

Parameters:

path (str)

Return type:

Indexer

property table: Table#

Return the underlying PyArrow table.

write(path)[source]#

Write the index to path as a Parquet file.

Parameters:

path (str)

Return type:

None

Datasets, groups, and variables (in memory)#

The objects you receive from Collection.query() and pass back into Collection.insert(). Dataset is a root Group: it owns variables and child groups directly. Group itself is the hierarchical container reused for every nested group.

Both Dataset and Variable print as multi-line, xarray-like blocks that include a synthetic byte size for the dataset, each child group, and each variable, so you can gauge memory/disk footprint at a glance:

<zcollection.data.dataset.Dataset '/'> Size: 5.32 MB
  Dimensions: (time: 100000, x_ac: 240)
Data variables:
    time (time)              int64           781.25 kB  numpy.ndarray<...>
    ssh  (time, x_ac)        float32         91.55 MB   numpy.ndarray<...>
Groups:
    data_01           1.20 GB  (0 variables, 1 subgroup)

Path-based access (ds.get_variable("/data_01/ku/power"), ds["/data_01/ku/power"]) and short-name search (find_variable(), find_group()) make navigating the hierarchy straightforward. find_dimension() walks up the tree so child groups inherit dimensions declared on ancestors. nbytes recurses over the whole tree.

The xarray bridge (to_xarray() / from_xarray()) operates on the root group only — xarray has no native group concept, so nested groups are dropped on conversion.

class zcollection.Dataset(schema, variables=(), groups=(), attrs=None)[source]#

A schema plus the in-memory data for one or more partitions.

A Dataset is the root Group of a zcollection tree (name == "/", parent is None). Compared to a plain Group, the schema attribute is narrowed to DatasetSchema (statically; the runtime slot is the same) and the constructor has a path-aware variable router.

Parameters:
  • schema (DatasetSchema) – The dataset schema (root GroupSchema).

  • variables (Mapping[str, Variable] | Iterable[Variable]) –

    Variables for the dataset. Two forms are accepted:

    • a mapping name -> Variable: keys without / are placed at the root; keys containing / are routed into the matching nested group (intermediate groups are auto-built from schema);

    • an iterable of Variable instances — each is placed at the root using its own schema.name.

  • groups (Mapping[str, Group] | Iterable[Group]) – Optional pre-built child groups for the root. They are merged with the auto-built nested groups, with groups winning on name conflict for groups it already provides.

  • attrs (Mapping[str, Any] | None) – Optional dataset-level attributes; defaults to schema.attrs.

classmethod from_xarray(ds)[source]#

Build a Dataset from an xarray Dataset (flat).

Schema fields are inferred from the input:

  • dimensions come from ds.sizes;

  • dataset attributes from ds.attrs;

  • each variable contributes its dtype, dimensions, attrs (minus _FillValue, which becomes the variable’s fill_value); coordinates and data variables are merged into a single flat namespace.

xarray has no native group concept, so the resulting Dataset is always flat (root group only). Round-trip a hierarchical Dataset by calling to_xarray() per group.

Parameters:

ds (xarray.Dataset) – The xarray Dataset to convert.

Returns:

A new Dataset with an inferred schema and the same variable data (no copy — the underlying arrays are shared with ds).

Return type:

Dataset

select(names)[source]#

Return a new dataset restricted to the named variables.

Both the variables and the schema are subsetted: the returned dataset’s schema is built via DatasetSchema.select(), which prunes any group that ends up empty after the selection.

names may be short names (resolved against the root group) or absolute paths (/grp/var).

Parameters:

names (Iterable[str])

Return type:

Dataset

to_xarray(group=None)[source]#

Convert one group’s variables to an xarray Dataset.

xarray has no native group concept, so a single xarray.Dataset carries exactly one flat namespace of variables. Use group to pick which of this dataset’s groups is materialised; the rest of the tree is dropped silently.

Parameters:

group (str | None) – Absolute or relative path of the group to convert. None (default) and "/" both target the root group (backward-compatible behaviour). For a nested group, pass its path (e.g. "/data_01" or "data_01/ku"); get_group semantics apply.

Returns:

An xarray.Dataset whose variables are the named group’s variables and whose attrs are that group’s attributes.

Raises:

KeyError – If group does not resolve to a known group in this dataset’s tree.

Return type:

xarray.Dataset

class zcollection.Group(schema, variables=(), groups=(), attrs=None, *, name='/', parent=None)[source]#

A named container of variables, attributes, dimensions, and child groups.

The root group has name == "/" and parent is None. Child groups keep a back-reference to their parent so absolute paths (long_name()) and dimension inheritance (find_dimension()) can be computed cheaply.

Parameters:
  • schema (GroupSchema) – Schema describing this group’s variables, dimensions, and attributes.

  • variables (Mapping[str, Variable] | Iterable[Variable]) – Variables for this group, as a mapping or iterable.

  • groups (Mapping[str, Group] | Iterable[Group]) – Child groups for this group, as a mapping or iterable.

  • attrs (Mapping[str, Any] | None) – Group-level attributes; defaults to schema.attrs.

  • name (str) – Short name of this group; "/" for the root.

  • parent (Group | None) – Parent group reference; None for the root.

add_group(group)[source]#

Attach group as a child of this group.

group.parent is rewritten to self. Any prior parent the passed group held is silently dropped.

Parameters:

group (Group) – The group to attach.

Returns:

The same group, now attached.

Raises:

ValueError – If a child with the same name already exists.

Return type:

Group

add_variable(variable)[source]#

Register variable on this group.

Re-runs the per-dim size check across all of this group’s variables; a length disagreement on a shared dim raises.

Raises:

ValueError – If a variable with the same name already exists, or if its shape disagrees with sizes already established by other variables on this group.

Parameters:

variable (Variable)

Return type:

None

all_variables()[source]#

Return every variable in the tree keyed by absolute path.

Variables at the root keep their short name; nested variables are keyed by "<group>/<name>" (no leading slash) for direct passing back to Dataset.

Return type:

dict[str, Variable]

property attrs: dict[str, Any]#

Return this group’s attributes mapping (live, not a copy).

The returned dict is the group’s internal storage: mutating it modifies the group. Use update_attributes() for an explicit setter, or copy the result yourself (dict(group.attrs)) if you need a detached snapshot.

property dimensions: dict[str, int]#

Return the dimension-name → size mapping declared on this group.

Sizes are derived from this group’s own variables; dimensions declared on ancestors but not used by any local variable are not listed. Use find_dimension() to resolve a dim name across the inheritance chain.

find_dimension(name)[source]#

Search for a declared dimension by name, walking up the tree.

Returns the first matching Dimension from this group’s schema or any ancestor’s schema, or None if not found.

Parameters:

name (str)

Return type:

Dimension | None

find_group(name)[source]#

Search for a child group by short name in this subtree.

Returns the first match in depth-first order, or None if no group with that short name exists. See get_group() for the path-based variant.

Parameters:

name (str)

Return type:

Group | None

find_variable(name)[source]#

Search for a variable by short name in this group and its descendants.

Returns the first match in depth-first order — checking this group first, then the leftmost descendant, and so on. Returns None if no variable with that short name exists anywhere in the tree. Use get_variable() if you have an absolute path and want a hard failure on miss.

Parameters:

name (str)

Return type:

Variable | None

get_group(path)[source]#

Return the group at path (absolute or relative).

Raises:

KeyError – If the path does not resolve to a known group.

Parameters:

path (str)

Return type:

Group

get_variable(path)[source]#

Return the variable at path (absolute or relative).

Raises:

KeyError – If the path does not resolve to a known variable.

Parameters:

path (str)

Return type:

Variable

property groups: Mapping[str, Group]#

Return this group’s direct child groups (no recursion).

property is_lazy: bool#

Return whether any variable in the tree wraps lazy data.

Walks descendants too — true iff any Variable anywhere in the tree (including in nested groups) reports Variable.is_lazy.

is_root()[source]#

Return True if this is the root group (no parent).

Return type:

bool

iter_groups()[source]#

Yield every descendant group, depth-first, excluding self.

Return type:

Iterator[Group]

long_name()[source]#

Return the absolute path of this group (e.g. "/data_01/ku").

The root group resolves to "/". Any group with parent is None is treated as a root regardless of its name.

Return type:

str

property nbytes: int#

Return the uncompressed byte size of the tree rooted here.

Sums Variable.nbytes over every variable in this group and every descendant group. Reports the in-memory footprint; ignores any compression or sharding the variables might carry on disk.

update_attributes(**attributes)[source]#

Merge attributes into this group’s attrs (additive).

Existing keys are overwritten with the new values; keys not in attributes are left untouched. The mutation is done in place on the live attrs mapping.

Parameters:

attributes (Any)

Return type:

None

property variables: Mapping[str, Variable]#

Return this group’s own variables (no recursion).

Use all_variables() for the recursive view that includes every descendant variable keyed by absolute path.

walk()[source]#

Yield (absolute_path, group) for self and every descendant.

self is yielded first, followed by every descendant in depth-first order (matching iter_groups()).

Return type:

Iterator[tuple[str, Group]]

class zcollection.Variable(schema, data)[source]#

A named array bound to a VariableSchema.

data may be a numpy.ndarray (eager), any object exposing a compute() method (dask-style lazy arrays), an arbitrary array-like (anything numpy.asarray() accepts), or None (declared but not yet populated). is_lazy is true iff the array isn’t a plain numpy.ndarray.

On construction the data is validated against the schema’s number of dimensions; dtype mismatches are not enforced (upcasts are accepted silently).

Parameters:
  • schema (VariableSchema) – Variable schema describing dtype, dims and metadata.

  • data (Any) – Underlying array, or None for a placeholder variable.

Raises:

ValueError – If data exposes an ndim attribute that disagrees with schema.ndim.

property attrs: dict[str, Any]#

Return a fresh copy of the schema attributes.

The returned dict is detached: mutating it does not affect the underlying schema.

property data: Any#

Return the underlying array as-is (no materialisation).

property dimensions: tuple[str, ...]#

Return the dimension names.

property dtype: dtype#

Return the numpy dtype.

property fill_value: Any#

Return the schema fill value.

property is_lazy: bool#

Return whether the underlying data isn’t a plain numpy.ndarray.

True for dask arrays, Zarr AsyncArray proxies, and anything else that isn’t a concrete in-memory numpy buffer; false when the data is already materialised. data is None also returns true (the placeholder is treated as not-yet-eager).

property name: str#

Return the variable name.

property nbytes: int#

Return the uncompressed byte size of the underlying data.

Computed as prod(shape) * dtype.itemsize — the same convention as numpy.ndarray.nbytes. Ignores any compression or sharding the variable might carry on disk. Returns 0 for placeholder variables (data is None).

property ndim: int#

Return the number of dimensions.

schema#

The variable schema describing dtype, dims and metadata.

property shape: tuple[int, ...]#

Return the shape of the underlying data.

Returns () when data is None or has no shape attribute (the empty tuple makes the variable look like a 0-D scalar to size-aware consumers).

to_numpy()[source]#

Materialise the data as a numpy array.

Dispatches in three cases:

  • already a numpy.ndarray → returned as-is (no copy).

  • has a compute() method (dask-style lazy arrays) → call it and return the materialised result.

  • otherwise → numpy.asarray() on the data.

Calling this on a Variable with data is None produces numpy.array(None, dtype=object), which is rarely useful; guard against that case at the caller.

Return type:

ndarray

Codecs#

Codecs describe how a variable’s chunks travel from numpy memory to bytes on disk. Most users only need named profiles:

import zcollection as zc

stack = zc.codecs.profile("cloud-balanced")
names = zc.codecs.profile_names()  # ('local-fast', 'cloud-balanced', ...)
zcollection.codecs.profile(name=None, *, filters=None, compressor=None)[source]#

Build a CodecStack from a named profile, with overrides.

Parameters:
  • name (str | None) – Profile name. None uses DEFAULT_PROFILE.

  • filters (Iterable[dict[str, Any]] | None) – Optional array-to-array codec descriptors that override the profile’s empty filter list.

  • compressor (dict[str, Any] | None) – Optional bytes-to-bytes codec descriptor that replaces the profile’s entire compressor pipeline with this single codec.

Returns:

A CodecStack materialised from the profile, with the overrides applied.

Raises:

KeyError – If name is not a registered profile.

Return type:

CodecStack

zcollection.codecs.profile_names()[source]#

Return the names of all registered codec profiles.

Return type:

tuple[str, …]

zcollection.codecs.auto_codecs(dtype, profile_name=None)[source]#

Pick a CodecStack for a variable using the named profile.

The result is currently dtype-agnostic — dtype is accepted on the public surface so future profiles can specialise (e.g. byte-grain filters for booleans, transposes for high-rank arrays) without an API break. For now it is ignored.

Parameters:
  • dtype (dtype) – The variable dtype. Reserved for forward compatibility; does not affect the returned stack today.

  • profile_name (str | None) – Profile name. None uses DEFAULT_PROFILE.

Returns:

A CodecStack materialised from the profile.

Raises:

KeyError – If profile_name is not a registered profile.

Return type:

CodecStack

class zcollection.codecs.CodecStack(array_to_array=(), array_to_bytes=None, bytes_to_bytes=(), sharded=False, shard_target_bytes=None)[source]#

The persisted codec pipeline for one variable.

Each codec is held as a JSON-clean {"name": ..., "configuration": ...} dict so the schema round-trips through _zcollection.json without depending on Zarr v3 codec object identity. Materialisation into concrete zarr.codecs instances happens lazily in resolve_codec() at write time.

Parameters:
array_to_array: tuple[dict[str, Any], ...]#

Array-to-array codecs (Zarr v3 filters), applied in order to the chunk’s numpy array before serialisation.

array_to_bytes: dict[str, Any] | None#

The single array-to-bytes codec (Zarr v3 serializer) that turns array elements into a byte string for one chunk. The Zarr v3 spec requires exactly one codec in this slot.

bytes_to_bytes: tuple[dict[str, Any], ...]#

Bytes-to-bytes codecs (Zarr v3 compressors), applied in order to the per-chunk byte string. With sharding on, they compress each inner chunk inside a shard; with sharding off, they compress each chunk directly.

classmethod from_json(payload)[source]#

Build a codec stack from its JSON representation.

Parameters:

payload (dict[str, Any])

Return type:

CodecStack

shard_target_bytes: int | None#

Target byte budget for each shard when sharded is True; None otherwise. The actual shard shape is picked by zcollection.codecs.sharding.shard_decision(), which honours this as a hint, not a hard cap.

sharded: bool#

When True, the variable’s chunks are bundled into shards via zarr.codecs.ShardingCodec. The codecs in bytes_to_bytes then compress each inner chunk inside a shard. When False, each chunk is compressed directly and shards are not used.

to_json()[source]#

Return the codec stack as a JSON-serialisable dictionary.

Return type:

dict[str, Any]

zcollection.codecs.DEFAULT_PROFILE#

Name of the profile used when no codecs= is given on a variable.

Stores#

Stores are the I/O backend. The factory selects an implementation from a URL — that is what create_collection() and open_collection() call internally — but you can also build one explicitly and pass it as the path argument.

zcollection.open_store(path, *, read_only=False, storage_options=None)[source]#

Open a Store given a URL or filesystem path.

Schemes:

  • file:// (default for bare paths) → LocalStore

  • memory://MemoryStore

  • s3://, gs://, az://, http(s)://ObjectStore (obstore-backed, the only cloud path)

  • icechunk://IcechunkStore (transactional)

Parameters:
Return type:

Store

class zcollection.Store[source]#

Capability surface for a backend that can hold a Zarr v3 hierarchy.

Implementations wrap a concrete zarr.storage Store and expose:

abstractmethod delete_prefix(prefix)[source]#

Recursively delete everything under prefix.

Parameters:

prefix (str)

Return type:

None

abstractmethod exists(key)[source]#

Return whether key is present in the store.

Parameters:

key (str)

Return type:

bool

abstractmethod list_dir(prefix)[source]#

Yield direct children (groups + arrays) under prefix.

Parameters:

prefix (str)

Return type:

Iterator[str]

abstractmethod list_prefix(prefix)[source]#

Yield child keys (one path segment) under prefix.

Parameters:

prefix (str)

Return type:

Iterator[str]

abstractmethod read_bytes(key)[source]#

Return the raw bytes stored at key or None if absent.

Parameters:

key (str)

Return type:

bytes | None

abstract property root_uri: str#

Human-readable URI for diagnostics.

session()[source]#

Yield a StoreSession for the duration of a write block.

Return type:

Iterator[StoreSession]

abstractmethod write_bytes(key, data)[source]#

Write data at key (overwriting any existing value).

Parameters:
Return type:

None

abstractmethod zarr_store()[source]#

Return the underlying zarr.abc.store.Store instance.

Return type:

Any

class zcollection.LocalStore(path, *, read_only=False)[source]#

File-system backed store rooted at path.

Parameters:
class zcollection.MemoryStore[source]#

All keys held in a process-local dict; chiefly for tests.

The optional zcollection.store.IcechunkStore is loaded lazily when icechunk:// is opened or when you import it explicitly. It unlocks transactional, multi-writer inserts.

Errors#

All exceptions raised by zcollection inherit from ZCollectionError.

exception zcollection.ZCollectionError[source]#

Base class for all zcollection errors.

exception zcollection.SchemaError[source]#

Raised when a schema is invalid or inconsistent with data.

exception zcollection.StoreError[source]#

Raised by the store layer for I/O or transactional failures.

exception zcollection.CollectionExistsError[source]#

Raised when create_collection targets a path that already exists.

exception zcollection.CollectionNotFoundError[source]#

Raised when open_collection cannot locate a collection.

exception zcollection.ReadOnlyError[source]#

Raised when a write op is attempted on a read-only collection.

Async API#

zcollection.aio mirrors the sync facade one-to-one. Every function returns a coroutine; every method has an _async counterpart on Collection and view.View. Use this module when you are already inside an event loop.

Public async facade for zcollection v3.

Mirrors zcollection.api but every entry point is a coroutine. This module is the recommended surface when running inside an event loop (e.g. FastAPI handlers, Jupyter %await cells, or other async workloads).

async zcollection.aio.create_collection(path, *, schema, axis, partitioning, catalog_enabled=False, overwrite=False)[source]#

Async create. Returns the same Collection as the sync API.

Parameters:
Return type:

Collection

async zcollection.aio.open_collection(path, *, mode='r')[source]#

Async open. Returns a Collection; read_only flag follows mode.

Parameters:
Return type:

Collection