API Reference#
This page documents the public surface of zcollection. Everything listed
here is importable from the top-level zcollection package (with a
few sub-namespaces for grouped APIs: zcollection.codecs,
zcollection.merge, zcollection.partitioning,
zcollection.view, zcollection.aio).
Building a schema#
A DatasetSchema is the immutable description of a
dataset: dimensions, variables, attributes, child groups, and a format
version. The recommended way to build one is the fluent
Schema() factory:
import numpy
import zcollection as zc
schema = (
zc.Schema()
.with_dimension("time", chunks=4096)
.with_dimension("x_ac", size=240, chunks=240)
.with_variable("time", dtype="datetime64[ns]", dimensions=("time",))
.with_variable(
"ssh",
dtype="float32",
dimensions=("time", "x_ac"),
fill_value=numpy.float32("nan"),
codecs=zc.codecs.profile("cloud-balanced"),
)
.build()
)
Hierarchical schemas#
Variables can be placed inside nested groups by passing group= to
with_dimension(),
with_variable(), and
with_attribute(). Use
with_group() to attach group-level
attributes ahead of time. Intermediate groups along the path are
created on demand. Variables in a child group may reference dimensions
declared on any ancestor (dimension inheritance):
schema = (
zc.Schema()
.with_dimension("time", chunks=4096)
.with_variable("time", dtype="int64", dimensions=("time",))
.with_group("/data_01/ku", attrs={"band": "Ku"})
.with_dimension("range", size=240, chunks=240, group="/data_01/ku")
.with_variable(
"power",
dtype="float32",
dimensions=("time", "range"), # ``time`` inherited from root
group="/data_01/ku",
)
.build()
)
The resulting DatasetSchema exposes the hierarchy
through groups and
all_variables() (keys are absolute paths).
with_partition_axis() recurses into
nested groups, marking variables that don’t span the partitioning axis
as immutable.
- zcollection.Schema()[source]#
Shorthand for
SchemaBuilder.- Return type:
- class zcollection.SchemaBuilder[source]#
Mutable builder that produces an immutable
DatasetSchema.- build()[source]#
Build and return the immutable
DatasetSchema.- Return type:
- with_attribute(name, value, *, group=None)[source]#
Register an attribute on the root or a nested group.
- Parameters:
- Returns:
This builder, to allow chaining.
- Return type:
- with_dimension(name, *, size=None, chunks=None, shards=None, group=None)[source]#
Register a dimension on the root or a nested group.
- Parameters:
name (str) – Dimension name.
size (int | None) – Fixed size, or
Noneif unknown (e.g. partitioning axis).chunks (int | None) – Chunk size along this dimension;
Noneto use the full extent.shards (int | None) – Shard size along this dimension;
Nonefor no sharding.group (str | None) – Optional path of the group this dimension belongs to. Defaults to the root group.
- Returns:
This builder, to allow chaining.
- Return type:
- with_group(path, *, attrs=None)[source]#
Declare a nested group at
path, optionally with attributes.Intermediate groups along the path are created if missing. Calling this is only required to attach attributes to a group ahead of time; otherwise
with_variable()andwith_dimension()will create groups lazily.- Parameters:
- Returns:
This builder, to allow chaining.
- Return type:
- with_variable(name, *, dtype, dimensions, fill_value=None, codecs=None, attrs=None, role=VariableRole.USER, group=None)[source]#
Register a variable on the root or a nested group.
- Parameters:
name (str) – Variable name.
dtype (Any) – NumPy dtype or anything
numpy.dtype()accepts.dimensions (Iterable[str]) – Names of the dimensions this variable spans. Each dimension must be declared on the same group or an ancestor.
fill_value (Any | None) – Optional fill value.
codecs (CodecStack | None) – Optional explicit codec stack; auto-detected when
None.role (VariableRole) – Provenance role of the variable.
group (str | None) – Optional path of the group this variable belongs to. Defaults to the root group.
- Returns:
This builder, to allow chaining.
- Return type:
- class zcollection.DatasetSchema(name='/', dimensions=<factory>, variables=<factory>, groups=<factory>, attrs=<factory>, format_version=1)[source]#
Immutable description of a collection’s dataset.
Extends
GroupSchemawith aformat_versionfield and the JSON envelope used by the on-disk_zcollection.jsonconfig. The root name is always"/".- Parameters:
- all_variables_by_role(*, immutable=None)[source]#
Return all variables across the tree (keyed by absolute path).
- Parameters:
immutable (bool | None)
- Return type:
- property dim_chunks: dict[str, int | None]#
Return a mapping of dimension name to declared chunk size.
Walks the entire tree so chunk hints declared on ancestor groups are visible to descendants.
- property dim_sizes: dict[str, int | None]#
Return a mapping of dimension name to declared size (root only).
- classmethod from_json(payload)[source]#
Build a schema from a JSON-compatible dict, upgrading older formats.
- Parameters:
- Return type:
- select(names)[source]#
Return a new schema restricted to the named variables.
namesmay be short names (resolved against the root group) or absolute paths (/grp/sub/var). Empty groups in the tree are pruned from the result.- Raises:
SchemaError – If any of
namesis not a known variable.- Parameters:
- Return type:
- variables_by_role(*, immutable=None)[source]#
Return root-group variables, optionally filtered by
immutable.- Parameters:
immutable (bool | None) – If given, keep only variables whose
immutableattribute matches this value.- Returns:
A tuple of the matching variables in declaration order.
- Return type:
tuple[VariableSchema, …]
- with_partition_axis(axis)[source]#
Bind the partition axis and tag each variable accordingly.
Mark the variable immutable iff it does not span
axis.The
immutabletag means “constant across all partitions”: the variable is written once at the collection root (_immutable/) and merged into the dataset returned by every partition open.A
Collectiononly knows two kinds of variables:Partitioned — variables that span
axis. Their rows are split across partitions.Immutable — variables whose dimensions are all declared with a fixed size. They are the same in every partition.
Anything else (an unbounded dimension other than
axis) is forbidden: the collection has no rule to slice such a variable, no rule to merge it across granules, and no rule to deduplicate it. If a dataset carries two independent unbounded series (e.g. a 1 Hz and a 20 Hz time series), they belong in two different collections, not one.- Parameters:
axis (str) – Name of the partition axis (a root dimension).
- Returns:
A new
DatasetSchemawithimmutableflags set on every variable.- Raises:
SchemaError – If
axisis not a root dimension, or if any variable references an unbounded dimension other thanaxis.- Return type:
- class zcollection.GroupSchema(name='/', dimensions=<factory>, variables=<factory>, groups=<factory>, attrs=<factory>)[source]#
Immutable description of a (possibly nested) group of variables.
- Parameters:
name (str) – Short name of the group (
"/"for the root).dimensions (Mapping[str, Dimension]) – Mapping of dimension name to dimension metadata declared on this group. Variables declared on this group or any descendant may reference dimensions declared on this group or any ancestor.
variables (Mapping[str, VariableSchema]) – Variables declared at this group level.
groups (Mapping[str, GroupSchema]) – Child groups, keyed by short name.
- classmethod from_json(payload)[source]#
Build a group schema from a JSON-compatible dict.
- Parameters:
- Return type:
- get_group(path)[source]#
Return the group at
path(absolute or relative).
- groups: Mapping[str, GroupSchema]#
Mapping of child group name to child
GroupSchema.
- variables: Mapping[str, VariableSchema]#
Mapping of variable name to variable metadata.
- with_group(path, group)[source]#
Return a copy with
groupinserted atpath(relative).pathis the parent path; the inserted group keeps its own name. Intermediate groups are created if missing.- Parameters:
path (str)
group (GroupSchema)
- Return type:
- class zcollection.Dimension(name, size=None, chunks=None, shards=None)[source]#
A named axis of a dataset.
sizeisNonewhen unknown (typical for the partitioning axis).chunksisNonewhen unconstrained (use full extent on write).shardsisNonewhen unsharded; otherwise the shard size in elements along this dimension.
- class zcollection.VariableSchema(name, dtype, dimensions, fill_value=None, codecs=<factory>, attrs=<factory>, role=VariableRole.USER, immutable=False)[source]#
All information needed to create a Zarr v3 array for one variable.
- Parameters:
- codecs: CodecStack#
Codec stack for encoding/decoding the variable’s data.
- classmethod from_json(payload)[source]#
Build a variable from a JSON-compatible dict.
- Parameters:
- Return type:
- role: VariableRole#
Role of the variable in the schema.
Creating and opening collections#
A Collection is the persistent, partitioned
container. The two entry points dispatch on URL scheme
(file://, memory://, s3://, icechunk://):
- zcollection.create_collection(path, *, schema, axis, partitioning, catalog_enabled=False, overwrite=False)[source]#
Create a new collection at
pathand return its handle.Convenience wrapper around
Collection.create(). Ifpathis a string it is dispatched throughopen_store()using its URL scheme (file://,memory://,s3://,icechunk://).- Parameters:
schema (DatasetSchema) – The dataset schema. Variables not depending on
axisbecome immutable.axis (str) – Name of the partition axis (a dimension of
schema).partitioning (Partitioning) – A
Partitioninginstance (e.g.Date,Sequence,GroupedSequence).catalog_enabled (bool) – If
True, maintain a sharded_catalog/of partition paths to make cold opens and listings O(1).overwrite (bool) – If
True, replace any existing root atpath.
- Returns:
A writable
Collectionready toinsert.- Raises:
CollectionExistsError – If a collection already exists at
pathandoverwrite=False.- Return type:
- zcollection.open_collection(path, *, mode='r')[source]#
Open an existing collection in
r(read-only) orrwmode.- Parameters:
- Returns:
A
Collectionbound to the existing root. In"r"mode all mutating methods raiseReadOnlyError.- Raises:
ValueError – If
modeis not one of"r"or"rw".CollectionNotFoundError – If no collection exists at
path.
- Return type:
Working with a Collection#
- class zcollection.Collection(store, *, schema, axis, partitioning, catalog_enabled=False, read_only=False)[source]#
A partitioned Zarr v3 collection on a
Store.A
Collectionmaterialises aDatasetSchemaover a set of partitions on disk. Each partition is a self-contained Zarr v3 group whose key in the partitioning dimension is encoded into its path (e.g.year=2024/month=03). Variables that don’t span the partition axis are flagged immutable, written once under_immutable/, and merged back into every read.Typical lifecycle:
zcollection.create_collection()(orCollection.create()) writes the root config and returns a writable handle.insert()(and any merge strategy) populates partitions.zcollection.open_collection()(orCollection.open()) reopens an existing root, optionally read-only.query(),map(),update(), anddrop_partitions()operate on subsets selected by partition filters; each has an_asyncsibling with identical semantics.
Direct use of
__init__()is reserved for the library; user code should always go throughcreate()/open()or the factory functions inzcollection.api.- Parameters:
store (Store)
schema (DatasetSchema)
axis (str)
partitioning (Partitioning)
catalog_enabled (bool)
read_only (bool)
- drop_partitions(*, filters=None)[source]#
Delete matching partitions.
Wrapped in a store session so transactional backends commit the whole drop atomically.
Unlike
insert(),query(),map()andupdate(), this method has no_asyncsibling: deletes are sequential by design (the store-session commit is one transaction).- Parameters:
filters (str | None) – A partition-key predicate; see
partitions(). IfNone, every partition is dropped — pass an explicit filter when you don’t mean that.- Returns:
The list of partition paths that were removed, in the order they were processed.
- Raises:
ReadOnlyError – If the collection was opened with
read_only=True.ExpressionError – If
filtersis malformed; seepartitions().
- Return type:
- insert(dataset, *, overwrite=True, merge=None)[source]#
Insert
datasetinto the collection.The dataset is split by the collection’s partitioning. Each slice is written under the matching partition path. Variables flagged immutable in the schema (those that don’t span the partition axis) are written once under
_immutable/and are merged into the dataset returned by every subsequentquery()/map()/update().On a transactional store (e.g.
IcechunkStore) the entire insert is wrapped in a single commit — a crash mid-insert leaves no partial state.- Parameters:
dataset (Dataset) – The data to insert. Variables can be backed by numpy, Dask, or Zarr
AsyncArray.overwrite (bool) – When a partition already exists, whether to replace its chunks (
True, default) or fail. When `merge` is set, ``overwrite=True`` is effectively required: the merged dataset must replace the existing partition, so passingoverwrite=Falsewith a non-Nonemergewill raise once the writer hits the existing group.merge (MergeCallable | __annotationlib_name_1__ | None) – Strategy for combining the inserted data with an existing partition. Either a built-in alias (
"replace"/"concat"/"time_series"/"upsert"), aMergeCallable, orNone(default — the inserted slice is written as-is and replaces the existing partition’s chunks). For tolerance-aware nearest-neighbour matching, build a callable withzcollection.collection.merge.upsert_within()— it isn’t registered as a string alias because it needs the tolerance argument.
- Returns:
The list of partition paths that were written, in the order they were produced.
- Raises:
ReadOnlyError – If the collection was opened with
read_only=True.KeyError – If
mergeis a string that doesn’t match a built-in strategy.PartitionError – If
datasetis missing variables required by the partitioning.
- Return type:
- async insert_async(dataset, *, overwrite=True, merge=None)[source]#
Async variant of
insert().- Parameters:
dataset (Dataset)
overwrite (bool)
merge (MergeCallable | __annotationlib_name_1__ | None)
- Return type:
- map(fn, *, filters=None, variables=None)[source]#
Apply
fnto each matching partition and collect the results.fnreceives the partitionDatasetwith the immutable group merged in. Read-only — the partition is not written back. Useupdate()to persist transformed data.- Parameters:
- Returns:
A mapping
{partition_path: fn(dataset)}.- Return type:
- property partitioning: Partitioning#
Return the partitioning strategy.
- partitions(*, filters=None)[source]#
Yield relative partition paths in sorted order, optionally filtered.
- Parameters:
filters (str | None) – An optional partition-key expression. The expression language is the small typed subset described in
compile_filter(): comparisons,and/or/not,in/not in, integer/string literals, and partition-key names. Examples:"year == 2024 and month >= 3","cycle in (1, 2)". Comparable types are whateverPartitioning.decodeproduces (integers and strings in the built-in partitionings).Noneyields every partition.- Yields:
Relative partition paths (e.g.
"year=2024/month=03"), sorted lexicographically.- Raises:
ExpressionError – If
filtershas invalid syntax or uses disallowed AST nodes (raised at compile time), or if it references a partition-key name that doesn’t exist (raised the first time the predicate is evaluated against a partition).- Return type:
- query(*, filters=None, variables=None)[source]#
Read matching partitions and return the concatenated dataset.
Partitions are loaded concurrently up to
zcollection.config["partition.concurrency"]. The variables flaggedimmutablein the schema (read once from_immutable/) are merged into the result; on name conflict the partition’s data wins.- Parameters:
filters (str | None) – A partition-key predicate; see
partitions().Nonereads every partition.variables (Iterable[str] | None) – Iterable of variable names to load.
None(default) loads every variable. Loading a subset is the primary way to keep cold S3 reads cheap.
- Returns:
The concatenated
Datasetalong the partitioning dimension, orNoneif no partition matchedfilters.- Raises:
ExpressionError – If
filtersis malformed; seepartitions().- Return type:
Dataset | None
- repair_catalog()[source]#
Rebuild the catalog by walking the store; return the new path list.
Use after a crash, or after manual edits to the partition tree.
- Returns:
The freshly walked list of partition paths, in sorted order, after the catalog has been rewritten.
- Raises:
RuntimeError – If the collection was not opened with
catalog_enabled=True(no catalog to repair).ReadOnlyError – If the collection was opened read-only.
- Return type:
- property schema: DatasetSchema#
Return the bound dataset schema.
- update(fn, *, filters=None, variables=None)[source]#
Read each matching partition, transform it, and write it back.
fnis called per partition with the dataset returned byquery()(immutable variables merged in). It must return aDatasetcarrying every variable the partition should keep — the call rewrites the partition wholesale with whateverfnreturns.Warning
updateis not a partial-write API. Each partition group is recreated withoverwrite=Truebefore the new dataset is written, so any variable absent fromfn’s return value is dropped from disk. This holds regardless ofvariables: if you load only a subset and return only a subset, the unloaded variables are also lost. To update one variable without disturbing the others, return a Dataset that still carries the rest (e.g. start from the inputdsand replace just the target variable).fnshould also preserve the partition’s length along the partitioning dimension;updatedoes not refresh the catalog from per-partition geometry.- Parameters:
fn (Callable[[Dataset], Dataset]) – Function
Dataset -> Datasetapplied to each matching partition.filters (str | None) – Partition-key predicate; see
partitions().variables (Iterable[str] | None) – Optional whitelist of variables to load before calling
fn. Reduces I/O at read time, but does not protect unloaded variables from being dropped on write — see the warning above.
- Returns:
The list of partition paths that were written, in the order they were processed.
- Raises:
ReadOnlyError – If the collection was opened with
read_only=True.ExpressionError – If
filtersis malformed; seepartitions().
- Return type:
Merge strategies#
Strategies for inserting into a partition that already has data on disk.
Pass any of these to Collection.insert() via the merge=
argument, either as the callable or by name ("replace", "concat",
"time_series", "upsert").
- zcollection.merge.replace(existing, inserted, *, axis, partitioning_dim)[source]#
Drop
existingentirely; onlyinsertedis written.
- zcollection.merge.concat(existing, inserted, *, axis, partitioning_dim)[source]#
Append
insertedafterexistingalongpartitioning_dim.- Parameters:
- Returns:
The concatenation
existing || inserted. No deduplication or sorting is performed.- Return type:
- zcollection.merge.time_series(existing, inserted, *, axis, partitioning_dim)[source]#
Time-aware merge: drop the existing window covered by
inserted.Rows in
existingwhose axis value falls inside[inserted_min, inserted_max]are dropped wholesale, the remaining rows are concatenated withinserted, and the result is sorted by axis. The input axes do not need to be monotonic — the function sorts on the way out.Empty-input short-circuits: if
insertedhas zero rows alongaxisthe function returnsexistingunchanged; ifexistingis empty it returnsinsertedunchanged.- Parameters:
existing (Dataset) – Dataset already on disk.
inserted (Dataset) – New dataset.
axis (str) – Name of the variable used for the time-window comparison. Must be a 1-D variable present on both sides; comparable types include numeric and
datetime64.partitioning_dim (str) – Dimension along which to slice and concat.
- Returns:
The merged, axis-sorted dataset.
- Raises:
ValueError – If
axisis not a variable on both sides.- Return type:
- zcollection.merge.upsert(existing, inserted, *, axis, partitioning_dim, tolerance=None)[source]#
Row-wise replace-or-add by axis proximity.
For each row in
existing: keep it iff its axis value has no match ininserted[axis]. The kept rows are concatenated withinsertedand the result is sorted by axis. Use this when a new acquisition may partially overlap (replace) and partially extend (add) an existing partition without wiping the gaps in between.Empty-input short-circuits: if
insertedhas zero rows alongaxisthe function returnsexistingunchanged; ifexistingis empty it returnsinsertedunchanged.tolerancecontrols what counts as a match:None(default) — exact equality (numpy.isin).a scalar (e.g.
numpy.timedelta64(500, "ms")for datetime axes, or a float for numeric axes) — an existing row matches the nearest inserted row when|existing - nearest_inserted| <= tolerance. Useful when re-acquired timestamps are jittered by clock drift.
The string alias
"upsert"covers only exact equality. For a tolerance-aware merge passed viaCollection.insert(merge=…)— which accepts a string alias or a callable, but not a string plus an argument — wrap the tolerance withupsert_within()and pass the returned callable:col.insert( ds, merge=zcollection.merge.upsert_within(numpy.timedelta64(500, "ms")), )
- Parameters:
existing (Dataset) – Dataset already on disk.
inserted (Dataset) – New dataset.
axis (str) – Name of the matching variable. Must be a 1-D variable present on both sides; comparable types include numeric and
datetime64.partitioning_dim (str) – Dimension along which to slice rows.
tolerance (Any) –
Nonefor exact equality, or a scalar for nearest-neighbour matching. For datetime axes pass anumpy.timedelta64(e.g.timedelta64(500, "ms")); for numeric axes pass a plain float.
- Returns:
The merged, axis-sorted dataset.
- Raises:
ValueError – If
axisis not a variable on both sides.- Return type:
- zcollection.merge.upsert_within(tolerance)[source]#
Return an
upsert()strategy bound totolerance.The returned
MergeCallableis suitable for passing toCollection.insert(merge=…):import numpy from zcollection.collection import merge # 500 ms clock-drift tolerance on a datetime axis. col.insert(ds, merge=merge.upsert_within(numpy.timedelta64(500, "ms"))) # Or on a numeric axis: col.insert(ds, merge=merge.upsert_within(1e-6))
- Parameters:
tolerance (Any) –
Nonefor exact equality, or a scalar (numeric ornumpy.timedelta64) controlling the nearest-neighbour match window inupsert().- Returns:
A
MergeCallablethat callsupsert()with the given tolerance.- Return type:
- class zcollection.merge.MergeCallable(*args, **kwargs)[source]#
Contract for a merge strategy.
Implementations are invoked by
Collection.insert_asyncwhenever the dataset slice destined for a partition collides with an existing on-disk partition. The callable receives the two datasets plus two metadata strings, and returns the dataset that should actually be written.
Partitioning#
Partitioning strategies decide how rows are bucketed onto disk. Pick one when you create the collection; it’s persisted and used for every subsequent open.
- class zcollection.partitioning.Date(variables, *, resolution, dimension=None)[source]#
Partition by truncating a 1-D datetime64 variable to
resolution.Component names match the layout (
year=2024/month=03/day=01).- Parameters:
- decode(path)[source]#
Decode a relative storage path into a key.
- Parameters:
path (str) – A string representing the relative storage path.
- Returns:
A tuple of (component-name, value) pairs corresponding to the partition key.
- Raises:
PartitionError – If the path is not properly formatted or if any component is missing or has an invalid value.
- Return type:
- encode(key)[source]#
Encode a key as a relative storage path.
- Parameters:
key (tuple[tuple[str, int], ...]) – A tuple of (component-name, value) pairs corresponding to the partition key.
- Returns:
A string representing the relative storage path for the given key, formatted according to the component names and zero-padding widths defined for this partitioning.
- Return type:
- class zcollection.partitioning.Sequence(variables, *, dimension)[source]#
Partition by the unique value tuples of one or more variables.
The variables must all share the same single dimension, which becomes the partitioning axis.
- Parameters:
- split(dataset)[source]#
Yield
(key, slice)for each unique tuple indataset.- Parameters:
dataset (Dataset) – The dataset to split; must contain the partition-key variables as 1-D arrays along the partitioning dimension.
- Yields:
Tuples of the form
(key, slice), wherekeyis a partition key tuple (e.g.(("x", 1), ("y", 2))) representing the unique value combination for the partition, andsliceis a slice object that can be used to index into the dataset along the partitioning dimension.- Return type:
- class zcollection.partitioning.GroupedSequence(variables, *, dimension, size, start=0)[source]#
Like
Sequence, but groups the last variable into buckets ofsize.The first
len(variables) - 1variables continue to act as exact keys; the last variable is mapped to(value - start) // size * size + startbefore grouping.sizemust be ≥ 2 (otherwise preferSequence).- Parameters:
variables (tuple[str, ...]) – The variable(s) to partition by; must be at least one.
dimension (str) – The dimension to partition along; if
None, inferred from the variable name.size (int) – The bucket size for the last variable; must be ≥ 2.
start (int) – The bucket origin for the last variable; defaults to 0.
- decode(path)#
Decode a relative storage path into a key.
- encode(key)#
Encode a key as a relative storage path.
- classmethod from_json(payload)[source]#
Reconstruct a GroupedSequence from its JSON payload.
- split(dataset)[source]#
Yield
(key, slice)for each contiguous bucket indataset.- Parameters:
dataset (Dataset) – The dataset to partition, which must contain all variables in this partitioning and have the partitioning dimension in each.
- Yields:
Tuples of (partition_key, slice) for each contiguous bucket in the dataset, where partition_key is a tuple of (component, value) pairs representing the partition key for that bucket, and slice is a slice object that can be used to index into the dataset along the partitioning dimension.
- Return type:
- class zcollection.partitioning.Partitioning(*args, **kwargs)[source]#
Maps dataset rows along a partitioning axis to partition keys.
Implementations operate on plain numpy — Dask is layered higher up.
- property axis: tuple[str, ...]#
Variables (along the partitioning dimension) used to derive the key.
- zcollection.partitioning.compile_filter(expr)[source]#
Compile a filter expression to a predicate over partition-key dicts.
expr=Noneor empty string returns a tautology.- Parameters:
expr (str | None) – The filter expression to compile. This should be a string containing a valid Python expression using only the allowed syntax.
- Returns:
A predicate function that takes a partition-key dict and returns a bool indicating whether the partition key satisfies the filter expression.
- Raises:
ExpressionError – If the expression contains syntax errors or uses disallowed syntax.
- Return type:
The compile_filter() helper turns a
filter string (the kind passed to Collection.partitions(filters=…))
into a typed predicate. You usually do not need to call it directly.
Views#
A view overlays extra variables on top of an existing read-only base collection without copying its data. Views live in a separate store and share the partitioning of the base.
- class zcollection.view.View(*, store, base, view_schema, reference, read_only=False)[source]#
Overlay of extra variables on top of a base
Collection.- Parameters:
store (Store)
base (Collection)
view_schema (DatasetSchema)
reference (ViewReference)
read_only (bool)
- property base: Collection#
Return the underlying base collection.
- classmethod create(store, *, base, variables, reference, overwrite=False)[source]#
Create a new view backed by
storeand overlayingbase.- Parameters:
store (Store) – Backing store for the view’s overlay variables. Must be different from
base.store.base (Collection) – The underlying read-only base collection.
variables (Iterable[VariableSchema]) – Schemas for the new variables the view adds. Each must share at least the partitioning dimension with the base collection.
reference (ViewReference | str) – Either a
ViewReferenceor a string URI identifying the base collection.overwrite (bool) – If
True, replace any existing view at this location.
- Returns:
A writable
Viewready toupdate.- Raises:
CollectionExistsError – If a view already exists at
store.root_uriandoverwrite=False.ZCollectionError – If a view variable’s dimensions are inconsistent with the base schema.
- Return type:
- classmethod open(store, *, base, read_only=False)[source]#
Open an existing view from
store.- Parameters:
store (Store) – The store backing the view’s overlay variables.
base (Collection) – The base collection that this view extends. The caller is responsible for ensuring it matches
reference.read_only (bool) – If
True, mutating methods (update) raiseReadOnlyError.
- Returns:
A
Viewbound to the existing overlay.- Raises:
CollectionNotFoundError – If no view config exists at
store.root_uri.- Return type:
- partitions(*, filters=None)[source]#
Yield partition paths from the base collection, optionally filtered.
- query(*, filters=None, variables=None)[source]#
Return the merged base+overlay dataset for matching partitions.
Variables are sourced from whichever side owns them. On a name collision the overlay (view) wins.
- property reference: ViewReference#
Return the reference to the base collection.
- update(fn, *, filters=None, variables=None)[source]#
Compute view-variable arrays for each base partition and write them.
fnruns once per matching base partition. It receives the merged base+view dataset and must return a mapping from view-variable name to a numpy array sized along the partitioning dimension. Returned keys must be a subset ofview_schema’s variables; missing keys are ignored, unknown keys raise.- Parameters:
fn (Callable[[Dataset], dict[str, Any]]) – Pure function
Dataset -> {name: numpy.ndarray}.filters (str | None) – Partition-key predicate (same syntax as
Collection.partitions()).variables (Iterable[str] | None) – Optional whitelist of variables to load before calling
fn.Noneloads everything available.
- Returns:
The list of partition paths that were written, in the order they were processed.
- Raises:
ReadOnlyError – If the view was opened with
read_only=True.- Return type:
- property view_schema: DatasetSchema#
Return the view’s overlay schema.
Indexing#
A Indexer is a Parquet-backed
lookup table: it maps user-defined key columns to (partition, start,
stop) row ranges so callers can slice a collection without scanning
every partition.
- class zcollection.indexing.parquet.Indexer(table)[source]#
Lookup table over a
Collection’s rows.- Parameters:
table (Table)
- classmethod build(collection, *, builder, filters=None, variables=None)[source]#
Build an index by walking the collection’s partitions.
- Parameters:
collection (Collection) – The
Collectionto index.builder (Callable[[Dataset], ndarray | dict[str, ndarray]]) – Per-partition row generator. It receives a Dataset slice and must return either a structured numpy array (with named fields, including
_startand_stop) or a dict{column_name: 1-D numpy.ndarray}of equal length. The user-defined columns become the queryable key columns;_start/_stopcarry contiguous row ranges within the partition.filters (str | None) – Partition-key predicate forwarded to
Collection.partitions()to skip whole partitions.variables (Iterable[str] | None) – Optional whitelist of variables to load before calling
builder— keep this tight on cloud stores.
- Returns:
An
Indexerwhose Parquet table is the concatenation of every non-empty per-partition output, with a_partitioncolumn added by the builder pipeline.- Raises:
ValueError – If
builder’s output for any partition omits a_startor_stopcolumn.TypeError – If
builderreturns something other than a structured array or a dict of arrays.
- Return type:
- property key_columns: tuple[str, ...]#
Return the user-defined key column names (excluding reserved ones).
- lookup(**predicates)[source]#
Return
{partition: [(start, stop), ...]}for matching rows.Filters are AND-ed: a row matches when every named predicate matches. A scalar predicate is equality; a list / tuple / set / ndarray predicate is set membership (
IN).- Parameters:
**predicates (Any) – One
key_column=valueper filter. Allowed column names are exactlykey_columns.- Returns:
A mapping from partition path to the list of contiguous
(start, stop)row ranges that satisfy the predicates. Partitions with no matching row are absent from the mapping.- Raises:
KeyError – If a predicate names an unknown column.
- Return type:
- property table: Table#
Return the underlying PyArrow table.
Datasets, groups, and variables (in memory)#
The objects you receive from Collection.query() and pass back into
Collection.insert(). Dataset is a root
Group: it owns variables and child groups
directly. Group itself is the hierarchical
container reused for every nested group.
Both Dataset and Variable
print as multi-line, xarray-like blocks that include a synthetic byte
size for the dataset, each child group, and each variable, so you can
gauge memory/disk footprint at a glance:
<zcollection.data.dataset.Dataset '/'> Size: 5.32 MB
Dimensions: (time: 100000, x_ac: 240)
Data variables:
time (time) int64 781.25 kB numpy.ndarray<...>
ssh (time, x_ac) float32 91.55 MB numpy.ndarray<...>
Groups:
data_01 1.20 GB (0 variables, 1 subgroup)
Path-based access (ds.get_variable("/data_01/ku/power"),
ds["/data_01/ku/power"]) and short-name search
(find_variable(),
find_group()) make navigating the hierarchy
straightforward. find_dimension() walks up the
tree so child groups inherit dimensions declared on ancestors.
nbytes recurses over the whole tree.
The xarray bridge (to_xarray() /
from_xarray()) operates on the root group
only — xarray has no native group concept, so nested groups are dropped
on conversion.
- class zcollection.Dataset(schema, variables=(), groups=(), attrs=None)[source]#
A schema plus the in-memory data for one or more partitions.
A
Datasetis the rootGroupof a zcollection tree (name == "/",parent is None). Compared to a plainGroup, theschemaattribute is narrowed toDatasetSchema(statically; the runtime slot is the same) and the constructor has a path-aware variable router.- Parameters:
schema (DatasetSchema) – The dataset schema (root
GroupSchema).variables (Mapping[str, Variable] | Iterable[Variable]) –
Variables for the dataset. Two forms are accepted:
a mapping
name -> Variable: keys without/are placed at the root; keys containing/are routed into the matching nested group (intermediate groups are auto-built fromschema);an iterable of
Variableinstances — each is placed at the root using its ownschema.name.
groups (Mapping[str, Group] | Iterable[Group]) – Optional pre-built child groups for the root. They are merged with the auto-built nested groups, with
groupswinning on name conflict for groups it already provides.attrs (Mapping[str, Any] | None) – Optional dataset-level attributes; defaults to
schema.attrs.
- classmethod from_xarray(ds)[source]#
Build a
Datasetfrom an xarrayDataset(flat).Schema fields are inferred from the input:
dimensions come from
ds.sizes;dataset attributes from
ds.attrs;each variable contributes its
dtype,dimensions,attrs(minus_FillValue, which becomes the variable’sfill_value); coordinates and data variables are merged into a single flat namespace.
xarray has no native group concept, so the resulting
Datasetis always flat (root group only). Round-trip a hierarchical Dataset by callingto_xarray()per group.- Parameters:
ds (xarray.Dataset) – The xarray Dataset to convert.
- Returns:
A new
Datasetwith an inferred schema and the same variable data (no copy — the underlying arrays are shared withds).- Return type:
- select(names)[source]#
Return a new dataset restricted to the named variables.
Both the variables and the schema are subsetted: the returned dataset’s
schemais built viaDatasetSchema.select(), which prunes any group that ends up empty after the selection.namesmay be short names (resolved against the root group) or absolute paths (/grp/var).
- to_xarray(group=None)[source]#
Convert one group’s variables to an xarray
Dataset.xarray has no native group concept, so a single
xarray.Datasetcarries exactly one flat namespace of variables. Usegroupto pick which of this dataset’s groups is materialised; the rest of the tree is dropped silently.- Parameters:
group (str | None) – Absolute or relative path of the group to convert.
None(default) and"/"both target the root group (backward-compatible behaviour). For a nested group, pass its path (e.g."/data_01"or"data_01/ku");get_groupsemantics apply.- Returns:
An
xarray.Datasetwhosevariablesare the named group’s variables and whoseattrsare that group’s attributes.- Raises:
KeyError – If
groupdoes not resolve to a known group in this dataset’s tree.- Return type:
- class zcollection.Group(schema, variables=(), groups=(), attrs=None, *, name='/', parent=None)[source]#
A named container of variables, attributes, dimensions, and child groups.
The root group has
name == "/"andparent is None. Child groups keep a back-reference to their parent so absolute paths (long_name()) and dimension inheritance (find_dimension()) can be computed cheaply.- Parameters:
schema (GroupSchema) – Schema describing this group’s variables, dimensions, and attributes.
variables (Mapping[str, Variable] | Iterable[Variable]) – Variables for this group, as a mapping or iterable.
groups (Mapping[str, Group] | Iterable[Group]) – Child groups for this group, as a mapping or iterable.
attrs (Mapping[str, Any] | None) – Group-level attributes; defaults to
schema.attrs.name (str) – Short name of this group;
"/"for the root.parent (Group | None) – Parent group reference;
Nonefor the root.
- add_group(group)[source]#
Attach
groupas a child of this group.group.parentis rewritten toself. Any prior parent the passed group held is silently dropped.- Parameters:
group (Group) – The group to attach.
- Returns:
The same group, now attached.
- Raises:
ValueError – If a child with the same name already exists.
- Return type:
- add_variable(variable)[source]#
Register
variableon this group.Re-runs the per-dim size check across all of this group’s variables; a length disagreement on a shared dim raises.
- Raises:
ValueError – If a variable with the same name already exists, or if its shape disagrees with sizes already established by other variables on this group.
- Parameters:
variable (Variable)
- Return type:
None
- all_variables()[source]#
Return every variable in the tree keyed by absolute path.
Variables at the root keep their short name; nested variables are keyed by
"<group>/<name>"(no leading slash) for direct passing back toDataset.
- property attrs: dict[str, Any]#
Return this group’s attributes mapping (live, not a copy).
The returned dict is the group’s internal storage: mutating it modifies the group. Use
update_attributes()for an explicit setter, or copy the result yourself (dict(group.attrs)) if you need a detached snapshot.
- property dimensions: dict[str, int]#
Return the dimension-name → size mapping declared on this group.
Sizes are derived from this group’s own variables; dimensions declared on ancestors but not used by any local variable are not listed. Use
find_dimension()to resolve a dim name across the inheritance chain.
- find_dimension(name)[source]#
Search for a declared dimension by name, walking up the tree.
Returns the first matching
Dimensionfrom this group’s schema or any ancestor’s schema, orNoneif not found.
- find_group(name)[source]#
Search for a child group by short name in this subtree.
Returns the first match in depth-first order, or
Noneif no group with that short name exists. Seeget_group()for the path-based variant.
- find_variable(name)[source]#
Search for a variable by short name in this group and its descendants.
Returns the first match in depth-first order — checking this group first, then the leftmost descendant, and so on. Returns
Noneif no variable with that short name exists anywhere in the tree. Useget_variable()if you have an absolute path and want a hard failure on miss.
- property is_lazy: bool#
Return whether any variable in the tree wraps lazy data.
Walks descendants too — true iff any
Variableanywhere in the tree (including in nested groups) reportsVariable.is_lazy.
- long_name()[source]#
Return the absolute path of this group (e.g.
"/data_01/ku").The root group resolves to
"/". Any group withparent is Noneis treated as a root regardless of itsname.- Return type:
- property nbytes: int#
Return the uncompressed byte size of the tree rooted here.
Sums
Variable.nbytesover every variable in this group and every descendant group. Reports the in-memory footprint; ignores any compression or sharding the variables might carry on disk.
- update_attributes(**attributes)[source]#
Merge
attributesinto this group’s attrs (additive).Existing keys are overwritten with the new values; keys not in
attributesare left untouched. The mutation is done in place on the liveattrsmapping.- Parameters:
attributes (Any)
- Return type:
None
- property variables: Mapping[str, Variable]#
Return this group’s own variables (no recursion).
Use
all_variables()for the recursive view that includes every descendant variable keyed by absolute path.
- class zcollection.Variable(schema, data)[source]#
A named array bound to a
VariableSchema.datamay be anumpy.ndarray(eager), any object exposing acompute()method (dask-style lazy arrays), an arbitrary array-like (anythingnumpy.asarray()accepts), orNone(declared but not yet populated).is_lazyis true iff the array isn’t a plainnumpy.ndarray.On construction the data is validated against the schema’s number of dimensions; dtype mismatches are not enforced (upcasts are accepted silently).
- Parameters:
schema (VariableSchema) – Variable schema describing dtype, dims and metadata.
data (Any) – Underlying array, or
Nonefor a placeholder variable.
- Raises:
ValueError – If
dataexposes anndimattribute that disagrees withschema.ndim.
- property attrs: dict[str, Any]#
Return a fresh copy of the schema attributes.
The returned dict is detached: mutating it does not affect the underlying schema.
- property is_lazy: bool#
Return whether the underlying data isn’t a plain
numpy.ndarray.True for dask arrays, Zarr
AsyncArrayproxies, and anything else that isn’t a concrete in-memory numpy buffer; false when the data is already materialised.data is Nonealso returns true (the placeholder is treated as not-yet-eager).
- property nbytes: int#
Return the uncompressed byte size of the underlying data.
Computed as
prod(shape) * dtype.itemsize— the same convention asnumpy.ndarray.nbytes. Ignores any compression or sharding the variable might carry on disk. Returns0for placeholder variables (data is None).
- schema#
The variable schema describing dtype, dims and metadata.
- property shape: tuple[int, ...]#
Return the shape of the underlying data.
Returns
()whendataisNoneor has noshapeattribute (the empty tuple makes the variable look like a 0-D scalar to size-aware consumers).
- to_numpy()[source]#
Materialise the data as a numpy array.
Dispatches in three cases:
already a
numpy.ndarray→ returned as-is (no copy).has a
compute()method (dask-style lazy arrays) → call it and return the materialised result.otherwise →
numpy.asarray()on the data.
Calling this on a Variable with
data is Noneproducesnumpy.array(None, dtype=object), which is rarely useful; guard against that case at the caller.- Return type:
Codecs#
Codecs describe how a variable’s chunks travel from numpy memory to bytes on disk. Most users only need named profiles:
import zcollection as zc
stack = zc.codecs.profile("cloud-balanced")
names = zc.codecs.profile_names() # ('local-fast', 'cloud-balanced', ...)
- zcollection.codecs.profile(name=None, *, filters=None, compressor=None)[source]#
Build a
CodecStackfrom a named profile, with overrides.- Parameters:
name (str | None) – Profile name.
NoneusesDEFAULT_PROFILE.filters (Iterable[dict[str, Any]] | None) – Optional array-to-array codec descriptors that override the profile’s empty filter list.
compressor (dict[str, Any] | None) – Optional bytes-to-bytes codec descriptor that replaces the profile’s entire compressor pipeline with this single codec.
- Returns:
A
CodecStackmaterialised from the profile, with the overrides applied.- Raises:
KeyError – If
nameis not a registered profile.- Return type:
- zcollection.codecs.auto_codecs(dtype, profile_name=None)[source]#
Pick a
CodecStackfor a variable using the named profile.The result is currently dtype-agnostic —
dtypeis accepted on the public surface so future profiles can specialise (e.g. byte-grain filters for booleans, transposes for high-rank arrays) without an API break. For now it is ignored.- Parameters:
dtype (dtype) – The variable dtype. Reserved for forward compatibility; does not affect the returned stack today.
profile_name (str | None) – Profile name.
NoneusesDEFAULT_PROFILE.
- Returns:
A
CodecStackmaterialised from the profile.- Raises:
KeyError – If
profile_nameis not a registered profile.- Return type:
- class zcollection.codecs.CodecStack(array_to_array=(), array_to_bytes=None, bytes_to_bytes=(), sharded=False, shard_target_bytes=None)[source]#
The persisted codec pipeline for one variable.
Each codec is held as a JSON-clean
{"name": ..., "configuration": ...}dict so the schema round-trips through_zcollection.jsonwithout depending on Zarr v3 codec object identity. Materialisation into concretezarr.codecsinstances happens lazily inresolve_codec()at write time.- Parameters:
- array_to_array: tuple[dict[str, Any], ...]#
Array-to-array codecs (Zarr v3 filters), applied in order to the chunk’s numpy array before serialisation.
- array_to_bytes: dict[str, Any] | None#
The single array-to-bytes codec (Zarr v3 serializer) that turns array elements into a byte string for one chunk. The Zarr v3 spec requires exactly one codec in this slot.
- bytes_to_bytes: tuple[dict[str, Any], ...]#
Bytes-to-bytes codecs (Zarr v3 compressors), applied in order to the per-chunk byte string. With sharding on, they compress each inner chunk inside a shard; with sharding off, they compress each chunk directly.
- classmethod from_json(payload)[source]#
Build a codec stack from its JSON representation.
- Parameters:
- Return type:
- shard_target_bytes: int | None#
Target byte budget for each shard when
shardedisTrue;Noneotherwise. The actual shard shape is picked byzcollection.codecs.sharding.shard_decision(), which honours this as a hint, not a hard cap.
- sharded: bool#
When
True, the variable’s chunks are bundled into shards viazarr.codecs.ShardingCodec. The codecs inbytes_to_bytesthen compress each inner chunk inside a shard. WhenFalse, each chunk is compressed directly and shards are not used.
- zcollection.codecs.DEFAULT_PROFILE#
Name of the profile used when no
codecs=is given on a variable.
Stores#
Stores are the I/O backend. The factory selects an implementation from a
URL — that is what create_collection() and open_collection()
call internally — but you can also build one explicitly and pass it as
the path argument.
- zcollection.open_store(path, *, read_only=False, storage_options=None)[source]#
Open a Store given a URL or filesystem path.
Schemes:
file://(default for bare paths) →LocalStorememory://→MemoryStores3://,gs://,az://,http(s)://→ObjectStore(obstore-backed, the only cloud path)icechunk://→IcechunkStore(transactional)
- class zcollection.Store[source]#
Capability surface for a backend that can hold a Zarr v3 hierarchy.
Implementations wrap a concrete
zarr.storageStore and expose:zarr_store()to hand the raw Zarr store to the io layer.exists()/delete_prefix()/list_prefix()for the collection-level operations that don’t go through Zarr.read_bytes()/write_bytes()for the small JSON config file.
- abstractmethod delete_prefix(prefix)[source]#
Recursively delete everything under
prefix.- Parameters:
prefix (str)
- Return type:
None
- session()[source]#
Yield a
StoreSessionfor the duration of a write block.- Return type:
Iterator[StoreSession]
- class zcollection.LocalStore(path, *, read_only=False)[source]#
File-system backed store rooted at
path.
The optional zcollection.store.IcechunkStore is loaded lazily
when icechunk:// is opened or when you import it explicitly. It
unlocks transactional, multi-writer inserts.
Errors#
All exceptions raised by zcollection inherit from
ZCollectionError.
- exception zcollection.SchemaError[source]#
Raised when a schema is invalid or inconsistent with data.
- exception zcollection.StoreError[source]#
Raised by the store layer for I/O or transactional failures.
- exception zcollection.CollectionExistsError[source]#
Raised when create_collection targets a path that already exists.
Async API#
zcollection.aio mirrors the sync facade one-to-one. Every
function returns a coroutine; every method has an _async counterpart
on Collection and view.View. Use this module when you
are already inside an event loop.
Public async facade for zcollection v3.
Mirrors zcollection.api but every entry point is a coroutine. This
module is the recommended surface when running inside an event loop (e.g.
FastAPI handlers, Jupyter %await cells, or other async workloads).
- async zcollection.aio.create_collection(path, *, schema, axis, partitioning, catalog_enabled=False, overwrite=False)[source]#
Async create. Returns the same
Collectionas the sync API.- Parameters:
schema (DatasetSchema)
axis (str)
partitioning (Partitioning)
catalog_enabled (bool)
overwrite (bool)
- Return type:
- async zcollection.aio.open_collection(path, *, mode='r')[source]#
Async open. Returns a
Collection;read_onlyflag followsmode.- Parameters:
- Return type: