ZCollection#
This project is a Python library manipulating data split into a
collection of groups stored in
Zarr v3 format.
A collection divides a dataset into partitions to make incremental
acquisitions or per-product updates cheap. Built-in partitionings are
by date,
by sequence, and
grouped sequences.
A collection partitioned by date with a monthly resolution looks like this on disk:
collection/
├── zarr.json
├── _zcollection.json
├── _catalog/ # optional partition index
│ ├── zarr.json
│ └── c/0
├── _immutable/ # non-partitioned variables
│ └── zarr.json
└── year=2024/
└── month=03/
├── zarr.json
├── time/
│ ├── zarr.json
│ └── c/0
└── ssh/
├── zarr.json
└── c/0/0
Hierarchical datasets#
A Dataset is a root Group:
it owns variables and attributes directly and may contain nested child
groups, mirroring the native Zarr v3 group hierarchy. Groups are useful
to organise variables that share a logical sub-domain — e.g. SWOT
/data_01/ku/... — while keeping a single, partitioned collection.
Each child group is a real Zarr group on disk; variables placed inside
nested groups round-trip transparently. See
Hierarchical Groups for a worked example.
Inserts can either overwrite existing partitions or merge with them
through pluggable strategies.
Storage backends are selected by URL scheme:
file://— local filesystemmemory://— in-process (tests, prototyping)icechunk://— transactional Zarr v3 via Icechunk
Dask is used to scale operations over partitions.
The implementation is async; the sync API is a thin wrapper, and an
zcollection.aio mirror is published for async callers.
Views layered on top of a read-only base collection let you add or recompute variables without touching the base.