ZCollection#

This project is a Python library manipulating data split into a collection of groups stored in Zarr v3 format.

A collection divides a dataset into partitions to make incremental acquisitions or per-product updates cheap. Built-in partitionings are by date, by sequence, and grouped sequences.

A collection partitioned by date with a monthly resolution looks like this on disk:

collection/
├── zarr.json
├── _zcollection.json
├── _catalog/                       # optional partition index
│   ├── zarr.json
│   └── c/0
├── _immutable/                     # non-partitioned variables
│   └── zarr.json
└── year=2024/
    └── month=03/
        ├── zarr.json
        ├── time/
        │   ├── zarr.json
        │   └── c/0
        └── ssh/
            ├── zarr.json
            └── c/0/0

Hierarchical datasets#

A Dataset is a root Group: it owns variables and attributes directly and may contain nested child groups, mirroring the native Zarr v3 group hierarchy. Groups are useful to organise variables that share a logical sub-domain — e.g. SWOT /data_01/ku/... — while keeping a single, partitioned collection. Each child group is a real Zarr group on disk; variables placed inside nested groups round-trip transparently. See Hierarchical Groups for a worked example.

Inserts can either overwrite existing partitions or merge with them through pluggable strategies.

Storage backends are selected by URL scheme:

  • file:// — local filesystem

  • memory:// — in-process (tests, prototyping)

  • s3:// — object storage via obstore or fsspec

  • icechunk:// — transactional Zarr v3 via Icechunk

Dask is used to scale operations over partitions. The implementation is async; the sync API is a thin wrapper, and an zcollection.aio mirror is published for async callers.

Views layered on top of a read-only base collection let you add or recompute variables without touching the base.

Indices and tables#