Migrating to the v3 rewrite#
zcollection releases use a date-based version scheme (YYYY.M.D).
The v1 line ends at 2024.2.0 and is built on Zarr v2; the
v3 rewrite starts at 2026.4.0 (alpha pre-releases tagged
2026.4.0aN) and is built on Zarr v3. The version number alone
doesn’t communicate the break — both look like routine date
releases — so the rewrite is referred to in the docs and changelogs
as the v3 line (after its on-disk Zarr format), and users on the
v1 line are expected to pin accordingly.
The v3 line is a clean break from v1: the on-disk format is now Zarr v3, the public API has been redesigned around an immutable schema, and the storage backend is chosen by URL scheme. There is no Zarr-v2 read path — v1 collections must be rewritten through the migration tool, not opened in place.
Staying on v1#
If you cannot upgrade yet:
Pin
zcollection >= 2024.2.0, < 2026.4.0in your requirements (or the equivalentzcollection ~= 2024.2.0if you want to stay locked to the2024.2.xpatch line).The v1 source lives on the
support/v1branch (Zarr/CPython convention:support/<line>for maintenance branches). Critical fixes are backported there as2024.2.xpatch tags; no new features land.The v1 documentation is published on Read the Docs as the
support-v1(or2024.2.0) version — pick it from the version selector in the docs sidebar. The default landing page (stable) tracks the v3 line.
What changed at a glance#
The bullets below contrast the v1 line (last release 2024.2.0)
with the v3 line (2026.4.0 and later).
Storage format: Zarr v3 (sharded, async-friendly) instead of Zarr v2. The
.zarray/.zgroup/.zattrs/.zmetadatafiles are gone; each partition is a single Zarr v3 group with azarr.json.Schema is explicit: build a
Schemaonce and pass it tocreate_collection(). No more inferring metadata from the first dataset.Stores by URL scheme:
file://,memory://,s3://,icechunk://are dispatched to the right backend byopen_store().Async core, sync facade: the implementation is async; the sync API in
zcollectionis a thin wrapper. Anzcollection.aiomirror is published as stable.Transactions:
IcechunkStoremakesinsert/drop_partitionsatomic. Without Icechunk, multi-process writers are not supported.Partition catalog: the optional
_catalog/group lists partitions in O(1) and replaces the per-partitionLISTwalk on cloud stores.Hierarchical datasets: a
Datasetis now a rootGroupand can declare nested child groups (e.g./data_01/ku/...). Each group becomes a real Zarr v3 subgroup on disk; nested groups round-trip through partition I/O. The schema gainsGroupSchemaand agroup=keyword on the builder methods.
Entry points#
v1 (≤ |
v3 ( |
|---|---|
|
|
|
|
|
|
|
|
Removed#
The following v1 features have no equivalent in the v3 line:
synchronizer=— replaced by per-partitionasyncio.Lock, Dask DAG topology, and (for cross-partition atomicity)IcechunkStore.delayed=flag at the dataset level — eachVariableis polymorphic on its data (numpy / dask / ZarrAsyncArray).distributed=andfilesystem=keyword arguments — pass a Dask client directly when you want one; pass a URL or pre-openedStorefor the backend.partition_base_dirpositional — encode it in the URL.update_deprecated_collection— moved out of the package; use thezcollection-migratetool to rewrite a v1 layout into the v3 on-disk format.Dataset.load/Dataset.add_variable/Dataset.drop_variable/Dataset.metadata— schemas are immutable; build a newSchemainstead.The
-1chunk sentinel — usechunks=Nonefor unknown.The
filler: boolflag on variables — useVariableRole.
Renamed#
zcollection.merging→zcollection.collection.mergemerge_callable=keyword →merge=zcollection.dataset→zcollection.datazcollection.variable.array/delayed_array→ singlezcollection.Variablezcollection.fs_utils→zcollection.storezcollection.dask_utils→zcollection.daskzcollection.expression→zcollection.partitioning.expressionzcollection.meta→zcollection.schema
Code translation#
Creating a collection#
Before:
import fsspec
import zcollection
from zcollection.partitioning import Date
fs = fsspec.filesystem('file')
col = zcollection.create_collection(
'time', ds, Date(('time',), 'M'),
'/data/altimetry', filesystem=fs,
)
After:
import zcollection as zc
from zcollection.partitioning import Date
schema = (
zc.Schema()
.with_dimension('time', chunks=4096)
.with_dimension('x_ac', size=240, chunks=240)
.with_variable('time', dtype='int64', dimensions=('time',))
.with_variable('ssh', dtype='float32',
dimensions=('time', 'x_ac'))
.build()
)
col = zc.create_collection(
'file:///data/altimetry',
schema=schema, axis='time',
partitioning=Date(('time',), resolution='M'),
)
col.insert(ds)
Views#
Before:
from zcollection.view import ViewReference
view = zcollection.create_view(
'/data/view', ViewReference('/data/altimetry'),
ds, filesystem=fs,
)
After:
from zcollection.view import View, ViewReference
view_var = zc.VariableSchema(
name='var2', dtype='float32',
dimensions=('time', 'x_ac'),
)
view = View.create(
zc.open_store('file:///data/view'),
base=base,
variables=[view_var],
reference=ViewReference(uri='file:///data/altimetry'),
)
Indexing#
The v1 Indexer was an abstract class with _create / _update
hooks. In the v3 line Indexer is a
single concrete Parquet-backed class that takes a builder callback:
from zcollection.indexing import Indexer
def builder(ds):
# return a structured numpy array with at least
# `_start`, `_stop`, and the key columns
...
indexer = Indexer.build(collection, builder=builder)
indexer.write('/data/index.parquet')
ranges = indexer.lookup(pass_number=[1, 2])
Atomic writes#
To get crash-safe inserts, open the collection through
IcechunkStore:
col = zc.create_collection(
'icechunk:///data/altimetry',
schema=schema, axis='time',
partitioning=Date(('time',), 'M'),
)
A failed insert is rolled back to the prior commit; no partial state
is ever visible after reopen.
Layout on disk#
Old (Zarr v2):
collection/
├── year=2024/month=03/
│ ├── time/{0.0,.zarray,.zattrs}
│ ├── ssh/{0.0,.zarray,.zattrs}
│ └── {.zattrs,.zgroup,.zmetadata}
└── .zcollection
New (Zarr v3):
collection/
├── zarr.json
├── _zcollection.json
├── _catalog/{zarr.json,c/0} # optional partition index
├── _immutable/{zarr.json,...} # non-partitioned variables
└── year=2024/month=03/
├── zarr.json
├── time/{zarr.json,c/0}
└── ssh/{zarr.json,c/0/0}
When the schema declares nested groups (see Hierarchical Groups), each group materialises as a real Zarr v3 subgroup inside every partition:
collection/
└── year=2024/month=03/
├── zarr.json
├── time/{zarr.json,c/0}
└── data_01/
├── zarr.json
└── ku/
├── zarr.json
└── power/{zarr.json,c/0/0}