
.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/ex_indexing.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_ex_indexing.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_ex_indexing.py:


Indexing a Collection
=====================

A secondary index lets you find which partitions and row-slices satisfy a
key-based query without scanning the whole collection. v3 indices are a
single Parquet table with ``(<key cols...>, _partition, _start, _stop)``
rows, built by walking the collection with
:py:meth:`Indexer.build<zcollection.indexing.Indexer.build>`.

Run with::

    python examples/ex_indexing.py

.. GENERATED FROM PYTHON SOURCE LINES 15-28

.. code-block:: Python


    from collections.abc import Iterator
    import itertools
    from pathlib import Path
    import shutil
    import tempfile

    import numpy

    import zcollection as zc
    from zcollection.indexing import Indexer









.. GENERATED FROM PYTHON SOURCE LINES 29-34

Build a half-orbit dataset
--------------------------

Each row carries ``cycle_number`` and ``pass_number``. A "half-orbit" is a
contiguous run of rows sharing the same (cycle, pass) pair.

.. GENERATED FROM PYTHON SOURCE LINES 34-80

.. code-block:: Python

    root = Path(tempfile.gettempdir()) / "zc-ex-indexing"
    if root.exists():
        shutil.rmtree(root)
    base_path = root / "collection"
    index_path = root / "index.parquet"

    n_cycles, n_passes, rows_per_pass = 5, 20, 10
    total = n_cycles * n_passes * rows_per_pass

    cycles = numpy.repeat(
        numpy.arange(n_cycles, dtype="uint16"), n_passes * rows_per_pass
    )
    passes = numpy.tile(
        numpy.repeat(numpy.arange(n_passes, dtype="uint16"), rows_per_pass),
        n_cycles,
    )
    times = numpy.arange(total, dtype="int64")

    schema = (
        zc.Schema()
        .with_dimension("time", chunks=rows_per_pass * n_passes)
        .with_variable("time", dtype="int64", dimensions=("time",))
        .with_variable("cycle_number", dtype="uint16", dimensions=("time",))
        .with_variable("pass_number", dtype="uint16", dimensions=("time",))
        .build()
    )
    ds = zc.Dataset(
        schema=schema,
        variables={
            "time": zc.Variable(schema.variables["time"], times),
            "cycle_number": zc.Variable(schema.variables["cycle_number"], cycles),
            "pass_number": zc.Variable(schema.variables["pass_number"], passes),
        },
    )

    # Partition the data by cycle so each cycle is one partition.
    collection = zc.create_collection(
        f"file://{base_path}",
        schema=schema,
        axis="time",
        partitioning=zc.partitioning.Sequence(("cycle_number",), dimension="time"),
    )
    collection.insert(ds)
    print(f"collection: {len(list(collection.partitions()))} partitions")






.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    collection: 5 partitions




.. GENERATED FROM PYTHON SOURCE LINES 81-88

Build the index
---------------

The builder takes one partition's :class:`~zcollection.Dataset` and
returns a structured numpy array with the key columns plus integer
``_start`` / ``_stop`` columns delineating each contiguous run. The
Indexer concatenates those rows over every partition.

.. GENERATED FROM PYTHON SOURCE LINES 88-124

.. code-block:: Python

    def split_runs(values: numpy.ndarray) -> Iterator[tuple[int, int]]:
        """Yield (start, stop) for each contiguous run of identical values."""
        if values.size == 0:
            return
        edges = numpy.concatenate(
            [[0], numpy.where(numpy.diff(values) != 0)[0] + 1, [values.size]]
        )
        yield from itertools.pairwise(edges.tolist())


    def half_orbit_rows(ds: zc.Dataset) -> numpy.ndarray:
        """Return one row per (cycle, pass, half-orbit) group."""
        cycle = ds["cycle_number"].to_numpy()
        pass_ = ds["pass_number"].to_numpy()
        # Combine cycle+pass into one composite key to find run boundaries.
        composite = (cycle.astype(numpy.int64) << 16) | pass_.astype(numpy.int64)
        rows = [
            (start, stop, int(cycle[start]), int(pass_[start]))
            for start, stop in split_runs(composite)
        ]
        return numpy.array(
            rows,
            dtype=[
                ("_start", "int64"),
                ("_stop", "int64"),
                ("cycle_number", "uint16"),
                ("pass_number", "uint16"),
            ],
        )


    indexer = Indexer.build(collection, builder=half_orbit_rows)
    print(f"index rows: {len(indexer)}, columns: {indexer.key_columns}")
    indexer.write(str(index_path))






.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    index rows: 100, columns: ('cycle_number', 'pass_number')




.. GENERATED FROM PYTHON SOURCE LINES 125-131

Query the index
---------------

:py:meth:`Indexer.lookup<zcollection.indexing.Indexer.lookup>` accepts a
scalar (equality) or a list (set membership). It returns
``{partition_path: [(start, stop), ...]}``, ready for slicing reads.

.. GENERATED FROM PYTHON SOURCE LINES 131-135

.. code-block:: Python

    ranges = indexer.lookup(pass_number=[1, 2])
    for path, slices in ranges.items():
        print(f" * {path}: {len(slices)} matching ranges")





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

     * cycle_number=0: 2 matching ranges
     * cycle_number=1: 2 matching ranges
     * cycle_number=2: 2 matching ranges
     * cycle_number=3: 2 matching ranges
     * cycle_number=4: 2 matching ranges




.. GENERATED FROM PYTHON SOURCE LINES 136-138

Round-trip the index from disk
------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 138-141

.. code-block:: Python

    reloaded = Indexer.read(str(index_path))
    assert len(reloaded) == len(indexer)
    print("indexer round-trip: OK")




.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    indexer round-trip: OK





.. _sphx_glr_download_auto_examples_ex_indexing.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: ex_indexing.ipynb <ex_indexing.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: ex_indexing.py <ex_indexing.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: ex_indexing.zip <ex_indexing.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
