damacy¶

High-speed streamed assembly of tensors from zarr sources to GPU.

damacy reads sharded NGFF zarr stores straight onto the GPU: per-shard chunk indexing, parallel host I/O, in-flight GPU-side decompression (zstd, blosc1-{lz4,zstd}), and a typed assemble kernel that lands each batch as a DLPack-ready device tensor.

Quick start¶

import damacy
import torch

cfg = damacy.Config(
    batch_size=8,
    host_buffer_bytes=1 << 30,
    device_buffer_bytes=1 << 30,
    dtype="bf16",
)
samples = [
    damacy.Sample(uri="/data/cells/cell-1.zarr", aabb=[(0, 64), (0, 256), (0, 256)]),
    damacy.Sample(uri="/data/cells/cell-2.zarr", aabb=[(0, 64), (0, 256), (0, 256)]),
]

with damacy.Pipeline(cfg) as d:
    d.push(samples)
    for batch in d.batches(len(samples) // cfg.batch_size):
        with batch as t:
            x = torch.from_dlpack(t)
            ...  # train step

By default the pipeline captures whatever CUDA context is current on the calling thread; PyTorch sets one up implicitly, and bare-Python users can call damacy._native.cuda_init_primary() once. For multi-GPU setups, see Distributed for the device binding model and a torchrun example.

Public surface¶

The published API lives entirely under the top-level damacy package. The native extension (damacy._native) is an implementation detail documented only via its .pyi stub.

API reference — Pipeline, Config, Sample, Batch, the exception hierarchy, and the Stats/Metric value types.
Distributed — device binding model and torchrun / DDP examples.

Performance dashboards¶

Continuous benchmark history (auto-published from bench.yml):

Throughput — bigger is better
Timings — smaller is better

Design notes¶

A single in-flight wave is processed at a time; output batches are double-buffered so the consumer can overlap training compute with the next wave's decompression and assembly.
Resource caps (host pinned-buffer pool, device decompress-scratch pool, GPU memory) are fixed at Pipeline(...) construction; nothing grows after that.
The assemble kernel casts heterogeneous source dtypes (u8/u16/i16/u32/i32/f16/f32) to the configured destination dtype (f32 or bf16) on the way out.