Skip to content

damacy

High-speed streamed assembly of tensors from zarr sources to GPU.

damacy reads sharded NGFF zarr stores straight onto the GPU: per-shard chunk indexing, parallel host I/O, in-flight GPU-side decompression (zstd, blosc1-zstd), and a typed assemble kernel that lands each batch as a DLPack-ready device tensor.

build test codecov


Quick start

import damacy
import torch

cfg = damacy.Config(
    samples_per_batch=2,
    sample_shape=(64, 256, 256),
    max_gpu_memory_bytes=1 << 30,
    dtype="bf16",
)
samples = [
    damacy.Sample(uri="/data/cells/cell-1.zarr", aabb=[(0, 64), (0, 256), (0, 256)]),
    damacy.Sample(uri="/data/cells/cell-2.zarr", aabb=[(0, 64), (0, 256), (0, 256)]),
]

with damacy.Pipeline(cfg) as d:
    d.push(samples)
    for _ in range(len(samples) // cfg.samples_per_batch):
        with d.pop() as batch:
            x = torch.from_dlpack(batch)
            ...  # train step

By default the pipeline captures whatever CUDA context is current on the calling thread; PyTorch sets one up implicitly, and bare-Python users can call damacy._native.cuda_init_primary() once. For multi-GPU setups, see Distributed for the device binding model and a torchrun example.

Concepts

You hand damacy a stream of Samples; it returns a stream of Batches, each one a device tensor of shape (samples_per_batch, *sample_shape).

  • A Sample is one crop request: a zarr URI plus an aabb (axis-aligned bounding box) given as a list of (start, stop) tuples — one per spatial axis. Every aabb must produce the same per-sample shape, and that shape is Config.sample_shape.
  • A Pipeline is a streaming context. You push an iterable of samples (lazy generators are fine — and recommended for long runs) and call pop() to block for the next ready batch.
  • A Batch is a DLPack-ready handle to a GPU-resident tensor. Use it inside a with block so damacy can reclaim the slot when you're done.

samples_per_batch, sample_shape, and max_gpu_memory_bytes are required on Config; everything else has a sensible default. The assemble kernel casts heterogeneous source dtypes (u8/u16/i16/u32/i32/f16/f32) to the configured destination dtype (f32 or bf16) on the way out, so your zarrs do not need to match it.

Public surface

The published API lives entirely under the top-level damacy package. The native extension (damacy._native) is an implementation detail documented only via its .pyi stub.

  • API referencePipeline, Config, Sample, Batch, the exception hierarchy, and the Stats/Metric value types.
  • GPU memory budget — how to think about max_gpu_memory_bytes, what it covers, and how to pick a value.
  • Distributed — device binding model and torchrun / DDP examples.
  • Async prefetch — zero-copy with deferred release for training loops that prefetch the next batch on a background thread.
  • Troubleshooting — common errors (PoolStarved, BudgetExceeded, missing CUDA context) and what to check first.

Performance dashboards

Continuous benchmark history (auto-published from bench.yml):