damacy¶
High-speed streamed assembly of tensors from zarr sources to GPU.
damacy reads sharded NGFF zarr stores straight onto the GPU: per-shard chunk indexing, parallel host I/O, in-flight GPU-side decompression (zstd, blosc1-zstd), and a typed assemble kernel that lands each batch as a DLPack-ready device tensor.
Quick start¶
import damacy
import torch
cfg = damacy.Config(
samples_per_batch=2,
sample_shape=(64, 256, 256),
max_gpu_memory_bytes=1 << 30,
dtype="bf16",
)
samples = [
damacy.Sample(uri="/data/cells/cell-1.zarr", aabb=[(0, 64), (0, 256), (0, 256)]),
damacy.Sample(uri="/data/cells/cell-2.zarr", aabb=[(0, 64), (0, 256), (0, 256)]),
]
with damacy.Pipeline(cfg) as d:
d.push(samples)
for _ in range(len(samples) // cfg.samples_per_batch):
with d.pop() as batch:
x = torch.from_dlpack(batch)
... # train step
By default the pipeline captures whatever CUDA context is current on
the calling thread; PyTorch sets one up implicitly, and bare-Python
users can call damacy._native.cuda_init_primary() once. For
multi-GPU setups, see Distributed for the device
binding model and a torchrun example.
Concepts¶
You hand damacy a stream of Samples; it returns a stream of
Batches, each one a device tensor of shape
(samples_per_batch, *sample_shape).
- A
Sampleis one crop request: a zarr URI plus anaabb(axis-aligned bounding box) given as a list of(start, stop)tuples — one per spatial axis. Everyaabbmust produce the same per-sample shape, and that shape isConfig.sample_shape. - A
Pipelineis a streaming context. Youpushan iterable of samples (lazy generators are fine — and recommended for long runs) and callpop()to block for the next ready batch. - A
Batchis a DLPack-ready handle to a GPU-resident tensor. Use it inside awithblock so damacy can reclaim the slot when you're done.
samples_per_batch, sample_shape, and max_gpu_memory_bytes are required
on Config; everything else has a sensible default. The assemble
kernel casts heterogeneous source dtypes
(u8/u16/i16/u32/i32/f16/f32) to the configured
destination dtype (f32 or bf16) on the way out, so your zarrs
do not need to match it.
Public surface¶
The published API lives entirely under the top-level damacy package.
The native extension (damacy._native) is an implementation detail
documented only via its .pyi stub.
- API reference —
Pipeline,Config,Sample,Batch, the exception hierarchy, and theStats/Metricvalue types. - GPU memory budget — how to think about
max_gpu_memory_bytes, what it covers, and how to pick a value. - Distributed — device binding model and torchrun / DDP examples.
- Async prefetch — zero-copy with deferred release for training loops that prefetch the next batch on a background thread.
- Troubleshooting — common errors
(
PoolStarved,BudgetExceeded, missing CUDA context) and what to check first.
Performance dashboards¶
Continuous benchmark history (auto-published from
bench.yml):
- Throughput — bigger is better
- Timings — smaller is better