damacy¶
High-speed streamed assembly of tensors from zarr sources to GPU.
damacy reads sharded NGFF zarr stores straight onto the GPU: per-shard chunk indexing, parallel host I/O, in-flight GPU-side decompression (zstd, blosc1-{lz4,zstd}), and a typed assemble kernel that lands each batch as a DLPack-ready device tensor.
Quick start¶
import damacy
import torch
cfg = damacy.Config(
batch_size=8,
host_buffer_bytes=1 << 30,
device_buffer_bytes=1 << 30,
dtype="bf16",
)
samples = [
damacy.Sample(uri="/data/cells/cell-1.zarr", aabb=[(0, 64), (0, 256), (0, 256)]),
damacy.Sample(uri="/data/cells/cell-2.zarr", aabb=[(0, 64), (0, 256), (0, 256)]),
]
with damacy.Pipeline(cfg) as d:
d.push(samples)
for batch in d.batches(len(samples) // cfg.batch_size):
with batch as t:
x = torch.from_dlpack(t)
... # train step
By default the pipeline captures whatever CUDA context is current on
the calling thread; PyTorch sets one up implicitly, and bare-Python
users can call damacy._native.cuda_init_primary() once. For
multi-GPU setups, see Distributed for the device
binding model and a torchrun example.
Public surface¶
The published API lives entirely under the top-level damacy package.
The native extension (damacy._native) is an implementation detail
documented only via its .pyi stub.
- API reference —
Pipeline,Config,Sample,Batch, the exception hierarchy, and theStats/Metricvalue types. - Distributed — device binding model and torchrun / DDP examples.
Performance dashboards¶
Continuous benchmark history (auto-published from
bench.yml):
- Throughput — bigger is better
- Timings — smaller is better
Design notes¶
- A single in-flight wave is processed at a time; output batches are double-buffered so the consumer can overlap training compute with the next wave's decompression and assembly.
- Resource caps (host pinned-buffer pool, device decompress-scratch
pool, GPU memory) are fixed at
Pipeline(...)construction; nothing grows after that. - The assemble kernel casts heterogeneous source dtypes
(
u8/u16/i16/u32/i32/f16/f32) to the configured destination dtype (f32orbf16) on the way out.