Context
From #104 and the architecture research in #107, async weight streaming was identified as a performance optimization for block-level offloading (--offload / MOLD_OFFLOAD=1).
Current behavior
The offloaded FLUX forward pass (offload.rs) transfers blocks synchronously:
for block in self.double_blocks.iter() {
let gpu_block = block.to_device(&self.gpu_device)?; // CPU → GPU transfer
(img, txt) = gpu_block.forward(&img, &txt, &vec_, &pe)?; // Compute
self.gpu_device.synchronize()?; // Wait
drop(gpu_block); // Free
}
Transfer and compute happen sequentially — GPU sits idle during transfers.
Proposed approach
Double-buffered async streaming using secondary CUDA streams:
- Create a secondary CUDA stream via
cudarc::driver::CudaStream
- While block N computes on the primary stream, transfer block N+1 on the secondary stream
- Use
cuEventRecord/cuStreamWaitEvent for synchronization between streams
- Each stream has its own cast buffer (ComfyUI pattern:
NUM_STREAMS=2, round-robin)
Expected benefit
- 30-50% reduction in per-denoising-step wall time when transfer time < compute time
- ~200MB additional VRAM from double-buffering (one extra block)
- Only affects the
Offloaded FluxTransformer variant
Implementation notes
- candle uses
cudarc for CUDA backend — stream APIs available via candle_core::cuda_backend::cudarc::driver
- May need candle fork changes to support
to_device() with explicit stream parameter
- Alternative: use
cudarc directly for cuMemcpyAsync on secondary stream
- ComfyUI reference:
comfy/model_management.py lines 1156-1261
References
- Research doc:
docs/architecture/research-multi-model-cache-and-compute-boundaries.md (Phase 7)
- ComfyUI async stream implementation with pinned memory
Context
From #104 and the architecture research in #107, async weight streaming was identified as a performance optimization for block-level offloading (
--offload/MOLD_OFFLOAD=1).Current behavior
The offloaded FLUX forward pass (
offload.rs) transfers blocks synchronously:Transfer and compute happen sequentially — GPU sits idle during transfers.
Proposed approach
Double-buffered async streaming using secondary CUDA streams:
cudarc::driver::CudaStreamcuEventRecord/cuStreamWaitEventfor synchronization between streamsNUM_STREAMS=2, round-robin)Expected benefit
OffloadedFluxTransformer variantImplementation notes
cudarcfor CUDA backend — stream APIs available viacandle_core::cuda_backend::cudarc::driverto_device()with explicit stream parametercudarcdirectly forcuMemcpyAsyncon secondary streamcomfy/model_management.pylines 1156-1261References
docs/architecture/research-multi-model-cache-and-compute-boundaries.md(Phase 7)