Skip to content

Enhancement: async CUDA stream weight streaming for block offloading #109

@jamesbrink

Description

@jamesbrink

Context

From #104 and the architecture research in #107, async weight streaming was identified as a performance optimization for block-level offloading (--offload / MOLD_OFFLOAD=1).

Current behavior

The offloaded FLUX forward pass (offload.rs) transfers blocks synchronously:

for block in self.double_blocks.iter() {
    let gpu_block = block.to_device(&self.gpu_device)?;  // CPU → GPU transfer
    (img, txt) = gpu_block.forward(&img, &txt, &vec_, &pe)?;  // Compute
    self.gpu_device.synchronize()?;  // Wait
    drop(gpu_block);  // Free
}

Transfer and compute happen sequentially — GPU sits idle during transfers.

Proposed approach

Double-buffered async streaming using secondary CUDA streams:

  1. Create a secondary CUDA stream via cudarc::driver::CudaStream
  2. While block N computes on the primary stream, transfer block N+1 on the secondary stream
  3. Use cuEventRecord/cuStreamWaitEvent for synchronization between streams
  4. Each stream has its own cast buffer (ComfyUI pattern: NUM_STREAMS=2, round-robin)

Expected benefit

  • 30-50% reduction in per-denoising-step wall time when transfer time < compute time
  • ~200MB additional VRAM from double-buffering (one extra block)
  • Only affects the Offloaded FluxTransformer variant

Implementation notes

  • candle uses cudarc for CUDA backend — stream APIs available via candle_core::cuda_backend::cudarc::driver
  • May need candle fork changes to support to_device() with explicit stream parameter
  • Alternative: use cudarc directly for cuMemcpyAsync on secondary stream
  • ComfyUI reference: comfy/model_management.py lines 1156-1261

References

  • Research doc: docs/architecture/research-multi-model-cache-and-compute-boundaries.md (Phase 7)
  • ComfyUI async stream implementation with pinned memory

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions