Enhancement: async CUDA stream weight streaming for block offloading

## Context

From #104 and the architecture research in #107, async weight streaming was identified as a performance optimization for block-level offloading (`--offload` / `MOLD_OFFLOAD=1`).

## Current behavior

The offloaded FLUX forward pass (`offload.rs`) transfers blocks synchronously:

```rust
for block in self.double_blocks.iter() {
    let gpu_block = block.to_device(&self.gpu_device)?;  // CPU → GPU transfer
    (img, txt) = gpu_block.forward(&img, &txt, &vec_, &pe)?;  // Compute
    self.gpu_device.synchronize()?;  // Wait
    drop(gpu_block);  // Free
}
```

Transfer and compute happen sequentially — GPU sits idle during transfers.

## Proposed approach

Double-buffered async streaming using secondary CUDA streams:

1. Create a secondary CUDA stream via `cudarc::driver::CudaStream`
2. While block N computes on the primary stream, transfer block N+1 on the secondary stream
3. Use `cuEventRecord`/`cuStreamWaitEvent` for synchronization between streams
4. Each stream has its own cast buffer (ComfyUI pattern: `NUM_STREAMS=2`, round-robin)

## Expected benefit

- 30-50% reduction in per-denoising-step wall time when transfer time < compute time
- ~200MB additional VRAM from double-buffering (one extra block)
- Only affects the `Offloaded` FluxTransformer variant

## Implementation notes

- candle uses `cudarc` for CUDA backend — stream APIs available via `candle_core::cuda_backend::cudarc::driver`
- May need candle fork changes to support `to_device()` with explicit stream parameter
- Alternative: use `cudarc` directly for `cuMemcpyAsync` on secondary stream
- ComfyUI reference: `comfy/model_management.py` lines 1156-1261

## References

- Research doc: `docs/architecture/research-multi-model-cache-and-compute-boundaries.md` (Phase 7)
- ComfyUI async stream implementation with pinned memory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: async CUDA stream weight streaming for block offloading #109

Context

Current behavior

Proposed approach

Expected benefit

Implementation notes

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhancement: async CUDA stream weight streaming for block offloading #109

Description

Context

Current behavior

Proposed approach

Expected benefit

Implementation notes

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions