feat: CUDA IPC zero-copy GPU transport via external cuda-link dependency (TD↔SD) by forkni · Pull Request #15 · dotsimulate/StreamDiffusion

forkni · 2026-05-18T05:42:54Z

Summary

Full zero-copy GPU transport between TouchDesigner and StreamDiffusion over CUDA IPC, in all three directions (SD→TD output, TD→SD input, TD→SD ControlNet). The CUDA-IPC engine is now an external pip dependency — cuda-link — rather than a vendored mirror inside the repo.

Architecture change: vendored `_compat/` → pip `cuda-link`

Previously the library was vendored under src/streamdiffusion/_compat/ as two mirror trees (cuda_ipc/ Python runtime + td_exporter/ TD DAT source), kept in lockstep by hand (the "re-vendoring trap": 5 relative-import patches re-applied on every update) while setup.py never declared the dependency. This PR removes ~19.6k LOC of vendored mirrors and depends solely on the installed package. See docs/adr/0001-cuda-link-as-external-dependency.md.

Aspect	Now
Dependency	`setup.py`: `cuda-link @ git+https://github.com/forkni/cuda-link@v1.8.1` (+ `cuda_ipc` extra)
Runtime import	`wrapper.py` imports `Exporter, FrameSpec, FrameOutcome, GpuFrame` from `cuda_link`
TD side	`CUDALinkBootstrap` library mode (`CUDALINK_LIB_PATH`) — same installed package, no DAT mirror
Removed	entire `src/streamdiffusion/_compat/{cuda_ipc,td_exporter}/` (~19.6k LOC)
Renamed	`_compat/` → `_patches/` for non-cuda runtime patches (diffusers KVO, HF tracing)

IPC transport (all three directions)

SD→TD output (Exporter): ring-buffer IPC with CUDA graph memcpy, activation barrier, WDDM HW scheduling support
TD→SD input (Importer): zero-copy GPU read of TD's render output; CPU cudaEventQuery sync (no GPU-stream entanglement)
TD→SD ControlNet (Importer): same zero-copy path for canny/depth control image; activated via use_cuda_ipc_controlnet YAML key

ControlNet TRT 901 fix (core bug resolved in this PR)

cudaErrorStreamCaptureInvalidated (901) fired on every cold-start when controlnet_scale > 0. Root cause: TRT's genericReformat::copyPackedRunKernel submits work to the legacy/NULL stream during execute_async_v3 inside the graph-capture window. Fix: use_cuda_graph=False for CN engines in wrapper.py — keeps TRT acceleration, skips the CUDA-graph wrapper, no capture window, no 901.

Also in this PR

IPC health indicator: surfaces zero-copy degradation to the console, gated behind par.Debugmode UI parameter
Profiling/perf: hot-path eager-op instrumentation, profile_ncu.py production-engine fix, UNet wave-limited roofline write-up
Plans: clean-shutdown watchdog + copy-sdtd-code crash-fix docs (docs/plans/2026-05-24-*.md)

Test plan

pip install -e .[cuda_ipc] resolves cuda-link v1.8.1; import streamdiffusion clean
Cold-start .toe with controlnet_scale: 0.577 and use_cuda_ipc_controlnet: true — no 901, CN active from frame 1
Toggle CN via OSC (enable/disable, scale 0→0.5→0.8) — no 901
TD IPC Receiver: all 3 slots open, event=YES, stream_wait < 0.1 ms
Output IPC: TD Receiver consuming SD output — copyCUDAMemory < 0.15 ms/frame
Sustained 3+ min run: steady FPS ≥ 15, no [E] IExecutionContext::enqueueV3 errors
Debugmode off → no health spam; Debugmode on → degradation line appears

🤖 Generated with Claude Code

…PIPS metrics

…me monkey-patch

…code error on Windows

… batch mismatch in calibration

…tion)

kvo_cache_in_* tensors have ONNX dim 0 = 2 (hard-static K/V pair), not a symbolic batch dim. The previous naïve _max_rows tile pumped sample to 2×_n_itr rows, causing modelopt's CalibrationDataProvider to compute n_itr=2×_n_itr (sample's symbolic dim 0 resolves to 1) and split kvo into chunks of shape (1,...) instead of (2,...) — ORT then rejected them with "Got 1 Expected 2". Fix: compute per-input target_rows = n_itr × resolved_dim0(name), mirroring modelopt's symbolic→1/static-kept substitution, so every input splits into exactly n_itr uniform chunks. Adds regression test in tests/quality/. Fixes SDXL-Turbo + use_cached_attn=True + cfg_type=self + use_controlnet TD config. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…kward-compat return arity

… GPU transport)

…U transport)

…handoff

…pes missing in SD venv)

…eam-start YAML Resolves cudaErrorStreamCaptureInvalidated (901) on first CN TRT inference when use_cuda_ipc_controlnet is active. Root cause and runtime fix live in the dotsimulate/StreamDiffusionTD repo (StreamDiffusionTD/td_manager.py: drop stream= arg from CUDAIPCImporter.get_frame, use CPU eager-sync via _wait_for_slot to avoid pending GPU work on the legacy stream). This commit covers the tracked-side changes: - cuda_ipc_exporter: capture mode Global->ThreadLocal (defensive hardening) - cuda_graphs: docstring correction for multi-engine processes - _plans: add 2026-05-17 emitter session + 2026-05-18 capture-fix session YAML emitter (use_cuda_ipc_controlnet + cuda_ipc_control_shm_name keys) was applied 2026-05-17 to Scripts/StreamDiffusionTD__Text__StreamDiffusionExt__td.py (outside this repo, lives in dotsimulate/StreamDiffusionTD). Verified: 19-28 FPS sustained with CN canny SDXL-Turbo 512x512, OSC enable/scale changes accepted, no 901, TD-side Receiver healthy.

…901) ControlNet TRT engines fail cudaStreamEndCapture with 901 (cudaErrorStreamCaptureInvalidated) on cold start when controlnet_scale > 0 in td_config.yaml. Root cause: TRT's internal genericReformat::copyPackedRunKernel submits work to the legacy/NULL stream during execute_async_v3 inside the graph-capture window on the engine's (polygraphy blocking) stream. wrapper.py:2208 hard-coded use_cuda_graph=True for every CN engine. Setting it to False keeps TRT acceleration for CN but skips the CUDA-graph wrapper, eliminating the capture-window conflict. Cost: ~hundreds of us per CN forward on WDDM (no graph batch-submission); steady-state FPS 18-25 vs 19-28. Also: - utilities.py: defensive torch.cuda.current_stream().synchronize() before cudaStreamBeginCapture, gated on first capture per engine. Covers the broader polygraphy blocking-stream / legacy-stream race for future TRT engines. Diagnosis trail: v0 (streamWaitEvent on legacy), v1 (wait_stream bridge), v2 (CPU cudaEventQuery - fixes warm-activation), Stage A (CUDALINK_USE_GRAPHS=0 - disproved), v3 (drain legacy pre-capture - disproved), v4 (this commit). Verified: cold-start with CN scale=0.577 + use_cuda_ipc_controlnet=true, no 901, CN active from frame 1, steady FPS sustained.

…N smoothness CUDALINK_WAIT_SPIN_US 200 -> 1000: absorbs CN-preprocessing variance on the importer side without falling to blocking wait path. Eliminates micro-stutter visible during CN-enabled SDXL-turbo runs on RTX 4090. CUDALINK_BARRIER_STALE_NS 5s -> 0.2s: at ~16 FPS, the 5s upstream default would let ~80 stale frames through the activation barrier before rejection. 0.2s (~3 frames) is tight enough to catch a genuinely stale publish without false-positive on healthy frames. Applied to both _compat/cuda_ipc/ (library) and _compat/td_exporter/ (TD COMP mirror, auto-synced to .tox) halves in lockstep. CUDALINK_TD_USE_GRAPHS default of False preserved.

…uild_engine The direct mutation at lines 1147-1149 was immediately overwritten by the full GPUBuildProfile dataclass rebuild at 1153-1172 — dead code. Drop the first block; keep the dataclass rebuild as the single override path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- bump VENDORED_VERSION 1.4.1 -> 1.5.1 (upstream 2d44ef8) - split cuda_ipc_exporter.py into exporter.py/importer.py/_cuda_adapters.py/_env.py/_profile.py; add activation_barrier.py; drop debug_utils.py - migrate StreamDiffusionWrapper export path to Exporter.open(FrameSpec)/export(GpuFrame)->FrameOutcome/close() with env-driven ExportPolicy - mirror td_exporter/ in lockstep (auto-syncs to .tox); retain CUDAIPCImporter as deprecation shim (removal v1.8)

P1: set TF32/cudnn/matmul precision flags at StreamDiffusion init P2: gate per-frame GPU sync to every 16th frame (remove ~15/16 host stalls) P3: GPU-native Canny in _process_tensor_core (eliminates D2H+cv2+H2D round-trip) P4: GPU-side uint8 output conversion + pinned D2H in _send_output_frame P5: upload CPU-SHM input as uint8, normalize on-GPU, use pipeline fast path P6: fix IPC zero-copy -- get_frame() instead of get_frame_numpy() P7: verified IPC output export already implemented Grounding: CUDA Handbook Ch.5/6/11 + PMPP Ch.4/5/6/18

- Add ControlNet CUDA-IPC consumer to both td_manager copies: ipc_control_importer state vars, _try_construct_ipc_control_importer(), throttled lazy-reconnect (1s), get_frame() -> (1,3,H,W) float16 [0,1] -> wrapper.update_control_image(). CPU-SHM fallback also given lazy-reconnect. Fixes no-conditioning-effect regression caused by missing consumer + startup race on control_memory. - P3 Canny hardening: replace per-frame amax normalization with constant /4.0 divisor for stable frame-to-frame thresholds; replace expand() stride-0 view with repeat() for contiguous CHW output tensor. - Add post-implementation corrections to cuda-perf-plan.md (file paths, td_manager untracked status, dead use_cuda_ipc_controlnet flag now backed by real consumer). - Add new plan doc: docs/plans/2026-05-24-controlnet-ipc-consumer.md

…efault off) The per-frame _send_processed_controlnet_frame().cpu().numpy() D2H copy stalled the CUDA stream every frame once controlnet_images[0] became non-None after the IPC consumer fix. Preview is display-only; diffusion is unaffected. Changes (Scripts/ auto-sync to running .tox; runtime td_manager.py is untracked): - _send_back_processed_controlnet: early-return when send_controlnet_preview is false - _initialize_memory_interfaces: skip control_processed_memory allocation when disabled - StreamDiffusionExt emitter: emit send_controlnet_preview flag (default false), overridable via Sendcontrolnetpreview TD par - docs/plans/2026-05-24-controlnet-preview-throttle.md: plan + diagnosis

inference_time_ema feeds only the similar-filter sleep heuristic. When the filter is off the EMA has no reader, so gate start.record()/end.record()/ end.synchronize() behind if self.similar_image_filter — eliminates even the residual 1-in-16 host stall on the default (filter-off) production path.

Adds NVTX profiler.region() wraps around all identified eager-op candidates outside the TRT engines: glue.ipc_pack_rgba (wrapper.py), trt.input_staging (utilities.py), sched.step_batch + sched.rebuild (pipeline.py). Adds nsys 2026.2.1 to profile_nsys.py auto-detect list. All candidates measured NO GO on RTX 4090 WDDM (33 FPS / ~30ms frame): - glue.ipc_pack_rgba: P50 = 80 us (0.27% of frame) - trt.input_staging: P50 = 40 us (0.13%) - sched.step_batch: P50 = 40 us/call x2 (0.27%) - sched.rebuild: P50 = 10 us (0.03%) The TRT UNet (~28 ms P50, ~93% of frame) is the only optimization surface. Results documented in docs/profiling/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mited roofline - Add --config flag to profile_ncu.py so it targets profile_nsys.py with the cached SDXL-Turbo fp8 engine instead of single.py (kohaku-v2.1) - Add Nsight Compute 2026.1.1 as first candidate in ncu path resolution - Sanitize output filename (remove colon from config-based target label) - Add docs/profiling/unet_ncu_roofline_2026-05-24.md: ncu roofline analysis of UNet fp8 GEMMs at 512x512; all 5 instances wave-limited (0.2-0.4 SM waves on 128-SM RTX 4090); only actionable lever is batch size (engine built max_batch=4)

Extends unet_ncu_roofline_2026-05-24.md with a Verdict section explaining: - 99% temporal GPU utilization vs ~15% compute SOL are different axes; both are true simultaneously (always busy in time, mostly empty in space per kernel) - batch_size = denoising_steps * frame_buffer_size; correct value for batch=4 is frame_buffer_size=2 (not 4, which would exceed max_batch) - TD loop is 1-in-1-out; frame batching requires loop + img2img rework + adds input latency; not pursued for the live interactive stream - Corrects the existing "without extra latency" claim and the wrong frame_buffer_size=4 value in "What to do next"

Add three-signal IPC health tracker to StreamDiffusionWrapper: - _ipc_consecutive_failures / _ipc_barrier_skip_count / _ipc_graphs_degraded counters updated per-frame in postprocess_image (zero hot-path cost: counters only) - get_ipc_health_status() public method for the 1 Hz polling loop; returns 'ok', 'ok/graph-fallback', 'FAILED(N)', 'barrier-skip(N)', 'not-init', 'disabled' - graph-fallback detection via getattr(exporter, '_graphs_disabled', False) — defensive read that survives vendored-code re-sync without modification TD-side counterparts (td_main.py OSCReporter.send_ipc_health, td_manager.py 1 Hz console emit + OSC /stream-info/ipc-health, oscin1_callbacks ipc_health row) are gitignored local dev copies — propagate via .tox re-export workflow.

Thread debug_mode through the wrapper construction chain so the per-frame IPC health counters are only active when par.Debugmode is on: - wrapper.py: add debug_mode param, gate counter updates behind self.debug_mode - config.py: map debug_mode in _extract_wrapper_params (picks up the override kwarg passed by td_manager via create_wrapper_from_config(..., debug_mode=...)) Signal flow: par.Debugmode → YAML debug_mode → manager.__init__(debug_mode) → create_wrapper_from_config(config, debug_mode=self.debug_mode) → wrapper.debug_mode In production (debug_mode=False): per-frame block skipped entirely; console/OSC emit also suppressed by the existing td_manager gate.

…link v1.8.1 - Create src/streamdiffusion/_patches/ package grouping diffusers monkey-patches (diffusers_kvo_patch, hf_tracing_patches); update import sites in __init__.py and unet_unified_export.py - Delete _compat/__init__.py, _compat/diffusers_kvo_patch.py, _compat/cuda_ipc/ subtree (15 files), and top-level _hf_tracing_patches.py; _compat/td_exporter/ preserved pending Phase B live verification - Declare cuda-link as an honest git-direct dependency in setup.py (cuda_ipc extra); deps regex confirmed to parse the @ git+ reference correctly - Mark cuda_link_upgrade_v1.7.2.md as superseded; update _compat path references in cuda-perf-plan and controlnet-ipc-consumer docs to pip cuda_link package Smoke-tested: import streamdiffusion applies kvo patch via _patches (_PATCHED=True); idempotent apply() confirmed; zero residual _compat imports in src/

…_plans/ superseded files)

…ernal)

forkni and others added 30 commits May 16, 2026 08:36

chore: gitignore local cgw tooling; add PR audit doc

5c5d5a6

feat: add FP8 QDQ finite-scale gate and fused-MHA layer count

275f293

feat: add quality regression harness with FP16-TRT goldens and SSIM/L…

90091d5

…PIPS metrics

feat: port varshith15 kvo_cache patch onto diffusers 0.38.0 via runti…

d1763bb

…me monkey-patch

fix: replace emoji chars in SDXL ONNX size warning to avoid cp1252 en…

62548fb

…code error on Windows

feat: seed quality-harness goldens, manifest, thresholds; fix FP8 CFG…

67c74c2

… batch mismatch in calibration

chore: stage pre-existing formatter diffs (quote/whitespace normalisa…

4b4aaf7

…tion)

fix: kvo_cache patch breaks ControlNet ONNX export — sentinel for bac…

1a8065f

…kward-compat return arity

feat: add CUDA IPC output direction via cuda-link (SD-to-TD zero-copy…

4c2a742

… GPU transport)

feat: add CUDA IPC input direction via cuda-link (TD->SD zero-copy GP…

72dc7cc

…U transport)

docs: add CUDA IPC input direction plan with next-session log review …

52c4a68

…handoff

fix: use relative imports in vendored _compat/cuda_ipc (CUDARuntimeTy…

eecb9f5

…pes missing in SD venv)

docs: add plans for CUDARuntimeTypes fix and zero-copy GPU input

02911e5

docs: add plan for ControlNet zero-copy GPU input

59f2caa

chore: update cuda-link _compat vendor mirrors to v1.7.2

5fbc04a

forkni added 6 commits June 1, 2026 21:27

chore: commit Phase A file deletions missing from bcc35fb (_compat/, …

b3dbacd

…_plans/ superseded files)

docs: add ADR-0001 — cuda-link as external pip dependency, not vendored

8167eb2

chore: remove redundant _compat/td_exporter glue layer (now fully ext…

609d8f8

…ernal)

style: normalize profile_ncu.py arg-list formatting

da09e3f

docs: add clean-shutdown-watchdog and copy-sdtd-code-crash-fix plans

c201a5f

forkni changed the title ~~feat: CUDA IPC zero-copy GPU transport (TD↔SD input, output, ControlNet)~~ feat: CUDA IPC zero-copy GPU transport via external cuda-link dependency (TD↔SD) Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CUDA IPC zero-copy GPU transport via external cuda-link dependency (TD↔SD)#15

feat: CUDA IPC zero-copy GPU transport via external cuda-link dependency (TD↔SD)#15
forkni wants to merge 36 commits into
SDTD_031_devfrom
feat/cuda-ipc-output

forkni commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

forkni commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture change: vendored _compat/ → pip cuda-link

IPC transport (all three directions)

ControlNet TRT 901 fix (core bug resolved in this PR)

Also in this PR

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

forkni commented May 18, 2026 •

edited

Loading

Architecture change: vendored `_compat/` → pip `cuda-link`