Could FlashRT's execution unit move from "single-graph submit" to "interleaved subgraphs"? Looking for thoughts on landing paradigms #32

gugudeshubao · 2026-05-20T07:26:18Z

gugudeshubao
May 20, 2026
Collaborator

Motivation

Real embodied edge devices (Orin / Thor and the like) never run a single model — they run a group of heterogeneous models with different periods and different bottlenecks, all live at the same time:

Model	Main ops	Bottleneck	Typical cadence
ASR	streaming decoder	decode (small ops + small batch)	streaming / event-triggered
VLA (Pi0.5)	ViT + LLM + action head	ViT/LLM compute-bound; action memory-bound	5–10 Hz
BEV	CNN + neck + sparse attn	CNN compute; attn memory	10–30 Hz

If each model runs independently as "one forward = one graph submit," the timeline often looks like: model A is in a memory-bound phase with SMs largely idle, while model B's compute-bound phase is sitting in the queue waiting. The vLLM / SGLang style of continuous batching doesn't help here — it's LLM-only and intra-model.

The mental-model shift I've been considering

Lift FlashRT's execution trigger unit from "one forward of one model" up to interleaved submission of subgraphs from a group of periodic / triggered streams.

Periodic execution ≡ event-triggered execution (a trigger is just an irregular period); the scheduler can treat them uniformly
The subgraph (not the whole graph) becomes the minimal scheduling unit, and can be interleaved across models
Different bottlenecks overlap on the timeline: BEV.CNN's compute peak fills the memory-bound gap of VLA.action
Side effect: subgraphs across models with matching shape/dtype can be aggregated naturally (e.g. the small GEMMs that show up in multiple places) → effective batch is no longer pinned at 1 by the per-model bs=1 ceiling

A one-second sketch (Orin AGX, hypothetical triple):

t=0   ms ┃ VLA.ViT      ━━━━━━━━━━ (compute)
         ┃ BEV.CNN          ━━━━━━━━━━━━━            ┃ ASR.dec ━━ (fills idle SMs)
t=50  ms ┃ VLA.LLM.prefill ━━━━━━━━
         ┃ VLA.action      ━ (memory-bound, gap left open)
t=100 ms ┃ VLA.LLM.decode ━━ ASR.dec ━━ BEV.CNN ━━  (cross-model batched GEMM)

At this point the direction feels clear to me, but I'm not confident on the landing paradigm — which is what I'd like to discuss.

A few candidate routes I've thought of, none of which I'm sure about

Subgraphs + multiple streams + static period table: pre-arrange time slots, launch by the table at runtime. Easy SLO reasoning, but triggered streams (ASR) don't fit cleanly.
Subgraphs + priority / deadline scheduler: dynamic preemption, good fit for triggered streams, but might fight with the kernel-selection logic in the current INT8 fast path.
CUDA Graph nesting / sub-graph capture: stitch the cross-model subgraphs into a single big graph and replay it periodically. Low replay overhead but inflexible — and I'm not sure how friendly capture is on Orin in practice.
Coroutine / actor model: each model = a coroutine that yields to a shared scheduler; the scheduler handles cross-model aggregation and multi-stream dispatch. Most expressive, also most invasive.
Thinner option: FlashRT only exposes a subgraph-level launch API (no scheduling); an orchestrator on top owns the scheduling complexity. Keeps FlashRT's current scope, pushes complexity outward.

Mostly just thinking out loud here — curious whether this framing resonates with you, and happy to kick the idea around in either direction.

LiangSu8899 · 2026-05-20T19:30:49Z

LiangSu8899
May 20, 2026
Maintainer

This is a very strong direction, and I share a similar motivation here. I also do not think the final landing paradigm is fully clear yet.

For edge devices, consumer GPUs, and small-batch realtime inference in general, so I think of this as scenario-level inference rather than single-model inference. A real deployment is almost never just one model. It is usually a group of heterogeneous models with different periods, deadlines, inputs, and bottlenecks running at the same time.

My current view is that the first required foundation is to get a very low-latency baseline for each individual model. That is the main focus of FlashRT right now: supporting more model types, kernels, precisions, and execution patterns, while keeping the dependency stack as thin as possible. The goal is to make the kernel/runtime layer reusable enough to support more systematic realtime serving later.

I agree that the next step should probably move from “one full model forward” to a more flexible subgraph-level execution model. However, I would be cautious about putting a full scheduler inside FlashRT too early, because the combination space grows very quickly:

different hardware: Orin, Thor, 4090/5090, L40S, etc.
different model groups
different execution cadences
event-triggered vs periodic workloads
possible cross-device execution
larger future models that may not fit on a single GPU (like pi0.7 got 14b bagel for inference)

I think the first four directions you mentioned could all become valid future routes. and option 5 looks like the clearest next step to me: keep FlashRT thin first and expose the right execution primitives. This is also very relevant to a question I was asked yesterday about whether static CUDA Graphs may lose too much flexibility in real deployment. I think this is exactly where the raw launch vs graph replay tradeoff becomes important.

Because of that, my current preference is to start with a thinner and more adaptive design.

For a first version, I think FlashRT could focus on three things (as ur No. 5 idea):

Split each model into named subgraphs / stages.
Let each subgraph support both raw launch and CUDA Graph replay.
Expose metadata to an external orchestrator, such as latency estimates, bottleneck type, shape/dtype signature, capture support, dependencies, workspace requirements, etc.

This keeps FlashRT lightweight while still leaving room for future scheduling. An external orchestrator can first experiment with static period tables, event-triggered streams, deadline-aware scheduling, and cross-model batching outside the core runtime. Once some patterns become stable, they can be moved back into FlashRT.

Concretely, I was also considering opening a separate beta serving directory for this kind of experiment. It could include several experimental realtime inference scenarios, and the multi-model setup you described would be a very good candidate.

I think after we have a few small demos on Orin / Thor / 40xx-class hardware, the serving spec and scheduler design will become much clearer.

Thanks again for bringing up this idea. It is very inspiring, and it is also close to the core direction I want FlashRT to move toward. If you would like to discuss this in more detail, feel free to email me — I would be happy to set up a call or conference sometime.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could FlashRT's execution unit move from "single-graph submit" to "interleaved subgraphs"? Looking for thoughts on landing paradigms #32

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Could FlashRT's execution unit move from "single-graph submit" to "interleaved subgraphs"? Looking for thoughts on landing paradigms #32

Uh oh!

gugudeshubao May 20, 2026 Collaborator

Motivation

The mental-model shift I've been considering

A few candidate routes I've thought of, none of which I'm sure about

Replies: 1 comment

Uh oh!

LiangSu8899 May 20, 2026 Maintainer

gugudeshubao
May 20, 2026
Collaborator

LiangSu8899
May 20, 2026
Maintainer