Could FlashRT's execution unit move from "single-graph submit" to "interleaved subgraphs"? Looking for thoughts on landing paradigms #32
Replies: 1 comment
-
|
This is a very strong direction, and I share a similar motivation here. I also do not think the final landing paradigm is fully clear yet. For edge devices, consumer GPUs, and small-batch realtime inference in general, so I think of this as scenario-level inference rather than single-model inference. A real deployment is almost never just one model. It is usually a group of heterogeneous models with different periods, deadlines, inputs, and bottlenecks running at the same time. My current view is that the first required foundation is to get a very low-latency baseline for each individual model. That is the main focus of FlashRT right now: supporting more model types, kernels, precisions, and execution patterns, while keeping the dependency stack as thin as possible. The goal is to make the kernel/runtime layer reusable enough to support more systematic realtime serving later. I agree that the next step should probably move from “one full model forward” to a more flexible subgraph-level execution model. However, I would be cautious about putting a full scheduler inside FlashRT too early, because the combination space grows very quickly:
I think the first four directions you mentioned could all become valid future routes. and option 5 looks like the clearest next step to me: keep FlashRT thin first and expose the right execution primitives. This is also very relevant to a question I was asked yesterday about whether static CUDA Graphs may lose too much flexibility in real deployment. I think this is exactly where the raw launch vs graph replay tradeoff becomes important. Because of that, my current preference is to start with a thinner and more adaptive design. For a first version, I think FlashRT could focus on three things (as ur No. 5 idea):
This keeps FlashRT lightweight while still leaving room for future scheduling. An external orchestrator can first experiment with static period tables, event-triggered streams, deadline-aware scheduling, and cross-model batching outside the core runtime. Once some patterns become stable, they can be moved back into FlashRT. Concretely, I was also considering opening a separate beta serving directory for this kind of experiment. It could include several experimental realtime inference scenarios, and the multi-model setup you described would be a very good candidate. I think after we have a few small demos on Orin / Thor / 40xx-class hardware, the serving spec and scheduler design will become much clearer. Thanks again for bringing up this idea. It is very inspiring, and it is also close to the core direction I want FlashRT to move toward. If you would like to discuss this in more detail, feel free to email me — I would be happy to set up a call or conference sometime. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
Real embodied edge devices (Orin / Thor and the like) never run a single model — they run a group of heterogeneous models with different periods and different bottlenecks, all live at the same time:
If each model runs independently as "one forward = one graph submit," the timeline often looks like: model A is in a memory-bound phase with SMs largely idle, while model B's compute-bound phase is sitting in the queue waiting. The vLLM / SGLang style of continuous batching doesn't help here — it's LLM-only and intra-model.
The mental-model shift I've been considering
Lift FlashRT's execution trigger unit from "one forward of one model" up to interleaved submission of subgraphs from a group of periodic / triggered streams.
bs=1ceilingA one-second sketch (Orin AGX, hypothetical triple):
At this point the direction feels clear to me, but I'm not confident on the landing paradigm — which is what I'd like to discuss.
A few candidate routes I've thought of, none of which I'm sure about
Mostly just thinking out loud here — curious whether this framing resonates with you, and happy to kick the idea around in either direction.
Beta Was this translation helpful? Give feedback.
All reactions