Summary
qd.checkpoint(yield_on=flag) containing a nested qd.graph_do_while triggers a non-graph fallback:
[I] graph=True: a qd.checkpoint() block containing a nested qd.graph_do_while is not yet
supported on the CUDA graph path; falling back to the non-graph launch.
This is correct for results but loses CUDA graph performance (10ms/step instead of <1ms).
Use case
IPC Newton solver with checkpoint-based overflow handling. The natural structure is:
@qd.kernel(graph=True)
def step(self):
while qd.graph_do_while(newton_cond): # Newton outer loop
# assembly writes triplets...
with qd.checkpoint(yield_on=self.triplet_overflow):
# sort + reduce
self.sort_radix()
...
# PCG inner loop (nested graph_do_while)
while qd.graph_do_while(self.pcg_cond):
self.pcg_iteration()
The checkpoint needs to wrap everything after assembly (so overflow can yield before sort touches invalid memory). But PCG is a nested graph_do_while inside the checkpoint body.
cgq (the C++ reference implementation) supports this pattern: CheckpointScope containing a PCG create_solve_graph (which is a conditional WHILE subgraph). See sim_engine_pipeline.cu line 187-352, "Checkpoint 1b: Assemble + PCG".
Root cause
graph_manager.cpp line 796-813:
// Unsupported combined case: a `qd.checkpoint()` block whose body contains a nested
// `qd.graph_do_while` (one cp_id spanning more than one loop level). build_level's per-level IF
// grouping assumes a checkpoint's tasks are flat within a single level, so fall back to the
// non-graph launch path (correct results, just no on-device gating) rather than build a wrong graph.
The check rejects any checkpoint where tasks have different graph_do_while_level_id values.
Minimal reproducer
import numpy as np
import quadrants as qd
qd.init(qd.cuda)
@qd.data_oriented
class Repro:
def __init__(self):
self.data = qd.ndarray(qd.f64, shape=(64,))
self.cond = qd.ndarray(qd.i32, shape=())
self.overflow = qd.ndarray(qd.i32, shape=())
self.iter_count = qd.ndarray(qd.i32, shape=())
@qd.kernel(graph=True)
def run(self):
for i in range(64):
self.data[i] = qd.f64(i)
with qd.checkpoint(yield_on=self.overflow):
for i in range(64):
self.data[i] = self.data[i] + 1.0
for _ in range(1):
self.cond[()] = 1
self.iter_count[()] = 0
while qd.graph_do_while(self.cond):
for i in range(64):
self.data[i] = self.data[i] * 1.001
for _ in range(1):
self.iter_count[()] = self.iter_count[()] + 1
if self.iter_count[()] >= 3:
self.cond[()] = 0
r = Repro()
r.overflow.from_numpy(np.array(0, dtype=np.int32))
r.run() # triggers fallback warning
Expected behavior
The CUDA graph should be built with the nested graph_do_while as a conditional WHILE node inside the checkpoint's conditional IF body (matching cgq's architecture).
Current workaround
Move the qd.graph_do_while outside the qd.checkpoint block. This is functionally correct but means the checkpoint cannot protect the nested loop from running on stale/overflow data.
Environment
- quadrants branch
hp/qipc-integration, commit b3ba47e6a
- CUDA SM 12.0 (Blackwell), Python 3.13.9, Windows 11
Summary
qd.checkpoint(yield_on=flag)containing a nestedqd.graph_do_whiletriggers a non-graph fallback:This is correct for results but loses CUDA graph performance (10ms/step instead of <1ms).
Use case
IPC Newton solver with checkpoint-based overflow handling. The natural structure is:
The checkpoint needs to wrap everything after assembly (so overflow can yield before sort touches invalid memory). But PCG is a nested
graph_do_whileinside the checkpoint body.cgq (the C++ reference implementation) supports this pattern:
CheckpointScopecontaining a PCGcreate_solve_graph(which is a conditional WHILE subgraph). Seesim_engine_pipeline.culine 187-352, "Checkpoint 1b: Assemble + PCG".Root cause
graph_manager.cppline 796-813:The check rejects any checkpoint where tasks have different
graph_do_while_level_idvalues.Minimal reproducer
Expected behavior
The CUDA graph should be built with the nested
graph_do_whileas a conditional WHILE node inside the checkpoint's conditional IF body (matching cgq's architecture).Current workaround
Move the
qd.graph_do_whileoutside theqd.checkpointblock. This is functionally correct but means the checkpoint cannot protect the nested loop from running on stale/overflow data.Environment
hp/qipc-integration, commitb3ba47e6a