Skip to content

graph=True: checkpoint containing nested graph_do_while should not fall back to non-graph launch #750

Description

@alanray-tech

Summary

qd.checkpoint(yield_on=flag) containing a nested qd.graph_do_while triggers a non-graph fallback:

[I] graph=True: a qd.checkpoint() block containing a nested qd.graph_do_while is not yet
supported on the CUDA graph path; falling back to the non-graph launch.

This is correct for results but loses CUDA graph performance (10ms/step instead of <1ms).

Use case

IPC Newton solver with checkpoint-based overflow handling. The natural structure is:

@qd.kernel(graph=True)
def step(self):
    while qd.graph_do_while(newton_cond):  # Newton outer loop
        # assembly writes triplets...

        with qd.checkpoint(yield_on=self.triplet_overflow):
            # sort + reduce
            self.sort_radix()
            ...
            # PCG inner loop (nested graph_do_while)
            while qd.graph_do_while(self.pcg_cond):
                self.pcg_iteration()

The checkpoint needs to wrap everything after assembly (so overflow can yield before sort touches invalid memory). But PCG is a nested graph_do_while inside the checkpoint body.

cgq (the C++ reference implementation) supports this pattern: CheckpointScope containing a PCG create_solve_graph (which is a conditional WHILE subgraph). See sim_engine_pipeline.cu line 187-352, "Checkpoint 1b: Assemble + PCG".

Root cause

graph_manager.cpp line 796-813:

// Unsupported combined case: a `qd.checkpoint()` block whose body contains a nested
// `qd.graph_do_while` (one cp_id spanning more than one loop level). build_level's per-level IF
// grouping assumes a checkpoint's tasks are flat within a single level, so fall back to the
// non-graph launch path (correct results, just no on-device gating) rather than build a wrong graph.

The check rejects any checkpoint where tasks have different graph_do_while_level_id values.

Minimal reproducer

import numpy as np
import quadrants as qd
qd.init(qd.cuda)

@qd.data_oriented
class Repro:
    def __init__(self):
        self.data = qd.ndarray(qd.f64, shape=(64,))
        self.cond = qd.ndarray(qd.i32, shape=())
        self.overflow = qd.ndarray(qd.i32, shape=())
        self.iter_count = qd.ndarray(qd.i32, shape=())

    @qd.kernel(graph=True)
    def run(self):
        for i in range(64):
            self.data[i] = qd.f64(i)

        with qd.checkpoint(yield_on=self.overflow):
            for i in range(64):
                self.data[i] = self.data[i] + 1.0

            for _ in range(1):
                self.cond[()] = 1
                self.iter_count[()] = 0
            while qd.graph_do_while(self.cond):
                for i in range(64):
                    self.data[i] = self.data[i] * 1.001
                for _ in range(1):
                    self.iter_count[()] = self.iter_count[()] + 1
                    if self.iter_count[()] >= 3:
                        self.cond[()] = 0

r = Repro()
r.overflow.from_numpy(np.array(0, dtype=np.int32))
r.run()  # triggers fallback warning

Expected behavior

The CUDA graph should be built with the nested graph_do_while as a conditional WHILE node inside the checkpoint's conditional IF body (matching cgq's architecture).

Current workaround

Move the qd.graph_do_while outside the qd.checkpoint block. This is functionally correct but means the checkpoint cannot protect the nested loop from running on stale/overflow data.

Environment

  • quadrants branch hp/qipc-integration, commit b3ba47e6a
  • CUDA SM 12.0 (Blackwell), Python 3.13.9, Windows 11

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions