[Graph] Add graph parallel by hughperkins · Pull Request #756 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-06-23T13:58:06Z

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

…raph kernels Lets a @qd.kernel(graph=True) author mark independent sequences of work as concurrent branches so the captured CUDA graph runs them on parallel streams (recovers the PT||EE overlap qipc loses vs cgq). API: `with qd.graph_parallel():` opens a fork/join region whose members are `with qd.branch(name=...):` blocks (name optional, label only). Region body must contain only branch blocks; branches are independent (author-guaranteed race-free) and everything after the region waits for all of them (join). Implementation reuses the existing stream_parallel_group_id tag: qd.branch lowers via begin/end_stream_parallel, so the branch id flows through the existing offload/codegen path to OffloadedTask. The only runtime change is in the CUDA graph builder (build_level): a contiguous run of nonzero-group, non-checkpoint tasks is forked by group id from the region entry, each branch chained, and all tails joined into one cuGraphAddEmptyNode. Single-branch regions degenerate to a plain chain. Validation: graph=True required; branch only inside a region; region body must be branches only; no nesting; graph_do_while structure validator updated to accept regions. Other backends (CPU/AMDGPU/Vulkan/Metal) run branches serially (correct); CUDA graph path runs them concurrently. Design: perso_hugh/doc/qipc/d3_0_graph_parallel_impl.md

…onal branches - Recursive region-body validation: a graph_parallel region may contain branch blocks optionally wrapped in `if qd.static(...)` (qipc ENABLE_EE pattern); graph_do_while validator descends through those ifs too. - tests/python/test_graph_parallel.py: correctness vs serial (2/3/multi-loop branches), single-branch no-join, optional static-if branch, region inside graph_do_while, and compile-time error cases. Node-count assertions on CUDA. - docs: graph.md "Concurrent branches" section + backend table row; streams.md cross-link to graph_parallel for graph kernels.

…oup id A dynamic-bound for-loop (e.g. range(x.shape[0])) lowers to a bound-compute serial task followed by the range_for. The serial task carried group 0, which split a qd.branch()'s contiguous run and defeated the graph builder's fork/join (branches ran serially). Propagate the loop's stream_parallel_group_id onto the flushed pending-serial task in the offloader so the bound task and its range_for stay in the same branch. The existing single-level / single-region frontend restriction guarantees the pending serial never mixes branches. Also make the graph_parallel test node-count assertions relative to the offloaded-task count (each dynamic-bound loop is 2 tasks).

The graph_parallel/branch context managers use @contextmanager. On the old desk8 base misc.py already imported it; rebasing the feature onto main (where misc.py no longer imports contextmanager) lost the import, breaking module import (NameError at load -> stub generation failed).

…ackend section

…t and qd.branch -> qd.graph_parallel The fork/join region context manager is now qd.graph_parallel_context() and each concurrent branch member is qd.graph_parallel(name=...). Updates the public API (misc.py + __all__), the AST detection/build/validation in ast_transformer.py and function_def_transformer.py (region helpers gain a _context suffix; the conceptual "branch" naming is retained for members), tests, user docs (graph.md/streams.md, including the heading anchor), and CUDA graph_manager comments.

…rallel docs - Remove the optional name= parameter from qd.graph_parallel (branch members now take no arguments); simplify the AST detection/build path accordingly. - Drop the "graph-compatible analogue of qd.stream_parallel" lead-in and the "single-branch region lowers to a plain chain" note from the graph_parallel docs. - Refer to the "graph builder" rather than the "CUDA graph builder" in the graph_parallel feature prose (it is graph-abstraction level, CUDA-honoured today).

…rim tables) - Rename the doc's "branch" terminology to "section" (heading, prose, tables, code comments); update the streams.md cross-reference anchor to match. - Stop describing graph_parallel scheduling in terms of parallel streams (we do not expose streams) and drop the qd.stream_parallel analogy from the section. - Trim the comparison table (drop "(parallel streams)"/"(correct)") and the backend-behaviour table (drop the result column, "(graph path)", and the "honoured only by the CUDA graph builder today" note). - Restrictions: drop the redundant "Must be used inside @qd.kernel(graph=True)" bullet and refer to qd.graph_parallel_context instead of a vague "region". - Minor: "falls back to a host-side loop"; use British "behaviour".

github-actions · 2026-06-23T15:03:53Z

Total: 7 file(s) changed, +424 -0 code lines.

…and docs Rename the internal "branch" concept to "parallel section" everywhere it referred to a qd.graph_parallel() block: AST helpers (_is_branch_call/_build_branch_with/_in_branch and _is_branch_with), the CUDA graph_manager `branches` vector, misc.py docstrings, and all related comments. Update graph.md/streams.md to say "parallel section(s)" (heading + anchor included). No behaviour change.

Rename the remaining "branch" wording in test_graph_parallel.py: test function names (e.g. test_graph_parallel_two_branches -> _two_sections) and all docstrings/comments now say "parallel section(s)", matching the source/docs rename.

…test Rename test_graph_parallel_non_section_body_raises -> test_graph_parallel_context_non_graph_parallel_raises so the name says what it checks: a qd.graph_parallel_context() whose body is not a qd.graph_parallel() must raise.

…test Rename test_graph_parallel_section_outside_region_raises -> test_graph_parallel_outside_context_raises: it checks that qd.graph_parallel() used outside a qd.graph_parallel_context() raises.

…ntics

Flip the British spellings introduced for graph_parallel back to American to match the rest of the codebase: behaviour->behavior, honoured->honored, recognised->recognized, behavioural->behavioral across graph.md, streams.md, misc.py docstrings, and the tests.

…ion' Use the construct-named term "qd.graph_parallel section" throughout the docs, docstrings, comments, and tests (dropping the redundant prefix where qd.graph_parallel is already adjacent). Updates the graph.md heading + the streams.md cross-reference anchor accordingly.

…-loop Port the static-loop section generator from hp/graph-parallel-static-branches onto the current naming. A `for ... in qd.static(...)` loop inside a qd.graph_parallel_context unrolls at trace time into one qd.graph_parallel section per iteration (each gets a fresh stream_parallel_group_id), so sections can be forked from a compile-time sequence -- e.g. one per @qd.func member of a @qd.data_oriented list (qipc's per-contact-type assembly pattern). - ast_transformer: region-body validator recurses into `for ... in qd.static(...)` loops (staticness re-checked via get_decorator at every nesting level, so a runtime loop -- even nested under a static one -- is still rejected). - function_def_transformer: graph_do_while structure validator descends into For members so static-loop section bodies are validated like hand-written ones. - tests: static-loop sections (incl. nested, empty range, single section, over-funcs, mixed with if-static, inside graph_do_while) + runtime-loop and non-section-body rejection guards. - docs: "Generating qd.graph_parallel sections from a compile-time sequence".

github-actions · 2026-06-23T17:09:45Z

Total: 7 file(s) changed, +631 -0 code lines.

github-actions · 2026-06-23T17:22:15Z

Diff coverage: 87% · 553 lines, 73 missing

Address the three Cursor-agent CI gates on PR #756: - Feature factorization: move the graph_parallel detection/validation/lowering out of ast_transformer.py into ast_transformers/graph_parallel_transformer.py (GraphParallelTransformer), mirroring checkpoint_transformer.py. build_With now calls the new class directly; function_def_transformer comments updated to point at it. - Line wrapping: re-wrap the graph_parallel_context docstring to 120c and drop a dangling reference to an internal design doc. - Doc quality (graph.md): reword pre-existing graph_do_while content the gate flagged -- replace "lowering" jargon, drop the internal "genesis-world" / hypothetical qd.graph_range_for / MAX_ITER design walk-through phrasing.

github-actions · 2026-06-23T17:59:01Z

Total: 8 file(s) changed, +645 -0 code lines.

github-actions · 2026-06-23T18:11:07Z

Diff coverage: 87% · 560 lines, 73 missing

…face test_api.py asserts the exact set of public qd.* symbols; the renamed context managers were missing from the expected list, failing the test across every platform. Also reflow a misc.py docstring line to 120c.

github-actions · 2026-06-23T20:26:27Z

Total: 9 file(s) changed, +647 -0 code lines.

github-actions · 2026-06-23T21:16:49Z

Diff coverage: 89% · 560 lines, 62 missing

hughperkins · 2026-06-24T13:18:20Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a7cb9866a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-24T13:21:54Z

+      while (run_end < end && tasks[run_end].graph_do_while_level_id == parent_id && tasks[run_end].checkpoint_id < 0 &&
+             tasks[run_end].stream_parallel_group_id != 0) {
+        run_end++;


Preserve joins between adjacent graph-parallel regions

When two with qd.graph_parallel_context(): regions are written back-to-back, the offloader emits all of their section tasks as one contiguous run with nonzero stream_parallel_group_ids (there is no marker for the region boundary). This loop consumes that entire run and forks every group from the same prev_node, so sections in the second region can run concurrently with sections in the first instead of waiting for the first region's join. Any valid second region that reads data produced by the first will race and produce nondeterministic results on CUDA.

Useful? React with 👍 / 👎.

github-actions · 2026-06-24T13:51:35Z

Total: 9 file(s) changed, +647 -0 code lines.

github-actions · 2026-06-24T14:08:47Z

Diff coverage: 87% · 560 lines, 73 missing

github-actions · 2026-06-24T14:40:45Z

Total: 9 file(s) changed, +647 -0 code lines.

…misc.py Moves the `graph_parallel_context()` / `graph_parallel()` no-op context-manager stubs into their own `quadrants/lang/graph_parallel.py`, mirroring `checkpoint.py`, to keep the 958-line catch-all `lang/misc.py` (imported by 33+ modules) from growing further. Re-exported via `misc.py` so the canonical `qd.graph_parallel_context` / `qd.graph_parallel` import paths are unchanged. Addresses the "Check feature factorization" CI finding on PR #756.

… regions Two qd.graph_parallel_context() regions written back-to-back (no serial work between them) were merged into one fork/join by the CUDA graph builder: the run-extension loop keyed only on stream_parallel_group_id != 0, with no region boundary marker, so the second region's sections forked from the same entry as the first's and could race the first region's writes. Thread a per-kernel graph_parallel_region_id from the AST builder (begin/end_graph_parallel_context) through ForLoopConfig -> FrontendForStmt -> RangeForStmt/StructForStmt -> OffloadedStmt -> OffloadedTask, and require equal region id when extending a fork/join run in build_level so adjacent regions each get their own join. AMD already serializes sections, so only the CUDA builder needed the guard. Adds a regression test.

…misc.py Moves the `graph_parallel_context()` / `graph_parallel()` no-op context-manager stubs into their own `quadrants/lang/graph_parallel.py`, mirroring `checkpoint.py`, to keep the 958-line catch-all `lang/misc.py` (imported by 33+ modules) from growing further. Re-exported via `misc.py` so the canonical `qd.graph_parallel_context` / `qd.graph_parallel` import paths are unchanged. Addresses the "Check feature factorization" CI finding on PR #756.

github-actions · 2026-06-24T15:25:07Z

Total: 21 file(s) changed, +731 -2 code lines.

github-actions · 2026-06-24T16:40:06Z

Diff coverage: 90% · 597 lines, 62 missing

Repacks the graph_parallel_context fork/join block comment in graph_manager.cpp (was wrapped at ~93-99c) and the OffloadedStmt region-id comment in statements.h so comment lines fill toward the project's 120c width instead of the ~80c default. Addresses the "Check line wrapping" CI on PR #756.

github-actions · 2026-06-25T13:07:00Z

Total: 21 file(s) changed, +731 -2 code lines.

github-actions · 2026-06-25T14:03:48Z

Diff coverage: 90% · 600 lines, 62 missing

hughperkins · 2026-06-26T13:04:33Z

(running Genesis unit tests and benchamrks)

github-actions · 2026-06-27T16:35:43Z

Total: 21 file(s) changed, +731 -2 code lines.

github-actions · 2026-06-27T17:34:05Z

Diff coverage: 90% · 597 lines, 62 missing

hughperkins added 8 commits June 23, 2026 06:24

docs(graph): drop stream_parallel lowering note from graph_parallel b…

a95d2b7

…ackend section

hughperkins added 11 commits June 23, 2026 08:08

test(graph): construct-based name for graph_parallel-outside-context …

6592249

…test Rename test_graph_parallel_section_outside_region_raises -> test_graph_parallel_outside_context_raises: it checks that qd.graph_parallel() used outside a qd.graph_parallel_context() raises.

docs/graph: drop CUDA conditional while nodes note from do-while sema…

a7988b3

…ntics

docs/graph: trim intro of graph_parallel_context section

509adfd

style: apply pre-commit (black, clang-format) formatting

478ec70

Merge branch 'main' into hp/graph-parallel-main

16dfba6

fix up doc ci flagged issues

9f35526

hughperkins marked this pull request as ready for review June 24, 2026 13:18

Merge branch 'main' into hp/graph-parallel-main

0a7cb98

chatgpt-codex-connector Bot reviewed Jun 24, 2026

View reviewed changes

address doc CI issues

4c80eb0

hughperkins added 2 commits June 24, 2026 07:50

hughperkins added need-genesis-unit-tests need-genesis-benchmarks labels Jun 26, 2026

Merge branch 'main' into hp/graph-parallel-main

b304f57

Uh oh!

Conversation

hughperkins commented Jun 23, 2026

Brief Summary

Walkthrough

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

hughperkins commented Jun 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

hughperkins commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 27, 2026

Uh oh!

github-actions Bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant