Skip to content

[Graph] Add graph parallel#756

Open
hughperkins wants to merge 28 commits into
mainfrom
hp/graph-parallel-main
Open

[Graph] Add graph parallel#756
hughperkins wants to merge 28 commits into
mainfrom
hp/graph-parallel-main

Conversation

@hughperkins

Copy link
Copy Markdown
Collaborator

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

…raph kernels

Lets a @qd.kernel(graph=True) author mark independent sequences of work as
concurrent branches so the captured CUDA graph runs them on parallel streams
(recovers the PT||EE overlap qipc loses vs cgq).

API: `with qd.graph_parallel():` opens a fork/join region whose members are
`with qd.branch(name=...):` blocks (name optional, label only). Region body
must contain only branch blocks; branches are independent (author-guaranteed
race-free) and everything after the region waits for all of them (join).

Implementation reuses the existing stream_parallel_group_id tag: qd.branch
lowers via begin/end_stream_parallel, so the branch id flows through the
existing offload/codegen path to OffloadedTask. The only runtime change is in
the CUDA graph builder (build_level): a contiguous run of nonzero-group,
non-checkpoint tasks is forked by group id from the region entry, each branch
chained, and all tails joined into one cuGraphAddEmptyNode. Single-branch
regions degenerate to a plain chain.

Validation: graph=True required; branch only inside a region; region body must
be branches only; no nesting; graph_do_while structure validator updated to
accept regions. Other backends (CPU/AMDGPU/Vulkan/Metal) run branches serially
(correct); CUDA graph path runs them concurrently.

Design: perso_hugh/doc/qipc/d3_0_graph_parallel_impl.md
…onal branches

- Recursive region-body validation: a graph_parallel region may contain branch
  blocks optionally wrapped in `if qd.static(...)` (qipc ENABLE_EE pattern);
  graph_do_while validator descends through those ifs too.
- tests/python/test_graph_parallel.py: correctness vs serial (2/3/multi-loop
  branches), single-branch no-join, optional static-if branch, region inside
  graph_do_while, and compile-time error cases. Node-count assertions on CUDA.
- docs: graph.md "Concurrent branches" section + backend table row; streams.md
  cross-link to graph_parallel for graph kernels.
…oup id

A dynamic-bound for-loop (e.g. range(x.shape[0])) lowers to a bound-compute
serial task followed by the range_for. The serial task carried group 0, which
split a qd.branch()'s contiguous run and defeated the graph builder's fork/join
(branches ran serially). Propagate the loop's stream_parallel_group_id onto the
flushed pending-serial task in the offloader so the bound task and its range_for
stay in the same branch. The existing single-level / single-region frontend
restriction guarantees the pending serial never mixes branches.

Also make the graph_parallel test node-count assertions relative to the
offloaded-task count (each dynamic-bound loop is 2 tasks).
The graph_parallel/branch context managers use @contextmanager. On the
old desk8 base misc.py already imported it; rebasing the feature onto
main (where misc.py no longer imports contextmanager) lost the import,
breaking module import (NameError at load -> stub generation failed).
…t and qd.branch -> qd.graph_parallel

The fork/join region context manager is now qd.graph_parallel_context() and each
concurrent branch member is qd.graph_parallel(name=...). Updates the public API
(misc.py + __all__), the AST detection/build/validation in ast_transformer.py and
function_def_transformer.py (region helpers gain a _context suffix; the conceptual
"branch" naming is retained for members), tests, user docs (graph.md/streams.md,
including the heading anchor), and CUDA graph_manager comments.
…rallel docs

- Remove the optional name= parameter from qd.graph_parallel (branch members now
  take no arguments); simplify the AST detection/build path accordingly.
- Drop the "graph-compatible analogue of qd.stream_parallel" lead-in and the
  "single-branch region lowers to a plain chain" note from the graph_parallel docs.
- Refer to the "graph builder" rather than the "CUDA graph builder" in the
  graph_parallel feature prose (it is graph-abstraction level, CUDA-honoured today).
…rim tables)

- Rename the doc's "branch" terminology to "section" (heading, prose, tables,
  code comments); update the streams.md cross-reference anchor to match.
- Stop describing graph_parallel scheduling in terms of parallel streams (we do
  not expose streams) and drop the qd.stream_parallel analogy from the section.
- Trim the comparison table (drop "(parallel streams)"/"(correct)") and the
  backend-behaviour table (drop the result column, "(graph path)", and the
  "honoured only by the CUDA graph builder today" note).
- Restrictions: drop the redundant "Must be used inside @qd.kernel(graph=True)"
  bullet and refer to qd.graph_parallel_context instead of a vague "region".
- Minor: "falls back to a host-side loop"; use British "behaviour".
@github-actions

Copy link
Copy Markdown

…and docs

Rename the internal "branch" concept to "parallel section" everywhere it
referred to a qd.graph_parallel() block: AST helpers
(_is_branch_call/_build_branch_with/_in_branch and _is_branch_with),
the CUDA graph_manager `branches` vector, misc.py docstrings, and all
related comments. Update graph.md/streams.md to say "parallel section(s)"
(heading + anchor included). No behaviour change.
Rename the remaining "branch" wording in test_graph_parallel.py: test
function names (e.g. test_graph_parallel_two_branches ->
_two_sections) and all docstrings/comments now say "parallel
section(s)", matching the source/docs rename.
…test

Rename test_graph_parallel_non_section_body_raises ->
test_graph_parallel_context_non_graph_parallel_raises so the name says
what it checks: a qd.graph_parallel_context() whose body is not a
qd.graph_parallel() must raise.
…test

Rename test_graph_parallel_section_outside_region_raises ->
test_graph_parallel_outside_context_raises: it checks that
qd.graph_parallel() used outside a qd.graph_parallel_context() raises.
Flip the British spellings introduced for graph_parallel back to
American to match the rest of the codebase: behaviour->behavior,
honoured->honored, recognised->recognized, behavioural->behavioral
across graph.md, streams.md, misc.py docstrings, and the tests.
…ion'

Use the construct-named term "qd.graph_parallel section" throughout the
docs, docstrings, comments, and tests (dropping the redundant prefix
where qd.graph_parallel is already adjacent). Updates the graph.md
heading + the streams.md cross-reference anchor accordingly.
…-loop

Port the static-loop section generator from hp/graph-parallel-static-branches
onto the current naming. A `for ... in qd.static(...)` loop inside a
qd.graph_parallel_context unrolls at trace time into one qd.graph_parallel
section per iteration (each gets a fresh stream_parallel_group_id), so sections
can be forked from a compile-time sequence -- e.g. one per @qd.func member of a
@qd.data_oriented list (qipc's per-contact-type assembly pattern).

- ast_transformer: region-body validator recurses into `for ... in qd.static(...)`
  loops (staticness re-checked via get_decorator at every nesting level, so a
  runtime loop -- even nested under a static one -- is still rejected).
- function_def_transformer: graph_do_while structure validator descends into For
  members so static-loop section bodies are validated like hand-written ones.
- tests: static-loop sections (incl. nested, empty range, single section,
  over-funcs, mixed with if-static, inside graph_do_while) + runtime-loop and
  non-section-body rejection guards.
- docs: "Generating qd.graph_parallel sections from a compile-time sequence".
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

Address the three Cursor-agent CI gates on PR #756:

- Feature factorization: move the graph_parallel detection/validation/lowering
  out of ast_transformer.py into ast_transformers/graph_parallel_transformer.py
  (GraphParallelTransformer), mirroring checkpoint_transformer.py. build_With now
  calls the new class directly; function_def_transformer comments updated to point
  at it.
- Line wrapping: re-wrap the graph_parallel_context docstring to 120c and drop a
  dangling reference to an internal design doc.
- Doc quality (graph.md): reword pre-existing graph_do_while content the gate
  flagged -- replace "lowering" jargon, drop the internal "genesis-world" /
  hypothetical qd.graph_range_for / MAX_ITER design walk-through phrasing.
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

…face

test_api.py asserts the exact set of public qd.* symbols; the renamed
context managers were missing from the expected list, failing the test
across every platform. Also reflow a misc.py docstring line to 120c.
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

@hughperkins hughperkins marked this pull request as ready for review June 24, 2026 13:18
@hughperkins

Copy link
Copy Markdown
Collaborator Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a7cb9866a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +415 to +417
while (run_end < end && tasks[run_end].graph_do_while_level_id == parent_id && tasks[run_end].checkpoint_id < 0 &&
tasks[run_end].stream_parallel_group_id != 0) {
run_end++;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve joins between adjacent graph-parallel regions

When two with qd.graph_parallel_context(): regions are written back-to-back, the offloader emits all of their section tasks as one contiguous run with nonzero stream_parallel_group_ids (there is no marker for the region boundary). This loop consumes that entire run and forks every group from the same prev_node, so sections in the second region can run concurrently with sections in the first instead of waiting for the first region's join. Any valid second region that reads data produced by the first will race and produce nondeterministic results on CUDA.

Useful? React with 👍 / 👎.

@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

hughperkins added a commit that referenced this pull request Jun 24, 2026
…misc.py

Moves the `graph_parallel_context()` / `graph_parallel()` no-op context-manager stubs into their own
`quadrants/lang/graph_parallel.py`, mirroring `checkpoint.py`, to keep the 958-line catch-all `lang/misc.py`
(imported by 33+ modules) from growing further. Re-exported via `misc.py` so the canonical
`qd.graph_parallel_context` / `qd.graph_parallel` import paths are unchanged.

Addresses the "Check feature factorization" CI finding on PR #756.
… regions

Two qd.graph_parallel_context() regions written back-to-back (no serial work
between them) were merged into one fork/join by the CUDA graph builder: the
run-extension loop keyed only on stream_parallel_group_id != 0, with no region
boundary marker, so the second region's sections forked from the same entry as
the first's and could race the first region's writes.

Thread a per-kernel graph_parallel_region_id from the AST builder
(begin/end_graph_parallel_context) through ForLoopConfig -> FrontendForStmt ->
RangeForStmt/StructForStmt -> OffloadedStmt -> OffloadedTask, and require equal
region id when extending a fork/join run in build_level so adjacent regions
each get their own join. AMD already serializes sections, so only the CUDA
builder needed the guard. Adds a regression test.
…misc.py

Moves the `graph_parallel_context()` / `graph_parallel()` no-op context-manager stubs into their own
`quadrants/lang/graph_parallel.py`, mirroring `checkpoint.py`, to keep the 958-line catch-all `lang/misc.py`
(imported by 33+ modules) from growing further. Re-exported via `misc.py` so the canonical
`qd.graph_parallel_context` / `qd.graph_parallel` import paths are unchanged.

Addresses the "Check feature factorization" CI finding on PR #756.
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

Repacks the graph_parallel_context fork/join block comment in graph_manager.cpp (was wrapped at ~93-99c) and
the OffloadedStmt region-id comment in statements.h so comment lines fill toward the project's 120c width
instead of the ~80c default. Addresses the "Check line wrapping" CI on PR #756.
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

@hughperkins

Copy link
Copy Markdown
Collaborator Author

(running Genesis unit tests and benchamrks)

@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant