[Graph] Add graph parallel#756
Conversation
…raph kernels Lets a @qd.kernel(graph=True) author mark independent sequences of work as concurrent branches so the captured CUDA graph runs them on parallel streams (recovers the PT||EE overlap qipc loses vs cgq). API: `with qd.graph_parallel():` opens a fork/join region whose members are `with qd.branch(name=...):` blocks (name optional, label only). Region body must contain only branch blocks; branches are independent (author-guaranteed race-free) and everything after the region waits for all of them (join). Implementation reuses the existing stream_parallel_group_id tag: qd.branch lowers via begin/end_stream_parallel, so the branch id flows through the existing offload/codegen path to OffloadedTask. The only runtime change is in the CUDA graph builder (build_level): a contiguous run of nonzero-group, non-checkpoint tasks is forked by group id from the region entry, each branch chained, and all tails joined into one cuGraphAddEmptyNode. Single-branch regions degenerate to a plain chain. Validation: graph=True required; branch only inside a region; region body must be branches only; no nesting; graph_do_while structure validator updated to accept regions. Other backends (CPU/AMDGPU/Vulkan/Metal) run branches serially (correct); CUDA graph path runs them concurrently. Design: perso_hugh/doc/qipc/d3_0_graph_parallel_impl.md
…onal branches - Recursive region-body validation: a graph_parallel region may contain branch blocks optionally wrapped in `if qd.static(...)` (qipc ENABLE_EE pattern); graph_do_while validator descends through those ifs too. - tests/python/test_graph_parallel.py: correctness vs serial (2/3/multi-loop branches), single-branch no-join, optional static-if branch, region inside graph_do_while, and compile-time error cases. Node-count assertions on CUDA. - docs: graph.md "Concurrent branches" section + backend table row; streams.md cross-link to graph_parallel for graph kernels.
…oup id A dynamic-bound for-loop (e.g. range(x.shape[0])) lowers to a bound-compute serial task followed by the range_for. The serial task carried group 0, which split a qd.branch()'s contiguous run and defeated the graph builder's fork/join (branches ran serially). Propagate the loop's stream_parallel_group_id onto the flushed pending-serial task in the offloader so the bound task and its range_for stay in the same branch. The existing single-level / single-region frontend restriction guarantees the pending serial never mixes branches. Also make the graph_parallel test node-count assertions relative to the offloaded-task count (each dynamic-bound loop is 2 tasks).
The graph_parallel/branch context managers use @contextmanager. On the old desk8 base misc.py already imported it; rebasing the feature onto main (where misc.py no longer imports contextmanager) lost the import, breaking module import (NameError at load -> stub generation failed).
…t and qd.branch -> qd.graph_parallel The fork/join region context manager is now qd.graph_parallel_context() and each concurrent branch member is qd.graph_parallel(name=...). Updates the public API (misc.py + __all__), the AST detection/build/validation in ast_transformer.py and function_def_transformer.py (region helpers gain a _context suffix; the conceptual "branch" naming is retained for members), tests, user docs (graph.md/streams.md, including the heading anchor), and CUDA graph_manager comments.
…rallel docs - Remove the optional name= parameter from qd.graph_parallel (branch members now take no arguments); simplify the AST detection/build path accordingly. - Drop the "graph-compatible analogue of qd.stream_parallel" lead-in and the "single-branch region lowers to a plain chain" note from the graph_parallel docs. - Refer to the "graph builder" rather than the "CUDA graph builder" in the graph_parallel feature prose (it is graph-abstraction level, CUDA-honoured today).
…rim tables) - Rename the doc's "branch" terminology to "section" (heading, prose, tables, code comments); update the streams.md cross-reference anchor to match. - Stop describing graph_parallel scheduling in terms of parallel streams (we do not expose streams) and drop the qd.stream_parallel analogy from the section. - Trim the comparison table (drop "(parallel streams)"/"(correct)") and the backend-behaviour table (drop the result column, "(graph path)", and the "honoured only by the CUDA graph builder today" note). - Restrictions: drop the redundant "Must be used inside @qd.kernel(graph=True)" bullet and refer to qd.graph_parallel_context instead of a vague "region". - Minor: "falls back to a host-side loop"; use British "behaviour".
…and docs Rename the internal "branch" concept to "parallel section" everywhere it referred to a qd.graph_parallel() block: AST helpers (_is_branch_call/_build_branch_with/_in_branch and _is_branch_with), the CUDA graph_manager `branches` vector, misc.py docstrings, and all related comments. Update graph.md/streams.md to say "parallel section(s)" (heading + anchor included). No behaviour change.
Rename the remaining "branch" wording in test_graph_parallel.py: test function names (e.g. test_graph_parallel_two_branches -> _two_sections) and all docstrings/comments now say "parallel section(s)", matching the source/docs rename.
…test Rename test_graph_parallel_non_section_body_raises -> test_graph_parallel_context_non_graph_parallel_raises so the name says what it checks: a qd.graph_parallel_context() whose body is not a qd.graph_parallel() must raise.
…test Rename test_graph_parallel_section_outside_region_raises -> test_graph_parallel_outside_context_raises: it checks that qd.graph_parallel() used outside a qd.graph_parallel_context() raises.
Flip the British spellings introduced for graph_parallel back to American to match the rest of the codebase: behaviour->behavior, honoured->honored, recognised->recognized, behavioural->behavioral across graph.md, streams.md, misc.py docstrings, and the tests.
…ion' Use the construct-named term "qd.graph_parallel section" throughout the docs, docstrings, comments, and tests (dropping the redundant prefix where qd.graph_parallel is already adjacent). Updates the graph.md heading + the streams.md cross-reference anchor accordingly.
…-loop Port the static-loop section generator from hp/graph-parallel-static-branches onto the current naming. A `for ... in qd.static(...)` loop inside a qd.graph_parallel_context unrolls at trace time into one qd.graph_parallel section per iteration (each gets a fresh stream_parallel_group_id), so sections can be forked from a compile-time sequence -- e.g. one per @qd.func member of a @qd.data_oriented list (qipc's per-contact-type assembly pattern). - ast_transformer: region-body validator recurses into `for ... in qd.static(...)` loops (staticness re-checked via get_decorator at every nesting level, so a runtime loop -- even nested under a static one -- is still rejected). - function_def_transformer: graph_do_while structure validator descends into For members so static-loop section bodies are validated like hand-written ones. - tests: static-loop sections (incl. nested, empty range, single section, over-funcs, mixed with if-static, inside graph_do_while) + runtime-loop and non-section-body rejection guards. - docs: "Generating qd.graph_parallel sections from a compile-time sequence".
Address the three Cursor-agent CI gates on PR #756: - Feature factorization: move the graph_parallel detection/validation/lowering out of ast_transformer.py into ast_transformers/graph_parallel_transformer.py (GraphParallelTransformer), mirroring checkpoint_transformer.py. build_With now calls the new class directly; function_def_transformer comments updated to point at it. - Line wrapping: re-wrap the graph_parallel_context docstring to 120c and drop a dangling reference to an internal design doc. - Doc quality (graph.md): reword pre-existing graph_do_while content the gate flagged -- replace "lowering" jargon, drop the internal "genesis-world" / hypothetical qd.graph_range_for / MAX_ITER design walk-through phrasing.
…face test_api.py asserts the exact set of public qd.* symbols; the renamed context managers were missing from the expected list, failing the test across every platform. Also reflow a misc.py docstring line to 120c.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0a7cb9866a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| while (run_end < end && tasks[run_end].graph_do_while_level_id == parent_id && tasks[run_end].checkpoint_id < 0 && | ||
| tasks[run_end].stream_parallel_group_id != 0) { | ||
| run_end++; |
There was a problem hiding this comment.
Preserve joins between adjacent graph-parallel regions
When two with qd.graph_parallel_context(): regions are written back-to-back, the offloader emits all of their section tasks as one contiguous run with nonzero stream_parallel_group_ids (there is no marker for the region boundary). This loop consumes that entire run and forks every group from the same prev_node, so sections in the second region can run concurrently with sections in the first instead of waiting for the first region's join. Any valid second region that reads data produced by the first will race and produce nondeterministic results on CUDA.
Useful? React with 👍 / 👎.
…misc.py Moves the `graph_parallel_context()` / `graph_parallel()` no-op context-manager stubs into their own `quadrants/lang/graph_parallel.py`, mirroring `checkpoint.py`, to keep the 958-line catch-all `lang/misc.py` (imported by 33+ modules) from growing further. Re-exported via `misc.py` so the canonical `qd.graph_parallel_context` / `qd.graph_parallel` import paths are unchanged. Addresses the "Check feature factorization" CI finding on PR #756.
… regions Two qd.graph_parallel_context() regions written back-to-back (no serial work between them) were merged into one fork/join by the CUDA graph builder: the run-extension loop keyed only on stream_parallel_group_id != 0, with no region boundary marker, so the second region's sections forked from the same entry as the first's and could race the first region's writes. Thread a per-kernel graph_parallel_region_id from the AST builder (begin/end_graph_parallel_context) through ForLoopConfig -> FrontendForStmt -> RangeForStmt/StructForStmt -> OffloadedStmt -> OffloadedTask, and require equal region id when extending a fork/join run in build_level so adjacent regions each get their own join. AMD already serializes sections, so only the CUDA builder needed the guard. Adds a regression test.
…misc.py Moves the `graph_parallel_context()` / `graph_parallel()` no-op context-manager stubs into their own `quadrants/lang/graph_parallel.py`, mirroring `checkpoint.py`, to keep the 958-line catch-all `lang/misc.py` (imported by 33+ modules) from growing further. Re-exported via `misc.py` so the canonical `qd.graph_parallel_context` / `qd.graph_parallel` import paths are unchanged. Addresses the "Check feature factorization" CI finding on PR #756.
Repacks the graph_parallel_context fork/join block comment in graph_manager.cpp (was wrapped at ~93-99c) and the OffloadedStmt region-id comment in statements.h so comment lines fill toward the project's 120c width instead of the ~80c default. Addresses the "Check line wrapping" CI on PR #756.
|
(running Genesis unit tests and benchamrks) |
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough