[Docs] Clarify top-level for-loops inside graph_do_while#727
Open
hughperkins wants to merge 1 commit into
Open
[Docs] Clarify top-level for-loops inside graph_do_while#727hughperkins wants to merge 1 commit into
hughperkins wants to merge 1 commit into
Conversation
Document that each top-level for-loop in a graph_do_while body is still its own offloaded launch with grid-wide barriers between consecutive loops, so multi-phase algorithms that need grid-wide sync between phases (e.g. a device-wide radix sort: histogram -> scan -> scatter) work correctly when called directly in the loop body. Clarifies that the "do not nest in runtime for/if/while" guidance is about ordinary nested control flow (which demotes a loop out of top-level position and collapses the offload), NOT about graph_do_while itself - which is the construct designed to host a sequence of top-level offloaded loops. Verified empirically (radix sort of N>BLOCK_DIM keys inside graph_do_while sorts correctly every iteration).
hughperkins
commented
Jun 8, 2026
| cond[()] = ... | ||
| ``` | ||
|
|
||
| **What does break the grid-wide barrier:** nesting a `for`-loop inside *ordinary* runtime control flow — another `for`, an `if`, or a plain Python `while` — **demotes it from top-level position**, so it no longer becomes its own offloaded launch. Instead it runs as device code *within the enclosing launch*, and the grid-wide barrier between it and its siblings is lost (other blocks may not have produced their data yet). `graph_do_while` is **not** "ordinary runtime control flow" in this sense — it is precisely the construct designed to host a sequence of top-level offloaded loops, so loops directly in its body keep their barriers. Compile-time `qd.static(range(...))` loops are also fine: they unroll flat at compile time and keep their bodies at top-level position. |
Collaborator
Author
There was a problem hiding this comment.
do we need this paragraph?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Document that each top-level for-loop in a graph_do_while body is still its own offloaded launch with grid-wide barriers between consecutive loops, so multi-phase algorithms that need grid-wide sync between phases (e.g. a device-wide radix sort: histogram -> scan -> scatter) work correctly when called directly in the loop body.
Clarifies that the "do not nest in runtime for/if/while" guidance is about ordinary nested control flow (which demotes a loop out of top-level position and collapses the offload), NOT about graph_do_while itself - which is the construct designed to host a sequence of top-level offloaded loops. Verified empirically (radix sort of N>BLOCK_DIM keys inside graph_do_while sorts correctly every iteration).
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough