From 3d3327dc7ce314d823cf5b37a4d9b328ba63b78f Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Wed, 24 Jun 2026 10:59:28 -0400
Subject: [PATCH 1/9] address doc CI issues

---
 docs/source/user_guide/graph.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/user_guide/graph.md b/docs/source/user_guide/graph.md
index b1c239b080..b12fd4f6a8 100644
--- a/docs/source/user_guide/graph.md
+++ b/docs/source/user_guide/graph.md
@@ -4,7 +4,7 @@ Graphs reduce kernel launch overhead by capturing a sequence of GPU operations i
 
 ## Backend support
 
-`graph=True` and `graph_do_while` run on every backend. They are *hardware accelerated* on CUDA (via CUDA graphs) and AMDGPU (via HIP graphs); `graph_do_while` additionally requires CUDA SM 9.0+ / Hopper for its hardware-accelerated path. On other backends, `graph=True` is silently ignored and the kernel runs via the normal launch path, and `graph_do_while` falls back to a host-side do-while loop that copies the condition value GPU → host each iteration (causing a pipeline stall). `qd.checkpoint` gating runs entirely on the device on every GPU backend; only the CPU backend uses host-side gating.
+`graph=True` and `graph_do_while` run on every backend. They are *hardware accelerated* on CUDA (via CUDA graphs) and AMDGPU (via HIP graphs); `graph_do_while` additionally requires [CUDA SM 9.0+](https://developer.nvidia.com/cuda/gpus) for its hardware-accelerated path. On other backends, `graph=True` is silently ignored and the kernel runs via the normal launch path, and `graph_do_while` falls back to a host-side do-while loop. `qd.checkpoint` gating runs entirely on the device on every GPU backend.
 
 | Feature | `qd.cuda` SM 9.0+ | `qd.cuda` < SM 9.0 | `qd.amdgpu` | `qd.metal` | `qd.vulkan` | `qd.cpu` |
 | --- | --- | --- | --- | --- | --- | --- |
@@ -44,7 +44,7 @@ my_kernel(x, y)  # first call: builds and caches the graph
 my_kernel(x, y)  # subsequent calls: replays the cached graph
 ```
 
-This works the same way on CUDA and AMDGPU. The cache is keyed per (compiled-kernel-specialization, launch-id), so different template instantiations (different field bindings, etc.) get their own cached graph.
+This works the same way on CUDA and AMDGPU.
 
 ### Restrictions
 

From dde771bc1626fae328897b7b8c65c44ee0b361e0 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Wed, 24 Jun 2026 09:24:49 -0700
Subject: [PATCH 2/9] [CI] Fold reading-order (no-forward-reference) check into
 doc-quality

Add RULE 3 to the doc-quality agent prompt: within a single doc, information
must be ordered so a first-time reader can follow it top-to-bottom without
jumping ahead (no forward references). Includes carve-outs for roadmaps,
optional "see below" pointers, backward references, cross-doc references, and
conventional preamble ordering, and hands pure undefined-term cases to Rule 1.
Adds the [order] tag and reworks the violation cap so one rule cannot crowd
out the others.
---
 .github/workflows/check_doc_quality.yml | 34 ++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/.github/workflows/check_doc_quality.yml b/.github/workflows/check_doc_quality.yml
index 5e163d8b52..7bd82eae70 100644
--- a/.github/workflows/check_doc_quality.yml
+++ b/.github/workflows/check_doc_quality.yml
@@ -61,7 +61,7 @@ jobs:
           /tmp/changed_docs.txt contains a newline-separated list of Markdown (.md) doc files (paths
           relative to the repo root) that were added or modified in this PR. For EACH file in that
           list, read the FULL current contents of the file (not just the diff) and review it against
-          the two rules below.
+          the three rules below.
 
           Target audience: an *end user* of Quadrants. That is, someone who writes GPU programs in
           Python using the Quadrants library (`@qd.kernel`, `@qd.func`, ndarrays, fields, etc.).
@@ -118,16 +118,42 @@ jobs:
           VIOLATION: internal / non-end-user material (subject to the carve-outs above) appears
           outside of such a clearly-marked advanced / under-the-hood section.
 
+          RULE 3 — Reading order (no forward references).
+          Within a single doc, information should be ordered so a first-time reader can follow it
+          top-to-bottom in one pass, without jumping ahead. A FORWARD REFERENCE is when a passage
+          depends on a concept, parameter, behavior, code element, or result that is only introduced
+          LATER in the SAME file, such that the reader cannot understand the current passage without
+          first reading the later one.
+          The following are NOT order violations:
+          - A brief overview / roadmap near the top that previews upcoming sections ("this guide
+            covers A, then B") — signposting the structure, not a dependency: the reader does not
+            need to understand A or B yet to keep reading.
+          - Optional "for more detail, see <section> below" pointers, where the current passage is
+            fully understandable WITHOUT following them. The test is: MUST read ahead to understand
+            (violation) vs more depth merely available later (fine).
+          - Backward references (to something earlier in the same file) are always fine.
+          - References to OTHER docs/files; this rule is about ordering WITHIN a single file only.
+          - Conventional preamble ordering (e.g. Prerequisites -> Installation -> Usage).
+          A plain term that is simply never defined belongs to Rule 1, not here; Rule 3 covers
+          ordering dependencies (concepts, examples, parameters, results) where the explanation DOES
+          exist in the file but appears too late. Do not double-flag the same issue under both rules.
+          VIOLATION: a passage cannot be understood by a first-time reader at the point they reach
+          it, because what it relies on is only introduced later in the same file.
+
           Read each listed file in full before judging it. You may open any other file in the repo
           (e.g. linked docs) to verify whether a term is defined elsewhere. Do NOT modify any files.
           Be precise and conservative: only flag clear violations, not borderline cases. Judge each
-          file independently. Stop after finding 10 violations in total.
+          file independently against all three rules. Stop after reporting 10 violations in total,
+          but do not let one rule crowd out the others: if more than one rule has violations, make
+          sure each such rule is represented in your list before you stop.
 
           If there are NO violations, your final output must start with the word PASS.
           If there ARE violations, your final output must start with the word FAIL, followed by a list
           of violations (up to 10), one per line, in the format:
-          <filepath>: [term|scope]: <brief description>
-          where [term] is a Rule 1 violation and [scope] is a Rule 2 violation.
+          <filepath>: [term|scope|order]: <brief description>
+          where [term] is a Rule 1 violation, [scope] is a Rule 2 violation, and [order] is a Rule 3
+          violation (for [order], cite both line numbers — what is used at line N is only introduced
+          at line M).
           PROMPT
           )" --model claude-4.6-opus-high-thinking --mode ask --output-format text --trust)
 

From aafcc41c71b4703d0d0e9d8a1a267e1d887e4bd1 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Wed, 24 Jun 2026 09:25:17 -0700
Subject: [PATCH 3/9] [Doc] Document reading-order rule in doc-quality check
 description

The doc-quality check now enforces a third rule (reading order / no forward
references); describe it and its carve-outs in contributing.md.
---
 docs/source/user_guide/contributing.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/user_guide/contributing.md b/docs/source/user_guide/contributing.md
index bc7231fda0..f7826be26c 100644
--- a/docs/source/user_guide/contributing.md
+++ b/docs/source/user_guide/contributing.md
@@ -180,9 +180,9 @@ The agent reports up to 5 violations, each annotated with the host file's hotnes
 
 ### Doc quality check (`check_doc_quality.yml`)
 
-Uses an AI agent to review documentation changes for an end-user audience (someone writing Quadrants kernels in Python, not a compiler engineer). For each `docs/**/*.md` file added or modified in the PR, it reads the entire current file (not just the diff) and checks two things: (1) **undefined terms** — a term a typical user is unlikely to know (specialized or internal jargon, project-specific abbreviations) must be defined at its first use in that file, either inline or via a link to a doc that defines it; (2) **end-user relevance** — internal / implementation / contributor-only material must be confined to a clearly-marked section whose heading contains "Advanced", "Under the hood", "Internals", or "Implementation".
+Uses an AI agent to review documentation changes for an end-user audience (someone writing Quadrants kernels in Python, not a compiler engineer). For each `docs/**/*.md` file added or modified in the PR, it reads the entire current file (not just the diff) and checks three things: (1) **undefined terms** — a term a typical user is unlikely to know (specialized or internal jargon, project-specific abbreviations) must be defined at its first use in that file, either inline or via a link to a doc that defines it; (2) **end-user relevance** — internal / implementation / contributor-only material must be confined to a clearly-marked section whose heading contains "Advanced", "Under the hood", "Internals", or "Implementation"; (3) **reading order** — information must be ordered so a first-time reader can follow the file top-to-bottom in one pass, without forward references (a passage that can only be understood by reading something introduced later in the same file).
 
-The following do not count as violations: references to public APIs that the author links to their docs and/or labels as public; brief reader-directed pointers suggesting the reader could contribute upstream or file an issue; the core public API vocabulary a user already knows (e.g. `@qd.kernel`, `@qd.func`, `qd.Template`, fields, ndarrays). The agent reports up to 10 violations. This check is delayed by 30 minutes, to avoid running repeatedly if multiple commits pushed with a short delay between each.
+The following do not count as violations: references to public APIs that the author links to their docs and/or labels as public; brief reader-directed pointers suggesting the reader could contribute upstream or file an issue; the core public API vocabulary a user already knows (e.g. `@qd.kernel`, `@qd.func`, `qd.Template`, fields, ndarrays); and, for the reading-order check, an overview/roadmap near the top, optional "see below" pointers (where the current passage still reads fine on its own), and backward references. The agent reports up to 10 violations. This check is delayed by 30 minutes, to avoid running repeatedly if multiple commits pushed with a short delay between each.
 
 ### PR change report (`pr_change_report.yml`)
 

From 899600ef3501bf4cbb9bc2090b1c67d597a8e2e6 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Wed, 24 Jun 2026 09:27:39 -0700
Subject: [PATCH 4/9] [CI] Simplify doc-quality RULE 3: drop reworked cap and
 double-flag note

Both added cognitive load to the agent for little gain. Revert to the original
"stop after 10 violations" cap and remove the Rule 1/Rule 3 delineation note.
---
 .github/workflows/check_doc_quality.yml | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/.github/workflows/check_doc_quality.yml b/.github/workflows/check_doc_quality.yml
index 7bd82eae70..4fbb9f8aed 100644
--- a/.github/workflows/check_doc_quality.yml
+++ b/.github/workflows/check_doc_quality.yml
@@ -134,18 +134,13 @@ jobs:
           - Backward references (to something earlier in the same file) are always fine.
           - References to OTHER docs/files; this rule is about ordering WITHIN a single file only.
           - Conventional preamble ordering (e.g. Prerequisites -> Installation -> Usage).
-          A plain term that is simply never defined belongs to Rule 1, not here; Rule 3 covers
-          ordering dependencies (concepts, examples, parameters, results) where the explanation DOES
-          exist in the file but appears too late. Do not double-flag the same issue under both rules.
           VIOLATION: a passage cannot be understood by a first-time reader at the point they reach
           it, because what it relies on is only introduced later in the same file.
 
           Read each listed file in full before judging it. You may open any other file in the repo
           (e.g. linked docs) to verify whether a term is defined elsewhere. Do NOT modify any files.
           Be precise and conservative: only flag clear violations, not borderline cases. Judge each
-          file independently against all three rules. Stop after reporting 10 violations in total,
-          but do not let one rule crowd out the others: if more than one rule has violations, make
-          sure each such rule is represented in your list before you stop.
+          file independently. Stop after finding 10 violations in total.
 
           If there are NO violations, your final output must start with the word PASS.
           If there ARE violations, your final output must start with the word FAIL, followed by a list

From 57f07bd4011f6fa509e5efb754be74bbfeb72b0d Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Fri, 26 Jun 2026 09:01:45 -0400
Subject: [PATCH 5/9] fix doc ci failures

---
 docs/source/user_guide/graph.md | 38 ++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/docs/source/user_guide/graph.md b/docs/source/user_guide/graph.md
index b12fd4f6a8..2ad0cbd182 100644
--- a/docs/source/user_guide/graph.md
+++ b/docs/source/user_guide/graph.md
@@ -2,20 +2,6 @@
 
 Graphs reduce kernel launch overhead by capturing a sequence of GPU operations into a graph, then replaying it in a single launch.
 
-## Backend support
-
-`graph=True` and `graph_do_while` run on every backend. They are *hardware accelerated* on CUDA (via CUDA graphs) and AMDGPU (via HIP graphs); `graph_do_while` additionally requires [CUDA SM 9.0+](https://developer.nvidia.com/cuda/gpus) for its hardware-accelerated path. On other backends, `graph=True` is silently ignored and the kernel runs via the normal launch path, and `graph_do_while` falls back to a host-side do-while loop. `qd.checkpoint` gating runs entirely on the device on every GPU backend.
-
-| Feature | `qd.cuda` SM 9.0+ | `qd.cuda` < SM 9.0 | `qd.amdgpu` | `qd.metal` | `qd.vulkan` | `qd.cpu` |
-| --- | --- | --- | --- | --- | --- | --- |
-| `graph=True` | hardware accelerated | hardware accelerated | hardware accelerated | runs (no acceleration) | runs (no acceleration) | runs (no acceleration) |
-| `qd.graph_do_while` | hardware accelerated | host fallback | host fallback | host fallback | host fallback | host fallback |
-| `qd.checkpoint` | GPU-side | GPU-side | GPU-side | GPU-side | GPU-side | host-side |
-
-AMDGPU `graph_do_while` falls back to the host-side loop because HIP does not currently expose conditional / while graph nodes (as of ROCm 7.2).
-
-Nested and sibling `graph_do_while` loops (and mixing `graph_do_while` with top-level `for`-loops) are **experimental** for now — see [Nested loops and mixing with for-loops](#nested-loops-and-mixing-with-for-loops).
-
 ## Basic usage
 
 Add `graph=True` to a `@qd.kernel` decorator:
@@ -49,9 +35,9 @@ This works the same way on CUDA and AMDGPU.
 ### Restrictions
 
 - **No struct return values.** Kernels that return values (e.g. `-> qd.i32`) cannot use graphs. An error is raised if `graph=True` is set on such a kernel.
-- **Primal kernels only.** The `graph=True` flag is applied to the primal (forward) kernel only, not its adjoint. Autodiff kernels use the normal launch path.
+- **Primal kernels only.** The `graph=True` flag is applied to the primal (forward) kernel only, not its adjoint (backward). [Autodiff](autodiff.md) kernels use the normal launch path.
 - **Device-resident ndarrays.** Graph mode bakes device pointers into the cached graph, so all ndarray arguments must be on the GPU. Passing a host-resident ndarray raises an error.
-- **`qd_stream` is incompatible** with `graph=True`. Choose one or the other.
+- [streams](streams.md) are incompatible** with graph.
 
 ### Passing different arguments
 
@@ -179,7 +165,7 @@ Therefore on unsupported platforms, you might consider creating a second impleme
 
 ## Checkpoints with `qd.checkpoint` *(experimental)*
 
-> **Experimental.** `qd.checkpoint`, `qd.GraphStatus`, and `kernel.resume(from_checkpoint=...)` are experimental APIs. The shape of the public surface (the context-manager signature, the `@qd.kernel(checkpoints=True)` flag, the `GraphStatus` fields, the host-side resume loop, the error messages, and the cross-backend lowering details) may change in any future release without a deprecation cycle.
+> **Experimental.** `qd.checkpoint`, `qd.GraphStatus`, and `kernel.resume(from_checkpoint=...)` are experimental APIs, and may change in the near future, or be removed, or replaced.
 
 `qd.checkpoint` lets a graph kernel break partway through, surface a reason to the host, let the host fix things up, and resume from the same location on the next launch. An example use-case is an algorithm implemented as a graph that may need to allocate additional memory partway through, where the operations in the graph are in-place, and therefore cannot be rerun without changing/corrupting the output, and therefore for which simply retrying the whole graph from the start is not an option.
 
@@ -268,7 +254,21 @@ while status.yielded:
           arr[i] = arr[i] + 1
   ```
 
-The restriction is by design: each top-level statement inside a checkpoint becomes its own GPU task / graph node, so silently wrapping bare statements would hide a sequence of N field writes ballooning into N kernel launches. Forcing the user to write the `for`-wrap themselves keeps the lowering visible and gives a single obvious place to fuse multiple writes into one task by sharing a single wrapper.
+The restriction is by design: each top-level statement inside a checkpoint becomes its own GPU task / graph node, so silently wrapping bare statements would hide a sequence of N field writes ballooning into N kernel launches.
+
+## Backend support
+
+`graph=True` and `graph_do_while` run on every backend. They are *hardware accelerated* on CUDA (via CUDA graphs) and AMDGPU (via HIP graphs); `graph_do_while` additionally requires [CUDA SM 9.0+](https://developer.nvidia.com/cuda/gpus) for its hardware-accelerated path. On other backends, `graph=True` is silently ignored and the kernel runs via the normal launch path, and `graph_do_while` falls back to a host-side do-while loop. `qd.checkpoint` gating runs entirely on the device on every GPU backend.
+
+| Feature | `qd.cuda` SM 9.0+ | `qd.cuda` < SM 9.0 | `qd.amdgpu` | `qd.metal` | `qd.vulkan` | `qd.cpu` |
+| --- | --- | --- | --- | --- | --- | --- |
+| `graph=True` | hardware accelerated | hardware accelerated | hardware accelerated | runs (no acceleration) | runs (no acceleration) | runs (no acceleration) |
+| `qd.graph_do_while` | hardware accelerated | host fallback | host fallback | host fallback | host fallback | host fallback |
+| `qd.checkpoint` | GPU-side | GPU-side | GPU-side | GPU-side | GPU-side | host-side |
+
+AMDGPU `graph_do_while` falls back to the host-side loop because HIP does not currently expose conditional / while graph nodes (as of ROCm 7.2).
+
+Nested and sibling `graph_do_while` loops (and mixing `graph_do_while` with top-level `for`-loops) are **experimental** for now — see [Nested loops and mixing with for-loops](#nested-loops-and-mixing-with-for-loops).
 
 ## Performance
 
@@ -397,7 +397,7 @@ If you do want fixed-size for loops to run optimally on unsupported hardware pla
 - on graph-do-while-supported hardware: handle adding the additional increment kernel
 - on graph-do-while-unsupported hardware: handle running the loop entirely on the host-side, to avoid adding a gpu pipeline stall
 
-In practice, for our own kernels, i.e. in genesis-world, they largely fall under the do while formulation, see the previous section. However, also have some that used to be do while, but have been migrated to an optimized fixed-size, see next section.
+In practice, for our own kernels, i.e. in [genesis-world](https://github.com/Genesis-Embodied-AI/genesis-world), they largely fall under the do while formulation, see the previous section. However, also have some that used to be do while, but have been migrated to an optimized fixed-size, see next section.
 
 ### A while loop, conditional on a device-side scalar tensor, that has been optimized into a fixed-size for loop
 

From cc8ea8a57cd45810bbb7faf9edb5e2f6727c3ef5 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Fri, 26 Jun 2026 14:13:39 -0400
Subject: [PATCH 6/9] doc tewaks for ci

---
 docs/source/user_guide/graph.md | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/docs/source/user_guide/graph.md b/docs/source/user_guide/graph.md
index 2ad0cbd182..84ee6d594f 100644
--- a/docs/source/user_guide/graph.md
+++ b/docs/source/user_guide/graph.md
@@ -75,7 +75,7 @@ solve(x, counter)
 
 The argument to `qd.graph_do_while()` must be the name of a scalar `qd.i32` ndarray parameter. The loop body repeats while this value is non-zero.
 
-- On CUDA SM 9.0+ (Hopper), this uses CUDA conditional while nodes — the entire iteration runs on the GPU with no host involvement.
+- On [CUDA SM 9.0+](https://developer.nvidia.com/cuda/gpus), this uses CUDA conditional while nodes — the entire iteration runs on the GPU with no host involvement.
 - On older CUDA GPUs, AMDGPU, and non-GPU backends, it falls back to a host-side do-while loop (see the [backend support table](#backend-support)).
 
 ### Patterns
@@ -151,7 +151,7 @@ Note that `qd.func`'s are inlined, so you can freely factorize these structures
 
 ### Caveats
 
-On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and **AMDGPU** (HIP has no conditional / while node API as of ROCm 7.2) — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration:
+On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and **AMDGPU** ([HIP](https://rocm.docs.amd.com/projects/HIP/en/latest/what_is_hip.html) has no conditional / while node API as of [ROCm](https://www.amd.com/en/products/software/rocm.html) 7.2) — the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. For nested loops this host round-trip happens once per iteration of each loop level, and each loop-body task is replayed individually, so deeply nested loops on these backends pay correspondingly more host overhead (they remain correct, just slower than the CUDA SM 9.0+ native path). At the end of each loop iteration:
 - wait for GPU async queue to finish processing
 - copy condition value to hostside
 - evaluate condition value on hostside
@@ -393,11 +393,9 @@ k1(a, count)
 
 The recommendation is to use the graph do while here anyway, if you need it for any platform, in order to ensure the code is compact and maintainable.
 
-If you do want fixed-size for loops to run optimally on unsupported hardware platforms, we could add a specializd `qd.graph_range_for` function. This would:
-- on graph-do-while-supported hardware: handle adding the additional increment kernel
-- on graph-do-while-unsupported hardware: handle running the loop entirely on the host-side, to avoid adding a gpu pipeline stall
+If you do want fixed-size for loops to run optimally on unsupported hardware platforms, please raise an issue, and we can look into this.
 
-In practice, for our own kernels, i.e. in [genesis-world](https://github.com/Genesis-Embodied-AI/genesis-world), they largely fall under the do while formulation, see the previous section. However, also have some that used to be do while, but have been migrated to an optimized fixed-size, see next section.
+In practice, for our own [genesis-world](https://github.com/Genesis-Embodied-AI/genesis-world) kernels, they largely fall under the do while formulation, see the previous section. However, also have some that used to be do while, but have been migrated to an optimized fixed-size, see next section.
 
 ### A while loop, conditional on a device-side scalar tensor, that has been optimized into a fixed-size for loop
 

From ee7754cf7146c43eff29ccc2a60cf3f8d4ea03ce Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Fri, 26 Jun 2026 14:15:51 -0400
Subject: [PATCH 7/9] Apply suggestion from @graphite-app[bot]

Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>
---
 docs/source/user_guide/graph.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/source/user_guide/graph.md b/docs/source/user_guide/graph.md
index a51b74e215..2e7c2ee95d 100644
--- a/docs/source/user_guide/graph.md
+++ b/docs/source/user_guide/graph.md
@@ -37,7 +37,9 @@ This works the same way on CUDA and AMDGPU.
 - **No struct return values.** Kernels that return values (e.g. `-> qd.i32`) cannot use graphs. An error is raised if `graph=True` is set on such a kernel.
 - **Primal kernels only.** The `graph=True` flag is applied to the primal (forward) kernel only, not its adjoint (backward). [Autodiff](autodiff.md) kernels use the normal launch path.
 - **Device-resident ndarrays.** Graph mode bakes device pointers into the cached graph, so all ndarray arguments must be on the GPU. Passing a host-resident ndarray raises an error.
-- [streams](streams.md) are incompatible** with graph.
+
+- [streams](streams.md) are **incompatible** with graph.
+
 
 ### Passing different arguments
 

From 04f2f0e77eb12cd19bf07e5cdfcd84f8806a2066 Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Fri, 26 Jun 2026 16:33:04 -0400
Subject: [PATCH 8/9] address doc CI

---
 docs/source/user_guide/graph.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/source/user_guide/graph.md b/docs/source/user_guide/graph.md
index 2e7c2ee95d..e2e6dffa9c 100644
--- a/docs/source/user_guide/graph.md
+++ b/docs/source/user_guide/graph.md
@@ -149,7 +149,7 @@ Note that `qd.func`'s are inlined, so you can freely factorize these structures
 
 ### Restrictions
 
-- The counter ndarray may be swapped between calls: the cached graph reads each counter through an indirection slot that is refreshed on every launch, so passing a different ndarray (or alternating between several) replays the cached graph without rebuilding it.
+- The counter ndarray may be swapped between calls: passing a different ndarray (or alternating between several) replays the cached graph without rebuilding it.
 
 ### Caveats
 
@@ -242,7 +242,7 @@ while status.yielded:
 
 ### Restrictions
 
-- Must be used inside `@qd.kernel(graph=True, checkpoints=True)`. Without the flag, `qd.checkpoint(...)` raises `QuadrantsSyntaxError` at compile time with a fix-it pointing at `checkpoints=True`.
+- Must be used inside `@qd.kernel(graph=True, checkpoints=True)`. Without the flag, `qd.checkpoint(...)` raises `QuadrantsSyntaxError` at compile time.
 - `cp_id` must be an int literal or an `IntEnum` value, and must be unique across the kernel.
 - `yield_on=` must be a kernel parameter that is a 0-d `qd.types.ndarray(qd.i32, ndim=0)`; expressions are not supported.
 - Checkpoints cannot be nested inside other checkpoints. Checkpoints inside a `qd.graph_do_while` body are fine.
@@ -465,4 +465,4 @@ In this case, our recommendation is:
 - use graph do while anyway, if you need it on any platform
     - this will ensure your code is compact and maintainable
 - if you need optimum 100% performance on unsupported platforms, then consider PRing onto quadrants an optimized graph implementation for your target platform
-    - for example it could somehow run MAX_ITER iterations anyway, similar to the earlier hand-rolled version, but via the graph abstraction, hence allowing the code to be compact, cross-platform, and also optimally fast
+    - advanced: for example it could somehow run MAX_ITER iterations anyway, similar to the earlier hand-rolled version, but via the graph abstraction, hence allowing the code to be compact, cross-platform, and also optimally fast

From a1d925b3f96e06869ca1c51ebfc26891a91464ef Mon Sep 17 00:00:00 2001
From: Hugh Perkins <hughperkins@gmail.com>
Date: Fri, 26 Jun 2026 16:54:49 -0400
Subject: [PATCH 9/9] address doc ci issues

---
 docs/source/user_guide/graph.md | 26 +++++++++++---------------
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/docs/source/user_guide/graph.md b/docs/source/user_guide/graph.md
index e2e6dffa9c..9b3ca47fb5 100644
--- a/docs/source/user_guide/graph.md
+++ b/docs/source/user_guide/graph.md
@@ -289,9 +289,9 @@ def k1(a: qd.type.NDArray, b: qd.type.NDArray, c: qd.type.NDArray):
     for i in range(c.shape[0]):
         fn_b(c, i)
 ```
-We have three top-level for loops, which we call 'offloaded tasks'. Each offloaded task is compiled into a separate GPU kernel. When we call `k1` from python, the c++ host-side code launches three gpu kernels.
+We have three top-level for loops, which we call 'offloaded tasks'. Each offloaded task is compiled into a separate GPU kernel. When we call `k1` from python, three gpu kernels are launched, from the host side.
 
-We can migrate it to graph by adding `graph=True`:
+We can migrate this qd.kernel to graph by adding `graph=True`:
 ```
 @qd.kernel(graph=True)
 def k1(a: qd.type.NDArray, b: qd.type.NDArray, c: qd.type.NDArray):
@@ -304,8 +304,8 @@ def k1(a: qd.type.NDArray, b: qd.type.NDArray, c: qd.type.NDArray):
 ```
 
 Results:
-- on hardware-accelerated platforms, we only launch a single graph from the host, rather than 3 kernels
-- on other platforms, there is no change: we still launch 3 gpu kernels: no change: not better, not worse
+- on hardware-accelerated platforms, we only launch a single graph from the host, rather than 3 separate kernels
+- on other platforms, there is no change: we still launch 3 separate kernels: no change: not better, not worse
 
 ### A while loop, conditional on a device-side scalar tensor
 
@@ -328,11 +328,10 @@ So, we have:
 - the kernel contains device code, which will run on the gpu
 - each iteration, we copy the value of cond, from the gpu to the host, and check the value
 - this causes a gpu pipeline stall:
-    - first we wait for the entire default stream gpu work to complete/drain
-    - then we wait for the value of cond to copy from the gpu to the host
-    - then we run the python code to check the value of cond
-    - if we continue the loop, we now have to run through the python and c++ machinery to prepare the gpu kernel launch
-    - then launch the gpu kernels inside k1
+    - first Quadrants wait for the entire default stream gpu work to complete/drain
+    - then Quadrants wait for the value of cond to copy from the gpu to the host
+    - then Quadrants run the python code to check the value of cond
+    - if we continue the loop, Quadrants now launches the gpu kernels inside k1
     - together, these steps can cause a noticeable delay, reducing throughput speed
 
 After migrating to graph with graph do while we have:
@@ -351,8 +350,6 @@ Now:
 - on supported hardware, the cond evaluation takes place on the gpu
     - and we avoid the gpu pipeline stall
 - on unsupported hardware, we still incur the pipeline stall, as before
-    - note that there will be some small acceleration, because the condition evaluation and kernel launch will take place entirely from c++, bypassing python
-    - no worse, incrementally better
 
 ### A fixed-size for loop
 
@@ -371,8 +368,8 @@ for _ in range(num_its):
 In this case, we have `num_its` launches of the three gpu kernels in k1
 - there is nothing on the host side that waits for anything to finish on the gpu-side
 - there is kernel launch latency associated with:
-    - running k1 from host-side python
-    - launching the gpu kernels for each of fn_1, fn_2, fn_3 from host-side c++
+    - running k1 from host-side
+    - launching the gpu kernels for each of fn_1, fn_2, fn_3, also from host-side
 
 After migrating to graph we have something like:
 ```
@@ -464,5 +461,4 @@ The effect in reality is situation dependent:
 In this case, our recommendation is:
 - use graph do while anyway, if you need it on any platform
     - this will ensure your code is compact and maintainable
-- if you need optimum 100% performance on unsupported platforms, then consider PRing onto quadrants an optimized graph implementation for your target platform
-    - advanced: for example it could somehow run MAX_ITER iterations anyway, similar to the earlier hand-rolled version, but via the graph abstraction, hence allowing the code to be compact, cross-platform, and also optimally fast
+- if you need optimum 100% performance on unsupported platforms, then consider PRing onto quadrants an optimized graph implementation for your target platform, or raising an issue