Genesis-Embodied-AI · hughperkins · Jun 26, 2026 · Jun 23, 2026 · Jun 26, 2026
diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md
@@ -102,7 +102,7 @@ scratch64 = qd.ndarray(qd.u64, shape=qd.algorithms.reduce_scratch_slots(N, D))
 
 ## Semantics
 
-The active ops below share a calling convention and several rules; these are stated once in **Common conventions**, and only the op-specific behaviour is repeated per op. The internal algorithm for each op is in [Under the hood](#under-the-hood).
+The active ops below share a calling convention and several rules; these are stated once in **Common conventions**, and only the op-specific behavior is repeated per op. The internal algorithm for each op is in [Under the hood](#under-the-hood).
 
 Each op section ends with a runnable toy example. They all assume this prelude:
 

diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md
@@ -22,7 +22,7 @@ All atomic ops follow the same shape: `qd.atomic_op(x, y)` performs `x = op(x, y
 
 A few cross-cutting notes that the cells above abbreviate:
 
-- **`atomic_sub` is not a separate op in the IR.** `quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten` rewrites every `atomic_sub(x, y)` into `atomic_add(x, -y)` before codegen sees it, so per-backend support and per-dtype behaviour are exactly those of `atomic_add`.
+- **`atomic_sub` is not a separate op in the IR.** `quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten` rewrites every `atomic_sub(x, y)` into `atomic_add(x, -y)` before codegen sees it, so per-backend support and per-dtype behavior are exactly those of `atomic_add`.
 - **CAS-loop ops are noticeably slower than native atomics**, especially under contention — every contending thread retries the load + compare-exchange until it wins. Prefer pre-aggregating into a register or shared array and issuing a single atomic at the end of the block where possible.
 - **f16 floats always use a CAS loop** (no native f16 atomic on any backend except SPIR-V with the right capability bit).
 - **On CPU, "native" does not guarantee a single machine instruction.** On x86 and other architectures without hardware float atomics, the compiler backend lowers native float `atomic_add` (and integer `min` / `max`) to a CAS loop in machine code. Under high contention the performance is similar to the explicit "CAS" entries; the difference is that "native" ops benefit from hardware acceleration where available.
@@ -71,7 +71,7 @@ Bitwise atomics. Integer dtypes only — passing `f32` / `f64` raises a type err
 
 ### `qd.atomic_sub(x, y)` / `qd.atomic_mul(x, y)`
 
-Atomic subtract and atomic multiply. `atomic_sub` is rewritten to `atomic_add(x, -y)` at IR-construction time (`quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten`), so its per-backend behaviour is identical to `atomic_add`. `atomic_mul` always lowers to a CAS loop - no LLVM AtomicRMW or SPIR-V `OpAtomic*` op corresponds to multiply - and is intentionally not heavily optimised; prefer reducing to a different scheme on hot paths.
+Atomic subtract and atomic multiply. `atomic_sub` is rewritten to `atomic_add(x, -y)` at IR-construction time (`quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten`), so its per-backend behavior is identical to `atomic_add`. `atomic_mul` always lowers to a CAS loop - no LLVM AtomicRMW or SPIR-V `OpAtomic*` op corresponds to multiply - and is intentionally not heavily optimized; prefer reducing to a different scheme on hot paths.
 
 ### `qd.atomic_exchange(x, y)`
 
@@ -149,11 +149,11 @@ val = qd.volatile_load(target)
 | Backend          | Lowering                                                                                  |
 |------------------|-------------------------------------------------------------------------------------------|
 | CUDA             | LLVM `load volatile` → PTX `ld.volatile.global`.                                          |
-| AMDGPU           | LLVM `load volatile` → unhoistable `global_load_*` (the optimiser is inhibited from forwarding / merging). |
+| AMDGPU           | LLVM `load volatile` → unhoistable `global_load_*` (the optimizer is inhibited from forwarding / merging). |
 | Vulkan / Metal   | SPIR-V `OpLoad` with the `Volatile` `MemoryAccess` mask, propagated through SPIRV-Cross to a re-read on every use in the generated MSL / GLSL. |
-| CPU (x86_64)     | LLVM `load volatile` (the optimiser cannot hoist or merge it; the runtime cost is identical to an ordinary load on x86). |
+| CPU (x86_64)     | LLVM `load volatile` (the optimizer cannot hoist or merge it; the runtime cost is identical to an ordinary load on x86). |
 
-Quadrants additionally suppresses the optimisations that would otherwise let an aliased rewrite slip past codegen:
+Quadrants additionally suppresses the optimizations that would otherwise let an aliased rewrite slip past codegen:
 
 - `cache_loop_invariant_global_vars` does not hoist a volatile load out of an enclosing loop.
 - `simplify` does not replace a volatile load with the value of an earlier load of the same address.
@@ -199,7 +199,7 @@ The decoupled-look-back scan in [grid](grid.md) shows the full pattern.
 
 Every `qd.atomic_*` is emitted at **device-wide scope**: visible to all threads on the GPU executing the kernel, but not required to be coherent with the host CPU mid-kernel. The host only observes results once the kernel completes, at which point the launcher's stream-sync flushes everything regardless. Choosing device scope (rather than the strongest "system" scope) lets every backend lower the op to a single hardware atomic instruction instead of a software CAS retry loop, which matters for correctness as much as for speed: under heavy contention, a CAS loop on a non-converging op like `atomic_xor` can livelock.
 
-You don't normally need to think about scope as a user. It's listed here so the per-backend behaviour is explicit:
+You don't normally need to think about scope as a user. It's listed here so the per-backend behavior is explicit:
 
 | Backend                 | Scope spelling in the IR          |
 |-------------------------|-----------------------------------|

diff --git a/docs/source/user_guide/autodiff.md b/docs/source/user_guide/autodiff.md
@@ -298,7 +298,7 @@ The on-device sizer relies on two common hardware features (64-bit integer arith
 `qd.init()` exposes two escape hatches:
 
 - `ad_stack_size=N` (default `0`): forces every adstack to exactly `N` slots and bypasses the sizer. Leave at `0` in day-to-day use; positive `N` is for stress tests or working around a suspected sizer bug.
-- `ad_stack_sparse_threshold_bytes=B` (default `100 MiB`): cutoff below which the gate-passing-count sizing of [Memory footprint](#memory-footprint) is skipped in favour of the eager `dispatched_threads * stride` heap. The sparse path saves memory but pays a per-launch reducer dispatch; below `B` of conservative heap, that overhead outweighs the savings. Set to `0` to always use the sparse path; lower it if the default still skips kernels you want shrunk.
+- `ad_stack_sparse_threshold_bytes=B` (default `100 MiB`): cutoff below which the gate-passing-count sizing of [Memory footprint](#memory-footprint) is skipped in favor of the eager `dispatched_threads * stride` heap. The sparse path saves memory but pays a per-launch reducer dispatch; below `B` of conservative heap, that overhead outweighs the savings. Set to `0` to always use the sparse path; lower it if the default still skips kernels you want shrunk.
 
 #### Memory footprint
 
@@ -379,7 +379,7 @@ A reverse-mode kernel with two nested loops is in some cases limited to an outer
 
 ## Appendix A: types of dynamic loops supported by reverse-mode AD
 
-The compiler recognises the following bound shapes for adstack-aware loops:
+The compiler recognizes the following bound shapes for adstack-aware loops:
 
 | Bound shape | Example |
 | --- | --- |

diff --git a/docs/source/user_guide/block.md b/docs/source/user_guide/block.md
@@ -155,7 +155,7 @@ A generic `block.reduce(value, block_dim, op, dtype)` is also available for cust
 
 ### `block.reduce_all_{add,min,max}(value, block_dim, dtype)`
 
-The broadcast variants of the above. Identical semantics, but the result is published to a one-slot `SharedArray` and read back by every thread after a second `block.sync()`. Use this when downstream code on every thread needs the block-wide aggregate (e.g. normalising each thread's value by the block sum). Cost: one extra `block.sync()` plus one shared-memory hop vs. the lane-0-only variants. The corresponding generic form is `block.reduce_all(value, block_dim, op, dtype)`.
+The broadcast variants of the above. Identical semantics, but the result is published to a one-slot `SharedArray` and read back by every thread after a second `block.sync()`. Use this when downstream code on every thread needs the block-wide aggregate (e.g. normalizing each thread's value by the block sum). Cost: one extra `block.sync()` plus one shared-memory hop vs. the lane-0-only variants. The corresponding generic form is `block.reduce_all(value, block_dim, op, dtype)`.
 
 ### `block.inclusive_{add,min,max}(value, block_dim, dtype)`
 

diff --git a/docs/source/user_guide/compound_types.md b/docs/source/user_guide/compound_types.md
@@ -219,7 +219,7 @@ state.step()
 
 ### Under the hood
 
-Like `dataclasses.dataclass`, a `@qd.data_oriented` object is Python-only — the compiler flattens it into individual kernel parameters and the object itself has no kernel-side representation. Unlike `dataclasses.dataclass` it needs no member annotations: the compiler reads the live instance's attributes directly. Primitive members are baked into the kernel as constants, so each distinct primitive value compiles a new specialised kernel.
+Like `dataclasses.dataclass`, a `@qd.data_oriented` object is Python-only — the compiler flattens it into individual kernel parameters and the object itself has no kernel-side representation. Unlike `dataclasses.dataclass` it needs no member annotations: the compiler reads the live instance's attributes directly. Primitive members are baked into the kernel as constants, so each distinct primitive value compiles a new specialized kernel.
 
 ## qd.dataclass / qd.types.struct
 
@@ -292,7 +292,7 @@ Unlike the other two compound types, `@qd.dataclass` is a real kernel-side type
 
 ## Nesting compatibility
 
-This table summarises which member types are allowed inside which container type. "yes" means the member is walked correctly when the container is passed to a kernel; "no" means the member is ignored or the combination raises an error.
+This table summarizes which member types are allowed inside which container type. "yes" means the member is walked correctly when the container is passed to a kernel; "no" means the member is ignored or the combination raises an error.
 
 | Container ↓ &nbsp;&nbsp;&nbsp; / &nbsp;&nbsp;&nbsp; Member → | `qd.ndarray` | `qd.field` | primitive | `dataclasses.dataclass` | `@qd.data_oriented` | `@qd.dataclass` |
 |---|:---:|:---:|:---:|:---:|:---:|:---:|
@@ -316,7 +316,7 @@ Practical consequence:
 
 ### Reassigning ndarray members
 
-For `@qd.data_oriented` containers passed via `qd.Template`, reassigning an ndarray member between kernel launches is supported, including changes to `dtype`, `ndim`, or layout. A new specialised kernel is compiled and cached for the new shape; subsequent launches with the original shape continue to use the original cached kernel. (For `@dataclasses.dataclass` containers — passed via the dataclass-type annotation — the member binding follows the standard dataclass mutability rules: frozen dataclasses can't rebind, non-frozen ones can, and a rebind triggers a fresh kernel arg setup on the next launch.)
+For `@qd.data_oriented` containers passed via `qd.Template`, reassigning an ndarray member between kernel launches is supported, including changes to `dtype`, `ndim`, or layout. A new specialized kernel is compiled and cached for the new shape; subsequent launches with the original shape continue to use the original cached kernel. (For `@dataclasses.dataclass` containers — passed via the dataclass-type annotation — the member binding follows the standard dataclass mutability rules: frozen dataclasses can't rebind, non-frozen ones can, and a rebind triggers a fresh kernel arg setup on the next launch.)
 
 ### Restrictions
 

diff --git a/docs/source/user_guide/getting_started.md b/docs/source/user_guide/getting_started.md
@@ -108,7 +108,7 @@ qd.sync()
 end = time.time()
 ```
 
-In addition, whilst it looks like we aren't using the gpu before this, in fact we are: when we create the NDArray, the ndarray needs to be created in GPU memory, and again this happens asynchronously. So before calling start we also add qd.sync():
+In addition, while it looks like we aren't using the gpu before this, in fact we are: when we create the NDArray, the ndarray needs to be created in GPU memory, and again this happens asynchronously. So before calling start we also add qd.sync():
 
 ```python
 qd.sync()

diff --git a/docs/source/user_guide/graph.md b/docs/source/user_guide/graph.md
@@ -224,7 +224,7 @@ When the body of a checkpoint writes a non-zero value into `yield_on[()]`:
 
 The framework never writes into your `yield_on` buffer — you own it end-to-end. That means:
 
-- Before the **first** launch, initialise it to `0` (a freshly allocated `qd.ndarray` is not guaranteed to be zeroed).
+- Before the **first** launch, initialize it to `0` (a freshly allocated `qd.ndarray` is not guaranteed to be zeroed).
 - :warning: Before each **resume** launch, reset it to `0` (otherwise the body of the same checkpoint sees the stale non-zero value and yields again on the same condition, looping forever).
 
 ### Host-side yield / resume loop
@@ -237,7 +237,7 @@ Kernels annotated with `checkpoints=True` return a `qd.GraphStatus` from every l
 Resume by calling `kernel.resume(..., from_checkpoint=label)`. Everything before `label` in source order is skipped on the resume launch; everything from `label` onward runs normally. The canonical host loop:
 
 ```python
-overflow_flag[()] = 0  # initialise before the first launch
+overflow_flag[()] = 0  # initialize before the first launch
 status = step(arr, overflow_flag, newton_cond)
 while status.yielded:
     handle_overflow_for(status.checkpoint, ...)

diff --git a/docs/source/user_guide/grid.md b/docs/source/user_guide/grid.md
@@ -91,7 +91,7 @@ is therefore unsafe: LLVM's loop-invariant-code-motion will hoist the load out o
       pass
   ```
 
-  `qd.volatile_load` lowers to LLVM `load volatile` on CUDA / AMDGPU and to `OpLoad` with the SPIR-V `Volatile` `MemoryAccess` mask on Vulkan / Metal — the optimiser is forbidden from hoisting / merging the load on every backend, with no per-iteration cache-flush or atomic-RMW overhead. See [atomics](atomics.md) for the full primitive description, including the producer-side pairing requirements (atomic store, or plain store + fence on non-Metal backends).
+  `qd.volatile_load` lowers to LLVM `load volatile` on CUDA / AMDGPU and to `OpLoad` with the SPIR-V `Volatile` `MemoryAccess` mask on Vulkan / Metal — the optimizer is forbidden from hoisting / merging the load on every backend, with no per-iteration cache-flush or atomic-RMW overhead. See [atomics](atomics.md) for the full primitive description, including the producer-side pairing requirements (atomic store, or plain store + fence on non-Metal backends).
 
 - **Fence inside the loop body** (used in the example above; legacy approach):
 

diff --git a/docs/source/user_guide/init_options.md b/docs/source/user_guide/init_options.md
@@ -62,7 +62,7 @@ Forces every adstack in the program to exactly `N` slots and bypasses the launch
 
 ### `ad_stack_sparse_threshold_bytes`
 
-Cutoff (in bytes) below which the gate-passing-count sizing path described in [Memory footprint](./autodiff.md#memory-footprint) is skipped in favour of the eager `dispatched_threads * stride` heap. Default `100 MiB`. The sparse path saves memory on kernels of the shape `for i in range(...): if field[i] cmp literal: <adstack work>` but pays a per-launch reducer dispatch; below the threshold that overhead outweighs the savings. Set to `0` to always use the sparse path; lower it if the default still skips kernels you want shrunk. No effect when `ad_stack_experimental_enabled=False` or when the kernel has no such gate.
+Cutoff (in bytes) below which the gate-passing-count sizing path described in [Memory footprint](./autodiff.md#memory-footprint) is skipped in favor of the eager `dispatched_threads * stride` heap. Default `100 MiB`. The sparse path saves memory on kernels of the shape `for i in range(...): if field[i] cmp literal: <adstack work>` but pays a per-launch reducer dispatch; below the threshold that overhead outweighs the savings. Set to `0` to always use the sparse path; lower it if the default still skips kernels you want shrunk. No effect when `ad_stack_experimental_enabled=False` or when the kernel has no such gate.
 
 ## Apple Metal
 
@@ -74,7 +74,7 @@ An `MTLCommandQueue*` pointer (as an integer) to use instead of creating a new M
 
 Default `False`. Set to `True` when the `external_metal_command_queue` is PyTorch MPS's command queue. This tells Quadrants that both frameworks share the same Metal queue, so the explicit `qd.sync()` / `torch.mps.synchronize()` calls at `to_torch` / `from_torch` interop points can be skipped. When `False` (or when no external queue is set), the interop syncs are preserved.
 
-See [Shared Metal command queue](./metal_shared_queue.md) for the full setup guide, including how to extract the queue pointer from PyTorch and the synchronisation implications.
+See [Shared Metal command queue](./metal_shared_queue.md) for the full setup guide, including how to extract the queue pointer from PyTorch and the synchronization implications.
 
 ## Debugging
-Original file line number
+Diff line change
@@ Expand Up @@
           pass
       ```
-      `qd.volatile_load` lowers to LLVM `load volatile` on CUDA / AMDGPU and to `OpLoad` with the SPIR-V `Volatile` `MemoryAccess` mask on Vulkan / Metal — the optimiser is forbidden from hoisting / merging the load on every backend, with no per-iteration cache-flush or atomic-RMW overhead. See [atomics](atomics.md) for the full primitive description, including the producer-side pairing requirements (atomic store, or plain store + fence on non-Metal backends).
+      `qd.volatile_load` lowers to LLVM `load volatile` on CUDA / AMDGPU and to `OpLoad` with the SPIR-V `Volatile` `MemoryAccess` mask on Vulkan / Metal — the optimizer is forbidden from hoisting / merging the load on every backend, with no per-iteration cache-flush or atomic-RMW overhead. See [atomics](atomics.md) for the full primitive description, including the producer-side pairing requirements (atomic store, or plain store + fence on non-Metal backends).
     - **Fence inside the loop body** (used in the example above; legacy approach):
@@ Expand Down @@