diff --git a/docs/source/user_guide/algorithms.md b/docs/source/user_guide/algorithms.md index 272d1271f4..53c18ef07a 100644 --- a/docs/source/user_guide/algorithms.md +++ b/docs/source/user_guide/algorithms.md @@ -102,7 +102,7 @@ scratch64 = qd.ndarray(qd.u64, shape=qd.algorithms.reduce_scratch_slots(N, D)) ## Semantics -The active ops below share a calling convention and several rules; these are stated once in **Common conventions**, and only the op-specific behaviour is repeated per op. The internal algorithm for each op is in [Under the hood](#under-the-hood). +The active ops below share a calling convention and several rules; these are stated once in **Common conventions**, and only the op-specific behavior is repeated per op. The internal algorithm for each op is in [Under the hood](#under-the-hood). Each op section ends with a runnable toy example. They all assume this prelude: diff --git a/docs/source/user_guide/atomics.md b/docs/source/user_guide/atomics.md index 6f2e739650..d30d4a1968 100644 --- a/docs/source/user_guide/atomics.md +++ b/docs/source/user_guide/atomics.md @@ -22,7 +22,7 @@ All atomic ops follow the same shape: `qd.atomic_op(x, y)` performs `x = op(x, y A few cross-cutting notes that the cells above abbreviate: -- **`atomic_sub` is not a separate op in the IR.** `quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten` rewrites every `atomic_sub(x, y)` into `atomic_add(x, -y)` before codegen sees it, so per-backend support and per-dtype behaviour are exactly those of `atomic_add`. +- **`atomic_sub` is not a separate op in the IR.** `quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten` rewrites every `atomic_sub(x, y)` into `atomic_add(x, -y)` before codegen sees it, so per-backend support and per-dtype behavior are exactly those of `atomic_add`. - **CAS-loop ops are noticeably slower than native atomics**, especially under contention — every contending thread retries the load + compare-exchange until it wins. Prefer pre-aggregating into a register or shared array and issuing a single atomic at the end of the block where possible. - **f16 floats always use a CAS loop** (no native f16 atomic on any backend except SPIR-V with the right capability bit). - **On CPU, "native" does not guarantee a single machine instruction.** On x86 and other architectures without hardware float atomics, the compiler backend lowers native float `atomic_add` (and integer `min` / `max`) to a CAS loop in machine code. Under high contention the performance is similar to the explicit "CAS" entries; the difference is that "native" ops benefit from hardware acceleration where available. @@ -71,7 +71,7 @@ Bitwise atomics. Integer dtypes only — passing `f32` / `f64` raises a type err ### `qd.atomic_sub(x, y)` / `qd.atomic_mul(x, y)` -Atomic subtract and atomic multiply. `atomic_sub` is rewritten to `atomic_add(x, -y)` at IR-construction time (`quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten`), so its per-backend behaviour is identical to `atomic_add`. `atomic_mul` always lowers to a CAS loop - no LLVM AtomicRMW or SPIR-V `OpAtomic*` op corresponds to multiply - and is intentionally not heavily optimised; prefer reducing to a different scheme on hot paths. +Atomic subtract and atomic multiply. `atomic_sub` is rewritten to `atomic_add(x, -y)` at IR-construction time (`quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten`), so its per-backend behavior is identical to `atomic_add`. `atomic_mul` always lowers to a CAS loop - no LLVM AtomicRMW or SPIR-V `OpAtomic*` op corresponds to multiply - and is intentionally not heavily optimized; prefer reducing to a different scheme on hot paths. ### `qd.atomic_exchange(x, y)` @@ -149,11 +149,11 @@ val = qd.volatile_load(target) | Backend | Lowering | |------------------|-------------------------------------------------------------------------------------------| | CUDA | LLVM `load volatile` → PTX `ld.volatile.global`. | -| AMDGPU | LLVM `load volatile` → unhoistable `global_load_*` (the optimiser is inhibited from forwarding / merging). | +| AMDGPU | LLVM `load volatile` → unhoistable `global_load_*` (the optimizer is inhibited from forwarding / merging). | | Vulkan / Metal | SPIR-V `OpLoad` with the `Volatile` `MemoryAccess` mask, propagated through SPIRV-Cross to a re-read on every use in the generated MSL / GLSL. | -| CPU (x86_64) | LLVM `load volatile` (the optimiser cannot hoist or merge it; the runtime cost is identical to an ordinary load on x86). | +| CPU (x86_64) | LLVM `load volatile` (the optimizer cannot hoist or merge it; the runtime cost is identical to an ordinary load on x86). | -Quadrants additionally suppresses the optimisations that would otherwise let an aliased rewrite slip past codegen: +Quadrants additionally suppresses the optimizations that would otherwise let an aliased rewrite slip past codegen: - `cache_loop_invariant_global_vars` does not hoist a volatile load out of an enclosing loop. - `simplify` does not replace a volatile load with the value of an earlier load of the same address. @@ -199,7 +199,7 @@ The decoupled-look-back scan in [grid](grid.md) shows the full pattern. Every `qd.atomic_*` is emitted at **device-wide scope**: visible to all threads on the GPU executing the kernel, but not required to be coherent with the host CPU mid-kernel. The host only observes results once the kernel completes, at which point the launcher's stream-sync flushes everything regardless. Choosing device scope (rather than the strongest "system" scope) lets every backend lower the op to a single hardware atomic instruction instead of a software CAS retry loop, which matters for correctness as much as for speed: under heavy contention, a CAS loop on a non-converging op like `atomic_xor` can livelock. -You don't normally need to think about scope as a user. It's listed here so the per-backend behaviour is explicit: +You don't normally need to think about scope as a user. It's listed here so the per-backend behavior is explicit: | Backend | Scope spelling in the IR | |-------------------------|-----------------------------------| diff --git a/docs/source/user_guide/autodiff.md b/docs/source/user_guide/autodiff.md index dfbe1c5750..f3b33e81a5 100644 --- a/docs/source/user_guide/autodiff.md +++ b/docs/source/user_guide/autodiff.md @@ -298,7 +298,7 @@ The on-device sizer relies on two common hardware features (64-bit integer arith `qd.init()` exposes two escape hatches: - `ad_stack_size=N` (default `0`): forces every adstack to exactly `N` slots and bypasses the sizer. Leave at `0` in day-to-day use; positive `N` is for stress tests or working around a suspected sizer bug. -- `ad_stack_sparse_threshold_bytes=B` (default `100 MiB`): cutoff below which the gate-passing-count sizing of [Memory footprint](#memory-footprint) is skipped in favour of the eager `dispatched_threads * stride` heap. The sparse path saves memory but pays a per-launch reducer dispatch; below `B` of conservative heap, that overhead outweighs the savings. Set to `0` to always use the sparse path; lower it if the default still skips kernels you want shrunk. +- `ad_stack_sparse_threshold_bytes=B` (default `100 MiB`): cutoff below which the gate-passing-count sizing of [Memory footprint](#memory-footprint) is skipped in favor of the eager `dispatched_threads * stride` heap. The sparse path saves memory but pays a per-launch reducer dispatch; below `B` of conservative heap, that overhead outweighs the savings. Set to `0` to always use the sparse path; lower it if the default still skips kernels you want shrunk. #### Memory footprint @@ -379,7 +379,7 @@ A reverse-mode kernel with two nested loops is in some cases limited to an outer ## Appendix A: types of dynamic loops supported by reverse-mode AD -The compiler recognises the following bound shapes for adstack-aware loops: +The compiler recognizes the following bound shapes for adstack-aware loops: | Bound shape | Example | | --- | --- | diff --git a/docs/source/user_guide/block.md b/docs/source/user_guide/block.md index c397b56a75..5c1eeb34a6 100644 --- a/docs/source/user_guide/block.md +++ b/docs/source/user_guide/block.md @@ -155,7 +155,7 @@ A generic `block.reduce(value, block_dim, op, dtype)` is also available for cust ### `block.reduce_all_{add,min,max}(value, block_dim, dtype)` -The broadcast variants of the above. Identical semantics, but the result is published to a one-slot `SharedArray` and read back by every thread after a second `block.sync()`. Use this when downstream code on every thread needs the block-wide aggregate (e.g. normalising each thread's value by the block sum). Cost: one extra `block.sync()` plus one shared-memory hop vs. the lane-0-only variants. The corresponding generic form is `block.reduce_all(value, block_dim, op, dtype)`. +The broadcast variants of the above. Identical semantics, but the result is published to a one-slot `SharedArray` and read back by every thread after a second `block.sync()`. Use this when downstream code on every thread needs the block-wide aggregate (e.g. normalizing each thread's value by the block sum). Cost: one extra `block.sync()` plus one shared-memory hop vs. the lane-0-only variants. The corresponding generic form is `block.reduce_all(value, block_dim, op, dtype)`. ### `block.inclusive_{add,min,max}(value, block_dim, dtype)` diff --git a/docs/source/user_guide/compound_types.md b/docs/source/user_guide/compound_types.md index e2d000bcbb..b8659de33c 100644 --- a/docs/source/user_guide/compound_types.md +++ b/docs/source/user_guide/compound_types.md @@ -219,7 +219,7 @@ state.step() ### Under the hood -Like `dataclasses.dataclass`, a `@qd.data_oriented` object is Python-only — the compiler flattens it into individual kernel parameters and the object itself has no kernel-side representation. Unlike `dataclasses.dataclass` it needs no member annotations: the compiler reads the live instance's attributes directly. Primitive members are baked into the kernel as constants, so each distinct primitive value compiles a new specialised kernel. +Like `dataclasses.dataclass`, a `@qd.data_oriented` object is Python-only — the compiler flattens it into individual kernel parameters and the object itself has no kernel-side representation. Unlike `dataclasses.dataclass` it needs no member annotations: the compiler reads the live instance's attributes directly. Primitive members are baked into the kernel as constants, so each distinct primitive value compiles a new specialized kernel. ## qd.dataclass / qd.types.struct @@ -292,7 +292,7 @@ Unlike the other two compound types, `@qd.dataclass` is a real kernel-side type ## Nesting compatibility -This table summarises which member types are allowed inside which container type. "yes" means the member is walked correctly when the container is passed to a kernel; "no" means the member is ignored or the combination raises an error. +This table summarizes which member types are allowed inside which container type. "yes" means the member is walked correctly when the container is passed to a kernel; "no" means the member is ignored or the combination raises an error. | Container ↓     /     Member → | `qd.ndarray` | `qd.field` | primitive | `dataclasses.dataclass` | `@qd.data_oriented` | `@qd.dataclass` | |---|:---:|:---:|:---:|:---:|:---:|:---:| @@ -316,7 +316,7 @@ Practical consequence: ### Reassigning ndarray members -For `@qd.data_oriented` containers passed via `qd.Template`, reassigning an ndarray member between kernel launches is supported, including changes to `dtype`, `ndim`, or layout. A new specialised kernel is compiled and cached for the new shape; subsequent launches with the original shape continue to use the original cached kernel. (For `@dataclasses.dataclass` containers — passed via the dataclass-type annotation — the member binding follows the standard dataclass mutability rules: frozen dataclasses can't rebind, non-frozen ones can, and a rebind triggers a fresh kernel arg setup on the next launch.) +For `@qd.data_oriented` containers passed via `qd.Template`, reassigning an ndarray member between kernel launches is supported, including changes to `dtype`, `ndim`, or layout. A new specialized kernel is compiled and cached for the new shape; subsequent launches with the original shape continue to use the original cached kernel. (For `@dataclasses.dataclass` containers — passed via the dataclass-type annotation — the member binding follows the standard dataclass mutability rules: frozen dataclasses can't rebind, non-frozen ones can, and a rebind triggers a fresh kernel arg setup on the next launch.) ### Restrictions diff --git a/docs/source/user_guide/getting_started.md b/docs/source/user_guide/getting_started.md index d4776ac82b..db3dffff3d 100644 --- a/docs/source/user_guide/getting_started.md +++ b/docs/source/user_guide/getting_started.md @@ -108,7 +108,7 @@ qd.sync() end = time.time() ``` -In addition, whilst it looks like we aren't using the gpu before this, in fact we are: when we create the NDArray, the ndarray needs to be created in GPU memory, and again this happens asynchronously. So before calling start we also add qd.sync(): +In addition, while it looks like we aren't using the gpu before this, in fact we are: when we create the NDArray, the ndarray needs to be created in GPU memory, and again this happens asynchronously. So before calling start we also add qd.sync(): ```python qd.sync() diff --git a/docs/source/user_guide/graph.md b/docs/source/user_guide/graph.md index b1c239b080..b018a6194f 100644 --- a/docs/source/user_guide/graph.md +++ b/docs/source/user_guide/graph.md @@ -224,7 +224,7 @@ When the body of a checkpoint writes a non-zero value into `yield_on[()]`: The framework never writes into your `yield_on` buffer — you own it end-to-end. That means: -- Before the **first** launch, initialise it to `0` (a freshly allocated `qd.ndarray` is not guaranteed to be zeroed). +- Before the **first** launch, initialize it to `0` (a freshly allocated `qd.ndarray` is not guaranteed to be zeroed). - :warning: Before each **resume** launch, reset it to `0` (otherwise the body of the same checkpoint sees the stale non-zero value and yields again on the same condition, looping forever). ### Host-side yield / resume loop @@ -237,7 +237,7 @@ Kernels annotated with `checkpoints=True` return a `qd.GraphStatus` from every l Resume by calling `kernel.resume(..., from_checkpoint=label)`. Everything before `label` in source order is skipped on the resume launch; everything from `label` onward runs normally. The canonical host loop: ```python -overflow_flag[()] = 0 # initialise before the first launch +overflow_flag[()] = 0 # initialize before the first launch status = step(arr, overflow_flag, newton_cond) while status.yielded: handle_overflow_for(status.checkpoint, ...) diff --git a/docs/source/user_guide/grid.md b/docs/source/user_guide/grid.md index c8b08f9c09..6690729ad6 100644 --- a/docs/source/user_guide/grid.md +++ b/docs/source/user_guide/grid.md @@ -91,7 +91,7 @@ is therefore unsafe: LLVM's loop-invariant-code-motion will hoist the load out o pass ``` - `qd.volatile_load` lowers to LLVM `load volatile` on CUDA / AMDGPU and to `OpLoad` with the SPIR-V `Volatile` `MemoryAccess` mask on Vulkan / Metal — the optimiser is forbidden from hoisting / merging the load on every backend, with no per-iteration cache-flush or atomic-RMW overhead. See [atomics](atomics.md) for the full primitive description, including the producer-side pairing requirements (atomic store, or plain store + fence on non-Metal backends). + `qd.volatile_load` lowers to LLVM `load volatile` on CUDA / AMDGPU and to `OpLoad` with the SPIR-V `Volatile` `MemoryAccess` mask on Vulkan / Metal — the optimizer is forbidden from hoisting / merging the load on every backend, with no per-iteration cache-flush or atomic-RMW overhead. See [atomics](atomics.md) for the full primitive description, including the producer-side pairing requirements (atomic store, or plain store + fence on non-Metal backends). - **Fence inside the loop body** (used in the example above; legacy approach): diff --git a/docs/source/user_guide/init_options.md b/docs/source/user_guide/init_options.md index 325b41acd2..a74a7fbea0 100644 --- a/docs/source/user_guide/init_options.md +++ b/docs/source/user_guide/init_options.md @@ -62,7 +62,7 @@ Forces every adstack in the program to exactly `N` slots and bypasses the launch ### `ad_stack_sparse_threshold_bytes` -Cutoff (in bytes) below which the gate-passing-count sizing path described in [Memory footprint](./autodiff.md#memory-footprint) is skipped in favour of the eager `dispatched_threads * stride` heap. Default `100 MiB`. The sparse path saves memory on kernels of the shape `for i in range(...): if field[i] cmp literal: ` but pays a per-launch reducer dispatch; below the threshold that overhead outweighs the savings. Set to `0` to always use the sparse path; lower it if the default still skips kernels you want shrunk. No effect when `ad_stack_experimental_enabled=False` or when the kernel has no such gate. +Cutoff (in bytes) below which the gate-passing-count sizing path described in [Memory footprint](./autodiff.md#memory-footprint) is skipped in favor of the eager `dispatched_threads * stride` heap. Default `100 MiB`. The sparse path saves memory on kernels of the shape `for i in range(...): if field[i] cmp literal: ` but pays a per-launch reducer dispatch; below the threshold that overhead outweighs the savings. Set to `0` to always use the sparse path; lower it if the default still skips kernels you want shrunk. No effect when `ad_stack_experimental_enabled=False` or when the kernel has no such gate. ## Apple Metal @@ -74,7 +74,7 @@ An `MTLCommandQueue*` pointer (as an integer) to use instead of creating a new M Default `False`. Set to `True` when the `external_metal_command_queue` is PyTorch MPS's command queue. This tells Quadrants that both frameworks share the same Metal queue, so the explicit `qd.sync()` / `torch.mps.synchronize()` calls at `to_torch` / `from_torch` interop points can be skipped. When `False` (or when no external queue is set), the interop syncs are preserved. -See [Shared Metal command queue](./metal_shared_queue.md) for the full setup guide, including how to extract the queue pointer from PyTorch and the synchronisation implications. +See [Shared Metal command queue](./metal_shared_queue.md) for the full setup guide, including how to extract the queue pointer from PyTorch and the synchronization implications. ## Debugging diff --git a/docs/source/user_guide/interop.md b/docs/source/user_guide/interop.md index 196eea93ea..42c32c4af8 100644 --- a/docs/source/user_guide/interop.md +++ b/docs/source/user_guide/interop.md @@ -103,13 +103,13 @@ On **NumPy >= 2.1**, `to_numpy(copy=False)` returns a **writable** array (via a ### Semantics of `copy` -| Value | Behaviour | +| Value | Behavior | |---|---| | `True` (default) | Independent copy via kernel. | | `None` | Zero-copy view via DLPack when available, otherwise falls back to a copy silently. | | `False` | Zero-copy view via DLPack, or `ValueError` if zero-copy is unsupported for this backend/dtype. | -The default `copy=True` always returns a buffer that is safe to mutate without affecting the field/ndarray. Use `copy=None` when you want zero-copy as a best-effort optimisation without having to handle exceptions — it gives you a view when possible and a safe copy otherwise. +The default `copy=True` always returns a buffer that is safe to mutate without affecting the field/ndarray. Use `copy=None` when you want zero-copy as a best-effort optimization without having to handle exceptions — it gives you a view when possible and a safe copy otherwise. ### Examples @@ -151,7 +151,7 @@ v2 = f.to_torch(copy=False) assert v1.data_ptr() == v2.data_ptr() # same underlying memory ``` -### Apple Metal: synchronisation +### Apple Metal: synchronization On Apple Metal, Quadrants and PyTorch MPS use separate Metal command queues. Every `to_torch()` / `to_numpy()` call runs `qd.sync()` internally to flush the Quadrants queue. Additionally, `copy=True` (the default) calls `torch.mps.synchronize()` after the kernel copy. This is necessary because, on Metal, Quadrants and Torch do not share the same compute streams. `copy=False` does **not** call `torch.mps.synchronize()`: @@ -164,7 +164,7 @@ view = f.to_torch(copy=False) # qd.sync() only copy = f.to_torch(copy=True) # qd.sync() + torch.mps.synchronize() ``` -The reverse direction (PyTorch writes to a zero-copy view, then a Quadrants kernel reads from the same field) is **not** automatically synchronised. Because Quadrants and PyTorch MPS submit work to separate Metal command queues, a kernel launched immediately after a torch write may execute before the torch write has actually committed to memory: +The reverse direction (PyTorch writes to a zero-copy view, then a Quadrants kernel reads from the same field) is **not** automatically synchronized. Because Quadrants and PyTorch MPS submit work to separate Metal command queues, a kernel launched immediately after a torch write may execute before the torch write has actually committed to memory: ```python qd.init(arch=qd.metal) @@ -180,11 +180,11 @@ my_kernel(f) # now safe This is intentional: forcing a sync on every Quadrants kernel that touches a previously-zerocopied field would be very expensive in workloads that batch many torch ops and many kernels back-to-back. If you mutate fields from torch and then read them from a Quadrants kernel on Metal, call `torch.mps.synchronize()` once between the torch ops and the kernels. -**Shared command queue.** The synchronisation overhead above can be eliminated entirely by passing PyTorch MPS's `MTLCommandQueue` to Quadrants at init time via `external_metal_command_queue`. Quadrants provides `quadrants.interop.get_mps_command_queue()` to extract the queue pointer at runtime. When both frameworks share the same queue, Metal guarantees command buffer ordering automatically. See [Shared Metal command queue](./metal_shared_queue.md) for the setup guide. +**Shared command queue.** The synchronization overhead above can be eliminated entirely by passing PyTorch MPS's `MTLCommandQueue` to Quadrants at init time via `external_metal_command_queue`. Quadrants provides `quadrants.interop.get_mps_command_queue()` to extract the queue pointer at runtime. When both frameworks share the same queue, Metal guarantees command buffer ordering automatically. See [Shared Metal command queue](./metal_shared_queue.md) for the setup guide. ### Lifetime caveats -A zero-copy view becomes invalid when the underlying Quadrants storage is freed. This happens on `qd.reset()` and `qd.init()`. Holding a `copy=False` tensor across either is undefined behaviour: +A zero-copy view becomes invalid when the underlying Quadrants storage is freed. This happens on `qd.reset()` and `qd.init()`. Holding a `copy=False` tensor across either is undefined behavior: ```python view = f.to_torch(copy=False) @@ -198,7 +198,7 @@ The default `copy=True` produces an independent copy that is unaffected. Only `c `StructField.to_torch()` and `StructField.to_numpy()` return a dictionary mapping each member name to a tensor / array; the `copy` argument is propagated to each member, so zero-copy availability is decided per member. The relevant axis is the SNode layout chosen at construction: -- **AOS** (default `Struct.field(..., layout=Layout.AOS)`): all members share the struct cell, e.g. `Struct.field({"a": i32, "b": f32}, shape=(N,))` stores `[a0, b0, a1, b1, ...]` in memory, with stride `sizeof(cell)` between consecutive `a`'s. Quadrants' C++ DLPack export does not currently emit cell-stride-aware views for individual members (it computes contiguous strides at the member dtype size, which would interleave neighbouring members' bytes), so AOS members fall back to a kernel copy and `copy=False` raises on each AOS member. +- **AOS** (default `Struct.field(..., layout=Layout.AOS)`): all members share the struct cell, e.g. `Struct.field({"a": i32, "b": f32}, shape=(N,))` stores `[a0, b0, a1, b1, ...]` in memory, with stride `sizeof(cell)` between consecutive `a`'s. Quadrants' C++ DLPack export does not currently emit cell-stride-aware views for individual members (it computes contiguous strides at the member dtype size, which would interleave neighboring members' bytes), so AOS members fall back to a kernel copy and `copy=False` raises on each AOS member. - **SOA** (`Struct.field(..., layout=Layout.SOA)`): each member sits in its own dense SNode subtree with contiguous storage, so members are zero-copyable individually under the usual backend / dtype rules. `copy=False` succeeds and returns aliasing views. ```python diff --git a/docs/source/user_guide/linalg_per_thread.md b/docs/source/user_guide/linalg_per_thread.md index c223426dc6..ff45ef8b07 100644 --- a/docs/source/user_guide/linalg_per_thread.md +++ b/docs/source/user_guide/linalg_per_thread.md @@ -97,8 +97,8 @@ Direct solve of `A @ x = b` via Gauss elimination with partial pivoting. Returns - Shapes 2×2 and 3×3. - The implementation asserts `A.n == A.m` and `A.m == b.n`. -- Singular `A` is checked by a kernel `assert` (`"Matrix is singular in linear solve."`) inside the Gauss-elimination path. Kernel asserts only fire when the runtime is initialised with `qd.init(debug=True)` (see [debug](debug.md)) — under the default `debug=False` a singular input silently produces a divide-by-zero / NaN result with no diagnostic. If you need a signal in production, check singularity explicitly before calling `qd.solve` (e.g. `abs(A.determinant())` against a tolerance), or run development workloads with `debug=True` to catch the case. -- Each call factorises `A` from scratch and back-substitutes for the given `b`. +- Singular `A` is checked by a kernel `assert` (`"Matrix is singular in linear solve."`) inside the Gauss-elimination path. Kernel asserts only fire when the runtime is initialized with `qd.init(debug=True)` (see [debug](debug.md)) — under the default `debug=False` a singular input silently produces a divide-by-zero / NaN result with no diagnostic. If you need a signal in production, check singularity explicitly before calling `qd.solve` (e.g. `abs(A.determinant())` against a tolerance), or run development workloads with `debug=True` to catch the case. +- Each call factorizes `A` from scratch and back-substitutes for the given `b`. ## Examples @@ -173,7 +173,7 @@ def shape_match(A: qd.types.matrix(2, 2, qd.f32)) -> qd.types.matrix(2, 2, qd.f3 return R ``` -The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `‖R - A‖_F` — the building block of position-based dynamics shape-matching. +The rotation factor `R` from `A = R @ S` is the rigid alignment that minimizes `‖R - A‖_F` — the building block of position-based dynamics shape-matching. ## Shapes, performance, portability diff --git a/docs/source/user_guide/math.md b/docs/source/user_guide/math.md index e04b44c16d..b19bcb14c9 100644 --- a/docs/source/user_guide/math.md +++ b/docs/source/user_guide/math.md @@ -17,7 +17,7 @@ Single-thread integer-register operations. They do not access memory and do not All four ops **return `i32` on every backend**, regardless of input width. This matches CUDA libdevice (`__nv_popc` / `__nv_clz` / `__nv_ffs` all return `int`) and the natural width of the AMDGPU SALU bit-count / leading-zero instructions (`s_bcnt1_i32_b64`, `s_flbit_i32_b64`); SPIR-V and x64 truncate down to `i32` so the same kernel source has the same return type everywhere. The result is always non-negative and fits in 7 bits (`popcnt` ≤ 64, `clz` ≤ 64, `ffs` ≤ 64, `fns` ≤ 31 plus the `0xFFFFFFFF` not-found sentinel for the `u32` case). -\* On SPIR-V the 64-bit path (i64 / u64) for `clz` and `ffs` is synthesised from two ext-inst calls on the 32-bit halves plus an `OpSelect`, since `GLSL.std.450 FindUMsb` and `FindILsb` are both 32-bit-only. The runtime device must advertise the `Int64` SPIR-V capability (Vulkan: `shaderInt64`); this is the same precondition any other 64-bit op would impose. On unsupported integer widths (e.g. `i8`, `i16`, `u16`) `clz`, `popcnt` and `ffs` hit `QD_NOT_IMPLEMENTED` on every backend. +\* On SPIR-V the 64-bit path (i64 / u64) for `clz` and `ffs` is synthesized from two ext-inst calls on the 32-bit halves plus an `OpSelect`, since `GLSL.std.450 FindUMsb` and `FindILsb` are both 32-bit-only. The runtime device must advertise the `Int64` SPIR-V capability (Vulkan: `shaderInt64`); this is the same precondition any other 64-bit op would impose. On unsupported integer widths (e.g. `i8`, `i16`, `u16`) `clz`, `popcnt` and `ffs` hit `QD_NOT_IMPLEMENTED` on every backend. ### `qd.math.popcnt(x)` @@ -25,11 +25,11 @@ Counts set bits in `x` and returns an `i32`. On CUDA, lowers to `__nv_popc` for ### `qd.math.clz(x)` -Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. The count is over the unsigned bit pattern, so `clz(-1) == 0` and `clz(0x7FFFFFFF) == 1`. Signed and unsigned inputs lower to the same intrinsic on every backend (LLVM IR is signless for integers; SPIR-V `FindUMsb` is unsigned by definition), so `qd.math.clz(qd.u32(x))` and `qd.math.clz(qd.bit_cast(x, qd.i32))` are equivalent. On CUDA, lowers to `__nv_clz` (32-bit) and `__nv_clzll` (64-bit). On AMDGPU, lowers to the portable `llvm.ctlz` intrinsic with `is_zero_undef = false` (matching `clz(0) = bitwidth`). On SPIR-V, the 32-bit case lowers to `GLSL.std.450 FindUMsb` followed by `31 - FindUMsb`. The 64-bit case is synthesised from a hi/lo decomposition: shift the operand right by 32 to get the high i32 half, truncate for the low half, run `FindUMsb` on each, and select `31 - FindUMsb(hi)` if the high half is non-zero or `63 - FindUMsb(lo)` otherwise. `FindUMsb` returns `-1` on a zero input, so `clz(0) == 64` falls out naturally. +Counts leading zero bits in `x` and returns an `i32`. For a 32-bit input, `clz(0) = 32`; otherwise the result is in `[0, 31]`. The count is over the unsigned bit pattern, so `clz(-1) == 0` and `clz(0x7FFFFFFF) == 1`. Signed and unsigned inputs lower to the same intrinsic on every backend (LLVM IR is signless for integers; SPIR-V `FindUMsb` is unsigned by definition), so `qd.math.clz(qd.u32(x))` and `qd.math.clz(qd.bit_cast(x, qd.i32))` are equivalent. On CUDA, lowers to `__nv_clz` (32-bit) and `__nv_clzll` (64-bit). On AMDGPU, lowers to the portable `llvm.ctlz` intrinsic with `is_zero_undef = false` (matching `clz(0) = bitwidth`). On SPIR-V, the 32-bit case lowers to `GLSL.std.450 FindUMsb` followed by `31 - FindUMsb`. The 64-bit case is synthesized from a hi/lo decomposition: shift the operand right by 32 to get the high i32 half, truncate for the low half, run `FindUMsb` on each, and select `31 - FindUMsb(hi)` if the high half is non-zero or `63 - FindUMsb(lo)` otherwise. `FindUMsb` returns `-1` on a zero input, so `clz(0) == 64` falls out naturally. ### `qd.math.ffs(x)` -Finds the lowest set bit in `x` and returns its **1-indexed** position as an `i32`, with `ffs(0) == 0` (matching the CUDA `__ffs` convention). Otherwise the result is in `[1, bitwidth(x)]`, e.g. `ffs(1) == 1`, `ffs(2) == 2`, `ffs(0x80000000) == 32`. The count is over the unsigned bit pattern, so `ffs(-1) == 1` regardless of input signedness. On CUDA, lowers to libdevice's `__nv_ffs` (32-bit) / `__nv_ffsll` (64-bit), which already encode the `ffs(0) == 0` contract. On CPU and AMDGPU, lowers to `llvm.cttz` with `is_zero_undef = false` plus an explicit `select` for the zero case (cttz returns bitwidth on zero, so `cttz + 1` would otherwise yield `bitwidth + 1`). On SPIR-V, the 32-bit case lowers to `GLSL.std.450 FindILsb` plus a `+1` and a zero-input select; the 64-bit case is synthesised from a hi/lo decomposition that consults the low half first (since "first" set bit means lowest-indexed) and falls back to the high half offset by 32 when the low half is zero. +Finds the lowest set bit in `x` and returns its **1-indexed** position as an `i32`, with `ffs(0) == 0` (matching the CUDA `__ffs` convention). Otherwise the result is in `[1, bitwidth(x)]`, e.g. `ffs(1) == 1`, `ffs(2) == 2`, `ffs(0x80000000) == 32`. The count is over the unsigned bit pattern, so `ffs(-1) == 1` regardless of input signedness. On CUDA, lowers to libdevice's `__nv_ffs` (32-bit) / `__nv_ffsll` (64-bit), which already encode the `ffs(0) == 0` contract. On CPU and AMDGPU, lowers to `llvm.cttz` with `is_zero_undef = false` plus an explicit `select` for the zero case (cttz returns bitwidth on zero, so `cttz + 1` would otherwise yield `bitwidth + 1`). On SPIR-V, the 32-bit case lowers to `GLSL.std.450 FindILsb` plus a `+1` and a zero-input select; the 64-bit case is synthesized from a hi/lo decomposition that consults the low half first (since "first" set bit means lowest-indexed) and falls back to the high half offset by 32 when the low half is zero. ### `qd.math.fns(mask, base, offset)` diff --git a/docs/source/user_guide/matrix_vector.md b/docs/source/user_guide/matrix_vector.md index ac1fd6225b..43cf19d2ad 100644 --- a/docs/source/user_guide/matrix_vector.md +++ b/docs/source/user_guide/matrix_vector.md @@ -190,9 +190,9 @@ The generated struct in this example has 65 scalar members (`_a0..._a31`, `_b0.. ## How the packed vs unpacked layout differs at the LLVM level -A plain `qd.types.vector(N, dtype)` field on a `@qd.dataclass` lowers to a single stack-allocated group of `N` packed scalars. LLVM's optimiser attempts to decompose that group into `N` per-slot register-resident values, but the decomposition is conservative: under high register pressure (e.g. two concurrent 32×32 tiles in a Cholesky + triangular solve) the optimiser bails out and the whole vector spills to local memory as a unit. +A plain `qd.types.vector(N, dtype)` field on a `@qd.dataclass` lowers to a single stack-allocated group of `N` packed scalars. LLVM's optimizer attempts to decompose that group into `N` per-slot register-resident values, but the decomposition is conservative: under high register pressure (e.g. two concurrent 32×32 tiles in a Cholesky + triangular solve) the optimizer bails out and the whole vector spills to local memory as a unit. -`qd.types.vector(N, dtype, unpacked=True)` expands to `N` independent scalar stack slots, one per element, so the optimiser can promote each slot to a register independently and the register allocator can spill only the slots it has to. That is exactly what the hand-rolled `r0..r{N-1}` form produces; the generated LLVM IR / PTX matches it byte-for-byte. +`qd.types.vector(N, dtype, unpacked=True)` expands to `N` independent scalar stack slots, one per element, so the optimizer can promote each slot to a register independently and the register allocator can spill only the slots it has to. That is exactly what the hand-rolled `r0..r{N-1}` form produces; the generated LLVM IR / PTX matches it byte-for-byte. ## How to check for spills @@ -243,13 +243,13 @@ ncu --set full --section MemoryWorkloadAnalysis ./your_program Look at the "Memory Workload Analysis -> Local Memory" section. This reports *actually executed* local-memory loads / stores, which catches issues `ptxas` doesn't (e.g. driver-stage JIT spills on a different GPU, hot-path-only spills that static analysis misses). -### Also useful: post-optimisation LLVM IR +### Also useful: post-optimization LLVM IR ```python qd.init(arch=qd.cuda, print_kernel_llvm_ir_optimized=True) ``` -Dumps `quadrants_kernel_cuda_llvm_ir_optimized_NNNN.ll`. In LLVM IR, every per-function stack allocation appears as an `alloca` instruction. The optimiser tries to promote each `alloca` into a register-resident value; any `alloca` that survives the optimiser into the post-optimisation dump is a stack slot it couldn't promote, and it will become PTX local memory. Grep for them: +Dumps `quadrants_kernel_cuda_llvm_ir_optimized_NNNN.ll`. In LLVM IR, every per-function stack allocation appears as an `alloca` instruction. The optimizer tries to promote each `alloca` into a register-resident value; any `alloca` that survives the optimizer into the post-optimization dump is a stack slot it couldn't promote, and it will become PTX local memory. Grep for them: ```bash grep -nE "alloca" quadrants_kernel_cuda_llvm_ir_optimized_0007.ll | head diff --git a/docs/source/user_guide/metal_shared_queue.md b/docs/source/user_guide/metal_shared_queue.md index e389501562..ff970c626e 100644 --- a/docs/source/user_guide/metal_shared_queue.md +++ b/docs/source/user_guide/metal_shared_queue.md @@ -1,6 +1,6 @@ # Shared Metal command queue (PyTorch MPS) -On Apple Silicon, Quadrants and PyTorch MPS both dispatch GPU work via Metal. By default each framework creates its own `MTLCommandQueue`, which means there is no GPU-level ordering between them. Every zero-copy interop point therefore requires explicit CPU-side synchronisation (`qd.sync()` and `torch.mps.synchronize()`) to guarantee data visibility. +On Apple Silicon, Quadrants and PyTorch MPS both dispatch GPU work via Metal. By default each framework creates its own `MTLCommandQueue`, which means there is no GPU-level ordering between them. Every zero-copy interop point therefore requires explicit CPU-side synchronization (`qd.sync()` and `torch.mps.synchronize()`) to guarantee data visibility. The `external_metal_command_queue` option lets you pass PyTorch's command queue to Quadrants so that both frameworks share a single queue. Metal processes command buffers in commit order within a queue, so GPU-side ordering is automatic and the per-interop sync overhead is eliminated. @@ -23,7 +23,7 @@ Two flags work together: - `external_metal_command_queue` — the raw `MTLCommandQueue*` pointer. Quadrants dispatches all GPU work on this queue instead of creating its own. - `external_metal_command_queue_is_torch_queue` — set to `True` when the queue comes from PyTorch MPS. This tells Quadrants that PyTorch shares the same queue, so the explicit interop syncs can be safely skipped. Defaults to `False`, which preserves the sync calls even when an external queue is provided (useful when the external queue belongs to a non-PyTorch framework). -Once initialised with both flags: +Once initialized with both flags: - `to_torch(copy=False)` no longer calls `qd.sync()` internally. - `to_torch(copy=True)` no longer calls `torch.mps.synchronize()` after the copy. @@ -41,17 +41,17 @@ from quadrants.interop import get_mps_command_queue queue_ptr = get_mps_command_queue() # returns int (raw pointer), or 0 on failure ``` -The function initialises PyTorch MPS if needed, then returns the `MTLCommandQueue*` as a Python `int`. It returns `0` if extraction fails (e.g. non-macOS platform, PyTorch not installed, MPS not available, or unsupported PyTorch build). The underlying C++ symbol (`_ZN2at3mps19getDefaultMPSStreamEv`) has been stable since PyTorch 1.13. +The function initializes PyTorch MPS if needed, then returns the `MTLCommandQueue*` as a Python `int`. It returns `0` if extraction fails (e.g. non-macOS platform, PyTorch not installed, MPS not available, or unsupported PyTorch build). The underlying C++ symbol (`_ZN2at3mps19getDefaultMPSStreamEv`) has been stable since PyTorch 1.13. ## Init ordering -`get_mps_command_queue()` handles PyTorch MPS initialisation internally, so you can call it before `qd.init()` without any manual setup: +`get_mps_command_queue()` handles PyTorch MPS initialization internally, so you can call it before `qd.init()` without any manual setup: ```python import quadrants as qd from quadrants.interop import get_mps_command_queue -queue_ptr = get_mps_command_queue() # initialises MPS if needed +queue_ptr = get_mps_command_queue() # initializes MPS if needed qd.init( arch=qd.metal, external_metal_command_queue=queue_ptr, diff --git a/docs/source/user_guide/optimization_passes.md b/docs/source/user_guide/optimization_passes.md index a88913a5f5..c0c90ba318 100644 --- a/docs/source/user_guide/optimization_passes.md +++ b/docs/source/user_guide/optimization_passes.md @@ -44,7 +44,7 @@ In the order they run each round: | Pass | What it does | |------|--------------| -| Extract constant | Lifts constant values out of larger expressions into standalone constant instructions, so the passes below can recognise and reuse them. | +| Extract constant | Lifts constant values out of larger expressions into standalone constant instructions, so the passes below can recognize and reuse them. | | Unreachable-code elimination | Removes branches that can never be taken (e.g. the body of an `if` whose condition is always false). | | Binary-op / algebraic simplification | Applies arithmetic identities: `x * 1 → x`, `x + 0 → x`, `x * 2 → x + x`, and similar peephole rewrites. | | Constant folding | Pre-computes expressions whose inputs are all known at compile time: `2 * 3 → 6`. | @@ -63,9 +63,9 @@ A **control-flow graph** is a map of your kernel's basic blocks together with th - **Store-to-load forwarding** - if a value is written to a location and then read again before anything overwrites it, the read is replaced with the value directly, skipping the round trip through memory. - **Dead-store elimination** - if a write is overwritten before anyone reads it, the write is removed. -Building and analysing the CFG is the most expensive optimization in the pipeline, which is why it runs at most once per simplify stage rather than every round. +Building and analyzing the CFG is the most expensive optimization in the pipeline, which is why it runs at most once per simplify stage rather than every round. -**One CFG per offloaded task.** The CFG optimization is built and run separately for each offloaded task, over that task's IR alone - never over the whole `qd.kernel` at once. This is both faster to analyse and safe: because each task is a separate device launch, a value held in a register in one task cannot survive into the next one, so there is never anything to forward across a task boundary anyway. Anything written to global memory is treated as potentially read by a later task, so no store another task might need is dropped. +**One CFG per offloaded task.** The CFG optimization is built and run separately for each offloaded task, over that task's IR alone - never over the whole `qd.kernel` at once. This is both faster to analyze and safe: because each task is a separate device launch, a value held in a register in one task cannot survive into the next one, so there is never anything to forward across a task boundary anyway. Anything written to global memory is treated as potentially read by a later task, so no store another task might need is dropped. ## Controlling the passes diff --git a/docs/source/user_guide/parallelization.md b/docs/source/user_guide/parallelization.md index 3e443f928c..9d6ca98dc8 100644 --- a/docs/source/user_guide/parallelization.md +++ b/docs/source/user_guide/parallelization.md @@ -76,7 +76,7 @@ for I in qd.grouped(qd.ndrange(M, N, axes=(1, 0))): ## Does GPU kernel launch latency matter? -Kernel launch can be done in parallel whilst the previously launched kernel is still running. This means that if the previously launched kernel takes longer to run than the launch time for the new kernel, then the kernel launch latency will be perfectly hidden. +Kernel launch can be done in parallel while the previously launched kernel is still running. This means that if the previously launched kernel takes longer to run than the launch time for the new kernel, then the kernel launch latency will be perfectly hidden. It's important to try to make sure that the work done by each kernel is sufficient to hide the kernel launch latency, otherwise the launch latency will be a bottleneck to maximum performance. diff --git a/docs/source/user_guide/sub_functions.md b/docs/source/user_guide/sub_functions.md index 0e01f7b12a..d89bb2cf68 100644 --- a/docs/source/user_guide/sub_functions.md +++ b/docs/source/user_guide/sub_functions.md @@ -39,7 +39,7 @@ def compute(a: qd.Template) -> None: ## Restricting a func to the top level (`requires_top_level=True`) -**Experimental.** `requires_top_level` is an experimental feature and its behaviour or API may change in a future release. +**Experimental.** `requires_top_level` is an experimental feature and its behavior or API may change in a future release. Some qd.func contain for-loops that are assumed and intended to be top-level for-loops, that each become separate offloaded tasks, and ultimately separate device kernels. If such qd.func's are placed inside other for-loops, the qd.func will no longer generate the structure of offloaded tasks and device kernels assumed, and might either run very slowly, or crash, or give incorrect results. diff --git a/docs/source/user_guide/subgroup.md b/docs/source/user_guide/subgroup.md index ba99df5a92..1c0abcc2f5 100644 --- a/docs/source/user_guide/subgroup.md +++ b/docs/source/user_guide/subgroup.md @@ -41,7 +41,7 @@ Renames relative to the previous `qd.simt.subgroup` API: - `subgroup.barrier()` → `subgroup.sync()` (matching `block.sync()`). - `subgroup.memory_barrier()` → `subgroup.mem_fence()` (matching the planned `block.mem_fence()` and `grid.mem_fence()`). - `subgroup.ballot_full_subgroup(predicate)` → `subgroup.ballot(predicate)`. -- Every `(value, log2_size)` reduction / scan / vote that previously took `log2_size` directly is now `_tiled(value, log2_size)`; the bare `(value)` form is the full-subgroup convenience wrapper. **This is a breaking change** - call sites that hard-coded `log2_size = 5` (the old "full warp" idiom) need to either drop the argument or add the `_tiled` suffix to keep the old behaviour on wave64. +- Every `(value, log2_size)` reduction / scan / vote that previously took `log2_size` directly is now `_tiled(value, log2_size)`; the bare `(value)` form is the full-subgroup convenience wrapper. **This is a breaking change** - call sites that hard-coded `log2_size = 5` (the old "full warp" idiom) need to either drop the argument or add the `_tiled` suffix to keep the old behavior on wave64. The `barrier()` / `memory_barrier()` / `ballot_full_subgroup()` names remain as deprecated aliases that emit a `DeprecationWarning` on first use and forward to the new ones; they will be removed in a future release. The rest of this page uses the new names. @@ -82,7 +82,7 @@ CUDA shortcut: `all_true` / `any_true` lower to a single `__all_sync(0xFFFFFFFF, Every op above has a paired `_tiled` form that takes an extra `log2_size` template parameter and operates on independent `2**log2_size`-aligned tiles within the subgroup - see [Tiled variants](#tiled-variants). -The SPV-only no-arg reductions (`subgroup.reduce_mul` / `reduce_and` / `reduce_or` / `reduce_xor`, plus the original `reduce_add_tiled(value)` with no `log2_size`) have been removed in favour of the portable sized API. For reductions other than the ones listed above, build a sized helper on top of `shuffle_down` / `shuffle` following the same pattern as `reduce_add_tiled` / `reduce_all_add_tiled`. +The SPV-only no-arg reductions (`subgroup.reduce_mul` / `reduce_and` / `reduce_or` / `reduce_xor`, plus the original `reduce_add_tiled(value)` with no `log2_size`) have been removed in favor of the portable sized API. For reductions other than the ones listed above, build a sized helper on top of `shuffle_down` / `shuffle` following the same pattern as `reduce_add_tiled` / `reduce_all_add_tiled`. ### Sorting @@ -92,11 +92,11 @@ In-register key/value sort across the subgroup, one `(key, value)` pair per lane |----------------------------------------|------|--------|-------------------------|-----------------------------------------------------------------| | `subgroup.bitonic_sort_kv(key, value)` | yes | yes | yes | key & value: i32, u32, f32, f64, i64, u64 (independently typed) | -Returns `(key, value)` - assign with `key, value = subgroup.bitonic_sort_kv(key, value)`. Sorts ascending on the `(key, value)` lex tuple; ties on `key` break on ascending `value` (not a textbook-stable sort - equal-keyed lanes come back in ascending-`value` order, not in original-lane order). Tiled variant: `bitonic_sort_kv_tiled(key, value, log2_size)` runs the same sort independently on each `2**log2_size`-aligned tile - see [Tiled variants](#tiled-variants). See [`bitonic_sort_kv`](#bitonic_sort_kvkey-value) for the short-input pattern (sentinel padding), the textbook-stability caveat, and the float NaN behaviour. See [Bitonic key/value sort example](#bitonic-keyvalue-sort-example) for an example. +Returns `(key, value)` - assign with `key, value = subgroup.bitonic_sort_kv(key, value)`. Sorts ascending on the `(key, value)` lex tuple; ties on `key` break on ascending `value` (not a textbook-stable sort - equal-keyed lanes come back in ascending-`value` order, not in original-lane order). Tiled variant: `bitonic_sort_kv_tiled(key, value, log2_size)` runs the same sort independently on each `2**log2_size`-aligned tile - see [Tiled variants](#tiled-variants). See [`bitonic_sort_kv`](#bitonic_sort_kvkey-value) for the short-input pattern (sentinel padding), the textbook-stability caveat, and the float NaN behavior. See [Bitonic key/value sort example](#bitonic-keyvalue-sort-example) for an example. ## Semantics -All of these ops operate within a single subgroup: they do not move data through memory and do not synchronise across subgroups. +All of these ops operate within a single subgroup: they do not move data through memory and do not synchronize across subgroups. ### `shuffle(value, index)` @@ -125,7 +125,7 @@ Lane `i` returns the `value` held by lane `i ^ mask`. Convenient for butterfly p - Same dtype rules as `shuffle`; `mask` is a `u32`. - Implemented portably as a `@qd.func` over `shuffle`: every backend that lowers `shuffle` therefore lowers `shuffle_xor` with no additional codegen path. Inlines at compile time into a single `shuffle(value, u32(invocation_id()) ^ mask)`. -- The XOR partner must be inside the active subgroup; behaviour outside that range is implementation-defined (same caveat as `shuffle`). +- The XOR partner must be inside the active subgroup; behavior outside that range is implementation-defined (same caveat as `shuffle`). ### `broadcast(value, index)` @@ -221,7 +221,7 @@ Each base op is a one-line wrapper around its `_tiled` form: `reduce_add(v)` is Returns `1` on lane 0 of every subgroup and `0` on every other lane. Useful for "exactly one lane does X" patterns where you don't care which lane does it - e.g. emitting a single global write per subgroup. - Implemented portably as a `@qd.func` wrapper: `i32(invocation_id() == 0)`. Inlines at compile time into a single compare + zero-extend on every backend. -- This narrows the SPIR-V `OpGroupNonUniformElect` semantics, which would otherwise be free to pick any *active* lane. Under the documented uniform-CF + all-lanes-active contract for `qd.simt.subgroup` the distinction is invisible (lane 0 is always active and is a legal choice), and pinning the elected lane down keeps the behaviour identical across backends. +- This narrows the SPIR-V `OpGroupNonUniformElect` semantics, which would otherwise be free to pick any *active* lane. Under the documented uniform-CF + all-lanes-active contract for `qd.simt.subgroup` the distinction is invisible (lane 0 is always active and is a legal choice), and pinning the elected lane down keeps the behavior identical across backends. ### `sync()` @@ -231,7 +231,7 @@ Subgroup-scope thread-converging barrier - every lane in the subgroup must reach - **SPIR-V**: `OpControlBarrier(Subgroup, Subgroup, 0)`. - **CUDA**: `__syncwarp(0xFFFFFFFF)` (`nvvm.bar.warp.sync`). Reconverges lanes that may have diverged under independent thread scheduling on Volta+; under uniform CF on Pascal and earlier this is effectively a no-op but is still legal. - **AMDGPU**: `llvm.amdgcn.wave.barrier`. Acts as a compiler reordering barrier on GCN (where waves are lockstep) and as a real wave-scope hardware barrier on RDNA. -- Caller contract on every backend: call from uniform control flow with all lanes active. Calling from divergent control flow has implementation-defined behaviour (CUDA's `nvvm.bar.warp.sync` will deadlock if the mask does not match the active set; AMDGPU's `wave.barrier` is a no-op on most chips so divergent calls silently pass through). +- Caller contract on every backend: call from uniform control flow with all lanes active. Calling from divergent control flow has implementation-defined behavior (CUDA's `nvvm.bar.warp.sync` will deadlock if the mask does not match the active set; AMDGPU's `wave.barrier` is a no-op on most chips so divergent calls silently pass through). - The legacy name `subgroup.barrier()` is still available as a deprecated alias. It forwards to `sync()` and emits a `DeprecationWarning` on first use; prefer the new name in new code. ### `mem_fence()` @@ -242,7 +242,7 @@ Subgroup-scope memory fence - orders memory operations within the subgroup witho - **SPIR-V**: `OpMemoryBarrier(Subgroup, AcquireRelease | UniformMemory | WorkgroupMemory)`. - **CUDA**: `__threadfence_block()` (`nvvm.membar.cta`) - workgroup-scope, see the `**` footnote in the matrix above. - **AMDGPU**: LLVM `fence syncscope("workgroup") seq_cst` - workgroup-scope, same caveat. -- Caller contract on every backend: call from uniform control flow with all lanes active. Calling from divergent control flow has implementation-defined behaviour (same caveats as `sync()`). +- Caller contract on every backend: call from uniform control flow with all lanes active. Calling from divergent control flow has implementation-defined behavior (same caveats as `sync()`). - The legacy name `subgroup.memory_barrier()` is still available as a deprecated alias. It forwards to `mem_fence()` and emits a `DeprecationWarning` on first use; prefer the new name in new code. ### `reduce_add(value)` @@ -279,7 +279,7 @@ Same min / max as `reduce_min` / `reduce_max`, but broadcast to **every lane** i Per-lane inclusive scan under `+` / `min` / `max` that resets at every non-zero `head_flag`, across the entire subgroup. Lane `i` returns the scan of `value[head_below..i + 1]`, where `head_below` is the largest lane index `<= i` whose `head_flag` is non-zero. If no such lane exists, lane 0 is treated as an implicit head, so the result is the inclusive scan from lane 0 to lane `i`. Tiled variants: `segmented_reduce_add_tiled(value, head_flag, log2_size)` (and `_min` / `_max`) - see [Tiled variants](#tiled-variants). - `value` is any type supporting the operator (`+` and `shuffle_up` for `_add`; `qd.min`/`qd.max` and `shuffle_up` for `_min`/`_max`). `head_flag` is any integer scalar; the lowering tests `head_flag != 0`, so non-binary truthy values (e.g. `7`, `42`) work. -- Implementation: one `subgroup.ballot(head_flag != 0)` to materialise a `u64` of head positions, then a Hillis-Steele inclusive scan bounded by `distance >= offset` where `distance = lane - segment_head`. A compile-time branch in `_segment_head_distance_tiled` picks between two paths: +- Implementation: one `subgroup.ballot(head_flag != 0)` to materialize a `u64` of head positions, then a Hillis-Steele inclusive scan bounded by `distance >= offset` where `distance = lane - segment_head`. A compile-time branch in `_segment_head_distance_tiled` picks between two paths: - **`log2_size <= 5`** - u32-bitmask path. Shifts the relevant 32-lane half of the ballot down to bits 0..31 and runs the bit-mask + `clz` arithmetic in half-local coordinates (`lane_in_half = lane - half_base`). Half-local `distance` equals absolute `lane - segment_head_abs` because both terms are offset by the same `half_base`, so the downstream `shuffle_up`'s `distance >= offset` guard still works in absolute terms. This is the only path on wave32 backends - it compiles to identical IR to the historical wave32-only implementation, so CUDA / Metal / Vulkan callers see no perf regression from the wave64 support. - **`log2_size == 6`** - u64-bitmask path. Works in absolute lane coordinates with the full `u64` ballot, an OR-injected virtual head at lane 0 to guarantee a non-zero `lower`, and a `clz(u64)` for the segment head. Costs one extra `u64` shift + `u64 clz` vs the u32 path; only reachable when `group_size() == 64` (i.e. AMDGPU), so the entire branch is dead-code-eliminated at every `log2_size <= 5` call site. - No identity element is involved at all - the per-lane `distance >= offset` guard ensures the scan never reaches across a segment boundary, so a partner from another segment is never combined with the local value (i.e. the implementation doesn't need a "what to combine with at the segment head" sentinel the way `exclusive_min` / `exclusive_max` do for lane 0). @@ -327,7 +327,7 @@ Float NaN handling is implementation-defined: comparisons with NaN return false #### Short-input pattern -When sorting fewer than `2**log2_size` real elements, load real data into the low `n` lanes, initialise the high lanes with a sentinel `key` that compares greater than every real key (`+inf` for floats, `INT_MAX` / `UINT_MAX` for ints) and any safe `value`, then ignore the high lanes in the result. +When sorting fewer than `2**log2_size` real elements, load real data into the low `n` lanes, initialize the high lanes with a sentinel `key` that compares greater than every real key (`+inf` for floats, `INT_MAX` / `UINT_MAX` for ints) and any safe `value`, then ignore the high lanes in the result. ### `ballot_first_n(predicate, n)` @@ -340,7 +340,7 @@ Returns a `u32` bitmask whose bit `i` is set iff `i < n` AND lane `i`'s `predica - **AMDGPU**: `llvm.amdgcn.ballot.i64` followed by `trunc to i32`. Packs lanes 0..31 into the result; on wave64 lanes 32..63's predicates are explicitly discarded by the truncate, matching the `n <= 32` contract. The `i64 + trunc` form is a workaround for an LLVM AMDGPU isel bug - `ballot.i32` is documented as well-defined on wave64 (PR [llvm/llvm-project#71556](https://github.com/llvm/llvm-project/pull/71556)) but in practice still fails `Cannot select` on gfx942 in LLVM 20 / 22 for non-constant predicates. The workaround costs nothing - both forms produce the same single `v_cmp_*_e64` plus a low-half store. - **SPIR-V**: `OpGroupNonUniformBallot` returns a `uvec4`; we extract component 0, which by spec contains the ballot bits for lanes 0..31. - For `n < 32` we mask the predicate by `lane < n` before issuing the ballot, so bits `[n, 32)` of the result are forced to zero regardless of those lanes' actual predicate values. At `n == 32` the masking is provably a no-op on every backend (lanes `>= 32` are either non-existent on wave32 or already not represented in the `u32` result on wave64), so the masking is elided at compile time and the call lowers to a single ballot intrinsic. -- Caller contract: uniform CF + all lanes active. Calling from divergent control flow has implementation-defined behaviour (CUDA's `__ballot_sync` will deadlock if the active mask doesn't match `0xFFFFFFFF`). +- Caller contract: uniform CF + all lanes active. Calling from divergent control flow has implementation-defined behavior (CUDA's `__ballot_sync` will deadlock if the active mask doesn't match `0xFFFFFFFF`). - Useful for stream compaction over the first 32 lanes, the wave32 path of `segmented_reduce_*` (which uses the u32-bitmask form internally for `log2_size <= 5`), and any pattern that wants `clz` / `popcount` / `ffs` over a per-lane predicate within a `u32`. For full-subgroup ballots on AMDGPU wave64 use `ballot` (returns a `u64`). ### `ballot(predicate)` @@ -373,7 +373,7 @@ Returns `i32(1)` on every lane iff every lane in the subgroup has the same `valu ### `lanemask_{lt,le,eq,gt,ge}(lane_id)` -Closed-form `u32` lane-mask constants parametrised by a lane id. Bit `i` of the result follows the relation in the suffix: +Closed-form `u32` lane-mask constants parametrized by a lane id. Bit `i` of the result follows the relation in the suffix: | Op | Bit `i` set iff | Closed form | |----------------|-----------------|----------------------------------------| @@ -385,7 +385,7 @@ Closed-form `u32` lane-mask constants parametrised by a lane id. Bit `i` of the - `lane_id` is any integer scalar. Pass `subgroup.invocation_id()` to get the classic CUDA built-in form (current lane's mask), or any other expression to query an arbitrary lane's mask. The op is pure arithmetic - no shuffle, no ballot - so per-lane-varying `lane_id` works the same as a uniform one. - Returns `u32`. Bit 0 corresponds to lane 0, bit 31 to lane 31. -- Caller contract: `lane_id` must be in `[0, 31]` (matching the `u32` return type, which represents 32 lanes). Passing `lane_id == 32` triggers an undefined-behaviour shift on most backends. +- Caller contract: `lane_id` must be in `[0, 31]` (matching the `u32` return type, which represents 32 lanes). Passing `lane_id == 32` triggers an undefined-behavior shift on most backends. - Implemented portably as a `@qd.func` over `<<`, `-`, `|`, `~`. Inlines at compile time into 1-3 ALU ops on every backend. - AMDGPU CDNA wave64 caveat: only the low 32 lanes are representable in this op (the return type is `u32`). If you need a mask covering all 64 wave64 lanes, use `subgroup.ballot` instead - it returns a `u64` and includes lanes 32..63. @@ -437,7 +437,7 @@ def identity(src: qd.types.ndarray(dtype=qd.f32, ndim=1), `dst[i]` equals `src[i]` on every lane. -### Swap neighbours (xor pattern via explicit lane) example +### Swap neighbors (xor pattern via explicit lane) example ```python @qd.kernel @@ -504,7 +504,7 @@ def sum32(src: qd.types.ndarray(dtype=qd.f32, ndim=1), ### Broadcast the sum to all lanes with `reduce_all_add_tiled` example -When every lane needs the reduction result - e.g. to normalise by the sum - use the butterfly variant. No follow-up broadcast needed: +When every lane needs the reduction result - e.g. to normalize by the sum - use the butterfly variant. No follow-up broadcast needed: ```python @qd.kernel diff --git a/docs/source/user_guide/tensor.md b/docs/source/user_guide/tensor.md index b86279f9a3..660dcfa782 100644 --- a/docs/source/user_guide/tensor.md +++ b/docs/source/user_guide/tensor.md @@ -166,7 +166,7 @@ auto = a.to_torch(copy=None) # zero-copy if possible, otherwise copy clone = a.to_torch(copy=True) # independent copy (default) ``` -| Value | Behaviour | +| Value | Behavior | |---|---| | `True` (default) | Independent copy via kernel. Safe to mutate freely. | | `None` | Zero-copy when available, otherwise falls back to a copy silently. | @@ -174,11 +174,11 @@ clone = a.to_torch(copy=True) # independent copy (default) `copy=False` and `copy=None` avoid both the buffer allocation and the copy kernel when zero-copy is available — the returned numpy array or torch tensor points directly at Quadrants' existing memory. For a large tensor this eliminates a potentially expensive memcpy and a device-side kernel launch. Writes through the view are immediately visible to subsequent Quadrants kernels (and vice versa), removing the need for `to_torch` → modify → `from_torch` round-trips. -The difference between `False` and `None`: `copy=False` raises `ValueError` when zero-copy is not supported (e.g. unsupported dtype or GPU-to-numpy), while `copy=None` silently falls back to a kernel copy in those cases. Use `copy=None` when you want zero-copy as a best-effort optimisation without having to handle exceptions. +The difference between `False` and `None`: `copy=False` raises `ValueError` when zero-copy is not supported (e.g. unsupported dtype or GPU-to-numpy), while `copy=None` silently falls back to a kernel copy in those cases. Use `copy=None` when you want zero-copy as a best-effort optimization without having to handle exceptions. -The tradeoff of zero-copy is lifetime coupling: the view is invalidated on `qd.reset()` or `qd.init()`, and on GPU you must be mindful of stream synchronisation when both frameworks write to the same buffer. +The tradeoff of zero-copy is lifetime coupling: the view is invalidated on `qd.reset()` or `qd.init()`, and on GPU you must be mindful of stream synchronization when both frameworks write to the same buffer. -This works identically on both backends. For the full support matrix (which backends/dtypes qualify, lifetime caveats, Metal synchronisation) see [`interop`](interop.md#zero-copy-interop-via-dlpack). +This works identically on both backends. For the full support matrix (which backends/dtypes qualify, lifetime caveats, Metal synchronization) see [`interop`](interop.md#zero-copy-interop-via-dlpack). Gradient buffers behave identically: `a.grad.to_numpy()` returns the canonical view of the gradient. diff --git a/docs/source/user_guide/unit_testing.md b/docs/source/user_guide/unit_testing.md index 08453a9912..9af39dabfc 100644 --- a/docs/source/user_guide/unit_testing.md +++ b/docs/source/user_guide/unit_testing.md @@ -10,7 +10,7 @@ The test suite is run via the project's launcher, **not** by invoking `pytest` d python tests/run_tests.py ``` -The launcher sets up the test-only env vars (kernel offline cache, watchdog, xdist worker count, etc.) and forwards any unrecognised flags to pytest. Calling `pytest` directly skips that setup and behaves differently. +The launcher sets up the test-only env vars (kernel offline cache, watchdog, xdist worker count, etc.) and forwards any unrecognized flags to pytest. Calling `pytest` directly skips that setup and behaves differently. Common one-liners: