Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/user_guide/algorithms.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ scratch64 = qd.ndarray(qd.u64, shape=qd.algorithms.reduce_scratch_slots(N, D))

## Semantics

The active ops below share a calling convention and several rules; these are stated once in **Common conventions**, and only the op-specific behaviour is repeated per op. The internal algorithm for each op is in [Under the hood](#under-the-hood).
The active ops below share a calling convention and several rules; these are stated once in **Common conventions**, and only the op-specific behavior is repeated per op. The internal algorithm for each op is in [Under the hood](#under-the-hood).

Each op section ends with a runnable toy example. They all assume this prelude:

Expand Down
12 changes: 6 additions & 6 deletions docs/source/user_guide/atomics.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ All atomic ops follow the same shape: `qd.atomic_op(x, y)` performs `x = op(x, y

A few cross-cutting notes that the cells above abbreviate:

- **`atomic_sub` is not a separate op in the IR.** `quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten` rewrites every `atomic_sub(x, y)` into `atomic_add(x, -y)` before codegen sees it, so per-backend support and per-dtype behaviour are exactly those of `atomic_add`.
- **`atomic_sub` is not a separate op in the IR.** `quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten` rewrites every `atomic_sub(x, y)` into `atomic_add(x, -y)` before codegen sees it, so per-backend support and per-dtype behavior are exactly those of `atomic_add`.
- **CAS-loop ops are noticeably slower than native atomics**, especially under contention — every contending thread retries the load + compare-exchange until it wins. Prefer pre-aggregating into a register or shared array and issuing a single atomic at the end of the block where possible.
- **f16 floats always use a CAS loop** (no native f16 atomic on any backend except SPIR-V with the right capability bit).
- **On CPU, "native" does not guarantee a single machine instruction.** On x86 and other architectures without hardware float atomics, the compiler backend lowers native float `atomic_add` (and integer `min` / `max`) to a CAS loop in machine code. Under high contention the performance is similar to the explicit "CAS" entries; the difference is that "native" ops benefit from hardware acceleration where available.
Expand Down Expand Up @@ -71,7 +71,7 @@ Bitwise atomics. Integer dtypes only — passing `f32` / `f64` raises a type err

### `qd.atomic_sub(x, y)` / `qd.atomic_mul(x, y)`

Atomic subtract and atomic multiply. `atomic_sub` is rewritten to `atomic_add(x, -y)` at IR-construction time (`quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten`), so its per-backend behaviour is identical to `atomic_add`. `atomic_mul` always lowers to a CAS loop - no LLVM AtomicRMW or SPIR-V `OpAtomic*` op corresponds to multiply - and is intentionally not heavily optimised; prefer reducing to a different scheme on hot paths.
Atomic subtract and atomic multiply. `atomic_sub` is rewritten to `atomic_add(x, -y)` at IR-construction time (`quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten`), so its per-backend behavior is identical to `atomic_add`. `atomic_mul` always lowers to a CAS loop - no LLVM AtomicRMW or SPIR-V `OpAtomic*` op corresponds to multiply - and is intentionally not heavily optimized; prefer reducing to a different scheme on hot paths.

### `qd.atomic_exchange(x, y)`

Expand Down Expand Up @@ -149,11 +149,11 @@ val = qd.volatile_load(target)
| Backend | Lowering |
|------------------|-------------------------------------------------------------------------------------------|
| CUDA | LLVM `load volatile` → PTX `ld.volatile.global`. |
| AMDGPU | LLVM `load volatile` → unhoistable `global_load_*` (the optimiser is inhibited from forwarding / merging). |
| AMDGPU | LLVM `load volatile` → unhoistable `global_load_*` (the optimizer is inhibited from forwarding / merging). |
| Vulkan / Metal | SPIR-V `OpLoad` with the `Volatile` `MemoryAccess` mask, propagated through SPIRV-Cross to a re-read on every use in the generated MSL / GLSL. |
| CPU (x86_64) | LLVM `load volatile` (the optimiser cannot hoist or merge it; the runtime cost is identical to an ordinary load on x86). |
| CPU (x86_64) | LLVM `load volatile` (the optimizer cannot hoist or merge it; the runtime cost is identical to an ordinary load on x86). |

Quadrants additionally suppresses the optimisations that would otherwise let an aliased rewrite slip past codegen:
Quadrants additionally suppresses the optimizations that would otherwise let an aliased rewrite slip past codegen:

- `cache_loop_invariant_global_vars` does not hoist a volatile load out of an enclosing loop.
- `simplify` does not replace a volatile load with the value of an earlier load of the same address.
Expand Down Expand Up @@ -199,7 +199,7 @@ The decoupled-look-back scan in [grid](grid.md) shows the full pattern.

Every `qd.atomic_*` is emitted at **device-wide scope**: visible to all threads on the GPU executing the kernel, but not required to be coherent with the host CPU mid-kernel. The host only observes results once the kernel completes, at which point the launcher's stream-sync flushes everything regardless. Choosing device scope (rather than the strongest "system" scope) lets every backend lower the op to a single hardware atomic instruction instead of a software CAS retry loop, which matters for correctness as much as for speed: under heavy contention, a CAS loop on a non-converging op like `atomic_xor` can livelock.

You don't normally need to think about scope as a user. It's listed here so the per-backend behaviour is explicit:
You don't normally need to think about scope as a user. It's listed here so the per-backend behavior is explicit:

| Backend | Scope spelling in the IR |
|-------------------------|-----------------------------------|
Expand Down
4 changes: 2 additions & 2 deletions docs/source/user_guide/autodiff.md
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ The on-device sizer relies on two common hardware features (64-bit integer arith
`qd.init()` exposes two escape hatches:

- `ad_stack_size=N` (default `0`): forces every adstack to exactly `N` slots and bypasses the sizer. Leave at `0` in day-to-day use; positive `N` is for stress tests or working around a suspected sizer bug.
- `ad_stack_sparse_threshold_bytes=B` (default `100 MiB`): cutoff below which the gate-passing-count sizing of [Memory footprint](#memory-footprint) is skipped in favour of the eager `dispatched_threads * stride` heap. The sparse path saves memory but pays a per-launch reducer dispatch; below `B` of conservative heap, that overhead outweighs the savings. Set to `0` to always use the sparse path; lower it if the default still skips kernels you want shrunk.
- `ad_stack_sparse_threshold_bytes=B` (default `100 MiB`): cutoff below which the gate-passing-count sizing of [Memory footprint](#memory-footprint) is skipped in favor of the eager `dispatched_threads * stride` heap. The sparse path saves memory but pays a per-launch reducer dispatch; below `B` of conservative heap, that overhead outweighs the savings. Set to `0` to always use the sparse path; lower it if the default still skips kernels you want shrunk.

#### Memory footprint

Expand Down Expand Up @@ -379,7 +379,7 @@ A reverse-mode kernel with two nested loops is in some cases limited to an outer

## Appendix A: types of dynamic loops supported by reverse-mode AD

The compiler recognises the following bound shapes for adstack-aware loops:
The compiler recognizes the following bound shapes for adstack-aware loops:

| Bound shape | Example |
| --- | --- |
Expand Down
2 changes: 1 addition & 1 deletion docs/source/user_guide/block.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ A generic `block.reduce(value, block_dim, op, dtype)` is also available for cust

### `block.reduce_all_{add,min,max}(value, block_dim, dtype)`

The broadcast variants of the above. Identical semantics, but the result is published to a one-slot `SharedArray` and read back by every thread after a second `block.sync()`. Use this when downstream code on every thread needs the block-wide aggregate (e.g. normalising each thread's value by the block sum). Cost: one extra `block.sync()` plus one shared-memory hop vs. the lane-0-only variants. The corresponding generic form is `block.reduce_all(value, block_dim, op, dtype)`.
The broadcast variants of the above. Identical semantics, but the result is published to a one-slot `SharedArray` and read back by every thread after a second `block.sync()`. Use this when downstream code on every thread needs the block-wide aggregate (e.g. normalizing each thread's value by the block sum). Cost: one extra `block.sync()` plus one shared-memory hop vs. the lane-0-only variants. The corresponding generic form is `block.reduce_all(value, block_dim, op, dtype)`.

### `block.inclusive_{add,min,max}(value, block_dim, dtype)`

Expand Down
6 changes: 3 additions & 3 deletions docs/source/user_guide/compound_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,7 @@ state.step()

### Under the hood

Like `dataclasses.dataclass`, a `@qd.data_oriented` object is Python-only — the compiler flattens it into individual kernel parameters and the object itself has no kernel-side representation. Unlike `dataclasses.dataclass` it needs no member annotations: the compiler reads the live instance's attributes directly. Primitive members are baked into the kernel as constants, so each distinct primitive value compiles a new specialised kernel.
Like `dataclasses.dataclass`, a `@qd.data_oriented` object is Python-only — the compiler flattens it into individual kernel parameters and the object itself has no kernel-side representation. Unlike `dataclasses.dataclass` it needs no member annotations: the compiler reads the live instance's attributes directly. Primitive members are baked into the kernel as constants, so each distinct primitive value compiles a new specialized kernel.

## qd.dataclass / qd.types.struct

Expand Down Expand Up @@ -292,7 +292,7 @@ Unlike the other two compound types, `@qd.dataclass` is a real kernel-side type

## Nesting compatibility

This table summarises which member types are allowed inside which container type. "yes" means the member is walked correctly when the container is passed to a kernel; "no" means the member is ignored or the combination raises an error.
This table summarizes which member types are allowed inside which container type. "yes" means the member is walked correctly when the container is passed to a kernel; "no" means the member is ignored or the combination raises an error.

| Container ↓     /     Member → | `qd.ndarray` | `qd.field` | primitive | `dataclasses.dataclass` | `@qd.data_oriented` | `@qd.dataclass` |
|---|:---:|:---:|:---:|:---:|:---:|:---:|
Expand All @@ -316,7 +316,7 @@ Practical consequence:

### Reassigning ndarray members

For `@qd.data_oriented` containers passed via `qd.Template`, reassigning an ndarray member between kernel launches is supported, including changes to `dtype`, `ndim`, or layout. A new specialised kernel is compiled and cached for the new shape; subsequent launches with the original shape continue to use the original cached kernel. (For `@dataclasses.dataclass` containers — passed via the dataclass-type annotation — the member binding follows the standard dataclass mutability rules: frozen dataclasses can't rebind, non-frozen ones can, and a rebind triggers a fresh kernel arg setup on the next launch.)
For `@qd.data_oriented` containers passed via `qd.Template`, reassigning an ndarray member between kernel launches is supported, including changes to `dtype`, `ndim`, or layout. A new specialized kernel is compiled and cached for the new shape; subsequent launches with the original shape continue to use the original cached kernel. (For `@dataclasses.dataclass` containers — passed via the dataclass-type annotation — the member binding follows the standard dataclass mutability rules: frozen dataclasses can't rebind, non-frozen ones can, and a rebind triggers a fresh kernel arg setup on the next launch.)

### Restrictions

Expand Down
2 changes: 1 addition & 1 deletion docs/source/user_guide/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ qd.sync()
end = time.time()
```

In addition, whilst it looks like we aren't using the gpu before this, in fact we are: when we create the NDArray, the ndarray needs to be created in GPU memory, and again this happens asynchronously. So before calling start we also add qd.sync():
In addition, while it looks like we aren't using the gpu before this, in fact we are: when we create the NDArray, the ndarray needs to be created in GPU memory, and again this happens asynchronously. So before calling start we also add qd.sync():

```python
qd.sync()
Expand Down
4 changes: 2 additions & 2 deletions docs/source/user_guide/graph.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ When the body of a checkpoint writes a non-zero value into `yield_on[()]`:

The framework never writes into your `yield_on` buffer — you own it end-to-end. That means:

- Before the **first** launch, initialise it to `0` (a freshly allocated `qd.ndarray` is not guaranteed to be zeroed).
- Before the **first** launch, initialize it to `0` (a freshly allocated `qd.ndarray` is not guaranteed to be zeroed).
- :warning: Before each **resume** launch, reset it to `0` (otherwise the body of the same checkpoint sees the stale non-zero value and yields again on the same condition, looping forever).

### Host-side yield / resume loop
Expand All @@ -237,7 +237,7 @@ Kernels annotated with `checkpoints=True` return a `qd.GraphStatus` from every l
Resume by calling `kernel.resume(..., from_checkpoint=label)`. Everything before `label` in source order is skipped on the resume launch; everything from `label` onward runs normally. The canonical host loop:

```python
overflow_flag[()] = 0 # initialise before the first launch
overflow_flag[()] = 0 # initialize before the first launch
status = step(arr, overflow_flag, newton_cond)
while status.yielded:
handle_overflow_for(status.checkpoint, ...)
Expand Down
2 changes: 1 addition & 1 deletion docs/source/user_guide/grid.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ is therefore unsafe: LLVM's loop-invariant-code-motion will hoist the load out o
pass
```

`qd.volatile_load` lowers to LLVM `load volatile` on CUDA / AMDGPU and to `OpLoad` with the SPIR-V `Volatile` `MemoryAccess` mask on Vulkan / Metal — the optimiser is forbidden from hoisting / merging the load on every backend, with no per-iteration cache-flush or atomic-RMW overhead. See [atomics](atomics.md) for the full primitive description, including the producer-side pairing requirements (atomic store, or plain store + fence on non-Metal backends).
`qd.volatile_load` lowers to LLVM `load volatile` on CUDA / AMDGPU and to `OpLoad` with the SPIR-V `Volatile` `MemoryAccess` mask on Vulkan / Metal — the optimizer is forbidden from hoisting / merging the load on every backend, with no per-iteration cache-flush or atomic-RMW overhead. See [atomics](atomics.md) for the full primitive description, including the producer-side pairing requirements (atomic store, or plain store + fence on non-Metal backends).

- **Fence inside the loop body** (used in the example above; legacy approach):

Expand Down
4 changes: 2 additions & 2 deletions docs/source/user_guide/init_options.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ Forces every adstack in the program to exactly `N` slots and bypasses the launch

### `ad_stack_sparse_threshold_bytes`

Cutoff (in bytes) below which the gate-passing-count sizing path described in [Memory footprint](./autodiff.md#memory-footprint) is skipped in favour of the eager `dispatched_threads * stride` heap. Default `100 MiB`. The sparse path saves memory on kernels of the shape `for i in range(...): if field[i] cmp literal: <adstack work>` but pays a per-launch reducer dispatch; below the threshold that overhead outweighs the savings. Set to `0` to always use the sparse path; lower it if the default still skips kernels you want shrunk. No effect when `ad_stack_experimental_enabled=False` or when the kernel has no such gate.
Cutoff (in bytes) below which the gate-passing-count sizing path described in [Memory footprint](./autodiff.md#memory-footprint) is skipped in favor of the eager `dispatched_threads * stride` heap. Default `100 MiB`. The sparse path saves memory on kernels of the shape `for i in range(...): if field[i] cmp literal: <adstack work>` but pays a per-launch reducer dispatch; below the threshold that overhead outweighs the savings. Set to `0` to always use the sparse path; lower it if the default still skips kernels you want shrunk. No effect when `ad_stack_experimental_enabled=False` or when the kernel has no such gate.

## Apple Metal

Expand All @@ -74,7 +74,7 @@ An `MTLCommandQueue*` pointer (as an integer) to use instead of creating a new M

Default `False`. Set to `True` when the `external_metal_command_queue` is PyTorch MPS's command queue. This tells Quadrants that both frameworks share the same Metal queue, so the explicit `qd.sync()` / `torch.mps.synchronize()` calls at `to_torch` / `from_torch` interop points can be skipped. When `False` (or when no external queue is set), the interop syncs are preserved.

See [Shared Metal command queue](./metal_shared_queue.md) for the full setup guide, including how to extract the queue pointer from PyTorch and the synchronisation implications.
See [Shared Metal command queue](./metal_shared_queue.md) for the full setup guide, including how to extract the queue pointer from PyTorch and the synchronization implications.

## Debugging

Expand Down
Loading
Loading