diff --git a/AGENTS.md b/AGENTS.md
index 554d6e96..c69b9d87 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -63,25 +63,36 @@ cd my-arkor-app && pnpm dev                            # Studio at http://127.0.
 
 `arkor dev` generates a 32-byte base64url token per launch ([packages/arkor/src/cli/commands/dev.ts](packages/arkor/src/cli/commands/dev.ts)) and:
 
-1. Passes it to `buildStudioApp({ studioToken })`. The Hono server validates every `/api/*` request via `X-Arkor-Studio-Token` header (or `?studioToken=` query for `EventSource`, which can't set headers). Comparison uses `timingSafeEqual`.
-2. Persists it to `~/.arkor/studio-token` (mode 0600) so the SPA dev workflow (`pnpm --filter @arkor/studio-app dev`) can read it via the `arkor-studio-token` Vite plugin in [packages/studio-app/vite.config.ts](packages/studio-app/vite.config.ts), which injects `<meta name="arkor-studio-token">` into `index.html` on each request. Persistence failure must NOT block server start (read-only `$HOME` on Docker, etc.) — just warn.
+1. Passes it to `buildStudioApp({ studioToken })`. The Hono server validates every `/api/*` request via `X-Arkor-Studio-Token` header (or `?studioToken=` query for `EventSource`, which can't set headers). Comparison uses `timingSafeEqual`. The query-token allow-list lives in `eventStreamPathPattern` in [packages/arkor/src/studio/server.ts](packages/arkor/src/studio/server.ts), currently `/api/jobs/:id/events` and `/api/dev/events`. **Adding to that regex is CSRF-sensitive: each entry must be a GET stream-only route, never a mutation endpoint.**
+2. Persists it to `~/.arkor/studio-token` (mode 0600) so the SPA dev workflow (`pnpm --filter @arkor/studio-app dev`) can read it via the `arkor-studio-token` Vite plugin in [packages/studio-app/vite.config.ts](packages/studio-app/vite.config.ts), which injects `<meta name="arkor-studio-token">` into `index.html` on each request. Persistence failure must NOT block server start (read-only `$HOME` on Docker, etc.); just warn.
 3. Cleans up on `exit`/SIGINT/SIGTERM/SIGHUP via `unlinkSync`.
 
-`/api/*` middleware also enforces a host-header allow-list (`127.0.0.1`/`localhost`) for DNS-rebinding defence. **CORS is intentionally NOT configured** — the SPA is same-origin so reflecting `*` would let "simple" cross-origin POSTs reach handlers. The token check rejects those; cross-origin tabs cannot read the SPA's `<meta>`.
+`/api/*` middleware also enforces a host-header allow-list (`127.0.0.1`/`localhost`) for DNS-rebinding defence. **CORS is intentionally NOT configured**: the SPA is same-origin so reflecting `*` would let "simple" cross-origin POSTs reach handlers. The token check rejects those; cross-origin tabs cannot read the SPA's `<meta>`.
 
-The whole point: prevents another browser tab on the same machine from POSTing `/api/train` (which spawns `arkor train` and dynamically imports user TS — RCE-grade).
+The whole point: prevents another browser tab on the same machine from POSTing `/api/train` (which spawns `arkor train` and dynamically imports user TS, an RCE-grade exposure).
 
-When touching the Studio server or SPA fetch layer, preserve: token via header for `fetch`, query param for `EventSource`, host-header guard, no CORS, timing-safe compare. The Vite plugin is dev-only (`apply: "serve"`) — running it during `vite build` would bake a stale per-launch token into the production `index.html` and shadow the runtime tag, causing every `/api/*` call to 403.
+When touching the Studio server or SPA fetch layer, preserve: token via header for `fetch`, query param for `EventSource`, host-header guard, no CORS, timing-safe compare. The Vite plugin is dev-only (`apply: "serve"`): running it during `vite build` would bake a stale per-launch token into the production `index.html` and shadow the runtime tag, causing every `/api/*` call to 403.
+
+### HMR + graceful early-stop + callback hot-swap
+
+`arkor dev` keeps a [Rolldown](https://rolldown.rs) watcher over `src/arkor/` ([packages/arkor/src/studio/hmr.ts](packages/arkor/src/studio/hmr.ts)) and pushes rebuild events over `/api/dev/events` (SSE). On each successful build the watcher dynamic-imports the artifact, pulls a `TrainerInspection` snapshot off the discovered trainer (via the cross-realm `Symbol.for("arkor.trainer.inspect")` brand attached in [packages/arkor/src/core/trainerInspection.ts](packages/arkor/src/core/trainerInspection.ts)), and computes a stable `configHash` from the cloud-side `JobConfig`. The SPA re-fetches `/api/manifest` on each event so the Run Training button stays in sync without a browser refresh.
+
+When a rebuild lands while a `/api/train`-spawned subprocess is in flight, the server makes a per-child decision in [packages/arkor/src/studio/trainRegistry.ts](packages/arkor/src/studio/trainRegistry.ts):
+
+- **`configHash` matches the spawn-time hash** → SIGUSR2. The child's `installCallbackReloadHandler` re-imports the artifact and rotates the trainer's callback cell via the internal `Symbol.for("arkor.trainer.replaceCallbacks")` brand exposed by [packages/arkor/src/core/trainerInspection.ts](packages/arkor/src/core/trainerInspection.ts). The cloud-side run is untouched. Use this whenever a code change is contained inside the `callbacks: { ... }` object. Don't add a `replaceCallbacks()` method to the public `Trainer` interface: keeping the mutator behind a `Symbol.for` brand is what stops the dev-only HMR primitive from leaking into the SDK's published surface.
+- **`configHash` differs (or is null because the new bundle didn't inspect)** → SIGTERM. `installShutdownHandlers` drives the trainer's internal early-stop entry point via the `Symbol.for("arkor.trainer.requestEarlyStop")` brand exposed by [packages/arkor/src/core/trainerInspection.ts](packages/arkor/src/core/trainerInspection.ts), which lets the next `checkpoint.saved` event finish (work preserved) before issuing `cancel()` and exiting cleanly. The SPA auto-restarts the run with the rebuilt artifact via the `restart: true` flag on the SSE event. A second SIGTERM bypasses the early-stop and exits 143 immediately, as an emergency escape hatch for a hung cancel.
+
+Don't replace the SIGTERM-and-let-the-child-handle-it pattern with a SIGKILL escalation in the server: that would orphan Cloud-side jobs (no `cancel()` POST goes out) and waste GPU budget. Don't widen the SIGUSR2 path to "always hot-swap, server-side": the `configHash` check is what guarantees a hot-swap can't silently leave a child running with a stale `JobConfig`. Don't surface `requestEarlyStop()` (or `replaceCallbacks()`) as a method on the public `Trainer` interface: both are dev-only HMR primitives, and keeping them behind `Symbol.for` brands is what stops them from leaking into the published SDK shape; user code that wants similar semantics should compose `abortSignal` + `cancel()` per the cookbook.
 
 ### Project entry-point discovery
 
 The CLI/Studio look at `src/arkor/index.ts` in user projects. Discovery in [packages/arkor/src/core/runner.ts](packages/arkor/src/core/runner.ts) accepts (in order): a named `arkor` export from `createArkor({...})`, a bare `trainer` export, a default export holding either an Arkor manifest or a Trainer, or a `default.trainer` nested shape. `createArkor` returns a frozen, opaque manifest tagged with `_kind: "arkor"`; treat it as a value to hand to tooling, not a programmable client.
 
-`arkor build` ([packages/arkor/src/cli/commands/build.ts](packages/arkor/src/cli/commands/build.ts)) bundles to `.arkor/build/index.mjs` with esbuild; bare specifiers (e.g. `arkor`, anything in `node_modules`) stay external so the artifact resolves the runtime SDK from the project's installed copy.
+`arkor build` ([packages/arkor/src/cli/commands/build.ts](packages/arkor/src/cli/commands/build.ts)) bundles to `.arkor/build/index.mjs` with [Rolldown](https://rolldown.rs); bare specifiers (e.g. `arkor`, anything in `node_modules`) stay external so the artifact resolves the runtime SDK from the project's installed copy. The `transform.target` is derived from `process.versions.node` at build time so the bundle targets the same Node binary that will execute it.
 
 ### E2E suite specifics
 
-Both [e2e/cli](e2e/cli) and [e2e/studio](e2e/studio) declare `arkor` (and, for `e2e/cli`, `create-arkor`) as `workspace:*` `devDependencies`, so Turbo's `^build` produces `dist/bin.mjs` exactly once before `#test`/`#test:coverage` runs — no `pretest` hooks, no concurrent rebuilds racing on `dist/`. Standalone runs (`pnpm --filter @arkor/e2e-* test`) need a prior `pnpm build`. Every supported Node (≥22.22.0) is in rolldown's compatible range (^20.19 || >=22.12), so the previous "rolldown-incompatible" CI bypass path was removed.
+Both [e2e/cli](e2e/cli) and [e2e/studio](e2e/studio) declare `arkor` (and, for `e2e/cli`, `create-arkor`) as `workspace:*` `devDependencies`, so Turbo's `^build` produces `dist/bin.mjs` exactly once before `#test`/`#test:coverage` runs (no `pretest` hooks, no concurrent rebuilds racing on `dist/`). Standalone runs (`pnpm --filter @arkor/e2e-* test`) need a prior `pnpm build`. Every supported Node (≥22.22.0) is in rolldown's compatible range (^20.19 || >=22.12), so the previous "rolldown-incompatible" CI bypass path was removed.
 
 Tests rely on `ARKOR_INTERNAL_SCAFFOLD_ARKOR_SPEC=file:.../packages/arkor` so the scaffolded fixtures install the workspace `arkor` instead of the npm-published one. Both this var and `SKIP_E2E_INSTALL` are declared in [turbo.json](turbo.json) so they pass through Turbo's hash.
 
@@ -96,7 +107,7 @@ When implementing anything (new feature, SDK/CLI/Studio behaviour change, schema
 1. **Docs in both languages.** This repo pairs English/Japanese docs: `README.md` ↔ `README.ja.md`, `CONTRIBUTING.md` ↔ `CONTRIBUTING.ja.md`, and `docs/` ↔ `docs/ja/`. If you edit the English side, update the Japanese side in the same PR. Don't leave Japanese docs to be retro-translated later.
 2. **Tests.** Add vitest cases under `packages/*/src/**/*.test.ts` for SDK/CLI/scaffold logic changes. For CLI flow changes, consider an `e2e/cli` scenario.
 
-Don't split these into "docs in a follow-up PR" or "tests later" — land them in the same PR. Skip only when the user explicitly says to.
+Don't split these into "docs in a follow-up PR" or "tests later"; land them in the same PR. Skip only when the user explicitly says to.
 
 ## Non-obvious gotchas
 
diff --git a/docs/concepts/studio.mdx b/docs/concepts/studio.mdx
index fb9835c5..90ae5bdd 100644
--- a/docs/concepts/studio.mdx
+++ b/docs/concepts/studio.mdx
@@ -14,7 +14,12 @@ Four jobs:
 3. **Try a finished model.** A Playground page lets you pick the base model or the final adapter from any completed job and chat with it. The Playground does not load intermediate checkpoints; for mid-run inference, use [`onCheckpoint`](/concepts/lifecycle) callbacks in your trainer.
 4. **Publish a model behind a `*.arkor.app` URL.** An Endpoints page creates a per-deployment subdomain that serves OpenAI-compatible chat completions for a chosen adapter or base model, plus the API keys that authenticate calls to it. The same actions are available programmatically via [`CloudApiClient`](/sdk/deployments) — Studio is the interactive surface; the SDK is the lower-level one.
 
-A note on the dev loop: Studio's `/api/manifest` endpoint rebuilds and re-imports your trainer on every request (with a cache-bust query, see `packages/arkor/src/studio/manifest.ts`), but the UI only fetches it when the Run training page mounts. So if you edit `src/arkor/` and stay on the same Run training page, the next click reuses the existing `.arkor/build/index.mjs` and runs your old code. Refresh the page (or run `arkor build` from the terminal) between edits and clicks to pick up the new code reliably.
+A note on the dev loop: Studio runs a [Rolldown](https://rolldown.rs) watcher over `src/arkor/` and pushes rebuild notifications to the SPA over a Server-Sent Events stream (`/api/dev/events`). Edit a file, save, and the Run training button updates with the new trainer name without a refresh. If a training run is in flight, the Studio compares the new bundle's cloud-side `JobConfig` hash to the one captured when the run was spawned:
+
+- **Same hash (only callbacks changed).** The runner is signalled with SIGUSR2; it re-imports the rebuilt artifact and rotates the trainer's callback cell in place via an internal HMR brand. The cloud-side training run is untouched, no GPU time is wasted, and the SPA shows a brief "Callbacks hot-swapped" indicator.
+- **Different hash (model / dataset / hyperparameters changed).** The runner is signalled with SIGTERM; the trainer's internal early-stop entry point lets the next checkpoint upload finish before issuing `cancel()`, then the SPA re-spawns the run with the rebuilt artifact. The previous Cloud-side job reaches `cancelled` after the checkpoint is uploaded, so the partial work is preserved as an artifact.
+
+If you want this "stop after the next checkpoint" behaviour from your own code (rather than from the dev loop), build it on top of the public [`abortSignal` + `cancel()`](/sdk/trainer-control#abortsignal) pair. The [Early stopping recipe](/cookbook/early-stopping) walks through it.
 
 ## Where Studio runs
 
diff --git a/docs/ja/concepts/studio.mdx b/docs/ja/concepts/studio.mdx
index 12f176e2..224c6eed 100644
--- a/docs/ja/concepts/studio.mdx
+++ b/docs/ja/concepts/studio.mdx
@@ -14,7 +14,12 @@ Studio は `arkor dev` 実行時に立ち上がるローカル Web UI です。
 3. **完成モデルを試す。** Playground ページでベースモデルや任意の完了済みジョブの最終アダプターを選んでチャットできます。中間チェックポイントは Playground からはロードしません。学習中の推論には [`onCheckpoint`](/ja/concepts/lifecycle) コールバックをトレーナーで使ってください。
 4. **`*.arkor.app` URL でモデルを公開する。** Endpoints ページで OpenAI 互換 chat completions を提供する deployment 専用サブドメインを作成し、その API キーを発行・取り消しできます。同じ操作は [`CloudApiClient`](/ja/sdk/deployments) からプログラマティックにも可能で、Studio が対話的なインターフェイス、SDK が下位レイヤーという位置付けです。
 
-dev ループのメモ: Studio の `/api/manifest` エンドポイントはリクエストごとにトレーナーをリビルド・再 import しますが（キャッシュバストクエリ付き、`packages/arkor/src/studio/manifest.ts` を参照）、UI が fetch するのは Run training ページがマウントされたときだけです。`src/arkor/` を編集して同じ Run training ページに留まり続けると、次のクリックは既存の `.arkor/build/index.mjs` を再利用して古いコードで走ります。確実に新しいコードを取り込むには、編集とクリックの間にページをリロード（あるいはターミナルから `arkor build`）してください。
+dev ループのメモ: Studio は [Rolldown](https://rolldown.rs) のウォッチャを `src/arkor/` 上で常駐させ、再ビルド通知を Server-Sent Events ストリーム (`/api/dev/events`) で SPA に push します。ファイルを編集して保存すれば、Run training ボタンのトレーナー名表示はリロード無しで更新されます。学習が走っている最中であれば、Studio は再ビルドしたバンドルの Cloud 側 `JobConfig` ハッシュを、spawn 時に保存したハッシュと比較します。
+
+- **ハッシュ一致（コールバックのみ変更）。** ランナーへ SIGUSR2 を送ります。ランナーは再ビルドされた成果物を再 import し、内部 HMR ブランド経由でトレーナーのコールバック cell をその場で差し替えます。Cloud 側の学習はそのまま継続し、GPU 時間を無駄にせず、SPA には "Callbacks hot-swapped" と短く表示されます。
+- **ハッシュ不一致（モデル / データセット / ハイパーパラメータが変わった）。** ランナーへ SIGTERM を送ります。トレーナー内部の early-stop エントリが次のチェックポイントのアップロードを待ってから `cancel()` を発火し、SPA が再ビルドした成果物で再投入します。Cloud 側の以前のジョブはチェックポイントのアップロード完了後に `cancelled` 状態に遷移するので、ここまでの学習成果は artifact として保全されます。
+
+自前のコードから（dev ループではなく）この「次のチェックポイントで止める」挙動が欲しい場合は、公開 API の [`abortSignal` + `cancel()`](/ja/sdk/trainer-control#abortsignal) を組み合わせて書いてください。具体的な手順は [Early Stopping レシピ](/ja/cookbook/early-stopping) にあります。
 
 ## Studio が動く場所
 
diff --git a/docs/ja/studio/jobs.mdx b/docs/ja/studio/jobs.mdx
index 5a1fae60..56683075 100644
--- a/docs/ja/studio/jobs.mdx
+++ b/docs/ja/studio/jobs.mdx
@@ -62,8 +62,8 @@ Jobs ページ（`#/jobs`）はマウント時に 1 度、その後 5 秒ごと
 
 Loss チャートは `training.log` イベントから描画される SVG プロットです。Y 軸は最小値と最大値によるスケーリング、X 軸はステップ番号で、最大 2 系列を表示します:
 
-- **Training loss** — 実線のティール色。数値 `loss` を含むイベントごとに 1 頂点。
-- **Eval loss** — 破線のピンク色（点マーカー付き）。数値 `evalLoss` を含むイベント（通常は `evalSteps` 刻み）から描画。系列はイベントから直接構築するため、`evalLoss` のみを持ち `loss` を含まない eval-only フレームも線・凡例・統計に反映されます。Eval ポイントが 1 つも来ていない間は凡例にも表示されません。
+- **Training loss**: 実線のティール色。数値 `loss` を含むイベントごとに 1 頂点。
+- **Eval loss**: 破線のピンク色（点マーカー付き）。数値 `evalLoss` を含むイベント（通常は `evalSteps` 刻み）から描画。系列はイベントから直接構築するため、`evalLoss` のみを持ち `loss` を含まない eval-only フレームも線・凡例・統計に反映されます。Eval ポイントが 1 つも来ていない間は凡例にも表示されません。
 
 ホバーすると最寄りステップと、そのステップに含まれる `loss` / `evalLoss` のうち存在する値が表示されます（eval-only ステップでは `loss` 値は出ず、その逆も同様）。チャートは `loss` または `evalLoss` のいずれかが数値であるイベントが 1 件以上届くまで `Waiting for training.log events…`（`training.log` イベント待ち）プレースホルダーを表示します。両方とも null / 省略の `training.log` フレームはカウントされません。
 
@@ -71,9 +71,9 @@ Loss チャートは `training.log` イベントから描画される SVG プロ
 
 チャートヘッダーの **Advanced** トグルを ON にすると、系列ごとの統計パネルが現れます。各カードに表示される項目:
 
-- **Mean loss ± 95% CI** — Loss 値の標本平均と 95% 信頼区間の半幅（Student の t 分布。n > 31 では z = 1.96 にフォールバック）。
-- **Std dev**（標準偏差）と **Variance**（分散） — Bessel 補正済みの不偏推定量（`ddof=1`）。
-- **p90** と **p95** — numpy のデフォルトに合わせた線形補間パーセンタイル。
+- **Mean loss ± 95% CI**: Loss 値の標本平均と 95% 信頼区間の半幅（Student の t 分布。n > 31 では z = 1.96 にフォールバック）。
+- **Std dev**（標準偏差）と **Variance**（分散）: Bessel 補正済みの不偏推定量（`ddof=1`）。
+- **p90** と **p95**: numpy のデフォルトに合わせた線形補間パーセンタイル。
 
 Eval カードは数値 `evalLoss` を含む `training.log` イベントが届くまでは空のままです。
 
diff --git a/e2e/studio/src/specs/hmr.spec.ts b/e2e/studio/src/specs/hmr.spec.ts
new file mode 100644
index 00000000..94346b38
--- /dev/null
+++ b/e2e/studio/src/specs/hmr.spec.ts
@@ -0,0 +1,284 @@
+import { writeFileSync } from "node:fs";
+import { join } from "node:path";
+import { expect, test } from "../harness/fixture";
+
+/**
+ * Rewrite the seeded `src/arkor/index.ts` with a new trainer `name`
+ * (and arbitrary content tail to bump mtime + size beyond any
+ * sub-millisecond resolution noise on fast filesystems). We rewrite
+ * the WHOLE file (not append) so rolldown's incremental cache can't
+ * reuse the prior module record and skip the rebuild.
+ *
+ * Two key shape differences from `seedFixture.ts`'s `seedManifest`:
+ *
+ *  1. The trainer carries the `Symbol.for("arkor.trainer.inspect")`
+ *     brand so `findInspectableTrainer` (used by `studio/hmr.ts`'s
+ *     `inspectBundle`) can read its name + config: without the
+ *     brand, every SSE rebuild frame gets `trainerName: null` and
+ *     the SSE-level test below can't distinguish the post-edit
+ *     rebuild from the cached initial-build replay. The seed
+ *     fixture skips the brand because its existing tests only
+ *     exercise the `/api/manifest` path (which uses
+ *     `findTrainerInModule`, brand-less). Extending it would
+ *     couple every test to inspection internals it doesn't care
+ *     about.
+ *
+ *  2. The brand returns a real `JobConfig` shape (`model` +
+ *     `datasetSource` set), not the seed's empty placeholder, so
+ *     `hashJobConfig` produces a stable non-empty `configHash`.
+ *     `studio/server.ts`'s `dispatchRebuild` consults that hash to
+ *     route between SIGUSR2 hot-swap and SIGTERM restart; the
+ *     existing E2E only tests the boot path so it never needs a
+ *     real config there.
+ *
+ * `Symbol.for` keys round-trip across the dev process / built
+ * bundle realm boundary because they live in the global symbol
+ * registry, the same mechanism `core/trainerInspection.ts` documents
+ * for the runtime CLI / `.arkor/build/index.mjs` split.
+ */
+function rewriteManifest(projectDir: string, name: string): void {
+  const path = join(projectDir, "src", "arkor", "index.ts");
+  writeFileSync(
+    path,
+    [
+      'const TRAINER_INSPECT_KEY = Symbol.for("arkor.trainer.inspect");',
+      "const trainer = {",
+      `  name: ${JSON.stringify(name)},`,
+      "  start: async () => ({ id: 'e2e-job', url: '' }),",
+      "  wait: async () => ({ status: 'completed' as const }),",
+      "  cancel: async () => {},",
+      "};",
+      "Object.defineProperty(trainer, TRAINER_INSPECT_KEY, {",
+      "  value: () => ({",
+      "    name: trainer.name,",
+      "    config: {",
+      '      model: "studio-e2e-model",',
+      '      datasetSource: { type: "huggingface" as const, name: "studio-e2e-dataset" },',
+      "    },",
+      "    callbacks: {},",
+      "  }),",
+      "  enumerable: false,",
+      "});",
+      'export const arkor = { _kind: "arkor" as const, trainer };',
+      "export default arkor;",
+      `// rewritten-${name}-${Date.now()}`,
+      "",
+    ].join("\n"),
+  );
+}
+
+interface SseFrame {
+  event: string;
+  data: string;
+}
+
+/**
+ * Open `/api/dev/events`, parse incoming SSE frames, and resolve when
+ * `predicate` first returns true. Cleans up the underlying body
+ * reader on resolve / reject so the Hono server's connection bookkeeping
+ * doesn't leak between tests.
+ *
+ * `arkor dev` requires the studio token via the query param (EventSource
+ * can't set headers); the same allow-list governs `fetch()` here.
+ */
+async function awaitSseFrame(
+  studioUrl: string,
+  token: string,
+  predicate: (frame: SseFrame) => boolean,
+  timeoutMs: number,
+): Promise<SseFrame> {
+  const url = `${studioUrl}/api/dev/events?studioToken=${encodeURIComponent(token)}`;
+  const controller = new AbortController();
+  const timeout = setTimeout(() => controller.abort(), timeoutMs);
+  let res: Response;
+  try {
+    res = await fetch(url, { signal: controller.signal });
+  } catch (err) {
+    clearTimeout(timeout);
+    throw new Error(
+      `SSE connect failed for ${url}: ${(err as Error).message}`,
+    );
+  }
+  if (!res.ok || !res.body) {
+    clearTimeout(timeout);
+    throw new Error(
+      `SSE connect returned ${res.status} ${res.statusText}; body=${
+        res.body ? "present" : "missing"
+      }`,
+    );
+  }
+  const reader = res.body.getReader();
+  const decoder = new TextDecoder();
+  let buf = "";
+  try {
+    while (true) {
+      const { value, done } = await reader.read();
+      if (done) {
+        throw new Error("SSE stream ended before predicate matched");
+      }
+      buf += decoder.decode(value, { stream: true });
+      // Frames are terminated by a blank line (`\n\n`). Split, keep
+      // the trailing partial in `buf` for the next iteration.
+      const parts = buf.split("\n\n");
+      buf = parts.pop() ?? "";
+      for (const raw of parts) {
+        if (!raw) continue;
+        let event = "";
+        let data = "";
+        for (const line of raw.split("\n")) {
+          if (line.startsWith("event: ")) event = line.slice(7);
+          else if (line.startsWith("data: ")) data = line.slice(6);
+        }
+        const frame: SseFrame = { event, data };
+        if (predicate(frame)) return frame;
+      }
+    }
+  } finally {
+    clearTimeout(timeout);
+    // Cancel rather than just release: cancel propagates to the Hono
+    // ReadableStream's `cancel()` handler so the server unsubscribes
+    // this listener from the HMR coordinator promptly. Otherwise the
+    // listener lingers until the next dispose, which can produce
+    // cross-test bleed when running with `--repeat-each`.
+    await reader.cancel().catch(() => {});
+  }
+}
+
+test.describe("Studio HMR", () => {
+  test("/api/dev/events is registered with the hmr-enabled meta tag", async ({
+    page,
+    studio,
+  }) => {
+    // Boot-time wiring: `arkor dev` always wires up the HMR
+    // coordinator, so the served HTML must carry both the
+    // studio-token meta and the hmr-enabled meta. Without the
+    // hmr-enabled tag, `isHmrEnabled()` returns false in the SPA
+    // and the auto-restart / hot-swap paths silently no-op.
+    await page.goto(studio.url);
+    const hmrMeta = page.locator('meta[name="arkor-hmr-enabled"]');
+    await expect(hmrMeta).toHaveCount(1);
+    await expect(hmrMeta).toHaveAttribute("content", "true");
+
+    // Endpoint sanity-check: a GET without the studio token must 403
+    // (regression for the CSRF allow-list: `eventStreamPathPattern`
+    // permits the query-token form, but a raw GET stays gated).
+    const noToken = await fetch(`${studio.url}/api/dev/events`);
+    expect(noToken.status).toBe(403);
+  });
+
+  test("editing src/arkor/index.ts broadcasts a rebuild SSE frame with the new trainer name", async ({
+    studio,
+    fixturePaths,
+  }) => {
+    // Edit BEFORE subscribing, then let the predicate filter out
+    // pre-edit replays. The watcher may already have a cached
+    // initial-build `ready` (with the seed name) by the time we
+    // connect; subscribing first then editing would force a
+    // drain step. Going edit → subscribe is simpler: the
+    // predicate explicitly requires `trainerName === newName`,
+    // which only the post-edit BUNDLE_END can satisfy; any
+    // cached or in-flight frame for the seed name fails the
+    // predicate and `awaitSseFrame` keeps reading until the
+    // matching one arrives.
+    const newName = "studio-e2e-trainer-edited";
+    rewriteManifest(fixturePaths.projectDir, newName);
+
+    const frame = await awaitSseFrame(
+      studio.url,
+      studio.token,
+      (f) => {
+        if (f.event !== "rebuild" && f.event !== "ready") return false;
+        // Some replays have empty data; skip those.
+        if (!f.data) return false;
+        try {
+          const parsed = JSON.parse(f.data) as {
+            trainerName?: string | null;
+          };
+          return parsed.trainerName === newName;
+        } catch {
+          return false;
+        }
+      },
+      // Generous: rolldown's first cold build on a fresh project
+      // can take 1–2s on a slow CI runner; the post-edit rebuild is
+      // typically faster (incremental) but we don't want to flake on
+      // a noisy host.
+      20_000,
+    );
+
+    expect(frame.event === "rebuild" || frame.event === "ready").toBe(true);
+    const parsed = JSON.parse(frame.data) as {
+      outFile?: string;
+      trainerName?: string | null;
+      configHash?: string | null;
+    };
+    expect(parsed.trainerName).toBe(newName);
+    // The artefact path is also part of the contract: HMR consumers
+    // (including the runner subprocess on SIGUSR2) re-import the
+    // bundle by `outFile`. A regression that drops it would silently
+    // disable hot-swap.
+    expect(parsed.outFile).toMatch(/\.arkor[\\/]build[\\/]index\.mjs$/);
+  });
+
+  test("/api/manifest reflects the edited trainer name after a save", async ({
+    studio,
+    fixturePaths,
+  }) => {
+    // End-to-end through the Hono `/api/manifest` route, which
+    // dynamic-imports the freshly-built artefact via
+    // `summariseBuiltManifest`. The HMR rebuild must have completed
+    // *and* the cache-bust URL must reflect the new bytes for this
+    // assertion to pass: exercises the rebuild → write artefact →
+    // re-import → return summary chain end-to-end.
+    const newName = `studio-e2e-trainer-renamed-${Date.now()}`;
+    rewriteManifest(fixturePaths.projectDir, newName);
+
+    await expect
+      .poll(
+        async () => {
+          const res = await fetch(`${studio.url}/api/manifest`, {
+            headers: { "X-Arkor-Studio-Token": studio.token },
+          });
+          if (!res.ok) return null;
+          const body = (await res.json()) as {
+            trainer?: { name?: string } | null;
+          };
+          return body.trainer?.name ?? null;
+        },
+        {
+          // Same 20s budget as the SSE test for the same reason: the
+          // first rebuild after spawn can be slow on cold CI. Keep
+          // the poll interval modest so we don't hammer the dev
+          // loop's `runBuild` faster than it can settle.
+          timeout: 20_000,
+          intervals: [200, 400, 800, 1500],
+        },
+      )
+      .toBe(newName);
+  });
+
+  test("the SPA Run Training caption updates without a page reload after a save", async ({
+    page,
+    studio,
+    fixturePaths,
+  }) => {
+    // End-to-end browser proof: the SPA's RunTraining component
+    // subscribes to `/api/dev/events`, calls `fetchManifest()` on
+    // each rebuild, and re-renders the trainer caption. Reloading
+    // the page would mask any regression in that subscription path,
+    // so we explicitly DO NOT navigate again after the edit.
+    await page.goto(studio.url);
+    await expect(page.getByText(/studio-e2e-trainer/).first()).toBeVisible();
+
+    const newName = `studio-e2e-trainer-live-${Date.now()}`;
+    rewriteManifest(fixturePaths.projectDir, newName);
+
+    // The new name should appear without a navigation. Match by
+    // substring rather than exact text so the surrounding "Trainer
+    // <name> from src/arkor/index.ts" caption decoration doesn't
+    // need to be replicated here.
+    await expect(page.getByText(newName).first()).toBeVisible({
+      timeout: 20_000,
+    });
+  });
+});
diff --git a/packages/arkor/package.json b/packages/arkor/package.json
index 692d2f3c..11088c91 100644
--- a/packages/arkor/package.json
+++ b/packages/arkor/package.json
@@ -55,17 +55,17 @@
     "@clack/prompts": "^0.8.0",
     "@hono/node-server": "^1.14.0",
     "commander": "^13.0.0",
-    "esbuild": "^0.28.0",
     "hono": "^4.7.0",
     "open": "^10.0.0",
     "posthog-node": "^5.30.6",
+    "rolldown": "^1.0.0",
     "zod": "^4.3.6"
   },
   "devDependencies": {
     "@arkor/cli-internal": "workspace:*",
     "@types/node": "^24",
     "@vitest/coverage-v8": "^4.1.5",
-    "tsdown": "^0.21.9",
+    "tsdown": "^0.22.0",
     "typescript": "^5",
     "vitest": "^4.1.5"
   },
diff --git a/packages/arkor/src/cli/cleanupHooks.test.ts b/packages/arkor/src/cli/cleanupHooks.test.ts
new file mode 100644
index 00000000..864feb9f
--- /dev/null
+++ b/packages/arkor/src/cli/cleanupHooks.test.ts
@@ -0,0 +1,256 @@
+import { afterEach, describe, expect, it, vi } from "vitest";
+import {
+  __resetCleanupHooksForTests,
+  registerCleanupHook,
+} from "./cleanupHooks";
+
+// Each test that emits a signal also installs new listeners on
+// `process` for the lifetime of this worker. Auto-detach inside the
+// handlers covers the fire-then-cleanup case; `__resetCleanupHooksForTests`
+// covers tests whose registration never fires (still need their
+// listeners off the worker before the next test runs).
+
+let exitSpy: ReturnType<typeof vi.spyOn> | null = null;
+let stdoutSpy: ReturnType<typeof vi.spyOn> | null = null;
+
+afterEach(() => {
+  exitSpy?.mockRestore();
+  stdoutSpy?.mockRestore();
+  exitSpy = null;
+  stdoutSpy = null;
+  __resetCleanupHooksForTests();
+});
+
+function mockExit(): number[] {
+  const codes: number[] = [];
+  exitSpy = vi
+    .spyOn(process, "exit")
+    .mockImplementation(((code?: number) => {
+      codes.push(code ?? 0);
+      return undefined as never;
+    }) as typeof process.exit);
+  return codes;
+}
+
+function flushMicrotasks(): Promise<void> {
+  return new Promise((resolve) => setImmediate(resolve));
+}
+
+describe("registerCleanupHook", () => {
+  it("waits for an async sibling cleanup to settle before exitOnSignal fires", async () => {
+    // Regression: previously the signal handler called
+    // `process.exit(0)` immediately after kicking off cleanup, so a
+    // sibling registration's async dispose (`hmr.dispose()`) got cut
+    // off mid-promise. The fix coordinates via a module-level
+    // in-flight set so the exit-owning hook awaits every other
+    // registered cleanup before terminating.
+    const order: string[] = [];
+    let resolveSlowDispose!: () => void;
+    const slowDispose = new Promise<void>((resolve) => {
+      resolveSlowDispose = resolve;
+    });
+
+    registerCleanupHook({
+      cleanup: () =>
+        slowDispose.then(() => {
+          order.push("async-cleanup-finished");
+        }),
+    });
+    registerCleanupHook({
+      cleanup: () => {
+        order.push("sync-cleanup");
+      },
+      exitOnSignal: true,
+    });
+
+    const codes = mockExit();
+    process.emit("SIGINT", "SIGINT");
+
+    // Sync cleanup body has already fired; async one is still pending,
+    // and exit must NOT have been called yet.
+    expect(order).toEqual(["sync-cleanup"]);
+    expect(codes).toEqual([]);
+
+    // Resolve the slow dispose; one microtask later the coordinator
+    // fires process.exit(0).
+    resolveSlowDispose();
+    await flushMicrotasks();
+    await flushMicrotasks();
+
+    expect(order).toEqual(["sync-cleanup", "async-cleanup-finished"]);
+    // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2) so parent
+    // shells / orchestrators can distinguish "user interrupted"
+    // from "ran to completion (0)"; see SIGNAL_EXIT_CODE in
+    // cleanupHooks.ts.
+    expect(codes).toEqual([130]);
+  });
+
+  it("waits for sibling async cleanups even when the exit-owning hook is registered FIRST", async () => {
+    // Regression: even with the in-flight set in place, the
+    // exit-owning hook's signal handler used to take its
+    // `[...inFlightCleanups]` snapshot synchronously inside the
+    // listener body. Node's EventEmitter dispatches signal listeners
+    // in registration order, so when the exit-owning hook is wired
+    // up *first*, its handler takes the snapshot before any sibling
+    // hook (registered later) gets a chance to run its handler and
+    // add its own in-flight promise. Result: `Promise.allSettled`
+    // resolved on the snapshot of just-this-hook's promise → exit
+    // fired → siblings' async cleanup got cut off mid-flight.
+    //
+    // The order in the existing "waits for an async sibling
+    // cleanup" test happens to dodge this bug by registering the
+    // async hook first, so its handler runs first and seeds
+    // inFlightCleanups before the exit-owner takes its snapshot.
+    // This test inverts the order to actually exercise the
+    // queueMicrotask-deferred snapshot fix.
+    const order: string[] = [];
+    let resolveSlow!: () => void;
+    const slow = new Promise<void>((resolve) => {
+      resolveSlow = resolve;
+    });
+
+    // Register exit-owner FIRST.
+    registerCleanupHook({
+      cleanup: () => {
+        order.push("sync-cleanup");
+      },
+      exitOnSignal: true,
+    });
+    // Sibling async cleanup registered AFTER. With the old code,
+    // its promise wouldn't make it into the exit-owner's snapshot.
+    registerCleanupHook({
+      cleanup: () =>
+        slow.then(() => {
+          order.push("async-cleanup-finished");
+        }),
+    });
+
+    const codes = mockExit();
+    process.emit("SIGINT", "SIGINT");
+
+    // Sync ran inline; async pending; exit must NOT have fired.
+    expect(order).toEqual(["sync-cleanup"]);
+    expect(codes).toEqual([]);
+
+    resolveSlow();
+    await flushMicrotasks();
+    await flushMicrotasks();
+
+    expect(order).toEqual(["sync-cleanup", "async-cleanup-finished"]);
+    // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2) so parent
+    // shells / orchestrators can distinguish "user interrupted"
+    // from "ran to completion (0)"; see SIGNAL_EXIT_CODE in
+    // cleanupHooks.ts.
+    expect(codes).toEqual([130]);
+  });
+
+  it("exits with the POSIX 128+signo code for each terminating signal (130/143/129)", async () => {
+    // Regression: the exit-owning hook used to always
+    // `process.exit(0)`, regardless of which signal fired the
+    // shutdown. Parent shells / orchestrators / CI runners that
+    // gate on signal-style nonzero status would mis-classify a
+    // Ctrl-C (SIGINT) as a clean run: `arkor dev || cleanup`
+    // would skip the cleanup branch and leave whatever it owned
+    // unreaped. POSIX convention is 128 + signo (SIGINT=2 → 130,
+    // SIGTERM=15 → 143, SIGHUP=1 → 129); SIGNAL_EXIT_CODE in
+    // cleanupHooks.ts pins the mapping.
+    const cases: Array<["SIGINT" | "SIGTERM" | "SIGHUP", number]> = [
+      ["SIGINT", 130],
+      ["SIGTERM", 143],
+      ["SIGHUP", 129],
+    ];
+    for (const [sig, expected] of cases) {
+      registerCleanupHook({ cleanup: () => {}, exitOnSignal: true });
+      const codes = mockExit();
+      process.emit(sig, sig);
+      // queueMicrotask + Promise.allSettled chain: two flushes
+      // mirror the existing tests.
+      await flushMicrotasks();
+      await flushMicrotasks();
+      expect(codes, `signal ${sig}`).toEqual([expected]);
+      // Reset for the next iteration's hook registration so the
+      // new SIGNAL_EXIT_CODE doesn't get clobbered by leftover
+      // listeners.
+      __resetCleanupHooksForTests();
+      exitSpy?.mockRestore();
+      exitSpy = null;
+    }
+  });
+
+  it("auto-detaches its process listeners after firing so they don't accumulate", () => {
+    // Regression: previously each `registerCleanupHook` call left
+    // `process.on('exit', ...)` and per-signal listeners armed
+    // forever. A long-lived Node worker that re-arms hooks (vitest
+    // running many tests, or any future caller that re-registers on
+    // each iteration) tripped Node's
+    // `MaxListenersExceededWarning`. Fix: each handler synchronously
+    // detaches its registration after invoking `run()`.
+    const exitBefore = process.listeners("exit").length;
+    const sigintBefore = process.listeners("SIGINT").length;
+    const sigtermBefore = process.listeners("SIGTERM").length;
+    const sighupBefore = process.listeners("SIGHUP").length;
+
+    registerCleanupHook({
+      cleanup: () => {},
+      exitOnSignal: false,
+    });
+
+    expect(process.listeners("exit").length).toBe(exitBefore + 1);
+    expect(process.listeners("SIGINT").length).toBe(sigintBefore + 1);
+    expect(process.listeners("SIGTERM").length).toBe(sigtermBefore + 1);
+    expect(process.listeners("SIGHUP").length).toBe(sighupBefore + 1);
+
+    // Firing one signal must detach BOTH that registration's signal
+    // listener AND its sibling exit listener: the registration is
+    // done after first fire regardless of which channel triggered it.
+    process.emit("SIGINT", "SIGINT");
+
+    expect(process.listeners("exit").length).toBe(exitBefore);
+    expect(process.listeners("SIGINT").length).toBe(sigintBefore);
+    expect(process.listeners("SIGTERM").length).toBe(sigtermBefore);
+    expect(process.listeners("SIGHUP").length).toBe(sighupBefore);
+  });
+
+  it("__resetCleanupHooksForTests detaches every still-armed registration", () => {
+    // Test-only escape hatch for registrations whose handler never
+    // fires inside the test (no signal emitted); without it, those
+    // listeners would persist across the vitest worker's test queue.
+    const exitBefore = process.listeners("exit").length;
+    registerCleanupHook({ cleanup: () => {}, exitOnSignal: false });
+    registerCleanupHook({ cleanup: () => {}, exitOnSignal: true });
+    expect(process.listeners("exit").length).toBe(exitBefore + 2);
+
+    __resetCleanupHooksForTests();
+
+    expect(process.listeners("exit").length).toBe(exitBefore);
+  });
+
+  it("is idempotent against repeated signals (done latch + bounded exit)", async () => {
+    let invocations = 0;
+    registerCleanupHook({
+      cleanup: () => {
+        invocations += 1;
+      },
+      exitOnSignal: true,
+    });
+
+    const codes = mockExit();
+    process.emit("SIGINT", "SIGINT");
+    process.emit("SIGINT", "SIGINT");
+    process.emit("SIGINT", "SIGINT");
+    await flushMicrotasks();
+    await flushMicrotasks();
+
+    // Cleanup body runs once even if the signal fires multiple times
+    // (auto-detach removes the listener after first fire; the `done`
+    // latch is the secondary defence in case detach is racy).
+    expect(invocations).toBe(1);
+    // First SIGINT fires the handler → exit(0); follow-ups hit no
+    // listener after auto-detach, so codes has exactly one entry.
+    // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2) so parent
+    // shells / orchestrators can distinguish "user interrupted"
+    // from "ran to completion (0)"; see SIGNAL_EXIT_CODE in
+    // cleanupHooks.ts.
+    expect(codes).toEqual([130]);
+  });
+});
diff --git a/packages/arkor/src/cli/cleanupHooks.ts b/packages/arkor/src/cli/cleanupHooks.ts
new file mode 100644
index 00000000..24d0eb60
--- /dev/null
+++ b/packages/arkor/src/cli/cleanupHooks.ts
@@ -0,0 +1,196 @@
+import { SIGNAL_EXIT_CODE } from "../core/signalExit";
+
+// POSIX `128 + signo` exit codes live in `core/signalExit.ts` so the
+// runner's two-stage shutdown handler and this coordinator share a
+// single source of truth. Without the shared map, adding (say)
+// SIGQUIT to one side without the other would produce inconsistent
+// exit statuses for the same signal: the exact parent-shell-
+// classification regression the per-signal code was introduced to
+// prevent.
+
+const TERMINATING_SIGNALS = ["SIGINT", "SIGTERM", "SIGHUP"] as const;
+
+export interface CleanupHookOptions {
+  /**
+   * Idempotent cleanup body. Wrapped with a `done` guard so a noisy
+   * shutdown (signal arriving while `process.exit` is already running
+   * an `exit` listener) doesn't trigger a double-cleanup. May be sync
+   * or return a Promise; async cleanups are awaited (across **all
+   * registered hooks**) before `exitOnSignal` fires the final
+   * `process.exit`.
+   */
+  cleanup: () => void | Promise<void>;
+  /**
+   * Whether the signal-handler arm of this registration should call
+   * `process.exit` once every in-flight cleanup (this hook + any
+   * siblings registered in the same process) has settled. Use `true`
+   * for the outermost cleanup responsible for terminating the
+   * process; `false` for inner cleanups that should let a sibling
+   * own the exit. Default: `false`.
+   *
+   * The exit code is the POSIX `128 + signo` for the signal that
+   * triggered shutdown: 130 for SIGINT, 143 for SIGTERM, 129 for
+   * SIGHUP (see `SIGNAL_EXIT_CODE`). Parent shells / orchestrators /
+   * CI runners distinguish "user interrupted" (nonzero) from "ran
+   * to completion" (zero) on this: exiting 0 for a Ctrl-C'd
+   * `arkor dev` would let `arkor dev || cleanup_on_failure` skip
+   * its cleanup branch.
+   */
+  exitOnSignal?: boolean;
+}
+
+/**
+ * Module-scoped tracker of cleanup promises that haven't settled yet.
+ * The exit-owning hook waits on the union of (its own cleanup) +
+ * (every other in-flight cleanup) before calling `process.exit(...)`,
+ * so a fire-and-forget async cleanup in a sibling registration
+ * (`hmr.dispose()` is the canonical example) isn't cut off by an
+ * eager exit. (Exit code is signal-specific; see `SIGNAL_EXIT_CODE`.)
+ *
+ * Auto-prunes via the `.finally(() => inFlightCleanups.delete(...))`
+ * each `run()` attaches, so the set doesn't grow without bound across
+ * multiple `runDev()` invocations in the same process (tests).
+ */
+const inFlightCleanups = new Set<Promise<void>>();
+
+/**
+ * Detachers for every still-armed registration. The signal/exit
+ * handlers each call their own detacher synchronously after invoking
+ * `run()` so a long-lived worker that calls `registerCleanupHook`
+ * many times (vitest reusing the same Node worker across tests, or a
+ * future caller that re-arms hooks dynamically) doesn't pile up
+ * `process.on(...)` listeners and trip Node's
+ * `MaxListenersExceededWarning`. Test code can also call
+ * `__resetCleanupHooksForTests()` to detach every still-armed
+ * registration up-front for explicit isolation.
+ */
+const attachedHandlers = new Set<() => void>();
+
+/**
+ * Register a cleanup hook that fires on `process.exit` and on
+ * SIGINT / SIGTERM / SIGHUP. Used by `runDev` to dispose long-lived
+ * resources (the studio-token file, the HMR coordinator) without each
+ * call site re-implementing the same idempotent-guard + per-signal
+ * registration boilerplate.
+ *
+ * Per-registration signal listeners (rather than a singleton): each
+ * `runDev()` invocation gets its own listener wired to its own
+ * `done` latch. Listeners auto-detach as soon as their handler fires
+ * (the `done` latch makes any later invocation a no-op anyway), so
+ * a process that goes through many register → fire cycles doesn't
+ * accumulate stale listeners on `process`.
+ *
+ * `process.on("exit", ...)` listeners cannot be async: Node fires
+ * them right before the process terminates and discards any returned
+ * promise. We still register so sync cleanups (e.g. `unlinkSync`) run
+ * on a normal `process.exit(0)` path that never reached a signal
+ * handler. Async tails on this path are best-effort. The signal-
+ * handler path *does* await async tails before exiting.
+ */
+export function registerCleanupHook(options: CleanupHookOptions): void {
+  let done = false;
+  const run = (): Promise<void> => {
+    if (done) return Promise.resolve();
+    done = true;
+    let promise: Promise<void>;
+    try {
+      const result = options.cleanup();
+      // Wrap so callers can await uniformly even when cleanup was
+      // synchronous. Catch is attached so a thrown async cleanup
+      // doesn't leave an unhandled rejection on the floor.
+      promise = Promise.resolve(result).catch(() => {
+        // best-effort: shutdown is racing other cleanup paths
+      });
+    } catch {
+      promise = Promise.resolve();
+    }
+    inFlightCleanups.add(promise);
+    void promise.finally(() => inFlightCleanups.delete(promise));
+    return promise;
+  };
+
+  const exitHandler = () => {
+    void run();
+    detach();
+  };
+  const signalHandlers = new Map<(typeof TERMINATING_SIGNALS)[number], () => void>();
+  for (const sig of TERMINATING_SIGNALS) {
+    signalHandlers.set(sig, () => {
+      // Sync cleanup body fires inside this `run()` call before the
+      // returned promise resolves; that preserves "side effect is
+      // observable right after the handler returns" for sync
+      // cleanups like `unlinkSync` (and the existing tests that
+      // assert on it).
+      run();
+      detach();
+      if (!options.exitOnSignal) return;
+      // Capture which signal triggered shutdown so the exit code
+      // below reflects "interrupted by SIG<X>" (POSIX 128 + signo)
+      // rather than "ran to completion" (0). Parent shells /
+      // orchestrators / CI runners distinguish these: a script
+      // that runs `arkor dev || cleanup_on_failure` would otherwise
+      // mis-classify a Ctrl-C as success and skip its cleanup.
+      const exitCode = SIGNAL_EXIT_CODE[sig];
+      // Snapshot `inFlightCleanups` AFTER every other signal listener
+      // for this signal has run. Node's EventEmitter dispatches
+      // listeners synchronously in registration order, so if the
+      // exit-owning hook happens to be registered *first*, taking the
+      // snapshot here in the listener body would miss promises that
+      // sibling hooks are about to add when their listeners run a
+      // few sync steps later. `queueMicrotask` defers past the end of
+      // the current sync turn (where `process.emit` finishes
+      // dispatching all listeners), so the snapshot includes every
+      // sibling's freshly-registered promise. Without this, an
+      // `arkor dev` whose `scheduleStudioTokenCleanup` (exitOnSignal:
+      // true) was registered before `scheduleHmrCleanup` (async
+      // dispose) would `process.exit(...)` mid-`hmr.dispose()` and
+      // leak the rolldown watcher.
+      //
+      // Settled promises pass through `Promise.allSettled` in a
+      // single microtask, so a process whose hooks are all
+      // synchronous still exits effectively immediately (one extra
+      // microtask round-trip).
+      queueMicrotask(() => {
+        void Promise.allSettled(inFlightCleanups).then(() =>
+          process.exit(exitCode),
+        );
+      });
+    });
+  }
+
+  let detached = false;
+  const detach = () => {
+    if (detached) return;
+    detached = true;
+    process.off("exit", exitHandler);
+    for (const sig of TERMINATING_SIGNALS) {
+      const handler = signalHandlers.get(sig);
+      if (handler) process.off(sig, handler);
+    }
+    attachedHandlers.delete(detach);
+  };
+  attachedHandlers.add(detach);
+
+  process.on("exit", exitHandler);
+  for (const sig of TERMINATING_SIGNALS) {
+    const handler = signalHandlers.get(sig);
+    if (handler) process.on(sig, handler);
+  }
+}
+
+/**
+ * Detach every still-armed registration. Test-only escape hatch: a
+ * vitest worker reuses the same Node process across many tests, and
+ * each `registerCleanupHook` call leaves listeners attached until
+ * something fires them. Call this from `afterEach` to keep the
+ * worker's `process` listener counts flat.
+ */
+export function __resetCleanupHooksForTests(): void {
+  // `detach()` mutates `attachedHandlers` by removing the current entry.
+  // `Set` iterators safely handle that case (a deleted current item is
+  // not re-visited and remaining items keep their order), so we can
+  // iterate directly without snapshotting via `[...attachedHandlers]`.
+  for (const detach of attachedHandlers) detach();
+  attachedHandlers.clear();
+  inFlightCleanups.clear();
+}
diff --git a/packages/arkor/src/cli/commands/build.ts b/packages/arkor/src/cli/commands/build.ts
index ebfc3675..c4609039 100644
--- a/packages/arkor/src/cli/commands/build.ts
+++ b/packages/arkor/src/cli/commands/build.ts
@@ -1,7 +1,12 @@
 import { existsSync } from "node:fs";
 import { mkdir } from "node:fs/promises";
-import { isAbsolute, relative, resolve } from "node:path";
-import { build as esbuild } from "esbuild";
+import { relative } from "node:path";
+import { rolldown } from "rolldown";
+import {
+  BUILD_DEFAULTS,
+  resolveBuildEntry,
+  rolldownInputOptions,
+} from "../../core/rolldownConfig";
 import { ui } from "../prompts";
 
 export interface BuildOptions {
@@ -22,42 +27,30 @@ export interface BuildResult {
   outFile: string;
 }
 
-const DEFAULT_ENTRY = "src/arkor/index.ts";
-const DEFAULT_OUT_DIR = ".arkor/build";
-
 /**
  * Bundle the user's `src/arkor/index.ts` into a single ESM artifact at
  * `.arkor/build/index.mjs`.
  *
- * Bare specifiers (`arkor`, anything from `node_modules`) are kept external so
- * the artifact resolves the runtime SDK from the project's installed copy.
- * Relative imports are bundled inline.
+ * Bare specifiers (`arkor`, anything from `node_modules`) are kept external
+ * so the artifact resolves the runtime SDK from the project's installed
+ * copy. Relative imports are bundled inline. The transform target is
+ * derived from the running Node binary (see `resolveNodeTarget`).
  */
 export async function runBuild(opts: BuildOptions = {}): Promise<BuildResult> {
-  const cwd = opts.cwd ?? process.cwd();
-  const entryRel = opts.entry ?? DEFAULT_ENTRY;
-  const entry = isAbsolute(entryRel) ? entryRel : resolve(cwd, entryRel);
+  const { cwd, entry, outDir, outFile } = resolveBuildEntry(opts);
   if (!existsSync(entry)) {
     throw new Error(
-      `Build entry not found: ${entry}. Create ${DEFAULT_ENTRY} or pass an explicit entry argument.`,
+      `Build entry not found: ${entry}. Create ${BUILD_DEFAULTS.entry} or pass an explicit entry argument.`,
     );
   }
-
-  const outDirRel = opts.outDir ?? DEFAULT_OUT_DIR;
-  const outDir = isAbsolute(outDirRel) ? outDirRel : resolve(cwd, outDirRel);
   await mkdir(outDir, { recursive: true });
-  const outFile = resolve(outDir, "index.mjs");
 
-  await esbuild({
-    entryPoints: [entry],
-    bundle: true,
-    platform: "node",
-    format: "esm",
-    target: "node22.22",
-    outfile: outFile,
-    packages: "external",
-    logLevel: "error",
-  });
+  const bundle = await rolldown(rolldownInputOptions({ cwd, entry }));
+  try {
+    await bundle.write({ file: outFile, format: "esm" });
+  } finally {
+    await bundle.close();
+  }
 
   if (!opts.quiet) {
     ui.log.success(
diff --git a/packages/arkor/src/cli/commands/dev.test.ts b/packages/arkor/src/cli/commands/dev.test.ts
index 1104489c..c2bc1653 100644
--- a/packages/arkor/src/cli/commands/dev.test.ts
+++ b/packages/arkor/src/cli/commands/dev.test.ts
@@ -4,6 +4,7 @@ import {
   mkdtempSync,
   readFileSync,
   rmSync,
+  writeFileSync,
 } from "node:fs";
 import { tmpdir } from "node:os";
 import { join } from "node:path";
@@ -31,8 +32,33 @@ import {
   writeCredentials,
   type AnonymousCredentials,
 } from "../../core/credentials";
+import { __resetCleanupHooksForTests } from "../cleanupHooks";
 import { ensureCredentialsForStudio, runDev } from "./dev";
 
+/**
+ * Yield one `setImmediate` tick: enough for the cleanupHooks
+ * coordinator's `Promise.allSettled(...).then(() => process.exit(0))`
+ * chain to drain when there are no async cleanups in flight (the
+ * common case in this file: signal handler → queueMicrotask →
+ * already-resolved `allSettled` → `.then` → `process.exit(0)`,
+ * which all collapses into the single macrotask boundary that
+ * `setImmediate` yields to).
+ *
+ * `setImmediate` is the right primitive (vs `Promise.resolve` /
+ * `queueMicrotask`) because we need the event loop to actually
+ * turn: the `process.exit` mock fires inside a `.then` callback
+ * scheduled from a previous microtask checkpoint, and a microtask-
+ * only flush would resume *before* that callback gets to run.
+ *
+ * Tests that drive a chain with extra microtask hops (e.g. async
+ * sibling cleanups whose promises also pass through
+ * `Promise.allSettled`) await this helper twice in a row; see
+ * the cleanupHooks tests.
+ */
+function flushMicrotasks(): Promise<void> {
+  return new Promise((resolve) => setImmediate(resolve));
+}
+
 let fakeHome: string;
 const ORIG_HOME = process.env.HOME;
 // `os.homedir()` reads USERPROFILE on Windows; HOME-only redirection leaves
@@ -83,7 +109,7 @@ describe("ensureCredentialsForStudio", () => {
   });
 
   // When OAuth is advertised by the deployment, `arkor dev` no longer
-  // hands off to `runLogin` — that would block the Studio launch on a
+  // hands off to `runLogin`; that would block the Studio launch on a
   // browser flow. Instead we bootstrap anon and show a hint pointing at
   // `arkor login`, leaving the upgrade in the user's hands.
   it("bootstraps anonymous credentials even when OAuth is configured", async () => {
@@ -158,7 +184,7 @@ describe("ensureCredentialsForStudio", () => {
     });
   });
 
-  // Regression for ENG-403 — when the cloud-api is unreachable, `arkor dev`
+  // Regression for ENG-403: when the cloud-api is unreachable, `arkor dev`
   // previously failed to start because the anonymous bootstrap's network
   // error wasn't caught.
   it("does not throw when the anonymous bootstrap fails after a successful config fetch", async () => {
@@ -240,7 +266,7 @@ describe("ensureCredentialsForStudio", () => {
   // must surface at startup instead of being silently warned.
   it("re-throws when ARKOR_CLOUD_API_URL is malformed (config error)", async () => {
     process.env.ARKOR_CLOUD_API_URL = "";
-    // No fetch mock — let real fetch raise the URL parse error so we
+    // No fetch mock: let real fetch raise the URL parse error so we
     // exercise the actual undici contract, not a synthetic TypeError.
     await expect(ensureCredentialsForStudio()).rejects.toThrow(TypeError);
     await expect(ensureCredentialsForStudio()).rejects.not.toThrow(
@@ -281,7 +307,7 @@ describe("ensureCredentialsForStudio", () => {
     );
   });
 
-  // Codex P1 review on PR #65 — OAuth-only deployments advertise Auth0 in
+  // Codex P1 review on PR #65: OAuth-only deployments advertise Auth0 in
   // /v1/auth/cli/config but reject /v1/auth/anonymous. The new "always try
   // anon first" flow used to leave first-run users on those deployments
   // with a bare "Failed to acquire anonymous token (4xx)" error and no way
@@ -320,7 +346,7 @@ describe("ensureCredentialsForStudio", () => {
     expect(await readCredentials()).toBeNull();
   });
 
-  // Codex P2 review on PR #65 — the OAuth-only wrap used to span the whole
+  // Codex P2 review on PR #65: the OAuth-only wrap used to span the whole
   // anon bootstrap, so fs errors from `writeCredentials` were also rewritten
   // as "deployment may require sign-in", hiding the actionable fs cause.
   //
@@ -330,8 +356,8 @@ describe("ensureCredentialsForStudio", () => {
   // `writeFile` would raise EACCES under the bootstrap) only works on
   // POSIX as a non-root user: root bypasses chmod (Codex on PR #65), and
   // on Windows POSIX permission bits don't durably block writes inside a
-  // directory at all — Node maps `chmod` to the legacy read-only
-  // attribute, which NTFS only enforces on files. Both edges silently
+  // directory at all (Node maps `chmod` to the legacy read-only
+  // attribute, which NTFS only enforces on files). Both edges silently
   // turned the test green for the wrong reason. Mocking lifts the
   // "produce an EACCES" half of the test out of the host filesystem
   // entirely so every CI matrix entry exercises the wrap-narrowing
@@ -395,7 +421,7 @@ describe("ensureCredentialsForStudio", () => {
         );
       }
       if (url.endsWith("/v1/auth/anonymous")) {
-        // Missing `personalOrg` — anonymousTokenResponseSchema rejects.
+        // Missing `personalOrg`: anonymousTokenResponseSchema rejects.
         return new Response(
           JSON.stringify({ token: "t", anonymousId: "a", kind: "cli" }),
           { status: 200 },
@@ -413,7 +439,7 @@ describe("ensureCredentialsForStudio", () => {
   it("forwards a non-Error throwable from requestAnonymousToken (String() coercion)", async () => {
     // Defensive coverage of the `err instanceof Error ? err.message : String(err)`
     // helper inside the warn branch isn't exercised here because the
-    // helper is in the dev.ts catch — but the symmetrical path inside
+    // helper is in the dev.ts catch; but the symmetrical path inside
     // the schema-error case rethrows with the original value preserved.
     globalThis.fetch = vi.fn(async (input) => {
       const url = String(input);
@@ -449,7 +475,7 @@ describe("ensureCredentialsForStudio", () => {
         );
       }
       if (url.endsWith("/v1/auth/anonymous")) {
-        // Missing `personalOrg` — anonymousTokenResponseSchema rejects.
+        // Missing `personalOrg`: anonymousTokenResponseSchema rejects.
         return new Response(
           JSON.stringify({ token: "t", anonymousId: "a", kind: "cli" }),
           { status: 200 },
@@ -545,15 +571,6 @@ describe("ensureCredentialsForStudio", () => {
 });
 
 describe("runDev", () => {
-  // Track exit/signal listeners we add via scheduleStudioTokenCleanup so
-  // we can remove them between tests; otherwise vitest's worker would
-  // accumulate listeners and Node's MaxListenersExceededWarning would
-  // fire by the third test.
-  const ORIG_EXIT_LISTENERS = process.listeners("exit").length;
-  const ORIG_SIGINT_LISTENERS = process.listeners("SIGINT").length;
-  const ORIG_SIGTERM_LISTENERS = process.listeners("SIGTERM").length;
-  const ORIG_SIGHUP_LISTENERS = process.listeners("SIGHUP").length;
-
   beforeEach(async () => {
     vi.mocked(serve).mockClear();
     vi.mocked(open).mockClear();
@@ -570,18 +587,11 @@ describe("runDev", () => {
   });
 
   afterEach(() => {
-    // Trim the exit/signal listeners runDev installed each iteration to
-    // keep vitest's worker tidy across tests.
-    const trim = (ev: string, keep: number) => {
-      const all = process.listeners(ev as never);
-      for (let i = keep; i < all.length; i++) {
-        process.removeListener(ev as never, all[i] as never);
-      }
-    };
-    trim("exit", ORIG_EXIT_LISTENERS);
-    trim("SIGINT", ORIG_SIGINT_LISTENERS);
-    trim("SIGTERM", ORIG_SIGTERM_LISTENERS);
-    trim("SIGHUP", ORIG_SIGHUP_LISTENERS);
+    // Each `runDev()` arms exit/signal hooks via `registerCleanupHook`.
+    // Tests whose handler never fires would leak listeners across the
+    // vitest worker's queue; this detaches every still-armed
+    // registration so Node's MaxListenersExceededWarning doesn't trip.
+    __resetCleanupHooksForTests();
   });
 
   it("persists the studio token and starts the server on the requested port", async () => {
@@ -654,7 +664,7 @@ describe("runDev", () => {
     // ~/.arkor read-only after writeCredentials (so readCredentials still
     // works) so the per-launch token write hits EACCES.
     if (typeof process.getuid === "function" && process.getuid() === 0) {
-      // Root bypasses chmod permission checks — skip on root containers.
+      // Root bypasses chmod permission checks; skip on root containers.
       return;
     }
     chmodSync(join(fakeHome, ".arkor"), 0o555);
@@ -697,8 +707,163 @@ describe("runDev", () => {
       const sigintListeners = process.listeners("SIGINT");
       const handler = sigintListeners[sigintListeners.length - 1] as () => void;
       handler();
+      // Sync side effect (token unlink) lands inside the synchronous
+      // portion of the handler.
       expect(existsSync(studioTokenPath())).toBe(false);
-      expect(exitSpy).toHaveBeenCalledWith(0);
+      // Exit fires after `Promise.allSettled(asyncCleanups)` resolves;
+      // a few microticks later. Flush to let the queued exit run.
+      await flushMicrotasks();
+      // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2): see
+      // SIGNAL_EXIT_CODE in cleanupHooks.ts. Parent shells need
+      // the nonzero code to distinguish interrupt from clean exit.
+      expect(exitSpy).toHaveBeenCalledWith(130);
+    } finally {
+      exitSpy.mockRestore();
+    }
+  });
+
+  it("keeps the SIGINT exit handler armed even when persisting the studio token fails", async () => {
+    // Regression: if `persistStudioToken` threw, the previous code
+    // skipped `scheduleStudioTokenCleanup`, and that was the *only*
+    // hook that called `process.exit(0)` on SIGINT. The leftover HMR
+    // hook overrides Node's default "exit on SIGINT" behaviour, so the
+    // dev server would idle in the foreground forever. The fix
+    // registers the token cleanup unconditionally; here we make
+    // persist throw and verify SIGINT still terminates.
+    if (typeof process.getuid === "function" && process.getuid() === 0) {
+      // Root bypasses chmod permission checks; skip on root containers.
+      return;
+    }
+    chmodSync(join(fakeHome, ".arkor"), 0o555);
+    const stdoutSpy = vi
+      .spyOn(process.stdout, "write")
+      .mockImplementation((() => true) as typeof process.stdout.write);
+    try {
+      await runDev({ port: 4206 });
+    } finally {
+      stdoutSpy.mockRestore();
+      chmodSync(join(fakeHome, ".arkor"), 0o755);
+    }
+
+    const exitSpy = vi
+      .spyOn(process, "exit")
+      .mockImplementation(((_code?: number) => {
+        return undefined as never;
+      }) as typeof process.exit);
+    try {
+      const sigintListeners = process.listeners("SIGINT");
+      const handler = sigintListeners[sigintListeners.length - 1] as () => void;
+      handler();
+      // Even though the token file was never written, the cleanup hook
+      // ran (best-effort `unlinkSync` swallows ENOENT) and the
+      // exit-on-signal arm fired (after async cleanup tails settle).
+      await flushMicrotasks();
+      // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2): see
+      // SIGNAL_EXIT_CODE in cleanupHooks.ts. Parent shells need
+      // the nonzero code to distinguish interrupt from clean exit.
+      expect(exitSpy).toHaveBeenCalledWith(130);
+    } finally {
+      exitSpy.mockRestore();
+    }
+  });
+
+  it("does NOT unlink a pre-existing token file when this process failed to persist its own token (concurrent arkor dev safety)", async () => {
+    // Regression: a failed-persist `arkor dev` used to unconditionally
+    // `unlinkSync(studioTokenPath())` on shutdown. If a concurrent
+    // `arkor dev` (different port, same user) had already persisted a
+    // valid token to the shared path, this run's cleanup would wipe
+    // it out from under them, breaking that session's Vite SPA dev
+    // workflow with mystery 403s on /api/*. The fix gates the unlink
+    // on `tokenPersisted` so a failed-persist run is a no-op at
+    // shutdown.
+    if (typeof process.getuid === "function" && process.getuid() === 0) {
+      // Root bypasses chmod permission checks: skip on root containers.
+      return;
+    }
+    // Pre-place a "concurrent" token (the other dev session's). Body
+    // content lets us assert byte-equality after cleanup, not just
+    // file existence, to rule out an unlink+recreate cycle.
+    const path = studioTokenPath();
+    writeFileSync(path, "concurrent-token-value", { mode: 0o600 });
+    // Make the FILE unwritable so persistStudioToken's `writeFile`
+    // throws EACCES, but leave the *directory* writable so unlinkSync
+    // (which requires dir-write, not file-write perms) would happily
+    // delete the file if the cleanup hook weren't gated.
+    chmodSync(path, 0o444);
+
+    const stdoutSpy = vi
+      .spyOn(process.stdout, "write")
+      .mockImplementation((() => true) as typeof process.stdout.write);
+    try {
+      await expect(runDev({ port: 4207 })).resolves.toBeUndefined();
+    } finally {
+      stdoutSpy.mockRestore();
+    }
+
+    const exitSpy = vi
+      .spyOn(process, "exit")
+      .mockImplementation(((_code?: number) => {
+        return undefined as never;
+      }) as typeof process.exit);
+    try {
+      // Restore read perms so we can `readFileSync` to verify content.
+      chmodSync(path, 0o644);
+      const sigintListeners = process.listeners("SIGINT");
+      const handler = sigintListeners[sigintListeners.length - 1] as () => void;
+      handler();
+      await flushMicrotasks();
+      // The pre-existing token is still on disk AND unchanged: this
+      // failed-persist run did not wipe it.
+      expect(existsSync(path)).toBe(true);
+      expect(readFileSync(path, "utf8")).toBe("concurrent-token-value");
+    } finally {
+      exitSpy.mockRestore();
+    }
+  });
+
+  it("does NOT unlink the studio-token when a concurrent arkor dev has overwritten it after our successful persist (token-identity check)", async () => {
+    // Regression: even when this process SUCCESSFULLY persisted the
+    // token, the cleanup hook used to `unlinkSync` unconditionally on
+    // shutdown. If a second `arkor dev` launched in the same `$HOME`
+    // overwrote `~/.arkor/studio-token` with ITS token AFTER our
+    // persist, our cleanup would still wipe the file: the second
+    // session's Vite SPA dev workflow would then see mystery 403s on
+    // /api/* because the meta tag the SPA reads no longer matches
+    // any in-memory token. The fix re-reads the file at exit time
+    // and only unlinks when the bytes match what we wrote.
+    const stdoutSpy = vi
+      .spyOn(process.stdout, "write")
+      .mockImplementation((() => true) as typeof process.stdout.write);
+    try {
+      await runDev({ port: 4208 });
+    } finally {
+      stdoutSpy.mockRestore();
+    }
+    // Sanity: our persist landed.
+    const path = studioTokenPath();
+    expect(existsSync(path)).toBe(true);
+    const ourToken = readFileSync(path, "utf8").trim();
+    expect(ourToken).toMatch(/^[A-Za-z0-9_-]+$/);
+    // Simulate the concurrent overwrite: a second `arkor dev` wrote
+    // its own token to the same shared path while we were running.
+    const concurrentToken = "concurrent-dev-token-XYZ";
+    writeFileSync(path, concurrentToken, { mode: 0o600 });
+
+    const exitSpy = vi
+      .spyOn(process, "exit")
+      .mockImplementation(((_code?: number) => {
+        return undefined as never;
+      }) as typeof process.exit);
+    try {
+      const sigintListeners = process.listeners("SIGINT");
+      const handler = sigintListeners[sigintListeners.length - 1] as () => void;
+      handler();
+      await flushMicrotasks();
+      // Under the bug the file would be gone. With the fix the
+      // concurrent token is still in place AND unchanged so the
+      // sibling `arkor dev` keeps working.
+      expect(existsSync(path)).toBe(true);
+      expect(readFileSync(path, "utf8")).toBe(concurrentToken);
     } finally {
       exitSpy.mockRestore();
     }
diff --git a/packages/arkor/src/cli/commands/dev.ts b/packages/arkor/src/cli/commands/dev.ts
index e2bf01cf..ac24eb4b 100644
--- a/packages/arkor/src/cli/commands/dev.ts
+++ b/packages/arkor/src/cli/commands/dev.ts
@@ -1,5 +1,5 @@
-import { randomBytes } from "node:crypto";
-import { unlinkSync } from "node:fs";
+import { randomBytes, timingSafeEqual } from "node:crypto";
+import { readFileSync, unlinkSync } from "node:fs";
 import { chmod, mkdir, writeFile } from "node:fs/promises";
 import { dirname } from "node:path";
 import { serve } from "@hono/node-server";
@@ -16,7 +16,9 @@ import {
   type AnonymousCredentials,
 } from "../../core/credentials";
 import { buildStudioApp } from "../../studio/server";
+import { createHmrCoordinator } from "../../studio/hmr";
 import { ANON_PERSISTENCE_NUDGE } from "../anonymous";
+import { registerCleanupHook } from "../cleanupHooks";
 import { ui } from "../prompts";
 
 export interface DevOptions {
@@ -116,7 +118,7 @@ export async function ensureCredentialsForStudio(): Promise<void> {
     // wrap fires only for genuine deployment rejection (401/403/404 et
     // al). 5xx is a transient cloud-api failure where retrying makes
     // sense, ZodErrors signal a malformed response (server bug), and fs
-    // failures are out of scope for the anon endpoint entirely — none of
+    // failures are out of scope for the anon endpoint entirely; none of
     // these should be mislabelled as a sign-in requirement.
     if (
       err instanceof AnonymousTokenRejectedError &&
@@ -124,7 +126,7 @@ export async function ensureCredentialsForStudio(): Promise<void> {
       err.status < 500 &&
       oauthAvailable
     ) {
-      // Surface only the status code at the top level — the inner
+      // Surface only the status code at the top level: the inner
       // `err.message` already starts with "Failed to acquire…" and
       // includes the response-body snippet, which would double-prefix the
       // wrap and risk leaking noisy HTML/JSON error pages. The full
@@ -170,24 +172,81 @@ async function persistStudioToken(token: string): Promise<string> {
   return path;
 }
 
-function scheduleStudioTokenCleanup(path: string): void {
-  let cleaned = false;
-  const cleanup = () => {
-    if (cleaned) return;
-    cleaned = true;
-    try {
-      unlinkSync(path);
-    } catch {
-      // best-effort
-    }
-  };
-  process.on("exit", cleanup);
-  for (const sig of ["SIGINT", "SIGTERM", "SIGHUP"] as const) {
-    process.on(sig, () => {
-      cleanup();
-      process.exit(0);
-    });
-  }
+/**
+ * Constant-time string comparison for the token-identity check below.
+ * The "is this my token?" gate is not strictly a security-sensitive
+ * comparison (both sides are owned by the user on the local FS), but
+ * the SDK already uses `timingSafeEqual` for every other studio-token
+ * comparison (`buildStudioApp`), and keeping the same primitive here
+ * costs nothing while making the policy "tokens are always compared
+ * constant-time" uniform across the codebase.
+ */
+function tokensEqual(a: string, b: string): boolean {
+  const aBuf = Buffer.from(a);
+  const bBuf = Buffer.from(b);
+  if (aBuf.length !== bBuf.length) return false;
+  return timingSafeEqual(aBuf, bBuf);
+}
+
+function scheduleStudioTokenCleanup(
+  path: string,
+  // Read at cleanup time so a `persistStudioToken` call that's still
+  // in flight when the user hits Ctrl-C (or one that resolved
+  // successfully *after* this scheduler ran) has its outcome
+  // respected. A plain boolean parameter would be captured at hook
+  // registration time, well before persist resolves.
+  shouldUnlink: () => boolean,
+  // Token THIS process wrote. Compared against the file's current
+  // contents at unlink time so we never delete a token a concurrent
+  // `arkor dev` overwrote in the shared path. See cleanup body for
+  // the full rationale.
+  expectedToken: string,
+): void {
+  registerCleanupHook({
+    cleanup: () => {
+      // Skip the unlink entirely if THIS process never persisted the
+      // file. Without this gate, a failed-persist `arkor dev` would
+      // happily `unlinkSync` on shutdown, and if a concurrent
+      // `arkor dev` process (different port, same user) had persisted
+      // a valid token to the same shared path, our cleanup would
+      // wipe it out from under them, breaking that session's Vite
+      // SPA dev workflow with mystery 403s on /api/*.
+      if (!shouldUnlink()) return;
+      // Token-identity check: even when this process DID persist a
+      // token, another `arkor dev` launched in the same `$HOME` may
+      // have overwritten the shared `~/.arkor/studio-token` path
+      // BEFORE our shutdown. Unlinking unconditionally would then
+      // delete THEIR valid token, breaking their Vite SPA dev
+      // workflow. Re-read at exit time and only unlink when the
+      // bytes still match what we wrote so the cleanup is a no-op
+      // for foreign tokens. Read failure (ENOENT etc.) means the
+      // file is already gone, which is fine; the unlink would have
+      // been a no-op anyway.
+      let current: string;
+      try {
+        current = readFileSync(path, "utf8").trim();
+      } catch {
+        return;
+      }
+      if (!tokensEqual(current, expectedToken)) return;
+      try {
+        unlinkSync(path);
+      } catch {
+        // best-effort
+      }
+    },
+    // Outermost cleanup: responsible for terminating the process after
+    // all earlier-registered hooks (e.g. HMR dispose) have run.
+    exitOnSignal: true,
+  });
+}
+
+function scheduleHmrCleanup(hmr: { dispose: () => Promise<void> }): void {
+  // Registered before the studio-token cleanup so it runs first on
+  // shutdown: Node fires signal handlers in registration order, and we
+  // want the watcher to release file handles before the outermost
+  // process.exit.
+  registerCleanupHook({ cleanup: () => hmr.dispose() });
 }
 
 export async function runDev(options: DevOptions = {}): Promise<void> {
@@ -199,16 +258,59 @@ export async function runDev(options: DevOptions = {}): Promise<void> {
   // hitting `arkor start` (and therefore RCE via dynamic import).
   const studioToken = randomBytes(32).toString("base64url");
 
+  // HMR coordinator: a long-lived rolldown watcher over the user's
+  // `src/arkor` graph. The coordinator itself is lazy (`subscribe()`
+  // is what starts the watcher, not `createHmrCoordinator`), but
+  // `buildStudioApp` registers its per-rebuild signal-dispatch
+  // subscriber unconditionally: that subscriber needs to run on
+  // every BUNDLE_END regardless of whether any SSE client is
+  // connected, so it can SIGUSR2/SIGTERM active `/api/train`
+  // children and keep `lastSuccessConfigHash` warm for spawn-time
+  // capture. Net effect: the watcher starts at server boot. An
+  // `arkor dev` launched in an unbuilt project doesn't fail immediately
+  // because `startWatcher` falls through to a poll loop that waits
+  // for the entry file to appear (see `hmr.ts:entryWaitTimer`).
+  //
+  // Registered before the studio-token cleanup so the latter remains
+  // the most-recently-attached signal listener (existing tests rely
+  // on this ordering to find the token-removal handler).
+  const hmr = createHmrCoordinator({ cwd: process.cwd() });
+  scheduleHmrCleanup(hmr);
+
+  // Register the studio-token cleanup *unconditionally* up-front. The hook
+  // is the only one that calls `process.exit(0)` on SIGINT/SIGTERM/SIGHUP
+  // (the HMR hook above only disposes), and `registerCleanupHook` overrides
+  // Node's default "exit on signal" behaviour for any signal it listens
+  // on. If we were to gate registration behind a successful
+  // `persistStudioToken` and the persist threw, Ctrl-C would run the HMR
+  // dispose and then leave the server idle in the foreground: no exit
+  // ever fires.
+  //
+  // The cleanup body itself, however, gates `unlinkSync` on TWO checks:
+  //   - `tokenPersisted` (set only after `persistStudioToken` resolves)
+  //     so a failed-persist run never touches the shared file.
+  //   - token-identity match (re-read the file at exit time, compare
+  //     against the bytes WE wrote) so a successful-persist run that
+  //     was later overwritten by a concurrent `arkor dev` in the same
+  //     `$HOME` still leaves THAT instance's token in place. Without
+  //     this second check, the later instance would see mystery 403s
+  //     on /api/* because we'd have wiped its valid token.
+  // All three protections together: hook is always registered (so
+  // exits behave), and only deletes a file we wrote AND still own.
+  const tokenPath = studioTokenPath();
+  let tokenPersisted = false;
+  scheduleStudioTokenCleanup(tokenPath, () => tokenPersisted, studioToken);
+
   // Persisting the token to disk is *only* needed for the Vite SPA dev
   // workflow. The bundled `:port` flow injects the meta tag at request time
   // via `buildStudioApp`, so a failure here (read-only $HOME on Docker /
   // locked-down CI / restrictive umask) must not block the server.
   try {
-    const tokenPath = await persistStudioToken(studioToken);
-    scheduleStudioTokenCleanup(tokenPath);
+    await persistStudioToken(studioToken);
+    tokenPersisted = true;
   } catch (err) {
     ui.log.warn(
-      `Could not write ${studioTokenPath()} (${
+      `Could not write ${tokenPath} (${
         err instanceof Error ? err.message : String(err)
       }). The Studio at http://localhost:${port} is unaffected, but the Vite SPA dev workflow will see 403s on /api/*.`,
     );
@@ -217,9 +319,9 @@ export async function runDev(options: DevOptions = {}): Promise<void> {
   // `autoAnonymous: true` (the default) lets the Hono server retry the
   // anonymous bootstrap on first `/api/credentials` hit if the up-front
   // attempt above failed (e.g. cloud-api was unreachable at launch).
-  const app = buildStudioApp({ studioToken });
+  const app = buildStudioApp({ studioToken, hmr });
   // Bind to 127.0.0.1 (not "localhost") so the listener can't end up on `::1`
-  // only — `@hono/node-server` passes hostname to `net.Server.listen`, which
+  // only; `@hono/node-server` passes hostname to `net.Server.listen`, which
   // calls `dns.lookup`. On hosts where `/etc/hosts` orders `::1 localhost`
   // before `127.0.0.1 localhost`, a "localhost" bind would refuse IPv4
   // connections, breaking the studio-app Vite proxy (hardcoded to
@@ -229,6 +331,13 @@ export async function runDev(options: DevOptions = {}): Promise<void> {
   const url = `http://localhost:${port}`;
   serve({ fetch: app.fetch, port, hostname: "127.0.0.1" });
   process.stdout.write(`Arkor Studio running on ${url}\n`);
+  // "ready (will watch …)" rather than "enabled (watching …)" because
+  // `createHmrCoordinator` is lazy: the rolldown watcher doesn't
+  // actually start until the first `subscribe()` call inside
+  // `buildStudioApp`, and on a fresh scaffold with no
+  // `src/arkor/index.ts` yet the watcher falls into the
+  // entry-wait poll loop rather than actively watching.
+  process.stdout.write(`HMR ready (will watch src/arkor)\n`);
   if (options.open) {
     try {
       await open(url);
diff --git a/packages/arkor/src/cli/commands/start.test.ts b/packages/arkor/src/cli/commands/start.test.ts
index 8209818b..a08d70f4 100644
--- a/packages/arkor/src/cli/commands/start.test.ts
+++ b/packages/arkor/src/cli/commands/start.test.ts
@@ -78,7 +78,7 @@ describe("runStart", () => {
   it("skips the build step when the artifact already exists and no entry override is given", async () => {
     // Branch coverage for `Boolean(opts.entry) || !existsSync(outFile)` —
     // the path where both halves are false. Pre-build the artifact, then
-    // confirm runStart imports it without triggering esbuild again.
+    // confirm runStart imports it without triggering rolldown again.
     mkdirSync(join(cwd, "src/arkor"), { recursive: true });
     writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
     // First call builds normally.
diff --git a/packages/arkor/src/core/configHash.test.ts b/packages/arkor/src/core/configHash.test.ts
new file mode 100644
index 00000000..ec681124
--- /dev/null
+++ b/packages/arkor/src/core/configHash.test.ts
@@ -0,0 +1,213 @@
+import { describe, it, expect } from "vitest";
+import { hashJobConfig } from "./configHash";
+import type { JobConfig } from "./types";
+
+describe("hashJobConfig", () => {
+  it("returns the same hash for key-order-equivalent configs", () => {
+    const a: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+      maxSteps: 10,
+      learningRate: 1e-4,
+    };
+    const b: JobConfig = {
+      learningRate: 1e-4,
+      maxSteps: 10,
+      datasetSource: { name: "x", type: "huggingface" },
+      model: "m",
+    } as JobConfig;
+    expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+  });
+
+  it("returns different hashes for materially different configs", () => {
+    const base: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+    };
+    expect(hashJobConfig(base)).not.toBe(
+      hashJobConfig({ ...base, model: "m2" }),
+    );
+    expect(hashJobConfig(base)).not.toBe(
+      hashJobConfig({
+        ...base,
+        datasetSource: { type: "huggingface", name: "y" },
+      }),
+    );
+  });
+
+  it("is order-stable for nested arrays (dataset format / split)", () => {
+    const a: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+      datasetFormat: ["a", "b", "c"],
+    };
+    const b: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+      datasetFormat: ["a", "b", "c"],
+    };
+    expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+  });
+
+  it("treats `undefined` object properties identically to omitted ones (JSON parity)", () => {
+    // Regression: the previous `stableStringify` delegated to
+    // `JSON.stringify(undefined)` which returns `undefined` (not a
+    // string), concatenated via template literal that became the
+    // substring `"undefined"` in the hash input. So `{ a: 1 }` and
+    // `{ a: 1, b: undefined }` produced different hashes even though
+    // they're indistinguishable on the wire (`JSON.stringify` drops
+    // `undefined` properties).
+    const omitted: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+    };
+    const explicitlyUndefined: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+      // `unknown`-typed forwarder fields can legitimately end up
+      // holding `undefined` if a caller spreads from a partial source.
+      warmupSteps: undefined,
+      datasetFormat: undefined,
+    };
+    expect(hashJobConfig(omitted)).toBe(hashJobConfig(explicitlyUndefined));
+  });
+
+  it("normalises `undefined` array slots to null (JSON parity)", () => {
+    // `JSON.stringify([undefined])` → `"[null]"`. The previous
+    // implementation produced the literal substring `"[undefined]"`
+    // instead, which is not even valid JSON.
+    const a: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+      datasetFormat: ["a", undefined, "c"] as unknown,
+    };
+    const b: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+      datasetFormat: ["a", null, "c"] as unknown,
+    };
+    expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+  });
+
+  it("honors `toJSON()` like JSON.stringify (Date, etc.)", () => {
+    // Regression: `JSON.stringify({ d: new Date(0) })` serialises
+    // `d` as `"1970-01-01T00:00:00.000Z"`, but a naive recursive
+    // walker would serialise the Date as `{}` (no enumerable own
+    // keys). A `JobConfig` whose `unknown`-typed forwarder field
+    // ever holds a Date (or any object with `toJSON`) would then
+    // produce a hash that disagrees with the wire-format payload,
+    // causing spurious "configHash changed" → SIGTERM restarts.
+    const date = new Date("2024-01-01T00:00:00.000Z");
+    const a: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+      warmupSteps: date as unknown,
+    };
+    const b: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+      warmupSteps: "2024-01-01T00:00:00.000Z" as unknown,
+    };
+    expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+  });
+
+  it("threads the property key through to user-defined `toJSON(key)` (JSON parity)", () => {
+    // Regression: `JSON.stringify` calls `value.toJSON(key)` with
+    // the hosting property name (or array index as string), so a
+    // `toJSON` that branches on the key produces different output
+    // depending on where the value lives in the tree. The previous
+    // `stableStringify` called `toJSON()` without the key argument,
+    // so the hash diverged from the wire-format payload for any
+    // user object whose serialiser depends on context.
+    //
+    // The fixture's `toJSON(key)` returns `"key=<key>"`. Compare
+    // against an explicit string field holding what JSON.stringify
+    // would produce; matching hashes prove the key reached toJSON.
+    const ctx = {
+      toJSON(key: string) {
+        return `key=${key}`;
+      },
+    };
+    const a: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+      warmupSteps: ctx as unknown,
+    };
+    const b: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+      warmupSteps: "key=warmupSteps" as unknown,
+    };
+    expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+  });
+
+  it("omits an object property whose `toJSON(key)` returns undefined (JSON parity)", () => {
+    // Regression: `JSON.stringify({ a: { toJSON: () => undefined } })`
+    // produces `"{}"`: `toJSON` returning `undefined` is the spec's
+    // "skip me" signal in object position. The previous
+    // `stableStringify` collapsed every non-representable value to
+    // the literal string `"null"` at recursion time, so the same
+    // input hashed as `{"a":null}` instead of `{}`. That divergence
+    // forced unnecessary SIGTERM restarts whenever a `JobConfig`
+    // field's serialiser opted out: `configHash` would diverge from
+    // the wire-format payload (which DOES omit the field).
+    const omitting = {
+      toJSON() {
+        return undefined;
+      },
+    };
+    const a: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+      warmupSteps: omitting as unknown,
+    };
+    const b: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+    };
+    expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+  });
+
+  it("substitutes `null` for an array element whose `toJSON(idx)` returns undefined (JSON parity)", () => {
+    // Sibling contract: in array position, `JSON.stringify` writes
+    // `null` for a `toJSON()→undefined` element (it can't drop the
+    // slot without shifting indices). The `stableStringify` boundary
+    // for arrays maps the omit sentinel to `"null"`.
+    const omitting = {
+      toJSON() {
+        return undefined;
+      },
+    };
+    const a: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+      datasetFormat: ["a", omitting, "c"] as unknown,
+    };
+    const b: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+      datasetFormat: ["a", null, "c"] as unknown,
+    };
+    expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+  });
+
+  it("ignores function / symbol properties (JSON parity)", () => {
+    // `JSON.stringify` drops these too. The hash should be insensitive
+    // to "transparent" callbacks accidentally landing in a forwarded
+    // config (the SDK separates `callbacks` out, but `unknown` fields
+    // could leak one).
+    const fn = () => 0;
+    const sym = Symbol("foo");
+    const a: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+    };
+    const b: JobConfig = {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+      warmupSteps: fn as unknown,
+      loggingSteps: sym as unknown,
+    };
+    expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+  });
+});
diff --git a/packages/arkor/src/core/configHash.ts b/packages/arkor/src/core/configHash.ts
new file mode 100644
index 00000000..2e407094
--- /dev/null
+++ b/packages/arkor/src/core/configHash.ts
@@ -0,0 +1,91 @@
+import { createHash } from "node:crypto";
+import type { JobConfig } from "./types";
+
+/**
+ * Deterministic JSON serialiser: keys sorted at every nesting level so
+ * `{a:1, b:2}` and `{b:2, a:1}` produce the same string. Necessary because
+ * `JSON.stringify` follows insertion order, which isn't stable across
+ * `buildJobConfig` revisions or user-side spread-merge tricks.
+ *
+ * Returns `string | undefined`. `undefined` is the "omit me from my
+ * containing object" sentinel: it propagates from any value
+ * `JSON.stringify` would silently drop in object position
+ * (`undefined`, functions, symbols, *and* objects whose `toJSON(key)`
+ * returns one of those). Callers sit at three boundaries:
+ *
+ *   - Top level: `hashJobConfig` collapses `undefined` to `"null"`
+ *     so the digest input stays a valid hash string.
+ *   - Array slots: the map below substitutes `"null"` (matches
+ *     `JSON.stringify([undefined]) === "[null]"`).
+ *   - Object slots: the loop filters the key out entirely (matches
+ *     `JSON.stringify({a: undefined}) === "{}"`).
+ *
+ * The previous implementation collapsed every non-representable to
+ * the literal string `"null"` at recursion time, which leaked into
+ * object slots as `{"a":null}` instead of the JSON-correct `{}`,
+ * making `configHash` diverge from the wire-format payload for
+ * `JobConfig` fields whose `toJSON(key)` happened to return
+ * `undefined` (the spec-defined "skip me" signal). That divergence
+ * forces unnecessary SIGTERM restarts on every rebuild.
+ */
+function stableStringify(value: unknown, key: string = ""): string | undefined {
+  if (value === null) return "null";
+  // Non-representable values: omit (undefined return) so each caller's
+  // boundary handler chooses the right substitution per its position.
+  if (value === undefined || typeof value === "function" || typeof value === "symbol") {
+    return undefined;
+  }
+  if (typeof value !== "object") return JSON.stringify(value);
+  // `JSON.stringify` calls `value.toJSON(key)` first when present
+  // (passing `""` at the top level, the property name in object
+  // positions, the index-as-string in array positions), then
+  // serialises the return value. Canonical example: `Date` → ISO
+  // string. The `key` argument is threaded through recursion so
+  // user-side `toJSON(key)` implementations that branch on the
+  // hosting property/index see the same value JSON.stringify would.
+  // If `toJSON` returns `undefined`, that propagates as the omit
+  // sentinel: the spec-defined "skip me" path.
+  const maybeToJSON = (value as { toJSON?: unknown }).toJSON;
+  if (typeof maybeToJSON === "function") {
+    return stableStringify(
+      (maybeToJSON as (key: string) => unknown).call(value, key),
+      key,
+    );
+  }
+  if (Array.isArray(value)) {
+    // Array slots: non-representable → "null" (matches JSON spec).
+    // Index-as-string keys mirror `JSON.stringify`'s behaviour for
+    // array elements (per the ECMAScript spec, `SerializeJSONArray`
+    // calls `SerializeJSONProperty` with the index converted to a
+    // string).
+    const items = value.map((v, i) => stableStringify(v, String(i)) ?? "null");
+    return `[${items.join(",")}]`;
+  }
+  // Object slots: skip keys whose serialised value is `undefined`
+  // (matches `JSON.stringify({a: undefined}) === "{}"`). Property
+  // names are passed as the recursion key so a nested `toJSON(key)`
+  // sees the hosting field name.
+  const obj = value as Record<string, unknown>;
+  const parts: string[] = [];
+  for (const k of Object.keys(obj).sort()) {
+    const serialised = stableStringify(obj[k], k);
+    if (serialised === undefined) continue;
+    parts.push(`${JSON.stringify(k)}:${serialised}`);
+  }
+  return `{${parts.join(",")}}`;
+}
+
+/**
+ * Stable fingerprint of a `JobConfig`. Used by HMR to decide whether a
+ * rebuild changed only the in-process callbacks (configHash unchanged →
+ * hot-swap) or the cloud-side training config (configHash changed →
+ * full restart with `requestEarlyStop`).
+ */
+export function hashJobConfig(config: JobConfig): string {
+  // Top-level fallback to `"null"` so a pathological config that
+  // serialises to `undefined` (top-level `toJSON` returning
+  // undefined, etc.) still produces a deterministic digest input
+  // rather than crashing `createHash.update(undefined)`.
+  const serialised = stableStringify(config) ?? "null";
+  return createHash("sha256").update(serialised).digest("hex").slice(0, 16);
+}
diff --git a/packages/arkor/src/core/moduleCacheBust.test.ts b/packages/arkor/src/core/moduleCacheBust.test.ts
new file mode 100644
index 00000000..40b8509a
--- /dev/null
+++ b/packages/arkor/src/core/moduleCacheBust.test.ts
@@ -0,0 +1,66 @@
+import { describe, it, expect, beforeEach, afterEach } from "vitest";
+import { mkdtempSync, rmSync, writeFileSync } from "node:fs";
+import { tmpdir } from "node:os";
+import { join } from "node:path";
+import { pathToFileURL } from "node:url";
+import {
+  moduleCacheBustKey,
+  moduleCacheBustUrl,
+} from "./moduleCacheBust";
+
+let dir: string;
+
+beforeEach(() => {
+  dir = mkdtempSync(join(tmpdir(), "arkor-cachebust-test-"));
+});
+
+afterEach(() => {
+  rmSync(dir, { recursive: true, force: true });
+});
+
+describe("moduleCacheBustKey", () => {
+  it("is stable across calls when the file hasn't changed", () => {
+    // Regression: Node's ESM loader never evicts module records, and
+    // a `Date.now()` cache-bust would produce a fresh URL on every
+    // call → unbounded leak across long `arkor dev` sessions
+    // (5 s `/api/manifest` polls + every save firing SIGUSR2).
+    // mtime+ctime+size keying must collapse repeat reads of unchanged
+    // bytes onto the same key so the loader serves from cache.
+    const file = join(dir, "stable.mjs");
+    writeFileSync(file, "export const v = 1;");
+    const k1 = moduleCacheBustKey(file);
+    const k2 = moduleCacheBustKey(file);
+    expect(k1).toBe(k2);
+    // mtimeMs-ctimeMs-size; mtimeMs/ctimeMs may carry sub-ms precision
+    // (no `toFixed(0)`) so digits include an optional fractional part.
+    expect(k1).toMatch(/^[\d.]+-[\d.]+-\d+$/);
+  });
+
+  it("changes when the file content changes (different size)", () => {
+    const file = join(dir, "growing.mjs");
+    writeFileSync(file, "v1");
+    const before = moduleCacheBustKey(file);
+    writeFileSync(file, "version-two");
+    const after = moduleCacheBustKey(file);
+    expect(after).not.toBe(before);
+  });
+
+  it("returns a stable fallback (\"0-0-0\") for missing files instead of throwing", () => {
+    // The eventual `await import(url)` will throw on a missing
+    // file; the helper itself should produce a value rather than
+    // bubbling the stat error and turning every consumer into a
+    // try/catch site. Three zeros (one each for mtimeMs, ctimeMs,
+    // size) to keep the shape uniform with the success branch.
+    expect(moduleCacheBustKey(join(dir, "does-not-exist.mjs"))).toBe("0-0-0");
+  });
+});
+
+describe("moduleCacheBustUrl", () => {
+  it("returns a fully-qualified file URL with the cache-bust query attached", () => {
+    const file = join(dir, "u.mjs");
+    writeFileSync(file, "export const x = 1;");
+    const url = moduleCacheBustUrl(file);
+    expect(url.startsWith(pathToFileURL(file).href + "?t=")).toBe(true);
+    expect(url).toMatch(/\?t=[\d.]+-[\d.]+-\d+$/);
+  });
+});
diff --git a/packages/arkor/src/core/moduleCacheBust.ts b/packages/arkor/src/core/moduleCacheBust.ts
new file mode 100644
index 00000000..22f160a5
--- /dev/null
+++ b/packages/arkor/src/core/moduleCacheBust.ts
@@ -0,0 +1,51 @@
+import { statSync } from "node:fs";
+import { pathToFileURL } from "node:url";
+
+/**
+ * Build a content-derived cache-bust query for `await import(url + "?t=" + key)`.
+ *
+ * Why this matters: Node's ESM loader caches every dynamically-imported
+ * URL for the lifetime of the process and exposes no API to evict a
+ * record. A naive `?t=Date.now()` cache-bust produces a fresh URL on
+ * every call, so a long-running `arkor dev` session (where the SPA
+ * polls `/api/manifest` every few seconds and every save fires
+ * `BUNDLE_END` + SIGUSR2) accumulates one module record per call,
+ * unbounded.
+ *
+ * Keying on `mtimeMs + ctimeMs + size` collapses repeated reads of the
+ * same bytes onto the same URL, which Node's loader then serves from
+ * its existing cache record. The leak shrinks from "one entry per
+ * call" to "one entry per actual file change", which is the tightest
+ * bound we can offer without spawning a child process per import.
+ *
+ * `mtimeMs` is kept at full sub-millisecond precision (no rounding):
+ * a previous `toFixed(0)` collapsed two distinct edits that landed in
+ * the same millisecond and produced an identically-sized output onto
+ * the same key, which made Node's loader return the *stale* module
+ * for the second edit (HMR/manifest staleness on fast filesystems).
+ * `ctimeMs` is included as belt-and-braces against the (rare) case
+ * where mtime collides but ctime moves: `touch -m` and some build
+ * tools update one without the other.
+ *
+ * Falls back to a stable literal on stat failure so the eventual
+ * `import()` (which will throw on a missing file) gets to surface its
+ * own clean error rather than us inventing a noisy timestamp here.
+ */
+export function moduleCacheBustKey(filePath: string): string {
+  try {
+    const s = statSync(filePath);
+    return `${s.mtimeMs}-${s.ctimeMs}-${s.size}`;
+  } catch {
+    return "0-0-0";
+  }
+}
+
+/**
+ * Convenience: full file URL with the cache-bust key already
+ * appended. The `as const`-style template is small enough to inline
+ * but doing it in one place keeps the URL shape uniform across the
+ * three callers (`hmr.ts`, `manifest.ts`, `runnerSignals.ts`).
+ */
+export function moduleCacheBustUrl(filePath: string): string {
+  return `${pathToFileURL(filePath).href}?t=${moduleCacheBustKey(filePath)}`;
+}
diff --git a/packages/arkor/src/core/projectState.test.ts b/packages/arkor/src/core/projectState.test.ts
index 8d36515a..73d6b706 100644
--- a/packages/arkor/src/core/projectState.test.ts
+++ b/packages/arkor/src/core/projectState.test.ts
@@ -37,7 +37,7 @@ function fakeClient(
   // Construct a real CloudApiClient (so type-compatibility holds), then
   // monkey-patch only the methods exercised by ensureProjectState. The
   // other methods would throw on first use because no fetcher is wired,
-  // which is fine — projectState should never reach them.
+  // which is fine; projectState should never reach them.
   const client = new CloudApiClient({
     baseUrl: "http://mock",
     credentials: anonCreds,
@@ -84,7 +84,7 @@ describe("ensureProjectState", () => {
     expect(createProject).not.toHaveBeenCalled();
   });
 
-  it("throws for auth0 callers without state — they must write .arkor/state.json by hand", async () => {
+  it("throws for auth0 callers without state: they must write .arkor/state.json by hand", async () => {
     const client = fakeClient();
     await expect(
       ensureProjectState({ cwd, client, credentials: auth0Creds }),
@@ -116,7 +116,7 @@ describe("ensureProjectState", () => {
       expect(createProject).toHaveBeenCalledWith({
         orgSlug: "anon-abc",
         name: expect.stringMatching(/^my-app/),
-        // Sanitised slug — basename starts with "my-app-<random>", and we
+        // Sanitised slug: basename starts with "my-app-<random>", and we
         // expect the sanitiser to keep dashes.
         slug: expect.stringMatching(/^my-app/),
       });
diff --git a/packages/arkor/src/core/rolldownConfig.ts b/packages/arkor/src/core/rolldownConfig.ts
new file mode 100644
index 00000000..66e87c29
--- /dev/null
+++ b/packages/arkor/src/core/rolldownConfig.ts
@@ -0,0 +1,86 @@
+import { isAbsolute, resolve } from "node:path";
+import type { InputOptions } from "rolldown";
+
+const DEFAULT_ENTRY = "src/arkor/index.ts";
+const DEFAULT_OUT_DIR = ".arkor/build";
+
+export interface BuildEntryOptions {
+  /** Source entry path; defaults to `src/arkor/index.ts`. */
+  entry?: string;
+  /** Output directory; defaults to `.arkor/build`. */
+  outDir?: string;
+  /** Project root; defaults to `process.cwd()`. */
+  cwd?: string;
+}
+
+export interface ResolvedBuildEntry {
+  /** Project root (absolute). */
+  cwd: string;
+  /** Entry source file (absolute). */
+  entry: string;
+  /** Output directory (absolute). */
+  outDir: string;
+  /** Output bundle (absolute, always `<outDir>/index.mjs`). */
+  outFile: string;
+}
+
+/** Resolve `cwd` / `entry` / `outDir` to absolute paths with the standard defaults. */
+export function resolveBuildEntry(opts: BuildEntryOptions): ResolvedBuildEntry {
+  const cwd = opts.cwd ?? process.cwd();
+  const entryRel = opts.entry ?? DEFAULT_ENTRY;
+  const entry = isAbsolute(entryRel) ? entryRel : resolve(cwd, entryRel);
+  const outDirRel = opts.outDir ?? DEFAULT_OUT_DIR;
+  const outDir = isAbsolute(outDirRel) ? outDirRel : resolve(cwd, outDirRel);
+  const outFile = resolve(outDir, "index.mjs");
+  return { cwd, entry, outDir, outFile };
+}
+
+/**
+ * `node<major>.<minor>` derived from the running Node binary. Build host and
+ * run host are effectively the same process (Studio spawns `arkor start` with
+ * `process.execPath`), so the bundle can target precisely what will execute it.
+ */
+export function resolveNodeTarget(): string {
+  // Fallback aligns with the published `engines.node` floor; see
+  // [packages/arkor/package.json] / `AGENTS.md`'s "Node version" note.
+  const [major = "22", minor = "22"] = process.versions.node.split(".");
+  return `node${major}.${minor}`;
+}
+
+/**
+ * Build the shared rolldown options object used by both `runBuild` (one-shot)
+ * and the HMR coordinator (`watch()`). Centralising the configuration here
+ * keeps the two pipelines aligned: anything that affects the bundle shape
+ * (external resolution, transform target, platform) is set in one place so
+ * the artifact a watcher writes is byte-equivalent to a one-shot rebuild.
+ */
+export function rolldownInputOptions(
+  resolved: Pick<ResolvedBuildEntry, "cwd" | "entry">,
+): InputOptions {
+  return {
+    input: resolved.entry,
+    cwd: resolved.cwd,
+    platform: "node",
+    logLevel: "warn",
+    transform: { target: resolveNodeTarget() },
+    // Mirror esbuild's `packages: "external"`: any specifier that isn't a
+    // relative or absolute path stays external. `node:`-prefixed builtins
+    // are already handled by `platform: "node"`; the explicit allow below
+    // is a safety net in case the builtin set drifts.
+    external: (id, _importer, isResolved) => {
+      if (isResolved) return false;
+      if (id.startsWith(".")) return false;
+      if (isAbsolute(id)) return false;
+      return true;
+    },
+  };
+}
+
+/**
+ * Re-exported defaults so consumers (like error messages) can name the same
+ * paths we resolve internally.
+ */
+export const BUILD_DEFAULTS = {
+  entry: DEFAULT_ENTRY,
+  outDir: DEFAULT_OUT_DIR,
+} as const;
diff --git a/packages/arkor/src/core/runner.test.ts b/packages/arkor/src/core/runner.test.ts
index 89f29249..cdabfb15 100644
--- a/packages/arkor/src/core/runner.test.ts
+++ b/packages/arkor/src/core/runner.test.ts
@@ -1,4 +1,4 @@
-import { describe, it, expect, afterEach, beforeEach } from "vitest";
+import { describe, it, expect, afterEach, beforeEach, vi } from "vitest";
 import { mkdtempSync, rmSync, writeFileSync, mkdirSync } from "node:fs";
 import { tmpdir } from "node:os";
 import { join } from "node:path";
@@ -49,7 +49,7 @@ afterEach(() => {
   rmSync(cwd, { recursive: true, force: true });
 });
 
-describe("runTrainer — entry extraction", () => {
+describe("runTrainer: entry extraction", () => {
   it("throws when the entry file does not exist", async () => {
     await expect(runTrainer("missing.ts")).rejects.toThrow(
       /Training entry not found/,
@@ -124,7 +124,7 @@ describe("runTrainer — entry extraction", () => {
   });
 
   it("throws when default export is a primitive (typeof !== 'object' branch)", async () => {
-    // The second half of `mod.default && typeof mod.default === "object"` —
+    // The second half of `mod.default && typeof mod.default === "object"`:
     // a primitive default like `42` or `"foo"` must short-circuit out of
     // the nested-trainer probe.
     const entry = join(cwd, "primitive-default.mjs");
@@ -135,7 +135,7 @@ describe("runTrainer — entry extraction", () => {
   });
 
   it("accepts a default export wrapping a `trainer` field (legacy power-user shape)", async () => {
-    // Hits the `if (isTrainer(nested)) return nested` branch — the only
+    // Hits the `if (isTrainer(nested)) return nested` branch: the only
     // place line 38 is reachable.
     const entry = join(cwd, "default-with-trainer.mjs");
     writeFileSync(
@@ -154,7 +154,7 @@ describe("runTrainer — entry extraction", () => {
 
   it("falls back to DEFAULT_ENTRY (src/arkor/index.ts) when called with no argument", async () => {
     // Branch coverage for `file ?? DEFAULT_ENTRY`. Place the entry at
-    // `<cwd>/src/arkor/index.ts` and invoke runTrainer() — the default
+    // `<cwd>/src/arkor/index.ts` and invoke runTrainer(): the default
     // path is what `arkor start` and Studio's "Run training" button use.
     const arkorDir = join(cwd, "src", "arkor");
     mkdirSync(arkorDir, { recursive: true });
@@ -174,8 +174,8 @@ describe("runTrainer — entry extraction", () => {
       join(arkorDir, "index.ts"),
       `export * from "./index.mjs";\n`,
     );
-    // Pass undefined explicitly to exercise the `?? DEFAULT_ENTRY` branch
-    // — Node's built-in TypeScript stripping handles the .ts extension at
+    // Pass undefined explicitly to exercise the `?? DEFAULT_ENTRY` branch.
+    // Node's built-in TypeScript stripping handles the .ts extension at
     // runtime. (vitest also strips TS so this works under test too.)
     await expect(runTrainer()).resolves.toBeUndefined();
   });
@@ -207,3 +207,99 @@ describe("runTrainer — entry extraction", () => {
     expect(typeof t.wait).toBe("function");
   });
 });
+
+describe("runTrainer: shutdown signal handling", () => {
+  it("first SIGTERM calls trainer.requestEarlyStop and exits 0; second SIGTERM exits 143", async () => {
+    // Fake trainer whose `wait()` hangs until the test manually resolves it
+    // (via a global helper). This lets us hold the run in flight long
+    // enough to assert both signal-handling branches without racing the
+    // `finally` block that removes the listeners.
+    // The fake trainer wears the early-stop brand
+    // (`Symbol.for("arkor.trainer.requestEarlyStop")`) so the runner's
+    // SIGTERM handler invokes it the same way the SDK-provided trainer
+    // does. No public `requestEarlyStop` method exists any more.
+    const trainerSrc = `
+      const KEY = Symbol.for("arkor.trainer.requestEarlyStop");
+      let earlyStopCalls = 0;
+      let resolveWait;
+      const waitPromise = new Promise((r) => { resolveWait = r; });
+      globalThis.__test_signalProbe = {
+        get earlyStopCalls() { return earlyStopCalls; },
+        finishWait: () => resolveWait({
+          job: {
+            id: "j1", orgId: "o", projectId: "p", name: "n",
+            status: "completed",
+            config: { model: "m", datasetSource: { type: "huggingface", name: "x" } },
+            createdAt: "2026",
+          },
+          artifacts: [],
+        }),
+      };
+      const trainer = {
+        name: "n",
+        start: async () => ({ jobId: "j1" }),
+        wait: () => waitPromise,
+        cancel: async () => {},
+      };
+      Object.defineProperty(trainer, KEY, {
+        value: async () => { earlyStopCalls++; },
+        enumerable: false,
+      });
+      export { trainer };
+    `;
+    const entry = join(cwd, "src/arkor/index.mjs");
+    mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+    writeFileSync(entry, trainerSrc);
+
+    const exitCalls: number[] = [];
+    const exitSpy = vi
+      .spyOn(process, "exit")
+      .mockImplementation(((code?: number) => {
+        exitCalls.push(code ?? 0);
+        return undefined as never;
+      }) as typeof process.exit);
+    const stdoutSpy = vi
+      .spyOn(process.stdout, "write")
+      .mockImplementation((() => true) as typeof process.stdout.write);
+    try {
+      const runPromise = runTrainer("src/arkor/index.mjs");
+      // Wait for import + start() to settle so the handler is registered
+      // before we synthesise SIGTERM. Poll for the probe rather than
+      // relying on a fixed timer: under load (e.g. running alongside
+      // sibling test files in turbo) the dynamic import + top-level
+      // body can take longer than a hardcoded 25 ms window.
+      type Probe = { earlyStopCalls: number; finishWait: () => void };
+      let probe: Probe | undefined;
+      for (let i = 0; i < 40; i++) {
+        probe = (globalThis as unknown as { __test_signalProbe?: Probe })
+          .__test_signalProbe;
+        if (probe) break;
+        await new Promise((r) => setTimeout(r, 25));
+      }
+      if (!probe) throw new Error("Probe not installed by user bundle");
+
+      // 1st SIGTERM → requestEarlyStop is called, exit(0) scheduled in the
+      // promise's `.finally`.
+      process.emit("SIGTERM", "SIGTERM");
+      await new Promise((r) => setTimeout(r, 25));
+      expect(probe.earlyStopCalls).toBe(1);
+      expect(exitCalls).toContain(0);
+
+      // 2nd SIGTERM (still in-flight, listeners not yet removed) →
+      // exit(143) immediately, no second requestEarlyStop call.
+      process.emit("SIGTERM", "SIGTERM");
+      await new Promise((r) => setTimeout(r, 25));
+      expect(probe.earlyStopCalls).toBe(1);
+      expect(exitCalls).toContain(143);
+
+      // Release the hung wait() so runPromise can complete and the
+      // shutdown handlers detach via the finally block.
+      probe.finishWait();
+      await runPromise;
+    } finally {
+      exitSpy.mockRestore();
+      stdoutSpy.mockRestore();
+      delete (globalThis as Record<string, unknown>).__test_signalProbe;
+    }
+  });
+});
diff --git a/packages/arkor/src/core/runner.ts b/packages/arkor/src/core/runner.ts
index e674b70e..38db0537 100644
--- a/packages/arkor/src/core/runner.ts
+++ b/packages/arkor/src/core/runner.ts
@@ -2,10 +2,48 @@ import { existsSync } from "node:fs";
 import { resolve, isAbsolute } from "node:path";
 import { pathToFileURL } from "node:url";
 import { isArkor } from "./arkor";
+import {
+  installCallbackReloadHandler,
+  installShutdownHandlers,
+} from "./runnerSignals";
 import type { Trainer } from "./types";
 
 const DEFAULT_ENTRY = "src/arkor/index.ts";
 
+/**
+ * Per-spawn nonce that `/api/train` injects via env so the server can
+ * recognise the runner's `Started job <id>` line without it being
+ * forgeable from user code. Captured at module load (i.e. BEFORE
+ * `runTrainer` does its `await import(userEntry)`) and the env var
+ * is deleted right after so the dynamically-imported user module
+ * cannot read it via `process.env`. If a user callback then writes
+ * `Started job <token>` to stdout, the line won't carry the nonce
+ * prefix and the server's anchored regex will reject it: no
+ * spoofed cloud `cancel()` POST against an attacker-chosen job id.
+ *
+ * Null when the runner was launched directly (e.g. `arkor start` from
+ * a shell), in which case the runner falls back to the plain
+ * `Started job <id>` form for backwards compatibility. The server only
+ * uses the nonce-prefixed form because every server spawn sets the
+ * env var.
+ *
+ * **Import-order requirement.** The spoof-prevention guarantee relies
+ * on this module reading + deleting `ARKOR_JOB_ID_MARKER_NONCE`
+ * before any user-controlled module gets to touch `process.env`.
+ * That's safe today because the only consumer chain is
+ * `bin.ts → cli/main.ts → cli/commands/start.ts → core/runner.ts`,
+ * all static imports, so this module is fully evaluated before
+ * `runTrainer` performs its `await import(userEntry)`. If a future
+ * refactor introduces a dynamic-import / lazy-load of runner.ts (so
+ * a sibling module runs first and could snapshot `process.env`), the
+ * capture+delete should move into a tiny dedicated module that the
+ * bin imports first, or the env var should be wiped at the server
+ * spawn boundary too.
+ */
+const STARTED_JOB_NONCE: string | null =
+  process.env.ARKOR_JOB_ID_MARKER_NONCE ?? null;
+delete process.env.ARKOR_JOB_ID_MARKER_NONCE;
+
 function isTrainer(value: unknown): value is Trainer {
   if (!value || typeof value !== "object") return false;
   const t = value as Record<string, unknown>;
@@ -53,8 +91,20 @@ export async function runTrainer(file?: string): Promise<void> {
   const mod = (await import(pathToFileURL(abs).href)) as Record<string, unknown>;
   const trainer = extractTrainer(mod);
 
-  const { jobId } = await trainer.start();
-  process.stdout.write(`Started job ${jobId}\n`);
-  const result = await trainer.wait();
-  process.stdout.write(`Job ${result.job.id} finished with status=${result.job.status}\n`);
+  const removeShutdown = installShutdownHandlers(trainer);
+  const removeCallbackReload = installCallbackReloadHandler(trainer, abs);
+  try {
+    const { jobId } = await trainer.start();
+    const startedJobPrefix = STARTED_JOB_NONCE
+      ? `[arkor:${STARTED_JOB_NONCE}] `
+      : "";
+    process.stdout.write(`${startedJobPrefix}Started job ${jobId}\n`);
+    const result = await trainer.wait();
+    process.stdout.write(
+      `Job ${result.job.id} finished with status=${result.job.status}\n`,
+    );
+  } finally {
+    removeShutdown();
+    removeCallbackReload();
+  }
 }
diff --git a/packages/arkor/src/core/runnerSignals.test.ts b/packages/arkor/src/core/runnerSignals.test.ts
new file mode 100644
index 00000000..5e274943
--- /dev/null
+++ b/packages/arkor/src/core/runnerSignals.test.ts
@@ -0,0 +1,403 @@
+import { describe, it, expect, beforeEach, afterEach, vi } from "vitest";
+import { mkdirSync, mkdtempSync, rmSync, writeFileSync } from "node:fs";
+import { tmpdir } from "node:os";
+import { join } from "node:path";
+import {
+  installCallbackReloadHandler,
+  installShutdownHandlers,
+} from "./runnerSignals";
+import type { Trainer, TrainerCallbacks } from "./types";
+import {
+  attachTrainerCallbackReplacer,
+  attachTrainerEarlyStopper,
+  attachTrainerInspection,
+} from "./trainerInspection";
+
+let cwd: string;
+
+beforeEach(() => {
+  cwd = mkdtempSync(join(tmpdir(), "arkor-signals-test-"));
+});
+
+afterEach(() => {
+  rmSync(cwd, { recursive: true, force: true });
+});
+
+function makeTrainer(): Trainer & {
+  __earlyStop: { calls: number };
+  __replace: {
+    lastCallbacks: Partial<TrainerCallbacks> | null;
+    calls: number;
+  };
+} {
+  const earlyStop = { calls: 0 };
+  const replace = {
+    lastCallbacks: null as Partial<TrainerCallbacks> | null,
+    calls: 0,
+  };
+  const trainer: Trainer = {
+    name: "n",
+    async start() {
+      return { jobId: "j" };
+    },
+    async wait() {
+      throw new Error("not used");
+    },
+    async cancel() {},
+  };
+  // Wire the internal callback-replacer + early-stop brands the same
+  // way `createTrainer` does. SIGUSR2 looks them up via
+  // `replaceTrainerCallbacks` and SIGTERM via `requestTrainerEarlyStop`
+  // (there are no public methods on `Trainer` for either any more).
+  attachTrainerCallbackReplacer(trainer, (cbs) => {
+    replace.lastCallbacks = cbs;
+    replace.calls += 1;
+  });
+  attachTrainerEarlyStopper(trainer, async () => {
+    earlyStop.calls += 1;
+  });
+  return Object.assign(trainer, {
+    __earlyStop: earlyStop,
+    __replace: replace,
+  });
+}
+
+describe("installShutdownHandlers", () => {
+  it("calls trainer.requestEarlyStop on the first SIGTERM and exit(0)", async () => {
+    const trainer = makeTrainer();
+    const exitSpy = vi
+      .spyOn(process, "exit")
+      .mockImplementation((() => undefined as never) as typeof process.exit);
+    const stdoutSpy = vi
+      .spyOn(process.stdout, "write")
+      .mockImplementation((() => true) as typeof process.stdout.write);
+    const dispose = installShutdownHandlers(trainer);
+    try {
+      process.emit("SIGTERM", "SIGTERM");
+      await new Promise((r) => setTimeout(r, 10));
+      expect(trainer.__earlyStop.calls).toBe(1);
+      expect(exitSpy).toHaveBeenCalledWith(0);
+    } finally {
+      dispose();
+      exitSpy.mockRestore();
+      stdoutSpy.mockRestore();
+    }
+  });
+
+  it("second-signal exit code is per-signal POSIX 128+signo (130 for SIGINT, 129 for SIGHUP)", async () => {
+    // Regression: the second-signal emergency-exit path used to
+    // hardcode `process.exit(143)` regardless of which signal
+    // fired. SIGINT (Ctrl-C twice) and SIGHUP shutdowns then
+    // looked like SIGTERM exits to parent shells / orchestrators,
+    // breaking signal-aware logic (e.g. tmux pane behaviour, CI
+    // job classification, `&&` / `||` chains that distinguish
+    // user-cancel from clean exit). Mirrors `SIGNAL_EXIT_CODE` in
+    // `cli/cleanupHooks.ts`.
+    const cases: Array<["SIGINT" | "SIGTERM" | "SIGHUP", number]> = [
+      ["SIGINT", 130],
+      ["SIGTERM", 143],
+      ["SIGHUP", 129],
+    ];
+    for (const [sig, expectedExit] of cases) {
+      const trainer = makeTrainer();
+      const exitCodes: number[] = [];
+      const exitSpy = vi
+        .spyOn(process, "exit")
+        .mockImplementation(((code?: number) => {
+          exitCodes.push(code ?? 0);
+          return undefined as never;
+        }) as typeof process.exit);
+      const stdoutSpy = vi
+        .spyOn(process.stdout, "write")
+        .mockImplementation((() => true) as typeof process.stdout.write);
+      const dispose = installShutdownHandlers(trainer);
+      try {
+        process.emit(sig, sig);
+        await new Promise((r) => setTimeout(r, 10));
+        process.emit(sig, sig);
+        await new Promise((r) => setTimeout(r, 10));
+        // First signal exits 0 via the early-stop chain's
+        // `.finally(() => process.exit(0))`; second signal exits
+        // with the per-signal POSIX code.
+        expect(exitCodes, `signal ${sig}`).toContain(expectedExit);
+      } finally {
+        dispose();
+        exitSpy.mockRestore();
+        stdoutSpy.mockRestore();
+      }
+    }
+  });
+
+  it("first-signal exit code is per-signal POSIX 128+signo when the early-stop chain rejects", async () => {
+    // Regression: the first-signal `.finally(() => process.exit(0))`
+    // always exited 0 even when the early-stop chain rejected
+    // (cancel POST hit a cloud-api 5xx, network drop, etc.). Parent
+    // shells running `arkor start || cleanup_on_failure` would then
+    // classify the failed cancel as a clean run and skip cleanup
+    // despite the stderr diagnostic. Fix: non-zero POSIX 128+signo on
+    // rejection so the exit status carries the same signal-shape
+    // semantics as the second-signal emergency path.
+    const cases: Array<["SIGINT" | "SIGTERM" | "SIGHUP", number]> = [
+      ["SIGINT", 130],
+      ["SIGTERM", 143],
+      ["SIGHUP", 129],
+    ];
+    for (const [sig, expectedExit] of cases) {
+      // Build a trainer whose internal early-stop brand REJECTS, so
+      // the runner's `.catch(...).finally(...)` chain goes through
+      // the failure branch.
+      const trainer: Trainer = {
+        name: "n",
+        async start() {
+          return { jobId: "j" };
+        },
+        async wait() {
+          throw new Error("not used");
+        },
+        async cancel() {},
+      };
+      attachTrainerEarlyStopper(trainer, async () => {
+        throw new Error("cloud-api 503");
+      });
+      const exitCodes: number[] = [];
+      const exitSpy = vi
+        .spyOn(process, "exit")
+        .mockImplementation(((code?: number) => {
+          exitCodes.push(code ?? 0);
+          return undefined as never;
+        }) as typeof process.exit);
+      const stdoutSpy = vi
+        .spyOn(process.stdout, "write")
+        .mockImplementation((() => true) as typeof process.stdout.write);
+      const stderrSpy = vi
+        .spyOn(process.stderr, "write")
+        .mockImplementation((() => true) as typeof process.stderr.write);
+      const dispose = installShutdownHandlers(trainer);
+      try {
+        process.emit(sig, sig);
+        // Wait for the .catch / .finally microtasks to settle.
+        await new Promise((r) => setTimeout(r, 10));
+        // Under the bug this was just `[0]`. With the fix the
+        // first-signal exit code reflects the signal that fired.
+        expect(exitCodes, `signal ${sig}`).toEqual([expectedExit]);
+      } finally {
+        dispose();
+        exitSpy.mockRestore();
+        stdoutSpy.mockRestore();
+        stderrSpy.mockRestore();
+      }
+    }
+  });
+
+  it("second SIGTERM exits 143 without re-invoking requestEarlyStop", async () => {
+    const trainer = makeTrainer();
+    const exitCodes: number[] = [];
+    const exitSpy = vi
+      .spyOn(process, "exit")
+      .mockImplementation(((code?: number) => {
+        exitCodes.push(code ?? 0);
+        return undefined as never;
+      }) as typeof process.exit);
+    const stdoutSpy = vi
+      .spyOn(process.stdout, "write")
+      .mockImplementation((() => true) as typeof process.stdout.write);
+    const dispose = installShutdownHandlers(trainer);
+    try {
+      process.emit("SIGTERM", "SIGTERM");
+      await new Promise((r) => setTimeout(r, 10));
+      process.emit("SIGTERM", "SIGTERM");
+      await new Promise((r) => setTimeout(r, 10));
+      expect(trainer.__earlyStop.calls).toBe(1);
+      expect(exitCodes).toContain(0);
+      expect(exitCodes).toContain(143);
+    } finally {
+      dispose();
+      exitSpy.mockRestore();
+      stdoutSpy.mockRestore();
+    }
+  });
+});
+
+describe("installCallbackReloadHandler", () => {
+  function writeUserBundle(label: string): string {
+    const file = join(cwd, "entry.mjs");
+    // Inline a fake trainer that wears the inspection brand. The
+    // SIGUSR2 handler dynamic-imports this file and pulls the
+    // callbacks reference off via `getTrainerInspection`.
+    const src = `
+      const KEY = Symbol.for("arkor.trainer.inspect");
+      const callbacks = { onLog: (ctx) => globalThis.__arkor_callbackProbe?.(${JSON.stringify(label)}, ctx) };
+      const trainer = {
+        name: "t",
+        start: async () => ({ jobId: "j" }),
+        wait: async () => ({ job: {}, artifacts: [] }),
+        cancel: async () => {},
+      };
+      Object.defineProperty(trainer, KEY, {
+        value: () => ({ name: "t", config: { model: "m", datasetSource: { type: "huggingface", name: "x" } }, callbacks }),
+        enumerable: false,
+      });
+      export const arkor = Object.freeze({ _kind: "arkor", trainer });
+    `;
+    writeFileSync(file, src);
+    return file;
+  }
+
+  it("re-imports the bundle and forwards the new callbacks via replaceCallbacks", async () => {
+    const trainer = makeTrainer();
+    // Brand the trainer too so the import path-side has a reference shape.
+    attachTrainerInspection(trainer, () => ({
+      name: "n",
+      config: {
+        model: "m",
+        datasetSource: { type: "huggingface", name: "x" },
+      },
+      callbacks: {},
+    }));
+
+    const file = writeUserBundle("v1");
+    const stdoutSpy = vi
+      .spyOn(process.stdout, "write")
+      .mockImplementation((() => true) as typeof process.stdout.write);
+    const stderrSpy = vi
+      .spyOn(process.stderr, "write")
+      .mockImplementation((() => true) as typeof process.stderr.write);
+    const dispose = installCallbackReloadHandler(trainer, file);
+    mkdirSync(join(cwd, "src"), { recursive: true });
+    try {
+      // Rewrite the entry to "v2" callbacks before signalling.
+      writeUserBundle("v2");
+      process.emit("SIGUSR2", "SIGUSR2");
+      // Wait for the dynamic import + replaceCallbacks to settle.
+      for (let i = 0; i < 50 && trainer.__replace.lastCallbacks === null; i++) {
+        await new Promise((r) => setTimeout(r, 10));
+      }
+      expect(trainer.__replace.lastCallbacks).not.toBeNull();
+      expect(typeof trainer.__replace.lastCallbacks?.onLog).toBe("function");
+    } finally {
+      dispose();
+      stdoutSpy.mockRestore();
+      stderrSpy.mockRestore();
+    }
+  });
+
+  it("returns a no-op disposer when SIGUSR2 registration throws (Windows fallback)", () => {
+    // Regression: `process.on("SIGUSR2", ...)` can throw at
+    // registration time on platforms that don't support the signal
+    // (notably Windows). Previously this would surface as a hard
+    // crash at `arkor start` boot. The handler now wraps the
+    // registration in try/catch and degrades to a no-op disposer so
+    // the rest of the runner stays up: the server's
+    // `safeKill(child, "SIGUSR2")` already detects the same
+    // condition and falls back to SIGTERM-restart there.
+    const trainer = makeTrainer();
+    const file = join(cwd, "entry.mjs");
+    writeFileSync(file, "export const x = 1;\n");
+
+    const realOn = process.on.bind(process);
+    const onSpy = vi
+      .spyOn(process, "on")
+      .mockImplementation(((event: string, listener: (...args: unknown[]) => void) => {
+        if (event === "SIGUSR2") {
+          throw new Error("ENOSYS: function not implemented");
+        }
+        return realOn(event as never, listener as never);
+      }) as typeof process.on);
+
+    let dispose: (() => void) | undefined;
+    try {
+      // Must not throw despite the SIGUSR2 registration failure.
+      dispose = installCallbackReloadHandler(trainer, file);
+      expect(typeof dispose).toBe("function");
+      // No listener was attached, so the disposer is a no-op; calling
+      // it must not throw either (mirroring the success-path contract
+      // for tests that always invoke the disposer in `finally`).
+      expect(() => dispose?.()).not.toThrow();
+    } finally {
+      onSpy.mockRestore();
+    }
+  });
+
+  it("drops a stale reload's result when a newer SIGUSR2 starts before the import resolves", async () => {
+    // Regression: each SIGUSR2 starts a fire-and-forget
+    // `import()` + `replaceTrainerCallbacks`. Two same-`configHash`
+    // rebuilds firing back-to-back can race: the earlier import's
+    // bytes sometimes resolve *after* the newer one, and
+    // `replaceTrainerCallbacks` overwrites the freshly-loaded
+    // callbacks with the prior version. The fix version-gates each
+    // reload via a monotonic `loadSeq`; this test pins the contract
+    // by firing two signals back-to-back and asserting that
+    // `replaceTrainerCallbacks` was invoked exactly **once**:
+    // proving the older IIFE dropped its result at the
+    // `seq !== loadSeq` check before reaching the replace call.
+    const trainer = makeTrainer();
+    attachTrainerInspection(trainer, () => ({
+      name: "n",
+      config: {
+        model: "m",
+        datasetSource: { type: "huggingface", name: "x" },
+      },
+      callbacks: {},
+    }));
+
+    const file = writeUserBundle("v1");
+    const stdoutSpy = vi
+      .spyOn(process.stdout, "write")
+      .mockImplementation((() => true) as typeof process.stdout.write);
+    const stderrSpy = vi
+      .spyOn(process.stderr, "write")
+      .mockImplementation((() => true) as typeof process.stderr.write);
+    const dispose = installCallbackReloadHandler(trainer, file);
+    try {
+      // First signal: captures seq=1 inside the IIFE.
+      process.emit("SIGUSR2", "SIGUSR2");
+      // Rewrite the bundle to v2 BEFORE letting either import
+      // resolve. mtime+ctime+size change → distinct cache-bust URL.
+      writeUserBundle("v2");
+      // Second signal: captures seq=2, bumps loadSeq to 2.
+      process.emit("SIGUSR2", "SIGUSR2");
+      // Generous fixed wait so both imports definitely settle;
+      // we can't poll on `lastCallbacks !== null` because the v1
+      // IIFE might land first and short-circuit our wait, hiding
+      // the count assertion below.
+      await new Promise((r) => setTimeout(r, 200));
+      // Without the seq guard, both IIFEs would call
+      // `replaceTrainerCallbacks` and `calls` would be 2. With the
+      // guard, the older IIFE's `seq !== loadSeq` short-circuit
+      // skips the replace call entirely.
+      expect(trainer.__replace.calls).toBe(1);
+    } finally {
+      dispose();
+      stdoutSpy.mockRestore();
+      stderrSpy.mockRestore();
+    }
+  });
+
+  it("logs a skip warning when the bundle has no inspectable trainer", async () => {
+    const trainer = makeTrainer();
+    const file = join(cwd, "no-trainer.mjs");
+    writeFileSync(file, "export const nothing = true;\n");
+    const stdoutSpy = vi
+      .spyOn(process.stdout, "write")
+      .mockImplementation((() => true) as typeof process.stdout.write);
+    const stderrChunks: string[] = [];
+    const stderrSpy = vi
+      .spyOn(process.stderr, "write")
+      .mockImplementation(((chunk: unknown) => {
+        stderrChunks.push(String(chunk));
+        return true;
+      }) as typeof process.stderr.write);
+    const dispose = installCallbackReloadHandler(trainer, file);
+    try {
+      process.emit("SIGUSR2", "SIGUSR2");
+      // Give the dynamic import a few ticks.
+      await new Promise((r) => setTimeout(r, 50));
+      expect(stderrChunks.join("")).toMatch(/no inspectable trainer/i);
+      expect(trainer.__replace.lastCallbacks).toBeNull();
+    } finally {
+      dispose();
+      stdoutSpy.mockRestore();
+      stderrSpy.mockRestore();
+    }
+  });
+});
diff --git a/packages/arkor/src/core/runnerSignals.ts b/packages/arkor/src/core/runnerSignals.ts
new file mode 100644
index 00000000..5c71a27d
--- /dev/null
+++ b/packages/arkor/src/core/runnerSignals.ts
@@ -0,0 +1,215 @@
+import { moduleCacheBustUrl } from "./moduleCacheBust";
+import { SIGNAL_EXIT_CODE } from "./signalExit";
+import {
+  findInspectableTrainer,
+  replaceTrainerCallbacks,
+  requestTrainerEarlyStop,
+} from "./trainerInspection";
+import type { Trainer, TrainerCallbacks } from "./types";
+
+const SHUTDOWN_SIGNALS = ["SIGTERM", "SIGINT", "SIGHUP"] as const;
+const CALLBACK_RELOAD_SIGNAL = "SIGUSR2" as const;
+
+/**
+ * Two-stage shutdown handling so HMR rebuilds (Studio sends SIGTERM)
+ * preserve the in-flight checkpoint work:
+ *
+ *   - 1st signal → `trainer.requestEarlyStop()`. The trainer keeps
+ *     running, lets the next `checkpoint.saved` event land, then issues
+ *     `cancel()`.
+ *   - 2nd signal → immediate `process.exit(POSIX 128+signo)`:
+ *     130 for SIGINT, 143 for SIGTERM, 129 for SIGHUP. Escape hatch
+ *     for an impatient operator or a hung early-stop. Per-signal
+ *     exit code so parent shells see the actual interruption type.
+ *
+ * The returned dispose function removes the handlers so a normal
+ * `wait()` completion doesn't leave stale listeners behind: important
+ * because `runTrainer` can be called multiple times in tests within a
+ * single Node process.
+ */
+export function installShutdownHandlers(trainer: Trainer): () => void {
+  let signalCount = 0;
+  const handler = (signal: (typeof SHUTDOWN_SIGNALS)[number]): void => {
+    signalCount += 1;
+    if (signalCount > 1) {
+      process.stdout.write(
+        `Received second ${signal}; exiting without waiting for checkpoint.\n`,
+      );
+      // POSIX 128 + signo so the parent shell sees the right exit
+      // status: 130 for SIGINT (Ctrl-C twice), 129 for SIGHUP,
+      // 143 for SIGTERM. Hardcoding 143 misclassifies SIGINT and
+      // SIGHUP shutdowns as SIGTERM-style exits and breaks
+      // signal-aware orchestration. Defaults to 143 for any future
+      // signal we forget to map.
+      const code = SIGNAL_EXIT_CODE[signal] ?? 143;
+      process.exit(code);
+      // Explicit return so test mocks of process.exit (which don't
+      // actually terminate the worker) don't fall through into the
+      // early-stop path.
+      return;
+    }
+    process.stdout.write(
+      `Received ${signal}; early-stopping at next checkpoint…\n`,
+    );
+    // Drive the trainer's internal early-stop entry point via the
+    // `Symbol.for("arkor.trainer.requestEarlyStop")` brand attached by
+    // `createTrainer`. `runTrainer` also accepts hand-rolled
+    // `{ start, wait, cancel }` trainers; for those the brand is
+    // absent and `requestTrainerEarlyStop` transparently falls back
+    // to `trainer.cancel()` (best-effort, matches the public contract).
+    //
+    // Track whether the early-stop chain rejected so the final
+    // `process.exit` carries a non-zero status. The previous version
+    // always exited 0, which made `arkor start || cleanup_on_failure`
+    // wrappers classify a cancel-POST rejection (cloud-api transient
+    // failure, network drop) as a clean run despite the stderr
+    // diagnostic. POSIX 128 + signo on failure mirrors the
+    // second-signal exit-code convention so parent shells see a
+    // signal-style nonzero status.
+    let earlyStopFailed = false;
+    requestTrainerEarlyStop(trainer)
+      .catch((err: unknown) => {
+        earlyStopFailed = true;
+        const msg = err instanceof Error ? err.message : String(err);
+        process.stderr.write(`requestEarlyStop failed: ${msg}\n`);
+      })
+      .finally(() => {
+        const code = earlyStopFailed
+          ? (SIGNAL_EXIT_CODE[signal] ?? 143)
+          : 0;
+        process.exit(code);
+      });
+  };
+  // Per-signal closure (vs a single shared listener registered on
+  // every signal): the closure captures `sig` at registration time
+  // so the handler doesn't depend on whatever Node passes as the
+  // event arg. Node's documented contract is to pass the signal
+  // name, but pinning the source via closure keeps the handler
+  // robust regardless and makes the registration → arg
+  // relationship explicit at the callsite. Stored in a Map so
+  // `process.off` can remove the exact closure (anonymous arrow
+  // would leak the listener since `process.off` matches by
+  // identity).
+  const signalHandlers = new Map<
+    (typeof SHUTDOWN_SIGNALS)[number],
+    () => void
+  >();
+  for (const sig of SHUTDOWN_SIGNALS) {
+    const fn = () => handler(sig);
+    signalHandlers.set(sig, fn);
+    process.on(sig, fn);
+  }
+  return () => {
+    for (const [sig, fn] of signalHandlers) process.off(sig, fn);
+  };
+}
+
+/**
+ * SIGUSR2 handler: re-import the freshly-rebuilt artefact and rotate
+ * the trainer's callback cell via the internal
+ * `Symbol.for("arkor.trainer.replaceCallbacks")` brand. The cloud-side
+ * training run is untouched; only the in-process callbacks change.
+ *
+ * Studio sends SIGUSR2 from the `/api/dev/events` HMR pipeline when
+ * (and only when) the rebuilt bundle's `JobConfig` hash matches the
+ * one captured at spawn time. A mismatch produces SIGTERM instead, which
+ * goes through `installShutdownHandlers` above.
+ */
+export function installCallbackReloadHandler(
+  trainer: Trainer,
+  entryPath: string,
+): () => void {
+  /**
+   * Monotonic counter for sequencing concurrent SIGUSR2 reloads.
+   * Bumped synchronously inside the signal handler *before* the
+   * dynamic-import await begins, so each in-flight reload knows its
+   * arrival order. When the import resolves, the IIFE compares its
+   * captured `seq` against `loadSeq` and silently drops the result
+   * if a newer signal already started a newer reload. Without this,
+   * two same-`configHash` rebuilds firing back-to-back can race on
+   * the import: the earlier import's bytes (now stale on disk)
+   * resolve *after* the newer one, and `replaceTrainerCallbacks`
+   * overwrites the freshly-loaded callbacks with the prior version,
+   * leaving the running job out of sync until the next rebuild.
+   * Mirrors the `buildSeq` guard in `studio/hmr.ts`'s
+   * `emitBuildSucceeded`.
+   */
+  let loadSeq = 0;
+  const handler = (): void => {
+    const seq = ++loadSeq;
+    // mtime+ctime+size cache-bust (vs `Date.now()`): Node's ESM
+    // loader never evicts module records, so a long `arkor start`
+    // session with frequent SIGUSR2 reloads would accumulate one
+    // record per signal forever. Keying on the actual artefact bytes
+    // (via `moduleCacheBustUrl`) collapses no-op signals onto the
+    // same URL; the leak is bounded to "one per real edit", which
+    // is fundamentally what HMR has to retain.
+    const url = moduleCacheBustUrl(entryPath);
+    void (async () => {
+      try {
+        const mod = (await import(url)) as Record<string, unknown>;
+        // A newer SIGUSR2 already started its own import while we
+        // were awaiting; drop our result so the latest edit wins.
+        if (seq !== loadSeq) return;
+        const callbacks = extractCallbacks(mod);
+        if (!callbacks) {
+          process.stderr.write(
+            "Callback reload skipped: rebuilt bundle has no inspectable trainer.\n",
+          );
+          return;
+        }
+        replaceTrainerCallbacks(trainer, callbacks);
+        process.stdout.write(
+          "Callbacks hot-reloaded; training run continues.\n",
+        );
+      } catch (err: unknown) {
+        const msg = err instanceof Error ? err.message : String(err);
+        process.stderr.write(`Callback reload failed: ${msg}\n`);
+      }
+    })();
+  };
+  // `process.on('SIGUSR2', ...)` can throw at registration time on
+  // platforms that don't support the signal (notably Windows: libuv's
+  // signal-wrap returns ENOSYS for SIGUSR2 on win32 and the error
+  // escapes to userland on some Node versions). The server-side
+  // `trainRegistry.safeKill(child, "SIGUSR2")` already detects this
+  // ("unsupported" → falls back to SIGTERM-restart), so an unarmed
+  // listener here is the documented contract on those platforms:
+  // quietly degrade to a no-op disposer rather than crashing
+  // `arkor start` at boot.
+  // Track registration success so the returned disposer never
+  // calls `process.off(...)` for a handler we never attached.
+  // Today this only fires for the early-return-no-op path where
+  // `process.on` threw at registration, but future Node versions
+  // could route `off` through the same libuv signal-wrap that
+  // throws for unsupported signals on Windows, and a symmetric
+  // throw inside the disposer would crash the `runTrainer` finally
+  // block instead of merely being a no-op.
+  let attached = false;
+  try {
+    process.on(CALLBACK_RELOAD_SIGNAL, handler);
+    attached = true;
+  } catch {
+    return () => {
+      // no-op: handler was never attached
+    };
+  }
+  return () => {
+    if (!attached) return;
+    process.off(CALLBACK_RELOAD_SIGNAL, handler);
+  };
+}
+
+/**
+ * Extract the user-supplied callbacks reference from a re-imported
+ * bundle. Delegates the entry-shape walk to `findInspectableTrainer`
+ * so SIGUSR2's view of "what counts as a trainer" stays identical to
+ * the HMR coordinator's `inspectBundle` and `runner.ts`'s
+ * `extractTrainer`. Returns `null` when no candidate carries the
+ * inspection brand.
+ */
+function extractCallbacks(
+  mod: Record<string, unknown>,
+): Partial<TrainerCallbacks> | null {
+  return findInspectableTrainer(mod)?.callbacks ?? null;
+}
diff --git a/packages/arkor/src/core/schemas.test.ts b/packages/arkor/src/core/schemas.test.ts
index 50ff2571..46a9cf18 100644
--- a/packages/arkor/src/core/schemas.test.ts
+++ b/packages/arkor/src/core/schemas.test.ts
@@ -64,7 +64,7 @@ describe("trainingJobSchema", () => {
   });
 
   it("normalises non-null startedAt/completedAt: strings pass through, Dates ISO-coerce", () => {
-    // Branch coverage for the `toIsoOrNull` transforms — the `null`
+    // Branch coverage for the `toIsoOrNull` transforms: the `null`
     // branch is exercised by every other test in this file (the
     // `valid` fixture has both fields null), but the truthy branch
     // only fires when the field carries an actual timestamp. Strings
diff --git a/packages/arkor/src/core/signalExit.ts b/packages/arkor/src/core/signalExit.ts
new file mode 100644
index 00000000..81e42eb7
--- /dev/null
+++ b/packages/arkor/src/core/signalExit.ts
@@ -0,0 +1,21 @@
+/**
+ * Shared POSIX `128 + signo` exit code mapping for the runner's
+ * two-stage shutdown handler (`core/runnerSignals.ts`) and the CLI's
+ * cleanup-hook coordinator (`cli/cleanupHooks.ts`). The two map
+ * MUST agree: AGENTS.md describes them as a single contract, and a
+ * drift (e.g. someone adding SIGQUIT to one but not the other)
+ * would make the runner and the dev-server exit with inconsistent
+ * codes for the same signal, the exact parent-shell-classification
+ * regression the per-signal mapping was introduced to prevent.
+ *
+ * Lives in `core/` (not `cli/`) so both consumers can import it
+ * without `cli/` ↔ `core/` cycles: `cli/cleanupHooks.ts` imports
+ * from `core/`, but `core/` must not depend on `cli/`.
+ */
+export const SIGNAL_EXIT_CODE = {
+  SIGHUP: 129,
+  SIGINT: 130,
+  SIGTERM: 143,
+} as const;
+
+export type ShutdownSignal = keyof typeof SIGNAL_EXIT_CODE;
diff --git a/packages/arkor/src/core/trainer.test.ts b/packages/arkor/src/core/trainer.test.ts
index f9ef2f85..ff1d02ed 100644
--- a/packages/arkor/src/core/trainer.test.ts
+++ b/packages/arkor/src/core/trainer.test.ts
@@ -3,6 +3,10 @@ import { mkdtempSync, rmSync } from "node:fs";
 import { tmpdir } from "node:os";
 import { join } from "node:path";
 import { createTrainer } from "./trainer";
+import {
+  replaceTrainerCallbacks,
+  requestTrainerEarlyStop,
+} from "./trainerInspection";
 import { writeState } from "./state";
 import type { AnonymousCredentials } from "./credentials";
 
@@ -266,7 +270,7 @@ describe("createTrainer (credentials defaulting)", () => {
             model: "m",
             dataset: { type: "huggingface", name: "x" },
           },
-          // Note: NO `credentials` here — trainer must call ensureCredentials.
+          // Note: NO `credentials` here, so trainer must call ensureCredentials.
           {
             baseUrl: "http://mock",
             cwd: localCwd,
@@ -667,7 +671,7 @@ describe("createTrainer (SSE event stream)", () => {
   });
 });
 
-// Regression for ENG-406 — the previous reconnect loop had no upper bound
+// Regression for ENG-406: the previous reconnect loop had no upper bound
 // and no jitter, so a permanently-down cloud-api would keep retrying every
 // `reconnectDelayMs` forever (and on recovery several SDK clients would
 // reconnect at exactly the same instant).
@@ -795,7 +799,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
             step: 1,
             loss: 1,
           })}\n\n`,
-          // No terminal event — stream closes cleanly, outer loop reconnects.
+          // No terminal event: stream closes cleanly, outer loop reconnects.
         ],
       },
       { kind: "throw", error: new TypeError("fetch failed") },
@@ -833,8 +837,8 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
   // when `Math.random()` lands near 1.
   // Codex review on PR #13 (round 3) flagged that a 200-OK stream that
   // EOFs without emitting any frame would loop forever at the base delay
-  // — `maxReconnectAttempts` was bypassed because clean closes never
-  // touched the failure counter. Misconfigured proxies / load-balancers
+  // because `maxReconnectAttempts` was bypassed (clean closes never
+  // touched the failure counter). Misconfigured proxies / load-balancers
   // that accept the connection and immediately drop it would hang
   // `wait()` indefinitely.
   it("counts clean closes with no frames toward maxReconnectAttempts", async () => {
@@ -955,7 +959,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
     };
     // The trainer fires `POST /v1/jobs` synchronously inside the start()
     // path, so cancel() needs the job row to be assigned. We never open the
-    // event stream — cancel() should not depend on it.
+    // event stream; cancel() should not depend on it.
     const sse = [
       `id: 1\nevent: training.completed\ndata: ${JSON.stringify({
         type: "training.completed",
@@ -1012,7 +1016,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
     const original = globalThis.fetch;
     globalThis.fetch = fetcher;
     try {
-      // Start the run by awaiting wait() — the streamed completion event
+      // Start the run by awaiting wait(): the streamed completion event
       // closes the loop quickly so cancel() runs against a fully-resolved
       // startedJob/scope pair.
       await trainer.wait();
@@ -1086,7 +1090,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
   });
 
   it("skips malformed event payloads without aborting the stream", async () => {
-    // Branch coverage for the `try/catch` around JSON.parse — a single
+    // Branch coverage for the `try/catch` around JSON.parse: a single
     // malformed `data:` line shouldn't tear down the whole training run.
     // Send one garbage frame followed by a real terminal event.
     await writeState(
@@ -1153,7 +1157,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
   });
 
   it("recovers when the SSE body itself errors mid-stream", async () => {
-    // Branch coverage for the catch around the for-await iterator —
+    // Branch coverage for the catch around the for-await iterator:
     // covers the case where the stream's underlying body emits an error
     // (e.g. a network disconnect partway through). The reconnect loop
     // should treat it as a failure, count it toward the limit, then
@@ -1263,7 +1267,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
       { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
       cwd,
     );
-    // No fetch mock at all — if cancel() reached the API we'd see a real
+    // No fetch mock at all: if cancel() reached the API we'd see a real
     // network error. Safety net for callers that wire up cancel() to
     // SIGINT before kicking off the run.
     const trainer = createTrainer(
@@ -1413,3 +1417,1320 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
     }
   });
 });
+
+describe("createTrainer (early stop)", () => {
+  const minimalJobRow = {
+    id: "j-stop",
+    orgId: "o1",
+    projectId: "p1",
+    name: "run",
+    status: "queued",
+    config: {
+      model: "m",
+      datasetSource: { type: "huggingface", name: "x" },
+    },
+    createdAt: "2026-01-01T00:00:00Z",
+    startedAt: null,
+    completedAt: null,
+  };
+
+  it("calls cancel after the next checkpoint when early-stop is requested mid-run", async () => {
+    await writeState(
+      { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+      cwd,
+    );
+    // SSE stream: training.started → training.log → checkpoint.saved.
+    // The checkpoint event is the trigger for the early-stop branch in
+    // dispatch(); after that, the loop should treat the run as terminal
+    // (we asserted this by ending the wait() promise without sending
+    // training.completed).
+    const sse = [
+      `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+        type: "training.started",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:01Z",
+      })}\n\n`,
+      `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+        type: "training.log",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:02Z",
+        step: 1,
+        loss: 0.5,
+      })}\n\n`,
+      `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({
+        type: "checkpoint.saved",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:03Z",
+        step: 10,
+      })}\n\n`,
+    ];
+
+    let cancelCalls = 0;
+    const fetcher: typeof fetch = (async (
+      input: RequestInfo | URL,
+      init?: RequestInit,
+    ) => {
+      const url = typeof input === "string" ? input : input.toString();
+      const method = init?.method ?? "GET";
+      if (method === "POST" && url.includes("/v1/jobs?")) {
+        return new Response(JSON.stringify({ job: minimalJobRow }), {
+          status: 201,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+        return new Response(sseStream(sse), {
+          status: 200,
+          headers: { "content-type": "text/event-stream" },
+        });
+      }
+      if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+        cancelCalls += 1;
+        return new Response(JSON.stringify({ ok: true }), {
+          status: 200,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      throw new Error(`unexpected fetch: ${method} ${url}`);
+    }) as typeof fetch;
+
+    const trainer = createTrainer(
+      {
+        name: "run",
+        model: "m",
+        dataset: { type: "huggingface", name: "x" },
+        callbacks: {
+          // Arm the early-stop latch from inside the on-log callback so it
+          // fires before the checkpoint dispatch (mirrors the real CLI
+          // path where SIGTERM arrives mid-run). Fire-and-forget so the
+          // dispatch loop isn't blocked waiting for the latch's own
+          // checkpoint trigger to arrive.
+          onLog: () => {
+            void requestTrainerEarlyStop(trainer, { timeoutMs: 60_000 });
+          },
+        },
+      },
+      { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+    );
+    const original = globalThis.fetch;
+    globalThis.fetch = fetcher;
+    let result: Awaited<ReturnType<typeof trainer.wait>>;
+    try {
+      result = await trainer.wait();
+    } finally {
+      globalThis.fetch = original;
+    }
+    expect(cancelCalls).toBe(1);
+    // Regression: the early-stop checkpoint branch returns
+    // `{ terminal: true }` to break out of `wait()`'s loop without
+    // waiting for a cloud-side terminal event. The `TrainingResult`
+    // it resolves with must therefore reflect a terminal status
+    // locally; otherwise `wait()` violates its documented contract
+    // ("Resolve when the job reaches a terminal status") and a
+    // subsequent `requestEarlyStop` wouldn't see the
+    // `TERMINAL_STATUSES` short-circuit.
+    expect(result.job.status).toBe("cancelled");
+    expect(result.job.completedAt).toBe("2026-01-01T00:00:03Z");
+  });
+
+  it("early-stop checkpoint branch returns the checkpoint's artifacts in wait()'s result", async () => {
+    // Regression: the early-stop terminal return used
+    // `terminalResult?.artifacts ?? []`, but `wait()` always calls
+    // `dispatch(parsed, null)` so `terminalResult` was forever
+    // null → `wait()` resolved with `artifacts: []` even though
+    // the checkpoint event carries the very artefacts the
+    // early-stop existed to *preserve* (the whole point of the
+    // graceful-stop-at-next-checkpoint pattern is to keep that
+    // work). Now we return `event.artifacts` directly so the
+    // checkpoint's outputs make it into the resolved result.
+    await writeState(
+      { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+      cwd,
+    );
+    const checkpointArtifacts = [
+      { kind: "lora_adapter" as const, path: "/checkpoints/step-10/" },
+      { kind: "metric" as const, name: "loss", value: 0.42 },
+    ];
+    const sse = [
+      `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+        type: "training.started",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:01Z",
+      })}\n\n`,
+      `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+        type: "training.log",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:02Z",
+        step: 1,
+        loss: 0.5,
+      })}\n\n`,
+      `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({
+        type: "checkpoint.saved",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:03Z",
+        step: 10,
+        artifacts: checkpointArtifacts,
+      })}\n\n`,
+    ];
+    const fetcher: typeof fetch = (async (
+      input: RequestInfo | URL,
+      init?: RequestInit,
+    ) => {
+      const url = typeof input === "string" ? input : input.toString();
+      const method = init?.method ?? "GET";
+      if (method === "POST" && url.includes("/v1/jobs?")) {
+        return new Response(JSON.stringify({ job: minimalJobRow }), {
+          status: 201,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+        return new Response(sseStream(sse), {
+          status: 200,
+          headers: { "content-type": "text/event-stream" },
+        });
+      }
+      if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+        return new Response(JSON.stringify({ ok: true }), {
+          status: 200,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      throw new Error(`unexpected fetch: ${method} ${url}`);
+    }) as typeof fetch;
+
+    const trainer = createTrainer(
+      {
+        name: "run",
+        model: "m",
+        dataset: { type: "huggingface", name: "x" },
+        callbacks: {
+          onLog: () => {
+            void requestTrainerEarlyStop(trainer, { timeoutMs: 60_000 });
+          },
+        },
+      },
+      { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+    );
+    const original = globalThis.fetch;
+    globalThis.fetch = fetcher;
+    let result: Awaited<ReturnType<typeof trainer.wait>>;
+    try {
+      result = await trainer.wait();
+    } finally {
+      globalThis.fetch = original;
+    }
+    // The artefacts the checkpoint event carried must travel
+    // through to the wait() result; that's the whole point of
+    // graceful-stop-at-next-checkpoint preserving the in-flight
+    // work.
+    expect(result.artifacts).toEqual(checkpointArtifacts);
+    // Sibling assertion: status is still terminal (covered more
+    // thoroughly in the dedicated test above; this one just
+    // ensures we didn't accidentally regress the status while
+    // changing the artefacts return).
+    expect(result.job.status).toBe("cancelled");
+  });
+
+  it("early-stop branch still settles when the user's onCheckpoint callback throws (no SIGTERM hang)", async () => {
+    // Regression: the early-stop branch ran AFTER
+    // `await callbacks.onCheckpoint?.(ctx)`. A user-callback throw
+    // would propagate out of that await before the early-stop
+    // cancel + latch settlement could run, leaving
+    // `earlyStopDeferred` pending. The runner's
+    // `installShutdownHandlers` awaits that deferred → SIGTERM
+    // shutdown hangs until the (default 5-min) timeout fallback
+    // fires. The fix wraps `onCheckpoint` in try/catch, runs the
+    // early-stop branch unconditionally, then re-throws the
+    // captured callback error so wait()'s reconnect loop keeps
+    // its prior semantics.
+    await writeState(
+      { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+      cwd,
+    );
+    const sse = [
+      `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+        type: "training.started",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:01Z",
+      })}\n\n`,
+      `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+        type: "training.log",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:02Z",
+        step: 1,
+        loss: 0.5,
+      })}\n\n`,
+      `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({
+        type: "checkpoint.saved",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:03Z",
+        step: 10,
+      })}\n\n`,
+    ];
+    let cancelCalls = 0;
+    const fetcher: typeof fetch = (async (
+      input: RequestInfo | URL,
+      init?: RequestInit,
+    ) => {
+      const url = typeof input === "string" ? input : input.toString();
+      const method = init?.method ?? "GET";
+      if (method === "POST" && url.includes("/v1/jobs?")) {
+        return new Response(JSON.stringify({ job: minimalJobRow }), {
+          status: 201,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+        return new Response(sseStream(sse), {
+          status: 200,
+          headers: { "content-type": "text/event-stream" },
+        });
+      }
+      if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+        cancelCalls += 1;
+        return new Response(JSON.stringify({ ok: true }), {
+          status: 200,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      throw new Error(`unexpected fetch: ${method} ${url}`);
+    }) as typeof fetch;
+
+    let armedPromise: Promise<void> | null = null;
+    let armedResult: "resolved" | "rejected" | "pending" = "pending";
+    const trainer = createTrainer(
+      {
+        name: "run",
+        model: "m",
+        dataset: { type: "huggingface", name: "x" },
+        callbacks: {
+          onLog: () => {
+            if (armedPromise === null) {
+              armedPromise = requestTrainerEarlyStop(trainer, {
+                timeoutMs: 60_000,
+              });
+              armedPromise.then(
+                () => {
+                  armedResult = "resolved";
+                },
+                () => {
+                  armedResult = "rejected";
+                },
+              );
+            }
+          },
+          onCheckpoint: () => {
+            // User callback throws DURING the checkpoint that
+            // would normally trigger early-stop. Without the
+            // try/catch wrap this throw would skip the
+            // early-stop branch → latch pending → SIGTERM hang
+            // for up to 60s (our `timeoutMs`).
+            throw new Error("user onCheckpoint boom");
+          },
+        },
+      },
+      {
+        baseUrl: "http://mock",
+        credentials: creds,
+        cwd,
+        reconnectDelayMs: 1,
+        // Cap reconnects at 0 so the user-callback throw
+        // surfaces as a wait() rejection instead of
+        // looping forever (handleFailure would otherwise
+        // reconnect after the throw escapes dispatch).
+        maxReconnectAttempts: 0,
+      },
+    );
+    const original = globalThis.fetch;
+    globalThis.fetch = fetcher;
+    try {
+      // wait() rejects: handleFailure wraps the user callback
+      // throw because maxReconnectAttempts is 0.
+      await expect(trainer.wait()).rejects.toThrow();
+      // Critical: the latch SETTLED via the early-stop branch
+      // (resolve), not via the 60-second timeout. The cancel POST
+      // also fired (early-stop reached the cancel call before the
+      // throw was re-raised). Together: shutdown wouldn't hang.
+      await new Promise((r) => setImmediate(r));
+      expect(armedResult).toBe("resolved");
+      expect(cancelCalls).toBe(1);
+    } finally {
+      globalThis.fetch = original;
+    }
+  });
+
+  it("re-throws a falsy onCheckpoint throw (e.g. `throw null`) instead of silently suppressing it", async () => {
+    // Regression: `onCheckpointError !== null` was the discriminant for
+    // "did the user callback throw?". User code can legitimately
+    // `throw null` / `throw 0` / `throw ""`; the truthiness of
+    // `onCheckpointError` was then indistinguishable from the no-error
+    // path, and the post-early-stop re-throw at the end of the
+    // checkpoint dispatch silently dropped the user's signal. With the
+    // fix, a separate `onCheckpointThrew` boolean discriminates so
+    // ANY throwable (including falsy ones) propagates uniformly.
+    await writeState(
+      { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+      cwd,
+    );
+    const sse = [
+      `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+        type: "training.started",
+        jobId: "j-falsy",
+        timestamp: "2026-01-01T00:00:01Z",
+      })}\n\n`,
+      `id: 2\nevent: checkpoint.saved\ndata: ${JSON.stringify({
+        type: "checkpoint.saved",
+        jobId: "j-falsy",
+        timestamp: "2026-01-01T00:00:02Z",
+        step: 5,
+      })}\n\n`,
+    ];
+    const fetcher: typeof fetch = (async (
+      input: RequestInfo | URL,
+      init?: RequestInit,
+    ) => {
+      const url = typeof input === "string" ? input : input.toString();
+      const method = init?.method ?? "GET";
+      if (method === "POST" && url.includes("/v1/jobs?")) {
+        return new Response(JSON.stringify({ job: minimalJobRow }), {
+          status: 201,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      if (method === "GET" && url.includes("/v1/jobs/j-falsy/events/stream")) {
+        return new Response(sseStream(sse), {
+          status: 200,
+          headers: { "content-type": "text/event-stream" },
+        });
+      }
+      throw new Error(`unexpected fetch: ${method} ${url}`);
+    }) as typeof fetch;
+
+    const trainer = createTrainer(
+      {
+        name: "run",
+        model: "m",
+        dataset: { type: "huggingface", name: "x" },
+        callbacks: {
+          // `throw null`: a falsy throwable. Under the bug this was
+          // silently swallowed and wait() resolved as if no callback
+          // had thrown. With the fix `wait()` rejects (handleFailure
+          // wraps the throw because maxReconnectAttempts is 0).
+          onCheckpoint: () => {
+            throw null;
+          },
+        },
+      },
+      {
+        baseUrl: "http://mock",
+        credentials: creds,
+        cwd,
+        reconnectDelayMs: 1,
+        maxReconnectAttempts: 0,
+      },
+    );
+    const original = globalThis.fetch;
+    globalThis.fetch = fetcher;
+    try {
+      // Under the bug `wait()` resolved cleanly (the falsy throw was
+      // captured by `onCheckpointError = err` but the
+      // `if (onCheckpointError !== null) throw` guard saw `null` and
+      // skipped the re-throw). With the fix `wait()` rejects.
+      await expect(trainer.wait()).rejects.toBeDefined();
+    } finally {
+      globalThis.fetch = original;
+    }
+  });
+
+  it("early-stop checkpoint branch rejects the deferred when cancel() throws (visible to shutdown handler)", async () => {
+    // Regression: previously, an `await trainer.cancel()` that threw
+    // (network failure / cloud-api 5xx during the cancel POST) was
+    // *swallowed*, the deferred resolved cleanly, and the runner
+    // exited 0: the UI declared the run cancelled while the cloud
+    // job kept running, orphaning GPU spend with no visible error.
+    // The fix REJECTS the deferred so the runner's
+    // `installShutdownHandlers` `.catch()` writes the failure to
+    // stderr, surfacing the issue to the operator. The latch is
+    // still always settled (resolved or rejected), so shutdown
+    // doesn't hang waiting for a checkpoint that will never come.
+    await writeState(
+      { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+      cwd,
+    );
+    const sse = [
+      `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+        type: "training.started",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:01Z",
+      })}\n\n`,
+      `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+        type: "training.log",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:02Z",
+        step: 1,
+        loss: 0.5,
+      })}\n\n`,
+      `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({
+        type: "checkpoint.saved",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:03Z",
+        step: 10,
+      })}\n\n`,
+    ];
+    let cancelAttempts = 0;
+    const fetcher: typeof fetch = (async (
+      input: RequestInfo | URL,
+      init?: RequestInit,
+    ) => {
+      const url = typeof input === "string" ? input : input.toString();
+      const method = init?.method ?? "GET";
+      if (method === "POST" && url.includes("/v1/jobs?")) {
+        return new Response(JSON.stringify({ job: minimalJobRow }), {
+          status: 201,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+        return new Response(sseStream(sse), {
+          status: 200,
+          headers: { "content-type": "text/event-stream" },
+        });
+      }
+      if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+        cancelAttempts += 1;
+        // Simulate the cloud-api being unreachable mid-cancel.
+        throw new TypeError("fetch failed");
+      }
+      throw new Error(`unexpected fetch: ${method} ${url}`);
+    }) as typeof fetch;
+
+    // Capture the very-first armed early-stop promise so we can
+    // assert its settlement state below. The trainer is mutually
+    // recursive with the callback (`onLog` calls
+    // `requestTrainerEarlyStop(trainer, ...)`), so we declare it
+    // first as `let` and assign in a second step.
+    let armedPromise: Promise<void> | null = null;
+    let armedResult: "resolved" | "rejected" | "pending" = "pending";
+    let armedError: unknown = null;
+    const trainer = createTrainer(
+      {
+        name: "run",
+        model: "m",
+        dataset: { type: "huggingface", name: "x" },
+        callbacks: {
+          onLog: () => {
+            // Arm exactly once and capture the returned promise.
+            // requestTrainerEarlyStop is idempotent across repeat
+            // calls, but we only need the FIRST armed deferred:
+            // the cancel-throw rejects exactly that promise.
+            if (armedPromise === null) {
+              armedPromise = requestTrainerEarlyStop(trainer, {
+                timeoutMs: 60_000,
+              });
+              armedPromise.then(
+                () => {
+                  armedResult = "resolved";
+                },
+                (err: unknown) => {
+                  armedResult = "rejected";
+                  armedError = err;
+                },
+              );
+            }
+          },
+        },
+      },
+      { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+    );
+    const original = globalThis.fetch;
+    globalThis.fetch = fetcher;
+    try {
+      await trainer.wait();
+      // Flush microtasks so the .then(resolve, reject) handler
+      // observes the settlement before we assert.
+      await new Promise((r) => setImmediate(r));
+    } finally {
+      globalThis.fetch = original;
+    }
+    // cancel() was attempted (and threw).
+    expect(cancelAttempts).toBe(1);
+    // The armed deferred REJECTED: the runner's `.catch()` would
+    // see this error and log it to stderr instead of silently
+    // exiting 0. Critically: it didn't hang on "pending"; the
+    // failure case still settles, just via reject not resolve.
+    expect(armedResult).toBe("rejected");
+    expect(armedError).toBeInstanceOf(TypeError);
+    expect((armedError as Error).message).toBe("fetch failed");
+  });
+
+  it("early-stop checkpoint branch labels run as `failed` even when cancel throws a falsy value (not `null` discriminant)", async () => {
+    // Regression: the cancel-failure branch used to be discriminated
+    // by `cancelError !== null`, but user-side code can legitimately
+    // `throw null` / `throw 0` / `throw ""`. In those cases the
+    // captured `cancelError` would still read as falsy / `null` and
+    // the run would be silently labelled `"cancelled"` even though
+    // the cancel POST genuinely rejected, lying about cloud-side
+    // state that may still be running. Fix discriminates via a
+    // dedicated boolean flag and additionally wraps non-Error
+    // throws when rejecting the deferred so the SIGTERM handler's
+    // `.catch(err => err.message)` doesn't crash on a missing
+    // property.
+    await writeState(
+      { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+      cwd,
+    );
+    const sse = [
+      `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+        type: "training.started",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:01Z",
+      })}\n\n`,
+      `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+        type: "training.log",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:02Z",
+        step: 1,
+        loss: 0.5,
+      })}\n\n`,
+      `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({
+        type: "checkpoint.saved",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:03Z",
+        step: 10,
+      })}\n\n`,
+    ];
+    const fetcher: typeof fetch = (async (
+      input: RequestInfo | URL,
+      init?: RequestInit,
+    ) => {
+      const url = typeof input === "string" ? input : input.toString();
+      const method = init?.method ?? "GET";
+      if (method === "POST" && url.includes("/v1/jobs?")) {
+        return new Response(JSON.stringify({ job: minimalJobRow }), {
+          status: 201,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+        return new Response(sseStream(sse), {
+          status: 200,
+          headers: { "content-type": "text/event-stream" },
+        });
+      }
+      if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+        // Falsy non-Error throw: under the bug, the run would be
+        // labelled "cancelled" because `cancelError !== null` is
+        // false when the catch reassigned `cancelError = null`.
+        // eslint-disable-next-line no-throw-literal
+        throw null;
+      }
+      throw new Error(`unexpected fetch: ${method} ${url}`);
+    }) as typeof fetch;
+
+    let armedPromise: Promise<void> | null = null;
+    let armedResult: "resolved" | "rejected" | "pending" = "pending";
+    let armedError: unknown = null;
+    const trainer = createTrainer(
+      {
+        name: "run",
+        model: "m",
+        dataset: { type: "huggingface", name: "x" },
+        callbacks: {
+          onLog: () => {
+            if (armedPromise === null) {
+              armedPromise = requestTrainerEarlyStop(trainer, {
+                timeoutMs: 60_000,
+              });
+              armedPromise.then(
+                () => {
+                  armedResult = "resolved";
+                },
+                (err: unknown) => {
+                  armedResult = "rejected";
+                  armedError = err;
+                },
+              );
+            }
+          },
+        },
+      },
+      { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+    );
+    const original = globalThis.fetch;
+    globalThis.fetch = fetcher;
+    let result: Awaited<ReturnType<typeof trainer.wait>>;
+    try {
+      result = await trainer.wait();
+      await new Promise((r) => setImmediate(r));
+    } finally {
+      globalThis.fetch = original;
+    }
+    // The local job state reflects the cancel FAILURE, not a clean
+    // cancel. Under the bug this was `"cancelled"`.
+    expect(result.job.status).toBe("failed");
+    // The armed deferred rejected, and the rejection value is
+    // wrapped in a real Error so downstream `.catch(err => err.message)`
+    // chains don't crash on `null.message`.
+    expect(armedResult).toBe("rejected");
+    expect(armedError).toBeInstanceOf(Error);
+    expect((armedError as Error).message).toBe("null");
+  });
+
+  it("resolves the early-stop latch when the run hits a terminal event before the next checkpoint", async () => {
+    // Regression: previously `requestEarlyStop()`'s deferred was
+    // only resolved by (a) the checkpoint-triggered cancel branch
+    // or (b) the timeout fallback. If the run reached
+    // `training.completed` / `training.failed` *before* another
+    // checkpoint landed (a common case for short jobs or runs that
+    // had already saved their last checkpoint when SIGTERM arrived),
+    // the deferred stayed pending until the (default 5-min) timeout
+    // fired; the SIGTERM handler in `installShutdownHandlers`
+    // awaits that promise before exit, so shutdown was delayed up to
+    // `timeoutMs`. Both terminal branches now settle the latch
+    // explicitly so the signal path completes immediately when the
+    // job is already terminal.
+    await writeState(
+      { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+      cwd,
+    );
+    // started → log (arms early-stop) → completed; no checkpoint.saved
+    // in between, so the checkpoint-triggered resolution path is *not*
+    // exercised; only the new terminal-branch settlement is.
+    const sse = [
+      `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+        type: "training.started",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:01Z",
+      })}\n\n`,
+      `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+        type: "training.log",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:02Z",
+        step: 1,
+        loss: 0.5,
+      })}\n\n`,
+      `id: 3\nevent: training.completed\ndata: ${JSON.stringify({
+        type: "training.completed",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:03Z",
+        artifacts: [],
+      })}\n\n`,
+    ];
+
+    let cancelCalls = 0;
+    const fetcher: typeof fetch = (async (
+      input: RequestInfo | URL,
+      init?: RequestInit,
+    ) => {
+      const url = typeof input === "string" ? input : input.toString();
+      const method = init?.method ?? "GET";
+      if (method === "POST" && url.includes("/v1/jobs?")) {
+        return new Response(JSON.stringify({ job: minimalJobRow }), {
+          status: 201,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+        return new Response(sseStream(sse), {
+          status: 200,
+          headers: { "content-type": "text/event-stream" },
+        });
+      }
+      if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+        cancelCalls += 1;
+        return new Response(JSON.stringify({ ok: true }), {
+          status: 200,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      throw new Error(`unexpected fetch: ${method} ${url}`);
+    }) as typeof fetch;
+
+    let stopResolved = false;
+    const trainer = createTrainer(
+      {
+        name: "run",
+        model: "m",
+        dataset: { type: "huggingface", name: "x" },
+        callbacks: {
+          onLog: () => {
+            // Long timeout: if the fix regresses, this test would
+            // hang for ~60s before the timer fires. With the
+            // terminal-branch settlement, the deferred resolves the
+            // moment `training.completed` lands.
+            void requestTrainerEarlyStop(trainer, {
+              timeoutMs: 60_000,
+            }).then(() => {
+              stopResolved = true;
+            });
+          },
+        },
+      },
+      { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+    );
+
+    const original = globalThis.fetch;
+    globalThis.fetch = fetcher;
+    try {
+      const result = await trainer.wait();
+      // Flush microtasks so the .then() chain off `requestEarlyStop`
+      // observes the resolution before we assert.
+      await new Promise((r) => setImmediate(r));
+      expect(result.job.status).toBe("completed");
+      // No cancel POST was issued: the terminal branch just
+      // releases the latch; it doesn't cancel a run that already
+      // completed on its own.
+      expect(cancelCalls).toBe(0);
+      // The latch resolved via the terminal handler, not via the
+      // 60-second timeout. (The test would simply time out long
+      // before the timeout fired if this regressed.)
+      expect(stopResolved).toBe(true);
+    } finally {
+      globalThis.fetch = original;
+    }
+  });
+
+  it("settles the early-stop latch even when the user's onCompleted callback throws", async () => {
+    // Regression: previously `settleEarlyStopLatch()` was called
+    // *after* awaiting `callbacks.onCompleted` / `onFailed`. A
+    // thrown user callback propagated out of `dispatch()` before
+    // the settle ran, leaving `earlyStopDeferred` pending; the
+    // SIGTERM handler in `installShutdownHandlers` would block on
+    // that promise until the (default 5-min) timeout fired,
+    // delaying shutdown for a user-code bug. Wrapping in
+    // `try/finally` ensures the latch is released regardless,
+    // while preserving the throw's propagation through `wait()` so
+    // callers still see the original error.
+    await writeState(
+      { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+      cwd,
+    );
+    const sse = [
+      `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+        type: "training.started",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:01Z",
+      })}\n\n`,
+      `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+        type: "training.log",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:02Z",
+        step: 1,
+        loss: 0.5,
+      })}\n\n`,
+      `id: 3\nevent: training.completed\ndata: ${JSON.stringify({
+        type: "training.completed",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:03Z",
+        artifacts: [],
+      })}\n\n`,
+    ];
+
+    const fetcher: typeof fetch = (async (
+      input: RequestInfo | URL,
+      init?: RequestInit,
+    ) => {
+      const url = typeof input === "string" ? input : input.toString();
+      const method = init?.method ?? "GET";
+      if (method === "POST" && url.includes("/v1/jobs?")) {
+        return new Response(JSON.stringify({ job: minimalJobRow }), {
+          status: 201,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+        return new Response(sseStream(sse), {
+          status: 200,
+          headers: { "content-type": "text/event-stream" },
+        });
+      }
+      throw new Error(`unexpected fetch: ${method} ${url}`);
+    }) as typeof fetch;
+
+    let stopResolved = false;
+    let stopRejected = false;
+    const trainer = createTrainer(
+      {
+        name: "run",
+        model: "m",
+        dataset: { type: "huggingface", name: "x" },
+        callbacks: {
+          onLog: () => {
+            // Arm early-stop with a long timeout; if the latch
+            // isn't released by `finally`, this would hang for the
+            // full 60 seconds.
+            void requestTrainerEarlyStop(trainer, {
+              timeoutMs: 60_000,
+            }).then(
+              () => {
+                stopResolved = true;
+              },
+              () => {
+                stopRejected = true;
+              },
+            );
+          },
+          onCompleted: () => {
+            throw new Error("user callback boom");
+          },
+        },
+      },
+      {
+        baseUrl: "http://mock",
+        credentials: creds,
+        cwd,
+        reconnectDelayMs: 1,
+        // `wait()` catches dispatch throws and routes them through
+        // its reconnect loop; with the default unbounded retry the
+        // user-callback throw above would loop forever and the test
+        // would just time out. Cap retries at 0 so the first thrown
+        // dispatch surfaces as a `wait()` rejection; that lets us
+        // observe the *latch* settlement (the actual contract under
+        // test) cleanly.
+        maxReconnectAttempts: 0,
+      },
+    );
+
+    const original = globalThis.fetch;
+    globalThis.fetch = fetcher;
+    try {
+      // The user-callback throw is wrapped by `handleFailure` after
+      // `maxReconnectAttempts: 0` exhausts; the original error is
+      // preserved as `cause`. We just need wait() to settle so the
+      // test doesn't hang. The *body* of the assertion is the
+      // latch state below.
+      await expect(trainer.wait()).rejects.toThrow();
+      // The latch must have settled (via `finally`) BEFORE wait()
+      // rejected. Without the `try/finally` around `onCompleted`
+      // the latch would still be armed → `stopResolved` stays
+      // false → the test fails (rather than timing out, since
+      // `maxReconnectAttempts: 0` already unblocks wait()).
+      await new Promise((r) => setImmediate(r));
+      expect(stopResolved).toBe(true);
+      expect(stopRejected).toBe(false);
+    } finally {
+      globalThis.fetch = original;
+    }
+  });
+
+  it("falls back to immediate cancel when no checkpoint arrives within timeoutMs", async () => {
+    await writeState(
+      { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+      cwd,
+    );
+    // No checkpoint in the stream, only training.completed, which would
+    // normally finish the run. We hand-roll a stream that never ends so
+    // the timeout fallback is what actually triggers cancel.
+    let streamController: ReadableStreamDefaultController<Uint8Array> | null =
+      null;
+    const stallingStream = new ReadableStream<Uint8Array>({
+      start(controller) {
+        streamController = controller;
+        const enc = new TextEncoder();
+        controller.enqueue(
+          enc.encode(
+            `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+              type: "training.started",
+              jobId: "j-stop",
+              timestamp: "2026-01-01T00:00:01Z",
+            })}\n\n`,
+          ),
+        );
+      },
+    });
+
+    let cancelCalls = 0;
+    const fetcher: typeof fetch = (async (
+      input: RequestInfo | URL,
+      init?: RequestInit,
+    ) => {
+      const url = typeof input === "string" ? input : input.toString();
+      const method = init?.method ?? "GET";
+      if (method === "POST" && url.includes("/v1/jobs?")) {
+        return new Response(JSON.stringify({ job: minimalJobRow }), {
+          status: 201,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+        return new Response(stallingStream, {
+          status: 200,
+          headers: { "content-type": "text/event-stream" },
+        });
+      }
+      if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+        cancelCalls += 1;
+        // Closing the stream now mimics cloud-api's response to a cancel:
+        // the SSE channel ends and wait() exits its loop.
+        streamController?.close();
+        return new Response(JSON.stringify({ ok: true }), {
+          status: 200,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      throw new Error(`unexpected fetch: ${method} ${url}`);
+    }) as typeof fetch;
+
+    const trainer = createTrainer(
+      {
+        name: "run",
+        model: "m",
+        dataset: { type: "huggingface", name: "x" },
+      },
+      { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+    );
+
+    const original = globalThis.fetch;
+    globalThis.fetch = fetcher;
+    try {
+      await trainer.start();
+      // Tiny timeout so the test doesn't actually wait 5 minutes.
+      await requestTrainerEarlyStop(trainer, { timeoutMs: 5 });
+      expect(cancelCalls).toBe(1);
+      // Regression: the timeout fallback used to leave
+      // `earlyStopRequested = true` and `startedJob.status =
+      // "running"`. A subsequent `requestEarlyStop()` call would
+      // then re-arm a fresh timer and re-issue cancel even though
+      // the early-stop already fired. With the latch reset and
+      // local terminal-status update mirroring the
+      // checkpoint-triggered branch, the second call hits the
+      // TERMINAL_STATUSES short-circuit and is a true no-op.
+      await requestTrainerEarlyStop(trainer, { timeoutMs: 5 });
+      expect(cancelCalls).toBe(1);
+    } finally {
+      globalThis.fetch = original;
+    }
+  });
+
+  it("timeout fallback rejects the deferred when cancel() throws (visible to shutdown handler)", async () => {
+    // Companion to the checkpoint-branch reject test: when no
+    // checkpoint arrives within `timeoutMs`, the timeout fallback
+    // does its own `trainer.cancel()`. Old code swallowed cancel
+    // errors and ALWAYS resolved the deferred: same false-success
+    // failure mode as the checkpoint branch had: local runner
+    // exits cleanly while the cloud job keeps consuming GPU
+    // budget. The fix mirrors the checkpoint reject path: capture
+    // the error and reject the deferred so the runner's
+    // `.catch()` writes it to stderr.
+    await writeState(
+      { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+      cwd,
+    );
+    let streamController: ReadableStreamDefaultController<Uint8Array> | null =
+      null;
+    const stallingStream = new ReadableStream<Uint8Array>({
+      start(controller) {
+        streamController = controller;
+        const enc = new TextEncoder();
+        controller.enqueue(
+          enc.encode(
+            `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+              type: "training.started",
+              jobId: "j-stop",
+              timestamp: "2026-01-01T00:00:01Z",
+            })}\n\n`,
+          ),
+        );
+      },
+    });
+
+    let cancelCalls = 0;
+    const fetcher: typeof fetch = (async (
+      input: RequestInfo | URL,
+      init?: RequestInit,
+    ) => {
+      const url = typeof input === "string" ? input : input.toString();
+      const method = init?.method ?? "GET";
+      if (method === "POST" && url.includes("/v1/jobs?")) {
+        return new Response(JSON.stringify({ job: minimalJobRow }), {
+          status: 201,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+        return new Response(stallingStream, {
+          status: 200,
+          headers: { "content-type": "text/event-stream" },
+        });
+      }
+      if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+        cancelCalls += 1;
+        // Close the stream so wait() exits its loop even though we
+        // throw on the cancel POST itself.
+        streamController?.close();
+        // Simulate cloud-api unreachable mid-cancel (transport).
+        throw new TypeError("fetch failed");
+      }
+      throw new Error(`unexpected fetch: ${method} ${url}`);
+    }) as typeof fetch;
+
+    const trainer = createTrainer(
+      {
+        name: "run",
+        model: "m",
+        dataset: { type: "huggingface", name: "x" },
+      },
+      { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+    );
+    const original = globalThis.fetch;
+    globalThis.fetch = fetcher;
+    try {
+      await trainer.start();
+      // Tiny timeout so the timeout fallback fires fast (no
+      // checkpoint will land; stream only carries
+      // training.started). The returned promise should REJECT
+      // because the cancel POST throws.
+      await expect(
+        requestTrainerEarlyStop(trainer, { timeoutMs: 5 }),
+      ).rejects.toThrow(/fetch failed/);
+      expect(cancelCalls).toBe(1);
+    } finally {
+      globalThis.fetch = original;
+    }
+  });
+
+  it("is a no-op before start() and resolves immediately", async () => {
+    const trainer = createTrainer(
+      {
+        name: "run",
+        model: "m",
+        dataset: { type: "huggingface", name: "x" },
+      },
+      { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+    );
+    // Should resolve without contacting cloud-api at all.
+    await requestTrainerEarlyStop(trainer, { timeoutMs: 1 });
+  });
+
+  it("waits out an in-flight start() so a SIGTERM during create-job can still cancel the new job", async () => {
+    // Codex P1 regression: `start()` sets `scope` *before* awaiting
+    // `client.createJob`, so there's a real window where the cloud
+    // job is being created but `startedJob` is still null. If a
+    // runner-side SIGTERM lands in that window, an immediate
+    // "no-op" early-stop would let `installShutdownHandlers` exit
+    // the process, leaving the just-created cloud job running
+    // with no cancel POST. The fix is to await the in-flight
+    // `start()` promise inside `requestEarlyStop()` so the cancel
+    // path sees a definite job id (or a definite start failure).
+    await writeState(
+      { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+      cwd,
+    );
+    let cancelCalls = 0;
+    let releaseCreateJob!: () => void;
+    const createJobReleased = new Promise<void>((resolve) => {
+      releaseCreateJob = resolve;
+    });
+    const fetcher: typeof fetch = (async (
+      input: RequestInfo | URL,
+      init?: RequestInit,
+    ) => {
+      const url = typeof input === "string" ? input : input.toString();
+      const method = init?.method ?? "GET";
+      if (method === "POST" && url.includes("/v1/jobs?")) {
+        // Hold createJob open so we can fire `requestEarlyStop`
+        // mid-flight. Once the test releases the gate, return a
+        // valid job: that establishes the post-create state
+        // requestEarlyStop should then act on (cancel POST).
+        await createJobReleased;
+        return new Response(JSON.stringify({ job: minimalJobRow }), {
+          status: 201,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+        cancelCalls += 1;
+        return new Response(JSON.stringify({ ok: true }), {
+          status: 200,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      throw new Error(`unexpected fetch: ${method} ${url}`);
+    }) as typeof fetch;
+
+    const trainer = createTrainer(
+      {
+        name: "run",
+        model: "m",
+        dataset: { type: "huggingface", name: "x" },
+      },
+      { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+    );
+
+    const original = globalThis.fetch;
+    globalThis.fetch = fetcher;
+    try {
+      // Fire start() but DON'T await; its createJob is gated.
+      const startPromise = trainer.start();
+      // Yield once so the start microtasks queue up to the
+      // `await client.createJob`.
+      await new Promise((r) => setImmediate(r));
+      // requestEarlyStop fires while start() is mid-flight. With
+      // the fix it awaits start() rather than no-op'ing immediately.
+      // Tiny `timeoutMs` so once `start()` resolves the latch's
+      // timeout-fallback fires the cancel POST quickly. There's no
+      // SSE stream in this test, so the checkpoint-driven path
+      // never arrives. We're testing the "stop awaited start()" leg
+      // of the contract, not the checkpoint plumbing.
+      const stopPromise = requestTrainerEarlyStop(trainer, {
+        timeoutMs: 50,
+      });
+      // Sanity: stop hasn't resolved yet; it's blocked on
+      // start() which is blocked on createJob.
+      let stopSettled = false;
+      void stopPromise.then(() => {
+        stopSettled = true;
+      });
+      await new Promise((r) => setImmediate(r));
+      expect(stopSettled).toBe(false);
+      // Release createJob → start() resolves → stop() proceeds.
+      releaseCreateJob();
+      await startPromise;
+      await stopPromise;
+      // The deciding behaviour: cancel POST was issued because the
+      // stop awaited start() and saw a real job id. Without the
+      // in-flight gate, stop would have returned immediately on
+      // the null `startedJob`, no cancel POST, cloud job orphaned.
+      expect(cancelCalls).toBe(1);
+    } finally {
+      globalThis.fetch = original;
+    }
+  });
+
+  it("replaceTrainerCallbacks (internal HMR brand) swaps the dispatched callbacks on the next event", async () => {
+    await writeState(
+      { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+      cwd,
+    );
+    const sse = [
+      `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+        type: "training.started",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:01Z",
+      })}\n\n`,
+      `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+        type: "training.log",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:02Z",
+        step: 1,
+        loss: 1,
+      })}\n\n`,
+      `id: 3\nevent: training.log\ndata: ${JSON.stringify({
+        type: "training.log",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:03Z",
+        step: 2,
+        loss: 0.5,
+      })}\n\n`,
+      `id: 4\nevent: training.completed\ndata: ${JSON.stringify({
+        type: "training.completed",
+        jobId: "j-stop",
+        timestamp: "2026-01-01T00:00:04Z",
+      })}\n\n`,
+    ];
+    const fetcher: typeof fetch = (async (
+      input: RequestInfo | URL,
+      init?: RequestInit,
+    ) => {
+      const url = typeof input === "string" ? input : input.toString();
+      const method = init?.method ?? "GET";
+      if (method === "POST" && url.includes("/v1/jobs?")) {
+        return new Response(JSON.stringify({ job: minimalJobRow }), {
+          status: 201,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+        return new Response(sseStream(sse), {
+          status: 200,
+          headers: { "content-type": "text/event-stream" },
+        });
+      }
+      throw new Error(`unexpected fetch: ${method} ${url}`);
+    }) as typeof fetch;
+
+    const calls: string[] = [];
+    const trainer = createTrainer(
+      {
+        name: "run",
+        model: "m",
+        dataset: { type: "huggingface", name: "x" },
+        callbacks: {
+          onLog: ({ step }) => {
+            calls.push(`v1:onLog(${step})`);
+            // After the first onLog call, swap to v2 callbacks via the
+            // internal `Symbol.for("arkor.trainer.replaceCallbacks")`
+            // brand (the same brand `arkor dev`'s SIGUSR2 handler
+            // uses). The next event must dispatch via the new object.
+            if (step === 1) {
+              replaceTrainerCallbacks(trainer, {
+                onLog: ({ step: s }) => void calls.push(`v2:onLog(${s})`),
+              });
+            }
+          },
+        },
+      },
+      { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+    );
+    const original = globalThis.fetch;
+    globalThis.fetch = fetcher;
+    try {
+      await trainer.wait();
+    } finally {
+      globalThis.fetch = original;
+    }
+    expect(calls).toEqual(["v1:onLog(1)", "v2:onLog(2)"]);
+  });
+
+  it("is idempotent: repeated calls share the same in-flight promise", async () => {
+    await writeState(
+      { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+      cwd,
+    );
+    let cancelCalls = 0;
+    const fetcher: typeof fetch = (async (
+      input: RequestInfo | URL,
+      init?: RequestInit,
+    ) => {
+      const url = typeof input === "string" ? input : input.toString();
+      const method = init?.method ?? "GET";
+      if (method === "POST" && url.includes("/v1/jobs?")) {
+        return new Response(JSON.stringify({ job: minimalJobRow }), {
+          status: 201,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+        cancelCalls += 1;
+        return new Response(JSON.stringify({ ok: true }), {
+          status: 200,
+          headers: { "content-type": "application/json" },
+        });
+      }
+      throw new Error(`unexpected fetch: ${method} ${url}`);
+    }) as typeof fetch;
+
+    const trainer = createTrainer(
+      {
+        name: "run",
+        model: "m",
+        dataset: { type: "huggingface", name: "x" },
+      },
+      { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+    );
+    const original = globalThis.fetch;
+    globalThis.fetch = fetcher;
+    try {
+      await trainer.start();
+      const a = requestTrainerEarlyStop(trainer, { timeoutMs: 5 });
+      const b = requestTrainerEarlyStop(trainer, { timeoutMs: 5 });
+      await Promise.all([a, b]);
+      // The fallback timer fires once, so cancel is called once even though
+      // the early-stop brand was invoked twice.
+      expect(cancelCalls).toBe(1);
+    } finally {
+      globalThis.fetch = original;
+    }
+  });
+});
diff --git a/packages/arkor/src/core/trainer.ts b/packages/arkor/src/core/trainer.ts
index 7c9f9662..c6f99870 100644
--- a/packages/arkor/src/core/trainer.ts
+++ b/packages/arkor/src/core/trainer.ts
@@ -6,19 +6,28 @@ import {
   type Credentials,
 } from "./credentials";
 import { ensureProjectState } from "./projectState";
+import {
+  attachTrainerCallbackReplacer,
+  attachTrainerEarlyStopper,
+  attachTrainerInspection,
+  type RequestEarlyStopOptions,
+} from "./trainerInspection";
 import type {
   CheckpointContext,
   InferArgs,
   JobConfig,
   Trainer,
+  TrainerCallbacks,
   TrainerInput,
   TrainingJob,
   TrainingLogContext,
   TrainingResult,
 } from "./types";
 
+const TERMINAL_STATUSES = new Set(["completed", "failed", "cancelled"]);
+
 /**
- * Internal runtime context. Not part of the public API surface — exposed only
+ * Internal runtime context. Not part of the public API surface; exposed only
  * for tests and advanced power-user scenarios that need to inject a mock
  * `fetch` or override the working directory.
  *
@@ -111,7 +120,7 @@ function buildJobConfig(input: TrainerInput): JobConfig {
 /**
  * Build a `Trainer` bound to the user's configuration.
  *
- * Public signature: `createTrainer(input)` — runtime options like
+ * Public signature: `createTrainer(input)`. Runtime options like
  * `baseUrl` / `credentials` / `cwd` come from the environment and `.arkor/`
  * state, never from user code. The optional second argument is reserved for
  * tests and advanced overrides.
@@ -144,6 +153,63 @@ export function createTrainer(
   let startedJob: TrainingJob | null = null;
   let scope: { orgSlug: string; projectSlug: string } | null = null;
   let clientPromise: Promise<CloudApiClient> | null = null;
+  // In-flight `start()` promise: non-null between the first
+  // `client.createJob` call and the `startedJob` assignment. Lets
+  // `requestEarlyStop()` detect the "scope set but startedJob still
+  // null" window (`scope` is needed by `client.createJob` so we set
+  // it before the await) and wait out the create-job POST so a
+  // SIGTERM landing in that window can still drive a clean cancel
+  // once the job id materialises. Without this gate the early-stop
+  // path would no-op, the runner would `process.exit(0)`, and the
+  // newly created cloud job would orphan with no cancel POST.
+  let startInFlight: Promise<TrainingJob> | null = null;
+
+  // Mutable callbacks slot. Each `dispatch()` invocation reads this
+  // fresh, so the rotation triggered by the
+  // `Symbol.for("arkor.trainer.replaceCallbacks")` brand
+  // (`replaceTrainerCallbacks` in `core/trainerInspection.ts`) takes
+  // effect on the next event. Events already mid-await keep their
+  // old reference until they resolve, which matches the "replace,
+  // don't interrupt" contract. Public `Trainer` deliberately doesn't
+  // expose this; it's a dev-only HMR primitive driven by the
+  // SIGUSR2 path in `core/runnerSignals.ts`.
+  let currentCallbacks: Partial<TrainerCallbacks> = input.callbacks ?? {};
+
+  // Early-stop state. `requestEarlyStop()` arms the latch; the next
+  // `checkpoint.saved` dispatch (or the timeout, whichever fires first)
+  // calls cancel() and resolves the deferred. Idempotent across repeat
+  // calls (they share the same deferred).
+  const DEFAULT_EARLY_STOP_TIMEOUT_MS = 5 * 60 * 1000;
+  let earlyStopDeferred: {
+    promise: Promise<void>;
+    resolve: () => void;
+    reject: (err: unknown) => void;
+    timer: NodeJS.Timeout | null;
+  } | null = null;
+  let earlyStopRequested = false;
+
+  /**
+   * Drop the early-stop latch (clear timer + resolve deferred + reset
+   * the request flag). Called from any path that means "wait()'s
+   * cancel-after-checkpoint promise is no longer waiting on anything"
+   * (the checkpoint-driven cancel branch, the terminal `completed`
+   * / `failed` branches, and the up-front guard in
+   * `requestEarlyStop()` when the job is already terminal). Without
+   * this called from terminal branches, a `requestEarlyStop()` armed
+   * mid-run that races a `training.completed` / `training.failed`
+   * before the next `checkpoint.saved` would leave the deferred
+   * pending until the (default 5-min) timeout fires; the SIGTERM
+   * handler in `installShutdownHandlers` would block on that promise
+   * and delay shutdown for up to `timeoutMs`.
+   */
+  function settleEarlyStopLatch(): void {
+    if (earlyStopDeferred) {
+      if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer);
+      earlyStopDeferred.resolve();
+      earlyStopDeferred = null;
+    }
+    earlyStopRequested = false;
+  }
 
   async function getClient(): Promise<CloudApiClient> {
     if (!clientPromise) {
@@ -168,7 +234,7 @@ export function createTrainer(
    * many SDK clients retry at once.
    *
    * The final value is clamped at `maxReconnectDelayMs` because jitter
-   * sits *outside* the exponential clamp — without the outer clamp, a
+   * sits *outside* the exponential clamp; without the outer clamp, a
    * long outage where `exp` already hit the cap could wait up to 1.25 ×
    * the documented cap when `Math.random()` lands near 1.
    */
@@ -204,7 +270,10 @@ export function createTrainer(
       throw new Error("Trainer is in an inconsistent state");
     }
     const client = await getClient();
-    const callbacks = input.callbacks ?? {};
+    // Read once per dispatch so a `replaceCallbacks` between events takes
+    // effect on the next dispatch, but doesn't change identity inside a
+    // single in-flight handler.
+    const callbacks = currentCallbacks;
 
     switch (event.type) {
       case "training.started": {
@@ -255,7 +324,139 @@ export function createTrainer(
           infer,
           artifacts: event.artifacts,
         };
-        await callbacks.onCheckpoint?.(ctx);
+        // Capture (don't propagate yet) any throw from the user's
+        // `onCheckpoint`. The early-stop branch below MUST run
+        // even on a callback throw; without this wrap a thrown
+        // `onCheckpoint` would skip the cancel + latch settlement,
+        // leaving the SIGTERM handler waiting on the deferred
+        // until the (default 5-min) timeout fires. Surface the
+        // original throw via re-throw at the end so `wait()`'s
+        // reconnect / failure path keeps its existing semantics.
+        // Discriminant for the user-callback-threw branch. Tracked as
+        // a separate boolean (not `onCheckpointError !== null`) because
+        // user code can legitimately `throw null` / `throw 0` /
+        // `throw ""`; the truthiness of `onCheckpointError` would then
+        // be indistinguishable from the no-error path, and the re-throw
+        // at the end would silently swallow the user's falsy throw
+        // (callback's "I want to stop" signal gets dropped on the floor).
+        let onCheckpointError: unknown = null;
+        let onCheckpointThrew = false;
+        try {
+          await callbacks.onCheckpoint?.(ctx);
+        } catch (err) {
+          onCheckpointError = err;
+          onCheckpointThrew = true;
+        }
+        // Early-stop latch: a checkpoint just landed, so the in-flight work
+        // is durable. Cancel the cloud job and end `wait()` cleanly.
+        if (earlyStopRequested && earlyStopDeferred) {
+          // Capture the cancel error (if any) but DON'T swallow
+          // silently; propagate via the deferred's reject path so
+          // the runner's `installShutdownHandlers` `.catch()` writes
+          // the failure to stderr. The previous swallow let a
+          // transient cloud-api failure during early-stop appear
+          // as a clean cancel: the local runner exited 0, the UI
+          // declared the run cancelled, but the cloud job kept
+          // running (continued GPU spend). Keeping the error
+          // visible to the shutdown handler lets the operator see
+          // it and intervene.
+          //
+          // We still mark `startedJob.status` terminal locally
+          // either way: from the runner's perspective the run is
+          // over, and a subsequent `requestEarlyStop()` call must
+          // hit the `TERMINAL_STATUSES.has(...)` short-circuit
+          // (re-arming a fresh latch on a dead run would hang
+          // shutdown).
+          let cancelError: unknown = null;
+          // Discriminant for the cancel-failure branch. Tracked as a
+          // separate boolean (not `cancelError !== null`) because
+          // user code can legitimately `throw null` / `throw 0` /
+          // `throw ""`; the truthiness of `cancelError` would then
+          // be indistinguishable from the no-error path and the run
+          // would be silently labelled `"cancelled"` even when the
+          // cancel POST genuinely rejected.
+          let cancelFailed = false;
+          try {
+            await trainer.cancel();
+          } catch (err) {
+            cancelError = err;
+            cancelFailed = true;
+          }
+          // Reflect the cancellation locally so `wait()`'s resolved
+          // `TrainingResult.job.status` is a terminal status (per the
+          // documented contract). Without this update the result would
+          // surface as `status: "running"`, and a subsequent
+          // `requestEarlyStop` would not see the
+          // `TERMINAL_STATUSES.has(...)` short-circuit it relies on.
+          //
+          // Status is `"failed"` when the cancel POST itself threw
+          // (cloud-api transient failure mid-cancel): labelling
+          // such runs `"cancelled"` would lie about the cloud-side
+          // state, which may still be running. `"failed"` is
+          // terminal too, so the latch / TERMINAL_STATUSES short-
+          // circuit still works, but `wait()`'s caller can
+          // distinguish "we cancelled cleanly" from "we tried but
+          // the cancel may not have landed". The original cancel
+          // error is also rejected through the deferred below for
+          // the SIGTERM handler's `.catch()`.
+          startedJob = {
+            ...startedJob,
+            status: cancelFailed ? "failed" : "cancelled",
+            ...(cancelFailed && {
+              error: `Early-stop cancel failed: ${
+                cancelError instanceof Error
+                  ? cancelError.message
+                  : String(cancelError)
+              }`,
+            }),
+            completedAt: event.timestamp,
+          };
+          if (cancelFailed) {
+            // Reject (not resolve) the latch. Mirrors the success
+            // path's bookkeeping (clear timer, null out shared
+            // slot, drop the request flag) so a follow-up
+            // `requestEarlyStop()` won't piggyback on the rejected
+            // promise.
+            if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer);
+            // Wrap if user threw a non-Error so the deferred
+            // consumer always receives an Error instance. `throw 0`
+            // would otherwise reject the deferred with `0`, and the
+            // SIGTERM handler's `.catch(err => ...err.message)` would
+            // crash on the missing property.
+            earlyStopDeferred.reject(
+              cancelError instanceof Error
+                ? cancelError
+                : new Error(String(cancelError)),
+            );
+            earlyStopDeferred = null;
+            earlyStopRequested = false;
+          } else {
+            settleEarlyStopLatch();
+          }
+          // Return the *checkpoint's* artifacts (the ones the user
+          // just saved): that's the work HMR went out of its way
+          // to preserve before issuing cancel(). The previous
+          // `terminalResult?.artifacts ?? []` always resolved to
+          // `[]` because `wait()` calls `dispatch(parsed, null)` so
+          // `terminalResult` is never populated. Effect: an
+          // HMR-driven early-stop resolved `wait()` with empty
+          // `artifacts` even though the checkpoint event carried
+          // the very artifacts the early-stop existed to keep.
+          // Surface the user's `onCheckpoint` throw (if any) so
+          // `wait()`'s reconnect / failure path keeps the same
+          // semantics it had before the wrap: the checkpoint
+          // workload is preserved, but the user still sees their
+          // callback error.
+          if (onCheckpointThrew) throw onCheckpointError;
+          return {
+            terminal: true,
+            artifacts: (event.artifacts ?? []) as unknown[],
+          };
+        }
+        // Same re-throw on the non-early-stop branch: keep
+        // `wait()`'s reconnect loop seeing the user's original
+        // callback error so reconnection counters work as before.
+        if (onCheckpointThrew) throw onCheckpointError;
         return { terminal: false, artifacts: terminalResult?.artifacts ?? [] };
       }
       case "training.completed": {
@@ -265,7 +466,19 @@ export function createTrainer(
           completedAt: event.timestamp,
         };
         const artifacts = (event.artifacts ?? []) as unknown[];
-        await callbacks.onCompleted?.({ job: startedJob, artifacts });
+        // `try/finally` so the latch settles even when the user's
+        // `onCompleted` callback throws: otherwise a thrown
+        // callback would leave `earlyStopDeferred` pending and the
+        // SIGTERM handler awaiting `requestEarlyStop()` would block
+        // until the timeout (default 5 min). The throw still
+        // propagates through `dispatch()` → `wait()` so callers see
+        // the original error; we just don't strand the shutdown
+        // path along with it.
+        try {
+          await callbacks.onCompleted?.({ job: startedJob, artifacts });
+        } finally {
+          settleEarlyStopLatch();
+        }
         return { terminal: true, artifacts };
       }
       case "training.failed": {
@@ -275,7 +488,14 @@ export function createTrainer(
           error: event.error,
           completedAt: event.timestamp,
         };
-        await callbacks.onFailed?.({ job: startedJob, error: event.error });
+        // Symmetric to the `completed` branch above: terminal
+        // status settles the latch even when the run failed *and*
+        // the user's `onFailed` callback itself throws.
+        try {
+          await callbacks.onFailed?.({ job: startedJob, error: event.error });
+        } finally {
+          settleEarlyStopLatch();
+        }
         return { terminal: true, artifacts: [] };
       }
     }
@@ -286,18 +506,46 @@ export function createTrainer(
 
     async start() {
       if (startedJob) return { jobId: startedJob.id };
-      const client = await getClient();
-      const state = await resolveProjectState(client);
-      scope = { orgSlug: state.orgSlug, projectSlug: state.projectSlug };
-
-      const { job } = await client.createJob({
-        orgSlug: state.orgSlug,
-        projectSlug: state.projectSlug,
-        name: input.name,
-        config,
-      });
-      startedJob = job;
-      return { jobId: job.id };
+      // Already-pending start: reuse the in-flight promise so a
+      // concurrent caller (notably `requestEarlyStop` awaiting it
+      // to close the SIGTERM-during-create-job race) doesn't issue
+      // a second `client.createJob` POST. `Promise.resolve` returns
+      // the existing promise unchanged when it's already a thenable.
+      if (startInFlight) {
+        const job = await startInFlight;
+        return { jobId: job.id };
+      }
+      // Track the pending creation so `requestEarlyStop()` can
+      // detect the "started but not yet recorded" window and wait
+      // out the `client.createJob` POST. We set `scope` *before*
+      // the await (it's needed by the await itself), so a SIGTERM
+      // landing during the await would otherwise see
+      // `!startedJob && scope` and exit immediately, leaving the
+      // newly created cloud job uncancelled.
+      const startPromise = (async () => {
+        const client = await getClient();
+        const state = await resolveProjectState(client);
+        scope = { orgSlug: state.orgSlug, projectSlug: state.projectSlug };
+        const { job } = await client.createJob({
+          orgSlug: state.orgSlug,
+          projectSlug: state.projectSlug,
+          name: input.name,
+          config,
+        });
+        startedJob = job;
+        return job;
+      })();
+      startInFlight = startPromise;
+      try {
+        const job = await startPromise;
+        return { jobId: job.id };
+      } finally {
+        // Clear regardless of resolve/reject so a failed start can
+        // be retried (the caller decides), and a successful one
+        // doesn't pin a stale promise on the trainer for the rest
+        // of its lifetime.
+        startInFlight = null;
+      }
     },
 
     async wait(): Promise<TrainingResult> {
@@ -347,7 +595,7 @@ export function createTrainer(
         try {
           for await (const sse of iterateEvents(response)) {
             // Any frame from the server (including pings) means we're
-            // connected and making progress — reset the failure counter
+            // connected and making progress; reset the failure counter
             // so subsequent transient blips get the full retry budget.
             receivedAny = true;
             attempt = 0;
@@ -378,7 +626,7 @@ export function createTrainer(
         if (terminal) break;
 
         if (receivedAny) {
-          // Stream had real activity then closed cleanly. Not a failure —
+          // Stream had real activity then closed cleanly. Not a failure;
           // reconnect with Last-Event-ID at the base delay (no exponential
           // backoff, no counter increment).
           await delay(initialReconnectDelayMs, abortSignal);
@@ -404,5 +652,153 @@ export function createTrainer(
     },
   };
 
+  /**
+   * Internal "stop after next checkpoint" entry point. Hidden behind a
+   * `Symbol.for` brand so the runner subprocess's SIGTERM handler (in
+   * `runnerSignals.ts`) can drive a graceful early-stop without us
+   * exposing the operation on the public `Trainer` interface. User code
+   * that wants the same semantics should compose `abortSignal` +
+   * `cancel()` per `docs/cookbook/early-stopping.mdx`.
+   */
+  async function requestEarlyStop(
+    opts: RequestEarlyStopOptions = {},
+  ): Promise<void> {
+    // SIGTERM-during-create-job race: a runner-side SIGTERM can land
+    // between `start()`'s `scope = { … }` assignment and its
+    // `client.createJob(...)` resolution, with `startedJob` still
+    // null but a real cloud job about to exist. Treating that window
+    // as "nothing in flight" would `process.exit(0)` immediately
+    // after this returns, leaving the newly created cloud job
+    // running with no cancel POST. Awaiting `startInFlight` collapses
+    // the race onto a definite startedJob (success) or a definite
+    // start failure (rejection); either way the branches below
+    // can decide on real state. Swallow the rejection: if `start()`
+    // failed there's nothing to cancel anyway.
+    if (startInFlight) {
+      try {
+        await startInFlight;
+      } catch {
+        // intentionally ignored: failed start has no job to cancel
+      }
+    }
+    // Nothing in flight: cleanup any prior latch and resolve.
+    if (!startedJob || !scope || TERMINAL_STATUSES.has(startedJob.status)) {
+      settleEarlyStopLatch();
+      return;
+    }
+    // Idempotent: a second call piggybacks on the first.
+    if (earlyStopDeferred) return earlyStopDeferred.promise;
+
+    earlyStopRequested = true;
+    let resolveFn!: () => void;
+    let rejectFn!: (err: unknown) => void;
+    const promise = new Promise<void>((resolve, reject) => {
+      resolveFn = resolve;
+      rejectFn = reject;
+    });
+    const timeoutMs = opts.timeoutMs ?? DEFAULT_EARLY_STOP_TIMEOUT_MS;
+    const timer = setTimeout(() => {
+      // Timed out waiting for a checkpoint; fall back to immediate cancel.
+      // Capture the active deferred reference: by the time the cancel POST
+      // resolves, the checkpoint branch may have nulled out the shared
+      // slot, but this fallback path still owns the deferred it created.
+      const active = earlyStopDeferred;
+      // Capture (don't swallow) any cancel error so we can surface it
+      // through the deferred's reject path. Mirrors the checkpoint
+      // branch: a swallow here lets the runner's
+      // `installShutdownHandlers` exit "successfully" while the cloud
+      // job lives on (orphaned GPU spend with zero diagnostic), the
+      // exact failure mode that a "stop-after-checkpoint" deadline
+      // exists to PREVENT from going silent.
+      let cancelError: unknown = null;
+      // See the checkpoint-branch comment: tracked separately from
+      // `cancelError` so a user `throw null` / `throw 0` doesn't
+      // silently downgrade the cancel-failure path to "clean
+      // cancel".
+      let cancelFailed = false;
+      trainer
+        .cancel()
+        .catch((err) => {
+          cancelError = err;
+          cancelFailed = true;
+        })
+        .finally(() => {
+          // Mirror the checkpoint-triggered early-stop branch: reset
+          // the latch and reflect the cancellation locally so a
+          // second `requestEarlyStop()` call is a no-op (instead of
+          // re-arming a fresh timer + re-issuing cancel) and so
+          // `wait()`'s eventual resolution exposes a terminal status.
+          // Without this, a long-lived trainer left in
+          // `earlyStopRequested = true` would re-cancel on every
+          // future checkpoint event for the rest of its lifetime.
+          earlyStopRequested = false;
+          if (startedJob && !TERMINAL_STATUSES.has(startedJob.status)) {
+            // Symmetric to the checkpoint branch: `"failed"` (not
+            // `"cancelled"`) on cancel-throw so we don't lie
+            // about cloud-side state that may still be running.
+            // Both branches feed the same TERMINAL_STATUSES
+            // short-circuit, so re-armed `requestEarlyStop()`
+            // calls still no-op correctly.
+            startedJob = {
+              ...startedJob,
+              status: cancelFailed ? "failed" : "cancelled",
+              ...(cancelFailed && {
+                error: `Early-stop cancel failed: ${
+                  cancelError instanceof Error
+                    ? cancelError.message
+                    : String(cancelError)
+                }`,
+              }),
+              completedAt: new Date().toISOString(),
+            };
+          }
+          if (active) {
+            // Resolve on success, REJECT on cancel failure so the
+            // SIGTERM handler's `.catch()` writes the error to
+            // stderr and the operator can see that the cloud job
+            // may still be live. The latch always settles either
+            // way; shutdown won't hang.
+            if (cancelFailed) {
+              active.reject(
+                cancelError instanceof Error
+                  ? cancelError
+                  : new Error(String(cancelError)),
+              );
+            } else {
+              active.resolve();
+            }
+          }
+          if (earlyStopDeferred === active) earlyStopDeferred = null;
+        });
+    }, timeoutMs);
+    // `Timer.unref` keeps the early-stop timer from blocking process exit
+    // when the host runtime finishes for unrelated reasons.
+    timer.unref?.();
+    earlyStopDeferred = {
+      promise,
+      resolve: resolveFn,
+      reject: rejectFn,
+      timer,
+    };
+    return promise;
+  }
+
+  // Brand the trainer with the HMR control surface so the Studio server
+  // can (a) hash the cloud-side config to decide between hot-swap and
+  // restart, (b) atomically swap the callbacks cell from the runner
+  // subprocess on SIGUSR2, and (c) drive a graceful "stop after the
+  // next checkpoint" on SIGTERM. All three brands live behind
+  // `Symbol.for` keys so they don't appear on the public `Trainer`
+  // interface (see `trainerInspection.ts` for the rationale).
+  attachTrainerInspection(trainer, () => ({
+    name: input.name,
+    config,
+    callbacks: currentCallbacks,
+  }));
+  attachTrainerCallbackReplacer(trainer, (callbacks) => {
+    currentCallbacks = callbacks ?? {};
+  });
+  attachTrainerEarlyStopper(trainer, requestEarlyStop);
+
   return trainer;
 }
diff --git a/packages/arkor/src/core/trainerInspection.test.ts b/packages/arkor/src/core/trainerInspection.test.ts
new file mode 100644
index 00000000..cb3c11cc
--- /dev/null
+++ b/packages/arkor/src/core/trainerInspection.test.ts
@@ -0,0 +1,249 @@
+import { describe, expect, it, vi } from "vitest";
+import { createArkor } from "./arkor";
+import { createTrainer } from "./trainer";
+import {
+  findInspectableTrainer,
+  findTrainerInModule,
+  getTrainerInspection,
+  replaceTrainerCallbacks,
+  requestTrainerEarlyStop,
+} from "./trainerInspection";
+import type { Trainer } from "./types";
+
+function brandedTrainer(name: string) {
+  // Real `createTrainer` attaches the inspection brand. We only need
+  // a no-op trainer for these shape tests; `start`/`wait` etc. are
+  // never invoked.
+  return createTrainer({
+    name,
+    model: "m",
+    dataset: { type: "huggingface", name: "x" },
+  });
+}
+
+function unbrandedTrainer(name: string) {
+  // Hand-rolled trainer: passes the `start`/`wait`/`cancel` shape
+  // check `findTrainerInModule` requires but DOESN'T carry the SDK
+  // inspection brand. Mirrors a user who wraps or re-exports a
+  // trainer outside the SDK helpers.
+  return {
+    name,
+    start: async () => ({ jobId: "j" }),
+    wait: async () => ({ job: {}, artifacts: [] }),
+    cancel: async () => {},
+  };
+}
+
+describe("findTrainerInModule (trainer-shape walk)", () => {
+  it("finds shape #1: createArkor named export", () => {
+    const trainer = brandedTrainer("a");
+    const found = findTrainerInModule({ arkor: createArkor({ trainer }) });
+    expect(found).toBe(trainer);
+  });
+
+  it("finds shape #2: bare `trainer` named export", () => {
+    const trainer = brandedTrainer("b");
+    const found = findTrainerInModule({ trainer });
+    expect(found).toBe(trainer);
+  });
+
+  it("finds shape #3: default-export Arkor manifest", () => {
+    const trainer = brandedTrainer("c");
+    const found = findTrainerInModule({ default: createArkor({ trainer }) });
+    expect(found).toBe(trainer);
+  });
+
+  it("finds shape #4: default IS the Trainer", () => {
+    // Regression: `runner.ts`'s `extractTrainer` accepts
+    // `export default createTrainer(...)` directly (the trainer
+    // object itself becomes `mod.default`), but Studio's manifest /
+    // HMR walk previously skipped this shape. Result: a project that
+    // ran fine under `arkor start` showed as "no trainer" in Studio
+    // and HMR forced a SIGTERM-restart on every rebuild because
+    // `configHash` came back null.
+    const trainer = brandedTrainer("d");
+    const found = findTrainerInModule({ default: trainer });
+    expect(found).toBe(trainer);
+  });
+
+  it("finds shape #5: default.trainer nested", () => {
+    const trainer = brandedTrainer("e");
+    const found = findTrainerInModule({ default: { trainer } });
+    expect(found).toBe(trainer);
+  });
+
+  it("works for hand-rolled (unbranded) trainers in any of the five shapes", () => {
+    const trainer = unbrandedTrainer("manual");
+    expect(findTrainerInModule({ trainer })?.name).toBe("manual");
+    expect(findTrainerInModule({ default: trainer })?.name).toBe("manual");
+    expect(findTrainerInModule({ default: { trainer } })?.name).toBe("manual");
+  });
+
+  it("returns null when no candidate looks like a trainer", () => {
+    expect(findTrainerInModule({})).toBeNull();
+    expect(findTrainerInModule({ arkor: {} })).toBeNull();
+    expect(findTrainerInModule({ trainer: { name: "no-methods" } })).toBeNull();
+    expect(findTrainerInModule({ default: 42 })).toBeNull();
+  });
+});
+
+describe("findInspectableTrainer (brand-required path)", () => {
+  it("returns the inspection snapshot for a branded trainer in any shape", () => {
+    // Regression: previously HMR's `inspectBundle` only checked
+    // `mod.arkor ?? mod.default`, missing shapes #2 and #4. As a
+    // result, projects bare-exporting `trainer` always produced
+    // `configHash: null` and HMR conservatively SIGTERM-restarted on
+    // every rebuild, never hot-swapping callbacks. The fix routes
+    // through `findInspectableTrainer` which walks every supported
+    // shape via `findTrainerInModule` and pulls inspection off the
+    // discovered trainer.
+    const trainerA = brandedTrainer("from-arkor");
+    const inspectionA = findInspectableTrainer({
+      arkor: createArkor({ trainer: trainerA }),
+    });
+    expect(inspectionA?.name).toBe("from-arkor");
+
+    const trainerB = brandedTrainer("bare-named");
+    const inspectionB = findInspectableTrainer({ trainer: trainerB });
+    expect(inspectionB?.name).toBe("bare-named");
+
+    const trainerC = brandedTrainer("default-arkor");
+    const inspectionC = findInspectableTrainer({
+      default: createArkor({ trainer: trainerC }),
+    });
+    expect(inspectionC?.name).toBe("default-arkor");
+
+    const trainerD = brandedTrainer("default-nested");
+    const inspectionD = findInspectableTrainer({
+      default: { trainer: trainerD },
+    });
+    expect(inspectionD?.name).toBe("default-nested");
+  });
+
+  it("returns null when only an unbranded trainer is present", () => {
+    // Hand-rolled trainers don't carry the SDK inspection brand, so
+    // HMR can't compute their `configHash`. The Studio still shows
+    // the trainer name (via `findTrainerInModule` in
+    // `summariseBuiltManifest`), but HMR routing falls back to the
+    // SIGTERM-restart-everything path, which is the documented
+    // safe behaviour when configs can't be diffed.
+    const trainer = unbrandedTrainer("plain");
+    expect(findInspectableTrainer({ trainer })).toBeNull();
+    expect(getTrainerInspection(trainer)).toBeNull();
+  });
+
+  it("does NOT walk past an unbranded first candidate to inspect a later branded one (runTrainer parity)", () => {
+    // Regression: a previous implementation looped every trainer-
+    // shaped candidate and returned the first one carrying the
+    // inspection brand. But `runTrainer`'s `extractTrainer` always
+    // executes the FIRST candidate (precedence: `mod.arkor` →
+    // `mod.trainer` → `mod.default`...), regardless of brand. A module
+    // that exported both an unbranded `trainer` (shape #2) AND a
+    // branded `default = createArkor(...)` (shape #3) would have its
+    // HMR `configHash` computed from the BRANDED trainer while the
+    // runner actually ran the unbranded one. The mismatch could route
+    // a rebuild to SIGUSR2 (hot-swap) even though the live trainer
+    // has no callback-replacer brand to receive the swap, leaving
+    // the running job stuck on stale callbacks.
+    //
+    // The fix anchors `findInspectableTrainer` to the same first-
+    // wins precedence as `runTrainer`: if the first candidate is
+    // unbranded, return `null` (forcing SIGTERM-restart, the safe
+    // fallback) instead of hashing a different instance.
+    const unbranded = unbrandedTrainer("unbranded-first");
+    const branded = brandedTrainer("branded-second");
+    const inspection = findInspectableTrainer({
+      trainer: unbranded,
+      default: createArkor({ trainer: branded }),
+    });
+    // Under the bug this was the branded inspection ("branded-second").
+    // With the fix we get null so HMR conservatively SIGTERM-restarts
+    // rather than hot-swapping callbacks into a trainer that can't
+    // receive them.
+    expect(inspection).toBeNull();
+    // And `findTrainerInModule` confirms the runner would pick the
+    // unbranded one (proving the precedence we're anchoring to).
+    expect(findTrainerInModule({
+      trainer: unbranded,
+      default: createArkor({ trainer: branded }),
+    })).toBe(unbranded);
+  });
+});
+
+describe("requestTrainerEarlyStop / replaceTrainerCallbacks brand-missing fallback", () => {
+  // Regression: previously these helpers asserted the brand was
+  // present and threw a synchronous TypeError on hand-rolled trainers.
+  // `runner.ts`'s `extractTrainer` accepts ANY `{start, wait, cancel}`
+  // shape (a documented public path for unbranded trainers),
+  // so the SIGTERM handler crashed instead of stopping the run.
+
+  it("requestTrainerEarlyStop falls back to trainer.cancel() for unbranded trainers", async () => {
+    const cancelCalls = vi.fn(async () => {});
+    const trainer = {
+      name: "manual",
+      start: async () => ({ jobId: "j" }),
+      wait: async () => ({ job: {}, artifacts: [] }),
+      cancel: cancelCalls,
+    } as unknown as Trainer;
+
+    // Must not throw, must resolve, must have called cancel().
+    await expect(requestTrainerEarlyStop(trainer)).resolves.toBeUndefined();
+    expect(cancelCalls).toHaveBeenCalledTimes(1);
+  });
+
+  it("requestTrainerEarlyStop swallows a thrown cancel() so the SIGTERM handler can still settle", async () => {
+    // The runner's SIGTERM handler chains
+    // `requestTrainerEarlyStop(...).catch(...).finally(() => process.exit(0))`.
+    // If the brand-missing fallback let cancel()'s rejection bubble,
+    // the `.finally` would still fire, but the cancel error would
+    // surface as an unhandled rejection from the test runner. The
+    // documented contract for cancel() is best-effort, so swallow.
+    const trainer = {
+      name: "manual",
+      start: async () => ({ jobId: "j" }),
+      wait: async () => ({ job: {}, artifacts: [] }),
+      cancel: vi.fn(async () => {
+        throw new Error("network down");
+      }),
+    } as unknown as Trainer;
+
+    await expect(requestTrainerEarlyStop(trainer)).resolves.toBeUndefined();
+  });
+
+  it("requestTrainerEarlyStop is async-shaped: synchronous throws inside the brand call become rejections", async () => {
+    // Defense-in-depth: even when the brand IS attached but somehow
+    // throws synchronously (e.g. a future implementation regression),
+    // the SIGTERM handler's `.catch` arm should still see it instead
+    // of the throw escaping past `.finally` and taking the runner
+    // down. The function is `async`, which wraps any synchronous
+    // throw inside its body into a rejected promise.
+    const trainer = brandedTrainer("from-arkor");
+    // Replace the brand with a function that throws synchronously.
+    const KEY = Symbol.for("arkor.trainer.requestEarlyStop");
+    Object.defineProperty(trainer, KEY, {
+      value: () => {
+        throw new Error("brand impl exploded");
+      },
+      configurable: true,
+    });
+    await expect(requestTrainerEarlyStop(trainer)).rejects.toThrow(
+      /brand impl exploded/,
+    );
+  });
+
+  it("replaceTrainerCallbacks is a no-op (not a throw) for unbranded trainers", () => {
+    // The HMR pipeline never routes SIGUSR2 to unbranded trainers in
+    // practice (their `configHash` is null, which forces the
+    // SIGTERM-restart path), but if a future caller did, it must not
+    // crash the runner.
+    const trainer = {
+      name: "manual",
+      start: async () => ({ jobId: "j" }),
+      wait: async () => ({ job: {}, artifacts: [] }),
+      cancel: async () => {},
+    } as unknown as Trainer;
+    expect(() =>
+      replaceTrainerCallbacks(trainer, { onLog: () => {} }),
+    ).not.toThrow();
+  });
+});
diff --git a/packages/arkor/src/core/trainerInspection.ts b/packages/arkor/src/core/trainerInspection.ts
new file mode 100644
index 00000000..8bfe964d
--- /dev/null
+++ b/packages/arkor/src/core/trainerInspection.ts
@@ -0,0 +1,306 @@
+import { isArkor } from "./arkor";
+import type { Arkor, JobConfig, Trainer, TrainerCallbacks } from "./types";
+
+/**
+ * Snapshot of a trainer's identity and cloud-side config that the Studio
+ * server reads in order to (a) compute a stable hash for HMR's
+ * "callbacks-only vs full restart" decision and (b) extract the new
+ * callbacks reference when hot-swapping.
+ *
+ * **Internal API (not part of the user-facing SDK surface).** Both this
+ * snapshot and the companion `replaceTrainerCallbacks` mutator are
+ * exposed only via `Symbol.for(...)`-keyed properties on the trainer
+ * object so they don't appear on the public `Trainer` type. They exist
+ * to let `arkor dev`'s HMR pipeline hot-swap callbacks without
+ * restarting cloud-side training; user code shouldn't call them
+ * directly.
+ */
+export interface TrainerInspection {
+  /** Run name (mirror of `Trainer.name`, copied for forward compatibility). */
+  name: string;
+  /** The cloud-side `JobConfig` this trainer would submit on `start()`. */
+  config: JobConfig;
+  /** Whatever the user passed in `input.callbacks`. May be empty. */
+  callbacks: Partial<TrainerCallbacks>;
+}
+
+/**
+ * The CLI runtime (`dist/bin.mjs`) and the user's compiled bundle
+ * (`.arkor/build/index.mjs`, which keeps `arkor` external) end up loading
+ * two separate copies of this SDK as distinct ESM module records, so a
+ * module-local `WeakMap<Trainer, ...>` would split into two halves that
+ * can't see each other.
+ *
+ * `Symbol.for(key)` is the cross-realm equivalent: the same key string
+ * resolves to the same symbol in any module instance, so the trainer
+ * created in the user's bundle exposes its inspection through the same
+ * property the Studio process reads.
+ */
+const TRAINER_INSPECTION_KEY = Symbol.for("arkor.trainer.inspect");
+const TRAINER_REPLACE_CALLBACKS_KEY = Symbol.for(
+  "arkor.trainer.replaceCallbacks",
+);
+const TRAINER_REQUEST_EARLY_STOP_KEY = Symbol.for(
+  "arkor.trainer.requestEarlyStop",
+);
+
+export interface RequestEarlyStopOptions {
+  /** Default: 5 min. Falls back to immediate cancel if no checkpoint arrives. */
+  timeoutMs?: number;
+}
+
+/**
+ * Stamp the inspection snapshot onto a freshly-built `Trainer` instance.
+ * Called once from `createTrainer`. Stored as a thunk so callers can
+ * read a fresh copy each time (defensive: the trainer's callbacks cell
+ * is mutable across the lifetime of a hot-swap).
+ */
+export function attachTrainerInspection(
+  trainer: object,
+  read: () => TrainerInspection,
+): void {
+  Object.defineProperty(trainer, TRAINER_INSPECTION_KEY, {
+    value: read,
+    configurable: true,
+    enumerable: false,
+    writable: false,
+  });
+}
+
+/**
+ * Pull the snapshot off a Trainer-like value. Returns `null` for plain
+ * objects that don't carry the brand; used by the Studio server to
+ * gracefully ignore third-party wrappers or pre-SDK shapes.
+ */
+export function getTrainerInspection(
+  trainer: unknown,
+): TrainerInspection | null {
+  if (!trainer || typeof trainer !== "object") return null;
+  const fn = (trainer as Record<symbol, unknown>)[TRAINER_INSPECTION_KEY];
+  if (typeof fn !== "function") return null;
+  try {
+    const result = (fn as () => unknown).call(trainer);
+    if (
+      result &&
+      typeof result === "object" &&
+      "config" in result &&
+      "name" in result
+    ) {
+      return result as TrainerInspection;
+    }
+  } catch {
+    // Inspection is best-effort; a thrown user callback shouldn't crash HMR.
+  }
+  return null;
+}
+
+/**
+ * Wire the trainer's mutable callbacks slot to a `Symbol.for`-keyed
+ * brand so the runner subprocess can hot-swap callbacks without us
+ * exposing the operation on the public `Trainer` interface. Called once
+ * from `createTrainer`.
+ */
+export function attachTrainerCallbackReplacer(
+  trainer: object,
+  replace: (callbacks: Partial<TrainerCallbacks>) => void,
+): void {
+  Object.defineProperty(trainer, TRAINER_REPLACE_CALLBACKS_KEY, {
+    value: replace,
+    configurable: true,
+    enumerable: false,
+    writable: false,
+  });
+}
+
+/**
+ * Replace the trainer's lifecycle callbacks atomically. The brand is
+ * attached by `createTrainer`, but `runTrainer`'s `extractTrainer`
+ * also accepts hand-rolled trainers (any `{ start, wait, cancel }`
+ * shape), and those don't carry the brand. The HMR pipeline never
+ * routes SIGUSR2 to such trainers in practice (they always produce
+ * `configHash: null` upstream, which forces the SIGTERM-restart
+ * path), so this helper is a no-op for them rather than throwing.
+ */
+export function replaceTrainerCallbacks(
+  trainer: Trainer,
+  callbacks: Partial<TrainerCallbacks>,
+): void {
+  const fn = (trainer as unknown as Record<symbol, unknown>)[
+    TRAINER_REPLACE_CALLBACKS_KEY
+  ] as ((cbs: Partial<TrainerCallbacks>) => void) | undefined;
+  if (typeof fn !== "function") return;
+  fn.call(trainer, callbacks);
+}
+
+/**
+ * Wire an early-stop entry point onto a `Trainer` so the SIGTERM handler
+ * in the runner subprocess can request a graceful "stop after the next
+ * checkpoint" without us exposing the operation on the public `Trainer`
+ * interface. User code that wants the same semantics should compose
+ * the cookbook's `abortSignal` + `cancel()` recipe instead (see
+ * `docs/cookbook/early-stopping.mdx`).
+ */
+export function attachTrainerEarlyStopper(
+  trainer: object,
+  requestStop: (opts?: RequestEarlyStopOptions) => Promise<void>,
+): void {
+  Object.defineProperty(trainer, TRAINER_REQUEST_EARLY_STOP_KEY, {
+    value: requestStop,
+    configurable: true,
+    enumerable: false,
+    writable: false,
+  });
+}
+
+/**
+ * Request that the trainer stop after the next saved checkpoint.
+ * Resolves once `cancel()` has been accepted by the cloud API, or
+ * after `timeoutMs` if no checkpoint arrived in time.
+ *
+ * `createTrainer` attaches the brand unconditionally, but
+ * `runTrainer`'s `extractTrainer` also accepts hand-rolled trainers
+ * (any `{ start, wait, cancel }` shape), which legitimately don't
+ * carry the brand. Falling back to the public `Trainer.cancel()` for
+ * those is the closest semantic match available without the SDK's
+ * checkpoint-aware machinery; it's also what the runner's SIGTERM
+ * handler needs to keep working (the previous "throw if brand
+ * missing" behaviour caused a synchronous TypeError before the
+ * handler's `.catch().finally()` chain attached, so SIGTERM crashed
+ * the runner instead of stopping the run).
+ */
+// async wrapper (rather than a bare function returning Promise) so
+// any *synchronous* throw inside the brand call (or its arguments)
+// becomes a rejected promise; the SIGTERM handler's `.catch()` then
+// catches it instead of the throw escaping past the `.finally()`
+// chain and taking the runner down.
+export async function requestTrainerEarlyStop(
+  trainer: Trainer,
+  opts?: RequestEarlyStopOptions,
+): Promise<void> {
+  const fn = (trainer as unknown as Record<symbol, unknown>)[
+    TRAINER_REQUEST_EARLY_STOP_KEY
+  ] as ((opts?: RequestEarlyStopOptions) => Promise<void>) | undefined;
+  if (typeof fn !== "function") {
+    // Best-effort fallback for unbranded trainers: trainer.cancel()
+    // is part of the public Trainer interface, so it's always safe
+    // to call. Catch/swallow because the documented contract for
+    // cancel() is "best-effort" and the SIGTERM handler needs the
+    // returned promise to settle either way.
+    try {
+      await trainer.cancel();
+    } catch {
+      // intentionally ignored; see comment above.
+    }
+    return;
+  }
+  await fn.call(trainer, opts);
+}
+
+/**
+ * Trainer-shaped value pulled from a re-imported bundle. We don't
+ * import the public `Trainer` type here because consumers of this
+ * helper want to read minimal fields (`name` for display) without
+ * type-narrowing on the full SDK interface. Many tests fabricate
+ * hand-rolled trainer literals that don't structurally match
+ * `Trainer` (no `requestEarlyStop` etc.) but are still legitimate
+ * user shapes the runner accepts.
+ */
+type TrainerLike = { name?: unknown; [key: string]: unknown };
+
+function isTrainerLike(value: unknown): value is TrainerLike {
+  if (!value || typeof value !== "object") return false;
+  const v = value as Record<string, unknown>;
+  return (
+    typeof v.start === "function" &&
+    typeof v.wait === "function" &&
+    typeof v.cancel === "function"
+  );
+}
+
+/**
+ * Walk the user module in `runner.ts`'s precedence order and return
+ * every *distinct* trainer-shaped value found. The walk is
+ * de-duplicated because the common `createArkor({ trainer })`
+ * default-export shape would otherwise surface the same trainer up
+ * to three times (case 3 pushes `mod.default.trainer`; case 4
+ * pushes the manifest object itself which is filtered out by
+ * `isTrainerLike`; case 5 pushes `mod.default.trainer` a second
+ * time). Callers iterate in precedence order, so this preserves
+ * the "first match wins" contract.
+ *
+ * The five supported shapes (mirroring `runner.ts`'s `extractTrainer`):
+ *   1. `export const arkor = createArkor({ trainer })`
+ *   2. `export const trainer = createTrainer(...)`  (bare named export)
+ *   3. `export default createArkor({ trainer })`
+ *   4. `export default createTrainer(...)`           (default IS a Trainer)
+ *   5. `export default { trainer: createTrainer(...) }`
+ *
+ * Without shape #4 a project that default-exports a Trainer would run
+ * fine under `arkor start` but show as "no trainer" in Studio's
+ * manifest, with `configHash: null` forcing every HMR rebuild down the
+ * SIGTERM-restart path instead of the SIGUSR2 hot-swap path.
+ */
+function findTrainerCandidates(mod: Record<string, unknown>): TrainerLike[] {
+  const trainers: TrainerLike[] = [];
+  const seen = new Set<unknown>();
+  const push = (value: unknown): void => {
+    if (value === undefined || value === null) return;
+    if (seen.has(value)) return;
+    seen.add(value);
+    if (isTrainerLike(value)) trainers.push(value);
+  };
+  // 1: createArkor named export
+  if (isArkor(mod.arkor)) push((mod.arkor as Arkor).trainer);
+  // 2: bare `trainer` named export
+  push(mod.trainer);
+  // 3: default-export holding an Arkor manifest
+  if (isArkor(mod.default)) push((mod.default as Arkor).trainer);
+  // 4: default IS the Trainer itself. `isTrainerLike` filters out
+  // cases 3/5 (an Arkor manifest doesn't have `start`/`wait`/
+  // `cancel`, nor does a plain `{ trainer }` wrapper).
+  push(mod.default);
+  // 5: default.trainer nested
+  if (mod.default && typeof mod.default === "object") {
+    push((mod.default as Record<string, unknown>).trainer);
+  }
+  return trainers;
+}
+
+/**
+ * Return the first trainer-shaped value (anything with
+ * `start`/`wait`/`cancel`) in `runner.ts`'s precedence order. Doesn't
+ * require the SDK inspection brand: the Studio manifest UI displays
+ * the trainer's `name` for hand-rolled trainers too, even when HMR
+ * can't compute a `configHash` for them. "First match wins" matches
+ * `runner.ts`'s `extractTrainer`, so this is the trainer the runner
+ * will actually execute.
+ */
+export function findTrainerInModule(
+  mod: Record<string, unknown>,
+): TrainerLike | null {
+  return findTrainerCandidates(mod)[0] ?? null;
+}
+
+/**
+ * Inspection snapshot of the trainer `runTrainer` would execute
+ * (== the first candidate in `runner.ts`'s precedence order).
+ * Used by both `studio/hmr.ts` (computing the `configHash` for HMR
+ * routing) and `core/runnerSignals.ts` (extracting new callbacks for
+ * SIGUSR2 hot-swap).
+ *
+ * Returns `null` when the first candidate doesn't carry the
+ * inspection brand. We deliberately DO NOT walk past it to find a
+ * branded trainer further down the list: the runner ignores those,
+ * so hashing a deeper branded trainer would compute HMR decisions
+ * for a different instance than the one actually running, e.g.
+ * route to SIGUSR2/hot-swap when the live (unbranded) trainer
+ * cannot be callback-reloaded. A null here correctly forces SIGTERM-
+ * restart, which is the safe fallback when configs can't be diffed.
+ */
+export function findInspectableTrainer(
+  mod: Record<string, unknown>,
+): TrainerInspection | null {
+  const trainer = findTrainerCandidates(mod)[0];
+  if (!trainer) return null;
+  return getTrainerInspection(trainer);
+}
diff --git a/packages/arkor/src/studio/hmr.test.ts b/packages/arkor/src/studio/hmr.test.ts
new file mode 100644
index 00000000..b892c68c
--- /dev/null
+++ b/packages/arkor/src/studio/hmr.test.ts
@@ -0,0 +1,423 @@
+import { describe, it, expect, beforeEach, afterEach } from "vitest";
+import {
+  mkdirSync,
+  mkdtempSync,
+  rmSync,
+  statSync,
+  writeFileSync,
+} from "node:fs";
+import { tmpdir } from "node:os";
+import { join } from "node:path";
+import { createHmrCoordinator, type HmrEvent } from "./hmr";
+
+const FAKE_MANIFEST = `export const arkor = Object.freeze({
+  _kind: "arkor",
+  trainer: { name: "alpha" },
+});
+`;
+
+let cwd: string;
+
+beforeEach(() => {
+  cwd = mkdtempSync(join(tmpdir(), "arkor-hmr-test-"));
+});
+
+afterEach(() => {
+  rmSync(cwd, { recursive: true, force: true });
+});
+
+function nextEvent(
+  events: HmrEvent[],
+  predicate: (e: HmrEvent) => boolean,
+  timeoutMs = 10_000,
+): Promise<HmrEvent> {
+  return new Promise((resolve, reject) => {
+    const start = Date.now();
+    const tick = () => {
+      const found = events.find(predicate);
+      if (found) return resolve(found);
+      if (Date.now() - start > timeoutMs) {
+        return reject(
+          new Error(
+            `Timed out waiting for matching HMR event after ${timeoutMs}ms`,
+          ),
+        );
+      }
+      setTimeout(tick, 25);
+    };
+    tick();
+  });
+}
+
+/**
+ * Resolve once `events.length` has gone `quietWindowMs` without
+ * growing. Used to wait out spurious watcher events on noisier file
+ * systems (Windows polling / macOS FSEvents coalescing) before
+ * asserting the cached state.
+ */
+function waitForStableEvents(
+  events: HmrEvent[],
+  quietWindowMs: number,
+): Promise<void> {
+  return new Promise((resolve) => {
+    let lastLength = events.length;
+    let stableSince = Date.now();
+    const tick = () => {
+      if (events.length !== lastLength) {
+        lastLength = events.length;
+        stableSince = Date.now();
+      }
+      if (Date.now() - stableSince >= quietWindowMs) return resolve();
+      setTimeout(tick, 50);
+    };
+    tick();
+  });
+}
+
+describe("createHmrCoordinator", () => {
+  it("emits a `ready` event after the first successful build", async () => {
+    mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+    writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+    const events: HmrEvent[] = [];
+    const hmr = createHmrCoordinator({ cwd });
+    hmr.subscribe((e) => events.push(e));
+    try {
+      const ready = await nextEvent(events, (e) => e.type === "ready");
+      expect(ready.outFile).toMatch(/\.arkor[\\/]+build[\\/]+index\.mjs$/);
+      expect(typeof ready.hash).toBe("string");
+    } finally {
+      await hmr.dispose();
+    }
+  });
+
+  it("emits a `rebuild` event after a source edit", async () => {
+    mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+    writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+    const events: HmrEvent[] = [];
+    const hmr = createHmrCoordinator({ cwd });
+    hmr.subscribe((e) => events.push(e));
+    try {
+      const ready = await nextEvent(events, (e) => e.type === "ready");
+      // Touch the entry with new content so the watcher detects a change.
+      writeFileSync(
+        join(cwd, "src/arkor/index.ts"),
+        FAKE_MANIFEST.replace(`"alpha"`, `"beta"`),
+      );
+      const rebuild = await nextEvent(events, (e) => e.type === "rebuild");
+      expect(rebuild.outFile).toBe(ready.outFile);
+      expect(rebuild.hash).not.toBe(ready.hash);
+    } finally {
+      await hmr.dispose();
+    }
+  });
+
+  it("emits an `error` event when the entry is missing on subscribe", async () => {
+    const events: HmrEvent[] = [];
+    const hmr = createHmrCoordinator({ cwd });
+    hmr.subscribe((e) => events.push(e));
+    try {
+      const err = await nextEvent(events, (e) => e.type === "error", 1000);
+      expect(err.message).toMatch(/Build entry not found/);
+    } finally {
+      await hmr.dispose();
+    }
+  });
+
+  it("transitions from `error` to `ready` once the entry appears, without re-subscribing", async () => {
+    // Regression: previously `startWatcher` bailed out and never
+    // retried, so an SPA already connected to `/api/dev/events` against
+    // a fresh scaffold would be stuck on the initial `error` event
+    // forever: EventSource doesn't reconnect on application-level
+    // errors. The coordinator now polls for the entry file in the
+    // background and starts the watcher the moment it appears.
+    const events: HmrEvent[] = [];
+    const hmr = createHmrCoordinator({ cwd });
+    hmr.subscribe((e) => events.push(e));
+    try {
+      await nextEvent(events, (e) => e.type === "error", 1000);
+      // Same subscriber: no reconnect, no second `subscribe` call.
+      mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+      writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+      const ready = await nextEvent(
+        events,
+        (e) => e.type === "ready",
+        4000,
+      );
+      expect(ready.outFile).toMatch(/index\.mjs$/);
+    } finally {
+      await hmr.dispose();
+    }
+  });
+
+  it("replays the latest event to late subscribers", async () => {
+    mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+    writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+    const firstEvents: HmrEvent[] = [];
+    const hmr = createHmrCoordinator({ cwd });
+    hmr.subscribe((e) => firstEvents.push(e));
+    try {
+      await nextEvent(firstEvents, (e) => e.type === "ready");
+      // A new subscriber should receive the cached state synchronously
+      // before any new build is triggered.
+      //
+      // We assert "the late subscriber sees the same event the prior one
+      // saw last" rather than literally "ready" because rolldown@1.0.0-rc.17
+      // on macOS occasionally fires a spurious second BUNDLE_END (FSEvents
+      // coalescing inside the watcher): there, `firstEvents` already
+      // contains the spurious `rebuild` by the time we late-subscribe, and
+      // the contract under test (replay of the cached state) holds either
+      // way.
+      // TODO(rolldown 1.0): re-check after rolldown leaves RC. If the
+      // spurious BUNDLE_END is gone on macOS, tighten this back to
+      //   expect(lateEvents[0]?.type).toBe("ready");
+      const lateEvents: HmrEvent[] = [];
+      hmr.subscribe((e) => lateEvents.push(e));
+      expect(lateEvents.length).toBeGreaterThanOrEqual(1);
+      expect(lateEvents[0]).toEqual(firstEvents[firstEvents.length - 1]);
+    } finally {
+      await hmr.dispose();
+    }
+  });
+
+  it("subscribe()'s lastEvent replay swallows a throwing subscriber so initialization keeps working", async () => {
+    // Regression: `subscribe()` synchronously replays `lastEvent` to
+    // a fresh subscriber for the late-mount-cached-state contract.
+    // Previously the replay had no try/catch, so a subscriber that
+    // threw during that one call (typical case: an SSE controller
+    // that closed mid-replay: `controller.enqueue` on a closed
+    // stream throws) propagated out of `subscribe()` and broke
+    // whoever just registered. `broadcast()` already swallowed
+    // subscriber throws defensively; this test pins the symmetric
+    // contract on `subscribe()`.
+    mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+    writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+    const firstEvents: HmrEvent[] = [];
+    const hmr = createHmrCoordinator({ cwd });
+    hmr.subscribe((e) => firstEvents.push(e));
+    try {
+      await nextEvent(firstEvents, (e) => e.type === "ready");
+      // A subscriber whose body throws on the cached-state replay.
+      const throwingSubscriber = (): void => {
+        throw new Error("controller closed");
+      };
+      // Must not throw out of subscribe(); must still return a
+      // working unsubscribe.
+      let unsubscribe: () => void = () => undefined;
+      expect(() => {
+        unsubscribe = hmr.subscribe(throwingSubscriber);
+      }).not.toThrow();
+      expect(typeof unsubscribe).toBe("function");
+      // Confirm the coordinator is still healthy: a *new* subscriber
+      // (after the throwing one) still receives the cached replay.
+      const recoveryEvents: HmrEvent[] = [];
+      hmr.subscribe((e) => recoveryEvents.push(e));
+      expect(recoveryEvents.length).toBeGreaterThanOrEqual(1);
+      unsubscribe();
+    } finally {
+      await hmr.dispose();
+    }
+  });
+
+  it("stops broadcasting after dispose()", async () => {
+    mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+    writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+    const events: HmrEvent[] = [];
+    const hmr = createHmrCoordinator({ cwd });
+    hmr.subscribe((e) => events.push(e));
+    await nextEvent(events, (e) => e.type === "ready");
+    await hmr.dispose();
+    const countAfterDispose = events.length;
+
+    // Edit after dispose must not produce any further events.
+    writeFileSync(
+      join(cwd, "src/arkor/index.ts"),
+      FAKE_MANIFEST.replace(`"alpha"`, `"gamma"`),
+    );
+    await new Promise((r) => setTimeout(r, 250));
+    expect(events.length).toBe(countAfterDispose);
+  });
+
+  it("the cached lastEvent reflects the LATEST source under rapid back-to-back edits", async () => {
+    // Regression: the BUNDLE_END handler used to fire
+    // `emitBuildSucceeded` without awaiting, so two quick rebuilds
+    // could run `inspectBundle` concurrently and broadcast out of
+    // order, leaving `lastEvent` pointing at the older snapshot.
+    // We can't deterministically synthesise a race against rolldown's
+    // real watcher, but we *can* assert the user-visible invariant:
+    // after a sequence of edits, the cached state must match the
+    // bytes that are actually on disk. The new sequence-number guard
+    // inside `emitBuildSucceeded` drops stale inspection results so
+    // whichever BUNDLE_END landed last broadcasts last.
+    mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+    writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+    const events: HmrEvent[] = [];
+    const hmr = createHmrCoordinator({ cwd });
+    hmr.subscribe((e) => events.push(e));
+    try {
+      await nextEvent(events, (e) => e.type === "ready");
+      writeFileSync(
+        join(cwd, "src/arkor/index.ts"),
+        FAKE_MANIFEST.replace(`"alpha"`, `"beta"`),
+      );
+      await nextEvent(events, (e) => e.type === "rebuild", 4000);
+      writeFileSync(
+        join(cwd, "src/arkor/index.ts"),
+        FAKE_MANIFEST.replace(`"alpha"`, `"gamma"`),
+      );
+      // Wait for the watcher to settle; any rebuild that's going to
+      // fire (including spurious extras from FSEvents on macOS or
+      // chokidar polling on Windows) lands within this window. The
+      // assertion then compares the cached `lastEvent.hash` against
+      // the *actual* fingerprint of the on-disk artefact, not a
+      // captured "last expected" hash from earlier in the test:
+      // that earlier capture was brittle on Windows where rolldown
+      // routinely emits a 4th BUNDLE_END after the explicit edits
+      // settle, producing a slightly different output byte (a
+      // change in the bundled comment header is enough to bump
+      // mtime + ctime + size).
+      await waitForStableEvents(events, 750);
+      const stat = statSync(join(cwd, ".arkor/build/index.mjs"));
+      const expectedHash = `${stat.mtimeMs}-${stat.ctimeMs}-${stat.size}`;
+      expect(events[events.length - 1]?.hash).toBe(expectedHash);
+    } finally {
+      await hmr.dispose();
+    }
+  });
+
+  it("getCurrentConfigHash() returns the latest cached event's hash", async () => {
+    // Regression: `/api/train` previously called `readManifestSummary`
+    // and ran a redundant rebuild per spawn (racing the watcher).
+    // The new server flow reads the cached hash via
+    // `getCurrentConfigHash()`. We can't trigger a real build here
+    // (the user-bundle entry shape would need a working `arkor`
+    // resolution at import time), but we can verify the getter
+    // returns `null` before the watcher has emitted any event and
+    // tracks the cached event's `configHash` field once one lands.
+    // The integration of "configHash actually populated for all
+    // entry shapes" is covered by the unit test against
+    // `findInspectableTrainer` in `trainerInspection.test.ts`.
+    mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+    writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+    const events: HmrEvent[] = [];
+    const hmr = createHmrCoordinator({ cwd });
+    // Before any subscriber attaches, no watcher is running and no
+    // event has been broadcast: getter must return null without
+    // throwing.
+    expect(hmr.getCurrentConfigHash()).toBeNull();
+    hmr.subscribe((e) => events.push(e));
+    try {
+      const ready = await nextEvent(events, (e) => e.type === "ready");
+      // FAKE_MANIFEST is hand-rolled (no SDK brand) so the cached
+      // hash is null, but the *getter* must still return whatever
+      // the cached event carries, not throw.
+      expect(hmr.getCurrentConfigHash()).toBe(ready.configHash ?? null);
+    } finally {
+      await hmr.dispose();
+    }
+  });
+
+  it("getCurrentArtifactHash() returns null when the artefact doesn't exist (vs a Date.now() fallback)", async () => {
+    // Regression: a previous implementation did
+    // `statSync(...) ; return fingerprint(...)`. Two stat calls
+    // means a race window where the file disappears between them:
+    // the existence check passes, then `fingerprint`'s catch
+    // branch substitutes `Date.now().toString(36)` (its
+    // freshness-forcing fallback for SSE dedup), and the getter
+    // returns a non-null, non-artefact-derived hash. That
+    // silently breaks `dispatchRebuild`'s pre-ready-spawn gate
+    // which relies on null === "no artefact, force restart".
+    // The fix uses `fingerprintOrNull`: single statSync, true
+    // null on failure.
+    //
+    // We assert the getter on a project that has NEVER built
+    // (no `.arkor/build/index.mjs` ever existed). The bug-fix
+    // version returns null; the broken version's leftover would
+    // have been Date.now()-derived non-null.
+    mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+    writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+    const hmr = createHmrCoordinator({ cwd });
+    try {
+      // No subscribe() yet: watcher hasn't started, so no
+      // BUNDLE_END has written the artefact. The on-disk
+      // `.arkor/build/index.mjs` doesn't exist.
+      expect(hmr.getCurrentArtifactHash()).toBeNull();
+    } finally {
+      await hmr.dispose();
+    }
+  });
+
+  it("getCurrentArtifactHash() returns a stable mtime/ctime/size hash once the artefact exists", async () => {
+    // Companion to the null-on-missing test: when the artefact
+    // *does* exist (watcher's first BUNDLE_END landed), the
+    // getter returns the same `mtimeMs-ctimeMs-size` shape the
+    // SSE event's `hash` field uses. The two are paired for SSE
+    // dedup purposes; the pre-ready-spawn registry gate switched
+    // to content-hash (`getCurrentArtifactContentHash`) to avoid
+    // identical-bytes/different-timestamps false positives, but
+    // the timestamp hash stays as the canonical SSE event id.
+    mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+    writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+    const events: HmrEvent[] = [];
+    const hmr = createHmrCoordinator({ cwd });
+    hmr.subscribe((e) => events.push(e));
+    try {
+      const ready = await nextEvent(events, (e) => e.type === "ready");
+      const artifactHash = hmr.getCurrentArtifactHash();
+      // Same shape as the SSE event's `hash` field: both feed
+      // through the same `mtimeMs-ctimeMs-size` formula.
+      expect(artifactHash).toBe(ready.hash ?? null);
+      expect(artifactHash).toMatch(/^[\d.]+-[\d.]+-\d+$/);
+    } finally {
+      await hmr.dispose();
+    }
+  });
+
+  it("getCurrentConfigHash() preserves the last-success hash across an ERROR event", async () => {
+    // Regression: previously `getCurrentConfigHash()` returned
+    // `lastEvent?.configHash ?? null`. After an ERROR landed,
+    // `lastEvent` was the error event (no `configHash`) so the
+    // getter went null even though `.arkor/build/index.mjs` still
+    // held the previous *successful* bundle bytes (ERROR doesn't
+    // overwrite the output). A child spawned via `/api/train` in
+    // that window would register `configHash: null`, and the next
+    // successful BUNDLE_END would diff against null → SIGTERM
+    // restart instead of SIGUSR2 hot-swap, defeating callback
+    // hot-swap for the rest of the session. The fix tracks the
+    // last *successful* hash separately from `lastEvent`.
+    mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+    writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+    const events: HmrEvent[] = [];
+    const hmr = createHmrCoordinator({ cwd });
+    hmr.subscribe((e) => events.push(e));
+    try {
+      const ready = await nextEvent(events, (e) => e.type === "ready");
+      const successHash = hmr.getCurrentConfigHash();
+      // Sanity: ready event's configHash matches the getter.
+      expect(successHash).toBe(ready.configHash ?? null);
+      // Inject a syntax error to force a watcher ERROR event.
+      writeFileSync(
+        join(cwd, "src/arkor/index.ts"),
+        "this is not { valid javascript = ;",
+      );
+      await nextEvent(events, (e) => e.type === "error", 4000);
+      // After the error, the cached `lastEvent` is the error frame
+      // but the on-disk artifact still holds the previous
+      // success. The getter must return that previous-success hash
+      // so any `/api/train` spawn during this window still gets a
+      // useful spawn-time hash for the *next* rebuild's routing.
+      expect(hmr.getCurrentConfigHash()).toBe(successHash);
+    } finally {
+      await hmr.dispose();
+    }
+  });
+});
diff --git a/packages/arkor/src/studio/hmr.ts b/packages/arkor/src/studio/hmr.ts
new file mode 100644
index 00000000..974ed771
--- /dev/null
+++ b/packages/arkor/src/studio/hmr.ts
@@ -0,0 +1,527 @@
+import { createHash } from "node:crypto";
+import { existsSync, readFileSync, statSync } from "node:fs";
+import { watch, type RolldownWatcher } from "rolldown";
+import { hashJobConfig } from "../core/configHash";
+import { moduleCacheBustUrl } from "../core/moduleCacheBust";
+import {
+  BUILD_DEFAULTS,
+  resolveBuildEntry,
+  rolldownInputOptions,
+  type BuildEntryOptions,
+} from "../core/rolldownConfig";
+import { findInspectableTrainer } from "../core/trainerInspection";
+
+export type HmrEventType = "ready" | "rebuild" | "error";
+
+export interface HmrEvent {
+  type: HmrEventType;
+  outFile?: string;
+  /**
+   * Short fingerprint of the bundle artefact (mtime + ctime + size,
+   * mirroring `core/moduleCacheBust.ts`'s key shape). Subscribers use
+   * this to dedupe replays of the same successful build.
+   */
+  hash?: string;
+  /**
+   * Content-derived hash (sha256, truncated) of the artefact bytes.
+   * Used by `dispatchRebuild`'s pre-ready-spawn equality gate where
+   * `hash` would over-trigger SIGTERM-restart: a watcher build that
+   * rewrites identical bytes still bumps mtime/ctime, so two
+   * timestamp fingerprints differ even though the loaded bytes are
+   * the same. Comparing this content-hash instead avoids that
+   * spurious cancel+restart cycle in the "user clicked Run before
+   * the watcher's first BUNDLE_END landed" case.
+   */
+  contentHash?: string | null;
+  /**
+   * Stable hash of the trainer's cloud-side `JobConfig`. When this is
+   * unchanged across a rebuild, only the in-process callbacks moved and
+   * the Studio server can hot-swap them without restarting the run.
+   * `null` when the bundle has no discoverable trainer (e.g. the user's
+   * source has a syntax error or the Arkor manifest is missing).
+   */
+  configHash?: string | null;
+  /** Run name pulled from the rebuilt manifest. */
+  trainerName?: string | null;
+  /** Human-readable error message; only present on `type === "error"`. */
+  message?: string;
+}
+
+export interface HmrCoordinator {
+  /**
+   * Receive the current cached state immediately, then every subsequent
+   * event. Returns an unsubscribe function.
+   */
+  subscribe(fn: (event: HmrEvent) => void): () => void;
+  /**
+   * Synchronous read of the most recent successful build's
+   * `configHash`. Used by `/api/train` to capture the hash that's
+   * about to be spawned so HMR routing on the *next* rebuild knows
+   * whether the new bundle changed cloud-side config. `null` when the
+   * watcher hasn't completed a successful build yet (e.g. fresh
+   * scaffold) or the latest event was an `error`.
+   */
+  getCurrentConfigHash(): string | null;
+  /**
+   * Synchronous fingerprint of the on-disk build artefact RIGHT NOW
+   * (fresh stat, not cached). Used by `/api/train`'s registry entry
+   * so HMR routing in the pre-ready-spawn case (`configHash === null`)
+   * can compare against the rebuild's `event.hash` to tell whether
+   * the child read the same bytes. Without this gate, an edit
+   * landing between spawn and the watcher's first BUNDLE_END would
+   * silently teach the registry to use the post-edit `configHash`
+   * as the child's baseline; later same-hash rebuilds would then
+   * hot-swap callbacks into a child whose cloud-side `JobConfig`
+   * was actually spawned against an older version, leaving the
+   * cloud run on a stale config. `null` when stat fails (artefact
+   * doesn't exist yet, fresh project never built).
+   */
+  getCurrentArtifactHash(): string | null;
+  /**
+   * Content-derived hash (sha256, truncated) of the on-disk
+   * artefact RIGHT NOW. Used by `/api/train` to capture a
+   * spawn-time content-hash for the registry's pre-ready-spawn
+   * equality gate; paired with the rebuild's `event.contentHash`,
+   * a mismatch unambiguously means the bytes changed (not just
+   * timestamps), so `dispatchRebuild` only SIGTERM-restarts when
+   * the child genuinely loaded different bytes than the new
+   * configHash describes. `null` on stat/read failure (artefact
+   * doesn't exist yet, fresh project never built).
+   */
+  getCurrentArtifactContentHash(): string | null;
+  /**
+   * Last broadcast event's `type`, or `null` if nothing has been
+   * broadcast yet. `/api/manifest`'s HMR fast path consults this to
+   * suppress its "serve last good artefact" behaviour while the
+   * watcher is in an `error` state; without that gate, the SPA's
+   * 5 s `/api/manifest` poll would keep getting a 200 stale
+   * manifest and silently overwrite the SSE-driven build-error UI,
+   * letting users run with stale code/config while the latest
+   * source is still failing to compile.
+   */
+  getLastEventType(): HmrEventType | null;
+  /**
+   * Close the rolldown watcher and drop all subscribers. **Does not
+   * (and cannot) evict the user-module records that `inspectBundle`
+   * loaded into Node's ESM cache** — Node's loader exposes no
+   * eviction API, so for `arkor dev` sessions that go through many
+   * rebuilds before exit, the cache retains one record per distinct
+   * artefact content hash for the rest of the process lifetime.
+   * The mtime/ctime/size cache-bust key (`moduleCacheBustUrl`)
+   * collapses identical-byte rebuilds onto the same record, bounding
+   * the retention to "one entry per real edit", which is the tightest
+   * we can offer here. Tests that loop `createHmrCoordinator` →
+   * rebuild → `dispose` therefore still accumulate process-wide
+   * ESM-cache entries.
+   */
+  dispose(): Promise<void>;
+}
+
+export type HmrOptions = BuildEntryOptions;
+
+/**
+ * Content-derived fingerprint of the artefact bytes (sha256, first 16
+ * hex chars). Used by `dispatchRebuild`'s pre-ready-spawn gate where
+ * timestamp-based comparison gives false positives: a watcher rebuild
+ * that produces the same bytes still bumps mtime/ctime, so a child
+ * spawned just before `ready` would be unnecessarily SIGTERM-restarted
+ * even though its loaded bytes match the new build's. Hashing a few
+ * MB of bundle on each call is cheap relative to the GPU cost of a
+ * spurious cancel+restart cycle.
+ *
+ * Returns `null` on stat/read failure so the caller can treat
+ * "no artefact" as "force restart" (the conservative default).
+ */
+function contentHashOrNull(outFile: string): string | null {
+  try {
+    const bytes = readFileSync(outFile);
+    return createHash("sha256").update(bytes).digest("hex").slice(0, 16);
+  } catch {
+    return null;
+  }
+}
+
+/**
+ * Single-stat fingerprint with a clean `null` on failure: used by
+ * `getCurrentArtifactHash()` whose contract is "return a fingerprint
+ * derived from the artefact bytes, or `null` if no artefact". A
+ * separate exists-check + `fingerprint()` here would race: the file
+ * could disappear between the two stats and `fingerprint()`'s
+ * `Date.now()` fallback would return a non-null hash that doesn't
+ * describe any real bytes, silently violating the contract.
+ */
+function fingerprintOrNull(outFile: string): string | null {
+  try {
+    const s = statSync(outFile);
+    // Same shape as `fingerprint()`'s success branch; `ctimeMs` is
+    // the belt-and-braces guard for `touch -m`-style edits where
+    // mtime stays put.
+    return `${s.mtimeMs}-${s.ctimeMs}-${s.size}`;
+  } catch {
+    return null;
+  }
+}
+
+function fingerprint(outFile: string): string {
+  // Delegate to `fingerprintOrNull` and substitute a freshness-
+  // forcing token on stat failure. The `Date.now()` fallback
+  // matters here (vs the "0-0-0" sentinel `moduleCacheBustKey`
+  // uses): SPA-side SSE dedup keys off this hash, so a stable
+  // literal during a racy stat would silently swallow genuinely-
+  // fresh broadcast events.
+  return fingerprintOrNull(outFile) ?? Date.now().toString(36);
+}
+
+type InspectionResult = {
+  configHash: string;
+  trainerName: string;
+} | null;
+
+/**
+ * Dynamic-import the freshly-built bundle and pull a `TrainerInspection`
+ * snapshot off the discovered trainer.
+ *
+ * Walks every entry shape `runner.ts` accepts (named `arkor`, named
+ * `trainer`, `default` Arkor manifest, `default.trainer`) via the
+ * shared `findInspectableTrainer` helper, keeping inspection in sync
+ * with execution. Without this, projects that only `export const
+ * trainer` (a documented shortcut) would always produce `configHash:
+ * null` and the SPA would unnecessarily SIGTERM-restart on every
+ * rebuild.
+ *
+ * Cache-bust by file mtime+ctime+size (via `moduleCacheBustUrl`)
+ * rather than `Date.now()`:
+ *
+ *   - Node's ESM loader caches every dynamically-imported URL for the
+ *     lifetime of the process and never evicts. A `?t=Date.now()`
+ *     suffix produces a unique URL per call, so a long `arkor dev`
+ *     session would accumulate one module record per BUNDLE_END:
+ *     unbounded memory growth.
+ *   - The composite key (`mtimeMs-ctimeMs-size`) keys the cache to
+ *     "the actual bytes in this file", so spurious watcher events
+ *     that don't change content reuse the prior module record. The
+ *     leak shrinks from "one entry per keystroke" to "one entry per
+ *     actual rebuild", which for a realistic dev session (hundreds
+ *     of saves over hours) is bounded by the number of distinct file
+ *     states the user produces, and that's fundamentally what HMR
+ *     has to track to surface up-to-date trainer state. There's no
+ *     public Node API for evicting an ESM module record, so this is
+ *     the tightest bound we can offer without spawning a child
+ *     process per inspection.
+ *
+ * Best-effort: a missing/malformed manifest or a thrown user
+ * constructor returns `null` and the caller treats the rebuild as
+ * "config-unknown".
+ */
+async function inspectBundle(outFile: string): Promise<InspectionResult> {
+  try {
+    const mod = (await import(moduleCacheBustUrl(outFile))) as Record<
+      string,
+      unknown
+    >;
+    const inspection = findInspectableTrainer(mod);
+    if (!inspection) return null;
+    return {
+      configHash: hashJobConfig(inspection.config),
+      trainerName: inspection.name,
+    };
+  } catch {
+    return null;
+  }
+}
+
+/**
+ * Spin up a rolldown watcher over the user's `src/arkor` entry, broadcasting
+ * `ready` / `rebuild` / `error` to subscribers. Used by `arkor dev` to push
+ * `/api/dev/events` SSE notifications to the SPA.
+ *
+ * Lazy: the watcher only starts on the first `subscribe` call so a Studio
+ * launch in a project without `src/arkor/index.ts` doesn't immediately fail.
+ * The watcher kicks in once the user creates the file and the SPA opens
+ * an EventSource. After every successful build the watcher caches the
+ * latest state and replays it to new subscribers so a late-mounting
+ * component still sees the trainer.
+ */
+export function createHmrCoordinator(opts: HmrOptions): HmrCoordinator {
+  const resolved = resolveBuildEntry(opts);
+
+  const subscribers = new Set<(event: HmrEvent) => void>();
+  let lastEvent: HmrEvent | null = null;
+  let watcher: RolldownWatcher | null = null;
+  let disposed = false;
+  /**
+   * When `startWatcher` runs against a project that doesn't have an
+   * entry file yet, a poll timer takes over and waits for the file to
+   * appear. Without this, an SPA that opened `/api/dev/events` against
+   * a fresh scaffold would hang on the initial `error` event forever
+   * (`startWatcher` is only re-entered on `subscribe()`, but EventSource
+   * doesn't reconnect on application-level errors).
+   */
+  let entryWaitTimer: ReturnType<typeof setInterval> | null = null;
+  /**
+   * Monotonically incrementing build sequence number. Bumped on every
+   * `BUNDLE_END` *before* the inspection awaits, so when an
+   * inspection eventually resolves it can check whether a newer
+   * build has started in the meantime and silently drop its stale
+   * result.
+   *
+   * This matters because `inspectBundle` does an asynchronous
+   * dynamic-import of the just-written artifact. Two rebuilds A → B
+   * landing within the import window can race, with A's inspection
+   * resolving *after* B's. The previous "fire-and-forget" code
+   * would then publish A on top of B and leave `lastEvent` pointing
+   * at the older `configHash`/`trainerName`. That in turn drove
+   * `/api/dev/events` to make hot-swap-vs-restart decisions against
+   * stale routing data and surfaced the wrong trainer name in the
+   * SPA.
+   */
+  let buildSeq = 0;
+  /**
+   * Whether a `ready` event has actually broadcast yet. Tracked
+   * separately from `firstBuild` because the inspection await means
+   * the first BUNDLE_END's broadcast can land *after* a second
+   * BUNDLE_END schedules its own. Pinning the type to
+   * "broadcast-time" rather than "schedule-time" guarantees the SPA
+   * still sees `ready` first even when the initial inspection loses
+   * the race.
+   */
+  let firstBroadcast = true;
+  /**
+   * Cached `configHash` of the last *successful* build, **independent
+   * of `lastEvent`**. `lastEvent` tracks every broadcast (including
+   * `error`) for the cached-replay-on-late-subscribe contract, but a
+   * transient build error must not blank out the spawn-time hash that
+   * `/api/train` reads via `getCurrentConfigHash()`. The on-disk
+   * `.arkor/build/index.mjs` doesn't change on ERROR, so a child
+   * spawned during an error state is running the *previous* successful
+   * bundle, and the next BUNDLE_END's hash should be compared
+   * against THAT. Without this separate cache, the whole rebuild gets
+   * routed through SIGTERM-restart and SIGUSR2 hot-swap stops working
+   * for the rest of the session whenever the user briefly broke their
+   * source.
+   */
+  let lastSuccessConfigHash: string | null = null;
+
+  function broadcast(event: HmrEvent): void {
+    lastEvent = event;
+    for (const fn of subscribers) {
+      try {
+        fn(event);
+      } catch {
+        // Subscribers are SSE controllers; a thrown error usually means
+        // the connection closed mid-flight. Drop it so one bad subscriber
+        // can't poison the broadcast for the rest.
+      }
+    }
+  }
+
+  async function emitBuildSucceeded(): Promise<void> {
+    if (disposed) return;
+    const seq = ++buildSeq;
+    const inspection = await inspectBundle(resolved.outFile);
+    // Drop stale results: a newer rebuild already started (or
+    // finished) while our inspection was running. The newer
+    // inspection will own the broadcast for the latest state; this
+    // one publishing now would just clobber `lastEvent` with the
+    // older snapshot.
+    if (seq !== buildSeq || disposed) return;
+    const type: HmrEventType = firstBroadcast ? "ready" : "rebuild";
+    firstBroadcast = false;
+    const configHash = inspection?.configHash ?? null;
+    // BUNDLE_END always reflects what's now on disk: even when the
+    // bundle is unbranded (`configHash === null`), that's the
+    // current truth. Capture it so `/api/train` spawning during a
+    // *subsequent* transient error still has the right spawn-time
+    // hash to compare against the next successful rebuild.
+    lastSuccessConfigHash = configHash;
+    broadcast({
+      type,
+      outFile: resolved.outFile,
+      hash: fingerprint(resolved.outFile),
+      // Content hash powers the registry's pre-ready-spawn equality
+      // gate (timestamp-only would over-trigger SIGTERM-restart on
+      // identical-bytes rebuilds). Read once here so the broadcast
+      // and any spawn-time capture reference the same on-disk state.
+      contentHash: contentHashOrNull(resolved.outFile),
+      configHash,
+      trainerName: inspection?.trainerName ?? null,
+    });
+  }
+
+  function startWatcher(): void {
+    if (watcher || disposed) return;
+    if (!existsSync(resolved.entry)) {
+      broadcast({
+        type: "error",
+        message: `Build entry not found: ${resolved.entry}. Create ${BUILD_DEFAULTS.entry} or pass an explicit entry argument.`,
+      });
+      // Hand off to a low-frequency poll so an SPA already connected to
+      // `/api/dev/events` transitions from "error" to "ready" the moment
+      // the user creates the entry file (no manual reconnect required).
+      // The poll is `unref()`'d so it never blocks process exit, and
+      // `dispose()` clears it.
+      if (!entryWaitTimer) {
+        entryWaitTimer = setInterval(() => {
+          if (disposed || watcher) {
+            if (entryWaitTimer) clearInterval(entryWaitTimer);
+            entryWaitTimer = null;
+            return;
+          }
+          if (existsSync(resolved.entry)) {
+            if (entryWaitTimer) clearInterval(entryWaitTimer);
+            entryWaitTimer = null;
+            startWatcher();
+          }
+        }, 1000);
+        entryWaitTimer.unref?.();
+      }
+      return;
+    }
+    // The entry exists now: clear any leftover poll timer from a prior
+    // failed startWatcher invocation.
+    if (entryWaitTimer) {
+      clearInterval(entryWaitTimer);
+      entryWaitTimer = null;
+    }
+    watcher = watch({
+      ...rolldownInputOptions(resolved),
+      output: { file: resolved.outFile, format: "esm" },
+    });
+    watcher.on("event", (event) => {
+      if (event.code === "BUNDLE_END") {
+        // rolldown requires the per-build result to be closed to avoid leaks.
+        event.result.close().catch(() => {});
+        // The event type ("ready" vs "rebuild") is decided inside
+        // `emitBuildSucceeded` *after* the inspection await, based on
+        // whether any prior broadcast actually landed (see the
+        // `firstBroadcast` comment for why pinning the type at this
+        // schedule point would be wrong under inspection races).
+        void emitBuildSucceeded();
+      } else if (event.code === "ERROR") {
+        // Rolldown's ERROR events don't always carry a `result`:
+        // when the failure is in the parse/resolve phase there's
+        // no per-build output to close, so `event.result` is
+        // `undefined`. Calling `.close()` then would throw
+        // synchronously, escape this listener, and permanently
+        // wedge the watcher so the SPA stays on the prior `error`
+        // state forever even after the user fixes their code.
+        // Optional-chain so we still close any result that *is*
+        // present (avoiding the leak rolldown warns about) without
+        // blowing up the watcher when none is.
+        event.result?.close().catch(() => {});
+        // Bump the seq so a still-in-flight `emitBuildSucceeded`
+        // from a *prior* BUNDLE_END drops its broadcast when its
+        // inspection finally resolves. Without this, the older
+        // success would land on top of this error and clobber
+        // `lastEvent`/`configHash`, leaving the SPA showing a
+        // healthy rebuild while the actual latest build state is
+        // a compile error. The successful-rebuild path bumps the
+        // same counter inside `emitBuildSucceeded`.
+        buildSeq += 1;
+        broadcast({
+          type: "error",
+          message:
+            event.error instanceof Error
+              ? event.error.message
+              : String(event.error),
+        });
+      }
+    });
+  }
+
+  return {
+    subscribe(fn) {
+      subscribers.add(fn);
+      // Replay the last broadcast so a late-mounting subscriber (an
+      // `/api/dev/events` SSE client opening after the first BUNDLE_END,
+      // or `buildStudioApp`'s dispatch subscriber registering after
+      // entry-wait recovery) sees current state without waiting for
+      // the next rebuild.
+      //
+      // Wrapped in the same defensive try/catch as `broadcast` so a
+      // throw inside the subscriber (typically an SSE controller that
+      // closed mid-replay: `controller.enqueue` on a closed stream
+      // throws) doesn't propagate out of `subscribe()` and crash
+      // whoever just registered. One bad subscriber must not be able
+      // to break HMR initialisation for the rest of the process.
+      if (lastEvent) {
+        try {
+          fn(lastEvent);
+        } catch {
+          // Swallow: subscribers own their own teardown; we just
+          // shouldn't poison their `subscribe()` call site.
+        }
+      }
+      startWatcher();
+      return () => {
+        subscribers.delete(fn);
+      };
+    },
+    getCurrentConfigHash() {
+      // Returns the hash of the *last successful* build, NOT
+      // `lastEvent.configHash`. The two diverge after an ERROR:
+      // `lastEvent` becomes the error event (no `configHash`), but
+      // `.arkor/build/index.mjs` still holds the previous successful
+      // bundle bytes, and a child spawned in that window is running
+      // those bytes. Returning the cached success hash keeps
+      // `/api/train` registering accurate spawn-time hashes so the
+      // next successful BUNDLE_END can route hot-swap vs restart
+      // correctly. `null` only before the first successful build (or
+      // a build that wasn't inspectable).
+      return lastSuccessConfigHash;
+    },
+    getCurrentArtifactHash() {
+      // Fresh stat (not the cached `lastEvent.hash`). The cached
+      // hash describes the bytes the watcher last broadcast about,
+      // but the on-disk artefact may be newer (a BUNDLE_END is
+      // queued, file already written, inspection still pending) or
+      // older (next BUNDLE_END hasn't fired yet but the user just
+      // edited and saved). For the registry's pre-ready-spawn gate
+      // we want "what bytes will the child's `await import()` see
+      // RIGHT NOW".
+      //
+      // `fingerprintOrNull` does ONE statSync and returns null on
+      // failure, preserving the documented contract. A previous
+      // implementation here did `statSync(...)` first and then
+      // called `fingerprint()` (which has a `Date.now()` fallback
+      // baked in for SSE dedup uniqueness). That double-stat
+      // raced: if the file disappeared between the two calls we'd
+      // return a Date.now()-derived hash that doesn't describe any
+      // real bytes, silently violating the "null on stat failure"
+      // contract dispatchRebuild relies on for its SIGTERM-restart
+      // routing.
+      return fingerprintOrNull(resolved.outFile);
+    },
+    getCurrentArtifactContentHash() {
+      // Companion to `getCurrentArtifactHash` for the registry's
+      // pre-ready-spawn equality gate. Reads + sha256s the file
+      // at call time so the result describes the exact bytes the
+      // just-spawned child will see in its `await import()`.
+      // Same null-on-failure contract: caller treats null as
+      // "force restart" (the conservative default).
+      return contentHashOrNull(resolved.outFile);
+    },
+    getLastEventType() {
+      // `lastEvent` is the latest broadcast: `ready` / `rebuild` /
+      // `error`. Returning the type lets `/api/manifest`'s HMR
+      // fast path skip serving the stale built artefact when the
+      // watcher is currently in `error` (current source fails to
+      // compile), so the SPA's poll loop doesn't paper over the
+      // SSE-surfaced error.
+      return lastEvent?.type ?? null;
+    },
+    async dispose() {
+      disposed = true;
+      subscribers.clear();
+      if (entryWaitTimer) {
+        clearInterval(entryWaitTimer);
+        entryWaitTimer = null;
+      }
+      if (watcher) {
+        const w = watcher;
+        watcher = null;
+        await w.close().catch(() => {});
+      }
+    },
+  };
+}
diff --git a/packages/arkor/src/studio/manifest.ts b/packages/arkor/src/studio/manifest.ts
index 72452da8..677115e3 100644
--- a/packages/arkor/src/studio/manifest.ts
+++ b/packages/arkor/src/studio/manifest.ts
@@ -1,6 +1,11 @@
-import { pathToFileURL } from "node:url";
+import { existsSync } from "node:fs";
 import { runBuild } from "../cli/commands/build";
-import { isArkor } from "../core/arkor";
+import { hashJobConfig } from "../core/configHash";
+import { moduleCacheBustUrl } from "../core/moduleCacheBust";
+import {
+  findTrainerInModule,
+  getTrainerInspection,
+} from "../core/trainerInspection";
 
 /**
  * Wire-friendly snapshot of the user's `createArkor({...})` manifest. Mirrors
@@ -9,28 +14,120 @@ import { isArkor } from "../core/arkor";
  */
 export interface ManifestSummary {
   trainer: { name: string } | null;
+  /**
+   * Stable hash of the trainer's cloud-side `JobConfig`. Used by HMR to
+   * decide whether a rebuild only changed in-process callbacks (hash
+   * unchanged → hot-swap) or also touched cloud-side training config
+   * (hash changed → restart with `requestEarlyStop`). `null` when no
+   * inspectable trainer is present.
+   */
+  configHash: string | null;
   // future: deploy: { name: string } | null;
   // future: eval:   { name: string } | null;
 }
 
-const EMPTY: ManifestSummary = { trainer: null };
+const EMPTY: ManifestSummary = { trainer: null, configHash: null };
 
 /**
- * Build the user's `src/arkor/index.ts` and import the artifact to extract a
- * serialisable summary of its manifest. The Studio UI hits this on home-page
- * load to show *what* the project contains (just the trainer name today;
- * deploy / eval slots when those primitives land).
+ * Dynamic-import an already-built artefact and pull a serialisable
+ * summary off its trainer. Cache-bust the URL so Node's ESM loader
+ * returns the fresh module text rather than a stale evaluation.
  *
- * Each call rebuilds and re-imports so edits to the user's source surface
- * without restarting Studio. The import URL carries a cache-bust query so
- * Node's ESM cache doesn't return a stale module.
+ * Split out of `readManifestSummary` so callers that already triggered a
+ * build (the HMR coordinator hands the SPA a `outFile` after each
+ * `BUNDLE_END`) can inspect the artefact without paying for a redundant
+ * `runBuild()`.
  */
-export async function readManifestSummary(cwd: string): Promise<ManifestSummary> {
+export async function summariseBuiltManifest(
+  outFile: string,
+): Promise<ManifestSummary> {
+  // mtime+ctime+size cache-bust (vs `Date.now()`): the SPA polls
+  // `/api/manifest` every ~5 s, so a `Date.now()` suffix would
+  // accumulate one ESM module record per poll across a long
+  // `arkor dev` session: Node's loader has no eviction. Keying on
+  // the artefact bytes (via `moduleCacheBustUrl`) collapses
+  // unchanged-poll reads onto the existing record.
+  const mod = (await import(moduleCacheBustUrl(outFile))) as Record<
+    string,
+    unknown
+  >;
+  // Walk every trainer export shape `runner.ts` accepts via the
+  // shared helper (named `arkor`, named `trainer`, default Arkor
+  // manifest, `default.trainer`) so manifest summary, HMR routing,
+  // and runtime execution all agree about which exports count as a
+  // trainer.
+  const trainer = findTrainerInModule(mod);
+  if (!trainer) return EMPTY;
+  // Trainer name renders in the UI even for hand-rolled trainers
+  // that bypass `createTrainer` and therefore don't carry the SDK
+  // inspection brand. The brand is required only for the
+  // `configHash` used by HMR routing; without it, HMR conservatively
+  // SIGTERM-restarts on every rebuild (correct fallback).
+  const name =
+    typeof trainer.name === "string" ? trainer.name : "(unnamed trainer)";
+  const inspection = getTrainerInspection(trainer);
+  return {
+    trainer: { name },
+    configHash: inspection ? hashJobConfig(inspection.config) : null,
+  };
+}
+
+export interface ReadManifestOptions {
+  /**
+   * HMR-aware fast path: when set and the file exists, skip the
+   * `runBuild()` call and inspect this artefact directly. The HMR
+   * coordinator already keeps `.arkor/build/index.mjs` continuously
+   * fresh via its rolldown watcher, so re-running `runBuild()` on
+   * every `/api/manifest` poll (every ~5 s + on every rebuild SSE
+   * event) is wasted CPU AND races the watcher writing to the
+   * same path. Pre-existence is checked with `existsSync` so the
+   * very first poll on a fresh scaffold (watcher's first
+   * BUNDLE_END hasn't completed yet) still bootstraps via
+   * `runBuild()`. Once the file appears, subsequent polls skip
+   * the rebuild.
+   *
+   * Pass `coordinator.outFile`-equivalent (e.g.
+   * `resolveBuildEntry({ cwd }).outFile`) here when the server has
+   * an active `HmrCoordinator`; leave undefined when HMR is off so
+   * the build path runs as before.
+   */
+  prebuiltOutFile?: string;
+}
+
+/**
+ * Build the user's `src/arkor/index.ts` and import the artifact to
+ * extract a serialisable summary of its manifest. The Studio UI hits
+ * this on home-page load to show *what* the project contains (just the
+ * trainer name today; deploy / eval slots when those primitives land).
+ *
+ * Each call rebuilds and re-imports so edits to the user's source
+ * surface without restarting Studio. When `prebuiltOutFile` is
+ * supplied (HMR-enabled servers), the `runBuild()` step is bypassed
+ * (see `ReadManifestOptions.prebuiltOutFile` for the rationale).
+ */
+export async function readManifestSummary(
+  cwd: string,
+  opts: ReadManifestOptions = {},
+): Promise<ManifestSummary> {
+  if (opts.prebuiltOutFile && existsSync(opts.prebuiltOutFile)) {
+    // Race recovery: rolldown's watcher writes
+    // `.arkor/build/index.mjs` non-atomically. `existsSync` flips to
+    // `true` the instant the file is created, but a `/api/manifest`
+    // poll landing during the flush window would then try to
+    // `await import(...)` partial bytes and surface as a 500
+    // SyntaxError in the UI. The legacy `runBuild()` path was
+    // synchronous and self-contained, so this race didn't exist
+    // there. Fall through to a fresh `runBuild()` on import failure
+    // (which produces a coherent artifact under our control). The
+    // fallback is best-effort: if `runBuild()` itself also throws
+    // (real user syntax error), rethrowing IS the right surface for
+    // `/api/manifest` to render the error inline.
+    try {
+      return await summariseBuiltManifest(opts.prebuiltOutFile);
+    } catch {
+      // fall through to runBuild()
+    }
+  }
   const { outFile } = await runBuild({ cwd, quiet: true });
-  const url = `${pathToFileURL(outFile).href}?t=${Date.now()}`;
-  const mod = (await import(url)) as Record<string, unknown>;
-  const candidate = mod.arkor ?? mod.default;
-  if (!isArkor(candidate)) return EMPTY;
-  const trainer = candidate.trainer ? { name: candidate.trainer.name } : null;
-  return { trainer };
+  return summariseBuiltManifest(outFile);
 }
diff --git a/packages/arkor/src/studio/server.test.ts b/packages/arkor/src/studio/server.test.ts
index 214889d9..ed8b9298 100644
--- a/packages/arkor/src/studio/server.test.ts
+++ b/packages/arkor/src/studio/server.test.ts
@@ -11,6 +11,7 @@ import {
 import { tmpdir } from "node:os";
 import { join, resolve } from "node:path";
 import { buildStudioApp } from "./server";
+import type { HmrCoordinator, HmrEvent } from "./hmr";
 import { writeCredentials } from "../core/credentials";
 import { readState, writeState } from "../core/state";
 import {
@@ -82,14 +83,14 @@ describe("Studio server", () => {
         baseUrl: "http://mock",
         assetsDir,
         autoAnonymous: false,
-        // @ts-expect-error — intentionally omitted to assert the runtime guard
+        // @ts-expect-error: intentionally omitted to assert the runtime guard
         studioToken: undefined,
       }),
     ).toThrow(/studioToken/);
   });
 
   it("HTML-escapes special characters in the studio token before injecting", async () => {
-    // Branch coverage for `htmlAttrEscape` — a defensive guard against
+    // Branch coverage for `htmlAttrEscape`: a defensive guard against
     // a token that contains `<`, `>`, `&`, `"`, `'`. randomBytes/base64url
     // never produces these, but the helper must still escape them so a
     // future token strategy can't break index.html parsing or open a
@@ -111,7 +112,7 @@ describe("Studio server", () => {
     expect(html).toContain(
       '<meta name="arkor-studio-token" content="&lt;&gt;&amp;&quot;&#39;-1234567890ab">',
     );
-    // The raw exotic token must not leak into HTML — an attacker who
+    // The raw exotic token must not leak into HTML: an attacker who
     // could influence the token (hypothetical) shouldn't be able to
     // inject markup.
     expect(html).not.toMatch(/content="<>/);
@@ -133,6 +134,48 @@ describe("Studio server", () => {
     expect(html.indexOf("arkor-studio-token")).toBeLessThan(
       html.indexOf("</head>"),
     );
+    // HMR meta tag must NOT appear when no coordinator was supplied.
+    // The SPA reads this flag to decide whether to open
+    // `/api/dev/events`; a stray "true" here would make every prod
+    // session retry against the 404 indefinitely.
+    expect(html).not.toContain("arkor-hmr-enabled");
+  });
+
+  it("injects <meta name=\"arkor-hmr-enabled\"> when an HMR coordinator is supplied", async () => {
+    // Regression: the SPA can't tell dev-mode usage from prod-mode
+    // usage at runtime: `vite build` ships with
+    // `import.meta.env.DEV === false`, so a build-time DEV gate inside
+    // the SPA bundle would (wrongly) suppress HMR even in real
+    // `arkor dev` sessions. The server-side flag is `true` exactly
+    // when `arkor dev` wired in an HMR coordinator. Verify it lands
+    // in `<head>` next to the studio-token tag.
+    const fakeHmr = {
+      subscribe: () => () => undefined,
+      getCurrentConfigHash: () => null,
+      getCurrentArtifactHash: () => null,
+      getCurrentArtifactContentHash: () => null,
+      getLastEventType: () => null,
+      async dispose() {},
+    };
+    const app = buildStudioApp({
+      baseUrl: "http://mock",
+      assetsDir,
+      autoAnonymous: false,
+      studioToken: STUDIO_TOKEN,
+      cwd: trainCwd,
+      hmr: fakeHmr,
+    });
+    const res = await app.request("/", {
+      headers: { host: "127.0.0.1:4000" },
+    });
+    expect(res.status).toBe(200);
+    const html = await res.text();
+    expect(html).toContain(
+      `<meta name="arkor-hmr-enabled" content="true">`,
+    );
+    expect(html.indexOf("arkor-hmr-enabled")).toBeLessThan(
+      html.indexOf("</head>"),
+    );
   });
 
   it("serves non-html assets with the correct content-type", async () => {
@@ -356,7 +399,7 @@ describe("Studio server", () => {
     expect(res.status).toBe(403);
   });
 
-  // Regression for ENG-404 — `path.resolve` doesn't follow symlinks, so a
+  // Regression for ENG-404: `path.resolve` doesn't follow symlinks, so a
   // link inside the project directory pointing outside it would previously
   // pass the containment check and be handed to `arkor start` (which would
   // then dlopen the link's target).
@@ -433,7 +476,7 @@ describe("Studio server", () => {
     expect(body.error).toMatch(/does not exist/);
   });
 
-  // Regression for ENG-356 — `/api/train` previously resolved the bundled
+  // Regression for ENG-356: `/api/train` previously resolved the bundled
   // bin at `<pkg>/bin.mjs` (one level above `dist/`), which never existed.
   // The DI'd `binPath` lets us assert (a) a working bin streams its stdout
   // through the response, and (b) a missing bin surfaces ENOENT-grade errors
@@ -468,6 +511,12 @@ process.exit(0);
         body: JSON.stringify({}),
       });
       expect(res.status).toBe(200);
+      // Regression: the spawned subprocess's pid is exposed via the
+      // `X-Arkor-Train-Pid` response header so the SPA can scope HMR
+      // restart events to its own child (a multi-tab broadcast can
+      // contain mixed restart/hot-swap targets across siblings).
+      const pidHeader = res.headers.get("x-arkor-train-pid");
+      expect(pidHeader).toMatch(/^\d+$/);
       const text = await res.text();
       expect(text).toContain("[fake-bin]");
       // The bin receives `start` as the first non-flag arg.
@@ -548,6 +597,763 @@ process.exit(0);
       expect(text).toContain("exit=");
       expect(text).not.toContain("exit=0");
     });
+
+    it("captures the spawn-time configHash from the HMR coordinator (no extra rebuild)", async () => {
+      // Regression: `/api/train` previously called `readManifestSummary`
+      // which ran a full `runBuild()` per spawn: wasteful and racy
+      // against the HMR watcher writing the same `.arkor/build/index.mjs`.
+      // The new server reads the cached hash from
+      // `coordinator.getCurrentConfigHash()` instead. We assert the
+      // call happens (so a rebuild is *not* required) by exposing the
+      // spy count on the fake coordinator.
+      await writeCredentials(ANON_CREDS);
+      let getCurrentCalls = 0;
+      const fakeHmr = {
+        subscribe: () => () => undefined,
+        getCurrentConfigHash: () => {
+          getCurrentCalls += 1;
+          return "spawn-time-hash";
+        },
+        getCurrentArtifactHash: () => "spawn-artefact-hash",
+        getCurrentArtifactContentHash: () => "spawn-artefact-content-hash",
+        getLastEventType: () => null,
+        async dispose() {},
+      };
+      const fakeBin = join(trainCwd, "fake-bin.mjs");
+      writeFileSync(fakeBin, `process.exit(0);\n`);
+      const app = buildStudioApp({
+        baseUrl: "http://mock",
+        assetsDir,
+        autoAnonymous: false,
+        studioToken: STUDIO_TOKEN,
+        cwd: trainCwd,
+        binPath: fakeBin,
+        hmr: fakeHmr,
+      });
+      const res = await app.request("/api/train", {
+        method: "POST",
+        headers: {
+          host: "127.0.0.1:4000",
+          "x-arkor-studio-token": STUDIO_TOKEN,
+          "content-type": "application/json",
+        },
+        body: JSON.stringify({}),
+      });
+      expect(res.status).toBe(200);
+      // Drain the body so the close handler runs and the test
+      // doesn't leak the subprocess.
+      await res.text();
+      expect(getCurrentCalls).toBe(1);
+    });
+
+    it("/api/train job-id parser ignores stderr so a `Started job <token>` line on stderr can't hijack the cancel POST", async () => {
+      // Regression: the job-id detector used to consume both
+      // stdout AND stderr through a shared `onChunk` + shared
+      // line buffer. A user `console.error("Started job <token>")`
+      // on stderr would then poison the buffer first; the real
+      // stdout marker arrives later but our `getJobId(...) === null`
+      // gate has already short-circuited subsequent scans, so
+      // Stop-training POSTs cancel for the wrong (decoy) job and
+      // the real one keeps running: silent cloud orphan.
+      // Splitting into a stdout-only `onStdoutChunk` parser and a
+      // forward-only `onStderrChunk` makes stderr unable to
+      // populate `jobId` regardless of what the user logs there.
+      await writeCredentials(ANON_CREDS);
+      await writeState(
+        {
+          orgSlug: "stderr-test-org",
+          projectSlug: "stderr-test-project",
+          projectId: "p-stderr",
+        },
+        trainCwd,
+      );
+      // Bin emits a decoy `Started job <token>` to STDERR first
+      // (would poison the shared buffer), then the canonical real
+      // line to STDOUT, then hangs. With the split we expect the
+      // real id to win; with the bug the decoy would win.
+      const REAL_JOB_ID = "real-job-id";
+      const DECOY_JOB_ID = "decoy-from-stderr";
+      const fakeBin = join(trainCwd, "stderr-decoy-bin.mjs");
+      // The real runner prefixes its canonical line with the
+      // per-spawn nonce the server injected via
+      // ARKOR_JOB_ID_MARKER_NONCE; the decoy on stderr deliberately
+      // uses the nonce too (worst-case: a user who somehow learned
+      // the nonce still can't hijack the parser by writing to the
+      // wrong stream). With the parser correctly stdout-only the
+      // real line wins regardless.
+      writeFileSync(
+        fakeBin,
+        `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? "";
+        process.stderr.write(\`[arkor:\${nonce}] Started job ${DECOY_JOB_ID}\\n\`);
+        // Slight delay so stderr lands first.
+        setTimeout(() => {
+          process.stdout.write(\`[arkor:\${nonce}] Started job ${REAL_JOB_ID}\\n\`);
+        }, 30);
+        process.on("SIGTERM", () => {});
+        setInterval(() => {}, 60_000);
+        `,
+      );
+      let cancelHits: Array<{ url: string }> = [];
+      const ORIG_FETCH = globalThis.fetch;
+      globalThis.fetch = (async (
+        input: Parameters<typeof fetch>[0],
+        init?: Parameters<typeof fetch>[1],
+      ) => {
+        const url = typeof input === "string" ? input : input.toString();
+        const method = init?.method ?? "GET";
+        if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) {
+          cancelHits.push({ url });
+          return new Response(JSON.stringify({ ok: true }), {
+            status: 200,
+            headers: { "content-type": "application/json" },
+          });
+        }
+        return new Response("not found", { status: 404 });
+      }) as typeof fetch;
+
+      try {
+        const app = buildStudioApp({
+          baseUrl: "http://mock-cloud-api",
+          assetsDir,
+          autoAnonymous: false,
+          studioToken: STUDIO_TOKEN,
+          cwd: trainCwd,
+          binPath: fakeBin,
+        });
+        const trainRes = await app.request("/api/train", {
+          method: "POST",
+          headers: {
+            host: "127.0.0.1:4000",
+            "x-arkor-studio-token": STUDIO_TOKEN,
+            "content-type": "application/json",
+          },
+          body: JSON.stringify({}),
+        });
+        expect(trainRes.status).toBe(200);
+        // Drain until the REAL line is in the body. Both the
+        // decoy and the real line forward through to the SPA log
+        // stream, so both bytes show up here regardless of which
+        // (if any) the parser captures.
+        const reader = trainRes.body!.getReader();
+        const decoder = new TextDecoder();
+        let buf = "";
+        while (!buf.includes(`Started job ${REAL_JOB_ID}`)) {
+          const { value, done } = await reader.read();
+          if (done) break;
+          buf += decoder.decode(value, { stream: true });
+        }
+        await reader.cancel();
+        await new Promise((r) => setTimeout(r, 200));
+
+        // The cancel POST must target the REAL id. With the bug
+        // the decoy would have been recorded first → cancelHits[0]
+        // would contain `decoy-from-stderr` instead.
+        expect(cancelHits).toHaveLength(1);
+        expect(cancelHits[0]?.url).toContain(`/v1/jobs/${REAL_JOB_ID}/cancel`);
+        expect(cancelHits[0]?.url).not.toContain(DECOY_JOB_ID);
+      } finally {
+        globalThis.fetch = ORIG_FETCH;
+      }
+    });
+
+    it("/api/train cancel POSTs cloud /v1/jobs/:id/cancel so the cloud job is released even though SIGKILL bypasses the runner's shutdown handlers", async () => {
+      // Regression: SIGKILL kills the runner without giving its
+      // `installShutdownHandlers` a chance to issue the cloud
+      // `cancel()` POST itself. Without a server-side equivalent
+      // the cloud job sits in "running" until TTL/reaper, so a
+      // user clicking "Stop training" silently keeps consuming
+      // GPU spend. The fix parses the runner's `Started job <id>`
+      // stdout line, records the id on the registry entry, and
+      // fires a fire-and-forget POST to cloud-api on cancel
+      // *before* SIGKILLing.
+      await writeCredentials(ANON_CREDS);
+      // The cancel POST reads scope from `.arkor/state.json` (not
+      // from the anon creds' orgSlug; that's a different code
+      // path). Pre-seed so the POST can address the cloud job.
+      await writeState(
+        {
+          orgSlug: "cancel-test-org",
+          projectSlug: "cancel-test-project",
+          projectId: "p-cancel",
+        },
+        trainCwd,
+      );
+      // Bin prints the canonical "Started job <id>" line then
+      // hangs (just like the real runner after `start()` resolves).
+      // The id is the same kind of identifier cloud-api would
+      // mint: an opaque string we'll verify shows up in the cancel
+      // POST URL below.
+      const FAKE_JOB_ID = "j-cancel-test";
+      const fakeBin = join(trainCwd, "started-job-bin.mjs");
+      // Prefix the marker with the per-spawn nonce the server
+      // injected via ARKOR_JOB_ID_MARKER_NONCE: that's the only
+      // shape the server's parser accepts, since user code can't
+      // know the nonce ahead of time (real runner deletes the env
+      // var before importing user modules).
+      writeFileSync(
+        fakeBin,
+        `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? "";
+        process.stdout.write(\`[arkor:\${nonce}] Started job ${FAKE_JOB_ID}\\n\`);
+        process.on("SIGTERM", () => {});
+        setInterval(() => {}, 60_000);
+        `,
+      );
+      // Capture the cloud-api requests so we can verify the
+      // server's cancel POST landed with the right job id +
+      // scope. The default fetch in this suite would 404 our POST
+      // and leave it as `cancelCalls === 0`.
+      let cancelHits: Array<{ url: string; method: string }> = [];
+      const ORIG_FETCH = globalThis.fetch;
+      globalThis.fetch = (async (
+        input: Parameters<typeof fetch>[0],
+        init?: Parameters<typeof fetch>[1],
+      ) => {
+        const url = typeof input === "string" ? input : input.toString();
+        const method = init?.method ?? "GET";
+        if (
+          method === "POST" &&
+          url.includes(`/v1/jobs/${FAKE_JOB_ID}/cancel`)
+        ) {
+          cancelHits.push({ url, method });
+          return new Response(JSON.stringify({ ok: true }), {
+            status: 200,
+            headers: { "content-type": "application/json" },
+          });
+        }
+        // Pass-through default: anything else 404s, which would
+        // surface as a test-side failure if our cancel POST
+        // doesn't match the expected URL shape.
+        return new Response("not found", { status: 404 });
+      }) as typeof fetch;
+
+      try {
+        const app = buildStudioApp({
+          baseUrl: "http://mock-cloud-api",
+          assetsDir,
+          autoAnonymous: false,
+          studioToken: STUDIO_TOKEN,
+          cwd: trainCwd,
+          binPath: fakeBin,
+        });
+        const trainRes = await app.request("/api/train", {
+          method: "POST",
+          headers: {
+            host: "127.0.0.1:4000",
+            "x-arkor-studio-token": STUDIO_TOKEN,
+            "content-type": "application/json",
+          },
+          body: JSON.stringify({}),
+        });
+        expect(trainRes.status).toBe(200);
+        // Read enough of the body to ensure the runner's
+        // `Started job <id>` chunk has been processed by the
+        // server's stdout parser (without this, cancel could
+        // race ahead of the parser and find no jobId on the
+        // registry → no cancel POST → false test failure).
+        const reader = trainRes.body!.getReader();
+        const decoder = new TextDecoder();
+        let buf = "";
+        while (!buf.includes(`Started job ${FAKE_JOB_ID}`)) {
+          const { value, done } = await reader.read();
+          if (done) break;
+          buf += decoder.decode(value, { stream: true });
+        }
+        // Trigger cancel: should fire the cloud POST + SIGKILL.
+        await reader.cancel();
+        // Fire-and-forget: give the void IIFE a tick to actually
+        // dispatch the fetch + receive the 200 response.
+        await new Promise((r) => setTimeout(r, 200));
+
+        expect(cancelHits).toHaveLength(1);
+        expect(cancelHits[0]?.url).toContain(`/v1/jobs/${FAKE_JOB_ID}/cancel`);
+        // Scope is required by the cloud-api contract: comes from
+        // `.arkor/state.json` (seeded above), not the anon creds.
+        expect(cancelHits[0]?.url).toContain("orgSlug=cancel-test-org");
+        expect(cancelHits[0]?.url).toContain("projectSlug=cancel-test-project");
+      } finally {
+        globalThis.fetch = ORIG_FETCH;
+      }
+    });
+
+    it("/api/train cancel uses the spawn-time scope from the registry even when state.json was deleted mid-training", async () => {
+      // Regression: the cancel handler used to re-read
+      // `.arkor/state.json` at stop time to address the cloud cancel
+      // POST. If the user removed or made the file unreadable
+      // mid-training (rm -rf .arkor, accidental git clean -fdx, fs
+      // unmounted), the read returned null and the handler silently
+      // skipped the POST: the local SIGKILL still tore down the
+      // subprocess but the cloud job orphaned until TTL/reaper. The
+      // fix captures `{orgSlug, projectSlug}` on the registry entry
+      // at spawn time so the cancel POST is decoupled from
+      // mutable filesystem state.
+      await writeCredentials(ANON_CREDS);
+      await writeState(
+        {
+          orgSlug: "scope-pin-org",
+          projectSlug: "scope-pin-project",
+          projectId: "p-scope-pin",
+        },
+        trainCwd,
+      );
+      const FAKE_JOB_ID = "j-scope-pin";
+      const fakeBin = join(trainCwd, "scope-pin-bin.mjs");
+      writeFileSync(
+        fakeBin,
+        `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? "";
+        process.stdout.write(\`[arkor:\${nonce}] Started job ${FAKE_JOB_ID}\\n\`);
+        process.on("SIGTERM", () => {});
+        setInterval(() => {}, 60_000);
+        `,
+      );
+      let cancelHits: Array<{ url: string }> = [];
+      const ORIG_FETCH = globalThis.fetch;
+      globalThis.fetch = (async (
+        input: Parameters<typeof fetch>[0],
+        init?: Parameters<typeof fetch>[1],
+      ) => {
+        const url = typeof input === "string" ? input : input.toString();
+        const method = init?.method ?? "GET";
+        if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) {
+          cancelHits.push({ url });
+          return new Response(JSON.stringify({ ok: true }), {
+            status: 200,
+            headers: { "content-type": "application/json" },
+          });
+        }
+        return new Response("not found", { status: 404 });
+      }) as typeof fetch;
+
+      try {
+        const app = buildStudioApp({
+          baseUrl: "http://mock-cloud-api",
+          assetsDir,
+          autoAnonymous: false,
+          studioToken: STUDIO_TOKEN,
+          cwd: trainCwd,
+          binPath: fakeBin,
+        });
+        const trainRes = await app.request("/api/train", {
+          method: "POST",
+          headers: {
+            host: "127.0.0.1:4000",
+            "x-arkor-studio-token": STUDIO_TOKEN,
+            "content-type": "application/json",
+          },
+          body: JSON.stringify({}),
+        });
+        expect(trainRes.status).toBe(200);
+        const reader = trainRes.body!.getReader();
+        const decoder = new TextDecoder();
+        let buf = "";
+        while (!buf.includes(`Started job ${FAKE_JOB_ID}`)) {
+          const { value, done } = await reader.read();
+          if (done) break;
+          buf += decoder.decode(value, { stream: true });
+        }
+        // The hostile mid-training mutation: nuke the state file
+        // that the OLD code would have re-read at cancel time.
+        rmSync(join(trainCwd, ".arkor"), { recursive: true, force: true });
+        // Cancel: under the bug, the handler's state read returns
+        // null and the cancel POST is silently skipped. With the
+        // fix, the registry-pinned scope is used and the POST goes
+        // out anyway.
+        await reader.cancel();
+        await new Promise((r) => setTimeout(r, 200));
+
+        expect(cancelHits).toHaveLength(1);
+        expect(cancelHits[0]?.url).toContain(`/v1/jobs/${FAKE_JOB_ID}/cancel`);
+        expect(cancelHits[0]?.url).toContain("orgSlug=scope-pin-org");
+        expect(cancelHits[0]?.url).toContain("projectSlug=scope-pin-project");
+      } finally {
+        globalThis.fetch = ORIG_FETCH;
+      }
+    });
+
+    it("/api/train cancel falls back to reading .arkor/state.json when no scope was captured at spawn time (first-run anon)", async () => {
+      // Regression: capturing the cloud scope at spawn time covered
+      // the "user deleted state mid-training" hazard but broke the
+      // common first-run anonymous flow. On a fresh project,
+      // `.arkor/state.json` is created by `ensureProjectState` from
+      // INSIDE the child during `trainer.start()`, i.e. AFTER spawn.
+      // The spawn-time `readState(trainCwd)` therefore returns null,
+      // `pinnedScope` stays null, and the previous code silently
+      // skipped the cancel POST: local SIGKILL torn down the
+      // subprocess but the cloud job orphaned. The fix uses the
+      // pinned spawn-time scope WHEN PRESENT (delete-mid-training
+      // hazard) and falls back to reading at cancel time when it
+      // was null (first-run anon).
+      await writeCredentials(ANON_CREDS);
+      // Deliberately DO NOT seed state at spawn time. The bin will
+      // write it AFTER its `Started job <id>` line lands, simulating
+      // the order `ensureProjectState`/`trainer.start()` produce in
+      // a real anon first-run.
+      const FAKE_JOB_ID = "j-late-scope";
+      const stateDir = join(trainCwd, ".arkor");
+      const statePath = join(stateDir, "state.json");
+      const fakeBin = join(trainCwd, "late-scope-bin.mjs");
+      writeFileSync(
+        fakeBin,
+        `import { mkdirSync, writeFileSync } from "node:fs";
+        const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? "";
+        // Mirror runner order: state appears AFTER spawn, but BEFORE
+        // the Started job line so the cancel-time read sees it.
+        mkdirSync(${JSON.stringify(stateDir)}, { recursive: true });
+        writeFileSync(${JSON.stringify(statePath)}, JSON.stringify({
+          orgSlug: "late-scope-org",
+          projectSlug: "late-scope-project",
+          projectId: "p-late-scope",
+        }));
+        process.stdout.write(\`[arkor:\${nonce}] Started job ${FAKE_JOB_ID}\\n\`);
+        process.on("SIGTERM", () => {});
+        setInterval(() => {}, 60_000);
+        `,
+      );
+      let cancelHits: Array<{ url: string }> = [];
+      const ORIG_FETCH = globalThis.fetch;
+      globalThis.fetch = (async (
+        input: Parameters<typeof fetch>[0],
+        init?: Parameters<typeof fetch>[1],
+      ) => {
+        const url = typeof input === "string" ? input : input.toString();
+        const method = init?.method ?? "GET";
+        if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) {
+          cancelHits.push({ url });
+          return new Response(JSON.stringify({ ok: true }), {
+            status: 200,
+            headers: { "content-type": "application/json" },
+          });
+        }
+        return new Response("not found", { status: 404 });
+      }) as typeof fetch;
+
+      try {
+        const app = buildStudioApp({
+          baseUrl: "http://mock-cloud-api",
+          assetsDir,
+          autoAnonymous: false,
+          studioToken: STUDIO_TOKEN,
+          cwd: trainCwd,
+          binPath: fakeBin,
+        });
+        const trainRes = await app.request("/api/train", {
+          method: "POST",
+          headers: {
+            host: "127.0.0.1:4000",
+            "x-arkor-studio-token": STUDIO_TOKEN,
+            "content-type": "application/json",
+          },
+          body: JSON.stringify({}),
+        });
+        expect(trainRes.status).toBe(200);
+        const reader = trainRes.body!.getReader();
+        const decoder = new TextDecoder();
+        let buf = "";
+        while (!buf.includes(`Started job ${FAKE_JOB_ID}`)) {
+          const { value, done } = await reader.read();
+          if (done) break;
+          buf += decoder.decode(value, { stream: true });
+        }
+        await reader.cancel();
+        await new Promise((r) => setTimeout(r, 250));
+
+        // Under the bug there were 0 cancel hits (pinned scope null
+        // → skip). With the fix the cancel-time read recovers the
+        // scope the child just wrote.
+        expect(cancelHits).toHaveLength(1);
+        expect(cancelHits[0]?.url).toContain(`/v1/jobs/${FAKE_JOB_ID}/cancel`);
+        expect(cancelHits[0]?.url).toContain("orgSlug=late-scope-org");
+        expect(cancelHits[0]?.url).toContain("projectSlug=late-scope-project");
+      } finally {
+        globalThis.fetch = ORIG_FETCH;
+      }
+    });
+
+    it("/api/train job-id parser ignores stdout lines that lack the per-spawn nonce prefix so user code can't forge a `Started job` marker", async () => {
+      // Regression: the parser used to match any `Started job <id>`
+      // line in stdout. User code (which runs inside the runner's
+      // `await import(userEntry)` chain and therefore shares the
+      // child's stdout) could write `console.log("Started job
+      // attacker-chosen-id")` before the runner's canonical line
+      // arrives, the parser would record the attacker's id, and
+      // Stop-training would POST `/v1/jobs/<attacker-id>/cancel`
+      // against a job the attacker picked. The fix injects a
+      // per-spawn 32-hex nonce via ARKOR_JOB_ID_MARKER_NONCE that
+      // the server's regex anchors on; runner.ts deletes the env
+      // var before dynamically importing the user module, so user
+      // code can't read the nonce via `process.env` either.
+      await writeCredentials(ANON_CREDS);
+      await writeState(
+        {
+          orgSlug: "nonce-org",
+          projectSlug: "nonce-project",
+          projectId: "p-nonce",
+        },
+        trainCwd,
+      );
+      const REAL_JOB_ID = "real-nonce-job";
+      const SPOOF_JOB_ID = "attacker-chosen-id";
+      const fakeBin = join(trainCwd, "spoof-bin.mjs");
+      // Bin first emits an UNPREFIXED spoof on stdout (mimicking
+      // hostile user code), THEN the real nonce-prefixed canonical
+      // line. With the fix the spoof is rejected; the real line
+      // wins and the cancel POST targets the real id.
+      writeFileSync(
+        fakeBin,
+        `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? "";
+        process.stdout.write("Started job ${SPOOF_JOB_ID}\\n");
+        setTimeout(() => {
+          process.stdout.write(\`[arkor:\${nonce}] Started job ${REAL_JOB_ID}\\n\`);
+        }, 30);
+        process.on("SIGTERM", () => {});
+        setInterval(() => {}, 60_000);
+        `,
+      );
+      let cancelHits: Array<{ url: string }> = [];
+      const ORIG_FETCH = globalThis.fetch;
+      globalThis.fetch = (async (
+        input: Parameters<typeof fetch>[0],
+        init?: Parameters<typeof fetch>[1],
+      ) => {
+        const url = typeof input === "string" ? input : input.toString();
+        const method = init?.method ?? "GET";
+        if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) {
+          cancelHits.push({ url });
+          return new Response(JSON.stringify({ ok: true }), {
+            status: 200,
+            headers: { "content-type": "application/json" },
+          });
+        }
+        return new Response("not found", { status: 404 });
+      }) as typeof fetch;
+
+      try {
+        const app = buildStudioApp({
+          baseUrl: "http://mock-cloud-api",
+          assetsDir,
+          autoAnonymous: false,
+          studioToken: STUDIO_TOKEN,
+          cwd: trainCwd,
+          binPath: fakeBin,
+        });
+        const trainRes = await app.request("/api/train", {
+          method: "POST",
+          headers: {
+            host: "127.0.0.1:4000",
+            "x-arkor-studio-token": STUDIO_TOKEN,
+            "content-type": "application/json",
+          },
+          body: JSON.stringify({}),
+        });
+        expect(trainRes.status).toBe(200);
+        const reader = trainRes.body!.getReader();
+        const decoder = new TextDecoder();
+        let buf = "";
+        // Wait for the REAL line (with nonce prefix) to be visible
+        // in the body. Both lines forward to the SPA log
+        // regardless of which (if any) the parser captures, so the
+        // body is a reliable readiness signal.
+        while (!buf.includes(`Started job ${REAL_JOB_ID}`)) {
+          const { value, done } = await reader.read();
+          if (done) break;
+          buf += decoder.decode(value, { stream: true });
+        }
+        await reader.cancel();
+        await new Promise((r) => setTimeout(r, 200));
+
+        // Cancel POST landed against the REAL id: the spoof was
+        // rejected by the anchored nonce-prefixed regex.
+        expect(cancelHits).toHaveLength(1);
+        expect(cancelHits[0]?.url).toContain(`/v1/jobs/${REAL_JOB_ID}/cancel`);
+        expect(cancelHits[0]?.url).not.toContain(SPOOF_JOB_ID);
+      } finally {
+        globalThis.fetch = ORIG_FETCH;
+      }
+    });
+
+    it("/api/train cancel sends SIGKILL so user-initiated stop bypasses the runner's graceful early-stop", async () => {
+      // Regression: a default `child.kill()` sends SIGTERM, which
+      // the runner's `installShutdownHandlers` now interprets as a
+      // graceful early-stop request (wait for the next checkpoint,
+      // up to ~5 min). For HMR-driven cancels that's correct, but
+      // for a Stop-training click the user wants the run STOPPED
+      // immediately. Leaving it running in the background for
+      // minutes consuming GPU spend silently is a regression
+      // introduced by this PR's graceful-shutdown work. We assert
+      // SIGKILL by giving the bin a SIGTERM no-op handler: SIGTERM
+      // would be swallowed and the bin would stay alive; SIGKILL
+      // is uncatchable and reaps the process unconditionally.
+      // Probe liveness with `process.kill(pid, 0)` (ESRCH ⇒ gone).
+      await writeCredentials(ANON_CREDS);
+      const hangingBin = join(trainCwd, "hanging-bin.mjs");
+      writeFileSync(
+        hangingBin,
+        // SIGTERM swallowed; setInterval keeps the event loop
+        // alive forever absent SIGKILL.
+        `process.on("SIGTERM", () => {});
+        setInterval(() => {}, 60_000);
+        `,
+      );
+      const app = buildStudioApp({
+        baseUrl: "http://mock",
+        assetsDir,
+        autoAnonymous: false,
+        studioToken: STUDIO_TOKEN,
+        cwd: trainCwd,
+        binPath: hangingBin,
+      });
+      const res = await app.request("/api/train", {
+        method: "POST",
+        headers: {
+          host: "127.0.0.1:4000",
+          "x-arkor-studio-token": STUDIO_TOKEN,
+          "content-type": "application/json",
+        },
+        body: JSON.stringify({}),
+      });
+      expect(res.status).toBe(200);
+      const pid = Number(res.headers.get("x-arkor-train-pid"));
+      expect(Number.isFinite(pid)).toBe(true);
+
+      // Trigger the cancel() handler.
+      await res.body!.cancel();
+
+      // Give the OS a moment to deliver SIGKILL and reap.
+      await new Promise((r) => setTimeout(r, 300));
+
+      // `process.kill(pid, 0)` is the standard "is this pid alive?"
+      // probe: sends signal 0 (no-op) but the syscall still
+      // surfaces ESRCH for non-existent pids. SIGKILL → reaped →
+      // ESRCH. SIGTERM (with the bin's no-op handler) → still
+      // alive → no throw → test fails.
+      let probeError: NodeJS.ErrnoException | null = null;
+      try {
+        process.kill(pid, 0);
+      } catch (e) {
+        probeError = e as NodeJS.ErrnoException;
+      }
+      expect(probeError).not.toBeNull();
+      expect(probeError?.code).toBe("ESRCH");
+    });
+
+    it("/api/train cancel handler doesn't crash when child.kill() throws", async () => {
+      // Regression: `ReadableStream.cancel()` called `child.kill()`
+      // without a try/catch. If the child had already exited (ESRCH
+      // race against the cancel), the throw bubbled up as an
+      // unhandled exception and crashed the request handler.
+      await writeCredentials(ANON_CREDS);
+      const fakeBin = join(trainCwd, "fake-bin.mjs");
+      // Bin exits immediately so the child is already dead by the
+      // time our cancel handler tries to signal it.
+      writeFileSync(fakeBin, `process.exit(0);\n`);
+      const app = buildStudioApp({
+        baseUrl: "http://mock",
+        assetsDir,
+        autoAnonymous: false,
+        studioToken: STUDIO_TOKEN,
+        cwd: trainCwd,
+        binPath: fakeBin,
+      });
+      const res = await app.request("/api/train", {
+        method: "POST",
+        headers: {
+          host: "127.0.0.1:4000",
+          "x-arkor-studio-token": STUDIO_TOKEN,
+          "content-type": "application/json",
+        },
+        body: JSON.stringify({}),
+      });
+      expect(res.status).toBe(200);
+      // Race: read enough of the body to see the close, then cancel.
+      // The cancel hook must not throw even when the underlying
+      // child is already gone.
+      const reader = res.body!.getReader();
+      // Wait for `exit=` so we know the child died first.
+      let buf = "";
+      const decoder = new TextDecoder();
+      while (!buf.includes("exit=")) {
+        const { value, done } = await reader.read();
+        if (done) break;
+        buf += decoder.decode(value, { stream: true });
+      }
+      await expect(reader.cancel()).resolves.toBeUndefined();
+    });
+
+    it("/api/train survives cancellation while the child is still streaming output", async () => {
+      // Regression: the previous implementation registered raw
+      // `controller.enqueue(...)` listeners on `child.stdout` /
+      // `child.stderr` and an unguarded `controller.close()` in
+      // `child.on("close")`. After the client cancelled the
+      // ReadableStream, those handlers kept firing, and calling
+      // `enqueue` / `close` on a closed controller throws "Invalid
+      // state". The throw escaped the request pipeline as an
+      // unhandled exception. The fix flips a `closed` flag in
+      // `cancelTeardown` and try/catches the post-cancel enqueue
+      // paths defensively. NOTE: cancel intentionally does NOT
+      // detach the `data` listeners; leaving them attached keeps
+      // the OS pipe draining while the child checkpoints / exits
+      // gracefully (otherwise a full pipe back-pressures and
+      // deadlocks the very graceful exit we're preserving).
+      // `onClose` / `onError` detach all listeners when the child
+      // finally exits. See `cancelTeardown` in `studio/server.ts`
+      // for the full backpressure rationale.
+      await writeCredentials(ANON_CREDS);
+      const fakeBin = join(trainCwd, "fake-bin.mjs");
+      // Bin spits a chunk every ~5 ms forever. We cancel while it's
+      // mid-stream so the child is *still alive* when listeners are
+      // removed: the previous bug only surfaced in this window.
+      writeFileSync(
+        fakeBin,
+        `setInterval(() => process.stdout.write("tick\\n"), 5);\nsetInterval(() => {}, 60_000);\n`,
+      );
+      const app = buildStudioApp({
+        baseUrl: "http://mock",
+        assetsDir,
+        autoAnonymous: false,
+        studioToken: STUDIO_TOKEN,
+        cwd: trainCwd,
+        binPath: fakeBin,
+      });
+      const res = await app.request("/api/train", {
+        method: "POST",
+        headers: {
+          host: "127.0.0.1:4000",
+          "x-arkor-studio-token": STUDIO_TOKEN,
+          "content-type": "application/json",
+        },
+        body: JSON.stringify({}),
+      });
+      expect(res.status).toBe(200);
+      const reader = res.body!.getReader();
+      // Read at least one chunk so the child is definitely streaming
+      // before we cancel: that's the race window the previous code
+      // crashed in.
+      const decoder = new TextDecoder();
+      let received = "";
+      while (!received.includes("tick")) {
+        const { value, done } = await reader.read();
+        if (done) break;
+        received += decoder.decode(value, { stream: true });
+      }
+      // Listen for unhandled rejections / uncaught exceptions during
+      // and shortly after the cancel: before the fix, the child's
+      // next `data` chunk would synchronously throw inside the
+      // enqueue callback.
+      const errors: unknown[] = [];
+      const onUnhandled = (err: unknown) => errors.push(err);
+      process.on("uncaughtException", onUnhandled);
+      process.on("unhandledRejection", onUnhandled);
+      try {
+        await reader.cancel();
+        // Give the child's interval a few iterations to attempt
+        // post-cancel writes. The handler must short-circuit on the
+        // `closed` flag and not crash the worker.
+        await new Promise((r) => setTimeout(r, 50));
+      } finally {
+        process.off("uncaughtException", onUnhandled);
+        process.off("unhandledRejection", onUnhandled);
+      }
+      expect(errors).toEqual([]);
+    });
   });
 
   describe("auto-anonymous bootstrap", () => {
@@ -557,7 +1363,7 @@ process.exit(0);
     });
 
     it("acquires + persists an anonymous token on the first /api/credentials hit when autoAnonymous=true", async () => {
-      // No credentials on disk — buildStudioApp's autoAnonymous default
+      // No credentials on disk: buildStudioApp's autoAnonymous default
       // (true) lets the server bootstrap on first hit so a fresh `arkor
       // dev` works even when the up-front bootstrap in dev.ts skipped due
       // to a transient network blip.
@@ -599,7 +1405,7 @@ process.exit(0);
       expect(body).toMatchObject({ token: "lazy-anon", mode: "anon" });
       expect(calls).toBe(1);
 
-      // Subsequent calls use the persisted credentials — no re-bootstrap.
+      // Subsequent calls use the persisted credentials (no re-bootstrap).
       const res2 = await app.request("/api/credentials", {
         headers: {
           host: "127.0.0.1:4000",
@@ -674,7 +1480,7 @@ process.exit(0);
       // The cloud-api-client wrapper around `onDeprecation` synchronously
       // checks `typeof result.then` on the callback's return value; a plain
       // `void` return throws and gets swallowed with a stderr log. The
-      // wrapper in `createRpc` returns null to short-circuit that check —
+      // wrapper in `createRpc` returns null to short-circuit that check;
       // assert that no such log fires here.
       const errorSpy = vi
         .spyOn(console, "error")
@@ -1173,6 +1979,179 @@ process.exit(0);
       const body = (await res.json()) as { trainer: unknown };
       expect(body.trainer).toBeNull();
     });
+
+    it("skips runBuild() when HMR is enabled and the watcher's artefact already exists", async () => {
+      // Regression: previously every `/api/manifest` poll triggered a
+      // fresh `runBuild()` even with HMR active, so the SPA's
+      // ~5 s polling + per-rebuild SSE refetch would re-bundle on
+      // every poll AND race the watcher writing to the same
+      // `.arkor/build/index.mjs`. The fast path inspects the
+      // pre-existing artefact directly when HMR's coordinator is
+      // wired in. We assert by pre-writing a hand-rolled artefact
+      // bundle and verifying `/api/manifest` returns its trainer
+      // *without* the source file existing: `runBuild()` would
+      // throw on the missing entry, so a 200 here proves we never
+      // called it.
+      await writeCredentials(ANON_CREDS);
+      // Write the artefact that the HMR watcher would have produced.
+      // Mirrors the seed fixture's shape: `_kind: "arkor"` + trainer
+      // with the four required methods.
+      mkdirSync(join(trainCwd, ".arkor/build"), { recursive: true });
+      writeFileSync(
+        join(trainCwd, ".arkor/build/index.mjs"),
+        `const trainer = {
+          name: "hmr-fast-path",
+          start: async () => ({ jobId: "j" }),
+          wait: async () => ({ job: {}, artifacts: [] }),
+          cancel: async () => {},
+        };
+        export const arkor = { _kind: "arkor", trainer };
+        export default arkor;
+        `,
+      );
+      // Notice: NO `src/arkor/index.ts`. `runBuild()` would fail with
+      // "Build entry not found"; the test fails if the fast path
+      // regresses and falls through to it.
+      const fakeHmr = {
+        subscribe: () => () => undefined,
+        getCurrentConfigHash: () => null,
+        getCurrentArtifactHash: () => null,
+        getCurrentArtifactContentHash: () => null,
+        getLastEventType: () => null,
+        async dispose() {},
+      };
+      const app = buildStudioApp({
+        baseUrl: "http://mock",
+        assetsDir,
+        autoAnonymous: false,
+        studioToken: STUDIO_TOKEN,
+        cwd: trainCwd,
+        hmr: fakeHmr,
+      });
+      const res = await app.request("/api/manifest", {
+        headers: {
+          host: "127.0.0.1:4000",
+          "x-arkor-studio-token": STUDIO_TOKEN,
+        },
+      });
+      expect(res.status).toBe(200);
+      const body = (await res.json()) as {
+        trainer: { name: string } | null;
+      };
+      expect(body.trainer).toEqual({ name: "hmr-fast-path" });
+    });
+
+    it("falls back to runBuild() when HMR is enabled but the watcher hasn't produced an artefact yet", async () => {
+      // Companion to the fast-path test: on a fresh scaffold the
+      // watcher's first BUNDLE_END may not have completed by the
+      // time the SPA's first /api/manifest poll lands. Without the
+      // existsSync gate we'd `await import(missing)` and 400
+      // forever (the watcher's later writes don't retroactively
+      // make this poll succeed); with the gate we bootstrap via
+      // `runBuild()` for that single call.
+      await writeCredentials(ANON_CREDS);
+      mkdirSync(join(trainCwd, "src/arkor"), { recursive: true });
+      writeFileSync(
+        join(trainCwd, "src/arkor/index.ts"),
+        `export const arkor = Object.freeze({
+          _kind: "arkor",
+          trainer: {
+            name: "fallback-build",
+            start: async () => ({ jobId: "j" }),
+            wait: async () => ({ job: {}, artifacts: [] }),
+            cancel: async () => {},
+          },
+        });`,
+      );
+      // No pre-existing `.arkor/build/index.mjs`: the artefact
+      // doesn't exist. `existsSync` is false → `runBuild()` runs.
+      const fakeHmr = {
+        subscribe: () => () => undefined,
+        getCurrentConfigHash: () => null,
+        getCurrentArtifactHash: () => null,
+        getCurrentArtifactContentHash: () => null,
+        getLastEventType: () => null,
+        async dispose() {},
+      };
+      const app = buildStudioApp({
+        baseUrl: "http://mock",
+        assetsDir,
+        autoAnonymous: false,
+        studioToken: STUDIO_TOKEN,
+        cwd: trainCwd,
+        hmr: fakeHmr,
+      });
+      const res = await app.request("/api/manifest", {
+        headers: {
+          host: "127.0.0.1:4000",
+          "x-arkor-studio-token": STUDIO_TOKEN,
+        },
+      });
+      expect(res.status).toBe(200);
+      const body = (await res.json()) as {
+        trainer: { name: string } | null;
+      };
+      expect(body.trainer).toEqual({ name: "fallback-build" });
+    });
+
+    it("returns 400 (not stale 200) while the HMR watcher is in error state", async () => {
+      // Regression: the HMR fast path served the last-built artefact
+      // even when the watcher's most recent event was `error`. The
+      // SPA's `/api/manifest` poll runs every ~5s, so a successful
+      // 200 with stale data would silently overwrite the SSE-driven
+      // build-error UI within 5s of the user breaking their source:
+      // they'd then unknowingly run stale code/config while the
+      // latest edit is still failing to compile. Gating the fast
+      // path on `getLastEventType() === "error"` keeps both
+      // channels (poll + SSE) consistent.
+      await writeCredentials(ANON_CREDS);
+      mkdirSync(join(trainCwd, ".arkor/build"), { recursive: true });
+      // Pre-write a previously-good artefact so the fast path
+      // *would* otherwise return 200 with it.
+      writeFileSync(
+        join(trainCwd, ".arkor/build/index.mjs"),
+        `const trainer = {
+          name: "stale-good-build",
+          start: async () => ({ jobId: "j" }),
+          wait: async () => ({ job: {}, artifacts: [] }),
+          cancel: async () => {},
+        };
+        export const arkor = { _kind: "arkor", trainer };
+        export default arkor;
+        `,
+      );
+      // Coordinator is currently in error state: the latest
+      // broadcast was a compile failure.
+      const fakeHmr = {
+        subscribe: () => () => undefined,
+        getCurrentConfigHash: () => null,
+        getCurrentArtifactHash: () => null,
+        getCurrentArtifactContentHash: () => null,
+        getLastEventType: () => "error" as const,
+        async dispose() {},
+      };
+      const app = buildStudioApp({
+        baseUrl: "http://mock",
+        assetsDir,
+        autoAnonymous: false,
+        studioToken: STUDIO_TOKEN,
+        cwd: trainCwd,
+        hmr: fakeHmr,
+      });
+      const res = await app.request("/api/manifest", {
+        headers: {
+          host: "127.0.0.1:4000",
+          "x-arkor-studio-token": STUDIO_TOKEN,
+        },
+      });
+      // 400: the SPA's existing 4xx-handling path renders the
+      // build-error hint instead of a fake-healthy manifest.
+      expect(res.status).toBe(400);
+      const body = (await res.json()) as { error?: string };
+      expect(body.error).toMatch(/Build failed/);
+      // Sanity: the stale artefact name is NOT leaked through.
+      expect(JSON.stringify(body)).not.toContain("stale-good-build");
+    });
   });
 
   describe("/api/inference/chat", () => {
@@ -1184,7 +2163,7 @@ process.exit(0);
 
     it("auto-bootstraps project state and proxies base-model inference", async () => {
       await writeCredentials(ANON_CREDS);
-      // No state.json — server should derive a slug from cwd, create the
+      // No state.json: server should derive a slug from cwd, create the
       // project on cloud-api, persist state, and forward the inference call.
 
       const calls: Array<{
@@ -1313,7 +2292,7 @@ process.exit(0);
       });
       expect(res.status).toBe(200);
 
-      // Only the inference call should have hit the network — no project
+      // Only the inference call should have hit the network: no project
       // create/list when state is already present.
       expect(calls.filter((c) => c.url.includes("/v1/projects"))).toHaveLength(0);
       const chat = calls.find((c) => c.url.includes("/v1/inference/chat"));
@@ -1374,7 +2353,7 @@ process.exit(0);
 
     it("propagates the cloud-api status when project bootstrap fails", async () => {
       await writeCredentials(ANON_CREDS);
-      // No state.json — bootstrap will hit cloud-api, which returns 503.
+      // No state.json: bootstrap will hit cloud-api, which returns 503.
       // We expect that 503 to be passed through, not collapsed to 400.
 
       globalThis.fetch = (async (
@@ -1435,13 +2414,436 @@ process.exit(0);
     });
   });
 
-  // -------------------------------------------------------------------------
-  // Deployments (`/api/deployments/*`) — minimal coverage of the router
-  // boundary. Cloud-side semantics already have heavy test coverage in
-  // `core/client.deployments.test.ts`; here we verify only that the Studio
-  // server forwards correctly, returns the empty wrapper when no project
-  // state exists, and surfaces upstream errors verbatim.
-  // -------------------------------------------------------------------------
+  describe("/api/dev/events (HMR)", () => {
+    function fakeHmr(initialConfigHash: string | null = null) {
+      // Mirror the real HmrCoordinator surface but stay synchronous so
+      // the test doesn't depend on rolldown.watch starting up. `emit`
+      // is a test hook for pushing events into the SSE stream from the
+      // test body; `currentConfigHash` is a settable mock for what
+      // `/api/train` reads via `getCurrentConfigHash` to capture the
+      // spawned-config snapshot.
+      const subs = new Set<(e: HmrEvent) => void>();
+      let currentConfigHash: string | null = initialConfigHash;
+      // Match the real coordinator's behaviour: a stable artefact
+      // fingerprint at spawn time. Tests that exercise the
+      // pre-ready-spawn path (configHash null, then a real hash)
+      // can override via `setArtifactHash`.
+      let currentArtifactHash: string | null = "fake-artefact-hash";
+      let currentArtifactContentHash: string | null =
+        "fake-artefact-content-hash";
+      let lastEventType: HmrEvent["type"] | null = null;
+      const coordinator: HmrCoordinator = {
+        subscribe(fn) {
+          subs.add(fn);
+          return () => {
+            subs.delete(fn);
+          };
+        },
+        getCurrentConfigHash() {
+          return currentConfigHash;
+        },
+        getCurrentArtifactHash() {
+          return currentArtifactHash;
+        },
+        getCurrentArtifactContentHash() {
+          return currentArtifactContentHash;
+        },
+        getLastEventType() {
+          return lastEventType;
+        },
+        async dispose() {
+          subs.clear();
+        },
+      };
+      return {
+        coordinator,
+        emit(event: HmrEvent) {
+          // Track the latest event type so `getLastEventType()`
+          // mirrors the real coordinator's `lastEvent?.type`;
+          // the `/api/manifest` HMR-error gate consults this.
+          lastEventType = event.type;
+          for (const fn of subs) fn(event);
+        },
+        setConfigHash(hash: string | null) {
+          currentConfigHash = hash;
+        },
+        setArtifactHash(hash: string | null) {
+          currentArtifactHash = hash;
+        },
+        setArtifactContentHash(hash: string | null) {
+          currentArtifactContentHash = hash;
+        },
+        setLastEventType(t: HmrEvent["type"] | null) {
+          lastEventType = t;
+        },
+        get subscriberCount() {
+          return subs.size;
+        },
+      };
+    }
+
+    it("is unregistered when no hmr coordinator is supplied", async () => {
+      const app = build();
+      const res = await app.request("/api/dev/events", {
+        headers: {
+          host: "127.0.0.1:4000",
+          "x-arkor-studio-token": STUDIO_TOKEN,
+        },
+      });
+      expect(res.status).toBe(404);
+    });
+
+    it("rejects /api/dev/events without a token", async () => {
+      const fake = fakeHmr();
+      const app = buildStudioApp({
+        baseUrl: "http://mock",
+        assetsDir,
+        autoAnonymous: false,
+        studioToken: STUDIO_TOKEN,
+        cwd: trainCwd,
+        hmr: fake.coordinator,
+      });
+      const res = await app.request("/api/dev/events", {
+        headers: { host: "127.0.0.1:4000" },
+      });
+      expect(res.status).toBe(403);
+    });
+
+    it("accepts the studio token via ?studioToken= for the dev event stream", async () => {
+      const fake = fakeHmr();
+      const app = buildStudioApp({
+        baseUrl: "http://mock",
+        assetsDir,
+        autoAnonymous: false,
+        studioToken: STUDIO_TOKEN,
+        cwd: trainCwd,
+        hmr: fake.coordinator,
+      });
+      // The server subscribes to the HMR coordinator exactly once at
+      // build time (so multiple SSE clients don't fan signal dispatch
+      // out to the same child N times). Per-client cleanup happens on
+      // the SSE listener set, not against the coordinator, so
+      // `fake.subscriberCount` stays at 1 across the connection
+      // lifecycle. We assert that here rather than expect the
+      // pre-refactor "0 after cancel" behaviour.
+      expect(fake.subscriberCount).toBe(1);
+      const res = await app.request(
+        `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`,
+        { headers: { host: "127.0.0.1:4000" } },
+      );
+      expect(res.status).toBe(200);
+      expect(res.headers.get("content-type")).toBe("text/event-stream");
+      const reader = res.body!.getReader();
+      await reader.cancel();
+      // Cancel doesn't unsubscribe the server-level listener; emitting
+      // an event after cancel must still be safe (the SSE listener that
+      // was registered for this connection is removed, so the
+      // controller-closed try/catch in `send` is never exercised).
+      expect(() =>
+        fake.emit({
+          type: "rebuild",
+          outFile: "/tmp/x",
+          hash: "h",
+          configHash: null,
+          trainerName: null,
+        }),
+      ).not.toThrow();
+    });
+
+    it("rejects /api/dev/events when host header is non-loopback", async () => {
+      const fake = fakeHmr();
+      const app = buildStudioApp({
+        baseUrl: "http://mock",
+        assetsDir,
+        autoAnonymous: false,
+        studioToken: STUDIO_TOKEN,
+        cwd: trainCwd,
+        hmr: fake.coordinator,
+      });
+      const res = await app.request(
+        `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`,
+        { headers: { host: "evil.example.com" } },
+      );
+      expect(res.status).toBe(403);
+    });
+
+    it("dispatches HMR signals exactly once per rebuild regardless of connected SSE client count", async () => {
+      // Regression: previously each `/api/dev/events` connection
+      // attached its own `hmr.subscribe(...)` callback, so a rebuild
+      // with N open Studio tabs fanned out into N × SIGUSR2 / N ×
+      // SIGTERM per child. The runner's shutdown handler interprets a
+      // *second* SIGTERM as the emergency `exit(143)` fast-path, which
+      // would defeat checkpoint preservation. The server now subscribes
+      // to the coordinator exactly once and broadcasts the augmented
+      // payload to every SSE client; we assert that subscriber count
+      // doesn't grow when extra connections are opened.
+      const fake = fakeHmr();
+      const app = buildStudioApp({
+        baseUrl: "http://mock",
+        assetsDir,
+        autoAnonymous: false,
+        studioToken: STUDIO_TOKEN,
+        cwd: trainCwd,
+        hmr: fake.coordinator,
+      });
+      expect(fake.subscriberCount).toBe(1);
+      const r1 = await app.request(
+        `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`,
+        { headers: { host: "127.0.0.1:4000" } },
+      );
+      const r2 = await app.request(
+        `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`,
+        { headers: { host: "127.0.0.1:4000" } },
+      );
+      // Pump the streams so their `start()` runs, registering the
+      // per-client SSE listeners on the server side.
+      const reader1 = r1.body!.getReader();
+      const reader2 = r2.body!.getReader();
+      // Even with two concurrent SSE clients the HMR coordinator still
+      // sees exactly the one server-level subscriber.
+      expect(fake.subscriberCount).toBe(1);
+      await reader1.cancel();
+      await reader2.cancel();
+      expect(fake.subscriberCount).toBe(1);
+    });
+
+    it("/api/train cancel still fires cloud cancel POST + SIGKILL even when HMR has already requested early-stop", async () => {
+      // Regression: the cancel handler used to short-circuit
+      // (`if (earlyStopInFlight) return;`) when HMR's
+      // `dispatchRebuild` had already SIGTERMed the child for a
+      // graceful checkpoint-wait early-stop. That gate was added
+      // to avoid a second SIGTERM piling on top of the first
+      // (which would have triggered the runner's `exit(143)`
+      // emergency path and broken cloud cancel POSTing). With
+      // SIGKILL replacing the user-stop SIGTERM, the
+      // double-signal worry no longer applies, and the gate
+      // turned a Stop click during HMR's graceful window into a
+      // total no-op, leaving the run alive until checkpoint /
+      // 5-min timeout. Manual stop now overrides HMR's graceful
+      // path: server POSTs cloud cancel + SIGKILLs the
+      // subprocess regardless of `isEarlyStopRequested`.
+      await writeCredentials(ANON_CREDS);
+      await writeState(
+        {
+          orgSlug: "manual-override-org",
+          projectSlug: "manual-override-project",
+          projectId: "p-manual",
+        },
+        trainCwd,
+      );
+      const FAKE_JOB_ID = "manual-stop-during-hmr";
+      const fakeBin = join(trainCwd, "manual-during-hmr-bin.mjs");
+      // SIGTERM no-op so HMR's graceful SIGTERM doesn't terminate
+      // the bin; we need it alive so the subsequent manual
+      // cancel actually has something to SIGKILL. Marker uses the
+      // server-injected nonce prefix so the parser accepts it.
+      writeFileSync(
+        fakeBin,
+        `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? "";
+        process.stdout.write(\`[arkor:\${nonce}] Started job ${FAKE_JOB_ID}\\n\`);
+        process.on("SIGTERM", () => {});
+        setInterval(() => {}, 60_000);
+        `,
+      );
+      let cancelHits: Array<{ url: string }> = [];
+      const ORIG_FETCH = globalThis.fetch;
+      globalThis.fetch = (async (
+        input: Parameters<typeof fetch>[0],
+        init?: Parameters<typeof fetch>[1],
+      ) => {
+        const url = typeof input === "string" ? input : input.toString();
+        const method = init?.method ?? "GET";
+        if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) {
+          cancelHits.push({ url });
+          return new Response(JSON.stringify({ ok: true }), {
+            status: 200,
+            headers: { "content-type": "application/json" },
+          });
+        }
+        return new Response("not found", { status: 404 });
+      }) as typeof fetch;
+
+      try {
+        const fake = fakeHmr("h1");
+        const app = buildStudioApp({
+          baseUrl: "http://mock-cloud-api",
+          assetsDir,
+          autoAnonymous: false,
+          studioToken: STUDIO_TOKEN,
+          cwd: trainCwd,
+          binPath: fakeBin,
+          hmr: fake.coordinator,
+        });
+        const trainRes = await app.request("/api/train", {
+          method: "POST",
+          headers: {
+            host: "127.0.0.1:4000",
+            "x-arkor-studio-token": STUDIO_TOKEN,
+            "content-type": "application/json",
+          },
+          body: JSON.stringify({}),
+        });
+        expect(trainRes.status).toBe(200);
+        const pid = Number(trainRes.headers.get("x-arkor-train-pid"));
+        // Drain until the parser has recorded the job id.
+        const reader = trainRes.body!.getReader();
+        const decoder = new TextDecoder();
+        let buf = "";
+        while (!buf.includes(`Started job ${FAKE_JOB_ID}`)) {
+          const { value, done } = await reader.read();
+          if (done) break;
+          buf += decoder.decode(value, { stream: true });
+        }
+        // Emit an HMR mismatch: server's dispatch SIGTERMs the
+        // bin and sets `earlyStopRequested = true` on the entry.
+        // The bin's SIGTERM no-op keeps it alive so the manual
+        // cancel below has a target.
+        fake.emit({
+          type: "ready",
+          outFile: "/tmp/x.mjs",
+          hash: "abc",
+          configHash: "h2", // mismatch with spawn-time "h1"
+          trainerName: "t",
+        });
+        // Let the dispatch run + signal land.
+        await new Promise((r) => setTimeout(r, 80));
+
+        // Manual cancel: old code would have early-returned; new
+        // code POSTs cloud cancel + SIGKILLs.
+        await reader.cancel();
+        await new Promise((r) => setTimeout(r, 250));
+
+        // Cloud cancel POST landed for the right job.
+        expect(cancelHits).toHaveLength(1);
+        expect(cancelHits[0]?.url).toContain(`/v1/jobs/${FAKE_JOB_ID}/cancel`);
+        // And the bin is dead: SIGKILL bypassed its SIGTERM
+        // no-op (which had been masking HMR's earlier SIGTERM).
+        let probeError: NodeJS.ErrnoException | null = null;
+        try {
+          process.kill(pid, 0);
+        } catch (e) {
+          probeError = e as NodeJS.ErrnoException;
+        }
+        expect(probeError?.code).toBe("ESRCH");
+      } finally {
+        globalThis.fetch = ORIG_FETCH;
+      }
+    });
+
+    it("dispatches HMR signals for `ready` events too (not only `rebuild`)", async () => {
+      // Regression: previously the dispatch fired only on
+      // `rebuild`, so a child started via `/api/train` *before*
+      // the watcher's first successful BUNDLE_END (the very first
+      // success is broadcast as `ready`, and the entry-wait recovery
+      // path also emits `ready`) would never get SIGUSR2/SIGTERM-
+      // routed when that build eventually landed, leaving it
+      // running a stale or empty artifact. Exercise the contract
+      // here by spawning a hanging child, then emitting `ready`
+      // with a different `configHash`; dispatch should pick up the
+      // mismatch and surface restart targets in the SSE frame.
+      await writeCredentials(ANON_CREDS);
+      const hangingBin = join(trainCwd, "hanging-bin.mjs");
+      // setInterval keeps the event loop alive without trapping
+      // SIGTERM, so dispatch's kill returns the child to the OS.
+      writeFileSync(hangingBin, "setInterval(() => {}, 1000);\n");
+
+      const fake = fakeHmr("h1");
+      const app = buildStudioApp({
+        baseUrl: "http://mock",
+        assetsDir,
+        autoAnonymous: false,
+        studioToken: STUDIO_TOKEN,
+        cwd: trainCwd,
+        binPath: hangingBin,
+        hmr: fake.coordinator,
+      });
+
+      const trainRes = await app.request("/api/train", {
+        method: "POST",
+        headers: {
+          host: "127.0.0.1:4000",
+          "x-arkor-studio-token": STUDIO_TOKEN,
+          "content-type": "application/json",
+        },
+        body: JSON.stringify({}),
+      });
+      expect(trainRes.status).toBe(200);
+      const pid = Number(trainRes.headers.get("x-arkor-train-pid"));
+
+      const sseRes = await app.request(
+        `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`,
+        { headers: { host: "127.0.0.1:4000" } },
+      );
+      const reader = sseRes.body!.getReader();
+      const decoder = new TextDecoder();
+
+      try {
+        // `configHash` = "h2" mismatches the spawn-time "h1" → SIGTERM
+        // path → `restartTargets` should be non-empty in the SSE frame.
+        fake.emit({
+          type: "ready",
+          outFile: "/tmp/x.mjs",
+          hash: "abc",
+          configHash: "h2",
+          trainerName: "t",
+        });
+
+        let received = "";
+        while (!received.includes("\n\n")) {
+          const { value, done } = await reader.read();
+          if (done) break;
+          received += decoder.decode(value, { stream: true });
+        }
+        expect(received).toContain("event: ready");
+        // The dispatch augmentation marker: would be absent if the
+        // `event.type !== "error"` filter regressed back to gating on
+        // `=== "rebuild"`, and `restart`/`restartTargets` would never
+        // appear on a `ready` frame.
+        expect(received).toContain('"restart":true');
+        expect(received).toContain(`"pid":${pid}`);
+      } finally {
+        await reader.cancel();
+        // Best-effort cleanup if dispatch's SIGTERM hasn't reaped
+        // the child yet (signal delivery is async in the kernel).
+        try {
+          process.kill(pid, "SIGKILL");
+        } catch {
+          // already gone
+        }
+      }
+    });
+
+    it("forwards rebuild events as SSE frames", async () => {
+      const fake = fakeHmr();
+      const app = buildStudioApp({
+        baseUrl: "http://mock",
+        assetsDir,
+        autoAnonymous: false,
+        studioToken: STUDIO_TOKEN,
+        cwd: trainCwd,
+        hmr: fake.coordinator,
+      });
+      const res = await app.request(
+        `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`,
+        { headers: { host: "127.0.0.1:4000" } },
+      );
+      const reader = res.body!.getReader();
+      const decoder = new TextDecoder();
+
+      fake.emit({ type: "ready", outFile: "/tmp/x", hash: "abc" });
+      // Read chunks until we have at least one full SSE frame.
+      let received = "";
+      while (!received.includes("\n\n")) {
+        const { value, done } = await reader.read();
+        if (done) break;
+        received += decoder.decode(value, { stream: true });
+      }
+      expect(received).toContain("event: ready");
+      expect(received).toContain('"outFile":"/tmp/x"');
+      await reader.cancel();
+    });
+  });
+
   describe("/api/deployments", () => {
     const ORIG_FETCH = globalThis.fetch;
 
diff --git a/packages/arkor/src/studio/server.ts b/packages/arkor/src/studio/server.ts
index 95deff08..e5db1d9f 100644
--- a/packages/arkor/src/studio/server.ts
+++ b/packages/arkor/src/studio/server.ts
@@ -1,6 +1,7 @@
-import { spawn } from "node:child_process";
+import { spawn, type ChildProcessByStdio } from "node:child_process";
+import type { Readable, Writable } from "node:stream";
 import { readFile, realpath } from "node:fs/promises";
-import { timingSafeEqual } from "node:crypto";
+import { randomBytes, timingSafeEqual } from "node:crypto";
 import { Hono } from "hono";
 import { createClient } from "@arkor/cloud-api-client";
 import { CloudApiClient, CloudApiError } from "../core/client";
@@ -22,7 +23,42 @@ import {
   createDeploymentRequestSchema,
 } from "../core/schemas";
 import { readState } from "../core/state";
+import { resolveBuildEntry } from "../core/rolldownConfig";
 import { readManifestSummary } from "./manifest";
+import type { HmrCoordinator, HmrEvent } from "./hmr";
+import { TrainRegistry, type RestartTarget } from "./trainRegistry";
+
+/** Identify the spawned subprocess to the SPA without exposing it as
+ *  a body frame (which would interleave with trainer stdout). The SPA
+ *  reads this off `Response.headers` and uses it to scope HMR
+ *  `restart` events to the run *this* tab actually started. */
+const TRAIN_PID_HEADER = "x-arkor-train-pid";
+/**
+ * Build the strict full-line match for the runner's `[arkor:<nonce>] Started job <id>` line.
+ *
+ * `core/runner.ts` prefixes that text with the per-spawn nonce we
+ * inject via `ARKOR_JOB_ID_MARKER_NONCE`; without the prefix, a
+ * user `console.log("Started job <attacker-id>")` from inside
+ * `trainer.start()` / `onCheckpoint` / etc. could land in stdout
+ * *before* the runner's real line and we'd record the wrong id, so
+ * Stop-training would then POST `/v1/jobs/:attacker-id/cancel`
+ * against a job the attacker chose. Anchoring on a 32-hex nonce
+ * known only to the server + runner (the env var is deleted by
+ * runner.ts BEFORE the user module is dynamically imported, so the
+ * user can't read it) closes that hole.
+ *
+ * Pattern is per-spawn because the nonce is per-spawn.
+ *
+ * Anchors `^…$` and `(\S+)` job-id capture mirror the runner's
+ * exact write shape (cloud-api job ids never contain whitespace),
+ * so a chatty bin that wraps the line in other content cannot
+ * collide either.
+ */
+function buildStartedJobPattern(nonce: string): RegExp {
+  // Nonce is a 32-char hex string from `randomBytes(16).toString("hex")`,
+  // i.e. only `[0-9a-f]` (safe to interpolate into the regex literal).
+  return new RegExp(`^\\[arkor:${nonce}\\] Started job (\\S+)$`);
+}
 
 const DEPRECATION_HEADERS = ["Deprecation", "Sunset", "Warning"] as const;
 function copyDeprecationHeaders(from: Headers, to: Headers): void {
@@ -66,6 +102,15 @@ export interface StudioServerOptions {
    * here points at the bin itself). Override in tests.
    */
   binPath?: string;
+  /**
+   * Optional HMR coordinator. When provided, the server registers
+   * `/api/dev/events` as an SSE stream that pushes rebuild / error events to
+   * the SPA, and rebuilds also signal SIGTERM to active `/api/train`
+   * subprocesses so they early-stop at the next checkpoint and the SPA can
+   * restart them with the new bundle. Wired in by `arkor dev`; left
+   * undefined for any non-dev consumer of `buildStudioApp`.
+   */
+  hmr?: HmrCoordinator;
 }
 
 function tokensMatch(provided: string, expected: string): boolean {
@@ -89,11 +134,31 @@ function htmlAttrEscape(s: string): string {
   );
 }
 
-function injectStudioToken(html: string, token: string): string {
-  const meta = `<meta name="arkor-studio-token" content="${htmlAttrEscape(token)}">`;
+/**
+ * Inject the per-launch studio token (always) and an optional HMR
+ * feature flag into `<head>`. Both are read by the SPA via
+ * `<meta name="...">` lookups: the token gates `/api/*` requests and
+ * the HMR flag tells `RunTraining` whether to open
+ * `/api/dev/events` (which only exists when `arkor dev` wired in an
+ * HMR coordinator). Without the server-side flag the SPA can't tell
+ * dev-mode usage from prod-mode usage at runtime: `vite build`'s
+ * output ships with `import.meta.env.DEV === false`, so any DEV gate
+ * baked into the bundle would suppress HMR even in real `arkor dev`
+ * sessions.
+ */
+function injectStudioMeta(
+  html: string,
+  token: string,
+  hmrEnabled: boolean,
+): string {
+  const tokenTag = `<meta name="arkor-studio-token" content="${htmlAttrEscape(token)}">`;
+  const hmrTag = hmrEnabled
+    ? `<meta name="arkor-hmr-enabled" content="true">`
+    : "";
+  const tags = `${tokenTag}${hmrTag}`;
   const idx = html.indexOf("</head>");
-  if (idx === -1) return `${meta}${html}`;
-  return `${html.slice(0, idx)}${meta}${html.slice(idx)}`;
+  if (idx === -1) return `${tags}${html}`;
+  return `${html.slice(0, idx)}${tags}${html.slice(idx)}`;
 }
 
 export function buildStudioApp(options: StudioServerOptions) {
@@ -105,7 +170,7 @@ export function buildStudioApp(options: StudioServerOptions) {
   // `studio/server.ts` is bundled into `dist/bin.mjs` (it isn't reachable
   // from `src/index.ts`, so tsdown doesn't extract it as a shared chunk).
   // The bin therefore sits *next* to this code at runtime, not one
-  // directory up — `../bin.mjs` would resolve to the package root.
+  // directory up: `../bin.mjs` would resolve to the package root.
   const trainBinPath =
     options.binPath ?? fileURLToPath(new URL("./bin.mjs", import.meta.url));
 
@@ -118,7 +183,12 @@ export function buildStudioApp(options: StudioServerOptions) {
   const app = new Hono();
 
   const loopbackHostPattern = /^(127\.0\.0\.1|localhost)(:\d+)?$/;
-  const jobEventsPathPattern = /^\/api\/jobs\/[^/]+\/events$/;
+  // Routes where `?studioToken=` is accepted instead of the
+  // `X-Arkor-Studio-Token` header. Used only for `EventSource` streams,
+  // which cannot send custom headers. Adding to this list is CSRF-sensitive:
+  // it must always be a GET stream-only route, never a mutation endpoint.
+  const eventStreamPathPattern =
+    /^\/api\/jobs\/[^/]+\/events$|^\/api\/dev\/events$/;
 
   // Host-header guard for every route, including static HTML that carries the
   // per-launch Studio token. This is the DNS-rebinding boundary: a victim
@@ -138,14 +208,14 @@ export function buildStudioApp(options: StudioServerOptions) {
   //   1. Per-launch token. CORS is intentionally not configured: the SPA
   //      is same-origin so CORS adds no value, and reflecting `*` would let
   //      "simple" cross-origin POSTs (text/plain, urlencoded) skip preflight
-  //      and reach the handler. The token check rejects those — an attacker
+  //      and reach the handler. The token check rejects those: an attacker
   //      page can't read the SPA's <meta> from another origin.
   //   2. `?studioToken=` is accepted only on the job-event stream route
   //      because `EventSource` cannot send custom headers. Mutation routes
   //      require the header so a leaked token in a URL is not enough to POST.
   app.use("/api/*", async (c, next) => {
     const queryTokenAllowed =
-      c.req.method === "GET" && jobEventsPathPattern.test(c.req.path);
+      c.req.method === "GET" && eventStreamPathPattern.test(c.req.path);
     const provided =
       c.req.header("x-arkor-studio-token") ??
       (queryTokenAllowed ? c.req.query("studioToken") : undefined) ??
@@ -268,9 +338,39 @@ export function buildStudioApp(options: StudioServerOptions) {
     return new Response(body, { status: res.status, headers });
   });
 
+  // Pre-resolved outFile for the HMR fast path. The path is
+  // deterministic per cwd (defaults from `BUILD_DEFAULTS`), so we
+  // compute it once at app build time rather than on every request.
+  // Only used when HMR is enabled; `readManifestSummary` falls
+  // back to `runBuild()` when this is undefined or the file doesn't
+  // exist yet (fresh scaffold pre-watcher-bootstrap).
+  const hmrOutFile = options.hmr
+    ? resolveBuildEntry({ cwd: trainCwd }).outFile
+    : undefined;
   app.get("/api/manifest", async (c) => {
     try {
-      const manifest = await readManifestSummary(trainCwd);
+      // Surface watcher build errors directly. Without this gate the
+      // HMR fast path below would happily serve the LAST GOOD
+      // artefact even when the user's current source fails to
+      // compile: `RunTraining` polls `/api/manifest` every ~5 s, so
+      // the next poll after a compile error would 200 with stale
+      // data and silently overwrite the SSE-surfaced error UI.
+      // Users would then see a "healthy" trainer in the manifest
+      // and unknowingly run stale code/config while the latest
+      // edit is still broken. Rejecting with the SSE error message
+      // keeps the SPA's error state consistent across both
+      // channels (poll + SSE).
+      if (options.hmr?.getLastEventType() === "error") {
+        return c.json({ error: "Build failed; see HMR error frame" }, 400);
+      }
+      // HMR-aware fast path: when `arkor dev` wired in a coordinator,
+      // skip the per-request `runBuild()` and read the watcher's
+      // already-built artefact. Without this every SPA poll
+      // (~5 s + per-rebuild SSE refetch) would re-bundle and race
+      // the watcher writing to the same `.arkor/build/index.mjs`.
+      const manifest = await readManifestSummary(trainCwd, {
+        prebuiltOutFile: hmrOutFile,
+      });
       return c.json(manifest);
     } catch (err) {
       // The user's `src/arkor/index.ts` may not exist yet (fresh scaffold) or
@@ -335,11 +435,15 @@ export function buildStudioApp(options: StudioServerOptions) {
     return new Response(upstream.body, { status: upstream.status, headers });
   });
 
+  // Active `/api/train` subprocesses. The registry encapsulates the
+  // signal-dispatch policy (see `studio/trainRegistry.ts`).
+  const activeTrains = new TrainRegistry();
+
   app.post("/api/train", async (c) => {
     const body = (await c.req.json().catch(() => ({}))) as { file?: string };
     let trainFile: string | undefined;
     if (body.file) {
-      // Resolve symlinks before the containment check — `path.resolve` is purely
+      // Resolve symlinks before the containment check: `path.resolve` is purely
       // lexical, so a symlink under the project directory pointing at e.g.
       // `/etc/passwd` would otherwise pass `startsWith(baseAbs + sep)`. The
       // bin spawned below would then dlopen the link's target.
@@ -362,32 +466,621 @@ export function buildStudioApp(options: StudioServerOptions) {
       }
       trainFile = abs;
     }
+    // Snapshot the current `configHash` so HMR routing on the *next*
+    // rebuild can compare against this child's spawn-time config.
+    //
+    // When HMR is enabled, read it synchronously from the coordinator
+    // (which already maintains `lastEvent.configHash` for its watcher).
+    // Reading from the cache avoids triggering an extra `runBuild()`
+    // per train request: the previous implementation called
+    // `readManifestSummary(trainCwd)` here, which both wasted CPU and
+    // raced the watcher writing the same `.arkor/build/index.mjs`.
+    //
+    // When HMR is disabled the field is irrelevant (no rebuilds will
+    // happen) so we leave it null without paying for a build.
+    const configHash: string | null = options.hmr
+      ? options.hmr.getCurrentConfigHash()
+      : null;
+    // Spawn-time CONTENT-hash of the on-disk build artefact. Only
+    // the pre-ready-spawn case in `dispatchRebuild` consults it:
+    // when a rebuild lands while the child's `configHash` is still
+    // null, backfilling the new hash is only safe if the artefact
+    // bytes the child loaded (= the bytes on disk *now*, at spawn)
+    // are the same bytes the new hash describes. Without this
+    // gate, an edit landing between spawn and the watcher's first
+    // BUNDLE_END would silently align the registry with a config
+    // the child never actually loaded → cloud-side `JobConfig`
+    // drift on subsequent same-hash hot-swaps.
+    //
+    // Content (sha256) rather than mtime+ctime+size: the
+    // timestamp version had a false-positive failure mode where a
+    // watcher rebuild that produced identical bytes still bumped
+    // mtime/ctime, forcing a spurious cancel+restart cycle on a
+    // pre-ready spawn even though the child's loaded bytes
+    // actually matched the new build. Content-hash is precise.
+    const spawnArtifactContentHash: string | null = options.hmr
+      ? options.hmr.getCurrentArtifactContentHash()
+      : null;
+    // Capture the cloud-api scope NOW (at spawn time) so the cancel
+    // handler can POST `/v1/jobs/:id/cancel` without re-reading
+    // `.arkor/state.json` at stop time. If the user removed or made
+    // the state file unreadable mid-training, the stop-time read
+    // would return null and the cancel POST would silently skip:
+    // local SIGKILL still tears down the subprocess but the cloud
+    // run orphans. Pinning the scope on the registry entry when it
+    // exists decouples cancel correctness from mutable filesystem state.
+    //
+    // `spawnScope` may legitimately be `null` on a first-run anonymous
+    // project: `.arkor/state.json` is created by `ensureProjectState`
+    // INSIDE the child during `trainer.start()`, i.e. AFTER spawn but
+    // possibly before the user clicks Stop. The cancel handler treats
+    // a null registry scope as a signal to fall back to reading
+    // `.arkor/state.json` at cancel time (the file should exist by
+    // then because the runner emits its `Started job <id>` line AFTER
+    // `trainer.start()` resolved, which is the same point at which
+    // `ensureProjectState` has finished writing the state file). The
+    // delete-mid-training hazard the spawn-time capture exists to
+    // close only applies when the SPAWN read succeeded; once we have
+    // a non-null capture we never re-read.
+    const spawnState = await readState(trainCwd);
+    const spawnScope = spawnState
+      ? { orgSlug: spawnState.orgSlug, projectSlug: spawnState.projectSlug }
+      : null;
     const args = [trainBinPath, "start"];
     if (trainFile) args.push(trainFile);
-    const child = spawn(process.execPath, args, {
-      stdio: "pipe",
-      cwd: trainCwd,
+    // Per-spawn 16-byte nonce passed via env var so the runner can
+    // prefix its `Started job <id>` line with `[arkor:<nonce>] `. The
+    // server matches that nonce-prefixed shape (see
+    // `buildStartedJobPattern` for why). 32-hex chars of entropy
+    // guarantees a user-code spoof attempt can't guess the prefix in
+    // a single shot, and `core/runner.ts` deletes the env var BEFORE
+    // dynamically importing the user module so user code can't read
+    // it via `process.env` either.
+    const startedJobNonce = randomBytes(16).toString("hex");
+    const startedJobPattern = buildStartedJobPattern(startedJobNonce);
+    // `spawn()` is mostly async (filesystem failures surface as the
+    // child's `error` event), but Node can still throw synchronously
+    // for argument-shape problems (e.g. invalid stdio descriptor on
+    // unusual platforms). Catch both paths so an `/api/train` POST
+    // can never hang the SPA: sync throws return a clean 500, async
+    // 'error' events forward into the stream and close it (handled
+    // inside the ReadableStream `start()` below).
+    // `ChildProcessByStdio<Writable, Readable, Readable>` is the
+    // specific overload return for `stdio: "pipe"`; narrows
+    // `child.stdout` / `child.stderr` away from the nullable
+    // `Readable | null` of the general `ChildProcess` type.
+    // `ReturnType<typeof spawn>` would land on the union and force
+    // a `?.` everywhere downstream.
+    let child: ChildProcessByStdio<Writable, Readable, Readable>;
+    try {
+      child = spawn(process.execPath, args, {
+        stdio: "pipe",
+        cwd: trainCwd,
+        env: {
+          ...process.env,
+          ARKOR_JOB_ID_MARKER_NONCE: startedJobNonce,
+        },
+      });
+    } catch (err: unknown) {
+      const msg = err instanceof Error ? err.message : String(err);
+      return c.json(
+        { error: `Failed to spawn training subprocess: ${msg}` },
+        500,
+      );
+    }
+    activeTrains.register(child, {
+      trainFile,
+      configHash,
+      spawnArtifactContentHash,
+      scope: spawnScope,
     });
-    const stream = new ReadableStream({
+    // Hoisted out of the `ReadableStream` underlying-source so the
+    // `start` handler can hand its closure-bound teardown helper to
+    // the `cancel` handler. `cancel` runs in a separate invocation,
+    // not through `controller`, so the two need a parent-scope
+    // rendez-vous variable.
+    let cancelTeardown: (() => void) | null = null;
+    // Mirror of the cloud `jobId` parsed out of the runner's
+    // stdout, accessible to both the `start` (parser writes) and
+    // `cancel` (post-unregister read) handlers. We can't just call
+    // `activeTrains.getJobId(pid)` from `cancel` because cancel
+    // unregisters the entry first, so subsequent reads of the
+    // registry would always be `null` even if the parser races a
+    // late line in afterwards. This closure variable keeps the id
+    // observable even after unregister, so the cancel POST poll
+    // below can pick up a jobId that lands a few ms after Stop.
+    let parsedJobId: string | null = null;
+    const stream = new ReadableStream<Uint8Array>({
       start(controller) {
+        // After `cancel()` runs, calling `controller.enqueue` /
+        // `controller.close` on the now-closed controller throws
+        // ("Invalid state: Controller is closed"). The child
+        // subprocess keeps emitting `data` and ultimately a `close`
+        // event for some time after the client disconnects, so each
+        // forwarder needs its own "are we still attached?" guard.
+        // Track via a flag plus an explicit listener-removal so the
+        // event loop also stops dispatching once we've torn down.
+        let closed = false;
+        // `child.stdout` is in default (binary) mode, so each `data`
+        // chunk is a Buffer, and `Buffer extends Uint8Array`, so we
+        // can pass it straight to `controller.enqueue` without a
+        // round-trip through `TextEncoder`. The previous code did
+        // `enc.encode(d)` which implicitly coerced the buffer via
+        // `String()`: same byte content, but allocates a new array.
+        // Forward a chunk to the SPA stream. Shared between the
+        // stdout and stderr listeners; both paths surface as
+        // request body bytes for the SPA's log view.
+        const forward = (d: Buffer): void => {
+          if (closed) return;
+          try {
+            controller.enqueue(d);
+          } catch {
+            // Controller raced us into the closed state; flip the
+            // flag so subsequent chunks short-circuit.
+            closed = true;
+          }
+        };
+        // Carry-over buffer for line-oriented job-id extraction.
+        // Stream chunk boundaries are arbitrary: the runner's
+        // single-line `Started job <id>` write can land split
+        // across two `data` events, in which case a per-chunk
+        // regex would never match and the cancel POST chain
+        // would never fire (cloud-job orphan on Stop). We
+        // accumulate text until a newline, parse the complete
+        // line, and keep any trailing partial for the next
+        // chunk. Cleared the moment the id is recorded so a
+        // chatty bin doesn't pin memory after the marker has
+        // landed; capped at 4 KiB regardless to bound a
+        // misbehaving bin that never emits a newline before the
+        // marker (the canonical line is well under 100 bytes).
+        let stdoutLineBuf = "";
+        const STARTED_JOB_BUFFER_CAP = 4096;
+        // STDOUT-ONLY job-id parser. The runner writes the canonical
+        // `Started job <id>` line via `process.stdout.write` (never
+        // stderr), so a single shared buffer across both pipes
+        // would mis-match in two ways:
+        //   1. A user `console.error("Started job <token>")` would
+        //      poison the buffer first; the real stdout marker
+        //      arrives later but our `getJobId(...) === null` gate
+        //      has already short-circuited subsequent scans, so
+        //      Stop-training POSTs cancel for the wrong (or
+        //      non-existent) job.
+        //   2. Interleaved stderr bytes could land between
+        //      "Started job " and "<id>\n" in the shared buffer,
+        //      breaking the anchored line match → missed match →
+        //      cloud cancel skipped on Stop.
+        // Two dedicated handlers share `forward` for the byte
+        // pipeline but only the stdout one runs the parse.
+        const onStdoutChunk = (d: Buffer): void => {
+          // Intentionally NOT gated on `closed`: when the SPA cancels,
+          // `cancelTeardown()` flips `closed = true` so the controller
+          // path no-ops, but the cancel IIFE then POLLS `parsedJobId`
+          // for up to 500 ms to catch a `Started job <id>` line that
+          // landed just after the user clicked Stop. The parser has to
+          // keep running during that window for the poll to ever
+          // observe a value. (`forward()` has its own `closed` check
+          // for the controller-enqueue side, so the SSE-body path
+          // stays sealed.) Gate the parse on `parsedJobId === null`
+          // (not `activeTrains.getJobId(...) === null`): the latter
+          // returns null forever after `unregister`, which would make
+          // us re-enter and re-parse the buffer on every subsequent
+          // chunk during the poll window.
+          if (parsedJobId === null) {
+            stdoutLineBuf += d.toString("utf8");
+            let nl = stdoutLineBuf.indexOf("\n");
+            while (nl !== -1) {
+              // Strip a possible \r so CRLF-emitting bins (rare for
+              // Node `process.stdout.write` but defensive) match
+              // the same anchored pattern.
+              const line = stdoutLineBuf.slice(0, nl).replace(/\r$/, "");
+              stdoutLineBuf = stdoutLineBuf.slice(nl + 1);
+              const m = startedJobPattern.exec(line);
+              if (m && m[1]) {
+                activeTrains.recordJobId(child.pid, m[1]);
+                // Mirror to the parent-scope closure so the cancel
+                // handler can pick this up even AFTER it called
+                // `activeTrains.unregister(...)` (the registry
+                // read would return null post-unregister).
+                parsedJobId = m[1];
+                stdoutLineBuf = "";
+                break;
+              }
+              nl = stdoutLineBuf.indexOf("\n");
+            }
+            if (stdoutLineBuf.length > STARTED_JOB_BUFFER_CAP) {
+              stdoutLineBuf = stdoutLineBuf.slice(-STARTED_JOB_BUFFER_CAP);
+            }
+          }
+          forward(d);
+        };
+        const onStderrChunk = (d: Buffer): void => {
+          // Forward only; never scan for `Started job`. See
+          // `onStdoutChunk` comment for the cross-stream poisoning
+          // hazards this split prevents.
+          forward(d);
+        };
         const enc = new TextEncoder();
-        child.stdout.on("data", (d) => controller.enqueue(enc.encode(d)));
-        child.stderr.on("data", (d) => controller.enqueue(enc.encode(d)));
-        child.on("close", (code) => {
-          controller.enqueue(enc.encode(`\n---\nexit=${code}\n`));
-          controller.close();
-        });
+        // Detach every listener this stream wired onto `child`. Called
+        // from `onClose` / `onError` themselves (so once one fires the
+        // closure references (controller, TextEncoder) drop and the
+        // subprocess record can be GC'd promptly even if the other
+        // event also queues), and from `cancelTeardown` for the
+        // client-side cancel path. Removing only the `data` listeners
+        // (as the previous code did) left `close` / `error` attached
+        // to the dead ChildProcess, which kept their closures pinned
+        // until the process object itself was reaped: meaningful
+        // memory pressure for an `arkor dev` session that spawns many
+        // children over hours.
+        const detachListeners = (): void => {
+          child.stdout.off("data", onStdoutChunk);
+          child.stderr.off("data", onStderrChunk);
+          child.off("close", onClose);
+          child.off("error", onError);
+        };
+        const onClose = (code: number | null): void => {
+          activeTrains.unregister(child.pid);
+          detachListeners();
+          if (closed) return;
+          closed = true;
+          try {
+            controller.enqueue(enc.encode(`\n---\nexit=${code}\n`));
+            controller.close();
+          } catch {
+            // already cancelled; nothing more to do.
+          }
+        };
+        // `error` event fires when async spawn machinery surfaces a
+        // failure (ENOENT for the executable, EACCES, EAGAIN under
+        // resource exhaustion, etc.). Without this listener the
+        // ReadableStream would never close; the SPA would hang
+        // waiting for output that never arrives. Forward the error
+        // text into the stream body, close, and unregister the
+        // child. Node's contract is: if 'error' fires, 'close' may
+        // or may not follow; both paths are guarded by the `closed`
+        // flag and the `unregister` call is idempotent.
+        const onError = (err: Error): void => {
+          activeTrains.unregister(child.pid);
+          detachListeners();
+          if (closed) return;
+          closed = true;
+          try {
+            controller.enqueue(
+              enc.encode(`\n---\nerror=${err.message}\n`),
+            );
+            controller.close();
+          } catch {
+            // already cancelled; nothing more to do.
+          }
+        };
+        child.stdout.on("data", onStdoutChunk);
+        child.stderr.on("data", onStderrChunk);
+        child.on("close", onClose);
+        child.on("error", onError);
+        cancelTeardown = () => {
+          // Don't detach data listeners here: the child stays alive
+          // for some time after the SPA cancels, either because
+          // we're skipping `child.kill()` for an in-progress
+          // HMR early-stop, or because `child.kill()`'s SIGTERM
+          // triggers a graceful checkpoint+exit that takes
+          // seconds. During that window the child keeps writing
+          // logs to its stdout/stderr pipes; if our `data`
+          // listeners are gone, Node stops draining the OS pipe,
+          // the buffer fills, and the child's next `write()`
+          // blocks indefinitely, deadlocking the very graceful
+          // exit we're trying to preserve. The `closed` flag
+          // already makes `enqueue`/`close` a no-op so the
+          // controller-closed race stays safe; the eventual
+          // `onClose` / `onError` listeners detach everything
+          // (via `detachListeners()`) when the child finally
+          // exits. That timing (at-exit, not at-cancel) is the
+          // correct moment to break the closure refs for GC.
+          closed = true;
+        };
       },
       cancel() {
-        child.kill();
+        // The SPA-side cancel is always *user-initiated*: either an
+        // explicit Stop click or tab-close/navigation, which the
+        // user just as explicitly chose. HMR-driven SIGTERMs go
+        // straight from the server to the runner via
+        // `dispatchRebuild`; they DO NOT trigger this handler
+        // (the SPA waits for the train stream's `exit=` line and
+        // schedules auto-restart, never aborting). So manual stop
+        // takes precedence over any in-flight HMR graceful path:
+        // we POST cloud cancel + SIGKILL unconditionally.
+        //
+        // SIGKILL is uncatchable so the long-standing
+        // "second-SIGTERM-triggers-exit(143)-fast-path" worry
+        // (which used to gate this branch on
+        // `isEarlyStopRequested`) doesn't apply. The runner's
+        // graceful early-stop chain may have been trying to
+        // preserve a checkpoint, but the user just said no; keep
+        // the local subprocess teardown snappy and let the
+        // server-side cancel POST handle the cloud-side release.
+        //
+        // Capture the cloud job id + spawn-time scope BEFORE
+        // unregistering: once the entry is gone, the getters
+        // return null and the fire-and-forget POST below would
+        // no-op.
+        //
+        // `pid` is captured once here because the closure below
+        // runs after `unregister` and we want a stable handle.
+        const cancelPid = child.pid;
+        // Scope resolution order:
+        //   1. Registry entry's pinned scope (captured at spawn time).
+        //      Authoritative when non-null: a user who deleted or made
+        //      `.arkor/state.json` unreadable AFTER spawn shouldn't be
+        //      able to silently orphan their cloud job by losing the
+        //      cancel-time read.
+        //   2. Cancel-time re-read of `.arkor/state.json`, ONLY when
+        //      the spawn-time capture was null. This handles the
+        //      first-run anon case where `ensureProjectState` writes
+        //      the state file from inside the child during
+        //      `trainer.start()` (i.e. AFTER spawn). The read happens
+        //      inside the fire-and-forget IIFE below so the cancel
+        //      handler stays sync.
+        const pinnedScope = activeTrains.getScope(cancelPid);
+        activeTrains.unregister(cancelPid);
+        cancelTeardown?.();
+        // Fire-and-forget cloud-side cancel so the cloud job is
+        // released even though the SIGKILL below bypasses the
+        // runner's `installShutdownHandlers` (which would
+        // otherwise issue cancel itself via the graceful
+        // early-stop chain). The IIFE polls for the jobId
+        // *briefly* before giving up: there's a real race
+        // window where the user clicks Stop after the cloud
+        // job has been created but before the runner's
+        // `Started job <id>` line has been parsed (cloud
+        // createJob roundtrip is ~50-200ms; UI clicks can land
+        // sub-100ms into that window). Polling closes the most
+        // common case; beyond ~500 ms we accept the cloud-side
+        // orphan as a follow-up (the cloud reaper / TTL is the
+        // safety net, and the alternative of querying cloud-api
+        // for matching jobs at cancel time is brittle in
+        // multi-tab/multi-spawn scenarios).
+        void (async () => {
+          // Brief poll on `parsedJobId` (the closure mirror,
+          // see top-of-handler for why it can't be the
+          // registry's `getJobId`): the runner's
+          // `Started job <id>` line may not have been parsed by
+          // the time the user clicked Stop. Most runs hit it
+          // within ~50-200 ms of spawn (cloud createJob
+          // roundtrip), so polling for up to ~500 ms catches
+          // nearly all races. Beyond that we accept the
+          // cloud-side orphan as a documented follow-up: cloud
+          // reaper / TTL is the safety net, and the
+          // alternative (querying cloud-api for matching jobs
+          // at cancel time) is brittle for multi-tab /
+          // multi-spawn cases.
+          if (parsedJobId === null) {
+            const start = Date.now();
+            while (parsedJobId === null && Date.now() - start < 500) {
+              await new Promise((r) => setTimeout(r, 25));
+            }
+          }
+          if (parsedJobId === null) return;
+          // Resolve the cloud scope: prefer the spawn-time
+          // capture (immutable, snapshot at spawn) and fall back
+          // to reading `.arkor/state.json` only when there was
+          // none. The state file usually exists by now: the
+          // runner doesn't print `Started job <id>` until
+          // `trainer.start()` resolves, and `ensureProjectState`
+          // (which writes the file from inside the child for
+          // first-run anon projects) runs as part of that path.
+          let scopeForCancel = pinnedScope;
+          if (!scopeForCancel) {
+            try {
+              const late = await readState(trainCwd);
+              if (late) {
+                scopeForCancel = {
+                  orgSlug: late.orgSlug,
+                  projectSlug: late.projectSlug,
+                };
+              }
+            } catch {
+              // best-effort
+            }
+          }
+          if (!scopeForCancel) return;
+          try {
+            // `createRpc` now needs (baseUrl, token) explicitly; main's
+            // refactor moved off the closure-based getter so the per-
+            // request credentials read happens once here rather than
+            // twice via the SDK's lazy token callback.
+            const { baseUrl: rpcBaseUrl, token: rpcToken } =
+              await resolveCredentialsAndBaseUrl();
+            const rpc = createRpc(rpcBaseUrl, rpcToken);
+            await rpc.v1.jobs[":id"].cancel.$post({
+              param: { id: parsedJobId },
+              query: {
+                orgSlug: scopeForCancel.orgSlug,
+                projectSlug: scopeForCancel.projectSlug,
+              },
+            });
+          } catch {
+            // Best-effort: cloud-api transient failure or scope
+            // drift. Cloud reaper / TTL is the safety net.
+          }
+        })();
+        // SIGKILL (not the default SIGTERM) for user-initiated
+        // aborts. The runner's `installShutdownHandlers` now treats
+        // a single SIGTERM as the HMR-driven "graceful early-stop"
+        // signal: wait for the next checkpoint (up to ~5 min
+        // timeout) before exiting. That semantics is right for the
+        // HMR path but wrong for a Stop-training click: the user
+        // wants the run STOPPED, not left running in the background
+        // for minutes consuming GPU/cloud spend while the UI has
+        // already settled to idle. SIGKILL is uncatchable so the
+        // child dies immediately, eliminating the
+        // unregister-before-graceful-exit window where a fast new
+        // run could overlap an old one untracked by HMR routing.
+        //
+        // The cloud-side job is released by the fire-and-forget
+        // POST above (we recorded the runner's `Started job <id>`
+        // line on the registry; the IIFE looks it up here). SIGKILL
+        // alone would have left the cloud job orphaned until
+        // TTL/reaper because the runner can't POST cancel itself
+        // when the kernel reaps it without warning. Together,
+        // server-side cancel POST + SIGKILL give snappy local
+        // teardown AND eventual cloud-side release.
+        //
+        // `ChildProcess.kill()` can throw (ESRCH if the process has
+        // already exited between this handler's invocation and the
+        // signal delivery). A throw here would surface as an unhandled
+        // exception in the request pipeline and crash the server
+        // handler. Swallow it; the close handler above has already
+        // taken the entry out of the registry.
+        try {
+          child.kill("SIGKILL");
+        } catch {
+          // already gone; nothing to clean up.
+        }
       },
     });
-    return new Response(stream, {
-      status: 200,
-      headers: { "content-type": "text/plain; charset=utf-8" },
-    });
+    // Expose the spawned pid via a response header so the SPA can
+    // tell its own child apart from other tabs' children when
+    // `/api/dev/events` broadcasts `restartTargets` / `hotSwapTargets`.
+    // Without this, a passive tab whose run was hot-swapped could
+    // misread a sibling tab's restart event as its own.
+    //
+    // Header is OMITTED entirely (rather than sent as an empty
+    // string) when `child.pid` isn't a number; that case happens
+    // when the OS hasn't assigned a pid by the time `spawn()`
+    // returns and the child's async `error` event will fire shortly
+    // (per-Node-docs `subprocess.pid` is `undefined` for
+    // failed-spawn children). "Header absent" is the unambiguous
+    // signal the SPA can read; an empty string would force callers
+    // to special-case `""` vs missing for the same condition. The
+    // SPA's `raw ? Number.parseInt(raw, 10) : NaN` handler treats
+    // both cases identically, but absent-only is the cleaner wire
+    // contract.
+    const headers: Record<string, string> = {
+      "content-type": "text/plain; charset=utf-8",
+    };
+    if (typeof child.pid === "number") {
+      headers[TRAIN_PID_HEADER] = String(child.pid);
+    }
+    return new Response(stream, { status: 200, headers });
   });
 
+  // `/api/dev/events`: SSE stream of HMR rebuild / error notifications.
+  // Only active when `arkor dev` passed an HMR coordinator. The CSRF model
+  // accepts `?studioToken=` here (whitelisted in `eventStreamPathPattern`)
+  // because `EventSource` cannot send headers. When HMR is not configured
+  // the route still has an explicit 404 so the request doesn't fall through
+  // to the SPA index.html (which would mislead the SPA into thinking the
+  // EventSource connected successfully).
+  if (!options.hmr) {
+    app.get("/api/dev/events", (c) =>
+      c.json({ error: "HMR not enabled" }, 404),
+    );
+  }
+  if (options.hmr) {
+    const hmr = options.hmr;
+    /** Augmented event = raw HMR event + the per-child signal results we
+     *  computed for it. We compute these once per rebuild (not once per
+     *  connected SSE client) so opening multiple Studio tabs doesn't fan
+     *  out into N × SIGTERM / N × SIGUSR2 to each child. */
+    type AugmentedEvent = HmrEvent & {
+      restart?: boolean;
+      hotSwap?: boolean;
+      restartTargets?: RestartTarget[];
+      hotSwapTargets?: RestartTarget[];
+    };
+    const sseListeners = new Set<(event: AugmentedEvent) => void>();
+    let lastAugmented: AugmentedEvent | null = null;
+
+    // Single subscription against the HMR coordinator: this handler does
+    // signal dispatch + augmentation exactly once per rebuild, then fans
+    // the augmented payload out to every connected SSE client. Late-
+    // mounting clients receive `lastAugmented` instead of triggering a
+    // fresh signal pass against the same rebuild.
+    hmr.subscribe((event) => {
+      let augmented: AugmentedEvent = event;
+      // Route dispatch through every *successful* build event, not
+      // just `rebuild`. The coordinator emits the very first
+      // successful compile as `ready` (and the entry-wait recovery
+      // path also broadcasts `ready` when a fresh-scaffold project's
+      // entry file first appears). A child started via `/api/train`
+      // before the first `ready` (e.g. the SPA fired Run Training
+      // immediately after `arkor dev` booted, while the watcher's
+      // initial BUNDLE_END was still in flight) would otherwise
+      // never get SIGUSR2/SIGTERM-routed when that build lands,
+      // leaving it stuck on a stale or empty artifact until the
+      // next edit triggers a `rebuild`. Filtering by "not error"
+      // is forward-compatible with any new successful event types.
+      if (event.type !== "error" && activeTrains.size > 0) {
+        // Single per-child decision pass: hash match → SIGUSR2 (with
+        // a Windows fallback to SIGTERM since win32 doesn't deliver
+        // SIGUSR2), hash mismatch → SIGTERM. The registry returns
+        // both buckets so the SPA can react per-child rather than
+        // assuming one global outcome.
+        const nextHash = event.configHash ?? null;
+        // Content-hash for the pre-ready-spawn equality gate (the
+        // timestamp `event.hash` would over-trigger SIGTERM-restart
+        // on identical-bytes rebuilds). Both sides of the
+        // comparison (`entry.spawnArtifactContentHash` captured
+        // via `getCurrentArtifactContentHash()`, and this
+        // `event.contentHash`) are derived the same way, so a
+        // match means the child's loaded bytes ARE what the new
+        // configHash describes.
+        const nextArtifactContentHash = event.contentHash ?? null;
+        const { hotSwapTargets, restartTargets } = activeTrains.dispatchRebuild(
+          nextHash,
+          nextArtifactContentHash,
+        );
+        augmented = {
+          ...event,
+          hotSwap: hotSwapTargets.length > 0,
+          hotSwapTargets,
+          restart: restartTargets.length > 0,
+          restartTargets,
+        };
+      }
+      lastAugmented = augmented;
+      for (const fn of sseListeners) {
+        try {
+          fn(augmented);
+        } catch {
+          // listener controller closed mid-write; the cancel hook
+          // below takes care of removing it from the set.
+        }
+      }
+    });
+
+    app.get("/api/dev/events", () => {
+      const enc = new TextEncoder();
+      let listener: ((event: AugmentedEvent) => void) | null = null;
+      const stream = new ReadableStream({
+        start(controller) {
+          const send = (event: AugmentedEvent): void => {
+            const payload = JSON.stringify(event);
+            try {
+              controller.enqueue(
+                enc.encode(`event: ${event.type}\ndata: ${payload}\n\n`),
+              );
+            } catch {
+              // controller closed mid-write; cancel() removes us.
+            }
+          };
+          if (lastAugmented) send(lastAugmented);
+          listener = send;
+          sseListeners.add(send);
+        },
+        cancel() {
+          if (listener) sseListeners.delete(listener);
+          listener = null;
+        },
+      });
+      return new Response(stream, {
+        status: 200,
+        headers: {
+          "content-type": "text/event-stream",
+          "cache-control": "no-cache, no-transform",
+        },
+      });
+    });
+  }
+
   // Playground hits this so mid-training inference from Studio has the same
   // auth path as the rest of /api/*. State is auto-bootstrapped (anon only)
   // so the Playground's base-model mode works on a fresh anonymous launch
@@ -407,7 +1100,7 @@ export function buildStudioApp(options: StudioServerOptions) {
       state = await ensureProjectState({ cwd: trainCwd, client, credentials });
     } catch (err) {
       // Propagate cloud-api's status verbatim (e.g. 401 / 403 / 5xx) so the
-      // SPA / clients can react appropriately — collapsing everything to 400
+      // SPA / clients can react appropriately; collapsing everything to 400
       // would mis-report upstream outages and auth failures. Anything else
       // (local writeState failures, missing-credentials guard) is treated as
       // a server-side error.
@@ -897,7 +1590,11 @@ export function buildStudioApp(options: StudioServerOptions) {
       const file = await readFile(join(assetsDir, cleaned));
       const ext = cleaned.slice(cleaned.lastIndexOf(".") + 1);
       if (ext === "html") {
-        const html = injectStudioToken(file.toString("utf8"), studioToken);
+        const html = injectStudioMeta(
+          file.toString("utf8"),
+          studioToken,
+          Boolean(options.hmr),
+        );
         return new Response(html, {
           status: 200,
           headers: { "content-type": CONTENT_TYPES.html! },
diff --git a/packages/arkor/src/studio/trainRegistry.test.ts b/packages/arkor/src/studio/trainRegistry.test.ts
new file mode 100644
index 00000000..1278f7f0
--- /dev/null
+++ b/packages/arkor/src/studio/trainRegistry.test.ts
@@ -0,0 +1,411 @@
+import { describe, it, expect, vi } from "vitest";
+import type { ChildProcess } from "node:child_process";
+import { TrainRegistry } from "./trainRegistry";
+
+interface FakeChild {
+  pid: number;
+  kill: ReturnType<typeof vi.fn>;
+}
+
+function fakeChild(pid: number): FakeChild {
+  // Default: `kill(sig)` returns `true`, mirroring Node's contract for
+  // a successful signal delivery to a still-running process.
+  return { pid, kill: vi.fn(() => true) };
+}
+
+describe("TrainRegistry", () => {
+  it("ignores children without a pid (already-exited spawns)", () => {
+    const reg = new TrainRegistry();
+    reg.register({ pid: undefined } as unknown as ChildProcess, {
+      configHash: "h1",
+    });
+    expect(reg.size).toBe(0);
+  });
+
+  it("dispatchRebuild SIGUSR2s only matching configHashes", () => {
+    const reg = new TrainRegistry();
+    const a = fakeChild(101);
+    const b = fakeChild(102);
+    const c = fakeChild(103);
+    reg.register(a as unknown as ChildProcess, { configHash: "match" });
+    reg.register(b as unknown as ChildProcess, {
+      configHash: "different",
+      trainFile: "/tmp/b.ts",
+    });
+    reg.register(c as unknown as ChildProcess, { configHash: "match" });
+
+    const result = reg.dispatchRebuild("match");
+    expect(result.hotSwapTargets).toEqual([
+      { pid: 101, trainFile: undefined },
+      { pid: 103, trainFile: undefined },
+    ]);
+    expect(result.restartTargets).toEqual([
+      { pid: 102, trainFile: "/tmp/b.ts" },
+    ]);
+    expect(a.kill).toHaveBeenCalledWith("SIGUSR2");
+    expect(c.kill).toHaveBeenCalledWith("SIGUSR2");
+    expect(b.kill).toHaveBeenCalledWith("SIGTERM");
+  });
+
+  it("dispatchRebuild SIGTERMs everything when nextConfigHash is null", () => {
+    // null nextHash means "we couldn't inspect the new bundle": be
+    // conservative and SIGTERM every active child since we can't
+    // prove their configs are unaffected.
+    const reg = new TrainRegistry();
+    const a = fakeChild(201);
+    const b = fakeChild(202);
+    reg.register(a as unknown as ChildProcess, { configHash: "h" });
+    reg.register(b as unknown as ChildProcess, { configHash: null });
+
+    const result = reg.dispatchRebuild(null);
+    expect(result.hotSwapTargets).toEqual([]);
+    expect(result.restartTargets).toHaveLength(2);
+    expect(a.kill).toHaveBeenCalledWith("SIGTERM");
+    expect(b.kill).toHaveBeenCalledWith("SIGTERM");
+  });
+
+  it("dispatchRebuild backfills the hash and skips dispatch when the spawn-time artefact matches the new build", () => {
+    // Pre-ready spawn (configHash: null) is the "user clicked Run
+    // before the watcher's first BUNDLE_END" case. Whether it's
+    // safe to backfill the new hash as the child's baseline depends
+    // on whether the on-disk artefact has changed between spawn
+    // and now: if `spawnArtifactContentHash === nextArtifactContentHash`, the
+    // child read exactly the bytes the new hash describes →
+    // backfill + skip dispatch (no spurious cancel+restart cycle).
+    // Otherwise (see the next test) SIGTERM-restart so cloud
+    // and child stay aligned.
+    const reg = new TrainRegistry();
+    const c = fakeChild(401);
+    reg.register(c as unknown as ChildProcess, {
+      configHash: null,
+      trainFile: "/tmp/preready.ts",
+      spawnArtifactContentHash: "art-v1",
+    });
+    const result = reg.dispatchRebuild("first-real-hash", "art-v1");
+    // Neither bucket: no signal sent, nothing for the SPA to react to.
+    expect(result.hotSwapTargets).toEqual([]);
+    expect(result.restartTargets).toEqual([]);
+    expect(c.kill).not.toHaveBeenCalled();
+    // A subsequent dispatch with the SAME config hash must take the
+    // hot-swap path (proves the backfill landed; without it this
+    // would STILL be null vs "first-real-hash" → SIGTERM).
+    const second = reg.dispatchRebuild("first-real-hash", "art-v2");
+    expect(second.hotSwapTargets).toEqual([
+      { pid: 401, trainFile: "/tmp/preready.ts" },
+    ]);
+    expect(second.restartTargets).toEqual([]);
+    expect(c.kill).toHaveBeenCalledWith("SIGUSR2");
+    // And a different config hash on a later rebuild now correctly
+    // routes to SIGTERM-restart (backfilled hash is real).
+    c.kill.mockClear();
+    const third = reg.dispatchRebuild("second-hash", "art-v3");
+    expect(third.restartTargets).toEqual([
+      { pid: 401, trainFile: "/tmp/preready.ts" },
+    ]);
+    expect(c.kill).toHaveBeenCalledWith("SIGTERM");
+  });
+
+  it("dispatchRebuild SIGTERM-restarts a pre-ready spawn when the artefact has changed since spawn", () => {
+    // Codex P2 regression: an edit landing between spawn and the
+    // watcher's first BUNDLE_END means the bytes the child loaded
+    // differ from what the new `configHash` describes. Backfilling
+    // unconditionally would silently teach the registry to use the
+    // post-edit hash as the child's baseline; later same-hash
+    // rebuilds would then hot-swap callbacks into a child whose
+    // cloud-side `JobConfig` was actually spawned against an older
+    // version, leaving the cloud run on a stale config. The artefact
+    // fingerprint mismatch (`art-stale` vs `art-fresh`) is the
+    // signal that the child loaded older bytes; SIGTERM-restart
+    // forces a clean re-spawn against the freshly-built artefact.
+    const reg = new TrainRegistry();
+    const c = fakeChild(411);
+    reg.register(c as unknown as ChildProcess, {
+      configHash: null,
+      trainFile: "/tmp/preready-stale.ts",
+      spawnArtifactContentHash: "art-stale",
+    });
+    const result = reg.dispatchRebuild("real-hash", "art-fresh");
+    // SIGTERM-restart: the child's bytes are stale relative to the
+    // new build. Hot-swap would be unsafe (config drift); skip
+    // would leave the child running with no future correction
+    // path (the registry would treat "real-hash" as the baseline
+    // even though the child never loaded that build).
+    expect(result.hotSwapTargets).toEqual([]);
+    expect(result.restartTargets).toEqual([
+      { pid: 411, trainFile: "/tmp/preready-stale.ts" },
+    ]);
+    expect(c.kill).toHaveBeenCalledWith("SIGTERM");
+  });
+
+  it("dispatchRebuild SIGTERM-restarts a pre-ready spawn when no artefact existed at spawn time", () => {
+    // Companion to the "artefact has changed" test: a fresh project
+    // never built before spawn means `coordinator.getCurrentArtifactHash()`
+    // returned `null`. The child's `await import` likely failed; we
+    // can't prove its config matches anything. Conservative
+    // SIGTERM-restart so the SPA re-spawns once the new bundle is
+    // on disk.
+    const reg = new TrainRegistry();
+    const c = fakeChild(421);
+    reg.register(c as unknown as ChildProcess, {
+      configHash: null,
+      trainFile: "/tmp/preready-fresh.ts",
+      spawnArtifactContentHash: null, // no artefact when /api/train fired
+    });
+    const result = reg.dispatchRebuild("first-real-hash", "art-fresh");
+    expect(result.hotSwapTargets).toEqual([]);
+    expect(result.restartTargets).toEqual([
+      { pid: 421, trainFile: "/tmp/preready-fresh.ts" },
+    ]);
+    expect(c.kill).toHaveBeenCalledWith("SIGTERM");
+  });
+
+  it("isEarlyStopRequested reflects the dispatchRebuild SIGTERM flag", () => {
+    // Regression: `/api/train`'s ReadableStream `cancel()` consults
+    // this flag to avoid sending a *second* SIGTERM to a child that
+    // HMR's `dispatchRebuild` already SIGTERMed for early-stop. A
+    // double-SIGTERM hits `installShutdownHandlers`' emergency
+    // `exit(143)` fast-path, bypassing the checkpoint-preserving
+    // cancel flow and potentially leaving the cloud run alive.
+    const reg = new TrainRegistry();
+    const a = fakeChild(901);
+    reg.register(a as unknown as ChildProcess, {
+      configHash: "h1",
+      trainFile: "/tmp/a.ts",
+    });
+    expect(reg.isEarlyStopRequested(901)).toBe(false);
+    // Mismatched hash → SIGTERM → flag flips on.
+    reg.dispatchRebuild("h2");
+    expect(reg.isEarlyStopRequested(901)).toBe(true);
+    // Defensive cases: non-numeric / unknown / never-registered pid.
+    expect(reg.isEarlyStopRequested(undefined)).toBe(false);
+    expect(reg.isEarlyStopRequested(99999)).toBe(false);
+    // Once the child unregisters (close handler) the flag effectively
+    // resets: subsequent queries return false rather than retaining
+    // stale state.
+    reg.unregister(901);
+    expect(reg.isEarlyStopRequested(901)).toBe(false);
+  });
+
+  it("unregister removes the child from the policy decisions", () => {
+    const reg = new TrainRegistry();
+    const a = fakeChild(401);
+    reg.register(a as unknown as ChildProcess, { configHash: "h" });
+    reg.unregister(401);
+    expect(reg.size).toBe(0);
+    const result = reg.dispatchRebuild("h");
+    expect(result.hotSwapTargets).toEqual([]);
+    expect(result.restartTargets).toEqual([]);
+  });
+
+  it("survives kill() throwing (child exited mid-iteration)", () => {
+    const reg = new TrainRegistry();
+    const a = fakeChild(501);
+    a.kill.mockImplementation(() => {
+      throw new Error("ESRCH");
+    });
+    reg.register(a as unknown as ChildProcess, { configHash: "h" });
+    // Both the hot-swap branch (matching hash) and the restart branch
+    // (mismatched hash) must swallow the throw and continue with their
+    // bookkeeping so a single dead child can't break HMR for siblings.
+    expect(() => reg.dispatchRebuild("h")).not.toThrow();
+    expect(() => reg.dispatchRebuild("x")).not.toThrow();
+  });
+
+  it("dispatchRebuild omits dead-on-kill children from the restart targets", () => {
+    // Regression: previously the implementation always pushed onto
+    // `targets` even when `kill()` threw, so a child that had already
+    // exited would still be reported back to the SPA as a restart
+    // target: the SPA would then wait forever for the (already-
+    // delivered) `exit=...` line and never re-spawn.
+    const reg = new TrainRegistry();
+    const dead = fakeChild(601);
+    dead.kill.mockImplementation(() => {
+      const err = new Error("kill ESRCH") as Error & { code?: string };
+      err.code = "ESRCH";
+      throw err;
+    });
+    reg.register(dead as unknown as ChildProcess, {
+      configHash: "stale",
+      trainFile: "/tmp/d.ts",
+    });
+    const result = reg.dispatchRebuild("fresh");
+    expect(result.hotSwapTargets).toEqual([]);
+    expect(result.restartTargets).toEqual([]);
+  });
+
+  it("dispatchRebuild classifies ESRCH on the hash-match branch as 'gone' (no SIGTERM fallback)", () => {
+    // Regression: `safeKill` previously treated any thrown error as
+    // `"unsupported"`, which on the hash-match branch triggers a
+    // SIGTERM fallback (intended for Windows + SIGUSR2 unsupported).
+    // POSIX `kill(2)` raises `ESRCH` for an already-exited child:
+    // classifying that as "unsupported" caused a needless SIGTERM
+    // attempt against a dead PID. Now ESRCH routes through the
+    // "gone" branch (no fallback, no restart-target push) so the
+    // child is dropped silently for the close handler to reap.
+    const reg = new TrainRegistry();
+    const goneOnSigusr2 = fakeChild(801);
+    goneOnSigusr2.kill.mockImplementation(() => {
+      const err = new Error("kill ESRCH") as Error & { code?: string };
+      err.code = "ESRCH";
+      throw err;
+    });
+    reg.register(goneOnSigusr2 as unknown as ChildProcess, {
+      configHash: "match",
+      trainFile: "/tmp/g.ts",
+    });
+    const result = reg.dispatchRebuild("match");
+    // No hot-swap (SIGUSR2 failed), no restart (correctly classified
+    // as gone, NOT routed into the SIGTERM fallback path).
+    expect(result.hotSwapTargets).toEqual([]);
+    expect(result.restartTargets).toEqual([]);
+    // Single SIGUSR2 attempt: no SIGTERM fallback was issued.
+    expect(goneOnSigusr2.kill).toHaveBeenCalledTimes(1);
+    expect(goneOnSigusr2.kill).toHaveBeenCalledWith("SIGUSR2");
+  });
+
+  it("dispatchRebuild omits dead-on-kill children when kill returns false (no throw)", () => {
+    // Regression: `ChildProcess.kill()` returns `false` (without
+    // throwing) when the target process is already gone. The previous
+    // implementation treated any non-throw as success and reported the
+    // child as a restart target; the SPA would then wait forever for
+    // an exit line that already arrived.
+    const reg = new TrainRegistry();
+    const gone = fakeChild(701);
+    gone.kill.mockReturnValue(false);
+    reg.register(gone as unknown as ChildProcess, {
+      configHash: "stale",
+      trainFile: "/tmp/g.ts",
+    });
+    const result = reg.dispatchRebuild("fresh");
+    expect(result.restartTargets).toEqual([]);
+    // We still attempted the kill; only the bookkeeping is skipped.
+    expect(gone.kill).toHaveBeenCalledWith("SIGTERM");
+  });
+
+  it("dispatchRebuild sends SIGTERM at most once per child across rebuilds", () => {
+    // Regression: under rapid edits the dev loop can fire multiple
+    // rebuilds before the child reaches its next checkpoint. The
+    // runner's shutdown handler treats a *second* SIGTERM as the
+    // emergency `exit(143)` fast-path, which would defeat the whole
+    // point of preserving the in-flight checkpoint. The registry now
+    // tracks per-child early-stop state and skips children it has
+    // already signalled.
+    const reg = new TrainRegistry();
+    const a = fakeChild(801);
+    reg.register(a as unknown as ChildProcess, {
+      configHash: "h1",
+      trainFile: "/tmp/a.ts",
+    });
+
+    const first = reg.dispatchRebuild("h2");
+    expect(first.restartTargets).toEqual([
+      { pid: 801, trainFile: "/tmp/a.ts" },
+    ]);
+    expect(a.kill).toHaveBeenCalledTimes(1);
+
+    // Second mismatching rebuild before the child has exited: must NOT
+    // re-send SIGTERM and must NOT re-list the child as a restart
+    // target (the SPA already has a pending re-spawn for it).
+    const second = reg.dispatchRebuild("h3");
+    expect(second.restartTargets).toEqual([]);
+    expect(a.kill).toHaveBeenCalledTimes(1);
+
+    // After the child exits and is unregistered, a fresh spawn in its
+    // place starts from a clean slate.
+    reg.unregister(801);
+    const respawn = fakeChild(802);
+    reg.register(respawn as unknown as ChildProcess, {
+      configHash: "h3",
+      trainFile: "/tmp/a.ts",
+    });
+    const third = reg.dispatchRebuild("h4");
+    expect(third.restartTargets).toEqual([
+      { pid: 802, trainFile: "/tmp/a.ts" },
+    ]);
+    expect(respawn.kill).toHaveBeenCalledTimes(1);
+  });
+
+  it("dispatchRebuild on win32 routes hash-matches directly to SIGTERM-restart (skips SIGUSR2 attempt)", () => {
+    // Regression: Node's `child.kill("SIGUSR2")` on Windows is
+    // documented to **forcefully terminate** the process (treats
+    // any unknown POSIX signal as SIGKILL-equivalent) and STILL
+    // returns `true` like a successful delivery. `safeKill` would
+    // then report `"ok"` → entry lands in `hotSwapTargets` → SPA
+    // shows "hot-swap" and skips restart, but the child is already
+    // dead. The Codex P1 fix gates the SIGUSR2 attempt behind
+    // `process.platform !== "win32"` so win32 routes straight to
+    // SIGTERM-restart, surfacing a real restart target the SPA can
+    // act on.
+    const originalPlatform = Object.getOwnPropertyDescriptor(
+      process,
+      "platform",
+    );
+    Object.defineProperty(process, "platform", {
+      value: "win32",
+      configurable: true,
+    });
+    try {
+      const reg = new TrainRegistry();
+      const a = fakeChild(951);
+      a.kill.mockReturnValue(true); // win32 reports success even for SIGUSR2
+      reg.register(a as unknown as ChildProcess, {
+        configHash: "match",
+        trainFile: "/tmp/win.ts",
+      });
+      const result = reg.dispatchRebuild("match");
+      // Restart bucket only: hot-swap is unsafe on win32 even
+      // when kill() reported "ok".
+      expect(result.hotSwapTargets).toEqual([]);
+      expect(result.restartTargets).toEqual([
+        { pid: 951, trainFile: "/tmp/win.ts" },
+      ]);
+      // SIGUSR2 was NEVER attempted: the platform gate skipped it
+      // entirely and went straight to the SIGTERM fallback path.
+      // (Without the gate, SIGUSR2 would have fired first and been
+      // misclassified as a successful hot-swap.)
+      expect(a.kill).toHaveBeenCalledTimes(1);
+      expect(a.kill).toHaveBeenCalledWith("SIGTERM");
+    } finally {
+      if (originalPlatform) {
+        Object.defineProperty(process, "platform", originalPlatform);
+      }
+    }
+  });
+
+  it("dispatchRebuild degrades to SIGTERM-restart when SIGUSR2 is unsupported (Windows)", () => {
+    // Regression: Node's win32 build doesn't deliver SIGUSR2 (it
+    // throws "ENOSYS" inside `child.kill('SIGUSR2')`). The previous
+    // implementation silently swallowed that throw, so on Windows a
+    // hash-match rebuild produced neither hot-swap nor restart and
+    // callback edits never landed. Now we degrade to a SIGTERM-driven
+    // restart so the new code does take effect, at the cost of a
+    // brief gap rather than an in-place swap.
+    const reg = new TrainRegistry();
+    const a = fakeChild(901);
+    a.kill.mockImplementation((sig?: string) => {
+      if (sig === "SIGUSR2") {
+        const err = new Error(
+          "kill ENOSYS",
+        ) as Error & { code?: string };
+        err.code = "ENOSYS";
+        throw err;
+      }
+      return true; // SIGTERM works
+    });
+    reg.register(a as unknown as ChildProcess, {
+      configHash: "match",
+      trainFile: "/tmp/win.ts",
+    });
+    const result = reg.dispatchRebuild("match");
+    // Must not appear in hot-swap (signal failed) but must appear in
+    // restart (fallback succeeded) so the SPA re-spawns once the
+    // exit message arrives.
+    expect(result.hotSwapTargets).toEqual([]);
+    expect(result.restartTargets).toEqual([
+      { pid: 901, trainFile: "/tmp/win.ts" },
+    ]);
+    // Both signals were attempted in order: SIGUSR2 → fallback SIGTERM.
+    expect(a.kill).toHaveBeenNthCalledWith(1, "SIGUSR2");
+    expect(a.kill).toHaveBeenNthCalledWith(2, "SIGTERM");
+  });
+});
diff --git a/packages/arkor/src/studio/trainRegistry.ts b/packages/arkor/src/studio/trainRegistry.ts
new file mode 100644
index 00000000..9286e98b
--- /dev/null
+++ b/packages/arkor/src/studio/trainRegistry.ts
@@ -0,0 +1,416 @@
+import type { ChildProcess } from "node:child_process";
+
+/**
+ * Per-active-train state tracked alongside the spawned `arkor start`
+ * subprocess. The Studio server records this at spawn time so HMR
+ * rebuilds can decide, per child, between:
+ *
+ *   - **SIGUSR2** (callback hot-swap) when the new bundle's `configHash`
+ *     matches the one captured at spawn time: the cloud-side run is
+ *     unaffected, only in-process callbacks need to update.
+ *   - **SIGTERM** (graceful early-stop + restart) when the configs
+ *     diverge: the runner's internal early-stop entry point lets the
+ *     next checkpoint finish, the subprocess exits, and the SPA
+ *     re-spawns with the rebuilt artefact.
+ */
+export interface ActiveTrain {
+  child: ChildProcess;
+  trainFile?: string;
+  /** Cloud-side config hash captured at spawn time (may be null if the
+   *  manifest wasn't inspectable yet, e.g. spawn raced an in-flight
+   *  build). A null entry forces SIGTERM on the next rebuild because we
+   *  can't prove the configs match. */
+  configHash: string | null;
+  /**
+   * Content hash (sha256, truncated; see `studio/hmr.ts`'s
+   * `contentHashOrNull`) of the on-disk `.arkor/build/index.mjs`
+   * at spawn time. Used **only** to gate the pre-ready-spawn
+   * backfill: if a rebuild eventually fires while `configHash` is
+   * still null and this content hash equals the rebuild's
+   * `event.contentHash`, the child is provably reading the same
+   * bundle bytes the new hash describes: safe to backfill
+   * `configHash` and skip dispatch. A mismatch (or null here)
+   * means the on-disk artefact has changed between spawn and
+   * rebuild (user edited mid-spawn, fresh project never built, …)
+   * so the child is running stale bytes and we MUST SIGTERM-restart
+   * to keep cloud-side `JobConfig` aligned with what the child
+   * actually loaded.
+   *
+   * Content-hash (vs the timestamp `mtime+ctime+size` shape used
+   * by `event.hash` for SSE dedup) avoids a false-positive
+   * mismatch when a watcher rebuild produces identical bytes:
+   * timestamps still bump, but content is the same and we
+   * shouldn't force a spurious cancel+restart cycle. Null when
+   * HMR isn't enabled or read failed.
+   */
+  spawnArtifactContentHash: string | null;
+  /**
+   * `true` once we've already SIGTERM'd this child for an HMR-driven
+   * early-stop. Subsequent rebuilds (which can land before the child
+   * has reached its next checkpoint) must NOT re-send SIGTERM:
+   * the runner's shutdown handler treats a second SIGTERM as the
+   * emergency `process.exit(143)` escape hatch, which would defeat
+   * the whole point of preserving the in-flight checkpoint. Kept
+   * internal to the registry; consumers shouldn't manage it.
+   */
+  earlyStopRequested?: boolean;
+  /**
+   * Cloud-side job id, captured by parsing the runner's
+   * `Started job <id>` stdout line shortly after spawn. Populated
+   * via `recordJobId(pid, id)` on the first matching chunk; null
+   * before that or for runs whose stdout we never saw the line on
+   * (early spawn failure, custom user bins, etc.). The
+   * `/api/train` cancel handler reads this to fire a fire-and-forget
+   * `POST /v1/jobs/:id/cancel` before SIGKILLing the subprocess.
+   * SIGKILL bypasses the runner's `installShutdownHandlers`, so
+   * without this server-side cancel the cloud-side job would live
+   * until the cloud reaper / TTL fires (continued GPU spend).
+   */
+  jobId: string | null;
+  /**
+   * Cloud-api scope (org + project slugs) captured at spawn time
+   * from `.arkor/state.json`. Pinned on the registry entry so the
+   * `/api/train` cancel handler can address the cloud cancel POST
+   * without re-reading the filesystem at stop time. Without this
+   * pin, a user who deleted or made unreadable `.arkor/state.json`
+   * mid-training would have their manual stop silently skip the
+   * cancel POST (state read returns null, handler bails) and
+   * the cloud job would orphan. Null when `/api/train` ran without
+   * state (auto-anonymous bootstrap failed, etc.); cancel POST is
+   * skipped then too, but the SIGKILL still tears down the local
+   * subprocess.
+   */
+  scope: { orgSlug: string; projectSlug: string } | null;
+}
+
+export interface RestartTarget {
+  pid: number;
+  trainFile?: string;
+}
+
+export interface DispatchResult {
+  /** Children whose callbacks were rotated in place via SIGUSR2. */
+  hotSwapTargets: RestartTarget[];
+  /**
+   * Children that were SIGTERM'd for graceful early-stop and need to
+   * be re-spawned by the SPA after the train stream emits its
+   * `exit=...` line. Includes both config-mismatch matches and
+   * config-match cases that fell back here because the platform
+   * doesn't support SIGUSR2 (Windows).
+   */
+  restartTargets: RestartTarget[];
+}
+
+/**
+ * Outcome of a single `child.kill(signal)` call.
+ *
+ * - `"ok"`: signal was delivered.
+ * - `"gone"`: process was already exited. Surfaces both as `kill`
+ *   returning `false` (Node's mapped form) and as a thrown `ESRCH`
+ *   (a race where the child exits between the `entries` lookup and
+ *   the `kill` call: POSIX `kill(2)` raises `ESRCH` for
+ *   non-existent PIDs and Node propagates it on some versions).
+ * - `"unsupported"`: any *other* `kill` throw, i.e. the signal
+ *   couldn't be delivered for a reason that isn't "process is gone".
+ *   The motivating case is the platform not supporting this signal
+ *   kind (Windows + `SIGUSR2` → `ENOSYS`; bad signal name →
+ *   `EINVAL`), which `dispatchRebuild` falls back to SIGTERM-restart
+ *   for. The bucket is intentionally a catch-all rather than a
+ *   whitelist of error codes: rare cases like `EPERM` (lost the
+ *   right to signal a re-parented child) and platform-specific
+ *   surprises take the same conservative fallback (try the next
+ *   signal, otherwise drop the entry), which is what callers want
+ *   from "kill failed for some non-recoverable reason".
+ */
+type KillResult = "ok" | "gone" | "unsupported";
+
+function safeKill(child: ChildProcess, signal: NodeJS.Signals): KillResult {
+  try {
+    return child.kill(signal) ? "ok" : "gone";
+  } catch (err) {
+    // `ESRCH` ("no such process") means the child already exited:
+    // semantically identical to `kill returning false`. Mis-classifying
+    // it as `"unsupported"` would route a hash-match hot-swap candidate
+    // into the SIGTERM fallback, which then also no-ops (also gone) but
+    // costs a needless restart-bucket inclusion until the close handler
+    // unregisters the child. Every other throw collapses into
+    // `"unsupported"` per the type doc above.
+    const code = (err as NodeJS.ErrnoException | null)?.code;
+    if (code === "ESRCH") return "gone";
+    return "unsupported";
+  }
+}
+
+/**
+ * Encapsulates the set of `/api/train`-spawned subprocesses and the
+ * signal-dispatch decision rule for HMR rebuilds. Pulled out of
+ * `buildStudioApp` so the policy is testable in isolation and so future
+ * additions (e.g. a `cancel-all` admin endpoint) have a clear seam.
+ */
+export class TrainRegistry {
+  private readonly entries = new Map<number, ActiveTrain>();
+
+  register(
+    child: ChildProcess,
+    init: Omit<
+      ActiveTrain,
+      | "child"
+      | "earlyStopRequested"
+      | "spawnArtifactContentHash"
+      | "jobId"
+      | "scope"
+    > & {
+      // Optional in the signature so tests / future callers that
+      // don't track the on-disk artefact content hash (e.g. an
+      // HMR-disabled server, a hand-rolled fake) can omit it.
+      // Defaults to `null`, which forces the pre-ready-spawn
+      // branch to fall through to SIGTERM-restart on the next
+      // non-null rebuild (the safe choice when we genuinely
+      // don't know what bytes the child loaded). Real `/api/train`
+      // calls in HMR mode capture this from
+      // `coordinator.getCurrentArtifactContentHash()`.
+      spawnArtifactContentHash?: string | null;
+      // Optional too: tests don't need scope for HMR-routing
+      // assertions. Real `/api/train` calls in production pass a
+      // non-null scope captured from `.arkor/state.json` so the
+      // cancel POST can address the cloud job without re-reading
+      // the filesystem at stop time.
+      scope?: { orgSlug: string; projectSlug: string } | null;
+    },
+  ): void {
+    if (typeof child.pid !== "number") return;
+    this.entries.set(child.pid, {
+      child,
+      ...init,
+      spawnArtifactContentHash: init.spawnArtifactContentHash ?? null,
+      scope: init.scope ?? null,
+      earlyStopRequested: false,
+      // `jobId` starts null; populated later by `recordJobId(pid,
+      // id)` when the server's stdout parser sees the runner's
+      // `Started job <id>` line. Tests that don't exercise the
+      // cancel-POST path can leave it null.
+      jobId: null,
+    });
+  }
+
+  unregister(pid: number | undefined): void {
+    if (typeof pid === "number") this.entries.delete(pid);
+  }
+
+  /**
+   * Record the cloud-side job id for an active child. Called by the
+   * server's `/api/train` stdout parser the first time it spots
+   * `Started job <id>` in the runner's output. Idempotent: a
+   * second call with the same pid + id is a no-op (the runner
+   * only prints the line once anyway). Unknown pids are silently
+   * dropped (the child may have already exited and unregistered).
+   */
+  recordJobId(pid: number | undefined, jobId: string): void {
+    if (typeof pid !== "number") return;
+    const entry = this.entries.get(pid);
+    if (!entry) return;
+    entry.jobId = jobId;
+  }
+
+  /**
+   * Read the recorded cloud-side job id for a pid. `/api/train`'s
+   * cancel handler consults this to POST `/v1/jobs/:id/cancel`
+   * before SIGKILLing the local subprocess; without that POST,
+   * a user-initiated stop would leave the cloud job running
+   * until TTL (the SIGKILL bypasses the runner's `installShutdownHandlers`
+   * so the runner can't issue cancel itself). Returns null when
+   * the pid is unknown or the runner hasn't printed its
+   * `Started job` line yet (early spawn failure, race against
+   * a fast cancel, custom user bins).
+   */
+  getJobId(pid: number | undefined): string | null {
+    if (typeof pid !== "number") return null;
+    return this.entries.get(pid)?.jobId ?? null;
+  }
+
+  /**
+   * Read the spawn-time cloud-api scope for a pid. Paired with
+   * `getJobId` by `/api/train`'s cancel handler to build the cloud
+   * cancel POST URL without re-reading `.arkor/state.json` at stop
+   * time: if the file was deleted or made unreadable mid-training,
+   * the read would return null and the cancel POST would silently
+   * skip, orphaning the cloud run. Captured at spawn time, immutable
+   * for the entry's lifetime.
+   */
+  getScope(
+    pid: number | undefined,
+  ): { orgSlug: string; projectSlug: string } | null {
+    if (typeof pid !== "number") return null;
+    return this.entries.get(pid)?.scope ?? null;
+  }
+
+  /**
+   * Whether `dispatchRebuild` has already issued a graceful-restart
+   * SIGTERM to this child as part of an HMR cycle. Consulted by
+   * `/api/train`'s ReadableStream `cancel()` handler so a client-
+   * driven cancel (tab close, navigation, aborted fetch) doesn't
+   * pile a second SIGTERM on top of an in-progress early-stop:
+   * the runner's `installShutdownHandlers` interprets a second
+   * SIGTERM as the emergency `exit(143)` fast-path, which bypasses
+   * the checkpoint-preserving early-stop + `cancel()` flow and
+   * leaves the cloud-side run live while the local subprocess
+   * dies. Defeats the main safety goal of the HMR restart logic.
+   */
+  isEarlyStopRequested(pid: number | undefined): boolean {
+    if (typeof pid !== "number") return false;
+    return this.entries.get(pid)?.earlyStopRequested ?? false;
+  }
+
+  get size(): number {
+    return this.entries.size;
+  }
+
+  /** Read-only snapshot, mostly for tests / observability. */
+  list(): ReadonlyArray<ActiveTrain> {
+    return [...this.entries.values()];
+  }
+
+  /**
+   * Single entry point for HMR rebuilds: per active child, decide
+   * between callback hot-swap (SIGUSR2) and graceful restart
+   * (SIGTERM), apply the signal, and report which children landed in
+   * each bucket so the SPA can update its UI / re-spawn restarted
+   * runs.
+   *
+   * Combines what was previously `notifyCallbackReload` +
+   * `requestEarlyStopOnMismatch` into one pass so the per-child
+   * decision is atomic: important because the hot-swap path can
+   * gracefully degrade into the restart path on platforms (Windows)
+   * where SIGUSR2 isn't supported, which is hard to express across
+   * two separate iterations of the registry.
+   *
+   * Re-signal protection: children already flagged
+   * `earlyStopRequested` are skipped entirely. The flag is cleared
+   * naturally when the child exits and is unregistered.
+   *
+   * Defensive corner cases:
+   *   - `kill()` returns `false` (process already exited) → drop
+   *     from the targets list, the registry's close handler will
+   *     unregister it.
+   *   - `kill("SIGUSR2")` throws on Windows → degrade to SIGTERM so
+   *     callback edits still take effect (via a full restart) rather
+   *     than silently being ignored.
+   */
+  dispatchRebuild(
+    nextConfigHash: string | null,
+    // Content hash (sha256-derived; see `studio/hmr.ts`) of the
+    // freshly-built artefact, paired with `entry.spawnArtifactContentHash`
+    // for the pre-ready-spawn equality gate. Defaults to `null` so
+    // tests / pre-existing callers that don't pass a hash get the
+    // conservative behaviour: a null entry hash falls through to
+    // SIGTERM-restart. Real dispatch from `/api/train`'s HMR
+    // subscriber threads `event.contentHash` here so the backfill
+    // optimisation activates only when the child's loaded bytes
+    // genuinely match.
+    nextArtifactContentHash: string | null = null,
+  ): DispatchResult {
+    const hotSwapTargets: RestartTarget[] = [];
+    const restartTargets: RestartTarget[] = [];
+
+    for (const [pid, entry] of this.entries) {
+      if (entry.earlyStopRequested) continue;
+      const target: RestartTarget = { pid, trainFile: entry.trainFile };
+      // Pre-ready spawn: this child was registered via `/api/train`
+      // *before* the HMR watcher's first successful build, so its
+      // recorded `configHash` is `null`. Whether the rebuild's new
+      // hash describes the same bytes the child actually loaded
+      // depends on whether the on-disk artefact has changed between
+      // spawn and now. Tie the decision to the artefact content
+      // hash:
+      //
+      //   - `entry.spawnArtifactContentHash === nextArtifactContentHash`
+      //     → child read the same bytes the new hash describes.
+      //     Safe to backfill `configHash`; future rebuilds compare
+      //     against the backfilled value like any other child. This
+      //     is the common case (user clicked Run before the SPA had
+      //     refreshed its manifest, but the on-disk artefact is the
+      //     same one the watcher just settled on).
+      //
+      //   - content hashes differ (or one side is null) → the bytes
+      //     the child loaded don't match the new hash. SIGTERM-restart
+      //     so the cloud-side `JobConfig` and the child's actual
+      //     config are guaranteed to align. Without this gate, an
+      //     edit landing between spawn and the first BUNDLE_END would
+      //     silently teach the registry to use the post-edit hash as
+      //     the child's baseline; later same-hash rebuilds would
+      //     then hot-swap callbacks into a child whose cloud-side
+      //     `JobConfig` was *actually* spawned against an older
+      //     version, leaving the cloud run on a stale config.
+      const isPreReadySpawn =
+        entry.configHash === null && nextConfigHash !== null;
+      if (isPreReadySpawn) {
+        const artefactsAgree =
+          entry.spawnArtifactContentHash !== null &&
+          nextArtifactContentHash !== null &&
+          entry.spawnArtifactContentHash === nextArtifactContentHash;
+        if (artefactsAgree) {
+          entry.configHash = nextConfigHash;
+          continue;
+        }
+        // fall through to the mismatch / SIGTERM-restart path below
+      }
+      const matches =
+        nextConfigHash !== null &&
+        entry.configHash !== null &&
+        entry.configHash === nextConfigHash;
+
+      if (matches) {
+        // On Windows, Node's `child.kill(signal)` for any unknown
+        // POSIX signal (including SIGUSR2) is documented to
+        // **forcefully terminate** the process (same effect as
+        // SIGKILL), and `kill()` returns `true` like a successful
+        // delivery. `safeKill` would then report `"ok"`, the entry
+        // would land in `hotSwapTargets`, and the SPA would never
+        // schedule a restart even though the child is *dead*. Skip
+        // the SIGUSR2 attempt on win32 entirely and route directly
+        // to the SIGTERM-restart path so the SPA learns about the
+        // pending restart and re-spawns when the exit line arrives.
+        // The user-visible outcome (callbacks reload after a brief
+        // restart) matches the design intent on platforms where
+        // the in-place hot-swap simply isn't available.
+        if (process.platform !== "win32") {
+          const r = safeKill(entry.child, "SIGUSR2");
+          if (r === "ok") {
+            hotSwapTargets.push(target);
+            continue;
+          }
+          if (r === "gone") {
+            // Child already exited; close handler will unregister.
+            continue;
+          }
+          // Cross-platform safety net: SIGUSR2 reported `"unsupported"`
+          // on a non-win32 platform (rare: `ENOSYS` from libuv signal
+          // wrap on exotic builds, future Node versions removing the
+          // signal, etc.). Same fallback as the win32 skip above:
+          // route to SIGTERM-restart so callback edits still take
+          // effect via a full restart instead of silently being
+          // ignored.
+        }
+        const fallback = safeKill(entry.child, "SIGTERM");
+        if (fallback === "ok") {
+          entry.earlyStopRequested = true;
+          restartTargets.push(target);
+        }
+        // "gone" / "unsupported" again → drop silently; the close
+        // handler (or operator-driven restart) will recover.
+        continue;
+      }
+
+      // Hash mismatch (or one side is null): graceful restart.
+      const r = safeKill(entry.child, "SIGTERM");
+      if (r === "ok") {
+        entry.earlyStopRequested = true;
+        restartTargets.push(target);
+      }
+      // "gone": child already exited, drop. "unsupported": can't
+      // happen for SIGTERM on supported platforms; drop defensively.
+    }
+
+    return { hotSwapTargets, restartTargets };
+  }
+}
diff --git a/packages/cli-internal/src/templates.ts b/packages/cli-internal/src/templates.ts
index a250fd62..1d8e94fc 100644
--- a/packages/cli-internal/src/templates.ts
+++ b/packages/cli-internal/src/templates.ts
@@ -1,13 +1,13 @@
 /**
  * Starter templates written out by `create-arkor` / `arkor init`.
- * Single source of truth — both consumers bundle this module at build time.
+ * Single source of truth: both consumers bundle this module at build time.
  *
  * Layout written to disk:
  *
  *   src/arkor/index.ts    ← entry-point manifest (`createArkor({ trainer })`)
  *   src/arkor/trainer.ts  ← per-template trainer (`createTrainer({...})`)
  *
- * `index.ts` is identical across templates — only the trainer body differs.
+ * `index.ts` is identical across templates; only the trainer body differs.
  */
 export type TemplateId = "redaction" | "translate" | "triage";
 
@@ -179,7 +179,7 @@ ${ONLOG_BODY}
 });
 `;
 
-// Order is significant — `templateChoices()` preserves insertion order so the
+// Order is significant: `templateChoices()` preserves insertion order so the
 // CLI prompt lists demos first (sorted by estimated training time).
 //
 // Estimated training times assume A100 80GB on Runpod Serverless with the
@@ -204,7 +204,7 @@ export const TEMPLATES: Record<TemplateId, Template> = {
 };
 
 /**
- * Body of `src/arkor/index.ts` — identical across templates. The `createArkor`
+ * Body of `src/arkor/index.ts`: identical across templates. The `createArkor`
  * factory is what `arkor build` / Studio discovers; per-role primitives
  * (`trainer`, future `deploy`, `eval`) live in sibling files and get gathered
  * here.
@@ -215,7 +215,7 @@ import { trainer } from "./trainer";
 export const arkor = createArkor({ trainer });
 `;
 
-export const STARTER_CONFIG = `// Placeholder for future project-level config — the runtime does not read
+export const STARTER_CONFIG = `// Placeholder for future project-level config: the runtime does not read
 // fields from this file yet. Training settings (\`maxSteps\`, \`lora\`, etc.)
 // live on the Trainer in src/arkor/trainer.ts. Project routing
 // (orgSlug / projectSlug) is tracked automatically in .arkor/state.json.
@@ -241,7 +241,7 @@ An arkor training project scaffolded by \`create-arkor\`.
 
 The \`dev\` / \`build\` / \`start\` package scripts forward to the matching
 \`arkor\` subcommands, so the script form works across every package
-manager (\`npm\` does not run package binaries via \`npm <bin>\` — use
+manager (\`npm\` does not run package binaries via \`npm <bin>\`; use
 \`npm run <script>\` or \`npx arkor <subcommand>\`).
 
 \`\`\`
@@ -253,7 +253,7 @@ npm install && npm run dev
 
 \`arkor dev\` opens the local Studio GUI (most workflows live there).
 
-Optional — log in to your own org instead of using anonymous tokens:
+Optional: log in to your own org instead of using anonymous tokens:
 
 \`\`\`
 npx arkor login
@@ -268,18 +268,18 @@ npm run start    # runs the build artifact on the cloud
 
 ## Files
 
-- \`src/arkor/index.ts\` — entry-point manifest (\`createArkor({ trainer })\`).
+- \`src/arkor/index.ts\`: entry-point manifest (\`createArkor({ trainer })\`).
   This is what the CLI and Studio discover.
-- \`src/arkor/trainer.ts\` — your trainer (\`createTrainer({...})\`). Training
+- \`src/arkor/trainer.ts\`: your trainer (\`createTrainer({...})\`). Training
   settings (\`maxSteps\`, \`lora\`, etc.) live on the Trainer itself. Add
   sibling files for future primitives (\`deploy.ts\`, \`eval.ts\`) and
   register them in the \`createArkor\` call.
-- \`arkor.config.ts\` — placeholder for future project-level config. The
+- \`arkor.config.ts\`: placeholder for future project-level config. The
   runtime does not read fields from this file yet. Project routing lives
   in \`.arkor/state.json\`, managed by the CLI.${
     options.agentsMd
       ? `
-- \`AGENTS.md\` / \`CLAUDE.md\` — instructions for AI coding agents,
+- \`AGENTS.md\` / \`CLAUDE.md\`: instructions for AI coding agents,
   briefing them that arkor post-dates their training data. When the
   scaffolder creates \`CLAUDE.md\` itself it is a one-liner that imports
   \`AGENTS.md\` via the Claude Code \`@<path>\` directive; if you already
diff --git a/packages/create-arkor/package.json b/packages/create-arkor/package.json
index b09a9715..d9ae3d7c 100644
--- a/packages/create-arkor/package.json
+++ b/packages/create-arkor/package.json
@@ -43,7 +43,7 @@
     "@arkor/cli-internal": "workspace:*",
     "@types/node": "^24",
     "@vitest/coverage-v8": "^4.1.5",
-    "tsdown": "^0.21.9",
+    "tsdown": "^0.22.0",
     "typescript": "^5",
     "vitest": "^4.1.5"
   },
diff --git a/packages/studio-app/src/components/RunTraining.tsx b/packages/studio-app/src/components/RunTraining.tsx
index eee5094c..d2fea8f4 100644
--- a/packages/studio-app/src/components/RunTraining.tsx
+++ b/packages/studio-app/src/components/RunTraining.tsx
@@ -1,7 +1,10 @@
 import { useEffect, useRef, useState } from "react";
 import {
   fetchManifest,
+  isHmrEnabled,
+  openDevEvents,
   streamTraining,
+  type DevEvent,
   type ManifestResult,
 } from "../lib/api";
 import { Play, StopCircle } from "./icons";
@@ -22,14 +25,102 @@ export function RunTraining() {
   const [running, setRunning] = useState(false);
   const [log, setLog] = useState("");
   const [manifest, setManifest] = useState<ManifestResult | null>(null);
+  const [hmrStatus, setHmrStatus] = useState<
+    "idle" | "early-stopping" | "restarting" | "hot-swapped"
+  >("idle");
   const boxRef = useRef<HTMLPreElement>(null);
   // Tracked separately from the manifest poll so navigating away
   // from Overview mid-stream tears the training stream down without
   // touching the (always-running) manifest poll.
   const trainingAbortRef = useRef<AbortController | null>(null);
+  // HMR auto-restart bookkeeping:
+  //  - lastTrainFileRef: carries the same `file?` arg into the auto
+  //    re-spawn.
+  //  - restartPendingRef: latch the SSE listener trips ONLY when *this
+  //    tab's* current child landed in `restartTargets`. Without the
+  //    pid scope, a tab whose run was hot-swapped (other tab's child
+  //    in `restartTargets`) would still latch on the broadcast and
+  //    auto-spawn a duplicate job after its own run completes.
+  //  - runningRef: short-circuit for tabs not running anything.
+  //  - currentPidRef: the spawned child's pid for the run currently
+  //    in flight, set from the `X-Arkor-Train-Pid` response header.
+  //  - hotSwapTimerRef: id for the "hot-swapped" status auto-clear
+  //    timer so unmount-during-flash doesn't leak (or trigger a
+  //    setState-after-unmount warning).
+  const lastTrainFileRef = useRef<string | undefined>(undefined);
+  const restartPendingRef = useRef(false);
+  const runningRef = useRef(false);
+  const currentPidRef = useRef<number | null>(null);
+  // Browser `window.setTimeout` returns a numeric handle, not Node's
+  // `Timeout` object; explicit `number` so TS doesn't pick up the
+  // Node typing from the global `setTimeout`.
+  const hotSwapTimerRef = useRef<number | null>(null);
+  // SSE events that arrived during the startup window: after `run()`
+  // set `runningRef.current = true` but before `streamTraining`'s
+  // `onSpawn` populated `currentPidRef`. The per-pid filter would
+  // otherwise drop any HMR dispatch landing in this window because
+  // `myPid === null`, leaving the user on stale code: a config-
+  // changing rebuild fires immediately after the Run click → server
+  // SIGTERMs the just-started child → exit reaches us → no auto-
+  // restart latch. Buffer here, drain in `onSpawn` once we know our
+  // pid so the per-pid decision can run retroactively. Cleared in
+  // `run()`'s `finally` (and on unmount) so a failed spawn doesn't
+  // leak entries into the next run.
+  const pendingPreSpawnEventsRef = useRef<DevEvent[]>([]);
+  // Grace window after the train stream closes during which the SSE
+  // handler can still latch a *late* restart event onto our just-
+  // exited child. The `/api/train` stream and `/api/dev/events` SSE
+  // are independent connections: under the race where the child
+  // exits before the matching `rebuild` event lands on the SSE
+  // channel (fast child exit, network jitter), `run()`'s finally
+  // would synchronously settle "no restart" and the user would be
+  // left on stale code despite the server-side SIGTERM. This timer
+  // defers the no-restart decision and keeps `currentPidRef` set so
+  // the SSE handler can still match per-pid; if the late event
+  // arrives within the window it sets `restartPendingRef` and the
+  // timer's callback fires the auto-restart from there. The window
+  // is short (a few hundred ms), well under user perception for
+  // the no-restart outcome but long enough to absorb realistic
+  // cross-connection delivery skew.
+  const restartGraceTimerRef = useRef<number | null>(null);
+  // Tracks "is this React tree still mounted?". The HMR auto-restart
+  // path schedules `queueMicrotask(() => run(...))` after the prior
+  // run's `finally`. Without this gate, navigating away during the
+  // tiny window between scheduling and the microtask running would
+  // fire a fresh `/api/train` POST from an unmounted view, spawning
+  // an invisible cloud job the user can't see or stop.
+  const isMountedRef = useRef(true);
 
   useEffect(() => {
-    return () => trainingAbortRef.current?.abort();
+    // Re-arm the mounted flag every time the effect (re-)runs.
+    // React StrictMode (enabled in `main.tsx` for dev) intentionally
+    // runs setup → cleanup → setup once on mount to surface
+    // ordering bugs; without this re-arm the cleanup's
+    // `isMountedRef.current = false` would persist into the second
+    // setup, making the ref permanently false while the component
+    // is actually mounted. The HMR auto-restart paths guarded by
+    // `isMountedRef.current` would then silently no-op in every
+    // Vite dev session even though they work fine in `vite build`
+    // output (StrictMode's double-effect is a dev-only behaviour).
+    isMountedRef.current = true;
+    return () => {
+      isMountedRef.current = false;
+      // Defense in depth: clearing the latch here means even if a
+      // microtask snuck past the `isMountedRef` check (concurrent
+      // edits to React's effect ordering, future refactors), it
+      // still finds nothing pending.
+      restartPendingRef.current = false;
+      pendingPreSpawnEventsRef.current = [];
+      trainingAbortRef.current?.abort();
+      if (hotSwapTimerRef.current !== null) {
+        clearTimeout(hotSwapTimerRef.current);
+        hotSwapTimerRef.current = null;
+      }
+      if (restartGraceTimerRef.current !== null) {
+        clearTimeout(restartGraceTimerRef.current);
+        restartGraceTimerRef.current = null;
+      }
+    };
   }, []);
 
   useEffect(() => {
@@ -60,11 +151,164 @@ export function RunTraining() {
     };
   }, []);
 
+  // HMR: listen for rebuild notifications from `arkor dev` and refresh
+  // the manifest. When a rebuild also early-stopped *this tab's*
+  // training run, the server includes the spawned pid in
+  // `restartTargets`; defer the auto re-invocation until the current
+  // `streamTraining` resolves so we don't run two cloud jobs at once.
+  //
+  // Gated by `isHmrEnabled()` (server-injected `<meta>` flag) rather
+  // than `import.meta.env.DEV`: the SPA is shipped via `vite build`
+  // and served by `arkor dev` as static assets, so DEV is `false` in
+  // every real session. The server-side flag is `true` exactly when
+  // `arkor dev` wired in an HMR coordinator, i.e. when
+  // `/api/dev/events` actually exists. Without this flag the
+  // EventSource would either be dead in real dev sessions (DEV gate)
+  // or retry forever against a 404 (no gate).
+  useEffect(() => {
+    if (!isHmrEnabled()) return;
+    const es = openDevEvents();
+    // Typed as `Event` (not `MessageEvent`) because the same handler
+    // is registered for the `error` event, which EventSource fires
+    // as a plain `Event` on connection failures (server crashed,
+    // browser dropped the SSE) carry no `.data`. Custom
+    // server-sent events (`event: ready` / `event: rebuild` / the
+    // SSE `event: error` frame the HMR server emits) all arrive as
+    // `MessageEvent` instances, so we narrow before reading
+    // `.data`. EventSource will auto-retry connection failures, so
+    // there's nothing to do for them other than not crash.
+    const onMessage = (raw: Event) => {
+      if (!(raw instanceof MessageEvent)) return;
+      let payload: DevEvent;
+      try {
+        payload = JSON.parse(raw.data) as DevEvent;
+      } catch {
+        return;
+      }
+      if (payload.type === "error") {
+        setManifest({ error: payload.message ?? "Build failed" });
+        setHmrStatus("idle");
+        // Cancel any pending HMR auto-restart latched from a
+        // previous successful rebuild. Without this, a sequence
+        // like (rebuild → restartPendingRef=true → user breaks
+        // the source → error event → child eventually exits) would
+        // hit `run()`'s finally branch, see the still-set latch,
+        // and auto-restart from the **previous** artefact even
+        // though the latest source state is broken: silent
+        // stale-code background churn until the user notices.
+        // Clearing here makes the user's broken-state edit the
+        // source of truth: no auto-restart fires until the next
+        // successful rebuild re-arms the latch.
+        restartPendingRef.current = false;
+        // Same hazard via the pre-spawn buffer: a `rebuild` event
+        // that landed before `onSpawn` populated the pid is parked
+        // in `pendingPreSpawnEventsRef`. If the user then breaks
+        // the source and the next event is `error`, `onSpawn`'s
+        // later drain would still find the stale restart target
+        // and latch `restartPendingRef = true` → auto-restart
+        // against the broken state. Drop the buffer alongside the
+        // latch so the error event is the new source of truth.
+        pendingPreSpawnEventsRef.current = [];
+        return;
+      }
+      // Always refresh the manifest on ready/rebuild.
+      void fetchManifest()
+        .then(setManifest)
+        .catch((err: unknown) => {
+          setManifest({
+            error: err instanceof Error ? err.message : String(err),
+          });
+        });
+      // Per-child decision based on the spawned pid: a single rebuild
+      // can produce mixed outcomes (one child hot-swapped, another
+      // restarted), and `payload.restart` / `payload.hotSwap` reflect
+      // *aggregate* truth across all active children. Filter to "did
+      // *my* child land in this bucket?" so a tab whose run was
+      // hot-swapped doesn't latch onto a sibling tab's restart.
+      const myPid = currentPidRef.current;
+      // Pre-spawn race: if we've started a run but `onSpawn` hasn't
+      // populated our pid yet, the dispatch result for our own child
+      // would be silently ignored. Stash the payload and let
+      // `onSpawn` re-run the per-pid decision once the pid arrives.
+      if (myPid === null && runningRef.current) {
+        pendingPreSpawnEventsRef.current.push(payload);
+        return;
+      }
+      // Don't gate `myRestart` on `runningRef.current`: the
+      // `/api/train` stream and `/api/dev/events` SSE channel are
+      // independent connections, so a fast child exit can race the
+      // SSE delivery and flip `runningRef` to false JUST BEFORE the
+      // matching `rebuild` event lands here. Per-pid filtering via
+      // `currentPidRef` is what scopes the latch to *this tab's*
+      // child; `run()`'s finally keeps `currentPidRef` set during a
+      // brief grace window after the train stream closes for
+      // exactly this reason. Without dropping the `runningRef`
+      // gate, post-exit restart events would silently no-op and
+      // leave the tab on stale code.
+      const myRestart =
+        myPid !== null &&
+        (payload.restartTargets?.some((t) => t.pid === myPid) ?? false);
+      const myHotSwap =
+        myPid !== null &&
+        (payload.hotSwapTargets?.some((t) => t.pid === myPid) ?? false);
+      if (myRestart) {
+        // Training run is early-stopping; the active stream will
+        // resolve once the next checkpoint lands and the subprocess
+        // exits cleanly. The `finally` block of `run()` picks up the
+        // pending flag and re-spawns with the same args.
+        restartPendingRef.current = true;
+        setHmrStatus("early-stopping");
+      } else if (myHotSwap) {
+        // Callbacks were swapped in place; the cloud-side run is
+        // unaffected. Flash a brief "hot-swapped" indicator so users
+        // know the new code is live. The previous timer (if any) is
+        // cleared so two close-together rebuilds don't race for the
+        // status reset.
+        setHmrStatus("hot-swapped");
+        if (hotSwapTimerRef.current !== null) {
+          clearTimeout(hotSwapTimerRef.current);
+        }
+        hotSwapTimerRef.current = window.setTimeout(() => {
+          setHmrStatus((s) => (s === "hot-swapped" ? "idle" : s));
+          hotSwapTimerRef.current = null;
+        }, 1500);
+      } else {
+        // Nothing pertaining to this tab's child. Don't blanket-
+        // reset `hmrStatus` here: another tab's `early-stopping` /
+        // `restarting` event is irrelevant to us, but so is a
+        // stale "hot-swapped" flash whose 1.5 s timer is still
+        // running. Reset ONLY the timer-driven "hot-swapped" label
+        // (mirroring the inverse predicate in the timer body
+        // below) so it can't get stuck if a sibling rebuild lands
+        // mid-flash; the user-driven "early-stopping" /
+        // "restarting" spans clear themselves via `run()`'s
+        // `finally` / `onSpawn` transitions and shouldn't be
+        // preempted by an out-of-band sibling event.
+        if (!runningRef.current) {
+          setHmrStatus((s) => (s === "hot-swapped" ? "idle" : s));
+        }
+      }
+    };
+    es.addEventListener("ready", onMessage);
+    es.addEventListener("rebuild", onMessage);
+    es.addEventListener("error", onMessage);
+    return () => {
+      es.close();
+    };
+  }, []);
+
   useEffect(() => {
     if (boxRef.current) boxRef.current.scrollTop = boxRef.current.scrollHeight;
   }, [log]);
 
-  async function run() {
+  async function run(file?: string): Promise<void> {
+    runningRef.current = true;
+    lastTrainFileRef.current = file;
+    // Reset the pid before each spawn so a stale value from a prior
+    // run can't accidentally match a new HMR restart broadcast in the
+    // window between this assignment and `streamTraining` invoking
+    // `onSpawn`.
+    currentPidRef.current = null;
     setRunning(true);
     setLog("");
     const ac = new AbortController();
@@ -76,29 +320,170 @@ export function RunTraining() {
           if (ac.signal.aborted) return;
           setLog((prev) => appendCapped(prev, chunk));
         },
-        undefined,
+        file,
         ac.signal,
+        (pid) => {
+          currentPidRef.current = pid;
+          // Clear the "Restarting with updated code…" status as soon
+          // as the new run starts spawning. Without this the label
+          // stays pinned for the entire restarted run because
+          // `setHmrStatus("restarting")` is set in the *prior* run's
+          // `finally` and nothing else clears it. We only knock out
+          // "restarting" specifically; "early-stopping" / "hot-
+          // swapped" should land via their own state transitions.
+          setHmrStatus((s) => (s === "restarting" ? "idle" : s));
+          // Drain any HMR events that landed in the pre-spawn race
+          // window. Apply the same per-pid decision retroactively now
+          // that the pid is known. Restart wins over hot-swap (a
+          // stale child got SIGTERM'd → must re-spawn), so collapse
+          // the buffer's findings into a single decision rather than
+          // dispatching every buffered event verbatim.
+          const buffered = pendingPreSpawnEventsRef.current;
+          pendingPreSpawnEventsRef.current = [];
+          let restartHit = false;
+          let hotSwapHit = false;
+          for (const ev of buffered) {
+            if (ev.restartTargets?.some((t) => t.pid === pid)) {
+              restartHit = true;
+              break;
+            }
+            if (ev.hotSwapTargets?.some((t) => t.pid === pid)) {
+              hotSwapHit = true;
+            }
+          }
+          if (restartHit) {
+            restartPendingRef.current = true;
+            setHmrStatus("early-stopping");
+          } else if (hotSwapHit) {
+            setHmrStatus("hot-swapped");
+            if (hotSwapTimerRef.current !== null) {
+              clearTimeout(hotSwapTimerRef.current);
+            }
+            hotSwapTimerRef.current = window.setTimeout(() => {
+              setHmrStatus((s) => (s === "hot-swapped" ? "idle" : s));
+              hotSwapTimerRef.current = null;
+            }, 1500);
+          }
+        },
       );
     } catch (err) {
       // Aborts are expected when the user navigates away mid-stream;
-      // don't surface them as errors in the log.
-      if (ac.signal.aborted) return;
-      setLog((prev) =>
-        appendCapped(
-          prev,
-          `\n[error] ${err instanceof Error ? err.message : String(err)}\n`,
-        ),
-      );
-    } finally {
-      if (trainingAbortRef.current === ac) trainingAbortRef.current = null;
-      // Always release the running flag, including the user-initiated
-      // abort path. setState on an already-unmounted component is a
-      // no-op in React 18+, so the unmount-cleanup case handles itself.
-      setRunning(false);
+      // don't surface them as errors in the log. Non-abort errors
+      // get logged; in both cases the post-handler block below
+      // (formerly a `finally`) runs the same cleanup. Lint's
+      // `no-unsafe-finally` rule disallows `return` inside `finally`
+      // because such returns clobber a re-throw from the surrounding
+      // `catch`; we sidestep the rule by lifting the cleanup into
+      // a normal post-try/catch sequence, which has the same effect
+      // here because the `catch` block above never re-throws.
+      if (!ac.signal.aborted) {
+        setLog((prev) =>
+          appendCapped(
+            prev,
+            `\n[error] ${err instanceof Error ? err.message : String(err)}\n`,
+          ),
+        );
+      }
     }
+    runningRef.current = false;
+    // DO NOT null `currentPidRef` here: the SSE handler needs to
+    // be able to match per-pid during the post-exit grace window
+    // below to catch a `rebuild` event that races behind the
+    // train stream's close on the separate connection. Captured
+    // here so the grace timer can detect "a new run started
+    // during the window" by comparing the current ref against
+    // `pidAtExit` and skipping its cleanup in that case.
+    const pidAtExit = currentPidRef.current;
+    // Drop any pre-spawn buffer entries that survived a failed
+    // run (spawn errored before `onSpawn` could drain). Without
+    // this they'd be carried into the next run and falsely match
+    // the new pid only by luck.
+    pendingPreSpawnEventsRef.current = [];
+    if (trainingAbortRef.current === ac) trainingAbortRef.current = null;
+    // Always release the running flag, including the user-initiated
+    // abort path. setState on an already-unmounted component is a
+    // no-op in React 18+, so the unmount-cleanup case handles itself.
+    setRunning(false);
+
+    if (ac.signal.aborted) {
+      // User Stop wins over any pending or in-flight HMR restart:
+      // clear everything synchronously and skip the grace window
+      // so the tab really settles instead of bouncing back up.
+      restartPendingRef.current = false;
+      currentPidRef.current = null;
+      setHmrStatus("idle");
+      if (restartGraceTimerRef.current !== null) {
+        clearTimeout(restartGraceTimerRef.current);
+        restartGraceTimerRef.current = null;
+      }
+      return;
+    }
+
+    if (restartPendingRef.current) {
+      // Fast path: SSE event already landed before exit. Fire the
+      // restart synchronously without waiting for the grace
+      // window so the common case has no perceptible delay.
+      restartPendingRef.current = false;
+      currentPidRef.current = null;
+      setHmrStatus("restarting");
+      const fileForRestart = lastTrainFileRef.current;
+      queueMicrotask(() => {
+        // Don't auto-spawn a fresh /api/train request from an
+        // unmounted view: the user navigated away in the small
+        // window between scheduling and running this microtask, so
+        // their intent was "stop interacting with this view", not
+        // "kick off another cloud job invisibly". The unmount
+        // cleanup also clears `restartPendingRef` defensively.
+        if (!isMountedRef.current) return;
+        void run(fileForRestart);
+      });
+      return;
+    }
+
+    // Slow path: SSE rebuild event might still be in flight on a
+    // separate connection. Defer the "no restart" decision so the
+    // SSE handler has time to land and flip `restartPendingRef`.
+    // `currentPidRef` stays set for the grace window so that
+    // late event can still match per-pid.
+    //
+    // Skip the grace window entirely when HMR isn't enabled:
+    // without an `/api/dev/events` subscription nothing can ever
+    // flip `restartPendingRef`, so the 250 ms timer + closure
+    // would just churn microtasks and delay `setHmrStatus("idle")`
+    // for no benefit. `isHmrEnabled()` reads the same
+    // server-injected meta the SSE effect above gates on, so this
+    // mirrors that condition exactly.
+    if (!isHmrEnabled()) {
+      currentPidRef.current = null;
+      setHmrStatus("idle");
+      return;
+    }
+    if (restartGraceTimerRef.current !== null) {
+      clearTimeout(restartGraceTimerRef.current);
+    }
+    restartGraceTimerRef.current = window.setTimeout(() => {
+      restartGraceTimerRef.current = null;
+      // A new run started during the window (overwrote the pid).
+      // Leave its lifecycle alone; its own finally will manage
+      // the cleanup eventually.
+      if (currentPidRef.current !== pidAtExit) return;
+      currentPidRef.current = null;
+      if (!isMountedRef.current) return;
+      if (restartPendingRef.current) {
+        restartPendingRef.current = false;
+        setHmrStatus("restarting");
+        const fileForRestart = lastTrainFileRef.current;
+        void run(fileForRestart);
+      } else {
+        setHmrStatus("idle");
+      }
+    }, 250);
   }
 
   function stop() {
+    // A user Stop click also cancels any pending HMR auto-restart so
+    // the run finally settles instead of bouncing back up.
+    restartPendingRef.current = false;
     trainingAbortRef.current?.abort();
   }
 
@@ -158,10 +543,10 @@ export function RunTraining() {
         </div>
         <Button
           // While `running`, the same button doubles as the abort
-          // affordance — clicking aborts the in-flight stream so the
+          // affordance: clicking aborts the in-flight stream so the
           // visible StopCircle icon actually does what the user
           // expects. When idle, it kicks off a new run.
-          onClick={running ? stop : run}
+          onClick={running ? stop : () => run(lastTrainFileRef.current)}
           disabled={!running && !hasTrainer}
           leadingIcon={running ? <StopCircle /> : <Play />}
         >
@@ -173,6 +558,15 @@ export function RunTraining() {
         </Button>
       </div>
 
+      {hmrStatus !== "idle" && (
+        <div className="text-xs text-zinc-500 dark:text-zinc-400">
+          {hmrStatus === "early-stopping" && "Stopping at next checkpoint…"}
+          {hmrStatus === "restarting" && "Restarting with updated code…"}
+          {hmrStatus === "hot-swapped" &&
+            "Callbacks hot-swapped: run continues."}
+        </div>
+      )}
+
       {(running || log) && (
         <pre
           ref={boxRef}
diff --git a/packages/studio-app/src/lib/api.test.ts b/packages/studio-app/src/lib/api.test.ts
index 7701ddd4..31b33142 100644
--- a/packages/studio-app/src/lib/api.test.ts
+++ b/packages/studio-app/src/lib/api.test.ts
@@ -12,6 +12,7 @@ import {
   fetchJobs,
   fetchManifest,
   fetchMe,
+  isHmrEnabled,
   revokeDeploymentKey,
   streamInferenceContent,
   streamTraining,
@@ -42,7 +43,7 @@ describe("extractInferenceDelta", () => {
     // Some providers/proxies serialize token chunks as `data: "Hel"`
     // (a JSON string, not an object). Previously these parsed as a
     // string, hit the `typeof parsed !== "object"` branch, and returned
-    // null — leaving the assistant bubble silently empty.
+    // null, leaving the assistant bubble silently empty.
     expect(extractInferenceDelta('"Hel"')).toBe("Hel");
     expect(extractInferenceDelta('""')).toBe("");
   });
@@ -67,7 +68,7 @@ describe("extractInferenceDelta", () => {
 });
 
 // Mount a SSE-shaped Response from a list of frames and let the SPA's
-// stream consumer assemble the assistant text. Regression for ENG-358 —
+// stream consumer assemble the assistant text. Regression for ENG-358:
 // the previous Playground code concatenated raw `data: …\n\n` frames
 // straight into the message bubble.
 describe("streamInferenceContent (regression for ENG-358)", () => {
@@ -148,7 +149,7 @@ describe("streamInferenceContent (regression for ENG-358)", () => {
     // closed cleanly without writing any frames) used to surface as
     // `Error("No response body")` in the Playground. After the SSE rewrite
     // the silent-exit branch in `iterateSseFrames` would have left the
-    // assistant bubble empty with no feedback — guard against that.
+    // assistant bubble empty with no feedback; guard against that.
     globalThis.fetch = vi.fn(
       async () => new Response(null, { status: 204 }),
     ) as typeof fetch;
@@ -290,7 +291,7 @@ describe("fetchManifest", () => {
 
   it("returns the structured error envelope on 400 (e.g. missing src/arkor/index.ts)", async () => {
     // The SPA renders this error inline as a hint instead of a generic
-    // failure — the helper must therefore distinguish 400 from other 4xx.
+    // failure; the helper must therefore distinguish 400 from other 4xx.
     globalThis.fetch = vi.fn(
       async () =>
         new Response(
@@ -370,6 +371,53 @@ describe("streamTraining", () => {
     expect(JSON.parse(captured)).toEqual({});
   });
 
+  it("throws with the response body when /api/train returns non-2xx", async () => {
+    // Regression: previously `streamTraining` ignored `res.ok` and
+    // proceeded to call `onSpawn` + read the body even on 403/500
+    // failures. The SPA would treat a failed spawn (auth rejection,
+    // server-side EACCES surfacing as 500, etc.) as a normal
+    // completion: `onChunk` got nothing, `onSpawn` was called with
+    // a `null` pid, and `run()` resolved cleanly. The user saw an
+    // idle UI with no log line and no clue what went wrong. Fail
+    // fast so the caller's catch path surfaces the server's reason.
+    globalThis.fetch = vi.fn(
+      async () =>
+        new Response("anonymous tokens disabled", {
+          status: 403,
+          statusText: "Forbidden",
+        }),
+    ) as typeof fetch;
+    const onChunkCalls: string[] = [];
+    let onSpawnCalls = 0;
+    await expect(
+      streamTraining(
+        (t) => onChunkCalls.push(t),
+        undefined,
+        undefined,
+        () => {
+          onSpawnCalls += 1;
+        },
+      ),
+    ).rejects.toThrow(/403.*anonymous tokens disabled/);
+    // The body must NOT have been streamed and `onSpawn` must NOT
+    // have been called with the bogus null pid; both would mislead
+    // the SPA into treating the failure as a successful run.
+    expect(onChunkCalls).toEqual([]);
+    expect(onSpawnCalls).toBe(0);
+  });
+
+  it("falls back to the bare status when the error response body is empty", async () => {
+    // Belt-and-braces for upstreams that return non-2xx with no
+    // body. The status code is enough to surface the failure to
+    // the user; we just don't want a `: undefined` suffix.
+    globalThis.fetch = vi.fn(
+      async () => new Response(null, { status: 500, statusText: "Server Error" }),
+    ) as typeof fetch;
+    await expect(streamTraining(() => undefined)).rejects.toThrow(
+      /^\/api\/train failed \(500 Server Error\)$/,
+    );
+  });
+
   it("returns silently when the response has no body (e.g. 204 from upstream)", async () => {
     globalThis.fetch = vi.fn(
       async () => new Response(null, { status: 204 }),
@@ -427,7 +475,7 @@ describe("streamTraining", () => {
       cancel() {
         cancelled = true;
       },
-      // intentionally no enqueue / no close — would block on read()
+      // intentionally no enqueue / no close; would block on read()
     });
     globalThis.fetch = vi.fn(
       async () =>
@@ -551,8 +599,80 @@ describe("streamInferenceContent abort", () => {
   });
 });
 
+describe("isHmrEnabled", () => {
+  // Regression: a previous version of `RunTraining` gated its
+  // EventSource subscription on `import.meta.env.DEV`, which is
+  // baked to `false` by `vite build` and therefore *always* false
+  // in a real `arkor dev` session (the SPA is shipped as static
+  // assets). The new server-side `<meta name="arkor-hmr-enabled">`
+  // tag is what tells the SPA whether HMR is actually wired in;
+  // these tests pin the contract.
+  //
+  // The package's vitest config doesn't load jsdom (the rest of the
+  // suite runs in Node), so we stub the minimal `document` API
+  // `isHmrEnabled` uses (`querySelector('meta[name=...]')`)
+  // directly on `globalThis`. The reader's contract is just "look
+  // up a meta tag and return its content === 'true'", which a tiny
+  // hand-rolled stub covers without dragging the whole DOM in.
+  function withMetaContent(value: string | null, fn: () => void) {
+    const fakeDocument = {
+      querySelector: (selector: string) => {
+        if (selector !== 'meta[name="arkor-hmr-enabled"]') return null;
+        if (value === null) return null;
+        return { getAttribute: () => value };
+      },
+    };
+    const had = "document" in globalThis;
+    const previous = (globalThis as { document?: unknown }).document;
+    (globalThis as { document?: unknown }).document = fakeDocument;
+    try {
+      fn();
+    } finally {
+      if (had) (globalThis as { document?: unknown }).document = previous;
+      else delete (globalThis as { document?: unknown }).document;
+    }
+  }
+
+  it("returns true when the server-injected meta says HMR is on", () => {
+    withMetaContent("true", () => {
+      expect(isHmrEnabled()).toBe(true);
+    });
+  });
+
+  it("returns false when the meta tag is missing entirely", () => {
+    // No injection → SPA must NOT open `/api/dev/events` (which
+    // would 404 and EventSource-retry forever in a non-HMR build).
+    withMetaContent(null, () => {
+      expect(isHmrEnabled()).toBe(false);
+    });
+  });
+
+  it("returns false for any meta content other than the literal `true`", () => {
+    // Defensive: don't fail open on a malformed/legacy server that
+    // injects an empty value or a placeholder.
+    withMetaContent("", () => expect(isHmrEnabled()).toBe(false));
+    withMetaContent("false", () => expect(isHmrEnabled()).toBe(false));
+    withMetaContent("yes", () => expect(isHmrEnabled()).toBe(false));
+  });
+
+  it("returns false when there is no document at all (Node SSR / module-load probe)", () => {
+    // The reader is called during component render; in any non-DOM
+    // host (test fixtures that import the module without jsdom, a
+    // hypothetical SSR pre-render) it must return false rather than
+    // throwing on `document` being undefined.
+    const had = "document" in globalThis;
+    const previous = (globalThis as { document?: unknown }).document;
+    delete (globalThis as { document?: unknown }).document;
+    try {
+      expect(isHmrEnabled()).toBe(false);
+    } finally {
+      if (had) (globalThis as { document?: unknown }).document = previous;
+    }
+  });
+});
+
 // ---------------------------------------------------------------------------
-// Deployment helpers — verify the wire shape (method/URL/body) the Studio
+// Deployment helpers: verify the wire shape (method/URL/body) the Studio
 // SPA sends to its own server so regressions in route paths or request
 // envelopes show up here rather than as silent SPA bugs.
 // ---------------------------------------------------------------------------
diff --git a/packages/studio-app/src/lib/api.ts b/packages/studio-app/src/lib/api.ts
index 05414279..43ace23f 100644
--- a/packages/studio-app/src/lib/api.ts
+++ b/packages/studio-app/src/lib/api.ts
@@ -32,6 +32,13 @@ export interface Job {
  */
 export interface ManifestSummary {
   trainer: { name: string } | null;
+  /**
+   * Stable hash of the trainer's cloud-side `JobConfig`. The server is
+   * always paired with this SPA in the same package, so the field is
+   * always present in the wire payload: `null` when no inspectable
+   * trainer is loaded, a hex string otherwise. Not optional.
+   */
+  configHash: string | null;
 }
 
 export interface ManifestError {
@@ -48,6 +55,78 @@ function readStudioToken(): string {
 
 const STUDIO_TOKEN = readStudioToken();
 
+/**
+ * Whether `arkor dev` wired in an HMR coordinator at server boot.
+ * The studio server emits `<meta name="arkor-hmr-enabled" content="true">`
+ * into `index.html` only when `options.hmr` is set, so we can tell
+ * dev-mode usage from prod-mode usage at runtime: `vite build`'s
+ * output ships with `import.meta.env.DEV === false`, so a build-time
+ * gate inside the SPA bundle would (wrongly) suppress HMR even in
+ * real `arkor dev` sessions. `RunTraining` consults this flag before
+ * opening `/api/dev/events`; without it, the EventSource would retry
+ * forever against the 404 the server returns for non-HMR builds.
+ *
+ * The Vite SPA dev workflow (`pnpm --filter @arkor/studio-app dev`)
+ * serves its own `index.html`, so the SPA's `vite.config.ts` plugin
+ * also injects this meta alongside the studio-token meta. That way
+ * a single meta-presence check covers both the production-built SPA
+ * (served by `arkor dev`) and the Vite-served dev SPA, instead of
+ * needing a separate `import.meta.env.DEV` fallback that would diverge
+ * between dev workflows.
+ */
+export function isHmrEnabled(): boolean {
+  if (typeof document === "undefined") return false;
+  const meta = document.querySelector('meta[name="arkor-hmr-enabled"]');
+  return meta?.getAttribute("content") === "true";
+}
+
+/**
+ * Cap for the error-response body read in `streamTraining` below.
+ * 8 KiB comfortably fits the JSON / text errors Studio actually
+ * returns (`/api/train failed: ...` payloads are short), but
+ * prevents a misconfigured upstream (reverse proxy that interposes
+ * a multi-MB HTML error page, server-side stack trace, etc.) from
+ * making the UI hang on `res.text()` for the user's idle Run
+ * click. The trimmed prefix is enough to display the failure cause
+ * inline in the SPA log pane.
+ */
+const ERROR_BODY_MAX_BYTES = 8 * 1024;
+
+/**
+ * Read `res.body` up to `maxBytes` and return as UTF-8 text. Cancels
+ * the underlying stream once the cap is reached so the network
+ * doesn't keep draining a hostile multi-MB error page in the
+ * background.
+ */
+async function readBodyCapped(res: Response, maxBytes: number): Promise<string> {
+  if (!res.body) return "";
+  const reader = res.body.getReader();
+  const decoder = new TextDecoder();
+  let total = 0;
+  let out = "";
+  try {
+    for (;;) {
+      const { value, done } = await reader.read();
+      if (done) break;
+      if (!value) continue;
+      const remaining = maxBytes - total;
+      if (remaining <= 0) break;
+      const slice =
+        value.byteLength > remaining ? value.subarray(0, remaining) : value;
+      out += decoder.decode(slice, { stream: true });
+      total += slice.byteLength;
+      if (total >= maxBytes) break;
+    }
+  } finally {
+    // Best-effort cancel: closes the response stream so we don't
+    // keep pulling bytes after the cap. Throw is ignored because
+    // the caller is already throwing the wrapped error.
+    void reader.cancel().catch(() => {});
+  }
+  out += decoder.decode();
+  return out;
+}
+
 /**
  * `fetch` with the per-launch CSRF token attached. The token is read once at
  * module load from the `<meta>` tag the Studio server injects into
@@ -89,7 +168,7 @@ export async function fetchJobs(): Promise<{ jobs: Job[] }> {
 /**
  * Fetch a serialisable summary of the user's `createArkor({...})` manifest.
  * Returns `{ error }` (not a thrown exception) on 4xx so the SPA can render a
- * targeted hint — typically "no src/arkor/index.ts yet" right after scaffold.
+ * targeted hint, typically "no src/arkor/index.ts yet" right after scaffold.
  */
 export async function fetchManifest(): Promise<ManifestResult> {
   const res = await apiFetch("/api/manifest");
@@ -106,6 +185,35 @@ export function openJobEvents(jobId: string): EventSource {
   );
 }
 
+/**
+ * HMR rebuild notifications from `arkor dev`. Server pushes a `ready`
+ * event on first bundle, `rebuild` on each subsequent change, and `error`
+ * when the bundle fails to compile. `restart: true` indicates a training
+ * subprocess was signalled to early-stop and the SPA should re-spawn it
+ * after the current `streamTraining` resolves.
+ */
+export interface DevEvent {
+  type: "ready" | "rebuild" | "error";
+  outFile?: string;
+  hash?: string;
+  /** Cloud-side `JobConfig` hash; null when the bundle has no inspectable trainer. */
+  configHash?: string | null;
+  /** Run name pulled from the rebuilt manifest. */
+  trainerName?: string | null;
+  message?: string;
+  /** True when the rebuild changed cloud-side config and a child was SIGTERM'd. */
+  restart?: boolean;
+  restartTargets?: Array<{ pid: number; trainFile?: string }>;
+  /** True when the rebuild only changed callbacks and one or more children
+   *  were SIGUSR2'd to hot-swap their callback closures in place. */
+  hotSwap?: boolean;
+  hotSwapTargets?: Array<{ pid: number; trainFile?: string }>;
+}
+
+export function openDevEvents(): EventSource {
+  return new EventSource(withStudioToken("/api/dev/events"));
+}
+
 export interface ChatRequestBody {
   messages: Array<{
     role: "system" | "user" | "assistant";
@@ -123,7 +231,7 @@ export interface ChatRequestBody {
  * Stream assistant text deltas from `/api/inference/chat`.
  *
  * The Studio server proxies cloud-api's `/v1/inference/chat` SSE stream
- * verbatim, so the body is `event: …\ndata: {…}\n\n` frames — not plain
+ * verbatim, so the body is `event: …\ndata: {…}\n\n` frames, not plain
  * text. We parse the frames with `eventsource-parser` (the same parser
  * the SDK's `iterateEvents` uses) and pull the assistant text out of
  * each frame's `data` payload.
@@ -147,7 +255,7 @@ export async function* streamInferenceContent(
   }
   // `iterateSseFrames` mirrors cloud-api-client's `iterateEvents` and silently
   // exits when there's no body. That's fine for the SDK but in the Playground
-  // it would leave an empty assistant bubble with no error surfaced — make
+  // it would leave an empty assistant bubble with no error surfaced. Make
   // the missing-body case loud here instead.
   if (!res.body) {
     throw new Error("Inference response has no body");
@@ -211,7 +319,7 @@ export function extractInferenceDelta(data: string): string | null {
     return data;
   }
   // Some providers/proxies serialize token chunks as plain JSON strings
-  // (`data: "Hel"`) rather than objects — surface those directly so we
+  // (`data: "Hel"`) rather than objects; surface those directly so we
   // don't end up with a silently empty assistant bubble.
   if (typeof parsed === "string") return parsed;
   if (!parsed || typeof parsed !== "object") return null;
@@ -230,6 +338,14 @@ export async function streamTraining(
   onChunk: (text: string) => void,
   file?: string,
   signal?: AbortSignal,
+  /**
+   * Called once with the spawned subprocess's pid (or `null` if the
+   * server didn't include the `X-Arkor-Train-Pid` header). The SPA
+   * uses this to scope HMR `restart` events to the run *this* call
+   * actually started, so a passive tab whose own run was hot-swapped
+   * doesn't latch onto a sibling tab's restart broadcast.
+   */
+  onSpawn?: (pid: number | null) => void,
 ): Promise<void> {
   const res = await apiFetch("/api/train", {
     method: "POST",
@@ -237,6 +353,35 @@ export async function streamTraining(
     body: JSON.stringify(file ? { file } : {}),
     signal,
   });
+  // Fail fast on non-2xx so a failed spawn (auth 403, validation 400,
+  // server-side spawn EACCES surfacing as 500, etc.) doesn't slip
+  // through as a "successful" silent run. Without this, the SPA
+  // would call `onSpawn(null)` (the failure response carries no
+  // `X-Arkor-Train-Pid`), then hit `!res.body` or read an empty
+  // body and resolve as if the run completed cleanly, leaving the
+  // user looking at an idle UI and no log output. Read the body
+  // text for diagnostics so the caller's error log shows the
+  // server's reason instead of a bare status code.
+  if (!res.ok) {
+    let detail = "";
+    try {
+      detail = (await readBodyCapped(res, ERROR_BODY_MAX_BYTES)).trim();
+    } catch {
+      // Body unreadable (already consumed, network gone, etc.):
+      // surface the status alone rather than masking the failure
+      // entirely.
+    }
+    throw new Error(
+      detail
+        ? `/api/train failed (${res.status} ${res.statusText}): ${detail}`
+        : `/api/train failed (${res.status} ${res.statusText})`,
+    );
+  }
+  if (onSpawn) {
+    const raw = res.headers.get("x-arkor-train-pid");
+    const parsed = raw ? Number.parseInt(raw, 10) : NaN;
+    onSpawn(Number.isFinite(parsed) ? parsed : null);
+  }
   if (!res.body) return;
   const reader = res.body.getReader();
   const decoder = new TextDecoder();
@@ -246,7 +391,7 @@ export async function streamTraining(
   const onAbort = () => void reader.cancel().catch(() => {});
   // Cover the case where the signal was already aborted before we
   // got here (or aborted in the small window between `getReader()`
-  // and `addEventListener`) — `addEventListener("abort", ...)` won't
+  // and `addEventListener`): `addEventListener("abort", ...)` won't
   // fire after the fact, so the trainer process spawned upstream
   // would never be killed. Cancel synchronously instead.
   if (signal?.aborted) {
diff --git a/packages/studio-app/vite.config.ts b/packages/studio-app/vite.config.ts
index 51e50dcf..b5d40c8e 100644
--- a/packages/studio-app/vite.config.ts
+++ b/packages/studio-app/vite.config.ts
@@ -25,10 +25,22 @@ function htmlAttrEscape(s: string): string {
 }
 
 /**
- * Inject the per-launch Studio CSRF token into served `index.html` so the
- * SPA's `apiFetch` can attach it. `arkor dev` writes the token to
- * `~/.arkor/studio-token` on launch; we re-read on every request so that
- * starting `arkor dev` after Vite is picked up on the next reload.
+ * Inject the per-launch Studio CSRF token (and the HMR-enabled flag)
+ * into served `index.html` so the SPA's `apiFetch` can attach the
+ * token, and `isHmrEnabled()` can light up the `/api/dev/events`
+ * subscription. `arkor dev` writes the token to `~/.arkor/studio-token`
+ * on launch; we re-read on every request so that starting `arkor dev`
+ * after Vite is picked up on the next reload.
+ *
+ * Why also inject `arkor-hmr-enabled` here: the SPA reads the meta to
+ * decide whether to open the SSE channel, and `buildStudioApp` only
+ * emits it when HMR is wired in. Vite serves its own `index.html` (so
+ * the runtime backend never gets to inject anything), and the only
+ * realistic backend for Vite-served pages is `arkor dev` (Vite proxies
+ * `/api` to :4000), which always boots with HMR. Pairing the two
+ * meta tags keeps both the production SPA (served by `arkor dev`) and
+ * the Vite dev workflow (`pnpm --filter @arkor/studio-app dev`)
+ * behaving the same way: HMR active whenever the token is.
  *
  * `apply: "serve"` constrains this to the dev server. If it ran during
  * `vite build` it would bake the current per-launch token into `dist/
@@ -51,7 +63,9 @@ function arkorStudioToken(): Plugin {
         return html;
       }
       if (!token) return html;
-      const meta = `<meta name="arkor-studio-token" content="${htmlAttrEscape(token)}">`;
+      const tokenMeta = `<meta name="arkor-studio-token" content="${htmlAttrEscape(token)}">`;
+      const hmrMeta = `<meta name="arkor-hmr-enabled" content="true">`;
+      const meta = `${tokenMeta}${hmrMeta}`;
       const idx = html.indexOf("</head>");
       if (idx === -1) return `${meta}${html}`;
       return `${html.slice(0, idx)}${meta}${html.slice(idx)}`;
diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml
index cb36ec19..c2ba2afe 100644
--- a/pnpm-lock.yaml
+++ b/pnpm-lock.yaml
@@ -4,6 +4,9 @@ settings:
   autoInstallPeers: true
   excludeLinksFromLockfile: false
 
+overrides:
+  unrun: ^0.3.0
+
 importers:
 
   .:
@@ -71,9 +74,6 @@ importers:
       commander:
         specifier: ^13.0.0
         version: 13.1.0
-      esbuild:
-        specifier: ^0.28.0
-        version: 0.28.0
       hono:
         specifier: ^4.7.0
         version: 4.12.14
@@ -83,6 +83,9 @@ importers:
       posthog-node:
         specifier: ^5.30.6
         version: 5.30.6(rxjs@7.8.2)
+      rolldown:
+        specifier: ^1.0.0
+        version: 1.0.2
       zod:
         specifier: ^4.3.6
         version: 4.3.6
@@ -97,8 +100,8 @@ importers:
         specifier: ^4.1.5
         version: 4.1.5(vitest@4.1.5)
       tsdown:
-        specifier: ^0.21.9
-        version: 0.21.10(typescript@5.9.3)
+        specifier: ^0.22.0
+        version: 0.22.0(typescript@5.9.3)
       typescript:
         specifier: ^5
         version: 5.9.3
@@ -140,8 +143,8 @@ importers:
         specifier: ^4.1.5
         version: 4.1.5(vitest@4.1.5)
       tsdown:
-        specifier: ^0.21.9
-        version: 0.21.10(typescript@5.9.3)
+        specifier: ^0.22.0
+        version: 0.22.0(typescript@5.9.3)
       typescript:
         specifier: ^5
         version: 5.9.3
@@ -266,9 +269,9 @@ packages:
     resolution: {integrity: sha512-qsaF+9Qcm2Qv8SRIMMscAvG4O3lJ0F1GuMo5HR/Bp02LopNgnZBC/EkbevHFeGs4ls/oPz9v+Bsmzbkbe+0dUw==}
     engines: {node: '>=6.9.0'}
 
-  '@babel/generator@8.0.0-rc.3':
-    resolution: {integrity: sha512-em37/13/nR320G4jab/nIIHZgc2Wz2y/D39lxnTyxB4/D/omPQncl/lSdlnJY1OhQcRGugTSIF2l/69o31C9dA==}
-    engines: {node: ^20.19.0 || >=22.12.0}
+  '@babel/generator@8.0.0-rc.5':
+    resolution: {integrity: sha512-nFZPWz3FHIS7y6rMIVoa/WBwjdutfIaRJIBQjzn+t3RnecZoRNlGmGcyR2wb0T/IgSd50Kz/6dG8/LvMCRunjg==}
+    engines: {node: ^22.18.0 || >=24.11.0}
 
   '@babel/helper-compilation-targets@7.28.6':
     resolution: {integrity: sha512-JYtls3hqi15fcx5GaSNL7SCTJ2MNmjrkHXg4FSpOA/grxK8KwyZ5bubHsCq8FXCkua6xhuaaBit+3b7+VZRfcA==}
@@ -296,17 +299,17 @@ packages:
     resolution: {integrity: sha512-qMlSxKbpRlAridDExk92nSobyDdpPijUq2DW6oDnUqd0iOGxmQjyqhMIihI9+zv4LPyZdRje2cavWPbCbWm3eA==}
     engines: {node: '>=6.9.0'}
 
-  '@babel/helper-string-parser@8.0.0-rc.3':
-    resolution: {integrity: sha512-AmwWFx1m8G/a5cXkxLxTiWl+YEoWuoFLUCwqMlNuWO1tqAYITQAbCRPUkyBHv1VOFgfjVOqEj6L3u15J5ZCzTA==}
-    engines: {node: ^20.19.0 || >=22.12.0}
+  '@babel/helper-string-parser@8.0.0-rc.5':
+    resolution: {integrity: sha512-sN7R8rBvDurfaziNfDEIjIntlazmlkCDGO4SNl2RJ3wRCn+QxspLV7hzYAE8WWVd2joVuT8sUxeePdLp2idI1A==}
+    engines: {node: ^22.18.0 || >=24.11.0}
 
   '@babel/helper-validator-identifier@7.28.5':
     resolution: {integrity: sha512-qSs4ifwzKJSV39ucNjsvc6WVHs6b7S03sOh2OcHF9UHfVPqWWALUsNUVzhSBiItjRZoLHx7nIarVjqKVusUZ1Q==}
     engines: {node: '>=6.9.0'}
 
-  '@babel/helper-validator-identifier@8.0.0-rc.3':
-    resolution: {integrity: sha512-8AWCJ2VJJyDFlGBep5GpaaQ9AAaE/FjAcrqI7jyssYhtL7WGV0DOKpJsQqM037xDbpRLHXsY8TwU7zDma7coOw==}
-    engines: {node: ^20.19.0 || >=22.12.0}
+  '@babel/helper-validator-identifier@8.0.0-rc.5':
+    resolution: {integrity: sha512-ehJDxHvtbZ85RtX/L2fi0h9AGsBNqB5Euv1EB8RMAvGYvD+2X+QbpzzOpbklnNXO+WSZJNOaetw2BBj27xsWVg==}
+    engines: {node: ^22.18.0 || >=24.11.0}
 
   '@babel/helper-validator-option@7.27.1':
     resolution: {integrity: sha512-YvjJow9FxbhFFKDSuFnVCe2WxXk1zWc22fFePVNEaWJEu8IrZVlda6N0uHwzZrUM1il7NC9Mlp4MaJYbYd9JSg==}
@@ -321,11 +324,16 @@ packages:
     engines: {node: '>=6.0.0'}
     hasBin: true
 
-  '@babel/parser@8.0.0-rc.3':
-    resolution: {integrity: sha512-B20dvP3MfNc/XS5KKCHy/oyWl5IA6Cn9YjXRdDlCjNmUFrjvLXMNUfQq/QUy9fnG2gYkKKcrto2YaF9B32ToOQ==}
+  '@babel/parser@8.0.0-rc.4':
+    resolution: {integrity: sha512-0S/1yefMa15N4i2v3t8Fw9pgMHhf2gF6Lc1UEXI96Ls6FNAjqvHHZouZ2ZS/deqLhbMFtmfVeFac6iTsvFbLwA==}
     engines: {node: ^20.19.0 || >=22.12.0}
     hasBin: true
 
+  '@babel/parser@8.0.0-rc.5':
+    resolution: {integrity: sha512-/Mfg83rK3+jsRbl4Vbd0jqxc6M1A1/WNFtgrowRM1unEsD3XcNnrBdMM0JWakd0/RN9lseQKwPduW1TiEwKOlQ==}
+    engines: {node: ^22.18.0 || >=24.11.0}
+    hasBin: true
+
   '@babel/plugin-transform-react-jsx-self@7.27.1':
     resolution: {integrity: sha512-6UzkCs+ejGdZ5mFFC/OCUrv028ab2fp1znZmCZjAOBKiBK2jXD1O+BPSfX8X2qjJ75fZBMSnQn3Rq2mrBJK2mw==}
     engines: {node: '>=6.9.0'}
@@ -354,9 +362,9 @@ packages:
     resolution: {integrity: sha512-LwdZHpScM4Qz8Xw2iKSzS+cfglZzJGvofQICy7W7v4caru4EaAmyUuO6BGrbyQ2mYV11W0U8j5mBhd14dd3B0A==}
     engines: {node: '>=6.9.0'}
 
-  '@babel/types@8.0.0-rc.3':
-    resolution: {integrity: sha512-mOm5ZrYmphGfqVWoH5YYMTITb3cDXsFgmvFlvkvWDMsR9X8RFnt7a0Wb6yNIdoFsiMO9WjYLq+U/FMtqIYAF8Q==}
-    engines: {node: ^20.19.0 || >=22.12.0}
+  '@babel/types@8.0.0-rc.5':
+    resolution: {integrity: sha512-JeSVu/m8x/zpp4CLjYHVNXuhEyOkhPXuxM8YOXjh6L4LlvQNKuUNOTo5KdBuKAcTDHw8DquToTaEkhsBqPXOaA==}
+    engines: {node: ^22.18.0 || >=24.11.0}
 
   '@bcoe/v8-coverage@1.0.2':
     resolution: {integrity: sha512-6zABk/ECA/QYSCQ1NGiVwwbQerUCZ+TQbp64Q3AgmfNvurHH0j8TtXa1qbShXA6qqkpAj4V5W8pP6mLe1mcMqA==}
@@ -426,312 +434,156 @@ packages:
     cpu: [ppc64]
     os: [aix]
 
-  '@esbuild/aix-ppc64@0.28.0':
-    resolution: {integrity: sha512-lhRUCeuOyJQURhTxl4WkpFTjIsbDayJHih5kZC1giwE+MhIzAb7mEsQMqMf18rHLsrb5qI1tafG20mLxEWcWlA==}
-    engines: {node: '>=18'}
-    cpu: [ppc64]
-    os: [aix]
-
   '@esbuild/android-arm64@0.25.12':
     resolution: {integrity: sha512-6AAmLG7zwD1Z159jCKPvAxZd4y/VTO0VkprYy+3N2FtJ8+BQWFXU+OxARIwA46c5tdD9SsKGZ/1ocqBS/gAKHg==}
     engines: {node: '>=18'}
     cpu: [arm64]
     os: [android]
 
-  '@esbuild/android-arm64@0.28.0':
-    resolution: {integrity: sha512-+WzIXQOSaGs33tLEgYPYe/yQHf0WTU0X42Jca3y8NWMbUVhp7rUnw+vAsRC/QiDrdD31IszMrZy+qwPOPjd+rw==}
-    engines: {node: '>=18'}
-    cpu: [arm64]
-    os: [android]
-
   '@esbuild/android-arm@0.25.12':
     resolution: {integrity: sha512-VJ+sKvNA/GE7Ccacc9Cha7bpS8nyzVv0jdVgwNDaR4gDMC/2TTRc33Ip8qrNYUcpkOHUT5OZ0bUcNNVZQ9RLlg==}
     engines: {node: '>=18'}
     cpu: [arm]
     os: [android]
 
-  '@esbuild/android-arm@0.28.0':
-    resolution: {integrity: sha512-wqh0ByljabXLKHeWXYLqoJ5jKC4XBaw6Hk08OfMrCRd2nP2ZQ5eleDZC41XHyCNgktBGYMbqnrJKq/K/lzPMSQ==}
-    engines: {node: '>=18'}
-    cpu: [arm]
-    os: [android]
-
   '@esbuild/android-x64@0.25.12':
     resolution: {integrity: sha512-5jbb+2hhDHx5phYR2By8GTWEzn6I9UqR11Kwf22iKbNpYrsmRB18aX/9ivc5cabcUiAT/wM+YIZ6SG9QO6a8kg==}
     engines: {node: '>=18'}
     cpu: [x64]
     os: [android]
 
-  '@esbuild/android-x64@0.28.0':
-    resolution: {integrity: sha512-+VJggoaKhk2VNNqVL7f6S189UzShHC/mR9EE8rDdSkdpN0KflSwWY/gWjDrNxxisg8Fp1ZCD9jLMo4m0OUfeUA==}
-    engines: {node: '>=18'}
-    cpu: [x64]
-    os: [android]
-
   '@esbuild/darwin-arm64@0.25.12':
     resolution: {integrity: sha512-N3zl+lxHCifgIlcMUP5016ESkeQjLj/959RxxNYIthIg+CQHInujFuXeWbWMgnTo4cp5XVHqFPmpyu9J65C1Yg==}
     engines: {node: '>=18'}
     cpu: [arm64]
     os: [darwin]
 
-  '@esbuild/darwin-arm64@0.28.0':
-    resolution: {integrity: sha512-0T+A9WZm+bZ84nZBtk1ckYsOvyA3x7e2Acj1KdVfV4/2tdG4fzUp91YHx+GArWLtwqp77pBXVCPn2We7Letr0Q==}
-    engines: {node: '>=18'}
-    cpu: [arm64]
-    os: [darwin]
-
   '@esbuild/darwin-x64@0.25.12':
     resolution: {integrity: sha512-HQ9ka4Kx21qHXwtlTUVbKJOAnmG1ipXhdWTmNXiPzPfWKpXqASVcWdnf2bnL73wgjNrFXAa3yYvBSd9pzfEIpA==}
     engines: {node: '>=18'}
     cpu: [x64]
     os: [darwin]
 
-  '@esbuild/darwin-x64@0.28.0':
-    resolution: {integrity: sha512-fyzLm/DLDl/84OCfp2f/XQ4flmORsjU7VKt8HLjvIXChJoFFOIL6pLJPH4Yhd1n1gGFF9mPwtlN5Wf82DZs+LQ==}
-    engines: {node: '>=18'}
-    cpu: [x64]
-    os: [darwin]
-
   '@esbuild/freebsd-arm64@0.25.12':
     resolution: {integrity: sha512-gA0Bx759+7Jve03K1S0vkOu5Lg/85dou3EseOGUes8flVOGxbhDDh/iZaoek11Y8mtyKPGF3vP8XhnkDEAmzeg==}
     engines: {node: '>=18'}
     cpu: [arm64]
     os: [freebsd]
 
-  '@esbuild/freebsd-arm64@0.28.0':
-    resolution: {integrity: sha512-l9GeW5UZBT9k9brBYI+0WDffcRxgHQD8ShN2Ur4xWq/NFzUKm3k5lsH4PdaRgb2w7mI9u61nr2gI2mLI27Nh3Q==}
-    engines: {node: '>=18'}
-    cpu: [arm64]
-    os: [freebsd]
-
   '@esbuild/freebsd-x64@0.25.12':
     resolution: {integrity: sha512-TGbO26Yw2xsHzxtbVFGEXBFH0FRAP7gtcPE7P5yP7wGy7cXK2oO7RyOhL5NLiqTlBh47XhmIUXuGciXEqYFfBQ==}
     engines: {node: '>=18'}
     cpu: [x64]
     os: [freebsd]
 
-  '@esbuild/freebsd-x64@0.28.0':
-    resolution: {integrity: sha512-BXoQai/A0wPO6Es3yFJ7APCiKGc1tdAEOgeTNy3SsB491S3aHn4S4r3e976eUnPdU+NbdtmBuLncYir2tMU9Nw==}
-    engines: {node: '>=18'}
-    cpu: [x64]
-    os: [freebsd]
-
   '@esbuild/linux-arm64@0.25.12':
     resolution: {integrity: sha512-8bwX7a8FghIgrupcxb4aUmYDLp8pX06rGh5HqDT7bB+8Rdells6mHvrFHHW2JAOPZUbnjUpKTLg6ECyzvas2AQ==}
     engines: {node: '>=18'}
     cpu: [arm64]
     os: [linux]
 
-  '@esbuild/linux-arm64@0.28.0':
-    resolution: {integrity: sha512-RVyzfb3FWsGA55n6WY0MEIEPURL1FcbhFE6BffZEMEekfCzCIMtB5yyDcFnVbTnwk+CLAgTujmV/Lgvih56W+A==}
-    engines: {node: '>=18'}
-    cpu: [arm64]
-    os: [linux]
-
   '@esbuild/linux-arm@0.25.12':
     resolution: {integrity: sha512-lPDGyC1JPDou8kGcywY0YILzWlhhnRjdof3UlcoqYmS9El818LLfJJc3PXXgZHrHCAKs/Z2SeZtDJr5MrkxtOw==}
     engines: {node: '>=18'}
     cpu: [arm]
     os: [linux]
 
-  '@esbuild/linux-arm@0.28.0':
-    resolution: {integrity: sha512-CjaaREJagqJp7iTaNQjjidaNbCKYcd4IDkzbwwxtSvjI7NZm79qiHc8HqciMddQ6CKvJT6aBd8lO9kN/ZudLlw==}
-    engines: {node: '>=18'}
-    cpu: [arm]
-    os: [linux]
-
   '@esbuild/linux-ia32@0.25.12':
     resolution: {integrity: sha512-0y9KrdVnbMM2/vG8KfU0byhUN+EFCny9+8g202gYqSSVMonbsCfLjUO+rCci7pM0WBEtz+oK/PIwHkzxkyharA==}
     engines: {node: '>=18'}
     cpu: [ia32]
     os: [linux]
 
-  '@esbuild/linux-ia32@0.28.0':
-    resolution: {integrity: sha512-KBnSTt1kxl9x70q+ydterVdl+Cn0H18ngRMRCEQfrbqdUuntQQ0LoMZv47uB97NljZFzY6HcfqEZ2SAyIUTQBQ==}
-    engines: {node: '>=18'}
-    cpu: [ia32]
-    os: [linux]
-
   '@esbuild/linux-loong64@0.25.12':
     resolution: {integrity: sha512-h///Lr5a9rib/v1GGqXVGzjL4TMvVTv+s1DPoxQdz7l/AYv6LDSxdIwzxkrPW438oUXiDtwM10o9PmwS/6Z0Ng==}
     engines: {node: '>=18'}
     cpu: [loong64]
     os: [linux]
 
-  '@esbuild/linux-loong64@0.28.0':
-    resolution: {integrity: sha512-zpSlUce1mnxzgBADvxKXX5sl8aYQHo2ezvMNI8I0lbblJtp8V4odlm3Yzlj7gPyt3T8ReksE6bK+pT3WD+aJRg==}
-    engines: {node: '>=18'}
-    cpu: [loong64]
-    os: [linux]
-
   '@esbuild/linux-mips64el@0.25.12':
     resolution: {integrity: sha512-iyRrM1Pzy9GFMDLsXn1iHUm18nhKnNMWscjmp4+hpafcZjrr2WbT//d20xaGljXDBYHqRcl8HnxbX6uaA/eGVw==}
     engines: {node: '>=18'}
     cpu: [mips64el]
     os: [linux]
 
-  '@esbuild/linux-mips64el@0.28.0':
-    resolution: {integrity: sha512-2jIfP6mmjkdmeTlsX/9vmdmhBmKADrWqN7zcdtHIeNSCH1SqIoNI63cYsjQR8J+wGa4Y5izRcSHSm8K3QWmk3w==}
-    engines: {node: '>=18'}
-    cpu: [mips64el]
-    os: [linux]
-
   '@esbuild/linux-ppc64@0.25.12':
     resolution: {integrity: sha512-9meM/lRXxMi5PSUqEXRCtVjEZBGwB7P/D4yT8UG/mwIdze2aV4Vo6U5gD3+RsoHXKkHCfSxZKzmDssVlRj1QQA==}
     engines: {node: '>=18'}
     cpu: [ppc64]
     os: [linux]
 
-  '@esbuild/linux-ppc64@0.28.0':
-    resolution: {integrity: sha512-bc0FE9wWeC0WBm49IQMPSPILRocGTQt3j5KPCA8os6VprfuJ7KD+5PzESSrJ6GmPIPJK965ZJHTUlSA6GNYEhg==}
-    engines: {node: '>=18'}
-    cpu: [ppc64]
-    os: [linux]
-
   '@esbuild/linux-riscv64@0.25.12':
     resolution: {integrity: sha512-Zr7KR4hgKUpWAwb1f3o5ygT04MzqVrGEGXGLnj15YQDJErYu/BGg+wmFlIDOdJp0PmB0lLvxFIOXZgFRrdjR0w==}
     engines: {node: '>=18'}
     cpu: [riscv64]
     os: [linux]
 
-  '@esbuild/linux-riscv64@0.28.0':
-    resolution: {integrity: sha512-SQPZOwoTTT/HXFXQJG/vBX8sOFagGqvZyXcgLA3NhIqcBv1BJU1d46c0rGcrij2B56Z2rNiSLaZOYW5cUk7yLQ==}
-    engines: {node: '>=18'}
-    cpu: [riscv64]
-    os: [linux]
-
   '@esbuild/linux-s390x@0.25.12':
     resolution: {integrity: sha512-MsKncOcgTNvdtiISc/jZs/Zf8d0cl/t3gYWX8J9ubBnVOwlk65UIEEvgBORTiljloIWnBzLs4qhzPkJcitIzIg==}
     engines: {node: '>=18'}
     cpu: [s390x]
     os: [linux]
 
-  '@esbuild/linux-s390x@0.28.0':
-    resolution: {integrity: sha512-SCfR0HN8CEEjnYnySJTd2cw0k9OHB/YFzt5zgJEwa+wL/T/raGWYMBqwDNAC6dqFKmJYZoQBRfHjgwLHGSrn3Q==}
-    engines: {node: '>=18'}
-    cpu: [s390x]
-    os: [linux]
-
   '@esbuild/linux-x64@0.25.12':
     resolution: {integrity: sha512-uqZMTLr/zR/ed4jIGnwSLkaHmPjOjJvnm6TVVitAa08SLS9Z0VM8wIRx7gWbJB5/J54YuIMInDquWyYvQLZkgw==}
     engines: {node: '>=18'}
     cpu: [x64]
     os: [linux]
 
-  '@esbuild/linux-x64@0.28.0':
-    resolution: {integrity: sha512-us0dSb9iFxIi8srnpl931Nvs65it/Jd2a2K3qs7fz2WfGPHqzfzZTfec7oxZJRNPXPnNYZtanmRc4AL/JwVzHQ==}
-    engines: {node: '>=18'}
-    cpu: [x64]
-    os: [linux]
-
   '@esbuild/netbsd-arm64@0.25.12':
     resolution: {integrity: sha512-xXwcTq4GhRM7J9A8Gv5boanHhRa/Q9KLVmcyXHCTaM4wKfIpWkdXiMog/KsnxzJ0A1+nD+zoecuzqPmCRyBGjg==}
     engines: {node: '>=18'}
     cpu: [arm64]
     os: [netbsd]
 
-  '@esbuild/netbsd-arm64@0.28.0':
-    resolution: {integrity: sha512-CR/RYotgtCKwtftMwJlUU7xCVNg3lMYZ0RzTmAHSfLCXw3NtZtNpswLEj/Kkf6kEL3Gw+BpOekRX0BYCtklhUw==}
-    engines: {node: '>=18'}
-    cpu: [arm64]
-    os: [netbsd]
-
   '@esbuild/netbsd-x64@0.25.12':
     resolution: {integrity: sha512-Ld5pTlzPy3YwGec4OuHh1aCVCRvOXdH8DgRjfDy/oumVovmuSzWfnSJg+VtakB9Cm0gxNO9BzWkj6mtO1FMXkQ==}
     engines: {node: '>=18'}
     cpu: [x64]
     os: [netbsd]
 
-  '@esbuild/netbsd-x64@0.28.0':
-    resolution: {integrity: sha512-nU1yhmYutL+fQ71Kxnhg8uEOdC0pwEW9entHykTgEbna2pw2dkbFSMeqjjyHZoCmt8SBkOSvV+yNmm94aUrrqw==}
-    engines: {node: '>=18'}
-    cpu: [x64]
-    os: [netbsd]
-
   '@esbuild/openbsd-arm64@0.25.12':
     resolution: {integrity: sha512-fF96T6KsBo/pkQI950FARU9apGNTSlZGsv1jZBAlcLL1MLjLNIWPBkj5NlSz8aAzYKg+eNqknrUJ24QBybeR5A==}
     engines: {node: '>=18'}
     cpu: [arm64]
     os: [openbsd]
 
-  '@esbuild/openbsd-arm64@0.28.0':
-    resolution: {integrity: sha512-cXb5vApOsRsxsEl4mcZ1XY3D4DzcoMxR/nnc4IyqYs0rTI8ZKmW6kyyg+11Z8yvgMfAEldKzP7AdP64HnSC/6g==}
-    engines: {node: '>=18'}
-    cpu: [arm64]
-    os: [openbsd]
-
   '@esbuild/openbsd-x64@0.25.12':
     resolution: {integrity: sha512-MZyXUkZHjQxUvzK7rN8DJ3SRmrVrke8ZyRusHlP+kuwqTcfWLyqMOE3sScPPyeIXN/mDJIfGXvcMqCgYKekoQw==}
     engines: {node: '>=18'}
     cpu: [x64]
     os: [openbsd]
 
-  '@esbuild/openbsd-x64@0.28.0':
-    resolution: {integrity: sha512-8wZM2qqtv9UP3mzy7HiGYNH/zjTA355mpeuA+859TyR+e+Tc08IHYpLJuMsfpDJwoLo1ikIJI8jC3GFjnRClzA==}
-    engines: {node: '>=18'}
-    cpu: [x64]
-    os: [openbsd]
-
   '@esbuild/openharmony-arm64@0.25.12':
     resolution: {integrity: sha512-rm0YWsqUSRrjncSXGA7Zv78Nbnw4XL6/dzr20cyrQf7ZmRcsovpcRBdhD43Nuk3y7XIoW2OxMVvwuRvk9XdASg==}
     engines: {node: '>=18'}
     cpu: [arm64]
     os: [openharmony]
 
-  '@esbuild/openharmony-arm64@0.28.0':
-    resolution: {integrity: sha512-FLGfyizszcef5C3YtoyQDACyg95+dndv79i2EekILBofh5wpCa1KuBqOWKrEHZg3zrL3t5ouE5jgr94vA+Wb2w==}
-    engines: {node: '>=18'}
-    cpu: [arm64]
-    os: [openharmony]
-
   '@esbuild/sunos-x64@0.25.12':
     resolution: {integrity: sha512-3wGSCDyuTHQUzt0nV7bocDy72r2lI33QL3gkDNGkod22EsYl04sMf0qLb8luNKTOmgF/eDEDP5BFNwoBKH441w==}
     engines: {node: '>=18'}
     cpu: [x64]
     os: [sunos]
 
-  '@esbuild/sunos-x64@0.28.0':
-    resolution: {integrity: sha512-1ZgjUoEdHZZl/YlV76TSCz9Hqj9h9YmMGAgAPYd+q4SicWNX3G5GCyx9uhQWSLcbvPW8Ni7lj4gDa1T40akdlw==}
-    engines: {node: '>=18'}
-    cpu: [x64]
-    os: [sunos]
-
   '@esbuild/win32-arm64@0.25.12':
     resolution: {integrity: sha512-rMmLrur64A7+DKlnSuwqUdRKyd3UE7oPJZmnljqEptesKM8wx9J8gx5u0+9Pq0fQQW8vqeKebwNXdfOyP+8Bsg==}
     engines: {node: '>=18'}
     cpu: [arm64]
     os: [win32]
 
-  '@esbuild/win32-arm64@0.28.0':
-    resolution: {integrity: sha512-Q9StnDmQ/enxnpxCCLSg0oo4+34B9TdXpuyPeTedN/6+iXBJ4J+zwfQI28u/Jl40nOYAxGoNi7mFP40RUtkmUA==}
-    engines: {node: '>=18'}
-    cpu: [arm64]
-    os: [win32]
-
   '@esbuild/win32-ia32@0.25.12':
     resolution: {integrity: sha512-HkqnmmBoCbCwxUKKNPBixiWDGCpQGVsrQfJoVGYLPT41XWF8lHuE5N6WhVia2n4o5QK5M4tYr21827fNhi4byQ==}
     engines: {node: '>=18'}
     cpu: [ia32]
     os: [win32]
 
-  '@esbuild/win32-ia32@0.28.0':
-    resolution: {integrity: sha512-zF3ag/gfiCe6U2iczcRzSYJKH1DCI+ByzSENHlM2FcDbEeo5Zd2C86Aq0tKUYAJJ1obRP84ymxIAksZUcdztHA==}
-    engines: {node: '>=18'}
-    cpu: [ia32]
-    os: [win32]
-
   '@esbuild/win32-x64@0.25.12':
     resolution: {integrity: sha512-alJC0uCZpTFrSL0CCDjcgleBXPnCrEAhTBILpeAp7M/OFgoqtAetfBzX0xM00MUsVVPpVjlPuMbREqnZCXaTnA==}
     engines: {node: '>=18'}
     cpu: [x64]
     os: [win32]
 
-  '@esbuild/win32-x64@0.28.0':
-    resolution: {integrity: sha512-pEl1bO9mfAmIC+tW5btTmrKaujg3zGtUmWNdCw/xs70FBjwAL3o9OEKNHvNmnyylD6ubxUERiEhdsL0xBQ9efw==}
-    engines: {node: '>=18'}
-    cpu: [x64]
-    os: [win32]
-
   '@exodus/bytes@1.15.0':
     resolution: {integrity: sha512-UY0nlA+feH81UGSHv92sLEPLCeZFjXOuHhrIo0HQydScuQc8s0A7kL/UdgwgDq8g8ilksmuoF35YVTNphV2aBQ==}
     engines: {node: ^20.19.0 || ^22.12.0 || >=24.0.0}
@@ -1151,8 +1003,8 @@ packages:
   '@openapi-contrib/openapi-schema-to-json-schema@3.2.0':
     resolution: {integrity: sha512-Gj6C0JwCr8arj0sYuslWXUBSP/KnUlEGnPW4qxlXvAl543oaNQgMgIgkQUA6vs5BCCvwTEiL8m/wdWzfl4UvSw==}
 
-  '@oxc-project/types@0.127.0':
-    resolution: {integrity: sha512-aIYXQBo4lCbO4z0R3FHeucQHpF46l2LbMdxRvqvuRuW2OxdnSkcng5B8+K12spgLDj93rtN3+J2Vac/TIO+ciQ==}
+  '@oxc-project/types@0.132.0':
+    resolution: {integrity: sha512-FESMOxil5Se014ui/Eq8fT5uHJo6nIRwH0PfJrZJXs6Gek3ZVFOrpUv3YIZT20m+extU98Hg1Ym72U58rlsxUQ==}
 
   '@oxlint/binding-android-arm-eabi@1.66.0':
     resolution: {integrity: sha512-f7kq8N51T4phpzqfBpA2qaVTI/KrkCmNwaj3t/97I/WLTDI+UhlP5GL9eER+zVxBhtlx5rKXWByJU1/zDAvyaw==}
@@ -1520,97 +1372,97 @@ packages:
   '@radix-ui/rect@1.1.1':
     resolution: {integrity: sha512-HPwpGIzkl28mWyZqG52jiqDJ12waP11Pa1lGoiyUkIEuMLBP0oeK/C89esbXrxsky5we7dfd8U58nm0SgAWpVw==}
 
-  '@rolldown/binding-android-arm64@1.0.0-rc.17':
-    resolution: {integrity: sha512-s70pVGhw4zqGeFnXWvAzJDlvxhlRollagdCCKRgOsgUOH3N1l0LIxf83AtGzmb5SiVM4Hjl5HyarMRfdfj3DaQ==}
+  '@rolldown/binding-android-arm64@1.0.2':
+    resolution: {integrity: sha512-ZS4D1JPGn/MYQN/SYDWftIE/nVsM8j/AFOYEzAoOE2O3NktQOZru+/vYXGbR/qtdLdIfGCP0lcoJiYVzsEz+iQ==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [arm64]
     os: [android]
 
-  '@rolldown/binding-darwin-arm64@1.0.0-rc.17':
-    resolution: {integrity: sha512-4ksWc9n0mhlZpZ9PMZgTGjeOPRu8MB1Z3Tz0Mo02eWfWCHMW1zN82Qz/pL/rC+yQa+8ZnutMF0JjJe7PjwasYw==}
+  '@rolldown/binding-darwin-arm64@1.0.2':
+    resolution: {integrity: sha512-vdFA9+C/rekyGce7WqHs/xoT0ioZEWaOFyZLIV1mEeNFaFDUQrPIo8Vs2GvJ6eetb3rzDUtUBgzto3ExpXJB3w==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [arm64]
     os: [darwin]
 
-  '@rolldown/binding-darwin-x64@1.0.0-rc.17':
-    resolution: {integrity: sha512-SUSDOI6WwUVNcWxd02QEBjLdY1VPHvlEkw6T/8nYG322iYWCTxRb1vzk4E+mWWYehTp7ERibq54LSJGjmouOsw==}
+  '@rolldown/binding-darwin-x64@1.0.2':
+    resolution: {integrity: sha512-BewSOwTHazv77DTYiAZXSqqKZ4KP/KonFisDMVU7PImxoWfB2aepnPhd2E4SWz3zDzYgDNbs6jBmTdgNnF02GA==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [x64]
     os: [darwin]
 
-  '@rolldown/binding-freebsd-x64@1.0.0-rc.17':
-    resolution: {integrity: sha512-hwnz3nw9dbJ05EDO/PvcjaaewqqDy7Y1rn1UO81l8iIK1GjenME75dl16ajbvSSMfv66WXSRCYKIqfgq2KCfxw==}
+  '@rolldown/binding-freebsd-x64@1.0.2':
+    resolution: {integrity: sha512-m41o7M0YWtUdqk61Tb+jnKb2rN++iRdIASlExkUoKfIAH30DOHCB8fVLzSUpbWHHU8esmEioY62PxzexE8MBuA==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [x64]
     os: [freebsd]
 
-  '@rolldown/binding-linux-arm-gnueabihf@1.0.0-rc.17':
-    resolution: {integrity: sha512-IS+W7epTcwANmFSQFrS1SivEXHtl1JtuQA9wlxrZTcNi6mx+FDOYrakGevvvTwgj2JvWiK8B29/qD9BELZPyXQ==}
+  '@rolldown/binding-linux-arm-gnueabihf@1.0.2':
+    resolution: {integrity: sha512-jcojB9H7W/jS29pMKWAK1N+fU99vXodHDTatS3b3y/XSOCiHo0kkA74pL3jJmkoQtYpOCxDvaKs1fo2Ij/1X5w==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [arm]
     os: [linux]
 
-  '@rolldown/binding-linux-arm64-gnu@1.0.0-rc.17':
-    resolution: {integrity: sha512-e6usGaHKW5BMNZOymS1UcEYGowQMWcgZ71Z17Sl/h2+ZziNJ1a9n3Zvcz6LdRyIW5572wBCTH/Z+bKuZouGk9Q==}
+  '@rolldown/binding-linux-arm64-gnu@1.0.2':
+    resolution: {integrity: sha512-1jn6qDU5iiOgFgygDzKUuKP0maTi0/f1+sBLgvij/76C77Nm3ts6ufz9Bjg5q5dduxiUIxtq86JIoBvo1xQ4Ig==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [arm64]
     os: [linux]
     libc: [glibc]
 
-  '@rolldown/binding-linux-arm64-musl@1.0.0-rc.17':
-    resolution: {integrity: sha512-b/CgbwAJpmrRLp02RPfhbudf5tZnN9nsPWK82znefso832etkem8H7FSZwxrOI9djcdTP7U6YfNhbRnh7djErg==}
+  '@rolldown/binding-linux-arm64-musl@1.0.2':
+    resolution: {integrity: sha512-QVLO/czFMdoMFSqlX3bcswcJNm/23r+qoa/jgtmFc/qEp6/jXmIkDjF/XIo8dPfGaiwy1xfQn8o77L79GeXFgw==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [arm64]
     os: [linux]
     libc: [musl]
 
-  '@rolldown/binding-linux-ppc64-gnu@1.0.0-rc.17':
-    resolution: {integrity: sha512-4EII1iNGRUN5WwGbF/kOh/EIkoDN9HsupgLQoXfY+D1oyJm7/F4t5PYU5n8SWZgG0FEwakyM8pGgwcBYruGTlA==}
+  '@rolldown/binding-linux-ppc64-gnu@1.0.2':
+    resolution: {integrity: sha512-hgO5Abm0w5UL6FEa2iFnZqo2KlK7TQ5QhV5x09hujBf7t5KzHQ1VmfPuTpqRy/rNlSxua3eWH374xxiVrP+lcA==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [ppc64]
     os: [linux]
     libc: [glibc]
 
-  '@rolldown/binding-linux-s390x-gnu@1.0.0-rc.17':
-    resolution: {integrity: sha512-AH8oq3XqQo4IibpVXvPeLDI5pzkpYn0WiZAfT05kFzoJ6tQNzwRdDYQ45M8I/gslbodRZwW8uxLhbSBbkv96rA==}
+  '@rolldown/binding-linux-s390x-gnu@1.0.2':
+    resolution: {integrity: sha512-fy8rXxuYEu602abC8MUNaPjYLIFzReOaEIEMKMUa0rFEUxNpVXhs15KSSQ4qlqSaM7B6rcj9rDZgADh/IGDzLQ==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [s390x]
     os: [linux]
     libc: [glibc]
 
-  '@rolldown/binding-linux-x64-gnu@1.0.0-rc.17':
-    resolution: {integrity: sha512-cLnjV3xfo7KslbU41Z7z8BH/E1y5mzUYzAqih1d1MDaIGZRCMqTijqLv76/P7fyHuvUcfGsIpqCdddbxLLK9rA==}
+  '@rolldown/binding-linux-x64-gnu@1.0.2':
+    resolution: {integrity: sha512-0+bOkiQ779+r1WpoHOWHqncvyySci0vKph+myNDYb+im6meJAzHQXay6oEgnkHuUGouM1LKTZwqKpBow6Kj7CQ==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [x64]
     os: [linux]
     libc: [glibc]
 
-  '@rolldown/binding-linux-x64-musl@1.0.0-rc.17':
-    resolution: {integrity: sha512-0phclDw1spsL7dUB37sIARuis2tAgomCJXAHZlpt8PXZ4Ba0dRP1e+66lsRqrfhISeN9bEGNjQs+T/Fbd7oYGw==}
+  '@rolldown/binding-linux-x64-musl@1.0.2':
+    resolution: {integrity: sha512-mjSkrzZK5Qsl0a9d1JgILOiuZOSDTVdKENcSXBoqbzSrspLR/4/IRVDo5wd2GgZjNss/viBFJdeq+j7qH2nypw==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [x64]
     os: [linux]
     libc: [musl]
 
-  '@rolldown/binding-openharmony-arm64@1.0.0-rc.17':
-    resolution: {integrity: sha512-0ag/hEgXOwgw4t8QyQvUCxvEg+V0KBcA6YuOx9g0r02MprutRF5dyljgm3EmR02O292UX7UeS6HzWHAl6KgyhA==}
+  '@rolldown/binding-openharmony-arm64@1.0.2':
+    resolution: {integrity: sha512-1v5vHasdfQAZoEHakBV72LIFAC9JjnymsiKxp+GEr/ma3+NJCPSaYK+qavInOovJkgwFrs7GccX2d6IgDA3Z5w==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [arm64]
     os: [openharmony]
 
-  '@rolldown/binding-wasm32-wasi@1.0.0-rc.17':
-    resolution: {integrity: sha512-LEXei6vo0E5wTGwpkJ4KoT3OZJRnglwldt5ziLzOlc6qqb55z4tWNq2A+PFqCJuvWWdP53CVhG1Z9NtToDPJrA==}
+  '@rolldown/binding-wasm32-wasi@1.0.2':
+    resolution: {integrity: sha512-mb1VobWn6NheziTk5/WEaR6AKVbrwT5sOi6C7zk3gy/pD1qtJfU1j4PgTo2NJnOtbL9Dl3Aeei8w9jJ7qC2jZQ==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [wasm32]
 
-  '@rolldown/binding-win32-arm64-msvc@1.0.0-rc.17':
-    resolution: {integrity: sha512-gUmyzBl3SPMa6hrqFUth9sVfcLBlYsbMzBx5PlexMroZStgzGqlZ26pYG89rBb45Mnia+oil6YAIFeEWGWhoZA==}
+  '@rolldown/binding-win32-arm64-msvc@1.0.2':
+    resolution: {integrity: sha512-SqKonF56vA/L2yHwHYcEp2P34URpOZ7d1fS635cTkpDnUtEGdUbhI6NzsPdqeSWvAAeGDrxjWjNmibDIdFf9/A==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [arm64]
     os: [win32]
 
-  '@rolldown/binding-win32-x64-msvc@1.0.0-rc.17':
-    resolution: {integrity: sha512-3hkiolcUAvPB9FLb3UZdfjVVNWherN1f/skkGWJP/fgSQhYUZpSIRr0/I8ZK9TkF3F7kxvJAk0+IcKvPHk9qQg==}
+  '@rolldown/binding-win32-x64-msvc@1.0.2':
+    resolution: {integrity: sha512-v7qRI7gXLRINcOGXt+7YmAZ6iFuyZVMIoXAxhd8oP+DR9dLfL9GfNIx7PLMxmhZdvq8waUJBQiWN9EKNy+TRBQ==}
     engines: {node: ^20.19.0 || >=22.12.0}
     cpu: [x64]
     os: [win32]
@@ -1618,8 +1470,8 @@ packages:
   '@rolldown/pluginutils@1.0.0-beta.27':
     resolution: {integrity: sha512-+d0F4MKMCbeVUJwG96uQ4SgAznZNSq93I3V+9NHA4OpvqG8mRCpGdKmK8l/dl02h2CCDHwW2FqilnTyDcAnqjA==}
 
-  '@rolldown/pluginutils@1.0.0-rc.17':
-    resolution: {integrity: sha512-n8iosDOt6Ig1UhJ2AYqoIhHWh/isz0xpicHTzpKBeotdVsTEcxsSA/i3EVM7gQAj0rU27OLAxCjzlj15IWY7bg==}
+  '@rolldown/pluginutils@1.0.1':
+    resolution: {integrity: sha512-2j9bGt5Jh8hj+vPtgzPtl72j0yRxHAyumoo6TNfAjsLB04UtpSvPbPcDcBMxz7n+9CYB0c1GxQFxYRg2jimqGw==}
 
   '@rollup/rollup-android-arm-eabi@4.60.2':
     resolution: {integrity: sha512-dnlp69efPPg6Uaw2dVqzWRfAWRnYVb1XJ8CyyhIbZeaq4CA5/mLeZ1IEt9QqQxmbdvagjLIm2ZL8BxXv5lH4Yw==}
@@ -2902,9 +2754,9 @@ packages:
   dom-accessibility-api@0.6.3:
     resolution: {integrity: sha512-7ZgogeTnjuHbo+ct10G9Ffp0mif17idi0IyWNVA/wcwcm7NPOD/WEHVP3n7n3MhXqxoIYm8d6MuZohYWIZ4T3w==}
 
-  dts-resolver@2.1.3:
-    resolution: {integrity: sha512-bihc7jPC90VrosXNzK0LTE2cuLP6jr0Ro8jk+kMugHReJVLIpHz/xadeq3MhuwyO4TD4OA3L1Q8pBBFRc08Tsw==}
-    engines: {node: '>=20.19.0'}
+  dts-resolver@3.0.0:
+    resolution: {integrity: sha512-1T1f+z+4tl9XD+m+0HBgWoL/nm0bOIffyWaUuUSBlFg/86IWvfx+wjNaO/ybU0AJzG9/Mi5hBUgGV6zCmWEN7Q==}
+    engines: {node: ^22.18.0 || >=24.0.0}
     peerDependencies:
       oxc-resolver: '>=11.0.0'
     peerDependenciesMeta:
@@ -3020,11 +2872,6 @@ packages:
     engines: {node: '>=18'}
     hasBin: true
 
-  esbuild@0.28.0:
-    resolution: {integrity: sha512-sNR9MHpXSUV/XB4zmsFKN+QgVG82Cc7+/aaxJ8Adi8hyOac+EXptIp45QBPaVyX3N70664wRbTcLTOemCAnyqw==}
-    engines: {node: '>=18'}
-    hasBin: true
-
   escalade@3.2.0:
     resolution: {integrity: sha512-WUj2qlxaQtO4g6Pq5c29GTcWGDyd8itL8zTlipgECz3JesAiiOKotd8JU6otB3PACgG6xkJUyVhboMS+bje/jA==}
     engines: {node: '>=6'}
@@ -3290,8 +3137,9 @@ packages:
     resolution: {integrity: sha512-w9UMqWwJxHNOvoNzSJ2oPF5wvYcvP7jUvYzhp67yEhTi17ZDBBC1z9pTdGuzjD+EFIqLSYRweZjqfiPzQ06Ebg==}
     engines: {node: '>= 0.4'}
 
-  get-tsconfig@4.14.0:
-    resolution: {integrity: sha512-yTb+8DXzDREzgvYmh6s9vHsSVCHeC0G3PI5bEXNBHtmshPnO+S5O7qgLEOn0I5QvMy6kpZN8K1NKGyilLb93wA==}
+  get-tsconfig@5.0.0-beta.5:
+    resolution: {integrity: sha512-/6gFNr0N04nob252sTQxyFLi3eKFRqIg1I87YcqAMT1i6SQrSF6KujUEQrtrjMV0H/eejTCltLdDSTEMzHbnsQ==}
+    engines: {node: '>=20.20.0'}
 
   get-uri@6.0.5:
     resolution: {integrity: sha512-b1O07XYq8eRuVzBNgJLstU6FYc1tS6wnMtF1I1D9lE8LxZSOGZ7LhxN54yPP6mGw5f2CkXY2BQUL9Fx41qvcIg==}
@@ -3484,9 +3332,9 @@ packages:
     resolution: {integrity: sha512-TR3KfrTZTYLPB6jUjfx6MF9WcWrHL9su5TObK4ZkYgBdWKPOFoSoQIdEuTuR82pmtxH2spWG9h6etwfr1pLBqQ==}
     engines: {node: '>=6'}
 
-  import-without-cache@0.3.3:
-    resolution: {integrity: sha512-bDxwDdF04gm550DfZHgffvlX+9kUlcz32UD0AeBTmVPFiWkrexF2XVmiuFFbDhiFuP8fQkrkvI2KdSNPYWAXkQ==}
-    engines: {node: '>=20.19.0'}
+  import-without-cache@0.4.0:
+    resolution: {integrity: sha512-NkJQA7oZ4YHQhd2+H3BoRFKF3d/XNsiKpHZCQEMH9pDX27hQQLsTyOocyRgaIVtf8gHX3Nt3LPkR4e5EdtPAGQ==}
+    engines: {node: ^22.18.0 || >=24.0.0}
 
   indent-string@4.0.0:
     resolution: {integrity: sha512-EdDDZu4A2OyIK7Lr/2zG+w5jmbuk1DVBnEwREQvBzspBJkCEbRa8GxU1lghYcaGJCnRWibjDXlq779X1/y5xwg==}
@@ -4906,13 +4754,13 @@ packages:
     resolution: {integrity: sha512-g6QUff04oZpHs0eG5p83rFLhHeV00ug/Yf9nZM6fLeUrPguBTkTQOdpAWWspMh55TZfVQDPaN3NQJfbVRAxdIw==}
     engines: {iojs: '>=1.0.0', node: '>=0.10.0'}
 
-  rolldown-plugin-dts@0.23.2:
-    resolution: {integrity: sha512-PbSqLawLgZBGcOGT3yqWBGn4cX+wh2nt5FuBGdcMHyOhoukmjbhYAl8NT9sE4U38Cm9tqLOIQeOrvzeayM0DLQ==}
-    engines: {node: '>=20.19.0'}
+  rolldown-plugin-dts@0.25.1:
+    resolution: {integrity: sha512-zK82aC/8z1iVW+g0bCnlQZq04Y5bNeL/RcRwTYBwsnU6wH0N+6vpIFkN7JC0kYRS5qKA+pxQyfIPvXJ6Q5xSpQ==}
+    engines: {node: ^22.18.0 || >=24.0.0}
     peerDependencies:
       '@ts-macro/tsc': ^0.3.6
       '@typescript/native-preview': '>=7.0.0-dev.20260325.1'
-      rolldown: ^1.0.0-rc.12
+      rolldown: ^1.0.0
       typescript: ^5.0.0 || ^6.0.0
       vue-tsc: ~3.2.0
     peerDependenciesMeta:
@@ -4925,8 +4773,8 @@ packages:
       vue-tsc:
         optional: true
 
-  rolldown@1.0.0-rc.17:
-    resolution: {integrity: sha512-ZrT53oAKrtA4+YtBWPQbtPOxIbVDbxT0orcYERKd63VJTF13zPcgXTvD4843L8pcsI7M6MErt8QtON6lrB9tyA==}
+  rolldown@1.0.2:
+    resolution: {integrity: sha512-oZx5zVDtVB44AW3eaifgDml1gWRDZGvjcfdxonE4swNPG98PrrXjaO/KrnUjzlMnztCCRVlUueA1kCXhARGk6g==}
     engines: {node: ^20.19.0 || >=22.12.0}
     hasBin: true
 
@@ -5285,6 +5133,10 @@ packages:
     resolution: {integrity: sha512-VKS/ZaQhhkKFMANmAOhhXVoIfBXblQxGX1myCQ2faQrfmobMftXeJPcZGp0gS07ocvGJWDLZGyOZDadDBqYIJg==}
     engines: {node: '>=18'}
 
+  tinyexec@1.1.2:
+    resolution: {integrity: sha512-dAqSqE/RabpBKI8+h26GfLq6Vb3JVXs30XYQjdMjaj/c2tS8IYYMbIzP599KtRj7c57/wYApb3QjgRgXmrCukA==}
+    engines: {node: '>=18'}
+
   tinyglobby@0.2.16:
     resolution: {integrity: sha512-pn99VhoACYR8nFHhxqix+uvsbXineAasWm5ojXoN8xEwK5Kd3/TrhNn1wByuD52UxWRLy8pu+kRMniEi6Eq9Zg==}
     engines: {node: '>=12.0.0'}
@@ -5338,18 +5190,20 @@ packages:
   ts-interface-checker@0.1.13:
     resolution: {integrity: sha512-Y/arvbn+rrz3JCKl9C4kVNfTfSm2/mEp5FSz5EsZSANGPSlQrpRI5M4PKF+mJnE52jOO90PnPSc3Ur3bTQw0gA==}
 
-  tsdown@0.21.10:
-    resolution: {integrity: sha512-3wk73yBhZe/wX7REqSdivNQ84TDs1mJ+IlnzrrEREP70xlJ/AEIzqaI04l/TzMKVIdkTdC3CPaADn2Lk/0SkdA==}
-    engines: {node: '>=20.19.0'}
+  tsdown@0.22.0:
+    resolution: {integrity: sha512-FgW0hHb27nGQA/+F3d5+U9wKXkfilk9DVkc5+7x/ZqF03g+Hoz/eeApT32jqxATt9eRoR+1jxk7MUMON+O4CXw==}
+    engines: {node: ^22.18.0 || >=24.0.0}
     hasBin: true
     peerDependencies:
       '@arethetypeswrong/core': ^0.18.1
-      '@tsdown/css': 0.21.10
-      '@tsdown/exe': 0.21.10
+      '@tsdown/css': 0.22.0
+      '@tsdown/exe': 0.22.0
       '@vitejs/devtools': '*'
-      publint: ^0.3.0
+      publint: ^0.3.8
+      tsx: '*'
       typescript: ^5.0.0 || ^6.0.0
       unplugin-unused: ^0.5.0
+      unrun: ^0.3.0
     peerDependenciesMeta:
       '@arethetypeswrong/core':
         optional: true
@@ -5361,10 +5215,14 @@ packages:
         optional: true
       publint:
         optional: true
+      tsx:
+        optional: true
       typescript:
         optional: true
       unplugin-unused:
         optional: true
+      unrun:
+        optional: true
 
   tslib@1.14.1:
     resolution: {integrity: sha512-Xni35NKzjgMrwevysHTCArtLDpPvye8zV/0E4EyYn43P7/7qvQwPh9BGkHewbMulVntbigmcT7rdX3BNo9wRJg==}
@@ -5503,16 +5361,6 @@ packages:
     resolution: {integrity: sha512-pjy2bYhSsufwWlKwPc+l3cN7+wuJlK6uz0YdJEOlQDbl6jo/YlPi4mb8agUkVC8BF7V8NuzeyPNqRksA3hztKQ==}
     engines: {node: '>= 0.8'}
 
-  unrun@0.2.37:
-    resolution: {integrity: sha512-AA7vDuYsgeSYVzJMm16UKA+aXFKhy7nFqW9z5l7q44K4ppFWZAMqYS58ePRZbugMLPH0fwwMzD5A8nP0avxwZQ==}
-    engines: {node: '>=20.19.0'}
-    hasBin: true
-    peerDependencies:
-      synckit: ^0.11.11
-    peerDependenciesMeta:
-      synckit:
-        optional: true
-
   update-browserslist-db@1.2.3:
     resolution: {integrity: sha512-Js0m9cx+qOgDxo0eMiFGEueWztz+d4+M3rGlmKPT+T4IS/jP4ylw3Nwpu6cpTTP8R1MAC1kF4VbdLt3ARf209w==}
     hasBin: true
@@ -5951,10 +5799,10 @@ snapshots:
       '@jridgewell/trace-mapping': 0.3.31
       jsesc: 3.1.0
 
-  '@babel/generator@8.0.0-rc.3':
+  '@babel/generator@8.0.0-rc.5':
     dependencies:
-      '@babel/parser': 8.0.0-rc.3
-      '@babel/types': 8.0.0-rc.3
+      '@babel/parser': 8.0.0-rc.5
+      '@babel/types': 8.0.0-rc.5
       '@jridgewell/gen-mapping': 0.3.13
       '@jridgewell/trace-mapping': 0.3.31
       '@types/jsesc': 2.5.1
@@ -5990,11 +5838,11 @@ snapshots:
 
   '@babel/helper-string-parser@7.27.1': {}
 
-  '@babel/helper-string-parser@8.0.0-rc.3': {}
+  '@babel/helper-string-parser@8.0.0-rc.5': {}
 
   '@babel/helper-validator-identifier@7.28.5': {}
 
-  '@babel/helper-validator-identifier@8.0.0-rc.3': {}
+  '@babel/helper-validator-identifier@8.0.0-rc.5': {}
 
   '@babel/helper-validator-option@7.27.1': {}
 
@@ -6007,9 +5855,13 @@ snapshots:
     dependencies:
       '@babel/types': 7.29.0
 
-  '@babel/parser@8.0.0-rc.3':
+  '@babel/parser@8.0.0-rc.4':
+    dependencies:
+      '@babel/types': 8.0.0-rc.5
+
+  '@babel/parser@8.0.0-rc.5':
     dependencies:
-      '@babel/types': 8.0.0-rc.3
+      '@babel/types': 8.0.0-rc.5
 
   '@babel/plugin-transform-react-jsx-self@7.27.1(@babel/core@7.29.0)':
     dependencies:
@@ -6046,10 +5898,10 @@ snapshots:
       '@babel/helper-string-parser': 7.27.1
       '@babel/helper-validator-identifier': 7.28.5
 
-  '@babel/types@8.0.0-rc.3':
+  '@babel/types@8.0.0-rc.5':
     dependencies:
-      '@babel/helper-string-parser': 8.0.0-rc.3
-      '@babel/helper-validator-identifier': 8.0.0-rc.3
+      '@babel/helper-string-parser': 8.0.0-rc.5
+      '@babel/helper-validator-identifier': 8.0.0-rc.5
 
   '@bcoe/v8-coverage@1.0.2': {}
 
@@ -6113,159 +5965,81 @@ snapshots:
   '@esbuild/aix-ppc64@0.25.12':
     optional: true
 
-  '@esbuild/aix-ppc64@0.28.0':
-    optional: true
-
   '@esbuild/android-arm64@0.25.12':
     optional: true
 
-  '@esbuild/android-arm64@0.28.0':
-    optional: true
-
   '@esbuild/android-arm@0.25.12':
     optional: true
 
-  '@esbuild/android-arm@0.28.0':
-    optional: true
-
   '@esbuild/android-x64@0.25.12':
     optional: true
 
-  '@esbuild/android-x64@0.28.0':
-    optional: true
-
   '@esbuild/darwin-arm64@0.25.12':
     optional: true
 
-  '@esbuild/darwin-arm64@0.28.0':
-    optional: true
-
   '@esbuild/darwin-x64@0.25.12':
     optional: true
 
-  '@esbuild/darwin-x64@0.28.0':
-    optional: true
-
   '@esbuild/freebsd-arm64@0.25.12':
     optional: true
 
-  '@esbuild/freebsd-arm64@0.28.0':
-    optional: true
-
   '@esbuild/freebsd-x64@0.25.12':
     optional: true
 
-  '@esbuild/freebsd-x64@0.28.0':
-    optional: true
-
   '@esbuild/linux-arm64@0.25.12':
     optional: true
 
-  '@esbuild/linux-arm64@0.28.0':
-    optional: true
-
   '@esbuild/linux-arm@0.25.12':
     optional: true
 
-  '@esbuild/linux-arm@0.28.0':
-    optional: true
-
   '@esbuild/linux-ia32@0.25.12':
     optional: true
 
-  '@esbuild/linux-ia32@0.28.0':
-    optional: true
-
   '@esbuild/linux-loong64@0.25.12':
     optional: true
 
-  '@esbuild/linux-loong64@0.28.0':
-    optional: true
-
   '@esbuild/linux-mips64el@0.25.12':
     optional: true
 
-  '@esbuild/linux-mips64el@0.28.0':
-    optional: true
-
   '@esbuild/linux-ppc64@0.25.12':
     optional: true
 
-  '@esbuild/linux-ppc64@0.28.0':
-    optional: true
-
   '@esbuild/linux-riscv64@0.25.12':
     optional: true
 
-  '@esbuild/linux-riscv64@0.28.0':
-    optional: true
-
   '@esbuild/linux-s390x@0.25.12':
     optional: true
 
-  '@esbuild/linux-s390x@0.28.0':
-    optional: true
-
   '@esbuild/linux-x64@0.25.12':
     optional: true
 
-  '@esbuild/linux-x64@0.28.0':
-    optional: true
-
   '@esbuild/netbsd-arm64@0.25.12':
     optional: true
 
-  '@esbuild/netbsd-arm64@0.28.0':
-    optional: true
-
   '@esbuild/netbsd-x64@0.25.12':
     optional: true
 
-  '@esbuild/netbsd-x64@0.28.0':
-    optional: true
-
   '@esbuild/openbsd-arm64@0.25.12':
     optional: true
 
-  '@esbuild/openbsd-arm64@0.28.0':
-    optional: true
-
   '@esbuild/openbsd-x64@0.25.12':
     optional: true
 
-  '@esbuild/openbsd-x64@0.28.0':
-    optional: true
-
   '@esbuild/openharmony-arm64@0.25.12':
     optional: true
 
-  '@esbuild/openharmony-arm64@0.28.0':
-    optional: true
-
   '@esbuild/sunos-x64@0.25.12':
     optional: true
 
-  '@esbuild/sunos-x64@0.28.0':
-    optional: true
-
   '@esbuild/win32-arm64@0.25.12':
     optional: true
 
-  '@esbuild/win32-arm64@0.28.0':
-    optional: true
-
   '@esbuild/win32-ia32@0.25.12':
     optional: true
 
-  '@esbuild/win32-ia32@0.28.0':
-    optional: true
-
   '@esbuild/win32-x64@0.25.12':
     optional: true
 
-  '@esbuild/win32-x64@0.28.0':
-    optional: true
-
   '@exodus/bytes@1.15.0': {}
 
   '@floating-ui/core@1.7.5':
@@ -7034,7 +6808,7 @@ snapshots:
     dependencies:
       fast-deep-equal: 3.1.3
 
-  '@oxc-project/types@0.127.0': {}
+  '@oxc-project/types@0.132.0': {}
 
   '@oxlint/binding-android-arm-eabi@1.66.0':
     optional: true
@@ -7317,58 +7091,58 @@ snapshots:
 
   '@radix-ui/rect@1.1.1': {}
 
-  '@rolldown/binding-android-arm64@1.0.0-rc.17':
+  '@rolldown/binding-android-arm64@1.0.2':
     optional: true
 
-  '@rolldown/binding-darwin-arm64@1.0.0-rc.17':
+  '@rolldown/binding-darwin-arm64@1.0.2':
     optional: true
 
-  '@rolldown/binding-darwin-x64@1.0.0-rc.17':
+  '@rolldown/binding-darwin-x64@1.0.2':
     optional: true
 
-  '@rolldown/binding-freebsd-x64@1.0.0-rc.17':
+  '@rolldown/binding-freebsd-x64@1.0.2':
     optional: true
 
-  '@rolldown/binding-linux-arm-gnueabihf@1.0.0-rc.17':
+  '@rolldown/binding-linux-arm-gnueabihf@1.0.2':
     optional: true
 
-  '@rolldown/binding-linux-arm64-gnu@1.0.0-rc.17':
+  '@rolldown/binding-linux-arm64-gnu@1.0.2':
     optional: true
 
-  '@rolldown/binding-linux-arm64-musl@1.0.0-rc.17':
+  '@rolldown/binding-linux-arm64-musl@1.0.2':
     optional: true
 
-  '@rolldown/binding-linux-ppc64-gnu@1.0.0-rc.17':
+  '@rolldown/binding-linux-ppc64-gnu@1.0.2':
     optional: true
 
-  '@rolldown/binding-linux-s390x-gnu@1.0.0-rc.17':
+  '@rolldown/binding-linux-s390x-gnu@1.0.2':
     optional: true
 
-  '@rolldown/binding-linux-x64-gnu@1.0.0-rc.17':
+  '@rolldown/binding-linux-x64-gnu@1.0.2':
     optional: true
 
-  '@rolldown/binding-linux-x64-musl@1.0.0-rc.17':
+  '@rolldown/binding-linux-x64-musl@1.0.2':
     optional: true
 
-  '@rolldown/binding-openharmony-arm64@1.0.0-rc.17':
+  '@rolldown/binding-openharmony-arm64@1.0.2':
     optional: true
 
-  '@rolldown/binding-wasm32-wasi@1.0.0-rc.17':
+  '@rolldown/binding-wasm32-wasi@1.0.2':
     dependencies:
       '@emnapi/core': 1.10.0
       '@emnapi/runtime': 1.10.0
       '@napi-rs/wasm-runtime': 1.1.4(@emnapi/core@1.10.0)(@emnapi/runtime@1.10.0)
     optional: true
 
-  '@rolldown/binding-win32-arm64-msvc@1.0.0-rc.17':
+  '@rolldown/binding-win32-arm64-msvc@1.0.2':
     optional: true
 
-  '@rolldown/binding-win32-x64-msvc@1.0.0-rc.17':
+  '@rolldown/binding-win32-x64-msvc@1.0.2':
     optional: true
 
   '@rolldown/pluginutils@1.0.0-beta.27': {}
 
-  '@rolldown/pluginutils@1.0.0-rc.17': {}
+  '@rolldown/pluginutils@1.0.1': {}
 
   '@rollup/rollup-android-arm-eabi@4.60.2':
     optional: true
@@ -8097,7 +7871,7 @@ snapshots:
 
   ast-kit@3.0.0-beta.1:
     dependencies:
-      '@babel/parser': 8.0.0-rc.3
+      '@babel/parser': 8.0.0-rc.5
       estree-walker: 3.0.3
       pathe: 2.0.3
 
@@ -8621,7 +8395,7 @@ snapshots:
 
   dom-accessibility-api@0.6.3: {}
 
-  dts-resolver@2.1.3: {}
+  dts-resolver@3.0.0: {}
 
   dunder-proto@1.0.1:
     dependencies:
@@ -8821,35 +8595,6 @@ snapshots:
       '@esbuild/win32-ia32': 0.25.12
       '@esbuild/win32-x64': 0.25.12
 
-  esbuild@0.28.0:
-    optionalDependencies:
-      '@esbuild/aix-ppc64': 0.28.0
-      '@esbuild/android-arm': 0.28.0
-      '@esbuild/android-arm64': 0.28.0
-      '@esbuild/android-x64': 0.28.0
-      '@esbuild/darwin-arm64': 0.28.0
-      '@esbuild/darwin-x64': 0.28.0
-      '@esbuild/freebsd-arm64': 0.28.0
-      '@esbuild/freebsd-x64': 0.28.0
-      '@esbuild/linux-arm': 0.28.0
-      '@esbuild/linux-arm64': 0.28.0
-      '@esbuild/linux-ia32': 0.28.0
-      '@esbuild/linux-loong64': 0.28.0
-      '@esbuild/linux-mips64el': 0.28.0
-      '@esbuild/linux-ppc64': 0.28.0
-      '@esbuild/linux-riscv64': 0.28.0
-      '@esbuild/linux-s390x': 0.28.0
-      '@esbuild/linux-x64': 0.28.0
-      '@esbuild/netbsd-arm64': 0.28.0
-      '@esbuild/netbsd-x64': 0.28.0
-      '@esbuild/openbsd-arm64': 0.28.0
-      '@esbuild/openbsd-x64': 0.28.0
-      '@esbuild/openharmony-arm64': 0.28.0
-      '@esbuild/sunos-x64': 0.28.0
-      '@esbuild/win32-arm64': 0.28.0
-      '@esbuild/win32-ia32': 0.28.0
-      '@esbuild/win32-x64': 0.28.0
-
   escalade@3.2.0: {}
 
   escape-html@1.0.3: {}
@@ -9148,7 +8893,7 @@ snapshots:
       es-errors: 1.3.0
       get-intrinsic: 1.3.0
 
-  get-tsconfig@4.14.0:
+  get-tsconfig@5.0.0-beta.5:
     dependencies:
       resolve-pkg-maps: 1.0.0
 
@@ -9484,7 +9229,7 @@ snapshots:
       parent-module: 1.0.1
       resolve-from: 4.0.0
 
-  import-without-cache@0.3.3: {}
+  import-without-cache@0.4.0: {}
 
   indent-string@4.0.0: {}
 
@@ -11355,44 +11100,42 @@ snapshots:
 
   reusify@1.1.0: {}
 
-  rolldown-plugin-dts@0.23.2(rolldown@1.0.0-rc.17)(typescript@5.9.3):
+  rolldown-plugin-dts@0.25.1(rolldown@1.0.2)(typescript@5.9.3):
     dependencies:
-      '@babel/generator': 8.0.0-rc.3
-      '@babel/helper-validator-identifier': 8.0.0-rc.3
-      '@babel/parser': 8.0.0-rc.3
-      '@babel/types': 8.0.0-rc.3
+      '@babel/generator': 8.0.0-rc.5
+      '@babel/helper-validator-identifier': 8.0.0-rc.5
+      '@babel/parser': 8.0.0-rc.4
       ast-kit: 3.0.0-beta.1
       birpc: 4.0.0
-      dts-resolver: 2.1.3
-      get-tsconfig: 4.14.0
+      dts-resolver: 3.0.0
+      get-tsconfig: 5.0.0-beta.5
       obug: 2.1.1
-      picomatch: 4.0.4
-      rolldown: 1.0.0-rc.17
+      rolldown: 1.0.2
     optionalDependencies:
       typescript: 5.9.3
     transitivePeerDependencies:
       - oxc-resolver
 
-  rolldown@1.0.0-rc.17:
+  rolldown@1.0.2:
     dependencies:
-      '@oxc-project/types': 0.127.0
-      '@rolldown/pluginutils': 1.0.0-rc.17
+      '@oxc-project/types': 0.132.0
+      '@rolldown/pluginutils': 1.0.1
     optionalDependencies:
-      '@rolldown/binding-android-arm64': 1.0.0-rc.17
-      '@rolldown/binding-darwin-arm64': 1.0.0-rc.17
-      '@rolldown/binding-darwin-x64': 1.0.0-rc.17
-      '@rolldown/binding-freebsd-x64': 1.0.0-rc.17
-      '@rolldown/binding-linux-arm-gnueabihf': 1.0.0-rc.17
-      '@rolldown/binding-linux-arm64-gnu': 1.0.0-rc.17
-      '@rolldown/binding-linux-arm64-musl': 1.0.0-rc.17
-      '@rolldown/binding-linux-ppc64-gnu': 1.0.0-rc.17
-      '@rolldown/binding-linux-s390x-gnu': 1.0.0-rc.17
-      '@rolldown/binding-linux-x64-gnu': 1.0.0-rc.17
-      '@rolldown/binding-linux-x64-musl': 1.0.0-rc.17
-      '@rolldown/binding-openharmony-arm64': 1.0.0-rc.17
-      '@rolldown/binding-wasm32-wasi': 1.0.0-rc.17
-      '@rolldown/binding-win32-arm64-msvc': 1.0.0-rc.17
-      '@rolldown/binding-win32-x64-msvc': 1.0.0-rc.17
+      '@rolldown/binding-android-arm64': 1.0.2
+      '@rolldown/binding-darwin-arm64': 1.0.2
+      '@rolldown/binding-darwin-x64': 1.0.2
+      '@rolldown/binding-freebsd-x64': 1.0.2
+      '@rolldown/binding-linux-arm-gnueabihf': 1.0.2
+      '@rolldown/binding-linux-arm64-gnu': 1.0.2
+      '@rolldown/binding-linux-arm64-musl': 1.0.2
+      '@rolldown/binding-linux-ppc64-gnu': 1.0.2
+      '@rolldown/binding-linux-s390x-gnu': 1.0.2
+      '@rolldown/binding-linux-x64-gnu': 1.0.2
+      '@rolldown/binding-linux-x64-musl': 1.0.2
+      '@rolldown/binding-openharmony-arm64': 1.0.2
+      '@rolldown/binding-wasm32-wasi': 1.0.2
+      '@rolldown/binding-win32-arm64-msvc': 1.0.2
+      '@rolldown/binding-win32-x64-msvc': 1.0.2
 
   rollup@4.60.2:
     dependencies:
@@ -11958,6 +11701,8 @@ snapshots:
 
   tinyexec@1.1.1: {}
 
+  tinyexec@1.1.2: {}
+
   tinyglobby@0.2.16:
     dependencies:
       fdir: 6.5.0(picomatch@4.0.4)
@@ -11999,31 +11744,29 @@ snapshots:
 
   ts-interface-checker@0.1.13: {}
 
-  tsdown@0.21.10(typescript@5.9.3):
+  tsdown@0.22.0(typescript@5.9.3):
     dependencies:
       ansis: 4.2.0
       cac: 7.0.0
       defu: 6.1.7
       empathic: 2.0.0
       hookable: 6.1.1
-      import-without-cache: 0.3.3
+      import-without-cache: 0.4.0
       obug: 2.1.1
       picomatch: 4.0.4
-      rolldown: 1.0.0-rc.17
-      rolldown-plugin-dts: 0.23.2(rolldown@1.0.0-rc.17)(typescript@5.9.3)
+      rolldown: 1.0.2
+      rolldown-plugin-dts: 0.25.1(rolldown@1.0.2)(typescript@5.9.3)
       semver: 7.7.4
-      tinyexec: 1.1.1
+      tinyexec: 1.1.2
       tinyglobby: 0.2.16
       tree-kill: 1.2.2
       unconfig-core: 7.5.0
-      unrun: 0.2.37
     optionalDependencies:
       typescript: 5.9.3
     transitivePeerDependencies:
       - '@ts-macro/tsc'
       - '@typescript/native-preview'
       - oxc-resolver
-      - synckit
       - vue-tsc
 
   tslib@1.14.1: {}
@@ -12217,10 +11960,6 @@ snapshots:
 
   unpipe@1.0.0: {}
 
-  unrun@0.2.37:
-    dependencies:
-      rolldown: 1.0.0-rc.17
-
   update-browserslist-db@1.2.3(browserslist@4.28.2):
     dependencies:
       browserslist: 4.28.2
diff --git a/pnpm-workspace.yaml b/pnpm-workspace.yaml
index 273f7a36..df3bfa9b 100644
--- a/pnpm-workspace.yaml
+++ b/pnpm-workspace.yaml
@@ -45,3 +45,12 @@ allowBuilds:
   rolldown: true
   unrs-resolver: true
   esbuild: true
+# Bump `unrun` past 0.2.x so its transitive `rolldown` resolves to the
+# GA series (^1.0.0) instead of the legacy `1.0.0-rc.17` pin in
+# `unrun@0.2.37`. `unrun` lands in the tree only as an optional peer of
+# `tsdown`; we don't import it ourselves. Without this override the
+# lockfile would carry two rolldown copies (GA from our direct dep +
+# rc.17 from unrun) for no functional gain. Drop this override once
+# every published `tsdown` resolves `unrun` to ^0.3.0 by itself.
+overrides:
+  unrun: ^0.3.0