diff --git a/AGENTS.md b/AGENTS.md
index 554d6e96..c69b9d87 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -63,25 +63,36 @@ cd my-arkor-app && pnpm dev # Studio at http://127.0.
`arkor dev` generates a 32-byte base64url token per launch ([packages/arkor/src/cli/commands/dev.ts](packages/arkor/src/cli/commands/dev.ts)) and:
-1. Passes it to `buildStudioApp({ studioToken })`. The Hono server validates every `/api/*` request via `X-Arkor-Studio-Token` header (or `?studioToken=` query for `EventSource`, which can't set headers). Comparison uses `timingSafeEqual`.
-2. Persists it to `~/.arkor/studio-token` (mode 0600) so the SPA dev workflow (`pnpm --filter @arkor/studio-app dev`) can read it via the `arkor-studio-token` Vite plugin in [packages/studio-app/vite.config.ts](packages/studio-app/vite.config.ts), which injects `` into `index.html` on each request. Persistence failure must NOT block server start (read-only `$HOME` on Docker, etc.) — just warn.
+1. Passes it to `buildStudioApp({ studioToken })`. The Hono server validates every `/api/*` request via `X-Arkor-Studio-Token` header (or `?studioToken=` query for `EventSource`, which can't set headers). Comparison uses `timingSafeEqual`. The query-token allow-list lives in `eventStreamPathPattern` in [packages/arkor/src/studio/server.ts](packages/arkor/src/studio/server.ts), currently `/api/jobs/:id/events` and `/api/dev/events`. **Adding to that regex is CSRF-sensitive: each entry must be a GET stream-only route, never a mutation endpoint.**
+2. Persists it to `~/.arkor/studio-token` (mode 0600) so the SPA dev workflow (`pnpm --filter @arkor/studio-app dev`) can read it via the `arkor-studio-token` Vite plugin in [packages/studio-app/vite.config.ts](packages/studio-app/vite.config.ts), which injects `` into `index.html` on each request. Persistence failure must NOT block server start (read-only `$HOME` on Docker, etc.); just warn.
3. Cleans up on `exit`/SIGINT/SIGTERM/SIGHUP via `unlinkSync`.
-`/api/*` middleware also enforces a host-header allow-list (`127.0.0.1`/`localhost`) for DNS-rebinding defence. **CORS is intentionally NOT configured** — the SPA is same-origin so reflecting `*` would let "simple" cross-origin POSTs reach handlers. The token check rejects those; cross-origin tabs cannot read the SPA's ``.
+`/api/*` middleware also enforces a host-header allow-list (`127.0.0.1`/`localhost`) for DNS-rebinding defence. **CORS is intentionally NOT configured**: the SPA is same-origin so reflecting `*` would let "simple" cross-origin POSTs reach handlers. The token check rejects those; cross-origin tabs cannot read the SPA's ``.
-The whole point: prevents another browser tab on the same machine from POSTing `/api/train` (which spawns `arkor train` and dynamically imports user TS — RCE-grade).
+The whole point: prevents another browser tab on the same machine from POSTing `/api/train` (which spawns `arkor train` and dynamically imports user TS, an RCE-grade exposure).
-When touching the Studio server or SPA fetch layer, preserve: token via header for `fetch`, query param for `EventSource`, host-header guard, no CORS, timing-safe compare. The Vite plugin is dev-only (`apply: "serve"`) — running it during `vite build` would bake a stale per-launch token into the production `index.html` and shadow the runtime tag, causing every `/api/*` call to 403.
+When touching the Studio server or SPA fetch layer, preserve: token via header for `fetch`, query param for `EventSource`, host-header guard, no CORS, timing-safe compare. The Vite plugin is dev-only (`apply: "serve"`): running it during `vite build` would bake a stale per-launch token into the production `index.html` and shadow the runtime tag, causing every `/api/*` call to 403.
+
+### HMR + graceful early-stop + callback hot-swap
+
+`arkor dev` keeps a [Rolldown](https://rolldown.rs) watcher over `src/arkor/` ([packages/arkor/src/studio/hmr.ts](packages/arkor/src/studio/hmr.ts)) and pushes rebuild events over `/api/dev/events` (SSE). On each successful build the watcher dynamic-imports the artifact, pulls a `TrainerInspection` snapshot off the discovered trainer (via the cross-realm `Symbol.for("arkor.trainer.inspect")` brand attached in [packages/arkor/src/core/trainerInspection.ts](packages/arkor/src/core/trainerInspection.ts)), and computes a stable `configHash` from the cloud-side `JobConfig`. The SPA re-fetches `/api/manifest` on each event so the Run Training button stays in sync without a browser refresh.
+
+When a rebuild lands while a `/api/train`-spawned subprocess is in flight, the server makes a per-child decision in [packages/arkor/src/studio/trainRegistry.ts](packages/arkor/src/studio/trainRegistry.ts):
+
+- **`configHash` matches the spawn-time hash** → SIGUSR2. The child's `installCallbackReloadHandler` re-imports the artifact and rotates the trainer's callback cell via the internal `Symbol.for("arkor.trainer.replaceCallbacks")` brand exposed by [packages/arkor/src/core/trainerInspection.ts](packages/arkor/src/core/trainerInspection.ts). The cloud-side run is untouched. Use this whenever a code change is contained inside the `callbacks: { ... }` object. Don't add a `replaceCallbacks()` method to the public `Trainer` interface: keeping the mutator behind a `Symbol.for` brand is what stops the dev-only HMR primitive from leaking into the SDK's published surface.
+- **`configHash` differs (or is null because the new bundle didn't inspect)** → SIGTERM. `installShutdownHandlers` drives the trainer's internal early-stop entry point via the `Symbol.for("arkor.trainer.requestEarlyStop")` brand exposed by [packages/arkor/src/core/trainerInspection.ts](packages/arkor/src/core/trainerInspection.ts), which lets the next `checkpoint.saved` event finish (work preserved) before issuing `cancel()` and exiting cleanly. The SPA auto-restarts the run with the rebuilt artifact via the `restart: true` flag on the SSE event. A second SIGTERM bypasses the early-stop and exits 143 immediately, as an emergency escape hatch for a hung cancel.
+
+Don't replace the SIGTERM-and-let-the-child-handle-it pattern with a SIGKILL escalation in the server: that would orphan Cloud-side jobs (no `cancel()` POST goes out) and waste GPU budget. Don't widen the SIGUSR2 path to "always hot-swap, server-side": the `configHash` check is what guarantees a hot-swap can't silently leave a child running with a stale `JobConfig`. Don't surface `requestEarlyStop()` (or `replaceCallbacks()`) as a method on the public `Trainer` interface: both are dev-only HMR primitives, and keeping them behind `Symbol.for` brands is what stops them from leaking into the published SDK shape; user code that wants similar semantics should compose `abortSignal` + `cancel()` per the cookbook.
### Project entry-point discovery
The CLI/Studio look at `src/arkor/index.ts` in user projects. Discovery in [packages/arkor/src/core/runner.ts](packages/arkor/src/core/runner.ts) accepts (in order): a named `arkor` export from `createArkor({...})`, a bare `trainer` export, a default export holding either an Arkor manifest or a Trainer, or a `default.trainer` nested shape. `createArkor` returns a frozen, opaque manifest tagged with `_kind: "arkor"`; treat it as a value to hand to tooling, not a programmable client.
-`arkor build` ([packages/arkor/src/cli/commands/build.ts](packages/arkor/src/cli/commands/build.ts)) bundles to `.arkor/build/index.mjs` with esbuild; bare specifiers (e.g. `arkor`, anything in `node_modules`) stay external so the artifact resolves the runtime SDK from the project's installed copy.
+`arkor build` ([packages/arkor/src/cli/commands/build.ts](packages/arkor/src/cli/commands/build.ts)) bundles to `.arkor/build/index.mjs` with [Rolldown](https://rolldown.rs); bare specifiers (e.g. `arkor`, anything in `node_modules`) stay external so the artifact resolves the runtime SDK from the project's installed copy. The `transform.target` is derived from `process.versions.node` at build time so the bundle targets the same Node binary that will execute it.
### E2E suite specifics
-Both [e2e/cli](e2e/cli) and [e2e/studio](e2e/studio) declare `arkor` (and, for `e2e/cli`, `create-arkor`) as `workspace:*` `devDependencies`, so Turbo's `^build` produces `dist/bin.mjs` exactly once before `#test`/`#test:coverage` runs — no `pretest` hooks, no concurrent rebuilds racing on `dist/`. Standalone runs (`pnpm --filter @arkor/e2e-* test`) need a prior `pnpm build`. Every supported Node (≥22.22.0) is in rolldown's compatible range (^20.19 || >=22.12), so the previous "rolldown-incompatible" CI bypass path was removed.
+Both [e2e/cli](e2e/cli) and [e2e/studio](e2e/studio) declare `arkor` (and, for `e2e/cli`, `create-arkor`) as `workspace:*` `devDependencies`, so Turbo's `^build` produces `dist/bin.mjs` exactly once before `#test`/`#test:coverage` runs (no `pretest` hooks, no concurrent rebuilds racing on `dist/`). Standalone runs (`pnpm --filter @arkor/e2e-* test`) need a prior `pnpm build`. Every supported Node (≥22.22.0) is in rolldown's compatible range (^20.19 || >=22.12), so the previous "rolldown-incompatible" CI bypass path was removed.
Tests rely on `ARKOR_INTERNAL_SCAFFOLD_ARKOR_SPEC=file:.../packages/arkor` so the scaffolded fixtures install the workspace `arkor` instead of the npm-published one. Both this var and `SKIP_E2E_INSTALL` are declared in [turbo.json](turbo.json) so they pass through Turbo's hash.
@@ -96,7 +107,7 @@ When implementing anything (new feature, SDK/CLI/Studio behaviour change, schema
1. **Docs in both languages.** This repo pairs English/Japanese docs: `README.md` ↔ `README.ja.md`, `CONTRIBUTING.md` ↔ `CONTRIBUTING.ja.md`, and `docs/` ↔ `docs/ja/`. If you edit the English side, update the Japanese side in the same PR. Don't leave Japanese docs to be retro-translated later.
2. **Tests.** Add vitest cases under `packages/*/src/**/*.test.ts` for SDK/CLI/scaffold logic changes. For CLI flow changes, consider an `e2e/cli` scenario.
-Don't split these into "docs in a follow-up PR" or "tests later" — land them in the same PR. Skip only when the user explicitly says to.
+Don't split these into "docs in a follow-up PR" or "tests later"; land them in the same PR. Skip only when the user explicitly says to.
## Non-obvious gotchas
diff --git a/docs/concepts/studio.mdx b/docs/concepts/studio.mdx
index fb9835c5..90ae5bdd 100644
--- a/docs/concepts/studio.mdx
+++ b/docs/concepts/studio.mdx
@@ -14,7 +14,12 @@ Four jobs:
3. **Try a finished model.** A Playground page lets you pick the base model or the final adapter from any completed job and chat with it. The Playground does not load intermediate checkpoints; for mid-run inference, use [`onCheckpoint`](/concepts/lifecycle) callbacks in your trainer.
4. **Publish a model behind a `*.arkor.app` URL.** An Endpoints page creates a per-deployment subdomain that serves OpenAI-compatible chat completions for a chosen adapter or base model, plus the API keys that authenticate calls to it. The same actions are available programmatically via [`CloudApiClient`](/sdk/deployments) — Studio is the interactive surface; the SDK is the lower-level one.
-A note on the dev loop: Studio's `/api/manifest` endpoint rebuilds and re-imports your trainer on every request (with a cache-bust query, see `packages/arkor/src/studio/manifest.ts`), but the UI only fetches it when the Run training page mounts. So if you edit `src/arkor/` and stay on the same Run training page, the next click reuses the existing `.arkor/build/index.mjs` and runs your old code. Refresh the page (or run `arkor build` from the terminal) between edits and clicks to pick up the new code reliably.
+A note on the dev loop: Studio runs a [Rolldown](https://rolldown.rs) watcher over `src/arkor/` and pushes rebuild notifications to the SPA over a Server-Sent Events stream (`/api/dev/events`). Edit a file, save, and the Run training button updates with the new trainer name without a refresh. If a training run is in flight, the Studio compares the new bundle's cloud-side `JobConfig` hash to the one captured when the run was spawned:
+
+- **Same hash (only callbacks changed).** The runner is signalled with SIGUSR2; it re-imports the rebuilt artifact and rotates the trainer's callback cell in place via an internal HMR brand. The cloud-side training run is untouched, no GPU time is wasted, and the SPA shows a brief "Callbacks hot-swapped" indicator.
+- **Different hash (model / dataset / hyperparameters changed).** The runner is signalled with SIGTERM; the trainer's internal early-stop entry point lets the next checkpoint upload finish before issuing `cancel()`, then the SPA re-spawns the run with the rebuilt artifact. The previous Cloud-side job reaches `cancelled` after the checkpoint is uploaded, so the partial work is preserved as an artifact.
+
+If you want this "stop after the next checkpoint" behaviour from your own code (rather than from the dev loop), build it on top of the public [`abortSignal` + `cancel()`](/sdk/trainer-control#abortsignal) pair. The [Early stopping recipe](/cookbook/early-stopping) walks through it.
## Where Studio runs
diff --git a/docs/ja/concepts/studio.mdx b/docs/ja/concepts/studio.mdx
index 12f176e2..224c6eed 100644
--- a/docs/ja/concepts/studio.mdx
+++ b/docs/ja/concepts/studio.mdx
@@ -14,7 +14,12 @@ Studio は `arkor dev` 実行時に立ち上がるローカル Web UI です。
3. **完成モデルを試す。** Playground ページでベースモデルや任意の完了済みジョブの最終アダプターを選んでチャットできます。中間チェックポイントは Playground からはロードしません。学習中の推論には [`onCheckpoint`](/ja/concepts/lifecycle) コールバックをトレーナーで使ってください。
4. **`*.arkor.app` URL でモデルを公開する。** Endpoints ページで OpenAI 互換 chat completions を提供する deployment 専用サブドメインを作成し、その API キーを発行・取り消しできます。同じ操作は [`CloudApiClient`](/ja/sdk/deployments) からプログラマティックにも可能で、Studio が対話的なインターフェイス、SDK が下位レイヤーという位置付けです。
-dev ループのメモ: Studio の `/api/manifest` エンドポイントはリクエストごとにトレーナーをリビルド・再 import しますが(キャッシュバストクエリ付き、`packages/arkor/src/studio/manifest.ts` を参照)、UI が fetch するのは Run training ページがマウントされたときだけです。`src/arkor/` を編集して同じ Run training ページに留まり続けると、次のクリックは既存の `.arkor/build/index.mjs` を再利用して古いコードで走ります。確実に新しいコードを取り込むには、編集とクリックの間にページをリロード(あるいはターミナルから `arkor build`)してください。
+dev ループのメモ: Studio は [Rolldown](https://rolldown.rs) のウォッチャを `src/arkor/` 上で常駐させ、再ビルド通知を Server-Sent Events ストリーム (`/api/dev/events`) で SPA に push します。ファイルを編集して保存すれば、Run training ボタンのトレーナー名表示はリロード無しで更新されます。学習が走っている最中であれば、Studio は再ビルドしたバンドルの Cloud 側 `JobConfig` ハッシュを、spawn 時に保存したハッシュと比較します。
+
+- **ハッシュ一致(コールバックのみ変更)。** ランナーへ SIGUSR2 を送ります。ランナーは再ビルドされた成果物を再 import し、内部 HMR ブランド経由でトレーナーのコールバック cell をその場で差し替えます。Cloud 側の学習はそのまま継続し、GPU 時間を無駄にせず、SPA には "Callbacks hot-swapped" と短く表示されます。
+- **ハッシュ不一致(モデル / データセット / ハイパーパラメータが変わった)。** ランナーへ SIGTERM を送ります。トレーナー内部の early-stop エントリが次のチェックポイントのアップロードを待ってから `cancel()` を発火し、SPA が再ビルドした成果物で再投入します。Cloud 側の以前のジョブはチェックポイントのアップロード完了後に `cancelled` 状態に遷移するので、ここまでの学習成果は artifact として保全されます。
+
+自前のコードから(dev ループではなく)この「次のチェックポイントで止める」挙動が欲しい場合は、公開 API の [`abortSignal` + `cancel()`](/ja/sdk/trainer-control#abortsignal) を組み合わせて書いてください。具体的な手順は [Early Stopping レシピ](/ja/cookbook/early-stopping) にあります。
## Studio が動く場所
diff --git a/docs/ja/studio/jobs.mdx b/docs/ja/studio/jobs.mdx
index 5a1fae60..56683075 100644
--- a/docs/ja/studio/jobs.mdx
+++ b/docs/ja/studio/jobs.mdx
@@ -62,8 +62,8 @@ Jobs ページ(`#/jobs`)はマウント時に 1 度、その後 5 秒ごと
Loss チャートは `training.log` イベントから描画される SVG プロットです。Y 軸は最小値と最大値によるスケーリング、X 軸はステップ番号で、最大 2 系列を表示します:
-- **Training loss** — 実線のティール色。数値 `loss` を含むイベントごとに 1 頂点。
-- **Eval loss** — 破線のピンク色(点マーカー付き)。数値 `evalLoss` を含むイベント(通常は `evalSteps` 刻み)から描画。系列はイベントから直接構築するため、`evalLoss` のみを持ち `loss` を含まない eval-only フレームも線・凡例・統計に反映されます。Eval ポイントが 1 つも来ていない間は凡例にも表示されません。
+- **Training loss**: 実線のティール色。数値 `loss` を含むイベントごとに 1 頂点。
+- **Eval loss**: 破線のピンク色(点マーカー付き)。数値 `evalLoss` を含むイベント(通常は `evalSteps` 刻み)から描画。系列はイベントから直接構築するため、`evalLoss` のみを持ち `loss` を含まない eval-only フレームも線・凡例・統計に反映されます。Eval ポイントが 1 つも来ていない間は凡例にも表示されません。
ホバーすると最寄りステップと、そのステップに含まれる `loss` / `evalLoss` のうち存在する値が表示されます(eval-only ステップでは `loss` 値は出ず、その逆も同様)。チャートは `loss` または `evalLoss` のいずれかが数値であるイベントが 1 件以上届くまで `Waiting for training.log events…`(`training.log` イベント待ち)プレースホルダーを表示します。両方とも null / 省略の `training.log` フレームはカウントされません。
@@ -71,9 +71,9 @@ Loss チャートは `training.log` イベントから描画される SVG プロ
チャートヘッダーの **Advanced** トグルを ON にすると、系列ごとの統計パネルが現れます。各カードに表示される項目:
-- **Mean loss ± 95% CI** — Loss 値の標本平均と 95% 信頼区間の半幅(Student の t 分布。n > 31 では z = 1.96 にフォールバック)。
-- **Std dev**(標準偏差)と **Variance**(分散) — Bessel 補正済みの不偏推定量(`ddof=1`)。
-- **p90** と **p95** — numpy のデフォルトに合わせた線形補間パーセンタイル。
+- **Mean loss ± 95% CI**: Loss 値の標本平均と 95% 信頼区間の半幅(Student の t 分布。n > 31 では z = 1.96 にフォールバック)。
+- **Std dev**(標準偏差)と **Variance**(分散): Bessel 補正済みの不偏推定量(`ddof=1`)。
+- **p90** と **p95**: numpy のデフォルトに合わせた線形補間パーセンタイル。
Eval カードは数値 `evalLoss` を含む `training.log` イベントが届くまでは空のままです。
diff --git a/e2e/studio/src/specs/hmr.spec.ts b/e2e/studio/src/specs/hmr.spec.ts
new file mode 100644
index 00000000..94346b38
--- /dev/null
+++ b/e2e/studio/src/specs/hmr.spec.ts
@@ -0,0 +1,284 @@
+import { writeFileSync } from "node:fs";
+import { join } from "node:path";
+import { expect, test } from "../harness/fixture";
+
+/**
+ * Rewrite the seeded `src/arkor/index.ts` with a new trainer `name`
+ * (and arbitrary content tail to bump mtime + size beyond any
+ * sub-millisecond resolution noise on fast filesystems). We rewrite
+ * the WHOLE file (not append) so rolldown's incremental cache can't
+ * reuse the prior module record and skip the rebuild.
+ *
+ * Two key shape differences from `seedFixture.ts`'s `seedManifest`:
+ *
+ * 1. The trainer carries the `Symbol.for("arkor.trainer.inspect")`
+ * brand so `findInspectableTrainer` (used by `studio/hmr.ts`'s
+ * `inspectBundle`) can read its name + config: without the
+ * brand, every SSE rebuild frame gets `trainerName: null` and
+ * the SSE-level test below can't distinguish the post-edit
+ * rebuild from the cached initial-build replay. The seed
+ * fixture skips the brand because its existing tests only
+ * exercise the `/api/manifest` path (which uses
+ * `findTrainerInModule`, brand-less). Extending it would
+ * couple every test to inspection internals it doesn't care
+ * about.
+ *
+ * 2. The brand returns a real `JobConfig` shape (`model` +
+ * `datasetSource` set), not the seed's empty placeholder, so
+ * `hashJobConfig` produces a stable non-empty `configHash`.
+ * `studio/server.ts`'s `dispatchRebuild` consults that hash to
+ * route between SIGUSR2 hot-swap and SIGTERM restart; the
+ * existing E2E only tests the boot path so it never needs a
+ * real config there.
+ *
+ * `Symbol.for` keys round-trip across the dev process / built
+ * bundle realm boundary because they live in the global symbol
+ * registry, the same mechanism `core/trainerInspection.ts` documents
+ * for the runtime CLI / `.arkor/build/index.mjs` split.
+ */
+function rewriteManifest(projectDir: string, name: string): void {
+ const path = join(projectDir, "src", "arkor", "index.ts");
+ writeFileSync(
+ path,
+ [
+ 'const TRAINER_INSPECT_KEY = Symbol.for("arkor.trainer.inspect");',
+ "const trainer = {",
+ ` name: ${JSON.stringify(name)},`,
+ " start: async () => ({ id: 'e2e-job', url: '' }),",
+ " wait: async () => ({ status: 'completed' as const }),",
+ " cancel: async () => {},",
+ "};",
+ "Object.defineProperty(trainer, TRAINER_INSPECT_KEY, {",
+ " value: () => ({",
+ " name: trainer.name,",
+ " config: {",
+ ' model: "studio-e2e-model",',
+ ' datasetSource: { type: "huggingface" as const, name: "studio-e2e-dataset" },',
+ " },",
+ " callbacks: {},",
+ " }),",
+ " enumerable: false,",
+ "});",
+ 'export const arkor = { _kind: "arkor" as const, trainer };',
+ "export default arkor;",
+ `// rewritten-${name}-${Date.now()}`,
+ "",
+ ].join("\n"),
+ );
+}
+
+interface SseFrame {
+ event: string;
+ data: string;
+}
+
+/**
+ * Open `/api/dev/events`, parse incoming SSE frames, and resolve when
+ * `predicate` first returns true. Cleans up the underlying body
+ * reader on resolve / reject so the Hono server's connection bookkeeping
+ * doesn't leak between tests.
+ *
+ * `arkor dev` requires the studio token via the query param (EventSource
+ * can't set headers); the same allow-list governs `fetch()` here.
+ */
+async function awaitSseFrame(
+ studioUrl: string,
+ token: string,
+ predicate: (frame: SseFrame) => boolean,
+ timeoutMs: number,
+): Promise {
+ const url = `${studioUrl}/api/dev/events?studioToken=${encodeURIComponent(token)}`;
+ const controller = new AbortController();
+ const timeout = setTimeout(() => controller.abort(), timeoutMs);
+ let res: Response;
+ try {
+ res = await fetch(url, { signal: controller.signal });
+ } catch (err) {
+ clearTimeout(timeout);
+ throw new Error(
+ `SSE connect failed for ${url}: ${(err as Error).message}`,
+ );
+ }
+ if (!res.ok || !res.body) {
+ clearTimeout(timeout);
+ throw new Error(
+ `SSE connect returned ${res.status} ${res.statusText}; body=${
+ res.body ? "present" : "missing"
+ }`,
+ );
+ }
+ const reader = res.body.getReader();
+ const decoder = new TextDecoder();
+ let buf = "";
+ try {
+ while (true) {
+ const { value, done } = await reader.read();
+ if (done) {
+ throw new Error("SSE stream ended before predicate matched");
+ }
+ buf += decoder.decode(value, { stream: true });
+ // Frames are terminated by a blank line (`\n\n`). Split, keep
+ // the trailing partial in `buf` for the next iteration.
+ const parts = buf.split("\n\n");
+ buf = parts.pop() ?? "";
+ for (const raw of parts) {
+ if (!raw) continue;
+ let event = "";
+ let data = "";
+ for (const line of raw.split("\n")) {
+ if (line.startsWith("event: ")) event = line.slice(7);
+ else if (line.startsWith("data: ")) data = line.slice(6);
+ }
+ const frame: SseFrame = { event, data };
+ if (predicate(frame)) return frame;
+ }
+ }
+ } finally {
+ clearTimeout(timeout);
+ // Cancel rather than just release: cancel propagates to the Hono
+ // ReadableStream's `cancel()` handler so the server unsubscribes
+ // this listener from the HMR coordinator promptly. Otherwise the
+ // listener lingers until the next dispose, which can produce
+ // cross-test bleed when running with `--repeat-each`.
+ await reader.cancel().catch(() => {});
+ }
+}
+
+test.describe("Studio HMR", () => {
+ test("/api/dev/events is registered with the hmr-enabled meta tag", async ({
+ page,
+ studio,
+ }) => {
+ // Boot-time wiring: `arkor dev` always wires up the HMR
+ // coordinator, so the served HTML must carry both the
+ // studio-token meta and the hmr-enabled meta. Without the
+ // hmr-enabled tag, `isHmrEnabled()` returns false in the SPA
+ // and the auto-restart / hot-swap paths silently no-op.
+ await page.goto(studio.url);
+ const hmrMeta = page.locator('meta[name="arkor-hmr-enabled"]');
+ await expect(hmrMeta).toHaveCount(1);
+ await expect(hmrMeta).toHaveAttribute("content", "true");
+
+ // Endpoint sanity-check: a GET without the studio token must 403
+ // (regression for the CSRF allow-list: `eventStreamPathPattern`
+ // permits the query-token form, but a raw GET stays gated).
+ const noToken = await fetch(`${studio.url}/api/dev/events`);
+ expect(noToken.status).toBe(403);
+ });
+
+ test("editing src/arkor/index.ts broadcasts a rebuild SSE frame with the new trainer name", async ({
+ studio,
+ fixturePaths,
+ }) => {
+ // Edit BEFORE subscribing, then let the predicate filter out
+ // pre-edit replays. The watcher may already have a cached
+ // initial-build `ready` (with the seed name) by the time we
+ // connect; subscribing first then editing would force a
+ // drain step. Going edit → subscribe is simpler: the
+ // predicate explicitly requires `trainerName === newName`,
+ // which only the post-edit BUNDLE_END can satisfy; any
+ // cached or in-flight frame for the seed name fails the
+ // predicate and `awaitSseFrame` keeps reading until the
+ // matching one arrives.
+ const newName = "studio-e2e-trainer-edited";
+ rewriteManifest(fixturePaths.projectDir, newName);
+
+ const frame = await awaitSseFrame(
+ studio.url,
+ studio.token,
+ (f) => {
+ if (f.event !== "rebuild" && f.event !== "ready") return false;
+ // Some replays have empty data; skip those.
+ if (!f.data) return false;
+ try {
+ const parsed = JSON.parse(f.data) as {
+ trainerName?: string | null;
+ };
+ return parsed.trainerName === newName;
+ } catch {
+ return false;
+ }
+ },
+ // Generous: rolldown's first cold build on a fresh project
+ // can take 1–2s on a slow CI runner; the post-edit rebuild is
+ // typically faster (incremental) but we don't want to flake on
+ // a noisy host.
+ 20_000,
+ );
+
+ expect(frame.event === "rebuild" || frame.event === "ready").toBe(true);
+ const parsed = JSON.parse(frame.data) as {
+ outFile?: string;
+ trainerName?: string | null;
+ configHash?: string | null;
+ };
+ expect(parsed.trainerName).toBe(newName);
+ // The artefact path is also part of the contract: HMR consumers
+ // (including the runner subprocess on SIGUSR2) re-import the
+ // bundle by `outFile`. A regression that drops it would silently
+ // disable hot-swap.
+ expect(parsed.outFile).toMatch(/\.arkor[\\/]build[\\/]index\.mjs$/);
+ });
+
+ test("/api/manifest reflects the edited trainer name after a save", async ({
+ studio,
+ fixturePaths,
+ }) => {
+ // End-to-end through the Hono `/api/manifest` route, which
+ // dynamic-imports the freshly-built artefact via
+ // `summariseBuiltManifest`. The HMR rebuild must have completed
+ // *and* the cache-bust URL must reflect the new bytes for this
+ // assertion to pass: exercises the rebuild → write artefact →
+ // re-import → return summary chain end-to-end.
+ const newName = `studio-e2e-trainer-renamed-${Date.now()}`;
+ rewriteManifest(fixturePaths.projectDir, newName);
+
+ await expect
+ .poll(
+ async () => {
+ const res = await fetch(`${studio.url}/api/manifest`, {
+ headers: { "X-Arkor-Studio-Token": studio.token },
+ });
+ if (!res.ok) return null;
+ const body = (await res.json()) as {
+ trainer?: { name?: string } | null;
+ };
+ return body.trainer?.name ?? null;
+ },
+ {
+ // Same 20s budget as the SSE test for the same reason: the
+ // first rebuild after spawn can be slow on cold CI. Keep
+ // the poll interval modest so we don't hammer the dev
+ // loop's `runBuild` faster than it can settle.
+ timeout: 20_000,
+ intervals: [200, 400, 800, 1500],
+ },
+ )
+ .toBe(newName);
+ });
+
+ test("the SPA Run Training caption updates without a page reload after a save", async ({
+ page,
+ studio,
+ fixturePaths,
+ }) => {
+ // End-to-end browser proof: the SPA's RunTraining component
+ // subscribes to `/api/dev/events`, calls `fetchManifest()` on
+ // each rebuild, and re-renders the trainer caption. Reloading
+ // the page would mask any regression in that subscription path,
+ // so we explicitly DO NOT navigate again after the edit.
+ await page.goto(studio.url);
+ await expect(page.getByText(/studio-e2e-trainer/).first()).toBeVisible();
+
+ const newName = `studio-e2e-trainer-live-${Date.now()}`;
+ rewriteManifest(fixturePaths.projectDir, newName);
+
+ // The new name should appear without a navigation. Match by
+ // substring rather than exact text so the surrounding "Trainer
+ // from src/arkor/index.ts" caption decoration doesn't
+ // need to be replicated here.
+ await expect(page.getByText(newName).first()).toBeVisible({
+ timeout: 20_000,
+ });
+ });
+});
diff --git a/packages/arkor/package.json b/packages/arkor/package.json
index 692d2f3c..11088c91 100644
--- a/packages/arkor/package.json
+++ b/packages/arkor/package.json
@@ -55,17 +55,17 @@
"@clack/prompts": "^0.8.0",
"@hono/node-server": "^1.14.0",
"commander": "^13.0.0",
- "esbuild": "^0.28.0",
"hono": "^4.7.0",
"open": "^10.0.0",
"posthog-node": "^5.30.6",
+ "rolldown": "^1.0.0",
"zod": "^4.3.6"
},
"devDependencies": {
"@arkor/cli-internal": "workspace:*",
"@types/node": "^24",
"@vitest/coverage-v8": "^4.1.5",
- "tsdown": "^0.21.9",
+ "tsdown": "^0.22.0",
"typescript": "^5",
"vitest": "^4.1.5"
},
diff --git a/packages/arkor/src/cli/cleanupHooks.test.ts b/packages/arkor/src/cli/cleanupHooks.test.ts
new file mode 100644
index 00000000..864feb9f
--- /dev/null
+++ b/packages/arkor/src/cli/cleanupHooks.test.ts
@@ -0,0 +1,256 @@
+import { afterEach, describe, expect, it, vi } from "vitest";
+import {
+ __resetCleanupHooksForTests,
+ registerCleanupHook,
+} from "./cleanupHooks";
+
+// Each test that emits a signal also installs new listeners on
+// `process` for the lifetime of this worker. Auto-detach inside the
+// handlers covers the fire-then-cleanup case; `__resetCleanupHooksForTests`
+// covers tests whose registration never fires (still need their
+// listeners off the worker before the next test runs).
+
+let exitSpy: ReturnType | null = null;
+let stdoutSpy: ReturnType | null = null;
+
+afterEach(() => {
+ exitSpy?.mockRestore();
+ stdoutSpy?.mockRestore();
+ exitSpy = null;
+ stdoutSpy = null;
+ __resetCleanupHooksForTests();
+});
+
+function mockExit(): number[] {
+ const codes: number[] = [];
+ exitSpy = vi
+ .spyOn(process, "exit")
+ .mockImplementation(((code?: number) => {
+ codes.push(code ?? 0);
+ return undefined as never;
+ }) as typeof process.exit);
+ return codes;
+}
+
+function flushMicrotasks(): Promise {
+ return new Promise((resolve) => setImmediate(resolve));
+}
+
+describe("registerCleanupHook", () => {
+ it("waits for an async sibling cleanup to settle before exitOnSignal fires", async () => {
+ // Regression: previously the signal handler called
+ // `process.exit(0)` immediately after kicking off cleanup, so a
+ // sibling registration's async dispose (`hmr.dispose()`) got cut
+ // off mid-promise. The fix coordinates via a module-level
+ // in-flight set so the exit-owning hook awaits every other
+ // registered cleanup before terminating.
+ const order: string[] = [];
+ let resolveSlowDispose!: () => void;
+ const slowDispose = new Promise((resolve) => {
+ resolveSlowDispose = resolve;
+ });
+
+ registerCleanupHook({
+ cleanup: () =>
+ slowDispose.then(() => {
+ order.push("async-cleanup-finished");
+ }),
+ });
+ registerCleanupHook({
+ cleanup: () => {
+ order.push("sync-cleanup");
+ },
+ exitOnSignal: true,
+ });
+
+ const codes = mockExit();
+ process.emit("SIGINT", "SIGINT");
+
+ // Sync cleanup body has already fired; async one is still pending,
+ // and exit must NOT have been called yet.
+ expect(order).toEqual(["sync-cleanup"]);
+ expect(codes).toEqual([]);
+
+ // Resolve the slow dispose; one microtask later the coordinator
+ // fires process.exit(0).
+ resolveSlowDispose();
+ await flushMicrotasks();
+ await flushMicrotasks();
+
+ expect(order).toEqual(["sync-cleanup", "async-cleanup-finished"]);
+ // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2) so parent
+ // shells / orchestrators can distinguish "user interrupted"
+ // from "ran to completion (0)"; see SIGNAL_EXIT_CODE in
+ // cleanupHooks.ts.
+ expect(codes).toEqual([130]);
+ });
+
+ it("waits for sibling async cleanups even when the exit-owning hook is registered FIRST", async () => {
+ // Regression: even with the in-flight set in place, the
+ // exit-owning hook's signal handler used to take its
+ // `[...inFlightCleanups]` snapshot synchronously inside the
+ // listener body. Node's EventEmitter dispatches signal listeners
+ // in registration order, so when the exit-owning hook is wired
+ // up *first*, its handler takes the snapshot before any sibling
+ // hook (registered later) gets a chance to run its handler and
+ // add its own in-flight promise. Result: `Promise.allSettled`
+ // resolved on the snapshot of just-this-hook's promise → exit
+ // fired → siblings' async cleanup got cut off mid-flight.
+ //
+ // The order in the existing "waits for an async sibling
+ // cleanup" test happens to dodge this bug by registering the
+ // async hook first, so its handler runs first and seeds
+ // inFlightCleanups before the exit-owner takes its snapshot.
+ // This test inverts the order to actually exercise the
+ // queueMicrotask-deferred snapshot fix.
+ const order: string[] = [];
+ let resolveSlow!: () => void;
+ const slow = new Promise((resolve) => {
+ resolveSlow = resolve;
+ });
+
+ // Register exit-owner FIRST.
+ registerCleanupHook({
+ cleanup: () => {
+ order.push("sync-cleanup");
+ },
+ exitOnSignal: true,
+ });
+ // Sibling async cleanup registered AFTER. With the old code,
+ // its promise wouldn't make it into the exit-owner's snapshot.
+ registerCleanupHook({
+ cleanup: () =>
+ slow.then(() => {
+ order.push("async-cleanup-finished");
+ }),
+ });
+
+ const codes = mockExit();
+ process.emit("SIGINT", "SIGINT");
+
+ // Sync ran inline; async pending; exit must NOT have fired.
+ expect(order).toEqual(["sync-cleanup"]);
+ expect(codes).toEqual([]);
+
+ resolveSlow();
+ await flushMicrotasks();
+ await flushMicrotasks();
+
+ expect(order).toEqual(["sync-cleanup", "async-cleanup-finished"]);
+ // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2) so parent
+ // shells / orchestrators can distinguish "user interrupted"
+ // from "ran to completion (0)"; see SIGNAL_EXIT_CODE in
+ // cleanupHooks.ts.
+ expect(codes).toEqual([130]);
+ });
+
+ it("exits with the POSIX 128+signo code for each terminating signal (130/143/129)", async () => {
+ // Regression: the exit-owning hook used to always
+ // `process.exit(0)`, regardless of which signal fired the
+ // shutdown. Parent shells / orchestrators / CI runners that
+ // gate on signal-style nonzero status would mis-classify a
+ // Ctrl-C (SIGINT) as a clean run: `arkor dev || cleanup`
+ // would skip the cleanup branch and leave whatever it owned
+ // unreaped. POSIX convention is 128 + signo (SIGINT=2 → 130,
+ // SIGTERM=15 → 143, SIGHUP=1 → 129); SIGNAL_EXIT_CODE in
+ // cleanupHooks.ts pins the mapping.
+ const cases: Array<["SIGINT" | "SIGTERM" | "SIGHUP", number]> = [
+ ["SIGINT", 130],
+ ["SIGTERM", 143],
+ ["SIGHUP", 129],
+ ];
+ for (const [sig, expected] of cases) {
+ registerCleanupHook({ cleanup: () => {}, exitOnSignal: true });
+ const codes = mockExit();
+ process.emit(sig, sig);
+ // queueMicrotask + Promise.allSettled chain: two flushes
+ // mirror the existing tests.
+ await flushMicrotasks();
+ await flushMicrotasks();
+ expect(codes, `signal ${sig}`).toEqual([expected]);
+ // Reset for the next iteration's hook registration so the
+ // new SIGNAL_EXIT_CODE doesn't get clobbered by leftover
+ // listeners.
+ __resetCleanupHooksForTests();
+ exitSpy?.mockRestore();
+ exitSpy = null;
+ }
+ });
+
+ it("auto-detaches its process listeners after firing so they don't accumulate", () => {
+ // Regression: previously each `registerCleanupHook` call left
+ // `process.on('exit', ...)` and per-signal listeners armed
+ // forever. A long-lived Node worker that re-arms hooks (vitest
+ // running many tests, or any future caller that re-registers on
+ // each iteration) tripped Node's
+ // `MaxListenersExceededWarning`. Fix: each handler synchronously
+ // detaches its registration after invoking `run()`.
+ const exitBefore = process.listeners("exit").length;
+ const sigintBefore = process.listeners("SIGINT").length;
+ const sigtermBefore = process.listeners("SIGTERM").length;
+ const sighupBefore = process.listeners("SIGHUP").length;
+
+ registerCleanupHook({
+ cleanup: () => {},
+ exitOnSignal: false,
+ });
+
+ expect(process.listeners("exit").length).toBe(exitBefore + 1);
+ expect(process.listeners("SIGINT").length).toBe(sigintBefore + 1);
+ expect(process.listeners("SIGTERM").length).toBe(sigtermBefore + 1);
+ expect(process.listeners("SIGHUP").length).toBe(sighupBefore + 1);
+
+ // Firing one signal must detach BOTH that registration's signal
+ // listener AND its sibling exit listener: the registration is
+ // done after first fire regardless of which channel triggered it.
+ process.emit("SIGINT", "SIGINT");
+
+ expect(process.listeners("exit").length).toBe(exitBefore);
+ expect(process.listeners("SIGINT").length).toBe(sigintBefore);
+ expect(process.listeners("SIGTERM").length).toBe(sigtermBefore);
+ expect(process.listeners("SIGHUP").length).toBe(sighupBefore);
+ });
+
+ it("__resetCleanupHooksForTests detaches every still-armed registration", () => {
+ // Test-only escape hatch for registrations whose handler never
+ // fires inside the test (no signal emitted); without it, those
+ // listeners would persist across the vitest worker's test queue.
+ const exitBefore = process.listeners("exit").length;
+ registerCleanupHook({ cleanup: () => {}, exitOnSignal: false });
+ registerCleanupHook({ cleanup: () => {}, exitOnSignal: true });
+ expect(process.listeners("exit").length).toBe(exitBefore + 2);
+
+ __resetCleanupHooksForTests();
+
+ expect(process.listeners("exit").length).toBe(exitBefore);
+ });
+
+ it("is idempotent against repeated signals (done latch + bounded exit)", async () => {
+ let invocations = 0;
+ registerCleanupHook({
+ cleanup: () => {
+ invocations += 1;
+ },
+ exitOnSignal: true,
+ });
+
+ const codes = mockExit();
+ process.emit("SIGINT", "SIGINT");
+ process.emit("SIGINT", "SIGINT");
+ process.emit("SIGINT", "SIGINT");
+ await flushMicrotasks();
+ await flushMicrotasks();
+
+ // Cleanup body runs once even if the signal fires multiple times
+ // (auto-detach removes the listener after first fire; the `done`
+ // latch is the secondary defence in case detach is racy).
+ expect(invocations).toBe(1);
+ // First SIGINT fires the handler → exit(0); follow-ups hit no
+ // listener after auto-detach, so codes has exactly one entry.
+ // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2) so parent
+ // shells / orchestrators can distinguish "user interrupted"
+ // from "ran to completion (0)"; see SIGNAL_EXIT_CODE in
+ // cleanupHooks.ts.
+ expect(codes).toEqual([130]);
+ });
+});
diff --git a/packages/arkor/src/cli/cleanupHooks.ts b/packages/arkor/src/cli/cleanupHooks.ts
new file mode 100644
index 00000000..24d0eb60
--- /dev/null
+++ b/packages/arkor/src/cli/cleanupHooks.ts
@@ -0,0 +1,196 @@
+import { SIGNAL_EXIT_CODE } from "../core/signalExit";
+
+// POSIX `128 + signo` exit codes live in `core/signalExit.ts` so the
+// runner's two-stage shutdown handler and this coordinator share a
+// single source of truth. Without the shared map, adding (say)
+// SIGQUIT to one side without the other would produce inconsistent
+// exit statuses for the same signal: the exact parent-shell-
+// classification regression the per-signal code was introduced to
+// prevent.
+
+const TERMINATING_SIGNALS = ["SIGINT", "SIGTERM", "SIGHUP"] as const;
+
+export interface CleanupHookOptions {
+ /**
+ * Idempotent cleanup body. Wrapped with a `done` guard so a noisy
+ * shutdown (signal arriving while `process.exit` is already running
+ * an `exit` listener) doesn't trigger a double-cleanup. May be sync
+ * or return a Promise; async cleanups are awaited (across **all
+ * registered hooks**) before `exitOnSignal` fires the final
+ * `process.exit`.
+ */
+ cleanup: () => void | Promise;
+ /**
+ * Whether the signal-handler arm of this registration should call
+ * `process.exit` once every in-flight cleanup (this hook + any
+ * siblings registered in the same process) has settled. Use `true`
+ * for the outermost cleanup responsible for terminating the
+ * process; `false` for inner cleanups that should let a sibling
+ * own the exit. Default: `false`.
+ *
+ * The exit code is the POSIX `128 + signo` for the signal that
+ * triggered shutdown: 130 for SIGINT, 143 for SIGTERM, 129 for
+ * SIGHUP (see `SIGNAL_EXIT_CODE`). Parent shells / orchestrators /
+ * CI runners distinguish "user interrupted" (nonzero) from "ran
+ * to completion" (zero) on this: exiting 0 for a Ctrl-C'd
+ * `arkor dev` would let `arkor dev || cleanup_on_failure` skip
+ * its cleanup branch.
+ */
+ exitOnSignal?: boolean;
+}
+
+/**
+ * Module-scoped tracker of cleanup promises that haven't settled yet.
+ * The exit-owning hook waits on the union of (its own cleanup) +
+ * (every other in-flight cleanup) before calling `process.exit(...)`,
+ * so a fire-and-forget async cleanup in a sibling registration
+ * (`hmr.dispose()` is the canonical example) isn't cut off by an
+ * eager exit. (Exit code is signal-specific; see `SIGNAL_EXIT_CODE`.)
+ *
+ * Auto-prunes via the `.finally(() => inFlightCleanups.delete(...))`
+ * each `run()` attaches, so the set doesn't grow without bound across
+ * multiple `runDev()` invocations in the same process (tests).
+ */
+const inFlightCleanups = new Set>();
+
+/**
+ * Detachers for every still-armed registration. The signal/exit
+ * handlers each call their own detacher synchronously after invoking
+ * `run()` so a long-lived worker that calls `registerCleanupHook`
+ * many times (vitest reusing the same Node worker across tests, or a
+ * future caller that re-arms hooks dynamically) doesn't pile up
+ * `process.on(...)` listeners and trip Node's
+ * `MaxListenersExceededWarning`. Test code can also call
+ * `__resetCleanupHooksForTests()` to detach every still-armed
+ * registration up-front for explicit isolation.
+ */
+const attachedHandlers = new Set<() => void>();
+
+/**
+ * Register a cleanup hook that fires on `process.exit` and on
+ * SIGINT / SIGTERM / SIGHUP. Used by `runDev` to dispose long-lived
+ * resources (the studio-token file, the HMR coordinator) without each
+ * call site re-implementing the same idempotent-guard + per-signal
+ * registration boilerplate.
+ *
+ * Per-registration signal listeners (rather than a singleton): each
+ * `runDev()` invocation gets its own listener wired to its own
+ * `done` latch. Listeners auto-detach as soon as their handler fires
+ * (the `done` latch makes any later invocation a no-op anyway), so
+ * a process that goes through many register → fire cycles doesn't
+ * accumulate stale listeners on `process`.
+ *
+ * `process.on("exit", ...)` listeners cannot be async: Node fires
+ * them right before the process terminates and discards any returned
+ * promise. We still register so sync cleanups (e.g. `unlinkSync`) run
+ * on a normal `process.exit(0)` path that never reached a signal
+ * handler. Async tails on this path are best-effort. The signal-
+ * handler path *does* await async tails before exiting.
+ */
+export function registerCleanupHook(options: CleanupHookOptions): void {
+ let done = false;
+ const run = (): Promise => {
+ if (done) return Promise.resolve();
+ done = true;
+ let promise: Promise;
+ try {
+ const result = options.cleanup();
+ // Wrap so callers can await uniformly even when cleanup was
+ // synchronous. Catch is attached so a thrown async cleanup
+ // doesn't leave an unhandled rejection on the floor.
+ promise = Promise.resolve(result).catch(() => {
+ // best-effort: shutdown is racing other cleanup paths
+ });
+ } catch {
+ promise = Promise.resolve();
+ }
+ inFlightCleanups.add(promise);
+ void promise.finally(() => inFlightCleanups.delete(promise));
+ return promise;
+ };
+
+ const exitHandler = () => {
+ void run();
+ detach();
+ };
+ const signalHandlers = new Map<(typeof TERMINATING_SIGNALS)[number], () => void>();
+ for (const sig of TERMINATING_SIGNALS) {
+ signalHandlers.set(sig, () => {
+ // Sync cleanup body fires inside this `run()` call before the
+ // returned promise resolves; that preserves "side effect is
+ // observable right after the handler returns" for sync
+ // cleanups like `unlinkSync` (and the existing tests that
+ // assert on it).
+ run();
+ detach();
+ if (!options.exitOnSignal) return;
+ // Capture which signal triggered shutdown so the exit code
+ // below reflects "interrupted by SIG" (POSIX 128 + signo)
+ // rather than "ran to completion" (0). Parent shells /
+ // orchestrators / CI runners distinguish these: a script
+ // that runs `arkor dev || cleanup_on_failure` would otherwise
+ // mis-classify a Ctrl-C as success and skip its cleanup.
+ const exitCode = SIGNAL_EXIT_CODE[sig];
+ // Snapshot `inFlightCleanups` AFTER every other signal listener
+ // for this signal has run. Node's EventEmitter dispatches
+ // listeners synchronously in registration order, so if the
+ // exit-owning hook happens to be registered *first*, taking the
+ // snapshot here in the listener body would miss promises that
+ // sibling hooks are about to add when their listeners run a
+ // few sync steps later. `queueMicrotask` defers past the end of
+ // the current sync turn (where `process.emit` finishes
+ // dispatching all listeners), so the snapshot includes every
+ // sibling's freshly-registered promise. Without this, an
+ // `arkor dev` whose `scheduleStudioTokenCleanup` (exitOnSignal:
+ // true) was registered before `scheduleHmrCleanup` (async
+ // dispose) would `process.exit(...)` mid-`hmr.dispose()` and
+ // leak the rolldown watcher.
+ //
+ // Settled promises pass through `Promise.allSettled` in a
+ // single microtask, so a process whose hooks are all
+ // synchronous still exits effectively immediately (one extra
+ // microtask round-trip).
+ queueMicrotask(() => {
+ void Promise.allSettled(inFlightCleanups).then(() =>
+ process.exit(exitCode),
+ );
+ });
+ });
+ }
+
+ let detached = false;
+ const detach = () => {
+ if (detached) return;
+ detached = true;
+ process.off("exit", exitHandler);
+ for (const sig of TERMINATING_SIGNALS) {
+ const handler = signalHandlers.get(sig);
+ if (handler) process.off(sig, handler);
+ }
+ attachedHandlers.delete(detach);
+ };
+ attachedHandlers.add(detach);
+
+ process.on("exit", exitHandler);
+ for (const sig of TERMINATING_SIGNALS) {
+ const handler = signalHandlers.get(sig);
+ if (handler) process.on(sig, handler);
+ }
+}
+
+/**
+ * Detach every still-armed registration. Test-only escape hatch: a
+ * vitest worker reuses the same Node process across many tests, and
+ * each `registerCleanupHook` call leaves listeners attached until
+ * something fires them. Call this from `afterEach` to keep the
+ * worker's `process` listener counts flat.
+ */
+export function __resetCleanupHooksForTests(): void {
+ // `detach()` mutates `attachedHandlers` by removing the current entry.
+ // `Set` iterators safely handle that case (a deleted current item is
+ // not re-visited and remaining items keep their order), so we can
+ // iterate directly without snapshotting via `[...attachedHandlers]`.
+ for (const detach of attachedHandlers) detach();
+ attachedHandlers.clear();
+ inFlightCleanups.clear();
+}
diff --git a/packages/arkor/src/cli/commands/build.ts b/packages/arkor/src/cli/commands/build.ts
index ebfc3675..c4609039 100644
--- a/packages/arkor/src/cli/commands/build.ts
+++ b/packages/arkor/src/cli/commands/build.ts
@@ -1,7 +1,12 @@
import { existsSync } from "node:fs";
import { mkdir } from "node:fs/promises";
-import { isAbsolute, relative, resolve } from "node:path";
-import { build as esbuild } from "esbuild";
+import { relative } from "node:path";
+import { rolldown } from "rolldown";
+import {
+ BUILD_DEFAULTS,
+ resolveBuildEntry,
+ rolldownInputOptions,
+} from "../../core/rolldownConfig";
import { ui } from "../prompts";
export interface BuildOptions {
@@ -22,42 +27,30 @@ export interface BuildResult {
outFile: string;
}
-const DEFAULT_ENTRY = "src/arkor/index.ts";
-const DEFAULT_OUT_DIR = ".arkor/build";
-
/**
* Bundle the user's `src/arkor/index.ts` into a single ESM artifact at
* `.arkor/build/index.mjs`.
*
- * Bare specifiers (`arkor`, anything from `node_modules`) are kept external so
- * the artifact resolves the runtime SDK from the project's installed copy.
- * Relative imports are bundled inline.
+ * Bare specifiers (`arkor`, anything from `node_modules`) are kept external
+ * so the artifact resolves the runtime SDK from the project's installed
+ * copy. Relative imports are bundled inline. The transform target is
+ * derived from the running Node binary (see `resolveNodeTarget`).
*/
export async function runBuild(opts: BuildOptions = {}): Promise {
- const cwd = opts.cwd ?? process.cwd();
- const entryRel = opts.entry ?? DEFAULT_ENTRY;
- const entry = isAbsolute(entryRel) ? entryRel : resolve(cwd, entryRel);
+ const { cwd, entry, outDir, outFile } = resolveBuildEntry(opts);
if (!existsSync(entry)) {
throw new Error(
- `Build entry not found: ${entry}. Create ${DEFAULT_ENTRY} or pass an explicit entry argument.`,
+ `Build entry not found: ${entry}. Create ${BUILD_DEFAULTS.entry} or pass an explicit entry argument.`,
);
}
-
- const outDirRel = opts.outDir ?? DEFAULT_OUT_DIR;
- const outDir = isAbsolute(outDirRel) ? outDirRel : resolve(cwd, outDirRel);
await mkdir(outDir, { recursive: true });
- const outFile = resolve(outDir, "index.mjs");
- await esbuild({
- entryPoints: [entry],
- bundle: true,
- platform: "node",
- format: "esm",
- target: "node22.22",
- outfile: outFile,
- packages: "external",
- logLevel: "error",
- });
+ const bundle = await rolldown(rolldownInputOptions({ cwd, entry }));
+ try {
+ await bundle.write({ file: outFile, format: "esm" });
+ } finally {
+ await bundle.close();
+ }
if (!opts.quiet) {
ui.log.success(
diff --git a/packages/arkor/src/cli/commands/dev.test.ts b/packages/arkor/src/cli/commands/dev.test.ts
index 1104489c..c2bc1653 100644
--- a/packages/arkor/src/cli/commands/dev.test.ts
+++ b/packages/arkor/src/cli/commands/dev.test.ts
@@ -4,6 +4,7 @@ import {
mkdtempSync,
readFileSync,
rmSync,
+ writeFileSync,
} from "node:fs";
import { tmpdir } from "node:os";
import { join } from "node:path";
@@ -31,8 +32,33 @@ import {
writeCredentials,
type AnonymousCredentials,
} from "../../core/credentials";
+import { __resetCleanupHooksForTests } from "../cleanupHooks";
import { ensureCredentialsForStudio, runDev } from "./dev";
+/**
+ * Yield one `setImmediate` tick: enough for the cleanupHooks
+ * coordinator's `Promise.allSettled(...).then(() => process.exit(0))`
+ * chain to drain when there are no async cleanups in flight (the
+ * common case in this file: signal handler → queueMicrotask →
+ * already-resolved `allSettled` → `.then` → `process.exit(0)`,
+ * which all collapses into the single macrotask boundary that
+ * `setImmediate` yields to).
+ *
+ * `setImmediate` is the right primitive (vs `Promise.resolve` /
+ * `queueMicrotask`) because we need the event loop to actually
+ * turn: the `process.exit` mock fires inside a `.then` callback
+ * scheduled from a previous microtask checkpoint, and a microtask-
+ * only flush would resume *before* that callback gets to run.
+ *
+ * Tests that drive a chain with extra microtask hops (e.g. async
+ * sibling cleanups whose promises also pass through
+ * `Promise.allSettled`) await this helper twice in a row; see
+ * the cleanupHooks tests.
+ */
+function flushMicrotasks(): Promise {
+ return new Promise((resolve) => setImmediate(resolve));
+}
+
let fakeHome: string;
const ORIG_HOME = process.env.HOME;
// `os.homedir()` reads USERPROFILE on Windows; HOME-only redirection leaves
@@ -83,7 +109,7 @@ describe("ensureCredentialsForStudio", () => {
});
// When OAuth is advertised by the deployment, `arkor dev` no longer
- // hands off to `runLogin` — that would block the Studio launch on a
+ // hands off to `runLogin`; that would block the Studio launch on a
// browser flow. Instead we bootstrap anon and show a hint pointing at
// `arkor login`, leaving the upgrade in the user's hands.
it("bootstraps anonymous credentials even when OAuth is configured", async () => {
@@ -158,7 +184,7 @@ describe("ensureCredentialsForStudio", () => {
});
});
- // Regression for ENG-403 — when the cloud-api is unreachable, `arkor dev`
+ // Regression for ENG-403: when the cloud-api is unreachable, `arkor dev`
// previously failed to start because the anonymous bootstrap's network
// error wasn't caught.
it("does not throw when the anonymous bootstrap fails after a successful config fetch", async () => {
@@ -240,7 +266,7 @@ describe("ensureCredentialsForStudio", () => {
// must surface at startup instead of being silently warned.
it("re-throws when ARKOR_CLOUD_API_URL is malformed (config error)", async () => {
process.env.ARKOR_CLOUD_API_URL = "";
- // No fetch mock — let real fetch raise the URL parse error so we
+ // No fetch mock: let real fetch raise the URL parse error so we
// exercise the actual undici contract, not a synthetic TypeError.
await expect(ensureCredentialsForStudio()).rejects.toThrow(TypeError);
await expect(ensureCredentialsForStudio()).rejects.not.toThrow(
@@ -281,7 +307,7 @@ describe("ensureCredentialsForStudio", () => {
);
});
- // Codex P1 review on PR #65 — OAuth-only deployments advertise Auth0 in
+ // Codex P1 review on PR #65: OAuth-only deployments advertise Auth0 in
// /v1/auth/cli/config but reject /v1/auth/anonymous. The new "always try
// anon first" flow used to leave first-run users on those deployments
// with a bare "Failed to acquire anonymous token (4xx)" error and no way
@@ -320,7 +346,7 @@ describe("ensureCredentialsForStudio", () => {
expect(await readCredentials()).toBeNull();
});
- // Codex P2 review on PR #65 — the OAuth-only wrap used to span the whole
+ // Codex P2 review on PR #65: the OAuth-only wrap used to span the whole
// anon bootstrap, so fs errors from `writeCredentials` were also rewritten
// as "deployment may require sign-in", hiding the actionable fs cause.
//
@@ -330,8 +356,8 @@ describe("ensureCredentialsForStudio", () => {
// `writeFile` would raise EACCES under the bootstrap) only works on
// POSIX as a non-root user: root bypasses chmod (Codex on PR #65), and
// on Windows POSIX permission bits don't durably block writes inside a
- // directory at all — Node maps `chmod` to the legacy read-only
- // attribute, which NTFS only enforces on files. Both edges silently
+ // directory at all (Node maps `chmod` to the legacy read-only
+ // attribute, which NTFS only enforces on files). Both edges silently
// turned the test green for the wrong reason. Mocking lifts the
// "produce an EACCES" half of the test out of the host filesystem
// entirely so every CI matrix entry exercises the wrap-narrowing
@@ -395,7 +421,7 @@ describe("ensureCredentialsForStudio", () => {
);
}
if (url.endsWith("/v1/auth/anonymous")) {
- // Missing `personalOrg` — anonymousTokenResponseSchema rejects.
+ // Missing `personalOrg`: anonymousTokenResponseSchema rejects.
return new Response(
JSON.stringify({ token: "t", anonymousId: "a", kind: "cli" }),
{ status: 200 },
@@ -413,7 +439,7 @@ describe("ensureCredentialsForStudio", () => {
it("forwards a non-Error throwable from requestAnonymousToken (String() coercion)", async () => {
// Defensive coverage of the `err instanceof Error ? err.message : String(err)`
// helper inside the warn branch isn't exercised here because the
- // helper is in the dev.ts catch — but the symmetrical path inside
+ // helper is in the dev.ts catch; but the symmetrical path inside
// the schema-error case rethrows with the original value preserved.
globalThis.fetch = vi.fn(async (input) => {
const url = String(input);
@@ -449,7 +475,7 @@ describe("ensureCredentialsForStudio", () => {
);
}
if (url.endsWith("/v1/auth/anonymous")) {
- // Missing `personalOrg` — anonymousTokenResponseSchema rejects.
+ // Missing `personalOrg`: anonymousTokenResponseSchema rejects.
return new Response(
JSON.stringify({ token: "t", anonymousId: "a", kind: "cli" }),
{ status: 200 },
@@ -545,15 +571,6 @@ describe("ensureCredentialsForStudio", () => {
});
describe("runDev", () => {
- // Track exit/signal listeners we add via scheduleStudioTokenCleanup so
- // we can remove them between tests; otherwise vitest's worker would
- // accumulate listeners and Node's MaxListenersExceededWarning would
- // fire by the third test.
- const ORIG_EXIT_LISTENERS = process.listeners("exit").length;
- const ORIG_SIGINT_LISTENERS = process.listeners("SIGINT").length;
- const ORIG_SIGTERM_LISTENERS = process.listeners("SIGTERM").length;
- const ORIG_SIGHUP_LISTENERS = process.listeners("SIGHUP").length;
-
beforeEach(async () => {
vi.mocked(serve).mockClear();
vi.mocked(open).mockClear();
@@ -570,18 +587,11 @@ describe("runDev", () => {
});
afterEach(() => {
- // Trim the exit/signal listeners runDev installed each iteration to
- // keep vitest's worker tidy across tests.
- const trim = (ev: string, keep: number) => {
- const all = process.listeners(ev as never);
- for (let i = keep; i < all.length; i++) {
- process.removeListener(ev as never, all[i] as never);
- }
- };
- trim("exit", ORIG_EXIT_LISTENERS);
- trim("SIGINT", ORIG_SIGINT_LISTENERS);
- trim("SIGTERM", ORIG_SIGTERM_LISTENERS);
- trim("SIGHUP", ORIG_SIGHUP_LISTENERS);
+ // Each `runDev()` arms exit/signal hooks via `registerCleanupHook`.
+ // Tests whose handler never fires would leak listeners across the
+ // vitest worker's queue; this detaches every still-armed
+ // registration so Node's MaxListenersExceededWarning doesn't trip.
+ __resetCleanupHooksForTests();
});
it("persists the studio token and starts the server on the requested port", async () => {
@@ -654,7 +664,7 @@ describe("runDev", () => {
// ~/.arkor read-only after writeCredentials (so readCredentials still
// works) so the per-launch token write hits EACCES.
if (typeof process.getuid === "function" && process.getuid() === 0) {
- // Root bypasses chmod permission checks — skip on root containers.
+ // Root bypasses chmod permission checks; skip on root containers.
return;
}
chmodSync(join(fakeHome, ".arkor"), 0o555);
@@ -697,8 +707,163 @@ describe("runDev", () => {
const sigintListeners = process.listeners("SIGINT");
const handler = sigintListeners[sigintListeners.length - 1] as () => void;
handler();
+ // Sync side effect (token unlink) lands inside the synchronous
+ // portion of the handler.
expect(existsSync(studioTokenPath())).toBe(false);
- expect(exitSpy).toHaveBeenCalledWith(0);
+ // Exit fires after `Promise.allSettled(asyncCleanups)` resolves;
+ // a few microticks later. Flush to let the queued exit run.
+ await flushMicrotasks();
+ // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2): see
+ // SIGNAL_EXIT_CODE in cleanupHooks.ts. Parent shells need
+ // the nonzero code to distinguish interrupt from clean exit.
+ expect(exitSpy).toHaveBeenCalledWith(130);
+ } finally {
+ exitSpy.mockRestore();
+ }
+ });
+
+ it("keeps the SIGINT exit handler armed even when persisting the studio token fails", async () => {
+ // Regression: if `persistStudioToken` threw, the previous code
+ // skipped `scheduleStudioTokenCleanup`, and that was the *only*
+ // hook that called `process.exit(0)` on SIGINT. The leftover HMR
+ // hook overrides Node's default "exit on SIGINT" behaviour, so the
+ // dev server would idle in the foreground forever. The fix
+ // registers the token cleanup unconditionally; here we make
+ // persist throw and verify SIGINT still terminates.
+ if (typeof process.getuid === "function" && process.getuid() === 0) {
+ // Root bypasses chmod permission checks; skip on root containers.
+ return;
+ }
+ chmodSync(join(fakeHome, ".arkor"), 0o555);
+ const stdoutSpy = vi
+ .spyOn(process.stdout, "write")
+ .mockImplementation((() => true) as typeof process.stdout.write);
+ try {
+ await runDev({ port: 4206 });
+ } finally {
+ stdoutSpy.mockRestore();
+ chmodSync(join(fakeHome, ".arkor"), 0o755);
+ }
+
+ const exitSpy = vi
+ .spyOn(process, "exit")
+ .mockImplementation(((_code?: number) => {
+ return undefined as never;
+ }) as typeof process.exit);
+ try {
+ const sigintListeners = process.listeners("SIGINT");
+ const handler = sigintListeners[sigintListeners.length - 1] as () => void;
+ handler();
+ // Even though the token file was never written, the cleanup hook
+ // ran (best-effort `unlinkSync` swallows ENOENT) and the
+ // exit-on-signal arm fired (after async cleanup tails settle).
+ await flushMicrotasks();
+ // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2): see
+ // SIGNAL_EXIT_CODE in cleanupHooks.ts. Parent shells need
+ // the nonzero code to distinguish interrupt from clean exit.
+ expect(exitSpy).toHaveBeenCalledWith(130);
+ } finally {
+ exitSpy.mockRestore();
+ }
+ });
+
+ it("does NOT unlink a pre-existing token file when this process failed to persist its own token (concurrent arkor dev safety)", async () => {
+ // Regression: a failed-persist `arkor dev` used to unconditionally
+ // `unlinkSync(studioTokenPath())` on shutdown. If a concurrent
+ // `arkor dev` (different port, same user) had already persisted a
+ // valid token to the shared path, this run's cleanup would wipe
+ // it out from under them, breaking that session's Vite SPA dev
+ // workflow with mystery 403s on /api/*. The fix gates the unlink
+ // on `tokenPersisted` so a failed-persist run is a no-op at
+ // shutdown.
+ if (typeof process.getuid === "function" && process.getuid() === 0) {
+ // Root bypasses chmod permission checks: skip on root containers.
+ return;
+ }
+ // Pre-place a "concurrent" token (the other dev session's). Body
+ // content lets us assert byte-equality after cleanup, not just
+ // file existence, to rule out an unlink+recreate cycle.
+ const path = studioTokenPath();
+ writeFileSync(path, "concurrent-token-value", { mode: 0o600 });
+ // Make the FILE unwritable so persistStudioToken's `writeFile`
+ // throws EACCES, but leave the *directory* writable so unlinkSync
+ // (which requires dir-write, not file-write perms) would happily
+ // delete the file if the cleanup hook weren't gated.
+ chmodSync(path, 0o444);
+
+ const stdoutSpy = vi
+ .spyOn(process.stdout, "write")
+ .mockImplementation((() => true) as typeof process.stdout.write);
+ try {
+ await expect(runDev({ port: 4207 })).resolves.toBeUndefined();
+ } finally {
+ stdoutSpy.mockRestore();
+ }
+
+ const exitSpy = vi
+ .spyOn(process, "exit")
+ .mockImplementation(((_code?: number) => {
+ return undefined as never;
+ }) as typeof process.exit);
+ try {
+ // Restore read perms so we can `readFileSync` to verify content.
+ chmodSync(path, 0o644);
+ const sigintListeners = process.listeners("SIGINT");
+ const handler = sigintListeners[sigintListeners.length - 1] as () => void;
+ handler();
+ await flushMicrotasks();
+ // The pre-existing token is still on disk AND unchanged: this
+ // failed-persist run did not wipe it.
+ expect(existsSync(path)).toBe(true);
+ expect(readFileSync(path, "utf8")).toBe("concurrent-token-value");
+ } finally {
+ exitSpy.mockRestore();
+ }
+ });
+
+ it("does NOT unlink the studio-token when a concurrent arkor dev has overwritten it after our successful persist (token-identity check)", async () => {
+ // Regression: even when this process SUCCESSFULLY persisted the
+ // token, the cleanup hook used to `unlinkSync` unconditionally on
+ // shutdown. If a second `arkor dev` launched in the same `$HOME`
+ // overwrote `~/.arkor/studio-token` with ITS token AFTER our
+ // persist, our cleanup would still wipe the file: the second
+ // session's Vite SPA dev workflow would then see mystery 403s on
+ // /api/* because the meta tag the SPA reads no longer matches
+ // any in-memory token. The fix re-reads the file at exit time
+ // and only unlinks when the bytes match what we wrote.
+ const stdoutSpy = vi
+ .spyOn(process.stdout, "write")
+ .mockImplementation((() => true) as typeof process.stdout.write);
+ try {
+ await runDev({ port: 4208 });
+ } finally {
+ stdoutSpy.mockRestore();
+ }
+ // Sanity: our persist landed.
+ const path = studioTokenPath();
+ expect(existsSync(path)).toBe(true);
+ const ourToken = readFileSync(path, "utf8").trim();
+ expect(ourToken).toMatch(/^[A-Za-z0-9_-]+$/);
+ // Simulate the concurrent overwrite: a second `arkor dev` wrote
+ // its own token to the same shared path while we were running.
+ const concurrentToken = "concurrent-dev-token-XYZ";
+ writeFileSync(path, concurrentToken, { mode: 0o600 });
+
+ const exitSpy = vi
+ .spyOn(process, "exit")
+ .mockImplementation(((_code?: number) => {
+ return undefined as never;
+ }) as typeof process.exit);
+ try {
+ const sigintListeners = process.listeners("SIGINT");
+ const handler = sigintListeners[sigintListeners.length - 1] as () => void;
+ handler();
+ await flushMicrotasks();
+ // Under the bug the file would be gone. With the fix the
+ // concurrent token is still in place AND unchanged so the
+ // sibling `arkor dev` keeps working.
+ expect(existsSync(path)).toBe(true);
+ expect(readFileSync(path, "utf8")).toBe(concurrentToken);
} finally {
exitSpy.mockRestore();
}
diff --git a/packages/arkor/src/cli/commands/dev.ts b/packages/arkor/src/cli/commands/dev.ts
index e2bf01cf..ac24eb4b 100644
--- a/packages/arkor/src/cli/commands/dev.ts
+++ b/packages/arkor/src/cli/commands/dev.ts
@@ -1,5 +1,5 @@
-import { randomBytes } from "node:crypto";
-import { unlinkSync } from "node:fs";
+import { randomBytes, timingSafeEqual } from "node:crypto";
+import { readFileSync, unlinkSync } from "node:fs";
import { chmod, mkdir, writeFile } from "node:fs/promises";
import { dirname } from "node:path";
import { serve } from "@hono/node-server";
@@ -16,7 +16,9 @@ import {
type AnonymousCredentials,
} from "../../core/credentials";
import { buildStudioApp } from "../../studio/server";
+import { createHmrCoordinator } from "../../studio/hmr";
import { ANON_PERSISTENCE_NUDGE } from "../anonymous";
+import { registerCleanupHook } from "../cleanupHooks";
import { ui } from "../prompts";
export interface DevOptions {
@@ -116,7 +118,7 @@ export async function ensureCredentialsForStudio(): Promise {
// wrap fires only for genuine deployment rejection (401/403/404 et
// al). 5xx is a transient cloud-api failure where retrying makes
// sense, ZodErrors signal a malformed response (server bug), and fs
- // failures are out of scope for the anon endpoint entirely — none of
+ // failures are out of scope for the anon endpoint entirely; none of
// these should be mislabelled as a sign-in requirement.
if (
err instanceof AnonymousTokenRejectedError &&
@@ -124,7 +126,7 @@ export async function ensureCredentialsForStudio(): Promise {
err.status < 500 &&
oauthAvailable
) {
- // Surface only the status code at the top level — the inner
+ // Surface only the status code at the top level: the inner
// `err.message` already starts with "Failed to acquire…" and
// includes the response-body snippet, which would double-prefix the
// wrap and risk leaking noisy HTML/JSON error pages. The full
@@ -170,24 +172,81 @@ async function persistStudioToken(token: string): Promise {
return path;
}
-function scheduleStudioTokenCleanup(path: string): void {
- let cleaned = false;
- const cleanup = () => {
- if (cleaned) return;
- cleaned = true;
- try {
- unlinkSync(path);
- } catch {
- // best-effort
- }
- };
- process.on("exit", cleanup);
- for (const sig of ["SIGINT", "SIGTERM", "SIGHUP"] as const) {
- process.on(sig, () => {
- cleanup();
- process.exit(0);
- });
- }
+/**
+ * Constant-time string comparison for the token-identity check below.
+ * The "is this my token?" gate is not strictly a security-sensitive
+ * comparison (both sides are owned by the user on the local FS), but
+ * the SDK already uses `timingSafeEqual` for every other studio-token
+ * comparison (`buildStudioApp`), and keeping the same primitive here
+ * costs nothing while making the policy "tokens are always compared
+ * constant-time" uniform across the codebase.
+ */
+function tokensEqual(a: string, b: string): boolean {
+ const aBuf = Buffer.from(a);
+ const bBuf = Buffer.from(b);
+ if (aBuf.length !== bBuf.length) return false;
+ return timingSafeEqual(aBuf, bBuf);
+}
+
+function scheduleStudioTokenCleanup(
+ path: string,
+ // Read at cleanup time so a `persistStudioToken` call that's still
+ // in flight when the user hits Ctrl-C (or one that resolved
+ // successfully *after* this scheduler ran) has its outcome
+ // respected. A plain boolean parameter would be captured at hook
+ // registration time, well before persist resolves.
+ shouldUnlink: () => boolean,
+ // Token THIS process wrote. Compared against the file's current
+ // contents at unlink time so we never delete a token a concurrent
+ // `arkor dev` overwrote in the shared path. See cleanup body for
+ // the full rationale.
+ expectedToken: string,
+): void {
+ registerCleanupHook({
+ cleanup: () => {
+ // Skip the unlink entirely if THIS process never persisted the
+ // file. Without this gate, a failed-persist `arkor dev` would
+ // happily `unlinkSync` on shutdown, and if a concurrent
+ // `arkor dev` process (different port, same user) had persisted
+ // a valid token to the same shared path, our cleanup would
+ // wipe it out from under them, breaking that session's Vite
+ // SPA dev workflow with mystery 403s on /api/*.
+ if (!shouldUnlink()) return;
+ // Token-identity check: even when this process DID persist a
+ // token, another `arkor dev` launched in the same `$HOME` may
+ // have overwritten the shared `~/.arkor/studio-token` path
+ // BEFORE our shutdown. Unlinking unconditionally would then
+ // delete THEIR valid token, breaking their Vite SPA dev
+ // workflow. Re-read at exit time and only unlink when the
+ // bytes still match what we wrote so the cleanup is a no-op
+ // for foreign tokens. Read failure (ENOENT etc.) means the
+ // file is already gone, which is fine; the unlink would have
+ // been a no-op anyway.
+ let current: string;
+ try {
+ current = readFileSync(path, "utf8").trim();
+ } catch {
+ return;
+ }
+ if (!tokensEqual(current, expectedToken)) return;
+ try {
+ unlinkSync(path);
+ } catch {
+ // best-effort
+ }
+ },
+ // Outermost cleanup: responsible for terminating the process after
+ // all earlier-registered hooks (e.g. HMR dispose) have run.
+ exitOnSignal: true,
+ });
+}
+
+function scheduleHmrCleanup(hmr: { dispose: () => Promise }): void {
+ // Registered before the studio-token cleanup so it runs first on
+ // shutdown: Node fires signal handlers in registration order, and we
+ // want the watcher to release file handles before the outermost
+ // process.exit.
+ registerCleanupHook({ cleanup: () => hmr.dispose() });
}
export async function runDev(options: DevOptions = {}): Promise {
@@ -199,16 +258,59 @@ export async function runDev(options: DevOptions = {}): Promise {
// hitting `arkor start` (and therefore RCE via dynamic import).
const studioToken = randomBytes(32).toString("base64url");
+ // HMR coordinator: a long-lived rolldown watcher over the user's
+ // `src/arkor` graph. The coordinator itself is lazy (`subscribe()`
+ // is what starts the watcher, not `createHmrCoordinator`), but
+ // `buildStudioApp` registers its per-rebuild signal-dispatch
+ // subscriber unconditionally: that subscriber needs to run on
+ // every BUNDLE_END regardless of whether any SSE client is
+ // connected, so it can SIGUSR2/SIGTERM active `/api/train`
+ // children and keep `lastSuccessConfigHash` warm for spawn-time
+ // capture. Net effect: the watcher starts at server boot. An
+ // `arkor dev` launched in an unbuilt project doesn't fail immediately
+ // because `startWatcher` falls through to a poll loop that waits
+ // for the entry file to appear (see `hmr.ts:entryWaitTimer`).
+ //
+ // Registered before the studio-token cleanup so the latter remains
+ // the most-recently-attached signal listener (existing tests rely
+ // on this ordering to find the token-removal handler).
+ const hmr = createHmrCoordinator({ cwd: process.cwd() });
+ scheduleHmrCleanup(hmr);
+
+ // Register the studio-token cleanup *unconditionally* up-front. The hook
+ // is the only one that calls `process.exit(0)` on SIGINT/SIGTERM/SIGHUP
+ // (the HMR hook above only disposes), and `registerCleanupHook` overrides
+ // Node's default "exit on signal" behaviour for any signal it listens
+ // on. If we were to gate registration behind a successful
+ // `persistStudioToken` and the persist threw, Ctrl-C would run the HMR
+ // dispose and then leave the server idle in the foreground: no exit
+ // ever fires.
+ //
+ // The cleanup body itself, however, gates `unlinkSync` on TWO checks:
+ // - `tokenPersisted` (set only after `persistStudioToken` resolves)
+ // so a failed-persist run never touches the shared file.
+ // - token-identity match (re-read the file at exit time, compare
+ // against the bytes WE wrote) so a successful-persist run that
+ // was later overwritten by a concurrent `arkor dev` in the same
+ // `$HOME` still leaves THAT instance's token in place. Without
+ // this second check, the later instance would see mystery 403s
+ // on /api/* because we'd have wiped its valid token.
+ // All three protections together: hook is always registered (so
+ // exits behave), and only deletes a file we wrote AND still own.
+ const tokenPath = studioTokenPath();
+ let tokenPersisted = false;
+ scheduleStudioTokenCleanup(tokenPath, () => tokenPersisted, studioToken);
+
// Persisting the token to disk is *only* needed for the Vite SPA dev
// workflow. The bundled `:port` flow injects the meta tag at request time
// via `buildStudioApp`, so a failure here (read-only $HOME on Docker /
// locked-down CI / restrictive umask) must not block the server.
try {
- const tokenPath = await persistStudioToken(studioToken);
- scheduleStudioTokenCleanup(tokenPath);
+ await persistStudioToken(studioToken);
+ tokenPersisted = true;
} catch (err) {
ui.log.warn(
- `Could not write ${studioTokenPath()} (${
+ `Could not write ${tokenPath} (${
err instanceof Error ? err.message : String(err)
}). The Studio at http://localhost:${port} is unaffected, but the Vite SPA dev workflow will see 403s on /api/*.`,
);
@@ -217,9 +319,9 @@ export async function runDev(options: DevOptions = {}): Promise {
// `autoAnonymous: true` (the default) lets the Hono server retry the
// anonymous bootstrap on first `/api/credentials` hit if the up-front
// attempt above failed (e.g. cloud-api was unreachable at launch).
- const app = buildStudioApp({ studioToken });
+ const app = buildStudioApp({ studioToken, hmr });
// Bind to 127.0.0.1 (not "localhost") so the listener can't end up on `::1`
- // only — `@hono/node-server` passes hostname to `net.Server.listen`, which
+ // only; `@hono/node-server` passes hostname to `net.Server.listen`, which
// calls `dns.lookup`. On hosts where `/etc/hosts` orders `::1 localhost`
// before `127.0.0.1 localhost`, a "localhost" bind would refuse IPv4
// connections, breaking the studio-app Vite proxy (hardcoded to
@@ -229,6 +331,13 @@ export async function runDev(options: DevOptions = {}): Promise {
const url = `http://localhost:${port}`;
serve({ fetch: app.fetch, port, hostname: "127.0.0.1" });
process.stdout.write(`Arkor Studio running on ${url}\n`);
+ // "ready (will watch …)" rather than "enabled (watching …)" because
+ // `createHmrCoordinator` is lazy: the rolldown watcher doesn't
+ // actually start until the first `subscribe()` call inside
+ // `buildStudioApp`, and on a fresh scaffold with no
+ // `src/arkor/index.ts` yet the watcher falls into the
+ // entry-wait poll loop rather than actively watching.
+ process.stdout.write(`HMR ready (will watch src/arkor)\n`);
if (options.open) {
try {
await open(url);
diff --git a/packages/arkor/src/cli/commands/start.test.ts b/packages/arkor/src/cli/commands/start.test.ts
index 8209818b..a08d70f4 100644
--- a/packages/arkor/src/cli/commands/start.test.ts
+++ b/packages/arkor/src/cli/commands/start.test.ts
@@ -78,7 +78,7 @@ describe("runStart", () => {
it("skips the build step when the artifact already exists and no entry override is given", async () => {
// Branch coverage for `Boolean(opts.entry) || !existsSync(outFile)` —
// the path where both halves are false. Pre-build the artifact, then
- // confirm runStart imports it without triggering esbuild again.
+ // confirm runStart imports it without triggering rolldown again.
mkdirSync(join(cwd, "src/arkor"), { recursive: true });
writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
// First call builds normally.
diff --git a/packages/arkor/src/core/configHash.test.ts b/packages/arkor/src/core/configHash.test.ts
new file mode 100644
index 00000000..ec681124
--- /dev/null
+++ b/packages/arkor/src/core/configHash.test.ts
@@ -0,0 +1,213 @@
+import { describe, it, expect } from "vitest";
+import { hashJobConfig } from "./configHash";
+import type { JobConfig } from "./types";
+
+describe("hashJobConfig", () => {
+ it("returns the same hash for key-order-equivalent configs", () => {
+ const a: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ maxSteps: 10,
+ learningRate: 1e-4,
+ };
+ const b: JobConfig = {
+ learningRate: 1e-4,
+ maxSteps: 10,
+ datasetSource: { name: "x", type: "huggingface" },
+ model: "m",
+ } as JobConfig;
+ expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+ });
+
+ it("returns different hashes for materially different configs", () => {
+ const base: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ };
+ expect(hashJobConfig(base)).not.toBe(
+ hashJobConfig({ ...base, model: "m2" }),
+ );
+ expect(hashJobConfig(base)).not.toBe(
+ hashJobConfig({
+ ...base,
+ datasetSource: { type: "huggingface", name: "y" },
+ }),
+ );
+ });
+
+ it("is order-stable for nested arrays (dataset format / split)", () => {
+ const a: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ datasetFormat: ["a", "b", "c"],
+ };
+ const b: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ datasetFormat: ["a", "b", "c"],
+ };
+ expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+ });
+
+ it("treats `undefined` object properties identically to omitted ones (JSON parity)", () => {
+ // Regression: the previous `stableStringify` delegated to
+ // `JSON.stringify(undefined)` which returns `undefined` (not a
+ // string), concatenated via template literal that became the
+ // substring `"undefined"` in the hash input. So `{ a: 1 }` and
+ // `{ a: 1, b: undefined }` produced different hashes even though
+ // they're indistinguishable on the wire (`JSON.stringify` drops
+ // `undefined` properties).
+ const omitted: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ };
+ const explicitlyUndefined: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ // `unknown`-typed forwarder fields can legitimately end up
+ // holding `undefined` if a caller spreads from a partial source.
+ warmupSteps: undefined,
+ datasetFormat: undefined,
+ };
+ expect(hashJobConfig(omitted)).toBe(hashJobConfig(explicitlyUndefined));
+ });
+
+ it("normalises `undefined` array slots to null (JSON parity)", () => {
+ // `JSON.stringify([undefined])` → `"[null]"`. The previous
+ // implementation produced the literal substring `"[undefined]"`
+ // instead, which is not even valid JSON.
+ const a: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ datasetFormat: ["a", undefined, "c"] as unknown,
+ };
+ const b: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ datasetFormat: ["a", null, "c"] as unknown,
+ };
+ expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+ });
+
+ it("honors `toJSON()` like JSON.stringify (Date, etc.)", () => {
+ // Regression: `JSON.stringify({ d: new Date(0) })` serialises
+ // `d` as `"1970-01-01T00:00:00.000Z"`, but a naive recursive
+ // walker would serialise the Date as `{}` (no enumerable own
+ // keys). A `JobConfig` whose `unknown`-typed forwarder field
+ // ever holds a Date (or any object with `toJSON`) would then
+ // produce a hash that disagrees with the wire-format payload,
+ // causing spurious "configHash changed" → SIGTERM restarts.
+ const date = new Date("2024-01-01T00:00:00.000Z");
+ const a: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ warmupSteps: date as unknown,
+ };
+ const b: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ warmupSteps: "2024-01-01T00:00:00.000Z" as unknown,
+ };
+ expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+ });
+
+ it("threads the property key through to user-defined `toJSON(key)` (JSON parity)", () => {
+ // Regression: `JSON.stringify` calls `value.toJSON(key)` with
+ // the hosting property name (or array index as string), so a
+ // `toJSON` that branches on the key produces different output
+ // depending on where the value lives in the tree. The previous
+ // `stableStringify` called `toJSON()` without the key argument,
+ // so the hash diverged from the wire-format payload for any
+ // user object whose serialiser depends on context.
+ //
+ // The fixture's `toJSON(key)` returns `"key="`. Compare
+ // against an explicit string field holding what JSON.stringify
+ // would produce; matching hashes prove the key reached toJSON.
+ const ctx = {
+ toJSON(key: string) {
+ return `key=${key}`;
+ },
+ };
+ const a: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ warmupSteps: ctx as unknown,
+ };
+ const b: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ warmupSteps: "key=warmupSteps" as unknown,
+ };
+ expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+ });
+
+ it("omits an object property whose `toJSON(key)` returns undefined (JSON parity)", () => {
+ // Regression: `JSON.stringify({ a: { toJSON: () => undefined } })`
+ // produces `"{}"`: `toJSON` returning `undefined` is the spec's
+ // "skip me" signal in object position. The previous
+ // `stableStringify` collapsed every non-representable value to
+ // the literal string `"null"` at recursion time, so the same
+ // input hashed as `{"a":null}` instead of `{}`. That divergence
+ // forced unnecessary SIGTERM restarts whenever a `JobConfig`
+ // field's serialiser opted out: `configHash` would diverge from
+ // the wire-format payload (which DOES omit the field).
+ const omitting = {
+ toJSON() {
+ return undefined;
+ },
+ };
+ const a: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ warmupSteps: omitting as unknown,
+ };
+ const b: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ };
+ expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+ });
+
+ it("substitutes `null` for an array element whose `toJSON(idx)` returns undefined (JSON parity)", () => {
+ // Sibling contract: in array position, `JSON.stringify` writes
+ // `null` for a `toJSON()→undefined` element (it can't drop the
+ // slot without shifting indices). The `stableStringify` boundary
+ // for arrays maps the omit sentinel to `"null"`.
+ const omitting = {
+ toJSON() {
+ return undefined;
+ },
+ };
+ const a: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ datasetFormat: ["a", omitting, "c"] as unknown,
+ };
+ const b: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ datasetFormat: ["a", null, "c"] as unknown,
+ };
+ expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+ });
+
+ it("ignores function / symbol properties (JSON parity)", () => {
+ // `JSON.stringify` drops these too. The hash should be insensitive
+ // to "transparent" callbacks accidentally landing in a forwarded
+ // config (the SDK separates `callbacks` out, but `unknown` fields
+ // could leak one).
+ const fn = () => 0;
+ const sym = Symbol("foo");
+ const a: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ };
+ const b: JobConfig = {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ warmupSteps: fn as unknown,
+ loggingSteps: sym as unknown,
+ };
+ expect(hashJobConfig(a)).toBe(hashJobConfig(b));
+ });
+});
diff --git a/packages/arkor/src/core/configHash.ts b/packages/arkor/src/core/configHash.ts
new file mode 100644
index 00000000..2e407094
--- /dev/null
+++ b/packages/arkor/src/core/configHash.ts
@@ -0,0 +1,91 @@
+import { createHash } from "node:crypto";
+import type { JobConfig } from "./types";
+
+/**
+ * Deterministic JSON serialiser: keys sorted at every nesting level so
+ * `{a:1, b:2}` and `{b:2, a:1}` produce the same string. Necessary because
+ * `JSON.stringify` follows insertion order, which isn't stable across
+ * `buildJobConfig` revisions or user-side spread-merge tricks.
+ *
+ * Returns `string | undefined`. `undefined` is the "omit me from my
+ * containing object" sentinel: it propagates from any value
+ * `JSON.stringify` would silently drop in object position
+ * (`undefined`, functions, symbols, *and* objects whose `toJSON(key)`
+ * returns one of those). Callers sit at three boundaries:
+ *
+ * - Top level: `hashJobConfig` collapses `undefined` to `"null"`
+ * so the digest input stays a valid hash string.
+ * - Array slots: the map below substitutes `"null"` (matches
+ * `JSON.stringify([undefined]) === "[null]"`).
+ * - Object slots: the loop filters the key out entirely (matches
+ * `JSON.stringify({a: undefined}) === "{}"`).
+ *
+ * The previous implementation collapsed every non-representable to
+ * the literal string `"null"` at recursion time, which leaked into
+ * object slots as `{"a":null}` instead of the JSON-correct `{}`,
+ * making `configHash` diverge from the wire-format payload for
+ * `JobConfig` fields whose `toJSON(key)` happened to return
+ * `undefined` (the spec-defined "skip me" signal). That divergence
+ * forces unnecessary SIGTERM restarts on every rebuild.
+ */
+function stableStringify(value: unknown, key: string = ""): string | undefined {
+ if (value === null) return "null";
+ // Non-representable values: omit (undefined return) so each caller's
+ // boundary handler chooses the right substitution per its position.
+ if (value === undefined || typeof value === "function" || typeof value === "symbol") {
+ return undefined;
+ }
+ if (typeof value !== "object") return JSON.stringify(value);
+ // `JSON.stringify` calls `value.toJSON(key)` first when present
+ // (passing `""` at the top level, the property name in object
+ // positions, the index-as-string in array positions), then
+ // serialises the return value. Canonical example: `Date` → ISO
+ // string. The `key` argument is threaded through recursion so
+ // user-side `toJSON(key)` implementations that branch on the
+ // hosting property/index see the same value JSON.stringify would.
+ // If `toJSON` returns `undefined`, that propagates as the omit
+ // sentinel: the spec-defined "skip me" path.
+ const maybeToJSON = (value as { toJSON?: unknown }).toJSON;
+ if (typeof maybeToJSON === "function") {
+ return stableStringify(
+ (maybeToJSON as (key: string) => unknown).call(value, key),
+ key,
+ );
+ }
+ if (Array.isArray(value)) {
+ // Array slots: non-representable → "null" (matches JSON spec).
+ // Index-as-string keys mirror `JSON.stringify`'s behaviour for
+ // array elements (per the ECMAScript spec, `SerializeJSONArray`
+ // calls `SerializeJSONProperty` with the index converted to a
+ // string).
+ const items = value.map((v, i) => stableStringify(v, String(i)) ?? "null");
+ return `[${items.join(",")}]`;
+ }
+ // Object slots: skip keys whose serialised value is `undefined`
+ // (matches `JSON.stringify({a: undefined}) === "{}"`). Property
+ // names are passed as the recursion key so a nested `toJSON(key)`
+ // sees the hosting field name.
+ const obj = value as Record;
+ const parts: string[] = [];
+ for (const k of Object.keys(obj).sort()) {
+ const serialised = stableStringify(obj[k], k);
+ if (serialised === undefined) continue;
+ parts.push(`${JSON.stringify(k)}:${serialised}`);
+ }
+ return `{${parts.join(",")}}`;
+}
+
+/**
+ * Stable fingerprint of a `JobConfig`. Used by HMR to decide whether a
+ * rebuild changed only the in-process callbacks (configHash unchanged →
+ * hot-swap) or the cloud-side training config (configHash changed →
+ * full restart with `requestEarlyStop`).
+ */
+export function hashJobConfig(config: JobConfig): string {
+ // Top-level fallback to `"null"` so a pathological config that
+ // serialises to `undefined` (top-level `toJSON` returning
+ // undefined, etc.) still produces a deterministic digest input
+ // rather than crashing `createHash.update(undefined)`.
+ const serialised = stableStringify(config) ?? "null";
+ return createHash("sha256").update(serialised).digest("hex").slice(0, 16);
+}
diff --git a/packages/arkor/src/core/moduleCacheBust.test.ts b/packages/arkor/src/core/moduleCacheBust.test.ts
new file mode 100644
index 00000000..40b8509a
--- /dev/null
+++ b/packages/arkor/src/core/moduleCacheBust.test.ts
@@ -0,0 +1,66 @@
+import { describe, it, expect, beforeEach, afterEach } from "vitest";
+import { mkdtempSync, rmSync, writeFileSync } from "node:fs";
+import { tmpdir } from "node:os";
+import { join } from "node:path";
+import { pathToFileURL } from "node:url";
+import {
+ moduleCacheBustKey,
+ moduleCacheBustUrl,
+} from "./moduleCacheBust";
+
+let dir: string;
+
+beforeEach(() => {
+ dir = mkdtempSync(join(tmpdir(), "arkor-cachebust-test-"));
+});
+
+afterEach(() => {
+ rmSync(dir, { recursive: true, force: true });
+});
+
+describe("moduleCacheBustKey", () => {
+ it("is stable across calls when the file hasn't changed", () => {
+ // Regression: Node's ESM loader never evicts module records, and
+ // a `Date.now()` cache-bust would produce a fresh URL on every
+ // call → unbounded leak across long `arkor dev` sessions
+ // (5 s `/api/manifest` polls + every save firing SIGUSR2).
+ // mtime+ctime+size keying must collapse repeat reads of unchanged
+ // bytes onto the same key so the loader serves from cache.
+ const file = join(dir, "stable.mjs");
+ writeFileSync(file, "export const v = 1;");
+ const k1 = moduleCacheBustKey(file);
+ const k2 = moduleCacheBustKey(file);
+ expect(k1).toBe(k2);
+ // mtimeMs-ctimeMs-size; mtimeMs/ctimeMs may carry sub-ms precision
+ // (no `toFixed(0)`) so digits include an optional fractional part.
+ expect(k1).toMatch(/^[\d.]+-[\d.]+-\d+$/);
+ });
+
+ it("changes when the file content changes (different size)", () => {
+ const file = join(dir, "growing.mjs");
+ writeFileSync(file, "v1");
+ const before = moduleCacheBustKey(file);
+ writeFileSync(file, "version-two");
+ const after = moduleCacheBustKey(file);
+ expect(after).not.toBe(before);
+ });
+
+ it("returns a stable fallback (\"0-0-0\") for missing files instead of throwing", () => {
+ // The eventual `await import(url)` will throw on a missing
+ // file; the helper itself should produce a value rather than
+ // bubbling the stat error and turning every consumer into a
+ // try/catch site. Three zeros (one each for mtimeMs, ctimeMs,
+ // size) to keep the shape uniform with the success branch.
+ expect(moduleCacheBustKey(join(dir, "does-not-exist.mjs"))).toBe("0-0-0");
+ });
+});
+
+describe("moduleCacheBustUrl", () => {
+ it("returns a fully-qualified file URL with the cache-bust query attached", () => {
+ const file = join(dir, "u.mjs");
+ writeFileSync(file, "export const x = 1;");
+ const url = moduleCacheBustUrl(file);
+ expect(url.startsWith(pathToFileURL(file).href + "?t=")).toBe(true);
+ expect(url).toMatch(/\?t=[\d.]+-[\d.]+-\d+$/);
+ });
+});
diff --git a/packages/arkor/src/core/moduleCacheBust.ts b/packages/arkor/src/core/moduleCacheBust.ts
new file mode 100644
index 00000000..22f160a5
--- /dev/null
+++ b/packages/arkor/src/core/moduleCacheBust.ts
@@ -0,0 +1,51 @@
+import { statSync } from "node:fs";
+import { pathToFileURL } from "node:url";
+
+/**
+ * Build a content-derived cache-bust query for `await import(url + "?t=" + key)`.
+ *
+ * Why this matters: Node's ESM loader caches every dynamically-imported
+ * URL for the lifetime of the process and exposes no API to evict a
+ * record. A naive `?t=Date.now()` cache-bust produces a fresh URL on
+ * every call, so a long-running `arkor dev` session (where the SPA
+ * polls `/api/manifest` every few seconds and every save fires
+ * `BUNDLE_END` + SIGUSR2) accumulates one module record per call,
+ * unbounded.
+ *
+ * Keying on `mtimeMs + ctimeMs + size` collapses repeated reads of the
+ * same bytes onto the same URL, which Node's loader then serves from
+ * its existing cache record. The leak shrinks from "one entry per
+ * call" to "one entry per actual file change", which is the tightest
+ * bound we can offer without spawning a child process per import.
+ *
+ * `mtimeMs` is kept at full sub-millisecond precision (no rounding):
+ * a previous `toFixed(0)` collapsed two distinct edits that landed in
+ * the same millisecond and produced an identically-sized output onto
+ * the same key, which made Node's loader return the *stale* module
+ * for the second edit (HMR/manifest staleness on fast filesystems).
+ * `ctimeMs` is included as belt-and-braces against the (rare) case
+ * where mtime collides but ctime moves: `touch -m` and some build
+ * tools update one without the other.
+ *
+ * Falls back to a stable literal on stat failure so the eventual
+ * `import()` (which will throw on a missing file) gets to surface its
+ * own clean error rather than us inventing a noisy timestamp here.
+ */
+export function moduleCacheBustKey(filePath: string): string {
+ try {
+ const s = statSync(filePath);
+ return `${s.mtimeMs}-${s.ctimeMs}-${s.size}`;
+ } catch {
+ return "0-0-0";
+ }
+}
+
+/**
+ * Convenience: full file URL with the cache-bust key already
+ * appended. The `as const`-style template is small enough to inline
+ * but doing it in one place keeps the URL shape uniform across the
+ * three callers (`hmr.ts`, `manifest.ts`, `runnerSignals.ts`).
+ */
+export function moduleCacheBustUrl(filePath: string): string {
+ return `${pathToFileURL(filePath).href}?t=${moduleCacheBustKey(filePath)}`;
+}
diff --git a/packages/arkor/src/core/projectState.test.ts b/packages/arkor/src/core/projectState.test.ts
index 8d36515a..73d6b706 100644
--- a/packages/arkor/src/core/projectState.test.ts
+++ b/packages/arkor/src/core/projectState.test.ts
@@ -37,7 +37,7 @@ function fakeClient(
// Construct a real CloudApiClient (so type-compatibility holds), then
// monkey-patch only the methods exercised by ensureProjectState. The
// other methods would throw on first use because no fetcher is wired,
- // which is fine — projectState should never reach them.
+ // which is fine; projectState should never reach them.
const client = new CloudApiClient({
baseUrl: "http://mock",
credentials: anonCreds,
@@ -84,7 +84,7 @@ describe("ensureProjectState", () => {
expect(createProject).not.toHaveBeenCalled();
});
- it("throws for auth0 callers without state — they must write .arkor/state.json by hand", async () => {
+ it("throws for auth0 callers without state: they must write .arkor/state.json by hand", async () => {
const client = fakeClient();
await expect(
ensureProjectState({ cwd, client, credentials: auth0Creds }),
@@ -116,7 +116,7 @@ describe("ensureProjectState", () => {
expect(createProject).toHaveBeenCalledWith({
orgSlug: "anon-abc",
name: expect.stringMatching(/^my-app/),
- // Sanitised slug — basename starts with "my-app-", and we
+ // Sanitised slug: basename starts with "my-app-", and we
// expect the sanitiser to keep dashes.
slug: expect.stringMatching(/^my-app/),
});
diff --git a/packages/arkor/src/core/rolldownConfig.ts b/packages/arkor/src/core/rolldownConfig.ts
new file mode 100644
index 00000000..66e87c29
--- /dev/null
+++ b/packages/arkor/src/core/rolldownConfig.ts
@@ -0,0 +1,86 @@
+import { isAbsolute, resolve } from "node:path";
+import type { InputOptions } from "rolldown";
+
+const DEFAULT_ENTRY = "src/arkor/index.ts";
+const DEFAULT_OUT_DIR = ".arkor/build";
+
+export interface BuildEntryOptions {
+ /** Source entry path; defaults to `src/arkor/index.ts`. */
+ entry?: string;
+ /** Output directory; defaults to `.arkor/build`. */
+ outDir?: string;
+ /** Project root; defaults to `process.cwd()`. */
+ cwd?: string;
+}
+
+export interface ResolvedBuildEntry {
+ /** Project root (absolute). */
+ cwd: string;
+ /** Entry source file (absolute). */
+ entry: string;
+ /** Output directory (absolute). */
+ outDir: string;
+ /** Output bundle (absolute, always `/index.mjs`). */
+ outFile: string;
+}
+
+/** Resolve `cwd` / `entry` / `outDir` to absolute paths with the standard defaults. */
+export function resolveBuildEntry(opts: BuildEntryOptions): ResolvedBuildEntry {
+ const cwd = opts.cwd ?? process.cwd();
+ const entryRel = opts.entry ?? DEFAULT_ENTRY;
+ const entry = isAbsolute(entryRel) ? entryRel : resolve(cwd, entryRel);
+ const outDirRel = opts.outDir ?? DEFAULT_OUT_DIR;
+ const outDir = isAbsolute(outDirRel) ? outDirRel : resolve(cwd, outDirRel);
+ const outFile = resolve(outDir, "index.mjs");
+ return { cwd, entry, outDir, outFile };
+}
+
+/**
+ * `node.` derived from the running Node binary. Build host and
+ * run host are effectively the same process (Studio spawns `arkor start` with
+ * `process.execPath`), so the bundle can target precisely what will execute it.
+ */
+export function resolveNodeTarget(): string {
+ // Fallback aligns with the published `engines.node` floor; see
+ // [packages/arkor/package.json] / `AGENTS.md`'s "Node version" note.
+ const [major = "22", minor = "22"] = process.versions.node.split(".");
+ return `node${major}.${minor}`;
+}
+
+/**
+ * Build the shared rolldown options object used by both `runBuild` (one-shot)
+ * and the HMR coordinator (`watch()`). Centralising the configuration here
+ * keeps the two pipelines aligned: anything that affects the bundle shape
+ * (external resolution, transform target, platform) is set in one place so
+ * the artifact a watcher writes is byte-equivalent to a one-shot rebuild.
+ */
+export function rolldownInputOptions(
+ resolved: Pick,
+): InputOptions {
+ return {
+ input: resolved.entry,
+ cwd: resolved.cwd,
+ platform: "node",
+ logLevel: "warn",
+ transform: { target: resolveNodeTarget() },
+ // Mirror esbuild's `packages: "external"`: any specifier that isn't a
+ // relative or absolute path stays external. `node:`-prefixed builtins
+ // are already handled by `platform: "node"`; the explicit allow below
+ // is a safety net in case the builtin set drifts.
+ external: (id, _importer, isResolved) => {
+ if (isResolved) return false;
+ if (id.startsWith(".")) return false;
+ if (isAbsolute(id)) return false;
+ return true;
+ },
+ };
+}
+
+/**
+ * Re-exported defaults so consumers (like error messages) can name the same
+ * paths we resolve internally.
+ */
+export const BUILD_DEFAULTS = {
+ entry: DEFAULT_ENTRY,
+ outDir: DEFAULT_OUT_DIR,
+} as const;
diff --git a/packages/arkor/src/core/runner.test.ts b/packages/arkor/src/core/runner.test.ts
index 89f29249..cdabfb15 100644
--- a/packages/arkor/src/core/runner.test.ts
+++ b/packages/arkor/src/core/runner.test.ts
@@ -1,4 +1,4 @@
-import { describe, it, expect, afterEach, beforeEach } from "vitest";
+import { describe, it, expect, afterEach, beforeEach, vi } from "vitest";
import { mkdtempSync, rmSync, writeFileSync, mkdirSync } from "node:fs";
import { tmpdir } from "node:os";
import { join } from "node:path";
@@ -49,7 +49,7 @@ afterEach(() => {
rmSync(cwd, { recursive: true, force: true });
});
-describe("runTrainer — entry extraction", () => {
+describe("runTrainer: entry extraction", () => {
it("throws when the entry file does not exist", async () => {
await expect(runTrainer("missing.ts")).rejects.toThrow(
/Training entry not found/,
@@ -124,7 +124,7 @@ describe("runTrainer — entry extraction", () => {
});
it("throws when default export is a primitive (typeof !== 'object' branch)", async () => {
- // The second half of `mod.default && typeof mod.default === "object"` —
+ // The second half of `mod.default && typeof mod.default === "object"`:
// a primitive default like `42` or `"foo"` must short-circuit out of
// the nested-trainer probe.
const entry = join(cwd, "primitive-default.mjs");
@@ -135,7 +135,7 @@ describe("runTrainer — entry extraction", () => {
});
it("accepts a default export wrapping a `trainer` field (legacy power-user shape)", async () => {
- // Hits the `if (isTrainer(nested)) return nested` branch — the only
+ // Hits the `if (isTrainer(nested)) return nested` branch: the only
// place line 38 is reachable.
const entry = join(cwd, "default-with-trainer.mjs");
writeFileSync(
@@ -154,7 +154,7 @@ describe("runTrainer — entry extraction", () => {
it("falls back to DEFAULT_ENTRY (src/arkor/index.ts) when called with no argument", async () => {
// Branch coverage for `file ?? DEFAULT_ENTRY`. Place the entry at
- // `/src/arkor/index.ts` and invoke runTrainer() — the default
+ // `/src/arkor/index.ts` and invoke runTrainer(): the default
// path is what `arkor start` and Studio's "Run training" button use.
const arkorDir = join(cwd, "src", "arkor");
mkdirSync(arkorDir, { recursive: true });
@@ -174,8 +174,8 @@ describe("runTrainer — entry extraction", () => {
join(arkorDir, "index.ts"),
`export * from "./index.mjs";\n`,
);
- // Pass undefined explicitly to exercise the `?? DEFAULT_ENTRY` branch
- // — Node's built-in TypeScript stripping handles the .ts extension at
+ // Pass undefined explicitly to exercise the `?? DEFAULT_ENTRY` branch.
+ // Node's built-in TypeScript stripping handles the .ts extension at
// runtime. (vitest also strips TS so this works under test too.)
await expect(runTrainer()).resolves.toBeUndefined();
});
@@ -207,3 +207,99 @@ describe("runTrainer — entry extraction", () => {
expect(typeof t.wait).toBe("function");
});
});
+
+describe("runTrainer: shutdown signal handling", () => {
+ it("first SIGTERM calls trainer.requestEarlyStop and exits 0; second SIGTERM exits 143", async () => {
+ // Fake trainer whose `wait()` hangs until the test manually resolves it
+ // (via a global helper). This lets us hold the run in flight long
+ // enough to assert both signal-handling branches without racing the
+ // `finally` block that removes the listeners.
+ // The fake trainer wears the early-stop brand
+ // (`Symbol.for("arkor.trainer.requestEarlyStop")`) so the runner's
+ // SIGTERM handler invokes it the same way the SDK-provided trainer
+ // does. No public `requestEarlyStop` method exists any more.
+ const trainerSrc = `
+ const KEY = Symbol.for("arkor.trainer.requestEarlyStop");
+ let earlyStopCalls = 0;
+ let resolveWait;
+ const waitPromise = new Promise((r) => { resolveWait = r; });
+ globalThis.__test_signalProbe = {
+ get earlyStopCalls() { return earlyStopCalls; },
+ finishWait: () => resolveWait({
+ job: {
+ id: "j1", orgId: "o", projectId: "p", name: "n",
+ status: "completed",
+ config: { model: "m", datasetSource: { type: "huggingface", name: "x" } },
+ createdAt: "2026",
+ },
+ artifacts: [],
+ }),
+ };
+ const trainer = {
+ name: "n",
+ start: async () => ({ jobId: "j1" }),
+ wait: () => waitPromise,
+ cancel: async () => {},
+ };
+ Object.defineProperty(trainer, KEY, {
+ value: async () => { earlyStopCalls++; },
+ enumerable: false,
+ });
+ export { trainer };
+ `;
+ const entry = join(cwd, "src/arkor/index.mjs");
+ mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+ writeFileSync(entry, trainerSrc);
+
+ const exitCalls: number[] = [];
+ const exitSpy = vi
+ .spyOn(process, "exit")
+ .mockImplementation(((code?: number) => {
+ exitCalls.push(code ?? 0);
+ return undefined as never;
+ }) as typeof process.exit);
+ const stdoutSpy = vi
+ .spyOn(process.stdout, "write")
+ .mockImplementation((() => true) as typeof process.stdout.write);
+ try {
+ const runPromise = runTrainer("src/arkor/index.mjs");
+ // Wait for import + start() to settle so the handler is registered
+ // before we synthesise SIGTERM. Poll for the probe rather than
+ // relying on a fixed timer: under load (e.g. running alongside
+ // sibling test files in turbo) the dynamic import + top-level
+ // body can take longer than a hardcoded 25 ms window.
+ type Probe = { earlyStopCalls: number; finishWait: () => void };
+ let probe: Probe | undefined;
+ for (let i = 0; i < 40; i++) {
+ probe = (globalThis as unknown as { __test_signalProbe?: Probe })
+ .__test_signalProbe;
+ if (probe) break;
+ await new Promise((r) => setTimeout(r, 25));
+ }
+ if (!probe) throw new Error("Probe not installed by user bundle");
+
+ // 1st SIGTERM → requestEarlyStop is called, exit(0) scheduled in the
+ // promise's `.finally`.
+ process.emit("SIGTERM", "SIGTERM");
+ await new Promise((r) => setTimeout(r, 25));
+ expect(probe.earlyStopCalls).toBe(1);
+ expect(exitCalls).toContain(0);
+
+ // 2nd SIGTERM (still in-flight, listeners not yet removed) →
+ // exit(143) immediately, no second requestEarlyStop call.
+ process.emit("SIGTERM", "SIGTERM");
+ await new Promise((r) => setTimeout(r, 25));
+ expect(probe.earlyStopCalls).toBe(1);
+ expect(exitCalls).toContain(143);
+
+ // Release the hung wait() so runPromise can complete and the
+ // shutdown handlers detach via the finally block.
+ probe.finishWait();
+ await runPromise;
+ } finally {
+ exitSpy.mockRestore();
+ stdoutSpy.mockRestore();
+ delete (globalThis as Record).__test_signalProbe;
+ }
+ });
+});
diff --git a/packages/arkor/src/core/runner.ts b/packages/arkor/src/core/runner.ts
index e674b70e..38db0537 100644
--- a/packages/arkor/src/core/runner.ts
+++ b/packages/arkor/src/core/runner.ts
@@ -2,10 +2,48 @@ import { existsSync } from "node:fs";
import { resolve, isAbsolute } from "node:path";
import { pathToFileURL } from "node:url";
import { isArkor } from "./arkor";
+import {
+ installCallbackReloadHandler,
+ installShutdownHandlers,
+} from "./runnerSignals";
import type { Trainer } from "./types";
const DEFAULT_ENTRY = "src/arkor/index.ts";
+/**
+ * Per-spawn nonce that `/api/train` injects via env so the server can
+ * recognise the runner's `Started job ` line without it being
+ * forgeable from user code. Captured at module load (i.e. BEFORE
+ * `runTrainer` does its `await import(userEntry)`) and the env var
+ * is deleted right after so the dynamically-imported user module
+ * cannot read it via `process.env`. If a user callback then writes
+ * `Started job ` to stdout, the line won't carry the nonce
+ * prefix and the server's anchored regex will reject it: no
+ * spoofed cloud `cancel()` POST against an attacker-chosen job id.
+ *
+ * Null when the runner was launched directly (e.g. `arkor start` from
+ * a shell), in which case the runner falls back to the plain
+ * `Started job ` form for backwards compatibility. The server only
+ * uses the nonce-prefixed form because every server spawn sets the
+ * env var.
+ *
+ * **Import-order requirement.** The spoof-prevention guarantee relies
+ * on this module reading + deleting `ARKOR_JOB_ID_MARKER_NONCE`
+ * before any user-controlled module gets to touch `process.env`.
+ * That's safe today because the only consumer chain is
+ * `bin.ts → cli/main.ts → cli/commands/start.ts → core/runner.ts`,
+ * all static imports, so this module is fully evaluated before
+ * `runTrainer` performs its `await import(userEntry)`. If a future
+ * refactor introduces a dynamic-import / lazy-load of runner.ts (so
+ * a sibling module runs first and could snapshot `process.env`), the
+ * capture+delete should move into a tiny dedicated module that the
+ * bin imports first, or the env var should be wiped at the server
+ * spawn boundary too.
+ */
+const STARTED_JOB_NONCE: string | null =
+ process.env.ARKOR_JOB_ID_MARKER_NONCE ?? null;
+delete process.env.ARKOR_JOB_ID_MARKER_NONCE;
+
function isTrainer(value: unknown): value is Trainer {
if (!value || typeof value !== "object") return false;
const t = value as Record;
@@ -53,8 +91,20 @@ export async function runTrainer(file?: string): Promise {
const mod = (await import(pathToFileURL(abs).href)) as Record;
const trainer = extractTrainer(mod);
- const { jobId } = await trainer.start();
- process.stdout.write(`Started job ${jobId}\n`);
- const result = await trainer.wait();
- process.stdout.write(`Job ${result.job.id} finished with status=${result.job.status}\n`);
+ const removeShutdown = installShutdownHandlers(trainer);
+ const removeCallbackReload = installCallbackReloadHandler(trainer, abs);
+ try {
+ const { jobId } = await trainer.start();
+ const startedJobPrefix = STARTED_JOB_NONCE
+ ? `[arkor:${STARTED_JOB_NONCE}] `
+ : "";
+ process.stdout.write(`${startedJobPrefix}Started job ${jobId}\n`);
+ const result = await trainer.wait();
+ process.stdout.write(
+ `Job ${result.job.id} finished with status=${result.job.status}\n`,
+ );
+ } finally {
+ removeShutdown();
+ removeCallbackReload();
+ }
}
diff --git a/packages/arkor/src/core/runnerSignals.test.ts b/packages/arkor/src/core/runnerSignals.test.ts
new file mode 100644
index 00000000..5e274943
--- /dev/null
+++ b/packages/arkor/src/core/runnerSignals.test.ts
@@ -0,0 +1,403 @@
+import { describe, it, expect, beforeEach, afterEach, vi } from "vitest";
+import { mkdirSync, mkdtempSync, rmSync, writeFileSync } from "node:fs";
+import { tmpdir } from "node:os";
+import { join } from "node:path";
+import {
+ installCallbackReloadHandler,
+ installShutdownHandlers,
+} from "./runnerSignals";
+import type { Trainer, TrainerCallbacks } from "./types";
+import {
+ attachTrainerCallbackReplacer,
+ attachTrainerEarlyStopper,
+ attachTrainerInspection,
+} from "./trainerInspection";
+
+let cwd: string;
+
+beforeEach(() => {
+ cwd = mkdtempSync(join(tmpdir(), "arkor-signals-test-"));
+});
+
+afterEach(() => {
+ rmSync(cwd, { recursive: true, force: true });
+});
+
+function makeTrainer(): Trainer & {
+ __earlyStop: { calls: number };
+ __replace: {
+ lastCallbacks: Partial | null;
+ calls: number;
+ };
+} {
+ const earlyStop = { calls: 0 };
+ const replace = {
+ lastCallbacks: null as Partial | null,
+ calls: 0,
+ };
+ const trainer: Trainer = {
+ name: "n",
+ async start() {
+ return { jobId: "j" };
+ },
+ async wait() {
+ throw new Error("not used");
+ },
+ async cancel() {},
+ };
+ // Wire the internal callback-replacer + early-stop brands the same
+ // way `createTrainer` does. SIGUSR2 looks them up via
+ // `replaceTrainerCallbacks` and SIGTERM via `requestTrainerEarlyStop`
+ // (there are no public methods on `Trainer` for either any more).
+ attachTrainerCallbackReplacer(trainer, (cbs) => {
+ replace.lastCallbacks = cbs;
+ replace.calls += 1;
+ });
+ attachTrainerEarlyStopper(trainer, async () => {
+ earlyStop.calls += 1;
+ });
+ return Object.assign(trainer, {
+ __earlyStop: earlyStop,
+ __replace: replace,
+ });
+}
+
+describe("installShutdownHandlers", () => {
+ it("calls trainer.requestEarlyStop on the first SIGTERM and exit(0)", async () => {
+ const trainer = makeTrainer();
+ const exitSpy = vi
+ .spyOn(process, "exit")
+ .mockImplementation((() => undefined as never) as typeof process.exit);
+ const stdoutSpy = vi
+ .spyOn(process.stdout, "write")
+ .mockImplementation((() => true) as typeof process.stdout.write);
+ const dispose = installShutdownHandlers(trainer);
+ try {
+ process.emit("SIGTERM", "SIGTERM");
+ await new Promise((r) => setTimeout(r, 10));
+ expect(trainer.__earlyStop.calls).toBe(1);
+ expect(exitSpy).toHaveBeenCalledWith(0);
+ } finally {
+ dispose();
+ exitSpy.mockRestore();
+ stdoutSpy.mockRestore();
+ }
+ });
+
+ it("second-signal exit code is per-signal POSIX 128+signo (130 for SIGINT, 129 for SIGHUP)", async () => {
+ // Regression: the second-signal emergency-exit path used to
+ // hardcode `process.exit(143)` regardless of which signal
+ // fired. SIGINT (Ctrl-C twice) and SIGHUP shutdowns then
+ // looked like SIGTERM exits to parent shells / orchestrators,
+ // breaking signal-aware logic (e.g. tmux pane behaviour, CI
+ // job classification, `&&` / `||` chains that distinguish
+ // user-cancel from clean exit). Mirrors `SIGNAL_EXIT_CODE` in
+ // `cli/cleanupHooks.ts`.
+ const cases: Array<["SIGINT" | "SIGTERM" | "SIGHUP", number]> = [
+ ["SIGINT", 130],
+ ["SIGTERM", 143],
+ ["SIGHUP", 129],
+ ];
+ for (const [sig, expectedExit] of cases) {
+ const trainer = makeTrainer();
+ const exitCodes: number[] = [];
+ const exitSpy = vi
+ .spyOn(process, "exit")
+ .mockImplementation(((code?: number) => {
+ exitCodes.push(code ?? 0);
+ return undefined as never;
+ }) as typeof process.exit);
+ const stdoutSpy = vi
+ .spyOn(process.stdout, "write")
+ .mockImplementation((() => true) as typeof process.stdout.write);
+ const dispose = installShutdownHandlers(trainer);
+ try {
+ process.emit(sig, sig);
+ await new Promise((r) => setTimeout(r, 10));
+ process.emit(sig, sig);
+ await new Promise((r) => setTimeout(r, 10));
+ // First signal exits 0 via the early-stop chain's
+ // `.finally(() => process.exit(0))`; second signal exits
+ // with the per-signal POSIX code.
+ expect(exitCodes, `signal ${sig}`).toContain(expectedExit);
+ } finally {
+ dispose();
+ exitSpy.mockRestore();
+ stdoutSpy.mockRestore();
+ }
+ }
+ });
+
+ it("first-signal exit code is per-signal POSIX 128+signo when the early-stop chain rejects", async () => {
+ // Regression: the first-signal `.finally(() => process.exit(0))`
+ // always exited 0 even when the early-stop chain rejected
+ // (cancel POST hit a cloud-api 5xx, network drop, etc.). Parent
+ // shells running `arkor start || cleanup_on_failure` would then
+ // classify the failed cancel as a clean run and skip cleanup
+ // despite the stderr diagnostic. Fix: non-zero POSIX 128+signo on
+ // rejection so the exit status carries the same signal-shape
+ // semantics as the second-signal emergency path.
+ const cases: Array<["SIGINT" | "SIGTERM" | "SIGHUP", number]> = [
+ ["SIGINT", 130],
+ ["SIGTERM", 143],
+ ["SIGHUP", 129],
+ ];
+ for (const [sig, expectedExit] of cases) {
+ // Build a trainer whose internal early-stop brand REJECTS, so
+ // the runner's `.catch(...).finally(...)` chain goes through
+ // the failure branch.
+ const trainer: Trainer = {
+ name: "n",
+ async start() {
+ return { jobId: "j" };
+ },
+ async wait() {
+ throw new Error("not used");
+ },
+ async cancel() {},
+ };
+ attachTrainerEarlyStopper(trainer, async () => {
+ throw new Error("cloud-api 503");
+ });
+ const exitCodes: number[] = [];
+ const exitSpy = vi
+ .spyOn(process, "exit")
+ .mockImplementation(((code?: number) => {
+ exitCodes.push(code ?? 0);
+ return undefined as never;
+ }) as typeof process.exit);
+ const stdoutSpy = vi
+ .spyOn(process.stdout, "write")
+ .mockImplementation((() => true) as typeof process.stdout.write);
+ const stderrSpy = vi
+ .spyOn(process.stderr, "write")
+ .mockImplementation((() => true) as typeof process.stderr.write);
+ const dispose = installShutdownHandlers(trainer);
+ try {
+ process.emit(sig, sig);
+ // Wait for the .catch / .finally microtasks to settle.
+ await new Promise((r) => setTimeout(r, 10));
+ // Under the bug this was just `[0]`. With the fix the
+ // first-signal exit code reflects the signal that fired.
+ expect(exitCodes, `signal ${sig}`).toEqual([expectedExit]);
+ } finally {
+ dispose();
+ exitSpy.mockRestore();
+ stdoutSpy.mockRestore();
+ stderrSpy.mockRestore();
+ }
+ }
+ });
+
+ it("second SIGTERM exits 143 without re-invoking requestEarlyStop", async () => {
+ const trainer = makeTrainer();
+ const exitCodes: number[] = [];
+ const exitSpy = vi
+ .spyOn(process, "exit")
+ .mockImplementation(((code?: number) => {
+ exitCodes.push(code ?? 0);
+ return undefined as never;
+ }) as typeof process.exit);
+ const stdoutSpy = vi
+ .spyOn(process.stdout, "write")
+ .mockImplementation((() => true) as typeof process.stdout.write);
+ const dispose = installShutdownHandlers(trainer);
+ try {
+ process.emit("SIGTERM", "SIGTERM");
+ await new Promise((r) => setTimeout(r, 10));
+ process.emit("SIGTERM", "SIGTERM");
+ await new Promise((r) => setTimeout(r, 10));
+ expect(trainer.__earlyStop.calls).toBe(1);
+ expect(exitCodes).toContain(0);
+ expect(exitCodes).toContain(143);
+ } finally {
+ dispose();
+ exitSpy.mockRestore();
+ stdoutSpy.mockRestore();
+ }
+ });
+});
+
+describe("installCallbackReloadHandler", () => {
+ function writeUserBundle(label: string): string {
+ const file = join(cwd, "entry.mjs");
+ // Inline a fake trainer that wears the inspection brand. The
+ // SIGUSR2 handler dynamic-imports this file and pulls the
+ // callbacks reference off via `getTrainerInspection`.
+ const src = `
+ const KEY = Symbol.for("arkor.trainer.inspect");
+ const callbacks = { onLog: (ctx) => globalThis.__arkor_callbackProbe?.(${JSON.stringify(label)}, ctx) };
+ const trainer = {
+ name: "t",
+ start: async () => ({ jobId: "j" }),
+ wait: async () => ({ job: {}, artifacts: [] }),
+ cancel: async () => {},
+ };
+ Object.defineProperty(trainer, KEY, {
+ value: () => ({ name: "t", config: { model: "m", datasetSource: { type: "huggingface", name: "x" } }, callbacks }),
+ enumerable: false,
+ });
+ export const arkor = Object.freeze({ _kind: "arkor", trainer });
+ `;
+ writeFileSync(file, src);
+ return file;
+ }
+
+ it("re-imports the bundle and forwards the new callbacks via replaceCallbacks", async () => {
+ const trainer = makeTrainer();
+ // Brand the trainer too so the import path-side has a reference shape.
+ attachTrainerInspection(trainer, () => ({
+ name: "n",
+ config: {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ },
+ callbacks: {},
+ }));
+
+ const file = writeUserBundle("v1");
+ const stdoutSpy = vi
+ .spyOn(process.stdout, "write")
+ .mockImplementation((() => true) as typeof process.stdout.write);
+ const stderrSpy = vi
+ .spyOn(process.stderr, "write")
+ .mockImplementation((() => true) as typeof process.stderr.write);
+ const dispose = installCallbackReloadHandler(trainer, file);
+ mkdirSync(join(cwd, "src"), { recursive: true });
+ try {
+ // Rewrite the entry to "v2" callbacks before signalling.
+ writeUserBundle("v2");
+ process.emit("SIGUSR2", "SIGUSR2");
+ // Wait for the dynamic import + replaceCallbacks to settle.
+ for (let i = 0; i < 50 && trainer.__replace.lastCallbacks === null; i++) {
+ await new Promise((r) => setTimeout(r, 10));
+ }
+ expect(trainer.__replace.lastCallbacks).not.toBeNull();
+ expect(typeof trainer.__replace.lastCallbacks?.onLog).toBe("function");
+ } finally {
+ dispose();
+ stdoutSpy.mockRestore();
+ stderrSpy.mockRestore();
+ }
+ });
+
+ it("returns a no-op disposer when SIGUSR2 registration throws (Windows fallback)", () => {
+ // Regression: `process.on("SIGUSR2", ...)` can throw at
+ // registration time on platforms that don't support the signal
+ // (notably Windows). Previously this would surface as a hard
+ // crash at `arkor start` boot. The handler now wraps the
+ // registration in try/catch and degrades to a no-op disposer so
+ // the rest of the runner stays up: the server's
+ // `safeKill(child, "SIGUSR2")` already detects the same
+ // condition and falls back to SIGTERM-restart there.
+ const trainer = makeTrainer();
+ const file = join(cwd, "entry.mjs");
+ writeFileSync(file, "export const x = 1;\n");
+
+ const realOn = process.on.bind(process);
+ const onSpy = vi
+ .spyOn(process, "on")
+ .mockImplementation(((event: string, listener: (...args: unknown[]) => void) => {
+ if (event === "SIGUSR2") {
+ throw new Error("ENOSYS: function not implemented");
+ }
+ return realOn(event as never, listener as never);
+ }) as typeof process.on);
+
+ let dispose: (() => void) | undefined;
+ try {
+ // Must not throw despite the SIGUSR2 registration failure.
+ dispose = installCallbackReloadHandler(trainer, file);
+ expect(typeof dispose).toBe("function");
+ // No listener was attached, so the disposer is a no-op; calling
+ // it must not throw either (mirroring the success-path contract
+ // for tests that always invoke the disposer in `finally`).
+ expect(() => dispose?.()).not.toThrow();
+ } finally {
+ onSpy.mockRestore();
+ }
+ });
+
+ it("drops a stale reload's result when a newer SIGUSR2 starts before the import resolves", async () => {
+ // Regression: each SIGUSR2 starts a fire-and-forget
+ // `import()` + `replaceTrainerCallbacks`. Two same-`configHash`
+ // rebuilds firing back-to-back can race: the earlier import's
+ // bytes sometimes resolve *after* the newer one, and
+ // `replaceTrainerCallbacks` overwrites the freshly-loaded
+ // callbacks with the prior version. The fix version-gates each
+ // reload via a monotonic `loadSeq`; this test pins the contract
+ // by firing two signals back-to-back and asserting that
+ // `replaceTrainerCallbacks` was invoked exactly **once**:
+ // proving the older IIFE dropped its result at the
+ // `seq !== loadSeq` check before reaching the replace call.
+ const trainer = makeTrainer();
+ attachTrainerInspection(trainer, () => ({
+ name: "n",
+ config: {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ },
+ callbacks: {},
+ }));
+
+ const file = writeUserBundle("v1");
+ const stdoutSpy = vi
+ .spyOn(process.stdout, "write")
+ .mockImplementation((() => true) as typeof process.stdout.write);
+ const stderrSpy = vi
+ .spyOn(process.stderr, "write")
+ .mockImplementation((() => true) as typeof process.stderr.write);
+ const dispose = installCallbackReloadHandler(trainer, file);
+ try {
+ // First signal: captures seq=1 inside the IIFE.
+ process.emit("SIGUSR2", "SIGUSR2");
+ // Rewrite the bundle to v2 BEFORE letting either import
+ // resolve. mtime+ctime+size change → distinct cache-bust URL.
+ writeUserBundle("v2");
+ // Second signal: captures seq=2, bumps loadSeq to 2.
+ process.emit("SIGUSR2", "SIGUSR2");
+ // Generous fixed wait so both imports definitely settle;
+ // we can't poll on `lastCallbacks !== null` because the v1
+ // IIFE might land first and short-circuit our wait, hiding
+ // the count assertion below.
+ await new Promise((r) => setTimeout(r, 200));
+ // Without the seq guard, both IIFEs would call
+ // `replaceTrainerCallbacks` and `calls` would be 2. With the
+ // guard, the older IIFE's `seq !== loadSeq` short-circuit
+ // skips the replace call entirely.
+ expect(trainer.__replace.calls).toBe(1);
+ } finally {
+ dispose();
+ stdoutSpy.mockRestore();
+ stderrSpy.mockRestore();
+ }
+ });
+
+ it("logs a skip warning when the bundle has no inspectable trainer", async () => {
+ const trainer = makeTrainer();
+ const file = join(cwd, "no-trainer.mjs");
+ writeFileSync(file, "export const nothing = true;\n");
+ const stdoutSpy = vi
+ .spyOn(process.stdout, "write")
+ .mockImplementation((() => true) as typeof process.stdout.write);
+ const stderrChunks: string[] = [];
+ const stderrSpy = vi
+ .spyOn(process.stderr, "write")
+ .mockImplementation(((chunk: unknown) => {
+ stderrChunks.push(String(chunk));
+ return true;
+ }) as typeof process.stderr.write);
+ const dispose = installCallbackReloadHandler(trainer, file);
+ try {
+ process.emit("SIGUSR2", "SIGUSR2");
+ // Give the dynamic import a few ticks.
+ await new Promise((r) => setTimeout(r, 50));
+ expect(stderrChunks.join("")).toMatch(/no inspectable trainer/i);
+ expect(trainer.__replace.lastCallbacks).toBeNull();
+ } finally {
+ dispose();
+ stdoutSpy.mockRestore();
+ stderrSpy.mockRestore();
+ }
+ });
+});
diff --git a/packages/arkor/src/core/runnerSignals.ts b/packages/arkor/src/core/runnerSignals.ts
new file mode 100644
index 00000000..5c71a27d
--- /dev/null
+++ b/packages/arkor/src/core/runnerSignals.ts
@@ -0,0 +1,215 @@
+import { moduleCacheBustUrl } from "./moduleCacheBust";
+import { SIGNAL_EXIT_CODE } from "./signalExit";
+import {
+ findInspectableTrainer,
+ replaceTrainerCallbacks,
+ requestTrainerEarlyStop,
+} from "./trainerInspection";
+import type { Trainer, TrainerCallbacks } from "./types";
+
+const SHUTDOWN_SIGNALS = ["SIGTERM", "SIGINT", "SIGHUP"] as const;
+const CALLBACK_RELOAD_SIGNAL = "SIGUSR2" as const;
+
+/**
+ * Two-stage shutdown handling so HMR rebuilds (Studio sends SIGTERM)
+ * preserve the in-flight checkpoint work:
+ *
+ * - 1st signal → `trainer.requestEarlyStop()`. The trainer keeps
+ * running, lets the next `checkpoint.saved` event land, then issues
+ * `cancel()`.
+ * - 2nd signal → immediate `process.exit(POSIX 128+signo)`:
+ * 130 for SIGINT, 143 for SIGTERM, 129 for SIGHUP. Escape hatch
+ * for an impatient operator or a hung early-stop. Per-signal
+ * exit code so parent shells see the actual interruption type.
+ *
+ * The returned dispose function removes the handlers so a normal
+ * `wait()` completion doesn't leave stale listeners behind: important
+ * because `runTrainer` can be called multiple times in tests within a
+ * single Node process.
+ */
+export function installShutdownHandlers(trainer: Trainer): () => void {
+ let signalCount = 0;
+ const handler = (signal: (typeof SHUTDOWN_SIGNALS)[number]): void => {
+ signalCount += 1;
+ if (signalCount > 1) {
+ process.stdout.write(
+ `Received second ${signal}; exiting without waiting for checkpoint.\n`,
+ );
+ // POSIX 128 + signo so the parent shell sees the right exit
+ // status: 130 for SIGINT (Ctrl-C twice), 129 for SIGHUP,
+ // 143 for SIGTERM. Hardcoding 143 misclassifies SIGINT and
+ // SIGHUP shutdowns as SIGTERM-style exits and breaks
+ // signal-aware orchestration. Defaults to 143 for any future
+ // signal we forget to map.
+ const code = SIGNAL_EXIT_CODE[signal] ?? 143;
+ process.exit(code);
+ // Explicit return so test mocks of process.exit (which don't
+ // actually terminate the worker) don't fall through into the
+ // early-stop path.
+ return;
+ }
+ process.stdout.write(
+ `Received ${signal}; early-stopping at next checkpoint…\n`,
+ );
+ // Drive the trainer's internal early-stop entry point via the
+ // `Symbol.for("arkor.trainer.requestEarlyStop")` brand attached by
+ // `createTrainer`. `runTrainer` also accepts hand-rolled
+ // `{ start, wait, cancel }` trainers; for those the brand is
+ // absent and `requestTrainerEarlyStop` transparently falls back
+ // to `trainer.cancel()` (best-effort, matches the public contract).
+ //
+ // Track whether the early-stop chain rejected so the final
+ // `process.exit` carries a non-zero status. The previous version
+ // always exited 0, which made `arkor start || cleanup_on_failure`
+ // wrappers classify a cancel-POST rejection (cloud-api transient
+ // failure, network drop) as a clean run despite the stderr
+ // diagnostic. POSIX 128 + signo on failure mirrors the
+ // second-signal exit-code convention so parent shells see a
+ // signal-style nonzero status.
+ let earlyStopFailed = false;
+ requestTrainerEarlyStop(trainer)
+ .catch((err: unknown) => {
+ earlyStopFailed = true;
+ const msg = err instanceof Error ? err.message : String(err);
+ process.stderr.write(`requestEarlyStop failed: ${msg}\n`);
+ })
+ .finally(() => {
+ const code = earlyStopFailed
+ ? (SIGNAL_EXIT_CODE[signal] ?? 143)
+ : 0;
+ process.exit(code);
+ });
+ };
+ // Per-signal closure (vs a single shared listener registered on
+ // every signal): the closure captures `sig` at registration time
+ // so the handler doesn't depend on whatever Node passes as the
+ // event arg. Node's documented contract is to pass the signal
+ // name, but pinning the source via closure keeps the handler
+ // robust regardless and makes the registration → arg
+ // relationship explicit at the callsite. Stored in a Map so
+ // `process.off` can remove the exact closure (anonymous arrow
+ // would leak the listener since `process.off` matches by
+ // identity).
+ const signalHandlers = new Map<
+ (typeof SHUTDOWN_SIGNALS)[number],
+ () => void
+ >();
+ for (const sig of SHUTDOWN_SIGNALS) {
+ const fn = () => handler(sig);
+ signalHandlers.set(sig, fn);
+ process.on(sig, fn);
+ }
+ return () => {
+ for (const [sig, fn] of signalHandlers) process.off(sig, fn);
+ };
+}
+
+/**
+ * SIGUSR2 handler: re-import the freshly-rebuilt artefact and rotate
+ * the trainer's callback cell via the internal
+ * `Symbol.for("arkor.trainer.replaceCallbacks")` brand. The cloud-side
+ * training run is untouched; only the in-process callbacks change.
+ *
+ * Studio sends SIGUSR2 from the `/api/dev/events` HMR pipeline when
+ * (and only when) the rebuilt bundle's `JobConfig` hash matches the
+ * one captured at spawn time. A mismatch produces SIGTERM instead, which
+ * goes through `installShutdownHandlers` above.
+ */
+export function installCallbackReloadHandler(
+ trainer: Trainer,
+ entryPath: string,
+): () => void {
+ /**
+ * Monotonic counter for sequencing concurrent SIGUSR2 reloads.
+ * Bumped synchronously inside the signal handler *before* the
+ * dynamic-import await begins, so each in-flight reload knows its
+ * arrival order. When the import resolves, the IIFE compares its
+ * captured `seq` against `loadSeq` and silently drops the result
+ * if a newer signal already started a newer reload. Without this,
+ * two same-`configHash` rebuilds firing back-to-back can race on
+ * the import: the earlier import's bytes (now stale on disk)
+ * resolve *after* the newer one, and `replaceTrainerCallbacks`
+ * overwrites the freshly-loaded callbacks with the prior version,
+ * leaving the running job out of sync until the next rebuild.
+ * Mirrors the `buildSeq` guard in `studio/hmr.ts`'s
+ * `emitBuildSucceeded`.
+ */
+ let loadSeq = 0;
+ const handler = (): void => {
+ const seq = ++loadSeq;
+ // mtime+ctime+size cache-bust (vs `Date.now()`): Node's ESM
+ // loader never evicts module records, so a long `arkor start`
+ // session with frequent SIGUSR2 reloads would accumulate one
+ // record per signal forever. Keying on the actual artefact bytes
+ // (via `moduleCacheBustUrl`) collapses no-op signals onto the
+ // same URL; the leak is bounded to "one per real edit", which
+ // is fundamentally what HMR has to retain.
+ const url = moduleCacheBustUrl(entryPath);
+ void (async () => {
+ try {
+ const mod = (await import(url)) as Record;
+ // A newer SIGUSR2 already started its own import while we
+ // were awaiting; drop our result so the latest edit wins.
+ if (seq !== loadSeq) return;
+ const callbacks = extractCallbacks(mod);
+ if (!callbacks) {
+ process.stderr.write(
+ "Callback reload skipped: rebuilt bundle has no inspectable trainer.\n",
+ );
+ return;
+ }
+ replaceTrainerCallbacks(trainer, callbacks);
+ process.stdout.write(
+ "Callbacks hot-reloaded; training run continues.\n",
+ );
+ } catch (err: unknown) {
+ const msg = err instanceof Error ? err.message : String(err);
+ process.stderr.write(`Callback reload failed: ${msg}\n`);
+ }
+ })();
+ };
+ // `process.on('SIGUSR2', ...)` can throw at registration time on
+ // platforms that don't support the signal (notably Windows: libuv's
+ // signal-wrap returns ENOSYS for SIGUSR2 on win32 and the error
+ // escapes to userland on some Node versions). The server-side
+ // `trainRegistry.safeKill(child, "SIGUSR2")` already detects this
+ // ("unsupported" → falls back to SIGTERM-restart), so an unarmed
+ // listener here is the documented contract on those platforms:
+ // quietly degrade to a no-op disposer rather than crashing
+ // `arkor start` at boot.
+ // Track registration success so the returned disposer never
+ // calls `process.off(...)` for a handler we never attached.
+ // Today this only fires for the early-return-no-op path where
+ // `process.on` threw at registration, but future Node versions
+ // could route `off` through the same libuv signal-wrap that
+ // throws for unsupported signals on Windows, and a symmetric
+ // throw inside the disposer would crash the `runTrainer` finally
+ // block instead of merely being a no-op.
+ let attached = false;
+ try {
+ process.on(CALLBACK_RELOAD_SIGNAL, handler);
+ attached = true;
+ } catch {
+ return () => {
+ // no-op: handler was never attached
+ };
+ }
+ return () => {
+ if (!attached) return;
+ process.off(CALLBACK_RELOAD_SIGNAL, handler);
+ };
+}
+
+/**
+ * Extract the user-supplied callbacks reference from a re-imported
+ * bundle. Delegates the entry-shape walk to `findInspectableTrainer`
+ * so SIGUSR2's view of "what counts as a trainer" stays identical to
+ * the HMR coordinator's `inspectBundle` and `runner.ts`'s
+ * `extractTrainer`. Returns `null` when no candidate carries the
+ * inspection brand.
+ */
+function extractCallbacks(
+ mod: Record,
+): Partial | null {
+ return findInspectableTrainer(mod)?.callbacks ?? null;
+}
diff --git a/packages/arkor/src/core/schemas.test.ts b/packages/arkor/src/core/schemas.test.ts
index 50ff2571..46a9cf18 100644
--- a/packages/arkor/src/core/schemas.test.ts
+++ b/packages/arkor/src/core/schemas.test.ts
@@ -64,7 +64,7 @@ describe("trainingJobSchema", () => {
});
it("normalises non-null startedAt/completedAt: strings pass through, Dates ISO-coerce", () => {
- // Branch coverage for the `toIsoOrNull` transforms — the `null`
+ // Branch coverage for the `toIsoOrNull` transforms: the `null`
// branch is exercised by every other test in this file (the
// `valid` fixture has both fields null), but the truthy branch
// only fires when the field carries an actual timestamp. Strings
diff --git a/packages/arkor/src/core/signalExit.ts b/packages/arkor/src/core/signalExit.ts
new file mode 100644
index 00000000..81e42eb7
--- /dev/null
+++ b/packages/arkor/src/core/signalExit.ts
@@ -0,0 +1,21 @@
+/**
+ * Shared POSIX `128 + signo` exit code mapping for the runner's
+ * two-stage shutdown handler (`core/runnerSignals.ts`) and the CLI's
+ * cleanup-hook coordinator (`cli/cleanupHooks.ts`). The two map
+ * MUST agree: AGENTS.md describes them as a single contract, and a
+ * drift (e.g. someone adding SIGQUIT to one but not the other)
+ * would make the runner and the dev-server exit with inconsistent
+ * codes for the same signal, the exact parent-shell-classification
+ * regression the per-signal mapping was introduced to prevent.
+ *
+ * Lives in `core/` (not `cli/`) so both consumers can import it
+ * without `cli/` ↔ `core/` cycles: `cli/cleanupHooks.ts` imports
+ * from `core/`, but `core/` must not depend on `cli/`.
+ */
+export const SIGNAL_EXIT_CODE = {
+ SIGHUP: 129,
+ SIGINT: 130,
+ SIGTERM: 143,
+} as const;
+
+export type ShutdownSignal = keyof typeof SIGNAL_EXIT_CODE;
diff --git a/packages/arkor/src/core/trainer.test.ts b/packages/arkor/src/core/trainer.test.ts
index f9ef2f85..ff1d02ed 100644
--- a/packages/arkor/src/core/trainer.test.ts
+++ b/packages/arkor/src/core/trainer.test.ts
@@ -3,6 +3,10 @@ import { mkdtempSync, rmSync } from "node:fs";
import { tmpdir } from "node:os";
import { join } from "node:path";
import { createTrainer } from "./trainer";
+import {
+ replaceTrainerCallbacks,
+ requestTrainerEarlyStop,
+} from "./trainerInspection";
import { writeState } from "./state";
import type { AnonymousCredentials } from "./credentials";
@@ -266,7 +270,7 @@ describe("createTrainer (credentials defaulting)", () => {
model: "m",
dataset: { type: "huggingface", name: "x" },
},
- // Note: NO `credentials` here — trainer must call ensureCredentials.
+ // Note: NO `credentials` here, so trainer must call ensureCredentials.
{
baseUrl: "http://mock",
cwd: localCwd,
@@ -667,7 +671,7 @@ describe("createTrainer (SSE event stream)", () => {
});
});
-// Regression for ENG-406 — the previous reconnect loop had no upper bound
+// Regression for ENG-406: the previous reconnect loop had no upper bound
// and no jitter, so a permanently-down cloud-api would keep retrying every
// `reconnectDelayMs` forever (and on recovery several SDK clients would
// reconnect at exactly the same instant).
@@ -795,7 +799,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
step: 1,
loss: 1,
})}\n\n`,
- // No terminal event — stream closes cleanly, outer loop reconnects.
+ // No terminal event: stream closes cleanly, outer loop reconnects.
],
},
{ kind: "throw", error: new TypeError("fetch failed") },
@@ -833,8 +837,8 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
// when `Math.random()` lands near 1.
// Codex review on PR #13 (round 3) flagged that a 200-OK stream that
// EOFs without emitting any frame would loop forever at the base delay
- // — `maxReconnectAttempts` was bypassed because clean closes never
- // touched the failure counter. Misconfigured proxies / load-balancers
+ // because `maxReconnectAttempts` was bypassed (clean closes never
+ // touched the failure counter). Misconfigured proxies / load-balancers
// that accept the connection and immediately drop it would hang
// `wait()` indefinitely.
it("counts clean closes with no frames toward maxReconnectAttempts", async () => {
@@ -955,7 +959,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
};
// The trainer fires `POST /v1/jobs` synchronously inside the start()
// path, so cancel() needs the job row to be assigned. We never open the
- // event stream — cancel() should not depend on it.
+ // event stream; cancel() should not depend on it.
const sse = [
`id: 1\nevent: training.completed\ndata: ${JSON.stringify({
type: "training.completed",
@@ -1012,7 +1016,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
const original = globalThis.fetch;
globalThis.fetch = fetcher;
try {
- // Start the run by awaiting wait() — the streamed completion event
+ // Start the run by awaiting wait(): the streamed completion event
// closes the loop quickly so cancel() runs against a fully-resolved
// startedJob/scope pair.
await trainer.wait();
@@ -1086,7 +1090,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
});
it("skips malformed event payloads without aborting the stream", async () => {
- // Branch coverage for the `try/catch` around JSON.parse — a single
+ // Branch coverage for the `try/catch` around JSON.parse: a single
// malformed `data:` line shouldn't tear down the whole training run.
// Send one garbage frame followed by a real terminal event.
await writeState(
@@ -1153,7 +1157,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
});
it("recovers when the SSE body itself errors mid-stream", async () => {
- // Branch coverage for the catch around the for-await iterator —
+ // Branch coverage for the catch around the for-await iterator:
// covers the case where the stream's underlying body emits an error
// (e.g. a network disconnect partway through). The reconnect loop
// should treat it as a failure, count it toward the limit, then
@@ -1263,7 +1267,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
{ orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
cwd,
);
- // No fetch mock at all — if cancel() reached the API we'd see a real
+ // No fetch mock at all: if cancel() reached the API we'd see a real
// network error. Safety net for callers that wire up cancel() to
// SIGINT before kicking off the run.
const trainer = createTrainer(
@@ -1413,3 +1417,1320 @@ describe("createTrainer (reconnect backoff + max attempts)", () => {
}
});
});
+
+describe("createTrainer (early stop)", () => {
+ const minimalJobRow = {
+ id: "j-stop",
+ orgId: "o1",
+ projectId: "p1",
+ name: "run",
+ status: "queued",
+ config: {
+ model: "m",
+ datasetSource: { type: "huggingface", name: "x" },
+ },
+ createdAt: "2026-01-01T00:00:00Z",
+ startedAt: null,
+ completedAt: null,
+ };
+
+ it("calls cancel after the next checkpoint when early-stop is requested mid-run", async () => {
+ await writeState(
+ { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+ cwd,
+ );
+ // SSE stream: training.started → training.log → checkpoint.saved.
+ // The checkpoint event is the trigger for the early-stop branch in
+ // dispatch(); after that, the loop should treat the run as terminal
+ // (we asserted this by ending the wait() promise without sending
+ // training.completed).
+ const sse = [
+ `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+ type: "training.started",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:01Z",
+ })}\n\n`,
+ `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+ type: "training.log",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:02Z",
+ step: 1,
+ loss: 0.5,
+ })}\n\n`,
+ `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({
+ type: "checkpoint.saved",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:03Z",
+ step: 10,
+ })}\n\n`,
+ ];
+
+ let cancelCalls = 0;
+ const fetcher: typeof fetch = (async (
+ input: RequestInfo | URL,
+ init?: RequestInit,
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && url.includes("/v1/jobs?")) {
+ return new Response(JSON.stringify({ job: minimalJobRow }), {
+ status: 201,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+ return new Response(sseStream(sse), {
+ status: 200,
+ headers: { "content-type": "text/event-stream" },
+ });
+ }
+ if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+ cancelCalls += 1;
+ return new Response(JSON.stringify({ ok: true }), {
+ status: 200,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ throw new Error(`unexpected fetch: ${method} ${url}`);
+ }) as typeof fetch;
+
+ const trainer = createTrainer(
+ {
+ name: "run",
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ callbacks: {
+ // Arm the early-stop latch from inside the on-log callback so it
+ // fires before the checkpoint dispatch (mirrors the real CLI
+ // path where SIGTERM arrives mid-run). Fire-and-forget so the
+ // dispatch loop isn't blocked waiting for the latch's own
+ // checkpoint trigger to arrive.
+ onLog: () => {
+ void requestTrainerEarlyStop(trainer, { timeoutMs: 60_000 });
+ },
+ },
+ },
+ { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+ );
+ const original = globalThis.fetch;
+ globalThis.fetch = fetcher;
+ let result: Awaited>;
+ try {
+ result = await trainer.wait();
+ } finally {
+ globalThis.fetch = original;
+ }
+ expect(cancelCalls).toBe(1);
+ // Regression: the early-stop checkpoint branch returns
+ // `{ terminal: true }` to break out of `wait()`'s loop without
+ // waiting for a cloud-side terminal event. The `TrainingResult`
+ // it resolves with must therefore reflect a terminal status
+ // locally; otherwise `wait()` violates its documented contract
+ // ("Resolve when the job reaches a terminal status") and a
+ // subsequent `requestEarlyStop` wouldn't see the
+ // `TERMINAL_STATUSES` short-circuit.
+ expect(result.job.status).toBe("cancelled");
+ expect(result.job.completedAt).toBe("2026-01-01T00:00:03Z");
+ });
+
+ it("early-stop checkpoint branch returns the checkpoint's artifacts in wait()'s result", async () => {
+ // Regression: the early-stop terminal return used
+ // `terminalResult?.artifacts ?? []`, but `wait()` always calls
+ // `dispatch(parsed, null)` so `terminalResult` was forever
+ // null → `wait()` resolved with `artifacts: []` even though
+ // the checkpoint event carries the very artefacts the
+ // early-stop existed to *preserve* (the whole point of the
+ // graceful-stop-at-next-checkpoint pattern is to keep that
+ // work). Now we return `event.artifacts` directly so the
+ // checkpoint's outputs make it into the resolved result.
+ await writeState(
+ { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+ cwd,
+ );
+ const checkpointArtifacts = [
+ { kind: "lora_adapter" as const, path: "/checkpoints/step-10/" },
+ { kind: "metric" as const, name: "loss", value: 0.42 },
+ ];
+ const sse = [
+ `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+ type: "training.started",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:01Z",
+ })}\n\n`,
+ `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+ type: "training.log",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:02Z",
+ step: 1,
+ loss: 0.5,
+ })}\n\n`,
+ `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({
+ type: "checkpoint.saved",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:03Z",
+ step: 10,
+ artifacts: checkpointArtifacts,
+ })}\n\n`,
+ ];
+ const fetcher: typeof fetch = (async (
+ input: RequestInfo | URL,
+ init?: RequestInit,
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && url.includes("/v1/jobs?")) {
+ return new Response(JSON.stringify({ job: minimalJobRow }), {
+ status: 201,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+ return new Response(sseStream(sse), {
+ status: 200,
+ headers: { "content-type": "text/event-stream" },
+ });
+ }
+ if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+ return new Response(JSON.stringify({ ok: true }), {
+ status: 200,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ throw new Error(`unexpected fetch: ${method} ${url}`);
+ }) as typeof fetch;
+
+ const trainer = createTrainer(
+ {
+ name: "run",
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ callbacks: {
+ onLog: () => {
+ void requestTrainerEarlyStop(trainer, { timeoutMs: 60_000 });
+ },
+ },
+ },
+ { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+ );
+ const original = globalThis.fetch;
+ globalThis.fetch = fetcher;
+ let result: Awaited>;
+ try {
+ result = await trainer.wait();
+ } finally {
+ globalThis.fetch = original;
+ }
+ // The artefacts the checkpoint event carried must travel
+ // through to the wait() result; that's the whole point of
+ // graceful-stop-at-next-checkpoint preserving the in-flight
+ // work.
+ expect(result.artifacts).toEqual(checkpointArtifacts);
+ // Sibling assertion: status is still terminal (covered more
+ // thoroughly in the dedicated test above; this one just
+ // ensures we didn't accidentally regress the status while
+ // changing the artefacts return).
+ expect(result.job.status).toBe("cancelled");
+ });
+
+ it("early-stop branch still settles when the user's onCheckpoint callback throws (no SIGTERM hang)", async () => {
+ // Regression: the early-stop branch ran AFTER
+ // `await callbacks.onCheckpoint?.(ctx)`. A user-callback throw
+ // would propagate out of that await before the early-stop
+ // cancel + latch settlement could run, leaving
+ // `earlyStopDeferred` pending. The runner's
+ // `installShutdownHandlers` awaits that deferred → SIGTERM
+ // shutdown hangs until the (default 5-min) timeout fallback
+ // fires. The fix wraps `onCheckpoint` in try/catch, runs the
+ // early-stop branch unconditionally, then re-throws the
+ // captured callback error so wait()'s reconnect loop keeps
+ // its prior semantics.
+ await writeState(
+ { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+ cwd,
+ );
+ const sse = [
+ `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+ type: "training.started",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:01Z",
+ })}\n\n`,
+ `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+ type: "training.log",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:02Z",
+ step: 1,
+ loss: 0.5,
+ })}\n\n`,
+ `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({
+ type: "checkpoint.saved",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:03Z",
+ step: 10,
+ })}\n\n`,
+ ];
+ let cancelCalls = 0;
+ const fetcher: typeof fetch = (async (
+ input: RequestInfo | URL,
+ init?: RequestInit,
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && url.includes("/v1/jobs?")) {
+ return new Response(JSON.stringify({ job: minimalJobRow }), {
+ status: 201,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+ return new Response(sseStream(sse), {
+ status: 200,
+ headers: { "content-type": "text/event-stream" },
+ });
+ }
+ if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+ cancelCalls += 1;
+ return new Response(JSON.stringify({ ok: true }), {
+ status: 200,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ throw new Error(`unexpected fetch: ${method} ${url}`);
+ }) as typeof fetch;
+
+ let armedPromise: Promise | null = null;
+ let armedResult: "resolved" | "rejected" | "pending" = "pending";
+ const trainer = createTrainer(
+ {
+ name: "run",
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ callbacks: {
+ onLog: () => {
+ if (armedPromise === null) {
+ armedPromise = requestTrainerEarlyStop(trainer, {
+ timeoutMs: 60_000,
+ });
+ armedPromise.then(
+ () => {
+ armedResult = "resolved";
+ },
+ () => {
+ armedResult = "rejected";
+ },
+ );
+ }
+ },
+ onCheckpoint: () => {
+ // User callback throws DURING the checkpoint that
+ // would normally trigger early-stop. Without the
+ // try/catch wrap this throw would skip the
+ // early-stop branch → latch pending → SIGTERM hang
+ // for up to 60s (our `timeoutMs`).
+ throw new Error("user onCheckpoint boom");
+ },
+ },
+ },
+ {
+ baseUrl: "http://mock",
+ credentials: creds,
+ cwd,
+ reconnectDelayMs: 1,
+ // Cap reconnects at 0 so the user-callback throw
+ // surfaces as a wait() rejection instead of
+ // looping forever (handleFailure would otherwise
+ // reconnect after the throw escapes dispatch).
+ maxReconnectAttempts: 0,
+ },
+ );
+ const original = globalThis.fetch;
+ globalThis.fetch = fetcher;
+ try {
+ // wait() rejects: handleFailure wraps the user callback
+ // throw because maxReconnectAttempts is 0.
+ await expect(trainer.wait()).rejects.toThrow();
+ // Critical: the latch SETTLED via the early-stop branch
+ // (resolve), not via the 60-second timeout. The cancel POST
+ // also fired (early-stop reached the cancel call before the
+ // throw was re-raised). Together: shutdown wouldn't hang.
+ await new Promise((r) => setImmediate(r));
+ expect(armedResult).toBe("resolved");
+ expect(cancelCalls).toBe(1);
+ } finally {
+ globalThis.fetch = original;
+ }
+ });
+
+ it("re-throws a falsy onCheckpoint throw (e.g. `throw null`) instead of silently suppressing it", async () => {
+ // Regression: `onCheckpointError !== null` was the discriminant for
+ // "did the user callback throw?". User code can legitimately
+ // `throw null` / `throw 0` / `throw ""`; the truthiness of
+ // `onCheckpointError` was then indistinguishable from the no-error
+ // path, and the post-early-stop re-throw at the end of the
+ // checkpoint dispatch silently dropped the user's signal. With the
+ // fix, a separate `onCheckpointThrew` boolean discriminates so
+ // ANY throwable (including falsy ones) propagates uniformly.
+ await writeState(
+ { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+ cwd,
+ );
+ const sse = [
+ `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+ type: "training.started",
+ jobId: "j-falsy",
+ timestamp: "2026-01-01T00:00:01Z",
+ })}\n\n`,
+ `id: 2\nevent: checkpoint.saved\ndata: ${JSON.stringify({
+ type: "checkpoint.saved",
+ jobId: "j-falsy",
+ timestamp: "2026-01-01T00:00:02Z",
+ step: 5,
+ })}\n\n`,
+ ];
+ const fetcher: typeof fetch = (async (
+ input: RequestInfo | URL,
+ init?: RequestInit,
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && url.includes("/v1/jobs?")) {
+ return new Response(JSON.stringify({ job: minimalJobRow }), {
+ status: 201,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ if (method === "GET" && url.includes("/v1/jobs/j-falsy/events/stream")) {
+ return new Response(sseStream(sse), {
+ status: 200,
+ headers: { "content-type": "text/event-stream" },
+ });
+ }
+ throw new Error(`unexpected fetch: ${method} ${url}`);
+ }) as typeof fetch;
+
+ const trainer = createTrainer(
+ {
+ name: "run",
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ callbacks: {
+ // `throw null`: a falsy throwable. Under the bug this was
+ // silently swallowed and wait() resolved as if no callback
+ // had thrown. With the fix `wait()` rejects (handleFailure
+ // wraps the throw because maxReconnectAttempts is 0).
+ onCheckpoint: () => {
+ throw null;
+ },
+ },
+ },
+ {
+ baseUrl: "http://mock",
+ credentials: creds,
+ cwd,
+ reconnectDelayMs: 1,
+ maxReconnectAttempts: 0,
+ },
+ );
+ const original = globalThis.fetch;
+ globalThis.fetch = fetcher;
+ try {
+ // Under the bug `wait()` resolved cleanly (the falsy throw was
+ // captured by `onCheckpointError = err` but the
+ // `if (onCheckpointError !== null) throw` guard saw `null` and
+ // skipped the re-throw). With the fix `wait()` rejects.
+ await expect(trainer.wait()).rejects.toBeDefined();
+ } finally {
+ globalThis.fetch = original;
+ }
+ });
+
+ it("early-stop checkpoint branch rejects the deferred when cancel() throws (visible to shutdown handler)", async () => {
+ // Regression: previously, an `await trainer.cancel()` that threw
+ // (network failure / cloud-api 5xx during the cancel POST) was
+ // *swallowed*, the deferred resolved cleanly, and the runner
+ // exited 0: the UI declared the run cancelled while the cloud
+ // job kept running, orphaning GPU spend with no visible error.
+ // The fix REJECTS the deferred so the runner's
+ // `installShutdownHandlers` `.catch()` writes the failure to
+ // stderr, surfacing the issue to the operator. The latch is
+ // still always settled (resolved or rejected), so shutdown
+ // doesn't hang waiting for a checkpoint that will never come.
+ await writeState(
+ { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+ cwd,
+ );
+ const sse = [
+ `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+ type: "training.started",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:01Z",
+ })}\n\n`,
+ `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+ type: "training.log",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:02Z",
+ step: 1,
+ loss: 0.5,
+ })}\n\n`,
+ `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({
+ type: "checkpoint.saved",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:03Z",
+ step: 10,
+ })}\n\n`,
+ ];
+ let cancelAttempts = 0;
+ const fetcher: typeof fetch = (async (
+ input: RequestInfo | URL,
+ init?: RequestInit,
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && url.includes("/v1/jobs?")) {
+ return new Response(JSON.stringify({ job: minimalJobRow }), {
+ status: 201,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+ return new Response(sseStream(sse), {
+ status: 200,
+ headers: { "content-type": "text/event-stream" },
+ });
+ }
+ if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+ cancelAttempts += 1;
+ // Simulate the cloud-api being unreachable mid-cancel.
+ throw new TypeError("fetch failed");
+ }
+ throw new Error(`unexpected fetch: ${method} ${url}`);
+ }) as typeof fetch;
+
+ // Capture the very-first armed early-stop promise so we can
+ // assert its settlement state below. The trainer is mutually
+ // recursive with the callback (`onLog` calls
+ // `requestTrainerEarlyStop(trainer, ...)`), so we declare it
+ // first as `let` and assign in a second step.
+ let armedPromise: Promise | null = null;
+ let armedResult: "resolved" | "rejected" | "pending" = "pending";
+ let armedError: unknown = null;
+ const trainer = createTrainer(
+ {
+ name: "run",
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ callbacks: {
+ onLog: () => {
+ // Arm exactly once and capture the returned promise.
+ // requestTrainerEarlyStop is idempotent across repeat
+ // calls, but we only need the FIRST armed deferred:
+ // the cancel-throw rejects exactly that promise.
+ if (armedPromise === null) {
+ armedPromise = requestTrainerEarlyStop(trainer, {
+ timeoutMs: 60_000,
+ });
+ armedPromise.then(
+ () => {
+ armedResult = "resolved";
+ },
+ (err: unknown) => {
+ armedResult = "rejected";
+ armedError = err;
+ },
+ );
+ }
+ },
+ },
+ },
+ { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+ );
+ const original = globalThis.fetch;
+ globalThis.fetch = fetcher;
+ try {
+ await trainer.wait();
+ // Flush microtasks so the .then(resolve, reject) handler
+ // observes the settlement before we assert.
+ await new Promise((r) => setImmediate(r));
+ } finally {
+ globalThis.fetch = original;
+ }
+ // cancel() was attempted (and threw).
+ expect(cancelAttempts).toBe(1);
+ // The armed deferred REJECTED: the runner's `.catch()` would
+ // see this error and log it to stderr instead of silently
+ // exiting 0. Critically: it didn't hang on "pending"; the
+ // failure case still settles, just via reject not resolve.
+ expect(armedResult).toBe("rejected");
+ expect(armedError).toBeInstanceOf(TypeError);
+ expect((armedError as Error).message).toBe("fetch failed");
+ });
+
+ it("early-stop checkpoint branch labels run as `failed` even when cancel throws a falsy value (not `null` discriminant)", async () => {
+ // Regression: the cancel-failure branch used to be discriminated
+ // by `cancelError !== null`, but user-side code can legitimately
+ // `throw null` / `throw 0` / `throw ""`. In those cases the
+ // captured `cancelError` would still read as falsy / `null` and
+ // the run would be silently labelled `"cancelled"` even though
+ // the cancel POST genuinely rejected, lying about cloud-side
+ // state that may still be running. Fix discriminates via a
+ // dedicated boolean flag and additionally wraps non-Error
+ // throws when rejecting the deferred so the SIGTERM handler's
+ // `.catch(err => err.message)` doesn't crash on a missing
+ // property.
+ await writeState(
+ { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+ cwd,
+ );
+ const sse = [
+ `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+ type: "training.started",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:01Z",
+ })}\n\n`,
+ `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+ type: "training.log",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:02Z",
+ step: 1,
+ loss: 0.5,
+ })}\n\n`,
+ `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({
+ type: "checkpoint.saved",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:03Z",
+ step: 10,
+ })}\n\n`,
+ ];
+ const fetcher: typeof fetch = (async (
+ input: RequestInfo | URL,
+ init?: RequestInit,
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && url.includes("/v1/jobs?")) {
+ return new Response(JSON.stringify({ job: minimalJobRow }), {
+ status: 201,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+ return new Response(sseStream(sse), {
+ status: 200,
+ headers: { "content-type": "text/event-stream" },
+ });
+ }
+ if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+ // Falsy non-Error throw: under the bug, the run would be
+ // labelled "cancelled" because `cancelError !== null` is
+ // false when the catch reassigned `cancelError = null`.
+ // eslint-disable-next-line no-throw-literal
+ throw null;
+ }
+ throw new Error(`unexpected fetch: ${method} ${url}`);
+ }) as typeof fetch;
+
+ let armedPromise: Promise | null = null;
+ let armedResult: "resolved" | "rejected" | "pending" = "pending";
+ let armedError: unknown = null;
+ const trainer = createTrainer(
+ {
+ name: "run",
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ callbacks: {
+ onLog: () => {
+ if (armedPromise === null) {
+ armedPromise = requestTrainerEarlyStop(trainer, {
+ timeoutMs: 60_000,
+ });
+ armedPromise.then(
+ () => {
+ armedResult = "resolved";
+ },
+ (err: unknown) => {
+ armedResult = "rejected";
+ armedError = err;
+ },
+ );
+ }
+ },
+ },
+ },
+ { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+ );
+ const original = globalThis.fetch;
+ globalThis.fetch = fetcher;
+ let result: Awaited>;
+ try {
+ result = await trainer.wait();
+ await new Promise((r) => setImmediate(r));
+ } finally {
+ globalThis.fetch = original;
+ }
+ // The local job state reflects the cancel FAILURE, not a clean
+ // cancel. Under the bug this was `"cancelled"`.
+ expect(result.job.status).toBe("failed");
+ // The armed deferred rejected, and the rejection value is
+ // wrapped in a real Error so downstream `.catch(err => err.message)`
+ // chains don't crash on `null.message`.
+ expect(armedResult).toBe("rejected");
+ expect(armedError).toBeInstanceOf(Error);
+ expect((armedError as Error).message).toBe("null");
+ });
+
+ it("resolves the early-stop latch when the run hits a terminal event before the next checkpoint", async () => {
+ // Regression: previously `requestEarlyStop()`'s deferred was
+ // only resolved by (a) the checkpoint-triggered cancel branch
+ // or (b) the timeout fallback. If the run reached
+ // `training.completed` / `training.failed` *before* another
+ // checkpoint landed (a common case for short jobs or runs that
+ // had already saved their last checkpoint when SIGTERM arrived),
+ // the deferred stayed pending until the (default 5-min) timeout
+ // fired; the SIGTERM handler in `installShutdownHandlers`
+ // awaits that promise before exit, so shutdown was delayed up to
+ // `timeoutMs`. Both terminal branches now settle the latch
+ // explicitly so the signal path completes immediately when the
+ // job is already terminal.
+ await writeState(
+ { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+ cwd,
+ );
+ // started → log (arms early-stop) → completed; no checkpoint.saved
+ // in between, so the checkpoint-triggered resolution path is *not*
+ // exercised; only the new terminal-branch settlement is.
+ const sse = [
+ `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+ type: "training.started",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:01Z",
+ })}\n\n`,
+ `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+ type: "training.log",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:02Z",
+ step: 1,
+ loss: 0.5,
+ })}\n\n`,
+ `id: 3\nevent: training.completed\ndata: ${JSON.stringify({
+ type: "training.completed",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:03Z",
+ artifacts: [],
+ })}\n\n`,
+ ];
+
+ let cancelCalls = 0;
+ const fetcher: typeof fetch = (async (
+ input: RequestInfo | URL,
+ init?: RequestInit,
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && url.includes("/v1/jobs?")) {
+ return new Response(JSON.stringify({ job: minimalJobRow }), {
+ status: 201,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+ return new Response(sseStream(sse), {
+ status: 200,
+ headers: { "content-type": "text/event-stream" },
+ });
+ }
+ if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+ cancelCalls += 1;
+ return new Response(JSON.stringify({ ok: true }), {
+ status: 200,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ throw new Error(`unexpected fetch: ${method} ${url}`);
+ }) as typeof fetch;
+
+ let stopResolved = false;
+ const trainer = createTrainer(
+ {
+ name: "run",
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ callbacks: {
+ onLog: () => {
+ // Long timeout: if the fix regresses, this test would
+ // hang for ~60s before the timer fires. With the
+ // terminal-branch settlement, the deferred resolves the
+ // moment `training.completed` lands.
+ void requestTrainerEarlyStop(trainer, {
+ timeoutMs: 60_000,
+ }).then(() => {
+ stopResolved = true;
+ });
+ },
+ },
+ },
+ { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+ );
+
+ const original = globalThis.fetch;
+ globalThis.fetch = fetcher;
+ try {
+ const result = await trainer.wait();
+ // Flush microtasks so the .then() chain off `requestEarlyStop`
+ // observes the resolution before we assert.
+ await new Promise((r) => setImmediate(r));
+ expect(result.job.status).toBe("completed");
+ // No cancel POST was issued: the terminal branch just
+ // releases the latch; it doesn't cancel a run that already
+ // completed on its own.
+ expect(cancelCalls).toBe(0);
+ // The latch resolved via the terminal handler, not via the
+ // 60-second timeout. (The test would simply time out long
+ // before the timeout fired if this regressed.)
+ expect(stopResolved).toBe(true);
+ } finally {
+ globalThis.fetch = original;
+ }
+ });
+
+ it("settles the early-stop latch even when the user's onCompleted callback throws", async () => {
+ // Regression: previously `settleEarlyStopLatch()` was called
+ // *after* awaiting `callbacks.onCompleted` / `onFailed`. A
+ // thrown user callback propagated out of `dispatch()` before
+ // the settle ran, leaving `earlyStopDeferred` pending; the
+ // SIGTERM handler in `installShutdownHandlers` would block on
+ // that promise until the (default 5-min) timeout fired,
+ // delaying shutdown for a user-code bug. Wrapping in
+ // `try/finally` ensures the latch is released regardless,
+ // while preserving the throw's propagation through `wait()` so
+ // callers still see the original error.
+ await writeState(
+ { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+ cwd,
+ );
+ const sse = [
+ `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+ type: "training.started",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:01Z",
+ })}\n\n`,
+ `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+ type: "training.log",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:02Z",
+ step: 1,
+ loss: 0.5,
+ })}\n\n`,
+ `id: 3\nevent: training.completed\ndata: ${JSON.stringify({
+ type: "training.completed",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:03Z",
+ artifacts: [],
+ })}\n\n`,
+ ];
+
+ const fetcher: typeof fetch = (async (
+ input: RequestInfo | URL,
+ init?: RequestInit,
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && url.includes("/v1/jobs?")) {
+ return new Response(JSON.stringify({ job: minimalJobRow }), {
+ status: 201,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+ return new Response(sseStream(sse), {
+ status: 200,
+ headers: { "content-type": "text/event-stream" },
+ });
+ }
+ throw new Error(`unexpected fetch: ${method} ${url}`);
+ }) as typeof fetch;
+
+ let stopResolved = false;
+ let stopRejected = false;
+ const trainer = createTrainer(
+ {
+ name: "run",
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ callbacks: {
+ onLog: () => {
+ // Arm early-stop with a long timeout; if the latch
+ // isn't released by `finally`, this would hang for the
+ // full 60 seconds.
+ void requestTrainerEarlyStop(trainer, {
+ timeoutMs: 60_000,
+ }).then(
+ () => {
+ stopResolved = true;
+ },
+ () => {
+ stopRejected = true;
+ },
+ );
+ },
+ onCompleted: () => {
+ throw new Error("user callback boom");
+ },
+ },
+ },
+ {
+ baseUrl: "http://mock",
+ credentials: creds,
+ cwd,
+ reconnectDelayMs: 1,
+ // `wait()` catches dispatch throws and routes them through
+ // its reconnect loop; with the default unbounded retry the
+ // user-callback throw above would loop forever and the test
+ // would just time out. Cap retries at 0 so the first thrown
+ // dispatch surfaces as a `wait()` rejection; that lets us
+ // observe the *latch* settlement (the actual contract under
+ // test) cleanly.
+ maxReconnectAttempts: 0,
+ },
+ );
+
+ const original = globalThis.fetch;
+ globalThis.fetch = fetcher;
+ try {
+ // The user-callback throw is wrapped by `handleFailure` after
+ // `maxReconnectAttempts: 0` exhausts; the original error is
+ // preserved as `cause`. We just need wait() to settle so the
+ // test doesn't hang. The *body* of the assertion is the
+ // latch state below.
+ await expect(trainer.wait()).rejects.toThrow();
+ // The latch must have settled (via `finally`) BEFORE wait()
+ // rejected. Without the `try/finally` around `onCompleted`
+ // the latch would still be armed → `stopResolved` stays
+ // false → the test fails (rather than timing out, since
+ // `maxReconnectAttempts: 0` already unblocks wait()).
+ await new Promise((r) => setImmediate(r));
+ expect(stopResolved).toBe(true);
+ expect(stopRejected).toBe(false);
+ } finally {
+ globalThis.fetch = original;
+ }
+ });
+
+ it("falls back to immediate cancel when no checkpoint arrives within timeoutMs", async () => {
+ await writeState(
+ { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+ cwd,
+ );
+ // No checkpoint in the stream, only training.completed, which would
+ // normally finish the run. We hand-roll a stream that never ends so
+ // the timeout fallback is what actually triggers cancel.
+ let streamController: ReadableStreamDefaultController | null =
+ null;
+ const stallingStream = new ReadableStream({
+ start(controller) {
+ streamController = controller;
+ const enc = new TextEncoder();
+ controller.enqueue(
+ enc.encode(
+ `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+ type: "training.started",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:01Z",
+ })}\n\n`,
+ ),
+ );
+ },
+ });
+
+ let cancelCalls = 0;
+ const fetcher: typeof fetch = (async (
+ input: RequestInfo | URL,
+ init?: RequestInit,
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && url.includes("/v1/jobs?")) {
+ return new Response(JSON.stringify({ job: minimalJobRow }), {
+ status: 201,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+ return new Response(stallingStream, {
+ status: 200,
+ headers: { "content-type": "text/event-stream" },
+ });
+ }
+ if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+ cancelCalls += 1;
+ // Closing the stream now mimics cloud-api's response to a cancel:
+ // the SSE channel ends and wait() exits its loop.
+ streamController?.close();
+ return new Response(JSON.stringify({ ok: true }), {
+ status: 200,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ throw new Error(`unexpected fetch: ${method} ${url}`);
+ }) as typeof fetch;
+
+ const trainer = createTrainer(
+ {
+ name: "run",
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ },
+ { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+ );
+
+ const original = globalThis.fetch;
+ globalThis.fetch = fetcher;
+ try {
+ await trainer.start();
+ // Tiny timeout so the test doesn't actually wait 5 minutes.
+ await requestTrainerEarlyStop(trainer, { timeoutMs: 5 });
+ expect(cancelCalls).toBe(1);
+ // Regression: the timeout fallback used to leave
+ // `earlyStopRequested = true` and `startedJob.status =
+ // "running"`. A subsequent `requestEarlyStop()` call would
+ // then re-arm a fresh timer and re-issue cancel even though
+ // the early-stop already fired. With the latch reset and
+ // local terminal-status update mirroring the
+ // checkpoint-triggered branch, the second call hits the
+ // TERMINAL_STATUSES short-circuit and is a true no-op.
+ await requestTrainerEarlyStop(trainer, { timeoutMs: 5 });
+ expect(cancelCalls).toBe(1);
+ } finally {
+ globalThis.fetch = original;
+ }
+ });
+
+ it("timeout fallback rejects the deferred when cancel() throws (visible to shutdown handler)", async () => {
+ // Companion to the checkpoint-branch reject test: when no
+ // checkpoint arrives within `timeoutMs`, the timeout fallback
+ // does its own `trainer.cancel()`. Old code swallowed cancel
+ // errors and ALWAYS resolved the deferred: same false-success
+ // failure mode as the checkpoint branch had: local runner
+ // exits cleanly while the cloud job keeps consuming GPU
+ // budget. The fix mirrors the checkpoint reject path: capture
+ // the error and reject the deferred so the runner's
+ // `.catch()` writes it to stderr.
+ await writeState(
+ { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+ cwd,
+ );
+ let streamController: ReadableStreamDefaultController | null =
+ null;
+ const stallingStream = new ReadableStream({
+ start(controller) {
+ streamController = controller;
+ const enc = new TextEncoder();
+ controller.enqueue(
+ enc.encode(
+ `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+ type: "training.started",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:01Z",
+ })}\n\n`,
+ ),
+ );
+ },
+ });
+
+ let cancelCalls = 0;
+ const fetcher: typeof fetch = (async (
+ input: RequestInfo | URL,
+ init?: RequestInit,
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && url.includes("/v1/jobs?")) {
+ return new Response(JSON.stringify({ job: minimalJobRow }), {
+ status: 201,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+ return new Response(stallingStream, {
+ status: 200,
+ headers: { "content-type": "text/event-stream" },
+ });
+ }
+ if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+ cancelCalls += 1;
+ // Close the stream so wait() exits its loop even though we
+ // throw on the cancel POST itself.
+ streamController?.close();
+ // Simulate cloud-api unreachable mid-cancel (transport).
+ throw new TypeError("fetch failed");
+ }
+ throw new Error(`unexpected fetch: ${method} ${url}`);
+ }) as typeof fetch;
+
+ const trainer = createTrainer(
+ {
+ name: "run",
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ },
+ { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+ );
+ const original = globalThis.fetch;
+ globalThis.fetch = fetcher;
+ try {
+ await trainer.start();
+ // Tiny timeout so the timeout fallback fires fast (no
+ // checkpoint will land; stream only carries
+ // training.started). The returned promise should REJECT
+ // because the cancel POST throws.
+ await expect(
+ requestTrainerEarlyStop(trainer, { timeoutMs: 5 }),
+ ).rejects.toThrow(/fetch failed/);
+ expect(cancelCalls).toBe(1);
+ } finally {
+ globalThis.fetch = original;
+ }
+ });
+
+ it("is a no-op before start() and resolves immediately", async () => {
+ const trainer = createTrainer(
+ {
+ name: "run",
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ },
+ { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+ );
+ // Should resolve without contacting cloud-api at all.
+ await requestTrainerEarlyStop(trainer, { timeoutMs: 1 });
+ });
+
+ it("waits out an in-flight start() so a SIGTERM during create-job can still cancel the new job", async () => {
+ // Codex P1 regression: `start()` sets `scope` *before* awaiting
+ // `client.createJob`, so there's a real window where the cloud
+ // job is being created but `startedJob` is still null. If a
+ // runner-side SIGTERM lands in that window, an immediate
+ // "no-op" early-stop would let `installShutdownHandlers` exit
+ // the process, leaving the just-created cloud job running
+ // with no cancel POST. The fix is to await the in-flight
+ // `start()` promise inside `requestEarlyStop()` so the cancel
+ // path sees a definite job id (or a definite start failure).
+ await writeState(
+ { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+ cwd,
+ );
+ let cancelCalls = 0;
+ let releaseCreateJob!: () => void;
+ const createJobReleased = new Promise((resolve) => {
+ releaseCreateJob = resolve;
+ });
+ const fetcher: typeof fetch = (async (
+ input: RequestInfo | URL,
+ init?: RequestInit,
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && url.includes("/v1/jobs?")) {
+ // Hold createJob open so we can fire `requestEarlyStop`
+ // mid-flight. Once the test releases the gate, return a
+ // valid job: that establishes the post-create state
+ // requestEarlyStop should then act on (cancel POST).
+ await createJobReleased;
+ return new Response(JSON.stringify({ job: minimalJobRow }), {
+ status: 201,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+ cancelCalls += 1;
+ return new Response(JSON.stringify({ ok: true }), {
+ status: 200,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ throw new Error(`unexpected fetch: ${method} ${url}`);
+ }) as typeof fetch;
+
+ const trainer = createTrainer(
+ {
+ name: "run",
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ },
+ { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+ );
+
+ const original = globalThis.fetch;
+ globalThis.fetch = fetcher;
+ try {
+ // Fire start() but DON'T await; its createJob is gated.
+ const startPromise = trainer.start();
+ // Yield once so the start microtasks queue up to the
+ // `await client.createJob`.
+ await new Promise((r) => setImmediate(r));
+ // requestEarlyStop fires while start() is mid-flight. With
+ // the fix it awaits start() rather than no-op'ing immediately.
+ // Tiny `timeoutMs` so once `start()` resolves the latch's
+ // timeout-fallback fires the cancel POST quickly. There's no
+ // SSE stream in this test, so the checkpoint-driven path
+ // never arrives. We're testing the "stop awaited start()" leg
+ // of the contract, not the checkpoint plumbing.
+ const stopPromise = requestTrainerEarlyStop(trainer, {
+ timeoutMs: 50,
+ });
+ // Sanity: stop hasn't resolved yet; it's blocked on
+ // start() which is blocked on createJob.
+ let stopSettled = false;
+ void stopPromise.then(() => {
+ stopSettled = true;
+ });
+ await new Promise((r) => setImmediate(r));
+ expect(stopSettled).toBe(false);
+ // Release createJob → start() resolves → stop() proceeds.
+ releaseCreateJob();
+ await startPromise;
+ await stopPromise;
+ // The deciding behaviour: cancel POST was issued because the
+ // stop awaited start() and saw a real job id. Without the
+ // in-flight gate, stop would have returned immediately on
+ // the null `startedJob`, no cancel POST, cloud job orphaned.
+ expect(cancelCalls).toBe(1);
+ } finally {
+ globalThis.fetch = original;
+ }
+ });
+
+ it("replaceTrainerCallbacks (internal HMR brand) swaps the dispatched callbacks on the next event", async () => {
+ await writeState(
+ { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+ cwd,
+ );
+ const sse = [
+ `id: 1\nevent: training.started\ndata: ${JSON.stringify({
+ type: "training.started",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:01Z",
+ })}\n\n`,
+ `id: 2\nevent: training.log\ndata: ${JSON.stringify({
+ type: "training.log",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:02Z",
+ step: 1,
+ loss: 1,
+ })}\n\n`,
+ `id: 3\nevent: training.log\ndata: ${JSON.stringify({
+ type: "training.log",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:03Z",
+ step: 2,
+ loss: 0.5,
+ })}\n\n`,
+ `id: 4\nevent: training.completed\ndata: ${JSON.stringify({
+ type: "training.completed",
+ jobId: "j-stop",
+ timestamp: "2026-01-01T00:00:04Z",
+ })}\n\n`,
+ ];
+ const fetcher: typeof fetch = (async (
+ input: RequestInfo | URL,
+ init?: RequestInit,
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && url.includes("/v1/jobs?")) {
+ return new Response(JSON.stringify({ job: minimalJobRow }), {
+ status: 201,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) {
+ return new Response(sseStream(sse), {
+ status: 200,
+ headers: { "content-type": "text/event-stream" },
+ });
+ }
+ throw new Error(`unexpected fetch: ${method} ${url}`);
+ }) as typeof fetch;
+
+ const calls: string[] = [];
+ const trainer = createTrainer(
+ {
+ name: "run",
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ callbacks: {
+ onLog: ({ step }) => {
+ calls.push(`v1:onLog(${step})`);
+ // After the first onLog call, swap to v2 callbacks via the
+ // internal `Symbol.for("arkor.trainer.replaceCallbacks")`
+ // brand (the same brand `arkor dev`'s SIGUSR2 handler
+ // uses). The next event must dispatch via the new object.
+ if (step === 1) {
+ replaceTrainerCallbacks(trainer, {
+ onLog: ({ step: s }) => void calls.push(`v2:onLog(${s})`),
+ });
+ }
+ },
+ },
+ },
+ { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+ );
+ const original = globalThis.fetch;
+ globalThis.fetch = fetcher;
+ try {
+ await trainer.wait();
+ } finally {
+ globalThis.fetch = original;
+ }
+ expect(calls).toEqual(["v1:onLog(1)", "v2:onLog(2)"]);
+ });
+
+ it("is idempotent: repeated calls share the same in-flight promise", async () => {
+ await writeState(
+ { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" },
+ cwd,
+ );
+ let cancelCalls = 0;
+ const fetcher: typeof fetch = (async (
+ input: RequestInfo | URL,
+ init?: RequestInit,
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && url.includes("/v1/jobs?")) {
+ return new Response(JSON.stringify({ job: minimalJobRow }), {
+ status: 201,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) {
+ cancelCalls += 1;
+ return new Response(JSON.stringify({ ok: true }), {
+ status: 200,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ throw new Error(`unexpected fetch: ${method} ${url}`);
+ }) as typeof fetch;
+
+ const trainer = createTrainer(
+ {
+ name: "run",
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ },
+ { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 },
+ );
+ const original = globalThis.fetch;
+ globalThis.fetch = fetcher;
+ try {
+ await trainer.start();
+ const a = requestTrainerEarlyStop(trainer, { timeoutMs: 5 });
+ const b = requestTrainerEarlyStop(trainer, { timeoutMs: 5 });
+ await Promise.all([a, b]);
+ // The fallback timer fires once, so cancel is called once even though
+ // the early-stop brand was invoked twice.
+ expect(cancelCalls).toBe(1);
+ } finally {
+ globalThis.fetch = original;
+ }
+ });
+});
diff --git a/packages/arkor/src/core/trainer.ts b/packages/arkor/src/core/trainer.ts
index 7c9f9662..c6f99870 100644
--- a/packages/arkor/src/core/trainer.ts
+++ b/packages/arkor/src/core/trainer.ts
@@ -6,19 +6,28 @@ import {
type Credentials,
} from "./credentials";
import { ensureProjectState } from "./projectState";
+import {
+ attachTrainerCallbackReplacer,
+ attachTrainerEarlyStopper,
+ attachTrainerInspection,
+ type RequestEarlyStopOptions,
+} from "./trainerInspection";
import type {
CheckpointContext,
InferArgs,
JobConfig,
Trainer,
+ TrainerCallbacks,
TrainerInput,
TrainingJob,
TrainingLogContext,
TrainingResult,
} from "./types";
+const TERMINAL_STATUSES = new Set(["completed", "failed", "cancelled"]);
+
/**
- * Internal runtime context. Not part of the public API surface — exposed only
+ * Internal runtime context. Not part of the public API surface; exposed only
* for tests and advanced power-user scenarios that need to inject a mock
* `fetch` or override the working directory.
*
@@ -111,7 +120,7 @@ function buildJobConfig(input: TrainerInput): JobConfig {
/**
* Build a `Trainer` bound to the user's configuration.
*
- * Public signature: `createTrainer(input)` — runtime options like
+ * Public signature: `createTrainer(input)`. Runtime options like
* `baseUrl` / `credentials` / `cwd` come from the environment and `.arkor/`
* state, never from user code. The optional second argument is reserved for
* tests and advanced overrides.
@@ -144,6 +153,63 @@ export function createTrainer(
let startedJob: TrainingJob | null = null;
let scope: { orgSlug: string; projectSlug: string } | null = null;
let clientPromise: Promise | null = null;
+ // In-flight `start()` promise: non-null between the first
+ // `client.createJob` call and the `startedJob` assignment. Lets
+ // `requestEarlyStop()` detect the "scope set but startedJob still
+ // null" window (`scope` is needed by `client.createJob` so we set
+ // it before the await) and wait out the create-job POST so a
+ // SIGTERM landing in that window can still drive a clean cancel
+ // once the job id materialises. Without this gate the early-stop
+ // path would no-op, the runner would `process.exit(0)`, and the
+ // newly created cloud job would orphan with no cancel POST.
+ let startInFlight: Promise | null = null;
+
+ // Mutable callbacks slot. Each `dispatch()` invocation reads this
+ // fresh, so the rotation triggered by the
+ // `Symbol.for("arkor.trainer.replaceCallbacks")` brand
+ // (`replaceTrainerCallbacks` in `core/trainerInspection.ts`) takes
+ // effect on the next event. Events already mid-await keep their
+ // old reference until they resolve, which matches the "replace,
+ // don't interrupt" contract. Public `Trainer` deliberately doesn't
+ // expose this; it's a dev-only HMR primitive driven by the
+ // SIGUSR2 path in `core/runnerSignals.ts`.
+ let currentCallbacks: Partial = input.callbacks ?? {};
+
+ // Early-stop state. `requestEarlyStop()` arms the latch; the next
+ // `checkpoint.saved` dispatch (or the timeout, whichever fires first)
+ // calls cancel() and resolves the deferred. Idempotent across repeat
+ // calls (they share the same deferred).
+ const DEFAULT_EARLY_STOP_TIMEOUT_MS = 5 * 60 * 1000;
+ let earlyStopDeferred: {
+ promise: Promise;
+ resolve: () => void;
+ reject: (err: unknown) => void;
+ timer: NodeJS.Timeout | null;
+ } | null = null;
+ let earlyStopRequested = false;
+
+ /**
+ * Drop the early-stop latch (clear timer + resolve deferred + reset
+ * the request flag). Called from any path that means "wait()'s
+ * cancel-after-checkpoint promise is no longer waiting on anything"
+ * (the checkpoint-driven cancel branch, the terminal `completed`
+ * / `failed` branches, and the up-front guard in
+ * `requestEarlyStop()` when the job is already terminal). Without
+ * this called from terminal branches, a `requestEarlyStop()` armed
+ * mid-run that races a `training.completed` / `training.failed`
+ * before the next `checkpoint.saved` would leave the deferred
+ * pending until the (default 5-min) timeout fires; the SIGTERM
+ * handler in `installShutdownHandlers` would block on that promise
+ * and delay shutdown for up to `timeoutMs`.
+ */
+ function settleEarlyStopLatch(): void {
+ if (earlyStopDeferred) {
+ if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer);
+ earlyStopDeferred.resolve();
+ earlyStopDeferred = null;
+ }
+ earlyStopRequested = false;
+ }
async function getClient(): Promise {
if (!clientPromise) {
@@ -168,7 +234,7 @@ export function createTrainer(
* many SDK clients retry at once.
*
* The final value is clamped at `maxReconnectDelayMs` because jitter
- * sits *outside* the exponential clamp — without the outer clamp, a
+ * sits *outside* the exponential clamp; without the outer clamp, a
* long outage where `exp` already hit the cap could wait up to 1.25 ×
* the documented cap when `Math.random()` lands near 1.
*/
@@ -204,7 +270,10 @@ export function createTrainer(
throw new Error("Trainer is in an inconsistent state");
}
const client = await getClient();
- const callbacks = input.callbacks ?? {};
+ // Read once per dispatch so a `replaceCallbacks` between events takes
+ // effect on the next dispatch, but doesn't change identity inside a
+ // single in-flight handler.
+ const callbacks = currentCallbacks;
switch (event.type) {
case "training.started": {
@@ -255,7 +324,139 @@ export function createTrainer(
infer,
artifacts: event.artifacts,
};
- await callbacks.onCheckpoint?.(ctx);
+ // Capture (don't propagate yet) any throw from the user's
+ // `onCheckpoint`. The early-stop branch below MUST run
+ // even on a callback throw; without this wrap a thrown
+ // `onCheckpoint` would skip the cancel + latch settlement,
+ // leaving the SIGTERM handler waiting on the deferred
+ // until the (default 5-min) timeout fires. Surface the
+ // original throw via re-throw at the end so `wait()`'s
+ // reconnect / failure path keeps its existing semantics.
+ // Discriminant for the user-callback-threw branch. Tracked as
+ // a separate boolean (not `onCheckpointError !== null`) because
+ // user code can legitimately `throw null` / `throw 0` /
+ // `throw ""`; the truthiness of `onCheckpointError` would then
+ // be indistinguishable from the no-error path, and the re-throw
+ // at the end would silently swallow the user's falsy throw
+ // (callback's "I want to stop" signal gets dropped on the floor).
+ let onCheckpointError: unknown = null;
+ let onCheckpointThrew = false;
+ try {
+ await callbacks.onCheckpoint?.(ctx);
+ } catch (err) {
+ onCheckpointError = err;
+ onCheckpointThrew = true;
+ }
+ // Early-stop latch: a checkpoint just landed, so the in-flight work
+ // is durable. Cancel the cloud job and end `wait()` cleanly.
+ if (earlyStopRequested && earlyStopDeferred) {
+ // Capture the cancel error (if any) but DON'T swallow
+ // silently; propagate via the deferred's reject path so
+ // the runner's `installShutdownHandlers` `.catch()` writes
+ // the failure to stderr. The previous swallow let a
+ // transient cloud-api failure during early-stop appear
+ // as a clean cancel: the local runner exited 0, the UI
+ // declared the run cancelled, but the cloud job kept
+ // running (continued GPU spend). Keeping the error
+ // visible to the shutdown handler lets the operator see
+ // it and intervene.
+ //
+ // We still mark `startedJob.status` terminal locally
+ // either way: from the runner's perspective the run is
+ // over, and a subsequent `requestEarlyStop()` call must
+ // hit the `TERMINAL_STATUSES.has(...)` short-circuit
+ // (re-arming a fresh latch on a dead run would hang
+ // shutdown).
+ let cancelError: unknown = null;
+ // Discriminant for the cancel-failure branch. Tracked as a
+ // separate boolean (not `cancelError !== null`) because
+ // user code can legitimately `throw null` / `throw 0` /
+ // `throw ""`; the truthiness of `cancelError` would then
+ // be indistinguishable from the no-error path and the run
+ // would be silently labelled `"cancelled"` even when the
+ // cancel POST genuinely rejected.
+ let cancelFailed = false;
+ try {
+ await trainer.cancel();
+ } catch (err) {
+ cancelError = err;
+ cancelFailed = true;
+ }
+ // Reflect the cancellation locally so `wait()`'s resolved
+ // `TrainingResult.job.status` is a terminal status (per the
+ // documented contract). Without this update the result would
+ // surface as `status: "running"`, and a subsequent
+ // `requestEarlyStop` would not see the
+ // `TERMINAL_STATUSES.has(...)` short-circuit it relies on.
+ //
+ // Status is `"failed"` when the cancel POST itself threw
+ // (cloud-api transient failure mid-cancel): labelling
+ // such runs `"cancelled"` would lie about the cloud-side
+ // state, which may still be running. `"failed"` is
+ // terminal too, so the latch / TERMINAL_STATUSES short-
+ // circuit still works, but `wait()`'s caller can
+ // distinguish "we cancelled cleanly" from "we tried but
+ // the cancel may not have landed". The original cancel
+ // error is also rejected through the deferred below for
+ // the SIGTERM handler's `.catch()`.
+ startedJob = {
+ ...startedJob,
+ status: cancelFailed ? "failed" : "cancelled",
+ ...(cancelFailed && {
+ error: `Early-stop cancel failed: ${
+ cancelError instanceof Error
+ ? cancelError.message
+ : String(cancelError)
+ }`,
+ }),
+ completedAt: event.timestamp,
+ };
+ if (cancelFailed) {
+ // Reject (not resolve) the latch. Mirrors the success
+ // path's bookkeeping (clear timer, null out shared
+ // slot, drop the request flag) so a follow-up
+ // `requestEarlyStop()` won't piggyback on the rejected
+ // promise.
+ if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer);
+ // Wrap if user threw a non-Error so the deferred
+ // consumer always receives an Error instance. `throw 0`
+ // would otherwise reject the deferred with `0`, and the
+ // SIGTERM handler's `.catch(err => ...err.message)` would
+ // crash on the missing property.
+ earlyStopDeferred.reject(
+ cancelError instanceof Error
+ ? cancelError
+ : new Error(String(cancelError)),
+ );
+ earlyStopDeferred = null;
+ earlyStopRequested = false;
+ } else {
+ settleEarlyStopLatch();
+ }
+ // Return the *checkpoint's* artifacts (the ones the user
+ // just saved): that's the work HMR went out of its way
+ // to preserve before issuing cancel(). The previous
+ // `terminalResult?.artifacts ?? []` always resolved to
+ // `[]` because `wait()` calls `dispatch(parsed, null)` so
+ // `terminalResult` is never populated. Effect: an
+ // HMR-driven early-stop resolved `wait()` with empty
+ // `artifacts` even though the checkpoint event carried
+ // the very artifacts the early-stop existed to keep.
+ // Surface the user's `onCheckpoint` throw (if any) so
+ // `wait()`'s reconnect / failure path keeps the same
+ // semantics it had before the wrap: the checkpoint
+ // workload is preserved, but the user still sees their
+ // callback error.
+ if (onCheckpointThrew) throw onCheckpointError;
+ return {
+ terminal: true,
+ artifacts: (event.artifacts ?? []) as unknown[],
+ };
+ }
+ // Same re-throw on the non-early-stop branch: keep
+ // `wait()`'s reconnect loop seeing the user's original
+ // callback error so reconnection counters work as before.
+ if (onCheckpointThrew) throw onCheckpointError;
return { terminal: false, artifacts: terminalResult?.artifacts ?? [] };
}
case "training.completed": {
@@ -265,7 +466,19 @@ export function createTrainer(
completedAt: event.timestamp,
};
const artifacts = (event.artifacts ?? []) as unknown[];
- await callbacks.onCompleted?.({ job: startedJob, artifacts });
+ // `try/finally` so the latch settles even when the user's
+ // `onCompleted` callback throws: otherwise a thrown
+ // callback would leave `earlyStopDeferred` pending and the
+ // SIGTERM handler awaiting `requestEarlyStop()` would block
+ // until the timeout (default 5 min). The throw still
+ // propagates through `dispatch()` → `wait()` so callers see
+ // the original error; we just don't strand the shutdown
+ // path along with it.
+ try {
+ await callbacks.onCompleted?.({ job: startedJob, artifacts });
+ } finally {
+ settleEarlyStopLatch();
+ }
return { terminal: true, artifacts };
}
case "training.failed": {
@@ -275,7 +488,14 @@ export function createTrainer(
error: event.error,
completedAt: event.timestamp,
};
- await callbacks.onFailed?.({ job: startedJob, error: event.error });
+ // Symmetric to the `completed` branch above: terminal
+ // status settles the latch even when the run failed *and*
+ // the user's `onFailed` callback itself throws.
+ try {
+ await callbacks.onFailed?.({ job: startedJob, error: event.error });
+ } finally {
+ settleEarlyStopLatch();
+ }
return { terminal: true, artifacts: [] };
}
}
@@ -286,18 +506,46 @@ export function createTrainer(
async start() {
if (startedJob) return { jobId: startedJob.id };
- const client = await getClient();
- const state = await resolveProjectState(client);
- scope = { orgSlug: state.orgSlug, projectSlug: state.projectSlug };
-
- const { job } = await client.createJob({
- orgSlug: state.orgSlug,
- projectSlug: state.projectSlug,
- name: input.name,
- config,
- });
- startedJob = job;
- return { jobId: job.id };
+ // Already-pending start: reuse the in-flight promise so a
+ // concurrent caller (notably `requestEarlyStop` awaiting it
+ // to close the SIGTERM-during-create-job race) doesn't issue
+ // a second `client.createJob` POST. `Promise.resolve` returns
+ // the existing promise unchanged when it's already a thenable.
+ if (startInFlight) {
+ const job = await startInFlight;
+ return { jobId: job.id };
+ }
+ // Track the pending creation so `requestEarlyStop()` can
+ // detect the "started but not yet recorded" window and wait
+ // out the `client.createJob` POST. We set `scope` *before*
+ // the await (it's needed by the await itself), so a SIGTERM
+ // landing during the await would otherwise see
+ // `!startedJob && scope` and exit immediately, leaving the
+ // newly created cloud job uncancelled.
+ const startPromise = (async () => {
+ const client = await getClient();
+ const state = await resolveProjectState(client);
+ scope = { orgSlug: state.orgSlug, projectSlug: state.projectSlug };
+ const { job } = await client.createJob({
+ orgSlug: state.orgSlug,
+ projectSlug: state.projectSlug,
+ name: input.name,
+ config,
+ });
+ startedJob = job;
+ return job;
+ })();
+ startInFlight = startPromise;
+ try {
+ const job = await startPromise;
+ return { jobId: job.id };
+ } finally {
+ // Clear regardless of resolve/reject so a failed start can
+ // be retried (the caller decides), and a successful one
+ // doesn't pin a stale promise on the trainer for the rest
+ // of its lifetime.
+ startInFlight = null;
+ }
},
async wait(): Promise {
@@ -347,7 +595,7 @@ export function createTrainer(
try {
for await (const sse of iterateEvents(response)) {
// Any frame from the server (including pings) means we're
- // connected and making progress — reset the failure counter
+ // connected and making progress; reset the failure counter
// so subsequent transient blips get the full retry budget.
receivedAny = true;
attempt = 0;
@@ -378,7 +626,7 @@ export function createTrainer(
if (terminal) break;
if (receivedAny) {
- // Stream had real activity then closed cleanly. Not a failure —
+ // Stream had real activity then closed cleanly. Not a failure;
// reconnect with Last-Event-ID at the base delay (no exponential
// backoff, no counter increment).
await delay(initialReconnectDelayMs, abortSignal);
@@ -404,5 +652,153 @@ export function createTrainer(
},
};
+ /**
+ * Internal "stop after next checkpoint" entry point. Hidden behind a
+ * `Symbol.for` brand so the runner subprocess's SIGTERM handler (in
+ * `runnerSignals.ts`) can drive a graceful early-stop without us
+ * exposing the operation on the public `Trainer` interface. User code
+ * that wants the same semantics should compose `abortSignal` +
+ * `cancel()` per `docs/cookbook/early-stopping.mdx`.
+ */
+ async function requestEarlyStop(
+ opts: RequestEarlyStopOptions = {},
+ ): Promise {
+ // SIGTERM-during-create-job race: a runner-side SIGTERM can land
+ // between `start()`'s `scope = { … }` assignment and its
+ // `client.createJob(...)` resolution, with `startedJob` still
+ // null but a real cloud job about to exist. Treating that window
+ // as "nothing in flight" would `process.exit(0)` immediately
+ // after this returns, leaving the newly created cloud job
+ // running with no cancel POST. Awaiting `startInFlight` collapses
+ // the race onto a definite startedJob (success) or a definite
+ // start failure (rejection); either way the branches below
+ // can decide on real state. Swallow the rejection: if `start()`
+ // failed there's nothing to cancel anyway.
+ if (startInFlight) {
+ try {
+ await startInFlight;
+ } catch {
+ // intentionally ignored: failed start has no job to cancel
+ }
+ }
+ // Nothing in flight: cleanup any prior latch and resolve.
+ if (!startedJob || !scope || TERMINAL_STATUSES.has(startedJob.status)) {
+ settleEarlyStopLatch();
+ return;
+ }
+ // Idempotent: a second call piggybacks on the first.
+ if (earlyStopDeferred) return earlyStopDeferred.promise;
+
+ earlyStopRequested = true;
+ let resolveFn!: () => void;
+ let rejectFn!: (err: unknown) => void;
+ const promise = new Promise((resolve, reject) => {
+ resolveFn = resolve;
+ rejectFn = reject;
+ });
+ const timeoutMs = opts.timeoutMs ?? DEFAULT_EARLY_STOP_TIMEOUT_MS;
+ const timer = setTimeout(() => {
+ // Timed out waiting for a checkpoint; fall back to immediate cancel.
+ // Capture the active deferred reference: by the time the cancel POST
+ // resolves, the checkpoint branch may have nulled out the shared
+ // slot, but this fallback path still owns the deferred it created.
+ const active = earlyStopDeferred;
+ // Capture (don't swallow) any cancel error so we can surface it
+ // through the deferred's reject path. Mirrors the checkpoint
+ // branch: a swallow here lets the runner's
+ // `installShutdownHandlers` exit "successfully" while the cloud
+ // job lives on (orphaned GPU spend with zero diagnostic), the
+ // exact failure mode that a "stop-after-checkpoint" deadline
+ // exists to PREVENT from going silent.
+ let cancelError: unknown = null;
+ // See the checkpoint-branch comment: tracked separately from
+ // `cancelError` so a user `throw null` / `throw 0` doesn't
+ // silently downgrade the cancel-failure path to "clean
+ // cancel".
+ let cancelFailed = false;
+ trainer
+ .cancel()
+ .catch((err) => {
+ cancelError = err;
+ cancelFailed = true;
+ })
+ .finally(() => {
+ // Mirror the checkpoint-triggered early-stop branch: reset
+ // the latch and reflect the cancellation locally so a
+ // second `requestEarlyStop()` call is a no-op (instead of
+ // re-arming a fresh timer + re-issuing cancel) and so
+ // `wait()`'s eventual resolution exposes a terminal status.
+ // Without this, a long-lived trainer left in
+ // `earlyStopRequested = true` would re-cancel on every
+ // future checkpoint event for the rest of its lifetime.
+ earlyStopRequested = false;
+ if (startedJob && !TERMINAL_STATUSES.has(startedJob.status)) {
+ // Symmetric to the checkpoint branch: `"failed"` (not
+ // `"cancelled"`) on cancel-throw so we don't lie
+ // about cloud-side state that may still be running.
+ // Both branches feed the same TERMINAL_STATUSES
+ // short-circuit, so re-armed `requestEarlyStop()`
+ // calls still no-op correctly.
+ startedJob = {
+ ...startedJob,
+ status: cancelFailed ? "failed" : "cancelled",
+ ...(cancelFailed && {
+ error: `Early-stop cancel failed: ${
+ cancelError instanceof Error
+ ? cancelError.message
+ : String(cancelError)
+ }`,
+ }),
+ completedAt: new Date().toISOString(),
+ };
+ }
+ if (active) {
+ // Resolve on success, REJECT on cancel failure so the
+ // SIGTERM handler's `.catch()` writes the error to
+ // stderr and the operator can see that the cloud job
+ // may still be live. The latch always settles either
+ // way; shutdown won't hang.
+ if (cancelFailed) {
+ active.reject(
+ cancelError instanceof Error
+ ? cancelError
+ : new Error(String(cancelError)),
+ );
+ } else {
+ active.resolve();
+ }
+ }
+ if (earlyStopDeferred === active) earlyStopDeferred = null;
+ });
+ }, timeoutMs);
+ // `Timer.unref` keeps the early-stop timer from blocking process exit
+ // when the host runtime finishes for unrelated reasons.
+ timer.unref?.();
+ earlyStopDeferred = {
+ promise,
+ resolve: resolveFn,
+ reject: rejectFn,
+ timer,
+ };
+ return promise;
+ }
+
+ // Brand the trainer with the HMR control surface so the Studio server
+ // can (a) hash the cloud-side config to decide between hot-swap and
+ // restart, (b) atomically swap the callbacks cell from the runner
+ // subprocess on SIGUSR2, and (c) drive a graceful "stop after the
+ // next checkpoint" on SIGTERM. All three brands live behind
+ // `Symbol.for` keys so they don't appear on the public `Trainer`
+ // interface (see `trainerInspection.ts` for the rationale).
+ attachTrainerInspection(trainer, () => ({
+ name: input.name,
+ config,
+ callbacks: currentCallbacks,
+ }));
+ attachTrainerCallbackReplacer(trainer, (callbacks) => {
+ currentCallbacks = callbacks ?? {};
+ });
+ attachTrainerEarlyStopper(trainer, requestEarlyStop);
+
return trainer;
}
diff --git a/packages/arkor/src/core/trainerInspection.test.ts b/packages/arkor/src/core/trainerInspection.test.ts
new file mode 100644
index 00000000..cb3c11cc
--- /dev/null
+++ b/packages/arkor/src/core/trainerInspection.test.ts
@@ -0,0 +1,249 @@
+import { describe, expect, it, vi } from "vitest";
+import { createArkor } from "./arkor";
+import { createTrainer } from "./trainer";
+import {
+ findInspectableTrainer,
+ findTrainerInModule,
+ getTrainerInspection,
+ replaceTrainerCallbacks,
+ requestTrainerEarlyStop,
+} from "./trainerInspection";
+import type { Trainer } from "./types";
+
+function brandedTrainer(name: string) {
+ // Real `createTrainer` attaches the inspection brand. We only need
+ // a no-op trainer for these shape tests; `start`/`wait` etc. are
+ // never invoked.
+ return createTrainer({
+ name,
+ model: "m",
+ dataset: { type: "huggingface", name: "x" },
+ });
+}
+
+function unbrandedTrainer(name: string) {
+ // Hand-rolled trainer: passes the `start`/`wait`/`cancel` shape
+ // check `findTrainerInModule` requires but DOESN'T carry the SDK
+ // inspection brand. Mirrors a user who wraps or re-exports a
+ // trainer outside the SDK helpers.
+ return {
+ name,
+ start: async () => ({ jobId: "j" }),
+ wait: async () => ({ job: {}, artifacts: [] }),
+ cancel: async () => {},
+ };
+}
+
+describe("findTrainerInModule (trainer-shape walk)", () => {
+ it("finds shape #1: createArkor named export", () => {
+ const trainer = brandedTrainer("a");
+ const found = findTrainerInModule({ arkor: createArkor({ trainer }) });
+ expect(found).toBe(trainer);
+ });
+
+ it("finds shape #2: bare `trainer` named export", () => {
+ const trainer = brandedTrainer("b");
+ const found = findTrainerInModule({ trainer });
+ expect(found).toBe(trainer);
+ });
+
+ it("finds shape #3: default-export Arkor manifest", () => {
+ const trainer = brandedTrainer("c");
+ const found = findTrainerInModule({ default: createArkor({ trainer }) });
+ expect(found).toBe(trainer);
+ });
+
+ it("finds shape #4: default IS the Trainer", () => {
+ // Regression: `runner.ts`'s `extractTrainer` accepts
+ // `export default createTrainer(...)` directly (the trainer
+ // object itself becomes `mod.default`), but Studio's manifest /
+ // HMR walk previously skipped this shape. Result: a project that
+ // ran fine under `arkor start` showed as "no trainer" in Studio
+ // and HMR forced a SIGTERM-restart on every rebuild because
+ // `configHash` came back null.
+ const trainer = brandedTrainer("d");
+ const found = findTrainerInModule({ default: trainer });
+ expect(found).toBe(trainer);
+ });
+
+ it("finds shape #5: default.trainer nested", () => {
+ const trainer = brandedTrainer("e");
+ const found = findTrainerInModule({ default: { trainer } });
+ expect(found).toBe(trainer);
+ });
+
+ it("works for hand-rolled (unbranded) trainers in any of the five shapes", () => {
+ const trainer = unbrandedTrainer("manual");
+ expect(findTrainerInModule({ trainer })?.name).toBe("manual");
+ expect(findTrainerInModule({ default: trainer })?.name).toBe("manual");
+ expect(findTrainerInModule({ default: { trainer } })?.name).toBe("manual");
+ });
+
+ it("returns null when no candidate looks like a trainer", () => {
+ expect(findTrainerInModule({})).toBeNull();
+ expect(findTrainerInModule({ arkor: {} })).toBeNull();
+ expect(findTrainerInModule({ trainer: { name: "no-methods" } })).toBeNull();
+ expect(findTrainerInModule({ default: 42 })).toBeNull();
+ });
+});
+
+describe("findInspectableTrainer (brand-required path)", () => {
+ it("returns the inspection snapshot for a branded trainer in any shape", () => {
+ // Regression: previously HMR's `inspectBundle` only checked
+ // `mod.arkor ?? mod.default`, missing shapes #2 and #4. As a
+ // result, projects bare-exporting `trainer` always produced
+ // `configHash: null` and HMR conservatively SIGTERM-restarted on
+ // every rebuild, never hot-swapping callbacks. The fix routes
+ // through `findInspectableTrainer` which walks every supported
+ // shape via `findTrainerInModule` and pulls inspection off the
+ // discovered trainer.
+ const trainerA = brandedTrainer("from-arkor");
+ const inspectionA = findInspectableTrainer({
+ arkor: createArkor({ trainer: trainerA }),
+ });
+ expect(inspectionA?.name).toBe("from-arkor");
+
+ const trainerB = brandedTrainer("bare-named");
+ const inspectionB = findInspectableTrainer({ trainer: trainerB });
+ expect(inspectionB?.name).toBe("bare-named");
+
+ const trainerC = brandedTrainer("default-arkor");
+ const inspectionC = findInspectableTrainer({
+ default: createArkor({ trainer: trainerC }),
+ });
+ expect(inspectionC?.name).toBe("default-arkor");
+
+ const trainerD = brandedTrainer("default-nested");
+ const inspectionD = findInspectableTrainer({
+ default: { trainer: trainerD },
+ });
+ expect(inspectionD?.name).toBe("default-nested");
+ });
+
+ it("returns null when only an unbranded trainer is present", () => {
+ // Hand-rolled trainers don't carry the SDK inspection brand, so
+ // HMR can't compute their `configHash`. The Studio still shows
+ // the trainer name (via `findTrainerInModule` in
+ // `summariseBuiltManifest`), but HMR routing falls back to the
+ // SIGTERM-restart-everything path, which is the documented
+ // safe behaviour when configs can't be diffed.
+ const trainer = unbrandedTrainer("plain");
+ expect(findInspectableTrainer({ trainer })).toBeNull();
+ expect(getTrainerInspection(trainer)).toBeNull();
+ });
+
+ it("does NOT walk past an unbranded first candidate to inspect a later branded one (runTrainer parity)", () => {
+ // Regression: a previous implementation looped every trainer-
+ // shaped candidate and returned the first one carrying the
+ // inspection brand. But `runTrainer`'s `extractTrainer` always
+ // executes the FIRST candidate (precedence: `mod.arkor` →
+ // `mod.trainer` → `mod.default`...), regardless of brand. A module
+ // that exported both an unbranded `trainer` (shape #2) AND a
+ // branded `default = createArkor(...)` (shape #3) would have its
+ // HMR `configHash` computed from the BRANDED trainer while the
+ // runner actually ran the unbranded one. The mismatch could route
+ // a rebuild to SIGUSR2 (hot-swap) even though the live trainer
+ // has no callback-replacer brand to receive the swap, leaving
+ // the running job stuck on stale callbacks.
+ //
+ // The fix anchors `findInspectableTrainer` to the same first-
+ // wins precedence as `runTrainer`: if the first candidate is
+ // unbranded, return `null` (forcing SIGTERM-restart, the safe
+ // fallback) instead of hashing a different instance.
+ const unbranded = unbrandedTrainer("unbranded-first");
+ const branded = brandedTrainer("branded-second");
+ const inspection = findInspectableTrainer({
+ trainer: unbranded,
+ default: createArkor({ trainer: branded }),
+ });
+ // Under the bug this was the branded inspection ("branded-second").
+ // With the fix we get null so HMR conservatively SIGTERM-restarts
+ // rather than hot-swapping callbacks into a trainer that can't
+ // receive them.
+ expect(inspection).toBeNull();
+ // And `findTrainerInModule` confirms the runner would pick the
+ // unbranded one (proving the precedence we're anchoring to).
+ expect(findTrainerInModule({
+ trainer: unbranded,
+ default: createArkor({ trainer: branded }),
+ })).toBe(unbranded);
+ });
+});
+
+describe("requestTrainerEarlyStop / replaceTrainerCallbacks brand-missing fallback", () => {
+ // Regression: previously these helpers asserted the brand was
+ // present and threw a synchronous TypeError on hand-rolled trainers.
+ // `runner.ts`'s `extractTrainer` accepts ANY `{start, wait, cancel}`
+ // shape (a documented public path for unbranded trainers),
+ // so the SIGTERM handler crashed instead of stopping the run.
+
+ it("requestTrainerEarlyStop falls back to trainer.cancel() for unbranded trainers", async () => {
+ const cancelCalls = vi.fn(async () => {});
+ const trainer = {
+ name: "manual",
+ start: async () => ({ jobId: "j" }),
+ wait: async () => ({ job: {}, artifacts: [] }),
+ cancel: cancelCalls,
+ } as unknown as Trainer;
+
+ // Must not throw, must resolve, must have called cancel().
+ await expect(requestTrainerEarlyStop(trainer)).resolves.toBeUndefined();
+ expect(cancelCalls).toHaveBeenCalledTimes(1);
+ });
+
+ it("requestTrainerEarlyStop swallows a thrown cancel() so the SIGTERM handler can still settle", async () => {
+ // The runner's SIGTERM handler chains
+ // `requestTrainerEarlyStop(...).catch(...).finally(() => process.exit(0))`.
+ // If the brand-missing fallback let cancel()'s rejection bubble,
+ // the `.finally` would still fire, but the cancel error would
+ // surface as an unhandled rejection from the test runner. The
+ // documented contract for cancel() is best-effort, so swallow.
+ const trainer = {
+ name: "manual",
+ start: async () => ({ jobId: "j" }),
+ wait: async () => ({ job: {}, artifacts: [] }),
+ cancel: vi.fn(async () => {
+ throw new Error("network down");
+ }),
+ } as unknown as Trainer;
+
+ await expect(requestTrainerEarlyStop(trainer)).resolves.toBeUndefined();
+ });
+
+ it("requestTrainerEarlyStop is async-shaped: synchronous throws inside the brand call become rejections", async () => {
+ // Defense-in-depth: even when the brand IS attached but somehow
+ // throws synchronously (e.g. a future implementation regression),
+ // the SIGTERM handler's `.catch` arm should still see it instead
+ // of the throw escaping past `.finally` and taking the runner
+ // down. The function is `async`, which wraps any synchronous
+ // throw inside its body into a rejected promise.
+ const trainer = brandedTrainer("from-arkor");
+ // Replace the brand with a function that throws synchronously.
+ const KEY = Symbol.for("arkor.trainer.requestEarlyStop");
+ Object.defineProperty(trainer, KEY, {
+ value: () => {
+ throw new Error("brand impl exploded");
+ },
+ configurable: true,
+ });
+ await expect(requestTrainerEarlyStop(trainer)).rejects.toThrow(
+ /brand impl exploded/,
+ );
+ });
+
+ it("replaceTrainerCallbacks is a no-op (not a throw) for unbranded trainers", () => {
+ // The HMR pipeline never routes SIGUSR2 to unbranded trainers in
+ // practice (their `configHash` is null, which forces the
+ // SIGTERM-restart path), but if a future caller did, it must not
+ // crash the runner.
+ const trainer = {
+ name: "manual",
+ start: async () => ({ jobId: "j" }),
+ wait: async () => ({ job: {}, artifacts: [] }),
+ cancel: async () => {},
+ } as unknown as Trainer;
+ expect(() =>
+ replaceTrainerCallbacks(trainer, { onLog: () => {} }),
+ ).not.toThrow();
+ });
+});
diff --git a/packages/arkor/src/core/trainerInspection.ts b/packages/arkor/src/core/trainerInspection.ts
new file mode 100644
index 00000000..8bfe964d
--- /dev/null
+++ b/packages/arkor/src/core/trainerInspection.ts
@@ -0,0 +1,306 @@
+import { isArkor } from "./arkor";
+import type { Arkor, JobConfig, Trainer, TrainerCallbacks } from "./types";
+
+/**
+ * Snapshot of a trainer's identity and cloud-side config that the Studio
+ * server reads in order to (a) compute a stable hash for HMR's
+ * "callbacks-only vs full restart" decision and (b) extract the new
+ * callbacks reference when hot-swapping.
+ *
+ * **Internal API (not part of the user-facing SDK surface).** Both this
+ * snapshot and the companion `replaceTrainerCallbacks` mutator are
+ * exposed only via `Symbol.for(...)`-keyed properties on the trainer
+ * object so they don't appear on the public `Trainer` type. They exist
+ * to let `arkor dev`'s HMR pipeline hot-swap callbacks without
+ * restarting cloud-side training; user code shouldn't call them
+ * directly.
+ */
+export interface TrainerInspection {
+ /** Run name (mirror of `Trainer.name`, copied for forward compatibility). */
+ name: string;
+ /** The cloud-side `JobConfig` this trainer would submit on `start()`. */
+ config: JobConfig;
+ /** Whatever the user passed in `input.callbacks`. May be empty. */
+ callbacks: Partial;
+}
+
+/**
+ * The CLI runtime (`dist/bin.mjs`) and the user's compiled bundle
+ * (`.arkor/build/index.mjs`, which keeps `arkor` external) end up loading
+ * two separate copies of this SDK as distinct ESM module records, so a
+ * module-local `WeakMap` would split into two halves that
+ * can't see each other.
+ *
+ * `Symbol.for(key)` is the cross-realm equivalent: the same key string
+ * resolves to the same symbol in any module instance, so the trainer
+ * created in the user's bundle exposes its inspection through the same
+ * property the Studio process reads.
+ */
+const TRAINER_INSPECTION_KEY = Symbol.for("arkor.trainer.inspect");
+const TRAINER_REPLACE_CALLBACKS_KEY = Symbol.for(
+ "arkor.trainer.replaceCallbacks",
+);
+const TRAINER_REQUEST_EARLY_STOP_KEY = Symbol.for(
+ "arkor.trainer.requestEarlyStop",
+);
+
+export interface RequestEarlyStopOptions {
+ /** Default: 5 min. Falls back to immediate cancel if no checkpoint arrives. */
+ timeoutMs?: number;
+}
+
+/**
+ * Stamp the inspection snapshot onto a freshly-built `Trainer` instance.
+ * Called once from `createTrainer`. Stored as a thunk so callers can
+ * read a fresh copy each time (defensive: the trainer's callbacks cell
+ * is mutable across the lifetime of a hot-swap).
+ */
+export function attachTrainerInspection(
+ trainer: object,
+ read: () => TrainerInspection,
+): void {
+ Object.defineProperty(trainer, TRAINER_INSPECTION_KEY, {
+ value: read,
+ configurable: true,
+ enumerable: false,
+ writable: false,
+ });
+}
+
+/**
+ * Pull the snapshot off a Trainer-like value. Returns `null` for plain
+ * objects that don't carry the brand; used by the Studio server to
+ * gracefully ignore third-party wrappers or pre-SDK shapes.
+ */
+export function getTrainerInspection(
+ trainer: unknown,
+): TrainerInspection | null {
+ if (!trainer || typeof trainer !== "object") return null;
+ const fn = (trainer as Record)[TRAINER_INSPECTION_KEY];
+ if (typeof fn !== "function") return null;
+ try {
+ const result = (fn as () => unknown).call(trainer);
+ if (
+ result &&
+ typeof result === "object" &&
+ "config" in result &&
+ "name" in result
+ ) {
+ return result as TrainerInspection;
+ }
+ } catch {
+ // Inspection is best-effort; a thrown user callback shouldn't crash HMR.
+ }
+ return null;
+}
+
+/**
+ * Wire the trainer's mutable callbacks slot to a `Symbol.for`-keyed
+ * brand so the runner subprocess can hot-swap callbacks without us
+ * exposing the operation on the public `Trainer` interface. Called once
+ * from `createTrainer`.
+ */
+export function attachTrainerCallbackReplacer(
+ trainer: object,
+ replace: (callbacks: Partial) => void,
+): void {
+ Object.defineProperty(trainer, TRAINER_REPLACE_CALLBACKS_KEY, {
+ value: replace,
+ configurable: true,
+ enumerable: false,
+ writable: false,
+ });
+}
+
+/**
+ * Replace the trainer's lifecycle callbacks atomically. The brand is
+ * attached by `createTrainer`, but `runTrainer`'s `extractTrainer`
+ * also accepts hand-rolled trainers (any `{ start, wait, cancel }`
+ * shape), and those don't carry the brand. The HMR pipeline never
+ * routes SIGUSR2 to such trainers in practice (they always produce
+ * `configHash: null` upstream, which forces the SIGTERM-restart
+ * path), so this helper is a no-op for them rather than throwing.
+ */
+export function replaceTrainerCallbacks(
+ trainer: Trainer,
+ callbacks: Partial,
+): void {
+ const fn = (trainer as unknown as Record)[
+ TRAINER_REPLACE_CALLBACKS_KEY
+ ] as ((cbs: Partial) => void) | undefined;
+ if (typeof fn !== "function") return;
+ fn.call(trainer, callbacks);
+}
+
+/**
+ * Wire an early-stop entry point onto a `Trainer` so the SIGTERM handler
+ * in the runner subprocess can request a graceful "stop after the next
+ * checkpoint" without us exposing the operation on the public `Trainer`
+ * interface. User code that wants the same semantics should compose
+ * the cookbook's `abortSignal` + `cancel()` recipe instead (see
+ * `docs/cookbook/early-stopping.mdx`).
+ */
+export function attachTrainerEarlyStopper(
+ trainer: object,
+ requestStop: (opts?: RequestEarlyStopOptions) => Promise,
+): void {
+ Object.defineProperty(trainer, TRAINER_REQUEST_EARLY_STOP_KEY, {
+ value: requestStop,
+ configurable: true,
+ enumerable: false,
+ writable: false,
+ });
+}
+
+/**
+ * Request that the trainer stop after the next saved checkpoint.
+ * Resolves once `cancel()` has been accepted by the cloud API, or
+ * after `timeoutMs` if no checkpoint arrived in time.
+ *
+ * `createTrainer` attaches the brand unconditionally, but
+ * `runTrainer`'s `extractTrainer` also accepts hand-rolled trainers
+ * (any `{ start, wait, cancel }` shape), which legitimately don't
+ * carry the brand. Falling back to the public `Trainer.cancel()` for
+ * those is the closest semantic match available without the SDK's
+ * checkpoint-aware machinery; it's also what the runner's SIGTERM
+ * handler needs to keep working (the previous "throw if brand
+ * missing" behaviour caused a synchronous TypeError before the
+ * handler's `.catch().finally()` chain attached, so SIGTERM crashed
+ * the runner instead of stopping the run).
+ */
+// async wrapper (rather than a bare function returning Promise) so
+// any *synchronous* throw inside the brand call (or its arguments)
+// becomes a rejected promise; the SIGTERM handler's `.catch()` then
+// catches it instead of the throw escaping past the `.finally()`
+// chain and taking the runner down.
+export async function requestTrainerEarlyStop(
+ trainer: Trainer,
+ opts?: RequestEarlyStopOptions,
+): Promise {
+ const fn = (trainer as unknown as Record)[
+ TRAINER_REQUEST_EARLY_STOP_KEY
+ ] as ((opts?: RequestEarlyStopOptions) => Promise) | undefined;
+ if (typeof fn !== "function") {
+ // Best-effort fallback for unbranded trainers: trainer.cancel()
+ // is part of the public Trainer interface, so it's always safe
+ // to call. Catch/swallow because the documented contract for
+ // cancel() is "best-effort" and the SIGTERM handler needs the
+ // returned promise to settle either way.
+ try {
+ await trainer.cancel();
+ } catch {
+ // intentionally ignored; see comment above.
+ }
+ return;
+ }
+ await fn.call(trainer, opts);
+}
+
+/**
+ * Trainer-shaped value pulled from a re-imported bundle. We don't
+ * import the public `Trainer` type here because consumers of this
+ * helper want to read minimal fields (`name` for display) without
+ * type-narrowing on the full SDK interface. Many tests fabricate
+ * hand-rolled trainer literals that don't structurally match
+ * `Trainer` (no `requestEarlyStop` etc.) but are still legitimate
+ * user shapes the runner accepts.
+ */
+type TrainerLike = { name?: unknown; [key: string]: unknown };
+
+function isTrainerLike(value: unknown): value is TrainerLike {
+ if (!value || typeof value !== "object") return false;
+ const v = value as Record;
+ return (
+ typeof v.start === "function" &&
+ typeof v.wait === "function" &&
+ typeof v.cancel === "function"
+ );
+}
+
+/**
+ * Walk the user module in `runner.ts`'s precedence order and return
+ * every *distinct* trainer-shaped value found. The walk is
+ * de-duplicated because the common `createArkor({ trainer })`
+ * default-export shape would otherwise surface the same trainer up
+ * to three times (case 3 pushes `mod.default.trainer`; case 4
+ * pushes the manifest object itself which is filtered out by
+ * `isTrainerLike`; case 5 pushes `mod.default.trainer` a second
+ * time). Callers iterate in precedence order, so this preserves
+ * the "first match wins" contract.
+ *
+ * The five supported shapes (mirroring `runner.ts`'s `extractTrainer`):
+ * 1. `export const arkor = createArkor({ trainer })`
+ * 2. `export const trainer = createTrainer(...)` (bare named export)
+ * 3. `export default createArkor({ trainer })`
+ * 4. `export default createTrainer(...)` (default IS a Trainer)
+ * 5. `export default { trainer: createTrainer(...) }`
+ *
+ * Without shape #4 a project that default-exports a Trainer would run
+ * fine under `arkor start` but show as "no trainer" in Studio's
+ * manifest, with `configHash: null` forcing every HMR rebuild down the
+ * SIGTERM-restart path instead of the SIGUSR2 hot-swap path.
+ */
+function findTrainerCandidates(mod: Record): TrainerLike[] {
+ const trainers: TrainerLike[] = [];
+ const seen = new Set();
+ const push = (value: unknown): void => {
+ if (value === undefined || value === null) return;
+ if (seen.has(value)) return;
+ seen.add(value);
+ if (isTrainerLike(value)) trainers.push(value);
+ };
+ // 1: createArkor named export
+ if (isArkor(mod.arkor)) push((mod.arkor as Arkor).trainer);
+ // 2: bare `trainer` named export
+ push(mod.trainer);
+ // 3: default-export holding an Arkor manifest
+ if (isArkor(mod.default)) push((mod.default as Arkor).trainer);
+ // 4: default IS the Trainer itself. `isTrainerLike` filters out
+ // cases 3/5 (an Arkor manifest doesn't have `start`/`wait`/
+ // `cancel`, nor does a plain `{ trainer }` wrapper).
+ push(mod.default);
+ // 5: default.trainer nested
+ if (mod.default && typeof mod.default === "object") {
+ push((mod.default as Record).trainer);
+ }
+ return trainers;
+}
+
+/**
+ * Return the first trainer-shaped value (anything with
+ * `start`/`wait`/`cancel`) in `runner.ts`'s precedence order. Doesn't
+ * require the SDK inspection brand: the Studio manifest UI displays
+ * the trainer's `name` for hand-rolled trainers too, even when HMR
+ * can't compute a `configHash` for them. "First match wins" matches
+ * `runner.ts`'s `extractTrainer`, so this is the trainer the runner
+ * will actually execute.
+ */
+export function findTrainerInModule(
+ mod: Record,
+): TrainerLike | null {
+ return findTrainerCandidates(mod)[0] ?? null;
+}
+
+/**
+ * Inspection snapshot of the trainer `runTrainer` would execute
+ * (== the first candidate in `runner.ts`'s precedence order).
+ * Used by both `studio/hmr.ts` (computing the `configHash` for HMR
+ * routing) and `core/runnerSignals.ts` (extracting new callbacks for
+ * SIGUSR2 hot-swap).
+ *
+ * Returns `null` when the first candidate doesn't carry the
+ * inspection brand. We deliberately DO NOT walk past it to find a
+ * branded trainer further down the list: the runner ignores those,
+ * so hashing a deeper branded trainer would compute HMR decisions
+ * for a different instance than the one actually running, e.g.
+ * route to SIGUSR2/hot-swap when the live (unbranded) trainer
+ * cannot be callback-reloaded. A null here correctly forces SIGTERM-
+ * restart, which is the safe fallback when configs can't be diffed.
+ */
+export function findInspectableTrainer(
+ mod: Record,
+): TrainerInspection | null {
+ const trainer = findTrainerCandidates(mod)[0];
+ if (!trainer) return null;
+ return getTrainerInspection(trainer);
+}
diff --git a/packages/arkor/src/studio/hmr.test.ts b/packages/arkor/src/studio/hmr.test.ts
new file mode 100644
index 00000000..b892c68c
--- /dev/null
+++ b/packages/arkor/src/studio/hmr.test.ts
@@ -0,0 +1,423 @@
+import { describe, it, expect, beforeEach, afterEach } from "vitest";
+import {
+ mkdirSync,
+ mkdtempSync,
+ rmSync,
+ statSync,
+ writeFileSync,
+} from "node:fs";
+import { tmpdir } from "node:os";
+import { join } from "node:path";
+import { createHmrCoordinator, type HmrEvent } from "./hmr";
+
+const FAKE_MANIFEST = `export const arkor = Object.freeze({
+ _kind: "arkor",
+ trainer: { name: "alpha" },
+});
+`;
+
+let cwd: string;
+
+beforeEach(() => {
+ cwd = mkdtempSync(join(tmpdir(), "arkor-hmr-test-"));
+});
+
+afterEach(() => {
+ rmSync(cwd, { recursive: true, force: true });
+});
+
+function nextEvent(
+ events: HmrEvent[],
+ predicate: (e: HmrEvent) => boolean,
+ timeoutMs = 10_000,
+): Promise {
+ return new Promise((resolve, reject) => {
+ const start = Date.now();
+ const tick = () => {
+ const found = events.find(predicate);
+ if (found) return resolve(found);
+ if (Date.now() - start > timeoutMs) {
+ return reject(
+ new Error(
+ `Timed out waiting for matching HMR event after ${timeoutMs}ms`,
+ ),
+ );
+ }
+ setTimeout(tick, 25);
+ };
+ tick();
+ });
+}
+
+/**
+ * Resolve once `events.length` has gone `quietWindowMs` without
+ * growing. Used to wait out spurious watcher events on noisier file
+ * systems (Windows polling / macOS FSEvents coalescing) before
+ * asserting the cached state.
+ */
+function waitForStableEvents(
+ events: HmrEvent[],
+ quietWindowMs: number,
+): Promise {
+ return new Promise((resolve) => {
+ let lastLength = events.length;
+ let stableSince = Date.now();
+ const tick = () => {
+ if (events.length !== lastLength) {
+ lastLength = events.length;
+ stableSince = Date.now();
+ }
+ if (Date.now() - stableSince >= quietWindowMs) return resolve();
+ setTimeout(tick, 50);
+ };
+ tick();
+ });
+}
+
+describe("createHmrCoordinator", () => {
+ it("emits a `ready` event after the first successful build", async () => {
+ mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+ writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+ const events: HmrEvent[] = [];
+ const hmr = createHmrCoordinator({ cwd });
+ hmr.subscribe((e) => events.push(e));
+ try {
+ const ready = await nextEvent(events, (e) => e.type === "ready");
+ expect(ready.outFile).toMatch(/\.arkor[\\/]+build[\\/]+index\.mjs$/);
+ expect(typeof ready.hash).toBe("string");
+ } finally {
+ await hmr.dispose();
+ }
+ });
+
+ it("emits a `rebuild` event after a source edit", async () => {
+ mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+ writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+ const events: HmrEvent[] = [];
+ const hmr = createHmrCoordinator({ cwd });
+ hmr.subscribe((e) => events.push(e));
+ try {
+ const ready = await nextEvent(events, (e) => e.type === "ready");
+ // Touch the entry with new content so the watcher detects a change.
+ writeFileSync(
+ join(cwd, "src/arkor/index.ts"),
+ FAKE_MANIFEST.replace(`"alpha"`, `"beta"`),
+ );
+ const rebuild = await nextEvent(events, (e) => e.type === "rebuild");
+ expect(rebuild.outFile).toBe(ready.outFile);
+ expect(rebuild.hash).not.toBe(ready.hash);
+ } finally {
+ await hmr.dispose();
+ }
+ });
+
+ it("emits an `error` event when the entry is missing on subscribe", async () => {
+ const events: HmrEvent[] = [];
+ const hmr = createHmrCoordinator({ cwd });
+ hmr.subscribe((e) => events.push(e));
+ try {
+ const err = await nextEvent(events, (e) => e.type === "error", 1000);
+ expect(err.message).toMatch(/Build entry not found/);
+ } finally {
+ await hmr.dispose();
+ }
+ });
+
+ it("transitions from `error` to `ready` once the entry appears, without re-subscribing", async () => {
+ // Regression: previously `startWatcher` bailed out and never
+ // retried, so an SPA already connected to `/api/dev/events` against
+ // a fresh scaffold would be stuck on the initial `error` event
+ // forever: EventSource doesn't reconnect on application-level
+ // errors. The coordinator now polls for the entry file in the
+ // background and starts the watcher the moment it appears.
+ const events: HmrEvent[] = [];
+ const hmr = createHmrCoordinator({ cwd });
+ hmr.subscribe((e) => events.push(e));
+ try {
+ await nextEvent(events, (e) => e.type === "error", 1000);
+ // Same subscriber: no reconnect, no second `subscribe` call.
+ mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+ writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+ const ready = await nextEvent(
+ events,
+ (e) => e.type === "ready",
+ 4000,
+ );
+ expect(ready.outFile).toMatch(/index\.mjs$/);
+ } finally {
+ await hmr.dispose();
+ }
+ });
+
+ it("replays the latest event to late subscribers", async () => {
+ mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+ writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+ const firstEvents: HmrEvent[] = [];
+ const hmr = createHmrCoordinator({ cwd });
+ hmr.subscribe((e) => firstEvents.push(e));
+ try {
+ await nextEvent(firstEvents, (e) => e.type === "ready");
+ // A new subscriber should receive the cached state synchronously
+ // before any new build is triggered.
+ //
+ // We assert "the late subscriber sees the same event the prior one
+ // saw last" rather than literally "ready" because rolldown@1.0.0-rc.17
+ // on macOS occasionally fires a spurious second BUNDLE_END (FSEvents
+ // coalescing inside the watcher): there, `firstEvents` already
+ // contains the spurious `rebuild` by the time we late-subscribe, and
+ // the contract under test (replay of the cached state) holds either
+ // way.
+ // TODO(rolldown 1.0): re-check after rolldown leaves RC. If the
+ // spurious BUNDLE_END is gone on macOS, tighten this back to
+ // expect(lateEvents[0]?.type).toBe("ready");
+ const lateEvents: HmrEvent[] = [];
+ hmr.subscribe((e) => lateEvents.push(e));
+ expect(lateEvents.length).toBeGreaterThanOrEqual(1);
+ expect(lateEvents[0]).toEqual(firstEvents[firstEvents.length - 1]);
+ } finally {
+ await hmr.dispose();
+ }
+ });
+
+ it("subscribe()'s lastEvent replay swallows a throwing subscriber so initialization keeps working", async () => {
+ // Regression: `subscribe()` synchronously replays `lastEvent` to
+ // a fresh subscriber for the late-mount-cached-state contract.
+ // Previously the replay had no try/catch, so a subscriber that
+ // threw during that one call (typical case: an SSE controller
+ // that closed mid-replay: `controller.enqueue` on a closed
+ // stream throws) propagated out of `subscribe()` and broke
+ // whoever just registered. `broadcast()` already swallowed
+ // subscriber throws defensively; this test pins the symmetric
+ // contract on `subscribe()`.
+ mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+ writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+ const firstEvents: HmrEvent[] = [];
+ const hmr = createHmrCoordinator({ cwd });
+ hmr.subscribe((e) => firstEvents.push(e));
+ try {
+ await nextEvent(firstEvents, (e) => e.type === "ready");
+ // A subscriber whose body throws on the cached-state replay.
+ const throwingSubscriber = (): void => {
+ throw new Error("controller closed");
+ };
+ // Must not throw out of subscribe(); must still return a
+ // working unsubscribe.
+ let unsubscribe: () => void = () => undefined;
+ expect(() => {
+ unsubscribe = hmr.subscribe(throwingSubscriber);
+ }).not.toThrow();
+ expect(typeof unsubscribe).toBe("function");
+ // Confirm the coordinator is still healthy: a *new* subscriber
+ // (after the throwing one) still receives the cached replay.
+ const recoveryEvents: HmrEvent[] = [];
+ hmr.subscribe((e) => recoveryEvents.push(e));
+ expect(recoveryEvents.length).toBeGreaterThanOrEqual(1);
+ unsubscribe();
+ } finally {
+ await hmr.dispose();
+ }
+ });
+
+ it("stops broadcasting after dispose()", async () => {
+ mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+ writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+ const events: HmrEvent[] = [];
+ const hmr = createHmrCoordinator({ cwd });
+ hmr.subscribe((e) => events.push(e));
+ await nextEvent(events, (e) => e.type === "ready");
+ await hmr.dispose();
+ const countAfterDispose = events.length;
+
+ // Edit after dispose must not produce any further events.
+ writeFileSync(
+ join(cwd, "src/arkor/index.ts"),
+ FAKE_MANIFEST.replace(`"alpha"`, `"gamma"`),
+ );
+ await new Promise((r) => setTimeout(r, 250));
+ expect(events.length).toBe(countAfterDispose);
+ });
+
+ it("the cached lastEvent reflects the LATEST source under rapid back-to-back edits", async () => {
+ // Regression: the BUNDLE_END handler used to fire
+ // `emitBuildSucceeded` without awaiting, so two quick rebuilds
+ // could run `inspectBundle` concurrently and broadcast out of
+ // order, leaving `lastEvent` pointing at the older snapshot.
+ // We can't deterministically synthesise a race against rolldown's
+ // real watcher, but we *can* assert the user-visible invariant:
+ // after a sequence of edits, the cached state must match the
+ // bytes that are actually on disk. The new sequence-number guard
+ // inside `emitBuildSucceeded` drops stale inspection results so
+ // whichever BUNDLE_END landed last broadcasts last.
+ mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+ writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+ const events: HmrEvent[] = [];
+ const hmr = createHmrCoordinator({ cwd });
+ hmr.subscribe((e) => events.push(e));
+ try {
+ await nextEvent(events, (e) => e.type === "ready");
+ writeFileSync(
+ join(cwd, "src/arkor/index.ts"),
+ FAKE_MANIFEST.replace(`"alpha"`, `"beta"`),
+ );
+ await nextEvent(events, (e) => e.type === "rebuild", 4000);
+ writeFileSync(
+ join(cwd, "src/arkor/index.ts"),
+ FAKE_MANIFEST.replace(`"alpha"`, `"gamma"`),
+ );
+ // Wait for the watcher to settle; any rebuild that's going to
+ // fire (including spurious extras from FSEvents on macOS or
+ // chokidar polling on Windows) lands within this window. The
+ // assertion then compares the cached `lastEvent.hash` against
+ // the *actual* fingerprint of the on-disk artefact, not a
+ // captured "last expected" hash from earlier in the test:
+ // that earlier capture was brittle on Windows where rolldown
+ // routinely emits a 4th BUNDLE_END after the explicit edits
+ // settle, producing a slightly different output byte (a
+ // change in the bundled comment header is enough to bump
+ // mtime + ctime + size).
+ await waitForStableEvents(events, 750);
+ const stat = statSync(join(cwd, ".arkor/build/index.mjs"));
+ const expectedHash = `${stat.mtimeMs}-${stat.ctimeMs}-${stat.size}`;
+ expect(events[events.length - 1]?.hash).toBe(expectedHash);
+ } finally {
+ await hmr.dispose();
+ }
+ });
+
+ it("getCurrentConfigHash() returns the latest cached event's hash", async () => {
+ // Regression: `/api/train` previously called `readManifestSummary`
+ // and ran a redundant rebuild per spawn (racing the watcher).
+ // The new server flow reads the cached hash via
+ // `getCurrentConfigHash()`. We can't trigger a real build here
+ // (the user-bundle entry shape would need a working `arkor`
+ // resolution at import time), but we can verify the getter
+ // returns `null` before the watcher has emitted any event and
+ // tracks the cached event's `configHash` field once one lands.
+ // The integration of "configHash actually populated for all
+ // entry shapes" is covered by the unit test against
+ // `findInspectableTrainer` in `trainerInspection.test.ts`.
+ mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+ writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+ const events: HmrEvent[] = [];
+ const hmr = createHmrCoordinator({ cwd });
+ // Before any subscriber attaches, no watcher is running and no
+ // event has been broadcast: getter must return null without
+ // throwing.
+ expect(hmr.getCurrentConfigHash()).toBeNull();
+ hmr.subscribe((e) => events.push(e));
+ try {
+ const ready = await nextEvent(events, (e) => e.type === "ready");
+ // FAKE_MANIFEST is hand-rolled (no SDK brand) so the cached
+ // hash is null, but the *getter* must still return whatever
+ // the cached event carries, not throw.
+ expect(hmr.getCurrentConfigHash()).toBe(ready.configHash ?? null);
+ } finally {
+ await hmr.dispose();
+ }
+ });
+
+ it("getCurrentArtifactHash() returns null when the artefact doesn't exist (vs a Date.now() fallback)", async () => {
+ // Regression: a previous implementation did
+ // `statSync(...) ; return fingerprint(...)`. Two stat calls
+ // means a race window where the file disappears between them:
+ // the existence check passes, then `fingerprint`'s catch
+ // branch substitutes `Date.now().toString(36)` (its
+ // freshness-forcing fallback for SSE dedup), and the getter
+ // returns a non-null, non-artefact-derived hash. That
+ // silently breaks `dispatchRebuild`'s pre-ready-spawn gate
+ // which relies on null === "no artefact, force restart".
+ // The fix uses `fingerprintOrNull`: single statSync, true
+ // null on failure.
+ //
+ // We assert the getter on a project that has NEVER built
+ // (no `.arkor/build/index.mjs` ever existed). The bug-fix
+ // version returns null; the broken version's leftover would
+ // have been Date.now()-derived non-null.
+ mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+ writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+ const hmr = createHmrCoordinator({ cwd });
+ try {
+ // No subscribe() yet: watcher hasn't started, so no
+ // BUNDLE_END has written the artefact. The on-disk
+ // `.arkor/build/index.mjs` doesn't exist.
+ expect(hmr.getCurrentArtifactHash()).toBeNull();
+ } finally {
+ await hmr.dispose();
+ }
+ });
+
+ it("getCurrentArtifactHash() returns a stable mtime/ctime/size hash once the artefact exists", async () => {
+ // Companion to the null-on-missing test: when the artefact
+ // *does* exist (watcher's first BUNDLE_END landed), the
+ // getter returns the same `mtimeMs-ctimeMs-size` shape the
+ // SSE event's `hash` field uses. The two are paired for SSE
+ // dedup purposes; the pre-ready-spawn registry gate switched
+ // to content-hash (`getCurrentArtifactContentHash`) to avoid
+ // identical-bytes/different-timestamps false positives, but
+ // the timestamp hash stays as the canonical SSE event id.
+ mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+ writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+ const events: HmrEvent[] = [];
+ const hmr = createHmrCoordinator({ cwd });
+ hmr.subscribe((e) => events.push(e));
+ try {
+ const ready = await nextEvent(events, (e) => e.type === "ready");
+ const artifactHash = hmr.getCurrentArtifactHash();
+ // Same shape as the SSE event's `hash` field: both feed
+ // through the same `mtimeMs-ctimeMs-size` formula.
+ expect(artifactHash).toBe(ready.hash ?? null);
+ expect(artifactHash).toMatch(/^[\d.]+-[\d.]+-\d+$/);
+ } finally {
+ await hmr.dispose();
+ }
+ });
+
+ it("getCurrentConfigHash() preserves the last-success hash across an ERROR event", async () => {
+ // Regression: previously `getCurrentConfigHash()` returned
+ // `lastEvent?.configHash ?? null`. After an ERROR landed,
+ // `lastEvent` was the error event (no `configHash`) so the
+ // getter went null even though `.arkor/build/index.mjs` still
+ // held the previous *successful* bundle bytes (ERROR doesn't
+ // overwrite the output). A child spawned via `/api/train` in
+ // that window would register `configHash: null`, and the next
+ // successful BUNDLE_END would diff against null → SIGTERM
+ // restart instead of SIGUSR2 hot-swap, defeating callback
+ // hot-swap for the rest of the session. The fix tracks the
+ // last *successful* hash separately from `lastEvent`.
+ mkdirSync(join(cwd, "src/arkor"), { recursive: true });
+ writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST);
+
+ const events: HmrEvent[] = [];
+ const hmr = createHmrCoordinator({ cwd });
+ hmr.subscribe((e) => events.push(e));
+ try {
+ const ready = await nextEvent(events, (e) => e.type === "ready");
+ const successHash = hmr.getCurrentConfigHash();
+ // Sanity: ready event's configHash matches the getter.
+ expect(successHash).toBe(ready.configHash ?? null);
+ // Inject a syntax error to force a watcher ERROR event.
+ writeFileSync(
+ join(cwd, "src/arkor/index.ts"),
+ "this is not { valid javascript = ;",
+ );
+ await nextEvent(events, (e) => e.type === "error", 4000);
+ // After the error, the cached `lastEvent` is the error frame
+ // but the on-disk artifact still holds the previous
+ // success. The getter must return that previous-success hash
+ // so any `/api/train` spawn during this window still gets a
+ // useful spawn-time hash for the *next* rebuild's routing.
+ expect(hmr.getCurrentConfigHash()).toBe(successHash);
+ } finally {
+ await hmr.dispose();
+ }
+ });
+});
diff --git a/packages/arkor/src/studio/hmr.ts b/packages/arkor/src/studio/hmr.ts
new file mode 100644
index 00000000..974ed771
--- /dev/null
+++ b/packages/arkor/src/studio/hmr.ts
@@ -0,0 +1,527 @@
+import { createHash } from "node:crypto";
+import { existsSync, readFileSync, statSync } from "node:fs";
+import { watch, type RolldownWatcher } from "rolldown";
+import { hashJobConfig } from "../core/configHash";
+import { moduleCacheBustUrl } from "../core/moduleCacheBust";
+import {
+ BUILD_DEFAULTS,
+ resolveBuildEntry,
+ rolldownInputOptions,
+ type BuildEntryOptions,
+} from "../core/rolldownConfig";
+import { findInspectableTrainer } from "../core/trainerInspection";
+
+export type HmrEventType = "ready" | "rebuild" | "error";
+
+export interface HmrEvent {
+ type: HmrEventType;
+ outFile?: string;
+ /**
+ * Short fingerprint of the bundle artefact (mtime + ctime + size,
+ * mirroring `core/moduleCacheBust.ts`'s key shape). Subscribers use
+ * this to dedupe replays of the same successful build.
+ */
+ hash?: string;
+ /**
+ * Content-derived hash (sha256, truncated) of the artefact bytes.
+ * Used by `dispatchRebuild`'s pre-ready-spawn equality gate where
+ * `hash` would over-trigger SIGTERM-restart: a watcher build that
+ * rewrites identical bytes still bumps mtime/ctime, so two
+ * timestamp fingerprints differ even though the loaded bytes are
+ * the same. Comparing this content-hash instead avoids that
+ * spurious cancel+restart cycle in the "user clicked Run before
+ * the watcher's first BUNDLE_END landed" case.
+ */
+ contentHash?: string | null;
+ /**
+ * Stable hash of the trainer's cloud-side `JobConfig`. When this is
+ * unchanged across a rebuild, only the in-process callbacks moved and
+ * the Studio server can hot-swap them without restarting the run.
+ * `null` when the bundle has no discoverable trainer (e.g. the user's
+ * source has a syntax error or the Arkor manifest is missing).
+ */
+ configHash?: string | null;
+ /** Run name pulled from the rebuilt manifest. */
+ trainerName?: string | null;
+ /** Human-readable error message; only present on `type === "error"`. */
+ message?: string;
+}
+
+export interface HmrCoordinator {
+ /**
+ * Receive the current cached state immediately, then every subsequent
+ * event. Returns an unsubscribe function.
+ */
+ subscribe(fn: (event: HmrEvent) => void): () => void;
+ /**
+ * Synchronous read of the most recent successful build's
+ * `configHash`. Used by `/api/train` to capture the hash that's
+ * about to be spawned so HMR routing on the *next* rebuild knows
+ * whether the new bundle changed cloud-side config. `null` when the
+ * watcher hasn't completed a successful build yet (e.g. fresh
+ * scaffold) or the latest event was an `error`.
+ */
+ getCurrentConfigHash(): string | null;
+ /**
+ * Synchronous fingerprint of the on-disk build artefact RIGHT NOW
+ * (fresh stat, not cached). Used by `/api/train`'s registry entry
+ * so HMR routing in the pre-ready-spawn case (`configHash === null`)
+ * can compare against the rebuild's `event.hash` to tell whether
+ * the child read the same bytes. Without this gate, an edit
+ * landing between spawn and the watcher's first BUNDLE_END would
+ * silently teach the registry to use the post-edit `configHash`
+ * as the child's baseline; later same-hash rebuilds would then
+ * hot-swap callbacks into a child whose cloud-side `JobConfig`
+ * was actually spawned against an older version, leaving the
+ * cloud run on a stale config. `null` when stat fails (artefact
+ * doesn't exist yet, fresh project never built).
+ */
+ getCurrentArtifactHash(): string | null;
+ /**
+ * Content-derived hash (sha256, truncated) of the on-disk
+ * artefact RIGHT NOW. Used by `/api/train` to capture a
+ * spawn-time content-hash for the registry's pre-ready-spawn
+ * equality gate; paired with the rebuild's `event.contentHash`,
+ * a mismatch unambiguously means the bytes changed (not just
+ * timestamps), so `dispatchRebuild` only SIGTERM-restarts when
+ * the child genuinely loaded different bytes than the new
+ * configHash describes. `null` on stat/read failure (artefact
+ * doesn't exist yet, fresh project never built).
+ */
+ getCurrentArtifactContentHash(): string | null;
+ /**
+ * Last broadcast event's `type`, or `null` if nothing has been
+ * broadcast yet. `/api/manifest`'s HMR fast path consults this to
+ * suppress its "serve last good artefact" behaviour while the
+ * watcher is in an `error` state; without that gate, the SPA's
+ * 5 s `/api/manifest` poll would keep getting a 200 stale
+ * manifest and silently overwrite the SSE-driven build-error UI,
+ * letting users run with stale code/config while the latest
+ * source is still failing to compile.
+ */
+ getLastEventType(): HmrEventType | null;
+ /**
+ * Close the rolldown watcher and drop all subscribers. **Does not
+ * (and cannot) evict the user-module records that `inspectBundle`
+ * loaded into Node's ESM cache** — Node's loader exposes no
+ * eviction API, so for `arkor dev` sessions that go through many
+ * rebuilds before exit, the cache retains one record per distinct
+ * artefact content hash for the rest of the process lifetime.
+ * The mtime/ctime/size cache-bust key (`moduleCacheBustUrl`)
+ * collapses identical-byte rebuilds onto the same record, bounding
+ * the retention to "one entry per real edit", which is the tightest
+ * we can offer here. Tests that loop `createHmrCoordinator` →
+ * rebuild → `dispose` therefore still accumulate process-wide
+ * ESM-cache entries.
+ */
+ dispose(): Promise;
+}
+
+export type HmrOptions = BuildEntryOptions;
+
+/**
+ * Content-derived fingerprint of the artefact bytes (sha256, first 16
+ * hex chars). Used by `dispatchRebuild`'s pre-ready-spawn gate where
+ * timestamp-based comparison gives false positives: a watcher rebuild
+ * that produces the same bytes still bumps mtime/ctime, so a child
+ * spawned just before `ready` would be unnecessarily SIGTERM-restarted
+ * even though its loaded bytes match the new build's. Hashing a few
+ * MB of bundle on each call is cheap relative to the GPU cost of a
+ * spurious cancel+restart cycle.
+ *
+ * Returns `null` on stat/read failure so the caller can treat
+ * "no artefact" as "force restart" (the conservative default).
+ */
+function contentHashOrNull(outFile: string): string | null {
+ try {
+ const bytes = readFileSync(outFile);
+ return createHash("sha256").update(bytes).digest("hex").slice(0, 16);
+ } catch {
+ return null;
+ }
+}
+
+/**
+ * Single-stat fingerprint with a clean `null` on failure: used by
+ * `getCurrentArtifactHash()` whose contract is "return a fingerprint
+ * derived from the artefact bytes, or `null` if no artefact". A
+ * separate exists-check + `fingerprint()` here would race: the file
+ * could disappear between the two stats and `fingerprint()`'s
+ * `Date.now()` fallback would return a non-null hash that doesn't
+ * describe any real bytes, silently violating the contract.
+ */
+function fingerprintOrNull(outFile: string): string | null {
+ try {
+ const s = statSync(outFile);
+ // Same shape as `fingerprint()`'s success branch; `ctimeMs` is
+ // the belt-and-braces guard for `touch -m`-style edits where
+ // mtime stays put.
+ return `${s.mtimeMs}-${s.ctimeMs}-${s.size}`;
+ } catch {
+ return null;
+ }
+}
+
+function fingerprint(outFile: string): string {
+ // Delegate to `fingerprintOrNull` and substitute a freshness-
+ // forcing token on stat failure. The `Date.now()` fallback
+ // matters here (vs the "0-0-0" sentinel `moduleCacheBustKey`
+ // uses): SPA-side SSE dedup keys off this hash, so a stable
+ // literal during a racy stat would silently swallow genuinely-
+ // fresh broadcast events.
+ return fingerprintOrNull(outFile) ?? Date.now().toString(36);
+}
+
+type InspectionResult = {
+ configHash: string;
+ trainerName: string;
+} | null;
+
+/**
+ * Dynamic-import the freshly-built bundle and pull a `TrainerInspection`
+ * snapshot off the discovered trainer.
+ *
+ * Walks every entry shape `runner.ts` accepts (named `arkor`, named
+ * `trainer`, `default` Arkor manifest, `default.trainer`) via the
+ * shared `findInspectableTrainer` helper, keeping inspection in sync
+ * with execution. Without this, projects that only `export const
+ * trainer` (a documented shortcut) would always produce `configHash:
+ * null` and the SPA would unnecessarily SIGTERM-restart on every
+ * rebuild.
+ *
+ * Cache-bust by file mtime+ctime+size (via `moduleCacheBustUrl`)
+ * rather than `Date.now()`:
+ *
+ * - Node's ESM loader caches every dynamically-imported URL for the
+ * lifetime of the process and never evicts. A `?t=Date.now()`
+ * suffix produces a unique URL per call, so a long `arkor dev`
+ * session would accumulate one module record per BUNDLE_END:
+ * unbounded memory growth.
+ * - The composite key (`mtimeMs-ctimeMs-size`) keys the cache to
+ * "the actual bytes in this file", so spurious watcher events
+ * that don't change content reuse the prior module record. The
+ * leak shrinks from "one entry per keystroke" to "one entry per
+ * actual rebuild", which for a realistic dev session (hundreds
+ * of saves over hours) is bounded by the number of distinct file
+ * states the user produces, and that's fundamentally what HMR
+ * has to track to surface up-to-date trainer state. There's no
+ * public Node API for evicting an ESM module record, so this is
+ * the tightest bound we can offer without spawning a child
+ * process per inspection.
+ *
+ * Best-effort: a missing/malformed manifest or a thrown user
+ * constructor returns `null` and the caller treats the rebuild as
+ * "config-unknown".
+ */
+async function inspectBundle(outFile: string): Promise {
+ try {
+ const mod = (await import(moduleCacheBustUrl(outFile))) as Record<
+ string,
+ unknown
+ >;
+ const inspection = findInspectableTrainer(mod);
+ if (!inspection) return null;
+ return {
+ configHash: hashJobConfig(inspection.config),
+ trainerName: inspection.name,
+ };
+ } catch {
+ return null;
+ }
+}
+
+/**
+ * Spin up a rolldown watcher over the user's `src/arkor` entry, broadcasting
+ * `ready` / `rebuild` / `error` to subscribers. Used by `arkor dev` to push
+ * `/api/dev/events` SSE notifications to the SPA.
+ *
+ * Lazy: the watcher only starts on the first `subscribe` call so a Studio
+ * launch in a project without `src/arkor/index.ts` doesn't immediately fail.
+ * The watcher kicks in once the user creates the file and the SPA opens
+ * an EventSource. After every successful build the watcher caches the
+ * latest state and replays it to new subscribers so a late-mounting
+ * component still sees the trainer.
+ */
+export function createHmrCoordinator(opts: HmrOptions): HmrCoordinator {
+ const resolved = resolveBuildEntry(opts);
+
+ const subscribers = new Set<(event: HmrEvent) => void>();
+ let lastEvent: HmrEvent | null = null;
+ let watcher: RolldownWatcher | null = null;
+ let disposed = false;
+ /**
+ * When `startWatcher` runs against a project that doesn't have an
+ * entry file yet, a poll timer takes over and waits for the file to
+ * appear. Without this, an SPA that opened `/api/dev/events` against
+ * a fresh scaffold would hang on the initial `error` event forever
+ * (`startWatcher` is only re-entered on `subscribe()`, but EventSource
+ * doesn't reconnect on application-level errors).
+ */
+ let entryWaitTimer: ReturnType | null = null;
+ /**
+ * Monotonically incrementing build sequence number. Bumped on every
+ * `BUNDLE_END` *before* the inspection awaits, so when an
+ * inspection eventually resolves it can check whether a newer
+ * build has started in the meantime and silently drop its stale
+ * result.
+ *
+ * This matters because `inspectBundle` does an asynchronous
+ * dynamic-import of the just-written artifact. Two rebuilds A → B
+ * landing within the import window can race, with A's inspection
+ * resolving *after* B's. The previous "fire-and-forget" code
+ * would then publish A on top of B and leave `lastEvent` pointing
+ * at the older `configHash`/`trainerName`. That in turn drove
+ * `/api/dev/events` to make hot-swap-vs-restart decisions against
+ * stale routing data and surfaced the wrong trainer name in the
+ * SPA.
+ */
+ let buildSeq = 0;
+ /**
+ * Whether a `ready` event has actually broadcast yet. Tracked
+ * separately from `firstBuild` because the inspection await means
+ * the first BUNDLE_END's broadcast can land *after* a second
+ * BUNDLE_END schedules its own. Pinning the type to
+ * "broadcast-time" rather than "schedule-time" guarantees the SPA
+ * still sees `ready` first even when the initial inspection loses
+ * the race.
+ */
+ let firstBroadcast = true;
+ /**
+ * Cached `configHash` of the last *successful* build, **independent
+ * of `lastEvent`**. `lastEvent` tracks every broadcast (including
+ * `error`) for the cached-replay-on-late-subscribe contract, but a
+ * transient build error must not blank out the spawn-time hash that
+ * `/api/train` reads via `getCurrentConfigHash()`. The on-disk
+ * `.arkor/build/index.mjs` doesn't change on ERROR, so a child
+ * spawned during an error state is running the *previous* successful
+ * bundle, and the next BUNDLE_END's hash should be compared
+ * against THAT. Without this separate cache, the whole rebuild gets
+ * routed through SIGTERM-restart and SIGUSR2 hot-swap stops working
+ * for the rest of the session whenever the user briefly broke their
+ * source.
+ */
+ let lastSuccessConfigHash: string | null = null;
+
+ function broadcast(event: HmrEvent): void {
+ lastEvent = event;
+ for (const fn of subscribers) {
+ try {
+ fn(event);
+ } catch {
+ // Subscribers are SSE controllers; a thrown error usually means
+ // the connection closed mid-flight. Drop it so one bad subscriber
+ // can't poison the broadcast for the rest.
+ }
+ }
+ }
+
+ async function emitBuildSucceeded(): Promise {
+ if (disposed) return;
+ const seq = ++buildSeq;
+ const inspection = await inspectBundle(resolved.outFile);
+ // Drop stale results: a newer rebuild already started (or
+ // finished) while our inspection was running. The newer
+ // inspection will own the broadcast for the latest state; this
+ // one publishing now would just clobber `lastEvent` with the
+ // older snapshot.
+ if (seq !== buildSeq || disposed) return;
+ const type: HmrEventType = firstBroadcast ? "ready" : "rebuild";
+ firstBroadcast = false;
+ const configHash = inspection?.configHash ?? null;
+ // BUNDLE_END always reflects what's now on disk: even when the
+ // bundle is unbranded (`configHash === null`), that's the
+ // current truth. Capture it so `/api/train` spawning during a
+ // *subsequent* transient error still has the right spawn-time
+ // hash to compare against the next successful rebuild.
+ lastSuccessConfigHash = configHash;
+ broadcast({
+ type,
+ outFile: resolved.outFile,
+ hash: fingerprint(resolved.outFile),
+ // Content hash powers the registry's pre-ready-spawn equality
+ // gate (timestamp-only would over-trigger SIGTERM-restart on
+ // identical-bytes rebuilds). Read once here so the broadcast
+ // and any spawn-time capture reference the same on-disk state.
+ contentHash: contentHashOrNull(resolved.outFile),
+ configHash,
+ trainerName: inspection?.trainerName ?? null,
+ });
+ }
+
+ function startWatcher(): void {
+ if (watcher || disposed) return;
+ if (!existsSync(resolved.entry)) {
+ broadcast({
+ type: "error",
+ message: `Build entry not found: ${resolved.entry}. Create ${BUILD_DEFAULTS.entry} or pass an explicit entry argument.`,
+ });
+ // Hand off to a low-frequency poll so an SPA already connected to
+ // `/api/dev/events` transitions from "error" to "ready" the moment
+ // the user creates the entry file (no manual reconnect required).
+ // The poll is `unref()`'d so it never blocks process exit, and
+ // `dispose()` clears it.
+ if (!entryWaitTimer) {
+ entryWaitTimer = setInterval(() => {
+ if (disposed || watcher) {
+ if (entryWaitTimer) clearInterval(entryWaitTimer);
+ entryWaitTimer = null;
+ return;
+ }
+ if (existsSync(resolved.entry)) {
+ if (entryWaitTimer) clearInterval(entryWaitTimer);
+ entryWaitTimer = null;
+ startWatcher();
+ }
+ }, 1000);
+ entryWaitTimer.unref?.();
+ }
+ return;
+ }
+ // The entry exists now: clear any leftover poll timer from a prior
+ // failed startWatcher invocation.
+ if (entryWaitTimer) {
+ clearInterval(entryWaitTimer);
+ entryWaitTimer = null;
+ }
+ watcher = watch({
+ ...rolldownInputOptions(resolved),
+ output: { file: resolved.outFile, format: "esm" },
+ });
+ watcher.on("event", (event) => {
+ if (event.code === "BUNDLE_END") {
+ // rolldown requires the per-build result to be closed to avoid leaks.
+ event.result.close().catch(() => {});
+ // The event type ("ready" vs "rebuild") is decided inside
+ // `emitBuildSucceeded` *after* the inspection await, based on
+ // whether any prior broadcast actually landed (see the
+ // `firstBroadcast` comment for why pinning the type at this
+ // schedule point would be wrong under inspection races).
+ void emitBuildSucceeded();
+ } else if (event.code === "ERROR") {
+ // Rolldown's ERROR events don't always carry a `result`:
+ // when the failure is in the parse/resolve phase there's
+ // no per-build output to close, so `event.result` is
+ // `undefined`. Calling `.close()` then would throw
+ // synchronously, escape this listener, and permanently
+ // wedge the watcher so the SPA stays on the prior `error`
+ // state forever even after the user fixes their code.
+ // Optional-chain so we still close any result that *is*
+ // present (avoiding the leak rolldown warns about) without
+ // blowing up the watcher when none is.
+ event.result?.close().catch(() => {});
+ // Bump the seq so a still-in-flight `emitBuildSucceeded`
+ // from a *prior* BUNDLE_END drops its broadcast when its
+ // inspection finally resolves. Without this, the older
+ // success would land on top of this error and clobber
+ // `lastEvent`/`configHash`, leaving the SPA showing a
+ // healthy rebuild while the actual latest build state is
+ // a compile error. The successful-rebuild path bumps the
+ // same counter inside `emitBuildSucceeded`.
+ buildSeq += 1;
+ broadcast({
+ type: "error",
+ message:
+ event.error instanceof Error
+ ? event.error.message
+ : String(event.error),
+ });
+ }
+ });
+ }
+
+ return {
+ subscribe(fn) {
+ subscribers.add(fn);
+ // Replay the last broadcast so a late-mounting subscriber (an
+ // `/api/dev/events` SSE client opening after the first BUNDLE_END,
+ // or `buildStudioApp`'s dispatch subscriber registering after
+ // entry-wait recovery) sees current state without waiting for
+ // the next rebuild.
+ //
+ // Wrapped in the same defensive try/catch as `broadcast` so a
+ // throw inside the subscriber (typically an SSE controller that
+ // closed mid-replay: `controller.enqueue` on a closed stream
+ // throws) doesn't propagate out of `subscribe()` and crash
+ // whoever just registered. One bad subscriber must not be able
+ // to break HMR initialisation for the rest of the process.
+ if (lastEvent) {
+ try {
+ fn(lastEvent);
+ } catch {
+ // Swallow: subscribers own their own teardown; we just
+ // shouldn't poison their `subscribe()` call site.
+ }
+ }
+ startWatcher();
+ return () => {
+ subscribers.delete(fn);
+ };
+ },
+ getCurrentConfigHash() {
+ // Returns the hash of the *last successful* build, NOT
+ // `lastEvent.configHash`. The two diverge after an ERROR:
+ // `lastEvent` becomes the error event (no `configHash`), but
+ // `.arkor/build/index.mjs` still holds the previous successful
+ // bundle bytes, and a child spawned in that window is running
+ // those bytes. Returning the cached success hash keeps
+ // `/api/train` registering accurate spawn-time hashes so the
+ // next successful BUNDLE_END can route hot-swap vs restart
+ // correctly. `null` only before the first successful build (or
+ // a build that wasn't inspectable).
+ return lastSuccessConfigHash;
+ },
+ getCurrentArtifactHash() {
+ // Fresh stat (not the cached `lastEvent.hash`). The cached
+ // hash describes the bytes the watcher last broadcast about,
+ // but the on-disk artefact may be newer (a BUNDLE_END is
+ // queued, file already written, inspection still pending) or
+ // older (next BUNDLE_END hasn't fired yet but the user just
+ // edited and saved). For the registry's pre-ready-spawn gate
+ // we want "what bytes will the child's `await import()` see
+ // RIGHT NOW".
+ //
+ // `fingerprintOrNull` does ONE statSync and returns null on
+ // failure, preserving the documented contract. A previous
+ // implementation here did `statSync(...)` first and then
+ // called `fingerprint()` (which has a `Date.now()` fallback
+ // baked in for SSE dedup uniqueness). That double-stat
+ // raced: if the file disappeared between the two calls we'd
+ // return a Date.now()-derived hash that doesn't describe any
+ // real bytes, silently violating the "null on stat failure"
+ // contract dispatchRebuild relies on for its SIGTERM-restart
+ // routing.
+ return fingerprintOrNull(resolved.outFile);
+ },
+ getCurrentArtifactContentHash() {
+ // Companion to `getCurrentArtifactHash` for the registry's
+ // pre-ready-spawn equality gate. Reads + sha256s the file
+ // at call time so the result describes the exact bytes the
+ // just-spawned child will see in its `await import()`.
+ // Same null-on-failure contract: caller treats null as
+ // "force restart" (the conservative default).
+ return contentHashOrNull(resolved.outFile);
+ },
+ getLastEventType() {
+ // `lastEvent` is the latest broadcast: `ready` / `rebuild` /
+ // `error`. Returning the type lets `/api/manifest`'s HMR
+ // fast path skip serving the stale built artefact when the
+ // watcher is currently in `error` (current source fails to
+ // compile), so the SPA's poll loop doesn't paper over the
+ // SSE-surfaced error.
+ return lastEvent?.type ?? null;
+ },
+ async dispose() {
+ disposed = true;
+ subscribers.clear();
+ if (entryWaitTimer) {
+ clearInterval(entryWaitTimer);
+ entryWaitTimer = null;
+ }
+ if (watcher) {
+ const w = watcher;
+ watcher = null;
+ await w.close().catch(() => {});
+ }
+ },
+ };
+}
diff --git a/packages/arkor/src/studio/manifest.ts b/packages/arkor/src/studio/manifest.ts
index 72452da8..677115e3 100644
--- a/packages/arkor/src/studio/manifest.ts
+++ b/packages/arkor/src/studio/manifest.ts
@@ -1,6 +1,11 @@
-import { pathToFileURL } from "node:url";
+import { existsSync } from "node:fs";
import { runBuild } from "../cli/commands/build";
-import { isArkor } from "../core/arkor";
+import { hashJobConfig } from "../core/configHash";
+import { moduleCacheBustUrl } from "../core/moduleCacheBust";
+import {
+ findTrainerInModule,
+ getTrainerInspection,
+} from "../core/trainerInspection";
/**
* Wire-friendly snapshot of the user's `createArkor({...})` manifest. Mirrors
@@ -9,28 +14,120 @@ import { isArkor } from "../core/arkor";
*/
export interface ManifestSummary {
trainer: { name: string } | null;
+ /**
+ * Stable hash of the trainer's cloud-side `JobConfig`. Used by HMR to
+ * decide whether a rebuild only changed in-process callbacks (hash
+ * unchanged → hot-swap) or also touched cloud-side training config
+ * (hash changed → restart with `requestEarlyStop`). `null` when no
+ * inspectable trainer is present.
+ */
+ configHash: string | null;
// future: deploy: { name: string } | null;
// future: eval: { name: string } | null;
}
-const EMPTY: ManifestSummary = { trainer: null };
+const EMPTY: ManifestSummary = { trainer: null, configHash: null };
/**
- * Build the user's `src/arkor/index.ts` and import the artifact to extract a
- * serialisable summary of its manifest. The Studio UI hits this on home-page
- * load to show *what* the project contains (just the trainer name today;
- * deploy / eval slots when those primitives land).
+ * Dynamic-import an already-built artefact and pull a serialisable
+ * summary off its trainer. Cache-bust the URL so Node's ESM loader
+ * returns the fresh module text rather than a stale evaluation.
*
- * Each call rebuilds and re-imports so edits to the user's source surface
- * without restarting Studio. The import URL carries a cache-bust query so
- * Node's ESM cache doesn't return a stale module.
+ * Split out of `readManifestSummary` so callers that already triggered a
+ * build (the HMR coordinator hands the SPA a `outFile` after each
+ * `BUNDLE_END`) can inspect the artefact without paying for a redundant
+ * `runBuild()`.
*/
-export async function readManifestSummary(cwd: string): Promise {
+export async function summariseBuiltManifest(
+ outFile: string,
+): Promise {
+ // mtime+ctime+size cache-bust (vs `Date.now()`): the SPA polls
+ // `/api/manifest` every ~5 s, so a `Date.now()` suffix would
+ // accumulate one ESM module record per poll across a long
+ // `arkor dev` session: Node's loader has no eviction. Keying on
+ // the artefact bytes (via `moduleCacheBustUrl`) collapses
+ // unchanged-poll reads onto the existing record.
+ const mod = (await import(moduleCacheBustUrl(outFile))) as Record<
+ string,
+ unknown
+ >;
+ // Walk every trainer export shape `runner.ts` accepts via the
+ // shared helper (named `arkor`, named `trainer`, default Arkor
+ // manifest, `default.trainer`) so manifest summary, HMR routing,
+ // and runtime execution all agree about which exports count as a
+ // trainer.
+ const trainer = findTrainerInModule(mod);
+ if (!trainer) return EMPTY;
+ // Trainer name renders in the UI even for hand-rolled trainers
+ // that bypass `createTrainer` and therefore don't carry the SDK
+ // inspection brand. The brand is required only for the
+ // `configHash` used by HMR routing; without it, HMR conservatively
+ // SIGTERM-restarts on every rebuild (correct fallback).
+ const name =
+ typeof trainer.name === "string" ? trainer.name : "(unnamed trainer)";
+ const inspection = getTrainerInspection(trainer);
+ return {
+ trainer: { name },
+ configHash: inspection ? hashJobConfig(inspection.config) : null,
+ };
+}
+
+export interface ReadManifestOptions {
+ /**
+ * HMR-aware fast path: when set and the file exists, skip the
+ * `runBuild()` call and inspect this artefact directly. The HMR
+ * coordinator already keeps `.arkor/build/index.mjs` continuously
+ * fresh via its rolldown watcher, so re-running `runBuild()` on
+ * every `/api/manifest` poll (every ~5 s + on every rebuild SSE
+ * event) is wasted CPU AND races the watcher writing to the
+ * same path. Pre-existence is checked with `existsSync` so the
+ * very first poll on a fresh scaffold (watcher's first
+ * BUNDLE_END hasn't completed yet) still bootstraps via
+ * `runBuild()`. Once the file appears, subsequent polls skip
+ * the rebuild.
+ *
+ * Pass `coordinator.outFile`-equivalent (e.g.
+ * `resolveBuildEntry({ cwd }).outFile`) here when the server has
+ * an active `HmrCoordinator`; leave undefined when HMR is off so
+ * the build path runs as before.
+ */
+ prebuiltOutFile?: string;
+}
+
+/**
+ * Build the user's `src/arkor/index.ts` and import the artifact to
+ * extract a serialisable summary of its manifest. The Studio UI hits
+ * this on home-page load to show *what* the project contains (just the
+ * trainer name today; deploy / eval slots when those primitives land).
+ *
+ * Each call rebuilds and re-imports so edits to the user's source
+ * surface without restarting Studio. When `prebuiltOutFile` is
+ * supplied (HMR-enabled servers), the `runBuild()` step is bypassed
+ * (see `ReadManifestOptions.prebuiltOutFile` for the rationale).
+ */
+export async function readManifestSummary(
+ cwd: string,
+ opts: ReadManifestOptions = {},
+): Promise {
+ if (opts.prebuiltOutFile && existsSync(opts.prebuiltOutFile)) {
+ // Race recovery: rolldown's watcher writes
+ // `.arkor/build/index.mjs` non-atomically. `existsSync` flips to
+ // `true` the instant the file is created, but a `/api/manifest`
+ // poll landing during the flush window would then try to
+ // `await import(...)` partial bytes and surface as a 500
+ // SyntaxError in the UI. The legacy `runBuild()` path was
+ // synchronous and self-contained, so this race didn't exist
+ // there. Fall through to a fresh `runBuild()` on import failure
+ // (which produces a coherent artifact under our control). The
+ // fallback is best-effort: if `runBuild()` itself also throws
+ // (real user syntax error), rethrowing IS the right surface for
+ // `/api/manifest` to render the error inline.
+ try {
+ return await summariseBuiltManifest(opts.prebuiltOutFile);
+ } catch {
+ // fall through to runBuild()
+ }
+ }
const { outFile } = await runBuild({ cwd, quiet: true });
- const url = `${pathToFileURL(outFile).href}?t=${Date.now()}`;
- const mod = (await import(url)) as Record;
- const candidate = mod.arkor ?? mod.default;
- if (!isArkor(candidate)) return EMPTY;
- const trainer = candidate.trainer ? { name: candidate.trainer.name } : null;
- return { trainer };
+ return summariseBuiltManifest(outFile);
}
diff --git a/packages/arkor/src/studio/server.test.ts b/packages/arkor/src/studio/server.test.ts
index 214889d9..ed8b9298 100644
--- a/packages/arkor/src/studio/server.test.ts
+++ b/packages/arkor/src/studio/server.test.ts
@@ -11,6 +11,7 @@ import {
import { tmpdir } from "node:os";
import { join, resolve } from "node:path";
import { buildStudioApp } from "./server";
+import type { HmrCoordinator, HmrEvent } from "./hmr";
import { writeCredentials } from "../core/credentials";
import { readState, writeState } from "../core/state";
import {
@@ -82,14 +83,14 @@ describe("Studio server", () => {
baseUrl: "http://mock",
assetsDir,
autoAnonymous: false,
- // @ts-expect-error — intentionally omitted to assert the runtime guard
+ // @ts-expect-error: intentionally omitted to assert the runtime guard
studioToken: undefined,
}),
).toThrow(/studioToken/);
});
it("HTML-escapes special characters in the studio token before injecting", async () => {
- // Branch coverage for `htmlAttrEscape` — a defensive guard against
+ // Branch coverage for `htmlAttrEscape`: a defensive guard against
// a token that contains `<`, `>`, `&`, `"`, `'`. randomBytes/base64url
// never produces these, but the helper must still escape them so a
// future token strategy can't break index.html parsing or open a
@@ -111,7 +112,7 @@ describe("Studio server", () => {
expect(html).toContain(
'',
);
- // The raw exotic token must not leak into HTML — an attacker who
+ // The raw exotic token must not leak into HTML: an attacker who
// could influence the token (hypothetical) shouldn't be able to
// inject markup.
expect(html).not.toMatch(/content="<>/);
@@ -133,6 +134,48 @@ describe("Studio server", () => {
expect(html.indexOf("arkor-studio-token")).toBeLessThan(
html.indexOf(""),
);
+ // HMR meta tag must NOT appear when no coordinator was supplied.
+ // The SPA reads this flag to decide whether to open
+ // `/api/dev/events`; a stray "true" here would make every prod
+ // session retry against the 404 indefinitely.
+ expect(html).not.toContain("arkor-hmr-enabled");
+ });
+
+ it("injects when an HMR coordinator is supplied", async () => {
+ // Regression: the SPA can't tell dev-mode usage from prod-mode
+ // usage at runtime: `vite build` ships with
+ // `import.meta.env.DEV === false`, so a build-time DEV gate inside
+ // the SPA bundle would (wrongly) suppress HMR even in real
+ // `arkor dev` sessions. The server-side flag is `true` exactly
+ // when `arkor dev` wired in an HMR coordinator. Verify it lands
+ // in `` next to the studio-token tag.
+ const fakeHmr = {
+ subscribe: () => () => undefined,
+ getCurrentConfigHash: () => null,
+ getCurrentArtifactHash: () => null,
+ getCurrentArtifactContentHash: () => null,
+ getLastEventType: () => null,
+ async dispose() {},
+ };
+ const app = buildStudioApp({
+ baseUrl: "http://mock",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ hmr: fakeHmr,
+ });
+ const res = await app.request("/", {
+ headers: { host: "127.0.0.1:4000" },
+ });
+ expect(res.status).toBe(200);
+ const html = await res.text();
+ expect(html).toContain(
+ ``,
+ );
+ expect(html.indexOf("arkor-hmr-enabled")).toBeLessThan(
+ html.indexOf(""),
+ );
});
it("serves non-html assets with the correct content-type", async () => {
@@ -356,7 +399,7 @@ describe("Studio server", () => {
expect(res.status).toBe(403);
});
- // Regression for ENG-404 — `path.resolve` doesn't follow symlinks, so a
+ // Regression for ENG-404: `path.resolve` doesn't follow symlinks, so a
// link inside the project directory pointing outside it would previously
// pass the containment check and be handed to `arkor start` (which would
// then dlopen the link's target).
@@ -433,7 +476,7 @@ describe("Studio server", () => {
expect(body.error).toMatch(/does not exist/);
});
- // Regression for ENG-356 — `/api/train` previously resolved the bundled
+ // Regression for ENG-356: `/api/train` previously resolved the bundled
// bin at `/bin.mjs` (one level above `dist/`), which never existed.
// The DI'd `binPath` lets us assert (a) a working bin streams its stdout
// through the response, and (b) a missing bin surfaces ENOENT-grade errors
@@ -468,6 +511,12 @@ process.exit(0);
body: JSON.stringify({}),
});
expect(res.status).toBe(200);
+ // Regression: the spawned subprocess's pid is exposed via the
+ // `X-Arkor-Train-Pid` response header so the SPA can scope HMR
+ // restart events to its own child (a multi-tab broadcast can
+ // contain mixed restart/hot-swap targets across siblings).
+ const pidHeader = res.headers.get("x-arkor-train-pid");
+ expect(pidHeader).toMatch(/^\d+$/);
const text = await res.text();
expect(text).toContain("[fake-bin]");
// The bin receives `start` as the first non-flag arg.
@@ -548,6 +597,763 @@ process.exit(0);
expect(text).toContain("exit=");
expect(text).not.toContain("exit=0");
});
+
+ it("captures the spawn-time configHash from the HMR coordinator (no extra rebuild)", async () => {
+ // Regression: `/api/train` previously called `readManifestSummary`
+ // which ran a full `runBuild()` per spawn: wasteful and racy
+ // against the HMR watcher writing the same `.arkor/build/index.mjs`.
+ // The new server reads the cached hash from
+ // `coordinator.getCurrentConfigHash()` instead. We assert the
+ // call happens (so a rebuild is *not* required) by exposing the
+ // spy count on the fake coordinator.
+ await writeCredentials(ANON_CREDS);
+ let getCurrentCalls = 0;
+ const fakeHmr = {
+ subscribe: () => () => undefined,
+ getCurrentConfigHash: () => {
+ getCurrentCalls += 1;
+ return "spawn-time-hash";
+ },
+ getCurrentArtifactHash: () => "spawn-artefact-hash",
+ getCurrentArtifactContentHash: () => "spawn-artefact-content-hash",
+ getLastEventType: () => null,
+ async dispose() {},
+ };
+ const fakeBin = join(trainCwd, "fake-bin.mjs");
+ writeFileSync(fakeBin, `process.exit(0);\n`);
+ const app = buildStudioApp({
+ baseUrl: "http://mock",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ binPath: fakeBin,
+ hmr: fakeHmr,
+ });
+ const res = await app.request("/api/train", {
+ method: "POST",
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ "content-type": "application/json",
+ },
+ body: JSON.stringify({}),
+ });
+ expect(res.status).toBe(200);
+ // Drain the body so the close handler runs and the test
+ // doesn't leak the subprocess.
+ await res.text();
+ expect(getCurrentCalls).toBe(1);
+ });
+
+ it("/api/train job-id parser ignores stderr so a `Started job ` line on stderr can't hijack the cancel POST", async () => {
+ // Regression: the job-id detector used to consume both
+ // stdout AND stderr through a shared `onChunk` + shared
+ // line buffer. A user `console.error("Started job ")`
+ // on stderr would then poison the buffer first; the real
+ // stdout marker arrives later but our `getJobId(...) === null`
+ // gate has already short-circuited subsequent scans, so
+ // Stop-training POSTs cancel for the wrong (decoy) job and
+ // the real one keeps running: silent cloud orphan.
+ // Splitting into a stdout-only `onStdoutChunk` parser and a
+ // forward-only `onStderrChunk` makes stderr unable to
+ // populate `jobId` regardless of what the user logs there.
+ await writeCredentials(ANON_CREDS);
+ await writeState(
+ {
+ orgSlug: "stderr-test-org",
+ projectSlug: "stderr-test-project",
+ projectId: "p-stderr",
+ },
+ trainCwd,
+ );
+ // Bin emits a decoy `Started job ` to STDERR first
+ // (would poison the shared buffer), then the canonical real
+ // line to STDOUT, then hangs. With the split we expect the
+ // real id to win; with the bug the decoy would win.
+ const REAL_JOB_ID = "real-job-id";
+ const DECOY_JOB_ID = "decoy-from-stderr";
+ const fakeBin = join(trainCwd, "stderr-decoy-bin.mjs");
+ // The real runner prefixes its canonical line with the
+ // per-spawn nonce the server injected via
+ // ARKOR_JOB_ID_MARKER_NONCE; the decoy on stderr deliberately
+ // uses the nonce too (worst-case: a user who somehow learned
+ // the nonce still can't hijack the parser by writing to the
+ // wrong stream). With the parser correctly stdout-only the
+ // real line wins regardless.
+ writeFileSync(
+ fakeBin,
+ `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? "";
+ process.stderr.write(\`[arkor:\${nonce}] Started job ${DECOY_JOB_ID}\\n\`);
+ // Slight delay so stderr lands first.
+ setTimeout(() => {
+ process.stdout.write(\`[arkor:\${nonce}] Started job ${REAL_JOB_ID}\\n\`);
+ }, 30);
+ process.on("SIGTERM", () => {});
+ setInterval(() => {}, 60_000);
+ `,
+ );
+ let cancelHits: Array<{ url: string }> = [];
+ const ORIG_FETCH = globalThis.fetch;
+ globalThis.fetch = (async (
+ input: Parameters[0],
+ init?: Parameters[1],
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) {
+ cancelHits.push({ url });
+ return new Response(JSON.stringify({ ok: true }), {
+ status: 200,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ return new Response("not found", { status: 404 });
+ }) as typeof fetch;
+
+ try {
+ const app = buildStudioApp({
+ baseUrl: "http://mock-cloud-api",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ binPath: fakeBin,
+ });
+ const trainRes = await app.request("/api/train", {
+ method: "POST",
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ "content-type": "application/json",
+ },
+ body: JSON.stringify({}),
+ });
+ expect(trainRes.status).toBe(200);
+ // Drain until the REAL line is in the body. Both the
+ // decoy and the real line forward through to the SPA log
+ // stream, so both bytes show up here regardless of which
+ // (if any) the parser captures.
+ const reader = trainRes.body!.getReader();
+ const decoder = new TextDecoder();
+ let buf = "";
+ while (!buf.includes(`Started job ${REAL_JOB_ID}`)) {
+ const { value, done } = await reader.read();
+ if (done) break;
+ buf += decoder.decode(value, { stream: true });
+ }
+ await reader.cancel();
+ await new Promise((r) => setTimeout(r, 200));
+
+ // The cancel POST must target the REAL id. With the bug
+ // the decoy would have been recorded first → cancelHits[0]
+ // would contain `decoy-from-stderr` instead.
+ expect(cancelHits).toHaveLength(1);
+ expect(cancelHits[0]?.url).toContain(`/v1/jobs/${REAL_JOB_ID}/cancel`);
+ expect(cancelHits[0]?.url).not.toContain(DECOY_JOB_ID);
+ } finally {
+ globalThis.fetch = ORIG_FETCH;
+ }
+ });
+
+ it("/api/train cancel POSTs cloud /v1/jobs/:id/cancel so the cloud job is released even though SIGKILL bypasses the runner's shutdown handlers", async () => {
+ // Regression: SIGKILL kills the runner without giving its
+ // `installShutdownHandlers` a chance to issue the cloud
+ // `cancel()` POST itself. Without a server-side equivalent
+ // the cloud job sits in "running" until TTL/reaper, so a
+ // user clicking "Stop training" silently keeps consuming
+ // GPU spend. The fix parses the runner's `Started job `
+ // stdout line, records the id on the registry entry, and
+ // fires a fire-and-forget POST to cloud-api on cancel
+ // *before* SIGKILLing.
+ await writeCredentials(ANON_CREDS);
+ // The cancel POST reads scope from `.arkor/state.json` (not
+ // from the anon creds' orgSlug; that's a different code
+ // path). Pre-seed so the POST can address the cloud job.
+ await writeState(
+ {
+ orgSlug: "cancel-test-org",
+ projectSlug: "cancel-test-project",
+ projectId: "p-cancel",
+ },
+ trainCwd,
+ );
+ // Bin prints the canonical "Started job " line then
+ // hangs (just like the real runner after `start()` resolves).
+ // The id is the same kind of identifier cloud-api would
+ // mint: an opaque string we'll verify shows up in the cancel
+ // POST URL below.
+ const FAKE_JOB_ID = "j-cancel-test";
+ const fakeBin = join(trainCwd, "started-job-bin.mjs");
+ // Prefix the marker with the per-spawn nonce the server
+ // injected via ARKOR_JOB_ID_MARKER_NONCE: that's the only
+ // shape the server's parser accepts, since user code can't
+ // know the nonce ahead of time (real runner deletes the env
+ // var before importing user modules).
+ writeFileSync(
+ fakeBin,
+ `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? "";
+ process.stdout.write(\`[arkor:\${nonce}] Started job ${FAKE_JOB_ID}\\n\`);
+ process.on("SIGTERM", () => {});
+ setInterval(() => {}, 60_000);
+ `,
+ );
+ // Capture the cloud-api requests so we can verify the
+ // server's cancel POST landed with the right job id +
+ // scope. The default fetch in this suite would 404 our POST
+ // and leave it as `cancelCalls === 0`.
+ let cancelHits: Array<{ url: string; method: string }> = [];
+ const ORIG_FETCH = globalThis.fetch;
+ globalThis.fetch = (async (
+ input: Parameters[0],
+ init?: Parameters[1],
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (
+ method === "POST" &&
+ url.includes(`/v1/jobs/${FAKE_JOB_ID}/cancel`)
+ ) {
+ cancelHits.push({ url, method });
+ return new Response(JSON.stringify({ ok: true }), {
+ status: 200,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ // Pass-through default: anything else 404s, which would
+ // surface as a test-side failure if our cancel POST
+ // doesn't match the expected URL shape.
+ return new Response("not found", { status: 404 });
+ }) as typeof fetch;
+
+ try {
+ const app = buildStudioApp({
+ baseUrl: "http://mock-cloud-api",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ binPath: fakeBin,
+ });
+ const trainRes = await app.request("/api/train", {
+ method: "POST",
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ "content-type": "application/json",
+ },
+ body: JSON.stringify({}),
+ });
+ expect(trainRes.status).toBe(200);
+ // Read enough of the body to ensure the runner's
+ // `Started job ` chunk has been processed by the
+ // server's stdout parser (without this, cancel could
+ // race ahead of the parser and find no jobId on the
+ // registry → no cancel POST → false test failure).
+ const reader = trainRes.body!.getReader();
+ const decoder = new TextDecoder();
+ let buf = "";
+ while (!buf.includes(`Started job ${FAKE_JOB_ID}`)) {
+ const { value, done } = await reader.read();
+ if (done) break;
+ buf += decoder.decode(value, { stream: true });
+ }
+ // Trigger cancel: should fire the cloud POST + SIGKILL.
+ await reader.cancel();
+ // Fire-and-forget: give the void IIFE a tick to actually
+ // dispatch the fetch + receive the 200 response.
+ await new Promise((r) => setTimeout(r, 200));
+
+ expect(cancelHits).toHaveLength(1);
+ expect(cancelHits[0]?.url).toContain(`/v1/jobs/${FAKE_JOB_ID}/cancel`);
+ // Scope is required by the cloud-api contract: comes from
+ // `.arkor/state.json` (seeded above), not the anon creds.
+ expect(cancelHits[0]?.url).toContain("orgSlug=cancel-test-org");
+ expect(cancelHits[0]?.url).toContain("projectSlug=cancel-test-project");
+ } finally {
+ globalThis.fetch = ORIG_FETCH;
+ }
+ });
+
+ it("/api/train cancel uses the spawn-time scope from the registry even when state.json was deleted mid-training", async () => {
+ // Regression: the cancel handler used to re-read
+ // `.arkor/state.json` at stop time to address the cloud cancel
+ // POST. If the user removed or made the file unreadable
+ // mid-training (rm -rf .arkor, accidental git clean -fdx, fs
+ // unmounted), the read returned null and the handler silently
+ // skipped the POST: the local SIGKILL still tore down the
+ // subprocess but the cloud job orphaned until TTL/reaper. The
+ // fix captures `{orgSlug, projectSlug}` on the registry entry
+ // at spawn time so the cancel POST is decoupled from
+ // mutable filesystem state.
+ await writeCredentials(ANON_CREDS);
+ await writeState(
+ {
+ orgSlug: "scope-pin-org",
+ projectSlug: "scope-pin-project",
+ projectId: "p-scope-pin",
+ },
+ trainCwd,
+ );
+ const FAKE_JOB_ID = "j-scope-pin";
+ const fakeBin = join(trainCwd, "scope-pin-bin.mjs");
+ writeFileSync(
+ fakeBin,
+ `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? "";
+ process.stdout.write(\`[arkor:\${nonce}] Started job ${FAKE_JOB_ID}\\n\`);
+ process.on("SIGTERM", () => {});
+ setInterval(() => {}, 60_000);
+ `,
+ );
+ let cancelHits: Array<{ url: string }> = [];
+ const ORIG_FETCH = globalThis.fetch;
+ globalThis.fetch = (async (
+ input: Parameters[0],
+ init?: Parameters[1],
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) {
+ cancelHits.push({ url });
+ return new Response(JSON.stringify({ ok: true }), {
+ status: 200,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ return new Response("not found", { status: 404 });
+ }) as typeof fetch;
+
+ try {
+ const app = buildStudioApp({
+ baseUrl: "http://mock-cloud-api",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ binPath: fakeBin,
+ });
+ const trainRes = await app.request("/api/train", {
+ method: "POST",
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ "content-type": "application/json",
+ },
+ body: JSON.stringify({}),
+ });
+ expect(trainRes.status).toBe(200);
+ const reader = trainRes.body!.getReader();
+ const decoder = new TextDecoder();
+ let buf = "";
+ while (!buf.includes(`Started job ${FAKE_JOB_ID}`)) {
+ const { value, done } = await reader.read();
+ if (done) break;
+ buf += decoder.decode(value, { stream: true });
+ }
+ // The hostile mid-training mutation: nuke the state file
+ // that the OLD code would have re-read at cancel time.
+ rmSync(join(trainCwd, ".arkor"), { recursive: true, force: true });
+ // Cancel: under the bug, the handler's state read returns
+ // null and the cancel POST is silently skipped. With the
+ // fix, the registry-pinned scope is used and the POST goes
+ // out anyway.
+ await reader.cancel();
+ await new Promise((r) => setTimeout(r, 200));
+
+ expect(cancelHits).toHaveLength(1);
+ expect(cancelHits[0]?.url).toContain(`/v1/jobs/${FAKE_JOB_ID}/cancel`);
+ expect(cancelHits[0]?.url).toContain("orgSlug=scope-pin-org");
+ expect(cancelHits[0]?.url).toContain("projectSlug=scope-pin-project");
+ } finally {
+ globalThis.fetch = ORIG_FETCH;
+ }
+ });
+
+ it("/api/train cancel falls back to reading .arkor/state.json when no scope was captured at spawn time (first-run anon)", async () => {
+ // Regression: capturing the cloud scope at spawn time covered
+ // the "user deleted state mid-training" hazard but broke the
+ // common first-run anonymous flow. On a fresh project,
+ // `.arkor/state.json` is created by `ensureProjectState` from
+ // INSIDE the child during `trainer.start()`, i.e. AFTER spawn.
+ // The spawn-time `readState(trainCwd)` therefore returns null,
+ // `pinnedScope` stays null, and the previous code silently
+ // skipped the cancel POST: local SIGKILL torn down the
+ // subprocess but the cloud job orphaned. The fix uses the
+ // pinned spawn-time scope WHEN PRESENT (delete-mid-training
+ // hazard) and falls back to reading at cancel time when it
+ // was null (first-run anon).
+ await writeCredentials(ANON_CREDS);
+ // Deliberately DO NOT seed state at spawn time. The bin will
+ // write it AFTER its `Started job ` line lands, simulating
+ // the order `ensureProjectState`/`trainer.start()` produce in
+ // a real anon first-run.
+ const FAKE_JOB_ID = "j-late-scope";
+ const stateDir = join(trainCwd, ".arkor");
+ const statePath = join(stateDir, "state.json");
+ const fakeBin = join(trainCwd, "late-scope-bin.mjs");
+ writeFileSync(
+ fakeBin,
+ `import { mkdirSync, writeFileSync } from "node:fs";
+ const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? "";
+ // Mirror runner order: state appears AFTER spawn, but BEFORE
+ // the Started job line so the cancel-time read sees it.
+ mkdirSync(${JSON.stringify(stateDir)}, { recursive: true });
+ writeFileSync(${JSON.stringify(statePath)}, JSON.stringify({
+ orgSlug: "late-scope-org",
+ projectSlug: "late-scope-project",
+ projectId: "p-late-scope",
+ }));
+ process.stdout.write(\`[arkor:\${nonce}] Started job ${FAKE_JOB_ID}\\n\`);
+ process.on("SIGTERM", () => {});
+ setInterval(() => {}, 60_000);
+ `,
+ );
+ let cancelHits: Array<{ url: string }> = [];
+ const ORIG_FETCH = globalThis.fetch;
+ globalThis.fetch = (async (
+ input: Parameters[0],
+ init?: Parameters[1],
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) {
+ cancelHits.push({ url });
+ return new Response(JSON.stringify({ ok: true }), {
+ status: 200,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ return new Response("not found", { status: 404 });
+ }) as typeof fetch;
+
+ try {
+ const app = buildStudioApp({
+ baseUrl: "http://mock-cloud-api",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ binPath: fakeBin,
+ });
+ const trainRes = await app.request("/api/train", {
+ method: "POST",
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ "content-type": "application/json",
+ },
+ body: JSON.stringify({}),
+ });
+ expect(trainRes.status).toBe(200);
+ const reader = trainRes.body!.getReader();
+ const decoder = new TextDecoder();
+ let buf = "";
+ while (!buf.includes(`Started job ${FAKE_JOB_ID}`)) {
+ const { value, done } = await reader.read();
+ if (done) break;
+ buf += decoder.decode(value, { stream: true });
+ }
+ await reader.cancel();
+ await new Promise((r) => setTimeout(r, 250));
+
+ // Under the bug there were 0 cancel hits (pinned scope null
+ // → skip). With the fix the cancel-time read recovers the
+ // scope the child just wrote.
+ expect(cancelHits).toHaveLength(1);
+ expect(cancelHits[0]?.url).toContain(`/v1/jobs/${FAKE_JOB_ID}/cancel`);
+ expect(cancelHits[0]?.url).toContain("orgSlug=late-scope-org");
+ expect(cancelHits[0]?.url).toContain("projectSlug=late-scope-project");
+ } finally {
+ globalThis.fetch = ORIG_FETCH;
+ }
+ });
+
+ it("/api/train job-id parser ignores stdout lines that lack the per-spawn nonce prefix so user code can't forge a `Started job` marker", async () => {
+ // Regression: the parser used to match any `Started job `
+ // line in stdout. User code (which runs inside the runner's
+ // `await import(userEntry)` chain and therefore shares the
+ // child's stdout) could write `console.log("Started job
+ // attacker-chosen-id")` before the runner's canonical line
+ // arrives, the parser would record the attacker's id, and
+ // Stop-training would POST `/v1/jobs//cancel`
+ // against a job the attacker picked. The fix injects a
+ // per-spawn 32-hex nonce via ARKOR_JOB_ID_MARKER_NONCE that
+ // the server's regex anchors on; runner.ts deletes the env
+ // var before dynamically importing the user module, so user
+ // code can't read the nonce via `process.env` either.
+ await writeCredentials(ANON_CREDS);
+ await writeState(
+ {
+ orgSlug: "nonce-org",
+ projectSlug: "nonce-project",
+ projectId: "p-nonce",
+ },
+ trainCwd,
+ );
+ const REAL_JOB_ID = "real-nonce-job";
+ const SPOOF_JOB_ID = "attacker-chosen-id";
+ const fakeBin = join(trainCwd, "spoof-bin.mjs");
+ // Bin first emits an UNPREFIXED spoof on stdout (mimicking
+ // hostile user code), THEN the real nonce-prefixed canonical
+ // line. With the fix the spoof is rejected; the real line
+ // wins and the cancel POST targets the real id.
+ writeFileSync(
+ fakeBin,
+ `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? "";
+ process.stdout.write("Started job ${SPOOF_JOB_ID}\\n");
+ setTimeout(() => {
+ process.stdout.write(\`[arkor:\${nonce}] Started job ${REAL_JOB_ID}\\n\`);
+ }, 30);
+ process.on("SIGTERM", () => {});
+ setInterval(() => {}, 60_000);
+ `,
+ );
+ let cancelHits: Array<{ url: string }> = [];
+ const ORIG_FETCH = globalThis.fetch;
+ globalThis.fetch = (async (
+ input: Parameters[0],
+ init?: Parameters[1],
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) {
+ cancelHits.push({ url });
+ return new Response(JSON.stringify({ ok: true }), {
+ status: 200,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ return new Response("not found", { status: 404 });
+ }) as typeof fetch;
+
+ try {
+ const app = buildStudioApp({
+ baseUrl: "http://mock-cloud-api",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ binPath: fakeBin,
+ });
+ const trainRes = await app.request("/api/train", {
+ method: "POST",
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ "content-type": "application/json",
+ },
+ body: JSON.stringify({}),
+ });
+ expect(trainRes.status).toBe(200);
+ const reader = trainRes.body!.getReader();
+ const decoder = new TextDecoder();
+ let buf = "";
+ // Wait for the REAL line (with nonce prefix) to be visible
+ // in the body. Both lines forward to the SPA log
+ // regardless of which (if any) the parser captures, so the
+ // body is a reliable readiness signal.
+ while (!buf.includes(`Started job ${REAL_JOB_ID}`)) {
+ const { value, done } = await reader.read();
+ if (done) break;
+ buf += decoder.decode(value, { stream: true });
+ }
+ await reader.cancel();
+ await new Promise((r) => setTimeout(r, 200));
+
+ // Cancel POST landed against the REAL id: the spoof was
+ // rejected by the anchored nonce-prefixed regex.
+ expect(cancelHits).toHaveLength(1);
+ expect(cancelHits[0]?.url).toContain(`/v1/jobs/${REAL_JOB_ID}/cancel`);
+ expect(cancelHits[0]?.url).not.toContain(SPOOF_JOB_ID);
+ } finally {
+ globalThis.fetch = ORIG_FETCH;
+ }
+ });
+
+ it("/api/train cancel sends SIGKILL so user-initiated stop bypasses the runner's graceful early-stop", async () => {
+ // Regression: a default `child.kill()` sends SIGTERM, which
+ // the runner's `installShutdownHandlers` now interprets as a
+ // graceful early-stop request (wait for the next checkpoint,
+ // up to ~5 min). For HMR-driven cancels that's correct, but
+ // for a Stop-training click the user wants the run STOPPED
+ // immediately. Leaving it running in the background for
+ // minutes consuming GPU spend silently is a regression
+ // introduced by this PR's graceful-shutdown work. We assert
+ // SIGKILL by giving the bin a SIGTERM no-op handler: SIGTERM
+ // would be swallowed and the bin would stay alive; SIGKILL
+ // is uncatchable and reaps the process unconditionally.
+ // Probe liveness with `process.kill(pid, 0)` (ESRCH ⇒ gone).
+ await writeCredentials(ANON_CREDS);
+ const hangingBin = join(trainCwd, "hanging-bin.mjs");
+ writeFileSync(
+ hangingBin,
+ // SIGTERM swallowed; setInterval keeps the event loop
+ // alive forever absent SIGKILL.
+ `process.on("SIGTERM", () => {});
+ setInterval(() => {}, 60_000);
+ `,
+ );
+ const app = buildStudioApp({
+ baseUrl: "http://mock",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ binPath: hangingBin,
+ });
+ const res = await app.request("/api/train", {
+ method: "POST",
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ "content-type": "application/json",
+ },
+ body: JSON.stringify({}),
+ });
+ expect(res.status).toBe(200);
+ const pid = Number(res.headers.get("x-arkor-train-pid"));
+ expect(Number.isFinite(pid)).toBe(true);
+
+ // Trigger the cancel() handler.
+ await res.body!.cancel();
+
+ // Give the OS a moment to deliver SIGKILL and reap.
+ await new Promise((r) => setTimeout(r, 300));
+
+ // `process.kill(pid, 0)` is the standard "is this pid alive?"
+ // probe: sends signal 0 (no-op) but the syscall still
+ // surfaces ESRCH for non-existent pids. SIGKILL → reaped →
+ // ESRCH. SIGTERM (with the bin's no-op handler) → still
+ // alive → no throw → test fails.
+ let probeError: NodeJS.ErrnoException | null = null;
+ try {
+ process.kill(pid, 0);
+ } catch (e) {
+ probeError = e as NodeJS.ErrnoException;
+ }
+ expect(probeError).not.toBeNull();
+ expect(probeError?.code).toBe("ESRCH");
+ });
+
+ it("/api/train cancel handler doesn't crash when child.kill() throws", async () => {
+ // Regression: `ReadableStream.cancel()` called `child.kill()`
+ // without a try/catch. If the child had already exited (ESRCH
+ // race against the cancel), the throw bubbled up as an
+ // unhandled exception and crashed the request handler.
+ await writeCredentials(ANON_CREDS);
+ const fakeBin = join(trainCwd, "fake-bin.mjs");
+ // Bin exits immediately so the child is already dead by the
+ // time our cancel handler tries to signal it.
+ writeFileSync(fakeBin, `process.exit(0);\n`);
+ const app = buildStudioApp({
+ baseUrl: "http://mock",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ binPath: fakeBin,
+ });
+ const res = await app.request("/api/train", {
+ method: "POST",
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ "content-type": "application/json",
+ },
+ body: JSON.stringify({}),
+ });
+ expect(res.status).toBe(200);
+ // Race: read enough of the body to see the close, then cancel.
+ // The cancel hook must not throw even when the underlying
+ // child is already gone.
+ const reader = res.body!.getReader();
+ // Wait for `exit=` so we know the child died first.
+ let buf = "";
+ const decoder = new TextDecoder();
+ while (!buf.includes("exit=")) {
+ const { value, done } = await reader.read();
+ if (done) break;
+ buf += decoder.decode(value, { stream: true });
+ }
+ await expect(reader.cancel()).resolves.toBeUndefined();
+ });
+
+ it("/api/train survives cancellation while the child is still streaming output", async () => {
+ // Regression: the previous implementation registered raw
+ // `controller.enqueue(...)` listeners on `child.stdout` /
+ // `child.stderr` and an unguarded `controller.close()` in
+ // `child.on("close")`. After the client cancelled the
+ // ReadableStream, those handlers kept firing, and calling
+ // `enqueue` / `close` on a closed controller throws "Invalid
+ // state". The throw escaped the request pipeline as an
+ // unhandled exception. The fix flips a `closed` flag in
+ // `cancelTeardown` and try/catches the post-cancel enqueue
+ // paths defensively. NOTE: cancel intentionally does NOT
+ // detach the `data` listeners; leaving them attached keeps
+ // the OS pipe draining while the child checkpoints / exits
+ // gracefully (otherwise a full pipe back-pressures and
+ // deadlocks the very graceful exit we're preserving).
+ // `onClose` / `onError` detach all listeners when the child
+ // finally exits. See `cancelTeardown` in `studio/server.ts`
+ // for the full backpressure rationale.
+ await writeCredentials(ANON_CREDS);
+ const fakeBin = join(trainCwd, "fake-bin.mjs");
+ // Bin spits a chunk every ~5 ms forever. We cancel while it's
+ // mid-stream so the child is *still alive* when listeners are
+ // removed: the previous bug only surfaced in this window.
+ writeFileSync(
+ fakeBin,
+ `setInterval(() => process.stdout.write("tick\\n"), 5);\nsetInterval(() => {}, 60_000);\n`,
+ );
+ const app = buildStudioApp({
+ baseUrl: "http://mock",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ binPath: fakeBin,
+ });
+ const res = await app.request("/api/train", {
+ method: "POST",
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ "content-type": "application/json",
+ },
+ body: JSON.stringify({}),
+ });
+ expect(res.status).toBe(200);
+ const reader = res.body!.getReader();
+ // Read at least one chunk so the child is definitely streaming
+ // before we cancel: that's the race window the previous code
+ // crashed in.
+ const decoder = new TextDecoder();
+ let received = "";
+ while (!received.includes("tick")) {
+ const { value, done } = await reader.read();
+ if (done) break;
+ received += decoder.decode(value, { stream: true });
+ }
+ // Listen for unhandled rejections / uncaught exceptions during
+ // and shortly after the cancel: before the fix, the child's
+ // next `data` chunk would synchronously throw inside the
+ // enqueue callback.
+ const errors: unknown[] = [];
+ const onUnhandled = (err: unknown) => errors.push(err);
+ process.on("uncaughtException", onUnhandled);
+ process.on("unhandledRejection", onUnhandled);
+ try {
+ await reader.cancel();
+ // Give the child's interval a few iterations to attempt
+ // post-cancel writes. The handler must short-circuit on the
+ // `closed` flag and not crash the worker.
+ await new Promise((r) => setTimeout(r, 50));
+ } finally {
+ process.off("uncaughtException", onUnhandled);
+ process.off("unhandledRejection", onUnhandled);
+ }
+ expect(errors).toEqual([]);
+ });
});
describe("auto-anonymous bootstrap", () => {
@@ -557,7 +1363,7 @@ process.exit(0);
});
it("acquires + persists an anonymous token on the first /api/credentials hit when autoAnonymous=true", async () => {
- // No credentials on disk — buildStudioApp's autoAnonymous default
+ // No credentials on disk: buildStudioApp's autoAnonymous default
// (true) lets the server bootstrap on first hit so a fresh `arkor
// dev` works even when the up-front bootstrap in dev.ts skipped due
// to a transient network blip.
@@ -599,7 +1405,7 @@ process.exit(0);
expect(body).toMatchObject({ token: "lazy-anon", mode: "anon" });
expect(calls).toBe(1);
- // Subsequent calls use the persisted credentials — no re-bootstrap.
+ // Subsequent calls use the persisted credentials (no re-bootstrap).
const res2 = await app.request("/api/credentials", {
headers: {
host: "127.0.0.1:4000",
@@ -674,7 +1480,7 @@ process.exit(0);
// The cloud-api-client wrapper around `onDeprecation` synchronously
// checks `typeof result.then` on the callback's return value; a plain
// `void` return throws and gets swallowed with a stderr log. The
- // wrapper in `createRpc` returns null to short-circuit that check —
+ // wrapper in `createRpc` returns null to short-circuit that check;
// assert that no such log fires here.
const errorSpy = vi
.spyOn(console, "error")
@@ -1173,6 +1979,179 @@ process.exit(0);
const body = (await res.json()) as { trainer: unknown };
expect(body.trainer).toBeNull();
});
+
+ it("skips runBuild() when HMR is enabled and the watcher's artefact already exists", async () => {
+ // Regression: previously every `/api/manifest` poll triggered a
+ // fresh `runBuild()` even with HMR active, so the SPA's
+ // ~5 s polling + per-rebuild SSE refetch would re-bundle on
+ // every poll AND race the watcher writing to the same
+ // `.arkor/build/index.mjs`. The fast path inspects the
+ // pre-existing artefact directly when HMR's coordinator is
+ // wired in. We assert by pre-writing a hand-rolled artefact
+ // bundle and verifying `/api/manifest` returns its trainer
+ // *without* the source file existing: `runBuild()` would
+ // throw on the missing entry, so a 200 here proves we never
+ // called it.
+ await writeCredentials(ANON_CREDS);
+ // Write the artefact that the HMR watcher would have produced.
+ // Mirrors the seed fixture's shape: `_kind: "arkor"` + trainer
+ // with the four required methods.
+ mkdirSync(join(trainCwd, ".arkor/build"), { recursive: true });
+ writeFileSync(
+ join(trainCwd, ".arkor/build/index.mjs"),
+ `const trainer = {
+ name: "hmr-fast-path",
+ start: async () => ({ jobId: "j" }),
+ wait: async () => ({ job: {}, artifacts: [] }),
+ cancel: async () => {},
+ };
+ export const arkor = { _kind: "arkor", trainer };
+ export default arkor;
+ `,
+ );
+ // Notice: NO `src/arkor/index.ts`. `runBuild()` would fail with
+ // "Build entry not found"; the test fails if the fast path
+ // regresses and falls through to it.
+ const fakeHmr = {
+ subscribe: () => () => undefined,
+ getCurrentConfigHash: () => null,
+ getCurrentArtifactHash: () => null,
+ getCurrentArtifactContentHash: () => null,
+ getLastEventType: () => null,
+ async dispose() {},
+ };
+ const app = buildStudioApp({
+ baseUrl: "http://mock",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ hmr: fakeHmr,
+ });
+ const res = await app.request("/api/manifest", {
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ },
+ });
+ expect(res.status).toBe(200);
+ const body = (await res.json()) as {
+ trainer: { name: string } | null;
+ };
+ expect(body.trainer).toEqual({ name: "hmr-fast-path" });
+ });
+
+ it("falls back to runBuild() when HMR is enabled but the watcher hasn't produced an artefact yet", async () => {
+ // Companion to the fast-path test: on a fresh scaffold the
+ // watcher's first BUNDLE_END may not have completed by the
+ // time the SPA's first /api/manifest poll lands. Without the
+ // existsSync gate we'd `await import(missing)` and 400
+ // forever (the watcher's later writes don't retroactively
+ // make this poll succeed); with the gate we bootstrap via
+ // `runBuild()` for that single call.
+ await writeCredentials(ANON_CREDS);
+ mkdirSync(join(trainCwd, "src/arkor"), { recursive: true });
+ writeFileSync(
+ join(trainCwd, "src/arkor/index.ts"),
+ `export const arkor = Object.freeze({
+ _kind: "arkor",
+ trainer: {
+ name: "fallback-build",
+ start: async () => ({ jobId: "j" }),
+ wait: async () => ({ job: {}, artifacts: [] }),
+ cancel: async () => {},
+ },
+ });`,
+ );
+ // No pre-existing `.arkor/build/index.mjs`: the artefact
+ // doesn't exist. `existsSync` is false → `runBuild()` runs.
+ const fakeHmr = {
+ subscribe: () => () => undefined,
+ getCurrentConfigHash: () => null,
+ getCurrentArtifactHash: () => null,
+ getCurrentArtifactContentHash: () => null,
+ getLastEventType: () => null,
+ async dispose() {},
+ };
+ const app = buildStudioApp({
+ baseUrl: "http://mock",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ hmr: fakeHmr,
+ });
+ const res = await app.request("/api/manifest", {
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ },
+ });
+ expect(res.status).toBe(200);
+ const body = (await res.json()) as {
+ trainer: { name: string } | null;
+ };
+ expect(body.trainer).toEqual({ name: "fallback-build" });
+ });
+
+ it("returns 400 (not stale 200) while the HMR watcher is in error state", async () => {
+ // Regression: the HMR fast path served the last-built artefact
+ // even when the watcher's most recent event was `error`. The
+ // SPA's `/api/manifest` poll runs every ~5s, so a successful
+ // 200 with stale data would silently overwrite the SSE-driven
+ // build-error UI within 5s of the user breaking their source:
+ // they'd then unknowingly run stale code/config while the
+ // latest edit is still failing to compile. Gating the fast
+ // path on `getLastEventType() === "error"` keeps both
+ // channels (poll + SSE) consistent.
+ await writeCredentials(ANON_CREDS);
+ mkdirSync(join(trainCwd, ".arkor/build"), { recursive: true });
+ // Pre-write a previously-good artefact so the fast path
+ // *would* otherwise return 200 with it.
+ writeFileSync(
+ join(trainCwd, ".arkor/build/index.mjs"),
+ `const trainer = {
+ name: "stale-good-build",
+ start: async () => ({ jobId: "j" }),
+ wait: async () => ({ job: {}, artifacts: [] }),
+ cancel: async () => {},
+ };
+ export const arkor = { _kind: "arkor", trainer };
+ export default arkor;
+ `,
+ );
+ // Coordinator is currently in error state: the latest
+ // broadcast was a compile failure.
+ const fakeHmr = {
+ subscribe: () => () => undefined,
+ getCurrentConfigHash: () => null,
+ getCurrentArtifactHash: () => null,
+ getCurrentArtifactContentHash: () => null,
+ getLastEventType: () => "error" as const,
+ async dispose() {},
+ };
+ const app = buildStudioApp({
+ baseUrl: "http://mock",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ hmr: fakeHmr,
+ });
+ const res = await app.request("/api/manifest", {
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ },
+ });
+ // 400: the SPA's existing 4xx-handling path renders the
+ // build-error hint instead of a fake-healthy manifest.
+ expect(res.status).toBe(400);
+ const body = (await res.json()) as { error?: string };
+ expect(body.error).toMatch(/Build failed/);
+ // Sanity: the stale artefact name is NOT leaked through.
+ expect(JSON.stringify(body)).not.toContain("stale-good-build");
+ });
});
describe("/api/inference/chat", () => {
@@ -1184,7 +2163,7 @@ process.exit(0);
it("auto-bootstraps project state and proxies base-model inference", async () => {
await writeCredentials(ANON_CREDS);
- // No state.json — server should derive a slug from cwd, create the
+ // No state.json: server should derive a slug from cwd, create the
// project on cloud-api, persist state, and forward the inference call.
const calls: Array<{
@@ -1313,7 +2292,7 @@ process.exit(0);
});
expect(res.status).toBe(200);
- // Only the inference call should have hit the network — no project
+ // Only the inference call should have hit the network: no project
// create/list when state is already present.
expect(calls.filter((c) => c.url.includes("/v1/projects"))).toHaveLength(0);
const chat = calls.find((c) => c.url.includes("/v1/inference/chat"));
@@ -1374,7 +2353,7 @@ process.exit(0);
it("propagates the cloud-api status when project bootstrap fails", async () => {
await writeCredentials(ANON_CREDS);
- // No state.json — bootstrap will hit cloud-api, which returns 503.
+ // No state.json: bootstrap will hit cloud-api, which returns 503.
// We expect that 503 to be passed through, not collapsed to 400.
globalThis.fetch = (async (
@@ -1435,13 +2414,436 @@ process.exit(0);
});
});
- // -------------------------------------------------------------------------
- // Deployments (`/api/deployments/*`) — minimal coverage of the router
- // boundary. Cloud-side semantics already have heavy test coverage in
- // `core/client.deployments.test.ts`; here we verify only that the Studio
- // server forwards correctly, returns the empty wrapper when no project
- // state exists, and surfaces upstream errors verbatim.
- // -------------------------------------------------------------------------
+ describe("/api/dev/events (HMR)", () => {
+ function fakeHmr(initialConfigHash: string | null = null) {
+ // Mirror the real HmrCoordinator surface but stay synchronous so
+ // the test doesn't depend on rolldown.watch starting up. `emit`
+ // is a test hook for pushing events into the SSE stream from the
+ // test body; `currentConfigHash` is a settable mock for what
+ // `/api/train` reads via `getCurrentConfigHash` to capture the
+ // spawned-config snapshot.
+ const subs = new Set<(e: HmrEvent) => void>();
+ let currentConfigHash: string | null = initialConfigHash;
+ // Match the real coordinator's behaviour: a stable artefact
+ // fingerprint at spawn time. Tests that exercise the
+ // pre-ready-spawn path (configHash null, then a real hash)
+ // can override via `setArtifactHash`.
+ let currentArtifactHash: string | null = "fake-artefact-hash";
+ let currentArtifactContentHash: string | null =
+ "fake-artefact-content-hash";
+ let lastEventType: HmrEvent["type"] | null = null;
+ const coordinator: HmrCoordinator = {
+ subscribe(fn) {
+ subs.add(fn);
+ return () => {
+ subs.delete(fn);
+ };
+ },
+ getCurrentConfigHash() {
+ return currentConfigHash;
+ },
+ getCurrentArtifactHash() {
+ return currentArtifactHash;
+ },
+ getCurrentArtifactContentHash() {
+ return currentArtifactContentHash;
+ },
+ getLastEventType() {
+ return lastEventType;
+ },
+ async dispose() {
+ subs.clear();
+ },
+ };
+ return {
+ coordinator,
+ emit(event: HmrEvent) {
+ // Track the latest event type so `getLastEventType()`
+ // mirrors the real coordinator's `lastEvent?.type`;
+ // the `/api/manifest` HMR-error gate consults this.
+ lastEventType = event.type;
+ for (const fn of subs) fn(event);
+ },
+ setConfigHash(hash: string | null) {
+ currentConfigHash = hash;
+ },
+ setArtifactHash(hash: string | null) {
+ currentArtifactHash = hash;
+ },
+ setArtifactContentHash(hash: string | null) {
+ currentArtifactContentHash = hash;
+ },
+ setLastEventType(t: HmrEvent["type"] | null) {
+ lastEventType = t;
+ },
+ get subscriberCount() {
+ return subs.size;
+ },
+ };
+ }
+
+ it("is unregistered when no hmr coordinator is supplied", async () => {
+ const app = build();
+ const res = await app.request("/api/dev/events", {
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ },
+ });
+ expect(res.status).toBe(404);
+ });
+
+ it("rejects /api/dev/events without a token", async () => {
+ const fake = fakeHmr();
+ const app = buildStudioApp({
+ baseUrl: "http://mock",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ hmr: fake.coordinator,
+ });
+ const res = await app.request("/api/dev/events", {
+ headers: { host: "127.0.0.1:4000" },
+ });
+ expect(res.status).toBe(403);
+ });
+
+ it("accepts the studio token via ?studioToken= for the dev event stream", async () => {
+ const fake = fakeHmr();
+ const app = buildStudioApp({
+ baseUrl: "http://mock",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ hmr: fake.coordinator,
+ });
+ // The server subscribes to the HMR coordinator exactly once at
+ // build time (so multiple SSE clients don't fan signal dispatch
+ // out to the same child N times). Per-client cleanup happens on
+ // the SSE listener set, not against the coordinator, so
+ // `fake.subscriberCount` stays at 1 across the connection
+ // lifecycle. We assert that here rather than expect the
+ // pre-refactor "0 after cancel" behaviour.
+ expect(fake.subscriberCount).toBe(1);
+ const res = await app.request(
+ `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`,
+ { headers: { host: "127.0.0.1:4000" } },
+ );
+ expect(res.status).toBe(200);
+ expect(res.headers.get("content-type")).toBe("text/event-stream");
+ const reader = res.body!.getReader();
+ await reader.cancel();
+ // Cancel doesn't unsubscribe the server-level listener; emitting
+ // an event after cancel must still be safe (the SSE listener that
+ // was registered for this connection is removed, so the
+ // controller-closed try/catch in `send` is never exercised).
+ expect(() =>
+ fake.emit({
+ type: "rebuild",
+ outFile: "/tmp/x",
+ hash: "h",
+ configHash: null,
+ trainerName: null,
+ }),
+ ).not.toThrow();
+ });
+
+ it("rejects /api/dev/events when host header is non-loopback", async () => {
+ const fake = fakeHmr();
+ const app = buildStudioApp({
+ baseUrl: "http://mock",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ hmr: fake.coordinator,
+ });
+ const res = await app.request(
+ `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`,
+ { headers: { host: "evil.example.com" } },
+ );
+ expect(res.status).toBe(403);
+ });
+
+ it("dispatches HMR signals exactly once per rebuild regardless of connected SSE client count", async () => {
+ // Regression: previously each `/api/dev/events` connection
+ // attached its own `hmr.subscribe(...)` callback, so a rebuild
+ // with N open Studio tabs fanned out into N × SIGUSR2 / N ×
+ // SIGTERM per child. The runner's shutdown handler interprets a
+ // *second* SIGTERM as the emergency `exit(143)` fast-path, which
+ // would defeat checkpoint preservation. The server now subscribes
+ // to the coordinator exactly once and broadcasts the augmented
+ // payload to every SSE client; we assert that subscriber count
+ // doesn't grow when extra connections are opened.
+ const fake = fakeHmr();
+ const app = buildStudioApp({
+ baseUrl: "http://mock",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ hmr: fake.coordinator,
+ });
+ expect(fake.subscriberCount).toBe(1);
+ const r1 = await app.request(
+ `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`,
+ { headers: { host: "127.0.0.1:4000" } },
+ );
+ const r2 = await app.request(
+ `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`,
+ { headers: { host: "127.0.0.1:4000" } },
+ );
+ // Pump the streams so their `start()` runs, registering the
+ // per-client SSE listeners on the server side.
+ const reader1 = r1.body!.getReader();
+ const reader2 = r2.body!.getReader();
+ // Even with two concurrent SSE clients the HMR coordinator still
+ // sees exactly the one server-level subscriber.
+ expect(fake.subscriberCount).toBe(1);
+ await reader1.cancel();
+ await reader2.cancel();
+ expect(fake.subscriberCount).toBe(1);
+ });
+
+ it("/api/train cancel still fires cloud cancel POST + SIGKILL even when HMR has already requested early-stop", async () => {
+ // Regression: the cancel handler used to short-circuit
+ // (`if (earlyStopInFlight) return;`) when HMR's
+ // `dispatchRebuild` had already SIGTERMed the child for a
+ // graceful checkpoint-wait early-stop. That gate was added
+ // to avoid a second SIGTERM piling on top of the first
+ // (which would have triggered the runner's `exit(143)`
+ // emergency path and broken cloud cancel POSTing). With
+ // SIGKILL replacing the user-stop SIGTERM, the
+ // double-signal worry no longer applies, and the gate
+ // turned a Stop click during HMR's graceful window into a
+ // total no-op, leaving the run alive until checkpoint /
+ // 5-min timeout. Manual stop now overrides HMR's graceful
+ // path: server POSTs cloud cancel + SIGKILLs the
+ // subprocess regardless of `isEarlyStopRequested`.
+ await writeCredentials(ANON_CREDS);
+ await writeState(
+ {
+ orgSlug: "manual-override-org",
+ projectSlug: "manual-override-project",
+ projectId: "p-manual",
+ },
+ trainCwd,
+ );
+ const FAKE_JOB_ID = "manual-stop-during-hmr";
+ const fakeBin = join(trainCwd, "manual-during-hmr-bin.mjs");
+ // SIGTERM no-op so HMR's graceful SIGTERM doesn't terminate
+ // the bin; we need it alive so the subsequent manual
+ // cancel actually has something to SIGKILL. Marker uses the
+ // server-injected nonce prefix so the parser accepts it.
+ writeFileSync(
+ fakeBin,
+ `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? "";
+ process.stdout.write(\`[arkor:\${nonce}] Started job ${FAKE_JOB_ID}\\n\`);
+ process.on("SIGTERM", () => {});
+ setInterval(() => {}, 60_000);
+ `,
+ );
+ let cancelHits: Array<{ url: string }> = [];
+ const ORIG_FETCH = globalThis.fetch;
+ globalThis.fetch = (async (
+ input: Parameters[0],
+ init?: Parameters[1],
+ ) => {
+ const url = typeof input === "string" ? input : input.toString();
+ const method = init?.method ?? "GET";
+ if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) {
+ cancelHits.push({ url });
+ return new Response(JSON.stringify({ ok: true }), {
+ status: 200,
+ headers: { "content-type": "application/json" },
+ });
+ }
+ return new Response("not found", { status: 404 });
+ }) as typeof fetch;
+
+ try {
+ const fake = fakeHmr("h1");
+ const app = buildStudioApp({
+ baseUrl: "http://mock-cloud-api",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ binPath: fakeBin,
+ hmr: fake.coordinator,
+ });
+ const trainRes = await app.request("/api/train", {
+ method: "POST",
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ "content-type": "application/json",
+ },
+ body: JSON.stringify({}),
+ });
+ expect(trainRes.status).toBe(200);
+ const pid = Number(trainRes.headers.get("x-arkor-train-pid"));
+ // Drain until the parser has recorded the job id.
+ const reader = trainRes.body!.getReader();
+ const decoder = new TextDecoder();
+ let buf = "";
+ while (!buf.includes(`Started job ${FAKE_JOB_ID}`)) {
+ const { value, done } = await reader.read();
+ if (done) break;
+ buf += decoder.decode(value, { stream: true });
+ }
+ // Emit an HMR mismatch: server's dispatch SIGTERMs the
+ // bin and sets `earlyStopRequested = true` on the entry.
+ // The bin's SIGTERM no-op keeps it alive so the manual
+ // cancel below has a target.
+ fake.emit({
+ type: "ready",
+ outFile: "/tmp/x.mjs",
+ hash: "abc",
+ configHash: "h2", // mismatch with spawn-time "h1"
+ trainerName: "t",
+ });
+ // Let the dispatch run + signal land.
+ await new Promise((r) => setTimeout(r, 80));
+
+ // Manual cancel: old code would have early-returned; new
+ // code POSTs cloud cancel + SIGKILLs.
+ await reader.cancel();
+ await new Promise((r) => setTimeout(r, 250));
+
+ // Cloud cancel POST landed for the right job.
+ expect(cancelHits).toHaveLength(1);
+ expect(cancelHits[0]?.url).toContain(`/v1/jobs/${FAKE_JOB_ID}/cancel`);
+ // And the bin is dead: SIGKILL bypassed its SIGTERM
+ // no-op (which had been masking HMR's earlier SIGTERM).
+ let probeError: NodeJS.ErrnoException | null = null;
+ try {
+ process.kill(pid, 0);
+ } catch (e) {
+ probeError = e as NodeJS.ErrnoException;
+ }
+ expect(probeError?.code).toBe("ESRCH");
+ } finally {
+ globalThis.fetch = ORIG_FETCH;
+ }
+ });
+
+ it("dispatches HMR signals for `ready` events too (not only `rebuild`)", async () => {
+ // Regression: previously the dispatch fired only on
+ // `rebuild`, so a child started via `/api/train` *before*
+ // the watcher's first successful BUNDLE_END (the very first
+ // success is broadcast as `ready`, and the entry-wait recovery
+ // path also emits `ready`) would never get SIGUSR2/SIGTERM-
+ // routed when that build eventually landed, leaving it
+ // running a stale or empty artifact. Exercise the contract
+ // here by spawning a hanging child, then emitting `ready`
+ // with a different `configHash`; dispatch should pick up the
+ // mismatch and surface restart targets in the SSE frame.
+ await writeCredentials(ANON_CREDS);
+ const hangingBin = join(trainCwd, "hanging-bin.mjs");
+ // setInterval keeps the event loop alive without trapping
+ // SIGTERM, so dispatch's kill returns the child to the OS.
+ writeFileSync(hangingBin, "setInterval(() => {}, 1000);\n");
+
+ const fake = fakeHmr("h1");
+ const app = buildStudioApp({
+ baseUrl: "http://mock",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ binPath: hangingBin,
+ hmr: fake.coordinator,
+ });
+
+ const trainRes = await app.request("/api/train", {
+ method: "POST",
+ headers: {
+ host: "127.0.0.1:4000",
+ "x-arkor-studio-token": STUDIO_TOKEN,
+ "content-type": "application/json",
+ },
+ body: JSON.stringify({}),
+ });
+ expect(trainRes.status).toBe(200);
+ const pid = Number(trainRes.headers.get("x-arkor-train-pid"));
+
+ const sseRes = await app.request(
+ `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`,
+ { headers: { host: "127.0.0.1:4000" } },
+ );
+ const reader = sseRes.body!.getReader();
+ const decoder = new TextDecoder();
+
+ try {
+ // `configHash` = "h2" mismatches the spawn-time "h1" → SIGTERM
+ // path → `restartTargets` should be non-empty in the SSE frame.
+ fake.emit({
+ type: "ready",
+ outFile: "/tmp/x.mjs",
+ hash: "abc",
+ configHash: "h2",
+ trainerName: "t",
+ });
+
+ let received = "";
+ while (!received.includes("\n\n")) {
+ const { value, done } = await reader.read();
+ if (done) break;
+ received += decoder.decode(value, { stream: true });
+ }
+ expect(received).toContain("event: ready");
+ // The dispatch augmentation marker: would be absent if the
+ // `event.type !== "error"` filter regressed back to gating on
+ // `=== "rebuild"`, and `restart`/`restartTargets` would never
+ // appear on a `ready` frame.
+ expect(received).toContain('"restart":true');
+ expect(received).toContain(`"pid":${pid}`);
+ } finally {
+ await reader.cancel();
+ // Best-effort cleanup if dispatch's SIGTERM hasn't reaped
+ // the child yet (signal delivery is async in the kernel).
+ try {
+ process.kill(pid, "SIGKILL");
+ } catch {
+ // already gone
+ }
+ }
+ });
+
+ it("forwards rebuild events as SSE frames", async () => {
+ const fake = fakeHmr();
+ const app = buildStudioApp({
+ baseUrl: "http://mock",
+ assetsDir,
+ autoAnonymous: false,
+ studioToken: STUDIO_TOKEN,
+ cwd: trainCwd,
+ hmr: fake.coordinator,
+ });
+ const res = await app.request(
+ `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`,
+ { headers: { host: "127.0.0.1:4000" } },
+ );
+ const reader = res.body!.getReader();
+ const decoder = new TextDecoder();
+
+ fake.emit({ type: "ready", outFile: "/tmp/x", hash: "abc" });
+ // Read chunks until we have at least one full SSE frame.
+ let received = "";
+ while (!received.includes("\n\n")) {
+ const { value, done } = await reader.read();
+ if (done) break;
+ received += decoder.decode(value, { stream: true });
+ }
+ expect(received).toContain("event: ready");
+ expect(received).toContain('"outFile":"/tmp/x"');
+ await reader.cancel();
+ });
+ });
+
describe("/api/deployments", () => {
const ORIG_FETCH = globalThis.fetch;
diff --git a/packages/arkor/src/studio/server.ts b/packages/arkor/src/studio/server.ts
index 95deff08..e5db1d9f 100644
--- a/packages/arkor/src/studio/server.ts
+++ b/packages/arkor/src/studio/server.ts
@@ -1,6 +1,7 @@
-import { spawn } from "node:child_process";
+import { spawn, type ChildProcessByStdio } from "node:child_process";
+import type { Readable, Writable } from "node:stream";
import { readFile, realpath } from "node:fs/promises";
-import { timingSafeEqual } from "node:crypto";
+import { randomBytes, timingSafeEqual } from "node:crypto";
import { Hono } from "hono";
import { createClient } from "@arkor/cloud-api-client";
import { CloudApiClient, CloudApiError } from "../core/client";
@@ -22,7 +23,42 @@ import {
createDeploymentRequestSchema,
} from "../core/schemas";
import { readState } from "../core/state";
+import { resolveBuildEntry } from "../core/rolldownConfig";
import { readManifestSummary } from "./manifest";
+import type { HmrCoordinator, HmrEvent } from "./hmr";
+import { TrainRegistry, type RestartTarget } from "./trainRegistry";
+
+/** Identify the spawned subprocess to the SPA without exposing it as
+ * a body frame (which would interleave with trainer stdout). The SPA
+ * reads this off `Response.headers` and uses it to scope HMR
+ * `restart` events to the run *this* tab actually started. */
+const TRAIN_PID_HEADER = "x-arkor-train-pid";
+/**
+ * Build the strict full-line match for the runner's `[arkor:] Started job ` line.
+ *
+ * `core/runner.ts` prefixes that text with the per-spawn nonce we
+ * inject via `ARKOR_JOB_ID_MARKER_NONCE`; without the prefix, a
+ * user `console.log("Started job ")` from inside
+ * `trainer.start()` / `onCheckpoint` / etc. could land in stdout
+ * *before* the runner's real line and we'd record the wrong id, so
+ * Stop-training would then POST `/v1/jobs/:attacker-id/cancel`
+ * against a job the attacker chose. Anchoring on a 32-hex nonce
+ * known only to the server + runner (the env var is deleted by
+ * runner.ts BEFORE the user module is dynamically imported, so the
+ * user can't read it) closes that hole.
+ *
+ * Pattern is per-spawn because the nonce is per-spawn.
+ *
+ * Anchors `^…$` and `(\S+)` job-id capture mirror the runner's
+ * exact write shape (cloud-api job ids never contain whitespace),
+ * so a chatty bin that wraps the line in other content cannot
+ * collide either.
+ */
+function buildStartedJobPattern(nonce: string): RegExp {
+ // Nonce is a 32-char hex string from `randomBytes(16).toString("hex")`,
+ // i.e. only `[0-9a-f]` (safe to interpolate into the regex literal).
+ return new RegExp(`^\\[arkor:${nonce}\\] Started job (\\S+)$`);
+}
const DEPRECATION_HEADERS = ["Deprecation", "Sunset", "Warning"] as const;
function copyDeprecationHeaders(from: Headers, to: Headers): void {
@@ -66,6 +102,15 @@ export interface StudioServerOptions {
* here points at the bin itself). Override in tests.
*/
binPath?: string;
+ /**
+ * Optional HMR coordinator. When provided, the server registers
+ * `/api/dev/events` as an SSE stream that pushes rebuild / error events to
+ * the SPA, and rebuilds also signal SIGTERM to active `/api/train`
+ * subprocesses so they early-stop at the next checkpoint and the SPA can
+ * restart them with the new bundle. Wired in by `arkor dev`; left
+ * undefined for any non-dev consumer of `buildStudioApp`.
+ */
+ hmr?: HmrCoordinator;
}
function tokensMatch(provided: string, expected: string): boolean {
@@ -89,11 +134,31 @@ function htmlAttrEscape(s: string): string {
);
}
-function injectStudioToken(html: string, token: string): string {
- const meta = ``;
+/**
+ * Inject the per-launch studio token (always) and an optional HMR
+ * feature flag into ``. Both are read by the SPA via
+ * `` lookups: the token gates `/api/*` requests and
+ * the HMR flag tells `RunTraining` whether to open
+ * `/api/dev/events` (which only exists when `arkor dev` wired in an
+ * HMR coordinator). Without the server-side flag the SPA can't tell
+ * dev-mode usage from prod-mode usage at runtime: `vite build`'s
+ * output ships with `import.meta.env.DEV === false`, so any DEV gate
+ * baked into the bundle would suppress HMR even in real `arkor dev`
+ * sessions.
+ */
+function injectStudioMeta(
+ html: string,
+ token: string,
+ hmrEnabled: boolean,
+): string {
+ const tokenTag = ``;
+ const hmrTag = hmrEnabled
+ ? ``
+ : "";
+ const tags = `${tokenTag}${hmrTag}`;
const idx = html.indexOf("");
- if (idx === -1) return `${meta}${html}`;
- return `${html.slice(0, idx)}${meta}${html.slice(idx)}`;
+ if (idx === -1) return `${tags}${html}`;
+ return `${html.slice(0, idx)}${tags}${html.slice(idx)}`;
}
export function buildStudioApp(options: StudioServerOptions) {
@@ -105,7 +170,7 @@ export function buildStudioApp(options: StudioServerOptions) {
// `studio/server.ts` is bundled into `dist/bin.mjs` (it isn't reachable
// from `src/index.ts`, so tsdown doesn't extract it as a shared chunk).
// The bin therefore sits *next* to this code at runtime, not one
- // directory up — `../bin.mjs` would resolve to the package root.
+ // directory up: `../bin.mjs` would resolve to the package root.
const trainBinPath =
options.binPath ?? fileURLToPath(new URL("./bin.mjs", import.meta.url));
@@ -118,7 +183,12 @@ export function buildStudioApp(options: StudioServerOptions) {
const app = new Hono();
const loopbackHostPattern = /^(127\.0\.0\.1|localhost)(:\d+)?$/;
- const jobEventsPathPattern = /^\/api\/jobs\/[^/]+\/events$/;
+ // Routes where `?studioToken=` is accepted instead of the
+ // `X-Arkor-Studio-Token` header. Used only for `EventSource` streams,
+ // which cannot send custom headers. Adding to this list is CSRF-sensitive:
+ // it must always be a GET stream-only route, never a mutation endpoint.
+ const eventStreamPathPattern =
+ /^\/api\/jobs\/[^/]+\/events$|^\/api\/dev\/events$/;
// Host-header guard for every route, including static HTML that carries the
// per-launch Studio token. This is the DNS-rebinding boundary: a victim
@@ -138,14 +208,14 @@ export function buildStudioApp(options: StudioServerOptions) {
// 1. Per-launch token. CORS is intentionally not configured: the SPA
// is same-origin so CORS adds no value, and reflecting `*` would let
// "simple" cross-origin POSTs (text/plain, urlencoded) skip preflight
- // and reach the handler. The token check rejects those — an attacker
+ // and reach the handler. The token check rejects those: an attacker
// page can't read the SPA's from another origin.
// 2. `?studioToken=` is accepted only on the job-event stream route
// because `EventSource` cannot send custom headers. Mutation routes
// require the header so a leaked token in a URL is not enough to POST.
app.use("/api/*", async (c, next) => {
const queryTokenAllowed =
- c.req.method === "GET" && jobEventsPathPattern.test(c.req.path);
+ c.req.method === "GET" && eventStreamPathPattern.test(c.req.path);
const provided =
c.req.header("x-arkor-studio-token") ??
(queryTokenAllowed ? c.req.query("studioToken") : undefined) ??
@@ -268,9 +338,39 @@ export function buildStudioApp(options: StudioServerOptions) {
return new Response(body, { status: res.status, headers });
});
+ // Pre-resolved outFile for the HMR fast path. The path is
+ // deterministic per cwd (defaults from `BUILD_DEFAULTS`), so we
+ // compute it once at app build time rather than on every request.
+ // Only used when HMR is enabled; `readManifestSummary` falls
+ // back to `runBuild()` when this is undefined or the file doesn't
+ // exist yet (fresh scaffold pre-watcher-bootstrap).
+ const hmrOutFile = options.hmr
+ ? resolveBuildEntry({ cwd: trainCwd }).outFile
+ : undefined;
app.get("/api/manifest", async (c) => {
try {
- const manifest = await readManifestSummary(trainCwd);
+ // Surface watcher build errors directly. Without this gate the
+ // HMR fast path below would happily serve the LAST GOOD
+ // artefact even when the user's current source fails to
+ // compile: `RunTraining` polls `/api/manifest` every ~5 s, so
+ // the next poll after a compile error would 200 with stale
+ // data and silently overwrite the SSE-surfaced error UI.
+ // Users would then see a "healthy" trainer in the manifest
+ // and unknowingly run stale code/config while the latest
+ // edit is still broken. Rejecting with the SSE error message
+ // keeps the SPA's error state consistent across both
+ // channels (poll + SSE).
+ if (options.hmr?.getLastEventType() === "error") {
+ return c.json({ error: "Build failed; see HMR error frame" }, 400);
+ }
+ // HMR-aware fast path: when `arkor dev` wired in a coordinator,
+ // skip the per-request `runBuild()` and read the watcher's
+ // already-built artefact. Without this every SPA poll
+ // (~5 s + per-rebuild SSE refetch) would re-bundle and race
+ // the watcher writing to the same `.arkor/build/index.mjs`.
+ const manifest = await readManifestSummary(trainCwd, {
+ prebuiltOutFile: hmrOutFile,
+ });
return c.json(manifest);
} catch (err) {
// The user's `src/arkor/index.ts` may not exist yet (fresh scaffold) or
@@ -335,11 +435,15 @@ export function buildStudioApp(options: StudioServerOptions) {
return new Response(upstream.body, { status: upstream.status, headers });
});
+ // Active `/api/train` subprocesses. The registry encapsulates the
+ // signal-dispatch policy (see `studio/trainRegistry.ts`).
+ const activeTrains = new TrainRegistry();
+
app.post("/api/train", async (c) => {
const body = (await c.req.json().catch(() => ({}))) as { file?: string };
let trainFile: string | undefined;
if (body.file) {
- // Resolve symlinks before the containment check — `path.resolve` is purely
+ // Resolve symlinks before the containment check: `path.resolve` is purely
// lexical, so a symlink under the project directory pointing at e.g.
// `/etc/passwd` would otherwise pass `startsWith(baseAbs + sep)`. The
// bin spawned below would then dlopen the link's target.
@@ -362,32 +466,621 @@ export function buildStudioApp(options: StudioServerOptions) {
}
trainFile = abs;
}
+ // Snapshot the current `configHash` so HMR routing on the *next*
+ // rebuild can compare against this child's spawn-time config.
+ //
+ // When HMR is enabled, read it synchronously from the coordinator
+ // (which already maintains `lastEvent.configHash` for its watcher).
+ // Reading from the cache avoids triggering an extra `runBuild()`
+ // per train request: the previous implementation called
+ // `readManifestSummary(trainCwd)` here, which both wasted CPU and
+ // raced the watcher writing the same `.arkor/build/index.mjs`.
+ //
+ // When HMR is disabled the field is irrelevant (no rebuilds will
+ // happen) so we leave it null without paying for a build.
+ const configHash: string | null = options.hmr
+ ? options.hmr.getCurrentConfigHash()
+ : null;
+ // Spawn-time CONTENT-hash of the on-disk build artefact. Only
+ // the pre-ready-spawn case in `dispatchRebuild` consults it:
+ // when a rebuild lands while the child's `configHash` is still
+ // null, backfilling the new hash is only safe if the artefact
+ // bytes the child loaded (= the bytes on disk *now*, at spawn)
+ // are the same bytes the new hash describes. Without this
+ // gate, an edit landing between spawn and the watcher's first
+ // BUNDLE_END would silently align the registry with a config
+ // the child never actually loaded → cloud-side `JobConfig`
+ // drift on subsequent same-hash hot-swaps.
+ //
+ // Content (sha256) rather than mtime+ctime+size: the
+ // timestamp version had a false-positive failure mode where a
+ // watcher rebuild that produced identical bytes still bumped
+ // mtime/ctime, forcing a spurious cancel+restart cycle on a
+ // pre-ready spawn even though the child's loaded bytes
+ // actually matched the new build. Content-hash is precise.
+ const spawnArtifactContentHash: string | null = options.hmr
+ ? options.hmr.getCurrentArtifactContentHash()
+ : null;
+ // Capture the cloud-api scope NOW (at spawn time) so the cancel
+ // handler can POST `/v1/jobs/:id/cancel` without re-reading
+ // `.arkor/state.json` at stop time. If the user removed or made
+ // the state file unreadable mid-training, the stop-time read
+ // would return null and the cancel POST would silently skip:
+ // local SIGKILL still tears down the subprocess but the cloud
+ // run orphans. Pinning the scope on the registry entry when it
+ // exists decouples cancel correctness from mutable filesystem state.
+ //
+ // `spawnScope` may legitimately be `null` on a first-run anonymous
+ // project: `.arkor/state.json` is created by `ensureProjectState`
+ // INSIDE the child during `trainer.start()`, i.e. AFTER spawn but
+ // possibly before the user clicks Stop. The cancel handler treats
+ // a null registry scope as a signal to fall back to reading
+ // `.arkor/state.json` at cancel time (the file should exist by
+ // then because the runner emits its `Started job ` line AFTER
+ // `trainer.start()` resolved, which is the same point at which
+ // `ensureProjectState` has finished writing the state file). The
+ // delete-mid-training hazard the spawn-time capture exists to
+ // close only applies when the SPAWN read succeeded; once we have
+ // a non-null capture we never re-read.
+ const spawnState = await readState(trainCwd);
+ const spawnScope = spawnState
+ ? { orgSlug: spawnState.orgSlug, projectSlug: spawnState.projectSlug }
+ : null;
const args = [trainBinPath, "start"];
if (trainFile) args.push(trainFile);
- const child = spawn(process.execPath, args, {
- stdio: "pipe",
- cwd: trainCwd,
+ // Per-spawn 16-byte nonce passed via env var so the runner can
+ // prefix its `Started job ` line with `[arkor:] `. The
+ // server matches that nonce-prefixed shape (see
+ // `buildStartedJobPattern` for why). 32-hex chars of entropy
+ // guarantees a user-code spoof attempt can't guess the prefix in
+ // a single shot, and `core/runner.ts` deletes the env var BEFORE
+ // dynamically importing the user module so user code can't read
+ // it via `process.env` either.
+ const startedJobNonce = randomBytes(16).toString("hex");
+ const startedJobPattern = buildStartedJobPattern(startedJobNonce);
+ // `spawn()` is mostly async (filesystem failures surface as the
+ // child's `error` event), but Node can still throw synchronously
+ // for argument-shape problems (e.g. invalid stdio descriptor on
+ // unusual platforms). Catch both paths so an `/api/train` POST
+ // can never hang the SPA: sync throws return a clean 500, async
+ // 'error' events forward into the stream and close it (handled
+ // inside the ReadableStream `start()` below).
+ // `ChildProcessByStdio` is the
+ // specific overload return for `stdio: "pipe"`; narrows
+ // `child.stdout` / `child.stderr` away from the nullable
+ // `Readable | null` of the general `ChildProcess` type.
+ // `ReturnType` would land on the union and force
+ // a `?.` everywhere downstream.
+ let child: ChildProcessByStdio;
+ try {
+ child = spawn(process.execPath, args, {
+ stdio: "pipe",
+ cwd: trainCwd,
+ env: {
+ ...process.env,
+ ARKOR_JOB_ID_MARKER_NONCE: startedJobNonce,
+ },
+ });
+ } catch (err: unknown) {
+ const msg = err instanceof Error ? err.message : String(err);
+ return c.json(
+ { error: `Failed to spawn training subprocess: ${msg}` },
+ 500,
+ );
+ }
+ activeTrains.register(child, {
+ trainFile,
+ configHash,
+ spawnArtifactContentHash,
+ scope: spawnScope,
});
- const stream = new ReadableStream({
+ // Hoisted out of the `ReadableStream` underlying-source so the
+ // `start` handler can hand its closure-bound teardown helper to
+ // the `cancel` handler. `cancel` runs in a separate invocation,
+ // not through `controller`, so the two need a parent-scope
+ // rendez-vous variable.
+ let cancelTeardown: (() => void) | null = null;
+ // Mirror of the cloud `jobId` parsed out of the runner's
+ // stdout, accessible to both the `start` (parser writes) and
+ // `cancel` (post-unregister read) handlers. We can't just call
+ // `activeTrains.getJobId(pid)` from `cancel` because cancel
+ // unregisters the entry first, so subsequent reads of the
+ // registry would always be `null` even if the parser races a
+ // late line in afterwards. This closure variable keeps the id
+ // observable even after unregister, so the cancel POST poll
+ // below can pick up a jobId that lands a few ms after Stop.
+ let parsedJobId: string | null = null;
+ const stream = new ReadableStream({
start(controller) {
+ // After `cancel()` runs, calling `controller.enqueue` /
+ // `controller.close` on the now-closed controller throws
+ // ("Invalid state: Controller is closed"). The child
+ // subprocess keeps emitting `data` and ultimately a `close`
+ // event for some time after the client disconnects, so each
+ // forwarder needs its own "are we still attached?" guard.
+ // Track via a flag plus an explicit listener-removal so the
+ // event loop also stops dispatching once we've torn down.
+ let closed = false;
+ // `child.stdout` is in default (binary) mode, so each `data`
+ // chunk is a Buffer, and `Buffer extends Uint8Array`, so we
+ // can pass it straight to `controller.enqueue` without a
+ // round-trip through `TextEncoder`. The previous code did
+ // `enc.encode(d)` which implicitly coerced the buffer via
+ // `String()`: same byte content, but allocates a new array.
+ // Forward a chunk to the SPA stream. Shared between the
+ // stdout and stderr listeners; both paths surface as
+ // request body bytes for the SPA's log view.
+ const forward = (d: Buffer): void => {
+ if (closed) return;
+ try {
+ controller.enqueue(d);
+ } catch {
+ // Controller raced us into the closed state; flip the
+ // flag so subsequent chunks short-circuit.
+ closed = true;
+ }
+ };
+ // Carry-over buffer for line-oriented job-id extraction.
+ // Stream chunk boundaries are arbitrary: the runner's
+ // single-line `Started job ` write can land split
+ // across two `data` events, in which case a per-chunk
+ // regex would never match and the cancel POST chain
+ // would never fire (cloud-job orphan on Stop). We
+ // accumulate text until a newline, parse the complete
+ // line, and keep any trailing partial for the next
+ // chunk. Cleared the moment the id is recorded so a
+ // chatty bin doesn't pin memory after the marker has
+ // landed; capped at 4 KiB regardless to bound a
+ // misbehaving bin that never emits a newline before the
+ // marker (the canonical line is well under 100 bytes).
+ let stdoutLineBuf = "";
+ const STARTED_JOB_BUFFER_CAP = 4096;
+ // STDOUT-ONLY job-id parser. The runner writes the canonical
+ // `Started job ` line via `process.stdout.write` (never
+ // stderr), so a single shared buffer across both pipes
+ // would mis-match in two ways:
+ // 1. A user `console.error("Started job ")` would
+ // poison the buffer first; the real stdout marker
+ // arrives later but our `getJobId(...) === null` gate
+ // has already short-circuited subsequent scans, so
+ // Stop-training POSTs cancel for the wrong (or
+ // non-existent) job.
+ // 2. Interleaved stderr bytes could land between
+ // "Started job " and "\n" in the shared buffer,
+ // breaking the anchored line match → missed match →
+ // cloud cancel skipped on Stop.
+ // Two dedicated handlers share `forward` for the byte
+ // pipeline but only the stdout one runs the parse.
+ const onStdoutChunk = (d: Buffer): void => {
+ // Intentionally NOT gated on `closed`: when the SPA cancels,
+ // `cancelTeardown()` flips `closed = true` so the controller
+ // path no-ops, but the cancel IIFE then POLLS `parsedJobId`
+ // for up to 500 ms to catch a `Started job ` line that
+ // landed just after the user clicked Stop. The parser has to
+ // keep running during that window for the poll to ever
+ // observe a value. (`forward()` has its own `closed` check
+ // for the controller-enqueue side, so the SSE-body path
+ // stays sealed.) Gate the parse on `parsedJobId === null`
+ // (not `activeTrains.getJobId(...) === null`): the latter
+ // returns null forever after `unregister`, which would make
+ // us re-enter and re-parse the buffer on every subsequent
+ // chunk during the poll window.
+ if (parsedJobId === null) {
+ stdoutLineBuf += d.toString("utf8");
+ let nl = stdoutLineBuf.indexOf("\n");
+ while (nl !== -1) {
+ // Strip a possible \r so CRLF-emitting bins (rare for
+ // Node `process.stdout.write` but defensive) match
+ // the same anchored pattern.
+ const line = stdoutLineBuf.slice(0, nl).replace(/\r$/, "");
+ stdoutLineBuf = stdoutLineBuf.slice(nl + 1);
+ const m = startedJobPattern.exec(line);
+ if (m && m[1]) {
+ activeTrains.recordJobId(child.pid, m[1]);
+ // Mirror to the parent-scope closure so the cancel
+ // handler can pick this up even AFTER it called
+ // `activeTrains.unregister(...)` (the registry
+ // read would return null post-unregister).
+ parsedJobId = m[1];
+ stdoutLineBuf = "";
+ break;
+ }
+ nl = stdoutLineBuf.indexOf("\n");
+ }
+ if (stdoutLineBuf.length > STARTED_JOB_BUFFER_CAP) {
+ stdoutLineBuf = stdoutLineBuf.slice(-STARTED_JOB_BUFFER_CAP);
+ }
+ }
+ forward(d);
+ };
+ const onStderrChunk = (d: Buffer): void => {
+ // Forward only; never scan for `Started job`. See
+ // `onStdoutChunk` comment for the cross-stream poisoning
+ // hazards this split prevents.
+ forward(d);
+ };
const enc = new TextEncoder();
- child.stdout.on("data", (d) => controller.enqueue(enc.encode(d)));
- child.stderr.on("data", (d) => controller.enqueue(enc.encode(d)));
- child.on("close", (code) => {
- controller.enqueue(enc.encode(`\n---\nexit=${code}\n`));
- controller.close();
- });
+ // Detach every listener this stream wired onto `child`. Called
+ // from `onClose` / `onError` themselves (so once one fires the
+ // closure references (controller, TextEncoder) drop and the
+ // subprocess record can be GC'd promptly even if the other
+ // event also queues), and from `cancelTeardown` for the
+ // client-side cancel path. Removing only the `data` listeners
+ // (as the previous code did) left `close` / `error` attached
+ // to the dead ChildProcess, which kept their closures pinned
+ // until the process object itself was reaped: meaningful
+ // memory pressure for an `arkor dev` session that spawns many
+ // children over hours.
+ const detachListeners = (): void => {
+ child.stdout.off("data", onStdoutChunk);
+ child.stderr.off("data", onStderrChunk);
+ child.off("close", onClose);
+ child.off("error", onError);
+ };
+ const onClose = (code: number | null): void => {
+ activeTrains.unregister(child.pid);
+ detachListeners();
+ if (closed) return;
+ closed = true;
+ try {
+ controller.enqueue(enc.encode(`\n---\nexit=${code}\n`));
+ controller.close();
+ } catch {
+ // already cancelled; nothing more to do.
+ }
+ };
+ // `error` event fires when async spawn machinery surfaces a
+ // failure (ENOENT for the executable, EACCES, EAGAIN under
+ // resource exhaustion, etc.). Without this listener the
+ // ReadableStream would never close; the SPA would hang
+ // waiting for output that never arrives. Forward the error
+ // text into the stream body, close, and unregister the
+ // child. Node's contract is: if 'error' fires, 'close' may
+ // or may not follow; both paths are guarded by the `closed`
+ // flag and the `unregister` call is idempotent.
+ const onError = (err: Error): void => {
+ activeTrains.unregister(child.pid);
+ detachListeners();
+ if (closed) return;
+ closed = true;
+ try {
+ controller.enqueue(
+ enc.encode(`\n---\nerror=${err.message}\n`),
+ );
+ controller.close();
+ } catch {
+ // already cancelled; nothing more to do.
+ }
+ };
+ child.stdout.on("data", onStdoutChunk);
+ child.stderr.on("data", onStderrChunk);
+ child.on("close", onClose);
+ child.on("error", onError);
+ cancelTeardown = () => {
+ // Don't detach data listeners here: the child stays alive
+ // for some time after the SPA cancels, either because
+ // we're skipping `child.kill()` for an in-progress
+ // HMR early-stop, or because `child.kill()`'s SIGTERM
+ // triggers a graceful checkpoint+exit that takes
+ // seconds. During that window the child keeps writing
+ // logs to its stdout/stderr pipes; if our `data`
+ // listeners are gone, Node stops draining the OS pipe,
+ // the buffer fills, and the child's next `write()`
+ // blocks indefinitely, deadlocking the very graceful
+ // exit we're trying to preserve. The `closed` flag
+ // already makes `enqueue`/`close` a no-op so the
+ // controller-closed race stays safe; the eventual
+ // `onClose` / `onError` listeners detach everything
+ // (via `detachListeners()`) when the child finally
+ // exits. That timing (at-exit, not at-cancel) is the
+ // correct moment to break the closure refs for GC.
+ closed = true;
+ };
},
cancel() {
- child.kill();
+ // The SPA-side cancel is always *user-initiated*: either an
+ // explicit Stop click or tab-close/navigation, which the
+ // user just as explicitly chose. HMR-driven SIGTERMs go
+ // straight from the server to the runner via
+ // `dispatchRebuild`; they DO NOT trigger this handler
+ // (the SPA waits for the train stream's `exit=` line and
+ // schedules auto-restart, never aborting). So manual stop
+ // takes precedence over any in-flight HMR graceful path:
+ // we POST cloud cancel + SIGKILL unconditionally.
+ //
+ // SIGKILL is uncatchable so the long-standing
+ // "second-SIGTERM-triggers-exit(143)-fast-path" worry
+ // (which used to gate this branch on
+ // `isEarlyStopRequested`) doesn't apply. The runner's
+ // graceful early-stop chain may have been trying to
+ // preserve a checkpoint, but the user just said no; keep
+ // the local subprocess teardown snappy and let the
+ // server-side cancel POST handle the cloud-side release.
+ //
+ // Capture the cloud job id + spawn-time scope BEFORE
+ // unregistering: once the entry is gone, the getters
+ // return null and the fire-and-forget POST below would
+ // no-op.
+ //
+ // `pid` is captured once here because the closure below
+ // runs after `unregister` and we want a stable handle.
+ const cancelPid = child.pid;
+ // Scope resolution order:
+ // 1. Registry entry's pinned scope (captured at spawn time).
+ // Authoritative when non-null: a user who deleted or made
+ // `.arkor/state.json` unreadable AFTER spawn shouldn't be
+ // able to silently orphan their cloud job by losing the
+ // cancel-time read.
+ // 2. Cancel-time re-read of `.arkor/state.json`, ONLY when
+ // the spawn-time capture was null. This handles the
+ // first-run anon case where `ensureProjectState` writes
+ // the state file from inside the child during
+ // `trainer.start()` (i.e. AFTER spawn). The read happens
+ // inside the fire-and-forget IIFE below so the cancel
+ // handler stays sync.
+ const pinnedScope = activeTrains.getScope(cancelPid);
+ activeTrains.unregister(cancelPid);
+ cancelTeardown?.();
+ // Fire-and-forget cloud-side cancel so the cloud job is
+ // released even though the SIGKILL below bypasses the
+ // runner's `installShutdownHandlers` (which would
+ // otherwise issue cancel itself via the graceful
+ // early-stop chain). The IIFE polls for the jobId
+ // *briefly* before giving up: there's a real race
+ // window where the user clicks Stop after the cloud
+ // job has been created but before the runner's
+ // `Started job ` line has been parsed (cloud
+ // createJob roundtrip is ~50-200ms; UI clicks can land
+ // sub-100ms into that window). Polling closes the most
+ // common case; beyond ~500 ms we accept the cloud-side
+ // orphan as a follow-up (the cloud reaper / TTL is the
+ // safety net, and the alternative of querying cloud-api
+ // for matching jobs at cancel time is brittle in
+ // multi-tab/multi-spawn scenarios).
+ void (async () => {
+ // Brief poll on `parsedJobId` (the closure mirror,
+ // see top-of-handler for why it can't be the
+ // registry's `getJobId`): the runner's
+ // `Started job ` line may not have been parsed by
+ // the time the user clicked Stop. Most runs hit it
+ // within ~50-200 ms of spawn (cloud createJob
+ // roundtrip), so polling for up to ~500 ms catches
+ // nearly all races. Beyond that we accept the
+ // cloud-side orphan as a documented follow-up: cloud
+ // reaper / TTL is the safety net, and the
+ // alternative (querying cloud-api for matching jobs
+ // at cancel time) is brittle for multi-tab /
+ // multi-spawn cases.
+ if (parsedJobId === null) {
+ const start = Date.now();
+ while (parsedJobId === null && Date.now() - start < 500) {
+ await new Promise((r) => setTimeout(r, 25));
+ }
+ }
+ if (parsedJobId === null) return;
+ // Resolve the cloud scope: prefer the spawn-time
+ // capture (immutable, snapshot at spawn) and fall back
+ // to reading `.arkor/state.json` only when there was
+ // none. The state file usually exists by now: the
+ // runner doesn't print `Started job ` until
+ // `trainer.start()` resolves, and `ensureProjectState`
+ // (which writes the file from inside the child for
+ // first-run anon projects) runs as part of that path.
+ let scopeForCancel = pinnedScope;
+ if (!scopeForCancel) {
+ try {
+ const late = await readState(trainCwd);
+ if (late) {
+ scopeForCancel = {
+ orgSlug: late.orgSlug,
+ projectSlug: late.projectSlug,
+ };
+ }
+ } catch {
+ // best-effort
+ }
+ }
+ if (!scopeForCancel) return;
+ try {
+ // `createRpc` now needs (baseUrl, token) explicitly; main's
+ // refactor moved off the closure-based getter so the per-
+ // request credentials read happens once here rather than
+ // twice via the SDK's lazy token callback.
+ const { baseUrl: rpcBaseUrl, token: rpcToken } =
+ await resolveCredentialsAndBaseUrl();
+ const rpc = createRpc(rpcBaseUrl, rpcToken);
+ await rpc.v1.jobs[":id"].cancel.$post({
+ param: { id: parsedJobId },
+ query: {
+ orgSlug: scopeForCancel.orgSlug,
+ projectSlug: scopeForCancel.projectSlug,
+ },
+ });
+ } catch {
+ // Best-effort: cloud-api transient failure or scope
+ // drift. Cloud reaper / TTL is the safety net.
+ }
+ })();
+ // SIGKILL (not the default SIGTERM) for user-initiated
+ // aborts. The runner's `installShutdownHandlers` now treats
+ // a single SIGTERM as the HMR-driven "graceful early-stop"
+ // signal: wait for the next checkpoint (up to ~5 min
+ // timeout) before exiting. That semantics is right for the
+ // HMR path but wrong for a Stop-training click: the user
+ // wants the run STOPPED, not left running in the background
+ // for minutes consuming GPU/cloud spend while the UI has
+ // already settled to idle. SIGKILL is uncatchable so the
+ // child dies immediately, eliminating the
+ // unregister-before-graceful-exit window where a fast new
+ // run could overlap an old one untracked by HMR routing.
+ //
+ // The cloud-side job is released by the fire-and-forget
+ // POST above (we recorded the runner's `Started job `
+ // line on the registry; the IIFE looks it up here). SIGKILL
+ // alone would have left the cloud job orphaned until
+ // TTL/reaper because the runner can't POST cancel itself
+ // when the kernel reaps it without warning. Together,
+ // server-side cancel POST + SIGKILL give snappy local
+ // teardown AND eventual cloud-side release.
+ //
+ // `ChildProcess.kill()` can throw (ESRCH if the process has
+ // already exited between this handler's invocation and the
+ // signal delivery). A throw here would surface as an unhandled
+ // exception in the request pipeline and crash the server
+ // handler. Swallow it; the close handler above has already
+ // taken the entry out of the registry.
+ try {
+ child.kill("SIGKILL");
+ } catch {
+ // already gone; nothing to clean up.
+ }
},
});
- return new Response(stream, {
- status: 200,
- headers: { "content-type": "text/plain; charset=utf-8" },
- });
+ // Expose the spawned pid via a response header so the SPA can
+ // tell its own child apart from other tabs' children when
+ // `/api/dev/events` broadcasts `restartTargets` / `hotSwapTargets`.
+ // Without this, a passive tab whose run was hot-swapped could
+ // misread a sibling tab's restart event as its own.
+ //
+ // Header is OMITTED entirely (rather than sent as an empty
+ // string) when `child.pid` isn't a number; that case happens
+ // when the OS hasn't assigned a pid by the time `spawn()`
+ // returns and the child's async `error` event will fire shortly
+ // (per-Node-docs `subprocess.pid` is `undefined` for
+ // failed-spawn children). "Header absent" is the unambiguous
+ // signal the SPA can read; an empty string would force callers
+ // to special-case `""` vs missing for the same condition. The
+ // SPA's `raw ? Number.parseInt(raw, 10) : NaN` handler treats
+ // both cases identically, but absent-only is the cleaner wire
+ // contract.
+ const headers: Record = {
+ "content-type": "text/plain; charset=utf-8",
+ };
+ if (typeof child.pid === "number") {
+ headers[TRAIN_PID_HEADER] = String(child.pid);
+ }
+ return new Response(stream, { status: 200, headers });
});
+ // `/api/dev/events`: SSE stream of HMR rebuild / error notifications.
+ // Only active when `arkor dev` passed an HMR coordinator. The CSRF model
+ // accepts `?studioToken=` here (whitelisted in `eventStreamPathPattern`)
+ // because `EventSource` cannot send headers. When HMR is not configured
+ // the route still has an explicit 404 so the request doesn't fall through
+ // to the SPA index.html (which would mislead the SPA into thinking the
+ // EventSource connected successfully).
+ if (!options.hmr) {
+ app.get("/api/dev/events", (c) =>
+ c.json({ error: "HMR not enabled" }, 404),
+ );
+ }
+ if (options.hmr) {
+ const hmr = options.hmr;
+ /** Augmented event = raw HMR event + the per-child signal results we
+ * computed for it. We compute these once per rebuild (not once per
+ * connected SSE client) so opening multiple Studio tabs doesn't fan
+ * out into N × SIGTERM / N × SIGUSR2 to each child. */
+ type AugmentedEvent = HmrEvent & {
+ restart?: boolean;
+ hotSwap?: boolean;
+ restartTargets?: RestartTarget[];
+ hotSwapTargets?: RestartTarget[];
+ };
+ const sseListeners = new Set<(event: AugmentedEvent) => void>();
+ let lastAugmented: AugmentedEvent | null = null;
+
+ // Single subscription against the HMR coordinator: this handler does
+ // signal dispatch + augmentation exactly once per rebuild, then fans
+ // the augmented payload out to every connected SSE client. Late-
+ // mounting clients receive `lastAugmented` instead of triggering a
+ // fresh signal pass against the same rebuild.
+ hmr.subscribe((event) => {
+ let augmented: AugmentedEvent = event;
+ // Route dispatch through every *successful* build event, not
+ // just `rebuild`. The coordinator emits the very first
+ // successful compile as `ready` (and the entry-wait recovery
+ // path also broadcasts `ready` when a fresh-scaffold project's
+ // entry file first appears). A child started via `/api/train`
+ // before the first `ready` (e.g. the SPA fired Run Training
+ // immediately after `arkor dev` booted, while the watcher's
+ // initial BUNDLE_END was still in flight) would otherwise
+ // never get SIGUSR2/SIGTERM-routed when that build lands,
+ // leaving it stuck on a stale or empty artifact until the
+ // next edit triggers a `rebuild`. Filtering by "not error"
+ // is forward-compatible with any new successful event types.
+ if (event.type !== "error" && activeTrains.size > 0) {
+ // Single per-child decision pass: hash match → SIGUSR2 (with
+ // a Windows fallback to SIGTERM since win32 doesn't deliver
+ // SIGUSR2), hash mismatch → SIGTERM. The registry returns
+ // both buckets so the SPA can react per-child rather than
+ // assuming one global outcome.
+ const nextHash = event.configHash ?? null;
+ // Content-hash for the pre-ready-spawn equality gate (the
+ // timestamp `event.hash` would over-trigger SIGTERM-restart
+ // on identical-bytes rebuilds). Both sides of the
+ // comparison (`entry.spawnArtifactContentHash` captured
+ // via `getCurrentArtifactContentHash()`, and this
+ // `event.contentHash`) are derived the same way, so a
+ // match means the child's loaded bytes ARE what the new
+ // configHash describes.
+ const nextArtifactContentHash = event.contentHash ?? null;
+ const { hotSwapTargets, restartTargets } = activeTrains.dispatchRebuild(
+ nextHash,
+ nextArtifactContentHash,
+ );
+ augmented = {
+ ...event,
+ hotSwap: hotSwapTargets.length > 0,
+ hotSwapTargets,
+ restart: restartTargets.length > 0,
+ restartTargets,
+ };
+ }
+ lastAugmented = augmented;
+ for (const fn of sseListeners) {
+ try {
+ fn(augmented);
+ } catch {
+ // listener controller closed mid-write; the cancel hook
+ // below takes care of removing it from the set.
+ }
+ }
+ });
+
+ app.get("/api/dev/events", () => {
+ const enc = new TextEncoder();
+ let listener: ((event: AugmentedEvent) => void) | null = null;
+ const stream = new ReadableStream({
+ start(controller) {
+ const send = (event: AugmentedEvent): void => {
+ const payload = JSON.stringify(event);
+ try {
+ controller.enqueue(
+ enc.encode(`event: ${event.type}\ndata: ${payload}\n\n`),
+ );
+ } catch {
+ // controller closed mid-write; cancel() removes us.
+ }
+ };
+ if (lastAugmented) send(lastAugmented);
+ listener = send;
+ sseListeners.add(send);
+ },
+ cancel() {
+ if (listener) sseListeners.delete(listener);
+ listener = null;
+ },
+ });
+ return new Response(stream, {
+ status: 200,
+ headers: {
+ "content-type": "text/event-stream",
+ "cache-control": "no-cache, no-transform",
+ },
+ });
+ });
+ }
+
// Playground hits this so mid-training inference from Studio has the same
// auth path as the rest of /api/*. State is auto-bootstrapped (anon only)
// so the Playground's base-model mode works on a fresh anonymous launch
@@ -407,7 +1100,7 @@ export function buildStudioApp(options: StudioServerOptions) {
state = await ensureProjectState({ cwd: trainCwd, client, credentials });
} catch (err) {
// Propagate cloud-api's status verbatim (e.g. 401 / 403 / 5xx) so the
- // SPA / clients can react appropriately — collapsing everything to 400
+ // SPA / clients can react appropriately; collapsing everything to 400
// would mis-report upstream outages and auth failures. Anything else
// (local writeState failures, missing-credentials guard) is treated as
// a server-side error.
@@ -897,7 +1590,11 @@ export function buildStudioApp(options: StudioServerOptions) {
const file = await readFile(join(assetsDir, cleaned));
const ext = cleaned.slice(cleaned.lastIndexOf(".") + 1);
if (ext === "html") {
- const html = injectStudioToken(file.toString("utf8"), studioToken);
+ const html = injectStudioMeta(
+ file.toString("utf8"),
+ studioToken,
+ Boolean(options.hmr),
+ );
return new Response(html, {
status: 200,
headers: { "content-type": CONTENT_TYPES.html! },
diff --git a/packages/arkor/src/studio/trainRegistry.test.ts b/packages/arkor/src/studio/trainRegistry.test.ts
new file mode 100644
index 00000000..1278f7f0
--- /dev/null
+++ b/packages/arkor/src/studio/trainRegistry.test.ts
@@ -0,0 +1,411 @@
+import { describe, it, expect, vi } from "vitest";
+import type { ChildProcess } from "node:child_process";
+import { TrainRegistry } from "./trainRegistry";
+
+interface FakeChild {
+ pid: number;
+ kill: ReturnType;
+}
+
+function fakeChild(pid: number): FakeChild {
+ // Default: `kill(sig)` returns `true`, mirroring Node's contract for
+ // a successful signal delivery to a still-running process.
+ return { pid, kill: vi.fn(() => true) };
+}
+
+describe("TrainRegistry", () => {
+ it("ignores children without a pid (already-exited spawns)", () => {
+ const reg = new TrainRegistry();
+ reg.register({ pid: undefined } as unknown as ChildProcess, {
+ configHash: "h1",
+ });
+ expect(reg.size).toBe(0);
+ });
+
+ it("dispatchRebuild SIGUSR2s only matching configHashes", () => {
+ const reg = new TrainRegistry();
+ const a = fakeChild(101);
+ const b = fakeChild(102);
+ const c = fakeChild(103);
+ reg.register(a as unknown as ChildProcess, { configHash: "match" });
+ reg.register(b as unknown as ChildProcess, {
+ configHash: "different",
+ trainFile: "/tmp/b.ts",
+ });
+ reg.register(c as unknown as ChildProcess, { configHash: "match" });
+
+ const result = reg.dispatchRebuild("match");
+ expect(result.hotSwapTargets).toEqual([
+ { pid: 101, trainFile: undefined },
+ { pid: 103, trainFile: undefined },
+ ]);
+ expect(result.restartTargets).toEqual([
+ { pid: 102, trainFile: "/tmp/b.ts" },
+ ]);
+ expect(a.kill).toHaveBeenCalledWith("SIGUSR2");
+ expect(c.kill).toHaveBeenCalledWith("SIGUSR2");
+ expect(b.kill).toHaveBeenCalledWith("SIGTERM");
+ });
+
+ it("dispatchRebuild SIGTERMs everything when nextConfigHash is null", () => {
+ // null nextHash means "we couldn't inspect the new bundle": be
+ // conservative and SIGTERM every active child since we can't
+ // prove their configs are unaffected.
+ const reg = new TrainRegistry();
+ const a = fakeChild(201);
+ const b = fakeChild(202);
+ reg.register(a as unknown as ChildProcess, { configHash: "h" });
+ reg.register(b as unknown as ChildProcess, { configHash: null });
+
+ const result = reg.dispatchRebuild(null);
+ expect(result.hotSwapTargets).toEqual([]);
+ expect(result.restartTargets).toHaveLength(2);
+ expect(a.kill).toHaveBeenCalledWith("SIGTERM");
+ expect(b.kill).toHaveBeenCalledWith("SIGTERM");
+ });
+
+ it("dispatchRebuild backfills the hash and skips dispatch when the spawn-time artefact matches the new build", () => {
+ // Pre-ready spawn (configHash: null) is the "user clicked Run
+ // before the watcher's first BUNDLE_END" case. Whether it's
+ // safe to backfill the new hash as the child's baseline depends
+ // on whether the on-disk artefact has changed between spawn
+ // and now: if `spawnArtifactContentHash === nextArtifactContentHash`, the
+ // child read exactly the bytes the new hash describes →
+ // backfill + skip dispatch (no spurious cancel+restart cycle).
+ // Otherwise (see the next test) SIGTERM-restart so cloud
+ // and child stay aligned.
+ const reg = new TrainRegistry();
+ const c = fakeChild(401);
+ reg.register(c as unknown as ChildProcess, {
+ configHash: null,
+ trainFile: "/tmp/preready.ts",
+ spawnArtifactContentHash: "art-v1",
+ });
+ const result = reg.dispatchRebuild("first-real-hash", "art-v1");
+ // Neither bucket: no signal sent, nothing for the SPA to react to.
+ expect(result.hotSwapTargets).toEqual([]);
+ expect(result.restartTargets).toEqual([]);
+ expect(c.kill).not.toHaveBeenCalled();
+ // A subsequent dispatch with the SAME config hash must take the
+ // hot-swap path (proves the backfill landed; without it this
+ // would STILL be null vs "first-real-hash" → SIGTERM).
+ const second = reg.dispatchRebuild("first-real-hash", "art-v2");
+ expect(second.hotSwapTargets).toEqual([
+ { pid: 401, trainFile: "/tmp/preready.ts" },
+ ]);
+ expect(second.restartTargets).toEqual([]);
+ expect(c.kill).toHaveBeenCalledWith("SIGUSR2");
+ // And a different config hash on a later rebuild now correctly
+ // routes to SIGTERM-restart (backfilled hash is real).
+ c.kill.mockClear();
+ const third = reg.dispatchRebuild("second-hash", "art-v3");
+ expect(third.restartTargets).toEqual([
+ { pid: 401, trainFile: "/tmp/preready.ts" },
+ ]);
+ expect(c.kill).toHaveBeenCalledWith("SIGTERM");
+ });
+
+ it("dispatchRebuild SIGTERM-restarts a pre-ready spawn when the artefact has changed since spawn", () => {
+ // Codex P2 regression: an edit landing between spawn and the
+ // watcher's first BUNDLE_END means the bytes the child loaded
+ // differ from what the new `configHash` describes. Backfilling
+ // unconditionally would silently teach the registry to use the
+ // post-edit hash as the child's baseline; later same-hash
+ // rebuilds would then hot-swap callbacks into a child whose
+ // cloud-side `JobConfig` was actually spawned against an older
+ // version, leaving the cloud run on a stale config. The artefact
+ // fingerprint mismatch (`art-stale` vs `art-fresh`) is the
+ // signal that the child loaded older bytes; SIGTERM-restart
+ // forces a clean re-spawn against the freshly-built artefact.
+ const reg = new TrainRegistry();
+ const c = fakeChild(411);
+ reg.register(c as unknown as ChildProcess, {
+ configHash: null,
+ trainFile: "/tmp/preready-stale.ts",
+ spawnArtifactContentHash: "art-stale",
+ });
+ const result = reg.dispatchRebuild("real-hash", "art-fresh");
+ // SIGTERM-restart: the child's bytes are stale relative to the
+ // new build. Hot-swap would be unsafe (config drift); skip
+ // would leave the child running with no future correction
+ // path (the registry would treat "real-hash" as the baseline
+ // even though the child never loaded that build).
+ expect(result.hotSwapTargets).toEqual([]);
+ expect(result.restartTargets).toEqual([
+ { pid: 411, trainFile: "/tmp/preready-stale.ts" },
+ ]);
+ expect(c.kill).toHaveBeenCalledWith("SIGTERM");
+ });
+
+ it("dispatchRebuild SIGTERM-restarts a pre-ready spawn when no artefact existed at spawn time", () => {
+ // Companion to the "artefact has changed" test: a fresh project
+ // never built before spawn means `coordinator.getCurrentArtifactHash()`
+ // returned `null`. The child's `await import` likely failed; we
+ // can't prove its config matches anything. Conservative
+ // SIGTERM-restart so the SPA re-spawns once the new bundle is
+ // on disk.
+ const reg = new TrainRegistry();
+ const c = fakeChild(421);
+ reg.register(c as unknown as ChildProcess, {
+ configHash: null,
+ trainFile: "/tmp/preready-fresh.ts",
+ spawnArtifactContentHash: null, // no artefact when /api/train fired
+ });
+ const result = reg.dispatchRebuild("first-real-hash", "art-fresh");
+ expect(result.hotSwapTargets).toEqual([]);
+ expect(result.restartTargets).toEqual([
+ { pid: 421, trainFile: "/tmp/preready-fresh.ts" },
+ ]);
+ expect(c.kill).toHaveBeenCalledWith("SIGTERM");
+ });
+
+ it("isEarlyStopRequested reflects the dispatchRebuild SIGTERM flag", () => {
+ // Regression: `/api/train`'s ReadableStream `cancel()` consults
+ // this flag to avoid sending a *second* SIGTERM to a child that
+ // HMR's `dispatchRebuild` already SIGTERMed for early-stop. A
+ // double-SIGTERM hits `installShutdownHandlers`' emergency
+ // `exit(143)` fast-path, bypassing the checkpoint-preserving
+ // cancel flow and potentially leaving the cloud run alive.
+ const reg = new TrainRegistry();
+ const a = fakeChild(901);
+ reg.register(a as unknown as ChildProcess, {
+ configHash: "h1",
+ trainFile: "/tmp/a.ts",
+ });
+ expect(reg.isEarlyStopRequested(901)).toBe(false);
+ // Mismatched hash → SIGTERM → flag flips on.
+ reg.dispatchRebuild("h2");
+ expect(reg.isEarlyStopRequested(901)).toBe(true);
+ // Defensive cases: non-numeric / unknown / never-registered pid.
+ expect(reg.isEarlyStopRequested(undefined)).toBe(false);
+ expect(reg.isEarlyStopRequested(99999)).toBe(false);
+ // Once the child unregisters (close handler) the flag effectively
+ // resets: subsequent queries return false rather than retaining
+ // stale state.
+ reg.unregister(901);
+ expect(reg.isEarlyStopRequested(901)).toBe(false);
+ });
+
+ it("unregister removes the child from the policy decisions", () => {
+ const reg = new TrainRegistry();
+ const a = fakeChild(401);
+ reg.register(a as unknown as ChildProcess, { configHash: "h" });
+ reg.unregister(401);
+ expect(reg.size).toBe(0);
+ const result = reg.dispatchRebuild("h");
+ expect(result.hotSwapTargets).toEqual([]);
+ expect(result.restartTargets).toEqual([]);
+ });
+
+ it("survives kill() throwing (child exited mid-iteration)", () => {
+ const reg = new TrainRegistry();
+ const a = fakeChild(501);
+ a.kill.mockImplementation(() => {
+ throw new Error("ESRCH");
+ });
+ reg.register(a as unknown as ChildProcess, { configHash: "h" });
+ // Both the hot-swap branch (matching hash) and the restart branch
+ // (mismatched hash) must swallow the throw and continue with their
+ // bookkeeping so a single dead child can't break HMR for siblings.
+ expect(() => reg.dispatchRebuild("h")).not.toThrow();
+ expect(() => reg.dispatchRebuild("x")).not.toThrow();
+ });
+
+ it("dispatchRebuild omits dead-on-kill children from the restart targets", () => {
+ // Regression: previously the implementation always pushed onto
+ // `targets` even when `kill()` threw, so a child that had already
+ // exited would still be reported back to the SPA as a restart
+ // target: the SPA would then wait forever for the (already-
+ // delivered) `exit=...` line and never re-spawn.
+ const reg = new TrainRegistry();
+ const dead = fakeChild(601);
+ dead.kill.mockImplementation(() => {
+ const err = new Error("kill ESRCH") as Error & { code?: string };
+ err.code = "ESRCH";
+ throw err;
+ });
+ reg.register(dead as unknown as ChildProcess, {
+ configHash: "stale",
+ trainFile: "/tmp/d.ts",
+ });
+ const result = reg.dispatchRebuild("fresh");
+ expect(result.hotSwapTargets).toEqual([]);
+ expect(result.restartTargets).toEqual([]);
+ });
+
+ it("dispatchRebuild classifies ESRCH on the hash-match branch as 'gone' (no SIGTERM fallback)", () => {
+ // Regression: `safeKill` previously treated any thrown error as
+ // `"unsupported"`, which on the hash-match branch triggers a
+ // SIGTERM fallback (intended for Windows + SIGUSR2 unsupported).
+ // POSIX `kill(2)` raises `ESRCH` for an already-exited child:
+ // classifying that as "unsupported" caused a needless SIGTERM
+ // attempt against a dead PID. Now ESRCH routes through the
+ // "gone" branch (no fallback, no restart-target push) so the
+ // child is dropped silently for the close handler to reap.
+ const reg = new TrainRegistry();
+ const goneOnSigusr2 = fakeChild(801);
+ goneOnSigusr2.kill.mockImplementation(() => {
+ const err = new Error("kill ESRCH") as Error & { code?: string };
+ err.code = "ESRCH";
+ throw err;
+ });
+ reg.register(goneOnSigusr2 as unknown as ChildProcess, {
+ configHash: "match",
+ trainFile: "/tmp/g.ts",
+ });
+ const result = reg.dispatchRebuild("match");
+ // No hot-swap (SIGUSR2 failed), no restart (correctly classified
+ // as gone, NOT routed into the SIGTERM fallback path).
+ expect(result.hotSwapTargets).toEqual([]);
+ expect(result.restartTargets).toEqual([]);
+ // Single SIGUSR2 attempt: no SIGTERM fallback was issued.
+ expect(goneOnSigusr2.kill).toHaveBeenCalledTimes(1);
+ expect(goneOnSigusr2.kill).toHaveBeenCalledWith("SIGUSR2");
+ });
+
+ it("dispatchRebuild omits dead-on-kill children when kill returns false (no throw)", () => {
+ // Regression: `ChildProcess.kill()` returns `false` (without
+ // throwing) when the target process is already gone. The previous
+ // implementation treated any non-throw as success and reported the
+ // child as a restart target; the SPA would then wait forever for
+ // an exit line that already arrived.
+ const reg = new TrainRegistry();
+ const gone = fakeChild(701);
+ gone.kill.mockReturnValue(false);
+ reg.register(gone as unknown as ChildProcess, {
+ configHash: "stale",
+ trainFile: "/tmp/g.ts",
+ });
+ const result = reg.dispatchRebuild("fresh");
+ expect(result.restartTargets).toEqual([]);
+ // We still attempted the kill; only the bookkeeping is skipped.
+ expect(gone.kill).toHaveBeenCalledWith("SIGTERM");
+ });
+
+ it("dispatchRebuild sends SIGTERM at most once per child across rebuilds", () => {
+ // Regression: under rapid edits the dev loop can fire multiple
+ // rebuilds before the child reaches its next checkpoint. The
+ // runner's shutdown handler treats a *second* SIGTERM as the
+ // emergency `exit(143)` fast-path, which would defeat the whole
+ // point of preserving the in-flight checkpoint. The registry now
+ // tracks per-child early-stop state and skips children it has
+ // already signalled.
+ const reg = new TrainRegistry();
+ const a = fakeChild(801);
+ reg.register(a as unknown as ChildProcess, {
+ configHash: "h1",
+ trainFile: "/tmp/a.ts",
+ });
+
+ const first = reg.dispatchRebuild("h2");
+ expect(first.restartTargets).toEqual([
+ { pid: 801, trainFile: "/tmp/a.ts" },
+ ]);
+ expect(a.kill).toHaveBeenCalledTimes(1);
+
+ // Second mismatching rebuild before the child has exited: must NOT
+ // re-send SIGTERM and must NOT re-list the child as a restart
+ // target (the SPA already has a pending re-spawn for it).
+ const second = reg.dispatchRebuild("h3");
+ expect(second.restartTargets).toEqual([]);
+ expect(a.kill).toHaveBeenCalledTimes(1);
+
+ // After the child exits and is unregistered, a fresh spawn in its
+ // place starts from a clean slate.
+ reg.unregister(801);
+ const respawn = fakeChild(802);
+ reg.register(respawn as unknown as ChildProcess, {
+ configHash: "h3",
+ trainFile: "/tmp/a.ts",
+ });
+ const third = reg.dispatchRebuild("h4");
+ expect(third.restartTargets).toEqual([
+ { pid: 802, trainFile: "/tmp/a.ts" },
+ ]);
+ expect(respawn.kill).toHaveBeenCalledTimes(1);
+ });
+
+ it("dispatchRebuild on win32 routes hash-matches directly to SIGTERM-restart (skips SIGUSR2 attempt)", () => {
+ // Regression: Node's `child.kill("SIGUSR2")` on Windows is
+ // documented to **forcefully terminate** the process (treats
+ // any unknown POSIX signal as SIGKILL-equivalent) and STILL
+ // returns `true` like a successful delivery. `safeKill` would
+ // then report `"ok"` → entry lands in `hotSwapTargets` → SPA
+ // shows "hot-swap" and skips restart, but the child is already
+ // dead. The Codex P1 fix gates the SIGUSR2 attempt behind
+ // `process.platform !== "win32"` so win32 routes straight to
+ // SIGTERM-restart, surfacing a real restart target the SPA can
+ // act on.
+ const originalPlatform = Object.getOwnPropertyDescriptor(
+ process,
+ "platform",
+ );
+ Object.defineProperty(process, "platform", {
+ value: "win32",
+ configurable: true,
+ });
+ try {
+ const reg = new TrainRegistry();
+ const a = fakeChild(951);
+ a.kill.mockReturnValue(true); // win32 reports success even for SIGUSR2
+ reg.register(a as unknown as ChildProcess, {
+ configHash: "match",
+ trainFile: "/tmp/win.ts",
+ });
+ const result = reg.dispatchRebuild("match");
+ // Restart bucket only: hot-swap is unsafe on win32 even
+ // when kill() reported "ok".
+ expect(result.hotSwapTargets).toEqual([]);
+ expect(result.restartTargets).toEqual([
+ { pid: 951, trainFile: "/tmp/win.ts" },
+ ]);
+ // SIGUSR2 was NEVER attempted: the platform gate skipped it
+ // entirely and went straight to the SIGTERM fallback path.
+ // (Without the gate, SIGUSR2 would have fired first and been
+ // misclassified as a successful hot-swap.)
+ expect(a.kill).toHaveBeenCalledTimes(1);
+ expect(a.kill).toHaveBeenCalledWith("SIGTERM");
+ } finally {
+ if (originalPlatform) {
+ Object.defineProperty(process, "platform", originalPlatform);
+ }
+ }
+ });
+
+ it("dispatchRebuild degrades to SIGTERM-restart when SIGUSR2 is unsupported (Windows)", () => {
+ // Regression: Node's win32 build doesn't deliver SIGUSR2 (it
+ // throws "ENOSYS" inside `child.kill('SIGUSR2')`). The previous
+ // implementation silently swallowed that throw, so on Windows a
+ // hash-match rebuild produced neither hot-swap nor restart and
+ // callback edits never landed. Now we degrade to a SIGTERM-driven
+ // restart so the new code does take effect, at the cost of a
+ // brief gap rather than an in-place swap.
+ const reg = new TrainRegistry();
+ const a = fakeChild(901);
+ a.kill.mockImplementation((sig?: string) => {
+ if (sig === "SIGUSR2") {
+ const err = new Error(
+ "kill ENOSYS",
+ ) as Error & { code?: string };
+ err.code = "ENOSYS";
+ throw err;
+ }
+ return true; // SIGTERM works
+ });
+ reg.register(a as unknown as ChildProcess, {
+ configHash: "match",
+ trainFile: "/tmp/win.ts",
+ });
+ const result = reg.dispatchRebuild("match");
+ // Must not appear in hot-swap (signal failed) but must appear in
+ // restart (fallback succeeded) so the SPA re-spawns once the
+ // exit message arrives.
+ expect(result.hotSwapTargets).toEqual([]);
+ expect(result.restartTargets).toEqual([
+ { pid: 901, trainFile: "/tmp/win.ts" },
+ ]);
+ // Both signals were attempted in order: SIGUSR2 → fallback SIGTERM.
+ expect(a.kill).toHaveBeenNthCalledWith(1, "SIGUSR2");
+ expect(a.kill).toHaveBeenNthCalledWith(2, "SIGTERM");
+ });
+});
diff --git a/packages/arkor/src/studio/trainRegistry.ts b/packages/arkor/src/studio/trainRegistry.ts
new file mode 100644
index 00000000..9286e98b
--- /dev/null
+++ b/packages/arkor/src/studio/trainRegistry.ts
@@ -0,0 +1,416 @@
+import type { ChildProcess } from "node:child_process";
+
+/**
+ * Per-active-train state tracked alongside the spawned `arkor start`
+ * subprocess. The Studio server records this at spawn time so HMR
+ * rebuilds can decide, per child, between:
+ *
+ * - **SIGUSR2** (callback hot-swap) when the new bundle's `configHash`
+ * matches the one captured at spawn time: the cloud-side run is
+ * unaffected, only in-process callbacks need to update.
+ * - **SIGTERM** (graceful early-stop + restart) when the configs
+ * diverge: the runner's internal early-stop entry point lets the
+ * next checkpoint finish, the subprocess exits, and the SPA
+ * re-spawns with the rebuilt artefact.
+ */
+export interface ActiveTrain {
+ child: ChildProcess;
+ trainFile?: string;
+ /** Cloud-side config hash captured at spawn time (may be null if the
+ * manifest wasn't inspectable yet, e.g. spawn raced an in-flight
+ * build). A null entry forces SIGTERM on the next rebuild because we
+ * can't prove the configs match. */
+ configHash: string | null;
+ /**
+ * Content hash (sha256, truncated; see `studio/hmr.ts`'s
+ * `contentHashOrNull`) of the on-disk `.arkor/build/index.mjs`
+ * at spawn time. Used **only** to gate the pre-ready-spawn
+ * backfill: if a rebuild eventually fires while `configHash` is
+ * still null and this content hash equals the rebuild's
+ * `event.contentHash`, the child is provably reading the same
+ * bundle bytes the new hash describes: safe to backfill
+ * `configHash` and skip dispatch. A mismatch (or null here)
+ * means the on-disk artefact has changed between spawn and
+ * rebuild (user edited mid-spawn, fresh project never built, …)
+ * so the child is running stale bytes and we MUST SIGTERM-restart
+ * to keep cloud-side `JobConfig` aligned with what the child
+ * actually loaded.
+ *
+ * Content-hash (vs the timestamp `mtime+ctime+size` shape used
+ * by `event.hash` for SSE dedup) avoids a false-positive
+ * mismatch when a watcher rebuild produces identical bytes:
+ * timestamps still bump, but content is the same and we
+ * shouldn't force a spurious cancel+restart cycle. Null when
+ * HMR isn't enabled or read failed.
+ */
+ spawnArtifactContentHash: string | null;
+ /**
+ * `true` once we've already SIGTERM'd this child for an HMR-driven
+ * early-stop. Subsequent rebuilds (which can land before the child
+ * has reached its next checkpoint) must NOT re-send SIGTERM:
+ * the runner's shutdown handler treats a second SIGTERM as the
+ * emergency `process.exit(143)` escape hatch, which would defeat
+ * the whole point of preserving the in-flight checkpoint. Kept
+ * internal to the registry; consumers shouldn't manage it.
+ */
+ earlyStopRequested?: boolean;
+ /**
+ * Cloud-side job id, captured by parsing the runner's
+ * `Started job ` stdout line shortly after spawn. Populated
+ * via `recordJobId(pid, id)` on the first matching chunk; null
+ * before that or for runs whose stdout we never saw the line on
+ * (early spawn failure, custom user bins, etc.). The
+ * `/api/train` cancel handler reads this to fire a fire-and-forget
+ * `POST /v1/jobs/:id/cancel` before SIGKILLing the subprocess.
+ * SIGKILL bypasses the runner's `installShutdownHandlers`, so
+ * without this server-side cancel the cloud-side job would live
+ * until the cloud reaper / TTL fires (continued GPU spend).
+ */
+ jobId: string | null;
+ /**
+ * Cloud-api scope (org + project slugs) captured at spawn time
+ * from `.arkor/state.json`. Pinned on the registry entry so the
+ * `/api/train` cancel handler can address the cloud cancel POST
+ * without re-reading the filesystem at stop time. Without this
+ * pin, a user who deleted or made unreadable `.arkor/state.json`
+ * mid-training would have their manual stop silently skip the
+ * cancel POST (state read returns null, handler bails) and
+ * the cloud job would orphan. Null when `/api/train` ran without
+ * state (auto-anonymous bootstrap failed, etc.); cancel POST is
+ * skipped then too, but the SIGKILL still tears down the local
+ * subprocess.
+ */
+ scope: { orgSlug: string; projectSlug: string } | null;
+}
+
+export interface RestartTarget {
+ pid: number;
+ trainFile?: string;
+}
+
+export interface DispatchResult {
+ /** Children whose callbacks were rotated in place via SIGUSR2. */
+ hotSwapTargets: RestartTarget[];
+ /**
+ * Children that were SIGTERM'd for graceful early-stop and need to
+ * be re-spawned by the SPA after the train stream emits its
+ * `exit=...` line. Includes both config-mismatch matches and
+ * config-match cases that fell back here because the platform
+ * doesn't support SIGUSR2 (Windows).
+ */
+ restartTargets: RestartTarget[];
+}
+
+/**
+ * Outcome of a single `child.kill(signal)` call.
+ *
+ * - `"ok"`: signal was delivered.
+ * - `"gone"`: process was already exited. Surfaces both as `kill`
+ * returning `false` (Node's mapped form) and as a thrown `ESRCH`
+ * (a race where the child exits between the `entries` lookup and
+ * the `kill` call: POSIX `kill(2)` raises `ESRCH` for
+ * non-existent PIDs and Node propagates it on some versions).
+ * - `"unsupported"`: any *other* `kill` throw, i.e. the signal
+ * couldn't be delivered for a reason that isn't "process is gone".
+ * The motivating case is the platform not supporting this signal
+ * kind (Windows + `SIGUSR2` → `ENOSYS`; bad signal name →
+ * `EINVAL`), which `dispatchRebuild` falls back to SIGTERM-restart
+ * for. The bucket is intentionally a catch-all rather than a
+ * whitelist of error codes: rare cases like `EPERM` (lost the
+ * right to signal a re-parented child) and platform-specific
+ * surprises take the same conservative fallback (try the next
+ * signal, otherwise drop the entry), which is what callers want
+ * from "kill failed for some non-recoverable reason".
+ */
+type KillResult = "ok" | "gone" | "unsupported";
+
+function safeKill(child: ChildProcess, signal: NodeJS.Signals): KillResult {
+ try {
+ return child.kill(signal) ? "ok" : "gone";
+ } catch (err) {
+ // `ESRCH` ("no such process") means the child already exited:
+ // semantically identical to `kill returning false`. Mis-classifying
+ // it as `"unsupported"` would route a hash-match hot-swap candidate
+ // into the SIGTERM fallback, which then also no-ops (also gone) but
+ // costs a needless restart-bucket inclusion until the close handler
+ // unregisters the child. Every other throw collapses into
+ // `"unsupported"` per the type doc above.
+ const code = (err as NodeJS.ErrnoException | null)?.code;
+ if (code === "ESRCH") return "gone";
+ return "unsupported";
+ }
+}
+
+/**
+ * Encapsulates the set of `/api/train`-spawned subprocesses and the
+ * signal-dispatch decision rule for HMR rebuilds. Pulled out of
+ * `buildStudioApp` so the policy is testable in isolation and so future
+ * additions (e.g. a `cancel-all` admin endpoint) have a clear seam.
+ */
+export class TrainRegistry {
+ private readonly entries = new Map();
+
+ register(
+ child: ChildProcess,
+ init: Omit<
+ ActiveTrain,
+ | "child"
+ | "earlyStopRequested"
+ | "spawnArtifactContentHash"
+ | "jobId"
+ | "scope"
+ > & {
+ // Optional in the signature so tests / future callers that
+ // don't track the on-disk artefact content hash (e.g. an
+ // HMR-disabled server, a hand-rolled fake) can omit it.
+ // Defaults to `null`, which forces the pre-ready-spawn
+ // branch to fall through to SIGTERM-restart on the next
+ // non-null rebuild (the safe choice when we genuinely
+ // don't know what bytes the child loaded). Real `/api/train`
+ // calls in HMR mode capture this from
+ // `coordinator.getCurrentArtifactContentHash()`.
+ spawnArtifactContentHash?: string | null;
+ // Optional too: tests don't need scope for HMR-routing
+ // assertions. Real `/api/train` calls in production pass a
+ // non-null scope captured from `.arkor/state.json` so the
+ // cancel POST can address the cloud job without re-reading
+ // the filesystem at stop time.
+ scope?: { orgSlug: string; projectSlug: string } | null;
+ },
+ ): void {
+ if (typeof child.pid !== "number") return;
+ this.entries.set(child.pid, {
+ child,
+ ...init,
+ spawnArtifactContentHash: init.spawnArtifactContentHash ?? null,
+ scope: init.scope ?? null,
+ earlyStopRequested: false,
+ // `jobId` starts null; populated later by `recordJobId(pid,
+ // id)` when the server's stdout parser sees the runner's
+ // `Started job ` line. Tests that don't exercise the
+ // cancel-POST path can leave it null.
+ jobId: null,
+ });
+ }
+
+ unregister(pid: number | undefined): void {
+ if (typeof pid === "number") this.entries.delete(pid);
+ }
+
+ /**
+ * Record the cloud-side job id for an active child. Called by the
+ * server's `/api/train` stdout parser the first time it spots
+ * `Started job ` in the runner's output. Idempotent: a
+ * second call with the same pid + id is a no-op (the runner
+ * only prints the line once anyway). Unknown pids are silently
+ * dropped (the child may have already exited and unregistered).
+ */
+ recordJobId(pid: number | undefined, jobId: string): void {
+ if (typeof pid !== "number") return;
+ const entry = this.entries.get(pid);
+ if (!entry) return;
+ entry.jobId = jobId;
+ }
+
+ /**
+ * Read the recorded cloud-side job id for a pid. `/api/train`'s
+ * cancel handler consults this to POST `/v1/jobs/:id/cancel`
+ * before SIGKILLing the local subprocess; without that POST,
+ * a user-initiated stop would leave the cloud job running
+ * until TTL (the SIGKILL bypasses the runner's `installShutdownHandlers`
+ * so the runner can't issue cancel itself). Returns null when
+ * the pid is unknown or the runner hasn't printed its
+ * `Started job` line yet (early spawn failure, race against
+ * a fast cancel, custom user bins).
+ */
+ getJobId(pid: number | undefined): string | null {
+ if (typeof pid !== "number") return null;
+ return this.entries.get(pid)?.jobId ?? null;
+ }
+
+ /**
+ * Read the spawn-time cloud-api scope for a pid. Paired with
+ * `getJobId` by `/api/train`'s cancel handler to build the cloud
+ * cancel POST URL without re-reading `.arkor/state.json` at stop
+ * time: if the file was deleted or made unreadable mid-training,
+ * the read would return null and the cancel POST would silently
+ * skip, orphaning the cloud run. Captured at spawn time, immutable
+ * for the entry's lifetime.
+ */
+ getScope(
+ pid: number | undefined,
+ ): { orgSlug: string; projectSlug: string } | null {
+ if (typeof pid !== "number") return null;
+ return this.entries.get(pid)?.scope ?? null;
+ }
+
+ /**
+ * Whether `dispatchRebuild` has already issued a graceful-restart
+ * SIGTERM to this child as part of an HMR cycle. Consulted by
+ * `/api/train`'s ReadableStream `cancel()` handler so a client-
+ * driven cancel (tab close, navigation, aborted fetch) doesn't
+ * pile a second SIGTERM on top of an in-progress early-stop:
+ * the runner's `installShutdownHandlers` interprets a second
+ * SIGTERM as the emergency `exit(143)` fast-path, which bypasses
+ * the checkpoint-preserving early-stop + `cancel()` flow and
+ * leaves the cloud-side run live while the local subprocess
+ * dies. Defeats the main safety goal of the HMR restart logic.
+ */
+ isEarlyStopRequested(pid: number | undefined): boolean {
+ if (typeof pid !== "number") return false;
+ return this.entries.get(pid)?.earlyStopRequested ?? false;
+ }
+
+ get size(): number {
+ return this.entries.size;
+ }
+
+ /** Read-only snapshot, mostly for tests / observability. */
+ list(): ReadonlyArray {
+ return [...this.entries.values()];
+ }
+
+ /**
+ * Single entry point for HMR rebuilds: per active child, decide
+ * between callback hot-swap (SIGUSR2) and graceful restart
+ * (SIGTERM), apply the signal, and report which children landed in
+ * each bucket so the SPA can update its UI / re-spawn restarted
+ * runs.
+ *
+ * Combines what was previously `notifyCallbackReload` +
+ * `requestEarlyStopOnMismatch` into one pass so the per-child
+ * decision is atomic: important because the hot-swap path can
+ * gracefully degrade into the restart path on platforms (Windows)
+ * where SIGUSR2 isn't supported, which is hard to express across
+ * two separate iterations of the registry.
+ *
+ * Re-signal protection: children already flagged
+ * `earlyStopRequested` are skipped entirely. The flag is cleared
+ * naturally when the child exits and is unregistered.
+ *
+ * Defensive corner cases:
+ * - `kill()` returns `false` (process already exited) → drop
+ * from the targets list, the registry's close handler will
+ * unregister it.
+ * - `kill("SIGUSR2")` throws on Windows → degrade to SIGTERM so
+ * callback edits still take effect (via a full restart) rather
+ * than silently being ignored.
+ */
+ dispatchRebuild(
+ nextConfigHash: string | null,
+ // Content hash (sha256-derived; see `studio/hmr.ts`) of the
+ // freshly-built artefact, paired with `entry.spawnArtifactContentHash`
+ // for the pre-ready-spawn equality gate. Defaults to `null` so
+ // tests / pre-existing callers that don't pass a hash get the
+ // conservative behaviour: a null entry hash falls through to
+ // SIGTERM-restart. Real dispatch from `/api/train`'s HMR
+ // subscriber threads `event.contentHash` here so the backfill
+ // optimisation activates only when the child's loaded bytes
+ // genuinely match.
+ nextArtifactContentHash: string | null = null,
+ ): DispatchResult {
+ const hotSwapTargets: RestartTarget[] = [];
+ const restartTargets: RestartTarget[] = [];
+
+ for (const [pid, entry] of this.entries) {
+ if (entry.earlyStopRequested) continue;
+ const target: RestartTarget = { pid, trainFile: entry.trainFile };
+ // Pre-ready spawn: this child was registered via `/api/train`
+ // *before* the HMR watcher's first successful build, so its
+ // recorded `configHash` is `null`. Whether the rebuild's new
+ // hash describes the same bytes the child actually loaded
+ // depends on whether the on-disk artefact has changed between
+ // spawn and now. Tie the decision to the artefact content
+ // hash:
+ //
+ // - `entry.spawnArtifactContentHash === nextArtifactContentHash`
+ // → child read the same bytes the new hash describes.
+ // Safe to backfill `configHash`; future rebuilds compare
+ // against the backfilled value like any other child. This
+ // is the common case (user clicked Run before the SPA had
+ // refreshed its manifest, but the on-disk artefact is the
+ // same one the watcher just settled on).
+ //
+ // - content hashes differ (or one side is null) → the bytes
+ // the child loaded don't match the new hash. SIGTERM-restart
+ // so the cloud-side `JobConfig` and the child's actual
+ // config are guaranteed to align. Without this gate, an
+ // edit landing between spawn and the first BUNDLE_END would
+ // silently teach the registry to use the post-edit hash as
+ // the child's baseline; later same-hash rebuilds would
+ // then hot-swap callbacks into a child whose cloud-side
+ // `JobConfig` was *actually* spawned against an older
+ // version, leaving the cloud run on a stale config.
+ const isPreReadySpawn =
+ entry.configHash === null && nextConfigHash !== null;
+ if (isPreReadySpawn) {
+ const artefactsAgree =
+ entry.spawnArtifactContentHash !== null &&
+ nextArtifactContentHash !== null &&
+ entry.spawnArtifactContentHash === nextArtifactContentHash;
+ if (artefactsAgree) {
+ entry.configHash = nextConfigHash;
+ continue;
+ }
+ // fall through to the mismatch / SIGTERM-restart path below
+ }
+ const matches =
+ nextConfigHash !== null &&
+ entry.configHash !== null &&
+ entry.configHash === nextConfigHash;
+
+ if (matches) {
+ // On Windows, Node's `child.kill(signal)` for any unknown
+ // POSIX signal (including SIGUSR2) is documented to
+ // **forcefully terminate** the process (same effect as
+ // SIGKILL), and `kill()` returns `true` like a successful
+ // delivery. `safeKill` would then report `"ok"`, the entry
+ // would land in `hotSwapTargets`, and the SPA would never
+ // schedule a restart even though the child is *dead*. Skip
+ // the SIGUSR2 attempt on win32 entirely and route directly
+ // to the SIGTERM-restart path so the SPA learns about the
+ // pending restart and re-spawns when the exit line arrives.
+ // The user-visible outcome (callbacks reload after a brief
+ // restart) matches the design intent on platforms where
+ // the in-place hot-swap simply isn't available.
+ if (process.platform !== "win32") {
+ const r = safeKill(entry.child, "SIGUSR2");
+ if (r === "ok") {
+ hotSwapTargets.push(target);
+ continue;
+ }
+ if (r === "gone") {
+ // Child already exited; close handler will unregister.
+ continue;
+ }
+ // Cross-platform safety net: SIGUSR2 reported `"unsupported"`
+ // on a non-win32 platform (rare: `ENOSYS` from libuv signal
+ // wrap on exotic builds, future Node versions removing the
+ // signal, etc.). Same fallback as the win32 skip above:
+ // route to SIGTERM-restart so callback edits still take
+ // effect via a full restart instead of silently being
+ // ignored.
+ }
+ const fallback = safeKill(entry.child, "SIGTERM");
+ if (fallback === "ok") {
+ entry.earlyStopRequested = true;
+ restartTargets.push(target);
+ }
+ // "gone" / "unsupported" again → drop silently; the close
+ // handler (or operator-driven restart) will recover.
+ continue;
+ }
+
+ // Hash mismatch (or one side is null): graceful restart.
+ const r = safeKill(entry.child, "SIGTERM");
+ if (r === "ok") {
+ entry.earlyStopRequested = true;
+ restartTargets.push(target);
+ }
+ // "gone": child already exited, drop. "unsupported": can't
+ // happen for SIGTERM on supported platforms; drop defensively.
+ }
+
+ return { hotSwapTargets, restartTargets };
+ }
+}
diff --git a/packages/cli-internal/src/templates.ts b/packages/cli-internal/src/templates.ts
index a250fd62..1d8e94fc 100644
--- a/packages/cli-internal/src/templates.ts
+++ b/packages/cli-internal/src/templates.ts
@@ -1,13 +1,13 @@
/**
* Starter templates written out by `create-arkor` / `arkor init`.
- * Single source of truth — both consumers bundle this module at build time.
+ * Single source of truth: both consumers bundle this module at build time.
*
* Layout written to disk:
*
* src/arkor/index.ts ← entry-point manifest (`createArkor({ trainer })`)
* src/arkor/trainer.ts ← per-template trainer (`createTrainer({...})`)
*
- * `index.ts` is identical across templates — only the trainer body differs.
+ * `index.ts` is identical across templates; only the trainer body differs.
*/
export type TemplateId = "redaction" | "translate" | "triage";
@@ -179,7 +179,7 @@ ${ONLOG_BODY}
});
`;
-// Order is significant — `templateChoices()` preserves insertion order so the
+// Order is significant: `templateChoices()` preserves insertion order so the
// CLI prompt lists demos first (sorted by estimated training time).
//
// Estimated training times assume A100 80GB on Runpod Serverless with the
@@ -204,7 +204,7 @@ export const TEMPLATES: Record = {
};
/**
- * Body of `src/arkor/index.ts` — identical across templates. The `createArkor`
+ * Body of `src/arkor/index.ts`: identical across templates. The `createArkor`
* factory is what `arkor build` / Studio discovers; per-role primitives
* (`trainer`, future `deploy`, `eval`) live in sibling files and get gathered
* here.
@@ -215,7 +215,7 @@ import { trainer } from "./trainer";
export const arkor = createArkor({ trainer });
`;
-export const STARTER_CONFIG = `// Placeholder for future project-level config — the runtime does not read
+export const STARTER_CONFIG = `// Placeholder for future project-level config: the runtime does not read
// fields from this file yet. Training settings (\`maxSteps\`, \`lora\`, etc.)
// live on the Trainer in src/arkor/trainer.ts. Project routing
// (orgSlug / projectSlug) is tracked automatically in .arkor/state.json.
@@ -241,7 +241,7 @@ An arkor training project scaffolded by \`create-arkor\`.
The \`dev\` / \`build\` / \`start\` package scripts forward to the matching
\`arkor\` subcommands, so the script form works across every package
-manager (\`npm\` does not run package binaries via \`npm \` — use
+manager (\`npm\` does not run package binaries via \`npm \`; use
\`npm run