diff --git a/AGENTS.md b/AGENTS.md index 554d6e96..c69b9d87 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -63,25 +63,36 @@ cd my-arkor-app && pnpm dev # Studio at http://127.0. `arkor dev` generates a 32-byte base64url token per launch ([packages/arkor/src/cli/commands/dev.ts](packages/arkor/src/cli/commands/dev.ts)) and: -1. Passes it to `buildStudioApp({ studioToken })`. The Hono server validates every `/api/*` request via `X-Arkor-Studio-Token` header (or `?studioToken=` query for `EventSource`, which can't set headers). Comparison uses `timingSafeEqual`. -2. Persists it to `~/.arkor/studio-token` (mode 0600) so the SPA dev workflow (`pnpm --filter @arkor/studio-app dev`) can read it via the `arkor-studio-token` Vite plugin in [packages/studio-app/vite.config.ts](packages/studio-app/vite.config.ts), which injects `` into `index.html` on each request. Persistence failure must NOT block server start (read-only `$HOME` on Docker, etc.) — just warn. +1. Passes it to `buildStudioApp({ studioToken })`. The Hono server validates every `/api/*` request via `X-Arkor-Studio-Token` header (or `?studioToken=` query for `EventSource`, which can't set headers). Comparison uses `timingSafeEqual`. The query-token allow-list lives in `eventStreamPathPattern` in [packages/arkor/src/studio/server.ts](packages/arkor/src/studio/server.ts), currently `/api/jobs/:id/events` and `/api/dev/events`. **Adding to that regex is CSRF-sensitive: each entry must be a GET stream-only route, never a mutation endpoint.** +2. Persists it to `~/.arkor/studio-token` (mode 0600) so the SPA dev workflow (`pnpm --filter @arkor/studio-app dev`) can read it via the `arkor-studio-token` Vite plugin in [packages/studio-app/vite.config.ts](packages/studio-app/vite.config.ts), which injects `` into `index.html` on each request. Persistence failure must NOT block server start (read-only `$HOME` on Docker, etc.); just warn. 3. Cleans up on `exit`/SIGINT/SIGTERM/SIGHUP via `unlinkSync`. -`/api/*` middleware also enforces a host-header allow-list (`127.0.0.1`/`localhost`) for DNS-rebinding defence. **CORS is intentionally NOT configured** — the SPA is same-origin so reflecting `*` would let "simple" cross-origin POSTs reach handlers. The token check rejects those; cross-origin tabs cannot read the SPA's ``. +`/api/*` middleware also enforces a host-header allow-list (`127.0.0.1`/`localhost`) for DNS-rebinding defence. **CORS is intentionally NOT configured**: the SPA is same-origin so reflecting `*` would let "simple" cross-origin POSTs reach handlers. The token check rejects those; cross-origin tabs cannot read the SPA's ``. -The whole point: prevents another browser tab on the same machine from POSTing `/api/train` (which spawns `arkor train` and dynamically imports user TS — RCE-grade). +The whole point: prevents another browser tab on the same machine from POSTing `/api/train` (which spawns `arkor train` and dynamically imports user TS, an RCE-grade exposure). -When touching the Studio server or SPA fetch layer, preserve: token via header for `fetch`, query param for `EventSource`, host-header guard, no CORS, timing-safe compare. The Vite plugin is dev-only (`apply: "serve"`) — running it during `vite build` would bake a stale per-launch token into the production `index.html` and shadow the runtime tag, causing every `/api/*` call to 403. +When touching the Studio server or SPA fetch layer, preserve: token via header for `fetch`, query param for `EventSource`, host-header guard, no CORS, timing-safe compare. The Vite plugin is dev-only (`apply: "serve"`): running it during `vite build` would bake a stale per-launch token into the production `index.html` and shadow the runtime tag, causing every `/api/*` call to 403. + +### HMR + graceful early-stop + callback hot-swap + +`arkor dev` keeps a [Rolldown](https://rolldown.rs) watcher over `src/arkor/` ([packages/arkor/src/studio/hmr.ts](packages/arkor/src/studio/hmr.ts)) and pushes rebuild events over `/api/dev/events` (SSE). On each successful build the watcher dynamic-imports the artifact, pulls a `TrainerInspection` snapshot off the discovered trainer (via the cross-realm `Symbol.for("arkor.trainer.inspect")` brand attached in [packages/arkor/src/core/trainerInspection.ts](packages/arkor/src/core/trainerInspection.ts)), and computes a stable `configHash` from the cloud-side `JobConfig`. The SPA re-fetches `/api/manifest` on each event so the Run Training button stays in sync without a browser refresh. + +When a rebuild lands while a `/api/train`-spawned subprocess is in flight, the server makes a per-child decision in [packages/arkor/src/studio/trainRegistry.ts](packages/arkor/src/studio/trainRegistry.ts): + +- **`configHash` matches the spawn-time hash** → SIGUSR2. The child's `installCallbackReloadHandler` re-imports the artifact and rotates the trainer's callback cell via the internal `Symbol.for("arkor.trainer.replaceCallbacks")` brand exposed by [packages/arkor/src/core/trainerInspection.ts](packages/arkor/src/core/trainerInspection.ts). The cloud-side run is untouched. Use this whenever a code change is contained inside the `callbacks: { ... }` object. Don't add a `replaceCallbacks()` method to the public `Trainer` interface: keeping the mutator behind a `Symbol.for` brand is what stops the dev-only HMR primitive from leaking into the SDK's published surface. +- **`configHash` differs (or is null because the new bundle didn't inspect)** → SIGTERM. `installShutdownHandlers` drives the trainer's internal early-stop entry point via the `Symbol.for("arkor.trainer.requestEarlyStop")` brand exposed by [packages/arkor/src/core/trainerInspection.ts](packages/arkor/src/core/trainerInspection.ts), which lets the next `checkpoint.saved` event finish (work preserved) before issuing `cancel()` and exiting cleanly. The SPA auto-restarts the run with the rebuilt artifact via the `restart: true` flag on the SSE event. A second SIGTERM bypasses the early-stop and exits 143 immediately, as an emergency escape hatch for a hung cancel. + +Don't replace the SIGTERM-and-let-the-child-handle-it pattern with a SIGKILL escalation in the server: that would orphan Cloud-side jobs (no `cancel()` POST goes out) and waste GPU budget. Don't widen the SIGUSR2 path to "always hot-swap, server-side": the `configHash` check is what guarantees a hot-swap can't silently leave a child running with a stale `JobConfig`. Don't surface `requestEarlyStop()` (or `replaceCallbacks()`) as a method on the public `Trainer` interface: both are dev-only HMR primitives, and keeping them behind `Symbol.for` brands is what stops them from leaking into the published SDK shape; user code that wants similar semantics should compose `abortSignal` + `cancel()` per the cookbook. ### Project entry-point discovery The CLI/Studio look at `src/arkor/index.ts` in user projects. Discovery in [packages/arkor/src/core/runner.ts](packages/arkor/src/core/runner.ts) accepts (in order): a named `arkor` export from `createArkor({...})`, a bare `trainer` export, a default export holding either an Arkor manifest or a Trainer, or a `default.trainer` nested shape. `createArkor` returns a frozen, opaque manifest tagged with `_kind: "arkor"`; treat it as a value to hand to tooling, not a programmable client. -`arkor build` ([packages/arkor/src/cli/commands/build.ts](packages/arkor/src/cli/commands/build.ts)) bundles to `.arkor/build/index.mjs` with esbuild; bare specifiers (e.g. `arkor`, anything in `node_modules`) stay external so the artifact resolves the runtime SDK from the project's installed copy. +`arkor build` ([packages/arkor/src/cli/commands/build.ts](packages/arkor/src/cli/commands/build.ts)) bundles to `.arkor/build/index.mjs` with [Rolldown](https://rolldown.rs); bare specifiers (e.g. `arkor`, anything in `node_modules`) stay external so the artifact resolves the runtime SDK from the project's installed copy. The `transform.target` is derived from `process.versions.node` at build time so the bundle targets the same Node binary that will execute it. ### E2E suite specifics -Both [e2e/cli](e2e/cli) and [e2e/studio](e2e/studio) declare `arkor` (and, for `e2e/cli`, `create-arkor`) as `workspace:*` `devDependencies`, so Turbo's `^build` produces `dist/bin.mjs` exactly once before `#test`/`#test:coverage` runs — no `pretest` hooks, no concurrent rebuilds racing on `dist/`. Standalone runs (`pnpm --filter @arkor/e2e-* test`) need a prior `pnpm build`. Every supported Node (≥22.22.0) is in rolldown's compatible range (^20.19 || >=22.12), so the previous "rolldown-incompatible" CI bypass path was removed. +Both [e2e/cli](e2e/cli) and [e2e/studio](e2e/studio) declare `arkor` (and, for `e2e/cli`, `create-arkor`) as `workspace:*` `devDependencies`, so Turbo's `^build` produces `dist/bin.mjs` exactly once before `#test`/`#test:coverage` runs (no `pretest` hooks, no concurrent rebuilds racing on `dist/`). Standalone runs (`pnpm --filter @arkor/e2e-* test`) need a prior `pnpm build`. Every supported Node (≥22.22.0) is in rolldown's compatible range (^20.19 || >=22.12), so the previous "rolldown-incompatible" CI bypass path was removed. Tests rely on `ARKOR_INTERNAL_SCAFFOLD_ARKOR_SPEC=file:.../packages/arkor` so the scaffolded fixtures install the workspace `arkor` instead of the npm-published one. Both this var and `SKIP_E2E_INSTALL` are declared in [turbo.json](turbo.json) so they pass through Turbo's hash. @@ -96,7 +107,7 @@ When implementing anything (new feature, SDK/CLI/Studio behaviour change, schema 1. **Docs in both languages.** This repo pairs English/Japanese docs: `README.md` ↔ `README.ja.md`, `CONTRIBUTING.md` ↔ `CONTRIBUTING.ja.md`, and `docs/` ↔ `docs/ja/`. If you edit the English side, update the Japanese side in the same PR. Don't leave Japanese docs to be retro-translated later. 2. **Tests.** Add vitest cases under `packages/*/src/**/*.test.ts` for SDK/CLI/scaffold logic changes. For CLI flow changes, consider an `e2e/cli` scenario. -Don't split these into "docs in a follow-up PR" or "tests later" — land them in the same PR. Skip only when the user explicitly says to. +Don't split these into "docs in a follow-up PR" or "tests later"; land them in the same PR. Skip only when the user explicitly says to. ## Non-obvious gotchas diff --git a/docs/concepts/studio.mdx b/docs/concepts/studio.mdx index fb9835c5..90ae5bdd 100644 --- a/docs/concepts/studio.mdx +++ b/docs/concepts/studio.mdx @@ -14,7 +14,12 @@ Four jobs: 3. **Try a finished model.** A Playground page lets you pick the base model or the final adapter from any completed job and chat with it. The Playground does not load intermediate checkpoints; for mid-run inference, use [`onCheckpoint`](/concepts/lifecycle) callbacks in your trainer. 4. **Publish a model behind a `*.arkor.app` URL.** An Endpoints page creates a per-deployment subdomain that serves OpenAI-compatible chat completions for a chosen adapter or base model, plus the API keys that authenticate calls to it. The same actions are available programmatically via [`CloudApiClient`](/sdk/deployments) — Studio is the interactive surface; the SDK is the lower-level one. -A note on the dev loop: Studio's `/api/manifest` endpoint rebuilds and re-imports your trainer on every request (with a cache-bust query, see `packages/arkor/src/studio/manifest.ts`), but the UI only fetches it when the Run training page mounts. So if you edit `src/arkor/` and stay on the same Run training page, the next click reuses the existing `.arkor/build/index.mjs` and runs your old code. Refresh the page (or run `arkor build` from the terminal) between edits and clicks to pick up the new code reliably. +A note on the dev loop: Studio runs a [Rolldown](https://rolldown.rs) watcher over `src/arkor/` and pushes rebuild notifications to the SPA over a Server-Sent Events stream (`/api/dev/events`). Edit a file, save, and the Run training button updates with the new trainer name without a refresh. If a training run is in flight, the Studio compares the new bundle's cloud-side `JobConfig` hash to the one captured when the run was spawned: + +- **Same hash (only callbacks changed).** The runner is signalled with SIGUSR2; it re-imports the rebuilt artifact and rotates the trainer's callback cell in place via an internal HMR brand. The cloud-side training run is untouched, no GPU time is wasted, and the SPA shows a brief "Callbacks hot-swapped" indicator. +- **Different hash (model / dataset / hyperparameters changed).** The runner is signalled with SIGTERM; the trainer's internal early-stop entry point lets the next checkpoint upload finish before issuing `cancel()`, then the SPA re-spawns the run with the rebuilt artifact. The previous Cloud-side job reaches `cancelled` after the checkpoint is uploaded, so the partial work is preserved as an artifact. + +If you want this "stop after the next checkpoint" behaviour from your own code (rather than from the dev loop), build it on top of the public [`abortSignal` + `cancel()`](/sdk/trainer-control#abortsignal) pair. The [Early stopping recipe](/cookbook/early-stopping) walks through it. ## Where Studio runs diff --git a/docs/ja/concepts/studio.mdx b/docs/ja/concepts/studio.mdx index 12f176e2..224c6eed 100644 --- a/docs/ja/concepts/studio.mdx +++ b/docs/ja/concepts/studio.mdx @@ -14,7 +14,12 @@ Studio は `arkor dev` 実行時に立ち上がるローカル Web UI です。 3. **完成モデルを試す。** Playground ページでベースモデルや任意の完了済みジョブの最終アダプターを選んでチャットできます。中間チェックポイントは Playground からはロードしません。学習中の推論には [`onCheckpoint`](/ja/concepts/lifecycle) コールバックをトレーナーで使ってください。 4. **`*.arkor.app` URL でモデルを公開する。** Endpoints ページで OpenAI 互換 chat completions を提供する deployment 専用サブドメインを作成し、その API キーを発行・取り消しできます。同じ操作は [`CloudApiClient`](/ja/sdk/deployments) からプログラマティックにも可能で、Studio が対話的なインターフェイス、SDK が下位レイヤーという位置付けです。 -dev ループのメモ: Studio の `/api/manifest` エンドポイントはリクエストごとにトレーナーをリビルド・再 import しますが(キャッシュバストクエリ付き、`packages/arkor/src/studio/manifest.ts` を参照)、UI が fetch するのは Run training ページがマウントされたときだけです。`src/arkor/` を編集して同じ Run training ページに留まり続けると、次のクリックは既存の `.arkor/build/index.mjs` を再利用して古いコードで走ります。確実に新しいコードを取り込むには、編集とクリックの間にページをリロード(あるいはターミナルから `arkor build`)してください。 +dev ループのメモ: Studio は [Rolldown](https://rolldown.rs) のウォッチャを `src/arkor/` 上で常駐させ、再ビルド通知を Server-Sent Events ストリーム (`/api/dev/events`) で SPA に push します。ファイルを編集して保存すれば、Run training ボタンのトレーナー名表示はリロード無しで更新されます。学習が走っている最中であれば、Studio は再ビルドしたバンドルの Cloud 側 `JobConfig` ハッシュを、spawn 時に保存したハッシュと比較します。 + +- **ハッシュ一致(コールバックのみ変更)。** ランナーへ SIGUSR2 を送ります。ランナーは再ビルドされた成果物を再 import し、内部 HMR ブランド経由でトレーナーのコールバック cell をその場で差し替えます。Cloud 側の学習はそのまま継続し、GPU 時間を無駄にせず、SPA には "Callbacks hot-swapped" と短く表示されます。 +- **ハッシュ不一致(モデル / データセット / ハイパーパラメータが変わった)。** ランナーへ SIGTERM を送ります。トレーナー内部の early-stop エントリが次のチェックポイントのアップロードを待ってから `cancel()` を発火し、SPA が再ビルドした成果物で再投入します。Cloud 側の以前のジョブはチェックポイントのアップロード完了後に `cancelled` 状態に遷移するので、ここまでの学習成果は artifact として保全されます。 + +自前のコードから(dev ループではなく)この「次のチェックポイントで止める」挙動が欲しい場合は、公開 API の [`abortSignal` + `cancel()`](/ja/sdk/trainer-control#abortsignal) を組み合わせて書いてください。具体的な手順は [Early Stopping レシピ](/ja/cookbook/early-stopping) にあります。 ## Studio が動く場所 diff --git a/docs/ja/studio/jobs.mdx b/docs/ja/studio/jobs.mdx index 5a1fae60..56683075 100644 --- a/docs/ja/studio/jobs.mdx +++ b/docs/ja/studio/jobs.mdx @@ -62,8 +62,8 @@ Jobs ページ(`#/jobs`)はマウント時に 1 度、その後 5 秒ごと Loss チャートは `training.log` イベントから描画される SVG プロットです。Y 軸は最小値と最大値によるスケーリング、X 軸はステップ番号で、最大 2 系列を表示します: -- **Training loss** — 実線のティール色。数値 `loss` を含むイベントごとに 1 頂点。 -- **Eval loss** — 破線のピンク色(点マーカー付き)。数値 `evalLoss` を含むイベント(通常は `evalSteps` 刻み)から描画。系列はイベントから直接構築するため、`evalLoss` のみを持ち `loss` を含まない eval-only フレームも線・凡例・統計に反映されます。Eval ポイントが 1 つも来ていない間は凡例にも表示されません。 +- **Training loss**: 実線のティール色。数値 `loss` を含むイベントごとに 1 頂点。 +- **Eval loss**: 破線のピンク色(点マーカー付き)。数値 `evalLoss` を含むイベント(通常は `evalSteps` 刻み)から描画。系列はイベントから直接構築するため、`evalLoss` のみを持ち `loss` を含まない eval-only フレームも線・凡例・統計に反映されます。Eval ポイントが 1 つも来ていない間は凡例にも表示されません。 ホバーすると最寄りステップと、そのステップに含まれる `loss` / `evalLoss` のうち存在する値が表示されます(eval-only ステップでは `loss` 値は出ず、その逆も同様)。チャートは `loss` または `evalLoss` のいずれかが数値であるイベントが 1 件以上届くまで `Waiting for training.log events…`(`training.log` イベント待ち)プレースホルダーを表示します。両方とも null / 省略の `training.log` フレームはカウントされません。 @@ -71,9 +71,9 @@ Loss チャートは `training.log` イベントから描画される SVG プロ チャートヘッダーの **Advanced** トグルを ON にすると、系列ごとの統計パネルが現れます。各カードに表示される項目: -- **Mean loss ± 95% CI** — Loss 値の標本平均と 95% 信頼区間の半幅(Student の t 分布。n > 31 では z = 1.96 にフォールバック)。 -- **Std dev**(標準偏差)と **Variance**(分散) — Bessel 補正済みの不偏推定量(`ddof=1`)。 -- **p90** と **p95** — numpy のデフォルトに合わせた線形補間パーセンタイル。 +- **Mean loss ± 95% CI**: Loss 値の標本平均と 95% 信頼区間の半幅(Student の t 分布。n > 31 では z = 1.96 にフォールバック)。 +- **Std dev**(標準偏差)と **Variance**(分散): Bessel 補正済みの不偏推定量(`ddof=1`)。 +- **p90** と **p95**: numpy のデフォルトに合わせた線形補間パーセンタイル。 Eval カードは数値 `evalLoss` を含む `training.log` イベントが届くまでは空のままです。 diff --git a/e2e/studio/src/specs/hmr.spec.ts b/e2e/studio/src/specs/hmr.spec.ts new file mode 100644 index 00000000..94346b38 --- /dev/null +++ b/e2e/studio/src/specs/hmr.spec.ts @@ -0,0 +1,284 @@ +import { writeFileSync } from "node:fs"; +import { join } from "node:path"; +import { expect, test } from "../harness/fixture"; + +/** + * Rewrite the seeded `src/arkor/index.ts` with a new trainer `name` + * (and arbitrary content tail to bump mtime + size beyond any + * sub-millisecond resolution noise on fast filesystems). We rewrite + * the WHOLE file (not append) so rolldown's incremental cache can't + * reuse the prior module record and skip the rebuild. + * + * Two key shape differences from `seedFixture.ts`'s `seedManifest`: + * + * 1. The trainer carries the `Symbol.for("arkor.trainer.inspect")` + * brand so `findInspectableTrainer` (used by `studio/hmr.ts`'s + * `inspectBundle`) can read its name + config: without the + * brand, every SSE rebuild frame gets `trainerName: null` and + * the SSE-level test below can't distinguish the post-edit + * rebuild from the cached initial-build replay. The seed + * fixture skips the brand because its existing tests only + * exercise the `/api/manifest` path (which uses + * `findTrainerInModule`, brand-less). Extending it would + * couple every test to inspection internals it doesn't care + * about. + * + * 2. The brand returns a real `JobConfig` shape (`model` + + * `datasetSource` set), not the seed's empty placeholder, so + * `hashJobConfig` produces a stable non-empty `configHash`. + * `studio/server.ts`'s `dispatchRebuild` consults that hash to + * route between SIGUSR2 hot-swap and SIGTERM restart; the + * existing E2E only tests the boot path so it never needs a + * real config there. + * + * `Symbol.for` keys round-trip across the dev process / built + * bundle realm boundary because they live in the global symbol + * registry, the same mechanism `core/trainerInspection.ts` documents + * for the runtime CLI / `.arkor/build/index.mjs` split. + */ +function rewriteManifest(projectDir: string, name: string): void { + const path = join(projectDir, "src", "arkor", "index.ts"); + writeFileSync( + path, + [ + 'const TRAINER_INSPECT_KEY = Symbol.for("arkor.trainer.inspect");', + "const trainer = {", + ` name: ${JSON.stringify(name)},`, + " start: async () => ({ id: 'e2e-job', url: '' }),", + " wait: async () => ({ status: 'completed' as const }),", + " cancel: async () => {},", + "};", + "Object.defineProperty(trainer, TRAINER_INSPECT_KEY, {", + " value: () => ({", + " name: trainer.name,", + " config: {", + ' model: "studio-e2e-model",', + ' datasetSource: { type: "huggingface" as const, name: "studio-e2e-dataset" },', + " },", + " callbacks: {},", + " }),", + " enumerable: false,", + "});", + 'export const arkor = { _kind: "arkor" as const, trainer };', + "export default arkor;", + `// rewritten-${name}-${Date.now()}`, + "", + ].join("\n"), + ); +} + +interface SseFrame { + event: string; + data: string; +} + +/** + * Open `/api/dev/events`, parse incoming SSE frames, and resolve when + * `predicate` first returns true. Cleans up the underlying body + * reader on resolve / reject so the Hono server's connection bookkeeping + * doesn't leak between tests. + * + * `arkor dev` requires the studio token via the query param (EventSource + * can't set headers); the same allow-list governs `fetch()` here. + */ +async function awaitSseFrame( + studioUrl: string, + token: string, + predicate: (frame: SseFrame) => boolean, + timeoutMs: number, +): Promise { + const url = `${studioUrl}/api/dev/events?studioToken=${encodeURIComponent(token)}`; + const controller = new AbortController(); + const timeout = setTimeout(() => controller.abort(), timeoutMs); + let res: Response; + try { + res = await fetch(url, { signal: controller.signal }); + } catch (err) { + clearTimeout(timeout); + throw new Error( + `SSE connect failed for ${url}: ${(err as Error).message}`, + ); + } + if (!res.ok || !res.body) { + clearTimeout(timeout); + throw new Error( + `SSE connect returned ${res.status} ${res.statusText}; body=${ + res.body ? "present" : "missing" + }`, + ); + } + const reader = res.body.getReader(); + const decoder = new TextDecoder(); + let buf = ""; + try { + while (true) { + const { value, done } = await reader.read(); + if (done) { + throw new Error("SSE stream ended before predicate matched"); + } + buf += decoder.decode(value, { stream: true }); + // Frames are terminated by a blank line (`\n\n`). Split, keep + // the trailing partial in `buf` for the next iteration. + const parts = buf.split("\n\n"); + buf = parts.pop() ?? ""; + for (const raw of parts) { + if (!raw) continue; + let event = ""; + let data = ""; + for (const line of raw.split("\n")) { + if (line.startsWith("event: ")) event = line.slice(7); + else if (line.startsWith("data: ")) data = line.slice(6); + } + const frame: SseFrame = { event, data }; + if (predicate(frame)) return frame; + } + } + } finally { + clearTimeout(timeout); + // Cancel rather than just release: cancel propagates to the Hono + // ReadableStream's `cancel()` handler so the server unsubscribes + // this listener from the HMR coordinator promptly. Otherwise the + // listener lingers until the next dispose, which can produce + // cross-test bleed when running with `--repeat-each`. + await reader.cancel().catch(() => {}); + } +} + +test.describe("Studio HMR", () => { + test("/api/dev/events is registered with the hmr-enabled meta tag", async ({ + page, + studio, + }) => { + // Boot-time wiring: `arkor dev` always wires up the HMR + // coordinator, so the served HTML must carry both the + // studio-token meta and the hmr-enabled meta. Without the + // hmr-enabled tag, `isHmrEnabled()` returns false in the SPA + // and the auto-restart / hot-swap paths silently no-op. + await page.goto(studio.url); + const hmrMeta = page.locator('meta[name="arkor-hmr-enabled"]'); + await expect(hmrMeta).toHaveCount(1); + await expect(hmrMeta).toHaveAttribute("content", "true"); + + // Endpoint sanity-check: a GET without the studio token must 403 + // (regression for the CSRF allow-list: `eventStreamPathPattern` + // permits the query-token form, but a raw GET stays gated). + const noToken = await fetch(`${studio.url}/api/dev/events`); + expect(noToken.status).toBe(403); + }); + + test("editing src/arkor/index.ts broadcasts a rebuild SSE frame with the new trainer name", async ({ + studio, + fixturePaths, + }) => { + // Edit BEFORE subscribing, then let the predicate filter out + // pre-edit replays. The watcher may already have a cached + // initial-build `ready` (with the seed name) by the time we + // connect; subscribing first then editing would force a + // drain step. Going edit → subscribe is simpler: the + // predicate explicitly requires `trainerName === newName`, + // which only the post-edit BUNDLE_END can satisfy; any + // cached or in-flight frame for the seed name fails the + // predicate and `awaitSseFrame` keeps reading until the + // matching one arrives. + const newName = "studio-e2e-trainer-edited"; + rewriteManifest(fixturePaths.projectDir, newName); + + const frame = await awaitSseFrame( + studio.url, + studio.token, + (f) => { + if (f.event !== "rebuild" && f.event !== "ready") return false; + // Some replays have empty data; skip those. + if (!f.data) return false; + try { + const parsed = JSON.parse(f.data) as { + trainerName?: string | null; + }; + return parsed.trainerName === newName; + } catch { + return false; + } + }, + // Generous: rolldown's first cold build on a fresh project + // can take 1–2s on a slow CI runner; the post-edit rebuild is + // typically faster (incremental) but we don't want to flake on + // a noisy host. + 20_000, + ); + + expect(frame.event === "rebuild" || frame.event === "ready").toBe(true); + const parsed = JSON.parse(frame.data) as { + outFile?: string; + trainerName?: string | null; + configHash?: string | null; + }; + expect(parsed.trainerName).toBe(newName); + // The artefact path is also part of the contract: HMR consumers + // (including the runner subprocess on SIGUSR2) re-import the + // bundle by `outFile`. A regression that drops it would silently + // disable hot-swap. + expect(parsed.outFile).toMatch(/\.arkor[\\/]build[\\/]index\.mjs$/); + }); + + test("/api/manifest reflects the edited trainer name after a save", async ({ + studio, + fixturePaths, + }) => { + // End-to-end through the Hono `/api/manifest` route, which + // dynamic-imports the freshly-built artefact via + // `summariseBuiltManifest`. The HMR rebuild must have completed + // *and* the cache-bust URL must reflect the new bytes for this + // assertion to pass: exercises the rebuild → write artefact → + // re-import → return summary chain end-to-end. + const newName = `studio-e2e-trainer-renamed-${Date.now()}`; + rewriteManifest(fixturePaths.projectDir, newName); + + await expect + .poll( + async () => { + const res = await fetch(`${studio.url}/api/manifest`, { + headers: { "X-Arkor-Studio-Token": studio.token }, + }); + if (!res.ok) return null; + const body = (await res.json()) as { + trainer?: { name?: string } | null; + }; + return body.trainer?.name ?? null; + }, + { + // Same 20s budget as the SSE test for the same reason: the + // first rebuild after spawn can be slow on cold CI. Keep + // the poll interval modest so we don't hammer the dev + // loop's `runBuild` faster than it can settle. + timeout: 20_000, + intervals: [200, 400, 800, 1500], + }, + ) + .toBe(newName); + }); + + test("the SPA Run Training caption updates without a page reload after a save", async ({ + page, + studio, + fixturePaths, + }) => { + // End-to-end browser proof: the SPA's RunTraining component + // subscribes to `/api/dev/events`, calls `fetchManifest()` on + // each rebuild, and re-renders the trainer caption. Reloading + // the page would mask any regression in that subscription path, + // so we explicitly DO NOT navigate again after the edit. + await page.goto(studio.url); + await expect(page.getByText(/studio-e2e-trainer/).first()).toBeVisible(); + + const newName = `studio-e2e-trainer-live-${Date.now()}`; + rewriteManifest(fixturePaths.projectDir, newName); + + // The new name should appear without a navigation. Match by + // substring rather than exact text so the surrounding "Trainer + // from src/arkor/index.ts" caption decoration doesn't + // need to be replicated here. + await expect(page.getByText(newName).first()).toBeVisible({ + timeout: 20_000, + }); + }); +}); diff --git a/packages/arkor/package.json b/packages/arkor/package.json index 692d2f3c..11088c91 100644 --- a/packages/arkor/package.json +++ b/packages/arkor/package.json @@ -55,17 +55,17 @@ "@clack/prompts": "^0.8.0", "@hono/node-server": "^1.14.0", "commander": "^13.0.0", - "esbuild": "^0.28.0", "hono": "^4.7.0", "open": "^10.0.0", "posthog-node": "^5.30.6", + "rolldown": "^1.0.0", "zod": "^4.3.6" }, "devDependencies": { "@arkor/cli-internal": "workspace:*", "@types/node": "^24", "@vitest/coverage-v8": "^4.1.5", - "tsdown": "^0.21.9", + "tsdown": "^0.22.0", "typescript": "^5", "vitest": "^4.1.5" }, diff --git a/packages/arkor/src/cli/cleanupHooks.test.ts b/packages/arkor/src/cli/cleanupHooks.test.ts new file mode 100644 index 00000000..864feb9f --- /dev/null +++ b/packages/arkor/src/cli/cleanupHooks.test.ts @@ -0,0 +1,256 @@ +import { afterEach, describe, expect, it, vi } from "vitest"; +import { + __resetCleanupHooksForTests, + registerCleanupHook, +} from "./cleanupHooks"; + +// Each test that emits a signal also installs new listeners on +// `process` for the lifetime of this worker. Auto-detach inside the +// handlers covers the fire-then-cleanup case; `__resetCleanupHooksForTests` +// covers tests whose registration never fires (still need their +// listeners off the worker before the next test runs). + +let exitSpy: ReturnType | null = null; +let stdoutSpy: ReturnType | null = null; + +afterEach(() => { + exitSpy?.mockRestore(); + stdoutSpy?.mockRestore(); + exitSpy = null; + stdoutSpy = null; + __resetCleanupHooksForTests(); +}); + +function mockExit(): number[] { + const codes: number[] = []; + exitSpy = vi + .spyOn(process, "exit") + .mockImplementation(((code?: number) => { + codes.push(code ?? 0); + return undefined as never; + }) as typeof process.exit); + return codes; +} + +function flushMicrotasks(): Promise { + return new Promise((resolve) => setImmediate(resolve)); +} + +describe("registerCleanupHook", () => { + it("waits for an async sibling cleanup to settle before exitOnSignal fires", async () => { + // Regression: previously the signal handler called + // `process.exit(0)` immediately after kicking off cleanup, so a + // sibling registration's async dispose (`hmr.dispose()`) got cut + // off mid-promise. The fix coordinates via a module-level + // in-flight set so the exit-owning hook awaits every other + // registered cleanup before terminating. + const order: string[] = []; + let resolveSlowDispose!: () => void; + const slowDispose = new Promise((resolve) => { + resolveSlowDispose = resolve; + }); + + registerCleanupHook({ + cleanup: () => + slowDispose.then(() => { + order.push("async-cleanup-finished"); + }), + }); + registerCleanupHook({ + cleanup: () => { + order.push("sync-cleanup"); + }, + exitOnSignal: true, + }); + + const codes = mockExit(); + process.emit("SIGINT", "SIGINT"); + + // Sync cleanup body has already fired; async one is still pending, + // and exit must NOT have been called yet. + expect(order).toEqual(["sync-cleanup"]); + expect(codes).toEqual([]); + + // Resolve the slow dispose; one microtask later the coordinator + // fires process.exit(0). + resolveSlowDispose(); + await flushMicrotasks(); + await flushMicrotasks(); + + expect(order).toEqual(["sync-cleanup", "async-cleanup-finished"]); + // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2) so parent + // shells / orchestrators can distinguish "user interrupted" + // from "ran to completion (0)"; see SIGNAL_EXIT_CODE in + // cleanupHooks.ts. + expect(codes).toEqual([130]); + }); + + it("waits for sibling async cleanups even when the exit-owning hook is registered FIRST", async () => { + // Regression: even with the in-flight set in place, the + // exit-owning hook's signal handler used to take its + // `[...inFlightCleanups]` snapshot synchronously inside the + // listener body. Node's EventEmitter dispatches signal listeners + // in registration order, so when the exit-owning hook is wired + // up *first*, its handler takes the snapshot before any sibling + // hook (registered later) gets a chance to run its handler and + // add its own in-flight promise. Result: `Promise.allSettled` + // resolved on the snapshot of just-this-hook's promise → exit + // fired → siblings' async cleanup got cut off mid-flight. + // + // The order in the existing "waits for an async sibling + // cleanup" test happens to dodge this bug by registering the + // async hook first, so its handler runs first and seeds + // inFlightCleanups before the exit-owner takes its snapshot. + // This test inverts the order to actually exercise the + // queueMicrotask-deferred snapshot fix. + const order: string[] = []; + let resolveSlow!: () => void; + const slow = new Promise((resolve) => { + resolveSlow = resolve; + }); + + // Register exit-owner FIRST. + registerCleanupHook({ + cleanup: () => { + order.push("sync-cleanup"); + }, + exitOnSignal: true, + }); + // Sibling async cleanup registered AFTER. With the old code, + // its promise wouldn't make it into the exit-owner's snapshot. + registerCleanupHook({ + cleanup: () => + slow.then(() => { + order.push("async-cleanup-finished"); + }), + }); + + const codes = mockExit(); + process.emit("SIGINT", "SIGINT"); + + // Sync ran inline; async pending; exit must NOT have fired. + expect(order).toEqual(["sync-cleanup"]); + expect(codes).toEqual([]); + + resolveSlow(); + await flushMicrotasks(); + await flushMicrotasks(); + + expect(order).toEqual(["sync-cleanup", "async-cleanup-finished"]); + // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2) so parent + // shells / orchestrators can distinguish "user interrupted" + // from "ran to completion (0)"; see SIGNAL_EXIT_CODE in + // cleanupHooks.ts. + expect(codes).toEqual([130]); + }); + + it("exits with the POSIX 128+signo code for each terminating signal (130/143/129)", async () => { + // Regression: the exit-owning hook used to always + // `process.exit(0)`, regardless of which signal fired the + // shutdown. Parent shells / orchestrators / CI runners that + // gate on signal-style nonzero status would mis-classify a + // Ctrl-C (SIGINT) as a clean run: `arkor dev || cleanup` + // would skip the cleanup branch and leave whatever it owned + // unreaped. POSIX convention is 128 + signo (SIGINT=2 → 130, + // SIGTERM=15 → 143, SIGHUP=1 → 129); SIGNAL_EXIT_CODE in + // cleanupHooks.ts pins the mapping. + const cases: Array<["SIGINT" | "SIGTERM" | "SIGHUP", number]> = [ + ["SIGINT", 130], + ["SIGTERM", 143], + ["SIGHUP", 129], + ]; + for (const [sig, expected] of cases) { + registerCleanupHook({ cleanup: () => {}, exitOnSignal: true }); + const codes = mockExit(); + process.emit(sig, sig); + // queueMicrotask + Promise.allSettled chain: two flushes + // mirror the existing tests. + await flushMicrotasks(); + await flushMicrotasks(); + expect(codes, `signal ${sig}`).toEqual([expected]); + // Reset for the next iteration's hook registration so the + // new SIGNAL_EXIT_CODE doesn't get clobbered by leftover + // listeners. + __resetCleanupHooksForTests(); + exitSpy?.mockRestore(); + exitSpy = null; + } + }); + + it("auto-detaches its process listeners after firing so they don't accumulate", () => { + // Regression: previously each `registerCleanupHook` call left + // `process.on('exit', ...)` and per-signal listeners armed + // forever. A long-lived Node worker that re-arms hooks (vitest + // running many tests, or any future caller that re-registers on + // each iteration) tripped Node's + // `MaxListenersExceededWarning`. Fix: each handler synchronously + // detaches its registration after invoking `run()`. + const exitBefore = process.listeners("exit").length; + const sigintBefore = process.listeners("SIGINT").length; + const sigtermBefore = process.listeners("SIGTERM").length; + const sighupBefore = process.listeners("SIGHUP").length; + + registerCleanupHook({ + cleanup: () => {}, + exitOnSignal: false, + }); + + expect(process.listeners("exit").length).toBe(exitBefore + 1); + expect(process.listeners("SIGINT").length).toBe(sigintBefore + 1); + expect(process.listeners("SIGTERM").length).toBe(sigtermBefore + 1); + expect(process.listeners("SIGHUP").length).toBe(sighupBefore + 1); + + // Firing one signal must detach BOTH that registration's signal + // listener AND its sibling exit listener: the registration is + // done after first fire regardless of which channel triggered it. + process.emit("SIGINT", "SIGINT"); + + expect(process.listeners("exit").length).toBe(exitBefore); + expect(process.listeners("SIGINT").length).toBe(sigintBefore); + expect(process.listeners("SIGTERM").length).toBe(sigtermBefore); + expect(process.listeners("SIGHUP").length).toBe(sighupBefore); + }); + + it("__resetCleanupHooksForTests detaches every still-armed registration", () => { + // Test-only escape hatch for registrations whose handler never + // fires inside the test (no signal emitted); without it, those + // listeners would persist across the vitest worker's test queue. + const exitBefore = process.listeners("exit").length; + registerCleanupHook({ cleanup: () => {}, exitOnSignal: false }); + registerCleanupHook({ cleanup: () => {}, exitOnSignal: true }); + expect(process.listeners("exit").length).toBe(exitBefore + 2); + + __resetCleanupHooksForTests(); + + expect(process.listeners("exit").length).toBe(exitBefore); + }); + + it("is idempotent against repeated signals (done latch + bounded exit)", async () => { + let invocations = 0; + registerCleanupHook({ + cleanup: () => { + invocations += 1; + }, + exitOnSignal: true, + }); + + const codes = mockExit(); + process.emit("SIGINT", "SIGINT"); + process.emit("SIGINT", "SIGINT"); + process.emit("SIGINT", "SIGINT"); + await flushMicrotasks(); + await flushMicrotasks(); + + // Cleanup body runs once even if the signal fires multiple times + // (auto-detach removes the listener after first fire; the `done` + // latch is the secondary defence in case detach is racy). + expect(invocations).toBe(1); + // First SIGINT fires the handler → exit(0); follow-ups hit no + // listener after auto-detach, so codes has exactly one entry. + // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2) so parent + // shells / orchestrators can distinguish "user interrupted" + // from "ran to completion (0)"; see SIGNAL_EXIT_CODE in + // cleanupHooks.ts. + expect(codes).toEqual([130]); + }); +}); diff --git a/packages/arkor/src/cli/cleanupHooks.ts b/packages/arkor/src/cli/cleanupHooks.ts new file mode 100644 index 00000000..24d0eb60 --- /dev/null +++ b/packages/arkor/src/cli/cleanupHooks.ts @@ -0,0 +1,196 @@ +import { SIGNAL_EXIT_CODE } from "../core/signalExit"; + +// POSIX `128 + signo` exit codes live in `core/signalExit.ts` so the +// runner's two-stage shutdown handler and this coordinator share a +// single source of truth. Without the shared map, adding (say) +// SIGQUIT to one side without the other would produce inconsistent +// exit statuses for the same signal: the exact parent-shell- +// classification regression the per-signal code was introduced to +// prevent. + +const TERMINATING_SIGNALS = ["SIGINT", "SIGTERM", "SIGHUP"] as const; + +export interface CleanupHookOptions { + /** + * Idempotent cleanup body. Wrapped with a `done` guard so a noisy + * shutdown (signal arriving while `process.exit` is already running + * an `exit` listener) doesn't trigger a double-cleanup. May be sync + * or return a Promise; async cleanups are awaited (across **all + * registered hooks**) before `exitOnSignal` fires the final + * `process.exit`. + */ + cleanup: () => void | Promise; + /** + * Whether the signal-handler arm of this registration should call + * `process.exit` once every in-flight cleanup (this hook + any + * siblings registered in the same process) has settled. Use `true` + * for the outermost cleanup responsible for terminating the + * process; `false` for inner cleanups that should let a sibling + * own the exit. Default: `false`. + * + * The exit code is the POSIX `128 + signo` for the signal that + * triggered shutdown: 130 for SIGINT, 143 for SIGTERM, 129 for + * SIGHUP (see `SIGNAL_EXIT_CODE`). Parent shells / orchestrators / + * CI runners distinguish "user interrupted" (nonzero) from "ran + * to completion" (zero) on this: exiting 0 for a Ctrl-C'd + * `arkor dev` would let `arkor dev || cleanup_on_failure` skip + * its cleanup branch. + */ + exitOnSignal?: boolean; +} + +/** + * Module-scoped tracker of cleanup promises that haven't settled yet. + * The exit-owning hook waits on the union of (its own cleanup) + + * (every other in-flight cleanup) before calling `process.exit(...)`, + * so a fire-and-forget async cleanup in a sibling registration + * (`hmr.dispose()` is the canonical example) isn't cut off by an + * eager exit. (Exit code is signal-specific; see `SIGNAL_EXIT_CODE`.) + * + * Auto-prunes via the `.finally(() => inFlightCleanups.delete(...))` + * each `run()` attaches, so the set doesn't grow without bound across + * multiple `runDev()` invocations in the same process (tests). + */ +const inFlightCleanups = new Set>(); + +/** + * Detachers for every still-armed registration. The signal/exit + * handlers each call their own detacher synchronously after invoking + * `run()` so a long-lived worker that calls `registerCleanupHook` + * many times (vitest reusing the same Node worker across tests, or a + * future caller that re-arms hooks dynamically) doesn't pile up + * `process.on(...)` listeners and trip Node's + * `MaxListenersExceededWarning`. Test code can also call + * `__resetCleanupHooksForTests()` to detach every still-armed + * registration up-front for explicit isolation. + */ +const attachedHandlers = new Set<() => void>(); + +/** + * Register a cleanup hook that fires on `process.exit` and on + * SIGINT / SIGTERM / SIGHUP. Used by `runDev` to dispose long-lived + * resources (the studio-token file, the HMR coordinator) without each + * call site re-implementing the same idempotent-guard + per-signal + * registration boilerplate. + * + * Per-registration signal listeners (rather than a singleton): each + * `runDev()` invocation gets its own listener wired to its own + * `done` latch. Listeners auto-detach as soon as their handler fires + * (the `done` latch makes any later invocation a no-op anyway), so + * a process that goes through many register → fire cycles doesn't + * accumulate stale listeners on `process`. + * + * `process.on("exit", ...)` listeners cannot be async: Node fires + * them right before the process terminates and discards any returned + * promise. We still register so sync cleanups (e.g. `unlinkSync`) run + * on a normal `process.exit(0)` path that never reached a signal + * handler. Async tails on this path are best-effort. The signal- + * handler path *does* await async tails before exiting. + */ +export function registerCleanupHook(options: CleanupHookOptions): void { + let done = false; + const run = (): Promise => { + if (done) return Promise.resolve(); + done = true; + let promise: Promise; + try { + const result = options.cleanup(); + // Wrap so callers can await uniformly even when cleanup was + // synchronous. Catch is attached so a thrown async cleanup + // doesn't leave an unhandled rejection on the floor. + promise = Promise.resolve(result).catch(() => { + // best-effort: shutdown is racing other cleanup paths + }); + } catch { + promise = Promise.resolve(); + } + inFlightCleanups.add(promise); + void promise.finally(() => inFlightCleanups.delete(promise)); + return promise; + }; + + const exitHandler = () => { + void run(); + detach(); + }; + const signalHandlers = new Map<(typeof TERMINATING_SIGNALS)[number], () => void>(); + for (const sig of TERMINATING_SIGNALS) { + signalHandlers.set(sig, () => { + // Sync cleanup body fires inside this `run()` call before the + // returned promise resolves; that preserves "side effect is + // observable right after the handler returns" for sync + // cleanups like `unlinkSync` (and the existing tests that + // assert on it). + run(); + detach(); + if (!options.exitOnSignal) return; + // Capture which signal triggered shutdown so the exit code + // below reflects "interrupted by SIG" (POSIX 128 + signo) + // rather than "ran to completion" (0). Parent shells / + // orchestrators / CI runners distinguish these: a script + // that runs `arkor dev || cleanup_on_failure` would otherwise + // mis-classify a Ctrl-C as success and skip its cleanup. + const exitCode = SIGNAL_EXIT_CODE[sig]; + // Snapshot `inFlightCleanups` AFTER every other signal listener + // for this signal has run. Node's EventEmitter dispatches + // listeners synchronously in registration order, so if the + // exit-owning hook happens to be registered *first*, taking the + // snapshot here in the listener body would miss promises that + // sibling hooks are about to add when their listeners run a + // few sync steps later. `queueMicrotask` defers past the end of + // the current sync turn (where `process.emit` finishes + // dispatching all listeners), so the snapshot includes every + // sibling's freshly-registered promise. Without this, an + // `arkor dev` whose `scheduleStudioTokenCleanup` (exitOnSignal: + // true) was registered before `scheduleHmrCleanup` (async + // dispose) would `process.exit(...)` mid-`hmr.dispose()` and + // leak the rolldown watcher. + // + // Settled promises pass through `Promise.allSettled` in a + // single microtask, so a process whose hooks are all + // synchronous still exits effectively immediately (one extra + // microtask round-trip). + queueMicrotask(() => { + void Promise.allSettled(inFlightCleanups).then(() => + process.exit(exitCode), + ); + }); + }); + } + + let detached = false; + const detach = () => { + if (detached) return; + detached = true; + process.off("exit", exitHandler); + for (const sig of TERMINATING_SIGNALS) { + const handler = signalHandlers.get(sig); + if (handler) process.off(sig, handler); + } + attachedHandlers.delete(detach); + }; + attachedHandlers.add(detach); + + process.on("exit", exitHandler); + for (const sig of TERMINATING_SIGNALS) { + const handler = signalHandlers.get(sig); + if (handler) process.on(sig, handler); + } +} + +/** + * Detach every still-armed registration. Test-only escape hatch: a + * vitest worker reuses the same Node process across many tests, and + * each `registerCleanupHook` call leaves listeners attached until + * something fires them. Call this from `afterEach` to keep the + * worker's `process` listener counts flat. + */ +export function __resetCleanupHooksForTests(): void { + // `detach()` mutates `attachedHandlers` by removing the current entry. + // `Set` iterators safely handle that case (a deleted current item is + // not re-visited and remaining items keep their order), so we can + // iterate directly without snapshotting via `[...attachedHandlers]`. + for (const detach of attachedHandlers) detach(); + attachedHandlers.clear(); + inFlightCleanups.clear(); +} diff --git a/packages/arkor/src/cli/commands/build.ts b/packages/arkor/src/cli/commands/build.ts index ebfc3675..c4609039 100644 --- a/packages/arkor/src/cli/commands/build.ts +++ b/packages/arkor/src/cli/commands/build.ts @@ -1,7 +1,12 @@ import { existsSync } from "node:fs"; import { mkdir } from "node:fs/promises"; -import { isAbsolute, relative, resolve } from "node:path"; -import { build as esbuild } from "esbuild"; +import { relative } from "node:path"; +import { rolldown } from "rolldown"; +import { + BUILD_DEFAULTS, + resolveBuildEntry, + rolldownInputOptions, +} from "../../core/rolldownConfig"; import { ui } from "../prompts"; export interface BuildOptions { @@ -22,42 +27,30 @@ export interface BuildResult { outFile: string; } -const DEFAULT_ENTRY = "src/arkor/index.ts"; -const DEFAULT_OUT_DIR = ".arkor/build"; - /** * Bundle the user's `src/arkor/index.ts` into a single ESM artifact at * `.arkor/build/index.mjs`. * - * Bare specifiers (`arkor`, anything from `node_modules`) are kept external so - * the artifact resolves the runtime SDK from the project's installed copy. - * Relative imports are bundled inline. + * Bare specifiers (`arkor`, anything from `node_modules`) are kept external + * so the artifact resolves the runtime SDK from the project's installed + * copy. Relative imports are bundled inline. The transform target is + * derived from the running Node binary (see `resolveNodeTarget`). */ export async function runBuild(opts: BuildOptions = {}): Promise { - const cwd = opts.cwd ?? process.cwd(); - const entryRel = opts.entry ?? DEFAULT_ENTRY; - const entry = isAbsolute(entryRel) ? entryRel : resolve(cwd, entryRel); + const { cwd, entry, outDir, outFile } = resolveBuildEntry(opts); if (!existsSync(entry)) { throw new Error( - `Build entry not found: ${entry}. Create ${DEFAULT_ENTRY} or pass an explicit entry argument.`, + `Build entry not found: ${entry}. Create ${BUILD_DEFAULTS.entry} or pass an explicit entry argument.`, ); } - - const outDirRel = opts.outDir ?? DEFAULT_OUT_DIR; - const outDir = isAbsolute(outDirRel) ? outDirRel : resolve(cwd, outDirRel); await mkdir(outDir, { recursive: true }); - const outFile = resolve(outDir, "index.mjs"); - await esbuild({ - entryPoints: [entry], - bundle: true, - platform: "node", - format: "esm", - target: "node22.22", - outfile: outFile, - packages: "external", - logLevel: "error", - }); + const bundle = await rolldown(rolldownInputOptions({ cwd, entry })); + try { + await bundle.write({ file: outFile, format: "esm" }); + } finally { + await bundle.close(); + } if (!opts.quiet) { ui.log.success( diff --git a/packages/arkor/src/cli/commands/dev.test.ts b/packages/arkor/src/cli/commands/dev.test.ts index 1104489c..c2bc1653 100644 --- a/packages/arkor/src/cli/commands/dev.test.ts +++ b/packages/arkor/src/cli/commands/dev.test.ts @@ -4,6 +4,7 @@ import { mkdtempSync, readFileSync, rmSync, + writeFileSync, } from "node:fs"; import { tmpdir } from "node:os"; import { join } from "node:path"; @@ -31,8 +32,33 @@ import { writeCredentials, type AnonymousCredentials, } from "../../core/credentials"; +import { __resetCleanupHooksForTests } from "../cleanupHooks"; import { ensureCredentialsForStudio, runDev } from "./dev"; +/** + * Yield one `setImmediate` tick: enough for the cleanupHooks + * coordinator's `Promise.allSettled(...).then(() => process.exit(0))` + * chain to drain when there are no async cleanups in flight (the + * common case in this file: signal handler → queueMicrotask → + * already-resolved `allSettled` → `.then` → `process.exit(0)`, + * which all collapses into the single macrotask boundary that + * `setImmediate` yields to). + * + * `setImmediate` is the right primitive (vs `Promise.resolve` / + * `queueMicrotask`) because we need the event loop to actually + * turn: the `process.exit` mock fires inside a `.then` callback + * scheduled from a previous microtask checkpoint, and a microtask- + * only flush would resume *before* that callback gets to run. + * + * Tests that drive a chain with extra microtask hops (e.g. async + * sibling cleanups whose promises also pass through + * `Promise.allSettled`) await this helper twice in a row; see + * the cleanupHooks tests. + */ +function flushMicrotasks(): Promise { + return new Promise((resolve) => setImmediate(resolve)); +} + let fakeHome: string; const ORIG_HOME = process.env.HOME; // `os.homedir()` reads USERPROFILE on Windows; HOME-only redirection leaves @@ -83,7 +109,7 @@ describe("ensureCredentialsForStudio", () => { }); // When OAuth is advertised by the deployment, `arkor dev` no longer - // hands off to `runLogin` — that would block the Studio launch on a + // hands off to `runLogin`; that would block the Studio launch on a // browser flow. Instead we bootstrap anon and show a hint pointing at // `arkor login`, leaving the upgrade in the user's hands. it("bootstraps anonymous credentials even when OAuth is configured", async () => { @@ -158,7 +184,7 @@ describe("ensureCredentialsForStudio", () => { }); }); - // Regression for ENG-403 — when the cloud-api is unreachable, `arkor dev` + // Regression for ENG-403: when the cloud-api is unreachable, `arkor dev` // previously failed to start because the anonymous bootstrap's network // error wasn't caught. it("does not throw when the anonymous bootstrap fails after a successful config fetch", async () => { @@ -240,7 +266,7 @@ describe("ensureCredentialsForStudio", () => { // must surface at startup instead of being silently warned. it("re-throws when ARKOR_CLOUD_API_URL is malformed (config error)", async () => { process.env.ARKOR_CLOUD_API_URL = ""; - // No fetch mock — let real fetch raise the URL parse error so we + // No fetch mock: let real fetch raise the URL parse error so we // exercise the actual undici contract, not a synthetic TypeError. await expect(ensureCredentialsForStudio()).rejects.toThrow(TypeError); await expect(ensureCredentialsForStudio()).rejects.not.toThrow( @@ -281,7 +307,7 @@ describe("ensureCredentialsForStudio", () => { ); }); - // Codex P1 review on PR #65 — OAuth-only deployments advertise Auth0 in + // Codex P1 review on PR #65: OAuth-only deployments advertise Auth0 in // /v1/auth/cli/config but reject /v1/auth/anonymous. The new "always try // anon first" flow used to leave first-run users on those deployments // with a bare "Failed to acquire anonymous token (4xx)" error and no way @@ -320,7 +346,7 @@ describe("ensureCredentialsForStudio", () => { expect(await readCredentials()).toBeNull(); }); - // Codex P2 review on PR #65 — the OAuth-only wrap used to span the whole + // Codex P2 review on PR #65: the OAuth-only wrap used to span the whole // anon bootstrap, so fs errors from `writeCredentials` were also rewritten // as "deployment may require sign-in", hiding the actionable fs cause. // @@ -330,8 +356,8 @@ describe("ensureCredentialsForStudio", () => { // `writeFile` would raise EACCES under the bootstrap) only works on // POSIX as a non-root user: root bypasses chmod (Codex on PR #65), and // on Windows POSIX permission bits don't durably block writes inside a - // directory at all — Node maps `chmod` to the legacy read-only - // attribute, which NTFS only enforces on files. Both edges silently + // directory at all (Node maps `chmod` to the legacy read-only + // attribute, which NTFS only enforces on files). Both edges silently // turned the test green for the wrong reason. Mocking lifts the // "produce an EACCES" half of the test out of the host filesystem // entirely so every CI matrix entry exercises the wrap-narrowing @@ -395,7 +421,7 @@ describe("ensureCredentialsForStudio", () => { ); } if (url.endsWith("/v1/auth/anonymous")) { - // Missing `personalOrg` — anonymousTokenResponseSchema rejects. + // Missing `personalOrg`: anonymousTokenResponseSchema rejects. return new Response( JSON.stringify({ token: "t", anonymousId: "a", kind: "cli" }), { status: 200 }, @@ -413,7 +439,7 @@ describe("ensureCredentialsForStudio", () => { it("forwards a non-Error throwable from requestAnonymousToken (String() coercion)", async () => { // Defensive coverage of the `err instanceof Error ? err.message : String(err)` // helper inside the warn branch isn't exercised here because the - // helper is in the dev.ts catch — but the symmetrical path inside + // helper is in the dev.ts catch; but the symmetrical path inside // the schema-error case rethrows with the original value preserved. globalThis.fetch = vi.fn(async (input) => { const url = String(input); @@ -449,7 +475,7 @@ describe("ensureCredentialsForStudio", () => { ); } if (url.endsWith("/v1/auth/anonymous")) { - // Missing `personalOrg` — anonymousTokenResponseSchema rejects. + // Missing `personalOrg`: anonymousTokenResponseSchema rejects. return new Response( JSON.stringify({ token: "t", anonymousId: "a", kind: "cli" }), { status: 200 }, @@ -545,15 +571,6 @@ describe("ensureCredentialsForStudio", () => { }); describe("runDev", () => { - // Track exit/signal listeners we add via scheduleStudioTokenCleanup so - // we can remove them between tests; otherwise vitest's worker would - // accumulate listeners and Node's MaxListenersExceededWarning would - // fire by the third test. - const ORIG_EXIT_LISTENERS = process.listeners("exit").length; - const ORIG_SIGINT_LISTENERS = process.listeners("SIGINT").length; - const ORIG_SIGTERM_LISTENERS = process.listeners("SIGTERM").length; - const ORIG_SIGHUP_LISTENERS = process.listeners("SIGHUP").length; - beforeEach(async () => { vi.mocked(serve).mockClear(); vi.mocked(open).mockClear(); @@ -570,18 +587,11 @@ describe("runDev", () => { }); afterEach(() => { - // Trim the exit/signal listeners runDev installed each iteration to - // keep vitest's worker tidy across tests. - const trim = (ev: string, keep: number) => { - const all = process.listeners(ev as never); - for (let i = keep; i < all.length; i++) { - process.removeListener(ev as never, all[i] as never); - } - }; - trim("exit", ORIG_EXIT_LISTENERS); - trim("SIGINT", ORIG_SIGINT_LISTENERS); - trim("SIGTERM", ORIG_SIGTERM_LISTENERS); - trim("SIGHUP", ORIG_SIGHUP_LISTENERS); + // Each `runDev()` arms exit/signal hooks via `registerCleanupHook`. + // Tests whose handler never fires would leak listeners across the + // vitest worker's queue; this detaches every still-armed + // registration so Node's MaxListenersExceededWarning doesn't trip. + __resetCleanupHooksForTests(); }); it("persists the studio token and starts the server on the requested port", async () => { @@ -654,7 +664,7 @@ describe("runDev", () => { // ~/.arkor read-only after writeCredentials (so readCredentials still // works) so the per-launch token write hits EACCES. if (typeof process.getuid === "function" && process.getuid() === 0) { - // Root bypasses chmod permission checks — skip on root containers. + // Root bypasses chmod permission checks; skip on root containers. return; } chmodSync(join(fakeHome, ".arkor"), 0o555); @@ -697,8 +707,163 @@ describe("runDev", () => { const sigintListeners = process.listeners("SIGINT"); const handler = sigintListeners[sigintListeners.length - 1] as () => void; handler(); + // Sync side effect (token unlink) lands inside the synchronous + // portion of the handler. expect(existsSync(studioTokenPath())).toBe(false); - expect(exitSpy).toHaveBeenCalledWith(0); + // Exit fires after `Promise.allSettled(asyncCleanups)` resolves; + // a few microticks later. Flush to let the queued exit run. + await flushMicrotasks(); + // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2): see + // SIGNAL_EXIT_CODE in cleanupHooks.ts. Parent shells need + // the nonzero code to distinguish interrupt from clean exit. + expect(exitSpy).toHaveBeenCalledWith(130); + } finally { + exitSpy.mockRestore(); + } + }); + + it("keeps the SIGINT exit handler armed even when persisting the studio token fails", async () => { + // Regression: if `persistStudioToken` threw, the previous code + // skipped `scheduleStudioTokenCleanup`, and that was the *only* + // hook that called `process.exit(0)` on SIGINT. The leftover HMR + // hook overrides Node's default "exit on SIGINT" behaviour, so the + // dev server would idle in the foreground forever. The fix + // registers the token cleanup unconditionally; here we make + // persist throw and verify SIGINT still terminates. + if (typeof process.getuid === "function" && process.getuid() === 0) { + // Root bypasses chmod permission checks; skip on root containers. + return; + } + chmodSync(join(fakeHome, ".arkor"), 0o555); + const stdoutSpy = vi + .spyOn(process.stdout, "write") + .mockImplementation((() => true) as typeof process.stdout.write); + try { + await runDev({ port: 4206 }); + } finally { + stdoutSpy.mockRestore(); + chmodSync(join(fakeHome, ".arkor"), 0o755); + } + + const exitSpy = vi + .spyOn(process, "exit") + .mockImplementation(((_code?: number) => { + return undefined as never; + }) as typeof process.exit); + try { + const sigintListeners = process.listeners("SIGINT"); + const handler = sigintListeners[sigintListeners.length - 1] as () => void; + handler(); + // Even though the token file was never written, the cleanup hook + // ran (best-effort `unlinkSync` swallows ENOENT) and the + // exit-on-signal arm fired (after async cleanup tails settle). + await flushMicrotasks(); + // SIGINT exits 130 (POSIX 128 + signo for SIGINT=2): see + // SIGNAL_EXIT_CODE in cleanupHooks.ts. Parent shells need + // the nonzero code to distinguish interrupt from clean exit. + expect(exitSpy).toHaveBeenCalledWith(130); + } finally { + exitSpy.mockRestore(); + } + }); + + it("does NOT unlink a pre-existing token file when this process failed to persist its own token (concurrent arkor dev safety)", async () => { + // Regression: a failed-persist `arkor dev` used to unconditionally + // `unlinkSync(studioTokenPath())` on shutdown. If a concurrent + // `arkor dev` (different port, same user) had already persisted a + // valid token to the shared path, this run's cleanup would wipe + // it out from under them, breaking that session's Vite SPA dev + // workflow with mystery 403s on /api/*. The fix gates the unlink + // on `tokenPersisted` so a failed-persist run is a no-op at + // shutdown. + if (typeof process.getuid === "function" && process.getuid() === 0) { + // Root bypasses chmod permission checks: skip on root containers. + return; + } + // Pre-place a "concurrent" token (the other dev session's). Body + // content lets us assert byte-equality after cleanup, not just + // file existence, to rule out an unlink+recreate cycle. + const path = studioTokenPath(); + writeFileSync(path, "concurrent-token-value", { mode: 0o600 }); + // Make the FILE unwritable so persistStudioToken's `writeFile` + // throws EACCES, but leave the *directory* writable so unlinkSync + // (which requires dir-write, not file-write perms) would happily + // delete the file if the cleanup hook weren't gated. + chmodSync(path, 0o444); + + const stdoutSpy = vi + .spyOn(process.stdout, "write") + .mockImplementation((() => true) as typeof process.stdout.write); + try { + await expect(runDev({ port: 4207 })).resolves.toBeUndefined(); + } finally { + stdoutSpy.mockRestore(); + } + + const exitSpy = vi + .spyOn(process, "exit") + .mockImplementation(((_code?: number) => { + return undefined as never; + }) as typeof process.exit); + try { + // Restore read perms so we can `readFileSync` to verify content. + chmodSync(path, 0o644); + const sigintListeners = process.listeners("SIGINT"); + const handler = sigintListeners[sigintListeners.length - 1] as () => void; + handler(); + await flushMicrotasks(); + // The pre-existing token is still on disk AND unchanged: this + // failed-persist run did not wipe it. + expect(existsSync(path)).toBe(true); + expect(readFileSync(path, "utf8")).toBe("concurrent-token-value"); + } finally { + exitSpy.mockRestore(); + } + }); + + it("does NOT unlink the studio-token when a concurrent arkor dev has overwritten it after our successful persist (token-identity check)", async () => { + // Regression: even when this process SUCCESSFULLY persisted the + // token, the cleanup hook used to `unlinkSync` unconditionally on + // shutdown. If a second `arkor dev` launched in the same `$HOME` + // overwrote `~/.arkor/studio-token` with ITS token AFTER our + // persist, our cleanup would still wipe the file: the second + // session's Vite SPA dev workflow would then see mystery 403s on + // /api/* because the meta tag the SPA reads no longer matches + // any in-memory token. The fix re-reads the file at exit time + // and only unlinks when the bytes match what we wrote. + const stdoutSpy = vi + .spyOn(process.stdout, "write") + .mockImplementation((() => true) as typeof process.stdout.write); + try { + await runDev({ port: 4208 }); + } finally { + stdoutSpy.mockRestore(); + } + // Sanity: our persist landed. + const path = studioTokenPath(); + expect(existsSync(path)).toBe(true); + const ourToken = readFileSync(path, "utf8").trim(); + expect(ourToken).toMatch(/^[A-Za-z0-9_-]+$/); + // Simulate the concurrent overwrite: a second `arkor dev` wrote + // its own token to the same shared path while we were running. + const concurrentToken = "concurrent-dev-token-XYZ"; + writeFileSync(path, concurrentToken, { mode: 0o600 }); + + const exitSpy = vi + .spyOn(process, "exit") + .mockImplementation(((_code?: number) => { + return undefined as never; + }) as typeof process.exit); + try { + const sigintListeners = process.listeners("SIGINT"); + const handler = sigintListeners[sigintListeners.length - 1] as () => void; + handler(); + await flushMicrotasks(); + // Under the bug the file would be gone. With the fix the + // concurrent token is still in place AND unchanged so the + // sibling `arkor dev` keeps working. + expect(existsSync(path)).toBe(true); + expect(readFileSync(path, "utf8")).toBe(concurrentToken); } finally { exitSpy.mockRestore(); } diff --git a/packages/arkor/src/cli/commands/dev.ts b/packages/arkor/src/cli/commands/dev.ts index e2bf01cf..ac24eb4b 100644 --- a/packages/arkor/src/cli/commands/dev.ts +++ b/packages/arkor/src/cli/commands/dev.ts @@ -1,5 +1,5 @@ -import { randomBytes } from "node:crypto"; -import { unlinkSync } from "node:fs"; +import { randomBytes, timingSafeEqual } from "node:crypto"; +import { readFileSync, unlinkSync } from "node:fs"; import { chmod, mkdir, writeFile } from "node:fs/promises"; import { dirname } from "node:path"; import { serve } from "@hono/node-server"; @@ -16,7 +16,9 @@ import { type AnonymousCredentials, } from "../../core/credentials"; import { buildStudioApp } from "../../studio/server"; +import { createHmrCoordinator } from "../../studio/hmr"; import { ANON_PERSISTENCE_NUDGE } from "../anonymous"; +import { registerCleanupHook } from "../cleanupHooks"; import { ui } from "../prompts"; export interface DevOptions { @@ -116,7 +118,7 @@ export async function ensureCredentialsForStudio(): Promise { // wrap fires only for genuine deployment rejection (401/403/404 et // al). 5xx is a transient cloud-api failure where retrying makes // sense, ZodErrors signal a malformed response (server bug), and fs - // failures are out of scope for the anon endpoint entirely — none of + // failures are out of scope for the anon endpoint entirely; none of // these should be mislabelled as a sign-in requirement. if ( err instanceof AnonymousTokenRejectedError && @@ -124,7 +126,7 @@ export async function ensureCredentialsForStudio(): Promise { err.status < 500 && oauthAvailable ) { - // Surface only the status code at the top level — the inner + // Surface only the status code at the top level: the inner // `err.message` already starts with "Failed to acquire…" and // includes the response-body snippet, which would double-prefix the // wrap and risk leaking noisy HTML/JSON error pages. The full @@ -170,24 +172,81 @@ async function persistStudioToken(token: string): Promise { return path; } -function scheduleStudioTokenCleanup(path: string): void { - let cleaned = false; - const cleanup = () => { - if (cleaned) return; - cleaned = true; - try { - unlinkSync(path); - } catch { - // best-effort - } - }; - process.on("exit", cleanup); - for (const sig of ["SIGINT", "SIGTERM", "SIGHUP"] as const) { - process.on(sig, () => { - cleanup(); - process.exit(0); - }); - } +/** + * Constant-time string comparison for the token-identity check below. + * The "is this my token?" gate is not strictly a security-sensitive + * comparison (both sides are owned by the user on the local FS), but + * the SDK already uses `timingSafeEqual` for every other studio-token + * comparison (`buildStudioApp`), and keeping the same primitive here + * costs nothing while making the policy "tokens are always compared + * constant-time" uniform across the codebase. + */ +function tokensEqual(a: string, b: string): boolean { + const aBuf = Buffer.from(a); + const bBuf = Buffer.from(b); + if (aBuf.length !== bBuf.length) return false; + return timingSafeEqual(aBuf, bBuf); +} + +function scheduleStudioTokenCleanup( + path: string, + // Read at cleanup time so a `persistStudioToken` call that's still + // in flight when the user hits Ctrl-C (or one that resolved + // successfully *after* this scheduler ran) has its outcome + // respected. A plain boolean parameter would be captured at hook + // registration time, well before persist resolves. + shouldUnlink: () => boolean, + // Token THIS process wrote. Compared against the file's current + // contents at unlink time so we never delete a token a concurrent + // `arkor dev` overwrote in the shared path. See cleanup body for + // the full rationale. + expectedToken: string, +): void { + registerCleanupHook({ + cleanup: () => { + // Skip the unlink entirely if THIS process never persisted the + // file. Without this gate, a failed-persist `arkor dev` would + // happily `unlinkSync` on shutdown, and if a concurrent + // `arkor dev` process (different port, same user) had persisted + // a valid token to the same shared path, our cleanup would + // wipe it out from under them, breaking that session's Vite + // SPA dev workflow with mystery 403s on /api/*. + if (!shouldUnlink()) return; + // Token-identity check: even when this process DID persist a + // token, another `arkor dev` launched in the same `$HOME` may + // have overwritten the shared `~/.arkor/studio-token` path + // BEFORE our shutdown. Unlinking unconditionally would then + // delete THEIR valid token, breaking their Vite SPA dev + // workflow. Re-read at exit time and only unlink when the + // bytes still match what we wrote so the cleanup is a no-op + // for foreign tokens. Read failure (ENOENT etc.) means the + // file is already gone, which is fine; the unlink would have + // been a no-op anyway. + let current: string; + try { + current = readFileSync(path, "utf8").trim(); + } catch { + return; + } + if (!tokensEqual(current, expectedToken)) return; + try { + unlinkSync(path); + } catch { + // best-effort + } + }, + // Outermost cleanup: responsible for terminating the process after + // all earlier-registered hooks (e.g. HMR dispose) have run. + exitOnSignal: true, + }); +} + +function scheduleHmrCleanup(hmr: { dispose: () => Promise }): void { + // Registered before the studio-token cleanup so it runs first on + // shutdown: Node fires signal handlers in registration order, and we + // want the watcher to release file handles before the outermost + // process.exit. + registerCleanupHook({ cleanup: () => hmr.dispose() }); } export async function runDev(options: DevOptions = {}): Promise { @@ -199,16 +258,59 @@ export async function runDev(options: DevOptions = {}): Promise { // hitting `arkor start` (and therefore RCE via dynamic import). const studioToken = randomBytes(32).toString("base64url"); + // HMR coordinator: a long-lived rolldown watcher over the user's + // `src/arkor` graph. The coordinator itself is lazy (`subscribe()` + // is what starts the watcher, not `createHmrCoordinator`), but + // `buildStudioApp` registers its per-rebuild signal-dispatch + // subscriber unconditionally: that subscriber needs to run on + // every BUNDLE_END regardless of whether any SSE client is + // connected, so it can SIGUSR2/SIGTERM active `/api/train` + // children and keep `lastSuccessConfigHash` warm for spawn-time + // capture. Net effect: the watcher starts at server boot. An + // `arkor dev` launched in an unbuilt project doesn't fail immediately + // because `startWatcher` falls through to a poll loop that waits + // for the entry file to appear (see `hmr.ts:entryWaitTimer`). + // + // Registered before the studio-token cleanup so the latter remains + // the most-recently-attached signal listener (existing tests rely + // on this ordering to find the token-removal handler). + const hmr = createHmrCoordinator({ cwd: process.cwd() }); + scheduleHmrCleanup(hmr); + + // Register the studio-token cleanup *unconditionally* up-front. The hook + // is the only one that calls `process.exit(0)` on SIGINT/SIGTERM/SIGHUP + // (the HMR hook above only disposes), and `registerCleanupHook` overrides + // Node's default "exit on signal" behaviour for any signal it listens + // on. If we were to gate registration behind a successful + // `persistStudioToken` and the persist threw, Ctrl-C would run the HMR + // dispose and then leave the server idle in the foreground: no exit + // ever fires. + // + // The cleanup body itself, however, gates `unlinkSync` on TWO checks: + // - `tokenPersisted` (set only after `persistStudioToken` resolves) + // so a failed-persist run never touches the shared file. + // - token-identity match (re-read the file at exit time, compare + // against the bytes WE wrote) so a successful-persist run that + // was later overwritten by a concurrent `arkor dev` in the same + // `$HOME` still leaves THAT instance's token in place. Without + // this second check, the later instance would see mystery 403s + // on /api/* because we'd have wiped its valid token. + // All three protections together: hook is always registered (so + // exits behave), and only deletes a file we wrote AND still own. + const tokenPath = studioTokenPath(); + let tokenPersisted = false; + scheduleStudioTokenCleanup(tokenPath, () => tokenPersisted, studioToken); + // Persisting the token to disk is *only* needed for the Vite SPA dev // workflow. The bundled `:port` flow injects the meta tag at request time // via `buildStudioApp`, so a failure here (read-only $HOME on Docker / // locked-down CI / restrictive umask) must not block the server. try { - const tokenPath = await persistStudioToken(studioToken); - scheduleStudioTokenCleanup(tokenPath); + await persistStudioToken(studioToken); + tokenPersisted = true; } catch (err) { ui.log.warn( - `Could not write ${studioTokenPath()} (${ + `Could not write ${tokenPath} (${ err instanceof Error ? err.message : String(err) }). The Studio at http://localhost:${port} is unaffected, but the Vite SPA dev workflow will see 403s on /api/*.`, ); @@ -217,9 +319,9 @@ export async function runDev(options: DevOptions = {}): Promise { // `autoAnonymous: true` (the default) lets the Hono server retry the // anonymous bootstrap on first `/api/credentials` hit if the up-front // attempt above failed (e.g. cloud-api was unreachable at launch). - const app = buildStudioApp({ studioToken }); + const app = buildStudioApp({ studioToken, hmr }); // Bind to 127.0.0.1 (not "localhost") so the listener can't end up on `::1` - // only — `@hono/node-server` passes hostname to `net.Server.listen`, which + // only; `@hono/node-server` passes hostname to `net.Server.listen`, which // calls `dns.lookup`. On hosts where `/etc/hosts` orders `::1 localhost` // before `127.0.0.1 localhost`, a "localhost" bind would refuse IPv4 // connections, breaking the studio-app Vite proxy (hardcoded to @@ -229,6 +331,13 @@ export async function runDev(options: DevOptions = {}): Promise { const url = `http://localhost:${port}`; serve({ fetch: app.fetch, port, hostname: "127.0.0.1" }); process.stdout.write(`Arkor Studio running on ${url}\n`); + // "ready (will watch …)" rather than "enabled (watching …)" because + // `createHmrCoordinator` is lazy: the rolldown watcher doesn't + // actually start until the first `subscribe()` call inside + // `buildStudioApp`, and on a fresh scaffold with no + // `src/arkor/index.ts` yet the watcher falls into the + // entry-wait poll loop rather than actively watching. + process.stdout.write(`HMR ready (will watch src/arkor)\n`); if (options.open) { try { await open(url); diff --git a/packages/arkor/src/cli/commands/start.test.ts b/packages/arkor/src/cli/commands/start.test.ts index 8209818b..a08d70f4 100644 --- a/packages/arkor/src/cli/commands/start.test.ts +++ b/packages/arkor/src/cli/commands/start.test.ts @@ -78,7 +78,7 @@ describe("runStart", () => { it("skips the build step when the artifact already exists and no entry override is given", async () => { // Branch coverage for `Boolean(opts.entry) || !existsSync(outFile)` — // the path where both halves are false. Pre-build the artifact, then - // confirm runStart imports it without triggering esbuild again. + // confirm runStart imports it without triggering rolldown again. mkdirSync(join(cwd, "src/arkor"), { recursive: true }); writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST); // First call builds normally. diff --git a/packages/arkor/src/core/configHash.test.ts b/packages/arkor/src/core/configHash.test.ts new file mode 100644 index 00000000..ec681124 --- /dev/null +++ b/packages/arkor/src/core/configHash.test.ts @@ -0,0 +1,213 @@ +import { describe, it, expect } from "vitest"; +import { hashJobConfig } from "./configHash"; +import type { JobConfig } from "./types"; + +describe("hashJobConfig", () => { + it("returns the same hash for key-order-equivalent configs", () => { + const a: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + maxSteps: 10, + learningRate: 1e-4, + }; + const b: JobConfig = { + learningRate: 1e-4, + maxSteps: 10, + datasetSource: { name: "x", type: "huggingface" }, + model: "m", + } as JobConfig; + expect(hashJobConfig(a)).toBe(hashJobConfig(b)); + }); + + it("returns different hashes for materially different configs", () => { + const base: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + }; + expect(hashJobConfig(base)).not.toBe( + hashJobConfig({ ...base, model: "m2" }), + ); + expect(hashJobConfig(base)).not.toBe( + hashJobConfig({ + ...base, + datasetSource: { type: "huggingface", name: "y" }, + }), + ); + }); + + it("is order-stable for nested arrays (dataset format / split)", () => { + const a: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + datasetFormat: ["a", "b", "c"], + }; + const b: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + datasetFormat: ["a", "b", "c"], + }; + expect(hashJobConfig(a)).toBe(hashJobConfig(b)); + }); + + it("treats `undefined` object properties identically to omitted ones (JSON parity)", () => { + // Regression: the previous `stableStringify` delegated to + // `JSON.stringify(undefined)` which returns `undefined` (not a + // string), concatenated via template literal that became the + // substring `"undefined"` in the hash input. So `{ a: 1 }` and + // `{ a: 1, b: undefined }` produced different hashes even though + // they're indistinguishable on the wire (`JSON.stringify` drops + // `undefined` properties). + const omitted: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + }; + const explicitlyUndefined: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + // `unknown`-typed forwarder fields can legitimately end up + // holding `undefined` if a caller spreads from a partial source. + warmupSteps: undefined, + datasetFormat: undefined, + }; + expect(hashJobConfig(omitted)).toBe(hashJobConfig(explicitlyUndefined)); + }); + + it("normalises `undefined` array slots to null (JSON parity)", () => { + // `JSON.stringify([undefined])` → `"[null]"`. The previous + // implementation produced the literal substring `"[undefined]"` + // instead, which is not even valid JSON. + const a: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + datasetFormat: ["a", undefined, "c"] as unknown, + }; + const b: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + datasetFormat: ["a", null, "c"] as unknown, + }; + expect(hashJobConfig(a)).toBe(hashJobConfig(b)); + }); + + it("honors `toJSON()` like JSON.stringify (Date, etc.)", () => { + // Regression: `JSON.stringify({ d: new Date(0) })` serialises + // `d` as `"1970-01-01T00:00:00.000Z"`, but a naive recursive + // walker would serialise the Date as `{}` (no enumerable own + // keys). A `JobConfig` whose `unknown`-typed forwarder field + // ever holds a Date (or any object with `toJSON`) would then + // produce a hash that disagrees with the wire-format payload, + // causing spurious "configHash changed" → SIGTERM restarts. + const date = new Date("2024-01-01T00:00:00.000Z"); + const a: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + warmupSteps: date as unknown, + }; + const b: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + warmupSteps: "2024-01-01T00:00:00.000Z" as unknown, + }; + expect(hashJobConfig(a)).toBe(hashJobConfig(b)); + }); + + it("threads the property key through to user-defined `toJSON(key)` (JSON parity)", () => { + // Regression: `JSON.stringify` calls `value.toJSON(key)` with + // the hosting property name (or array index as string), so a + // `toJSON` that branches on the key produces different output + // depending on where the value lives in the tree. The previous + // `stableStringify` called `toJSON()` without the key argument, + // so the hash diverged from the wire-format payload for any + // user object whose serialiser depends on context. + // + // The fixture's `toJSON(key)` returns `"key="`. Compare + // against an explicit string field holding what JSON.stringify + // would produce; matching hashes prove the key reached toJSON. + const ctx = { + toJSON(key: string) { + return `key=${key}`; + }, + }; + const a: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + warmupSteps: ctx as unknown, + }; + const b: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + warmupSteps: "key=warmupSteps" as unknown, + }; + expect(hashJobConfig(a)).toBe(hashJobConfig(b)); + }); + + it("omits an object property whose `toJSON(key)` returns undefined (JSON parity)", () => { + // Regression: `JSON.stringify({ a: { toJSON: () => undefined } })` + // produces `"{}"`: `toJSON` returning `undefined` is the spec's + // "skip me" signal in object position. The previous + // `stableStringify` collapsed every non-representable value to + // the literal string `"null"` at recursion time, so the same + // input hashed as `{"a":null}` instead of `{}`. That divergence + // forced unnecessary SIGTERM restarts whenever a `JobConfig` + // field's serialiser opted out: `configHash` would diverge from + // the wire-format payload (which DOES omit the field). + const omitting = { + toJSON() { + return undefined; + }, + }; + const a: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + warmupSteps: omitting as unknown, + }; + const b: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + }; + expect(hashJobConfig(a)).toBe(hashJobConfig(b)); + }); + + it("substitutes `null` for an array element whose `toJSON(idx)` returns undefined (JSON parity)", () => { + // Sibling contract: in array position, `JSON.stringify` writes + // `null` for a `toJSON()→undefined` element (it can't drop the + // slot without shifting indices). The `stableStringify` boundary + // for arrays maps the omit sentinel to `"null"`. + const omitting = { + toJSON() { + return undefined; + }, + }; + const a: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + datasetFormat: ["a", omitting, "c"] as unknown, + }; + const b: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + datasetFormat: ["a", null, "c"] as unknown, + }; + expect(hashJobConfig(a)).toBe(hashJobConfig(b)); + }); + + it("ignores function / symbol properties (JSON parity)", () => { + // `JSON.stringify` drops these too. The hash should be insensitive + // to "transparent" callbacks accidentally landing in a forwarded + // config (the SDK separates `callbacks` out, but `unknown` fields + // could leak one). + const fn = () => 0; + const sym = Symbol("foo"); + const a: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + }; + const b: JobConfig = { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + warmupSteps: fn as unknown, + loggingSteps: sym as unknown, + }; + expect(hashJobConfig(a)).toBe(hashJobConfig(b)); + }); +}); diff --git a/packages/arkor/src/core/configHash.ts b/packages/arkor/src/core/configHash.ts new file mode 100644 index 00000000..2e407094 --- /dev/null +++ b/packages/arkor/src/core/configHash.ts @@ -0,0 +1,91 @@ +import { createHash } from "node:crypto"; +import type { JobConfig } from "./types"; + +/** + * Deterministic JSON serialiser: keys sorted at every nesting level so + * `{a:1, b:2}` and `{b:2, a:1}` produce the same string. Necessary because + * `JSON.stringify` follows insertion order, which isn't stable across + * `buildJobConfig` revisions or user-side spread-merge tricks. + * + * Returns `string | undefined`. `undefined` is the "omit me from my + * containing object" sentinel: it propagates from any value + * `JSON.stringify` would silently drop in object position + * (`undefined`, functions, symbols, *and* objects whose `toJSON(key)` + * returns one of those). Callers sit at three boundaries: + * + * - Top level: `hashJobConfig` collapses `undefined` to `"null"` + * so the digest input stays a valid hash string. + * - Array slots: the map below substitutes `"null"` (matches + * `JSON.stringify([undefined]) === "[null]"`). + * - Object slots: the loop filters the key out entirely (matches + * `JSON.stringify({a: undefined}) === "{}"`). + * + * The previous implementation collapsed every non-representable to + * the literal string `"null"` at recursion time, which leaked into + * object slots as `{"a":null}` instead of the JSON-correct `{}`, + * making `configHash` diverge from the wire-format payload for + * `JobConfig` fields whose `toJSON(key)` happened to return + * `undefined` (the spec-defined "skip me" signal). That divergence + * forces unnecessary SIGTERM restarts on every rebuild. + */ +function stableStringify(value: unknown, key: string = ""): string | undefined { + if (value === null) return "null"; + // Non-representable values: omit (undefined return) so each caller's + // boundary handler chooses the right substitution per its position. + if (value === undefined || typeof value === "function" || typeof value === "symbol") { + return undefined; + } + if (typeof value !== "object") return JSON.stringify(value); + // `JSON.stringify` calls `value.toJSON(key)` first when present + // (passing `""` at the top level, the property name in object + // positions, the index-as-string in array positions), then + // serialises the return value. Canonical example: `Date` → ISO + // string. The `key` argument is threaded through recursion so + // user-side `toJSON(key)` implementations that branch on the + // hosting property/index see the same value JSON.stringify would. + // If `toJSON` returns `undefined`, that propagates as the omit + // sentinel: the spec-defined "skip me" path. + const maybeToJSON = (value as { toJSON?: unknown }).toJSON; + if (typeof maybeToJSON === "function") { + return stableStringify( + (maybeToJSON as (key: string) => unknown).call(value, key), + key, + ); + } + if (Array.isArray(value)) { + // Array slots: non-representable → "null" (matches JSON spec). + // Index-as-string keys mirror `JSON.stringify`'s behaviour for + // array elements (per the ECMAScript spec, `SerializeJSONArray` + // calls `SerializeJSONProperty` with the index converted to a + // string). + const items = value.map((v, i) => stableStringify(v, String(i)) ?? "null"); + return `[${items.join(",")}]`; + } + // Object slots: skip keys whose serialised value is `undefined` + // (matches `JSON.stringify({a: undefined}) === "{}"`). Property + // names are passed as the recursion key so a nested `toJSON(key)` + // sees the hosting field name. + const obj = value as Record; + const parts: string[] = []; + for (const k of Object.keys(obj).sort()) { + const serialised = stableStringify(obj[k], k); + if (serialised === undefined) continue; + parts.push(`${JSON.stringify(k)}:${serialised}`); + } + return `{${parts.join(",")}}`; +} + +/** + * Stable fingerprint of a `JobConfig`. Used by HMR to decide whether a + * rebuild changed only the in-process callbacks (configHash unchanged → + * hot-swap) or the cloud-side training config (configHash changed → + * full restart with `requestEarlyStop`). + */ +export function hashJobConfig(config: JobConfig): string { + // Top-level fallback to `"null"` so a pathological config that + // serialises to `undefined` (top-level `toJSON` returning + // undefined, etc.) still produces a deterministic digest input + // rather than crashing `createHash.update(undefined)`. + const serialised = stableStringify(config) ?? "null"; + return createHash("sha256").update(serialised).digest("hex").slice(0, 16); +} diff --git a/packages/arkor/src/core/moduleCacheBust.test.ts b/packages/arkor/src/core/moduleCacheBust.test.ts new file mode 100644 index 00000000..40b8509a --- /dev/null +++ b/packages/arkor/src/core/moduleCacheBust.test.ts @@ -0,0 +1,66 @@ +import { describe, it, expect, beforeEach, afterEach } from "vitest"; +import { mkdtempSync, rmSync, writeFileSync } from "node:fs"; +import { tmpdir } from "node:os"; +import { join } from "node:path"; +import { pathToFileURL } from "node:url"; +import { + moduleCacheBustKey, + moduleCacheBustUrl, +} from "./moduleCacheBust"; + +let dir: string; + +beforeEach(() => { + dir = mkdtempSync(join(tmpdir(), "arkor-cachebust-test-")); +}); + +afterEach(() => { + rmSync(dir, { recursive: true, force: true }); +}); + +describe("moduleCacheBustKey", () => { + it("is stable across calls when the file hasn't changed", () => { + // Regression: Node's ESM loader never evicts module records, and + // a `Date.now()` cache-bust would produce a fresh URL on every + // call → unbounded leak across long `arkor dev` sessions + // (5 s `/api/manifest` polls + every save firing SIGUSR2). + // mtime+ctime+size keying must collapse repeat reads of unchanged + // bytes onto the same key so the loader serves from cache. + const file = join(dir, "stable.mjs"); + writeFileSync(file, "export const v = 1;"); + const k1 = moduleCacheBustKey(file); + const k2 = moduleCacheBustKey(file); + expect(k1).toBe(k2); + // mtimeMs-ctimeMs-size; mtimeMs/ctimeMs may carry sub-ms precision + // (no `toFixed(0)`) so digits include an optional fractional part. + expect(k1).toMatch(/^[\d.]+-[\d.]+-\d+$/); + }); + + it("changes when the file content changes (different size)", () => { + const file = join(dir, "growing.mjs"); + writeFileSync(file, "v1"); + const before = moduleCacheBustKey(file); + writeFileSync(file, "version-two"); + const after = moduleCacheBustKey(file); + expect(after).not.toBe(before); + }); + + it("returns a stable fallback (\"0-0-0\") for missing files instead of throwing", () => { + // The eventual `await import(url)` will throw on a missing + // file; the helper itself should produce a value rather than + // bubbling the stat error and turning every consumer into a + // try/catch site. Three zeros (one each for mtimeMs, ctimeMs, + // size) to keep the shape uniform with the success branch. + expect(moduleCacheBustKey(join(dir, "does-not-exist.mjs"))).toBe("0-0-0"); + }); +}); + +describe("moduleCacheBustUrl", () => { + it("returns a fully-qualified file URL with the cache-bust query attached", () => { + const file = join(dir, "u.mjs"); + writeFileSync(file, "export const x = 1;"); + const url = moduleCacheBustUrl(file); + expect(url.startsWith(pathToFileURL(file).href + "?t=")).toBe(true); + expect(url).toMatch(/\?t=[\d.]+-[\d.]+-\d+$/); + }); +}); diff --git a/packages/arkor/src/core/moduleCacheBust.ts b/packages/arkor/src/core/moduleCacheBust.ts new file mode 100644 index 00000000..22f160a5 --- /dev/null +++ b/packages/arkor/src/core/moduleCacheBust.ts @@ -0,0 +1,51 @@ +import { statSync } from "node:fs"; +import { pathToFileURL } from "node:url"; + +/** + * Build a content-derived cache-bust query for `await import(url + "?t=" + key)`. + * + * Why this matters: Node's ESM loader caches every dynamically-imported + * URL for the lifetime of the process and exposes no API to evict a + * record. A naive `?t=Date.now()` cache-bust produces a fresh URL on + * every call, so a long-running `arkor dev` session (where the SPA + * polls `/api/manifest` every few seconds and every save fires + * `BUNDLE_END` + SIGUSR2) accumulates one module record per call, + * unbounded. + * + * Keying on `mtimeMs + ctimeMs + size` collapses repeated reads of the + * same bytes onto the same URL, which Node's loader then serves from + * its existing cache record. The leak shrinks from "one entry per + * call" to "one entry per actual file change", which is the tightest + * bound we can offer without spawning a child process per import. + * + * `mtimeMs` is kept at full sub-millisecond precision (no rounding): + * a previous `toFixed(0)` collapsed two distinct edits that landed in + * the same millisecond and produced an identically-sized output onto + * the same key, which made Node's loader return the *stale* module + * for the second edit (HMR/manifest staleness on fast filesystems). + * `ctimeMs` is included as belt-and-braces against the (rare) case + * where mtime collides but ctime moves: `touch -m` and some build + * tools update one without the other. + * + * Falls back to a stable literal on stat failure so the eventual + * `import()` (which will throw on a missing file) gets to surface its + * own clean error rather than us inventing a noisy timestamp here. + */ +export function moduleCacheBustKey(filePath: string): string { + try { + const s = statSync(filePath); + return `${s.mtimeMs}-${s.ctimeMs}-${s.size}`; + } catch { + return "0-0-0"; + } +} + +/** + * Convenience: full file URL with the cache-bust key already + * appended. The `as const`-style template is small enough to inline + * but doing it in one place keeps the URL shape uniform across the + * three callers (`hmr.ts`, `manifest.ts`, `runnerSignals.ts`). + */ +export function moduleCacheBustUrl(filePath: string): string { + return `${pathToFileURL(filePath).href}?t=${moduleCacheBustKey(filePath)}`; +} diff --git a/packages/arkor/src/core/projectState.test.ts b/packages/arkor/src/core/projectState.test.ts index 8d36515a..73d6b706 100644 --- a/packages/arkor/src/core/projectState.test.ts +++ b/packages/arkor/src/core/projectState.test.ts @@ -37,7 +37,7 @@ function fakeClient( // Construct a real CloudApiClient (so type-compatibility holds), then // monkey-patch only the methods exercised by ensureProjectState. The // other methods would throw on first use because no fetcher is wired, - // which is fine — projectState should never reach them. + // which is fine; projectState should never reach them. const client = new CloudApiClient({ baseUrl: "http://mock", credentials: anonCreds, @@ -84,7 +84,7 @@ describe("ensureProjectState", () => { expect(createProject).not.toHaveBeenCalled(); }); - it("throws for auth0 callers without state — they must write .arkor/state.json by hand", async () => { + it("throws for auth0 callers without state: they must write .arkor/state.json by hand", async () => { const client = fakeClient(); await expect( ensureProjectState({ cwd, client, credentials: auth0Creds }), @@ -116,7 +116,7 @@ describe("ensureProjectState", () => { expect(createProject).toHaveBeenCalledWith({ orgSlug: "anon-abc", name: expect.stringMatching(/^my-app/), - // Sanitised slug — basename starts with "my-app-", and we + // Sanitised slug: basename starts with "my-app-", and we // expect the sanitiser to keep dashes. slug: expect.stringMatching(/^my-app/), }); diff --git a/packages/arkor/src/core/rolldownConfig.ts b/packages/arkor/src/core/rolldownConfig.ts new file mode 100644 index 00000000..66e87c29 --- /dev/null +++ b/packages/arkor/src/core/rolldownConfig.ts @@ -0,0 +1,86 @@ +import { isAbsolute, resolve } from "node:path"; +import type { InputOptions } from "rolldown"; + +const DEFAULT_ENTRY = "src/arkor/index.ts"; +const DEFAULT_OUT_DIR = ".arkor/build"; + +export interface BuildEntryOptions { + /** Source entry path; defaults to `src/arkor/index.ts`. */ + entry?: string; + /** Output directory; defaults to `.arkor/build`. */ + outDir?: string; + /** Project root; defaults to `process.cwd()`. */ + cwd?: string; +} + +export interface ResolvedBuildEntry { + /** Project root (absolute). */ + cwd: string; + /** Entry source file (absolute). */ + entry: string; + /** Output directory (absolute). */ + outDir: string; + /** Output bundle (absolute, always `/index.mjs`). */ + outFile: string; +} + +/** Resolve `cwd` / `entry` / `outDir` to absolute paths with the standard defaults. */ +export function resolveBuildEntry(opts: BuildEntryOptions): ResolvedBuildEntry { + const cwd = opts.cwd ?? process.cwd(); + const entryRel = opts.entry ?? DEFAULT_ENTRY; + const entry = isAbsolute(entryRel) ? entryRel : resolve(cwd, entryRel); + const outDirRel = opts.outDir ?? DEFAULT_OUT_DIR; + const outDir = isAbsolute(outDirRel) ? outDirRel : resolve(cwd, outDirRel); + const outFile = resolve(outDir, "index.mjs"); + return { cwd, entry, outDir, outFile }; +} + +/** + * `node.` derived from the running Node binary. Build host and + * run host are effectively the same process (Studio spawns `arkor start` with + * `process.execPath`), so the bundle can target precisely what will execute it. + */ +export function resolveNodeTarget(): string { + // Fallback aligns with the published `engines.node` floor; see + // [packages/arkor/package.json] / `AGENTS.md`'s "Node version" note. + const [major = "22", minor = "22"] = process.versions.node.split("."); + return `node${major}.${minor}`; +} + +/** + * Build the shared rolldown options object used by both `runBuild` (one-shot) + * and the HMR coordinator (`watch()`). Centralising the configuration here + * keeps the two pipelines aligned: anything that affects the bundle shape + * (external resolution, transform target, platform) is set in one place so + * the artifact a watcher writes is byte-equivalent to a one-shot rebuild. + */ +export function rolldownInputOptions( + resolved: Pick, +): InputOptions { + return { + input: resolved.entry, + cwd: resolved.cwd, + platform: "node", + logLevel: "warn", + transform: { target: resolveNodeTarget() }, + // Mirror esbuild's `packages: "external"`: any specifier that isn't a + // relative or absolute path stays external. `node:`-prefixed builtins + // are already handled by `platform: "node"`; the explicit allow below + // is a safety net in case the builtin set drifts. + external: (id, _importer, isResolved) => { + if (isResolved) return false; + if (id.startsWith(".")) return false; + if (isAbsolute(id)) return false; + return true; + }, + }; +} + +/** + * Re-exported defaults so consumers (like error messages) can name the same + * paths we resolve internally. + */ +export const BUILD_DEFAULTS = { + entry: DEFAULT_ENTRY, + outDir: DEFAULT_OUT_DIR, +} as const; diff --git a/packages/arkor/src/core/runner.test.ts b/packages/arkor/src/core/runner.test.ts index 89f29249..cdabfb15 100644 --- a/packages/arkor/src/core/runner.test.ts +++ b/packages/arkor/src/core/runner.test.ts @@ -1,4 +1,4 @@ -import { describe, it, expect, afterEach, beforeEach } from "vitest"; +import { describe, it, expect, afterEach, beforeEach, vi } from "vitest"; import { mkdtempSync, rmSync, writeFileSync, mkdirSync } from "node:fs"; import { tmpdir } from "node:os"; import { join } from "node:path"; @@ -49,7 +49,7 @@ afterEach(() => { rmSync(cwd, { recursive: true, force: true }); }); -describe("runTrainer — entry extraction", () => { +describe("runTrainer: entry extraction", () => { it("throws when the entry file does not exist", async () => { await expect(runTrainer("missing.ts")).rejects.toThrow( /Training entry not found/, @@ -124,7 +124,7 @@ describe("runTrainer — entry extraction", () => { }); it("throws when default export is a primitive (typeof !== 'object' branch)", async () => { - // The second half of `mod.default && typeof mod.default === "object"` — + // The second half of `mod.default && typeof mod.default === "object"`: // a primitive default like `42` or `"foo"` must short-circuit out of // the nested-trainer probe. const entry = join(cwd, "primitive-default.mjs"); @@ -135,7 +135,7 @@ describe("runTrainer — entry extraction", () => { }); it("accepts a default export wrapping a `trainer` field (legacy power-user shape)", async () => { - // Hits the `if (isTrainer(nested)) return nested` branch — the only + // Hits the `if (isTrainer(nested)) return nested` branch: the only // place line 38 is reachable. const entry = join(cwd, "default-with-trainer.mjs"); writeFileSync( @@ -154,7 +154,7 @@ describe("runTrainer — entry extraction", () => { it("falls back to DEFAULT_ENTRY (src/arkor/index.ts) when called with no argument", async () => { // Branch coverage for `file ?? DEFAULT_ENTRY`. Place the entry at - // `/src/arkor/index.ts` and invoke runTrainer() — the default + // `/src/arkor/index.ts` and invoke runTrainer(): the default // path is what `arkor start` and Studio's "Run training" button use. const arkorDir = join(cwd, "src", "arkor"); mkdirSync(arkorDir, { recursive: true }); @@ -174,8 +174,8 @@ describe("runTrainer — entry extraction", () => { join(arkorDir, "index.ts"), `export * from "./index.mjs";\n`, ); - // Pass undefined explicitly to exercise the `?? DEFAULT_ENTRY` branch - // — Node's built-in TypeScript stripping handles the .ts extension at + // Pass undefined explicitly to exercise the `?? DEFAULT_ENTRY` branch. + // Node's built-in TypeScript stripping handles the .ts extension at // runtime. (vitest also strips TS so this works under test too.) await expect(runTrainer()).resolves.toBeUndefined(); }); @@ -207,3 +207,99 @@ describe("runTrainer — entry extraction", () => { expect(typeof t.wait).toBe("function"); }); }); + +describe("runTrainer: shutdown signal handling", () => { + it("first SIGTERM calls trainer.requestEarlyStop and exits 0; second SIGTERM exits 143", async () => { + // Fake trainer whose `wait()` hangs until the test manually resolves it + // (via a global helper). This lets us hold the run in flight long + // enough to assert both signal-handling branches without racing the + // `finally` block that removes the listeners. + // The fake trainer wears the early-stop brand + // (`Symbol.for("arkor.trainer.requestEarlyStop")`) so the runner's + // SIGTERM handler invokes it the same way the SDK-provided trainer + // does. No public `requestEarlyStop` method exists any more. + const trainerSrc = ` + const KEY = Symbol.for("arkor.trainer.requestEarlyStop"); + let earlyStopCalls = 0; + let resolveWait; + const waitPromise = new Promise((r) => { resolveWait = r; }); + globalThis.__test_signalProbe = { + get earlyStopCalls() { return earlyStopCalls; }, + finishWait: () => resolveWait({ + job: { + id: "j1", orgId: "o", projectId: "p", name: "n", + status: "completed", + config: { model: "m", datasetSource: { type: "huggingface", name: "x" } }, + createdAt: "2026", + }, + artifacts: [], + }), + }; + const trainer = { + name: "n", + start: async () => ({ jobId: "j1" }), + wait: () => waitPromise, + cancel: async () => {}, + }; + Object.defineProperty(trainer, KEY, { + value: async () => { earlyStopCalls++; }, + enumerable: false, + }); + export { trainer }; + `; + const entry = join(cwd, "src/arkor/index.mjs"); + mkdirSync(join(cwd, "src/arkor"), { recursive: true }); + writeFileSync(entry, trainerSrc); + + const exitCalls: number[] = []; + const exitSpy = vi + .spyOn(process, "exit") + .mockImplementation(((code?: number) => { + exitCalls.push(code ?? 0); + return undefined as never; + }) as typeof process.exit); + const stdoutSpy = vi + .spyOn(process.stdout, "write") + .mockImplementation((() => true) as typeof process.stdout.write); + try { + const runPromise = runTrainer("src/arkor/index.mjs"); + // Wait for import + start() to settle so the handler is registered + // before we synthesise SIGTERM. Poll for the probe rather than + // relying on a fixed timer: under load (e.g. running alongside + // sibling test files in turbo) the dynamic import + top-level + // body can take longer than a hardcoded 25 ms window. + type Probe = { earlyStopCalls: number; finishWait: () => void }; + let probe: Probe | undefined; + for (let i = 0; i < 40; i++) { + probe = (globalThis as unknown as { __test_signalProbe?: Probe }) + .__test_signalProbe; + if (probe) break; + await new Promise((r) => setTimeout(r, 25)); + } + if (!probe) throw new Error("Probe not installed by user bundle"); + + // 1st SIGTERM → requestEarlyStop is called, exit(0) scheduled in the + // promise's `.finally`. + process.emit("SIGTERM", "SIGTERM"); + await new Promise((r) => setTimeout(r, 25)); + expect(probe.earlyStopCalls).toBe(1); + expect(exitCalls).toContain(0); + + // 2nd SIGTERM (still in-flight, listeners not yet removed) → + // exit(143) immediately, no second requestEarlyStop call. + process.emit("SIGTERM", "SIGTERM"); + await new Promise((r) => setTimeout(r, 25)); + expect(probe.earlyStopCalls).toBe(1); + expect(exitCalls).toContain(143); + + // Release the hung wait() so runPromise can complete and the + // shutdown handlers detach via the finally block. + probe.finishWait(); + await runPromise; + } finally { + exitSpy.mockRestore(); + stdoutSpy.mockRestore(); + delete (globalThis as Record).__test_signalProbe; + } + }); +}); diff --git a/packages/arkor/src/core/runner.ts b/packages/arkor/src/core/runner.ts index e674b70e..38db0537 100644 --- a/packages/arkor/src/core/runner.ts +++ b/packages/arkor/src/core/runner.ts @@ -2,10 +2,48 @@ import { existsSync } from "node:fs"; import { resolve, isAbsolute } from "node:path"; import { pathToFileURL } from "node:url"; import { isArkor } from "./arkor"; +import { + installCallbackReloadHandler, + installShutdownHandlers, +} from "./runnerSignals"; import type { Trainer } from "./types"; const DEFAULT_ENTRY = "src/arkor/index.ts"; +/** + * Per-spawn nonce that `/api/train` injects via env so the server can + * recognise the runner's `Started job ` line without it being + * forgeable from user code. Captured at module load (i.e. BEFORE + * `runTrainer` does its `await import(userEntry)`) and the env var + * is deleted right after so the dynamically-imported user module + * cannot read it via `process.env`. If a user callback then writes + * `Started job ` to stdout, the line won't carry the nonce + * prefix and the server's anchored regex will reject it: no + * spoofed cloud `cancel()` POST against an attacker-chosen job id. + * + * Null when the runner was launched directly (e.g. `arkor start` from + * a shell), in which case the runner falls back to the plain + * `Started job ` form for backwards compatibility. The server only + * uses the nonce-prefixed form because every server spawn sets the + * env var. + * + * **Import-order requirement.** The spoof-prevention guarantee relies + * on this module reading + deleting `ARKOR_JOB_ID_MARKER_NONCE` + * before any user-controlled module gets to touch `process.env`. + * That's safe today because the only consumer chain is + * `bin.ts → cli/main.ts → cli/commands/start.ts → core/runner.ts`, + * all static imports, so this module is fully evaluated before + * `runTrainer` performs its `await import(userEntry)`. If a future + * refactor introduces a dynamic-import / lazy-load of runner.ts (so + * a sibling module runs first and could snapshot `process.env`), the + * capture+delete should move into a tiny dedicated module that the + * bin imports first, or the env var should be wiped at the server + * spawn boundary too. + */ +const STARTED_JOB_NONCE: string | null = + process.env.ARKOR_JOB_ID_MARKER_NONCE ?? null; +delete process.env.ARKOR_JOB_ID_MARKER_NONCE; + function isTrainer(value: unknown): value is Trainer { if (!value || typeof value !== "object") return false; const t = value as Record; @@ -53,8 +91,20 @@ export async function runTrainer(file?: string): Promise { const mod = (await import(pathToFileURL(abs).href)) as Record; const trainer = extractTrainer(mod); - const { jobId } = await trainer.start(); - process.stdout.write(`Started job ${jobId}\n`); - const result = await trainer.wait(); - process.stdout.write(`Job ${result.job.id} finished with status=${result.job.status}\n`); + const removeShutdown = installShutdownHandlers(trainer); + const removeCallbackReload = installCallbackReloadHandler(trainer, abs); + try { + const { jobId } = await trainer.start(); + const startedJobPrefix = STARTED_JOB_NONCE + ? `[arkor:${STARTED_JOB_NONCE}] ` + : ""; + process.stdout.write(`${startedJobPrefix}Started job ${jobId}\n`); + const result = await trainer.wait(); + process.stdout.write( + `Job ${result.job.id} finished with status=${result.job.status}\n`, + ); + } finally { + removeShutdown(); + removeCallbackReload(); + } } diff --git a/packages/arkor/src/core/runnerSignals.test.ts b/packages/arkor/src/core/runnerSignals.test.ts new file mode 100644 index 00000000..5e274943 --- /dev/null +++ b/packages/arkor/src/core/runnerSignals.test.ts @@ -0,0 +1,403 @@ +import { describe, it, expect, beforeEach, afterEach, vi } from "vitest"; +import { mkdirSync, mkdtempSync, rmSync, writeFileSync } from "node:fs"; +import { tmpdir } from "node:os"; +import { join } from "node:path"; +import { + installCallbackReloadHandler, + installShutdownHandlers, +} from "./runnerSignals"; +import type { Trainer, TrainerCallbacks } from "./types"; +import { + attachTrainerCallbackReplacer, + attachTrainerEarlyStopper, + attachTrainerInspection, +} from "./trainerInspection"; + +let cwd: string; + +beforeEach(() => { + cwd = mkdtempSync(join(tmpdir(), "arkor-signals-test-")); +}); + +afterEach(() => { + rmSync(cwd, { recursive: true, force: true }); +}); + +function makeTrainer(): Trainer & { + __earlyStop: { calls: number }; + __replace: { + lastCallbacks: Partial | null; + calls: number; + }; +} { + const earlyStop = { calls: 0 }; + const replace = { + lastCallbacks: null as Partial | null, + calls: 0, + }; + const trainer: Trainer = { + name: "n", + async start() { + return { jobId: "j" }; + }, + async wait() { + throw new Error("not used"); + }, + async cancel() {}, + }; + // Wire the internal callback-replacer + early-stop brands the same + // way `createTrainer` does. SIGUSR2 looks them up via + // `replaceTrainerCallbacks` and SIGTERM via `requestTrainerEarlyStop` + // (there are no public methods on `Trainer` for either any more). + attachTrainerCallbackReplacer(trainer, (cbs) => { + replace.lastCallbacks = cbs; + replace.calls += 1; + }); + attachTrainerEarlyStopper(trainer, async () => { + earlyStop.calls += 1; + }); + return Object.assign(trainer, { + __earlyStop: earlyStop, + __replace: replace, + }); +} + +describe("installShutdownHandlers", () => { + it("calls trainer.requestEarlyStop on the first SIGTERM and exit(0)", async () => { + const trainer = makeTrainer(); + const exitSpy = vi + .spyOn(process, "exit") + .mockImplementation((() => undefined as never) as typeof process.exit); + const stdoutSpy = vi + .spyOn(process.stdout, "write") + .mockImplementation((() => true) as typeof process.stdout.write); + const dispose = installShutdownHandlers(trainer); + try { + process.emit("SIGTERM", "SIGTERM"); + await new Promise((r) => setTimeout(r, 10)); + expect(trainer.__earlyStop.calls).toBe(1); + expect(exitSpy).toHaveBeenCalledWith(0); + } finally { + dispose(); + exitSpy.mockRestore(); + stdoutSpy.mockRestore(); + } + }); + + it("second-signal exit code is per-signal POSIX 128+signo (130 for SIGINT, 129 for SIGHUP)", async () => { + // Regression: the second-signal emergency-exit path used to + // hardcode `process.exit(143)` regardless of which signal + // fired. SIGINT (Ctrl-C twice) and SIGHUP shutdowns then + // looked like SIGTERM exits to parent shells / orchestrators, + // breaking signal-aware logic (e.g. tmux pane behaviour, CI + // job classification, `&&` / `||` chains that distinguish + // user-cancel from clean exit). Mirrors `SIGNAL_EXIT_CODE` in + // `cli/cleanupHooks.ts`. + const cases: Array<["SIGINT" | "SIGTERM" | "SIGHUP", number]> = [ + ["SIGINT", 130], + ["SIGTERM", 143], + ["SIGHUP", 129], + ]; + for (const [sig, expectedExit] of cases) { + const trainer = makeTrainer(); + const exitCodes: number[] = []; + const exitSpy = vi + .spyOn(process, "exit") + .mockImplementation(((code?: number) => { + exitCodes.push(code ?? 0); + return undefined as never; + }) as typeof process.exit); + const stdoutSpy = vi + .spyOn(process.stdout, "write") + .mockImplementation((() => true) as typeof process.stdout.write); + const dispose = installShutdownHandlers(trainer); + try { + process.emit(sig, sig); + await new Promise((r) => setTimeout(r, 10)); + process.emit(sig, sig); + await new Promise((r) => setTimeout(r, 10)); + // First signal exits 0 via the early-stop chain's + // `.finally(() => process.exit(0))`; second signal exits + // with the per-signal POSIX code. + expect(exitCodes, `signal ${sig}`).toContain(expectedExit); + } finally { + dispose(); + exitSpy.mockRestore(); + stdoutSpy.mockRestore(); + } + } + }); + + it("first-signal exit code is per-signal POSIX 128+signo when the early-stop chain rejects", async () => { + // Regression: the first-signal `.finally(() => process.exit(0))` + // always exited 0 even when the early-stop chain rejected + // (cancel POST hit a cloud-api 5xx, network drop, etc.). Parent + // shells running `arkor start || cleanup_on_failure` would then + // classify the failed cancel as a clean run and skip cleanup + // despite the stderr diagnostic. Fix: non-zero POSIX 128+signo on + // rejection so the exit status carries the same signal-shape + // semantics as the second-signal emergency path. + const cases: Array<["SIGINT" | "SIGTERM" | "SIGHUP", number]> = [ + ["SIGINT", 130], + ["SIGTERM", 143], + ["SIGHUP", 129], + ]; + for (const [sig, expectedExit] of cases) { + // Build a trainer whose internal early-stop brand REJECTS, so + // the runner's `.catch(...).finally(...)` chain goes through + // the failure branch. + const trainer: Trainer = { + name: "n", + async start() { + return { jobId: "j" }; + }, + async wait() { + throw new Error("not used"); + }, + async cancel() {}, + }; + attachTrainerEarlyStopper(trainer, async () => { + throw new Error("cloud-api 503"); + }); + const exitCodes: number[] = []; + const exitSpy = vi + .spyOn(process, "exit") + .mockImplementation(((code?: number) => { + exitCodes.push(code ?? 0); + return undefined as never; + }) as typeof process.exit); + const stdoutSpy = vi + .spyOn(process.stdout, "write") + .mockImplementation((() => true) as typeof process.stdout.write); + const stderrSpy = vi + .spyOn(process.stderr, "write") + .mockImplementation((() => true) as typeof process.stderr.write); + const dispose = installShutdownHandlers(trainer); + try { + process.emit(sig, sig); + // Wait for the .catch / .finally microtasks to settle. + await new Promise((r) => setTimeout(r, 10)); + // Under the bug this was just `[0]`. With the fix the + // first-signal exit code reflects the signal that fired. + expect(exitCodes, `signal ${sig}`).toEqual([expectedExit]); + } finally { + dispose(); + exitSpy.mockRestore(); + stdoutSpy.mockRestore(); + stderrSpy.mockRestore(); + } + } + }); + + it("second SIGTERM exits 143 without re-invoking requestEarlyStop", async () => { + const trainer = makeTrainer(); + const exitCodes: number[] = []; + const exitSpy = vi + .spyOn(process, "exit") + .mockImplementation(((code?: number) => { + exitCodes.push(code ?? 0); + return undefined as never; + }) as typeof process.exit); + const stdoutSpy = vi + .spyOn(process.stdout, "write") + .mockImplementation((() => true) as typeof process.stdout.write); + const dispose = installShutdownHandlers(trainer); + try { + process.emit("SIGTERM", "SIGTERM"); + await new Promise((r) => setTimeout(r, 10)); + process.emit("SIGTERM", "SIGTERM"); + await new Promise((r) => setTimeout(r, 10)); + expect(trainer.__earlyStop.calls).toBe(1); + expect(exitCodes).toContain(0); + expect(exitCodes).toContain(143); + } finally { + dispose(); + exitSpy.mockRestore(); + stdoutSpy.mockRestore(); + } + }); +}); + +describe("installCallbackReloadHandler", () => { + function writeUserBundle(label: string): string { + const file = join(cwd, "entry.mjs"); + // Inline a fake trainer that wears the inspection brand. The + // SIGUSR2 handler dynamic-imports this file and pulls the + // callbacks reference off via `getTrainerInspection`. + const src = ` + const KEY = Symbol.for("arkor.trainer.inspect"); + const callbacks = { onLog: (ctx) => globalThis.__arkor_callbackProbe?.(${JSON.stringify(label)}, ctx) }; + const trainer = { + name: "t", + start: async () => ({ jobId: "j" }), + wait: async () => ({ job: {}, artifacts: [] }), + cancel: async () => {}, + }; + Object.defineProperty(trainer, KEY, { + value: () => ({ name: "t", config: { model: "m", datasetSource: { type: "huggingface", name: "x" } }, callbacks }), + enumerable: false, + }); + export const arkor = Object.freeze({ _kind: "arkor", trainer }); + `; + writeFileSync(file, src); + return file; + } + + it("re-imports the bundle and forwards the new callbacks via replaceCallbacks", async () => { + const trainer = makeTrainer(); + // Brand the trainer too so the import path-side has a reference shape. + attachTrainerInspection(trainer, () => ({ + name: "n", + config: { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + }, + callbacks: {}, + })); + + const file = writeUserBundle("v1"); + const stdoutSpy = vi + .spyOn(process.stdout, "write") + .mockImplementation((() => true) as typeof process.stdout.write); + const stderrSpy = vi + .spyOn(process.stderr, "write") + .mockImplementation((() => true) as typeof process.stderr.write); + const dispose = installCallbackReloadHandler(trainer, file); + mkdirSync(join(cwd, "src"), { recursive: true }); + try { + // Rewrite the entry to "v2" callbacks before signalling. + writeUserBundle("v2"); + process.emit("SIGUSR2", "SIGUSR2"); + // Wait for the dynamic import + replaceCallbacks to settle. + for (let i = 0; i < 50 && trainer.__replace.lastCallbacks === null; i++) { + await new Promise((r) => setTimeout(r, 10)); + } + expect(trainer.__replace.lastCallbacks).not.toBeNull(); + expect(typeof trainer.__replace.lastCallbacks?.onLog).toBe("function"); + } finally { + dispose(); + stdoutSpy.mockRestore(); + stderrSpy.mockRestore(); + } + }); + + it("returns a no-op disposer when SIGUSR2 registration throws (Windows fallback)", () => { + // Regression: `process.on("SIGUSR2", ...)` can throw at + // registration time on platforms that don't support the signal + // (notably Windows). Previously this would surface as a hard + // crash at `arkor start` boot. The handler now wraps the + // registration in try/catch and degrades to a no-op disposer so + // the rest of the runner stays up: the server's + // `safeKill(child, "SIGUSR2")` already detects the same + // condition and falls back to SIGTERM-restart there. + const trainer = makeTrainer(); + const file = join(cwd, "entry.mjs"); + writeFileSync(file, "export const x = 1;\n"); + + const realOn = process.on.bind(process); + const onSpy = vi + .spyOn(process, "on") + .mockImplementation(((event: string, listener: (...args: unknown[]) => void) => { + if (event === "SIGUSR2") { + throw new Error("ENOSYS: function not implemented"); + } + return realOn(event as never, listener as never); + }) as typeof process.on); + + let dispose: (() => void) | undefined; + try { + // Must not throw despite the SIGUSR2 registration failure. + dispose = installCallbackReloadHandler(trainer, file); + expect(typeof dispose).toBe("function"); + // No listener was attached, so the disposer is a no-op; calling + // it must not throw either (mirroring the success-path contract + // for tests that always invoke the disposer in `finally`). + expect(() => dispose?.()).not.toThrow(); + } finally { + onSpy.mockRestore(); + } + }); + + it("drops a stale reload's result when a newer SIGUSR2 starts before the import resolves", async () => { + // Regression: each SIGUSR2 starts a fire-and-forget + // `import()` + `replaceTrainerCallbacks`. Two same-`configHash` + // rebuilds firing back-to-back can race: the earlier import's + // bytes sometimes resolve *after* the newer one, and + // `replaceTrainerCallbacks` overwrites the freshly-loaded + // callbacks with the prior version. The fix version-gates each + // reload via a monotonic `loadSeq`; this test pins the contract + // by firing two signals back-to-back and asserting that + // `replaceTrainerCallbacks` was invoked exactly **once**: + // proving the older IIFE dropped its result at the + // `seq !== loadSeq` check before reaching the replace call. + const trainer = makeTrainer(); + attachTrainerInspection(trainer, () => ({ + name: "n", + config: { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + }, + callbacks: {}, + })); + + const file = writeUserBundle("v1"); + const stdoutSpy = vi + .spyOn(process.stdout, "write") + .mockImplementation((() => true) as typeof process.stdout.write); + const stderrSpy = vi + .spyOn(process.stderr, "write") + .mockImplementation((() => true) as typeof process.stderr.write); + const dispose = installCallbackReloadHandler(trainer, file); + try { + // First signal: captures seq=1 inside the IIFE. + process.emit("SIGUSR2", "SIGUSR2"); + // Rewrite the bundle to v2 BEFORE letting either import + // resolve. mtime+ctime+size change → distinct cache-bust URL. + writeUserBundle("v2"); + // Second signal: captures seq=2, bumps loadSeq to 2. + process.emit("SIGUSR2", "SIGUSR2"); + // Generous fixed wait so both imports definitely settle; + // we can't poll on `lastCallbacks !== null` because the v1 + // IIFE might land first and short-circuit our wait, hiding + // the count assertion below. + await new Promise((r) => setTimeout(r, 200)); + // Without the seq guard, both IIFEs would call + // `replaceTrainerCallbacks` and `calls` would be 2. With the + // guard, the older IIFE's `seq !== loadSeq` short-circuit + // skips the replace call entirely. + expect(trainer.__replace.calls).toBe(1); + } finally { + dispose(); + stdoutSpy.mockRestore(); + stderrSpy.mockRestore(); + } + }); + + it("logs a skip warning when the bundle has no inspectable trainer", async () => { + const trainer = makeTrainer(); + const file = join(cwd, "no-trainer.mjs"); + writeFileSync(file, "export const nothing = true;\n"); + const stdoutSpy = vi + .spyOn(process.stdout, "write") + .mockImplementation((() => true) as typeof process.stdout.write); + const stderrChunks: string[] = []; + const stderrSpy = vi + .spyOn(process.stderr, "write") + .mockImplementation(((chunk: unknown) => { + stderrChunks.push(String(chunk)); + return true; + }) as typeof process.stderr.write); + const dispose = installCallbackReloadHandler(trainer, file); + try { + process.emit("SIGUSR2", "SIGUSR2"); + // Give the dynamic import a few ticks. + await new Promise((r) => setTimeout(r, 50)); + expect(stderrChunks.join("")).toMatch(/no inspectable trainer/i); + expect(trainer.__replace.lastCallbacks).toBeNull(); + } finally { + dispose(); + stdoutSpy.mockRestore(); + stderrSpy.mockRestore(); + } + }); +}); diff --git a/packages/arkor/src/core/runnerSignals.ts b/packages/arkor/src/core/runnerSignals.ts new file mode 100644 index 00000000..5c71a27d --- /dev/null +++ b/packages/arkor/src/core/runnerSignals.ts @@ -0,0 +1,215 @@ +import { moduleCacheBustUrl } from "./moduleCacheBust"; +import { SIGNAL_EXIT_CODE } from "./signalExit"; +import { + findInspectableTrainer, + replaceTrainerCallbacks, + requestTrainerEarlyStop, +} from "./trainerInspection"; +import type { Trainer, TrainerCallbacks } from "./types"; + +const SHUTDOWN_SIGNALS = ["SIGTERM", "SIGINT", "SIGHUP"] as const; +const CALLBACK_RELOAD_SIGNAL = "SIGUSR2" as const; + +/** + * Two-stage shutdown handling so HMR rebuilds (Studio sends SIGTERM) + * preserve the in-flight checkpoint work: + * + * - 1st signal → `trainer.requestEarlyStop()`. The trainer keeps + * running, lets the next `checkpoint.saved` event land, then issues + * `cancel()`. + * - 2nd signal → immediate `process.exit(POSIX 128+signo)`: + * 130 for SIGINT, 143 for SIGTERM, 129 for SIGHUP. Escape hatch + * for an impatient operator or a hung early-stop. Per-signal + * exit code so parent shells see the actual interruption type. + * + * The returned dispose function removes the handlers so a normal + * `wait()` completion doesn't leave stale listeners behind: important + * because `runTrainer` can be called multiple times in tests within a + * single Node process. + */ +export function installShutdownHandlers(trainer: Trainer): () => void { + let signalCount = 0; + const handler = (signal: (typeof SHUTDOWN_SIGNALS)[number]): void => { + signalCount += 1; + if (signalCount > 1) { + process.stdout.write( + `Received second ${signal}; exiting without waiting for checkpoint.\n`, + ); + // POSIX 128 + signo so the parent shell sees the right exit + // status: 130 for SIGINT (Ctrl-C twice), 129 for SIGHUP, + // 143 for SIGTERM. Hardcoding 143 misclassifies SIGINT and + // SIGHUP shutdowns as SIGTERM-style exits and breaks + // signal-aware orchestration. Defaults to 143 for any future + // signal we forget to map. + const code = SIGNAL_EXIT_CODE[signal] ?? 143; + process.exit(code); + // Explicit return so test mocks of process.exit (which don't + // actually terminate the worker) don't fall through into the + // early-stop path. + return; + } + process.stdout.write( + `Received ${signal}; early-stopping at next checkpoint…\n`, + ); + // Drive the trainer's internal early-stop entry point via the + // `Symbol.for("arkor.trainer.requestEarlyStop")` brand attached by + // `createTrainer`. `runTrainer` also accepts hand-rolled + // `{ start, wait, cancel }` trainers; for those the brand is + // absent and `requestTrainerEarlyStop` transparently falls back + // to `trainer.cancel()` (best-effort, matches the public contract). + // + // Track whether the early-stop chain rejected so the final + // `process.exit` carries a non-zero status. The previous version + // always exited 0, which made `arkor start || cleanup_on_failure` + // wrappers classify a cancel-POST rejection (cloud-api transient + // failure, network drop) as a clean run despite the stderr + // diagnostic. POSIX 128 + signo on failure mirrors the + // second-signal exit-code convention so parent shells see a + // signal-style nonzero status. + let earlyStopFailed = false; + requestTrainerEarlyStop(trainer) + .catch((err: unknown) => { + earlyStopFailed = true; + const msg = err instanceof Error ? err.message : String(err); + process.stderr.write(`requestEarlyStop failed: ${msg}\n`); + }) + .finally(() => { + const code = earlyStopFailed + ? (SIGNAL_EXIT_CODE[signal] ?? 143) + : 0; + process.exit(code); + }); + }; + // Per-signal closure (vs a single shared listener registered on + // every signal): the closure captures `sig` at registration time + // so the handler doesn't depend on whatever Node passes as the + // event arg. Node's documented contract is to pass the signal + // name, but pinning the source via closure keeps the handler + // robust regardless and makes the registration → arg + // relationship explicit at the callsite. Stored in a Map so + // `process.off` can remove the exact closure (anonymous arrow + // would leak the listener since `process.off` matches by + // identity). + const signalHandlers = new Map< + (typeof SHUTDOWN_SIGNALS)[number], + () => void + >(); + for (const sig of SHUTDOWN_SIGNALS) { + const fn = () => handler(sig); + signalHandlers.set(sig, fn); + process.on(sig, fn); + } + return () => { + for (const [sig, fn] of signalHandlers) process.off(sig, fn); + }; +} + +/** + * SIGUSR2 handler: re-import the freshly-rebuilt artefact and rotate + * the trainer's callback cell via the internal + * `Symbol.for("arkor.trainer.replaceCallbacks")` brand. The cloud-side + * training run is untouched; only the in-process callbacks change. + * + * Studio sends SIGUSR2 from the `/api/dev/events` HMR pipeline when + * (and only when) the rebuilt bundle's `JobConfig` hash matches the + * one captured at spawn time. A mismatch produces SIGTERM instead, which + * goes through `installShutdownHandlers` above. + */ +export function installCallbackReloadHandler( + trainer: Trainer, + entryPath: string, +): () => void { + /** + * Monotonic counter for sequencing concurrent SIGUSR2 reloads. + * Bumped synchronously inside the signal handler *before* the + * dynamic-import await begins, so each in-flight reload knows its + * arrival order. When the import resolves, the IIFE compares its + * captured `seq` against `loadSeq` and silently drops the result + * if a newer signal already started a newer reload. Without this, + * two same-`configHash` rebuilds firing back-to-back can race on + * the import: the earlier import's bytes (now stale on disk) + * resolve *after* the newer one, and `replaceTrainerCallbacks` + * overwrites the freshly-loaded callbacks with the prior version, + * leaving the running job out of sync until the next rebuild. + * Mirrors the `buildSeq` guard in `studio/hmr.ts`'s + * `emitBuildSucceeded`. + */ + let loadSeq = 0; + const handler = (): void => { + const seq = ++loadSeq; + // mtime+ctime+size cache-bust (vs `Date.now()`): Node's ESM + // loader never evicts module records, so a long `arkor start` + // session with frequent SIGUSR2 reloads would accumulate one + // record per signal forever. Keying on the actual artefact bytes + // (via `moduleCacheBustUrl`) collapses no-op signals onto the + // same URL; the leak is bounded to "one per real edit", which + // is fundamentally what HMR has to retain. + const url = moduleCacheBustUrl(entryPath); + void (async () => { + try { + const mod = (await import(url)) as Record; + // A newer SIGUSR2 already started its own import while we + // were awaiting; drop our result so the latest edit wins. + if (seq !== loadSeq) return; + const callbacks = extractCallbacks(mod); + if (!callbacks) { + process.stderr.write( + "Callback reload skipped: rebuilt bundle has no inspectable trainer.\n", + ); + return; + } + replaceTrainerCallbacks(trainer, callbacks); + process.stdout.write( + "Callbacks hot-reloaded; training run continues.\n", + ); + } catch (err: unknown) { + const msg = err instanceof Error ? err.message : String(err); + process.stderr.write(`Callback reload failed: ${msg}\n`); + } + })(); + }; + // `process.on('SIGUSR2', ...)` can throw at registration time on + // platforms that don't support the signal (notably Windows: libuv's + // signal-wrap returns ENOSYS for SIGUSR2 on win32 and the error + // escapes to userland on some Node versions). The server-side + // `trainRegistry.safeKill(child, "SIGUSR2")` already detects this + // ("unsupported" → falls back to SIGTERM-restart), so an unarmed + // listener here is the documented contract on those platforms: + // quietly degrade to a no-op disposer rather than crashing + // `arkor start` at boot. + // Track registration success so the returned disposer never + // calls `process.off(...)` for a handler we never attached. + // Today this only fires for the early-return-no-op path where + // `process.on` threw at registration, but future Node versions + // could route `off` through the same libuv signal-wrap that + // throws for unsupported signals on Windows, and a symmetric + // throw inside the disposer would crash the `runTrainer` finally + // block instead of merely being a no-op. + let attached = false; + try { + process.on(CALLBACK_RELOAD_SIGNAL, handler); + attached = true; + } catch { + return () => { + // no-op: handler was never attached + }; + } + return () => { + if (!attached) return; + process.off(CALLBACK_RELOAD_SIGNAL, handler); + }; +} + +/** + * Extract the user-supplied callbacks reference from a re-imported + * bundle. Delegates the entry-shape walk to `findInspectableTrainer` + * so SIGUSR2's view of "what counts as a trainer" stays identical to + * the HMR coordinator's `inspectBundle` and `runner.ts`'s + * `extractTrainer`. Returns `null` when no candidate carries the + * inspection brand. + */ +function extractCallbacks( + mod: Record, +): Partial | null { + return findInspectableTrainer(mod)?.callbacks ?? null; +} diff --git a/packages/arkor/src/core/schemas.test.ts b/packages/arkor/src/core/schemas.test.ts index 50ff2571..46a9cf18 100644 --- a/packages/arkor/src/core/schemas.test.ts +++ b/packages/arkor/src/core/schemas.test.ts @@ -64,7 +64,7 @@ describe("trainingJobSchema", () => { }); it("normalises non-null startedAt/completedAt: strings pass through, Dates ISO-coerce", () => { - // Branch coverage for the `toIsoOrNull` transforms — the `null` + // Branch coverage for the `toIsoOrNull` transforms: the `null` // branch is exercised by every other test in this file (the // `valid` fixture has both fields null), but the truthy branch // only fires when the field carries an actual timestamp. Strings diff --git a/packages/arkor/src/core/signalExit.ts b/packages/arkor/src/core/signalExit.ts new file mode 100644 index 00000000..81e42eb7 --- /dev/null +++ b/packages/arkor/src/core/signalExit.ts @@ -0,0 +1,21 @@ +/** + * Shared POSIX `128 + signo` exit code mapping for the runner's + * two-stage shutdown handler (`core/runnerSignals.ts`) and the CLI's + * cleanup-hook coordinator (`cli/cleanupHooks.ts`). The two map + * MUST agree: AGENTS.md describes them as a single contract, and a + * drift (e.g. someone adding SIGQUIT to one but not the other) + * would make the runner and the dev-server exit with inconsistent + * codes for the same signal, the exact parent-shell-classification + * regression the per-signal mapping was introduced to prevent. + * + * Lives in `core/` (not `cli/`) so both consumers can import it + * without `cli/` ↔ `core/` cycles: `cli/cleanupHooks.ts` imports + * from `core/`, but `core/` must not depend on `cli/`. + */ +export const SIGNAL_EXIT_CODE = { + SIGHUP: 129, + SIGINT: 130, + SIGTERM: 143, +} as const; + +export type ShutdownSignal = keyof typeof SIGNAL_EXIT_CODE; diff --git a/packages/arkor/src/core/trainer.test.ts b/packages/arkor/src/core/trainer.test.ts index f9ef2f85..ff1d02ed 100644 --- a/packages/arkor/src/core/trainer.test.ts +++ b/packages/arkor/src/core/trainer.test.ts @@ -3,6 +3,10 @@ import { mkdtempSync, rmSync } from "node:fs"; import { tmpdir } from "node:os"; import { join } from "node:path"; import { createTrainer } from "./trainer"; +import { + replaceTrainerCallbacks, + requestTrainerEarlyStop, +} from "./trainerInspection"; import { writeState } from "./state"; import type { AnonymousCredentials } from "./credentials"; @@ -266,7 +270,7 @@ describe("createTrainer (credentials defaulting)", () => { model: "m", dataset: { type: "huggingface", name: "x" }, }, - // Note: NO `credentials` here — trainer must call ensureCredentials. + // Note: NO `credentials` here, so trainer must call ensureCredentials. { baseUrl: "http://mock", cwd: localCwd, @@ -667,7 +671,7 @@ describe("createTrainer (SSE event stream)", () => { }); }); -// Regression for ENG-406 — the previous reconnect loop had no upper bound +// Regression for ENG-406: the previous reconnect loop had no upper bound // and no jitter, so a permanently-down cloud-api would keep retrying every // `reconnectDelayMs` forever (and on recovery several SDK clients would // reconnect at exactly the same instant). @@ -795,7 +799,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => { step: 1, loss: 1, })}\n\n`, - // No terminal event — stream closes cleanly, outer loop reconnects. + // No terminal event: stream closes cleanly, outer loop reconnects. ], }, { kind: "throw", error: new TypeError("fetch failed") }, @@ -833,8 +837,8 @@ describe("createTrainer (reconnect backoff + max attempts)", () => { // when `Math.random()` lands near 1. // Codex review on PR #13 (round 3) flagged that a 200-OK stream that // EOFs without emitting any frame would loop forever at the base delay - // — `maxReconnectAttempts` was bypassed because clean closes never - // touched the failure counter. Misconfigured proxies / load-balancers + // because `maxReconnectAttempts` was bypassed (clean closes never + // touched the failure counter). Misconfigured proxies / load-balancers // that accept the connection and immediately drop it would hang // `wait()` indefinitely. it("counts clean closes with no frames toward maxReconnectAttempts", async () => { @@ -955,7 +959,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => { }; // The trainer fires `POST /v1/jobs` synchronously inside the start() // path, so cancel() needs the job row to be assigned. We never open the - // event stream — cancel() should not depend on it. + // event stream; cancel() should not depend on it. const sse = [ `id: 1\nevent: training.completed\ndata: ${JSON.stringify({ type: "training.completed", @@ -1012,7 +1016,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => { const original = globalThis.fetch; globalThis.fetch = fetcher; try { - // Start the run by awaiting wait() — the streamed completion event + // Start the run by awaiting wait(): the streamed completion event // closes the loop quickly so cancel() runs against a fully-resolved // startedJob/scope pair. await trainer.wait(); @@ -1086,7 +1090,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => { }); it("skips malformed event payloads without aborting the stream", async () => { - // Branch coverage for the `try/catch` around JSON.parse — a single + // Branch coverage for the `try/catch` around JSON.parse: a single // malformed `data:` line shouldn't tear down the whole training run. // Send one garbage frame followed by a real terminal event. await writeState( @@ -1153,7 +1157,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => { }); it("recovers when the SSE body itself errors mid-stream", async () => { - // Branch coverage for the catch around the for-await iterator — + // Branch coverage for the catch around the for-await iterator: // covers the case where the stream's underlying body emits an error // (e.g. a network disconnect partway through). The reconnect loop // should treat it as a failure, count it toward the limit, then @@ -1263,7 +1267,7 @@ describe("createTrainer (reconnect backoff + max attempts)", () => { { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" }, cwd, ); - // No fetch mock at all — if cancel() reached the API we'd see a real + // No fetch mock at all: if cancel() reached the API we'd see a real // network error. Safety net for callers that wire up cancel() to // SIGINT before kicking off the run. const trainer = createTrainer( @@ -1413,3 +1417,1320 @@ describe("createTrainer (reconnect backoff + max attempts)", () => { } }); }); + +describe("createTrainer (early stop)", () => { + const minimalJobRow = { + id: "j-stop", + orgId: "o1", + projectId: "p1", + name: "run", + status: "queued", + config: { + model: "m", + datasetSource: { type: "huggingface", name: "x" }, + }, + createdAt: "2026-01-01T00:00:00Z", + startedAt: null, + completedAt: null, + }; + + it("calls cancel after the next checkpoint when early-stop is requested mid-run", async () => { + await writeState( + { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" }, + cwd, + ); + // SSE stream: training.started → training.log → checkpoint.saved. + // The checkpoint event is the trigger for the early-stop branch in + // dispatch(); after that, the loop should treat the run as terminal + // (we asserted this by ending the wait() promise without sending + // training.completed). + const sse = [ + `id: 1\nevent: training.started\ndata: ${JSON.stringify({ + type: "training.started", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:01Z", + })}\n\n`, + `id: 2\nevent: training.log\ndata: ${JSON.stringify({ + type: "training.log", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:02Z", + step: 1, + loss: 0.5, + })}\n\n`, + `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({ + type: "checkpoint.saved", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:03Z", + step: 10, + })}\n\n`, + ]; + + let cancelCalls = 0; + const fetcher: typeof fetch = (async ( + input: RequestInfo | URL, + init?: RequestInit, + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && url.includes("/v1/jobs?")) { + return new Response(JSON.stringify({ job: minimalJobRow }), { + status: 201, + headers: { "content-type": "application/json" }, + }); + } + if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) { + return new Response(sseStream(sse), { + status: 200, + headers: { "content-type": "text/event-stream" }, + }); + } + if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) { + cancelCalls += 1; + return new Response(JSON.stringify({ ok: true }), { + status: 200, + headers: { "content-type": "application/json" }, + }); + } + throw new Error(`unexpected fetch: ${method} ${url}`); + }) as typeof fetch; + + const trainer = createTrainer( + { + name: "run", + model: "m", + dataset: { type: "huggingface", name: "x" }, + callbacks: { + // Arm the early-stop latch from inside the on-log callback so it + // fires before the checkpoint dispatch (mirrors the real CLI + // path where SIGTERM arrives mid-run). Fire-and-forget so the + // dispatch loop isn't blocked waiting for the latch's own + // checkpoint trigger to arrive. + onLog: () => { + void requestTrainerEarlyStop(trainer, { timeoutMs: 60_000 }); + }, + }, + }, + { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 }, + ); + const original = globalThis.fetch; + globalThis.fetch = fetcher; + let result: Awaited>; + try { + result = await trainer.wait(); + } finally { + globalThis.fetch = original; + } + expect(cancelCalls).toBe(1); + // Regression: the early-stop checkpoint branch returns + // `{ terminal: true }` to break out of `wait()`'s loop without + // waiting for a cloud-side terminal event. The `TrainingResult` + // it resolves with must therefore reflect a terminal status + // locally; otherwise `wait()` violates its documented contract + // ("Resolve when the job reaches a terminal status") and a + // subsequent `requestEarlyStop` wouldn't see the + // `TERMINAL_STATUSES` short-circuit. + expect(result.job.status).toBe("cancelled"); + expect(result.job.completedAt).toBe("2026-01-01T00:00:03Z"); + }); + + it("early-stop checkpoint branch returns the checkpoint's artifacts in wait()'s result", async () => { + // Regression: the early-stop terminal return used + // `terminalResult?.artifacts ?? []`, but `wait()` always calls + // `dispatch(parsed, null)` so `terminalResult` was forever + // null → `wait()` resolved with `artifacts: []` even though + // the checkpoint event carries the very artefacts the + // early-stop existed to *preserve* (the whole point of the + // graceful-stop-at-next-checkpoint pattern is to keep that + // work). Now we return `event.artifacts` directly so the + // checkpoint's outputs make it into the resolved result. + await writeState( + { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" }, + cwd, + ); + const checkpointArtifacts = [ + { kind: "lora_adapter" as const, path: "/checkpoints/step-10/" }, + { kind: "metric" as const, name: "loss", value: 0.42 }, + ]; + const sse = [ + `id: 1\nevent: training.started\ndata: ${JSON.stringify({ + type: "training.started", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:01Z", + })}\n\n`, + `id: 2\nevent: training.log\ndata: ${JSON.stringify({ + type: "training.log", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:02Z", + step: 1, + loss: 0.5, + })}\n\n`, + `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({ + type: "checkpoint.saved", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:03Z", + step: 10, + artifacts: checkpointArtifacts, + })}\n\n`, + ]; + const fetcher: typeof fetch = (async ( + input: RequestInfo | URL, + init?: RequestInit, + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && url.includes("/v1/jobs?")) { + return new Response(JSON.stringify({ job: minimalJobRow }), { + status: 201, + headers: { "content-type": "application/json" }, + }); + } + if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) { + return new Response(sseStream(sse), { + status: 200, + headers: { "content-type": "text/event-stream" }, + }); + } + if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) { + return new Response(JSON.stringify({ ok: true }), { + status: 200, + headers: { "content-type": "application/json" }, + }); + } + throw new Error(`unexpected fetch: ${method} ${url}`); + }) as typeof fetch; + + const trainer = createTrainer( + { + name: "run", + model: "m", + dataset: { type: "huggingface", name: "x" }, + callbacks: { + onLog: () => { + void requestTrainerEarlyStop(trainer, { timeoutMs: 60_000 }); + }, + }, + }, + { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 }, + ); + const original = globalThis.fetch; + globalThis.fetch = fetcher; + let result: Awaited>; + try { + result = await trainer.wait(); + } finally { + globalThis.fetch = original; + } + // The artefacts the checkpoint event carried must travel + // through to the wait() result; that's the whole point of + // graceful-stop-at-next-checkpoint preserving the in-flight + // work. + expect(result.artifacts).toEqual(checkpointArtifacts); + // Sibling assertion: status is still terminal (covered more + // thoroughly in the dedicated test above; this one just + // ensures we didn't accidentally regress the status while + // changing the artefacts return). + expect(result.job.status).toBe("cancelled"); + }); + + it("early-stop branch still settles when the user's onCheckpoint callback throws (no SIGTERM hang)", async () => { + // Regression: the early-stop branch ran AFTER + // `await callbacks.onCheckpoint?.(ctx)`. A user-callback throw + // would propagate out of that await before the early-stop + // cancel + latch settlement could run, leaving + // `earlyStopDeferred` pending. The runner's + // `installShutdownHandlers` awaits that deferred → SIGTERM + // shutdown hangs until the (default 5-min) timeout fallback + // fires. The fix wraps `onCheckpoint` in try/catch, runs the + // early-stop branch unconditionally, then re-throws the + // captured callback error so wait()'s reconnect loop keeps + // its prior semantics. + await writeState( + { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" }, + cwd, + ); + const sse = [ + `id: 1\nevent: training.started\ndata: ${JSON.stringify({ + type: "training.started", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:01Z", + })}\n\n`, + `id: 2\nevent: training.log\ndata: ${JSON.stringify({ + type: "training.log", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:02Z", + step: 1, + loss: 0.5, + })}\n\n`, + `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({ + type: "checkpoint.saved", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:03Z", + step: 10, + })}\n\n`, + ]; + let cancelCalls = 0; + const fetcher: typeof fetch = (async ( + input: RequestInfo | URL, + init?: RequestInit, + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && url.includes("/v1/jobs?")) { + return new Response(JSON.stringify({ job: minimalJobRow }), { + status: 201, + headers: { "content-type": "application/json" }, + }); + } + if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) { + return new Response(sseStream(sse), { + status: 200, + headers: { "content-type": "text/event-stream" }, + }); + } + if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) { + cancelCalls += 1; + return new Response(JSON.stringify({ ok: true }), { + status: 200, + headers: { "content-type": "application/json" }, + }); + } + throw new Error(`unexpected fetch: ${method} ${url}`); + }) as typeof fetch; + + let armedPromise: Promise | null = null; + let armedResult: "resolved" | "rejected" | "pending" = "pending"; + const trainer = createTrainer( + { + name: "run", + model: "m", + dataset: { type: "huggingface", name: "x" }, + callbacks: { + onLog: () => { + if (armedPromise === null) { + armedPromise = requestTrainerEarlyStop(trainer, { + timeoutMs: 60_000, + }); + armedPromise.then( + () => { + armedResult = "resolved"; + }, + () => { + armedResult = "rejected"; + }, + ); + } + }, + onCheckpoint: () => { + // User callback throws DURING the checkpoint that + // would normally trigger early-stop. Without the + // try/catch wrap this throw would skip the + // early-stop branch → latch pending → SIGTERM hang + // for up to 60s (our `timeoutMs`). + throw new Error("user onCheckpoint boom"); + }, + }, + }, + { + baseUrl: "http://mock", + credentials: creds, + cwd, + reconnectDelayMs: 1, + // Cap reconnects at 0 so the user-callback throw + // surfaces as a wait() rejection instead of + // looping forever (handleFailure would otherwise + // reconnect after the throw escapes dispatch). + maxReconnectAttempts: 0, + }, + ); + const original = globalThis.fetch; + globalThis.fetch = fetcher; + try { + // wait() rejects: handleFailure wraps the user callback + // throw because maxReconnectAttempts is 0. + await expect(trainer.wait()).rejects.toThrow(); + // Critical: the latch SETTLED via the early-stop branch + // (resolve), not via the 60-second timeout. The cancel POST + // also fired (early-stop reached the cancel call before the + // throw was re-raised). Together: shutdown wouldn't hang. + await new Promise((r) => setImmediate(r)); + expect(armedResult).toBe("resolved"); + expect(cancelCalls).toBe(1); + } finally { + globalThis.fetch = original; + } + }); + + it("re-throws a falsy onCheckpoint throw (e.g. `throw null`) instead of silently suppressing it", async () => { + // Regression: `onCheckpointError !== null` was the discriminant for + // "did the user callback throw?". User code can legitimately + // `throw null` / `throw 0` / `throw ""`; the truthiness of + // `onCheckpointError` was then indistinguishable from the no-error + // path, and the post-early-stop re-throw at the end of the + // checkpoint dispatch silently dropped the user's signal. With the + // fix, a separate `onCheckpointThrew` boolean discriminates so + // ANY throwable (including falsy ones) propagates uniformly. + await writeState( + { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" }, + cwd, + ); + const sse = [ + `id: 1\nevent: training.started\ndata: ${JSON.stringify({ + type: "training.started", + jobId: "j-falsy", + timestamp: "2026-01-01T00:00:01Z", + })}\n\n`, + `id: 2\nevent: checkpoint.saved\ndata: ${JSON.stringify({ + type: "checkpoint.saved", + jobId: "j-falsy", + timestamp: "2026-01-01T00:00:02Z", + step: 5, + })}\n\n`, + ]; + const fetcher: typeof fetch = (async ( + input: RequestInfo | URL, + init?: RequestInit, + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && url.includes("/v1/jobs?")) { + return new Response(JSON.stringify({ job: minimalJobRow }), { + status: 201, + headers: { "content-type": "application/json" }, + }); + } + if (method === "GET" && url.includes("/v1/jobs/j-falsy/events/stream")) { + return new Response(sseStream(sse), { + status: 200, + headers: { "content-type": "text/event-stream" }, + }); + } + throw new Error(`unexpected fetch: ${method} ${url}`); + }) as typeof fetch; + + const trainer = createTrainer( + { + name: "run", + model: "m", + dataset: { type: "huggingface", name: "x" }, + callbacks: { + // `throw null`: a falsy throwable. Under the bug this was + // silently swallowed and wait() resolved as if no callback + // had thrown. With the fix `wait()` rejects (handleFailure + // wraps the throw because maxReconnectAttempts is 0). + onCheckpoint: () => { + throw null; + }, + }, + }, + { + baseUrl: "http://mock", + credentials: creds, + cwd, + reconnectDelayMs: 1, + maxReconnectAttempts: 0, + }, + ); + const original = globalThis.fetch; + globalThis.fetch = fetcher; + try { + // Under the bug `wait()` resolved cleanly (the falsy throw was + // captured by `onCheckpointError = err` but the + // `if (onCheckpointError !== null) throw` guard saw `null` and + // skipped the re-throw). With the fix `wait()` rejects. + await expect(trainer.wait()).rejects.toBeDefined(); + } finally { + globalThis.fetch = original; + } + }); + + it("early-stop checkpoint branch rejects the deferred when cancel() throws (visible to shutdown handler)", async () => { + // Regression: previously, an `await trainer.cancel()` that threw + // (network failure / cloud-api 5xx during the cancel POST) was + // *swallowed*, the deferred resolved cleanly, and the runner + // exited 0: the UI declared the run cancelled while the cloud + // job kept running, orphaning GPU spend with no visible error. + // The fix REJECTS the deferred so the runner's + // `installShutdownHandlers` `.catch()` writes the failure to + // stderr, surfacing the issue to the operator. The latch is + // still always settled (resolved or rejected), so shutdown + // doesn't hang waiting for a checkpoint that will never come. + await writeState( + { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" }, + cwd, + ); + const sse = [ + `id: 1\nevent: training.started\ndata: ${JSON.stringify({ + type: "training.started", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:01Z", + })}\n\n`, + `id: 2\nevent: training.log\ndata: ${JSON.stringify({ + type: "training.log", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:02Z", + step: 1, + loss: 0.5, + })}\n\n`, + `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({ + type: "checkpoint.saved", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:03Z", + step: 10, + })}\n\n`, + ]; + let cancelAttempts = 0; + const fetcher: typeof fetch = (async ( + input: RequestInfo | URL, + init?: RequestInit, + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && url.includes("/v1/jobs?")) { + return new Response(JSON.stringify({ job: minimalJobRow }), { + status: 201, + headers: { "content-type": "application/json" }, + }); + } + if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) { + return new Response(sseStream(sse), { + status: 200, + headers: { "content-type": "text/event-stream" }, + }); + } + if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) { + cancelAttempts += 1; + // Simulate the cloud-api being unreachable mid-cancel. + throw new TypeError("fetch failed"); + } + throw new Error(`unexpected fetch: ${method} ${url}`); + }) as typeof fetch; + + // Capture the very-first armed early-stop promise so we can + // assert its settlement state below. The trainer is mutually + // recursive with the callback (`onLog` calls + // `requestTrainerEarlyStop(trainer, ...)`), so we declare it + // first as `let` and assign in a second step. + let armedPromise: Promise | null = null; + let armedResult: "resolved" | "rejected" | "pending" = "pending"; + let armedError: unknown = null; + const trainer = createTrainer( + { + name: "run", + model: "m", + dataset: { type: "huggingface", name: "x" }, + callbacks: { + onLog: () => { + // Arm exactly once and capture the returned promise. + // requestTrainerEarlyStop is idempotent across repeat + // calls, but we only need the FIRST armed deferred: + // the cancel-throw rejects exactly that promise. + if (armedPromise === null) { + armedPromise = requestTrainerEarlyStop(trainer, { + timeoutMs: 60_000, + }); + armedPromise.then( + () => { + armedResult = "resolved"; + }, + (err: unknown) => { + armedResult = "rejected"; + armedError = err; + }, + ); + } + }, + }, + }, + { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 }, + ); + const original = globalThis.fetch; + globalThis.fetch = fetcher; + try { + await trainer.wait(); + // Flush microtasks so the .then(resolve, reject) handler + // observes the settlement before we assert. + await new Promise((r) => setImmediate(r)); + } finally { + globalThis.fetch = original; + } + // cancel() was attempted (and threw). + expect(cancelAttempts).toBe(1); + // The armed deferred REJECTED: the runner's `.catch()` would + // see this error and log it to stderr instead of silently + // exiting 0. Critically: it didn't hang on "pending"; the + // failure case still settles, just via reject not resolve. + expect(armedResult).toBe("rejected"); + expect(armedError).toBeInstanceOf(TypeError); + expect((armedError as Error).message).toBe("fetch failed"); + }); + + it("early-stop checkpoint branch labels run as `failed` even when cancel throws a falsy value (not `null` discriminant)", async () => { + // Regression: the cancel-failure branch used to be discriminated + // by `cancelError !== null`, but user-side code can legitimately + // `throw null` / `throw 0` / `throw ""`. In those cases the + // captured `cancelError` would still read as falsy / `null` and + // the run would be silently labelled `"cancelled"` even though + // the cancel POST genuinely rejected, lying about cloud-side + // state that may still be running. Fix discriminates via a + // dedicated boolean flag and additionally wraps non-Error + // throws when rejecting the deferred so the SIGTERM handler's + // `.catch(err => err.message)` doesn't crash on a missing + // property. + await writeState( + { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" }, + cwd, + ); + const sse = [ + `id: 1\nevent: training.started\ndata: ${JSON.stringify({ + type: "training.started", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:01Z", + })}\n\n`, + `id: 2\nevent: training.log\ndata: ${JSON.stringify({ + type: "training.log", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:02Z", + step: 1, + loss: 0.5, + })}\n\n`, + `id: 3\nevent: checkpoint.saved\ndata: ${JSON.stringify({ + type: "checkpoint.saved", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:03Z", + step: 10, + })}\n\n`, + ]; + const fetcher: typeof fetch = (async ( + input: RequestInfo | URL, + init?: RequestInit, + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && url.includes("/v1/jobs?")) { + return new Response(JSON.stringify({ job: minimalJobRow }), { + status: 201, + headers: { "content-type": "application/json" }, + }); + } + if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) { + return new Response(sseStream(sse), { + status: 200, + headers: { "content-type": "text/event-stream" }, + }); + } + if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) { + // Falsy non-Error throw: under the bug, the run would be + // labelled "cancelled" because `cancelError !== null` is + // false when the catch reassigned `cancelError = null`. + // eslint-disable-next-line no-throw-literal + throw null; + } + throw new Error(`unexpected fetch: ${method} ${url}`); + }) as typeof fetch; + + let armedPromise: Promise | null = null; + let armedResult: "resolved" | "rejected" | "pending" = "pending"; + let armedError: unknown = null; + const trainer = createTrainer( + { + name: "run", + model: "m", + dataset: { type: "huggingface", name: "x" }, + callbacks: { + onLog: () => { + if (armedPromise === null) { + armedPromise = requestTrainerEarlyStop(trainer, { + timeoutMs: 60_000, + }); + armedPromise.then( + () => { + armedResult = "resolved"; + }, + (err: unknown) => { + armedResult = "rejected"; + armedError = err; + }, + ); + } + }, + }, + }, + { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 }, + ); + const original = globalThis.fetch; + globalThis.fetch = fetcher; + let result: Awaited>; + try { + result = await trainer.wait(); + await new Promise((r) => setImmediate(r)); + } finally { + globalThis.fetch = original; + } + // The local job state reflects the cancel FAILURE, not a clean + // cancel. Under the bug this was `"cancelled"`. + expect(result.job.status).toBe("failed"); + // The armed deferred rejected, and the rejection value is + // wrapped in a real Error so downstream `.catch(err => err.message)` + // chains don't crash on `null.message`. + expect(armedResult).toBe("rejected"); + expect(armedError).toBeInstanceOf(Error); + expect((armedError as Error).message).toBe("null"); + }); + + it("resolves the early-stop latch when the run hits a terminal event before the next checkpoint", async () => { + // Regression: previously `requestEarlyStop()`'s deferred was + // only resolved by (a) the checkpoint-triggered cancel branch + // or (b) the timeout fallback. If the run reached + // `training.completed` / `training.failed` *before* another + // checkpoint landed (a common case for short jobs or runs that + // had already saved their last checkpoint when SIGTERM arrived), + // the deferred stayed pending until the (default 5-min) timeout + // fired; the SIGTERM handler in `installShutdownHandlers` + // awaits that promise before exit, so shutdown was delayed up to + // `timeoutMs`. Both terminal branches now settle the latch + // explicitly so the signal path completes immediately when the + // job is already terminal. + await writeState( + { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" }, + cwd, + ); + // started → log (arms early-stop) → completed; no checkpoint.saved + // in between, so the checkpoint-triggered resolution path is *not* + // exercised; only the new terminal-branch settlement is. + const sse = [ + `id: 1\nevent: training.started\ndata: ${JSON.stringify({ + type: "training.started", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:01Z", + })}\n\n`, + `id: 2\nevent: training.log\ndata: ${JSON.stringify({ + type: "training.log", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:02Z", + step: 1, + loss: 0.5, + })}\n\n`, + `id: 3\nevent: training.completed\ndata: ${JSON.stringify({ + type: "training.completed", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:03Z", + artifacts: [], + })}\n\n`, + ]; + + let cancelCalls = 0; + const fetcher: typeof fetch = (async ( + input: RequestInfo | URL, + init?: RequestInit, + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && url.includes("/v1/jobs?")) { + return new Response(JSON.stringify({ job: minimalJobRow }), { + status: 201, + headers: { "content-type": "application/json" }, + }); + } + if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) { + return new Response(sseStream(sse), { + status: 200, + headers: { "content-type": "text/event-stream" }, + }); + } + if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) { + cancelCalls += 1; + return new Response(JSON.stringify({ ok: true }), { + status: 200, + headers: { "content-type": "application/json" }, + }); + } + throw new Error(`unexpected fetch: ${method} ${url}`); + }) as typeof fetch; + + let stopResolved = false; + const trainer = createTrainer( + { + name: "run", + model: "m", + dataset: { type: "huggingface", name: "x" }, + callbacks: { + onLog: () => { + // Long timeout: if the fix regresses, this test would + // hang for ~60s before the timer fires. With the + // terminal-branch settlement, the deferred resolves the + // moment `training.completed` lands. + void requestTrainerEarlyStop(trainer, { + timeoutMs: 60_000, + }).then(() => { + stopResolved = true; + }); + }, + }, + }, + { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 }, + ); + + const original = globalThis.fetch; + globalThis.fetch = fetcher; + try { + const result = await trainer.wait(); + // Flush microtasks so the .then() chain off `requestEarlyStop` + // observes the resolution before we assert. + await new Promise((r) => setImmediate(r)); + expect(result.job.status).toBe("completed"); + // No cancel POST was issued: the terminal branch just + // releases the latch; it doesn't cancel a run that already + // completed on its own. + expect(cancelCalls).toBe(0); + // The latch resolved via the terminal handler, not via the + // 60-second timeout. (The test would simply time out long + // before the timeout fired if this regressed.) + expect(stopResolved).toBe(true); + } finally { + globalThis.fetch = original; + } + }); + + it("settles the early-stop latch even when the user's onCompleted callback throws", async () => { + // Regression: previously `settleEarlyStopLatch()` was called + // *after* awaiting `callbacks.onCompleted` / `onFailed`. A + // thrown user callback propagated out of `dispatch()` before + // the settle ran, leaving `earlyStopDeferred` pending; the + // SIGTERM handler in `installShutdownHandlers` would block on + // that promise until the (default 5-min) timeout fired, + // delaying shutdown for a user-code bug. Wrapping in + // `try/finally` ensures the latch is released regardless, + // while preserving the throw's propagation through `wait()` so + // callers still see the original error. + await writeState( + { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" }, + cwd, + ); + const sse = [ + `id: 1\nevent: training.started\ndata: ${JSON.stringify({ + type: "training.started", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:01Z", + })}\n\n`, + `id: 2\nevent: training.log\ndata: ${JSON.stringify({ + type: "training.log", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:02Z", + step: 1, + loss: 0.5, + })}\n\n`, + `id: 3\nevent: training.completed\ndata: ${JSON.stringify({ + type: "training.completed", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:03Z", + artifacts: [], + })}\n\n`, + ]; + + const fetcher: typeof fetch = (async ( + input: RequestInfo | URL, + init?: RequestInit, + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && url.includes("/v1/jobs?")) { + return new Response(JSON.stringify({ job: minimalJobRow }), { + status: 201, + headers: { "content-type": "application/json" }, + }); + } + if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) { + return new Response(sseStream(sse), { + status: 200, + headers: { "content-type": "text/event-stream" }, + }); + } + throw new Error(`unexpected fetch: ${method} ${url}`); + }) as typeof fetch; + + let stopResolved = false; + let stopRejected = false; + const trainer = createTrainer( + { + name: "run", + model: "m", + dataset: { type: "huggingface", name: "x" }, + callbacks: { + onLog: () => { + // Arm early-stop with a long timeout; if the latch + // isn't released by `finally`, this would hang for the + // full 60 seconds. + void requestTrainerEarlyStop(trainer, { + timeoutMs: 60_000, + }).then( + () => { + stopResolved = true; + }, + () => { + stopRejected = true; + }, + ); + }, + onCompleted: () => { + throw new Error("user callback boom"); + }, + }, + }, + { + baseUrl: "http://mock", + credentials: creds, + cwd, + reconnectDelayMs: 1, + // `wait()` catches dispatch throws and routes them through + // its reconnect loop; with the default unbounded retry the + // user-callback throw above would loop forever and the test + // would just time out. Cap retries at 0 so the first thrown + // dispatch surfaces as a `wait()` rejection; that lets us + // observe the *latch* settlement (the actual contract under + // test) cleanly. + maxReconnectAttempts: 0, + }, + ); + + const original = globalThis.fetch; + globalThis.fetch = fetcher; + try { + // The user-callback throw is wrapped by `handleFailure` after + // `maxReconnectAttempts: 0` exhausts; the original error is + // preserved as `cause`. We just need wait() to settle so the + // test doesn't hang. The *body* of the assertion is the + // latch state below. + await expect(trainer.wait()).rejects.toThrow(); + // The latch must have settled (via `finally`) BEFORE wait() + // rejected. Without the `try/finally` around `onCompleted` + // the latch would still be armed → `stopResolved` stays + // false → the test fails (rather than timing out, since + // `maxReconnectAttempts: 0` already unblocks wait()). + await new Promise((r) => setImmediate(r)); + expect(stopResolved).toBe(true); + expect(stopRejected).toBe(false); + } finally { + globalThis.fetch = original; + } + }); + + it("falls back to immediate cancel when no checkpoint arrives within timeoutMs", async () => { + await writeState( + { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" }, + cwd, + ); + // No checkpoint in the stream, only training.completed, which would + // normally finish the run. We hand-roll a stream that never ends so + // the timeout fallback is what actually triggers cancel. + let streamController: ReadableStreamDefaultController | null = + null; + const stallingStream = new ReadableStream({ + start(controller) { + streamController = controller; + const enc = new TextEncoder(); + controller.enqueue( + enc.encode( + `id: 1\nevent: training.started\ndata: ${JSON.stringify({ + type: "training.started", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:01Z", + })}\n\n`, + ), + ); + }, + }); + + let cancelCalls = 0; + const fetcher: typeof fetch = (async ( + input: RequestInfo | URL, + init?: RequestInit, + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && url.includes("/v1/jobs?")) { + return new Response(JSON.stringify({ job: minimalJobRow }), { + status: 201, + headers: { "content-type": "application/json" }, + }); + } + if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) { + return new Response(stallingStream, { + status: 200, + headers: { "content-type": "text/event-stream" }, + }); + } + if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) { + cancelCalls += 1; + // Closing the stream now mimics cloud-api's response to a cancel: + // the SSE channel ends and wait() exits its loop. + streamController?.close(); + return new Response(JSON.stringify({ ok: true }), { + status: 200, + headers: { "content-type": "application/json" }, + }); + } + throw new Error(`unexpected fetch: ${method} ${url}`); + }) as typeof fetch; + + const trainer = createTrainer( + { + name: "run", + model: "m", + dataset: { type: "huggingface", name: "x" }, + }, + { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 }, + ); + + const original = globalThis.fetch; + globalThis.fetch = fetcher; + try { + await trainer.start(); + // Tiny timeout so the test doesn't actually wait 5 minutes. + await requestTrainerEarlyStop(trainer, { timeoutMs: 5 }); + expect(cancelCalls).toBe(1); + // Regression: the timeout fallback used to leave + // `earlyStopRequested = true` and `startedJob.status = + // "running"`. A subsequent `requestEarlyStop()` call would + // then re-arm a fresh timer and re-issue cancel even though + // the early-stop already fired. With the latch reset and + // local terminal-status update mirroring the + // checkpoint-triggered branch, the second call hits the + // TERMINAL_STATUSES short-circuit and is a true no-op. + await requestTrainerEarlyStop(trainer, { timeoutMs: 5 }); + expect(cancelCalls).toBe(1); + } finally { + globalThis.fetch = original; + } + }); + + it("timeout fallback rejects the deferred when cancel() throws (visible to shutdown handler)", async () => { + // Companion to the checkpoint-branch reject test: when no + // checkpoint arrives within `timeoutMs`, the timeout fallback + // does its own `trainer.cancel()`. Old code swallowed cancel + // errors and ALWAYS resolved the deferred: same false-success + // failure mode as the checkpoint branch had: local runner + // exits cleanly while the cloud job keeps consuming GPU + // budget. The fix mirrors the checkpoint reject path: capture + // the error and reject the deferred so the runner's + // `.catch()` writes it to stderr. + await writeState( + { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" }, + cwd, + ); + let streamController: ReadableStreamDefaultController | null = + null; + const stallingStream = new ReadableStream({ + start(controller) { + streamController = controller; + const enc = new TextEncoder(); + controller.enqueue( + enc.encode( + `id: 1\nevent: training.started\ndata: ${JSON.stringify({ + type: "training.started", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:01Z", + })}\n\n`, + ), + ); + }, + }); + + let cancelCalls = 0; + const fetcher: typeof fetch = (async ( + input: RequestInfo | URL, + init?: RequestInit, + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && url.includes("/v1/jobs?")) { + return new Response(JSON.stringify({ job: minimalJobRow }), { + status: 201, + headers: { "content-type": "application/json" }, + }); + } + if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) { + return new Response(stallingStream, { + status: 200, + headers: { "content-type": "text/event-stream" }, + }); + } + if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) { + cancelCalls += 1; + // Close the stream so wait() exits its loop even though we + // throw on the cancel POST itself. + streamController?.close(); + // Simulate cloud-api unreachable mid-cancel (transport). + throw new TypeError("fetch failed"); + } + throw new Error(`unexpected fetch: ${method} ${url}`); + }) as typeof fetch; + + const trainer = createTrainer( + { + name: "run", + model: "m", + dataset: { type: "huggingface", name: "x" }, + }, + { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 }, + ); + const original = globalThis.fetch; + globalThis.fetch = fetcher; + try { + await trainer.start(); + // Tiny timeout so the timeout fallback fires fast (no + // checkpoint will land; stream only carries + // training.started). The returned promise should REJECT + // because the cancel POST throws. + await expect( + requestTrainerEarlyStop(trainer, { timeoutMs: 5 }), + ).rejects.toThrow(/fetch failed/); + expect(cancelCalls).toBe(1); + } finally { + globalThis.fetch = original; + } + }); + + it("is a no-op before start() and resolves immediately", async () => { + const trainer = createTrainer( + { + name: "run", + model: "m", + dataset: { type: "huggingface", name: "x" }, + }, + { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 }, + ); + // Should resolve without contacting cloud-api at all. + await requestTrainerEarlyStop(trainer, { timeoutMs: 1 }); + }); + + it("waits out an in-flight start() so a SIGTERM during create-job can still cancel the new job", async () => { + // Codex P1 regression: `start()` sets `scope` *before* awaiting + // `client.createJob`, so there's a real window where the cloud + // job is being created but `startedJob` is still null. If a + // runner-side SIGTERM lands in that window, an immediate + // "no-op" early-stop would let `installShutdownHandlers` exit + // the process, leaving the just-created cloud job running + // with no cancel POST. The fix is to await the in-flight + // `start()` promise inside `requestEarlyStop()` so the cancel + // path sees a definite job id (or a definite start failure). + await writeState( + { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" }, + cwd, + ); + let cancelCalls = 0; + let releaseCreateJob!: () => void; + const createJobReleased = new Promise((resolve) => { + releaseCreateJob = resolve; + }); + const fetcher: typeof fetch = (async ( + input: RequestInfo | URL, + init?: RequestInit, + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && url.includes("/v1/jobs?")) { + // Hold createJob open so we can fire `requestEarlyStop` + // mid-flight. Once the test releases the gate, return a + // valid job: that establishes the post-create state + // requestEarlyStop should then act on (cancel POST). + await createJobReleased; + return new Response(JSON.stringify({ job: minimalJobRow }), { + status: 201, + headers: { "content-type": "application/json" }, + }); + } + if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) { + cancelCalls += 1; + return new Response(JSON.stringify({ ok: true }), { + status: 200, + headers: { "content-type": "application/json" }, + }); + } + throw new Error(`unexpected fetch: ${method} ${url}`); + }) as typeof fetch; + + const trainer = createTrainer( + { + name: "run", + model: "m", + dataset: { type: "huggingface", name: "x" }, + }, + { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 }, + ); + + const original = globalThis.fetch; + globalThis.fetch = fetcher; + try { + // Fire start() but DON'T await; its createJob is gated. + const startPromise = trainer.start(); + // Yield once so the start microtasks queue up to the + // `await client.createJob`. + await new Promise((r) => setImmediate(r)); + // requestEarlyStop fires while start() is mid-flight. With + // the fix it awaits start() rather than no-op'ing immediately. + // Tiny `timeoutMs` so once `start()` resolves the latch's + // timeout-fallback fires the cancel POST quickly. There's no + // SSE stream in this test, so the checkpoint-driven path + // never arrives. We're testing the "stop awaited start()" leg + // of the contract, not the checkpoint plumbing. + const stopPromise = requestTrainerEarlyStop(trainer, { + timeoutMs: 50, + }); + // Sanity: stop hasn't resolved yet; it's blocked on + // start() which is blocked on createJob. + let stopSettled = false; + void stopPromise.then(() => { + stopSettled = true; + }); + await new Promise((r) => setImmediate(r)); + expect(stopSettled).toBe(false); + // Release createJob → start() resolves → stop() proceeds. + releaseCreateJob(); + await startPromise; + await stopPromise; + // The deciding behaviour: cancel POST was issued because the + // stop awaited start() and saw a real job id. Without the + // in-flight gate, stop would have returned immediately on + // the null `startedJob`, no cancel POST, cloud job orphaned. + expect(cancelCalls).toBe(1); + } finally { + globalThis.fetch = original; + } + }); + + it("replaceTrainerCallbacks (internal HMR brand) swaps the dispatched callbacks on the next event", async () => { + await writeState( + { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" }, + cwd, + ); + const sse = [ + `id: 1\nevent: training.started\ndata: ${JSON.stringify({ + type: "training.started", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:01Z", + })}\n\n`, + `id: 2\nevent: training.log\ndata: ${JSON.stringify({ + type: "training.log", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:02Z", + step: 1, + loss: 1, + })}\n\n`, + `id: 3\nevent: training.log\ndata: ${JSON.stringify({ + type: "training.log", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:03Z", + step: 2, + loss: 0.5, + })}\n\n`, + `id: 4\nevent: training.completed\ndata: ${JSON.stringify({ + type: "training.completed", + jobId: "j-stop", + timestamp: "2026-01-01T00:00:04Z", + })}\n\n`, + ]; + const fetcher: typeof fetch = (async ( + input: RequestInfo | URL, + init?: RequestInit, + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && url.includes("/v1/jobs?")) { + return new Response(JSON.stringify({ job: minimalJobRow }), { + status: 201, + headers: { "content-type": "application/json" }, + }); + } + if (method === "GET" && url.includes("/v1/jobs/j-stop/events/stream")) { + return new Response(sseStream(sse), { + status: 200, + headers: { "content-type": "text/event-stream" }, + }); + } + throw new Error(`unexpected fetch: ${method} ${url}`); + }) as typeof fetch; + + const calls: string[] = []; + const trainer = createTrainer( + { + name: "run", + model: "m", + dataset: { type: "huggingface", name: "x" }, + callbacks: { + onLog: ({ step }) => { + calls.push(`v1:onLog(${step})`); + // After the first onLog call, swap to v2 callbacks via the + // internal `Symbol.for("arkor.trainer.replaceCallbacks")` + // brand (the same brand `arkor dev`'s SIGUSR2 handler + // uses). The next event must dispatch via the new object. + if (step === 1) { + replaceTrainerCallbacks(trainer, { + onLog: ({ step: s }) => void calls.push(`v2:onLog(${s})`), + }); + } + }, + }, + }, + { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 }, + ); + const original = globalThis.fetch; + globalThis.fetch = fetcher; + try { + await trainer.wait(); + } finally { + globalThis.fetch = original; + } + expect(calls).toEqual(["v1:onLog(1)", "v2:onLog(2)"]); + }); + + it("is idempotent: repeated calls share the same in-flight promise", async () => { + await writeState( + { orgSlug: "anon-org", projectSlug: "proj", projectId: "p1" }, + cwd, + ); + let cancelCalls = 0; + const fetcher: typeof fetch = (async ( + input: RequestInfo | URL, + init?: RequestInit, + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && url.includes("/v1/jobs?")) { + return new Response(JSON.stringify({ job: minimalJobRow }), { + status: 201, + headers: { "content-type": "application/json" }, + }); + } + if (method === "POST" && url.includes("/v1/jobs/j-stop/cancel")) { + cancelCalls += 1; + return new Response(JSON.stringify({ ok: true }), { + status: 200, + headers: { "content-type": "application/json" }, + }); + } + throw new Error(`unexpected fetch: ${method} ${url}`); + }) as typeof fetch; + + const trainer = createTrainer( + { + name: "run", + model: "m", + dataset: { type: "huggingface", name: "x" }, + }, + { baseUrl: "http://mock", credentials: creds, cwd, reconnectDelayMs: 1 }, + ); + const original = globalThis.fetch; + globalThis.fetch = fetcher; + try { + await trainer.start(); + const a = requestTrainerEarlyStop(trainer, { timeoutMs: 5 }); + const b = requestTrainerEarlyStop(trainer, { timeoutMs: 5 }); + await Promise.all([a, b]); + // The fallback timer fires once, so cancel is called once even though + // the early-stop brand was invoked twice. + expect(cancelCalls).toBe(1); + } finally { + globalThis.fetch = original; + } + }); +}); diff --git a/packages/arkor/src/core/trainer.ts b/packages/arkor/src/core/trainer.ts index 7c9f9662..c6f99870 100644 --- a/packages/arkor/src/core/trainer.ts +++ b/packages/arkor/src/core/trainer.ts @@ -6,19 +6,28 @@ import { type Credentials, } from "./credentials"; import { ensureProjectState } from "./projectState"; +import { + attachTrainerCallbackReplacer, + attachTrainerEarlyStopper, + attachTrainerInspection, + type RequestEarlyStopOptions, +} from "./trainerInspection"; import type { CheckpointContext, InferArgs, JobConfig, Trainer, + TrainerCallbacks, TrainerInput, TrainingJob, TrainingLogContext, TrainingResult, } from "./types"; +const TERMINAL_STATUSES = new Set(["completed", "failed", "cancelled"]); + /** - * Internal runtime context. Not part of the public API surface — exposed only + * Internal runtime context. Not part of the public API surface; exposed only * for tests and advanced power-user scenarios that need to inject a mock * `fetch` or override the working directory. * @@ -111,7 +120,7 @@ function buildJobConfig(input: TrainerInput): JobConfig { /** * Build a `Trainer` bound to the user's configuration. * - * Public signature: `createTrainer(input)` — runtime options like + * Public signature: `createTrainer(input)`. Runtime options like * `baseUrl` / `credentials` / `cwd` come from the environment and `.arkor/` * state, never from user code. The optional second argument is reserved for * tests and advanced overrides. @@ -144,6 +153,63 @@ export function createTrainer( let startedJob: TrainingJob | null = null; let scope: { orgSlug: string; projectSlug: string } | null = null; let clientPromise: Promise | null = null; + // In-flight `start()` promise: non-null between the first + // `client.createJob` call and the `startedJob` assignment. Lets + // `requestEarlyStop()` detect the "scope set but startedJob still + // null" window (`scope` is needed by `client.createJob` so we set + // it before the await) and wait out the create-job POST so a + // SIGTERM landing in that window can still drive a clean cancel + // once the job id materialises. Without this gate the early-stop + // path would no-op, the runner would `process.exit(0)`, and the + // newly created cloud job would orphan with no cancel POST. + let startInFlight: Promise | null = null; + + // Mutable callbacks slot. Each `dispatch()` invocation reads this + // fresh, so the rotation triggered by the + // `Symbol.for("arkor.trainer.replaceCallbacks")` brand + // (`replaceTrainerCallbacks` in `core/trainerInspection.ts`) takes + // effect on the next event. Events already mid-await keep their + // old reference until they resolve, which matches the "replace, + // don't interrupt" contract. Public `Trainer` deliberately doesn't + // expose this; it's a dev-only HMR primitive driven by the + // SIGUSR2 path in `core/runnerSignals.ts`. + let currentCallbacks: Partial = input.callbacks ?? {}; + + // Early-stop state. `requestEarlyStop()` arms the latch; the next + // `checkpoint.saved` dispatch (or the timeout, whichever fires first) + // calls cancel() and resolves the deferred. Idempotent across repeat + // calls (they share the same deferred). + const DEFAULT_EARLY_STOP_TIMEOUT_MS = 5 * 60 * 1000; + let earlyStopDeferred: { + promise: Promise; + resolve: () => void; + reject: (err: unknown) => void; + timer: NodeJS.Timeout | null; + } | null = null; + let earlyStopRequested = false; + + /** + * Drop the early-stop latch (clear timer + resolve deferred + reset + * the request flag). Called from any path that means "wait()'s + * cancel-after-checkpoint promise is no longer waiting on anything" + * (the checkpoint-driven cancel branch, the terminal `completed` + * / `failed` branches, and the up-front guard in + * `requestEarlyStop()` when the job is already terminal). Without + * this called from terminal branches, a `requestEarlyStop()` armed + * mid-run that races a `training.completed` / `training.failed` + * before the next `checkpoint.saved` would leave the deferred + * pending until the (default 5-min) timeout fires; the SIGTERM + * handler in `installShutdownHandlers` would block on that promise + * and delay shutdown for up to `timeoutMs`. + */ + function settleEarlyStopLatch(): void { + if (earlyStopDeferred) { + if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer); + earlyStopDeferred.resolve(); + earlyStopDeferred = null; + } + earlyStopRequested = false; + } async function getClient(): Promise { if (!clientPromise) { @@ -168,7 +234,7 @@ export function createTrainer( * many SDK clients retry at once. * * The final value is clamped at `maxReconnectDelayMs` because jitter - * sits *outside* the exponential clamp — without the outer clamp, a + * sits *outside* the exponential clamp; without the outer clamp, a * long outage where `exp` already hit the cap could wait up to 1.25 × * the documented cap when `Math.random()` lands near 1. */ @@ -204,7 +270,10 @@ export function createTrainer( throw new Error("Trainer is in an inconsistent state"); } const client = await getClient(); - const callbacks = input.callbacks ?? {}; + // Read once per dispatch so a `replaceCallbacks` between events takes + // effect on the next dispatch, but doesn't change identity inside a + // single in-flight handler. + const callbacks = currentCallbacks; switch (event.type) { case "training.started": { @@ -255,7 +324,139 @@ export function createTrainer( infer, artifacts: event.artifacts, }; - await callbacks.onCheckpoint?.(ctx); + // Capture (don't propagate yet) any throw from the user's + // `onCheckpoint`. The early-stop branch below MUST run + // even on a callback throw; without this wrap a thrown + // `onCheckpoint` would skip the cancel + latch settlement, + // leaving the SIGTERM handler waiting on the deferred + // until the (default 5-min) timeout fires. Surface the + // original throw via re-throw at the end so `wait()`'s + // reconnect / failure path keeps its existing semantics. + // Discriminant for the user-callback-threw branch. Tracked as + // a separate boolean (not `onCheckpointError !== null`) because + // user code can legitimately `throw null` / `throw 0` / + // `throw ""`; the truthiness of `onCheckpointError` would then + // be indistinguishable from the no-error path, and the re-throw + // at the end would silently swallow the user's falsy throw + // (callback's "I want to stop" signal gets dropped on the floor). + let onCheckpointError: unknown = null; + let onCheckpointThrew = false; + try { + await callbacks.onCheckpoint?.(ctx); + } catch (err) { + onCheckpointError = err; + onCheckpointThrew = true; + } + // Early-stop latch: a checkpoint just landed, so the in-flight work + // is durable. Cancel the cloud job and end `wait()` cleanly. + if (earlyStopRequested && earlyStopDeferred) { + // Capture the cancel error (if any) but DON'T swallow + // silently; propagate via the deferred's reject path so + // the runner's `installShutdownHandlers` `.catch()` writes + // the failure to stderr. The previous swallow let a + // transient cloud-api failure during early-stop appear + // as a clean cancel: the local runner exited 0, the UI + // declared the run cancelled, but the cloud job kept + // running (continued GPU spend). Keeping the error + // visible to the shutdown handler lets the operator see + // it and intervene. + // + // We still mark `startedJob.status` terminal locally + // either way: from the runner's perspective the run is + // over, and a subsequent `requestEarlyStop()` call must + // hit the `TERMINAL_STATUSES.has(...)` short-circuit + // (re-arming a fresh latch on a dead run would hang + // shutdown). + let cancelError: unknown = null; + // Discriminant for the cancel-failure branch. Tracked as a + // separate boolean (not `cancelError !== null`) because + // user code can legitimately `throw null` / `throw 0` / + // `throw ""`; the truthiness of `cancelError` would then + // be indistinguishable from the no-error path and the run + // would be silently labelled `"cancelled"` even when the + // cancel POST genuinely rejected. + let cancelFailed = false; + try { + await trainer.cancel(); + } catch (err) { + cancelError = err; + cancelFailed = true; + } + // Reflect the cancellation locally so `wait()`'s resolved + // `TrainingResult.job.status` is a terminal status (per the + // documented contract). Without this update the result would + // surface as `status: "running"`, and a subsequent + // `requestEarlyStop` would not see the + // `TERMINAL_STATUSES.has(...)` short-circuit it relies on. + // + // Status is `"failed"` when the cancel POST itself threw + // (cloud-api transient failure mid-cancel): labelling + // such runs `"cancelled"` would lie about the cloud-side + // state, which may still be running. `"failed"` is + // terminal too, so the latch / TERMINAL_STATUSES short- + // circuit still works, but `wait()`'s caller can + // distinguish "we cancelled cleanly" from "we tried but + // the cancel may not have landed". The original cancel + // error is also rejected through the deferred below for + // the SIGTERM handler's `.catch()`. + startedJob = { + ...startedJob, + status: cancelFailed ? "failed" : "cancelled", + ...(cancelFailed && { + error: `Early-stop cancel failed: ${ + cancelError instanceof Error + ? cancelError.message + : String(cancelError) + }`, + }), + completedAt: event.timestamp, + }; + if (cancelFailed) { + // Reject (not resolve) the latch. Mirrors the success + // path's bookkeeping (clear timer, null out shared + // slot, drop the request flag) so a follow-up + // `requestEarlyStop()` won't piggyback on the rejected + // promise. + if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer); + // Wrap if user threw a non-Error so the deferred + // consumer always receives an Error instance. `throw 0` + // would otherwise reject the deferred with `0`, and the + // SIGTERM handler's `.catch(err => ...err.message)` would + // crash on the missing property. + earlyStopDeferred.reject( + cancelError instanceof Error + ? cancelError + : new Error(String(cancelError)), + ); + earlyStopDeferred = null; + earlyStopRequested = false; + } else { + settleEarlyStopLatch(); + } + // Return the *checkpoint's* artifacts (the ones the user + // just saved): that's the work HMR went out of its way + // to preserve before issuing cancel(). The previous + // `terminalResult?.artifacts ?? []` always resolved to + // `[]` because `wait()` calls `dispatch(parsed, null)` so + // `terminalResult` is never populated. Effect: an + // HMR-driven early-stop resolved `wait()` with empty + // `artifacts` even though the checkpoint event carried + // the very artifacts the early-stop existed to keep. + // Surface the user's `onCheckpoint` throw (if any) so + // `wait()`'s reconnect / failure path keeps the same + // semantics it had before the wrap: the checkpoint + // workload is preserved, but the user still sees their + // callback error. + if (onCheckpointThrew) throw onCheckpointError; + return { + terminal: true, + artifacts: (event.artifacts ?? []) as unknown[], + }; + } + // Same re-throw on the non-early-stop branch: keep + // `wait()`'s reconnect loop seeing the user's original + // callback error so reconnection counters work as before. + if (onCheckpointThrew) throw onCheckpointError; return { terminal: false, artifacts: terminalResult?.artifacts ?? [] }; } case "training.completed": { @@ -265,7 +466,19 @@ export function createTrainer( completedAt: event.timestamp, }; const artifacts = (event.artifacts ?? []) as unknown[]; - await callbacks.onCompleted?.({ job: startedJob, artifacts }); + // `try/finally` so the latch settles even when the user's + // `onCompleted` callback throws: otherwise a thrown + // callback would leave `earlyStopDeferred` pending and the + // SIGTERM handler awaiting `requestEarlyStop()` would block + // until the timeout (default 5 min). The throw still + // propagates through `dispatch()` → `wait()` so callers see + // the original error; we just don't strand the shutdown + // path along with it. + try { + await callbacks.onCompleted?.({ job: startedJob, artifacts }); + } finally { + settleEarlyStopLatch(); + } return { terminal: true, artifacts }; } case "training.failed": { @@ -275,7 +488,14 @@ export function createTrainer( error: event.error, completedAt: event.timestamp, }; - await callbacks.onFailed?.({ job: startedJob, error: event.error }); + // Symmetric to the `completed` branch above: terminal + // status settles the latch even when the run failed *and* + // the user's `onFailed` callback itself throws. + try { + await callbacks.onFailed?.({ job: startedJob, error: event.error }); + } finally { + settleEarlyStopLatch(); + } return { terminal: true, artifacts: [] }; } } @@ -286,18 +506,46 @@ export function createTrainer( async start() { if (startedJob) return { jobId: startedJob.id }; - const client = await getClient(); - const state = await resolveProjectState(client); - scope = { orgSlug: state.orgSlug, projectSlug: state.projectSlug }; - - const { job } = await client.createJob({ - orgSlug: state.orgSlug, - projectSlug: state.projectSlug, - name: input.name, - config, - }); - startedJob = job; - return { jobId: job.id }; + // Already-pending start: reuse the in-flight promise so a + // concurrent caller (notably `requestEarlyStop` awaiting it + // to close the SIGTERM-during-create-job race) doesn't issue + // a second `client.createJob` POST. `Promise.resolve` returns + // the existing promise unchanged when it's already a thenable. + if (startInFlight) { + const job = await startInFlight; + return { jobId: job.id }; + } + // Track the pending creation so `requestEarlyStop()` can + // detect the "started but not yet recorded" window and wait + // out the `client.createJob` POST. We set `scope` *before* + // the await (it's needed by the await itself), so a SIGTERM + // landing during the await would otherwise see + // `!startedJob && scope` and exit immediately, leaving the + // newly created cloud job uncancelled. + const startPromise = (async () => { + const client = await getClient(); + const state = await resolveProjectState(client); + scope = { orgSlug: state.orgSlug, projectSlug: state.projectSlug }; + const { job } = await client.createJob({ + orgSlug: state.orgSlug, + projectSlug: state.projectSlug, + name: input.name, + config, + }); + startedJob = job; + return job; + })(); + startInFlight = startPromise; + try { + const job = await startPromise; + return { jobId: job.id }; + } finally { + // Clear regardless of resolve/reject so a failed start can + // be retried (the caller decides), and a successful one + // doesn't pin a stale promise on the trainer for the rest + // of its lifetime. + startInFlight = null; + } }, async wait(): Promise { @@ -347,7 +595,7 @@ export function createTrainer( try { for await (const sse of iterateEvents(response)) { // Any frame from the server (including pings) means we're - // connected and making progress — reset the failure counter + // connected and making progress; reset the failure counter // so subsequent transient blips get the full retry budget. receivedAny = true; attempt = 0; @@ -378,7 +626,7 @@ export function createTrainer( if (terminal) break; if (receivedAny) { - // Stream had real activity then closed cleanly. Not a failure — + // Stream had real activity then closed cleanly. Not a failure; // reconnect with Last-Event-ID at the base delay (no exponential // backoff, no counter increment). await delay(initialReconnectDelayMs, abortSignal); @@ -404,5 +652,153 @@ export function createTrainer( }, }; + /** + * Internal "stop after next checkpoint" entry point. Hidden behind a + * `Symbol.for` brand so the runner subprocess's SIGTERM handler (in + * `runnerSignals.ts`) can drive a graceful early-stop without us + * exposing the operation on the public `Trainer` interface. User code + * that wants the same semantics should compose `abortSignal` + + * `cancel()` per `docs/cookbook/early-stopping.mdx`. + */ + async function requestEarlyStop( + opts: RequestEarlyStopOptions = {}, + ): Promise { + // SIGTERM-during-create-job race: a runner-side SIGTERM can land + // between `start()`'s `scope = { … }` assignment and its + // `client.createJob(...)` resolution, with `startedJob` still + // null but a real cloud job about to exist. Treating that window + // as "nothing in flight" would `process.exit(0)` immediately + // after this returns, leaving the newly created cloud job + // running with no cancel POST. Awaiting `startInFlight` collapses + // the race onto a definite startedJob (success) or a definite + // start failure (rejection); either way the branches below + // can decide on real state. Swallow the rejection: if `start()` + // failed there's nothing to cancel anyway. + if (startInFlight) { + try { + await startInFlight; + } catch { + // intentionally ignored: failed start has no job to cancel + } + } + // Nothing in flight: cleanup any prior latch and resolve. + if (!startedJob || !scope || TERMINAL_STATUSES.has(startedJob.status)) { + settleEarlyStopLatch(); + return; + } + // Idempotent: a second call piggybacks on the first. + if (earlyStopDeferred) return earlyStopDeferred.promise; + + earlyStopRequested = true; + let resolveFn!: () => void; + let rejectFn!: (err: unknown) => void; + const promise = new Promise((resolve, reject) => { + resolveFn = resolve; + rejectFn = reject; + }); + const timeoutMs = opts.timeoutMs ?? DEFAULT_EARLY_STOP_TIMEOUT_MS; + const timer = setTimeout(() => { + // Timed out waiting for a checkpoint; fall back to immediate cancel. + // Capture the active deferred reference: by the time the cancel POST + // resolves, the checkpoint branch may have nulled out the shared + // slot, but this fallback path still owns the deferred it created. + const active = earlyStopDeferred; + // Capture (don't swallow) any cancel error so we can surface it + // through the deferred's reject path. Mirrors the checkpoint + // branch: a swallow here lets the runner's + // `installShutdownHandlers` exit "successfully" while the cloud + // job lives on (orphaned GPU spend with zero diagnostic), the + // exact failure mode that a "stop-after-checkpoint" deadline + // exists to PREVENT from going silent. + let cancelError: unknown = null; + // See the checkpoint-branch comment: tracked separately from + // `cancelError` so a user `throw null` / `throw 0` doesn't + // silently downgrade the cancel-failure path to "clean + // cancel". + let cancelFailed = false; + trainer + .cancel() + .catch((err) => { + cancelError = err; + cancelFailed = true; + }) + .finally(() => { + // Mirror the checkpoint-triggered early-stop branch: reset + // the latch and reflect the cancellation locally so a + // second `requestEarlyStop()` call is a no-op (instead of + // re-arming a fresh timer + re-issuing cancel) and so + // `wait()`'s eventual resolution exposes a terminal status. + // Without this, a long-lived trainer left in + // `earlyStopRequested = true` would re-cancel on every + // future checkpoint event for the rest of its lifetime. + earlyStopRequested = false; + if (startedJob && !TERMINAL_STATUSES.has(startedJob.status)) { + // Symmetric to the checkpoint branch: `"failed"` (not + // `"cancelled"`) on cancel-throw so we don't lie + // about cloud-side state that may still be running. + // Both branches feed the same TERMINAL_STATUSES + // short-circuit, so re-armed `requestEarlyStop()` + // calls still no-op correctly. + startedJob = { + ...startedJob, + status: cancelFailed ? "failed" : "cancelled", + ...(cancelFailed && { + error: `Early-stop cancel failed: ${ + cancelError instanceof Error + ? cancelError.message + : String(cancelError) + }`, + }), + completedAt: new Date().toISOString(), + }; + } + if (active) { + // Resolve on success, REJECT on cancel failure so the + // SIGTERM handler's `.catch()` writes the error to + // stderr and the operator can see that the cloud job + // may still be live. The latch always settles either + // way; shutdown won't hang. + if (cancelFailed) { + active.reject( + cancelError instanceof Error + ? cancelError + : new Error(String(cancelError)), + ); + } else { + active.resolve(); + } + } + if (earlyStopDeferred === active) earlyStopDeferred = null; + }); + }, timeoutMs); + // `Timer.unref` keeps the early-stop timer from blocking process exit + // when the host runtime finishes for unrelated reasons. + timer.unref?.(); + earlyStopDeferred = { + promise, + resolve: resolveFn, + reject: rejectFn, + timer, + }; + return promise; + } + + // Brand the trainer with the HMR control surface so the Studio server + // can (a) hash the cloud-side config to decide between hot-swap and + // restart, (b) atomically swap the callbacks cell from the runner + // subprocess on SIGUSR2, and (c) drive a graceful "stop after the + // next checkpoint" on SIGTERM. All three brands live behind + // `Symbol.for` keys so they don't appear on the public `Trainer` + // interface (see `trainerInspection.ts` for the rationale). + attachTrainerInspection(trainer, () => ({ + name: input.name, + config, + callbacks: currentCallbacks, + })); + attachTrainerCallbackReplacer(trainer, (callbacks) => { + currentCallbacks = callbacks ?? {}; + }); + attachTrainerEarlyStopper(trainer, requestEarlyStop); + return trainer; } diff --git a/packages/arkor/src/core/trainerInspection.test.ts b/packages/arkor/src/core/trainerInspection.test.ts new file mode 100644 index 00000000..cb3c11cc --- /dev/null +++ b/packages/arkor/src/core/trainerInspection.test.ts @@ -0,0 +1,249 @@ +import { describe, expect, it, vi } from "vitest"; +import { createArkor } from "./arkor"; +import { createTrainer } from "./trainer"; +import { + findInspectableTrainer, + findTrainerInModule, + getTrainerInspection, + replaceTrainerCallbacks, + requestTrainerEarlyStop, +} from "./trainerInspection"; +import type { Trainer } from "./types"; + +function brandedTrainer(name: string) { + // Real `createTrainer` attaches the inspection brand. We only need + // a no-op trainer for these shape tests; `start`/`wait` etc. are + // never invoked. + return createTrainer({ + name, + model: "m", + dataset: { type: "huggingface", name: "x" }, + }); +} + +function unbrandedTrainer(name: string) { + // Hand-rolled trainer: passes the `start`/`wait`/`cancel` shape + // check `findTrainerInModule` requires but DOESN'T carry the SDK + // inspection brand. Mirrors a user who wraps or re-exports a + // trainer outside the SDK helpers. + return { + name, + start: async () => ({ jobId: "j" }), + wait: async () => ({ job: {}, artifacts: [] }), + cancel: async () => {}, + }; +} + +describe("findTrainerInModule (trainer-shape walk)", () => { + it("finds shape #1: createArkor named export", () => { + const trainer = brandedTrainer("a"); + const found = findTrainerInModule({ arkor: createArkor({ trainer }) }); + expect(found).toBe(trainer); + }); + + it("finds shape #2: bare `trainer` named export", () => { + const trainer = brandedTrainer("b"); + const found = findTrainerInModule({ trainer }); + expect(found).toBe(trainer); + }); + + it("finds shape #3: default-export Arkor manifest", () => { + const trainer = brandedTrainer("c"); + const found = findTrainerInModule({ default: createArkor({ trainer }) }); + expect(found).toBe(trainer); + }); + + it("finds shape #4: default IS the Trainer", () => { + // Regression: `runner.ts`'s `extractTrainer` accepts + // `export default createTrainer(...)` directly (the trainer + // object itself becomes `mod.default`), but Studio's manifest / + // HMR walk previously skipped this shape. Result: a project that + // ran fine under `arkor start` showed as "no trainer" in Studio + // and HMR forced a SIGTERM-restart on every rebuild because + // `configHash` came back null. + const trainer = brandedTrainer("d"); + const found = findTrainerInModule({ default: trainer }); + expect(found).toBe(trainer); + }); + + it("finds shape #5: default.trainer nested", () => { + const trainer = brandedTrainer("e"); + const found = findTrainerInModule({ default: { trainer } }); + expect(found).toBe(trainer); + }); + + it("works for hand-rolled (unbranded) trainers in any of the five shapes", () => { + const trainer = unbrandedTrainer("manual"); + expect(findTrainerInModule({ trainer })?.name).toBe("manual"); + expect(findTrainerInModule({ default: trainer })?.name).toBe("manual"); + expect(findTrainerInModule({ default: { trainer } })?.name).toBe("manual"); + }); + + it("returns null when no candidate looks like a trainer", () => { + expect(findTrainerInModule({})).toBeNull(); + expect(findTrainerInModule({ arkor: {} })).toBeNull(); + expect(findTrainerInModule({ trainer: { name: "no-methods" } })).toBeNull(); + expect(findTrainerInModule({ default: 42 })).toBeNull(); + }); +}); + +describe("findInspectableTrainer (brand-required path)", () => { + it("returns the inspection snapshot for a branded trainer in any shape", () => { + // Regression: previously HMR's `inspectBundle` only checked + // `mod.arkor ?? mod.default`, missing shapes #2 and #4. As a + // result, projects bare-exporting `trainer` always produced + // `configHash: null` and HMR conservatively SIGTERM-restarted on + // every rebuild, never hot-swapping callbacks. The fix routes + // through `findInspectableTrainer` which walks every supported + // shape via `findTrainerInModule` and pulls inspection off the + // discovered trainer. + const trainerA = brandedTrainer("from-arkor"); + const inspectionA = findInspectableTrainer({ + arkor: createArkor({ trainer: trainerA }), + }); + expect(inspectionA?.name).toBe("from-arkor"); + + const trainerB = brandedTrainer("bare-named"); + const inspectionB = findInspectableTrainer({ trainer: trainerB }); + expect(inspectionB?.name).toBe("bare-named"); + + const trainerC = brandedTrainer("default-arkor"); + const inspectionC = findInspectableTrainer({ + default: createArkor({ trainer: trainerC }), + }); + expect(inspectionC?.name).toBe("default-arkor"); + + const trainerD = brandedTrainer("default-nested"); + const inspectionD = findInspectableTrainer({ + default: { trainer: trainerD }, + }); + expect(inspectionD?.name).toBe("default-nested"); + }); + + it("returns null when only an unbranded trainer is present", () => { + // Hand-rolled trainers don't carry the SDK inspection brand, so + // HMR can't compute their `configHash`. The Studio still shows + // the trainer name (via `findTrainerInModule` in + // `summariseBuiltManifest`), but HMR routing falls back to the + // SIGTERM-restart-everything path, which is the documented + // safe behaviour when configs can't be diffed. + const trainer = unbrandedTrainer("plain"); + expect(findInspectableTrainer({ trainer })).toBeNull(); + expect(getTrainerInspection(trainer)).toBeNull(); + }); + + it("does NOT walk past an unbranded first candidate to inspect a later branded one (runTrainer parity)", () => { + // Regression: a previous implementation looped every trainer- + // shaped candidate and returned the first one carrying the + // inspection brand. But `runTrainer`'s `extractTrainer` always + // executes the FIRST candidate (precedence: `mod.arkor` → + // `mod.trainer` → `mod.default`...), regardless of brand. A module + // that exported both an unbranded `trainer` (shape #2) AND a + // branded `default = createArkor(...)` (shape #3) would have its + // HMR `configHash` computed from the BRANDED trainer while the + // runner actually ran the unbranded one. The mismatch could route + // a rebuild to SIGUSR2 (hot-swap) even though the live trainer + // has no callback-replacer brand to receive the swap, leaving + // the running job stuck on stale callbacks. + // + // The fix anchors `findInspectableTrainer` to the same first- + // wins precedence as `runTrainer`: if the first candidate is + // unbranded, return `null` (forcing SIGTERM-restart, the safe + // fallback) instead of hashing a different instance. + const unbranded = unbrandedTrainer("unbranded-first"); + const branded = brandedTrainer("branded-second"); + const inspection = findInspectableTrainer({ + trainer: unbranded, + default: createArkor({ trainer: branded }), + }); + // Under the bug this was the branded inspection ("branded-second"). + // With the fix we get null so HMR conservatively SIGTERM-restarts + // rather than hot-swapping callbacks into a trainer that can't + // receive them. + expect(inspection).toBeNull(); + // And `findTrainerInModule` confirms the runner would pick the + // unbranded one (proving the precedence we're anchoring to). + expect(findTrainerInModule({ + trainer: unbranded, + default: createArkor({ trainer: branded }), + })).toBe(unbranded); + }); +}); + +describe("requestTrainerEarlyStop / replaceTrainerCallbacks brand-missing fallback", () => { + // Regression: previously these helpers asserted the brand was + // present and threw a synchronous TypeError on hand-rolled trainers. + // `runner.ts`'s `extractTrainer` accepts ANY `{start, wait, cancel}` + // shape (a documented public path for unbranded trainers), + // so the SIGTERM handler crashed instead of stopping the run. + + it("requestTrainerEarlyStop falls back to trainer.cancel() for unbranded trainers", async () => { + const cancelCalls = vi.fn(async () => {}); + const trainer = { + name: "manual", + start: async () => ({ jobId: "j" }), + wait: async () => ({ job: {}, artifacts: [] }), + cancel: cancelCalls, + } as unknown as Trainer; + + // Must not throw, must resolve, must have called cancel(). + await expect(requestTrainerEarlyStop(trainer)).resolves.toBeUndefined(); + expect(cancelCalls).toHaveBeenCalledTimes(1); + }); + + it("requestTrainerEarlyStop swallows a thrown cancel() so the SIGTERM handler can still settle", async () => { + // The runner's SIGTERM handler chains + // `requestTrainerEarlyStop(...).catch(...).finally(() => process.exit(0))`. + // If the brand-missing fallback let cancel()'s rejection bubble, + // the `.finally` would still fire, but the cancel error would + // surface as an unhandled rejection from the test runner. The + // documented contract for cancel() is best-effort, so swallow. + const trainer = { + name: "manual", + start: async () => ({ jobId: "j" }), + wait: async () => ({ job: {}, artifacts: [] }), + cancel: vi.fn(async () => { + throw new Error("network down"); + }), + } as unknown as Trainer; + + await expect(requestTrainerEarlyStop(trainer)).resolves.toBeUndefined(); + }); + + it("requestTrainerEarlyStop is async-shaped: synchronous throws inside the brand call become rejections", async () => { + // Defense-in-depth: even when the brand IS attached but somehow + // throws synchronously (e.g. a future implementation regression), + // the SIGTERM handler's `.catch` arm should still see it instead + // of the throw escaping past `.finally` and taking the runner + // down. The function is `async`, which wraps any synchronous + // throw inside its body into a rejected promise. + const trainer = brandedTrainer("from-arkor"); + // Replace the brand with a function that throws synchronously. + const KEY = Symbol.for("arkor.trainer.requestEarlyStop"); + Object.defineProperty(trainer, KEY, { + value: () => { + throw new Error("brand impl exploded"); + }, + configurable: true, + }); + await expect(requestTrainerEarlyStop(trainer)).rejects.toThrow( + /brand impl exploded/, + ); + }); + + it("replaceTrainerCallbacks is a no-op (not a throw) for unbranded trainers", () => { + // The HMR pipeline never routes SIGUSR2 to unbranded trainers in + // practice (their `configHash` is null, which forces the + // SIGTERM-restart path), but if a future caller did, it must not + // crash the runner. + const trainer = { + name: "manual", + start: async () => ({ jobId: "j" }), + wait: async () => ({ job: {}, artifacts: [] }), + cancel: async () => {}, + } as unknown as Trainer; + expect(() => + replaceTrainerCallbacks(trainer, { onLog: () => {} }), + ).not.toThrow(); + }); +}); diff --git a/packages/arkor/src/core/trainerInspection.ts b/packages/arkor/src/core/trainerInspection.ts new file mode 100644 index 00000000..8bfe964d --- /dev/null +++ b/packages/arkor/src/core/trainerInspection.ts @@ -0,0 +1,306 @@ +import { isArkor } from "./arkor"; +import type { Arkor, JobConfig, Trainer, TrainerCallbacks } from "./types"; + +/** + * Snapshot of a trainer's identity and cloud-side config that the Studio + * server reads in order to (a) compute a stable hash for HMR's + * "callbacks-only vs full restart" decision and (b) extract the new + * callbacks reference when hot-swapping. + * + * **Internal API (not part of the user-facing SDK surface).** Both this + * snapshot and the companion `replaceTrainerCallbacks` mutator are + * exposed only via `Symbol.for(...)`-keyed properties on the trainer + * object so they don't appear on the public `Trainer` type. They exist + * to let `arkor dev`'s HMR pipeline hot-swap callbacks without + * restarting cloud-side training; user code shouldn't call them + * directly. + */ +export interface TrainerInspection { + /** Run name (mirror of `Trainer.name`, copied for forward compatibility). */ + name: string; + /** The cloud-side `JobConfig` this trainer would submit on `start()`. */ + config: JobConfig; + /** Whatever the user passed in `input.callbacks`. May be empty. */ + callbacks: Partial; +} + +/** + * The CLI runtime (`dist/bin.mjs`) and the user's compiled bundle + * (`.arkor/build/index.mjs`, which keeps `arkor` external) end up loading + * two separate copies of this SDK as distinct ESM module records, so a + * module-local `WeakMap` would split into two halves that + * can't see each other. + * + * `Symbol.for(key)` is the cross-realm equivalent: the same key string + * resolves to the same symbol in any module instance, so the trainer + * created in the user's bundle exposes its inspection through the same + * property the Studio process reads. + */ +const TRAINER_INSPECTION_KEY = Symbol.for("arkor.trainer.inspect"); +const TRAINER_REPLACE_CALLBACKS_KEY = Symbol.for( + "arkor.trainer.replaceCallbacks", +); +const TRAINER_REQUEST_EARLY_STOP_KEY = Symbol.for( + "arkor.trainer.requestEarlyStop", +); + +export interface RequestEarlyStopOptions { + /** Default: 5 min. Falls back to immediate cancel if no checkpoint arrives. */ + timeoutMs?: number; +} + +/** + * Stamp the inspection snapshot onto a freshly-built `Trainer` instance. + * Called once from `createTrainer`. Stored as a thunk so callers can + * read a fresh copy each time (defensive: the trainer's callbacks cell + * is mutable across the lifetime of a hot-swap). + */ +export function attachTrainerInspection( + trainer: object, + read: () => TrainerInspection, +): void { + Object.defineProperty(trainer, TRAINER_INSPECTION_KEY, { + value: read, + configurable: true, + enumerable: false, + writable: false, + }); +} + +/** + * Pull the snapshot off a Trainer-like value. Returns `null` for plain + * objects that don't carry the brand; used by the Studio server to + * gracefully ignore third-party wrappers or pre-SDK shapes. + */ +export function getTrainerInspection( + trainer: unknown, +): TrainerInspection | null { + if (!trainer || typeof trainer !== "object") return null; + const fn = (trainer as Record)[TRAINER_INSPECTION_KEY]; + if (typeof fn !== "function") return null; + try { + const result = (fn as () => unknown).call(trainer); + if ( + result && + typeof result === "object" && + "config" in result && + "name" in result + ) { + return result as TrainerInspection; + } + } catch { + // Inspection is best-effort; a thrown user callback shouldn't crash HMR. + } + return null; +} + +/** + * Wire the trainer's mutable callbacks slot to a `Symbol.for`-keyed + * brand so the runner subprocess can hot-swap callbacks without us + * exposing the operation on the public `Trainer` interface. Called once + * from `createTrainer`. + */ +export function attachTrainerCallbackReplacer( + trainer: object, + replace: (callbacks: Partial) => void, +): void { + Object.defineProperty(trainer, TRAINER_REPLACE_CALLBACKS_KEY, { + value: replace, + configurable: true, + enumerable: false, + writable: false, + }); +} + +/** + * Replace the trainer's lifecycle callbacks atomically. The brand is + * attached by `createTrainer`, but `runTrainer`'s `extractTrainer` + * also accepts hand-rolled trainers (any `{ start, wait, cancel }` + * shape), and those don't carry the brand. The HMR pipeline never + * routes SIGUSR2 to such trainers in practice (they always produce + * `configHash: null` upstream, which forces the SIGTERM-restart + * path), so this helper is a no-op for them rather than throwing. + */ +export function replaceTrainerCallbacks( + trainer: Trainer, + callbacks: Partial, +): void { + const fn = (trainer as unknown as Record)[ + TRAINER_REPLACE_CALLBACKS_KEY + ] as ((cbs: Partial) => void) | undefined; + if (typeof fn !== "function") return; + fn.call(trainer, callbacks); +} + +/** + * Wire an early-stop entry point onto a `Trainer` so the SIGTERM handler + * in the runner subprocess can request a graceful "stop after the next + * checkpoint" without us exposing the operation on the public `Trainer` + * interface. User code that wants the same semantics should compose + * the cookbook's `abortSignal` + `cancel()` recipe instead (see + * `docs/cookbook/early-stopping.mdx`). + */ +export function attachTrainerEarlyStopper( + trainer: object, + requestStop: (opts?: RequestEarlyStopOptions) => Promise, +): void { + Object.defineProperty(trainer, TRAINER_REQUEST_EARLY_STOP_KEY, { + value: requestStop, + configurable: true, + enumerable: false, + writable: false, + }); +} + +/** + * Request that the trainer stop after the next saved checkpoint. + * Resolves once `cancel()` has been accepted by the cloud API, or + * after `timeoutMs` if no checkpoint arrived in time. + * + * `createTrainer` attaches the brand unconditionally, but + * `runTrainer`'s `extractTrainer` also accepts hand-rolled trainers + * (any `{ start, wait, cancel }` shape), which legitimately don't + * carry the brand. Falling back to the public `Trainer.cancel()` for + * those is the closest semantic match available without the SDK's + * checkpoint-aware machinery; it's also what the runner's SIGTERM + * handler needs to keep working (the previous "throw if brand + * missing" behaviour caused a synchronous TypeError before the + * handler's `.catch().finally()` chain attached, so SIGTERM crashed + * the runner instead of stopping the run). + */ +// async wrapper (rather than a bare function returning Promise) so +// any *synchronous* throw inside the brand call (or its arguments) +// becomes a rejected promise; the SIGTERM handler's `.catch()` then +// catches it instead of the throw escaping past the `.finally()` +// chain and taking the runner down. +export async function requestTrainerEarlyStop( + trainer: Trainer, + opts?: RequestEarlyStopOptions, +): Promise { + const fn = (trainer as unknown as Record)[ + TRAINER_REQUEST_EARLY_STOP_KEY + ] as ((opts?: RequestEarlyStopOptions) => Promise) | undefined; + if (typeof fn !== "function") { + // Best-effort fallback for unbranded trainers: trainer.cancel() + // is part of the public Trainer interface, so it's always safe + // to call. Catch/swallow because the documented contract for + // cancel() is "best-effort" and the SIGTERM handler needs the + // returned promise to settle either way. + try { + await trainer.cancel(); + } catch { + // intentionally ignored; see comment above. + } + return; + } + await fn.call(trainer, opts); +} + +/** + * Trainer-shaped value pulled from a re-imported bundle. We don't + * import the public `Trainer` type here because consumers of this + * helper want to read minimal fields (`name` for display) without + * type-narrowing on the full SDK interface. Many tests fabricate + * hand-rolled trainer literals that don't structurally match + * `Trainer` (no `requestEarlyStop` etc.) but are still legitimate + * user shapes the runner accepts. + */ +type TrainerLike = { name?: unknown; [key: string]: unknown }; + +function isTrainerLike(value: unknown): value is TrainerLike { + if (!value || typeof value !== "object") return false; + const v = value as Record; + return ( + typeof v.start === "function" && + typeof v.wait === "function" && + typeof v.cancel === "function" + ); +} + +/** + * Walk the user module in `runner.ts`'s precedence order and return + * every *distinct* trainer-shaped value found. The walk is + * de-duplicated because the common `createArkor({ trainer })` + * default-export shape would otherwise surface the same trainer up + * to three times (case 3 pushes `mod.default.trainer`; case 4 + * pushes the manifest object itself which is filtered out by + * `isTrainerLike`; case 5 pushes `mod.default.trainer` a second + * time). Callers iterate in precedence order, so this preserves + * the "first match wins" contract. + * + * The five supported shapes (mirroring `runner.ts`'s `extractTrainer`): + * 1. `export const arkor = createArkor({ trainer })` + * 2. `export const trainer = createTrainer(...)` (bare named export) + * 3. `export default createArkor({ trainer })` + * 4. `export default createTrainer(...)` (default IS a Trainer) + * 5. `export default { trainer: createTrainer(...) }` + * + * Without shape #4 a project that default-exports a Trainer would run + * fine under `arkor start` but show as "no trainer" in Studio's + * manifest, with `configHash: null` forcing every HMR rebuild down the + * SIGTERM-restart path instead of the SIGUSR2 hot-swap path. + */ +function findTrainerCandidates(mod: Record): TrainerLike[] { + const trainers: TrainerLike[] = []; + const seen = new Set(); + const push = (value: unknown): void => { + if (value === undefined || value === null) return; + if (seen.has(value)) return; + seen.add(value); + if (isTrainerLike(value)) trainers.push(value); + }; + // 1: createArkor named export + if (isArkor(mod.arkor)) push((mod.arkor as Arkor).trainer); + // 2: bare `trainer` named export + push(mod.trainer); + // 3: default-export holding an Arkor manifest + if (isArkor(mod.default)) push((mod.default as Arkor).trainer); + // 4: default IS the Trainer itself. `isTrainerLike` filters out + // cases 3/5 (an Arkor manifest doesn't have `start`/`wait`/ + // `cancel`, nor does a plain `{ trainer }` wrapper). + push(mod.default); + // 5: default.trainer nested + if (mod.default && typeof mod.default === "object") { + push((mod.default as Record).trainer); + } + return trainers; +} + +/** + * Return the first trainer-shaped value (anything with + * `start`/`wait`/`cancel`) in `runner.ts`'s precedence order. Doesn't + * require the SDK inspection brand: the Studio manifest UI displays + * the trainer's `name` for hand-rolled trainers too, even when HMR + * can't compute a `configHash` for them. "First match wins" matches + * `runner.ts`'s `extractTrainer`, so this is the trainer the runner + * will actually execute. + */ +export function findTrainerInModule( + mod: Record, +): TrainerLike | null { + return findTrainerCandidates(mod)[0] ?? null; +} + +/** + * Inspection snapshot of the trainer `runTrainer` would execute + * (== the first candidate in `runner.ts`'s precedence order). + * Used by both `studio/hmr.ts` (computing the `configHash` for HMR + * routing) and `core/runnerSignals.ts` (extracting new callbacks for + * SIGUSR2 hot-swap). + * + * Returns `null` when the first candidate doesn't carry the + * inspection brand. We deliberately DO NOT walk past it to find a + * branded trainer further down the list: the runner ignores those, + * so hashing a deeper branded trainer would compute HMR decisions + * for a different instance than the one actually running, e.g. + * route to SIGUSR2/hot-swap when the live (unbranded) trainer + * cannot be callback-reloaded. A null here correctly forces SIGTERM- + * restart, which is the safe fallback when configs can't be diffed. + */ +export function findInspectableTrainer( + mod: Record, +): TrainerInspection | null { + const trainer = findTrainerCandidates(mod)[0]; + if (!trainer) return null; + return getTrainerInspection(trainer); +} diff --git a/packages/arkor/src/studio/hmr.test.ts b/packages/arkor/src/studio/hmr.test.ts new file mode 100644 index 00000000..b892c68c --- /dev/null +++ b/packages/arkor/src/studio/hmr.test.ts @@ -0,0 +1,423 @@ +import { describe, it, expect, beforeEach, afterEach } from "vitest"; +import { + mkdirSync, + mkdtempSync, + rmSync, + statSync, + writeFileSync, +} from "node:fs"; +import { tmpdir } from "node:os"; +import { join } from "node:path"; +import { createHmrCoordinator, type HmrEvent } from "./hmr"; + +const FAKE_MANIFEST = `export const arkor = Object.freeze({ + _kind: "arkor", + trainer: { name: "alpha" }, +}); +`; + +let cwd: string; + +beforeEach(() => { + cwd = mkdtempSync(join(tmpdir(), "arkor-hmr-test-")); +}); + +afterEach(() => { + rmSync(cwd, { recursive: true, force: true }); +}); + +function nextEvent( + events: HmrEvent[], + predicate: (e: HmrEvent) => boolean, + timeoutMs = 10_000, +): Promise { + return new Promise((resolve, reject) => { + const start = Date.now(); + const tick = () => { + const found = events.find(predicate); + if (found) return resolve(found); + if (Date.now() - start > timeoutMs) { + return reject( + new Error( + `Timed out waiting for matching HMR event after ${timeoutMs}ms`, + ), + ); + } + setTimeout(tick, 25); + }; + tick(); + }); +} + +/** + * Resolve once `events.length` has gone `quietWindowMs` without + * growing. Used to wait out spurious watcher events on noisier file + * systems (Windows polling / macOS FSEvents coalescing) before + * asserting the cached state. + */ +function waitForStableEvents( + events: HmrEvent[], + quietWindowMs: number, +): Promise { + return new Promise((resolve) => { + let lastLength = events.length; + let stableSince = Date.now(); + const tick = () => { + if (events.length !== lastLength) { + lastLength = events.length; + stableSince = Date.now(); + } + if (Date.now() - stableSince >= quietWindowMs) return resolve(); + setTimeout(tick, 50); + }; + tick(); + }); +} + +describe("createHmrCoordinator", () => { + it("emits a `ready` event after the first successful build", async () => { + mkdirSync(join(cwd, "src/arkor"), { recursive: true }); + writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST); + + const events: HmrEvent[] = []; + const hmr = createHmrCoordinator({ cwd }); + hmr.subscribe((e) => events.push(e)); + try { + const ready = await nextEvent(events, (e) => e.type === "ready"); + expect(ready.outFile).toMatch(/\.arkor[\\/]+build[\\/]+index\.mjs$/); + expect(typeof ready.hash).toBe("string"); + } finally { + await hmr.dispose(); + } + }); + + it("emits a `rebuild` event after a source edit", async () => { + mkdirSync(join(cwd, "src/arkor"), { recursive: true }); + writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST); + + const events: HmrEvent[] = []; + const hmr = createHmrCoordinator({ cwd }); + hmr.subscribe((e) => events.push(e)); + try { + const ready = await nextEvent(events, (e) => e.type === "ready"); + // Touch the entry with new content so the watcher detects a change. + writeFileSync( + join(cwd, "src/arkor/index.ts"), + FAKE_MANIFEST.replace(`"alpha"`, `"beta"`), + ); + const rebuild = await nextEvent(events, (e) => e.type === "rebuild"); + expect(rebuild.outFile).toBe(ready.outFile); + expect(rebuild.hash).not.toBe(ready.hash); + } finally { + await hmr.dispose(); + } + }); + + it("emits an `error` event when the entry is missing on subscribe", async () => { + const events: HmrEvent[] = []; + const hmr = createHmrCoordinator({ cwd }); + hmr.subscribe((e) => events.push(e)); + try { + const err = await nextEvent(events, (e) => e.type === "error", 1000); + expect(err.message).toMatch(/Build entry not found/); + } finally { + await hmr.dispose(); + } + }); + + it("transitions from `error` to `ready` once the entry appears, without re-subscribing", async () => { + // Regression: previously `startWatcher` bailed out and never + // retried, so an SPA already connected to `/api/dev/events` against + // a fresh scaffold would be stuck on the initial `error` event + // forever: EventSource doesn't reconnect on application-level + // errors. The coordinator now polls for the entry file in the + // background and starts the watcher the moment it appears. + const events: HmrEvent[] = []; + const hmr = createHmrCoordinator({ cwd }); + hmr.subscribe((e) => events.push(e)); + try { + await nextEvent(events, (e) => e.type === "error", 1000); + // Same subscriber: no reconnect, no second `subscribe` call. + mkdirSync(join(cwd, "src/arkor"), { recursive: true }); + writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST); + const ready = await nextEvent( + events, + (e) => e.type === "ready", + 4000, + ); + expect(ready.outFile).toMatch(/index\.mjs$/); + } finally { + await hmr.dispose(); + } + }); + + it("replays the latest event to late subscribers", async () => { + mkdirSync(join(cwd, "src/arkor"), { recursive: true }); + writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST); + + const firstEvents: HmrEvent[] = []; + const hmr = createHmrCoordinator({ cwd }); + hmr.subscribe((e) => firstEvents.push(e)); + try { + await nextEvent(firstEvents, (e) => e.type === "ready"); + // A new subscriber should receive the cached state synchronously + // before any new build is triggered. + // + // We assert "the late subscriber sees the same event the prior one + // saw last" rather than literally "ready" because rolldown@1.0.0-rc.17 + // on macOS occasionally fires a spurious second BUNDLE_END (FSEvents + // coalescing inside the watcher): there, `firstEvents` already + // contains the spurious `rebuild` by the time we late-subscribe, and + // the contract under test (replay of the cached state) holds either + // way. + // TODO(rolldown 1.0): re-check after rolldown leaves RC. If the + // spurious BUNDLE_END is gone on macOS, tighten this back to + // expect(lateEvents[0]?.type).toBe("ready"); + const lateEvents: HmrEvent[] = []; + hmr.subscribe((e) => lateEvents.push(e)); + expect(lateEvents.length).toBeGreaterThanOrEqual(1); + expect(lateEvents[0]).toEqual(firstEvents[firstEvents.length - 1]); + } finally { + await hmr.dispose(); + } + }); + + it("subscribe()'s lastEvent replay swallows a throwing subscriber so initialization keeps working", async () => { + // Regression: `subscribe()` synchronously replays `lastEvent` to + // a fresh subscriber for the late-mount-cached-state contract. + // Previously the replay had no try/catch, so a subscriber that + // threw during that one call (typical case: an SSE controller + // that closed mid-replay: `controller.enqueue` on a closed + // stream throws) propagated out of `subscribe()` and broke + // whoever just registered. `broadcast()` already swallowed + // subscriber throws defensively; this test pins the symmetric + // contract on `subscribe()`. + mkdirSync(join(cwd, "src/arkor"), { recursive: true }); + writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST); + + const firstEvents: HmrEvent[] = []; + const hmr = createHmrCoordinator({ cwd }); + hmr.subscribe((e) => firstEvents.push(e)); + try { + await nextEvent(firstEvents, (e) => e.type === "ready"); + // A subscriber whose body throws on the cached-state replay. + const throwingSubscriber = (): void => { + throw new Error("controller closed"); + }; + // Must not throw out of subscribe(); must still return a + // working unsubscribe. + let unsubscribe: () => void = () => undefined; + expect(() => { + unsubscribe = hmr.subscribe(throwingSubscriber); + }).not.toThrow(); + expect(typeof unsubscribe).toBe("function"); + // Confirm the coordinator is still healthy: a *new* subscriber + // (after the throwing one) still receives the cached replay. + const recoveryEvents: HmrEvent[] = []; + hmr.subscribe((e) => recoveryEvents.push(e)); + expect(recoveryEvents.length).toBeGreaterThanOrEqual(1); + unsubscribe(); + } finally { + await hmr.dispose(); + } + }); + + it("stops broadcasting after dispose()", async () => { + mkdirSync(join(cwd, "src/arkor"), { recursive: true }); + writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST); + + const events: HmrEvent[] = []; + const hmr = createHmrCoordinator({ cwd }); + hmr.subscribe((e) => events.push(e)); + await nextEvent(events, (e) => e.type === "ready"); + await hmr.dispose(); + const countAfterDispose = events.length; + + // Edit after dispose must not produce any further events. + writeFileSync( + join(cwd, "src/arkor/index.ts"), + FAKE_MANIFEST.replace(`"alpha"`, `"gamma"`), + ); + await new Promise((r) => setTimeout(r, 250)); + expect(events.length).toBe(countAfterDispose); + }); + + it("the cached lastEvent reflects the LATEST source under rapid back-to-back edits", async () => { + // Regression: the BUNDLE_END handler used to fire + // `emitBuildSucceeded` without awaiting, so two quick rebuilds + // could run `inspectBundle` concurrently and broadcast out of + // order, leaving `lastEvent` pointing at the older snapshot. + // We can't deterministically synthesise a race against rolldown's + // real watcher, but we *can* assert the user-visible invariant: + // after a sequence of edits, the cached state must match the + // bytes that are actually on disk. The new sequence-number guard + // inside `emitBuildSucceeded` drops stale inspection results so + // whichever BUNDLE_END landed last broadcasts last. + mkdirSync(join(cwd, "src/arkor"), { recursive: true }); + writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST); + + const events: HmrEvent[] = []; + const hmr = createHmrCoordinator({ cwd }); + hmr.subscribe((e) => events.push(e)); + try { + await nextEvent(events, (e) => e.type === "ready"); + writeFileSync( + join(cwd, "src/arkor/index.ts"), + FAKE_MANIFEST.replace(`"alpha"`, `"beta"`), + ); + await nextEvent(events, (e) => e.type === "rebuild", 4000); + writeFileSync( + join(cwd, "src/arkor/index.ts"), + FAKE_MANIFEST.replace(`"alpha"`, `"gamma"`), + ); + // Wait for the watcher to settle; any rebuild that's going to + // fire (including spurious extras from FSEvents on macOS or + // chokidar polling on Windows) lands within this window. The + // assertion then compares the cached `lastEvent.hash` against + // the *actual* fingerprint of the on-disk artefact, not a + // captured "last expected" hash from earlier in the test: + // that earlier capture was brittle on Windows where rolldown + // routinely emits a 4th BUNDLE_END after the explicit edits + // settle, producing a slightly different output byte (a + // change in the bundled comment header is enough to bump + // mtime + ctime + size). + await waitForStableEvents(events, 750); + const stat = statSync(join(cwd, ".arkor/build/index.mjs")); + const expectedHash = `${stat.mtimeMs}-${stat.ctimeMs}-${stat.size}`; + expect(events[events.length - 1]?.hash).toBe(expectedHash); + } finally { + await hmr.dispose(); + } + }); + + it("getCurrentConfigHash() returns the latest cached event's hash", async () => { + // Regression: `/api/train` previously called `readManifestSummary` + // and ran a redundant rebuild per spawn (racing the watcher). + // The new server flow reads the cached hash via + // `getCurrentConfigHash()`. We can't trigger a real build here + // (the user-bundle entry shape would need a working `arkor` + // resolution at import time), but we can verify the getter + // returns `null` before the watcher has emitted any event and + // tracks the cached event's `configHash` field once one lands. + // The integration of "configHash actually populated for all + // entry shapes" is covered by the unit test against + // `findInspectableTrainer` in `trainerInspection.test.ts`. + mkdirSync(join(cwd, "src/arkor"), { recursive: true }); + writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST); + + const events: HmrEvent[] = []; + const hmr = createHmrCoordinator({ cwd }); + // Before any subscriber attaches, no watcher is running and no + // event has been broadcast: getter must return null without + // throwing. + expect(hmr.getCurrentConfigHash()).toBeNull(); + hmr.subscribe((e) => events.push(e)); + try { + const ready = await nextEvent(events, (e) => e.type === "ready"); + // FAKE_MANIFEST is hand-rolled (no SDK brand) so the cached + // hash is null, but the *getter* must still return whatever + // the cached event carries, not throw. + expect(hmr.getCurrentConfigHash()).toBe(ready.configHash ?? null); + } finally { + await hmr.dispose(); + } + }); + + it("getCurrentArtifactHash() returns null when the artefact doesn't exist (vs a Date.now() fallback)", async () => { + // Regression: a previous implementation did + // `statSync(...) ; return fingerprint(...)`. Two stat calls + // means a race window where the file disappears between them: + // the existence check passes, then `fingerprint`'s catch + // branch substitutes `Date.now().toString(36)` (its + // freshness-forcing fallback for SSE dedup), and the getter + // returns a non-null, non-artefact-derived hash. That + // silently breaks `dispatchRebuild`'s pre-ready-spawn gate + // which relies on null === "no artefact, force restart". + // The fix uses `fingerprintOrNull`: single statSync, true + // null on failure. + // + // We assert the getter on a project that has NEVER built + // (no `.arkor/build/index.mjs` ever existed). The bug-fix + // version returns null; the broken version's leftover would + // have been Date.now()-derived non-null. + mkdirSync(join(cwd, "src/arkor"), { recursive: true }); + writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST); + + const hmr = createHmrCoordinator({ cwd }); + try { + // No subscribe() yet: watcher hasn't started, so no + // BUNDLE_END has written the artefact. The on-disk + // `.arkor/build/index.mjs` doesn't exist. + expect(hmr.getCurrentArtifactHash()).toBeNull(); + } finally { + await hmr.dispose(); + } + }); + + it("getCurrentArtifactHash() returns a stable mtime/ctime/size hash once the artefact exists", async () => { + // Companion to the null-on-missing test: when the artefact + // *does* exist (watcher's first BUNDLE_END landed), the + // getter returns the same `mtimeMs-ctimeMs-size` shape the + // SSE event's `hash` field uses. The two are paired for SSE + // dedup purposes; the pre-ready-spawn registry gate switched + // to content-hash (`getCurrentArtifactContentHash`) to avoid + // identical-bytes/different-timestamps false positives, but + // the timestamp hash stays as the canonical SSE event id. + mkdirSync(join(cwd, "src/arkor"), { recursive: true }); + writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST); + + const events: HmrEvent[] = []; + const hmr = createHmrCoordinator({ cwd }); + hmr.subscribe((e) => events.push(e)); + try { + const ready = await nextEvent(events, (e) => e.type === "ready"); + const artifactHash = hmr.getCurrentArtifactHash(); + // Same shape as the SSE event's `hash` field: both feed + // through the same `mtimeMs-ctimeMs-size` formula. + expect(artifactHash).toBe(ready.hash ?? null); + expect(artifactHash).toMatch(/^[\d.]+-[\d.]+-\d+$/); + } finally { + await hmr.dispose(); + } + }); + + it("getCurrentConfigHash() preserves the last-success hash across an ERROR event", async () => { + // Regression: previously `getCurrentConfigHash()` returned + // `lastEvent?.configHash ?? null`. After an ERROR landed, + // `lastEvent` was the error event (no `configHash`) so the + // getter went null even though `.arkor/build/index.mjs` still + // held the previous *successful* bundle bytes (ERROR doesn't + // overwrite the output). A child spawned via `/api/train` in + // that window would register `configHash: null`, and the next + // successful BUNDLE_END would diff against null → SIGTERM + // restart instead of SIGUSR2 hot-swap, defeating callback + // hot-swap for the rest of the session. The fix tracks the + // last *successful* hash separately from `lastEvent`. + mkdirSync(join(cwd, "src/arkor"), { recursive: true }); + writeFileSync(join(cwd, "src/arkor/index.ts"), FAKE_MANIFEST); + + const events: HmrEvent[] = []; + const hmr = createHmrCoordinator({ cwd }); + hmr.subscribe((e) => events.push(e)); + try { + const ready = await nextEvent(events, (e) => e.type === "ready"); + const successHash = hmr.getCurrentConfigHash(); + // Sanity: ready event's configHash matches the getter. + expect(successHash).toBe(ready.configHash ?? null); + // Inject a syntax error to force a watcher ERROR event. + writeFileSync( + join(cwd, "src/arkor/index.ts"), + "this is not { valid javascript = ;", + ); + await nextEvent(events, (e) => e.type === "error", 4000); + // After the error, the cached `lastEvent` is the error frame + // but the on-disk artifact still holds the previous + // success. The getter must return that previous-success hash + // so any `/api/train` spawn during this window still gets a + // useful spawn-time hash for the *next* rebuild's routing. + expect(hmr.getCurrentConfigHash()).toBe(successHash); + } finally { + await hmr.dispose(); + } + }); +}); diff --git a/packages/arkor/src/studio/hmr.ts b/packages/arkor/src/studio/hmr.ts new file mode 100644 index 00000000..974ed771 --- /dev/null +++ b/packages/arkor/src/studio/hmr.ts @@ -0,0 +1,527 @@ +import { createHash } from "node:crypto"; +import { existsSync, readFileSync, statSync } from "node:fs"; +import { watch, type RolldownWatcher } from "rolldown"; +import { hashJobConfig } from "../core/configHash"; +import { moduleCacheBustUrl } from "../core/moduleCacheBust"; +import { + BUILD_DEFAULTS, + resolveBuildEntry, + rolldownInputOptions, + type BuildEntryOptions, +} from "../core/rolldownConfig"; +import { findInspectableTrainer } from "../core/trainerInspection"; + +export type HmrEventType = "ready" | "rebuild" | "error"; + +export interface HmrEvent { + type: HmrEventType; + outFile?: string; + /** + * Short fingerprint of the bundle artefact (mtime + ctime + size, + * mirroring `core/moduleCacheBust.ts`'s key shape). Subscribers use + * this to dedupe replays of the same successful build. + */ + hash?: string; + /** + * Content-derived hash (sha256, truncated) of the artefact bytes. + * Used by `dispatchRebuild`'s pre-ready-spawn equality gate where + * `hash` would over-trigger SIGTERM-restart: a watcher build that + * rewrites identical bytes still bumps mtime/ctime, so two + * timestamp fingerprints differ even though the loaded bytes are + * the same. Comparing this content-hash instead avoids that + * spurious cancel+restart cycle in the "user clicked Run before + * the watcher's first BUNDLE_END landed" case. + */ + contentHash?: string | null; + /** + * Stable hash of the trainer's cloud-side `JobConfig`. When this is + * unchanged across a rebuild, only the in-process callbacks moved and + * the Studio server can hot-swap them without restarting the run. + * `null` when the bundle has no discoverable trainer (e.g. the user's + * source has a syntax error or the Arkor manifest is missing). + */ + configHash?: string | null; + /** Run name pulled from the rebuilt manifest. */ + trainerName?: string | null; + /** Human-readable error message; only present on `type === "error"`. */ + message?: string; +} + +export interface HmrCoordinator { + /** + * Receive the current cached state immediately, then every subsequent + * event. Returns an unsubscribe function. + */ + subscribe(fn: (event: HmrEvent) => void): () => void; + /** + * Synchronous read of the most recent successful build's + * `configHash`. Used by `/api/train` to capture the hash that's + * about to be spawned so HMR routing on the *next* rebuild knows + * whether the new bundle changed cloud-side config. `null` when the + * watcher hasn't completed a successful build yet (e.g. fresh + * scaffold) or the latest event was an `error`. + */ + getCurrentConfigHash(): string | null; + /** + * Synchronous fingerprint of the on-disk build artefact RIGHT NOW + * (fresh stat, not cached). Used by `/api/train`'s registry entry + * so HMR routing in the pre-ready-spawn case (`configHash === null`) + * can compare against the rebuild's `event.hash` to tell whether + * the child read the same bytes. Without this gate, an edit + * landing between spawn and the watcher's first BUNDLE_END would + * silently teach the registry to use the post-edit `configHash` + * as the child's baseline; later same-hash rebuilds would then + * hot-swap callbacks into a child whose cloud-side `JobConfig` + * was actually spawned against an older version, leaving the + * cloud run on a stale config. `null` when stat fails (artefact + * doesn't exist yet, fresh project never built). + */ + getCurrentArtifactHash(): string | null; + /** + * Content-derived hash (sha256, truncated) of the on-disk + * artefact RIGHT NOW. Used by `/api/train` to capture a + * spawn-time content-hash for the registry's pre-ready-spawn + * equality gate; paired with the rebuild's `event.contentHash`, + * a mismatch unambiguously means the bytes changed (not just + * timestamps), so `dispatchRebuild` only SIGTERM-restarts when + * the child genuinely loaded different bytes than the new + * configHash describes. `null` on stat/read failure (artefact + * doesn't exist yet, fresh project never built). + */ + getCurrentArtifactContentHash(): string | null; + /** + * Last broadcast event's `type`, or `null` if nothing has been + * broadcast yet. `/api/manifest`'s HMR fast path consults this to + * suppress its "serve last good artefact" behaviour while the + * watcher is in an `error` state; without that gate, the SPA's + * 5 s `/api/manifest` poll would keep getting a 200 stale + * manifest and silently overwrite the SSE-driven build-error UI, + * letting users run with stale code/config while the latest + * source is still failing to compile. + */ + getLastEventType(): HmrEventType | null; + /** + * Close the rolldown watcher and drop all subscribers. **Does not + * (and cannot) evict the user-module records that `inspectBundle` + * loaded into Node's ESM cache** — Node's loader exposes no + * eviction API, so for `arkor dev` sessions that go through many + * rebuilds before exit, the cache retains one record per distinct + * artefact content hash for the rest of the process lifetime. + * The mtime/ctime/size cache-bust key (`moduleCacheBustUrl`) + * collapses identical-byte rebuilds onto the same record, bounding + * the retention to "one entry per real edit", which is the tightest + * we can offer here. Tests that loop `createHmrCoordinator` → + * rebuild → `dispose` therefore still accumulate process-wide + * ESM-cache entries. + */ + dispose(): Promise; +} + +export type HmrOptions = BuildEntryOptions; + +/** + * Content-derived fingerprint of the artefact bytes (sha256, first 16 + * hex chars). Used by `dispatchRebuild`'s pre-ready-spawn gate where + * timestamp-based comparison gives false positives: a watcher rebuild + * that produces the same bytes still bumps mtime/ctime, so a child + * spawned just before `ready` would be unnecessarily SIGTERM-restarted + * even though its loaded bytes match the new build's. Hashing a few + * MB of bundle on each call is cheap relative to the GPU cost of a + * spurious cancel+restart cycle. + * + * Returns `null` on stat/read failure so the caller can treat + * "no artefact" as "force restart" (the conservative default). + */ +function contentHashOrNull(outFile: string): string | null { + try { + const bytes = readFileSync(outFile); + return createHash("sha256").update(bytes).digest("hex").slice(0, 16); + } catch { + return null; + } +} + +/** + * Single-stat fingerprint with a clean `null` on failure: used by + * `getCurrentArtifactHash()` whose contract is "return a fingerprint + * derived from the artefact bytes, or `null` if no artefact". A + * separate exists-check + `fingerprint()` here would race: the file + * could disappear between the two stats and `fingerprint()`'s + * `Date.now()` fallback would return a non-null hash that doesn't + * describe any real bytes, silently violating the contract. + */ +function fingerprintOrNull(outFile: string): string | null { + try { + const s = statSync(outFile); + // Same shape as `fingerprint()`'s success branch; `ctimeMs` is + // the belt-and-braces guard for `touch -m`-style edits where + // mtime stays put. + return `${s.mtimeMs}-${s.ctimeMs}-${s.size}`; + } catch { + return null; + } +} + +function fingerprint(outFile: string): string { + // Delegate to `fingerprintOrNull` and substitute a freshness- + // forcing token on stat failure. The `Date.now()` fallback + // matters here (vs the "0-0-0" sentinel `moduleCacheBustKey` + // uses): SPA-side SSE dedup keys off this hash, so a stable + // literal during a racy stat would silently swallow genuinely- + // fresh broadcast events. + return fingerprintOrNull(outFile) ?? Date.now().toString(36); +} + +type InspectionResult = { + configHash: string; + trainerName: string; +} | null; + +/** + * Dynamic-import the freshly-built bundle and pull a `TrainerInspection` + * snapshot off the discovered trainer. + * + * Walks every entry shape `runner.ts` accepts (named `arkor`, named + * `trainer`, `default` Arkor manifest, `default.trainer`) via the + * shared `findInspectableTrainer` helper, keeping inspection in sync + * with execution. Without this, projects that only `export const + * trainer` (a documented shortcut) would always produce `configHash: + * null` and the SPA would unnecessarily SIGTERM-restart on every + * rebuild. + * + * Cache-bust by file mtime+ctime+size (via `moduleCacheBustUrl`) + * rather than `Date.now()`: + * + * - Node's ESM loader caches every dynamically-imported URL for the + * lifetime of the process and never evicts. A `?t=Date.now()` + * suffix produces a unique URL per call, so a long `arkor dev` + * session would accumulate one module record per BUNDLE_END: + * unbounded memory growth. + * - The composite key (`mtimeMs-ctimeMs-size`) keys the cache to + * "the actual bytes in this file", so spurious watcher events + * that don't change content reuse the prior module record. The + * leak shrinks from "one entry per keystroke" to "one entry per + * actual rebuild", which for a realistic dev session (hundreds + * of saves over hours) is bounded by the number of distinct file + * states the user produces, and that's fundamentally what HMR + * has to track to surface up-to-date trainer state. There's no + * public Node API for evicting an ESM module record, so this is + * the tightest bound we can offer without spawning a child + * process per inspection. + * + * Best-effort: a missing/malformed manifest or a thrown user + * constructor returns `null` and the caller treats the rebuild as + * "config-unknown". + */ +async function inspectBundle(outFile: string): Promise { + try { + const mod = (await import(moduleCacheBustUrl(outFile))) as Record< + string, + unknown + >; + const inspection = findInspectableTrainer(mod); + if (!inspection) return null; + return { + configHash: hashJobConfig(inspection.config), + trainerName: inspection.name, + }; + } catch { + return null; + } +} + +/** + * Spin up a rolldown watcher over the user's `src/arkor` entry, broadcasting + * `ready` / `rebuild` / `error` to subscribers. Used by `arkor dev` to push + * `/api/dev/events` SSE notifications to the SPA. + * + * Lazy: the watcher only starts on the first `subscribe` call so a Studio + * launch in a project without `src/arkor/index.ts` doesn't immediately fail. + * The watcher kicks in once the user creates the file and the SPA opens + * an EventSource. After every successful build the watcher caches the + * latest state and replays it to new subscribers so a late-mounting + * component still sees the trainer. + */ +export function createHmrCoordinator(opts: HmrOptions): HmrCoordinator { + const resolved = resolveBuildEntry(opts); + + const subscribers = new Set<(event: HmrEvent) => void>(); + let lastEvent: HmrEvent | null = null; + let watcher: RolldownWatcher | null = null; + let disposed = false; + /** + * When `startWatcher` runs against a project that doesn't have an + * entry file yet, a poll timer takes over and waits for the file to + * appear. Without this, an SPA that opened `/api/dev/events` against + * a fresh scaffold would hang on the initial `error` event forever + * (`startWatcher` is only re-entered on `subscribe()`, but EventSource + * doesn't reconnect on application-level errors). + */ + let entryWaitTimer: ReturnType | null = null; + /** + * Monotonically incrementing build sequence number. Bumped on every + * `BUNDLE_END` *before* the inspection awaits, so when an + * inspection eventually resolves it can check whether a newer + * build has started in the meantime and silently drop its stale + * result. + * + * This matters because `inspectBundle` does an asynchronous + * dynamic-import of the just-written artifact. Two rebuilds A → B + * landing within the import window can race, with A's inspection + * resolving *after* B's. The previous "fire-and-forget" code + * would then publish A on top of B and leave `lastEvent` pointing + * at the older `configHash`/`trainerName`. That in turn drove + * `/api/dev/events` to make hot-swap-vs-restart decisions against + * stale routing data and surfaced the wrong trainer name in the + * SPA. + */ + let buildSeq = 0; + /** + * Whether a `ready` event has actually broadcast yet. Tracked + * separately from `firstBuild` because the inspection await means + * the first BUNDLE_END's broadcast can land *after* a second + * BUNDLE_END schedules its own. Pinning the type to + * "broadcast-time" rather than "schedule-time" guarantees the SPA + * still sees `ready` first even when the initial inspection loses + * the race. + */ + let firstBroadcast = true; + /** + * Cached `configHash` of the last *successful* build, **independent + * of `lastEvent`**. `lastEvent` tracks every broadcast (including + * `error`) for the cached-replay-on-late-subscribe contract, but a + * transient build error must not blank out the spawn-time hash that + * `/api/train` reads via `getCurrentConfigHash()`. The on-disk + * `.arkor/build/index.mjs` doesn't change on ERROR, so a child + * spawned during an error state is running the *previous* successful + * bundle, and the next BUNDLE_END's hash should be compared + * against THAT. Without this separate cache, the whole rebuild gets + * routed through SIGTERM-restart and SIGUSR2 hot-swap stops working + * for the rest of the session whenever the user briefly broke their + * source. + */ + let lastSuccessConfigHash: string | null = null; + + function broadcast(event: HmrEvent): void { + lastEvent = event; + for (const fn of subscribers) { + try { + fn(event); + } catch { + // Subscribers are SSE controllers; a thrown error usually means + // the connection closed mid-flight. Drop it so one bad subscriber + // can't poison the broadcast for the rest. + } + } + } + + async function emitBuildSucceeded(): Promise { + if (disposed) return; + const seq = ++buildSeq; + const inspection = await inspectBundle(resolved.outFile); + // Drop stale results: a newer rebuild already started (or + // finished) while our inspection was running. The newer + // inspection will own the broadcast for the latest state; this + // one publishing now would just clobber `lastEvent` with the + // older snapshot. + if (seq !== buildSeq || disposed) return; + const type: HmrEventType = firstBroadcast ? "ready" : "rebuild"; + firstBroadcast = false; + const configHash = inspection?.configHash ?? null; + // BUNDLE_END always reflects what's now on disk: even when the + // bundle is unbranded (`configHash === null`), that's the + // current truth. Capture it so `/api/train` spawning during a + // *subsequent* transient error still has the right spawn-time + // hash to compare against the next successful rebuild. + lastSuccessConfigHash = configHash; + broadcast({ + type, + outFile: resolved.outFile, + hash: fingerprint(resolved.outFile), + // Content hash powers the registry's pre-ready-spawn equality + // gate (timestamp-only would over-trigger SIGTERM-restart on + // identical-bytes rebuilds). Read once here so the broadcast + // and any spawn-time capture reference the same on-disk state. + contentHash: contentHashOrNull(resolved.outFile), + configHash, + trainerName: inspection?.trainerName ?? null, + }); + } + + function startWatcher(): void { + if (watcher || disposed) return; + if (!existsSync(resolved.entry)) { + broadcast({ + type: "error", + message: `Build entry not found: ${resolved.entry}. Create ${BUILD_DEFAULTS.entry} or pass an explicit entry argument.`, + }); + // Hand off to a low-frequency poll so an SPA already connected to + // `/api/dev/events` transitions from "error" to "ready" the moment + // the user creates the entry file (no manual reconnect required). + // The poll is `unref()`'d so it never blocks process exit, and + // `dispose()` clears it. + if (!entryWaitTimer) { + entryWaitTimer = setInterval(() => { + if (disposed || watcher) { + if (entryWaitTimer) clearInterval(entryWaitTimer); + entryWaitTimer = null; + return; + } + if (existsSync(resolved.entry)) { + if (entryWaitTimer) clearInterval(entryWaitTimer); + entryWaitTimer = null; + startWatcher(); + } + }, 1000); + entryWaitTimer.unref?.(); + } + return; + } + // The entry exists now: clear any leftover poll timer from a prior + // failed startWatcher invocation. + if (entryWaitTimer) { + clearInterval(entryWaitTimer); + entryWaitTimer = null; + } + watcher = watch({ + ...rolldownInputOptions(resolved), + output: { file: resolved.outFile, format: "esm" }, + }); + watcher.on("event", (event) => { + if (event.code === "BUNDLE_END") { + // rolldown requires the per-build result to be closed to avoid leaks. + event.result.close().catch(() => {}); + // The event type ("ready" vs "rebuild") is decided inside + // `emitBuildSucceeded` *after* the inspection await, based on + // whether any prior broadcast actually landed (see the + // `firstBroadcast` comment for why pinning the type at this + // schedule point would be wrong under inspection races). + void emitBuildSucceeded(); + } else if (event.code === "ERROR") { + // Rolldown's ERROR events don't always carry a `result`: + // when the failure is in the parse/resolve phase there's + // no per-build output to close, so `event.result` is + // `undefined`. Calling `.close()` then would throw + // synchronously, escape this listener, and permanently + // wedge the watcher so the SPA stays on the prior `error` + // state forever even after the user fixes their code. + // Optional-chain so we still close any result that *is* + // present (avoiding the leak rolldown warns about) without + // blowing up the watcher when none is. + event.result?.close().catch(() => {}); + // Bump the seq so a still-in-flight `emitBuildSucceeded` + // from a *prior* BUNDLE_END drops its broadcast when its + // inspection finally resolves. Without this, the older + // success would land on top of this error and clobber + // `lastEvent`/`configHash`, leaving the SPA showing a + // healthy rebuild while the actual latest build state is + // a compile error. The successful-rebuild path bumps the + // same counter inside `emitBuildSucceeded`. + buildSeq += 1; + broadcast({ + type: "error", + message: + event.error instanceof Error + ? event.error.message + : String(event.error), + }); + } + }); + } + + return { + subscribe(fn) { + subscribers.add(fn); + // Replay the last broadcast so a late-mounting subscriber (an + // `/api/dev/events` SSE client opening after the first BUNDLE_END, + // or `buildStudioApp`'s dispatch subscriber registering after + // entry-wait recovery) sees current state without waiting for + // the next rebuild. + // + // Wrapped in the same defensive try/catch as `broadcast` so a + // throw inside the subscriber (typically an SSE controller that + // closed mid-replay: `controller.enqueue` on a closed stream + // throws) doesn't propagate out of `subscribe()` and crash + // whoever just registered. One bad subscriber must not be able + // to break HMR initialisation for the rest of the process. + if (lastEvent) { + try { + fn(lastEvent); + } catch { + // Swallow: subscribers own their own teardown; we just + // shouldn't poison their `subscribe()` call site. + } + } + startWatcher(); + return () => { + subscribers.delete(fn); + }; + }, + getCurrentConfigHash() { + // Returns the hash of the *last successful* build, NOT + // `lastEvent.configHash`. The two diverge after an ERROR: + // `lastEvent` becomes the error event (no `configHash`), but + // `.arkor/build/index.mjs` still holds the previous successful + // bundle bytes, and a child spawned in that window is running + // those bytes. Returning the cached success hash keeps + // `/api/train` registering accurate spawn-time hashes so the + // next successful BUNDLE_END can route hot-swap vs restart + // correctly. `null` only before the first successful build (or + // a build that wasn't inspectable). + return lastSuccessConfigHash; + }, + getCurrentArtifactHash() { + // Fresh stat (not the cached `lastEvent.hash`). The cached + // hash describes the bytes the watcher last broadcast about, + // but the on-disk artefact may be newer (a BUNDLE_END is + // queued, file already written, inspection still pending) or + // older (next BUNDLE_END hasn't fired yet but the user just + // edited and saved). For the registry's pre-ready-spawn gate + // we want "what bytes will the child's `await import()` see + // RIGHT NOW". + // + // `fingerprintOrNull` does ONE statSync and returns null on + // failure, preserving the documented contract. A previous + // implementation here did `statSync(...)` first and then + // called `fingerprint()` (which has a `Date.now()` fallback + // baked in for SSE dedup uniqueness). That double-stat + // raced: if the file disappeared between the two calls we'd + // return a Date.now()-derived hash that doesn't describe any + // real bytes, silently violating the "null on stat failure" + // contract dispatchRebuild relies on for its SIGTERM-restart + // routing. + return fingerprintOrNull(resolved.outFile); + }, + getCurrentArtifactContentHash() { + // Companion to `getCurrentArtifactHash` for the registry's + // pre-ready-spawn equality gate. Reads + sha256s the file + // at call time so the result describes the exact bytes the + // just-spawned child will see in its `await import()`. + // Same null-on-failure contract: caller treats null as + // "force restart" (the conservative default). + return contentHashOrNull(resolved.outFile); + }, + getLastEventType() { + // `lastEvent` is the latest broadcast: `ready` / `rebuild` / + // `error`. Returning the type lets `/api/manifest`'s HMR + // fast path skip serving the stale built artefact when the + // watcher is currently in `error` (current source fails to + // compile), so the SPA's poll loop doesn't paper over the + // SSE-surfaced error. + return lastEvent?.type ?? null; + }, + async dispose() { + disposed = true; + subscribers.clear(); + if (entryWaitTimer) { + clearInterval(entryWaitTimer); + entryWaitTimer = null; + } + if (watcher) { + const w = watcher; + watcher = null; + await w.close().catch(() => {}); + } + }, + }; +} diff --git a/packages/arkor/src/studio/manifest.ts b/packages/arkor/src/studio/manifest.ts index 72452da8..677115e3 100644 --- a/packages/arkor/src/studio/manifest.ts +++ b/packages/arkor/src/studio/manifest.ts @@ -1,6 +1,11 @@ -import { pathToFileURL } from "node:url"; +import { existsSync } from "node:fs"; import { runBuild } from "../cli/commands/build"; -import { isArkor } from "../core/arkor"; +import { hashJobConfig } from "../core/configHash"; +import { moduleCacheBustUrl } from "../core/moduleCacheBust"; +import { + findTrainerInModule, + getTrainerInspection, +} from "../core/trainerInspection"; /** * Wire-friendly snapshot of the user's `createArkor({...})` manifest. Mirrors @@ -9,28 +14,120 @@ import { isArkor } from "../core/arkor"; */ export interface ManifestSummary { trainer: { name: string } | null; + /** + * Stable hash of the trainer's cloud-side `JobConfig`. Used by HMR to + * decide whether a rebuild only changed in-process callbacks (hash + * unchanged → hot-swap) or also touched cloud-side training config + * (hash changed → restart with `requestEarlyStop`). `null` when no + * inspectable trainer is present. + */ + configHash: string | null; // future: deploy: { name: string } | null; // future: eval: { name: string } | null; } -const EMPTY: ManifestSummary = { trainer: null }; +const EMPTY: ManifestSummary = { trainer: null, configHash: null }; /** - * Build the user's `src/arkor/index.ts` and import the artifact to extract a - * serialisable summary of its manifest. The Studio UI hits this on home-page - * load to show *what* the project contains (just the trainer name today; - * deploy / eval slots when those primitives land). + * Dynamic-import an already-built artefact and pull a serialisable + * summary off its trainer. Cache-bust the URL so Node's ESM loader + * returns the fresh module text rather than a stale evaluation. * - * Each call rebuilds and re-imports so edits to the user's source surface - * without restarting Studio. The import URL carries a cache-bust query so - * Node's ESM cache doesn't return a stale module. + * Split out of `readManifestSummary` so callers that already triggered a + * build (the HMR coordinator hands the SPA a `outFile` after each + * `BUNDLE_END`) can inspect the artefact without paying for a redundant + * `runBuild()`. */ -export async function readManifestSummary(cwd: string): Promise { +export async function summariseBuiltManifest( + outFile: string, +): Promise { + // mtime+ctime+size cache-bust (vs `Date.now()`): the SPA polls + // `/api/manifest` every ~5 s, so a `Date.now()` suffix would + // accumulate one ESM module record per poll across a long + // `arkor dev` session: Node's loader has no eviction. Keying on + // the artefact bytes (via `moduleCacheBustUrl`) collapses + // unchanged-poll reads onto the existing record. + const mod = (await import(moduleCacheBustUrl(outFile))) as Record< + string, + unknown + >; + // Walk every trainer export shape `runner.ts` accepts via the + // shared helper (named `arkor`, named `trainer`, default Arkor + // manifest, `default.trainer`) so manifest summary, HMR routing, + // and runtime execution all agree about which exports count as a + // trainer. + const trainer = findTrainerInModule(mod); + if (!trainer) return EMPTY; + // Trainer name renders in the UI even for hand-rolled trainers + // that bypass `createTrainer` and therefore don't carry the SDK + // inspection brand. The brand is required only for the + // `configHash` used by HMR routing; without it, HMR conservatively + // SIGTERM-restarts on every rebuild (correct fallback). + const name = + typeof trainer.name === "string" ? trainer.name : "(unnamed trainer)"; + const inspection = getTrainerInspection(trainer); + return { + trainer: { name }, + configHash: inspection ? hashJobConfig(inspection.config) : null, + }; +} + +export interface ReadManifestOptions { + /** + * HMR-aware fast path: when set and the file exists, skip the + * `runBuild()` call and inspect this artefact directly. The HMR + * coordinator already keeps `.arkor/build/index.mjs` continuously + * fresh via its rolldown watcher, so re-running `runBuild()` on + * every `/api/manifest` poll (every ~5 s + on every rebuild SSE + * event) is wasted CPU AND races the watcher writing to the + * same path. Pre-existence is checked with `existsSync` so the + * very first poll on a fresh scaffold (watcher's first + * BUNDLE_END hasn't completed yet) still bootstraps via + * `runBuild()`. Once the file appears, subsequent polls skip + * the rebuild. + * + * Pass `coordinator.outFile`-equivalent (e.g. + * `resolveBuildEntry({ cwd }).outFile`) here when the server has + * an active `HmrCoordinator`; leave undefined when HMR is off so + * the build path runs as before. + */ + prebuiltOutFile?: string; +} + +/** + * Build the user's `src/arkor/index.ts` and import the artifact to + * extract a serialisable summary of its manifest. The Studio UI hits + * this on home-page load to show *what* the project contains (just the + * trainer name today; deploy / eval slots when those primitives land). + * + * Each call rebuilds and re-imports so edits to the user's source + * surface without restarting Studio. When `prebuiltOutFile` is + * supplied (HMR-enabled servers), the `runBuild()` step is bypassed + * (see `ReadManifestOptions.prebuiltOutFile` for the rationale). + */ +export async function readManifestSummary( + cwd: string, + opts: ReadManifestOptions = {}, +): Promise { + if (opts.prebuiltOutFile && existsSync(opts.prebuiltOutFile)) { + // Race recovery: rolldown's watcher writes + // `.arkor/build/index.mjs` non-atomically. `existsSync` flips to + // `true` the instant the file is created, but a `/api/manifest` + // poll landing during the flush window would then try to + // `await import(...)` partial bytes and surface as a 500 + // SyntaxError in the UI. The legacy `runBuild()` path was + // synchronous and self-contained, so this race didn't exist + // there. Fall through to a fresh `runBuild()` on import failure + // (which produces a coherent artifact under our control). The + // fallback is best-effort: if `runBuild()` itself also throws + // (real user syntax error), rethrowing IS the right surface for + // `/api/manifest` to render the error inline. + try { + return await summariseBuiltManifest(opts.prebuiltOutFile); + } catch { + // fall through to runBuild() + } + } const { outFile } = await runBuild({ cwd, quiet: true }); - const url = `${pathToFileURL(outFile).href}?t=${Date.now()}`; - const mod = (await import(url)) as Record; - const candidate = mod.arkor ?? mod.default; - if (!isArkor(candidate)) return EMPTY; - const trainer = candidate.trainer ? { name: candidate.trainer.name } : null; - return { trainer }; + return summariseBuiltManifest(outFile); } diff --git a/packages/arkor/src/studio/server.test.ts b/packages/arkor/src/studio/server.test.ts index 214889d9..ed8b9298 100644 --- a/packages/arkor/src/studio/server.test.ts +++ b/packages/arkor/src/studio/server.test.ts @@ -11,6 +11,7 @@ import { import { tmpdir } from "node:os"; import { join, resolve } from "node:path"; import { buildStudioApp } from "./server"; +import type { HmrCoordinator, HmrEvent } from "./hmr"; import { writeCredentials } from "../core/credentials"; import { readState, writeState } from "../core/state"; import { @@ -82,14 +83,14 @@ describe("Studio server", () => { baseUrl: "http://mock", assetsDir, autoAnonymous: false, - // @ts-expect-error — intentionally omitted to assert the runtime guard + // @ts-expect-error: intentionally omitted to assert the runtime guard studioToken: undefined, }), ).toThrow(/studioToken/); }); it("HTML-escapes special characters in the studio token before injecting", async () => { - // Branch coverage for `htmlAttrEscape` — a defensive guard against + // Branch coverage for `htmlAttrEscape`: a defensive guard against // a token that contains `<`, `>`, `&`, `"`, `'`. randomBytes/base64url // never produces these, but the helper must still escape them so a // future token strategy can't break index.html parsing or open a @@ -111,7 +112,7 @@ describe("Studio server", () => { expect(html).toContain( '', ); - // The raw exotic token must not leak into HTML — an attacker who + // The raw exotic token must not leak into HTML: an attacker who // could influence the token (hypothetical) shouldn't be able to // inject markup. expect(html).not.toMatch(/content="<>/); @@ -133,6 +134,48 @@ describe("Studio server", () => { expect(html.indexOf("arkor-studio-token")).toBeLessThan( html.indexOf(""), ); + // HMR meta tag must NOT appear when no coordinator was supplied. + // The SPA reads this flag to decide whether to open + // `/api/dev/events`; a stray "true" here would make every prod + // session retry against the 404 indefinitely. + expect(html).not.toContain("arkor-hmr-enabled"); + }); + + it("injects when an HMR coordinator is supplied", async () => { + // Regression: the SPA can't tell dev-mode usage from prod-mode + // usage at runtime: `vite build` ships with + // `import.meta.env.DEV === false`, so a build-time DEV gate inside + // the SPA bundle would (wrongly) suppress HMR even in real + // `arkor dev` sessions. The server-side flag is `true` exactly + // when `arkor dev` wired in an HMR coordinator. Verify it lands + // in `` next to the studio-token tag. + const fakeHmr = { + subscribe: () => () => undefined, + getCurrentConfigHash: () => null, + getCurrentArtifactHash: () => null, + getCurrentArtifactContentHash: () => null, + getLastEventType: () => null, + async dispose() {}, + }; + const app = buildStudioApp({ + baseUrl: "http://mock", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + hmr: fakeHmr, + }); + const res = await app.request("/", { + headers: { host: "127.0.0.1:4000" }, + }); + expect(res.status).toBe(200); + const html = await res.text(); + expect(html).toContain( + ``, + ); + expect(html.indexOf("arkor-hmr-enabled")).toBeLessThan( + html.indexOf(""), + ); }); it("serves non-html assets with the correct content-type", async () => { @@ -356,7 +399,7 @@ describe("Studio server", () => { expect(res.status).toBe(403); }); - // Regression for ENG-404 — `path.resolve` doesn't follow symlinks, so a + // Regression for ENG-404: `path.resolve` doesn't follow symlinks, so a // link inside the project directory pointing outside it would previously // pass the containment check and be handed to `arkor start` (which would // then dlopen the link's target). @@ -433,7 +476,7 @@ describe("Studio server", () => { expect(body.error).toMatch(/does not exist/); }); - // Regression for ENG-356 — `/api/train` previously resolved the bundled + // Regression for ENG-356: `/api/train` previously resolved the bundled // bin at `/bin.mjs` (one level above `dist/`), which never existed. // The DI'd `binPath` lets us assert (a) a working bin streams its stdout // through the response, and (b) a missing bin surfaces ENOENT-grade errors @@ -468,6 +511,12 @@ process.exit(0); body: JSON.stringify({}), }); expect(res.status).toBe(200); + // Regression: the spawned subprocess's pid is exposed via the + // `X-Arkor-Train-Pid` response header so the SPA can scope HMR + // restart events to its own child (a multi-tab broadcast can + // contain mixed restart/hot-swap targets across siblings). + const pidHeader = res.headers.get("x-arkor-train-pid"); + expect(pidHeader).toMatch(/^\d+$/); const text = await res.text(); expect(text).toContain("[fake-bin]"); // The bin receives `start` as the first non-flag arg. @@ -548,6 +597,763 @@ process.exit(0); expect(text).toContain("exit="); expect(text).not.toContain("exit=0"); }); + + it("captures the spawn-time configHash from the HMR coordinator (no extra rebuild)", async () => { + // Regression: `/api/train` previously called `readManifestSummary` + // which ran a full `runBuild()` per spawn: wasteful and racy + // against the HMR watcher writing the same `.arkor/build/index.mjs`. + // The new server reads the cached hash from + // `coordinator.getCurrentConfigHash()` instead. We assert the + // call happens (so a rebuild is *not* required) by exposing the + // spy count on the fake coordinator. + await writeCredentials(ANON_CREDS); + let getCurrentCalls = 0; + const fakeHmr = { + subscribe: () => () => undefined, + getCurrentConfigHash: () => { + getCurrentCalls += 1; + return "spawn-time-hash"; + }, + getCurrentArtifactHash: () => "spawn-artefact-hash", + getCurrentArtifactContentHash: () => "spawn-artefact-content-hash", + getLastEventType: () => null, + async dispose() {}, + }; + const fakeBin = join(trainCwd, "fake-bin.mjs"); + writeFileSync(fakeBin, `process.exit(0);\n`); + const app = buildStudioApp({ + baseUrl: "http://mock", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + binPath: fakeBin, + hmr: fakeHmr, + }); + const res = await app.request("/api/train", { + method: "POST", + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + "content-type": "application/json", + }, + body: JSON.stringify({}), + }); + expect(res.status).toBe(200); + // Drain the body so the close handler runs and the test + // doesn't leak the subprocess. + await res.text(); + expect(getCurrentCalls).toBe(1); + }); + + it("/api/train job-id parser ignores stderr so a `Started job ` line on stderr can't hijack the cancel POST", async () => { + // Regression: the job-id detector used to consume both + // stdout AND stderr through a shared `onChunk` + shared + // line buffer. A user `console.error("Started job ")` + // on stderr would then poison the buffer first; the real + // stdout marker arrives later but our `getJobId(...) === null` + // gate has already short-circuited subsequent scans, so + // Stop-training POSTs cancel for the wrong (decoy) job and + // the real one keeps running: silent cloud orphan. + // Splitting into a stdout-only `onStdoutChunk` parser and a + // forward-only `onStderrChunk` makes stderr unable to + // populate `jobId` regardless of what the user logs there. + await writeCredentials(ANON_CREDS); + await writeState( + { + orgSlug: "stderr-test-org", + projectSlug: "stderr-test-project", + projectId: "p-stderr", + }, + trainCwd, + ); + // Bin emits a decoy `Started job ` to STDERR first + // (would poison the shared buffer), then the canonical real + // line to STDOUT, then hangs. With the split we expect the + // real id to win; with the bug the decoy would win. + const REAL_JOB_ID = "real-job-id"; + const DECOY_JOB_ID = "decoy-from-stderr"; + const fakeBin = join(trainCwd, "stderr-decoy-bin.mjs"); + // The real runner prefixes its canonical line with the + // per-spawn nonce the server injected via + // ARKOR_JOB_ID_MARKER_NONCE; the decoy on stderr deliberately + // uses the nonce too (worst-case: a user who somehow learned + // the nonce still can't hijack the parser by writing to the + // wrong stream). With the parser correctly stdout-only the + // real line wins regardless. + writeFileSync( + fakeBin, + `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? ""; + process.stderr.write(\`[arkor:\${nonce}] Started job ${DECOY_JOB_ID}\\n\`); + // Slight delay so stderr lands first. + setTimeout(() => { + process.stdout.write(\`[arkor:\${nonce}] Started job ${REAL_JOB_ID}\\n\`); + }, 30); + process.on("SIGTERM", () => {}); + setInterval(() => {}, 60_000); + `, + ); + let cancelHits: Array<{ url: string }> = []; + const ORIG_FETCH = globalThis.fetch; + globalThis.fetch = (async ( + input: Parameters[0], + init?: Parameters[1], + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) { + cancelHits.push({ url }); + return new Response(JSON.stringify({ ok: true }), { + status: 200, + headers: { "content-type": "application/json" }, + }); + } + return new Response("not found", { status: 404 }); + }) as typeof fetch; + + try { + const app = buildStudioApp({ + baseUrl: "http://mock-cloud-api", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + binPath: fakeBin, + }); + const trainRes = await app.request("/api/train", { + method: "POST", + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + "content-type": "application/json", + }, + body: JSON.stringify({}), + }); + expect(trainRes.status).toBe(200); + // Drain until the REAL line is in the body. Both the + // decoy and the real line forward through to the SPA log + // stream, so both bytes show up here regardless of which + // (if any) the parser captures. + const reader = trainRes.body!.getReader(); + const decoder = new TextDecoder(); + let buf = ""; + while (!buf.includes(`Started job ${REAL_JOB_ID}`)) { + const { value, done } = await reader.read(); + if (done) break; + buf += decoder.decode(value, { stream: true }); + } + await reader.cancel(); + await new Promise((r) => setTimeout(r, 200)); + + // The cancel POST must target the REAL id. With the bug + // the decoy would have been recorded first → cancelHits[0] + // would contain `decoy-from-stderr` instead. + expect(cancelHits).toHaveLength(1); + expect(cancelHits[0]?.url).toContain(`/v1/jobs/${REAL_JOB_ID}/cancel`); + expect(cancelHits[0]?.url).not.toContain(DECOY_JOB_ID); + } finally { + globalThis.fetch = ORIG_FETCH; + } + }); + + it("/api/train cancel POSTs cloud /v1/jobs/:id/cancel so the cloud job is released even though SIGKILL bypasses the runner's shutdown handlers", async () => { + // Regression: SIGKILL kills the runner without giving its + // `installShutdownHandlers` a chance to issue the cloud + // `cancel()` POST itself. Without a server-side equivalent + // the cloud job sits in "running" until TTL/reaper, so a + // user clicking "Stop training" silently keeps consuming + // GPU spend. The fix parses the runner's `Started job ` + // stdout line, records the id on the registry entry, and + // fires a fire-and-forget POST to cloud-api on cancel + // *before* SIGKILLing. + await writeCredentials(ANON_CREDS); + // The cancel POST reads scope from `.arkor/state.json` (not + // from the anon creds' orgSlug; that's a different code + // path). Pre-seed so the POST can address the cloud job. + await writeState( + { + orgSlug: "cancel-test-org", + projectSlug: "cancel-test-project", + projectId: "p-cancel", + }, + trainCwd, + ); + // Bin prints the canonical "Started job " line then + // hangs (just like the real runner after `start()` resolves). + // The id is the same kind of identifier cloud-api would + // mint: an opaque string we'll verify shows up in the cancel + // POST URL below. + const FAKE_JOB_ID = "j-cancel-test"; + const fakeBin = join(trainCwd, "started-job-bin.mjs"); + // Prefix the marker with the per-spawn nonce the server + // injected via ARKOR_JOB_ID_MARKER_NONCE: that's the only + // shape the server's parser accepts, since user code can't + // know the nonce ahead of time (real runner deletes the env + // var before importing user modules). + writeFileSync( + fakeBin, + `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? ""; + process.stdout.write(\`[arkor:\${nonce}] Started job ${FAKE_JOB_ID}\\n\`); + process.on("SIGTERM", () => {}); + setInterval(() => {}, 60_000); + `, + ); + // Capture the cloud-api requests so we can verify the + // server's cancel POST landed with the right job id + + // scope. The default fetch in this suite would 404 our POST + // and leave it as `cancelCalls === 0`. + let cancelHits: Array<{ url: string; method: string }> = []; + const ORIG_FETCH = globalThis.fetch; + globalThis.fetch = (async ( + input: Parameters[0], + init?: Parameters[1], + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if ( + method === "POST" && + url.includes(`/v1/jobs/${FAKE_JOB_ID}/cancel`) + ) { + cancelHits.push({ url, method }); + return new Response(JSON.stringify({ ok: true }), { + status: 200, + headers: { "content-type": "application/json" }, + }); + } + // Pass-through default: anything else 404s, which would + // surface as a test-side failure if our cancel POST + // doesn't match the expected URL shape. + return new Response("not found", { status: 404 }); + }) as typeof fetch; + + try { + const app = buildStudioApp({ + baseUrl: "http://mock-cloud-api", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + binPath: fakeBin, + }); + const trainRes = await app.request("/api/train", { + method: "POST", + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + "content-type": "application/json", + }, + body: JSON.stringify({}), + }); + expect(trainRes.status).toBe(200); + // Read enough of the body to ensure the runner's + // `Started job ` chunk has been processed by the + // server's stdout parser (without this, cancel could + // race ahead of the parser and find no jobId on the + // registry → no cancel POST → false test failure). + const reader = trainRes.body!.getReader(); + const decoder = new TextDecoder(); + let buf = ""; + while (!buf.includes(`Started job ${FAKE_JOB_ID}`)) { + const { value, done } = await reader.read(); + if (done) break; + buf += decoder.decode(value, { stream: true }); + } + // Trigger cancel: should fire the cloud POST + SIGKILL. + await reader.cancel(); + // Fire-and-forget: give the void IIFE a tick to actually + // dispatch the fetch + receive the 200 response. + await new Promise((r) => setTimeout(r, 200)); + + expect(cancelHits).toHaveLength(1); + expect(cancelHits[0]?.url).toContain(`/v1/jobs/${FAKE_JOB_ID}/cancel`); + // Scope is required by the cloud-api contract: comes from + // `.arkor/state.json` (seeded above), not the anon creds. + expect(cancelHits[0]?.url).toContain("orgSlug=cancel-test-org"); + expect(cancelHits[0]?.url).toContain("projectSlug=cancel-test-project"); + } finally { + globalThis.fetch = ORIG_FETCH; + } + }); + + it("/api/train cancel uses the spawn-time scope from the registry even when state.json was deleted mid-training", async () => { + // Regression: the cancel handler used to re-read + // `.arkor/state.json` at stop time to address the cloud cancel + // POST. If the user removed or made the file unreadable + // mid-training (rm -rf .arkor, accidental git clean -fdx, fs + // unmounted), the read returned null and the handler silently + // skipped the POST: the local SIGKILL still tore down the + // subprocess but the cloud job orphaned until TTL/reaper. The + // fix captures `{orgSlug, projectSlug}` on the registry entry + // at spawn time so the cancel POST is decoupled from + // mutable filesystem state. + await writeCredentials(ANON_CREDS); + await writeState( + { + orgSlug: "scope-pin-org", + projectSlug: "scope-pin-project", + projectId: "p-scope-pin", + }, + trainCwd, + ); + const FAKE_JOB_ID = "j-scope-pin"; + const fakeBin = join(trainCwd, "scope-pin-bin.mjs"); + writeFileSync( + fakeBin, + `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? ""; + process.stdout.write(\`[arkor:\${nonce}] Started job ${FAKE_JOB_ID}\\n\`); + process.on("SIGTERM", () => {}); + setInterval(() => {}, 60_000); + `, + ); + let cancelHits: Array<{ url: string }> = []; + const ORIG_FETCH = globalThis.fetch; + globalThis.fetch = (async ( + input: Parameters[0], + init?: Parameters[1], + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) { + cancelHits.push({ url }); + return new Response(JSON.stringify({ ok: true }), { + status: 200, + headers: { "content-type": "application/json" }, + }); + } + return new Response("not found", { status: 404 }); + }) as typeof fetch; + + try { + const app = buildStudioApp({ + baseUrl: "http://mock-cloud-api", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + binPath: fakeBin, + }); + const trainRes = await app.request("/api/train", { + method: "POST", + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + "content-type": "application/json", + }, + body: JSON.stringify({}), + }); + expect(trainRes.status).toBe(200); + const reader = trainRes.body!.getReader(); + const decoder = new TextDecoder(); + let buf = ""; + while (!buf.includes(`Started job ${FAKE_JOB_ID}`)) { + const { value, done } = await reader.read(); + if (done) break; + buf += decoder.decode(value, { stream: true }); + } + // The hostile mid-training mutation: nuke the state file + // that the OLD code would have re-read at cancel time. + rmSync(join(trainCwd, ".arkor"), { recursive: true, force: true }); + // Cancel: under the bug, the handler's state read returns + // null and the cancel POST is silently skipped. With the + // fix, the registry-pinned scope is used and the POST goes + // out anyway. + await reader.cancel(); + await new Promise((r) => setTimeout(r, 200)); + + expect(cancelHits).toHaveLength(1); + expect(cancelHits[0]?.url).toContain(`/v1/jobs/${FAKE_JOB_ID}/cancel`); + expect(cancelHits[0]?.url).toContain("orgSlug=scope-pin-org"); + expect(cancelHits[0]?.url).toContain("projectSlug=scope-pin-project"); + } finally { + globalThis.fetch = ORIG_FETCH; + } + }); + + it("/api/train cancel falls back to reading .arkor/state.json when no scope was captured at spawn time (first-run anon)", async () => { + // Regression: capturing the cloud scope at spawn time covered + // the "user deleted state mid-training" hazard but broke the + // common first-run anonymous flow. On a fresh project, + // `.arkor/state.json` is created by `ensureProjectState` from + // INSIDE the child during `trainer.start()`, i.e. AFTER spawn. + // The spawn-time `readState(trainCwd)` therefore returns null, + // `pinnedScope` stays null, and the previous code silently + // skipped the cancel POST: local SIGKILL torn down the + // subprocess but the cloud job orphaned. The fix uses the + // pinned spawn-time scope WHEN PRESENT (delete-mid-training + // hazard) and falls back to reading at cancel time when it + // was null (first-run anon). + await writeCredentials(ANON_CREDS); + // Deliberately DO NOT seed state at spawn time. The bin will + // write it AFTER its `Started job ` line lands, simulating + // the order `ensureProjectState`/`trainer.start()` produce in + // a real anon first-run. + const FAKE_JOB_ID = "j-late-scope"; + const stateDir = join(trainCwd, ".arkor"); + const statePath = join(stateDir, "state.json"); + const fakeBin = join(trainCwd, "late-scope-bin.mjs"); + writeFileSync( + fakeBin, + `import { mkdirSync, writeFileSync } from "node:fs"; + const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? ""; + // Mirror runner order: state appears AFTER spawn, but BEFORE + // the Started job line so the cancel-time read sees it. + mkdirSync(${JSON.stringify(stateDir)}, { recursive: true }); + writeFileSync(${JSON.stringify(statePath)}, JSON.stringify({ + orgSlug: "late-scope-org", + projectSlug: "late-scope-project", + projectId: "p-late-scope", + })); + process.stdout.write(\`[arkor:\${nonce}] Started job ${FAKE_JOB_ID}\\n\`); + process.on("SIGTERM", () => {}); + setInterval(() => {}, 60_000); + `, + ); + let cancelHits: Array<{ url: string }> = []; + const ORIG_FETCH = globalThis.fetch; + globalThis.fetch = (async ( + input: Parameters[0], + init?: Parameters[1], + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) { + cancelHits.push({ url }); + return new Response(JSON.stringify({ ok: true }), { + status: 200, + headers: { "content-type": "application/json" }, + }); + } + return new Response("not found", { status: 404 }); + }) as typeof fetch; + + try { + const app = buildStudioApp({ + baseUrl: "http://mock-cloud-api", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + binPath: fakeBin, + }); + const trainRes = await app.request("/api/train", { + method: "POST", + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + "content-type": "application/json", + }, + body: JSON.stringify({}), + }); + expect(trainRes.status).toBe(200); + const reader = trainRes.body!.getReader(); + const decoder = new TextDecoder(); + let buf = ""; + while (!buf.includes(`Started job ${FAKE_JOB_ID}`)) { + const { value, done } = await reader.read(); + if (done) break; + buf += decoder.decode(value, { stream: true }); + } + await reader.cancel(); + await new Promise((r) => setTimeout(r, 250)); + + // Under the bug there were 0 cancel hits (pinned scope null + // → skip). With the fix the cancel-time read recovers the + // scope the child just wrote. + expect(cancelHits).toHaveLength(1); + expect(cancelHits[0]?.url).toContain(`/v1/jobs/${FAKE_JOB_ID}/cancel`); + expect(cancelHits[0]?.url).toContain("orgSlug=late-scope-org"); + expect(cancelHits[0]?.url).toContain("projectSlug=late-scope-project"); + } finally { + globalThis.fetch = ORIG_FETCH; + } + }); + + it("/api/train job-id parser ignores stdout lines that lack the per-spawn nonce prefix so user code can't forge a `Started job` marker", async () => { + // Regression: the parser used to match any `Started job ` + // line in stdout. User code (which runs inside the runner's + // `await import(userEntry)` chain and therefore shares the + // child's stdout) could write `console.log("Started job + // attacker-chosen-id")` before the runner's canonical line + // arrives, the parser would record the attacker's id, and + // Stop-training would POST `/v1/jobs//cancel` + // against a job the attacker picked. The fix injects a + // per-spawn 32-hex nonce via ARKOR_JOB_ID_MARKER_NONCE that + // the server's regex anchors on; runner.ts deletes the env + // var before dynamically importing the user module, so user + // code can't read the nonce via `process.env` either. + await writeCredentials(ANON_CREDS); + await writeState( + { + orgSlug: "nonce-org", + projectSlug: "nonce-project", + projectId: "p-nonce", + }, + trainCwd, + ); + const REAL_JOB_ID = "real-nonce-job"; + const SPOOF_JOB_ID = "attacker-chosen-id"; + const fakeBin = join(trainCwd, "spoof-bin.mjs"); + // Bin first emits an UNPREFIXED spoof on stdout (mimicking + // hostile user code), THEN the real nonce-prefixed canonical + // line. With the fix the spoof is rejected; the real line + // wins and the cancel POST targets the real id. + writeFileSync( + fakeBin, + `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? ""; + process.stdout.write("Started job ${SPOOF_JOB_ID}\\n"); + setTimeout(() => { + process.stdout.write(\`[arkor:\${nonce}] Started job ${REAL_JOB_ID}\\n\`); + }, 30); + process.on("SIGTERM", () => {}); + setInterval(() => {}, 60_000); + `, + ); + let cancelHits: Array<{ url: string }> = []; + const ORIG_FETCH = globalThis.fetch; + globalThis.fetch = (async ( + input: Parameters[0], + init?: Parameters[1], + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) { + cancelHits.push({ url }); + return new Response(JSON.stringify({ ok: true }), { + status: 200, + headers: { "content-type": "application/json" }, + }); + } + return new Response("not found", { status: 404 }); + }) as typeof fetch; + + try { + const app = buildStudioApp({ + baseUrl: "http://mock-cloud-api", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + binPath: fakeBin, + }); + const trainRes = await app.request("/api/train", { + method: "POST", + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + "content-type": "application/json", + }, + body: JSON.stringify({}), + }); + expect(trainRes.status).toBe(200); + const reader = trainRes.body!.getReader(); + const decoder = new TextDecoder(); + let buf = ""; + // Wait for the REAL line (with nonce prefix) to be visible + // in the body. Both lines forward to the SPA log + // regardless of which (if any) the parser captures, so the + // body is a reliable readiness signal. + while (!buf.includes(`Started job ${REAL_JOB_ID}`)) { + const { value, done } = await reader.read(); + if (done) break; + buf += decoder.decode(value, { stream: true }); + } + await reader.cancel(); + await new Promise((r) => setTimeout(r, 200)); + + // Cancel POST landed against the REAL id: the spoof was + // rejected by the anchored nonce-prefixed regex. + expect(cancelHits).toHaveLength(1); + expect(cancelHits[0]?.url).toContain(`/v1/jobs/${REAL_JOB_ID}/cancel`); + expect(cancelHits[0]?.url).not.toContain(SPOOF_JOB_ID); + } finally { + globalThis.fetch = ORIG_FETCH; + } + }); + + it("/api/train cancel sends SIGKILL so user-initiated stop bypasses the runner's graceful early-stop", async () => { + // Regression: a default `child.kill()` sends SIGTERM, which + // the runner's `installShutdownHandlers` now interprets as a + // graceful early-stop request (wait for the next checkpoint, + // up to ~5 min). For HMR-driven cancels that's correct, but + // for a Stop-training click the user wants the run STOPPED + // immediately. Leaving it running in the background for + // minutes consuming GPU spend silently is a regression + // introduced by this PR's graceful-shutdown work. We assert + // SIGKILL by giving the bin a SIGTERM no-op handler: SIGTERM + // would be swallowed and the bin would stay alive; SIGKILL + // is uncatchable and reaps the process unconditionally. + // Probe liveness with `process.kill(pid, 0)` (ESRCH ⇒ gone). + await writeCredentials(ANON_CREDS); + const hangingBin = join(trainCwd, "hanging-bin.mjs"); + writeFileSync( + hangingBin, + // SIGTERM swallowed; setInterval keeps the event loop + // alive forever absent SIGKILL. + `process.on("SIGTERM", () => {}); + setInterval(() => {}, 60_000); + `, + ); + const app = buildStudioApp({ + baseUrl: "http://mock", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + binPath: hangingBin, + }); + const res = await app.request("/api/train", { + method: "POST", + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + "content-type": "application/json", + }, + body: JSON.stringify({}), + }); + expect(res.status).toBe(200); + const pid = Number(res.headers.get("x-arkor-train-pid")); + expect(Number.isFinite(pid)).toBe(true); + + // Trigger the cancel() handler. + await res.body!.cancel(); + + // Give the OS a moment to deliver SIGKILL and reap. + await new Promise((r) => setTimeout(r, 300)); + + // `process.kill(pid, 0)` is the standard "is this pid alive?" + // probe: sends signal 0 (no-op) but the syscall still + // surfaces ESRCH for non-existent pids. SIGKILL → reaped → + // ESRCH. SIGTERM (with the bin's no-op handler) → still + // alive → no throw → test fails. + let probeError: NodeJS.ErrnoException | null = null; + try { + process.kill(pid, 0); + } catch (e) { + probeError = e as NodeJS.ErrnoException; + } + expect(probeError).not.toBeNull(); + expect(probeError?.code).toBe("ESRCH"); + }); + + it("/api/train cancel handler doesn't crash when child.kill() throws", async () => { + // Regression: `ReadableStream.cancel()` called `child.kill()` + // without a try/catch. If the child had already exited (ESRCH + // race against the cancel), the throw bubbled up as an + // unhandled exception and crashed the request handler. + await writeCredentials(ANON_CREDS); + const fakeBin = join(trainCwd, "fake-bin.mjs"); + // Bin exits immediately so the child is already dead by the + // time our cancel handler tries to signal it. + writeFileSync(fakeBin, `process.exit(0);\n`); + const app = buildStudioApp({ + baseUrl: "http://mock", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + binPath: fakeBin, + }); + const res = await app.request("/api/train", { + method: "POST", + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + "content-type": "application/json", + }, + body: JSON.stringify({}), + }); + expect(res.status).toBe(200); + // Race: read enough of the body to see the close, then cancel. + // The cancel hook must not throw even when the underlying + // child is already gone. + const reader = res.body!.getReader(); + // Wait for `exit=` so we know the child died first. + let buf = ""; + const decoder = new TextDecoder(); + while (!buf.includes("exit=")) { + const { value, done } = await reader.read(); + if (done) break; + buf += decoder.decode(value, { stream: true }); + } + await expect(reader.cancel()).resolves.toBeUndefined(); + }); + + it("/api/train survives cancellation while the child is still streaming output", async () => { + // Regression: the previous implementation registered raw + // `controller.enqueue(...)` listeners on `child.stdout` / + // `child.stderr` and an unguarded `controller.close()` in + // `child.on("close")`. After the client cancelled the + // ReadableStream, those handlers kept firing, and calling + // `enqueue` / `close` on a closed controller throws "Invalid + // state". The throw escaped the request pipeline as an + // unhandled exception. The fix flips a `closed` flag in + // `cancelTeardown` and try/catches the post-cancel enqueue + // paths defensively. NOTE: cancel intentionally does NOT + // detach the `data` listeners; leaving them attached keeps + // the OS pipe draining while the child checkpoints / exits + // gracefully (otherwise a full pipe back-pressures and + // deadlocks the very graceful exit we're preserving). + // `onClose` / `onError` detach all listeners when the child + // finally exits. See `cancelTeardown` in `studio/server.ts` + // for the full backpressure rationale. + await writeCredentials(ANON_CREDS); + const fakeBin = join(trainCwd, "fake-bin.mjs"); + // Bin spits a chunk every ~5 ms forever. We cancel while it's + // mid-stream so the child is *still alive* when listeners are + // removed: the previous bug only surfaced in this window. + writeFileSync( + fakeBin, + `setInterval(() => process.stdout.write("tick\\n"), 5);\nsetInterval(() => {}, 60_000);\n`, + ); + const app = buildStudioApp({ + baseUrl: "http://mock", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + binPath: fakeBin, + }); + const res = await app.request("/api/train", { + method: "POST", + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + "content-type": "application/json", + }, + body: JSON.stringify({}), + }); + expect(res.status).toBe(200); + const reader = res.body!.getReader(); + // Read at least one chunk so the child is definitely streaming + // before we cancel: that's the race window the previous code + // crashed in. + const decoder = new TextDecoder(); + let received = ""; + while (!received.includes("tick")) { + const { value, done } = await reader.read(); + if (done) break; + received += decoder.decode(value, { stream: true }); + } + // Listen for unhandled rejections / uncaught exceptions during + // and shortly after the cancel: before the fix, the child's + // next `data` chunk would synchronously throw inside the + // enqueue callback. + const errors: unknown[] = []; + const onUnhandled = (err: unknown) => errors.push(err); + process.on("uncaughtException", onUnhandled); + process.on("unhandledRejection", onUnhandled); + try { + await reader.cancel(); + // Give the child's interval a few iterations to attempt + // post-cancel writes. The handler must short-circuit on the + // `closed` flag and not crash the worker. + await new Promise((r) => setTimeout(r, 50)); + } finally { + process.off("uncaughtException", onUnhandled); + process.off("unhandledRejection", onUnhandled); + } + expect(errors).toEqual([]); + }); }); describe("auto-anonymous bootstrap", () => { @@ -557,7 +1363,7 @@ process.exit(0); }); it("acquires + persists an anonymous token on the first /api/credentials hit when autoAnonymous=true", async () => { - // No credentials on disk — buildStudioApp's autoAnonymous default + // No credentials on disk: buildStudioApp's autoAnonymous default // (true) lets the server bootstrap on first hit so a fresh `arkor // dev` works even when the up-front bootstrap in dev.ts skipped due // to a transient network blip. @@ -599,7 +1405,7 @@ process.exit(0); expect(body).toMatchObject({ token: "lazy-anon", mode: "anon" }); expect(calls).toBe(1); - // Subsequent calls use the persisted credentials — no re-bootstrap. + // Subsequent calls use the persisted credentials (no re-bootstrap). const res2 = await app.request("/api/credentials", { headers: { host: "127.0.0.1:4000", @@ -674,7 +1480,7 @@ process.exit(0); // The cloud-api-client wrapper around `onDeprecation` synchronously // checks `typeof result.then` on the callback's return value; a plain // `void` return throws and gets swallowed with a stderr log. The - // wrapper in `createRpc` returns null to short-circuit that check — + // wrapper in `createRpc` returns null to short-circuit that check; // assert that no such log fires here. const errorSpy = vi .spyOn(console, "error") @@ -1173,6 +1979,179 @@ process.exit(0); const body = (await res.json()) as { trainer: unknown }; expect(body.trainer).toBeNull(); }); + + it("skips runBuild() when HMR is enabled and the watcher's artefact already exists", async () => { + // Regression: previously every `/api/manifest` poll triggered a + // fresh `runBuild()` even with HMR active, so the SPA's + // ~5 s polling + per-rebuild SSE refetch would re-bundle on + // every poll AND race the watcher writing to the same + // `.arkor/build/index.mjs`. The fast path inspects the + // pre-existing artefact directly when HMR's coordinator is + // wired in. We assert by pre-writing a hand-rolled artefact + // bundle and verifying `/api/manifest` returns its trainer + // *without* the source file existing: `runBuild()` would + // throw on the missing entry, so a 200 here proves we never + // called it. + await writeCredentials(ANON_CREDS); + // Write the artefact that the HMR watcher would have produced. + // Mirrors the seed fixture's shape: `_kind: "arkor"` + trainer + // with the four required methods. + mkdirSync(join(trainCwd, ".arkor/build"), { recursive: true }); + writeFileSync( + join(trainCwd, ".arkor/build/index.mjs"), + `const trainer = { + name: "hmr-fast-path", + start: async () => ({ jobId: "j" }), + wait: async () => ({ job: {}, artifacts: [] }), + cancel: async () => {}, + }; + export const arkor = { _kind: "arkor", trainer }; + export default arkor; + `, + ); + // Notice: NO `src/arkor/index.ts`. `runBuild()` would fail with + // "Build entry not found"; the test fails if the fast path + // regresses and falls through to it. + const fakeHmr = { + subscribe: () => () => undefined, + getCurrentConfigHash: () => null, + getCurrentArtifactHash: () => null, + getCurrentArtifactContentHash: () => null, + getLastEventType: () => null, + async dispose() {}, + }; + const app = buildStudioApp({ + baseUrl: "http://mock", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + hmr: fakeHmr, + }); + const res = await app.request("/api/manifest", { + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + }, + }); + expect(res.status).toBe(200); + const body = (await res.json()) as { + trainer: { name: string } | null; + }; + expect(body.trainer).toEqual({ name: "hmr-fast-path" }); + }); + + it("falls back to runBuild() when HMR is enabled but the watcher hasn't produced an artefact yet", async () => { + // Companion to the fast-path test: on a fresh scaffold the + // watcher's first BUNDLE_END may not have completed by the + // time the SPA's first /api/manifest poll lands. Without the + // existsSync gate we'd `await import(missing)` and 400 + // forever (the watcher's later writes don't retroactively + // make this poll succeed); with the gate we bootstrap via + // `runBuild()` for that single call. + await writeCredentials(ANON_CREDS); + mkdirSync(join(trainCwd, "src/arkor"), { recursive: true }); + writeFileSync( + join(trainCwd, "src/arkor/index.ts"), + `export const arkor = Object.freeze({ + _kind: "arkor", + trainer: { + name: "fallback-build", + start: async () => ({ jobId: "j" }), + wait: async () => ({ job: {}, artifacts: [] }), + cancel: async () => {}, + }, + });`, + ); + // No pre-existing `.arkor/build/index.mjs`: the artefact + // doesn't exist. `existsSync` is false → `runBuild()` runs. + const fakeHmr = { + subscribe: () => () => undefined, + getCurrentConfigHash: () => null, + getCurrentArtifactHash: () => null, + getCurrentArtifactContentHash: () => null, + getLastEventType: () => null, + async dispose() {}, + }; + const app = buildStudioApp({ + baseUrl: "http://mock", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + hmr: fakeHmr, + }); + const res = await app.request("/api/manifest", { + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + }, + }); + expect(res.status).toBe(200); + const body = (await res.json()) as { + trainer: { name: string } | null; + }; + expect(body.trainer).toEqual({ name: "fallback-build" }); + }); + + it("returns 400 (not stale 200) while the HMR watcher is in error state", async () => { + // Regression: the HMR fast path served the last-built artefact + // even when the watcher's most recent event was `error`. The + // SPA's `/api/manifest` poll runs every ~5s, so a successful + // 200 with stale data would silently overwrite the SSE-driven + // build-error UI within 5s of the user breaking their source: + // they'd then unknowingly run stale code/config while the + // latest edit is still failing to compile. Gating the fast + // path on `getLastEventType() === "error"` keeps both + // channels (poll + SSE) consistent. + await writeCredentials(ANON_CREDS); + mkdirSync(join(trainCwd, ".arkor/build"), { recursive: true }); + // Pre-write a previously-good artefact so the fast path + // *would* otherwise return 200 with it. + writeFileSync( + join(trainCwd, ".arkor/build/index.mjs"), + `const trainer = { + name: "stale-good-build", + start: async () => ({ jobId: "j" }), + wait: async () => ({ job: {}, artifacts: [] }), + cancel: async () => {}, + }; + export const arkor = { _kind: "arkor", trainer }; + export default arkor; + `, + ); + // Coordinator is currently in error state: the latest + // broadcast was a compile failure. + const fakeHmr = { + subscribe: () => () => undefined, + getCurrentConfigHash: () => null, + getCurrentArtifactHash: () => null, + getCurrentArtifactContentHash: () => null, + getLastEventType: () => "error" as const, + async dispose() {}, + }; + const app = buildStudioApp({ + baseUrl: "http://mock", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + hmr: fakeHmr, + }); + const res = await app.request("/api/manifest", { + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + }, + }); + // 400: the SPA's existing 4xx-handling path renders the + // build-error hint instead of a fake-healthy manifest. + expect(res.status).toBe(400); + const body = (await res.json()) as { error?: string }; + expect(body.error).toMatch(/Build failed/); + // Sanity: the stale artefact name is NOT leaked through. + expect(JSON.stringify(body)).not.toContain("stale-good-build"); + }); }); describe("/api/inference/chat", () => { @@ -1184,7 +2163,7 @@ process.exit(0); it("auto-bootstraps project state and proxies base-model inference", async () => { await writeCredentials(ANON_CREDS); - // No state.json — server should derive a slug from cwd, create the + // No state.json: server should derive a slug from cwd, create the // project on cloud-api, persist state, and forward the inference call. const calls: Array<{ @@ -1313,7 +2292,7 @@ process.exit(0); }); expect(res.status).toBe(200); - // Only the inference call should have hit the network — no project + // Only the inference call should have hit the network: no project // create/list when state is already present. expect(calls.filter((c) => c.url.includes("/v1/projects"))).toHaveLength(0); const chat = calls.find((c) => c.url.includes("/v1/inference/chat")); @@ -1374,7 +2353,7 @@ process.exit(0); it("propagates the cloud-api status when project bootstrap fails", async () => { await writeCredentials(ANON_CREDS); - // No state.json — bootstrap will hit cloud-api, which returns 503. + // No state.json: bootstrap will hit cloud-api, which returns 503. // We expect that 503 to be passed through, not collapsed to 400. globalThis.fetch = (async ( @@ -1435,13 +2414,436 @@ process.exit(0); }); }); - // ------------------------------------------------------------------------- - // Deployments (`/api/deployments/*`) — minimal coverage of the router - // boundary. Cloud-side semantics already have heavy test coverage in - // `core/client.deployments.test.ts`; here we verify only that the Studio - // server forwards correctly, returns the empty wrapper when no project - // state exists, and surfaces upstream errors verbatim. - // ------------------------------------------------------------------------- + describe("/api/dev/events (HMR)", () => { + function fakeHmr(initialConfigHash: string | null = null) { + // Mirror the real HmrCoordinator surface but stay synchronous so + // the test doesn't depend on rolldown.watch starting up. `emit` + // is a test hook for pushing events into the SSE stream from the + // test body; `currentConfigHash` is a settable mock for what + // `/api/train` reads via `getCurrentConfigHash` to capture the + // spawned-config snapshot. + const subs = new Set<(e: HmrEvent) => void>(); + let currentConfigHash: string | null = initialConfigHash; + // Match the real coordinator's behaviour: a stable artefact + // fingerprint at spawn time. Tests that exercise the + // pre-ready-spawn path (configHash null, then a real hash) + // can override via `setArtifactHash`. + let currentArtifactHash: string | null = "fake-artefact-hash"; + let currentArtifactContentHash: string | null = + "fake-artefact-content-hash"; + let lastEventType: HmrEvent["type"] | null = null; + const coordinator: HmrCoordinator = { + subscribe(fn) { + subs.add(fn); + return () => { + subs.delete(fn); + }; + }, + getCurrentConfigHash() { + return currentConfigHash; + }, + getCurrentArtifactHash() { + return currentArtifactHash; + }, + getCurrentArtifactContentHash() { + return currentArtifactContentHash; + }, + getLastEventType() { + return lastEventType; + }, + async dispose() { + subs.clear(); + }, + }; + return { + coordinator, + emit(event: HmrEvent) { + // Track the latest event type so `getLastEventType()` + // mirrors the real coordinator's `lastEvent?.type`; + // the `/api/manifest` HMR-error gate consults this. + lastEventType = event.type; + for (const fn of subs) fn(event); + }, + setConfigHash(hash: string | null) { + currentConfigHash = hash; + }, + setArtifactHash(hash: string | null) { + currentArtifactHash = hash; + }, + setArtifactContentHash(hash: string | null) { + currentArtifactContentHash = hash; + }, + setLastEventType(t: HmrEvent["type"] | null) { + lastEventType = t; + }, + get subscriberCount() { + return subs.size; + }, + }; + } + + it("is unregistered when no hmr coordinator is supplied", async () => { + const app = build(); + const res = await app.request("/api/dev/events", { + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + }, + }); + expect(res.status).toBe(404); + }); + + it("rejects /api/dev/events without a token", async () => { + const fake = fakeHmr(); + const app = buildStudioApp({ + baseUrl: "http://mock", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + hmr: fake.coordinator, + }); + const res = await app.request("/api/dev/events", { + headers: { host: "127.0.0.1:4000" }, + }); + expect(res.status).toBe(403); + }); + + it("accepts the studio token via ?studioToken= for the dev event stream", async () => { + const fake = fakeHmr(); + const app = buildStudioApp({ + baseUrl: "http://mock", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + hmr: fake.coordinator, + }); + // The server subscribes to the HMR coordinator exactly once at + // build time (so multiple SSE clients don't fan signal dispatch + // out to the same child N times). Per-client cleanup happens on + // the SSE listener set, not against the coordinator, so + // `fake.subscriberCount` stays at 1 across the connection + // lifecycle. We assert that here rather than expect the + // pre-refactor "0 after cancel" behaviour. + expect(fake.subscriberCount).toBe(1); + const res = await app.request( + `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`, + { headers: { host: "127.0.0.1:4000" } }, + ); + expect(res.status).toBe(200); + expect(res.headers.get("content-type")).toBe("text/event-stream"); + const reader = res.body!.getReader(); + await reader.cancel(); + // Cancel doesn't unsubscribe the server-level listener; emitting + // an event after cancel must still be safe (the SSE listener that + // was registered for this connection is removed, so the + // controller-closed try/catch in `send` is never exercised). + expect(() => + fake.emit({ + type: "rebuild", + outFile: "/tmp/x", + hash: "h", + configHash: null, + trainerName: null, + }), + ).not.toThrow(); + }); + + it("rejects /api/dev/events when host header is non-loopback", async () => { + const fake = fakeHmr(); + const app = buildStudioApp({ + baseUrl: "http://mock", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + hmr: fake.coordinator, + }); + const res = await app.request( + `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`, + { headers: { host: "evil.example.com" } }, + ); + expect(res.status).toBe(403); + }); + + it("dispatches HMR signals exactly once per rebuild regardless of connected SSE client count", async () => { + // Regression: previously each `/api/dev/events` connection + // attached its own `hmr.subscribe(...)` callback, so a rebuild + // with N open Studio tabs fanned out into N × SIGUSR2 / N × + // SIGTERM per child. The runner's shutdown handler interprets a + // *second* SIGTERM as the emergency `exit(143)` fast-path, which + // would defeat checkpoint preservation. The server now subscribes + // to the coordinator exactly once and broadcasts the augmented + // payload to every SSE client; we assert that subscriber count + // doesn't grow when extra connections are opened. + const fake = fakeHmr(); + const app = buildStudioApp({ + baseUrl: "http://mock", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + hmr: fake.coordinator, + }); + expect(fake.subscriberCount).toBe(1); + const r1 = await app.request( + `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`, + { headers: { host: "127.0.0.1:4000" } }, + ); + const r2 = await app.request( + `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`, + { headers: { host: "127.0.0.1:4000" } }, + ); + // Pump the streams so their `start()` runs, registering the + // per-client SSE listeners on the server side. + const reader1 = r1.body!.getReader(); + const reader2 = r2.body!.getReader(); + // Even with two concurrent SSE clients the HMR coordinator still + // sees exactly the one server-level subscriber. + expect(fake.subscriberCount).toBe(1); + await reader1.cancel(); + await reader2.cancel(); + expect(fake.subscriberCount).toBe(1); + }); + + it("/api/train cancel still fires cloud cancel POST + SIGKILL even when HMR has already requested early-stop", async () => { + // Regression: the cancel handler used to short-circuit + // (`if (earlyStopInFlight) return;`) when HMR's + // `dispatchRebuild` had already SIGTERMed the child for a + // graceful checkpoint-wait early-stop. That gate was added + // to avoid a second SIGTERM piling on top of the first + // (which would have triggered the runner's `exit(143)` + // emergency path and broken cloud cancel POSTing). With + // SIGKILL replacing the user-stop SIGTERM, the + // double-signal worry no longer applies, and the gate + // turned a Stop click during HMR's graceful window into a + // total no-op, leaving the run alive until checkpoint / + // 5-min timeout. Manual stop now overrides HMR's graceful + // path: server POSTs cloud cancel + SIGKILLs the + // subprocess regardless of `isEarlyStopRequested`. + await writeCredentials(ANON_CREDS); + await writeState( + { + orgSlug: "manual-override-org", + projectSlug: "manual-override-project", + projectId: "p-manual", + }, + trainCwd, + ); + const FAKE_JOB_ID = "manual-stop-during-hmr"; + const fakeBin = join(trainCwd, "manual-during-hmr-bin.mjs"); + // SIGTERM no-op so HMR's graceful SIGTERM doesn't terminate + // the bin; we need it alive so the subsequent manual + // cancel actually has something to SIGKILL. Marker uses the + // server-injected nonce prefix so the parser accepts it. + writeFileSync( + fakeBin, + `const nonce = process.env.ARKOR_JOB_ID_MARKER_NONCE ?? ""; + process.stdout.write(\`[arkor:\${nonce}] Started job ${FAKE_JOB_ID}\\n\`); + process.on("SIGTERM", () => {}); + setInterval(() => {}, 60_000); + `, + ); + let cancelHits: Array<{ url: string }> = []; + const ORIG_FETCH = globalThis.fetch; + globalThis.fetch = (async ( + input: Parameters[0], + init?: Parameters[1], + ) => { + const url = typeof input === "string" ? input : input.toString(); + const method = init?.method ?? "GET"; + if (method === "POST" && /\/v1\/jobs\/[^/]+\/cancel/.test(url)) { + cancelHits.push({ url }); + return new Response(JSON.stringify({ ok: true }), { + status: 200, + headers: { "content-type": "application/json" }, + }); + } + return new Response("not found", { status: 404 }); + }) as typeof fetch; + + try { + const fake = fakeHmr("h1"); + const app = buildStudioApp({ + baseUrl: "http://mock-cloud-api", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + binPath: fakeBin, + hmr: fake.coordinator, + }); + const trainRes = await app.request("/api/train", { + method: "POST", + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + "content-type": "application/json", + }, + body: JSON.stringify({}), + }); + expect(trainRes.status).toBe(200); + const pid = Number(trainRes.headers.get("x-arkor-train-pid")); + // Drain until the parser has recorded the job id. + const reader = trainRes.body!.getReader(); + const decoder = new TextDecoder(); + let buf = ""; + while (!buf.includes(`Started job ${FAKE_JOB_ID}`)) { + const { value, done } = await reader.read(); + if (done) break; + buf += decoder.decode(value, { stream: true }); + } + // Emit an HMR mismatch: server's dispatch SIGTERMs the + // bin and sets `earlyStopRequested = true` on the entry. + // The bin's SIGTERM no-op keeps it alive so the manual + // cancel below has a target. + fake.emit({ + type: "ready", + outFile: "/tmp/x.mjs", + hash: "abc", + configHash: "h2", // mismatch with spawn-time "h1" + trainerName: "t", + }); + // Let the dispatch run + signal land. + await new Promise((r) => setTimeout(r, 80)); + + // Manual cancel: old code would have early-returned; new + // code POSTs cloud cancel + SIGKILLs. + await reader.cancel(); + await new Promise((r) => setTimeout(r, 250)); + + // Cloud cancel POST landed for the right job. + expect(cancelHits).toHaveLength(1); + expect(cancelHits[0]?.url).toContain(`/v1/jobs/${FAKE_JOB_ID}/cancel`); + // And the bin is dead: SIGKILL bypassed its SIGTERM + // no-op (which had been masking HMR's earlier SIGTERM). + let probeError: NodeJS.ErrnoException | null = null; + try { + process.kill(pid, 0); + } catch (e) { + probeError = e as NodeJS.ErrnoException; + } + expect(probeError?.code).toBe("ESRCH"); + } finally { + globalThis.fetch = ORIG_FETCH; + } + }); + + it("dispatches HMR signals for `ready` events too (not only `rebuild`)", async () => { + // Regression: previously the dispatch fired only on + // `rebuild`, so a child started via `/api/train` *before* + // the watcher's first successful BUNDLE_END (the very first + // success is broadcast as `ready`, and the entry-wait recovery + // path also emits `ready`) would never get SIGUSR2/SIGTERM- + // routed when that build eventually landed, leaving it + // running a stale or empty artifact. Exercise the contract + // here by spawning a hanging child, then emitting `ready` + // with a different `configHash`; dispatch should pick up the + // mismatch and surface restart targets in the SSE frame. + await writeCredentials(ANON_CREDS); + const hangingBin = join(trainCwd, "hanging-bin.mjs"); + // setInterval keeps the event loop alive without trapping + // SIGTERM, so dispatch's kill returns the child to the OS. + writeFileSync(hangingBin, "setInterval(() => {}, 1000);\n"); + + const fake = fakeHmr("h1"); + const app = buildStudioApp({ + baseUrl: "http://mock", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + binPath: hangingBin, + hmr: fake.coordinator, + }); + + const trainRes = await app.request("/api/train", { + method: "POST", + headers: { + host: "127.0.0.1:4000", + "x-arkor-studio-token": STUDIO_TOKEN, + "content-type": "application/json", + }, + body: JSON.stringify({}), + }); + expect(trainRes.status).toBe(200); + const pid = Number(trainRes.headers.get("x-arkor-train-pid")); + + const sseRes = await app.request( + `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`, + { headers: { host: "127.0.0.1:4000" } }, + ); + const reader = sseRes.body!.getReader(); + const decoder = new TextDecoder(); + + try { + // `configHash` = "h2" mismatches the spawn-time "h1" → SIGTERM + // path → `restartTargets` should be non-empty in the SSE frame. + fake.emit({ + type: "ready", + outFile: "/tmp/x.mjs", + hash: "abc", + configHash: "h2", + trainerName: "t", + }); + + let received = ""; + while (!received.includes("\n\n")) { + const { value, done } = await reader.read(); + if (done) break; + received += decoder.decode(value, { stream: true }); + } + expect(received).toContain("event: ready"); + // The dispatch augmentation marker: would be absent if the + // `event.type !== "error"` filter regressed back to gating on + // `=== "rebuild"`, and `restart`/`restartTargets` would never + // appear on a `ready` frame. + expect(received).toContain('"restart":true'); + expect(received).toContain(`"pid":${pid}`); + } finally { + await reader.cancel(); + // Best-effort cleanup if dispatch's SIGTERM hasn't reaped + // the child yet (signal delivery is async in the kernel). + try { + process.kill(pid, "SIGKILL"); + } catch { + // already gone + } + } + }); + + it("forwards rebuild events as SSE frames", async () => { + const fake = fakeHmr(); + const app = buildStudioApp({ + baseUrl: "http://mock", + assetsDir, + autoAnonymous: false, + studioToken: STUDIO_TOKEN, + cwd: trainCwd, + hmr: fake.coordinator, + }); + const res = await app.request( + `/api/dev/events?studioToken=${encodeURIComponent(STUDIO_TOKEN)}`, + { headers: { host: "127.0.0.1:4000" } }, + ); + const reader = res.body!.getReader(); + const decoder = new TextDecoder(); + + fake.emit({ type: "ready", outFile: "/tmp/x", hash: "abc" }); + // Read chunks until we have at least one full SSE frame. + let received = ""; + while (!received.includes("\n\n")) { + const { value, done } = await reader.read(); + if (done) break; + received += decoder.decode(value, { stream: true }); + } + expect(received).toContain("event: ready"); + expect(received).toContain('"outFile":"/tmp/x"'); + await reader.cancel(); + }); + }); + describe("/api/deployments", () => { const ORIG_FETCH = globalThis.fetch; diff --git a/packages/arkor/src/studio/server.ts b/packages/arkor/src/studio/server.ts index 95deff08..e5db1d9f 100644 --- a/packages/arkor/src/studio/server.ts +++ b/packages/arkor/src/studio/server.ts @@ -1,6 +1,7 @@ -import { spawn } from "node:child_process"; +import { spawn, type ChildProcessByStdio } from "node:child_process"; +import type { Readable, Writable } from "node:stream"; import { readFile, realpath } from "node:fs/promises"; -import { timingSafeEqual } from "node:crypto"; +import { randomBytes, timingSafeEqual } from "node:crypto"; import { Hono } from "hono"; import { createClient } from "@arkor/cloud-api-client"; import { CloudApiClient, CloudApiError } from "../core/client"; @@ -22,7 +23,42 @@ import { createDeploymentRequestSchema, } from "../core/schemas"; import { readState } from "../core/state"; +import { resolveBuildEntry } from "../core/rolldownConfig"; import { readManifestSummary } from "./manifest"; +import type { HmrCoordinator, HmrEvent } from "./hmr"; +import { TrainRegistry, type RestartTarget } from "./trainRegistry"; + +/** Identify the spawned subprocess to the SPA without exposing it as + * a body frame (which would interleave with trainer stdout). The SPA + * reads this off `Response.headers` and uses it to scope HMR + * `restart` events to the run *this* tab actually started. */ +const TRAIN_PID_HEADER = "x-arkor-train-pid"; +/** + * Build the strict full-line match for the runner's `[arkor:] Started job ` line. + * + * `core/runner.ts` prefixes that text with the per-spawn nonce we + * inject via `ARKOR_JOB_ID_MARKER_NONCE`; without the prefix, a + * user `console.log("Started job ")` from inside + * `trainer.start()` / `onCheckpoint` / etc. could land in stdout + * *before* the runner's real line and we'd record the wrong id, so + * Stop-training would then POST `/v1/jobs/:attacker-id/cancel` + * against a job the attacker chose. Anchoring on a 32-hex nonce + * known only to the server + runner (the env var is deleted by + * runner.ts BEFORE the user module is dynamically imported, so the + * user can't read it) closes that hole. + * + * Pattern is per-spawn because the nonce is per-spawn. + * + * Anchors `^…$` and `(\S+)` job-id capture mirror the runner's + * exact write shape (cloud-api job ids never contain whitespace), + * so a chatty bin that wraps the line in other content cannot + * collide either. + */ +function buildStartedJobPattern(nonce: string): RegExp { + // Nonce is a 32-char hex string from `randomBytes(16).toString("hex")`, + // i.e. only `[0-9a-f]` (safe to interpolate into the regex literal). + return new RegExp(`^\\[arkor:${nonce}\\] Started job (\\S+)$`); +} const DEPRECATION_HEADERS = ["Deprecation", "Sunset", "Warning"] as const; function copyDeprecationHeaders(from: Headers, to: Headers): void { @@ -66,6 +102,15 @@ export interface StudioServerOptions { * here points at the bin itself). Override in tests. */ binPath?: string; + /** + * Optional HMR coordinator. When provided, the server registers + * `/api/dev/events` as an SSE stream that pushes rebuild / error events to + * the SPA, and rebuilds also signal SIGTERM to active `/api/train` + * subprocesses so they early-stop at the next checkpoint and the SPA can + * restart them with the new bundle. Wired in by `arkor dev`; left + * undefined for any non-dev consumer of `buildStudioApp`. + */ + hmr?: HmrCoordinator; } function tokensMatch(provided: string, expected: string): boolean { @@ -89,11 +134,31 @@ function htmlAttrEscape(s: string): string { ); } -function injectStudioToken(html: string, token: string): string { - const meta = ``; +/** + * Inject the per-launch studio token (always) and an optional HMR + * feature flag into ``. Both are read by the SPA via + * `` lookups: the token gates `/api/*` requests and + * the HMR flag tells `RunTraining` whether to open + * `/api/dev/events` (which only exists when `arkor dev` wired in an + * HMR coordinator). Without the server-side flag the SPA can't tell + * dev-mode usage from prod-mode usage at runtime: `vite build`'s + * output ships with `import.meta.env.DEV === false`, so any DEV gate + * baked into the bundle would suppress HMR even in real `arkor dev` + * sessions. + */ +function injectStudioMeta( + html: string, + token: string, + hmrEnabled: boolean, +): string { + const tokenTag = ``; + const hmrTag = hmrEnabled + ? `` + : ""; + const tags = `${tokenTag}${hmrTag}`; const idx = html.indexOf(""); - if (idx === -1) return `${meta}${html}`; - return `${html.slice(0, idx)}${meta}${html.slice(idx)}`; + if (idx === -1) return `${tags}${html}`; + return `${html.slice(0, idx)}${tags}${html.slice(idx)}`; } export function buildStudioApp(options: StudioServerOptions) { @@ -105,7 +170,7 @@ export function buildStudioApp(options: StudioServerOptions) { // `studio/server.ts` is bundled into `dist/bin.mjs` (it isn't reachable // from `src/index.ts`, so tsdown doesn't extract it as a shared chunk). // The bin therefore sits *next* to this code at runtime, not one - // directory up — `../bin.mjs` would resolve to the package root. + // directory up: `../bin.mjs` would resolve to the package root. const trainBinPath = options.binPath ?? fileURLToPath(new URL("./bin.mjs", import.meta.url)); @@ -118,7 +183,12 @@ export function buildStudioApp(options: StudioServerOptions) { const app = new Hono(); const loopbackHostPattern = /^(127\.0\.0\.1|localhost)(:\d+)?$/; - const jobEventsPathPattern = /^\/api\/jobs\/[^/]+\/events$/; + // Routes where `?studioToken=` is accepted instead of the + // `X-Arkor-Studio-Token` header. Used only for `EventSource` streams, + // which cannot send custom headers. Adding to this list is CSRF-sensitive: + // it must always be a GET stream-only route, never a mutation endpoint. + const eventStreamPathPattern = + /^\/api\/jobs\/[^/]+\/events$|^\/api\/dev\/events$/; // Host-header guard for every route, including static HTML that carries the // per-launch Studio token. This is the DNS-rebinding boundary: a victim @@ -138,14 +208,14 @@ export function buildStudioApp(options: StudioServerOptions) { // 1. Per-launch token. CORS is intentionally not configured: the SPA // is same-origin so CORS adds no value, and reflecting `*` would let // "simple" cross-origin POSTs (text/plain, urlencoded) skip preflight - // and reach the handler. The token check rejects those — an attacker + // and reach the handler. The token check rejects those: an attacker // page can't read the SPA's from another origin. // 2. `?studioToken=` is accepted only on the job-event stream route // because `EventSource` cannot send custom headers. Mutation routes // require the header so a leaked token in a URL is not enough to POST. app.use("/api/*", async (c, next) => { const queryTokenAllowed = - c.req.method === "GET" && jobEventsPathPattern.test(c.req.path); + c.req.method === "GET" && eventStreamPathPattern.test(c.req.path); const provided = c.req.header("x-arkor-studio-token") ?? (queryTokenAllowed ? c.req.query("studioToken") : undefined) ?? @@ -268,9 +338,39 @@ export function buildStudioApp(options: StudioServerOptions) { return new Response(body, { status: res.status, headers }); }); + // Pre-resolved outFile for the HMR fast path. The path is + // deterministic per cwd (defaults from `BUILD_DEFAULTS`), so we + // compute it once at app build time rather than on every request. + // Only used when HMR is enabled; `readManifestSummary` falls + // back to `runBuild()` when this is undefined or the file doesn't + // exist yet (fresh scaffold pre-watcher-bootstrap). + const hmrOutFile = options.hmr + ? resolveBuildEntry({ cwd: trainCwd }).outFile + : undefined; app.get("/api/manifest", async (c) => { try { - const manifest = await readManifestSummary(trainCwd); + // Surface watcher build errors directly. Without this gate the + // HMR fast path below would happily serve the LAST GOOD + // artefact even when the user's current source fails to + // compile: `RunTraining` polls `/api/manifest` every ~5 s, so + // the next poll after a compile error would 200 with stale + // data and silently overwrite the SSE-surfaced error UI. + // Users would then see a "healthy" trainer in the manifest + // and unknowingly run stale code/config while the latest + // edit is still broken. Rejecting with the SSE error message + // keeps the SPA's error state consistent across both + // channels (poll + SSE). + if (options.hmr?.getLastEventType() === "error") { + return c.json({ error: "Build failed; see HMR error frame" }, 400); + } + // HMR-aware fast path: when `arkor dev` wired in a coordinator, + // skip the per-request `runBuild()` and read the watcher's + // already-built artefact. Without this every SPA poll + // (~5 s + per-rebuild SSE refetch) would re-bundle and race + // the watcher writing to the same `.arkor/build/index.mjs`. + const manifest = await readManifestSummary(trainCwd, { + prebuiltOutFile: hmrOutFile, + }); return c.json(manifest); } catch (err) { // The user's `src/arkor/index.ts` may not exist yet (fresh scaffold) or @@ -335,11 +435,15 @@ export function buildStudioApp(options: StudioServerOptions) { return new Response(upstream.body, { status: upstream.status, headers }); }); + // Active `/api/train` subprocesses. The registry encapsulates the + // signal-dispatch policy (see `studio/trainRegistry.ts`). + const activeTrains = new TrainRegistry(); + app.post("/api/train", async (c) => { const body = (await c.req.json().catch(() => ({}))) as { file?: string }; let trainFile: string | undefined; if (body.file) { - // Resolve symlinks before the containment check — `path.resolve` is purely + // Resolve symlinks before the containment check: `path.resolve` is purely // lexical, so a symlink under the project directory pointing at e.g. // `/etc/passwd` would otherwise pass `startsWith(baseAbs + sep)`. The // bin spawned below would then dlopen the link's target. @@ -362,32 +466,621 @@ export function buildStudioApp(options: StudioServerOptions) { } trainFile = abs; } + // Snapshot the current `configHash` so HMR routing on the *next* + // rebuild can compare against this child's spawn-time config. + // + // When HMR is enabled, read it synchronously from the coordinator + // (which already maintains `lastEvent.configHash` for its watcher). + // Reading from the cache avoids triggering an extra `runBuild()` + // per train request: the previous implementation called + // `readManifestSummary(trainCwd)` here, which both wasted CPU and + // raced the watcher writing the same `.arkor/build/index.mjs`. + // + // When HMR is disabled the field is irrelevant (no rebuilds will + // happen) so we leave it null without paying for a build. + const configHash: string | null = options.hmr + ? options.hmr.getCurrentConfigHash() + : null; + // Spawn-time CONTENT-hash of the on-disk build artefact. Only + // the pre-ready-spawn case in `dispatchRebuild` consults it: + // when a rebuild lands while the child's `configHash` is still + // null, backfilling the new hash is only safe if the artefact + // bytes the child loaded (= the bytes on disk *now*, at spawn) + // are the same bytes the new hash describes. Without this + // gate, an edit landing between spawn and the watcher's first + // BUNDLE_END would silently align the registry with a config + // the child never actually loaded → cloud-side `JobConfig` + // drift on subsequent same-hash hot-swaps. + // + // Content (sha256) rather than mtime+ctime+size: the + // timestamp version had a false-positive failure mode where a + // watcher rebuild that produced identical bytes still bumped + // mtime/ctime, forcing a spurious cancel+restart cycle on a + // pre-ready spawn even though the child's loaded bytes + // actually matched the new build. Content-hash is precise. + const spawnArtifactContentHash: string | null = options.hmr + ? options.hmr.getCurrentArtifactContentHash() + : null; + // Capture the cloud-api scope NOW (at spawn time) so the cancel + // handler can POST `/v1/jobs/:id/cancel` without re-reading + // `.arkor/state.json` at stop time. If the user removed or made + // the state file unreadable mid-training, the stop-time read + // would return null and the cancel POST would silently skip: + // local SIGKILL still tears down the subprocess but the cloud + // run orphans. Pinning the scope on the registry entry when it + // exists decouples cancel correctness from mutable filesystem state. + // + // `spawnScope` may legitimately be `null` on a first-run anonymous + // project: `.arkor/state.json` is created by `ensureProjectState` + // INSIDE the child during `trainer.start()`, i.e. AFTER spawn but + // possibly before the user clicks Stop. The cancel handler treats + // a null registry scope as a signal to fall back to reading + // `.arkor/state.json` at cancel time (the file should exist by + // then because the runner emits its `Started job ` line AFTER + // `trainer.start()` resolved, which is the same point at which + // `ensureProjectState` has finished writing the state file). The + // delete-mid-training hazard the spawn-time capture exists to + // close only applies when the SPAWN read succeeded; once we have + // a non-null capture we never re-read. + const spawnState = await readState(trainCwd); + const spawnScope = spawnState + ? { orgSlug: spawnState.orgSlug, projectSlug: spawnState.projectSlug } + : null; const args = [trainBinPath, "start"]; if (trainFile) args.push(trainFile); - const child = spawn(process.execPath, args, { - stdio: "pipe", - cwd: trainCwd, + // Per-spawn 16-byte nonce passed via env var so the runner can + // prefix its `Started job ` line with `[arkor:] `. The + // server matches that nonce-prefixed shape (see + // `buildStartedJobPattern` for why). 32-hex chars of entropy + // guarantees a user-code spoof attempt can't guess the prefix in + // a single shot, and `core/runner.ts` deletes the env var BEFORE + // dynamically importing the user module so user code can't read + // it via `process.env` either. + const startedJobNonce = randomBytes(16).toString("hex"); + const startedJobPattern = buildStartedJobPattern(startedJobNonce); + // `spawn()` is mostly async (filesystem failures surface as the + // child's `error` event), but Node can still throw synchronously + // for argument-shape problems (e.g. invalid stdio descriptor on + // unusual platforms). Catch both paths so an `/api/train` POST + // can never hang the SPA: sync throws return a clean 500, async + // 'error' events forward into the stream and close it (handled + // inside the ReadableStream `start()` below). + // `ChildProcessByStdio` is the + // specific overload return for `stdio: "pipe"`; narrows + // `child.stdout` / `child.stderr` away from the nullable + // `Readable | null` of the general `ChildProcess` type. + // `ReturnType` would land on the union and force + // a `?.` everywhere downstream. + let child: ChildProcessByStdio; + try { + child = spawn(process.execPath, args, { + stdio: "pipe", + cwd: trainCwd, + env: { + ...process.env, + ARKOR_JOB_ID_MARKER_NONCE: startedJobNonce, + }, + }); + } catch (err: unknown) { + const msg = err instanceof Error ? err.message : String(err); + return c.json( + { error: `Failed to spawn training subprocess: ${msg}` }, + 500, + ); + } + activeTrains.register(child, { + trainFile, + configHash, + spawnArtifactContentHash, + scope: spawnScope, }); - const stream = new ReadableStream({ + // Hoisted out of the `ReadableStream` underlying-source so the + // `start` handler can hand its closure-bound teardown helper to + // the `cancel` handler. `cancel` runs in a separate invocation, + // not through `controller`, so the two need a parent-scope + // rendez-vous variable. + let cancelTeardown: (() => void) | null = null; + // Mirror of the cloud `jobId` parsed out of the runner's + // stdout, accessible to both the `start` (parser writes) and + // `cancel` (post-unregister read) handlers. We can't just call + // `activeTrains.getJobId(pid)` from `cancel` because cancel + // unregisters the entry first, so subsequent reads of the + // registry would always be `null` even if the parser races a + // late line in afterwards. This closure variable keeps the id + // observable even after unregister, so the cancel POST poll + // below can pick up a jobId that lands a few ms after Stop. + let parsedJobId: string | null = null; + const stream = new ReadableStream({ start(controller) { + // After `cancel()` runs, calling `controller.enqueue` / + // `controller.close` on the now-closed controller throws + // ("Invalid state: Controller is closed"). The child + // subprocess keeps emitting `data` and ultimately a `close` + // event for some time after the client disconnects, so each + // forwarder needs its own "are we still attached?" guard. + // Track via a flag plus an explicit listener-removal so the + // event loop also stops dispatching once we've torn down. + let closed = false; + // `child.stdout` is in default (binary) mode, so each `data` + // chunk is a Buffer, and `Buffer extends Uint8Array`, so we + // can pass it straight to `controller.enqueue` without a + // round-trip through `TextEncoder`. The previous code did + // `enc.encode(d)` which implicitly coerced the buffer via + // `String()`: same byte content, but allocates a new array. + // Forward a chunk to the SPA stream. Shared between the + // stdout and stderr listeners; both paths surface as + // request body bytes for the SPA's log view. + const forward = (d: Buffer): void => { + if (closed) return; + try { + controller.enqueue(d); + } catch { + // Controller raced us into the closed state; flip the + // flag so subsequent chunks short-circuit. + closed = true; + } + }; + // Carry-over buffer for line-oriented job-id extraction. + // Stream chunk boundaries are arbitrary: the runner's + // single-line `Started job ` write can land split + // across two `data` events, in which case a per-chunk + // regex would never match and the cancel POST chain + // would never fire (cloud-job orphan on Stop). We + // accumulate text until a newline, parse the complete + // line, and keep any trailing partial for the next + // chunk. Cleared the moment the id is recorded so a + // chatty bin doesn't pin memory after the marker has + // landed; capped at 4 KiB regardless to bound a + // misbehaving bin that never emits a newline before the + // marker (the canonical line is well under 100 bytes). + let stdoutLineBuf = ""; + const STARTED_JOB_BUFFER_CAP = 4096; + // STDOUT-ONLY job-id parser. The runner writes the canonical + // `Started job ` line via `process.stdout.write` (never + // stderr), so a single shared buffer across both pipes + // would mis-match in two ways: + // 1. A user `console.error("Started job ")` would + // poison the buffer first; the real stdout marker + // arrives later but our `getJobId(...) === null` gate + // has already short-circuited subsequent scans, so + // Stop-training POSTs cancel for the wrong (or + // non-existent) job. + // 2. Interleaved stderr bytes could land between + // "Started job " and "\n" in the shared buffer, + // breaking the anchored line match → missed match → + // cloud cancel skipped on Stop. + // Two dedicated handlers share `forward` for the byte + // pipeline but only the stdout one runs the parse. + const onStdoutChunk = (d: Buffer): void => { + // Intentionally NOT gated on `closed`: when the SPA cancels, + // `cancelTeardown()` flips `closed = true` so the controller + // path no-ops, but the cancel IIFE then POLLS `parsedJobId` + // for up to 500 ms to catch a `Started job ` line that + // landed just after the user clicked Stop. The parser has to + // keep running during that window for the poll to ever + // observe a value. (`forward()` has its own `closed` check + // for the controller-enqueue side, so the SSE-body path + // stays sealed.) Gate the parse on `parsedJobId === null` + // (not `activeTrains.getJobId(...) === null`): the latter + // returns null forever after `unregister`, which would make + // us re-enter and re-parse the buffer on every subsequent + // chunk during the poll window. + if (parsedJobId === null) { + stdoutLineBuf += d.toString("utf8"); + let nl = stdoutLineBuf.indexOf("\n"); + while (nl !== -1) { + // Strip a possible \r so CRLF-emitting bins (rare for + // Node `process.stdout.write` but defensive) match + // the same anchored pattern. + const line = stdoutLineBuf.slice(0, nl).replace(/\r$/, ""); + stdoutLineBuf = stdoutLineBuf.slice(nl + 1); + const m = startedJobPattern.exec(line); + if (m && m[1]) { + activeTrains.recordJobId(child.pid, m[1]); + // Mirror to the parent-scope closure so the cancel + // handler can pick this up even AFTER it called + // `activeTrains.unregister(...)` (the registry + // read would return null post-unregister). + parsedJobId = m[1]; + stdoutLineBuf = ""; + break; + } + nl = stdoutLineBuf.indexOf("\n"); + } + if (stdoutLineBuf.length > STARTED_JOB_BUFFER_CAP) { + stdoutLineBuf = stdoutLineBuf.slice(-STARTED_JOB_BUFFER_CAP); + } + } + forward(d); + }; + const onStderrChunk = (d: Buffer): void => { + // Forward only; never scan for `Started job`. See + // `onStdoutChunk` comment for the cross-stream poisoning + // hazards this split prevents. + forward(d); + }; const enc = new TextEncoder(); - child.stdout.on("data", (d) => controller.enqueue(enc.encode(d))); - child.stderr.on("data", (d) => controller.enqueue(enc.encode(d))); - child.on("close", (code) => { - controller.enqueue(enc.encode(`\n---\nexit=${code}\n`)); - controller.close(); - }); + // Detach every listener this stream wired onto `child`. Called + // from `onClose` / `onError` themselves (so once one fires the + // closure references (controller, TextEncoder) drop and the + // subprocess record can be GC'd promptly even if the other + // event also queues), and from `cancelTeardown` for the + // client-side cancel path. Removing only the `data` listeners + // (as the previous code did) left `close` / `error` attached + // to the dead ChildProcess, which kept their closures pinned + // until the process object itself was reaped: meaningful + // memory pressure for an `arkor dev` session that spawns many + // children over hours. + const detachListeners = (): void => { + child.stdout.off("data", onStdoutChunk); + child.stderr.off("data", onStderrChunk); + child.off("close", onClose); + child.off("error", onError); + }; + const onClose = (code: number | null): void => { + activeTrains.unregister(child.pid); + detachListeners(); + if (closed) return; + closed = true; + try { + controller.enqueue(enc.encode(`\n---\nexit=${code}\n`)); + controller.close(); + } catch { + // already cancelled; nothing more to do. + } + }; + // `error` event fires when async spawn machinery surfaces a + // failure (ENOENT for the executable, EACCES, EAGAIN under + // resource exhaustion, etc.). Without this listener the + // ReadableStream would never close; the SPA would hang + // waiting for output that never arrives. Forward the error + // text into the stream body, close, and unregister the + // child. Node's contract is: if 'error' fires, 'close' may + // or may not follow; both paths are guarded by the `closed` + // flag and the `unregister` call is idempotent. + const onError = (err: Error): void => { + activeTrains.unregister(child.pid); + detachListeners(); + if (closed) return; + closed = true; + try { + controller.enqueue( + enc.encode(`\n---\nerror=${err.message}\n`), + ); + controller.close(); + } catch { + // already cancelled; nothing more to do. + } + }; + child.stdout.on("data", onStdoutChunk); + child.stderr.on("data", onStderrChunk); + child.on("close", onClose); + child.on("error", onError); + cancelTeardown = () => { + // Don't detach data listeners here: the child stays alive + // for some time after the SPA cancels, either because + // we're skipping `child.kill()` for an in-progress + // HMR early-stop, or because `child.kill()`'s SIGTERM + // triggers a graceful checkpoint+exit that takes + // seconds. During that window the child keeps writing + // logs to its stdout/stderr pipes; if our `data` + // listeners are gone, Node stops draining the OS pipe, + // the buffer fills, and the child's next `write()` + // blocks indefinitely, deadlocking the very graceful + // exit we're trying to preserve. The `closed` flag + // already makes `enqueue`/`close` a no-op so the + // controller-closed race stays safe; the eventual + // `onClose` / `onError` listeners detach everything + // (via `detachListeners()`) when the child finally + // exits. That timing (at-exit, not at-cancel) is the + // correct moment to break the closure refs for GC. + closed = true; + }; }, cancel() { - child.kill(); + // The SPA-side cancel is always *user-initiated*: either an + // explicit Stop click or tab-close/navigation, which the + // user just as explicitly chose. HMR-driven SIGTERMs go + // straight from the server to the runner via + // `dispatchRebuild`; they DO NOT trigger this handler + // (the SPA waits for the train stream's `exit=` line and + // schedules auto-restart, never aborting). So manual stop + // takes precedence over any in-flight HMR graceful path: + // we POST cloud cancel + SIGKILL unconditionally. + // + // SIGKILL is uncatchable so the long-standing + // "second-SIGTERM-triggers-exit(143)-fast-path" worry + // (which used to gate this branch on + // `isEarlyStopRequested`) doesn't apply. The runner's + // graceful early-stop chain may have been trying to + // preserve a checkpoint, but the user just said no; keep + // the local subprocess teardown snappy and let the + // server-side cancel POST handle the cloud-side release. + // + // Capture the cloud job id + spawn-time scope BEFORE + // unregistering: once the entry is gone, the getters + // return null and the fire-and-forget POST below would + // no-op. + // + // `pid` is captured once here because the closure below + // runs after `unregister` and we want a stable handle. + const cancelPid = child.pid; + // Scope resolution order: + // 1. Registry entry's pinned scope (captured at spawn time). + // Authoritative when non-null: a user who deleted or made + // `.arkor/state.json` unreadable AFTER spawn shouldn't be + // able to silently orphan their cloud job by losing the + // cancel-time read. + // 2. Cancel-time re-read of `.arkor/state.json`, ONLY when + // the spawn-time capture was null. This handles the + // first-run anon case where `ensureProjectState` writes + // the state file from inside the child during + // `trainer.start()` (i.e. AFTER spawn). The read happens + // inside the fire-and-forget IIFE below so the cancel + // handler stays sync. + const pinnedScope = activeTrains.getScope(cancelPid); + activeTrains.unregister(cancelPid); + cancelTeardown?.(); + // Fire-and-forget cloud-side cancel so the cloud job is + // released even though the SIGKILL below bypasses the + // runner's `installShutdownHandlers` (which would + // otherwise issue cancel itself via the graceful + // early-stop chain). The IIFE polls for the jobId + // *briefly* before giving up: there's a real race + // window where the user clicks Stop after the cloud + // job has been created but before the runner's + // `Started job ` line has been parsed (cloud + // createJob roundtrip is ~50-200ms; UI clicks can land + // sub-100ms into that window). Polling closes the most + // common case; beyond ~500 ms we accept the cloud-side + // orphan as a follow-up (the cloud reaper / TTL is the + // safety net, and the alternative of querying cloud-api + // for matching jobs at cancel time is brittle in + // multi-tab/multi-spawn scenarios). + void (async () => { + // Brief poll on `parsedJobId` (the closure mirror, + // see top-of-handler for why it can't be the + // registry's `getJobId`): the runner's + // `Started job ` line may not have been parsed by + // the time the user clicked Stop. Most runs hit it + // within ~50-200 ms of spawn (cloud createJob + // roundtrip), so polling for up to ~500 ms catches + // nearly all races. Beyond that we accept the + // cloud-side orphan as a documented follow-up: cloud + // reaper / TTL is the safety net, and the + // alternative (querying cloud-api for matching jobs + // at cancel time) is brittle for multi-tab / + // multi-spawn cases. + if (parsedJobId === null) { + const start = Date.now(); + while (parsedJobId === null && Date.now() - start < 500) { + await new Promise((r) => setTimeout(r, 25)); + } + } + if (parsedJobId === null) return; + // Resolve the cloud scope: prefer the spawn-time + // capture (immutable, snapshot at spawn) and fall back + // to reading `.arkor/state.json` only when there was + // none. The state file usually exists by now: the + // runner doesn't print `Started job ` until + // `trainer.start()` resolves, and `ensureProjectState` + // (which writes the file from inside the child for + // first-run anon projects) runs as part of that path. + let scopeForCancel = pinnedScope; + if (!scopeForCancel) { + try { + const late = await readState(trainCwd); + if (late) { + scopeForCancel = { + orgSlug: late.orgSlug, + projectSlug: late.projectSlug, + }; + } + } catch { + // best-effort + } + } + if (!scopeForCancel) return; + try { + // `createRpc` now needs (baseUrl, token) explicitly; main's + // refactor moved off the closure-based getter so the per- + // request credentials read happens once here rather than + // twice via the SDK's lazy token callback. + const { baseUrl: rpcBaseUrl, token: rpcToken } = + await resolveCredentialsAndBaseUrl(); + const rpc = createRpc(rpcBaseUrl, rpcToken); + await rpc.v1.jobs[":id"].cancel.$post({ + param: { id: parsedJobId }, + query: { + orgSlug: scopeForCancel.orgSlug, + projectSlug: scopeForCancel.projectSlug, + }, + }); + } catch { + // Best-effort: cloud-api transient failure or scope + // drift. Cloud reaper / TTL is the safety net. + } + })(); + // SIGKILL (not the default SIGTERM) for user-initiated + // aborts. The runner's `installShutdownHandlers` now treats + // a single SIGTERM as the HMR-driven "graceful early-stop" + // signal: wait for the next checkpoint (up to ~5 min + // timeout) before exiting. That semantics is right for the + // HMR path but wrong for a Stop-training click: the user + // wants the run STOPPED, not left running in the background + // for minutes consuming GPU/cloud spend while the UI has + // already settled to idle. SIGKILL is uncatchable so the + // child dies immediately, eliminating the + // unregister-before-graceful-exit window where a fast new + // run could overlap an old one untracked by HMR routing. + // + // The cloud-side job is released by the fire-and-forget + // POST above (we recorded the runner's `Started job ` + // line on the registry; the IIFE looks it up here). SIGKILL + // alone would have left the cloud job orphaned until + // TTL/reaper because the runner can't POST cancel itself + // when the kernel reaps it without warning. Together, + // server-side cancel POST + SIGKILL give snappy local + // teardown AND eventual cloud-side release. + // + // `ChildProcess.kill()` can throw (ESRCH if the process has + // already exited between this handler's invocation and the + // signal delivery). A throw here would surface as an unhandled + // exception in the request pipeline and crash the server + // handler. Swallow it; the close handler above has already + // taken the entry out of the registry. + try { + child.kill("SIGKILL"); + } catch { + // already gone; nothing to clean up. + } }, }); - return new Response(stream, { - status: 200, - headers: { "content-type": "text/plain; charset=utf-8" }, - }); + // Expose the spawned pid via a response header so the SPA can + // tell its own child apart from other tabs' children when + // `/api/dev/events` broadcasts `restartTargets` / `hotSwapTargets`. + // Without this, a passive tab whose run was hot-swapped could + // misread a sibling tab's restart event as its own. + // + // Header is OMITTED entirely (rather than sent as an empty + // string) when `child.pid` isn't a number; that case happens + // when the OS hasn't assigned a pid by the time `spawn()` + // returns and the child's async `error` event will fire shortly + // (per-Node-docs `subprocess.pid` is `undefined` for + // failed-spawn children). "Header absent" is the unambiguous + // signal the SPA can read; an empty string would force callers + // to special-case `""` vs missing for the same condition. The + // SPA's `raw ? Number.parseInt(raw, 10) : NaN` handler treats + // both cases identically, but absent-only is the cleaner wire + // contract. + const headers: Record = { + "content-type": "text/plain; charset=utf-8", + }; + if (typeof child.pid === "number") { + headers[TRAIN_PID_HEADER] = String(child.pid); + } + return new Response(stream, { status: 200, headers }); }); + // `/api/dev/events`: SSE stream of HMR rebuild / error notifications. + // Only active when `arkor dev` passed an HMR coordinator. The CSRF model + // accepts `?studioToken=` here (whitelisted in `eventStreamPathPattern`) + // because `EventSource` cannot send headers. When HMR is not configured + // the route still has an explicit 404 so the request doesn't fall through + // to the SPA index.html (which would mislead the SPA into thinking the + // EventSource connected successfully). + if (!options.hmr) { + app.get("/api/dev/events", (c) => + c.json({ error: "HMR not enabled" }, 404), + ); + } + if (options.hmr) { + const hmr = options.hmr; + /** Augmented event = raw HMR event + the per-child signal results we + * computed for it. We compute these once per rebuild (not once per + * connected SSE client) so opening multiple Studio tabs doesn't fan + * out into N × SIGTERM / N × SIGUSR2 to each child. */ + type AugmentedEvent = HmrEvent & { + restart?: boolean; + hotSwap?: boolean; + restartTargets?: RestartTarget[]; + hotSwapTargets?: RestartTarget[]; + }; + const sseListeners = new Set<(event: AugmentedEvent) => void>(); + let lastAugmented: AugmentedEvent | null = null; + + // Single subscription against the HMR coordinator: this handler does + // signal dispatch + augmentation exactly once per rebuild, then fans + // the augmented payload out to every connected SSE client. Late- + // mounting clients receive `lastAugmented` instead of triggering a + // fresh signal pass against the same rebuild. + hmr.subscribe((event) => { + let augmented: AugmentedEvent = event; + // Route dispatch through every *successful* build event, not + // just `rebuild`. The coordinator emits the very first + // successful compile as `ready` (and the entry-wait recovery + // path also broadcasts `ready` when a fresh-scaffold project's + // entry file first appears). A child started via `/api/train` + // before the first `ready` (e.g. the SPA fired Run Training + // immediately after `arkor dev` booted, while the watcher's + // initial BUNDLE_END was still in flight) would otherwise + // never get SIGUSR2/SIGTERM-routed when that build lands, + // leaving it stuck on a stale or empty artifact until the + // next edit triggers a `rebuild`. Filtering by "not error" + // is forward-compatible with any new successful event types. + if (event.type !== "error" && activeTrains.size > 0) { + // Single per-child decision pass: hash match → SIGUSR2 (with + // a Windows fallback to SIGTERM since win32 doesn't deliver + // SIGUSR2), hash mismatch → SIGTERM. The registry returns + // both buckets so the SPA can react per-child rather than + // assuming one global outcome. + const nextHash = event.configHash ?? null; + // Content-hash for the pre-ready-spawn equality gate (the + // timestamp `event.hash` would over-trigger SIGTERM-restart + // on identical-bytes rebuilds). Both sides of the + // comparison (`entry.spawnArtifactContentHash` captured + // via `getCurrentArtifactContentHash()`, and this + // `event.contentHash`) are derived the same way, so a + // match means the child's loaded bytes ARE what the new + // configHash describes. + const nextArtifactContentHash = event.contentHash ?? null; + const { hotSwapTargets, restartTargets } = activeTrains.dispatchRebuild( + nextHash, + nextArtifactContentHash, + ); + augmented = { + ...event, + hotSwap: hotSwapTargets.length > 0, + hotSwapTargets, + restart: restartTargets.length > 0, + restartTargets, + }; + } + lastAugmented = augmented; + for (const fn of sseListeners) { + try { + fn(augmented); + } catch { + // listener controller closed mid-write; the cancel hook + // below takes care of removing it from the set. + } + } + }); + + app.get("/api/dev/events", () => { + const enc = new TextEncoder(); + let listener: ((event: AugmentedEvent) => void) | null = null; + const stream = new ReadableStream({ + start(controller) { + const send = (event: AugmentedEvent): void => { + const payload = JSON.stringify(event); + try { + controller.enqueue( + enc.encode(`event: ${event.type}\ndata: ${payload}\n\n`), + ); + } catch { + // controller closed mid-write; cancel() removes us. + } + }; + if (lastAugmented) send(lastAugmented); + listener = send; + sseListeners.add(send); + }, + cancel() { + if (listener) sseListeners.delete(listener); + listener = null; + }, + }); + return new Response(stream, { + status: 200, + headers: { + "content-type": "text/event-stream", + "cache-control": "no-cache, no-transform", + }, + }); + }); + } + // Playground hits this so mid-training inference from Studio has the same // auth path as the rest of /api/*. State is auto-bootstrapped (anon only) // so the Playground's base-model mode works on a fresh anonymous launch @@ -407,7 +1100,7 @@ export function buildStudioApp(options: StudioServerOptions) { state = await ensureProjectState({ cwd: trainCwd, client, credentials }); } catch (err) { // Propagate cloud-api's status verbatim (e.g. 401 / 403 / 5xx) so the - // SPA / clients can react appropriately — collapsing everything to 400 + // SPA / clients can react appropriately; collapsing everything to 400 // would mis-report upstream outages and auth failures. Anything else // (local writeState failures, missing-credentials guard) is treated as // a server-side error. @@ -897,7 +1590,11 @@ export function buildStudioApp(options: StudioServerOptions) { const file = await readFile(join(assetsDir, cleaned)); const ext = cleaned.slice(cleaned.lastIndexOf(".") + 1); if (ext === "html") { - const html = injectStudioToken(file.toString("utf8"), studioToken); + const html = injectStudioMeta( + file.toString("utf8"), + studioToken, + Boolean(options.hmr), + ); return new Response(html, { status: 200, headers: { "content-type": CONTENT_TYPES.html! }, diff --git a/packages/arkor/src/studio/trainRegistry.test.ts b/packages/arkor/src/studio/trainRegistry.test.ts new file mode 100644 index 00000000..1278f7f0 --- /dev/null +++ b/packages/arkor/src/studio/trainRegistry.test.ts @@ -0,0 +1,411 @@ +import { describe, it, expect, vi } from "vitest"; +import type { ChildProcess } from "node:child_process"; +import { TrainRegistry } from "./trainRegistry"; + +interface FakeChild { + pid: number; + kill: ReturnType; +} + +function fakeChild(pid: number): FakeChild { + // Default: `kill(sig)` returns `true`, mirroring Node's contract for + // a successful signal delivery to a still-running process. + return { pid, kill: vi.fn(() => true) }; +} + +describe("TrainRegistry", () => { + it("ignores children without a pid (already-exited spawns)", () => { + const reg = new TrainRegistry(); + reg.register({ pid: undefined } as unknown as ChildProcess, { + configHash: "h1", + }); + expect(reg.size).toBe(0); + }); + + it("dispatchRebuild SIGUSR2s only matching configHashes", () => { + const reg = new TrainRegistry(); + const a = fakeChild(101); + const b = fakeChild(102); + const c = fakeChild(103); + reg.register(a as unknown as ChildProcess, { configHash: "match" }); + reg.register(b as unknown as ChildProcess, { + configHash: "different", + trainFile: "/tmp/b.ts", + }); + reg.register(c as unknown as ChildProcess, { configHash: "match" }); + + const result = reg.dispatchRebuild("match"); + expect(result.hotSwapTargets).toEqual([ + { pid: 101, trainFile: undefined }, + { pid: 103, trainFile: undefined }, + ]); + expect(result.restartTargets).toEqual([ + { pid: 102, trainFile: "/tmp/b.ts" }, + ]); + expect(a.kill).toHaveBeenCalledWith("SIGUSR2"); + expect(c.kill).toHaveBeenCalledWith("SIGUSR2"); + expect(b.kill).toHaveBeenCalledWith("SIGTERM"); + }); + + it("dispatchRebuild SIGTERMs everything when nextConfigHash is null", () => { + // null nextHash means "we couldn't inspect the new bundle": be + // conservative and SIGTERM every active child since we can't + // prove their configs are unaffected. + const reg = new TrainRegistry(); + const a = fakeChild(201); + const b = fakeChild(202); + reg.register(a as unknown as ChildProcess, { configHash: "h" }); + reg.register(b as unknown as ChildProcess, { configHash: null }); + + const result = reg.dispatchRebuild(null); + expect(result.hotSwapTargets).toEqual([]); + expect(result.restartTargets).toHaveLength(2); + expect(a.kill).toHaveBeenCalledWith("SIGTERM"); + expect(b.kill).toHaveBeenCalledWith("SIGTERM"); + }); + + it("dispatchRebuild backfills the hash and skips dispatch when the spawn-time artefact matches the new build", () => { + // Pre-ready spawn (configHash: null) is the "user clicked Run + // before the watcher's first BUNDLE_END" case. Whether it's + // safe to backfill the new hash as the child's baseline depends + // on whether the on-disk artefact has changed between spawn + // and now: if `spawnArtifactContentHash === nextArtifactContentHash`, the + // child read exactly the bytes the new hash describes → + // backfill + skip dispatch (no spurious cancel+restart cycle). + // Otherwise (see the next test) SIGTERM-restart so cloud + // and child stay aligned. + const reg = new TrainRegistry(); + const c = fakeChild(401); + reg.register(c as unknown as ChildProcess, { + configHash: null, + trainFile: "/tmp/preready.ts", + spawnArtifactContentHash: "art-v1", + }); + const result = reg.dispatchRebuild("first-real-hash", "art-v1"); + // Neither bucket: no signal sent, nothing for the SPA to react to. + expect(result.hotSwapTargets).toEqual([]); + expect(result.restartTargets).toEqual([]); + expect(c.kill).not.toHaveBeenCalled(); + // A subsequent dispatch with the SAME config hash must take the + // hot-swap path (proves the backfill landed; without it this + // would STILL be null vs "first-real-hash" → SIGTERM). + const second = reg.dispatchRebuild("first-real-hash", "art-v2"); + expect(second.hotSwapTargets).toEqual([ + { pid: 401, trainFile: "/tmp/preready.ts" }, + ]); + expect(second.restartTargets).toEqual([]); + expect(c.kill).toHaveBeenCalledWith("SIGUSR2"); + // And a different config hash on a later rebuild now correctly + // routes to SIGTERM-restart (backfilled hash is real). + c.kill.mockClear(); + const third = reg.dispatchRebuild("second-hash", "art-v3"); + expect(third.restartTargets).toEqual([ + { pid: 401, trainFile: "/tmp/preready.ts" }, + ]); + expect(c.kill).toHaveBeenCalledWith("SIGTERM"); + }); + + it("dispatchRebuild SIGTERM-restarts a pre-ready spawn when the artefact has changed since spawn", () => { + // Codex P2 regression: an edit landing between spawn and the + // watcher's first BUNDLE_END means the bytes the child loaded + // differ from what the new `configHash` describes. Backfilling + // unconditionally would silently teach the registry to use the + // post-edit hash as the child's baseline; later same-hash + // rebuilds would then hot-swap callbacks into a child whose + // cloud-side `JobConfig` was actually spawned against an older + // version, leaving the cloud run on a stale config. The artefact + // fingerprint mismatch (`art-stale` vs `art-fresh`) is the + // signal that the child loaded older bytes; SIGTERM-restart + // forces a clean re-spawn against the freshly-built artefact. + const reg = new TrainRegistry(); + const c = fakeChild(411); + reg.register(c as unknown as ChildProcess, { + configHash: null, + trainFile: "/tmp/preready-stale.ts", + spawnArtifactContentHash: "art-stale", + }); + const result = reg.dispatchRebuild("real-hash", "art-fresh"); + // SIGTERM-restart: the child's bytes are stale relative to the + // new build. Hot-swap would be unsafe (config drift); skip + // would leave the child running with no future correction + // path (the registry would treat "real-hash" as the baseline + // even though the child never loaded that build). + expect(result.hotSwapTargets).toEqual([]); + expect(result.restartTargets).toEqual([ + { pid: 411, trainFile: "/tmp/preready-stale.ts" }, + ]); + expect(c.kill).toHaveBeenCalledWith("SIGTERM"); + }); + + it("dispatchRebuild SIGTERM-restarts a pre-ready spawn when no artefact existed at spawn time", () => { + // Companion to the "artefact has changed" test: a fresh project + // never built before spawn means `coordinator.getCurrentArtifactHash()` + // returned `null`. The child's `await import` likely failed; we + // can't prove its config matches anything. Conservative + // SIGTERM-restart so the SPA re-spawns once the new bundle is + // on disk. + const reg = new TrainRegistry(); + const c = fakeChild(421); + reg.register(c as unknown as ChildProcess, { + configHash: null, + trainFile: "/tmp/preready-fresh.ts", + spawnArtifactContentHash: null, // no artefact when /api/train fired + }); + const result = reg.dispatchRebuild("first-real-hash", "art-fresh"); + expect(result.hotSwapTargets).toEqual([]); + expect(result.restartTargets).toEqual([ + { pid: 421, trainFile: "/tmp/preready-fresh.ts" }, + ]); + expect(c.kill).toHaveBeenCalledWith("SIGTERM"); + }); + + it("isEarlyStopRequested reflects the dispatchRebuild SIGTERM flag", () => { + // Regression: `/api/train`'s ReadableStream `cancel()` consults + // this flag to avoid sending a *second* SIGTERM to a child that + // HMR's `dispatchRebuild` already SIGTERMed for early-stop. A + // double-SIGTERM hits `installShutdownHandlers`' emergency + // `exit(143)` fast-path, bypassing the checkpoint-preserving + // cancel flow and potentially leaving the cloud run alive. + const reg = new TrainRegistry(); + const a = fakeChild(901); + reg.register(a as unknown as ChildProcess, { + configHash: "h1", + trainFile: "/tmp/a.ts", + }); + expect(reg.isEarlyStopRequested(901)).toBe(false); + // Mismatched hash → SIGTERM → flag flips on. + reg.dispatchRebuild("h2"); + expect(reg.isEarlyStopRequested(901)).toBe(true); + // Defensive cases: non-numeric / unknown / never-registered pid. + expect(reg.isEarlyStopRequested(undefined)).toBe(false); + expect(reg.isEarlyStopRequested(99999)).toBe(false); + // Once the child unregisters (close handler) the flag effectively + // resets: subsequent queries return false rather than retaining + // stale state. + reg.unregister(901); + expect(reg.isEarlyStopRequested(901)).toBe(false); + }); + + it("unregister removes the child from the policy decisions", () => { + const reg = new TrainRegistry(); + const a = fakeChild(401); + reg.register(a as unknown as ChildProcess, { configHash: "h" }); + reg.unregister(401); + expect(reg.size).toBe(0); + const result = reg.dispatchRebuild("h"); + expect(result.hotSwapTargets).toEqual([]); + expect(result.restartTargets).toEqual([]); + }); + + it("survives kill() throwing (child exited mid-iteration)", () => { + const reg = new TrainRegistry(); + const a = fakeChild(501); + a.kill.mockImplementation(() => { + throw new Error("ESRCH"); + }); + reg.register(a as unknown as ChildProcess, { configHash: "h" }); + // Both the hot-swap branch (matching hash) and the restart branch + // (mismatched hash) must swallow the throw and continue with their + // bookkeeping so a single dead child can't break HMR for siblings. + expect(() => reg.dispatchRebuild("h")).not.toThrow(); + expect(() => reg.dispatchRebuild("x")).not.toThrow(); + }); + + it("dispatchRebuild omits dead-on-kill children from the restart targets", () => { + // Regression: previously the implementation always pushed onto + // `targets` even when `kill()` threw, so a child that had already + // exited would still be reported back to the SPA as a restart + // target: the SPA would then wait forever for the (already- + // delivered) `exit=...` line and never re-spawn. + const reg = new TrainRegistry(); + const dead = fakeChild(601); + dead.kill.mockImplementation(() => { + const err = new Error("kill ESRCH") as Error & { code?: string }; + err.code = "ESRCH"; + throw err; + }); + reg.register(dead as unknown as ChildProcess, { + configHash: "stale", + trainFile: "/tmp/d.ts", + }); + const result = reg.dispatchRebuild("fresh"); + expect(result.hotSwapTargets).toEqual([]); + expect(result.restartTargets).toEqual([]); + }); + + it("dispatchRebuild classifies ESRCH on the hash-match branch as 'gone' (no SIGTERM fallback)", () => { + // Regression: `safeKill` previously treated any thrown error as + // `"unsupported"`, which on the hash-match branch triggers a + // SIGTERM fallback (intended for Windows + SIGUSR2 unsupported). + // POSIX `kill(2)` raises `ESRCH` for an already-exited child: + // classifying that as "unsupported" caused a needless SIGTERM + // attempt against a dead PID. Now ESRCH routes through the + // "gone" branch (no fallback, no restart-target push) so the + // child is dropped silently for the close handler to reap. + const reg = new TrainRegistry(); + const goneOnSigusr2 = fakeChild(801); + goneOnSigusr2.kill.mockImplementation(() => { + const err = new Error("kill ESRCH") as Error & { code?: string }; + err.code = "ESRCH"; + throw err; + }); + reg.register(goneOnSigusr2 as unknown as ChildProcess, { + configHash: "match", + trainFile: "/tmp/g.ts", + }); + const result = reg.dispatchRebuild("match"); + // No hot-swap (SIGUSR2 failed), no restart (correctly classified + // as gone, NOT routed into the SIGTERM fallback path). + expect(result.hotSwapTargets).toEqual([]); + expect(result.restartTargets).toEqual([]); + // Single SIGUSR2 attempt: no SIGTERM fallback was issued. + expect(goneOnSigusr2.kill).toHaveBeenCalledTimes(1); + expect(goneOnSigusr2.kill).toHaveBeenCalledWith("SIGUSR2"); + }); + + it("dispatchRebuild omits dead-on-kill children when kill returns false (no throw)", () => { + // Regression: `ChildProcess.kill()` returns `false` (without + // throwing) when the target process is already gone. The previous + // implementation treated any non-throw as success and reported the + // child as a restart target; the SPA would then wait forever for + // an exit line that already arrived. + const reg = new TrainRegistry(); + const gone = fakeChild(701); + gone.kill.mockReturnValue(false); + reg.register(gone as unknown as ChildProcess, { + configHash: "stale", + trainFile: "/tmp/g.ts", + }); + const result = reg.dispatchRebuild("fresh"); + expect(result.restartTargets).toEqual([]); + // We still attempted the kill; only the bookkeeping is skipped. + expect(gone.kill).toHaveBeenCalledWith("SIGTERM"); + }); + + it("dispatchRebuild sends SIGTERM at most once per child across rebuilds", () => { + // Regression: under rapid edits the dev loop can fire multiple + // rebuilds before the child reaches its next checkpoint. The + // runner's shutdown handler treats a *second* SIGTERM as the + // emergency `exit(143)` fast-path, which would defeat the whole + // point of preserving the in-flight checkpoint. The registry now + // tracks per-child early-stop state and skips children it has + // already signalled. + const reg = new TrainRegistry(); + const a = fakeChild(801); + reg.register(a as unknown as ChildProcess, { + configHash: "h1", + trainFile: "/tmp/a.ts", + }); + + const first = reg.dispatchRebuild("h2"); + expect(first.restartTargets).toEqual([ + { pid: 801, trainFile: "/tmp/a.ts" }, + ]); + expect(a.kill).toHaveBeenCalledTimes(1); + + // Second mismatching rebuild before the child has exited: must NOT + // re-send SIGTERM and must NOT re-list the child as a restart + // target (the SPA already has a pending re-spawn for it). + const second = reg.dispatchRebuild("h3"); + expect(second.restartTargets).toEqual([]); + expect(a.kill).toHaveBeenCalledTimes(1); + + // After the child exits and is unregistered, a fresh spawn in its + // place starts from a clean slate. + reg.unregister(801); + const respawn = fakeChild(802); + reg.register(respawn as unknown as ChildProcess, { + configHash: "h3", + trainFile: "/tmp/a.ts", + }); + const third = reg.dispatchRebuild("h4"); + expect(third.restartTargets).toEqual([ + { pid: 802, trainFile: "/tmp/a.ts" }, + ]); + expect(respawn.kill).toHaveBeenCalledTimes(1); + }); + + it("dispatchRebuild on win32 routes hash-matches directly to SIGTERM-restart (skips SIGUSR2 attempt)", () => { + // Regression: Node's `child.kill("SIGUSR2")` on Windows is + // documented to **forcefully terminate** the process (treats + // any unknown POSIX signal as SIGKILL-equivalent) and STILL + // returns `true` like a successful delivery. `safeKill` would + // then report `"ok"` → entry lands in `hotSwapTargets` → SPA + // shows "hot-swap" and skips restart, but the child is already + // dead. The Codex P1 fix gates the SIGUSR2 attempt behind + // `process.platform !== "win32"` so win32 routes straight to + // SIGTERM-restart, surfacing a real restart target the SPA can + // act on. + const originalPlatform = Object.getOwnPropertyDescriptor( + process, + "platform", + ); + Object.defineProperty(process, "platform", { + value: "win32", + configurable: true, + }); + try { + const reg = new TrainRegistry(); + const a = fakeChild(951); + a.kill.mockReturnValue(true); // win32 reports success even for SIGUSR2 + reg.register(a as unknown as ChildProcess, { + configHash: "match", + trainFile: "/tmp/win.ts", + }); + const result = reg.dispatchRebuild("match"); + // Restart bucket only: hot-swap is unsafe on win32 even + // when kill() reported "ok". + expect(result.hotSwapTargets).toEqual([]); + expect(result.restartTargets).toEqual([ + { pid: 951, trainFile: "/tmp/win.ts" }, + ]); + // SIGUSR2 was NEVER attempted: the platform gate skipped it + // entirely and went straight to the SIGTERM fallback path. + // (Without the gate, SIGUSR2 would have fired first and been + // misclassified as a successful hot-swap.) + expect(a.kill).toHaveBeenCalledTimes(1); + expect(a.kill).toHaveBeenCalledWith("SIGTERM"); + } finally { + if (originalPlatform) { + Object.defineProperty(process, "platform", originalPlatform); + } + } + }); + + it("dispatchRebuild degrades to SIGTERM-restart when SIGUSR2 is unsupported (Windows)", () => { + // Regression: Node's win32 build doesn't deliver SIGUSR2 (it + // throws "ENOSYS" inside `child.kill('SIGUSR2')`). The previous + // implementation silently swallowed that throw, so on Windows a + // hash-match rebuild produced neither hot-swap nor restart and + // callback edits never landed. Now we degrade to a SIGTERM-driven + // restart so the new code does take effect, at the cost of a + // brief gap rather than an in-place swap. + const reg = new TrainRegistry(); + const a = fakeChild(901); + a.kill.mockImplementation((sig?: string) => { + if (sig === "SIGUSR2") { + const err = new Error( + "kill ENOSYS", + ) as Error & { code?: string }; + err.code = "ENOSYS"; + throw err; + } + return true; // SIGTERM works + }); + reg.register(a as unknown as ChildProcess, { + configHash: "match", + trainFile: "/tmp/win.ts", + }); + const result = reg.dispatchRebuild("match"); + // Must not appear in hot-swap (signal failed) but must appear in + // restart (fallback succeeded) so the SPA re-spawns once the + // exit message arrives. + expect(result.hotSwapTargets).toEqual([]); + expect(result.restartTargets).toEqual([ + { pid: 901, trainFile: "/tmp/win.ts" }, + ]); + // Both signals were attempted in order: SIGUSR2 → fallback SIGTERM. + expect(a.kill).toHaveBeenNthCalledWith(1, "SIGUSR2"); + expect(a.kill).toHaveBeenNthCalledWith(2, "SIGTERM"); + }); +}); diff --git a/packages/arkor/src/studio/trainRegistry.ts b/packages/arkor/src/studio/trainRegistry.ts new file mode 100644 index 00000000..9286e98b --- /dev/null +++ b/packages/arkor/src/studio/trainRegistry.ts @@ -0,0 +1,416 @@ +import type { ChildProcess } from "node:child_process"; + +/** + * Per-active-train state tracked alongside the spawned `arkor start` + * subprocess. The Studio server records this at spawn time so HMR + * rebuilds can decide, per child, between: + * + * - **SIGUSR2** (callback hot-swap) when the new bundle's `configHash` + * matches the one captured at spawn time: the cloud-side run is + * unaffected, only in-process callbacks need to update. + * - **SIGTERM** (graceful early-stop + restart) when the configs + * diverge: the runner's internal early-stop entry point lets the + * next checkpoint finish, the subprocess exits, and the SPA + * re-spawns with the rebuilt artefact. + */ +export interface ActiveTrain { + child: ChildProcess; + trainFile?: string; + /** Cloud-side config hash captured at spawn time (may be null if the + * manifest wasn't inspectable yet, e.g. spawn raced an in-flight + * build). A null entry forces SIGTERM on the next rebuild because we + * can't prove the configs match. */ + configHash: string | null; + /** + * Content hash (sha256, truncated; see `studio/hmr.ts`'s + * `contentHashOrNull`) of the on-disk `.arkor/build/index.mjs` + * at spawn time. Used **only** to gate the pre-ready-spawn + * backfill: if a rebuild eventually fires while `configHash` is + * still null and this content hash equals the rebuild's + * `event.contentHash`, the child is provably reading the same + * bundle bytes the new hash describes: safe to backfill + * `configHash` and skip dispatch. A mismatch (or null here) + * means the on-disk artefact has changed between spawn and + * rebuild (user edited mid-spawn, fresh project never built, …) + * so the child is running stale bytes and we MUST SIGTERM-restart + * to keep cloud-side `JobConfig` aligned with what the child + * actually loaded. + * + * Content-hash (vs the timestamp `mtime+ctime+size` shape used + * by `event.hash` for SSE dedup) avoids a false-positive + * mismatch when a watcher rebuild produces identical bytes: + * timestamps still bump, but content is the same and we + * shouldn't force a spurious cancel+restart cycle. Null when + * HMR isn't enabled or read failed. + */ + spawnArtifactContentHash: string | null; + /** + * `true` once we've already SIGTERM'd this child for an HMR-driven + * early-stop. Subsequent rebuilds (which can land before the child + * has reached its next checkpoint) must NOT re-send SIGTERM: + * the runner's shutdown handler treats a second SIGTERM as the + * emergency `process.exit(143)` escape hatch, which would defeat + * the whole point of preserving the in-flight checkpoint. Kept + * internal to the registry; consumers shouldn't manage it. + */ + earlyStopRequested?: boolean; + /** + * Cloud-side job id, captured by parsing the runner's + * `Started job ` stdout line shortly after spawn. Populated + * via `recordJobId(pid, id)` on the first matching chunk; null + * before that or for runs whose stdout we never saw the line on + * (early spawn failure, custom user bins, etc.). The + * `/api/train` cancel handler reads this to fire a fire-and-forget + * `POST /v1/jobs/:id/cancel` before SIGKILLing the subprocess. + * SIGKILL bypasses the runner's `installShutdownHandlers`, so + * without this server-side cancel the cloud-side job would live + * until the cloud reaper / TTL fires (continued GPU spend). + */ + jobId: string | null; + /** + * Cloud-api scope (org + project slugs) captured at spawn time + * from `.arkor/state.json`. Pinned on the registry entry so the + * `/api/train` cancel handler can address the cloud cancel POST + * without re-reading the filesystem at stop time. Without this + * pin, a user who deleted or made unreadable `.arkor/state.json` + * mid-training would have their manual stop silently skip the + * cancel POST (state read returns null, handler bails) and + * the cloud job would orphan. Null when `/api/train` ran without + * state (auto-anonymous bootstrap failed, etc.); cancel POST is + * skipped then too, but the SIGKILL still tears down the local + * subprocess. + */ + scope: { orgSlug: string; projectSlug: string } | null; +} + +export interface RestartTarget { + pid: number; + trainFile?: string; +} + +export interface DispatchResult { + /** Children whose callbacks were rotated in place via SIGUSR2. */ + hotSwapTargets: RestartTarget[]; + /** + * Children that were SIGTERM'd for graceful early-stop and need to + * be re-spawned by the SPA after the train stream emits its + * `exit=...` line. Includes both config-mismatch matches and + * config-match cases that fell back here because the platform + * doesn't support SIGUSR2 (Windows). + */ + restartTargets: RestartTarget[]; +} + +/** + * Outcome of a single `child.kill(signal)` call. + * + * - `"ok"`: signal was delivered. + * - `"gone"`: process was already exited. Surfaces both as `kill` + * returning `false` (Node's mapped form) and as a thrown `ESRCH` + * (a race where the child exits between the `entries` lookup and + * the `kill` call: POSIX `kill(2)` raises `ESRCH` for + * non-existent PIDs and Node propagates it on some versions). + * - `"unsupported"`: any *other* `kill` throw, i.e. the signal + * couldn't be delivered for a reason that isn't "process is gone". + * The motivating case is the platform not supporting this signal + * kind (Windows + `SIGUSR2` → `ENOSYS`; bad signal name → + * `EINVAL`), which `dispatchRebuild` falls back to SIGTERM-restart + * for. The bucket is intentionally a catch-all rather than a + * whitelist of error codes: rare cases like `EPERM` (lost the + * right to signal a re-parented child) and platform-specific + * surprises take the same conservative fallback (try the next + * signal, otherwise drop the entry), which is what callers want + * from "kill failed for some non-recoverable reason". + */ +type KillResult = "ok" | "gone" | "unsupported"; + +function safeKill(child: ChildProcess, signal: NodeJS.Signals): KillResult { + try { + return child.kill(signal) ? "ok" : "gone"; + } catch (err) { + // `ESRCH` ("no such process") means the child already exited: + // semantically identical to `kill returning false`. Mis-classifying + // it as `"unsupported"` would route a hash-match hot-swap candidate + // into the SIGTERM fallback, which then also no-ops (also gone) but + // costs a needless restart-bucket inclusion until the close handler + // unregisters the child. Every other throw collapses into + // `"unsupported"` per the type doc above. + const code = (err as NodeJS.ErrnoException | null)?.code; + if (code === "ESRCH") return "gone"; + return "unsupported"; + } +} + +/** + * Encapsulates the set of `/api/train`-spawned subprocesses and the + * signal-dispatch decision rule for HMR rebuilds. Pulled out of + * `buildStudioApp` so the policy is testable in isolation and so future + * additions (e.g. a `cancel-all` admin endpoint) have a clear seam. + */ +export class TrainRegistry { + private readonly entries = new Map(); + + register( + child: ChildProcess, + init: Omit< + ActiveTrain, + | "child" + | "earlyStopRequested" + | "spawnArtifactContentHash" + | "jobId" + | "scope" + > & { + // Optional in the signature so tests / future callers that + // don't track the on-disk artefact content hash (e.g. an + // HMR-disabled server, a hand-rolled fake) can omit it. + // Defaults to `null`, which forces the pre-ready-spawn + // branch to fall through to SIGTERM-restart on the next + // non-null rebuild (the safe choice when we genuinely + // don't know what bytes the child loaded). Real `/api/train` + // calls in HMR mode capture this from + // `coordinator.getCurrentArtifactContentHash()`. + spawnArtifactContentHash?: string | null; + // Optional too: tests don't need scope for HMR-routing + // assertions. Real `/api/train` calls in production pass a + // non-null scope captured from `.arkor/state.json` so the + // cancel POST can address the cloud job without re-reading + // the filesystem at stop time. + scope?: { orgSlug: string; projectSlug: string } | null; + }, + ): void { + if (typeof child.pid !== "number") return; + this.entries.set(child.pid, { + child, + ...init, + spawnArtifactContentHash: init.spawnArtifactContentHash ?? null, + scope: init.scope ?? null, + earlyStopRequested: false, + // `jobId` starts null; populated later by `recordJobId(pid, + // id)` when the server's stdout parser sees the runner's + // `Started job ` line. Tests that don't exercise the + // cancel-POST path can leave it null. + jobId: null, + }); + } + + unregister(pid: number | undefined): void { + if (typeof pid === "number") this.entries.delete(pid); + } + + /** + * Record the cloud-side job id for an active child. Called by the + * server's `/api/train` stdout parser the first time it spots + * `Started job ` in the runner's output. Idempotent: a + * second call with the same pid + id is a no-op (the runner + * only prints the line once anyway). Unknown pids are silently + * dropped (the child may have already exited and unregistered). + */ + recordJobId(pid: number | undefined, jobId: string): void { + if (typeof pid !== "number") return; + const entry = this.entries.get(pid); + if (!entry) return; + entry.jobId = jobId; + } + + /** + * Read the recorded cloud-side job id for a pid. `/api/train`'s + * cancel handler consults this to POST `/v1/jobs/:id/cancel` + * before SIGKILLing the local subprocess; without that POST, + * a user-initiated stop would leave the cloud job running + * until TTL (the SIGKILL bypasses the runner's `installShutdownHandlers` + * so the runner can't issue cancel itself). Returns null when + * the pid is unknown or the runner hasn't printed its + * `Started job` line yet (early spawn failure, race against + * a fast cancel, custom user bins). + */ + getJobId(pid: number | undefined): string | null { + if (typeof pid !== "number") return null; + return this.entries.get(pid)?.jobId ?? null; + } + + /** + * Read the spawn-time cloud-api scope for a pid. Paired with + * `getJobId` by `/api/train`'s cancel handler to build the cloud + * cancel POST URL without re-reading `.arkor/state.json` at stop + * time: if the file was deleted or made unreadable mid-training, + * the read would return null and the cancel POST would silently + * skip, orphaning the cloud run. Captured at spawn time, immutable + * for the entry's lifetime. + */ + getScope( + pid: number | undefined, + ): { orgSlug: string; projectSlug: string } | null { + if (typeof pid !== "number") return null; + return this.entries.get(pid)?.scope ?? null; + } + + /** + * Whether `dispatchRebuild` has already issued a graceful-restart + * SIGTERM to this child as part of an HMR cycle. Consulted by + * `/api/train`'s ReadableStream `cancel()` handler so a client- + * driven cancel (tab close, navigation, aborted fetch) doesn't + * pile a second SIGTERM on top of an in-progress early-stop: + * the runner's `installShutdownHandlers` interprets a second + * SIGTERM as the emergency `exit(143)` fast-path, which bypasses + * the checkpoint-preserving early-stop + `cancel()` flow and + * leaves the cloud-side run live while the local subprocess + * dies. Defeats the main safety goal of the HMR restart logic. + */ + isEarlyStopRequested(pid: number | undefined): boolean { + if (typeof pid !== "number") return false; + return this.entries.get(pid)?.earlyStopRequested ?? false; + } + + get size(): number { + return this.entries.size; + } + + /** Read-only snapshot, mostly for tests / observability. */ + list(): ReadonlyArray { + return [...this.entries.values()]; + } + + /** + * Single entry point for HMR rebuilds: per active child, decide + * between callback hot-swap (SIGUSR2) and graceful restart + * (SIGTERM), apply the signal, and report which children landed in + * each bucket so the SPA can update its UI / re-spawn restarted + * runs. + * + * Combines what was previously `notifyCallbackReload` + + * `requestEarlyStopOnMismatch` into one pass so the per-child + * decision is atomic: important because the hot-swap path can + * gracefully degrade into the restart path on platforms (Windows) + * where SIGUSR2 isn't supported, which is hard to express across + * two separate iterations of the registry. + * + * Re-signal protection: children already flagged + * `earlyStopRequested` are skipped entirely. The flag is cleared + * naturally when the child exits and is unregistered. + * + * Defensive corner cases: + * - `kill()` returns `false` (process already exited) → drop + * from the targets list, the registry's close handler will + * unregister it. + * - `kill("SIGUSR2")` throws on Windows → degrade to SIGTERM so + * callback edits still take effect (via a full restart) rather + * than silently being ignored. + */ + dispatchRebuild( + nextConfigHash: string | null, + // Content hash (sha256-derived; see `studio/hmr.ts`) of the + // freshly-built artefact, paired with `entry.spawnArtifactContentHash` + // for the pre-ready-spawn equality gate. Defaults to `null` so + // tests / pre-existing callers that don't pass a hash get the + // conservative behaviour: a null entry hash falls through to + // SIGTERM-restart. Real dispatch from `/api/train`'s HMR + // subscriber threads `event.contentHash` here so the backfill + // optimisation activates only when the child's loaded bytes + // genuinely match. + nextArtifactContentHash: string | null = null, + ): DispatchResult { + const hotSwapTargets: RestartTarget[] = []; + const restartTargets: RestartTarget[] = []; + + for (const [pid, entry] of this.entries) { + if (entry.earlyStopRequested) continue; + const target: RestartTarget = { pid, trainFile: entry.trainFile }; + // Pre-ready spawn: this child was registered via `/api/train` + // *before* the HMR watcher's first successful build, so its + // recorded `configHash` is `null`. Whether the rebuild's new + // hash describes the same bytes the child actually loaded + // depends on whether the on-disk artefact has changed between + // spawn and now. Tie the decision to the artefact content + // hash: + // + // - `entry.spawnArtifactContentHash === nextArtifactContentHash` + // → child read the same bytes the new hash describes. + // Safe to backfill `configHash`; future rebuilds compare + // against the backfilled value like any other child. This + // is the common case (user clicked Run before the SPA had + // refreshed its manifest, but the on-disk artefact is the + // same one the watcher just settled on). + // + // - content hashes differ (or one side is null) → the bytes + // the child loaded don't match the new hash. SIGTERM-restart + // so the cloud-side `JobConfig` and the child's actual + // config are guaranteed to align. Without this gate, an + // edit landing between spawn and the first BUNDLE_END would + // silently teach the registry to use the post-edit hash as + // the child's baseline; later same-hash rebuilds would + // then hot-swap callbacks into a child whose cloud-side + // `JobConfig` was *actually* spawned against an older + // version, leaving the cloud run on a stale config. + const isPreReadySpawn = + entry.configHash === null && nextConfigHash !== null; + if (isPreReadySpawn) { + const artefactsAgree = + entry.spawnArtifactContentHash !== null && + nextArtifactContentHash !== null && + entry.spawnArtifactContentHash === nextArtifactContentHash; + if (artefactsAgree) { + entry.configHash = nextConfigHash; + continue; + } + // fall through to the mismatch / SIGTERM-restart path below + } + const matches = + nextConfigHash !== null && + entry.configHash !== null && + entry.configHash === nextConfigHash; + + if (matches) { + // On Windows, Node's `child.kill(signal)` for any unknown + // POSIX signal (including SIGUSR2) is documented to + // **forcefully terminate** the process (same effect as + // SIGKILL), and `kill()` returns `true` like a successful + // delivery. `safeKill` would then report `"ok"`, the entry + // would land in `hotSwapTargets`, and the SPA would never + // schedule a restart even though the child is *dead*. Skip + // the SIGUSR2 attempt on win32 entirely and route directly + // to the SIGTERM-restart path so the SPA learns about the + // pending restart and re-spawns when the exit line arrives. + // The user-visible outcome (callbacks reload after a brief + // restart) matches the design intent on platforms where + // the in-place hot-swap simply isn't available. + if (process.platform !== "win32") { + const r = safeKill(entry.child, "SIGUSR2"); + if (r === "ok") { + hotSwapTargets.push(target); + continue; + } + if (r === "gone") { + // Child already exited; close handler will unregister. + continue; + } + // Cross-platform safety net: SIGUSR2 reported `"unsupported"` + // on a non-win32 platform (rare: `ENOSYS` from libuv signal + // wrap on exotic builds, future Node versions removing the + // signal, etc.). Same fallback as the win32 skip above: + // route to SIGTERM-restart so callback edits still take + // effect via a full restart instead of silently being + // ignored. + } + const fallback = safeKill(entry.child, "SIGTERM"); + if (fallback === "ok") { + entry.earlyStopRequested = true; + restartTargets.push(target); + } + // "gone" / "unsupported" again → drop silently; the close + // handler (or operator-driven restart) will recover. + continue; + } + + // Hash mismatch (or one side is null): graceful restart. + const r = safeKill(entry.child, "SIGTERM"); + if (r === "ok") { + entry.earlyStopRequested = true; + restartTargets.push(target); + } + // "gone": child already exited, drop. "unsupported": can't + // happen for SIGTERM on supported platforms; drop defensively. + } + + return { hotSwapTargets, restartTargets }; + } +} diff --git a/packages/cli-internal/src/templates.ts b/packages/cli-internal/src/templates.ts index a250fd62..1d8e94fc 100644 --- a/packages/cli-internal/src/templates.ts +++ b/packages/cli-internal/src/templates.ts @@ -1,13 +1,13 @@ /** * Starter templates written out by `create-arkor` / `arkor init`. - * Single source of truth — both consumers bundle this module at build time. + * Single source of truth: both consumers bundle this module at build time. * * Layout written to disk: * * src/arkor/index.ts ← entry-point manifest (`createArkor({ trainer })`) * src/arkor/trainer.ts ← per-template trainer (`createTrainer({...})`) * - * `index.ts` is identical across templates — only the trainer body differs. + * `index.ts` is identical across templates; only the trainer body differs. */ export type TemplateId = "redaction" | "translate" | "triage"; @@ -179,7 +179,7 @@ ${ONLOG_BODY} }); `; -// Order is significant — `templateChoices()` preserves insertion order so the +// Order is significant: `templateChoices()` preserves insertion order so the // CLI prompt lists demos first (sorted by estimated training time). // // Estimated training times assume A100 80GB on Runpod Serverless with the @@ -204,7 +204,7 @@ export const TEMPLATES: Record = { }; /** - * Body of `src/arkor/index.ts` — identical across templates. The `createArkor` + * Body of `src/arkor/index.ts`: identical across templates. The `createArkor` * factory is what `arkor build` / Studio discovers; per-role primitives * (`trainer`, future `deploy`, `eval`) live in sibling files and get gathered * here. @@ -215,7 +215,7 @@ import { trainer } from "./trainer"; export const arkor = createArkor({ trainer }); `; -export const STARTER_CONFIG = `// Placeholder for future project-level config — the runtime does not read +export const STARTER_CONFIG = `// Placeholder for future project-level config: the runtime does not read // fields from this file yet. Training settings (\`maxSteps\`, \`lora\`, etc.) // live on the Trainer in src/arkor/trainer.ts. Project routing // (orgSlug / projectSlug) is tracked automatically in .arkor/state.json. @@ -241,7 +241,7 @@ An arkor training project scaffolded by \`create-arkor\`. The \`dev\` / \`build\` / \`start\` package scripts forward to the matching \`arkor\` subcommands, so the script form works across every package -manager (\`npm\` does not run package binaries via \`npm \` — use +manager (\`npm\` does not run package binaries via \`npm \`; use \`npm run