arkorlab · k-taro56 · May 2, 2026 · May 2, 2026 · May 2, 2026 · May 2, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -63,25 +63,36 @@ cd my-arkor-app && pnpm dev                            # Studio at http://127.0.
 
 `arkor dev` generates a 32-byte base64url token per launch ([packages/arkor/src/cli/commands/dev.ts](packages/arkor/src/cli/commands/dev.ts)) and:
 
-1. Passes it to `buildStudioApp({ studioToken })`. The Hono server validates every `/api/*` request via `X-Arkor-Studio-Token` header (or `?studioToken=` query for `EventSource`, which can't set headers). Comparison uses `timingSafeEqual`.
-2. Persists it to `~/.arkor/studio-token` (mode 0600) so the SPA dev workflow (`pnpm --filter @arkor/studio-app dev`) can read it via the `arkor-studio-token` Vite plugin in [packages/studio-app/vite.config.ts](packages/studio-app/vite.config.ts), which injects `<meta name="arkor-studio-token">` into `index.html` on each request. Persistence failure must NOT block server start (read-only `$HOME` on Docker, etc.) — just warn.
+1. Passes it to `buildStudioApp({ studioToken })`. The Hono server validates every `/api/*` request via `X-Arkor-Studio-Token` header (or `?studioToken=` query for `EventSource`, which can't set headers). Comparison uses `timingSafeEqual`. The query-token allow-list lives in `eventStreamPathPattern` in [packages/arkor/src/studio/server.ts](packages/arkor/src/studio/server.ts), currently `/api/jobs/:id/events` and `/api/dev/events`. **Adding to that regex is CSRF-sensitive: each entry must be a GET stream-only route, never a mutation endpoint.**
+2. Persists it to `~/.arkor/studio-token` (mode 0600) so the SPA dev workflow (`pnpm --filter @arkor/studio-app dev`) can read it via the `arkor-studio-token` Vite plugin in [packages/studio-app/vite.config.ts](packages/studio-app/vite.config.ts), which injects `<meta name="arkor-studio-token">` into `index.html` on each request. Persistence failure must NOT block server start (read-only `$HOME` on Docker, etc.); just warn.
 3. Cleans up on `exit`/SIGINT/SIGTERM/SIGHUP via `unlinkSync`.
 
-`/api/*` middleware also enforces a host-header allow-list (`127.0.0.1`/`localhost`) for DNS-rebinding defence. **CORS is intentionally NOT configured** — the SPA is same-origin so reflecting `*` would let "simple" cross-origin POSTs reach handlers. The token check rejects those; cross-origin tabs cannot read the SPA's `<meta>`.
+`/api/*` middleware also enforces a host-header allow-list (`127.0.0.1`/`localhost`) for DNS-rebinding defence. **CORS is intentionally NOT configured**: the SPA is same-origin so reflecting `*` would let "simple" cross-origin POSTs reach handlers. The token check rejects those; cross-origin tabs cannot read the SPA's `<meta>`.
 
-The whole point: prevents another browser tab on the same machine from POSTing `/api/train` (which spawns `arkor train` and dynamically imports user TS — RCE-grade).
+The whole point: prevents another browser tab on the same machine from POSTing `/api/train` (which spawns `arkor train` and dynamically imports user TS, an RCE-grade exposure).
 
-When touching the Studio server or SPA fetch layer, preserve: token via header for `fetch`, query param for `EventSource`, host-header guard, no CORS, timing-safe compare. The Vite plugin is dev-only (`apply: "serve"`) — running it during `vite build` would bake a stale per-launch token into the production `index.html` and shadow the runtime tag, causing every `/api/*` call to 403.
+When touching the Studio server or SPA fetch layer, preserve: token via header for `fetch`, query param for `EventSource`, host-header guard, no CORS, timing-safe compare. The Vite plugin is dev-only (`apply: "serve"`): running it during `vite build` would bake a stale per-launch token into the production `index.html` and shadow the runtime tag, causing every `/api/*` call to 403.
+
+### HMR + graceful early-stop + callback hot-swap
+
+`arkor dev` keeps a [Rolldown](https://rolldown.rs) watcher over `src/arkor/` ([packages/arkor/src/studio/hmr.ts](packages/arkor/src/studio/hmr.ts)) and pushes rebuild events over `/api/dev/events` (SSE). On each successful build the watcher dynamic-imports the artifact, pulls a `TrainerInspection` snapshot off the discovered trainer (via the cross-realm `Symbol.for("arkor.trainer.inspect")` brand attached in [packages/arkor/src/core/trainerInspection.ts](packages/arkor/src/core/trainerInspection.ts)), and computes a stable `configHash` from the cloud-side `JobConfig`. The SPA re-fetches `/api/manifest` on each event so the Run Training button stays in sync without a browser refresh.
+
+When a rebuild lands while a `/api/train`-spawned subprocess is in flight, the server makes a per-child decision in [packages/arkor/src/studio/trainRegistry.ts](packages/arkor/src/studio/trainRegistry.ts):
+
+- **`configHash` matches the spawn-time hash** → SIGUSR2. The child's `installCallbackReloadHandler` re-imports the artifact and rotates the trainer's callback cell via the internal `Symbol.for("arkor.trainer.replaceCallbacks")` brand exposed by [packages/arkor/src/core/trainerInspection.ts](packages/arkor/src/core/trainerInspection.ts). The cloud-side run is untouched. Use this whenever a code change is contained inside the `callbacks: { ... }` object. Don't add a `replaceCallbacks()` method to the public `Trainer` interface: keeping the mutator behind a `Symbol.for` brand is what stops the dev-only HMR primitive from leaking into the SDK's published surface.
+- **`configHash` differs (or is null because the new bundle didn't inspect)** → SIGTERM. `installShutdownHandlers` drives the trainer's internal early-stop entry point via the `Symbol.for("arkor.trainer.requestEarlyStop")` brand exposed by [packages/arkor/src/core/trainerInspection.ts](packages/arkor/src/core/trainerInspection.ts), which lets the next `checkpoint.saved` event finish (work preserved) before issuing `cancel()` and exiting cleanly. The SPA auto-restarts the run with the rebuilt artifact via the `restart: true` flag on the SSE event. A second SIGTERM bypasses the early-stop and exits 143 immediately, as an emergency escape hatch for a hung cancel.
+
+Don't replace the SIGTERM-and-let-the-child-handle-it pattern with a SIGKILL escalation in the server: that would orphan Cloud-side jobs (no `cancel()` POST goes out) and waste GPU budget. Don't widen the SIGUSR2 path to "always hot-swap, server-side": the `configHash` check is what guarantees a hot-swap can't silently leave a child running with a stale `JobConfig`. Don't surface `requestEarlyStop()` (or `replaceCallbacks()`) as a method on the public `Trainer` interface: both are dev-only HMR primitives, and keeping them behind `Symbol.for` brands is what stops them from leaking into the published SDK shape; user code that wants similar semantics should compose `abortSignal` + `cancel()` per the cookbook.
 
 ### Project entry-point discovery
 
 The CLI/Studio look at `src/arkor/index.ts` in user projects. Discovery in [packages/arkor/src/core/runner.ts](packages/arkor/src/core/runner.ts) accepts (in order): a named `arkor` export from `createArkor({...})`, a bare `trainer` export, a default export holding either an Arkor manifest or a Trainer, or a `default.trainer` nested shape. `createArkor` returns a frozen, opaque manifest tagged with `_kind: "arkor"`; treat it as a value to hand to tooling, not a programmable client.
 
-`arkor build` ([packages/arkor/src/cli/commands/build.ts](packages/arkor/src/cli/commands/build.ts)) bundles to `.arkor/build/index.mjs` with esbuild; bare specifiers (e.g. `arkor`, anything in `node_modules`) stay external so the artifact resolves the runtime SDK from the project's installed copy.
+`arkor build` ([packages/arkor/src/cli/commands/build.ts](packages/arkor/src/cli/commands/build.ts)) bundles to `.arkor/build/index.mjs` with [Rolldown](https://rolldown.rs); bare specifiers (e.g. `arkor`, anything in `node_modules`) stay external so the artifact resolves the runtime SDK from the project's installed copy. The `transform.target` is derived from `process.versions.node` at build time so the bundle targets the same Node binary that will execute it.
 
 ### E2E suite specifics
 
-Both [e2e/cli](e2e/cli) and [e2e/studio](e2e/studio) declare `arkor` (and, for `e2e/cli`, `create-arkor`) as `workspace:*` `devDependencies`, so Turbo's `^build` produces `dist/bin.mjs` exactly once before `#test`/`#test:coverage` runs — no `pretest` hooks, no concurrent rebuilds racing on `dist/`. Standalone runs (`pnpm --filter @arkor/e2e-* test`) need a prior `pnpm build`. Every supported Node (≥22.22.0) is in rolldown's compatible range (^20.19 || >=22.12), so the previous "rolldown-incompatible" CI bypass path was removed.
+Both [e2e/cli](e2e/cli) and [e2e/studio](e2e/studio) declare `arkor` (and, for `e2e/cli`, `create-arkor`) as `workspace:*` `devDependencies`, so Turbo's `^build` produces `dist/bin.mjs` exactly once before `#test`/`#test:coverage` runs (no `pretest` hooks, no concurrent rebuilds racing on `dist/`). Standalone runs (`pnpm --filter @arkor/e2e-* test`) need a prior `pnpm build`. Every supported Node (≥22.22.0) is in rolldown's compatible range (^20.19 || >=22.12), so the previous "rolldown-incompatible" CI bypass path was removed.
 
 Tests rely on `ARKOR_INTERNAL_SCAFFOLD_ARKOR_SPEC=file:.../packages/arkor` so the scaffolded fixtures install the workspace `arkor` instead of the npm-published one. Both this var and `SKIP_E2E_INSTALL` are declared in [turbo.json](turbo.json) so they pass through Turbo's hash.
 
@@ -96,7 +107,7 @@ When implementing anything (new feature, SDK/CLI/Studio behaviour change, schema
 1. **Docs in both languages.** This repo pairs English/Japanese docs: `README.md` ↔ `README.ja.md`, `CONTRIBUTING.md` ↔ `CONTRIBUTING.ja.md`, and `docs/` ↔ `docs/ja/`. If you edit the English side, update the Japanese side in the same PR. Don't leave Japanese docs to be retro-translated later.
 2. **Tests.** Add vitest cases under `packages/*/src/**/*.test.ts` for SDK/CLI/scaffold logic changes. For CLI flow changes, consider an `e2e/cli` scenario.
 
-Don't split these into "docs in a follow-up PR" or "tests later" — land them in the same PR. Skip only when the user explicitly says to.
+Don't split these into "docs in a follow-up PR" or "tests later"; land them in the same PR. Skip only when the user explicitly says to.
 
 ## Non-obvious gotchas
 

diff --git a/docs/concepts/studio.mdx b/docs/concepts/studio.mdx
@@ -14,7 +14,12 @@ Four jobs:
 3. **Try a finished model.** A Playground page lets you pick the base model or the final adapter from any completed job and chat with it. The Playground does not load intermediate checkpoints; for mid-run inference, use [`onCheckpoint`](/concepts/lifecycle) callbacks in your trainer.
 4. **Publish a model behind a `*.arkor.app` URL.** An Endpoints page creates a per-deployment subdomain that serves OpenAI-compatible chat completions for a chosen adapter or base model, plus the API keys that authenticate calls to it. The same actions are available programmatically via [`CloudApiClient`](/sdk/deployments) — Studio is the interactive surface; the SDK is the lower-level one.
 
-A note on the dev loop: Studio's `/api/manifest` endpoint rebuilds and re-imports your trainer on every request (with a cache-bust query, see `packages/arkor/src/studio/manifest.ts`), but the UI only fetches it when the Run training page mounts. So if you edit `src/arkor/` and stay on the same Run training page, the next click reuses the existing `.arkor/build/index.mjs` and runs your old code. Refresh the page (or run `arkor build` from the terminal) between edits and clicks to pick up the new code reliably.
+A note on the dev loop: Studio runs a [Rolldown](https://rolldown.rs) watcher over `src/arkor/` and pushes rebuild notifications to the SPA over a Server-Sent Events stream (`/api/dev/events`). Edit a file, save, and the Run training button updates with the new trainer name without a refresh. If a training run is in flight, the Studio compares the new bundle's cloud-side `JobConfig` hash to the one captured when the run was spawned:
+
+- **Same hash (only callbacks changed).** The runner is signalled with SIGUSR2; it re-imports the rebuilt artifact and rotates the trainer's callback cell in place via an internal HMR brand. The cloud-side training run is untouched, no GPU time is wasted, and the SPA shows a brief "Callbacks hot-swapped" indicator.
+- **Different hash (model / dataset / hyperparameters changed).** The runner is signalled with SIGTERM; the trainer's internal early-stop entry point lets the next checkpoint upload finish before issuing `cancel()`, then the SPA re-spawns the run with the rebuilt artifact. The previous Cloud-side job reaches `cancelled` after the checkpoint is uploaded, so the partial work is preserved as an artifact.
+
+If you want this "stop after the next checkpoint" behaviour from your own code (rather than from the dev loop), build it on top of the public [`abortSignal` + `cancel()`](/sdk/trainer-control#abortsignal) pair. The [Early stopping recipe](/cookbook/early-stopping) walks through it.
 
 ## Where Studio runs
 

diff --git a/docs/ja/concepts/studio.mdx b/docs/ja/concepts/studio.mdx
@@ -14,7 +14,12 @@ Studio は `arkor dev` 実行時に立ち上がるローカル Web UI です。
 3. **完成モデルを試す。** Playground ページでベースモデルや任意の完了済みジョブの最終アダプターを選んでチャットできます。中間チェックポイントは Playground からはロードしません。学習中の推論には [`onCheckpoint`](/ja/concepts/lifecycle) コールバックをトレーナーで使ってください。
 4. **`*.arkor.app` URL でモデルを公開する。** Endpoints ページで OpenAI 互換 chat completions を提供する deployment 専用サブドメインを作成し、その API キーを発行・取り消しできます。同じ操作は [`CloudApiClient`](/ja/sdk/deployments) からプログラマティックにも可能で、Studio が対話的なインターフェイス、SDK が下位レイヤーという位置付けです。
 
-dev ループのメモ: Studio の `/api/manifest` エンドポイントはリクエストごとにトレーナーをリビルド・再 import しますが（キャッシュバストクエリ付き、`packages/arkor/src/studio/manifest.ts` を参照）、UI が fetch するのは Run training ページがマウントされたときだけです。`src/arkor/` を編集して同じ Run training ページに留まり続けると、次のクリックは既存の `.arkor/build/index.mjs` を再利用して古いコードで走ります。確実に新しいコードを取り込むには、編集とクリックの間にページをリロード（あるいはターミナルから `arkor build`）してください。
+dev ループのメモ: Studio は [Rolldown](https://rolldown.rs) のウォッチャを `src/arkor/` 上で常駐させ、再ビルド通知を Server-Sent Events ストリーム (`/api/dev/events`) で SPA に push します。ファイルを編集して保存すれば、Run training ボタンのトレーナー名表示はリロード無しで更新されます。学習が走っている最中であれば、Studio は再ビルドしたバンドルの Cloud 側 `JobConfig` ハッシュを、spawn 時に保存したハッシュと比較します。
+
+- **ハッシュ一致（コールバックのみ変更）。** ランナーへ SIGUSR2 を送ります。ランナーは再ビルドされた成果物を再 import し、内部 HMR ブランド経由でトレーナーのコールバック cell をその場で差し替えます。Cloud 側の学習はそのまま継続し、GPU 時間を無駄にせず、SPA には "Callbacks hot-swapped" と短く表示されます。
+- **ハッシュ不一致（モデル / データセット / ハイパーパラメータが変わった）。** ランナーへ SIGTERM を送ります。トレーナー内部の early-stop エントリが次のチェックポイントのアップロードを待ってから `cancel()` を発火し、SPA が再ビルドした成果物で再投入します。Cloud 側の以前のジョブはチェックポイントのアップロード完了後に `cancelled` 状態に遷移するので、ここまでの学習成果は artifact として保全されます。
+
+自前のコードから（dev ループではなく）この「次のチェックポイントで止める」挙動が欲しい場合は、公開 API の [`abortSignal` + `cancel()`](/ja/sdk/trainer-control#abortsignal) を組み合わせて書いてください。具体的な手順は [Early Stopping レシピ](/ja/cookbook/early-stopping) にあります。
 
 ## Studio が動く場所
 

diff --git a/docs/ja/studio/jobs.mdx b/docs/ja/studio/jobs.mdx
@@ -62,18 +62,18 @@ Jobs ページ（`#/jobs`）はマウント時に 1 度、その後 5 秒ごと
 
 Loss チャートは `training.log` イベントから描画される SVG プロットです。Y 軸は最小値と最大値によるスケーリング、X 軸はステップ番号で、最大 2 系列を表示します:
 
-- **Training loss** — 実線のティール色。数値 `loss` を含むイベントごとに 1 頂点。
-- **Eval loss** — 破線のピンク色（点マーカー付き）。数値 `evalLoss` を含むイベント（通常は `evalSteps` 刻み）から描画。系列はイベントから直接構築するため、`evalLoss` のみを持ち `loss` を含まない eval-only フレームも線・凡例・統計に反映されます。Eval ポイントが 1 つも来ていない間は凡例にも表示されません。
+- **Training loss**: 実線のティール色。数値 `loss` を含むイベントごとに 1 頂点。
+- **Eval loss**: 破線のピンク色（点マーカー付き）。数値 `evalLoss` を含むイベント（通常は `evalSteps` 刻み）から描画。系列はイベントから直接構築するため、`evalLoss` のみを持ち `loss` を含まない eval-only フレームも線・凡例・統計に反映されます。Eval ポイントが 1 つも来ていない間は凡例にも表示されません。
 
 ホバーすると最寄りステップと、そのステップに含まれる `loss` / `evalLoss` のうち存在する値が表示されます（eval-only ステップでは `loss` 値は出ず、その逆も同様）。チャートは `loss` または `evalLoss` のいずれかが数値であるイベントが 1 件以上届くまで `Waiting for training.log events…`（`training.log` イベント待ち）プレースホルダーを表示します。両方とも null / 省略の `training.log` フレームはカウントされません。
 
 ### 上級モード（Advanced metrics）
 
 チャートヘッダーの **Advanced** トグルを ON にすると、系列ごとの統計パネルが現れます。各カードに表示される項目:
 
-- **Mean loss ± 95% CI** — Loss 値の標本平均と 95% 信頼区間の半幅（Student の t 分布。n > 31 では z = 1.96 にフォールバック）。
-- **Std dev**（標準偏差）と **Variance**（分散） — Bessel 補正済みの不偏推定量（`ddof=1`）。
-- **p90** と **p95** — numpy のデフォルトに合わせた線形補間パーセンタイル。
+- **Mean loss ± 95% CI**: Loss 値の標本平均と 95% 信頼区間の半幅（Student の t 分布。n > 31 では z = 1.96 にフォールバック）。
+- **Std dev**（標準偏差）と **Variance**（分散）: Bessel 補正済みの不偏推定量（`ddof=1`）。
+- **p90** と **p95**: numpy のデフォルトに合わせた線形補間パーセンタイル。
 
 Eval カードは数値 `evalLoss` を含む `training.log` イベントが届くまでは空のままです。