Enhance Trainer API with HMR, early-stop, and cleanup improvements by k-taro56 · Pull Request #101 · arkorlab/arkor

k-taro56 · 2026-05-02T17:02:31Z

This pull request migrates the Arkor build and development workflow from esbuild to Rolldown, introduces a robust hot module replacement (HMR) system for the Studio dev server, and refines the cleanup and shutdown logic for long-lived resources. Documentation has been updated to explain the new HMR behavior in both English and Japanese, and dependency management has been adjusted accordingly. These changes streamline the developer experience, enable live code updates during training runs, and improve resource handling during shutdown.

Build system migration and HMR integration:

Migrated the build system from esbuild to Rolldown in arkor build, ensuring bundles target the current Node version and keeping external dependencies unbundled. (packages/arkor/package.json, packages/arkor/src/cli/commands/build.ts, packages/arkor/src/core/rolldownConfig) [1] [2] [3]
Added a persistent Rolldown watcher for src/arkor/ in dev mode, enabling HMR for trainer code. The watcher notifies the SPA via SSE (/api/dev/events), and the dev server can hot-swap callbacks or gracefully early-stop training runs based on config changes. (packages/arkor/src/cli/commands/dev.ts, packages/arkor/src/studio/hmr.ts, packages/arkor/src/studio/trainRegistry.ts) [1] [2] [3] [4] [5] [6]

Cleanup and shutdown improvements:

Introduced a reusable registerCleanupHook utility to handle resource cleanup on process exit and signals, ensuring proper teardown order (e.g., HMR watcher before token file removal). (packages/arkor/src/cli/cleanupHooks.ts, packages/arkor/src/cli/commands/dev.ts) [1] [2]

Documentation updates:

Updated English and Japanese docs to describe the new HMR workflow, explaining how live code changes are handled during training and how the system differentiates between callback-only and config changes. (docs/concepts/studio.mdx, docs/ja/concepts/studio.mdx, AGENTS.md) [1] [2] [3]
Clarified the use of Rolldown in the build process and updated package policy documentation accordingly. (AGENTS.md)

Dependency and test adjustments:

Removed esbuild from dependencies and added Rolldown. Updated test comments to reference Rolldown instead of esbuild. (packages/arkor/package.json, packages/arkor/src/cli/commands/start.test.ts) [1] [2]

Security and internal API notes:

Documented the security-sensitive allow-list for SSE token query parameters and emphasized that HMR/early-stop internals remain behind Symbol.for brands, not exposed on the public SDK surface. (AGENTS.md) [1] [2]

- Integrated Rolldown for hot module replacement (HMR) in `arkor dev`, allowing real-time updates to the training interface without page refresh. - Implemented `requestEarlyStop` and `replaceCallbacks` methods in the Trainer API to facilitate graceful stopping of training jobs and dynamic callback updates during execution. - Updated documentation to reflect new features and usage patterns for improved developer guidance. - Adjusted cleanup logic for better resource management during development sessions.

…l shutdown - Introduced `replaceCallbacks` method in the Trainer API to allow dynamic updates of lifecycle callbacks during training runs. - Enhanced signal handling for graceful early stopping, ensuring in-flight checkpoints are preserved during HMR rebuilds. - Added `registerCleanupHook` for streamlined resource management on process exit, improving cleanup logic across development commands. - Updated documentation to reflect new features and usage patterns for better developer guidance.

…internal callback swapping - Removed the public `replaceCallbacks` method from the Trainer interface to prevent exposure of the hot-swapping functionality. - Introduced an internal mechanism for callback swapping using a `Symbol.for`-keyed brand, allowing for dynamic updates during training runs without affecting the public API. - Updated signal handling to ensure seamless integration with the new callback swapping logic, enhancing the hot module replacement (HMR) experience. - Revised documentation to reflect changes in the Trainer API and clarify the internal callback management process.

…lement internal early-stop handling - Removed the public `requestEarlyStop` method from the Trainer interface to prevent exposure of the early-stop functionality. - Introduced an internal mechanism for early stopping using a `Symbol.for`-keyed brand, allowing for graceful stopping after the next checkpoint without affecting the public API. - Updated signal handling to ensure seamless integration with the new early-stop logic, enhancing the hot module replacement (HMR) experience. - Revised documentation to reflect changes in the Trainer API and clarify the internal early-stop management process.

…llback replacement - Updated the Trainer API to remove public exposure of `requestEarlyStop` and `replaceCallbacks` methods, enhancing encapsulation. - Implemented internal mechanisms for early stopping and callback swapping using `Symbol.for`-keyed brands, ensuring seamless integration during training runs. - Revised signal handling to improve the hot module replacement (HMR) experience and maintain clean resource management. - Updated documentation to reflect these changes and clarify the internal management processes for developers.

- Replaced esbuild with Rolldown for building the project, ensuring external dependencies are resolved correctly. - Implemented a cleanup hook system to manage resource disposal on process termination signals. - Enhanced the development experience with hot module replacement (HMR) capabilities, allowing for seamless updates during training runs. - Updated documentation to reflect changes in the development loop and HMR behavior. - Added tests for new signal handling and configuration hashing functionality.

mintlify · 2026-05-02T17:02:34Z

Preview deployment for your docs. Learn more about Mintlify Previews.

Project	Status	Preview	Updated (UTC)
arkor-92aeef0e	🟢 Ready	View Preview	May 2, 2026, 5:03 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

codecov · 2026-05-02T17:03:34Z

Codecov Report

❌ Patch coverage is 76.96203% with 182 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
packages/studio-app/src/components/RunTraining.tsx	30.88%	81 Missing and 13 partials ⚠️
packages/arkor/src/studio/server.ts	85.21%	11 Missing and 10 partials ⚠️
packages/arkor/src/studio/hmr.ts	77.01%	10 Missing and 10 partials ⚠️
packages/arkor/src/core/trainer.ts	87.00%	2 Missing and 11 partials ⚠️
packages/arkor/src/studio/trainRegistry.ts	87.03%	3 Missing and 4 partials ⚠️
packages/studio-app/src/lib/api.ts	79.41%	1 Missing and 6 partials ⚠️
packages/arkor/src/core/runnerSignals.ts	88.23%	2 Missing and 4 partials ⚠️
packages/arkor/src/cli/cleanupHooks.ts	93.18%	1 Missing and 2 partials ⚠️
packages/arkor/src/core/rolldownConfig.ts	82.35%	0 Missing and 3 partials ⚠️
packages/arkor/src/core/trainerInspection.ts	93.47%	1 Missing and 2 partials ⚠️
... and 4 more

📢 Thoughts on this report? Let us know!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8abc594369

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-02T17:07:52Z

+            const hotSwapTargets = activeTrains.notifyCallbackReload(nextHash);
+            const restartTargets =
+              activeTrains.requestEarlyStopOnMismatch(nextHash);


Dispatch train signals only once per rebuild

notifyCallbackReload/requestEarlyStopOnMismatch run inside each /api/dev/events subscriber callback, so one rebuild sends signals once per connected SSE client (and again on reconnect when createHmrCoordinator.subscribe replays lastEvent). In the common case of two open Studio tabs, the same child can receive two SIGTERMs for one config-changing rebuild; the second signal triggers the runner’s forced-exit path (exit(143)) instead of checkpoint-preserving early stop. Move signal dispatch out of per-subscriber delivery (or dedupe per rebuild hash) so each child is signaled at most once per rebuild event.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This PR upgrades the local Studio dev experience by migrating the trainer bundling pipeline from esbuild to Rolldown, adding an HMR/SSE channel to propagate rebuild events to the Studio SPA, and introducing signal-based “hot-swap callbacks vs early-stop + restart” behavior for in-flight training subprocesses.

Changes:

Replace esbuild-based trainer bundling with Rolldown (one-shot build + watch/HMR).
Add /api/dev/events SSE stream + SPA client wiring to refresh manifest and coordinate restarts/hot-swaps.
Introduce internal trainer inspection/callback replacement/early-stop hooks (via Symbol.for brands), plus cleanup utilities and updated docs/tests.

Reviewed changes

Copilot reviewed 27 out of 28 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
pnpm-lock.yaml	Removes esbuild and adds rolldown to the lockfile graph.
packages/arkor/package.json	Swaps dependency from `esbuild` to `rolldown`.
packages/arkor/src/core/rolldownConfig.ts	Centralizes Rolldown input options + entry/outDir resolution and Node target derivation.
packages/arkor/src/cli/commands/build.ts	Implements `arkor build` via Rolldown bundling instead of esbuild.
packages/arkor/src/studio/hmr.ts	Adds lazy Rolldown watcher coordinator that emits ready/rebuild/error events.
packages/arkor/src/studio/hmr.test.ts	Tests HMR coordinator behavior (ready/rebuild/error/replay/dispose).
packages/arkor/src/studio/manifest.ts	Adds `configHash` to manifest summary and splits “summarise built manifest” vs “build + summarise”.
packages/arkor/src/studio/trainRegistry.ts	Adds per-child registry + policy for SIGUSR2 hot-swap vs SIGTERM early-stop.
packages/arkor/src/studio/trainRegistry.test.ts	Tests TrainRegistry signaling decisions and error tolerance.
packages/arkor/src/studio/server.ts	Adds `/api/dev/events` SSE route and integrates TrainRegistry into rebuild handling.
packages/arkor/src/studio/server.test.ts	Adds tests for `/api/dev/events` token rules, loopback guard, and SSE framing.
packages/arkor/src/core/configHash.ts	Adds stable hashing of `JobConfig` for “callbacks-only vs restart” decisions.
packages/arkor/src/core/configHash.test.ts	Tests determinism and sensitivity of `hashJobConfig`.
packages/arkor/src/core/trainerInspection.ts	Introduces internal inspection + callback replacement + early-stop brands via `Symbol.for`.
packages/arkor/src/core/trainer.ts	Implements callback hot-swap support + early-stop latch; attaches internal brands.
packages/arkor/src/core/trainer.test.ts	Adds coverage for early-stop and callback hot-swap behavior via internal brands.
packages/arkor/src/core/runnerSignals.ts	Adds signal handlers: SIGTERM graceful early-stop + SIGUSR2 callback reload.
packages/arkor/src/core/runnerSignals.test.ts	Tests SIGTERM/SIGUSR2 handler behaviors with branded trainers.
packages/arkor/src/core/runner.ts	Installs/removes new signal handlers around `trainer.start()`/`trainer.wait()`.
packages/arkor/src/core/runner.test.ts	Adds a test for SIGTERM early-stop behavior in the runner.
packages/arkor/src/cli/cleanupHooks.ts	Adds reusable cleanup-hook registration for exit/signals.
packages/arkor/src/cli/commands/dev.ts	Wires HMR coordinator into Studio server and refactors shutdown cleanup via cleanup hooks.
packages/arkor/src/cli/commands/start.test.ts	Updates test comment to refer to Rolldown instead of esbuild.
packages/studio-app/src/lib/api.ts	Adds typed `DevEvent` + `openDevEvents()` for SSE notifications.
packages/studio-app/src/components/RunTraining.tsx	SPA listens to dev SSE events, refreshes manifest, and coordinates restart/hot-swap UI state.
docs/concepts/studio.mdx	Documents HMR workflow, including hash-based hot-swap vs restart behavior (EN).
docs/ja/concepts/studio.mdx	Documents HMR workflow (JA).
AGENTS.md	Updates repository policy/docs for new SSE allow-list, HMR internals, and Rolldown build pipeline.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  const [log, setLog] = useState("");
  const [manifest, setManifest] = useState<ManifestResult | null>(null);
+  const [hmrStatus, setHmrStatus] = useState<
+    "idle" | "rebuilding" | "early-stopping" | "restarting" | "hot-swapped"


 import { ensureProjectState } from "../core/projectState";
 import { readState } from "../core/state";
-import { readManifestSummary } from "./manifest";
+import { readManifestSummary, summariseBuiltManifest } from "./manifest";


+function scheduleHmrCleanup(hmr: { dispose: () => Promise<void> }): void {
+  // Registered before the studio-token cleanup so it runs first on
+  // shutdown — Node fires signal handlers in registration order, and we
+  // want the watcher to release file handles before the outermost
+  // process.exit.
+  registerCleanupHook({ cleanup: () => hmr.dispose() });
 }


+  requestEarlyStopOnMismatch(
+    nextConfigHash: string | null,
+  ): RestartTarget[] {
+    const targets: RestartTarget[] = [];
+    for (const [pid, entry] of this.entries) {
+      if (
+        nextConfigHash === null ||
+        entry.configHash === null ||
+        entry.configHash !== nextConfigHash
+      ) {
+        try {
+          entry.child.kill("SIGTERM");
+        } catch {
+          // child already exited; close handler will clean up.
+        }
+        targets.push({ pid, trainFile: entry.trainFile });
+      }


        await callbacks.onCheckpoint?.(ctx);
+        // Early-stop latch: a checkpoint just landed, so the in-flight work
+        // is durable. Cancel the cloud job and end `wait()` cleanly.
+        if (earlyStopRequested && earlyStopDeferred) {
+          await trainer.cancel();
+          if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer);
+          earlyStopDeferred.resolve();
+          earlyStopDeferred = null;
+          return { terminal: true, artifacts: terminalResult?.artifacts ?? [] };
+        }


+  function startWatcher(): void {
+    if (watcher || disposed) return;
+    if (!existsSync(resolved.entry)) {
+      broadcast({
+        type: "error",
+        message: `Build entry not found: ${resolved.entry}. Create ${BUILD_DEFAULTS.entry} or pass an explicit entry argument.`,
+      });
+      return;
+    }


…ous BUNDLE_END on macOS - Modified the test to assert that the late subscriber receives the same event as the prior subscriber, addressing issues with spurious BUNDLE_END events in rolldown@1.0.0-rc.17 on macOS. - Added comments to clarify the reasoning behind the change and noted a future task to revisit the assertion after rolldown stabilizes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 31f510869c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-03T10:04:21Z

+        restartPendingRef.current = true;
+        setHmrStatus(runningRef.current ? "early-stopping" : "idle");


Ignore restart hints when no local run is active

restartPendingRef is set for every payload.restart event even when this tab is not running a training stream. Since /api/dev/events broadcasts restart events to all connected tabs, a passive tab can cache restartPendingRef=true from another tab’s run; the next time the user runs training in this tab, finally will auto-trigger an extra run, creating an unintended duplicate job and extra cloud/GPU spend. Only mark restart pending when runningRef.current is true (or otherwise correlate the restart target to the local run).

Useful? React with 👍 / 👎.

Copilot

Pull request overview

Copilot reviewed 27 out of 28 changed files in this pull request and generated 4 comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        } catch {
+          // child already exited; close handler will clean up.
+        }
+        targets.push({ pid, trainFile: entry.trainFile });


 import { ensureProjectState } from "../core/projectState";
 import { readState } from "../core/state";
-import { readManifestSummary } from "./manifest";
+import { readManifestSummary, summariseBuiltManifest } from "./manifest";


  const [log, setLog] = useState("");
  const [manifest, setManifest] = useState<ManifestResult | null>(null);
+  const [hmrStatus, setHmrStatus] = useState<
+    "idle" | "rebuilding" | "early-stopping" | "restarting" | "hot-swapped"


+        // Early-stop latch: a checkpoint just landed, so the in-flight work
+        // is durable. Cancel the cloud job and end `wait()` cleanly.
+        if (earlyStopRequested && earlyStopDeferred) {
+          await trainer.cancel();
+          if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer);
+          earlyStopDeferred.resolve();
+          earlyStopDeferred = null;
+          return { terminal: true, artifacts: terminalResult?.artifacts ?? [] };
+        }
        return { terminal: false, artifacts: terminalResult?.artifacts ?? [] };


…ce failures - Added a test to verify that the SIGINT exit handler is armed even if the `persistStudioToken` function fails, preventing the dev server from idling indefinitely. - Updated the `runDev` function to register the studio-token cleanup hook unconditionally, ensuring it executes on process termination signals regardless of token persistence success. - Enhanced cleanup logic to maintain proper signal handling during development sessions, improving overall stability and user experience.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 55b4a2e8f6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-03T14:06:27Z

+      if (payload.restart) {
+        // `/api/dev/events` is a broadcast — every open Studio tab gets
+        // this event. Only flip the auto-restart latch when *this* tab
+        // is actually running a stream right now; otherwise a passive
+        // tab would silently auto-spawn an extra job the next time the
+        // user clicks Run training, doubling cloud spend.
+        if (runningRef.current) {
+          // Training run is early-stopping; the active stream will


Scope restart latch to runs actually signaled for restart

The handler sets restartPendingRef for any payload.restart while this tab is running, but rebuild events are broadcast across all tabs and can contain mixed outcomes (restartTargets and hotSwapTargets) for different subprocesses. In a multi-run setup, a tab whose run was only hot-swapped can still latch restartPendingRef=true, then auto-spawn an extra training job when its current run eventually finishes, increasing unintended cloud/GPU usage.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-03T14:06:27Z

+          entry.child.kill("SIGTERM");
+          entry.earlyStopRequested = true;


Check signal delivery before marking restart targets

ChildProcess.kill() returns a boolean and does not throw when a process is already gone, but this path treats any non-throw as success. If kill("SIGTERM") returns false during an exit race, the entry is still marked earlyStopRequested and added to targets, so the UI may wait for/restart a run that was never actually signaled.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

Copilot reviewed 28 out of 29 changed files in this pull request and generated 4 comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  useEffect(() => {
+    const es = openDevEvents();
+    const onMessage = (raw: MessageEvent) => {


+        setHmrStatus("hot-swapped");
+        window.setTimeout(() => {
+          setHmrStatus((s) => (s === "hot-swapped" ? "idle" : s));
+        }, 1500);
+      } else {


      },
      cancel() {
+        activeTrains.unregister(child.pid);
        child.kill();


+  // Synchronous wrapper so signal handlers preserve "cleanup landed
+  // before this function returns" — important for sync cleanups (e.g.
+  // `unlinkSync`) and for tests that assert the side effect right after
+  // invoking the handler. Async cleanups are fire-and-forget with a
+  // catch so a hung dispose doesn't block exit.
+  const run = (): void => {
+    if (done) return;
+    done = true;
+    try {
+      const result = options.cleanup();
+      if (result && typeof (result as Promise<void>).catch === "function") {
+        (result as Promise<void>).catch(() => {
+          // best-effort: shutdown is racing other cleanup paths
+        });
+      }
+    } catch {
+      // best-effort
+    }
+  };
+
+  process.on("exit", run);
+
+  for (const sig of TERMINATING_SIGNALS) {
+    process.on(sig, () => {
+      run();
+      if (options.exitOnSignal) process.exit(0);


This commit modifies the exit codes returned by the cleanup hooks to align with POSIX standards, ensuring that the process exits with the correct code based on the signal received (e.g., SIGINT results in exit code 130). This change allows parent shells and orchestrators to accurately distinguish between user interruptions and normal completions. Additionally, new tests are added to verify the correct exit codes for various signals, enhancing the robustness of the cleanup mechanism.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8ed9eb304a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T13:05:54Z

+      trainer
+        .cancel()
+        .catch(() => {})
+        .finally(() => {


Reject timeout early-stop when cancel POST fails

When the early-stop timeout path runs, trainer.cancel() errors are swallowed (.catch(() => {})) and the deferred is always resolved, so the SIGTERM handler treats shutdown as successful even if the cloud cancel request failed. This can orphan a still-running cloud job (continued GPU spend) in the exact case where no checkpoint arrived before timeoutMs and the cancel call transiently fails, because the local runner exits while the remote run continues.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T13:05:54Z

+      trainer
+        .cancel()
+        .catch(() => {})
+        .finally(() => {


Reject timed-out early-stop when cancel request fails

In the timeout fallback, trainer.cancel() errors are swallowed and the deferred is always resolved, so SIGTERM shutdown is treated as successful even when the cloud cancel call fails. This can leave the remote training job running while the local runner exits and the Studio proceeds as if the stop succeeded, which is exactly the high-cost failure path this timeout branch is supposed to recover.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T13:05:54Z

+      .catch((err: unknown) => {
+        const msg = err instanceof Error ? err.message : String(err);
+        process.stderr.write(`requestEarlyStop failed: ${msg}\n`);
+      })
+      .finally(() => process.exit(0));


Avoid exiting success after early-stop cancellation failure

The SIGTERM handler unconditionally exits with code 0 in finally, even when requestTrainerEarlyStop() rejects. In HMR restart flows, a cancel failure can therefore still produce a “clean” child exit, causing the SPA to restart and potentially overlap a new run with a cloud job that never actually canceled.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T13:05:54Z

+      trainer
+        .cancel()
+        .catch(() => {})
+        .finally(() => {


Reject timed-out early-stop when cancel request fails

The timeout fallback swallows trainer.cancel() failures and always resolves the early-stop promise, so SIGTERM shutdown is reported as successful even when the cloud cancel request failed. In the no-checkpoint path this can orphan a live remote training job while the local runner exits and the Studio proceeds as if cancellation completed.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T13:05:54Z

+      .catch((err: unknown) => {
+        const msg = err instanceof Error ? err.message : String(err);
+        process.stderr.write(`requestEarlyStop failed: ${msg}\n`);
+      })
+      .finally(() => process.exit(0));


Return non-success exit when early-stop cancellation fails

installShutdownHandlers always exits with code 0 in finally, even after requestTrainerEarlyStop() rejects. If early-stop/cancel fails, the child still appears to have exited cleanly, which can trigger restart flows and supervisory scripts to treat the run as safely stopped even though the cloud job may still be running.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T13:05:54Z

+  if (value === undefined || typeof value === "function" || typeof value === "symbol") {
+    return undefined;
+  }
+  if (typeof value !== "object") return JSON.stringify(value);


Normalize boxed primitives in config hash serialization

stableStringify treats boxed primitives (for example new Number(1)) as plain objects, producing {} instead of JSON’s scalar representation (1). Because JobConfig includes several unknown fields, this can generate a configHash that diverges from the actual JSON payload semantics and spuriously route HMR decisions (restart vs hot-swap) for otherwise equivalent configs.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated 3 comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

+      const active = earlyStopDeferred;
+      trainer
+        .cancel()
+        .catch(() => {})
+        .finally(() => {


        await callbacks.onCheckpoint?.(ctx);
+        // Early-stop latch: a checkpoint just landed, so the in-flight work
+        // is durable. Cancel the cloud job and end `wait()` cleanly.
+        if (earlyStopRequested && earlyStopDeferred) {
+          // Capture the cancel error (if any) but DON'T swallow


+   * Whether the signal-handler arm of this registration should call
+   * `process.exit(0)` once every in-flight cleanup (this hook + any
+   * siblings registered in the same process) has settled. Use `true`
+   * for the outermost cleanup responsible for terminating the
+   * process; `false` for inner cleanups that should let a sibling
+   * own the exit. Default: `false`.


…nt SIGTERM hang This commit updates the early-stop mechanism in the trainer to ensure that the early-stop branch executes even if the user's `onCheckpoint` callback throws an error. By capturing the error and re-throwing it after the early-stop logic, the implementation prevents the SIGTERM handler from hanging due to unresolved promises. Additionally, the cancellation process is modified to reject the deferred promise on cancel errors, improving visibility into cancellation failures. Tests are added to verify this behavior and prevent future regressions.

Copilot

Pull request overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated no new comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 48a926324c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-09T16:11:58Z

+      const manifest = await readManifestSummary(trainCwd, {
+        prebuiltOutFile: hmrOutFile,
+      });


Preserve HMR build-error state in /api/manifest

When HMR is enabled, this always serves readManifestSummary(..., { prebuiltOutFile }) from the last built artifact if it exists, even after the watcher has emitted an error event for current source edits. In that state the SPA’s periodic manifest poll (RunTraining polls every 5s) receives a 200 stale manifest and overwrites the build-error UI, so users can be shown an apparently healthy trainer while the latest code is still failing to compile. This can lead to launching training from stale code/config without an active error indication.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T16:11:58Z

+      const manifest = await readManifestSummary(trainCwd, {
+        prebuiltOutFile: hmrOutFile,
+      });


Keep manifest in error state after HMR compile failures

This always serves the last built artifact when HMR is enabled, even if the watcher’s latest event is a compile error. After a source edit breaks the build, RunTraining briefly shows the error from /api/dev/events, but the next 5s manifest poll gets a 200 stale manifest and clears the error UI, so users can unknowingly run stale code/config while current sources are still broken.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T16:11:59Z

+      if (payload.type === "error") {
+        setManifest({ error: payload.message ?? "Build failed" });
+        setHmrStatus("idle");
+        return;


Clear pending auto-restart when a newer HMR error lands

When a rebuild has already queued restartPendingRef for this run, a subsequent HMR error event returns early without clearing that latch. If the training process then exits, finally still auto-restarts the run from the previous artifact even though the latest edit failed to compile, causing stale-code restarts and unexpected background training churn until the user notices.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T16:11:59Z

+      const manifest = await readManifestSummary(trainCwd, {
+        prebuiltOutFile: hmrOutFile,
+      });


Keep manifest in error state after HMR compile failures

When HMR is enabled, /api/manifest always serves the existing prebuilt artifact if it exists, even if the watcher’s latest state is a compile error. In that case RunTraining briefly shows the SSE error but the next manifest poll (every 5s) receives stale 200 data and clears the error UI, so users can unknowingly run outdated code/config while current sources are still broken.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T16:11:59Z

+      if (payload.type === "error") {
+        setManifest({ error: payload.message ?? "Build failed" });
+        setHmrStatus("idle");
+        return;


Clear queued auto-restart when a newer HMR error arrives

If a rebuild already latched restartPendingRef for this run, a later HMR error event returns early without clearing that pending restart. When the current process exits, finally still auto-starts a new run from the previous artifact even though the latest edit failed to compile, leading to stale-code restarts and unexpected background training churn.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-09T16:11:59Z

+          entry.spawnArtifactHash !== null &&
+          nextArtifactHash !== null &&
+          entry.spawnArtifactHash === nextArtifactHash;


Compare pre-ready artifacts by content, not file timestamps

This pre-ready equality gate depends on spawnArtifactHash/nextArtifactHash, which are derived from mtimeMs-ctimeMs-size in hmr.ts rather than file content. A watcher build that rewrites the same bytes still changes timestamps, so a run started before first ready can be forced into unnecessary SIGTERM restart despite identical artifact/config, creating avoidable cancel+restart churn in normal startup timing windows.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated no new comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

Copilot

Pull request overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated 3 comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

+          startedJob = {
+            ...startedJob,
+            status: "cancelled",
+            completedAt: event.timestamp,
+          };


+          earlyStopRequested = false;
+          if (startedJob && !TERMINAL_STATUSES.has(startedJob.status)) {
+            startedJob = {
+              ...startedJob,
+              status: "cancelled",
+              completedAt: new Date().toISOString(),
+            };
+          }


+    if (signalCount > 1) {
+      process.stdout.write(
+        `Received second ${signal}; exiting without waiting for checkpoint.\n`,
+      );
+      process.exit(143);


…e HMR error state management

Copilot

Pull request overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated 5 comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

+export function installShutdownHandlers(trainer: Trainer): () => void {
+  let signalCount = 0;
+  const handler = (signal: NodeJS.Signals): void => {
+    signalCount += 1;
+    if (signalCount > 1) {
+      process.stdout.write(
+        `Received second ${signal}; exiting without waiting for checkpoint.\n`,


+    const dispose = installShutdownHandlers(trainer);
+    try {
+      process.emit("SIGTERM", "SIGTERM");
+      await new Promise((r) => setTimeout(r, 10));
+      expect(trainer.__earlyStop.calls).toBe(1);


+  dispatchRebuild(
+    nextConfigHash: string | null,
+    // Defaults to `null` so tests / pre-existing callers that don't
+    // pass the artefact hash get the conservative behaviour: the
+    // pre-ready-spawn branch's `artefactsAgree` check is `false`,
+    // so a null entry hash falls through to SIGTERM-restart. Real
+    // dispatch from `/api/train`'s HMR subscriber threads
+    // `event.hash` here so the backfill optimisation activates.
+    nextArtifactHash: string | null = null,


+      process.emit("SIGTERM", "SIGTERM");
+      await new Promise((r) => setTimeout(r, 25));
+      expect(probe.earlyStopCalls).toBe(1);
+      expect(exitCalls).toContain(0);
+
+      // 2nd SIGTERM (still in-flight, listeners not yet removed) →
+      // exit(143) immediately, no second requestEarlyStop call.
+      process.emit("SIGTERM", "SIGTERM");


+  /**
+   * Fingerprint (mtime+ctime+size, see `core/moduleCacheBust.ts`) of
+   * the on-disk `.arkor/build/index.mjs` at spawn time. Used **only**
+   * to gate the pre-ready-spawn backfill: if a rebuild eventually
+   * fires while `configHash` is still null and this fingerprint
+   * matches the rebuild's artefact, the child is provably reading
+   * the same bundle bytes the new hash describes — safe to backfill
+   * `configHash` and skip dispatch. A mismatch (or null here) means
+   * the on-disk artefact has changed between spawn and rebuild
+   * (user edited mid-spawn, fresh project never built, …) so the
+   * child is running stale bytes and we MUST SIGTERM-restart to
+   * keep cloud-side `JobConfig` aligned with what the child
+   * actually loaded. Null when HMR isn't enabled or stat failed.
+   */
+  spawnArtifactHash: string | null;


… and consistency in HMR handling

Copilot

Pull request overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated no new comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f65561b89a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-11T16:00:23Z

+        if (jobIdForCancel) {
+          void (async () => {


Cancel cloud run even if job id marker is missing

This branch skips the cloud jobs/:id/cancel call whenever jobIdForCancel is still null, then immediately sends SIGKILL. That creates a real orphaning window: if the user clicks Stop training after the cloud job has been created but before the runner’s Started job <id> line is parsed, the local child is killed and no remote cancel is attempted, so GPU work can continue invisibly until server-side TTL/reaper cleanup. The new SIGKILL-based stop path makes this race materially more likely than the previous graceful cancel flow.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-11T16:00:23Z

+        if (jobIdForCancel) {
+          void (async () => {


Cancel cloud run even if job id marker is missing

This branch skips the cloud jobs/:id/cancel call whenever jobIdForCancel is still null, then immediately sends SIGKILL. If the user clicks Stop training after the cloud job is created but before the runner’s Started job <id> line is parsed, the local child is killed and no remote cancel is attempted, so the cloud job can continue running (and spending GPU) until server-side reaper/TTL cleanup.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-11T16:00:23Z

+              const m = STARTED_JOB_PATTERN.exec(line);
+              if (m && m[1]) {
+                activeTrains.recordJobId(child.pid, m[1]);


Use an unforgeable marker for started-job id capture

The parser accepts the first stdout line matching Started job <token> from the entire child process output, but user code can emit that exact line (for example via module-scope console.log) before the runner prints its canonical marker. In that case the registry records a bogus id, later manual stop POSTs cancel for the wrong job (or no job), and the actual cloud run may be left alive. Anchoring helps substring noise, but it still does not distinguish runner control output from user logs.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-11T16:00:24Z

+        restartPendingRef.current = false;
+        return;


Clear pre-spawn restart buffer when HMR error arrives

payload.type === "error" clears restartPendingRef, but it leaves pendingPreSpawnEventsRef intact. A restart event buffered during the pre-spawn window (runningRef=true, currentPidRef=null) can therefore survive a newer compile error and be applied in onSpawn, re-arming auto-restart from stale code. Fresh evidence vs earlier restart-latch reports: this path bypasses restartPendingRef entirely because the restart decision is deferred in pendingPreSpawnEventsRef until pid assignment.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-11T16:00:24Z

+              const m = STARTED_JOB_PATTERN.exec(line);
+              if (m && m[1]) {
+                activeTrains.recordJobId(child.pid, m[1]);


Use an unforgeable marker for started-job id capture

The parser accepts the first stdout line matching Started job <token> from the entire child process output, but user code can emit that exact line (for example via module-scope console.log) before the runner prints its canonical marker. In that case the registry records a bogus id, later manual stop POSTs cancel for the wrong job (or no job), and the actual cloud run may be left alive.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-11T16:00:24Z

+        restartPendingRef.current = false;
+        return;


Clear pre-spawn restart buffer when HMR error arrives

payload.type === "error" clears restartPendingRef, but it leaves pendingPreSpawnEventsRef untouched. A restart event buffered during the pre-spawn window can survive a newer compile error and then be applied in onSpawn, re-arming auto-restart from stale code. Fresh evidence versus earlier restart-latch reports: this path bypasses restartPendingRef because the restart decision is deferred in pendingPreSpawnEventsRef until pid assignment.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-11T16:00:24Z

+  const tokenPath = studioTokenPath();
+  scheduleStudioTokenCleanup(tokenPath);


Avoid deleting shared studio token when persist failed

The cleanup hook is now registered before token persistence and runs unconditionally, so a process that failed to write its own token can still unlink ~/.arkor/studio-token on exit. That can delete a token written by another active arkor dev session, causing the other session’s Vite-backed Studio pages to start getting 403s on /api/* despite that server still running normally.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-11T16:00:24Z

+              const state = await readState(trainCwd);
+              if (!state) return; // no scope, can't address the job
+              const rpc = createRpc();
+              await rpc.v1.jobs[":id"].cancel.$post({


Do not require state file to cancel known running job

Manual stop now relies on re-reading .arkor state to build the cancel query, and silently skips the cancel POST when readState(trainCwd) returns null. If project state is removed or unreadable while training is in flight, the code still SIGKILLs the local child but never sends cloud cancel, which can orphan the remote run and continue spend. The runner already had in-memory scope for this job, so cancel should not depend on mutable filesystem state at stop time.

Useful? React with 👍 / 👎.

…dev session

Copilot

Pull request overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated 8 comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

+export function findInspectableTrainer(
+  mod: Record<string, unknown>,
+): TrainerInspection | null {
+  const trainer = findTrainerInModule(mod);
+  return trainer ? getTrainerInspection(trainer) : null;
+}


+          startedJob = {
+            ...startedJob,
+            status: cancelError !== null ? "failed" : "cancelled",
+            ...(cancelError !== null && {
+              error: `Early-stop cancel failed: ${
+                cancelError instanceof Error
+                  ? cancelError.message
+                  : String(cancelError)
+              }`,
+            }),
+            completedAt: event.timestamp,
+          };


+/**
+ * Per-spawn nonce that `/api/train` injects via env so the server can
+ * recognise the runner's `Started job <id>` line without it being
+ * forgeable from user code. Captured at module load (i.e. BEFORE
+ * `runTrainer` does its `await import(userEntry)`) and the env var
+ * is deleted right after so the dynamically-imported user module
+ * cannot read it via `process.env`. If a user callback then writes
+ * `Started job <token>` to stdout, the line won't carry the nonce
+ * prefix and the server's anchored regex will reject it: no
+ * spoofed cloud `cancel()` POST against an attacker-chosen job id.
+ *
+ * Null when the runner was launched directly (e.g. `arkor start` from
+ * a shell), in which case the runner falls back to the plain
+ * `Started job <id>` form for backwards compatibility. The server only
+ * uses the nonce-prefixed form because every server spawn sets the
+ * env var.
+ */
+const STARTED_JOB_NONCE: string | null =
+  process.env.ARKOR_JOB_ID_MARKER_NONCE ?? null;
+delete process.env.ARKOR_JOB_ID_MARKER_NONCE;


  const url = `http://localhost:${port}`;
  serve({ fetch: app.fetch, port, hostname: "127.0.0.1" });
  process.stdout.write(`Arkor Studio running on ${url}\n`);
+  process.stdout.write(`HMR enabled (watching src/arkor)\n`);


@@ -199,16 +220,54 @@ export async function runDev(options: DevOptions = {}): Promise<void> {
  // hitting `arkor start` (and therefore RCE via dynamic import).
  const studioToken = randomBytes(32).toString("base64url");

+  // HMR coordinator: a long-lived rolldown watcher over the user's
+  // `src/arkor` graph. The coordinator itself is lazy (`subscribe()`
+  // is what starts the watcher, not `createHmrCoordinator`), but
+  // `buildStudioApp` registers its per-rebuild signal-dispatch
+  // subscriber unconditionally — that subscriber needs to run on
+  // every BUNDLE_END regardless of whether any SSE client is
+  // connected, so it can SIGUSR2/SIGTERM active `/api/train`
+  // children and keep `lastSuccessConfigHash` warm for spawn-time
+  // capture. Net effect: the watcher starts at server boot. An
+  // `arkor dev` launched in an unbuilt project doesn't fail immediately
+  // because `startWatcher` falls through to a poll loop that waits
+  // for the entry file to appear (see `hmr.ts:entryWaitTimer`).
+  //
+  // Registered before the studio-token cleanup so the latter remains
+  // the most-recently-attached signal listener (existing tests rely
+  // on this ordering to find the token-removal handler).
+  const hmr = createHmrCoordinator({ cwd: process.cwd() });
+  scheduleHmrCleanup(hmr);


+  // `arkor start` at boot.
+  try {
+    process.on(CALLBACK_RELOAD_SIGNAL, handler);
+  } catch {
+    return () => {
+      // no-op: handler was never attached
+    };
+  }
+  return () => {


+    async dispose() {
+      disposed = true;
+      subscribers.clear();
+      if (entryWaitTimer) {
+        clearInterval(entryWaitTimer);
+        entryWaitTimer = null;
+      }
+      if (watcher) {
+        const w = watcher;
+        watcher = null;
+        await w.close().catch(() => {});
+      }
+    },
+  };


+      } else {
+        // Nothing pertaining to this tab's child — leave any in-
+        // progress status spans alone but make sure stale "early-
+        // stopping" / "restarting" labels from a prior run don't
+        // linger past the next quiet rebuild.
+        if (!runningRef.current) setHmrStatus("idle");
+      }


- Updated comments in trainRegistry.test.ts to enhance readability and understanding of the test cases. - Clarified comments in trainRegistry.ts regarding the behavior of signals and process management. - Enhanced documentation in templates.ts to ensure consistency in phrasing and clarity. - Improved comments in RunTraining.tsx to better explain the logic and flow of the component. - Refined comments in api.test.ts and api.ts to provide clearer context and reasoning for the code. - Adjusted comments in baseModels.test.ts to clarify the case-sensitivity of model IDs.

Copilot

Pull request overview

Copilot reviewed 40 out of 41 changed files in this pull request and generated 8 comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

 The CLI/Studio look at `src/arkor/index.ts` in user projects. Discovery in [packages/arkor/src/core/runner.ts](packages/arkor/src/core/runner.ts) accepts (in order): a named `arkor` export from `createArkor({...})`, a bare `trainer` export, a default export holding either an Arkor manifest or a Trainer, or a `default.trainer` nested shape. `createArkor` returns a frozen, opaque manifest tagged with `_kind: "arkor"`; treat it as a value to hand to tooling, not a programmable client.

-`arkor build` ([packages/arkor/src/cli/commands/build.ts](packages/arkor/src/cli/commands/build.ts)) bundles to `.arkor/build/index.mjs` with esbuild; bare specifiers (e.g. `arkor`, anything in `node_modules`) stay external so the artifact resolves the runtime SDK from the project's installed copy.
+`arkor build` ([packages/arkor/src/cli/commands/build.ts](packages/arkor/src/cli/commands/build.ts)) bundles to `.arkor/build/index.mjs` with [Rolldown](https://rolldown.rs); bare specifiers (e.g. `arkor`, anything in `node_modules`) stay external so the artifact resolves the runtime SDK from the project's installed copy. The `transform.target` is derived from `process.versions.node` at build time so the bundle targets the same Node binary that will execute it.


+  if (opts.prebuiltOutFile && existsSync(opts.prebuiltOutFile)) {
+    return summariseBuiltManifest(opts.prebuiltOutFile);
+  }


+  const candidates: unknown[] = [];
+  // 1: createArkor named export
+  if (isArkor(mod.arkor) && (mod.arkor as Arkor).trainer) {
+    candidates.push((mod.arkor as Arkor).trainer);
+  }
+  // 2: bare `trainer` named export
+  if (mod.trainer) candidates.push(mod.trainer);
+  // 3: default-export holding an Arkor manifest
+  if (isArkor(mod.default) && (mod.default as Arkor).trainer) {
+    candidates.push((mod.default as Arkor).trainer);
+  }
+  // 4: default IS the Trainer itself. The `isTrainerLike` filter
+  // below sorts this out from cases 3/5 (an Arkor manifest doesn't
+  // have `start`/`wait`/`cancel`, nor does a plain `{ trainer }`
+  // wrapper), so pushing `mod.default` unconditionally is safe.
+  if (mod.default !== undefined) candidates.push(mod.default);
+  // 5: default.trainer nested
+  if (
+    mod.default &&
+    typeof mod.default === "object" &&
+    "trainer" in (mod.default as Record<string, unknown>)
+  ) {
+    candidates.push((mod.default as Record<string, unknown>).trainer);
+  }
+  for (const c of candidates) {
+    if (isTrainerLike(c)) return c;
+  }


+const SECOND_SIGNAL_EXIT_CODE: Record<
+  (typeof SHUTDOWN_SIGNALS)[number],
+  number
+> = {
+  SIGHUP: 129,
+  SIGINT: 130,
+  SIGTERM: 143,
+};


+  if (!res.ok) {
+    let detail = "";
+    try {
+      detail = (await res.text()).trim();
+    } catch {
+      // Body unreadable (already consumed, network gone, etc.):
+      // surface the status alone rather than masking the failure
+      // entirely.
+    }
+    throw new Error(
+      detail
+        ? `/api/train failed (${res.status} ${res.statusText}): ${detail}`
+        : `/api/train failed (${res.status} ${res.statusText})`,
+    );


+export function openDevEvents(): EventSource {
+  return new EventSource(withStudioToken("/api/dev/events"));
+}


+    requestTrainerEarlyStop(trainer)
+      .catch((err: unknown) => {
+        const msg = err instanceof Error ? err.message : String(err);
+        process.stderr.write(`requestEarlyStop failed: ${msg}\n`);
+      })
+      .finally(() => process.exit(0));


+    const { jobId } = await trainer.start();
+    const startedJobPrefix = STARTED_JOB_NONCE
+      ? `[arkor:${STARTED_JOB_NONCE}] `
+      : "";
+    process.stdout.write(`${startedJobPrefix}Started job ${jobId}\n`);
+    const result = await trainer.wait();
+    process.stdout.write(
+      `Job ${result.job.id} finished with status=${result.job.status}\n`,
+    );


…ror state management

Copilot

Pull request overview

Copilot reviewed 41 out of 42 changed files in this pull request and generated 2 comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

+        let onCheckpointError: unknown = null;
+        try {
+          await callbacks.onCheckpoint?.(ctx);
+        } catch (err) {
+          onCheckpointError = err;
+        }
+        // Early-stop latch: a checkpoint just landed, so the in-flight work


+ * Walk a freshly-imported user bundle in the same precedence order
+ * as `runner.ts`'s `extractTrainer` and return the first
+ * trainer-shaped value (anything that has `start`/`wait`/`cancel`
+ * functions). Doesn't require the SDK inspection brand: the
+ * manifest UI displays the trainer's `name` for hand-rolled trainers
+ * too, even when HMR can't compute a `configHash` for them.
+ *
+ * The five supported shapes (mirroring `runner.ts`'s `extractTrainer`):
+ *   1. `export const arkor = createArkor({ trainer })`
+ *   2. `export const trainer = createTrainer(...)`  (bare named export)
+ *   3. `export default createArkor({ trainer })`
+ *   4. `export default createTrainer(...)`           (default IS a Trainer)
+ *   5. `export default { trainer: createTrainer(...) }`
+ *
+ * Without shape #4 a project that default-exports a Trainer would run
+ * fine under `arkor start` but show as "no trainer" in Studio's
+ * manifest, with `configHash: null` forcing every HMR rebuild down the
+ * SIGTERM-restart path instead of the SIGUSR2 hot-swap path.
+ */
+/**


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d0d035af7d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-21T11:06:37Z

+    const spawnState = await readState(trainCwd);
+    const spawnScope = spawnState
+      ? { orgSlug: spawnState.orgSlug, projectSlug: spawnState.projectSlug }
+      : null;


Capture cancel scope after start-created state becomes available

The cancel path now depends on scope captured before spawning, but readState(trainCwd) is often null on the first run of an anon project because ensureProjectState() creates .arkor/state.json inside the child during trainer.start(). In that common case scopeForCancel stays null, manual Stop sends SIGKILL, and no cloud jobs/:id/cancel request is attempted, which can orphan the remote job and continue GPU spend.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-21T11:06:37Z

+        activeTrains.unregister(cancelPid);
+        cancelTeardown?.();
+        // Fire-and-forget cloud-side cancel so the cloud job is


Keep parsing stdout while waiting for late job-id marker

The new 500ms poll for parsedJobId in cancel() cannot observe a late Started job line because cancelTeardown() is invoked first and marks the stream closed, and onStdoutChunk immediately returns when closed is true. Fresh evidence versus earlier missing-job-id reports: this ordering disables the parser during the exact polling window, so a fast Stop before marker parse still skips cloud cancel and can leave the remote run alive.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-21T11:06:38Z

+  for (const trainer of findTrainerCandidates(mod)) {
+    const inspection = getTrainerInspection(trainer);
+    if (inspection) return inspection;
+  }


Hash the same trainer instance that runTrainer executes

findInspectableTrainer currently skips the first exported trainer when it is unbranded and instead hashes a later branded export, but runTrainer still executes the first trainer by precedence. In modules that export both shapes (e.g. a hand-rolled named trainer plus a default createArkor(...)), HMR decisions are computed from a different trainer than the one actually running, which can route to SIGUSR2/hot-swap even though the live trainer cannot be callback-reloaded.

Useful? React with 👍 / 👎.

…ehavior for first-run anonymous projects

Copilot

Pull request overview

Copilot reviewed 41 out of 42 changed files in this pull request and generated 2 comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

+      // SPA dev workflow with mystery 403s on /api/*.
+      if (!shouldUnlink()) return;
+      try {
+        unlinkSync(path);
+      } catch {


+      // late event can still match per-pid.
+      if (restartGraceTimerRef.current !== null) {
+        clearTimeout(restartGraceTimerRef.current);
+      }
+      restartGraceTimerRef.current = window.setTimeout(() => {


…ncurrent dev tokens

Copilot

Pull request overview

Copilot reviewed 41 out of 42 changed files in this pull request and generated 1 comment.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

  const path = studioTokenPath();
  await mkdir(dirname(path), { recursive: true });
  await writeFile(path, token, { mode: 0o600 });
  await chmod(path, 0o600);
  return path;


socket-security · 2026-05-22T15:52:34Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	react@19.2.5

View full report

Copilot

Pull request overview

Copilot reviewed 40 out of 41 changed files in this pull request and generated no new comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

Copilot

Pull request overview

Copilot reviewed 40 out of 41 changed files in this pull request and generated 1 comment.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

+function stableStringify(value: unknown, key: string = ""): string | undefined {
+  if (value === null) return "null";
+  // Non-representable values: omit (undefined return) so each caller's
+  // boundary handler chooses the right substitution per its position.
+  if (value === undefined || typeof value === "function" || typeof value === "symbol") {
+    return undefined;
+  }
+  if (typeof value !== "object") return JSON.stringify(value);
+  // `JSON.stringify` calls `value.toJSON(key)` first when present
+  // (passing `""` at the top level, the property name in object
+  // positions, the index-as-string in array positions), then
+  // serialises the return value. Canonical example: `Date` → ISO
+  // string. The `key` argument is threaded through recursion so
+  // user-side `toJSON(key)` implementations that branch on the
+  // hosting property/index see the same value JSON.stringify would.
+  // If `toJSON` returns `undefined`, that propagates as the omit
+  // sentinel: the spec-defined "skip me" path.
+  const maybeToJSON = (value as { toJSON?: unknown }).toJSON;
+  if (typeof maybeToJSON === "function") {
+    return stableStringify(
+      (maybeToJSON as (key: string) => unknown).call(value, key),
+      key,
+    );
+  }
+  if (Array.isArray(value)) {
+    // Array slots: non-representable → "null" (matches JSON spec).


Copilot

Pull request overview

Copilot reviewed 40 out of 41 changed files in this pull request and generated no new comments.

Files not reviewed (1)

pnpm-lock.yaml: Language not supported

k-taro56 added 7 commits May 2, 2026 21:27

Merge branch 'main' into eng-615

45c49a8

Copilot AI review requested due to automatic review settings May 2, 2026 17:02

Copilot started reviewing on behalf of k-taro56 May 2, 2026 17:02 View session

mintlify Bot deployed to staging - docs May 2, 2026 17:02 View deployment

github-code-quality Bot found potential problems May 2, 2026

View reviewed changes

Comment thread packages/arkor/src/studio/server.ts Fixed

chatgpt-codex-connector Bot reviewed May 2, 2026

View reviewed changes

Copilot AI reviewed May 2, 2026

View reviewed changes

k-taro56 self-assigned this May 3, 2026

chatgpt-codex-connector Bot reviewed May 3, 2026

View reviewed changes

k-taro56 requested a review from Copilot May 3, 2026 10:32

Copilot started reviewing on behalf of k-taro56 May 3, 2026 10:32 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

k-taro56 requested a review from Copilot May 3, 2026 14:02

Copilot started reviewing on behalf of k-taro56 May 3, 2026 14:02 View session

chatgpt-codex-connector Bot reviewed May 3, 2026

View reviewed changes

Copilot AI reviewed May 3, 2026

View reviewed changes

Merge branch 'main' into eng-615

5016dfc

mintlify Bot deployed to staging - docs May 4, 2026 12:03 View deployment

k-taro56 requested a review from Copilot May 4, 2026 12:06

Copilot started reviewing on behalf of k-taro56 May 4, 2026 12:06 View session

chatgpt-codex-connector Bot reviewed May 9, 2026

View reviewed changes

Copilot AI reviewed May 9, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 9, 2026

View reviewed changes

Copilot AI reviewed May 9, 2026

View reviewed changes

fix: implement POSIX exit codes for second signal handling and improv…

b8db73a

…e HMR error state management

Copilot AI reviewed May 11, 2026

View reviewed changes

fix: rename spawnArtifactHash to spawnArtifactContentHash for clarity…

f65561b

… and consistency in HMR handling

Copilot AI reviewed May 11, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

fix: prevent concurrent token file deletion on failed persist during …

19e3b84

…dev session

Copilot AI reviewed May 18, 2026

View reviewed changes

fix: unify POSIX exit code handling across runner and CLI, improve er…

d0d035a

…ror state management

Copilot AI reviewed May 21, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

fix: improve error handling for checkpoint throws and refine cancel b…

35c01cb

…ehavior for first-run anonymous projects

Copilot AI reviewed May 21, 2026

View reviewed changes

fix: enhance token cleanup logic to prevent accidental deletion of co…

378096a

…ncurrent dev tokens

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread packages/arkor/src/cli/commands/dev.ts

Comment on lines 168 to 172

const path = studioTokenPath();

await mkdir(dirname(path), { recursive: true });

await writeFile(path, token, { mode: 0o600 });

await chmod(path, 0o600);

return path;

k-taro56 added 2 commits May 22, 2026 22:02

Merge branch 'main' into eng-615

4e689df

fix: improve error handling and cleanup logic in RunTraining component

8de5cbb

Copilot AI reviewed May 22, 2026

View reviewed changes

		restartPendingRef.current = true;
		setHmrStatus(runningRef.current ? "early-stopping" : "idle");

		entry.child.kill("SIGTERM");
		entry.earlyStopRequested = true;

		const tokenPath = studioTokenPath();
		scheduleStudioTokenCleanup(tokenPath);

Conversation

k-taro56 commented May 2, 2026

Uh oh!

mintlify Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

mintlify Bot commented May 2, 2026 •

edited

Loading

codecov Bot commented May 2, 2026 •

edited

Loading