Conversation
- Integrated Rolldown for hot module replacement (HMR) in `arkor dev`, allowing real-time updates to the training interface without page refresh. - Implemented `requestEarlyStop` and `replaceCallbacks` methods in the Trainer API to facilitate graceful stopping of training jobs and dynamic callback updates during execution. - Updated documentation to reflect new features and usage patterns for improved developer guidance. - Adjusted cleanup logic for better resource management during development sessions.
…l shutdown - Introduced `replaceCallbacks` method in the Trainer API to allow dynamic updates of lifecycle callbacks during training runs. - Enhanced signal handling for graceful early stopping, ensuring in-flight checkpoints are preserved during HMR rebuilds. - Added `registerCleanupHook` for streamlined resource management on process exit, improving cleanup logic across development commands. - Updated documentation to reflect new features and usage patterns for better developer guidance.
…internal callback swapping - Removed the public `replaceCallbacks` method from the Trainer interface to prevent exposure of the hot-swapping functionality. - Introduced an internal mechanism for callback swapping using a `Symbol.for`-keyed brand, allowing for dynamic updates during training runs without affecting the public API. - Updated signal handling to ensure seamless integration with the new callback swapping logic, enhancing the hot module replacement (HMR) experience. - Revised documentation to reflect changes in the Trainer API and clarify the internal callback management process.
…lement internal early-stop handling - Removed the public `requestEarlyStop` method from the Trainer interface to prevent exposure of the early-stop functionality. - Introduced an internal mechanism for early stopping using a `Symbol.for`-keyed brand, allowing for graceful stopping after the next checkpoint without affecting the public API. - Updated signal handling to ensure seamless integration with the new early-stop logic, enhancing the hot module replacement (HMR) experience. - Revised documentation to reflect changes in the Trainer API and clarify the internal early-stop management process.
…llback replacement - Updated the Trainer API to remove public exposure of `requestEarlyStop` and `replaceCallbacks` methods, enhancing encapsulation. - Implemented internal mechanisms for early stopping and callback swapping using `Symbol.for`-keyed brands, ensuring seamless integration during training runs. - Revised signal handling to improve the hot module replacement (HMR) experience and maintain clean resource management. - Updated documentation to reflect these changes and clarify the internal management processes for developers.
- Replaced esbuild with Rolldown for building the project, ensuring external dependencies are resolved correctly. - Implemented a cleanup hook system to manage resource disposal on process termination signals. - Enhanced the development experience with hot module replacement (HMR) capabilities, allowing for seamless updates during training runs. - Updated documentation to reflect changes in the development loop and HMR behavior. - Added tests for new signal handling and configuration hashing functionality.
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8abc594369
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const hotSwapTargets = activeTrains.notifyCallbackReload(nextHash); | ||
| const restartTargets = | ||
| activeTrains.requestEarlyStopOnMismatch(nextHash); |
There was a problem hiding this comment.
Dispatch train signals only once per rebuild
notifyCallbackReload/requestEarlyStopOnMismatch run inside each /api/dev/events subscriber callback, so one rebuild sends signals once per connected SSE client (and again on reconnect when createHmrCoordinator.subscribe replays lastEvent). In the common case of two open Studio tabs, the same child can receive two SIGTERMs for one config-changing rebuild; the second signal triggers the runner’s forced-exit path (exit(143)) instead of checkpoint-preserving early stop. Move signal dispatch out of per-subscriber delivery (or dedupe per rebuild hash) so each child is signaled at most once per rebuild event.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR upgrades the local Studio dev experience by migrating the trainer bundling pipeline from esbuild to Rolldown, adding an HMR/SSE channel to propagate rebuild events to the Studio SPA, and introducing signal-based “hot-swap callbacks vs early-stop + restart” behavior for in-flight training subprocesses.
Changes:
- Replace esbuild-based trainer bundling with Rolldown (one-shot build + watch/HMR).
- Add
/api/dev/eventsSSE stream + SPA client wiring to refresh manifest and coordinate restarts/hot-swaps. - Introduce internal trainer inspection/callback replacement/early-stop hooks (via
Symbol.forbrands), plus cleanup utilities and updated docs/tests.
Reviewed changes
Copilot reviewed 27 out of 28 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pnpm-lock.yaml | Removes esbuild and adds rolldown to the lockfile graph. |
| packages/arkor/package.json | Swaps dependency from esbuild to rolldown. |
| packages/arkor/src/core/rolldownConfig.ts | Centralizes Rolldown input options + entry/outDir resolution and Node target derivation. |
| packages/arkor/src/cli/commands/build.ts | Implements arkor build via Rolldown bundling instead of esbuild. |
| packages/arkor/src/studio/hmr.ts | Adds lazy Rolldown watcher coordinator that emits ready/rebuild/error events. |
| packages/arkor/src/studio/hmr.test.ts | Tests HMR coordinator behavior (ready/rebuild/error/replay/dispose). |
| packages/arkor/src/studio/manifest.ts | Adds configHash to manifest summary and splits “summarise built manifest” vs “build + summarise”. |
| packages/arkor/src/studio/trainRegistry.ts | Adds per-child registry + policy for SIGUSR2 hot-swap vs SIGTERM early-stop. |
| packages/arkor/src/studio/trainRegistry.test.ts | Tests TrainRegistry signaling decisions and error tolerance. |
| packages/arkor/src/studio/server.ts | Adds /api/dev/events SSE route and integrates TrainRegistry into rebuild handling. |
| packages/arkor/src/studio/server.test.ts | Adds tests for /api/dev/events token rules, loopback guard, and SSE framing. |
| packages/arkor/src/core/configHash.ts | Adds stable hashing of JobConfig for “callbacks-only vs restart” decisions. |
| packages/arkor/src/core/configHash.test.ts | Tests determinism and sensitivity of hashJobConfig. |
| packages/arkor/src/core/trainerInspection.ts | Introduces internal inspection + callback replacement + early-stop brands via Symbol.for. |
| packages/arkor/src/core/trainer.ts | Implements callback hot-swap support + early-stop latch; attaches internal brands. |
| packages/arkor/src/core/trainer.test.ts | Adds coverage for early-stop and callback hot-swap behavior via internal brands. |
| packages/arkor/src/core/runnerSignals.ts | Adds signal handlers: SIGTERM graceful early-stop + SIGUSR2 callback reload. |
| packages/arkor/src/core/runnerSignals.test.ts | Tests SIGTERM/SIGUSR2 handler behaviors with branded trainers. |
| packages/arkor/src/core/runner.ts | Installs/removes new signal handlers around trainer.start()/trainer.wait(). |
| packages/arkor/src/core/runner.test.ts | Adds a test for SIGTERM early-stop behavior in the runner. |
| packages/arkor/src/cli/cleanupHooks.ts | Adds reusable cleanup-hook registration for exit/signals. |
| packages/arkor/src/cli/commands/dev.ts | Wires HMR coordinator into Studio server and refactors shutdown cleanup via cleanup hooks. |
| packages/arkor/src/cli/commands/start.test.ts | Updates test comment to refer to Rolldown instead of esbuild. |
| packages/studio-app/src/lib/api.ts | Adds typed DevEvent + openDevEvents() for SSE notifications. |
| packages/studio-app/src/components/RunTraining.tsx | SPA listens to dev SSE events, refreshes manifest, and coordinates restart/hot-swap UI state. |
| docs/concepts/studio.mdx | Documents HMR workflow, including hash-based hot-swap vs restart behavior (EN). |
| docs/ja/concepts/studio.mdx | Documents HMR workflow (JA). |
| AGENTS.md | Updates repository policy/docs for new SSE allow-list, HMR internals, and Rolldown build pipeline. |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const [log, setLog] = useState(""); | ||
| const [manifest, setManifest] = useState<ManifestResult | null>(null); | ||
| const [hmrStatus, setHmrStatus] = useState< | ||
| "idle" | "rebuilding" | "early-stopping" | "restarting" | "hot-swapped" |
| import { ensureProjectState } from "../core/projectState"; | ||
| import { readState } from "../core/state"; | ||
| import { readManifestSummary } from "./manifest"; | ||
| import { readManifestSummary, summariseBuiltManifest } from "./manifest"; |
| function scheduleHmrCleanup(hmr: { dispose: () => Promise<void> }): void { | ||
| // Registered before the studio-token cleanup so it runs first on | ||
| // shutdown — Node fires signal handlers in registration order, and we | ||
| // want the watcher to release file handles before the outermost | ||
| // process.exit. | ||
| registerCleanupHook({ cleanup: () => hmr.dispose() }); | ||
| } |
| requestEarlyStopOnMismatch( | ||
| nextConfigHash: string | null, | ||
| ): RestartTarget[] { | ||
| const targets: RestartTarget[] = []; | ||
| for (const [pid, entry] of this.entries) { | ||
| if ( | ||
| nextConfigHash === null || | ||
| entry.configHash === null || | ||
| entry.configHash !== nextConfigHash | ||
| ) { | ||
| try { | ||
| entry.child.kill("SIGTERM"); | ||
| } catch { | ||
| // child already exited; close handler will clean up. | ||
| } | ||
| targets.push({ pid, trainFile: entry.trainFile }); | ||
| } |
| await callbacks.onCheckpoint?.(ctx); | ||
| // Early-stop latch: a checkpoint just landed, so the in-flight work | ||
| // is durable. Cancel the cloud job and end `wait()` cleanly. | ||
| if (earlyStopRequested && earlyStopDeferred) { | ||
| await trainer.cancel(); | ||
| if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer); | ||
| earlyStopDeferred.resolve(); | ||
| earlyStopDeferred = null; | ||
| return { terminal: true, artifacts: terminalResult?.artifacts ?? [] }; | ||
| } |
| function startWatcher(): void { | ||
| if (watcher || disposed) return; | ||
| if (!existsSync(resolved.entry)) { | ||
| broadcast({ | ||
| type: "error", | ||
| message: `Build entry not found: ${resolved.entry}. Create ${BUILD_DEFAULTS.entry} or pass an explicit entry argument.`, | ||
| }); | ||
| return; | ||
| } |
…ous BUNDLE_END on macOS - Modified the test to assert that the late subscriber receives the same event as the prior subscriber, addressing issues with spurious BUNDLE_END events in rolldown@1.0.0-rc.17 on macOS. - Added comments to clarify the reasoning behind the change and noted a future task to revisit the assertion after rolldown stabilizes.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 31f510869c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| restartPendingRef.current = true; | ||
| setHmrStatus(runningRef.current ? "early-stopping" : "idle"); |
There was a problem hiding this comment.
Ignore restart hints when no local run is active
restartPendingRef is set for every payload.restart event even when this tab is not running a training stream. Since /api/dev/events broadcasts restart events to all connected tabs, a passive tab can cache restartPendingRef=true from another tab’s run; the next time the user runs training in this tab, finally will auto-trigger an extra run, creating an unintended duplicate job and extra cloud/GPU spend. Only mark restart pending when runningRef.current is true (or otherwise correlate the restart target to the local run).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 28 changed files in this pull request and generated 4 comments.
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } catch { | ||
| // child already exited; close handler will clean up. | ||
| } | ||
| targets.push({ pid, trainFile: entry.trainFile }); |
| import { ensureProjectState } from "../core/projectState"; | ||
| import { readState } from "../core/state"; | ||
| import { readManifestSummary } from "./manifest"; | ||
| import { readManifestSummary, summariseBuiltManifest } from "./manifest"; |
| const [log, setLog] = useState(""); | ||
| const [manifest, setManifest] = useState<ManifestResult | null>(null); | ||
| const [hmrStatus, setHmrStatus] = useState< | ||
| "idle" | "rebuilding" | "early-stopping" | "restarting" | "hot-swapped" |
| // Early-stop latch: a checkpoint just landed, so the in-flight work | ||
| // is durable. Cancel the cloud job and end `wait()` cleanly. | ||
| if (earlyStopRequested && earlyStopDeferred) { | ||
| await trainer.cancel(); | ||
| if (earlyStopDeferred.timer) clearTimeout(earlyStopDeferred.timer); | ||
| earlyStopDeferred.resolve(); | ||
| earlyStopDeferred = null; | ||
| return { terminal: true, artifacts: terminalResult?.artifacts ?? [] }; | ||
| } | ||
| return { terminal: false, artifacts: terminalResult?.artifacts ?? [] }; |
…ce failures - Added a test to verify that the SIGINT exit handler is armed even if the `persistStudioToken` function fails, preventing the dev server from idling indefinitely. - Updated the `runDev` function to register the studio-token cleanup hook unconditionally, ensuring it executes on process termination signals regardless of token persistence success. - Enhanced cleanup logic to maintain proper signal handling during development sessions, improving overall stability and user experience.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 55b4a2e8f6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (payload.restart) { | ||
| // `/api/dev/events` is a broadcast — every open Studio tab gets | ||
| // this event. Only flip the auto-restart latch when *this* tab | ||
| // is actually running a stream right now; otherwise a passive | ||
| // tab would silently auto-spawn an extra job the next time the | ||
| // user clicks Run training, doubling cloud spend. | ||
| if (runningRef.current) { | ||
| // Training run is early-stopping; the active stream will |
There was a problem hiding this comment.
Scope restart latch to runs actually signaled for restart
The handler sets restartPendingRef for any payload.restart while this tab is running, but rebuild events are broadcast across all tabs and can contain mixed outcomes (restartTargets and hotSwapTargets) for different subprocesses. In a multi-run setup, a tab whose run was only hot-swapped can still latch restartPendingRef=true, then auto-spawn an extra training job when its current run eventually finishes, increasing unintended cloud/GPU usage.
Useful? React with 👍 / 👎.
| entry.child.kill("SIGTERM"); | ||
| entry.earlyStopRequested = true; |
There was a problem hiding this comment.
Check signal delivery before marking restart targets
ChildProcess.kill() returns a boolean and does not throw when a process is already gone, but this path treats any non-throw as success. If kill("SIGTERM") returns false during an exit race, the entry is still marked earlyStopRequested and added to targets, so the UI may wait for/restart a run that was never actually signaled.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 28 out of 29 changed files in this pull request and generated 4 comments.
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| useEffect(() => { | ||
| const es = openDevEvents(); | ||
| const onMessage = (raw: MessageEvent) => { |
| setHmrStatus("hot-swapped"); | ||
| window.setTimeout(() => { | ||
| setHmrStatus((s) => (s === "hot-swapped" ? "idle" : s)); | ||
| }, 1500); | ||
| } else { |
| }, | ||
| cancel() { | ||
| activeTrains.unregister(child.pid); | ||
| child.kill(); |
| // Synchronous wrapper so signal handlers preserve "cleanup landed | ||
| // before this function returns" — important for sync cleanups (e.g. | ||
| // `unlinkSync`) and for tests that assert the side effect right after | ||
| // invoking the handler. Async cleanups are fire-and-forget with a | ||
| // catch so a hung dispose doesn't block exit. | ||
| const run = (): void => { | ||
| if (done) return; | ||
| done = true; | ||
| try { | ||
| const result = options.cleanup(); | ||
| if (result && typeof (result as Promise<void>).catch === "function") { | ||
| (result as Promise<void>).catch(() => { | ||
| // best-effort: shutdown is racing other cleanup paths | ||
| }); | ||
| } | ||
| } catch { | ||
| // best-effort | ||
| } | ||
| }; | ||
|
|
||
| process.on("exit", run); | ||
|
|
||
| for (const sig of TERMINATING_SIGNALS) { | ||
| process.on(sig, () => { | ||
| run(); | ||
| if (options.exitOnSignal) process.exit(0); |
This commit modifies the exit codes returned by the cleanup hooks to align with POSIX standards, ensuring that the process exits with the correct code based on the signal received (e.g., SIGINT results in exit code 130). This change allows parent shells and orchestrators to accurately distinguish between user interruptions and normal completions. Additionally, new tests are added to verify the correct exit codes for various signals, enhancing the robustness of the cleanup mechanism.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8ed9eb304a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| trainer | ||
| .cancel() | ||
| .catch(() => {}) | ||
| .finally(() => { |
There was a problem hiding this comment.
Reject timeout early-stop when cancel POST fails
When the early-stop timeout path runs, trainer.cancel() errors are swallowed (.catch(() => {})) and the deferred is always resolved, so the SIGTERM handler treats shutdown as successful even if the cloud cancel request failed. This can orphan a still-running cloud job (continued GPU spend) in the exact case where no checkpoint arrived before timeoutMs and the cancel call transiently fails, because the local runner exits while the remote run continues.
Useful? React with 👍 / 👎.
| trainer | ||
| .cancel() | ||
| .catch(() => {}) | ||
| .finally(() => { |
There was a problem hiding this comment.
Reject timed-out early-stop when cancel request fails
In the timeout fallback, trainer.cancel() errors are swallowed and the deferred is always resolved, so SIGTERM shutdown is treated as successful even when the cloud cancel call fails. This can leave the remote training job running while the local runner exits and the Studio proceeds as if the stop succeeded, which is exactly the high-cost failure path this timeout branch is supposed to recover.
Useful? React with 👍 / 👎.
| .catch((err: unknown) => { | ||
| const msg = err instanceof Error ? err.message : String(err); | ||
| process.stderr.write(`requestEarlyStop failed: ${msg}\n`); | ||
| }) | ||
| .finally(() => process.exit(0)); |
There was a problem hiding this comment.
Avoid exiting success after early-stop cancellation failure
The SIGTERM handler unconditionally exits with code 0 in finally, even when requestTrainerEarlyStop() rejects. In HMR restart flows, a cancel failure can therefore still produce a “clean” child exit, causing the SPA to restart and potentially overlap a new run with a cloud job that never actually canceled.
Useful? React with 👍 / 👎.
| trainer | ||
| .cancel() | ||
| .catch(() => {}) | ||
| .finally(() => { |
There was a problem hiding this comment.
Reject timed-out early-stop when cancel request fails
The timeout fallback swallows trainer.cancel() failures and always resolves the early-stop promise, so SIGTERM shutdown is reported as successful even when the cloud cancel request failed. In the no-checkpoint path this can orphan a live remote training job while the local runner exits and the Studio proceeds as if cancellation completed.
Useful? React with 👍 / 👎.
| .catch((err: unknown) => { | ||
| const msg = err instanceof Error ? err.message : String(err); | ||
| process.stderr.write(`requestEarlyStop failed: ${msg}\n`); | ||
| }) | ||
| .finally(() => process.exit(0)); |
There was a problem hiding this comment.
Return non-success exit when early-stop cancellation fails
installShutdownHandlers always exits with code 0 in finally, even after requestTrainerEarlyStop() rejects. If early-stop/cancel fails, the child still appears to have exited cleanly, which can trigger restart flows and supervisory scripts to treat the run as safely stopped even though the cloud job may still be running.
Useful? React with 👍 / 👎.
| if (value === undefined || typeof value === "function" || typeof value === "symbol") { | ||
| return undefined; | ||
| } | ||
| if (typeof value !== "object") return JSON.stringify(value); |
There was a problem hiding this comment.
Normalize boxed primitives in config hash serialization
stableStringify treats boxed primitives (for example new Number(1)) as plain objects, producing {} instead of JSON’s scalar representation (1). Because JobConfig includes several unknown fields, this can generate a configHash that diverges from the actual JSON payload semantics and spuriously route HMR decisions (restart vs hot-swap) for otherwise equivalent configs.
Useful? React with 👍 / 👎.
| const active = earlyStopDeferred; | ||
| trainer | ||
| .cancel() | ||
| .catch(() => {}) | ||
| .finally(() => { |
| await callbacks.onCheckpoint?.(ctx); | ||
| // Early-stop latch: a checkpoint just landed, so the in-flight work | ||
| // is durable. Cancel the cloud job and end `wait()` cleanly. | ||
| if (earlyStopRequested && earlyStopDeferred) { | ||
| // Capture the cancel error (if any) but DON'T swallow |
| * Whether the signal-handler arm of this registration should call | ||
| * `process.exit(0)` once every in-flight cleanup (this hook + any | ||
| * siblings registered in the same process) has settled. Use `true` | ||
| * for the outermost cleanup responsible for terminating the | ||
| * process; `false` for inner cleanups that should let a sibling | ||
| * own the exit. Default: `false`. |
…nt SIGTERM hang This commit updates the early-stop mechanism in the trainer to ensure that the early-stop branch executes even if the user's `onCheckpoint` callback throws an error. By capturing the error and re-throwing it after the early-stop logic, the implementation prevents the SIGTERM handler from hanging due to unresolved promises. Additionally, the cancellation process is modified to reject the deferred promise on cancel errors, improving visibility into cancellation failures. Tests are added to verify this behavior and prevent future regressions.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 48a926324c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const manifest = await readManifestSummary(trainCwd, { | ||
| prebuiltOutFile: hmrOutFile, | ||
| }); |
There was a problem hiding this comment.
Preserve HMR build-error state in
/api/manifest
When HMR is enabled, this always serves readManifestSummary(..., { prebuiltOutFile }) from the last built artifact if it exists, even after the watcher has emitted an error event for current source edits. In that state the SPA’s periodic manifest poll (RunTraining polls every 5s) receives a 200 stale manifest and overwrites the build-error UI, so users can be shown an apparently healthy trainer while the latest code is still failing to compile. This can lead to launching training from stale code/config without an active error indication.
Useful? React with 👍 / 👎.
| const manifest = await readManifestSummary(trainCwd, { | ||
| prebuiltOutFile: hmrOutFile, | ||
| }); |
There was a problem hiding this comment.
Keep manifest in error state after HMR compile failures
This always serves the last built artifact when HMR is enabled, even if the watcher’s latest event is a compile error. After a source edit breaks the build, RunTraining briefly shows the error from /api/dev/events, but the next 5s manifest poll gets a 200 stale manifest and clears the error UI, so users can unknowingly run stale code/config while current sources are still broken.
Useful? React with 👍 / 👎.
| if (payload.type === "error") { | ||
| setManifest({ error: payload.message ?? "Build failed" }); | ||
| setHmrStatus("idle"); | ||
| return; |
There was a problem hiding this comment.
Clear pending auto-restart when a newer HMR error lands
When a rebuild has already queued restartPendingRef for this run, a subsequent HMR error event returns early without clearing that latch. If the training process then exits, finally still auto-restarts the run from the previous artifact even though the latest edit failed to compile, causing stale-code restarts and unexpected background training churn until the user notices.
Useful? React with 👍 / 👎.
| const manifest = await readManifestSummary(trainCwd, { | ||
| prebuiltOutFile: hmrOutFile, | ||
| }); |
There was a problem hiding this comment.
Keep manifest in error state after HMR compile failures
When HMR is enabled, /api/manifest always serves the existing prebuilt artifact if it exists, even if the watcher’s latest state is a compile error. In that case RunTraining briefly shows the SSE error but the next manifest poll (every 5s) receives stale 200 data and clears the error UI, so users can unknowingly run outdated code/config while current sources are still broken.
Useful? React with 👍 / 👎.
| if (payload.type === "error") { | ||
| setManifest({ error: payload.message ?? "Build failed" }); | ||
| setHmrStatus("idle"); | ||
| return; |
There was a problem hiding this comment.
Clear queued auto-restart when a newer HMR error arrives
If a rebuild already latched restartPendingRef for this run, a later HMR error event returns early without clearing that pending restart. When the current process exits, finally still auto-starts a new run from the previous artifact even though the latest edit failed to compile, leading to stale-code restarts and unexpected background training churn.
Useful? React with 👍 / 👎.
| entry.spawnArtifactHash !== null && | ||
| nextArtifactHash !== null && | ||
| entry.spawnArtifactHash === nextArtifactHash; |
There was a problem hiding this comment.
Compare pre-ready artifacts by content, not file timestamps
This pre-ready equality gate depends on spawnArtifactHash/nextArtifactHash, which are derived from mtimeMs-ctimeMs-size in hmr.ts rather than file content. A watcher build that rewrites the same bytes still changes timestamps, so a run started before first ready can be forced into unnecessary SIGTERM restart despite identical artifact/config, creating avoidable cancel+restart churn in normal startup timing windows.
Useful? React with 👍 / 👎.
| startedJob = { | ||
| ...startedJob, | ||
| status: "cancelled", | ||
| completedAt: event.timestamp, | ||
| }; |
| earlyStopRequested = false; | ||
| if (startedJob && !TERMINAL_STATUSES.has(startedJob.status)) { | ||
| startedJob = { | ||
| ...startedJob, | ||
| status: "cancelled", | ||
| completedAt: new Date().toISOString(), | ||
| }; | ||
| } |
| if (signalCount > 1) { | ||
| process.stdout.write( | ||
| `Received second ${signal}; exiting without waiting for checkpoint.\n`, | ||
| ); | ||
| process.exit(143); |
…e HMR error state management
| export function installShutdownHandlers(trainer: Trainer): () => void { | ||
| let signalCount = 0; | ||
| const handler = (signal: NodeJS.Signals): void => { | ||
| signalCount += 1; | ||
| if (signalCount > 1) { | ||
| process.stdout.write( | ||
| `Received second ${signal}; exiting without waiting for checkpoint.\n`, |
| const dispose = installShutdownHandlers(trainer); | ||
| try { | ||
| process.emit("SIGTERM", "SIGTERM"); | ||
| await new Promise((r) => setTimeout(r, 10)); | ||
| expect(trainer.__earlyStop.calls).toBe(1); |
| dispatchRebuild( | ||
| nextConfigHash: string | null, | ||
| // Defaults to `null` so tests / pre-existing callers that don't | ||
| // pass the artefact hash get the conservative behaviour: the | ||
| // pre-ready-spawn branch's `artefactsAgree` check is `false`, | ||
| // so a null entry hash falls through to SIGTERM-restart. Real | ||
| // dispatch from `/api/train`'s HMR subscriber threads | ||
| // `event.hash` here so the backfill optimisation activates. | ||
| nextArtifactHash: string | null = null, |
| process.emit("SIGTERM", "SIGTERM"); | ||
| await new Promise((r) => setTimeout(r, 25)); | ||
| expect(probe.earlyStopCalls).toBe(1); | ||
| expect(exitCalls).toContain(0); | ||
|
|
||
| // 2nd SIGTERM (still in-flight, listeners not yet removed) → | ||
| // exit(143) immediately, no second requestEarlyStop call. | ||
| process.emit("SIGTERM", "SIGTERM"); |
| /** | ||
| * Fingerprint (mtime+ctime+size, see `core/moduleCacheBust.ts`) of | ||
| * the on-disk `.arkor/build/index.mjs` at spawn time. Used **only** | ||
| * to gate the pre-ready-spawn backfill: if a rebuild eventually | ||
| * fires while `configHash` is still null and this fingerprint | ||
| * matches the rebuild's artefact, the child is provably reading | ||
| * the same bundle bytes the new hash describes — safe to backfill | ||
| * `configHash` and skip dispatch. A mismatch (or null here) means | ||
| * the on-disk artefact has changed between spawn and rebuild | ||
| * (user edited mid-spawn, fresh project never built, …) so the | ||
| * child is running stale bytes and we MUST SIGTERM-restart to | ||
| * keep cloud-side `JobConfig` aligned with what the child | ||
| * actually loaded. Null when HMR isn't enabled or stat failed. | ||
| */ | ||
| spawnArtifactHash: string | null; |
… and consistency in HMR handling
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f65561b89a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (jobIdForCancel) { | ||
| void (async () => { |
There was a problem hiding this comment.
Cancel cloud run even if job id marker is missing
This branch skips the cloud jobs/:id/cancel call whenever jobIdForCancel is still null, then immediately sends SIGKILL. That creates a real orphaning window: if the user clicks Stop training after the cloud job has been created but before the runner’s Started job <id> line is parsed, the local child is killed and no remote cancel is attempted, so GPU work can continue invisibly until server-side TTL/reaper cleanup. The new SIGKILL-based stop path makes this race materially more likely than the previous graceful cancel flow.
Useful? React with 👍 / 👎.
| if (jobIdForCancel) { | ||
| void (async () => { |
There was a problem hiding this comment.
Cancel cloud run even if job id marker is missing
This branch skips the cloud jobs/:id/cancel call whenever jobIdForCancel is still null, then immediately sends SIGKILL. If the user clicks Stop training after the cloud job is created but before the runner’s Started job <id> line is parsed, the local child is killed and no remote cancel is attempted, so the cloud job can continue running (and spending GPU) until server-side reaper/TTL cleanup.
Useful? React with 👍 / 👎.
| const m = STARTED_JOB_PATTERN.exec(line); | ||
| if (m && m[1]) { | ||
| activeTrains.recordJobId(child.pid, m[1]); |
There was a problem hiding this comment.
Use an unforgeable marker for started-job id capture
The parser accepts the first stdout line matching Started job <token> from the entire child process output, but user code can emit that exact line (for example via module-scope console.log) before the runner prints its canonical marker. In that case the registry records a bogus id, later manual stop POSTs cancel for the wrong job (or no job), and the actual cloud run may be left alive. Anchoring helps substring noise, but it still does not distinguish runner control output from user logs.
Useful? React with 👍 / 👎.
| restartPendingRef.current = false; | ||
| return; |
There was a problem hiding this comment.
Clear pre-spawn restart buffer when HMR error arrives
payload.type === "error" clears restartPendingRef, but it leaves pendingPreSpawnEventsRef intact. A restart event buffered during the pre-spawn window (runningRef=true, currentPidRef=null) can therefore survive a newer compile error and be applied in onSpawn, re-arming auto-restart from stale code. Fresh evidence vs earlier restart-latch reports: this path bypasses restartPendingRef entirely because the restart decision is deferred in pendingPreSpawnEventsRef until pid assignment.
Useful? React with 👍 / 👎.
| const m = STARTED_JOB_PATTERN.exec(line); | ||
| if (m && m[1]) { | ||
| activeTrains.recordJobId(child.pid, m[1]); |
There was a problem hiding this comment.
Use an unforgeable marker for started-job id capture
The parser accepts the first stdout line matching Started job <token> from the entire child process output, but user code can emit that exact line (for example via module-scope console.log) before the runner prints its canonical marker. In that case the registry records a bogus id, later manual stop POSTs cancel for the wrong job (or no job), and the actual cloud run may be left alive.
Useful? React with 👍 / 👎.
| restartPendingRef.current = false; | ||
| return; |
There was a problem hiding this comment.
Clear pre-spawn restart buffer when HMR error arrives
payload.type === "error" clears restartPendingRef, but it leaves pendingPreSpawnEventsRef untouched. A restart event buffered during the pre-spawn window can survive a newer compile error and then be applied in onSpawn, re-arming auto-restart from stale code. Fresh evidence versus earlier restart-latch reports: this path bypasses restartPendingRef because the restart decision is deferred in pendingPreSpawnEventsRef until pid assignment.
Useful? React with 👍 / 👎.
| const tokenPath = studioTokenPath(); | ||
| scheduleStudioTokenCleanup(tokenPath); |
There was a problem hiding this comment.
Avoid deleting shared studio token when persist failed
The cleanup hook is now registered before token persistence and runs unconditionally, so a process that failed to write its own token can still unlink ~/.arkor/studio-token on exit. That can delete a token written by another active arkor dev session, causing the other session’s Vite-backed Studio pages to start getting 403s on /api/* despite that server still running normally.
Useful? React with 👍 / 👎.
| const state = await readState(trainCwd); | ||
| if (!state) return; // no scope, can't address the job | ||
| const rpc = createRpc(); | ||
| await rpc.v1.jobs[":id"].cancel.$post({ |
There was a problem hiding this comment.
Do not require state file to cancel known running job
Manual stop now relies on re-reading .arkor state to build the cancel query, and silently skips the cancel POST when readState(trainCwd) returns null. If project state is removed or unreadable while training is in flight, the code still SIGKILLs the local child but never sends cloud cancel, which can orphan the remote run and continue spend. The runner already had in-memory scope for this job, so cancel should not depend on mutable filesystem state at stop time.
Useful? React with 👍 / 👎.
| export function findInspectableTrainer( | ||
| mod: Record<string, unknown>, | ||
| ): TrainerInspection | null { | ||
| const trainer = findTrainerInModule(mod); | ||
| return trainer ? getTrainerInspection(trainer) : null; | ||
| } |
| startedJob = { | ||
| ...startedJob, | ||
| status: cancelError !== null ? "failed" : "cancelled", | ||
| ...(cancelError !== null && { | ||
| error: `Early-stop cancel failed: ${ | ||
| cancelError instanceof Error | ||
| ? cancelError.message | ||
| : String(cancelError) | ||
| }`, | ||
| }), | ||
| completedAt: event.timestamp, | ||
| }; |
| /** | ||
| * Per-spawn nonce that `/api/train` injects via env so the server can | ||
| * recognise the runner's `Started job <id>` line without it being | ||
| * forgeable from user code. Captured at module load (i.e. BEFORE | ||
| * `runTrainer` does its `await import(userEntry)`) and the env var | ||
| * is deleted right after so the dynamically-imported user module | ||
| * cannot read it via `process.env`. If a user callback then writes | ||
| * `Started job <token>` to stdout, the line won't carry the nonce | ||
| * prefix and the server's anchored regex will reject it: no | ||
| * spoofed cloud `cancel()` POST against an attacker-chosen job id. | ||
| * | ||
| * Null when the runner was launched directly (e.g. `arkor start` from | ||
| * a shell), in which case the runner falls back to the plain | ||
| * `Started job <id>` form for backwards compatibility. The server only | ||
| * uses the nonce-prefixed form because every server spawn sets the | ||
| * env var. | ||
| */ | ||
| const STARTED_JOB_NONCE: string | null = | ||
| process.env.ARKOR_JOB_ID_MARKER_NONCE ?? null; | ||
| delete process.env.ARKOR_JOB_ID_MARKER_NONCE; |
| const url = `http://localhost:${port}`; | ||
| serve({ fetch: app.fetch, port, hostname: "127.0.0.1" }); | ||
| process.stdout.write(`Arkor Studio running on ${url}\n`); | ||
| process.stdout.write(`HMR enabled (watching src/arkor)\n`); |
| @@ -199,16 +220,54 @@ export async function runDev(options: DevOptions = {}): Promise<void> { | |||
| // hitting `arkor start` (and therefore RCE via dynamic import). | |||
| const studioToken = randomBytes(32).toString("base64url"); | |||
|
|
|||
| // HMR coordinator: a long-lived rolldown watcher over the user's | |||
| // `src/arkor` graph. The coordinator itself is lazy (`subscribe()` | |||
| // is what starts the watcher, not `createHmrCoordinator`), but | |||
| // `buildStudioApp` registers its per-rebuild signal-dispatch | |||
| // subscriber unconditionally — that subscriber needs to run on | |||
| // every BUNDLE_END regardless of whether any SSE client is | |||
| // connected, so it can SIGUSR2/SIGTERM active `/api/train` | |||
| // children and keep `lastSuccessConfigHash` warm for spawn-time | |||
| // capture. Net effect: the watcher starts at server boot. An | |||
| // `arkor dev` launched in an unbuilt project doesn't fail immediately | |||
| // because `startWatcher` falls through to a poll loop that waits | |||
| // for the entry file to appear (see `hmr.ts:entryWaitTimer`). | |||
| // | |||
| // Registered before the studio-token cleanup so the latter remains | |||
| // the most-recently-attached signal listener (existing tests rely | |||
| // on this ordering to find the token-removal handler). | |||
| const hmr = createHmrCoordinator({ cwd: process.cwd() }); | |||
| scheduleHmrCleanup(hmr); | |||
| // `arkor start` at boot. | ||
| try { | ||
| process.on(CALLBACK_RELOAD_SIGNAL, handler); | ||
| } catch { | ||
| return () => { | ||
| // no-op: handler was never attached | ||
| }; | ||
| } | ||
| return () => { |
| async dispose() { | ||
| disposed = true; | ||
| subscribers.clear(); | ||
| if (entryWaitTimer) { | ||
| clearInterval(entryWaitTimer); | ||
| entryWaitTimer = null; | ||
| } | ||
| if (watcher) { | ||
| const w = watcher; | ||
| watcher = null; | ||
| await w.close().catch(() => {}); | ||
| } | ||
| }, | ||
| }; |
| } else { | ||
| // Nothing pertaining to this tab's child — leave any in- | ||
| // progress status spans alone but make sure stale "early- | ||
| // stopping" / "restarting" labels from a prior run don't | ||
| // linger past the next quiet rebuild. | ||
| if (!runningRef.current) setHmrStatus("idle"); | ||
| } |
- Updated comments in trainRegistry.test.ts to enhance readability and understanding of the test cases. - Clarified comments in trainRegistry.ts regarding the behavior of signals and process management. - Enhanced documentation in templates.ts to ensure consistency in phrasing and clarity. - Improved comments in RunTraining.tsx to better explain the logic and flow of the component. - Refined comments in api.test.ts and api.ts to provide clearer context and reasoning for the code. - Adjusted comments in baseModels.test.ts to clarify the case-sensitivity of model IDs.
| The CLI/Studio look at `src/arkor/index.ts` in user projects. Discovery in [packages/arkor/src/core/runner.ts](packages/arkor/src/core/runner.ts) accepts (in order): a named `arkor` export from `createArkor({...})`, a bare `trainer` export, a default export holding either an Arkor manifest or a Trainer, or a `default.trainer` nested shape. `createArkor` returns a frozen, opaque manifest tagged with `_kind: "arkor"`; treat it as a value to hand to tooling, not a programmable client. | ||
|
|
||
| `arkor build` ([packages/arkor/src/cli/commands/build.ts](packages/arkor/src/cli/commands/build.ts)) bundles to `.arkor/build/index.mjs` with esbuild; bare specifiers (e.g. `arkor`, anything in `node_modules`) stay external so the artifact resolves the runtime SDK from the project's installed copy. | ||
| `arkor build` ([packages/arkor/src/cli/commands/build.ts](packages/arkor/src/cli/commands/build.ts)) bundles to `.arkor/build/index.mjs` with [Rolldown](https://rolldown.rs); bare specifiers (e.g. `arkor`, anything in `node_modules`) stay external so the artifact resolves the runtime SDK from the project's installed copy. The `transform.target` is derived from `process.versions.node` at build time so the bundle targets the same Node binary that will execute it. |
| if (opts.prebuiltOutFile && existsSync(opts.prebuiltOutFile)) { | ||
| return summariseBuiltManifest(opts.prebuiltOutFile); | ||
| } |
| const candidates: unknown[] = []; | ||
| // 1: createArkor named export | ||
| if (isArkor(mod.arkor) && (mod.arkor as Arkor).trainer) { | ||
| candidates.push((mod.arkor as Arkor).trainer); | ||
| } | ||
| // 2: bare `trainer` named export | ||
| if (mod.trainer) candidates.push(mod.trainer); | ||
| // 3: default-export holding an Arkor manifest | ||
| if (isArkor(mod.default) && (mod.default as Arkor).trainer) { | ||
| candidates.push((mod.default as Arkor).trainer); | ||
| } | ||
| // 4: default IS the Trainer itself. The `isTrainerLike` filter | ||
| // below sorts this out from cases 3/5 (an Arkor manifest doesn't | ||
| // have `start`/`wait`/`cancel`, nor does a plain `{ trainer }` | ||
| // wrapper), so pushing `mod.default` unconditionally is safe. | ||
| if (mod.default !== undefined) candidates.push(mod.default); | ||
| // 5: default.trainer nested | ||
| if ( | ||
| mod.default && | ||
| typeof mod.default === "object" && | ||
| "trainer" in (mod.default as Record<string, unknown>) | ||
| ) { | ||
| candidates.push((mod.default as Record<string, unknown>).trainer); | ||
| } | ||
| for (const c of candidates) { | ||
| if (isTrainerLike(c)) return c; | ||
| } |
| const SECOND_SIGNAL_EXIT_CODE: Record< | ||
| (typeof SHUTDOWN_SIGNALS)[number], | ||
| number | ||
| > = { | ||
| SIGHUP: 129, | ||
| SIGINT: 130, | ||
| SIGTERM: 143, | ||
| }; |
| if (!res.ok) { | ||
| let detail = ""; | ||
| try { | ||
| detail = (await res.text()).trim(); | ||
| } catch { | ||
| // Body unreadable (already consumed, network gone, etc.): | ||
| // surface the status alone rather than masking the failure | ||
| // entirely. | ||
| } | ||
| throw new Error( | ||
| detail | ||
| ? `/api/train failed (${res.status} ${res.statusText}): ${detail}` | ||
| : `/api/train failed (${res.status} ${res.statusText})`, | ||
| ); |
| export function openDevEvents(): EventSource { | ||
| return new EventSource(withStudioToken("/api/dev/events")); | ||
| } |
| requestTrainerEarlyStop(trainer) | ||
| .catch((err: unknown) => { | ||
| const msg = err instanceof Error ? err.message : String(err); | ||
| process.stderr.write(`requestEarlyStop failed: ${msg}\n`); | ||
| }) | ||
| .finally(() => process.exit(0)); |
| const { jobId } = await trainer.start(); | ||
| const startedJobPrefix = STARTED_JOB_NONCE | ||
| ? `[arkor:${STARTED_JOB_NONCE}] ` | ||
| : ""; | ||
| process.stdout.write(`${startedJobPrefix}Started job ${jobId}\n`); | ||
| const result = await trainer.wait(); | ||
| process.stdout.write( | ||
| `Job ${result.job.id} finished with status=${result.job.status}\n`, | ||
| ); |
…ror state management
| let onCheckpointError: unknown = null; | ||
| try { | ||
| await callbacks.onCheckpoint?.(ctx); | ||
| } catch (err) { | ||
| onCheckpointError = err; | ||
| } | ||
| // Early-stop latch: a checkpoint just landed, so the in-flight work |
| * Walk a freshly-imported user bundle in the same precedence order | ||
| * as `runner.ts`'s `extractTrainer` and return the first | ||
| * trainer-shaped value (anything that has `start`/`wait`/`cancel` | ||
| * functions). Doesn't require the SDK inspection brand: the | ||
| * manifest UI displays the trainer's `name` for hand-rolled trainers | ||
| * too, even when HMR can't compute a `configHash` for them. | ||
| * | ||
| * The five supported shapes (mirroring `runner.ts`'s `extractTrainer`): | ||
| * 1. `export const arkor = createArkor({ trainer })` | ||
| * 2. `export const trainer = createTrainer(...)` (bare named export) | ||
| * 3. `export default createArkor({ trainer })` | ||
| * 4. `export default createTrainer(...)` (default IS a Trainer) | ||
| * 5. `export default { trainer: createTrainer(...) }` | ||
| * | ||
| * Without shape #4 a project that default-exports a Trainer would run | ||
| * fine under `arkor start` but show as "no trainer" in Studio's | ||
| * manifest, with `configHash: null` forcing every HMR rebuild down the | ||
| * SIGTERM-restart path instead of the SIGUSR2 hot-swap path. | ||
| */ | ||
| /** |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d0d035af7d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const spawnState = await readState(trainCwd); | ||
| const spawnScope = spawnState | ||
| ? { orgSlug: spawnState.orgSlug, projectSlug: spawnState.projectSlug } | ||
| : null; |
There was a problem hiding this comment.
Capture cancel scope after start-created state becomes available
The cancel path now depends on scope captured before spawning, but readState(trainCwd) is often null on the first run of an anon project because ensureProjectState() creates .arkor/state.json inside the child during trainer.start(). In that common case scopeForCancel stays null, manual Stop sends SIGKILL, and no cloud jobs/:id/cancel request is attempted, which can orphan the remote job and continue GPU spend.
Useful? React with 👍 / 👎.
| activeTrains.unregister(cancelPid); | ||
| cancelTeardown?.(); | ||
| // Fire-and-forget cloud-side cancel so the cloud job is |
There was a problem hiding this comment.
Keep parsing stdout while waiting for late job-id marker
The new 500ms poll for parsedJobId in cancel() cannot observe a late Started job line because cancelTeardown() is invoked first and marks the stream closed, and onStdoutChunk immediately returns when closed is true. Fresh evidence versus earlier missing-job-id reports: this ordering disables the parser during the exact polling window, so a fast Stop before marker parse still skips cloud cancel and can leave the remote run alive.
Useful? React with 👍 / 👎.
| for (const trainer of findTrainerCandidates(mod)) { | ||
| const inspection = getTrainerInspection(trainer); | ||
| if (inspection) return inspection; | ||
| } |
There was a problem hiding this comment.
Hash the same trainer instance that
runTrainer executes
findInspectableTrainer currently skips the first exported trainer when it is unbranded and instead hashes a later branded export, but runTrainer still executes the first trainer by precedence. In modules that export both shapes (e.g. a hand-rolled named trainer plus a default createArkor(...)), HMR decisions are computed from a different trainer than the one actually running, which can route to SIGUSR2/hot-swap even though the live trainer cannot be callback-reloaded.
Useful? React with 👍 / 👎.
…ehavior for first-run anonymous projects
| // SPA dev workflow with mystery 403s on /api/*. | ||
| if (!shouldUnlink()) return; | ||
| try { | ||
| unlinkSync(path); | ||
| } catch { |
| // late event can still match per-pid. | ||
| if (restartGraceTimerRef.current !== null) { | ||
| clearTimeout(restartGraceTimerRef.current); | ||
| } | ||
| restartGraceTimerRef.current = window.setTimeout(() => { |
…ncurrent dev tokens
| const path = studioTokenPath(); | ||
| await mkdir(dirname(path), { recursive: true }); | ||
| await writeFile(path, token, { mode: 0o600 }); | ||
| await chmod(path, 0o600); | ||
| return path; |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
| function stableStringify(value: unknown, key: string = ""): string | undefined { | ||
| if (value === null) return "null"; | ||
| // Non-representable values: omit (undefined return) so each caller's | ||
| // boundary handler chooses the right substitution per its position. | ||
| if (value === undefined || typeof value === "function" || typeof value === "symbol") { | ||
| return undefined; | ||
| } | ||
| if (typeof value !== "object") return JSON.stringify(value); | ||
| // `JSON.stringify` calls `value.toJSON(key)` first when present | ||
| // (passing `""` at the top level, the property name in object | ||
| // positions, the index-as-string in array positions), then | ||
| // serialises the return value. Canonical example: `Date` → ISO | ||
| // string. The `key` argument is threaded through recursion so | ||
| // user-side `toJSON(key)` implementations that branch on the | ||
| // hosting property/index see the same value JSON.stringify would. | ||
| // If `toJSON` returns `undefined`, that propagates as the omit | ||
| // sentinel: the spec-defined "skip me" path. | ||
| const maybeToJSON = (value as { toJSON?: unknown }).toJSON; | ||
| if (typeof maybeToJSON === "function") { | ||
| return stableStringify( | ||
| (maybeToJSON as (key: string) => unknown).call(value, key), | ||
| key, | ||
| ); | ||
| } | ||
| if (Array.isArray(value)) { | ||
| // Array slots: non-representable → "null" (matches JSON spec). |
This pull request migrates the Arkor build and development workflow from esbuild to Rolldown, introduces a robust hot module replacement (HMR) system for the Studio dev server, and refines the cleanup and shutdown logic for long-lived resources. Documentation has been updated to explain the new HMR behavior in both English and Japanese, and dependency management has been adjusted accordingly. These changes streamline the developer experience, enable live code updates during training runs, and improve resource handling during shutdown.
Build system migration and HMR integration:
arkor build, ensuring bundles target the current Node version and keeping external dependencies unbundled. (packages/arkor/package.json,packages/arkor/src/cli/commands/build.ts,packages/arkor/src/core/rolldownConfig) [1] [2] [3]src/arkor/in dev mode, enabling HMR for trainer code. The watcher notifies the SPA via SSE (/api/dev/events), and the dev server can hot-swap callbacks or gracefully early-stop training runs based on config changes. (packages/arkor/src/cli/commands/dev.ts,packages/arkor/src/studio/hmr.ts,packages/arkor/src/studio/trainRegistry.ts) [1] [2] [3] [4] [5] [6]Cleanup and shutdown improvements:
registerCleanupHookutility to handle resource cleanup on process exit and signals, ensuring proper teardown order (e.g., HMR watcher before token file removal). (packages/arkor/src/cli/cleanupHooks.ts,packages/arkor/src/cli/commands/dev.ts) [1] [2]Documentation updates:
docs/concepts/studio.mdx,docs/ja/concepts/studio.mdx,AGENTS.md) [1] [2] [3]AGENTS.md)Dependency and test adjustments:
packages/arkor/package.json,packages/arkor/src/cli/commands/start.test.ts) [1] [2]Security and internal API notes:
Symbol.forbrands, not exposed on the public SDK surface. (AGENTS.md) [1] [2]