Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ADR 026: Server — Execute Mode

**SPEC:** [server.pipe](../specs/server.pipe.md)
**SPEC:** [headless-execute](../specs/headless-execute.md)
**Status:** Accepted
**Date:** 2026-06-11

Expand Down
212 changes: 212 additions & 0 deletions docs/adrs/027.server.execute-resilience.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
# ADR 027: Server — Execute Resilience & Structured Error Protocol

**Status**: Accepted
**Date**: 2026-06-20
**Spec**: [Server — Headless Execute](../specs/headless-execute.md)

---

## 1. Decision Record

<!-- User/human-facing: PM, stakeholder, reviewer -->

### Context

The `POST /s/:id/execute` endpoint introduced in ADR 026 spawns a child process and streams its output as NDJSON. Two gaps make the current implementation unsafe for production agent workflows:

**Gap 1 — Server crash on spawn failure.** If the requested command is not found (`ENOENT`) or cannot be executed (`EACCES`), Node.js emits an `error` event on the child process. The handler in ADR 026 has no `child.on('error', …)` listener, so this becomes an unhandled EventEmitter error — which Node.js converts to an uncaught exception and terminates the process. One bad tool call from any agent kills the server for every other user.

**Gap 2 — Untyped errors in the NDJSON stream.** When the command exits non-zero or spawn itself fails, the caller currently receives only raw stderr text and a numeric exit code. An agent cannot distinguish "the command ran and produced a meaningful error" from "the command never started because it doesn't exist". Both look like `exit: 1` with text in stderr. Without a typed signal, the agent must parse free-form error messages — fragile and brittle across OS and shell versions.

Additionally, `handleRequest` is an `async` function called inside `http.createServer` with no `.catch()`. A rejected promise from any route handler silently becomes an unhandled rejection, which in Node.js ≥ 15 terminates the process by default.

### Decision

Two decisions are made together because they form a single coherent contract: the server stays alive, and the agent always gets a machine-readable error signal.

**Decision 1 — Process-level resilience.** Register `uncaughtException` and `unhandledRejection` handlers in `src/server/index.ts` that log the error and do not exit. Wrap the `handleRequest` call in `.catch()` to ensure no promise rejection escapes. These are last-resort guards — not a substitute for proper per-handler error handling.

**Decision 2 — Spawn-level resilience in the execute handler.** In `POST /s/:id/execute`: move `res.writeHead(200, …)` to after the spawn attempt so that synchronous spawn failures can return HTTP 500 before headers are committed. Add `child.on('error', …)` to handle asynchronous spawn errors (ENOENT, EACCES). Use a `done` flag to prevent `res.end()` from being called by both the `error` and `close` handlers.

**Decision 3 — Structured error on the exit NDJSON line.** When spawn fails (not when the command simply exits non-zero), the final `{ exit }` line carries two additional fields: `error` (human-readable message) and `code` (the Node.js error code string, e.g. `ENOENT`). A non-zero exit from the command itself produces only `{ exit: N }` — no `error` or `code` fields. This one-line distinction gives agents a stable predicate to branch on without parsing stderr text.

### Considered Options

#### Option A — Process-level catch-all only

Register `uncaughtException` / `unhandledRejection` at process level and rely on stderr text for error diagnosis. No changes to the NDJSON stream shape.

**Pros:** Minimal diff. Server stays alive.
**Cons:** Agents still cannot distinguish spawn failure from command failure without text parsing. The catch-all is a backstop, not a contract.

#### Option B — Per-handler guard + structured exit line (chosen)

Fix the execute handler directly (`child.on('error')`, `done` flag, `writeHead` moved), and extend the exit NDJSON line with `error` and `code` fields on spawn failure. Add process-level guards as a second safety net.

**Pros:** Agent gets a stable, typed signal. Server never crashes. No text parsing required. Backwards-compatible — the `{ exit }` field is unchanged; new fields are additive.
**Cons:** Callers must handle the new fields explicitly to benefit from them. Adds one branch in the execute handler.

#### Option C — Separate `{ type: "error" }` NDJSON line

Emit a distinct NDJSON line `{ type: "error", code: "ENOENT", message: "…" }` before the `{ exit: 1 }` line.

**Pros:** Separates the error signal from the exit signal. Clearer for stream parsers that process line-by-line.
**Cons:** Parsers must now handle a third event type. Existing consumers that only watch for `exit` would ignore the error line entirely, giving no benefit unless they are updated. Option B adds signal to the existing terminal event, which every consumer already reads.

### Rationale

Option B is chosen because it adds machine-readable error context to the one line every consumer already waits for — the exit line. Agents do not need code changes to remain correct (the `exit` field is unchanged); they opt in to the richer signal by also checking `code`. The process-level guards are a necessary complement, not a substitute: a catch-all that swallows errors without telling the caller is still a failure mode for the agent.

### Consequences

**Positive**
- webtty never terminates due to a bad tool call from an agent.
- Agents can branch on `code === 'ENOENT'` to surface actionable guidance ("install the CLI") instead of forwarding raw stderr text.
- Backwards-compatible: callers that ignore `error`/`code` on the exit line continue to work unchanged.

**Negative**
- Agents must be updated to consume `code` from the exit line to get the benefit. The current agent (cliIntegrationAgentChatService) reads only `exit` and `data` — it gains resilience but not structured branching until updated.

**Risks**
- Node.js error codes (ENOENT, EACCES, etc.) are OS-level and stable, but not exhaustive. An unknown code should be treated as a generic spawn failure. Agents should not switch exhaustively on `code`.

---

## 2. Functional Specification

<!-- Human-facing: stakeholders & engineers. Per-change functional detail. -->

### Scenarios

**Scenario 1 — Command not found (ENOENT)**

1. Agent POSTs `{ cmd: "tc", args: ["search", "*"] }` to `/s/main/execute`.
2. `tc` is not on `PATH`.
3. Server spawns the process; Node.js emits `error` on the child with code `ENOENT`.
4. Handler writes `{ stream: "stderr", data: "spawn ENOENT: …\n" }` to the response.
5. Handler writes `{ exit: 1, error: "spawn ENOENT: …", code: "ENOENT" }` and ends.
6. webtty continues serving other requests. Server does not exit.
7. Agent reads `exit: 1, code: "ENOENT"` and surfaces: *"tc is not installed — run: bunx @awc/cli@next …"*

**Scenario 2 — Command exits non-zero**

1. Agent POSTs `{ cmd: "tc", args: ["search", "bad-query"] }`.
2. `tc` runs, returns exit code 2.
3. Handler writes stderr chunks as they arrive, then `{ exit: 2 }`.
4. No `error` or `code` fields on the exit line — this is a command-level failure, not a spawn failure.
5. Agent reads `exit: 2`, inspects stderr, and decides whether to retry or surface the error.

**Scenario 3 — Unhandled exception in a route handler**

1. A route handler throws an unexpected exception.
2. The `.catch()` wrapper in `index.ts` catches the rejected promise.
3. Server writes `500 Internal Server Error` and continues.
4. `uncaughtException` guard (last resort) logs and does not call `process.exit`.

### Non-Goals

- This ADR does not define agent-side error handling strategy — that is up to each consumer.
- This ADR does not add structured error fields to the `publish` endpoint or WebSocket channel.
- This ADR does not address authentication or rate-limiting on the execute endpoint.

### Data Contracts

**Execute NDJSON stream — extended exit line**

```
{ "stream": "stdout", "data": "…" } // zero or more stdout chunks
{ "stream": "stderr", "data": "…" } // zero or more stderr chunks
{ "exit": 0 } // command exited successfully
{ "exit": 1 } // command exited with error (no spawn issue)
{ "exit": 1, "error": "…", "code": "ENOENT" } // spawn itself failed
```

`error` and `code` are present **only** when spawn fails (the child process never started). A non-zero exit from a running command produces `{ exit: N }` with no additional fields.

`code` is the Node.js `SystemError.code` string (e.g. `ENOENT`, `EACCES`, `EPERM`). Callers should treat unrecognised codes as generic spawn failures.

---

## 3. Implementation Specification

<!-- Agent-facing: engineers & Generator agents -->

### Implementation Detail

**`src/server/index.ts` — process-level guards**

Add before `httpServer.listen`:

```ts
process.on('uncaughtException', (err) => {
console.error('[webtty] uncaughtException:', err);
});
process.on('unhandledRejection', (reason) => {
console.error('[webtty] unhandledRejection:', reason);
});
```

Wrap the `handleRequest` call in the server callback:

```ts
const httpServer = http.createServer((req, res) => {
handleRequest(req, res, distPath, wasmPath, clientDistPath, shutdown).catch((err) => {
console.error('[webtty] unhandled route error:', err);
if (!res.headersSent) {
res.writeHead(500);
res.end('Internal Server Error');
}
});
});
```

**`src/server/routes.ts` — execute handler hardening**

Replace the spawn block (currently lines 371–397) with:

```ts
let child: ReturnType<typeof spawn>;
try {
child = spawn(cmd, args as string[], { env: process.env });
} catch (err) {
res.writeHead(500);
res.end(`spawn error: ${String(err)}`);
return;
}
res.writeHead(200, { 'Content-Type': 'application/x-ndjson', 'Cache-Control': 'no-cache' });
let done = false;
child.on('error', (err) => {
if (done) return;
done = true;
const msg = err.message;
const code = (err as NodeJS.ErrnoException).code;
res.write(JSON.stringify({ stream: 'stderr', data: `${msg}\n` }) + '\n');
res.write(JSON.stringify({ exit: 1, error: msg, ...(code ? { code } : {}) }) + '\n');
res.end();
});
if (typeof stdin === 'string') child.stdin?.write(stdin);
child.stdin?.end();
child.stdout?.on('data', (chunk: Buffer) => {
res.write(JSON.stringify({ stream: 'stdout', data: chunk.toString() }) + '\n');
});
child.stderr?.on('data', (chunk: Buffer) => {
res.write(JSON.stringify({ stream: 'stderr', data: chunk.toString() }) + '\n');
});
child.on('close', (code: number | null) => {
if (done) return;
done = true;
res.write(JSON.stringify({ exit: code ?? 1 }) + '\n');
res.end();
});
req.on('close', () => { if (!done) child.kill(); });
```

Key invariants:
- `writeHead(200)` only fires after `spawn` succeeds — synchronous spawn errors return `500` with no NDJSON.
- `done` flag ensures exactly one `res.end()` call regardless of whether `error` or `close` fires first (both can fire on ENOENT).
- `error` and `code` appear on the exit line only when the child's `error` event fires — never on a normal non-zero exit.

### Related Decisions

- [ADR 026 — Server Execute Mode](026.server.headless-execute.md) — the endpoint this ADR hardens
- [ADR 025 — Session Channel](025.server.channel.md) — publish/events channel on the same session
Loading
Loading