Skip to content

feat(observability): add structured daemon.connection.* events with attempt + downtime fields to ws-client #218

@chrisleekr

Description

@chrisleekr

Finding

The daemon's outbound WebSocket lifecycle in src/daemon/ws-client.ts is the last
significant subsystem in this codebase whose telemetry has not yet been migrated to
the event: "<subsystem>.<verb>" pino convention that the rest of the codebase has
converged on (see e.g. dispatcher.offer.sent, daemon.heartbeat.pong_missed,
pipeline.stage, ship.scoped.fix_thread.daemon.completed, scheduler.action.*).
Today the daemon emits four free-text lines per connection cycle, none of which
carry a structured event field or the operational counters an operator needs to
reason about reconnect health:

  • src/daemon/ws-client.ts:76 (onopen) logs "Connected to orchestrator" with only orchestratorUrl. There is no time_to_connect_ms (the connect-call→onopen delta), no attempt counter, no downtime_ms measured from the previous onclose, so an operator cannot tell a clean first-connect from a 30-second reconnect.
  • src/daemon/ws-client.ts:104 (onclose) logs "Disconnected from orchestrator" with only code + reason. There is no connected_duration_ms measured from the prior onopen, so an operator cannot distinguish a daemon that flapped after 200 ms from one that lived for an hour and then died at the orchestrator-side idleTimeout: 120 boundary configured in src/orchestrator/ws-server.ts:207.
  • src/daemon/ws-client.ts:111-121 (onerror) logs "WebSocket error" with no event, leaving the operator to grep on freeform text alongside every other level: "error" line.
  • src/daemon/ws-client.ts:159 (scheduleReconnect) logs "Reconnecting to orchestrator" with only backoffMs. There is no attempt counter, so the standard "this daemon keeps reconnect-looping" alert query has to count log lines instead of reading one numeric field.

Additionally, src/daemon/main.ts:408 and src/daemon/main.ts:411 duplicate the
ws-client's connect/disconnect log with even less context (the onConnected
callback fires plain "Connected to orchestrator"), so every state transition
produces two unstructured lines back-to-back.

Mirror the structured-event treatment that landed for the orchestrator-side view
of the same connection (daemon.heartbeat.pong_missed, daemon.heartbeat.timeout,
daemon.heartbeat.ttl_refresh_failed pinned by the .strict() Zod discriminated
union at src/orchestrator/log-fields.ts:36-40). Add a sibling
src/daemon/log-fields.ts with a DAEMON_CONNECTION_LOG_EVENTS constant and
DaemonConnectionLogSchema discriminated union over five events:
daemon.connection.connect_attempt, daemon.connection.connected,
daemon.connection.disconnected, daemon.connection.reconnect_scheduled,
daemon.connection.error. Pin attempt, time_to_connect_ms,
connected_duration_ms, downtime_ms, backoff_ms, previous_backoff_ms,
code, reason, readyState per branch with the same .strict() shape and
co-located round-trip test as src/core/log-fields.ts. Have the ws-client own
two new timestamps (connectStartedAt, lastOpenedAt) so the deltas are
computable without churning the existing nextBackoff() decorrelated-jitter
math at lines 28-31. Remove the duplicate logger.info(...) calls in
src/daemon/main.ts:408 and src/daemon/main.ts:411 since the structured
lines emitted by ws-client now cover those transitions.

Diagram

stateDiagram-v2
    [*] --> Connecting: connect()
    Connecting --> Connected: ws.onopen<br/>daemon.connection.connected<br/>attempt + time_to_connect_ms
    Connecting --> Reconnecting: WebSocket throw / ws.onerror<br/>daemon.connection.error<br/>readyState + message
    Connected --> Disconnected: ws.onclose<br/>daemon.connection.disconnected<br/>code + reason + connected_duration_ms
    Disconnected --> Reconnecting: scheduleReconnect()<br/>daemon.connection.reconnect_scheduled<br/>attempt + backoff_ms
    Reconnecting --> Connecting: setTimeout fires<br/>daemon.connection.connect_attempt<br/>attempt + previous_backoff_ms
    Reconnecting --> [*]: close()
    Connected --> [*]: close()

    classDef happy fill:#196f3d;color:#ffffff;
    classDef sad fill:#922b21;color:#ffffff;
    classDef neutral fill:#1f618d;color:#ffffff;

    class Connected happy
    class Connecting neutral
    class Disconnected sad
    class Reconnecting sad
Loading

Rationale

The orchestrator already publishes pong-miss / TTL-refresh-failure events from
its side of every daemon's connection, so an operator can see "this daemon
went silent at 12:04Z". The daemon's side of that same connection is a blind
spot: the closest signal is daemon.heartbeat.timeout arriving on the
orchestrator some seconds after the actual TCP drop, with no way to correlate
it with the daemon's own view of onclose (code: 1006 vs 1000, the
reason string Bun surfaces, whether the daemon already entered backoff). For
the fleet's HA story this is load-bearing: src/orchestrator/liveness-reaper.ts
flips workflow_runs rows owned by a "no longer alive" daemon (reapOnce
at src/orchestrator/liveness-reaper.ts:88-185) based purely on the
orchestrator's pong view, and the operational question "did the daemon
actually die or did it just lose the WS?" has no log-side answer today.

The cost is small: ~80 lines of schema + ~30 lines of emitter changes inside
DaemonWsClient, gated by an existing test pattern (the .strict() Zod
round-trip co-located test that src/core/log-fields.ts and
src/orchestrator/log-fields.ts already use, see e.g. the test referenced at
src/scheduler/config-schema.test.ts). The schema is the drift-prevention
contract, not a runtime validator, so the hot path stays one log.info({...})
call per state transition with no measurable overhead. Decorrelated-jitter
backoff health, when paired with attempt counters, also lets an operator
trivially write the alert "P95 attempt across the fleet > 3 for 5 min",
which today requires log-line counting and is brittle across pino-pretty vs
JSON output.

This is the same shape of win the closed
feat(observability): structured pipeline.stage events with delta_ms and
feat(observability): structured dispatcher.offer.{sent,accepted,rejected,timed_out}
findings delivered for the pipeline and dispatcher subsystems; the
daemon-side WS lifecycle is the next analogous gap.

References

Internal:

  • src/daemon/ws-client.ts:75-122 — four unstructured logger.info / logger.error calls on the connection lifecycle.
  • src/daemon/ws-client.ts:153-170scheduleReconnect emits backoffMs but no attempt, and nextBackoff() at lines 28-31 is the decorrelated-jitter source we'd want a counter alongside.
  • src/daemon/main.ts:407-413onConnected / onDisconnected callbacks duplicate the ws-client logs with even less context; remove once structured lines land.
  • src/orchestrator/log-fields.ts:36-40DAEMON_HEARTBEAT_LOG_EVENTS template to mirror (orchestrator-side view of the same connection).
  • src/orchestrator/log-fields.ts:122-143DaemonHeartbeatLogSchema discriminated-union shape to mirror for the daemon-side schema.
  • src/core/log-fields.ts:18-37pipeline.stage precedent for camelCase correlation fields + snake_case numeric metrics (delta_ms).
  • src/orchestrator/ws-server.ts:207idleTimeout: 120 boundary that the new connected_duration_ms measurement makes diagnosable.
  • src/orchestrator/liveness-reaper.ts:88-185reapOnce consumer that benefits when both sides of the connection are observable.
  • CLAUDE.md "Key Concepts" — the event: "<subsystem>.<verb>" + .strict() schema pattern this finding extends.

External:

Suggested Next Steps

  1. Create src/daemon/log-fields.ts with DAEMON_CONNECTION_LOG_EVENTS and a DaemonConnectionLogSchema discriminated union over the five events, mirroring src/orchestrator/log-fields.ts:122-143. Add a co-located *.test.ts that round-trips a sample line through each branch under .strict().
  2. In DaemonWsClient (src/daemon/ws-client.ts), add three private timestamps (connectStartedAt, lastOpenedAt, lastClosedAt) and one attempt counter (reset on a successful onopen, incremented in scheduleReconnect). Replace the four logger.info/logger.error calls at lines 70, 76, 104, 113, 159 with structured emissions matching the schema.
  3. Delete the duplicate logger.info("Connected to orchestrator") / logger.info("Disconnected from orchestrator") callbacks at src/daemon/main.ts:408 / src/daemon/main.ts:411 since the structured lines emitted by the ws-client cover the same transitions.
  4. Update docs/operate/observability.md to add the five new daemon.connection.* events to the field-name index, per the CLAUDE.md doc-sync rule for "New Pino log field or metric".
  5. (Optional follow-up) add a Grafana / Loki alert template "daemon reconnect storm" filtering on event: "daemon.connection.reconnect_scheduled" AND attempt >= 5 to docs/operate/runbooks/daemon-fleet.md.

Areas Evaluated

  • src/daemon/ws-client.ts — the four unstructured lifecycle log lines and the nextBackoff decorrelated-jitter helper.
  • src/daemon/main.ts — the duplicate onConnected / onDisconnected callbacks and the surrounding shutdown + idle-loop observability.
  • src/daemon/health-reporter.ts — heartbeat pong build, confirmed no daemon-side connection telemetry lives here.
  • src/orchestrator/log-fields.ts — the .strict() Zod schema convention to mirror.
  • src/orchestrator/ws-server.ts — orchestrator's WebSocket server lifecycle, confirmed it logs WebSocket connection opened/closed with only remoteAddr/code/reason and no event field either (potential sibling finding, deliberately scoped out of this issue to keep one-finding-per-run).
  • src/orchestrator/instance-liveness.ts, src/orchestrator/liveness-reaper.ts, src/orchestrator/queue-worker.ts, src/orchestrator/concurrency.ts, src/orchestrator/valkey.ts, src/db/migrate.ts — surveyed for similar gaps; daemon-side WS lifecycle is the clearest single hop.
  • src/core/log-fields.ts, src/utils/octokit-observability.ts, src/utils/log-redaction.ts, src/logger.ts — confirmed the canonical event: "..." + .strict() schema pattern this finding extends.
  • Existing research issues (open and closed) — checked all 20+ area: observability entries; no overlap with pipeline.stage, dispatcher.offer.*, daemon.heartbeat.*, fleet-state gauge, queue_wait_ms, rate-limit hooks, stdio MCP logger, token-usage, pino.final, scheduler.scan.*, circuit.*, or retry.* findings.

Generated by the scheduled research action on 2026-06-06

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions