You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The daemon's outbound WebSocket lifecycle in src/daemon/ws-client.ts is the last
significant subsystem in this codebase whose telemetry has not yet been migrated to
the event: "<subsystem>.<verb>" pino convention that the rest of the codebase has
converged on (see e.g. dispatcher.offer.sent, daemon.heartbeat.pong_missed, pipeline.stage, ship.scoped.fix_thread.daemon.completed, scheduler.action.*).
Today the daemon emits four free-text lines per connection cycle, none of which
carry a structured event field or the operational counters an operator needs to
reason about reconnect health:
src/daemon/ws-client.ts:76 (onopen) logs "Connected to orchestrator" with only orchestratorUrl. There is no time_to_connect_ms (the connect-call→onopen delta), no attempt counter, no downtime_ms measured from the previous onclose, so an operator cannot tell a clean first-connect from a 30-second reconnect.
src/daemon/ws-client.ts:104 (onclose) logs "Disconnected from orchestrator" with only code + reason. There is no connected_duration_ms measured from the prior onopen, so an operator cannot distinguish a daemon that flapped after 200 ms from one that lived for an hour and then died at the orchestrator-side idleTimeout: 120 boundary configured in src/orchestrator/ws-server.ts:207.
src/daemon/ws-client.ts:111-121 (onerror) logs "WebSocket error" with no event, leaving the operator to grep on freeform text alongside every other level: "error" line.
src/daemon/ws-client.ts:159 (scheduleReconnect) logs "Reconnecting to orchestrator" with only backoffMs. There is no attempt counter, so the standard "this daemon keeps reconnect-looping" alert query has to count log lines instead of reading one numeric field.
Additionally, src/daemon/main.ts:408 and src/daemon/main.ts:411 duplicate the
ws-client's connect/disconnect log with even less context (the onConnected
callback fires plain "Connected to orchestrator"), so every state transition
produces two unstructured lines back-to-back.
Mirror the structured-event treatment that landed for the orchestrator-side view
of the same connection (daemon.heartbeat.pong_missed, daemon.heartbeat.timeout, daemon.heartbeat.ttl_refresh_failed pinned by the .strict() Zod discriminated
union at src/orchestrator/log-fields.ts:36-40). Add a sibling src/daemon/log-fields.ts with a DAEMON_CONNECTION_LOG_EVENTS constant and DaemonConnectionLogSchema discriminated union over five events: daemon.connection.connect_attempt, daemon.connection.connected, daemon.connection.disconnected, daemon.connection.reconnect_scheduled, daemon.connection.error. Pin attempt, time_to_connect_ms, connected_duration_ms, downtime_ms, backoff_ms, previous_backoff_ms, code, reason, readyState per branch with the same .strict() shape and
co-located round-trip test as src/core/log-fields.ts. Have the ws-client own
two new timestamps (connectStartedAt, lastOpenedAt) so the deltas are
computable without churning the existing nextBackoff() decorrelated-jitter
math at lines 28-31. Remove the duplicate logger.info(...) calls in src/daemon/main.ts:408 and src/daemon/main.ts:411 since the structured
lines emitted by ws-client now cover those transitions.
Diagram
stateDiagram-v2
[*] --> Connecting: connect()
Connecting --> Connected: ws.onopen<br/>daemon.connection.connected<br/>attempt + time_to_connect_ms
Connecting --> Reconnecting: WebSocket throw / ws.onerror<br/>daemon.connection.error<br/>readyState + message
Connected --> Disconnected: ws.onclose<br/>daemon.connection.disconnected<br/>code + reason + connected_duration_ms
Disconnected --> Reconnecting: scheduleReconnect()<br/>daemon.connection.reconnect_scheduled<br/>attempt + backoff_ms
Reconnecting --> Connecting: setTimeout fires<br/>daemon.connection.connect_attempt<br/>attempt + previous_backoff_ms
Reconnecting --> [*]: close()
Connected --> [*]: close()
classDef happy fill:#196f3d;color:#ffffff;
classDef sad fill:#922b21;color:#ffffff;
classDef neutral fill:#1f618d;color:#ffffff;
class Connected happy
class Connecting neutral
class Disconnected sad
class Reconnecting sad
Loading
Rationale
The orchestrator already publishes pong-miss / TTL-refresh-failure events from
its side of every daemon's connection, so an operator can see "this daemon
went silent at 12:04Z". The daemon's side of that same connection is a blind
spot: the closest signal is daemon.heartbeat.timeout arriving on the
orchestrator some seconds after the actual TCP drop, with no way to correlate
it with the daemon's own view of onclose (code: 1006 vs 1000, the reason string Bun surfaces, whether the daemon already entered backoff). For
the fleet's HA story this is load-bearing: src/orchestrator/liveness-reaper.ts
flips workflow_runs rows owned by a "no longer alive" daemon (reapOnce
at src/orchestrator/liveness-reaper.ts:88-185) based purely on the
orchestrator's pong view, and the operational question "did the daemon
actually die or did it just lose the WS?" has no log-side answer today.
The cost is small: ~80 lines of schema + ~30 lines of emitter changes inside DaemonWsClient, gated by an existing test pattern (the .strict() Zod
round-trip co-located test that src/core/log-fields.ts and src/orchestrator/log-fields.ts already use, see e.g. the test referenced at src/scheduler/config-schema.test.ts). The schema is the drift-prevention
contract, not a runtime validator, so the hot path stays one log.info({...})
call per state transition with no measurable overhead. Decorrelated-jitter
backoff health, when paired with attempt counters, also lets an operator
trivially write the alert "P95 attempt across the fleet > 3 for 5 min",
which today requires log-line counting and is brittle across pino-pretty vs
JSON output.
This is the same shape of win the closed feat(observability): structured pipeline.stage events with delta_ms and feat(observability): structured dispatcher.offer.{sent,accepted,rejected,timed_out}
findings delivered for the pipeline and dispatcher subsystems; the
daemon-side WS lifecycle is the next analogous gap.
References
Internal:
src/daemon/ws-client.ts:75-122 — four unstructured logger.info / logger.error calls on the connection lifecycle.
src/daemon/ws-client.ts:153-170 — scheduleReconnect emits backoffMs but no attempt, and nextBackoff() at lines 28-31 is the decorrelated-jitter source we'd want a counter alongside.
src/daemon/main.ts:407-413 — onConnected / onDisconnected callbacks duplicate the ws-client logs with even less context; remove once structured lines land.
src/orchestrator/log-fields.ts:36-40 — DAEMON_HEARTBEAT_LOG_EVENTS template to mirror (orchestrator-side view of the same connection).
src/orchestrator/log-fields.ts:122-143 — DaemonHeartbeatLogSchema discriminated-union shape to mirror for the daemon-side schema.
Create src/daemon/log-fields.ts with DAEMON_CONNECTION_LOG_EVENTS and a DaemonConnectionLogSchema discriminated union over the five events, mirroring src/orchestrator/log-fields.ts:122-143. Add a co-located *.test.ts that round-trips a sample line through each branch under .strict().
In DaemonWsClient (src/daemon/ws-client.ts), add three private timestamps (connectStartedAt, lastOpenedAt, lastClosedAt) and one attempt counter (reset on a successful onopen, incremented in scheduleReconnect). Replace the four logger.info/logger.error calls at lines 70, 76, 104, 113, 159 with structured emissions matching the schema.
Delete the duplicate logger.info("Connected to orchestrator") / logger.info("Disconnected from orchestrator") callbacks at src/daemon/main.ts:408 / src/daemon/main.ts:411 since the structured lines emitted by the ws-client cover the same transitions.
Update docs/operate/observability.md to add the five new daemon.connection.* events to the field-name index, per the CLAUDE.md doc-sync rule for "New Pino log field or metric".
(Optional follow-up) add a Grafana / Loki alert template "daemon reconnect storm" filtering on event: "daemon.connection.reconnect_scheduled" AND attempt >= 5 to docs/operate/runbooks/daemon-fleet.md.
Areas Evaluated
src/daemon/ws-client.ts — the four unstructured lifecycle log lines and the nextBackoff decorrelated-jitter helper.
src/daemon/main.ts — the duplicate onConnected / onDisconnected callbacks and the surrounding shutdown + idle-loop observability.
src/orchestrator/log-fields.ts — the .strict() Zod schema convention to mirror.
src/orchestrator/ws-server.ts — orchestrator's WebSocket server lifecycle, confirmed it logs WebSocket connection opened/closed with only remoteAddr/code/reason and no event field either (potential sibling finding, deliberately scoped out of this issue to keep one-finding-per-run).
src/orchestrator/instance-liveness.ts, src/orchestrator/liveness-reaper.ts, src/orchestrator/queue-worker.ts, src/orchestrator/concurrency.ts, src/orchestrator/valkey.ts, src/db/migrate.ts — surveyed for similar gaps; daemon-side WS lifecycle is the clearest single hop.
src/core/log-fields.ts, src/utils/octokit-observability.ts, src/utils/log-redaction.ts, src/logger.ts — confirmed the canonical event: "..." + .strict() schema pattern this finding extends.
Existing research issues (open and closed) — checked all 20+ area: observability entries; no overlap with pipeline.stage, dispatcher.offer.*, daemon.heartbeat.*, fleet-state gauge, queue_wait_ms, rate-limit hooks, stdio MCP logger, token-usage, pino.final, scheduler.scan.*, circuit.*, or retry.* findings.
Generated by the scheduled research action on 2026-06-06
Finding
The daemon's outbound WebSocket lifecycle in
src/daemon/ws-client.tsis the lastsignificant subsystem in this codebase whose telemetry has not yet been migrated to
the
event: "<subsystem>.<verb>"pino convention that the rest of the codebase hasconverged on (see e.g.
dispatcher.offer.sent,daemon.heartbeat.pong_missed,pipeline.stage,ship.scoped.fix_thread.daemon.completed,scheduler.action.*).Today the daemon emits four free-text lines per connection cycle, none of which
carry a structured
eventfield or the operational counters an operator needs toreason about reconnect health:
src/daemon/ws-client.ts:76(onopen) logs"Connected to orchestrator"with onlyorchestratorUrl. There is notime_to_connect_ms(the connect-call→onopen delta), noattemptcounter, nodowntime_msmeasured from the previousonclose, so an operator cannot tell a clean first-connect from a 30-second reconnect.src/daemon/ws-client.ts:104(onclose) logs"Disconnected from orchestrator"with onlycode+reason. There is noconnected_duration_msmeasured from the prioronopen, so an operator cannot distinguish a daemon that flapped after 200 ms from one that lived for an hour and then died at the orchestrator-sideidleTimeout: 120boundary configured insrc/orchestrator/ws-server.ts:207.src/daemon/ws-client.ts:111-121(onerror) logs"WebSocket error"with noevent, leaving the operator to grep on freeform text alongside every otherlevel: "error"line.src/daemon/ws-client.ts:159(scheduleReconnect) logs"Reconnecting to orchestrator"with onlybackoffMs. There is noattemptcounter, so the standard "this daemon keeps reconnect-looping" alert query has to count log lines instead of reading one numeric field.Additionally,
src/daemon/main.ts:408andsrc/daemon/main.ts:411duplicate thews-client's connect/disconnect log with even less context (the
onConnectedcallback fires plain
"Connected to orchestrator"), so every state transitionproduces two unstructured lines back-to-back.
Mirror the structured-event treatment that landed for the orchestrator-side view
of the same connection (
daemon.heartbeat.pong_missed,daemon.heartbeat.timeout,daemon.heartbeat.ttl_refresh_failedpinned by the.strict()Zod discriminatedunion at
src/orchestrator/log-fields.ts:36-40). Add a siblingsrc/daemon/log-fields.tswith aDAEMON_CONNECTION_LOG_EVENTSconstant andDaemonConnectionLogSchemadiscriminated union over five events:daemon.connection.connect_attempt,daemon.connection.connected,daemon.connection.disconnected,daemon.connection.reconnect_scheduled,daemon.connection.error. Pinattempt,time_to_connect_ms,connected_duration_ms,downtime_ms,backoff_ms,previous_backoff_ms,code,reason,readyStateper branch with the same.strict()shape andco-located round-trip test as
src/core/log-fields.ts. Have the ws-client owntwo new timestamps (
connectStartedAt,lastOpenedAt) so the deltas arecomputable without churning the existing
nextBackoff()decorrelated-jittermath at lines 28-31. Remove the duplicate
logger.info(...)calls insrc/daemon/main.ts:408andsrc/daemon/main.ts:411since the structuredlines emitted by ws-client now cover those transitions.
Diagram
stateDiagram-v2 [*] --> Connecting: connect() Connecting --> Connected: ws.onopen<br/>daemon.connection.connected<br/>attempt + time_to_connect_ms Connecting --> Reconnecting: WebSocket throw / ws.onerror<br/>daemon.connection.error<br/>readyState + message Connected --> Disconnected: ws.onclose<br/>daemon.connection.disconnected<br/>code + reason + connected_duration_ms Disconnected --> Reconnecting: scheduleReconnect()<br/>daemon.connection.reconnect_scheduled<br/>attempt + backoff_ms Reconnecting --> Connecting: setTimeout fires<br/>daemon.connection.connect_attempt<br/>attempt + previous_backoff_ms Reconnecting --> [*]: close() Connected --> [*]: close() classDef happy fill:#196f3d;color:#ffffff; classDef sad fill:#922b21;color:#ffffff; classDef neutral fill:#1f618d;color:#ffffff; class Connected happy class Connecting neutral class Disconnected sad class Reconnecting sadRationale
The orchestrator already publishes pong-miss / TTL-refresh-failure events from
its side of every daemon's connection, so an operator can see "this daemon
went silent at 12:04Z". The daemon's side of that same connection is a blind
spot: the closest signal is
daemon.heartbeat.timeoutarriving on theorchestrator some seconds after the actual TCP drop, with no way to correlate
it with the daemon's own view of
onclose(code: 1006vs1000, thereasonstring Bun surfaces, whether the daemon already entered backoff). Forthe fleet's HA story this is load-bearing:
src/orchestrator/liveness-reaper.tsflips
workflow_runsrows owned by a "no longer alive" daemon (reapOnceat
src/orchestrator/liveness-reaper.ts:88-185) based purely on theorchestrator's pong view, and the operational question "did the daemon
actually die or did it just lose the WS?" has no log-side answer today.
The cost is small: ~80 lines of schema + ~30 lines of emitter changes inside
DaemonWsClient, gated by an existing test pattern (the.strict()Zodround-trip co-located test that
src/core/log-fields.tsandsrc/orchestrator/log-fields.tsalready use, see e.g. the test referenced atsrc/scheduler/config-schema.test.ts). The schema is the drift-preventioncontract, not a runtime validator, so the hot path stays one
log.info({...})call per state transition with no measurable overhead. Decorrelated-jitter
backoff health, when paired with attempt counters, also lets an operator
trivially write the alert "P95
attemptacross the fleet > 3 for 5 min",which today requires log-line counting and is brittle across pino-pretty vs
JSON output.
This is the same shape of win the closed
feat(observability): structured pipeline.stage events with delta_msandfeat(observability): structured dispatcher.offer.{sent,accepted,rejected,timed_out}findings delivered for the pipeline and dispatcher subsystems; the
daemon-side WS lifecycle is the next analogous gap.
References
Internal:
src/daemon/ws-client.ts:75-122— four unstructuredlogger.info/logger.errorcalls on the connection lifecycle.src/daemon/ws-client.ts:153-170—scheduleReconnectemitsbackoffMsbut noattempt, andnextBackoff()at lines 28-31 is the decorrelated-jitter source we'd want a counter alongside.src/daemon/main.ts:407-413—onConnected/onDisconnectedcallbacks duplicate the ws-client logs with even less context; remove once structured lines land.src/orchestrator/log-fields.ts:36-40—DAEMON_HEARTBEAT_LOG_EVENTStemplate to mirror (orchestrator-side view of the same connection).src/orchestrator/log-fields.ts:122-143—DaemonHeartbeatLogSchemadiscriminated-union shape to mirror for the daemon-side schema.src/core/log-fields.ts:18-37—pipeline.stageprecedent for camelCase correlation fields + snake_case numeric metrics (delta_ms).src/orchestrator/ws-server.ts:207—idleTimeout: 120boundary that the newconnected_duration_msmeasurement makes diagnosable.src/orchestrator/liveness-reaper.ts:88-185—reapOnceconsumer that benefits when both sides of the connection are observable.CLAUDE.md"Key Concepts" — theevent: "<subsystem>.<verb>"+.strict()schema pattern this finding extends.External:
ws-client.ts:25-27; the attempt counter is the missing observable companion to the math.ws.url, attempt, and lifecycle phase as first-class attributes, exactly the shape proposed here.eventdiscriminator + camelCase correlation fields" idiom this repo has been migrating toward.Suggested Next Steps
src/daemon/log-fields.tswithDAEMON_CONNECTION_LOG_EVENTSand aDaemonConnectionLogSchemadiscriminated union over the five events, mirroringsrc/orchestrator/log-fields.ts:122-143. Add a co-located*.test.tsthat round-trips a sample line through each branch under.strict().DaemonWsClient(src/daemon/ws-client.ts), add three private timestamps (connectStartedAt,lastOpenedAt,lastClosedAt) and oneattemptcounter (reset on a successfulonopen, incremented inscheduleReconnect). Replace the fourlogger.info/logger.errorcalls at lines 70, 76, 104, 113, 159 with structured emissions matching the schema.logger.info("Connected to orchestrator")/logger.info("Disconnected from orchestrator")callbacks atsrc/daemon/main.ts:408/src/daemon/main.ts:411since the structured lines emitted by the ws-client cover the same transitions.docs/operate/observability.mdto add the five newdaemon.connection.*events to the field-name index, per the CLAUDE.md doc-sync rule for "New Pino log field or metric".event: "daemon.connection.reconnect_scheduled" AND attempt >= 5todocs/operate/runbooks/daemon-fleet.md.Areas Evaluated
src/daemon/ws-client.ts— the four unstructured lifecycle log lines and thenextBackoffdecorrelated-jitter helper.src/daemon/main.ts— the duplicateonConnected/onDisconnectedcallbacks and the surrounding shutdown + idle-loop observability.src/daemon/health-reporter.ts— heartbeat pong build, confirmed no daemon-side connection telemetry lives here.src/orchestrator/log-fields.ts— the.strict()Zod schema convention to mirror.src/orchestrator/ws-server.ts— orchestrator's WebSocket server lifecycle, confirmed it logsWebSocket connection opened/closedwith onlyremoteAddr/code/reasonand noeventfield either (potential sibling finding, deliberately scoped out of this issue to keep one-finding-per-run).src/orchestrator/instance-liveness.ts,src/orchestrator/liveness-reaper.ts,src/orchestrator/queue-worker.ts,src/orchestrator/concurrency.ts,src/orchestrator/valkey.ts,src/db/migrate.ts— surveyed for similar gaps; daemon-side WS lifecycle is the clearest single hop.src/core/log-fields.ts,src/utils/octokit-observability.ts,src/utils/log-redaction.ts,src/logger.ts— confirmed the canonicalevent: "..."+.strict()schema pattern this finding extends.area: observabilityentries; no overlap withpipeline.stage,dispatcher.offer.*,daemon.heartbeat.*, fleet-state gauge, queue_wait_ms, rate-limit hooks, stdio MCP logger, token-usage,pino.final,scheduler.scan.*,circuit.*, orretry.*findings.Generated by the scheduled research action on 2026-06-06