Skip to content

feat(observability): structured idempotency.* events for claimDelivery outcomes to make dedup-hit and fail-open rates queryable #232

@chrisleekr

Description

@chrisleekr

Finding

The webhook idempotency claim function claimDelivery at src/webhook/idempotency.ts:36 has three terminal outcomes but emits a structured event field on only one of them, breaking parity with every other recently-added structured event family in this repo (retry.* issue #225, dispatcher.offer.* issue #161, github.api.* src/utils/octokit-observability.ts:29, pipeline.stage issue #166).

Concretely, walking the four code paths:

  1. src/webhook/idempotency.ts:46-49 — Valkey unconfigured or unhealthy → log.warn({ deliveryId }, "claimDelivery: Valkey unavailable, proceeding (fail-open)"). No event tag, no reason discriminator.
  2. src/webhook/idempotency.ts:61SET … NX EX returned "OK" (claim won, the happy path). No log line at all. An operator cannot count successful claims, so the dedup-hit-rate denominator is invisible.
  3. src/webhook/idempotency.ts:62-66SET … NX EX returned null (duplicate redelivery). Emits event: "dedup-skip" — the one structured tag, but the verb breaks the dotted-namespace convention every other observability issue has standardised on (retry.attempt_failed, pipeline.completed, dispatcher.offer.sent).
  4. src/webhook/idempotency.ts:67-72 — Bun RedisClient.send threw → log.warn(..., "claimDelivery: Valkey error, proceeding (fail-open)"). No event tag, no reason discriminator, and indistinguishable from outcome 1 in a structured query without parsing the free-text message.

This is exactly the same gap that #225 closed for retryWithBackoff and #207 closed for the dispatcher: every terminal outcome of a load-bearing chokepoint needs an enumerable event so operator queries (event:idempotency.duplicate_skipped) can compute rates without resorting to message-field regex.

The fix is the established template (mirroring src/utils/retry-log-fields.ts): add src/webhook/idempotency-log-fields.ts with a z.strictObject IdempotencyLogFieldsSchema discriminated on event over the four-tag family idempotency.{claimed, duplicate_skipped, failed_open} (with reason: "unavailable" | "error" discriminator on failed_open), wire it into claimDelivery's four emit sites, retire the legacy dedup-skip literal, and document the new fields under the "Idempotency log fields" heading in docs/operate/observability.md (the existing table at lines 22-47 is the canonical place).

Diagram

flowchart TD
    WH["Webhook delivery<br/>X-GitHub-Delivery"] --> CD["claimDelivery<br/>idempotency.ts line 36"]
    CD --> VC{"Valkey configured<br/>AND healthy?"}
    VC -->|no| FOU["fail-open path<br/>line 47<br/>current: warn msg only<br/>proposed: idempotency.failed_open<br/>reason=unavailable"]
    VC -->|yes| SNX["SET key NX EX 259200<br/>lines 54-60"]
    SNX -->|reply OK| WON["claim won<br/>line 61<br/>current: NO LOG<br/>proposed: idempotency.claimed"]
    SNX -->|reply null| DUP["duplicate redelivery<br/>lines 62-65<br/>current: event=dedup-skip<br/>proposed: idempotency.duplicate_skipped"]
    SNX -->|throws| FOE["fail-open path<br/>lines 67-71<br/>current: warn msg only<br/>proposed: idempotency.failed_open<br/>reason=error"]
    WON --> HOK["handler proceeds<br/>side effects run"]
    FOU --> HOK
    FOE --> HOK
    DUP --> HSK["handler returns<br/>NO side effects"]

    classDef cTrigger fill:#1f4e79,color:#ffffff
    classDef cCall fill:#196f3d,color:#ffffff
    classDef cDecision fill:#7d6608,color:#ffffff
    classDef cOk fill:#196f3d,color:#ffffff
    classDef cDup fill:#1f4e79,color:#ffffff
    classDef cFailopen fill:#943126,color:#ffffff
    classDef cHandler fill:#4a235a,color:#ffffff

    class WH cTrigger
    class CD cCall
    class SNX cCall
    class VC cDecision
    class WON cOk
    class DUP cDup
    class FOU cFailopen
    class FOE cFailopen
    class HOK cHandler
    class HSK cHandler
Loading

Rationale

claimDelivery is one of two load-bearing dedup defences (the other being the idx_workflow_runs_inflight partial-unique index, per the "Idempotency" section of CLAUDE.md). Without structured outcome tags, the operator queries that would matter on an incident are impossible:

  • "Dedup hit rate" = count(event:idempotency.duplicate_skipped) / count(event:idempotency.{claimed,duplicate_skipped}). With outcome 2 emitting nothing, the denominator can only be reconstructed by counting downstream pipeline starts — fragile and lossy. A sudden spike in this rate would indicate GitHub redeliveries (manual operator replay, automatic retry, or webhook misconfiguration causing the same delivery to land multiple times) and is exactly the kind of operational signal observability is for.
  • "Fail-open rate" = count(event:idempotency.failed_open) / total claims. With the unavailable and error paths sharing only a free-text message, an alert like "fail-open rate > 1% sustained for 10m" requires message:~"fail-open" regex matching, which is brittle and breaks the field-vs-message-content separation the retry.* schema (src/utils/retry-log-fields.ts:48-95) deliberately enforces.
  • reason discriminator matters because unavailable (Valkey unset/disconnected → infrastructure config) and error (Bun RedisClient.send threw mid-flight → transient network or server-side error) have completely different runbook responses but are currently the same warn line.

The patch surface is tiny: one new file (the Zod schema, ~40 lines mirroring retry-log-fields.ts), four edits in idempotency.ts (two new emit sites, two reshapes), and a co-located test asserting field parity (the existing pattern in src/utils/retry-log-fields.test.ts). No new npm dependencies, no runtime cost on the hot path (the schema is a drift-prevention contract, not a runtime validator — same policy as RetryLogFieldsSchema).

Volume policy follows precedent: idempotency.claimed at debug (one per webhook would be too loud at info for a busy installation, mirroring github.api.request at line 117 of src/utils/octokit-observability.ts); idempotency.duplicate_skipped at info (the operational signal of interest); idempotency.failed_open at warn (degraded mode that should page if sustained).

References

Internal:

External:

Suggested Next Steps

  1. Add src/webhook/idempotency-log-fields.ts mirroring src/utils/retry-log-fields.ts: an IDEMPOTENCY_LOG_EVENTS = { claimed, duplicate_skipped, failed_open } as const map and a z.discriminatedUnion("event", [...]) schema whose failed_open branch carries reason: z.enum(["unavailable", "error"]) and the shared branch shape includes deliveryId: z.string().min(1) plus a ttl_seconds: z.number().int().positive() field on claimed so the TTL is greppable per emit.
  2. Update src/webhook/idempotency.ts to (a) emit idempotency.claimed at debug on the line-61 success branch, (b) replace the line-47 warn with idempotency.failed_open at warn with reason: "unavailable", (c) replace the line-63 event: "dedup-skip" with event: "idempotency.duplicate_skipped" at info, and (d) replace the line-68 warn with idempotency.failed_open at warn with reason: "error". Keep the human-readable log message text alongside for log-grep ergonomics; the event field is the structured-query handle.
  3. Add src/webhook/idempotency-log-fields.test.ts asserting each emit site's pino-event shape passes IdempotencyLogFieldsSchema.parse (the colocated-test pattern used by src/utils/retry-log-fields.test.ts and gated by bun run check:test-globs, see CLAUDE.md "Code Conventions").
  4. Document the new fields in docs/operate/observability.md: add idempotency.{claimed, duplicate_skipped, failed_open} rows to the field table (lines 22-47) and a short "Idempotency log fields" subsection after "Retry log fields", noting the volume policy (debug / info / warn) and the reason discriminator semantics. The check:docs-citations script enforces the cited line numbers stay accurate.

Areas Evaluated

  • src/webhook/idempotency.ts — the four code paths of claimDelivery and their current log shape; the file's contract docstring at lines 5-15.
  • All five callers of claimDelivery: src/webhook/events/issue-comment.ts:92, src/webhook/events/issues.ts:100, src/webhook/events/pull-request.ts:184, src/webhook/events/review-comment.ts:117 (gated paths) and the doc-only mention in src/webhook/events/review.ts:14 (un-gated by design).
  • src/orchestrator/valkey.ts:28-58getValkeyClient / isValkeyHealthy semantics that drive the line-46 healthy-or-not branch in claimDelivery.
  • src/utils/retry-log-fields.ts and src/utils/octokit-observability.ts as schema templates; src/orchestrator/log-fields.ts as a third dispatcher-side template for an event-discriminated family.
  • src/logger.ts — the root logger config and the createChildLogger API the handlers already build; no logger-side change is needed for this proposal because the events would flow through the existing child logger.
  • Closed and open research-labelled observability issues (gh issue list --label research --state all) to confirm no overlap; the two idempotency-labelled prior issues (fix(idempotency): production webhook handlers bypass router dedup and feat(idempotency): close duplicate-processing window) are feature work on the correctness of the layer, not on its observability surface.
  • docs/operate/observability.md:22-47 log-fields table to identify the canonical insertion point for the new field rows.

Generated by the scheduled research action on 2026-06-14

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions