Skip to content

feat(replay): single-use claim on consent + callback state#22

Merged
babs merged 1 commit intomasterfrom
feat/single-use-sealed-blobs
Apr 27, 2026
Merged

feat(replay): single-use claim on consent + callback state#22
babs merged 1 commit intomasterfrom
feat/single-use-sealed-blobs

Conversation

@babs
Copy link
Copy Markdown
Owner

@babs babs commented Apr 27, 2026

Summary

Closes T1.2 + T2.1 from misc/next-steps.md. Wires the existing replay store into two sealed-blob redemption sites that previously relied on TTL alone:

  • Consent token (POST /consent) — single-use claim on a per-render jti. Each GET /authorize render mints a fresh jti, so back-button = re-consent (a new claim slot) rather than a stuck-state error.
  • Callback state (GET /callback) — single-use claim on a per-session sid before the upstream OIDC exchange. A replayed /callback URL never fans out to the IdP token endpoint and never produces audit-log noise.

Eliminates the only "deliberate trade-off" footnote in specs.md (the consent-replay paragraph at the old line 337) and ships the audit-noise / IdP-fan-out fix the runbook called out for /callback.

Behavior when the replay store is wired (REDIS_URL set)

Site Replay detected Store error
POST /consent 400 invalid_request + error_code=consent_replay + mcp_auth_replay_detected_total{kind="consent"}++ 503 server_error + error_code=replay_store_unavailable + mcp_auth_access_denied_total{reason="replay_store_unavailable"}++
GET /callback 400 invalid_request + error_code=callback_state_replay + mcp_auth_replay_detected_total{kind="callback_state"}++ (same shape)

Reuses the existing mcp_auth_access_denied_total{reason="replay_store_unavailable"} counter from /token so one alert rule covers every claim site. All four replay_store_error log lines now carry an op field (claim_authz_code / claim_refresh_family / claim_consent / claim_callback_state) for site-level correlation.

Backwards compatibility

  • nil replay store → stateless fallback (configured opt-out, matches /token).
  • Empty JTI / SessionID on tokens sealed by older binaries → stateless fallback. Lets a rolling deploy avoid 503-ing active users for the duration of consentTTL (5 min) / sessionTTL (10 min).

Deliberate non-claim

The /callback IdP-error redirect path (RFC 6749 §4.1.2.1) does NOT claim — the redirect is idempotent (no IdP fan-out, no token issuance) and claiming would burn the state on a transient IdP error and force re-auth on every legit retry. Documented inline in handlers/callback.go.

Test plan

  • CI green (test, lint, keycloak-e2e, fuzz-smoke, manifest-prod, govulncheck, build).
  • 10 new handler tests in handlers/single_use_replay_test.go:
    • TestConsent_SingleUse_ReplayDetected — second POST → 400 consent_replay, replay metric +1, approved counter does NOT double-count.
    • TestConsent_SingleUse_ClaimsBeforeDeny — Deny is also one-shot.
    • TestConsent_SingleUse_NilStoreFallbacknil store keeps the prior stateless behavior.
    • TestConsent_SingleUse_LegacyEmptyJTI — token with empty JTI succeeds (in-flight rollout).
    • TestConsent_SingleUse_StoreErrorFailClosed — 503 + replay_store_unavailable metric.
    • TestCallback_SingleUse_ReplayDetected — second /callback → 400 callback_state_replay, replay metric +1.
    • TestCallback_SingleUse_LegacyEmptySID — empty sid does NOT hit the replay branch.
    • TestCallback_SingleUse_StoreErrorFailClosed — 503 + metric, verifyFunc not called.
    • TestAuthorize_RenderConsent_PopulatesJTI — GET /authorize (consent fork) seals a non-empty jti.
    • TestAuthorize_SilentRedirect_PopulatesSessionID — silent path seals a non-empty sid.
  • No existing test broke (full suite + -race runs locally).

Docs

  • specs.md/consent paragraph reworked from "Why no replay-store integration" → "Single-use replay defense"; /callback step list grew step 3 ("Single-use claim on sid"); subsequent steps renumbered.
  • README.mdmcp_auth_replay_detected_total{kind} documentation extended with consent and callback_state.

Closes the only documented security trade-off in specs.md
(consent_token replay) and removes IdP-fan-out + audit noise on
retried /callback. When a replay store is wired (REDIS_URL):

  - GET /authorize seals a fresh JTI on every consent render and
    a fresh SessionID on every silent-redirect session. Per-render
    JTI keeps back-button safe (a re-render gets a new claim slot)
    while still single-use within one render.

  - POST /consent ClaimOnce on JTI before either approve or deny —
    a captured consent_token cannot be replayed for either decision.

  - GET /callback ClaimOnce on session.SessionID before the upstream
    OIDC exchange — a replayed /callback URL never fans out to the
    IdP token endpoint and never produces audit-log noise.

  - Replay detected → 400 invalid_request with error_code
    consent_replay / callback_state_replay, increments
    mcp_auth_replay_detected_total{kind=consent|callback_state}.

  - Store unreachable → 503 with error_code replay_store_unavailable,
    increments mcp_auth_access_denied_total{reason="replay_store_unavailable"}
    (reuses the existing /token counter so one alert rule covers
    every claim site).

  - nil store → stateless fallback (configured opt-out, mirrors /token).

  - Empty JTI / SessionID on tokens sealed by older binaries falls
    through to stateless behavior so an in-flight rollout doesn't
    503 active users for the duration of consentTTL / sessionTTL.

  - The /callback IdP-error redirect path deliberately does NOT
    claim — the redirect is idempotent (no IdP fan-out, no token
    issuance) and claiming would force re-auth on every legit
    transient-error retry.

  - All four replay_store_error log sites now carry an `op` field
    (claim_authz_code / claim_refresh_family / claim_consent /
    claim_callback_state) for site-level log correlation.

specs.md and README updated. 10 new handler tests cover replay
detection, deny-replayed, nil-store fallback, legacy empty
JTI/SID, store-error fail-closed, and JTI/SID population by
/authorize for both forks.
@babs babs merged commit ad18290 into master Apr 27, 2026
7 checks passed
@babs babs deleted the feat/single-use-sealed-blobs branch April 27, 2026 21:40
babs added a commit that referenced this pull request Apr 27, 2026
Closes T2.3 from misc/next-steps.md. Adds a configurable grace
window during which a refresh-rotation claim collision is treated
as a benign concurrent submit (parallel-tab refresh, slow-network
double-submit) and surfaces as 429 + error_code
refresh_concurrent_submit + Retry-After: 2 instead of revoking
the family.

Outside the window the strict pre-T2.3 behavior applies — every
collision still revokes the lineage (RFC 6749 §10.4 / OAuth 2.1
§6.1). REFRESH_RACE_GRACE_SEC tunes the window: default 2s,
clamped to [0, 10] at startup. The 10s ceiling is a security cap
(wider windows are statistically attacker-shaped); 0 disables
the grace and keeps the strict pre-T2.3 contract for operators
who prefer it.

Implementation:

  - replay.Store.ClaimOrCheckFamily gains a graceWindow parameter
    and a third return bool `racing`. Atomicity preserved: the
    Lua script on Redis and the single-mutex section on
    MemoryStore both classify {revoked, racing, claimed, fresh}
    in one linearizable step, so the new branch cannot deadlock
    or double-decide under concurrent load.

  - The Lua script now stores epoch-ms timestamps (passed by Go
    so the proxy is authoritative about its clock) as claim
    values; legacy "1" placeholder values from pre-T2.3 binaries
    fall through to the strict revoke path during a rolling
    deploy. Documented in the script docstring + threat-model
    row 5.

  - MemoryStore gets a memoryEntry struct (expiresAt + setAt) so
    the racing branch can compare against the original claim
    time without polluting Mark semantics.

  - 429 on /token is a deliberate deviation from RFC 6749 §5.2
    (which defines 400/401 for token-endpoint errors). Most OAuth
    client libraries treat 429+Retry-After as "back off and
    retry", which is exactly the desired behavior. Documented in
    specs.md.

  - Tests: 2 handler-level (racing happy path + zero disables
    grace), 2 per store (racing within grace + past-grace
    revokes), 6 config (default + zero + custom + non-int +
    negative + above-ceiling).

  - All four replay_store_error log sites already carry the `op`
    field from PR #22; this change preserves it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant