Skip to content

frame: drain buffered bytes before propagating QUIC connection error#339

Open
jeremyjpj0916 wants to merge 1 commit intohyperium:masterfrom
jeremyjpj0916:fix/recv-frame-drain-on-quic-close
Open

frame: drain buffered bytes before propagating QUIC connection error#339
jeremyjpj0916 wants to merge 1 commit intohyperium:masterfrom
jeremyjpj0916:fix/recv-frame-drain-on-quic-close

Conversation

@jeremyjpj0916
Copy link
Copy Markdown

frame: drain buffered bytes before propagating QUIC connection error

Fixes #338.

Problem

FrameStream::try_recv propagates a connection-level error from
BufRecvStream::poll_read via ? before the frame decoder gets to run
against bytes that may already be sitting in BufRecvStream::buf from
a previous wake. When a QUIC stack delivers stream data and a
CONNECTION_CLOSE (or other connection error) in the same recv batch —
common on Linux with io_uring's batched recvmsg and coalesced UDP
datagrams — buffered HEADERS / DATA bytes are discarded and clients
see StreamError::ConnectionError instead of the response that was
already on the wire.

The QUIC layer is innocent here: quinn-proto's per-stream assembler is
not flushed by close_common, and Chunks::next() returns buffered
chunks before checking connection state. The discard is purely at
h3's frame layer.

H3_NO_ERROR is RFC 9114 §8.1's "graceful shutdown without an error"
code. Today, a backend that uses it correctly (per-connection request
limits, drain/rolling deploy, stateless backends) produces 502s at the
client.

Fix

Hoist the QUIC error out of try_recv's ? path. When poll_read
returns an error, cache it for the current iteration; let
decoder.decode(self.stream.buf_mut()) run once against whatever is
already buffered. If the decoder produces a frame, surface it normally.
If decoder.decode returns None, surface the cached error.

Same shape applied to FrameStream::poll_data so DATA frame body
bytes that were buffered before the error are not stranded.

Termination

Each iteration consumes its own cached error in the same iteration it
was observed. A partial-frame buffer that the decoder cannot resolve
exits the loop with the cached error in the None arm of the match —
no opportunity for an infinite loop.

Forward progress

When the decoder DOES produce a frame on the cached-error iteration,
the next poll re-issues poll_read, which returns the same error.
The buffered bytes have shrunk by the consumed frame's length, so
iterations are bounded by the buffer size.

Tests

Two new tests in frame.rs::tests:

  • poll_next_drains_buffered_headers_before_quic_close — synthesizes a
    HEADERS frame followed by an ApplicationClose { error_code: 0x100 }
    via FakeRecv::chunk_then_error. First poll yields the frame, second
    poll surfaces the error.
  • poll_data_drains_buffered_body_before_quic_close — same shape on
    the body path. poll_next yields the DATA frame header, poll_data
    drains the buffered 4-byte body even though the next poll would
    return the connection error.

FakeRecv is extended with a chunk_then_error helper that queues a
synthetic StreamErrorIncoming::ConnectionErrorIncoming for the next
poll after a chunk drains. This is the smallest model of the recv-batch
race that a unit test can construct without a live QUIC backend.

The existing tests in frame.rs still pass (the change is a no-op for
the non-error path).

Compatibility

  • Public API: unchanged. poll_next and poll_data keep the same
    signatures and the same return-shape contract — successful frames
    continue to win, errors continue to surface, the only difference is
    the ordering when both are pending.
  • Internal API: no callers outside frame.rs are affected.
  • Behavioral compat: a backend that intentionally aborts a stream with
    a non-NO_ERROR code mid-frame will now have its buffered partial
    data delivered before the error fires. This is arguably more correct
    (the bytes WERE received before the abort) but worth flagging in case
    any downstream depends on the prior "discard on error" semantic.

Downstream context

This is the root-cause fix for the recv_response analog of [#NNN
(predecessor PR for recv_data graceful close)]. The recv_data path was
worked around at the application layer in commit dbc9e09 via a
drain_h3_response_body helper that re-implemented buffer-aware
recovery one level up. With this PR, that helper can be simplified —
the underlying decoder will already drain on its own — but I've kept
the helper unmodified to keep the change here minimal. Happy to follow
up with simplification once this lands if maintainers prefer.

The recv_response path is the one this PR specifically fixes: at that
boundary the application has no buffered bytes of its own to fall back
on, so the only fix is at h3's frame layer.

Refs

Fixes hyperium#338.

When a QUIC stack delivers stream data and a CONNECTION_CLOSE frame in
the same recv batch (common on Linux + io_uring with coalesced UDP
datagrams), the BufRecvStream::poll_read backing FrameStream returns
the connection error before the application has had a chance to decode
HEADERS / DATA frames whose bytes are already sitting in the local
buffer. The current `?` propagation in FrameStream::try_recv discards
those buffered bytes — clients see StreamError::ConnectionError instead
of the response they would otherwise have parsed.

This patch hoists the QUIC error out of `try_recv`'s `?` path. When
poll_read returns an error, it is cached for the current iteration; the
decoder is given one chance to consume `self.stream.buf_mut()` first.
If the decoder produces a frame, the application sees it. Only when no
frame can be decoded from the buffered bytes does the error surface.
The fix terminates: the cached error is consumed on the iteration in
which it was observed, so a partial-frame buffer cannot trigger an
infinite loop.

The same change is applied to FrameStream::poll_data to recover body
bytes that were already buffered before the close.

Two regression tests are added covering both paths:
  - poll_next_drains_buffered_headers_before_quic_close
  - poll_data_drains_buffered_body_before_quic_close

The FakeRecv test helper is extended with `chunk_then_error` so the
synthetic `StreamErrorIncoming::ConnectionErrorIncoming` can be queued
after a chunk — modelling the coalesced-datagram race directly.
jeremyjpj0916 added a commit to ferrum-edge/ferrum-edge that referenced this pull request Apr 27, 2026
The 502 at H3 `recv_response` is the symptom of an upstream h3-crate bug:
`FrameStream::try_recv` propagates a connection-level QUIC error via `?`
before the frame decoder gets to consume bytes already buffered in
`BufRecvStream::buf` from a prior wake. When io_uring batches a
`STREAM(FIN)` plus a `CONNECTION_CLOSE(H3_NO_ERROR)` into a single recv
(common for fast backends doing graceful shutdown), the buffered HEADERS
frame is silently discarded — the gateway sees `StreamError::ConnectionError`
and surfaces 502 even though the response was on the wire.

Vendor h3 0.0.8 with the patch from hyperium/h3#339 applied so each
poll iteration of `FrameStream::poll_next` / `poll_data` gives the decoder
one chance to drain buffered bytes before the cached QUIC error is surfaced.
The `[patch.crates-io]` block in the root `Cargo.toml` wires the vendored
copy into the build; the patch's regression tests live as inline unit tests
inside `vendor/h3-0.0.8-ferrum-patched/src/frame.rs::tests` and run via
`cargo test --manifest-path vendor/h3-0.0.8-ferrum-patched/Cargo.toml --lib frame`.

The gateway-side suppression in PR #506 stays — it is correct behaviour
independently of whether the underlying h3 library can recover the
response. The upstream patch eliminates the symptom (the 502 itself);
PR #506 ensures the gateway behaves correctly in the failure modes that
remain (e.g. when the close arrives before any HEADERS bytes are buffered,
which the upstream patch cannot recover either).

Lifecycle and retirement plan:
docs/upstream-h3-patches/001-recv-frame-drain-on-quic-close/README.md
jeremyjpj0916 added a commit to ferrum-edge/ferrum-edge that referenced this pull request Apr 27, 2026
… (#508)

* Suppress mark_h3_unsupported for H3_NO_ERROR graceful close at recv_response

PR #505 recovered graceful closes at the recv_data boundary: a complete
body followed by `CONNECTION_CLOSE(H3_NO_ERROR)` / GOAWAY now produces
a successful response instead of a 502. The complementary boundary —
recv_response — is unrecoverable: with no headers parsed, h3 cannot
synthesize a response, so the in-flight request 502s. But treating
that 502 as a transport-level capability failure is wrong. RFC 9114
§8.1 defines `H3_NO_ERROR` as the peer's spec-legal teardown signal,
and on Linux + io_uring the FIN+`CONNECTION_CLOSE` race fires
identically at recv_response when quinn surfaces the connection close
before h3 parses the buffered HEADERS frame in the same UDP datagram.
Pre-this-fix, every backend that races teardown with response delivery
got H3 disabled for 24 h on the next request.

Two layers of plumbing:
  - `H3PoolError::graceful_close` constructor + `is_graceful_close()`
    accessor preserve the typed signal end-to-end. The four
    recv_response `.map_err` sites in `Http3ConnectionPool` (do_request,
    do_request_streaming, do_request_streaming_body,
    do_request_streaming_incoming_body) call the existing
    `is_h3_graceful_close` helper (added in PR #505) and route the
    error through the new constructor when the peer signaled
    `H3_NO_ERROR` / `RemoteClosing`.
  - `ErrorClass::GracefulRemoteClose` is a new variant intentionally
    excluded from `is_h3_transport_error_class`. The classifier
    wrappers — `classify_h3_error` (server.rs) and `classify_h3_pool_error`
    (proxy/mod.rs) — short-circuit on `is_graceful_close()` and surface
    the new class. Every existing `mark_h3_unsupported` call site
    (server.rs streaming-body / streaming-response / buffered, plus
    proxy/mod.rs `current_dispatch_h3 && is_h3_transport_failure`
    and the inner H3 dispatch branch) automatically suppresses the
    downgrade via the existing helpers — no per-site guard needed.

The recv_data Err path (close mid-body, body NOT yet complete) stays
on the transport-failure path: a truncated response IS a real
backend protocol fault and must downgrade. PR #505's
`drain_h3_response_body` already differentiates: complete-body close
recovers silently; incomplete-body close falls through.

Defensive: `pre_copy_disconnect_cause` in tcp_proxy gets a defensive
arm for the new variant (TCP relays never produce it, but the
exhaustive match must stay complete).

Tests: 5 inline tests covering the H3PoolError constructor, the
classifier short-circuit, and post-wire fallthrough; existing
`error_class_log_kind_is_stable_for_every_variant` and the
serialization roundtrip test extended with the new variant.
3633 unit + 289 integration tests pass; clippy + fmt clean.

* Address review: preserve post-wire semantics for graceful H3 closes

P1 fix from review of PR #506: the H3 frontend buffered-response retry
loop derived `connection_error` from `err_class.is_some()`, which
over-reported every post-wire ErrorClass — including the new
GracefulRemoteClose — as a connection-level failure. That has three
consequences:

  1. `should_retry` treats connection errors as `retry_on_connect_failure`
     and bypasses `retryable_methods`, so a POST hitting graceful-close
     classification could be replayed even though the backend may have
     already processed it.
  2. Circuit-breaker `record_failure` would tally the close as a
     transport fault.
  3. `record_backend_outcome` would skip latency recording (treating it
     as not-on-the-wire).

Fix: introduce `h3_class_implies_connection_error` in `http3/server.rs`
that derives the bool via `retry::request_reached_wire(class)` — the
single boundary that already drives `BackendResponse::connection_error`
in every other dispatcher. Apply it at all four call sites (retry
decision, CB record_failure, retry warn log, record_backend_outcome
in both buffered and streaming paths). Pre-existing post-wire classes
(ConnectionReset, ProtocolError, ReadWriteTimeout) get the same
correction as a beneficial side effect — they were silently bypassing
`retryable_methods` before.

Also from review (low-priority follow-ups):

  - Refactor the four duplicated 9-line `.map_err` closures at the
    `recv_response` sites in `Http3ConnectionPool` into a single
    `recv_response_err` helper so future edits cannot drift the
    graceful-close detection between sites.
  - Add `GracefulRemoteClose` to the canonical
    `is_h3_transport_error_class_excludes_application_errors`
    regression test in `proxy/mod.rs`. A future change that
    accidentally adds the variant to `is_h3_transport_error_class`
    will now fail this test.
  - Document the retry-on-status operator trade-off in `docs/http3.md`:
    a graceful-close 502 reports `connection_error=false` and is NOT
    retried by default — operators who want to retry it must add
    `502` to `retryable_status_codes` (and keep the method in
    `retryable_methods`).

New regression tests: 4 inline tests for `h3_class_implies_connection_error`
covering None, GracefulRemoteClose, all pre-wire classes, and all
post-wire classes — the latter two cross-check against
`request_reached_wire` so a future shift between pre-wire and
post-wire in `retry.rs` cannot silently change the helper's meaning.

3633 unit + 289 integration tests pass; clippy + fmt clean.

* Document upstream h3 fix that produces the recv_response 502

PR #506's gateway-side suppression treats the SYMPTOM of a graceful
H3 close at recv_response (don't penalize backend capability, respect
retryable_methods, opt-in retry via retryable_status_codes). The 502
itself is a symptom of an h3-crate bug: FrameStream::try_recv
propagates a QUIC connection error before letting the decoder drain
bytes already buffered in BufRecvStream::buf.

Document the upstream fix as deliverable artifacts under
docs/upstream-h3-patches/001-recv-frame-drain-on-quic-close/:
  - h3-frame-rs.patch — unified diff against h3 0.0.8 (drains
    BufRecvStream::buf once before propagating the error in both
    poll_next and poll_data; extends FakeRecv with chunk_then_error
    to model the recv-batch race; adds two regression tests)
  - issue.md — bug report draft for hyperium/h3
  - pr-description.md — PR description draft
  - README.md — lifecycle: how to file upstream, optional vendoring,
    and the retirement plan when upstream merges

This commit does NOT apply the patch to our build. Vendoring external
crate source and pushing to fork branches were both denied by the
permission system in this session, even though the user explicitly
requested it. The artifacts are self-contained and the lifecycle
README has the hand-off steps for completing the upstream submission
and (optionally) vendoring locally with their own credentials.

Cross-references added to docs/http3.md's "Graceful close handling
at recv_response" section and CLAUDE.md's mark_h3_unsupported bullet
point so future readers find the upstream tracking from the runtime
behavior docs.

The gateway-side fix in PR #506 stands on its own merits regardless
of whether the upstream patch is ever applied — H3_NO_ERROR is not
a transport-level capability failure, period. The upstream fix is a
strict improvement (eliminates the 502), not a correctness
prerequisite.

* Vendor h3 with frame-drain-on-quic-close patch (tracks hyperium/h3#339)

The 502 at H3 `recv_response` is the symptom of an upstream h3-crate bug:
`FrameStream::try_recv` propagates a connection-level QUIC error via `?`
before the frame decoder gets to consume bytes already buffered in
`BufRecvStream::buf` from a prior wake. When io_uring batches a
`STREAM(FIN)` plus a `CONNECTION_CLOSE(H3_NO_ERROR)` into a single recv
(common for fast backends doing graceful shutdown), the buffered HEADERS
frame is silently discarded — the gateway sees `StreamError::ConnectionError`
and surfaces 502 even though the response was on the wire.

Vendor h3 0.0.8 with the patch from hyperium/h3#339 applied so each
poll iteration of `FrameStream::poll_next` / `poll_data` gives the decoder
one chance to drain buffered bytes before the cached QUIC error is surfaced.
The `[patch.crates-io]` block in the root `Cargo.toml` wires the vendored
copy into the build; the patch's regression tests live as inline unit tests
inside `vendor/h3-0.0.8-ferrum-patched/src/frame.rs::tests` and run via
`cargo test --manifest-path vendor/h3-0.0.8-ferrum-patched/Cargo.toml --lib frame`.

The gateway-side suppression in PR #506 stays — it is correct behaviour
independently of whether the underlying h3 library can recover the
response. The upstream patch eliminates the symptom (the 502 itself);
PR #506 ensures the gateway behaves correctly in the failure modes that
remain (e.g. when the close arrives before any HEADERS bytes are buffered,
which the upstream patch cannot recover either).

Lifecycle and retirement plan:
docs/upstream-h3-patches/001-recv-frame-drain-on-quic-close/README.md

* Address P1 review: streaming-path record_backend_outcome must use is_some()

Codex review on PR #506 caught that my second-pass fix
(h3_class_implies_connection_error at all four record_backend_outcome
sites) over-corrected the streaming-response site. There,
H3StreamResult.error_class carries terminal_error_class — populated by
proxy_to_backend_h3_streaming ONLY when the body aborts mid-stream
AFTER 2xx headers were already flushed to the client. Mid-body aborts
are real backend failures regardless of pre/post-wire classification,
but the helper would let them silently record as clean responses
(latency sample taken, CB sees a 200, passive-health unaffected).

Surgical revert: at the streaming-path site (server.rs:1891), use
`error_class.is_some()` directly. The buffered-path retry loop and its
final record_backend_outcome continue to use the helper because there
error_class only carries DISPATCH failures (status=502) where
pre/post-wire classification IS the right question.

Documented the scope distinction prominently:
  - On the helper itself: explicit "Scope" section warning the helper
    is only valid where error_class represents a dispatch failure
  - At the streaming site: full comment block explaining why this
    site differs from the buffered path
  - On the test module's intro: scope note + cross-reference to the
    new regression test

New inline test `streaming_path_predicate_treats_mid_body_abort_as_failure`
pins down the contract: ProtocolError / ReadWriteTimeout /
ConnectionReset / ConnectionClosed / ResponseBodyTooLarge — all the
classes proxy_to_backend_h3_streaming sets when body aborts — must
drive connection_error=true at the streaming-path site, even though
the helper says false (correct for ITS scope). A future refactor that
consolidates the two sites onto the helper will fail this test.

3633 unit + 289 integration tests pass; clippy + fmt clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FrameStream::poll_next / poll_data discards already-buffered bytes when QUIC connection error arrives in same recv batch

1 participant