Skip to content

HTTP/3: cross-worker connection migration (SO_REUSEPORT CID steering) #72

@EdmondDantes

Description

@EdmondDantes

Goal

Make QUIC connection migration / NAT-rebind survive setWorkers > 1.

Single-worker migration already works (server commit 27f073c, RFC 9000 §9: feed ngtcp2 the real datagram source, sync conn->peer, drain on ps.path). But with a worker pool, each worker runs its own http3_listener_t with its own per-worker conn_map, and the kernel's SO_REUSEPORT group hashes by 4-tuple. When a client migrates (new source port → new 4-tuple), the rebound datagram can hash to a different worker that has no entry for that connection → it emits a stateless reset → the connection breaks.

This is the last open piece of the HTTP/3 migration story (roadmap #59, now closed).

Chosen approach — userspace forward (h2o / quicly style, not eBPF)

  • nginx steers with eBPF (bpf_sk_select_reuseport); h2o encodes a thread id in the CID and forwards misrouted packets between threads over an fd array. We take the h2o style: portable, no root/eBPF.
  • Encode the owning worker index into the SCID (the connection id the server mints): high nibble of byte 0, scid[0] = (worker_index << 4) | (rand & 0x0f). HTTP3_SCID_LEN == 8, so the nibble is free. Caps at 16 workers (matches quicly's thread_id:24-style field; widen the field if we ever need >16 H3 workers).
  • A short-header (1-RTT) datagram with no local conn whose DCID names a different worker → forward the raw datagram to the owner; the owner injects it into its own http3_connection_dispatch from its own reactor thread (preserves the single-thread-per-ngtcp2_conn invariant). Forwards are rare (only migrated conns).
  • Long-header / Initial packets (vc.version != 0) are never forwarded — those are brand-new handshakes the receiving worker legitimately owns; their DCID is client-chosen and carries no worker nibble.

⛔ Blocker (the reason this is paused)

The original plan assumed the owner could drain a ThreadChannel non-blocking from inside the H3 poll callback. Verified in source — this is impossible as written:

  • zend_async_channel_t ABI (php-async Zend/zend_async_API.h:1925-1931) has only send / receive / close. Both block:
    • thread_channel_send (thread_channel.c:95-115): pushes immediately when there is room, but on a full buffer registers a trigger and calls ZEND_ASYNC_SUSPEND() (needs a coroutine).
    • thread_channel_receive (thread_channel.c:142-185): pops immediately when data is present, but on empty suspends.
  • There is no try_send / receive_nowait / peek / count in the channel ABI (checked shipped php-release/include and live php-src/Zend).
  • http3_listener_poll_cb is a reactor event callback, not a coroutine → it cannot suspend. send() from the poll callback on a full buffer would crash, and that is flood-triggerable (forged short-header packets with a spoofed worker nibble).
  • The channel's own event base is not notified on send (only receiver_triggers of an already-parked receiver fire) → hanging a callback on channel.event won't wake on an incoming packet.
  • The only non-blocking access that exists today is direct circular_buffer_pop/push under ch->mutex (as php-async thread_pool.c:808-820 thread_pool_drain_tasks does). The buffer + mutex fields are public in thread_channel.h, but circular_buffer_* (internal/) and ASYNC_MUTEX_* (zend_common.h) are not shipped in php-release/include → unusable from the server today.

Resume options (pick one before coding)

  • (A) Extend ThreadChannel + drain coroutine. Add non-blocking async_thread_channel_try_send() / try_receive() to php-async thread_channel.{h,c} (lock + circular_buffer_*, drop-on-full for send, immediate-or-false for receive). Owner wakes via its own zend_async_trigger_event_t (uv_async_send, thread-safe, per-thread loop) and/or a parked drain coroutine. Keeps the documented "ThreadChannel + Z_PTR" direction and reuses the IS_PTR passthrough.
  • (B) Self-rolled per-worker MPSC ring + trigger event in the server, no ThreadChannel. Forward-hook pushes (drop-on-full) and fires the trigger; owner drains in a plain event callback (no coroutine). More idiomatic to this codebase's "no coroutine in internal machinery" rule, but reimplements a small thread-safe queue and leaves the IS_PTR passthrough unused.

Verified edit map (line numbers current as of branch 59-h3-finish-all)

  1. Worker index plumbing. pool_worker_ctx_t (server src/http_server_class.c:1620) holds only server_transit → add int worker_index. Fill ctxs[i].worker_index = i in the fill loop (1949). Stamp it onto the worker clone between LOAD and start() in pool_worker_handler (1639): Z_HTTP_SERVER_P(&server_zv)->worker_index = wctx->worker_index; (the clone is rebuilt fresh per worker — a raw field on http_server_object would be dropped).
  2. SCID encode — TWO sites. Initial SCID at src/http3/http3_connection.c:301-307 (after http3_fill_random, before the ngtcp2 copy/conn_map registration). And get_new_connection_id_cb at src/http3/http3_callbacks.c:89 — stamp cid->data[0] before http3_packet_compute_sr_token (line ~103) so the SR token covers the final bytes. Missing the second site silently breaks steering after the client migrates to a server-issued CID. Decode helper http3_dcid_decode_worker_index(dcid, len)dcid[0] >> 4 (or −1 if len 0), add to http3_internal.h next to HTTP3_SCID_LEN.
  3. Shared channel array. It must live in http_server_shared_config_t (src/http_server_config.c:50-115, config->frozen) — the single pemalloc'd, atomically-refcounted snapshot whose pointer all workers share (TRANSFER config.c:2894 / LOAD config.c:2918). Not on http_server_object and not in http_server_view_t (those are per-worker copies and would each get a separate pointer). Populate from http_server_start_pool after freeze; free in the shared-config destructor (config.c:~2758). Access from H3 via http3_listener_server_obj(l) → config → frozen → worker_channels.
  4. Forward-hook. http3_connection_dispatch (src/http3/http3_connection.c:548): after ngtcp2_pkt_decode_version_cid (:563) and the conn_map lookup (:581), inside the conn == NULL && vc.version == 0 branch (:586/:592), before the stateless reset (:596): decode owner from vc.dcid[0] >> 4; if owner != my_worker && owner < worker_count, forward and return. peer is const struct sockaddr * backed by sockaddr_storage — the forward struct must copy bytes+len+ecn+sockaddr (the vc.dcid/data pointers point into the recvmmsg buffer, valid only synchronously). Forward payload is pemalloc, freed by the receiver after inject.
  5. Inject / drain. Owner pulls from its channel and calls http3_connection_dispatch(my_listener, …) + http3_listener_flush_dirty from its own reactor thread, then pefree. Drain entry near http3_listener_poll_cb end (src/http3/http3_listener.c:541, after flush_dirty) and/or a dedicated wake — depends on option A vs B above. The http3_listener_t struct is defined in http3_listener.c:66-176 (opaque in the header) — new fields go there.
  6. Test. workers=2 + h3client H3CLIENT_REQUEST_COUNT=2 H3CLIENT_MIGRATE_AFTER=1: both responses 200, owner worker quic_conn_accepted == 1, quic_path_migrations >= 1. Model on tests/phpt/server/h3/032-h3-connection-migration.phpt. Note: per-worker getHttp3Stats() visibility from the parent thread is uncertain with a pool — the primary signal is both requests succeeding across the rebind. New counters quic_forwarded_out / quic_forwarded_in should be added to http3_packet_stats_t (src/http3/http3_packet.h).

Status

  • Paused. No server cross-worker code written (the map above came from a read-only analysis pass — nothing to revert).
  • php-async IS_PTR passthrough in thread_transfer_zval_inner (thread.c) is committed under true-async/php-async and referenced from this issue — it lets a ThreadChannel carry an opaque Z_PTR to a persistent packet struct with no copy/refcount. Benign and generally useful regardless of which resume option we pick.
  • The 6 commits on branch 59-h3-finish-all (single-worker migration + tests + HEAD-body fix + rejected-stream-leak fix + docs) are independent finished work, not part of this issue.
  • Environment caveat: WSL, loopback only, no sch_netem on the default kernel (a netem-enabled bzImage is staged but needs a wsl --shutdown to take effect) — lossy-path testing is limited.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions