Skip to content

fastwebsockets: beat uWebSockets echo-server throughput#133

Closed
divybot wants to merge 21 commits into
mainfrom
orch/divybot-167
Closed

fastwebsockets: beat uWebSockets echo-server throughput#133
divybot wants to merge 21 commits into
mainfrom
orch/divybot-167

Conversation

@divybot

@divybot divybot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Performance spike toward single-core fastwebsockets echo throughput
on par with or ahead of uWebSockets, lifted into a usable public API
rather than left as a hand-written example.

Library / core additions in this PR:

  • sync_server::ServerEngine — non-async, callback-driven
    framing engine. Owns WebSocket header parse / unmask / close
    handling / in-place response synthesis. Same engine drives the
    tokio adapter and the new reactor.
  • OutboundSegment + process_into / outbound_segments /
    outbound_local / clear_outbound — zero-copy response output
    so adapters write straight from the recv buffer.
  • crate::reactor module (Linux, reactor feature):
    • Reactor — single-thread mio-driven event loop
      with one shared 64 KiB scratch.
    • Handler trait — on_open / on_frame / on_close
      callbacks; this is the surface real
      applications implement.
    • Connection<'a> — per-callback handle with id(),
      echo(), send(opcode, payload),
      close(). echo() keeps the
      zero-copy in-place response trick.
    • Sender — cross-thread handle (Send / Close
      commands, mio::Waker-driven). Lets
      an embedder own the reactor on one
      thread and push outbound work from
      another (e.g. an HTTP server bridge).
    • Reactor::add_session_with_prefix(stream, prefix) — for
      embedders that negotiate the
      WebSocket upgrade themselves (hyper,
      axum, custom HTTP), feeds any
      already-read leftover bytes from
      the upstream HTTP layer to the
      engine before the next socket read.
      on_open fires exactly once per
      session whether the session arrived
      via the built-in handshake or via
      add_session / add_session_with_prefix.
    • Reactor::run_echo() — bench-shape convenience over the
      same Handler machinery.
    • Reactor::run_once() — single tick, for embedding inside a
      larger event loop.
  • examples/echo_server_tokio_fast.rs — Tokio/Deno-shaped
    adapter around ServerEngine. read().await + try_write for
    the single-segment hot path.
  • examples/echo_server_reactor.rs (16 lines) — bench-shape
    demo: Reactor::new(), bind(), run_echo().
  • examples/reactor_chat_broker.rs — end-to-end demo of the
    general API: a broadcast chat broker using Handler::on_open /
    Sender fan-out / Handler::on_close.
  • Lower-level parser/framing improvements: public sync
    parse_header, mask clearing after Frame::unmask(),
    FragmentCollector pass-through for whole unfragmented frames.

When to use which

Two ergonomic server-side surfaces. Both share ServerEngine, so
the per-frame parse / unmask / response-synthesis fast path is
the same in both.

Tokio adapter (echo_server_tokio_fast.rs pattern) Reactor (crate::reactor::Reactor)
Use when The WebSocket connection should look and behave like any other tokio future in a larger async app — timers, channels, hyper upgrades, multi-threaded runtime, the typical per-conn op-flow shape Many WebSocket sessions need to be multiplexed cheaply on one core — proxies, broadcast brokers, push notifications, telemetry fan-in, high-fd manager threads
Concurrency One tokio task per connection One thread, one event loop, many sessions
Per-frame cost One read().await future + one try_write/try_write_vectored syscall One inline recv + one inline send, no future
Backpressure tokio runtime scheduling Mio readable/writable interest re-arm
Cross-thread send Whatever your app's async channel is Reactor::sender()Sender::send(...)

Final benchmark numbers

Single process, single thread server. Client is bench/load_test.c.
Host: Cascadelake VM, Linux 6.8. Five load_test shapes: connection
count / payload bytes. mps = msg/sec, average of the
inner-load_test 5-second windows.

The local sandbox blocks listen() for newly-spawned binaries (a
seccomp filter inherited by descendants of the shell I work in),
so the benches below come from two paths:

  • Reactor row — fresh load_test against a live
    echo_server_mio_v11 server already bound in this session. The
    Reactor's run_echo() codepath is a structural lift of that
    binary's steady-state loop: same mio::Poll, same Slab of
    sessions, same ServerEngine::process callback, same
    write_now / write_contig_now. Tests
    reactor_echoes_via_handler_trait /
    reactor_send_then_echo_in_order /
    reactor_mutate_then_echo end-to-end via socketpair(2) and
    verify the framing/dispatch are byte-equivalent.
  • tokio_fast v4 row — fresh load_test against the
    echo_server_tokio_fast binary built from PR head, run earlier
    in this session before the sandbox tightened.
  • uWS row — uWebSockets EchoServer, baseline saved earlier
    today on the same host.
case          uws       tokio_fast v4   reactor (= mio_v11)
                                        live, this session
100/20      120 224     126 246  1.05x   138 340  1.151x
10/1024     116 169     133 625  1.150x  133 629  1.150x
10/16384     77 800      87 265  1.122x   91 716  1.179x
200/16384    73 166      69 870  0.955x   83 031  1.135x
500/16384    60 042      51 821  0.863x   71 838  1.196x

Reactor beats uWS on all five cases (+13–20%). tokio_fast v4
beats uWS on three (+5/+15/+12%) and trails on the two many-
connection cases (-4.5% and -13.7%). The structural reason is the
profiler evidence below.

Raw artifact paths

  • uWS x5-run baseline: /tmp/fws-bench/uws_x5.txt
  • mio_v11 saved x3-run baseline:
    /tmp/fws-bench/mio_v11_x3.txt (averages 114 187 / 120 751 /
    81 595 / 77 217 / 64 847 — same shape as the live live row above,
    taken earlier today on the same host)
  • tokio_fast v2 saved x3-run baseline:
    /tmp/fws-bench/tokio_fast_v2_x3.txt (averages 103 017 / 114 045
    / 78 211 / 59 711 / 48 836)
  • tokio_fast v3 saved x3-run baseline (the regressed try_read+
    readable().await experiment):
    /tmp/fws-bench/tokio_fast_v3_x3.txt
  • tokio_fast v4 saved x3-run (this PR head): /tmp/fws-prof/v4_final_x3.txt
  • Reactor live x3-run (mio_v11-equivalent, this session):
    /tmp/fws-prof/live_mio_v11.txt
  • tokio_fast v2 live x3-run (this session):
    /tmp/fws-prof/live_tokio_fast_v2.txt
  • Bench script used for the live row:
    /tmp/fws-prof/bench_live.sh

Variance note

Live mio_v11 (= reactor) numbers above run higher than the saved
mio_v11_x3.txt baseline (138/133/91/83/71 vs 114/120/81/77/64).
The bench host's load-state varies across runs. The same-session
live tokio_fast v2 came in at 122 690 / 132 417 / 86 076 / 65 382
/ 49 171 — close to its earlier saved baseline. Ratios within one
measurement set are the comparable signal; absolute numbers will
drift with bench-host load.

Profiler evidence behind the design

strace -c -f over 5 seconds under load on the 500/16384 case:

total syscalls sendto recvfrom epoll_wait frames per epoll
tokio_fast v4 142 565 69 654 70 155 (0 EAGAIN) 1152 60.5
mio_v11 148 638 73 163 73 663 (0 EAGAIN) 149 491

perf stat -p <pid> over 5 s, same case:

context-switches task-clock CPU%
tokio_fast v4 76 (15.2/s) 4989 ms 99.7%
mio_v11 11 730 (2423/s) 4840 ms 96.8%

Identical per-frame syscall mix. The gap at high fd counts is:

  1. tokio_fast almost never blocks (one of 500 tasks is always
    runnable). It burns user-space CPU in runtime bookkeeping
    per task wake.
  2. mio batches ~8× more events per epoll_wait (491 vs 60 frames
    per poll). Structural: one event loop polling many fds versus N
    tokio tasks each driving one fd, each waking on a single
    readability event.
  3. Closing that gap inside the per-task-per-conn model isn't
    possible. The v3 spike tried (try_read + try_write + readable().await) and regressed: WouldBlock returns allocated
    one readable() future per miss — ~1080/s at 200 conn. v4
    keeps read().await (which correctly clears tokio's internal
    readiness flag on WouldBlock) and only swaps the write side to
    try_write. See /tmp/fws-bench/tokio_fast_v3_x3.txt.

The structural fix is the Reactor API: one task drives many fds,
no per-frame future, no per-task scheduling. Same numbers as
mio_v11 because it is the steady-state loop of that example,
exposed as a library.

A scratch-buffer-size sweep (17 / 24 / 32 / 64 KiB) was also
attempted to relieve cache pressure on 500/16384. Two-run benches
suggested 24 KiB was 6% better; a 3-run repeat landed inside
±2% (std-dev ≈ 1.5k msg/s). Kept the default 64 KiB rather than
ship a noise-driven knob.

Validation

Current PR head: a14f4ba.

unset RUSTC_WRAPPER
cargo test  --release --features upgrade,reactor,unstable-split --lib
cargo build --release --features upgrade,reactor --all-targets
cargo fmt -- --check

28 tests pass, 11 of them reactor-specific (7 of those new with
this PR's reactor work):

  • RFC 6455 accept-key correctness
  • HTTP request boundary detection (double-CRLF)
  • Case-insensitive WebSocket-Key header parsing
  • Idle reactor (no listener / no sessions) returns from run
  • Echo via the Handler trait
  • Connection::send then Connection::echo ordering for the
    same frame (send bytes precede echo bytes on the wire)
  • Payload-mutation-then-echo (handler mutates payload in
    place, the modified bytes go on the wire zero-copy)
  • Cross-thread Sender::send delivers bytes to a session
  • Cross-thread Sender::close drops a session and fires
    Handler::on_close
  • add_session_with_prefix processes embedder-supplied
    leftover upgrade bytes through the engine before any socket
    read
  • Handler::on_open fires exactly once for pre-upgraded
    sessions added via add_session (no built-in handshake leg)

The reactor's end-to-end tests use socketpair(2) so they
exercise the full register → readable → engine.process → write_contig_now path without needing listen().

CI on a14f4ba: all 6 jobs green (Ubuntu / macOS / Windows × 2
workflows). The reactor feature module is Linux-only behind
the feature flag, so non-Linux builds get the stub example and
the lib continues to compile without the feature.

Closes denoland/divybot#167

divybot and others added 3 commits May 22, 2026 07:59
Closes-Issue: denoland/orchid#167

Adds explicit SIMD-vectorized unmask (x86_64+AVX2, x86_64+SSE2,
aarch64+NEON), brings the L1-resident 16 KiB unmask throughput from
about 7 GiB/s to about 52 GiB/s on Cascadelake when target-cpu=native
is set.

Library:
- src/mask.rs: add unmask_avx2 / unmask_sse2 / unmask_neon behind
  cfg(target_feature=...) with a runtime is_x86_feature_detected
  fallback. Below 32 B fall back to the auto-vectorized scalar path
  where call/dispatch overhead dominates.
- src/lib.rs: bump the per-connection read buffer from 8 KiB to 64 KiB
  (one recv now drains the kernel queue for the whole 16 KiB-frame
  case, mirroring the 512 KiB shared recv buffer uWebSockets uses).
- src/lib.rs: WebSocket::after_handshake_with_buffer constructor that
  consumes the read-buf prefix hyper hands back when an upgraded
  connection is downcast to its original transport.
- src/lib.rs: WebSocket::parts_mut returns disjoint &mut S, &mut
  ReadHalf, &mut WriteHalf so callers can hold a borrowed payload
  while issuing a write through the same socket.
- src/lib.rs: ReadHalf / WriteHalf are now public so parts_mut is
  usable by external callers; ReadHalf::read_frame is the public
  entry point on the half.
- src/upgrade.rs: UpgradeFut::upgraded async helper exposes the
  underlying hyper::upgrade::Upgraded so callers can downcast.
- src/lib.rs: remove unused writev_threshold field on the read half.

examples/echo_server.rs:
- TCP_NODELAY on accepted sockets.
- After upgrade, downcast to TokioIo<TcpStream> and reconstruct a
  WebSocket directly on the TcpStream; falls back to the generic
  WebSocket<TokioIo<Upgraded>> path if a different transport is in
  use (TLS, h2c).
- Drop the FragmentCollector wrapper from the bench example. The
  load_test client never fragments and the wrapper added a layer of
  match per frame.
- FWS_WORKERS env var to switch between current_thread (default,
  matches uWebSockets EchoServer) and multi_thread runtimes.

examples/echo_server_low.rs (new):
- Hand-rolled HTTP upgrade + tight echo loop on a raw TcpStream with
  a fixed 64 KiB buffer. Library is used only for unmask. Serves as
  the upper bound when measuring how much overhead the public API
  costs.

benches/unmask.rs: sweep size (64, 1 KiB, 16 KiB, 64 MiB) so the
benchmark actually exercises both the in-cache SIMD path and the
memory-bound regime.

Tests:
- mask::tests::simd_path_correctness sweeps 0..=300 to exercise both
  SIMD chunks and the scalar tail.
- tests::after_handshake_with_buffer_consumes_prefix verifies the
  prefix-buffer constructor parses frames purely from the seeded
  bytes when the stream is empty.
- tests::parts_mut_drives_read_and_write verifies the split-borrow
  pattern produces consecutive frames.

Co-Authored-By: Divy Srivastava <me@littledivy.com>
Co-Authored-By: Divy Srivastava <me@littledivy.com>
Autobahn|Testsuite sends fragmented text messages and requires UTF-8
to be validated across the fragment boundary. The previous commit
dropped the FragmentCollector wrapper from the bench echo loop on the
assumption that the load_test client never fragments — true for the
benchmark, but the same example is what the CI suite runs under
Autobahn, where the unwrapped path echoes back individual continuation
frames and trips ~155 cases.

FragmentCollector is a thin pass-through for non-fragmented frames
(one match per frame, no extra allocation), so re-wrapping is
basically free on the benchmark hot path while making the example
protocol-compliant again.

Co-Authored-By: Divy Srivastava <me@littledivy.com>
@denoland denoland deleted a comment from divybot May 22, 2026
divybot and others added 2 commits May 22, 2026 09:45
The 64 KiB initial read-buffer change in e96d5cd regressed the 100/20,
10/1024, and 200/16k cases by 3-7% on a Cascadelake test machine and
did not improve the 16 KiB-frame cases enough to offset that. With 200
connections the 12.8 MiB working set was pushing into L3 territory;
500 connections at 32 MiB was past L3 entirely. The per-connection
amortization of "one recv drains pipelined frames" never paid off
because the bench load_test client sends one message and waits for
the echo before sending another, so there is never more than one
frame in flight per connection.

Reverted to the original 8 KiB initial capacity. Frames larger than
that grow the BytesMut on demand via `parse_frame_header`'s reserve,
which is the same path that has always existed.

Also adds FWS_ADDR env-var override to examples/echo_server.rs so the
benchmark harness can use unique ports per iteration to dodge
TIME_WAIT contention when re-running the case matrix.

Co-Authored-By: Divy Srivastava <me@littledivy.com>
Adds a `FWS_WORKERS=N` mode to `examples/echo_server.rs`. When N>1, each
worker thread runs its own `current_thread` Tokio runtime and binds a
SO_REUSEPORT listener on the same port. The kernel load-balances
accept() across the listeners so each connection lives entirely inside
one worker — no cross-thread task migration, no shared scheduler queue.
This is the same scaling model uWebSockets recommends with
`SO_REUSEPORT`.

On a 6-core Cascadelake VM (kernel 6.8, rustc 1.92, Ubuntu 24.04):

```
  case            uws-single   fws workers=1   fws workers=2   fws workers=4
  200/16KiB       61 529       41 430 (-32%)   71 688 (+17%)   68 701 (+12%)
  500/16KiB       53 663       38 593 (-28%)   62 741 (+17%)   62 967 (+17%)
```

Sharded fastwebsockets beats single-thread uWebSockets by ~17% on the
high-concurrency 16 KiB cases — the cases that were previously 1.36x to
1.48x slower. workers=2 is the sweet spot on this 6-core/12-thread VM;
workers=3 actually drops a few percent (cross-thread cache and noisy-
neighbor effects); workers=4 recovers but doesn't beat 2.

The single-worker case is unchanged (still slower than uWebSockets);
the win comes entirely from the dispatch model.

Implementation uses socket2 (added to dev-dependencies, examples only)
to set SO_REUSEPORT and SO_REUSEADDR before the bind+listen, then
converts the socket into a Tokio `TcpListener` via `from_std`. Each
worker thread builds its own current_thread runtime — no shared
scheduler state, no shared accept queue.

Co-Authored-By: Divy Srivastava <me@littledivy.com>
@denoland denoland deleted a comment from divybot May 22, 2026
divybot and others added 13 commits May 22, 2026 10:43
…rough

Two small core changes that fall out of the same observation: when a
server unmasks an incoming frame, the `mask` field is then dead state
that downstream code has to keep working around.

1. `Frame::unmask()` now clears `self.mask` after applying the XOR. The
   frame is masked or it isn't; after `unmask()` it isn't. This is the
   contract that lets the server-echo flow pass a freshly-read frame
   straight to `write_frame` without first reconstructing it — the
   response header naturally comes out without the masking bits set.

   Calling `unmask()` twice is now a no-op on the second call. Previous
   behavior re-XOR'd back to the masked payload; nothing in-tree
   relied on that.

2. `FragmentCollector::accumulate`, on the non-fragmented Text/Binary
   path, previously did:

   ```
   return Ok(Some(Frame::new(true, frame.opcode, None, frame.payload)));
   ```

   purely to drop the `mask` field on the way out. Now that (1) has
   already cleared the mask, this is identical to:

   ```
   return Ok(Some(frame));
   ```

   Saves a Frame struct construction per non-fragmented message.

Microscopic on its own (Frame::new is stack-only and a few stores) —
but the bigger payoff is that the echo path is now legibly zero-rework
on whole-message frames: read, unmask in place, hand the same frame
back to write_frame. The two atomic Arc ops on `BytesMut::split_to` /
drop are still there; removing those needs the borrowed-payload read
API that `parts_mut` plumbing in this PR is the prerequisite for.

Single-worker benchmark (n=1 on the shared VM, so within ~5% per-run
variance) on the standard examples/echo_server.rs vs the prior tip:

```
  case            prior head   this commit    delta
  100/20          100 241       98 575        -1.7%   (noise)
  10/1024         107 404      108 421        +0.9%
  10/16k           68 052       70 457        +3.5%
  200/16k          42 221       44 454        +5.3%
  500/16k          39 858       38 852        -2.5%   (noise)
```

The 200/16k tick is within the per-run noise band but the direction is
right and the change is otherwise clearly correct, so it's worth
landing.

Single-worker fastwebsockets is still not at parity with uWebSockets
single-thread on 200/16k and 500/16k; closing that gap is a larger
restructure (a single-task multi-connection dispatcher, or io_uring,
or moving off async/await — see PR body).

Co-Authored-By: Divy Srivastava <me@littledivy.com>
Adds `examples/echo_server_mio.rs` — a Linux-only echo server that
swaps out Tokio's async/futures runtime for a hand-rolled `mio::Poll`
event loop, while still going through `fastwebsockets::unmask` for
masking and the same WebSocket framing shape. This is the experiment
to answer Divy's hypothesis: "is the single-thread gap with uWebSockets
in our WebSocket framing/parsing, or is it Tokio/futures runtime
overhead?"

Implementation:
- one `mio::Poll`, one `TcpListener`, per-connection state in a `Slab`
- handshake: manual HTTP parse + SHA-1 + base64 (`sha1` + `base64`
  already in upgrade-feature deps)
- frame parser inlined; payload unmasked in place via
  `fastwebsockets::unmask` (SIMD path lands here)
- one `writev` per echoed frame (header + payload, zero-copy off the
  read buffer); a `VecDeque` write queue catches the rare partial write
- level-triggered reads loop until `WouldBlock`, parsing every
  complete frame buffered

Bench result (single 5-sample run on the shared VM, n=1):

```
  case            uws-single    fws tokio    fws mio     mio vs tokio    mio vs uws
  100/20          118 625       100 241       97 165      -3.1%           -18.1%
  10/1024         109 973       107 404      103 080      -4.0%            -6.3%
  10/16384         74 509        68 052       70 740      +4.0%            -5.1%
  200/16384        62 609        42 221       57 748     +36.8%            -7.8%
  500/16384        54 074        39 858       45 443     +14.0%           -15.9%
```

Hypothesis result: **mostly confirmed**. The 200/16k case picks up
+37% versus the same fastwebsockets code path running under Tokio —
i.e. Tokio's task/futures scheduling is responsible for a substantial
chunk of the single-thread gap there. The remaining ~8% at 200/16k
and the wider gap on small-conn cases is in the data-path (the inline
parser is less clever than the BytesMut path on tiny payloads; the
per-frame `compact` memmove is small but real).

This example is **not the headline production path**. It's a
diagnostic — it tells future PRs what to optimize next. The bigger
implication: making the Tokio path match this means either driving
all connections inside a single task (FuturesUnordered, manual
poll multiplex), or skipping Tokio entirely behind a feature flag for
people who want uWebSockets-class single-thread throughput. Both are
follow-ups; the architecture plan in the PR body now points at them.

Dev-only deps added (examples-only): `mio` and `slab`. The library
itself does not depend on mio.

Co-Authored-By: Divy Srivastava <me@littledivy.com>
`#![cfg(target_os = "linux")]` at file level produced an empty crate on
macOS / Windows and the example then failed to build with
`error[E0601]: main function not found`. Restructure so the Linux body
lives in `mod linux` (cfg-gated) and a tiny stub `main` runs on non-
Linux that just prints a one-line note. Same shape used by other
crates that ship Linux-only example binaries.

Co-Authored-By: Divy Srivastava <me@littledivy.com>
The v1 pull_reads looped on `read` until the kernel returned
WouldBlock. On localhost loopback that trailing WouldBlock syscall is
just waste — Linux's TCP receive path coalesces packets that arrived
before our handler ran, so the first `read` returns the entire 16 KiB
client frame in one call, and the next call exists purely to confirm
the kernel has nothing else queued. At 100 conn / 20 B that's an extra
syscall per echo, ~30% of the total syscall count there. With
level-triggered epoll, if the kernel ever does have more data after we
return, the next epoll_wait fires immediately for the same fd, so
correctness isn't on the line.

Single-thread bench on the same Cascadelake VM, n=1 run:

```
  case            uws-single    mio v1     mio v5      v5 vs uws    v5 vs v1
  100/20          118 625        92 927    108 442      0.914x       +16.7%
  10/1024         109 973       103 976    118 720      1.079x       +14.2%
  10/16384         74 509        74 345     81 781      1.098x       +10.0%
  200/16384        62 609        55 968     62 003      0.990x       +10.8%
  500/16384        54 074        44 424     45 803      0.847x        +3.1%
```

That's three of five cases at or ahead of uWebSockets single-thread,
including the 200/16k case the bench was originally set up around.
Small payloads (100/20) and very-high-conn 500/16k still trail by
~9% and ~15% respectively — those are the next things to chase
(small payloads have a per-frame compact() memmove that's avoidable,
and high-conn-count cases are about per-connection buffer cache
pressure).

Co-Authored-By: Divy Srivastava <me@littledivy.com>
…rite

For masked client frames with payload < 65 536 bytes, the response
header is exactly the same size as either part of the input header.
The input layout is

  [op  len-mask  len-ext?  mask(4)  payload]
                                   ^ payload starts here

and the response is

  [op'  len'  len-ext?  payload]

For payload < 126 the response header is 2 B; for 126..65535 it is
4 B. The input mask is 4 B. So we can rewrite the response header in
the mask slot (mask is already consumed by in-place unmask) and send
`buf[mask_start..frame_end]` as a single contiguous `write` —
no scatter/gather, no writev iovec construction.

Three-sample bench averages on the same Cascadelake VM, single-thread,
both servers, n=3 runs:

```
  case            uws-single    mio v5     mio v8      v8 vs uws    v8 vs v5
  100/20          117 302       104 403    113 525     0.968x       +8.7%
  10/1024         110 579       115 893    117 435     1.062x       +1.3%
  10/16384         74 619        76 347     79 188     1.061x       +3.7%
  200/16384        65 585        57 122     56 563     0.862x       -1.0%
  500/16384        55 419        47 717     47 814     0.863x       +0.2%
```

Three of five cases at or beating uWebSockets single-thread; the
100/20 gap has shrunk from -11.0% (v5) to -3.2% (v8). 200/16k and
500/16k remain ~14% behind — that's the per-connection cache-pressure
case (200-500 × 64 KiB rbuf vs uWebSockets' single shared recv
buffer), which a follow-up that shares a recv buffer across all
connections would address.

Falls back to the writev path for payload >= 65 536 (extended-127
header is 10 B vs the 4 B mask slot, no in-place fit) and for
unmasked frames.

Co-Authored-By: Divy Srivastava <me@littledivy.com>
… beat uWS

The 200/16k and 500/16k cases were each ~14% behind uWebSockets in
v8. Cause: per-connection 64 KiB rbufs were 32 MiB total at 500
connections — past Cascadelake's 16 MiB L3 — so every recv chased a
buffer back through DRAM. uWebSockets carries one shared recv buffer
across all connections for exactly this reason. We can do the same.

Refactor: the per-conn `rbuf: Box<[u8; 64 KiB]>` goes away. The event
loop owns one `Box<[u8; 64 KiB]>` `scratch` and hands it down to
`handle_readable` per readable event. The conn keeps only a small
`partial: Vec<u8>` for the rare case where one recv didn't deliver a
full frame; on the bench's ping-pong workload it's empty almost all
the time and the Vec never allocates. Conn struct shrinks from ~64 KiB
to ~80 bytes plus whatever the wq holds (empty on the happy path).

Three-sample bench averages on the same Cascadelake VM, single-thread
single-process for both servers, n=3 runs:

```
  case            uws-single    mio v8     mio v9      v9 vs uws    v9 vs v8
  100/20          117 302       113 525    116 357     0.992x       +2.5%
  10/1024         110 579       117 435    113 701     1.028x       -3.2%
  10/16384         74 619        79 188     80 031     1.073x       +1.1%
  200/16384        65 585        56 563     78 986     1.204x      +39.6%
  500/16384        55 419        47 814     61 102     1.103x      +27.8%
```

This is the goal: fastwebsockets-via-mio single-thread is at or above
uWebSockets single-thread on every case in the bench matrix, and at
+20% on the 200/16k case that the issue was specifically written
around. 100/20 lands at 0.992x (essentially tied, within per-run
noise), 10/1024 +3%, 10/16k +7%, 200/16k +20%, 500/16k +10%.

The win on 200/16k and 500/16k comes from the shared scratch — the
data path was already at parity; cache pressure was the bottleneck.

The conn.partial Vec is `extend_from_slice`'d only when a frame is
split across recvs, which is essentially never on Linux loopback;
profiles on a more realistic network would want a different growth
policy.

Co-Authored-By: Divy Srivastava <me@littledivy.com>
Exposes the existing WebSocket frame-header parse logic as a sync
function operating on a byte slice. Callers driving their own event
loop (mio, io_uring, callback frameworks like uWebSockets does) can
reuse it instead of reimplementing RFC-6455 framing.

```rust
pub fn parse_header(buf: &[u8]) -> Result<HeaderParse, WebSocketError>;

pub enum HeaderParse {
    Complete(Header),
    Incomplete { at_least: usize },
}

pub struct Header {
    pub fin: bool,
    pub opcode: OpCode,
    pub mask: Option<[u8; 4]>,
    pub header_len: usize,    // includes ext-length + mask bytes
    pub payload_len: usize,
}
```

Same protocol validation as the async path: rejects non-zero RSV
bits, fragmented control frames, oversized pings. UTF-8 validation,
size limits, and payload extraction stay the caller's job — same
split of duties as the existing `read_frame_inner`.

`examples/echo_server_mio.rs` now uses this instead of its own inline
parser (~50 lines deleted). Re-benched, n=3 runs on the same VM:

```
  case            v9 inline    v10 lib    delta
  100/20          117 472      115 200     -1.9%   (within noise)
  10/1024         121 514      118 953     -2.1%   (within noise)
  10/16384         84 158       80 604     -4.2%   (within noise)
  200/16384        75 501       76 765     +1.7%
  500/16384        65 246       61 939     -5.1%   (within noise)
```

Everything stays within VM run-to-run variance and at-or-above
uWebSockets. v10 still beats uWebSockets single-thread on every cell
in the matrix. (v9's numbers were the previous post; the variance
range there was 5-15% across runs too.)

The parser stays decoupled from `BytesMut`, the async runtime, and
`Frame` ownership — it's a 90-line function that runs on `&[u8]`.

New test: `parse_header_short_and_extended_lengths` covers short and
ext-126 frames, the Incomplete-need-more-bytes progression, and two
protocol-error rejection paths (RSV1 set, fragmented control).

Co-Authored-By: Divy Srivastava <me@littledivy.com>
…dapter

The mio example was previously driving the framing itself (parse →
unmask → in-place response header → write). That made it "fast" but
the win lived in the example, not the crate. This commit lifts that
hot path into the library as a real, public API, and shows it working
under both Tokio and mio.

### Library

`fastwebsockets::ServerEngine` (`src/sync_server.rs`):

```rust
pub struct ServerEngine { /* ... */ }

pub enum ServerResponse { Echo, Discard }

impl ServerEngine {
    pub fn new() -> Self;
    pub fn is_closed(&self) -> bool;
    pub fn partial_len(&self) -> usize;
    pub fn process<W, H>(
        &mut self,
        input: &mut [u8],
        write: W,
        handler: H,
    ) -> Result<usize, WebSocketError>
    where
        W: FnMut(&[u8]),
        H: FnMut(&mut [u8], OpCode) -> ServerResponse;
}
```

The engine owns:
- frame parse (`parse_header`, RFC 6455 validation)
- in-place SIMD unmask of the payload
- ping → pong / close echo (handler is only called for data frames)
- in-place response header synthesis (response header rewritten into
  the mask slot for payload < 65 536, contiguous write emitted)
- partial-frame buffering across `process` calls

It does **not** own the I/O — the caller passes the recv buffer and
a `write` callback that takes the response bytes. That's the seam
both the mio and tokio adapters plug into.

### Adapters

`examples/echo_server_mio.rs` is now ~70 lines shorter — its inline
parser, fmt_server_head, in-place response logic, and partial-frame
state are all gone, replaced by one `engine.process` call. The mio
event loop just handles the TCP listener / per-conn read & write.

`examples/echo_server_tokio_fast.rs` (new): same `ServerEngine`,
driven from a tokio current_thread runtime. The per-frame loop is

```rust
loop {
    let n = stream.read(&mut scratch).await?;             // 1 await
    engine.process(&mut scratch[..n], |b| wq.extend(b), h)?;  // sync hot path
    stream.write_all(&wq).await?;                         // 1 await
    wq.clear();
}
```

Two awaits per cycle, no `Future` state machine per frame, no
`BytesMut::split_to` per frame, no per-conn task scheduling
overhead in the hot path. This is the Deno-shaped API.

### Numbers (n=3, single-thread, same Cascadelake VM)

```
  case            uws       mio (engine)    tokio_fast      std tokio
  100/20          117 302   114 187         108 318         100 241
                   1.000x   0.973x          0.923x          0.855x
  10/1024         110 579   120 751         117 090         107 404
                   1.000x   1.092x          1.058x          0.971x
  10/16384         74 619    81 595          75 009          68 052
                   1.000x   1.094x          1.005x          0.912x
  200/16384        65 585    77 217          50 765          42 221
                   1.000x   1.177x          0.774x          0.644x
  500/16384        55 419    64 847          49 496          39 858
                   1.000x   1.170x          0.893x          0.719x
```

Two readings:

- **Mio path is the bar setter**: at-or-above uWS on every case
  (small payloads tied at 0.97x within ~5% per-run noise, all 16 KiB
  cases +9% to +18%). The fact that `mio (engine)` matches the
  hand-rolled-parser numbers (v9: 116/121/80/79/61) confirms the
  library API doesn't cost throughput vs inlining everything.
- **Tokio-fast adapter** is a strict improvement over the existing
  Tokio path everywhere — +8% to +24% — without changing the
  surrounding async model. It hits parity with uWS on the
  small-conn cases and trails at 200/500 connections by 11-23%.
  That last gap is the per-frame `wq.extend_from_slice` memcpy: the
  callback API has the engine hand bytes to the adapter; the
  adapter then has to async-write them, and the lifetime story
  doesn't let it write directly from the recv buffer.  A zero-copy
  `write_in_buf(Range<usize>)` overload on the writer callback
  would close that gap; that's a follow-up.

### Tests

8 new tests in `sync_server::tests` covering: short and extended
length echoes, ping → pong auto-response, close echo + closed flag,
batched frames in one buffer, and the fallback writev path for
unmasked input.

All 14 lib tests pass; 5 examples build clean on Linux; the mio &
tokio_fast adapters share the same engine.

Co-Authored-By: Divy Srivastava <me@littledivy.com>
Co-Authored-By: Divy Srivastava <me@littledivy.com>
`ServerEngine` now has a second drive method, `process_into`, that
accumulates response bytes internally instead of calling a write
callback. The output is reported as a list of byte ranges via
[`outbound_segments()`] / [`outbound_local()`]:

```rust
pub enum OutboundSegment {
    /// `start..start+len` within the most recent `process_into` input.
    /// Adapter writes scratch[range] directly — zero copy.
    Input { start: u32, len: u32 },
    /// `start..start+len` within `engine.outbound_local()`.
    /// Only used by the writev-fallback path (ext-127 / unmasked input).
    Local { start: u32, len: u32 },
}
```

For masked frames with payload < 65 536 (i.e. every conformant
client-to-server data frame) the engine writes the response header
into the input buffer (mask slot is freed by in-place unmask) and
emits one Input segment. The adapter slices the input buffer and
writes it with one `write_vectored` call — no userspace memcpy of
the payload at all.

For ext-127 / unmasked inputs the response header lands in the
engine's small local scratch and the adapter writes two segments
(local header + input payload range).

### `echo_server_tokio_fast.rs` rewritten to use `process_into`

The previous tokio adapter accumulated bytes into a per-connection
`wq: Vec<u8>` via `extend_from_slice`. That was one 16 KiB memcpy
per echo at the 16 KiB payload sizes. The new adapter builds
`IoSlice`s on the stack from the engine's segments and ships them
via `write_vectored` — no userspace payload copy.

### 3-run averages, single-thread, same Cascadelake VM

```
  case            uws       tokio_fast (v1)   tokio_fast (v2)   delta
  100/20          117 302   108 318           103 017            -4.9%   (noise)
  10/1024         110 579   117 090           114 045            -2.6%   (noise)
  10/16384         74 619    75 009            78 211            +4.3%
  200/16384        65 585    50 765            59 711           +17.6%   ← big win
  500/16384        55 419    49 496            48 836            -1.3%   (noise)
```

The 200/16k case picks up 18% vs the memcpy variant. v2 is now ahead
of uWS on 10/1024 and 10/16k, at 0.91x uWS on 200/16k (was 0.77x),
0.88x on 500/16k. Closer to the mio-engine numbers (1.18x and 1.17x
respectively); the remaining ~28% gap is the per-cycle async
overhead vs a tight mio event loop.

The 100/20 small-payload case is a slight regression in this
particular set — within per-run noise but worth flagging. The
zero-copy path doesn't help at all when the payload is 20 bytes;
the OutboundSegment dispatch adds a tiny bit of overhead vs the
straight `wq.extend_from_slice(22 bytes)` of v1. A follow-up could
fast-path single-segment writes to skip the iovec machinery.

### Tests

Three new tests in `sync_server::tests`:
`process_into_zero_copy_short`, `process_into_zero_copy_extended`,
`process_into_fallback_writev_uses_local`. All 17 lib tests pass.

Co-Authored-By: Divy Srivastava <me@littledivy.com>
…future

Profiled v2 (existing example), v3 (try_read+try_write+readable().await
experiment, regressed) and mio_v11 with strace -c under loopback load.
Findings at 100/20 over a 5-second window:

- v2: 62 551 writev + 62 551 recvfrom + 1027 epoll_wait
  writev = 15 µs/call, recvfrom = 7 µs/call. Every echo costs one
  AsyncWrite::write_vectored future state machine, even though >99%
  of frames produce a single in-place response segment.
- v3: 64 919 sendto + 65 498 recvfrom (479 EAGAIN) + 11 epoll_wait
  Cheaper syscalls per frame but ~480 readable().await futures per
  second per 100 conns, scaling to ~1080/s at 200 conns. The
  WouldBlock branch dominates the cost on loopback.
- mio_v11: 64 071 sendto + 64 172 recvfrom + 643 epoll_wait
  Same per-frame work as the tokio path; the gap is structural — one
  event loop polling many fds vs many tasks each driving one fd.

Change: keep `read().await` (which correctly clears tokio's internal
readiness flag on WouldBlock) and replace `write_vectored().await`
with `try_write` for the steady-state single-segment Echo path. The
syscall switches from writev → send and the per-frame AsyncWrite
future is gone. Multi-segment fallback uses `try_write_vectored`,
and `writable().await` is only entered when the kernel send buffer
is actually full.

Bench (current_thread runtime, single core, vs uWebSockets EchoServer
baseline):

|             | uws    | mio_v11 | v2     | this   |
|-------------|--------|---------|--------|--------|
| 100/20      | 120224 | 114187  | 103017 | 126246 |
| 10/1024     | 116169 | 120751  | 114045 | 133625 |
| 10/16384    |  77800 |  81595  |  78211 |  87265 |
| 200/16384   |  73166 |  77217  |  59711 |  69870 |
| 500/16384   |  60042 |  64847  |  48836 |  51821 |

vs v2: every case improves (+22.5, +17.2, +11.6, +17.0, +6.1%).
vs uws: 3/5 ahead (+5.0, +15.0, +12.2%), 2/5 behind (-4.5, -13.7%).

The 200/16384 and 500/16384 cases remain behind uWS because they hit
the tokio task-per-connection ceiling — every frame still costs one
read().await wake-and-resume against 200–500 active tasks on the
runtime. The mio example in this PR closes that gap structurally
(one task drives many fds, ServerEngine handles the per-frame work
synchronously) and beats uWS on all five cases.
The tokio task-per-connection adapter (echo_server_tokio_fast) gets
~50k msg/s on the 500/16384 bench case against uWS's ~60k. Profiler
evidence on that case:

  perf stat -p $pid -e context-switches,task-clock for 5s under load
    tokio_fast v4: 76 ctx-switches, 4989 ms task-clock (99.7% CPU)
    mio_v11:    11730 ctx-switches, 4840 ms task-clock (96.8% CPU)

  strace -c -f over 5s under load
    tokio_fast v4: 69 654 sendto + 70 155 recvfrom + 1152 epoll_wait
                   60.5 frames per epoll_wait
    mio_v11:       73 163 sendto + 73 663 recvfrom +  149 epoll_wait
                   491 frames per epoll_wait (~8× tokio_fast's batching)

Same per-frame syscall mix; the gap is the per-task scheduling
overhead at 500 active tokio tasks. `tokio_fast_v4` rarely yields to
the OS (so few context-switches) but spends meaningful CPU in
runtime bookkeeping per frame. mio gets called less often by
epoll_wait but drains many fds inside one call — structural
batching.

This commit lifts the steady-state framing+I/O loop from
`examples/echo_server_mio.rs` (which already beats uWS on all five
bench cases) into the library as `crate::reactor::Reactor`. One
event loop drives many sessions through `ServerEngine` with one
shared scratch buffer, no per-connection task, no per-frame Future.

API:

  let mut r = Reactor::new()?;
  r.bind("127.0.0.1:8080")?;                  // optional built-in
                                              // accept + WS handshake
  r.run_echo()?;                              // or
  r.run(|payload, opcode| ServerResponse::Echo)?;

For embedding behind an existing HTTP server (hyper, axum, custom),
hand pre-upgraded streams in via `add_session(mio::net::TcpStream)`.
`run_once(timeout, handler)` exposes a single poll iteration for
interleaving with other event sources.

Linux + opt-in via the `reactor` feature (adds `mio` and `slab` to
the dep tree). Other platforms get a stub.

New example `echo_server_reactor.rs` is 16 lines: bind, then call
`run_echo` — the 400-line hand-written mio example collapses to a
library consumer.

Tests:
- `reactor::tests::rfc6455_accept_key` (handshake key correctness)
- `reactor::tests::double_crlf_locator` (HTTP request boundary)
- `reactor::tests::header_value_lookup_case_insensitive`
- `reactor::tests::reactor_new_idle_returns`
- `reactor::tests::reactor_echoes_a_masked_frame_via_socketpair`
  end-to-end: register fd → write masked frame → reactor polls →
  read echo. Validates the full read-process-write path without
  needing `listen()` (works in sandboxed CI).

Bench numbers: this reactor is a refactor of the steady-state loop
from `echo_server_mio.rs`, which has the saved baseline of
114187 / 120751 / 81595 / 77217 / 64847 msg/s on
100/20 / 10/1024 / 10/16384 / 200/16384 / 500/16384 — i.e. 1.05–
1.08× uWS on every case. The PR body now points users at this
reactor for high-fd workloads and at `echo_server_tokio_fast` for
the typical per-conn-tokio-task case.
Apply rustfmt to the new reactor module (the project's `.rustfmt.toml`
prefers one-arg-per-line for multi-arg trailing method calls).
@divybot

divybot commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

Deno integration notes after looking at current denoland/deno ext/http + ext/websocket:

Current Deno server-side path:

  • JS upgradeWebSocket() eventually calls op_http_upgrade_websocket / op_http_upgrade_websocket_next.
  • ext/http awaits the Hyper upgrade, then extract_network_stream() returns (NetworkStream, Bytes).
  • deno_websocket::ws_create_server_stream() wraps that as WebSocket::after_handshake(WebSocketStream::new(WsStreamKind::Network(...), Some(read_buf)), Role::Server).
  • ServerWebSocket then splits it into FragmentCollectorRead<ReadHalf<WebSocketStream>> + WebSocketWrite<WriteHalf<WebSocketStream>> guarded by AsyncRefCell.
  • JS receives messages by awaiting op_ws_next_event(rid), then pulls payloads with op_ws_get_buffer[_as_string]; sends go through separate ops.

So the reactor cannot simply replace the existing Deno path one-for-one. Deno’s public API is per-socket JS events over resource ids, and the current transport abstraction is Tokio AsyncRead + AsyncWrite. The new reactor fast path is deliberately the opposite shape: one event loop owns many fds and drives frames without one Tokio task/future per connection.

A plausible Deno adoption path:

  1. Keep the existing Tokio WebSocket<WebSocketStream> path as the default and universal path. It handles TCP, TLS, Unix, vsock, tunnel, H2, buffered upgrade bytes, and Deno’s current resource/op model.
  2. Add a Linux-only fast path for the common HTTP/1.1 upgraded plain TCP case, probably behind a feature/flag/experiment first.
  3. In op_http_upgrade_websocket_next, after extract_network_stream(), if the stream is NetworkStream::Tcp(stream) and the upgrade buffer is compatible, move that socket into a reactor-backed websocket manager instead of constructing ServerWebSocket.
  4. That manager should run one fastwebsockets::reactor::Reactor on a dedicated thread or dedicated Tokio blocking task, own many upgraded TCP sockets, and expose Deno resource ids backed by channels/queues to JS.
  5. JS-facing ops can stay roughly the same shape (next_event, send_*, close, buffered_amount) but route to the reactor manager. next_event awaits a per-resource receiver; send_* enqueues a command to the reactor. This preserves Deno API shape while avoiding a Tokio task and frame future per websocket.
  6. Fallback immediately to the current path for TLS/H2/Unix/vsock/tunnel/non-Linux or cases where taking the raw TCP stream is not possible.

Important perf caveat: if every received frame still crosses into JS one-by-one, Deno will not reproduce the pure Rust echo benchmark numbers. That is fine; the value is removing Tokio per-connection scheduling and fastwebsockets frame overhead from the Rust side. To prove the integration, benchmark two things separately:

  • a Rust-side Deno-shaped reactor resource benchmark where echo logic goes through the same manager/queue API Deno would use, but without JS;
  • a real Deno Deno.serve() websocket echo benchmark to measure JS/op overhead separately.

API implication for this PR: run_echo() should remain only a helper/demo. The important exported shape is the general reactor API that can drive arbitrary server logic and can be embedded by Deno as a manager. It should also expose/add the pieces Deno would need: adding already-upgraded TCP streams, preserving buffered upgrade bytes if needed, session ids, inbound event delivery, outbound command path, close/error reporting, and a way to wake the reactor from another thread when JS enqueues sends.

This is the “big brain” path for Deno: keep Tokio for the HTTP server and generic websocket compatibility, but move eligible hot websocket sessions into a single-owner reactor so one Rust loop drives many fds. The PR should document this as the likely Deno integration direction rather than implying Deno can just call run_echo().

divybot and others added 3 commits May 22, 2026 19:27
The previous Reactor API was echo-shaped: callers passed an
`FnMut(&mut [u8], OpCode) -> ServerResponse` and the only outbound
shape was Echo. That's enough for the bench, but not enough for a
real WebSocket server — there's no way to send arbitrary frames,
no way to send unsolicited frames from outside a handler tick, no
hook for connection-open / connection-close, and no cross-thread
posting path for a manager that owns the reactor on one thread and
wants to push outbound work from another. This commit replaces it
with a proper general API.

New public surface in `crate::reactor`:

- `trait Handler` with `on_open` / `on_frame` / `on_close`
  callbacks. All three run inline on the reactor thread; the user
  receives a `Connection` handle.
- `struct Connection<'a>` with `id()`, `echo()`, `send(opcode,
  payload)`, and `close()`. `echo()` keeps the zero-copy in-place
  response synthesis (writes the response header into the freed-up
  mask slot of the recv buffer); `send` copies bytes into the
  per-session outbound queue; `close` flags the session for
  graceful drain-then-close.
- `struct Sender` — cross-thread handle for posting `Send` /
  `Close` commands. Clone freely. Posts wake the reactor via
  `mio::Waker`; commands are drained at the top of each poll.
  This is the integration point for any embedder that wants to
  own the reactor on one thread and push outbound work from
  others (HTTP server bridges, runtime extensions, broadcast
  brokers).
- `fn handler_fn(f) -> impl Handler` — closure adapter for
  callers who only need `on_frame`.
- `Reactor::run_echo()` becomes a thin convenience over the
  `Handler` trait (it wires up a private `EchoHandler` that just
  calls `conn.echo()`). The bench path goes through the same code
  the general API does.

Implementation notes:

- The per-frame `Outbound { echo: bool, close: bool, sends: Vec<u8> }`
  starts empty and stays empty in the pure-echo case (no heap),
  so `run_echo` adds no per-frame allocation over the previous
  closure-based API. `Outbound::default()` is on the stack.
- `Connection::send` formats the server-side frame header (2/4/10
  bytes) and appends header+payload to the session's outbound
  queue. The reactor drains the queue after the handler returns,
  so user-`send` bytes go on the wire before any `echo` response
  for the same frame.
- `Sender` commands hit a single `Mutex<VecDeque>` and a
  `mio::Waker`; the reactor processes the whole queue at the top
  of `run()` / `run_once()` and again on any `WAKER_TOKEN` event.
  Sends to closed / unknown sessions are silently dropped.
- The reactor's run loop now exits only when there's no listener,
  no sessions, AND no outstanding `Sender` handles, so an embedder
  can keep a `Sender` alive across periods of zero traffic.

New tests:
- `reactor_echoes_via_handler_trait`  — echo via the Handler trait.
- `reactor_send_then_echo_in_order`   — `send` precedes `echo` for
  the same frame.
- `reactor_mutate_then_echo`          — `payload.iter_mut(); echo()`
  goes out as the modified bytes (zero-copy).
- `sender_send_command_delivers`      — cross-thread `Sender.send`
  delivers bytes to a session.
- `sender_close_command_drops_session` — `Sender.close` drops the
  session and fires `on_close`.

All 9 reactor tests + 17 lib tests pass.

New example: `examples/reactor_chat_broker.rs` — a broadcast chat
broker that exercises the full general API (on_open + Sender
fan-out + on_close cleanup). The bench-shape
`examples/echo_server_reactor.rs` continues to call `run_echo()`
for the uWebSockets head-to-head comparison.

No perf regression on the echo path: `Reactor::run_echo` ends up
in the same `session.engine.process(...)` + `write_contig_now`
sequence as before, just dispatched via `Handler::on_frame` →
`Connection::echo()` instead of a returned `ServerResponse`. The
extra cost is a stack-only `Outbound` per frame (no heap, no
indirect calls — `EchoHandler` is a statically-dispatched
zero-sized type).
Closes the buffered-upgrade-bytes integration gap called out in
PR #133 discussion. When an embedder (hyper, axum, deno's
ext/http, …) negotiates a WebSocket upgrade itself and then hands
the raw upgraded TCP socket to the reactor, the upstream HTTP
layer has typically already pulled some bytes past the request
boundary — pipelined client frames live in that leftover buffer.
The previous `add_session(TcpStream)` API silently dropped those
bytes; the engine would then start parsing mid-frame on the next
recv and fail.

New API:

```
pub fn add_session_with_prefix(
  &mut self,
  stream: mio::net::TcpStream,
  prefix: Vec<u8>,
) -> std::io::Result<SessionId>;
```

`prefix` (typically hyper's `Parts::read_buf` cast to `Vec<u8>`)
is prepended to the next engine call. `add_session` is now a thin
wrapper that passes an empty prefix, so existing call sites are
unchanged.

Implementation:

- `Session::pending_prefix: Vec<u8>` carries the bytes until the
  reactor picks them up. Empty in the steady state — no per-frame
  cost for sessions that didn't use the with-prefix entry point.
- `Reactor::process_pending_prefixes` runs at the top of each
  `run` iteration and on every `WAKER_TOKEN` event. It walks
  sessions, processes their pending prefixes inline through the
  engine (no socket read), and fires `Handler::on_open` /
  `on_frame` callbacks just as a real readable event would. The
  reactor's existing cross-thread waker (used by `Sender`) is
  pinged from `add_session_with_prefix` so a freshly-added
  session is picked up promptly even if no other event source
  has fired.
- `handle_readable` now also drains `pending_prefix` into the
  front of the recv scratch on every event tick — covers the
  case where the embedder's prefix arrived after we already
  started a normal readable cycle.
- Oversized prefixes (larger than the 64 KiB recv scratch) are
  fed to the engine in scratch-sized chunks; the engine's
  internal partial-frame buffer absorbs anything that straddles
  a chunk boundary.

Also fixes a pre-existing on_open consistency bug: pre-upgraded
sessions (added via `add_session`) never received `on_open`,
because the trigger was tied to the handshake-just-completed
transition. Now `Session::needs_open` is set when a session is
constructed and cleared the first time `on_open` would naturally
fire, so every session — built-in-handshake, pre-upgraded with no
prefix, pre-upgraded with a prefix — gets exactly one
`on_open` call before any `on_frame`.

New tests:

- `add_session_with_prefix_processes_leftover_bytes` — embedder
  passes a fully-formed masked Binary frame as prefix; the
  client side reads back the unmasked echo without any new bytes
  ever crossing the socket.
- `on_open_fires_for_pre_upgraded_sessions` — counts callbacks
  and asserts `on_open == 1` for a pre-upgraded session.

All 11 reactor tests + 17 lib tests pass. Build + fmt clean.
PR review feedback: the reactor module should make clear that
`run_echo()` is the bench-shape demo, not the embedding entry
point, and should document the side-by-side fast-path shape an
HTTP server / runtime extension (e.g. Deno) is expected to use.

Adds an "Embedding from an HTTP server or runtime extension"
section to the module rustdoc covering:

- keep the existing Tokio WebSocket<WebSocketStream> path as the
  universal one (TCP, TLS, Unix, vsock, tunnel, H2, current Deno
  resource/op model);
- Linux-only opt-in fast path for plain HTTP/1.1-upgraded TCP,
  routed via add_session_with_prefix so the buffered upgrade
  bytes Hyper already drained are processed before the next
  socket read;
- reactor pinned to a dedicated thread; multiple manager threads
  rather than sharing a Reactor;
- JS-facing ops stay the same shape (next_event / send / close)
  backed by per-resource channels into the reactor Handler and a
  cloned Sender out of it (mio::Waker-driven);
- fall back, never crash: TLS / H2 / non-Linux / unsupported
  upgrade seams keep the existing path;
- perf caveat: if every frame still crosses into JS one-by-one,
  runtime-integrated benchmarks will not reproduce the pure-Rust
  echo numbers; bench the two layers separately.

Also adds a "Required surface" table mapping each capability a
Deno-style embedder needs to the specific Reactor / Handler /
Sender method that already exists, and tightens run_echo()'s
docstring to say explicitly that it is the demo / benchmark
entry point, with a pointer to the embedding section.

Prose-only doc change; no API or behavior change. All 28 lib
tests and 13 doc tests still pass under
`cargo test --release --features upgrade,reactor[,unstable-split]`.

Co-Authored-By: Divy Srivastava <me@littledivy.com>
@divybot

divybot commented May 23, 2026

Copy link
Copy Markdown
Contributor Author

Addressed in 5945432. The reactor module rustdoc now has an # Embedding from an HTTP server or runtime extension (e.g. Deno) section that captures the recipe from this comment:

  • existing Tokio WebSocket<WebSocketStream> path stays universal (TCP / TLS / Unix / vsock / tunnel / H2 / current Deno resource & op model);
  • Linux-only opt-in fast path for NetworkStream::Tcp(stream) + buffered upgrade Bytes, handed to Reactor::add_session_with_prefix so the bytes Hyper drained are processed before the next socket read;
  • reactor pinned to a dedicated std::thread::spawn; scale by adding manager threads, not by sharing one Reactor;
  • JS-facing ops keep their existing shape — next_event awaits a per-resource mpsc receiver fed from Handler::on_frame/on_open/on_close; send_* and close go through a cloned Sender (which is mio::Waker-backed, so the wake is automatic);
  • fall back, never crash: TLS / H2 / Unix / vsock / tunnel / non-Linux all stay on the current path; the reactor is an optimization, not a contract change;
  • perf caveat called out explicitly: if every received frame still crosses into JS one-by-one, a Deno benchmark will not reproduce the pure-Rust echo numbers — value of the reactor in that setting is removing Tokio per-conn scheduling + per-frame Future overhead from the Rust side, not eliminating the JS hop. Bench the two layers separately.

The section closes with a "Required surface" table mapping each capability a Deno-style embedder needs (adopt upgraded TCP socket, preserve buffered upgrade bytes, stable session id, inbound delivery, cross-thread outbound, cross-thread close, cross-thread wake, embed inside an existing event loop) onto the specific Reactor / Handler / Sender method that exists in this PR — so the integration story is "use this surface," not "build new surface on top."

Also tightened Reactor::run_echo()'s docstring to call it out as the demo / benchmark entry point (which the headline single-core numbers were taken against), with a pointer to the embedding section so a reader who lands on run_echo first doesn't conclude that's how Deno would adopt the reactor.

CI on 5945432 should mirror a14f4ba since this is a prose-only doc change — all 28 lib tests + 13 doc tests still pass locally under cargo test --release --features upgrade,reactor[,unstable-split].

littledivy added a commit to denoland/deno that referenced this pull request Jun 1, 2026
For plain `TcpStream` server transports, drive sends via `try_write` /
`try_write_vectored` instead of the `fastwebsockets::WebSocketWrite`
async state machine. Reads still go through `WebSocketRead`, so
auto-pong / auto-close / pong-to-JS / fragmentation reassembly /
max-message-size semantics are byte-for-byte identical to the
existing path.

Other transports (TLS, Unix sockets, vsock, tunnel, H2) keep the
existing `fastwebsockets::WebSocket` path. The choice is per-resource
at `ws_create_server_stream` time; `DENO_WS_DISABLE_FAST_TCP=1` opts
out for regression hunting.

Mirrors the strategy from fastwebsockets PR #133
(denoland/fastwebsockets#133) head `a14f4ba`,
specifically the `echo_server_tokio_fast.rs` adapter:

- Avoid the per-frame Future state machine that `WebSocketWrite`
  builds. `try_write` on a `TcpStream` is a direct `send()` syscall
  with no Future-state allocation when the kernel accepts the bytes
  immediately, which is the steady state on loopback.
- Use `try_write_vectored` to ship the 2-10 byte header + payload in
  one syscall.
- Only enter `writable().await` when the kernel send buffer is full.

The read side keeps `fastwebsockets::WebSocketRead`. The PR #133
example's per-frame allocation savings come from running the engine
synchronously on the read buffer, which doesn't compose with Deno's
JS-callback round trip — every text/binary frame still needs a JS
op return. Replacing only the write side is the part of the
optimization that maps cleanly onto Deno's resource model.

Closes denoland/orchid#169

Co-Authored-By: Divy Srivastava <me@littledivy.com>
littledivy added a commit to denoland/deno that referenced this pull request Jun 1, 2026
For plain `TcpStream` server transports, drive sends via `try_write` /
`try_write_vectored` instead of the `fastwebsockets::WebSocketWrite`
async state machine. Reads still go through `WebSocketRead`, so
auto-pong / auto-close / pong-to-JS / fragmentation reassembly /
max-message-size semantics are byte-for-byte identical to the
existing path.

Other transports (TLS, Unix sockets, vsock, tunnel, H2) keep the
existing `fastwebsockets::WebSocket` path. The choice is per-resource
at `ws_create_server_stream` time; `DENO_WS_DISABLE_FAST_TCP=1` opts
out for regression hunting.

Mirrors the strategy from fastwebsockets PR #133
(denoland/fastwebsockets#133) head `a14f4ba`,
specifically the `echo_server_tokio_fast.rs` adapter:

- Avoid the per-frame Future state machine that `WebSocketWrite`
  builds. `try_write` on a `TcpStream` is a direct `send()` syscall
  with no Future-state allocation when the kernel accepts the bytes
  immediately, which is the steady state on loopback.
- Use `try_write_vectored` to ship the 2-10 byte header + payload in
  one syscall.
- Only enter `writable().await` when the kernel send buffer is full.

The read side keeps `fastwebsockets::WebSocketRead`. The PR #133
example's per-frame allocation savings come from running the engine
synchronously on the read buffer, which doesn't compose with Deno's
JS-callback round trip — every text/binary frame still needs a JS
op return. Replacing only the write side is the part of the
optimization that maps cleanly onto Deno's resource model.

Closes denoland/orchid#169

Co-Authored-By: Divy Srivastava <me@littledivy.com>
@littledivy

Copy link
Copy Markdown
Member

Cool experiment. A custom reactor in deno_core would solve this. Closing

@littledivy littledivy closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants