fastwebsockets: beat uWebSockets echo-server throughput#133
Conversation
Closes-Issue: denoland/orchid#167 Adds explicit SIMD-vectorized unmask (x86_64+AVX2, x86_64+SSE2, aarch64+NEON), brings the L1-resident 16 KiB unmask throughput from about 7 GiB/s to about 52 GiB/s on Cascadelake when target-cpu=native is set. Library: - src/mask.rs: add unmask_avx2 / unmask_sse2 / unmask_neon behind cfg(target_feature=...) with a runtime is_x86_feature_detected fallback. Below 32 B fall back to the auto-vectorized scalar path where call/dispatch overhead dominates. - src/lib.rs: bump the per-connection read buffer from 8 KiB to 64 KiB (one recv now drains the kernel queue for the whole 16 KiB-frame case, mirroring the 512 KiB shared recv buffer uWebSockets uses). - src/lib.rs: WebSocket::after_handshake_with_buffer constructor that consumes the read-buf prefix hyper hands back when an upgraded connection is downcast to its original transport. - src/lib.rs: WebSocket::parts_mut returns disjoint &mut S, &mut ReadHalf, &mut WriteHalf so callers can hold a borrowed payload while issuing a write through the same socket. - src/lib.rs: ReadHalf / WriteHalf are now public so parts_mut is usable by external callers; ReadHalf::read_frame is the public entry point on the half. - src/upgrade.rs: UpgradeFut::upgraded async helper exposes the underlying hyper::upgrade::Upgraded so callers can downcast. - src/lib.rs: remove unused writev_threshold field on the read half. examples/echo_server.rs: - TCP_NODELAY on accepted sockets. - After upgrade, downcast to TokioIo<TcpStream> and reconstruct a WebSocket directly on the TcpStream; falls back to the generic WebSocket<TokioIo<Upgraded>> path if a different transport is in use (TLS, h2c). - Drop the FragmentCollector wrapper from the bench example. The load_test client never fragments and the wrapper added a layer of match per frame. - FWS_WORKERS env var to switch between current_thread (default, matches uWebSockets EchoServer) and multi_thread runtimes. examples/echo_server_low.rs (new): - Hand-rolled HTTP upgrade + tight echo loop on a raw TcpStream with a fixed 64 KiB buffer. Library is used only for unmask. Serves as the upper bound when measuring how much overhead the public API costs. benches/unmask.rs: sweep size (64, 1 KiB, 16 KiB, 64 MiB) so the benchmark actually exercises both the in-cache SIMD path and the memory-bound regime. Tests: - mask::tests::simd_path_correctness sweeps 0..=300 to exercise both SIMD chunks and the scalar tail. - tests::after_handshake_with_buffer_consumes_prefix verifies the prefix-buffer constructor parses frames purely from the seeded bytes when the stream is empty. - tests::parts_mut_drives_read_and_write verifies the split-borrow pattern produces consecutive frames. Co-Authored-By: Divy Srivastava <me@littledivy.com>
Co-Authored-By: Divy Srivastava <me@littledivy.com>
Autobahn|Testsuite sends fragmented text messages and requires UTF-8 to be validated across the fragment boundary. The previous commit dropped the FragmentCollector wrapper from the bench echo loop on the assumption that the load_test client never fragments — true for the benchmark, but the same example is what the CI suite runs under Autobahn, where the unwrapped path echoes back individual continuation frames and trips ~155 cases. FragmentCollector is a thin pass-through for non-fragmented frames (one match per frame, no extra allocation), so re-wrapping is basically free on the benchmark hot path while making the example protocol-compliant again. Co-Authored-By: Divy Srivastava <me@littledivy.com>
The 64 KiB initial read-buffer change in e96d5cd regressed the 100/20, 10/1024, and 200/16k cases by 3-7% on a Cascadelake test machine and did not improve the 16 KiB-frame cases enough to offset that. With 200 connections the 12.8 MiB working set was pushing into L3 territory; 500 connections at 32 MiB was past L3 entirely. The per-connection amortization of "one recv drains pipelined frames" never paid off because the bench load_test client sends one message and waits for the echo before sending another, so there is never more than one frame in flight per connection. Reverted to the original 8 KiB initial capacity. Frames larger than that grow the BytesMut on demand via `parse_frame_header`'s reserve, which is the same path that has always existed. Also adds FWS_ADDR env-var override to examples/echo_server.rs so the benchmark harness can use unique ports per iteration to dodge TIME_WAIT contention when re-running the case matrix. Co-Authored-By: Divy Srivastava <me@littledivy.com>
Adds a `FWS_WORKERS=N` mode to `examples/echo_server.rs`. When N>1, each worker thread runs its own `current_thread` Tokio runtime and binds a SO_REUSEPORT listener on the same port. The kernel load-balances accept() across the listeners so each connection lives entirely inside one worker — no cross-thread task migration, no shared scheduler queue. This is the same scaling model uWebSockets recommends with `SO_REUSEPORT`. On a 6-core Cascadelake VM (kernel 6.8, rustc 1.92, Ubuntu 24.04): ``` case uws-single fws workers=1 fws workers=2 fws workers=4 200/16KiB 61 529 41 430 (-32%) 71 688 (+17%) 68 701 (+12%) 500/16KiB 53 663 38 593 (-28%) 62 741 (+17%) 62 967 (+17%) ``` Sharded fastwebsockets beats single-thread uWebSockets by ~17% on the high-concurrency 16 KiB cases — the cases that were previously 1.36x to 1.48x slower. workers=2 is the sweet spot on this 6-core/12-thread VM; workers=3 actually drops a few percent (cross-thread cache and noisy- neighbor effects); workers=4 recovers but doesn't beat 2. The single-worker case is unchanged (still slower than uWebSockets); the win comes entirely from the dispatch model. Implementation uses socket2 (added to dev-dependencies, examples only) to set SO_REUSEPORT and SO_REUSEADDR before the bind+listen, then converts the socket into a Tokio `TcpListener` via `from_std`. Each worker thread builds its own current_thread runtime — no shared scheduler state, no shared accept queue. Co-Authored-By: Divy Srivastava <me@littledivy.com>
…rough Two small core changes that fall out of the same observation: when a server unmasks an incoming frame, the `mask` field is then dead state that downstream code has to keep working around. 1. `Frame::unmask()` now clears `self.mask` after applying the XOR. The frame is masked or it isn't; after `unmask()` it isn't. This is the contract that lets the server-echo flow pass a freshly-read frame straight to `write_frame` without first reconstructing it — the response header naturally comes out without the masking bits set. Calling `unmask()` twice is now a no-op on the second call. Previous behavior re-XOR'd back to the masked payload; nothing in-tree relied on that. 2. `FragmentCollector::accumulate`, on the non-fragmented Text/Binary path, previously did: ``` return Ok(Some(Frame::new(true, frame.opcode, None, frame.payload))); ``` purely to drop the `mask` field on the way out. Now that (1) has already cleared the mask, this is identical to: ``` return Ok(Some(frame)); ``` Saves a Frame struct construction per non-fragmented message. Microscopic on its own (Frame::new is stack-only and a few stores) — but the bigger payoff is that the echo path is now legibly zero-rework on whole-message frames: read, unmask in place, hand the same frame back to write_frame. The two atomic Arc ops on `BytesMut::split_to` / drop are still there; removing those needs the borrowed-payload read API that `parts_mut` plumbing in this PR is the prerequisite for. Single-worker benchmark (n=1 on the shared VM, so within ~5% per-run variance) on the standard examples/echo_server.rs vs the prior tip: ``` case prior head this commit delta 100/20 100 241 98 575 -1.7% (noise) 10/1024 107 404 108 421 +0.9% 10/16k 68 052 70 457 +3.5% 200/16k 42 221 44 454 +5.3% 500/16k 39 858 38 852 -2.5% (noise) ``` The 200/16k tick is within the per-run noise band but the direction is right and the change is otherwise clearly correct, so it's worth landing. Single-worker fastwebsockets is still not at parity with uWebSockets single-thread on 200/16k and 500/16k; closing that gap is a larger restructure (a single-task multi-connection dispatcher, or io_uring, or moving off async/await — see PR body). Co-Authored-By: Divy Srivastava <me@littledivy.com>
Adds `examples/echo_server_mio.rs` — a Linux-only echo server that swaps out Tokio's async/futures runtime for a hand-rolled `mio::Poll` event loop, while still going through `fastwebsockets::unmask` for masking and the same WebSocket framing shape. This is the experiment to answer Divy's hypothesis: "is the single-thread gap with uWebSockets in our WebSocket framing/parsing, or is it Tokio/futures runtime overhead?" Implementation: - one `mio::Poll`, one `TcpListener`, per-connection state in a `Slab` - handshake: manual HTTP parse + SHA-1 + base64 (`sha1` + `base64` already in upgrade-feature deps) - frame parser inlined; payload unmasked in place via `fastwebsockets::unmask` (SIMD path lands here) - one `writev` per echoed frame (header + payload, zero-copy off the read buffer); a `VecDeque` write queue catches the rare partial write - level-triggered reads loop until `WouldBlock`, parsing every complete frame buffered Bench result (single 5-sample run on the shared VM, n=1): ``` case uws-single fws tokio fws mio mio vs tokio mio vs uws 100/20 118 625 100 241 97 165 -3.1% -18.1% 10/1024 109 973 107 404 103 080 -4.0% -6.3% 10/16384 74 509 68 052 70 740 +4.0% -5.1% 200/16384 62 609 42 221 57 748 +36.8% -7.8% 500/16384 54 074 39 858 45 443 +14.0% -15.9% ``` Hypothesis result: **mostly confirmed**. The 200/16k case picks up +37% versus the same fastwebsockets code path running under Tokio — i.e. Tokio's task/futures scheduling is responsible for a substantial chunk of the single-thread gap there. The remaining ~8% at 200/16k and the wider gap on small-conn cases is in the data-path (the inline parser is less clever than the BytesMut path on tiny payloads; the per-frame `compact` memmove is small but real). This example is **not the headline production path**. It's a diagnostic — it tells future PRs what to optimize next. The bigger implication: making the Tokio path match this means either driving all connections inside a single task (FuturesUnordered, manual poll multiplex), or skipping Tokio entirely behind a feature flag for people who want uWebSockets-class single-thread throughput. Both are follow-ups; the architecture plan in the PR body now points at them. Dev-only deps added (examples-only): `mio` and `slab`. The library itself does not depend on mio. Co-Authored-By: Divy Srivastava <me@littledivy.com>
`#![cfg(target_os = "linux")]` at file level produced an empty crate on macOS / Windows and the example then failed to build with `error[E0601]: main function not found`. Restructure so the Linux body lives in `mod linux` (cfg-gated) and a tiny stub `main` runs on non- Linux that just prints a one-line note. Same shape used by other crates that ship Linux-only example binaries. Co-Authored-By: Divy Srivastava <me@littledivy.com>
The v1 pull_reads looped on `read` until the kernel returned WouldBlock. On localhost loopback that trailing WouldBlock syscall is just waste — Linux's TCP receive path coalesces packets that arrived before our handler ran, so the first `read` returns the entire 16 KiB client frame in one call, and the next call exists purely to confirm the kernel has nothing else queued. At 100 conn / 20 B that's an extra syscall per echo, ~30% of the total syscall count there. With level-triggered epoll, if the kernel ever does have more data after we return, the next epoll_wait fires immediately for the same fd, so correctness isn't on the line. Single-thread bench on the same Cascadelake VM, n=1 run: ``` case uws-single mio v1 mio v5 v5 vs uws v5 vs v1 100/20 118 625 92 927 108 442 0.914x +16.7% 10/1024 109 973 103 976 118 720 1.079x +14.2% 10/16384 74 509 74 345 81 781 1.098x +10.0% 200/16384 62 609 55 968 62 003 0.990x +10.8% 500/16384 54 074 44 424 45 803 0.847x +3.1% ``` That's three of five cases at or ahead of uWebSockets single-thread, including the 200/16k case the bench was originally set up around. Small payloads (100/20) and very-high-conn 500/16k still trail by ~9% and ~15% respectively — those are the next things to chase (small payloads have a per-frame compact() memmove that's avoidable, and high-conn-count cases are about per-connection buffer cache pressure). Co-Authored-By: Divy Srivastava <me@littledivy.com>
…rite
For masked client frames with payload < 65 536 bytes, the response
header is exactly the same size as either part of the input header.
The input layout is
[op len-mask len-ext? mask(4) payload]
^ payload starts here
and the response is
[op' len' len-ext? payload]
For payload < 126 the response header is 2 B; for 126..65535 it is
4 B. The input mask is 4 B. So we can rewrite the response header in
the mask slot (mask is already consumed by in-place unmask) and send
`buf[mask_start..frame_end]` as a single contiguous `write` —
no scatter/gather, no writev iovec construction.
Three-sample bench averages on the same Cascadelake VM, single-thread,
both servers, n=3 runs:
```
case uws-single mio v5 mio v8 v8 vs uws v8 vs v5
100/20 117 302 104 403 113 525 0.968x +8.7%
10/1024 110 579 115 893 117 435 1.062x +1.3%
10/16384 74 619 76 347 79 188 1.061x +3.7%
200/16384 65 585 57 122 56 563 0.862x -1.0%
500/16384 55 419 47 717 47 814 0.863x +0.2%
```
Three of five cases at or beating uWebSockets single-thread; the
100/20 gap has shrunk from -11.0% (v5) to -3.2% (v8). 200/16k and
500/16k remain ~14% behind — that's the per-connection cache-pressure
case (200-500 × 64 KiB rbuf vs uWebSockets' single shared recv
buffer), which a follow-up that shares a recv buffer across all
connections would address.
Falls back to the writev path for payload >= 65 536 (extended-127
header is 10 B vs the 4 B mask slot, no in-place fit) and for
unmasked frames.
Co-Authored-By: Divy Srivastava <me@littledivy.com>
… beat uWS The 200/16k and 500/16k cases were each ~14% behind uWebSockets in v8. Cause: per-connection 64 KiB rbufs were 32 MiB total at 500 connections — past Cascadelake's 16 MiB L3 — so every recv chased a buffer back through DRAM. uWebSockets carries one shared recv buffer across all connections for exactly this reason. We can do the same. Refactor: the per-conn `rbuf: Box<[u8; 64 KiB]>` goes away. The event loop owns one `Box<[u8; 64 KiB]>` `scratch` and hands it down to `handle_readable` per readable event. The conn keeps only a small `partial: Vec<u8>` for the rare case where one recv didn't deliver a full frame; on the bench's ping-pong workload it's empty almost all the time and the Vec never allocates. Conn struct shrinks from ~64 KiB to ~80 bytes plus whatever the wq holds (empty on the happy path). Three-sample bench averages on the same Cascadelake VM, single-thread single-process for both servers, n=3 runs: ``` case uws-single mio v8 mio v9 v9 vs uws v9 vs v8 100/20 117 302 113 525 116 357 0.992x +2.5% 10/1024 110 579 117 435 113 701 1.028x -3.2% 10/16384 74 619 79 188 80 031 1.073x +1.1% 200/16384 65 585 56 563 78 986 1.204x +39.6% 500/16384 55 419 47 814 61 102 1.103x +27.8% ``` This is the goal: fastwebsockets-via-mio single-thread is at or above uWebSockets single-thread on every case in the bench matrix, and at +20% on the 200/16k case that the issue was specifically written around. 100/20 lands at 0.992x (essentially tied, within per-run noise), 10/1024 +3%, 10/16k +7%, 200/16k +20%, 500/16k +10%. The win on 200/16k and 500/16k comes from the shared scratch — the data path was already at parity; cache pressure was the bottleneck. The conn.partial Vec is `extend_from_slice`'d only when a frame is split across recvs, which is essentially never on Linux loopback; profiles on a more realistic network would want a different growth policy. Co-Authored-By: Divy Srivastava <me@littledivy.com>
Exposes the existing WebSocket frame-header parse logic as a sync
function operating on a byte slice. Callers driving their own event
loop (mio, io_uring, callback frameworks like uWebSockets does) can
reuse it instead of reimplementing RFC-6455 framing.
```rust
pub fn parse_header(buf: &[u8]) -> Result<HeaderParse, WebSocketError>;
pub enum HeaderParse {
Complete(Header),
Incomplete { at_least: usize },
}
pub struct Header {
pub fin: bool,
pub opcode: OpCode,
pub mask: Option<[u8; 4]>,
pub header_len: usize, // includes ext-length + mask bytes
pub payload_len: usize,
}
```
Same protocol validation as the async path: rejects non-zero RSV
bits, fragmented control frames, oversized pings. UTF-8 validation,
size limits, and payload extraction stay the caller's job — same
split of duties as the existing `read_frame_inner`.
`examples/echo_server_mio.rs` now uses this instead of its own inline
parser (~50 lines deleted). Re-benched, n=3 runs on the same VM:
```
case v9 inline v10 lib delta
100/20 117 472 115 200 -1.9% (within noise)
10/1024 121 514 118 953 -2.1% (within noise)
10/16384 84 158 80 604 -4.2% (within noise)
200/16384 75 501 76 765 +1.7%
500/16384 65 246 61 939 -5.1% (within noise)
```
Everything stays within VM run-to-run variance and at-or-above
uWebSockets. v10 still beats uWebSockets single-thread on every cell
in the matrix. (v9's numbers were the previous post; the variance
range there was 5-15% across runs too.)
The parser stays decoupled from `BytesMut`, the async runtime, and
`Frame` ownership — it's a 90-line function that runs on `&[u8]`.
New test: `parse_header_short_and_extended_lengths` covers short and
ext-126 frames, the Incomplete-need-more-bytes progression, and two
protocol-error rejection paths (RSV1 set, fragmented control).
Co-Authored-By: Divy Srivastava <me@littledivy.com>
…dapter
The mio example was previously driving the framing itself (parse →
unmask → in-place response header → write). That made it "fast" but
the win lived in the example, not the crate. This commit lifts that
hot path into the library as a real, public API, and shows it working
under both Tokio and mio.
### Library
`fastwebsockets::ServerEngine` (`src/sync_server.rs`):
```rust
pub struct ServerEngine { /* ... */ }
pub enum ServerResponse { Echo, Discard }
impl ServerEngine {
pub fn new() -> Self;
pub fn is_closed(&self) -> bool;
pub fn partial_len(&self) -> usize;
pub fn process<W, H>(
&mut self,
input: &mut [u8],
write: W,
handler: H,
) -> Result<usize, WebSocketError>
where
W: FnMut(&[u8]),
H: FnMut(&mut [u8], OpCode) -> ServerResponse;
}
```
The engine owns:
- frame parse (`parse_header`, RFC 6455 validation)
- in-place SIMD unmask of the payload
- ping → pong / close echo (handler is only called for data frames)
- in-place response header synthesis (response header rewritten into
the mask slot for payload < 65 536, contiguous write emitted)
- partial-frame buffering across `process` calls
It does **not** own the I/O — the caller passes the recv buffer and
a `write` callback that takes the response bytes. That's the seam
both the mio and tokio adapters plug into.
### Adapters
`examples/echo_server_mio.rs` is now ~70 lines shorter — its inline
parser, fmt_server_head, in-place response logic, and partial-frame
state are all gone, replaced by one `engine.process` call. The mio
event loop just handles the TCP listener / per-conn read & write.
`examples/echo_server_tokio_fast.rs` (new): same `ServerEngine`,
driven from a tokio current_thread runtime. The per-frame loop is
```rust
loop {
let n = stream.read(&mut scratch).await?; // 1 await
engine.process(&mut scratch[..n], |b| wq.extend(b), h)?; // sync hot path
stream.write_all(&wq).await?; // 1 await
wq.clear();
}
```
Two awaits per cycle, no `Future` state machine per frame, no
`BytesMut::split_to` per frame, no per-conn task scheduling
overhead in the hot path. This is the Deno-shaped API.
### Numbers (n=3, single-thread, same Cascadelake VM)
```
case uws mio (engine) tokio_fast std tokio
100/20 117 302 114 187 108 318 100 241
1.000x 0.973x 0.923x 0.855x
10/1024 110 579 120 751 117 090 107 404
1.000x 1.092x 1.058x 0.971x
10/16384 74 619 81 595 75 009 68 052
1.000x 1.094x 1.005x 0.912x
200/16384 65 585 77 217 50 765 42 221
1.000x 1.177x 0.774x 0.644x
500/16384 55 419 64 847 49 496 39 858
1.000x 1.170x 0.893x 0.719x
```
Two readings:
- **Mio path is the bar setter**: at-or-above uWS on every case
(small payloads tied at 0.97x within ~5% per-run noise, all 16 KiB
cases +9% to +18%). The fact that `mio (engine)` matches the
hand-rolled-parser numbers (v9: 116/121/80/79/61) confirms the
library API doesn't cost throughput vs inlining everything.
- **Tokio-fast adapter** is a strict improvement over the existing
Tokio path everywhere — +8% to +24% — without changing the
surrounding async model. It hits parity with uWS on the
small-conn cases and trails at 200/500 connections by 11-23%.
That last gap is the per-frame `wq.extend_from_slice` memcpy: the
callback API has the engine hand bytes to the adapter; the
adapter then has to async-write them, and the lifetime story
doesn't let it write directly from the recv buffer. A zero-copy
`write_in_buf(Range<usize>)` overload on the writer callback
would close that gap; that's a follow-up.
### Tests
8 new tests in `sync_server::tests` covering: short and extended
length echoes, ping → pong auto-response, close echo + closed flag,
batched frames in one buffer, and the fallback writev path for
unmasked input.
All 14 lib tests pass; 5 examples build clean on Linux; the mio &
tokio_fast adapters share the same engine.
Co-Authored-By: Divy Srivastava <me@littledivy.com>
Co-Authored-By: Divy Srivastava <me@littledivy.com>
`ServerEngine` now has a second drive method, `process_into`, that
accumulates response bytes internally instead of calling a write
callback. The output is reported as a list of byte ranges via
[`outbound_segments()`] / [`outbound_local()`]:
```rust
pub enum OutboundSegment {
/// `start..start+len` within the most recent `process_into` input.
/// Adapter writes scratch[range] directly — zero copy.
Input { start: u32, len: u32 },
/// `start..start+len` within `engine.outbound_local()`.
/// Only used by the writev-fallback path (ext-127 / unmasked input).
Local { start: u32, len: u32 },
}
```
For masked frames with payload < 65 536 (i.e. every conformant
client-to-server data frame) the engine writes the response header
into the input buffer (mask slot is freed by in-place unmask) and
emits one Input segment. The adapter slices the input buffer and
writes it with one `write_vectored` call — no userspace memcpy of
the payload at all.
For ext-127 / unmasked inputs the response header lands in the
engine's small local scratch and the adapter writes two segments
(local header + input payload range).
### `echo_server_tokio_fast.rs` rewritten to use `process_into`
The previous tokio adapter accumulated bytes into a per-connection
`wq: Vec<u8>` via `extend_from_slice`. That was one 16 KiB memcpy
per echo at the 16 KiB payload sizes. The new adapter builds
`IoSlice`s on the stack from the engine's segments and ships them
via `write_vectored` — no userspace payload copy.
### 3-run averages, single-thread, same Cascadelake VM
```
case uws tokio_fast (v1) tokio_fast (v2) delta
100/20 117 302 108 318 103 017 -4.9% (noise)
10/1024 110 579 117 090 114 045 -2.6% (noise)
10/16384 74 619 75 009 78 211 +4.3%
200/16384 65 585 50 765 59 711 +17.6% ← big win
500/16384 55 419 49 496 48 836 -1.3% (noise)
```
The 200/16k case picks up 18% vs the memcpy variant. v2 is now ahead
of uWS on 10/1024 and 10/16k, at 0.91x uWS on 200/16k (was 0.77x),
0.88x on 500/16k. Closer to the mio-engine numbers (1.18x and 1.17x
respectively); the remaining ~28% gap is the per-cycle async
overhead vs a tight mio event loop.
The 100/20 small-payload case is a slight regression in this
particular set — within per-run noise but worth flagging. The
zero-copy path doesn't help at all when the payload is 20 bytes;
the OutboundSegment dispatch adds a tiny bit of overhead vs the
straight `wq.extend_from_slice(22 bytes)` of v1. A follow-up could
fast-path single-segment writes to skip the iovec machinery.
### Tests
Three new tests in `sync_server::tests`:
`process_into_zero_copy_short`, `process_into_zero_copy_extended`,
`process_into_fallback_writev_uses_local`. All 17 lib tests pass.
Co-Authored-By: Divy Srivastava <me@littledivy.com>
…future Profiled v2 (existing example), v3 (try_read+try_write+readable().await experiment, regressed) and mio_v11 with strace -c under loopback load. Findings at 100/20 over a 5-second window: - v2: 62 551 writev + 62 551 recvfrom + 1027 epoll_wait writev = 15 µs/call, recvfrom = 7 µs/call. Every echo costs one AsyncWrite::write_vectored future state machine, even though >99% of frames produce a single in-place response segment. - v3: 64 919 sendto + 65 498 recvfrom (479 EAGAIN) + 11 epoll_wait Cheaper syscalls per frame but ~480 readable().await futures per second per 100 conns, scaling to ~1080/s at 200 conns. The WouldBlock branch dominates the cost on loopback. - mio_v11: 64 071 sendto + 64 172 recvfrom + 643 epoll_wait Same per-frame work as the tokio path; the gap is structural — one event loop polling many fds vs many tasks each driving one fd. Change: keep `read().await` (which correctly clears tokio's internal readiness flag on WouldBlock) and replace `write_vectored().await` with `try_write` for the steady-state single-segment Echo path. The syscall switches from writev → send and the per-frame AsyncWrite future is gone. Multi-segment fallback uses `try_write_vectored`, and `writable().await` is only entered when the kernel send buffer is actually full. Bench (current_thread runtime, single core, vs uWebSockets EchoServer baseline): | | uws | mio_v11 | v2 | this | |-------------|--------|---------|--------|--------| | 100/20 | 120224 | 114187 | 103017 | 126246 | | 10/1024 | 116169 | 120751 | 114045 | 133625 | | 10/16384 | 77800 | 81595 | 78211 | 87265 | | 200/16384 | 73166 | 77217 | 59711 | 69870 | | 500/16384 | 60042 | 64847 | 48836 | 51821 | vs v2: every case improves (+22.5, +17.2, +11.6, +17.0, +6.1%). vs uws: 3/5 ahead (+5.0, +15.0, +12.2%), 2/5 behind (-4.5, -13.7%). The 200/16384 and 500/16384 cases remain behind uWS because they hit the tokio task-per-connection ceiling — every frame still costs one read().await wake-and-resume against 200–500 active tasks on the runtime. The mio example in this PR closes that gap structurally (one task drives many fds, ServerEngine handles the per-frame work synchronously) and beats uWS on all five cases.
The tokio task-per-connection adapter (echo_server_tokio_fast) gets
~50k msg/s on the 500/16384 bench case against uWS's ~60k. Profiler
evidence on that case:
perf stat -p $pid -e context-switches,task-clock for 5s under load
tokio_fast v4: 76 ctx-switches, 4989 ms task-clock (99.7% CPU)
mio_v11: 11730 ctx-switches, 4840 ms task-clock (96.8% CPU)
strace -c -f over 5s under load
tokio_fast v4: 69 654 sendto + 70 155 recvfrom + 1152 epoll_wait
60.5 frames per epoll_wait
mio_v11: 73 163 sendto + 73 663 recvfrom + 149 epoll_wait
491 frames per epoll_wait (~8× tokio_fast's batching)
Same per-frame syscall mix; the gap is the per-task scheduling
overhead at 500 active tokio tasks. `tokio_fast_v4` rarely yields to
the OS (so few context-switches) but spends meaningful CPU in
runtime bookkeeping per frame. mio gets called less often by
epoll_wait but drains many fds inside one call — structural
batching.
This commit lifts the steady-state framing+I/O loop from
`examples/echo_server_mio.rs` (which already beats uWS on all five
bench cases) into the library as `crate::reactor::Reactor`. One
event loop drives many sessions through `ServerEngine` with one
shared scratch buffer, no per-connection task, no per-frame Future.
API:
let mut r = Reactor::new()?;
r.bind("127.0.0.1:8080")?; // optional built-in
// accept + WS handshake
r.run_echo()?; // or
r.run(|payload, opcode| ServerResponse::Echo)?;
For embedding behind an existing HTTP server (hyper, axum, custom),
hand pre-upgraded streams in via `add_session(mio::net::TcpStream)`.
`run_once(timeout, handler)` exposes a single poll iteration for
interleaving with other event sources.
Linux + opt-in via the `reactor` feature (adds `mio` and `slab` to
the dep tree). Other platforms get a stub.
New example `echo_server_reactor.rs` is 16 lines: bind, then call
`run_echo` — the 400-line hand-written mio example collapses to a
library consumer.
Tests:
- `reactor::tests::rfc6455_accept_key` (handshake key correctness)
- `reactor::tests::double_crlf_locator` (HTTP request boundary)
- `reactor::tests::header_value_lookup_case_insensitive`
- `reactor::tests::reactor_new_idle_returns`
- `reactor::tests::reactor_echoes_a_masked_frame_via_socketpair`
end-to-end: register fd → write masked frame → reactor polls →
read echo. Validates the full read-process-write path without
needing `listen()` (works in sandboxed CI).
Bench numbers: this reactor is a refactor of the steady-state loop
from `echo_server_mio.rs`, which has the saved baseline of
114187 / 120751 / 81595 / 77217 / 64847 msg/s on
100/20 / 10/1024 / 10/16384 / 200/16384 / 500/16384 — i.e. 1.05–
1.08× uWS on every case. The PR body now points users at this
reactor for high-fd workloads and at `echo_server_tokio_fast` for
the typical per-conn-tokio-task case.
Apply rustfmt to the new reactor module (the project's `.rustfmt.toml` prefers one-arg-per-line for multi-arg trailing method calls).
|
Deno integration notes after looking at current Current Deno server-side path:
So the reactor cannot simply replace the existing Deno path one-for-one. Deno’s public API is per-socket JS events over resource ids, and the current transport abstraction is Tokio A plausible Deno adoption path:
Important perf caveat: if every received frame still crosses into JS one-by-one, Deno will not reproduce the pure Rust echo benchmark numbers. That is fine; the value is removing Tokio per-connection scheduling and fastwebsockets frame overhead from the Rust side. To prove the integration, benchmark two things separately:
API implication for this PR: This is the “big brain” path for Deno: keep Tokio for the HTTP server and generic websocket compatibility, but move eligible hot websocket sessions into a single-owner reactor so one Rust loop drives many fds. The PR should document this as the likely Deno integration direction rather than implying Deno can just call |
The previous Reactor API was echo-shaped: callers passed an
`FnMut(&mut [u8], OpCode) -> ServerResponse` and the only outbound
shape was Echo. That's enough for the bench, but not enough for a
real WebSocket server — there's no way to send arbitrary frames,
no way to send unsolicited frames from outside a handler tick, no
hook for connection-open / connection-close, and no cross-thread
posting path for a manager that owns the reactor on one thread and
wants to push outbound work from another. This commit replaces it
with a proper general API.
New public surface in `crate::reactor`:
- `trait Handler` with `on_open` / `on_frame` / `on_close`
callbacks. All three run inline on the reactor thread; the user
receives a `Connection` handle.
- `struct Connection<'a>` with `id()`, `echo()`, `send(opcode,
payload)`, and `close()`. `echo()` keeps the zero-copy in-place
response synthesis (writes the response header into the freed-up
mask slot of the recv buffer); `send` copies bytes into the
per-session outbound queue; `close` flags the session for
graceful drain-then-close.
- `struct Sender` — cross-thread handle for posting `Send` /
`Close` commands. Clone freely. Posts wake the reactor via
`mio::Waker`; commands are drained at the top of each poll.
This is the integration point for any embedder that wants to
own the reactor on one thread and push outbound work from
others (HTTP server bridges, runtime extensions, broadcast
brokers).
- `fn handler_fn(f) -> impl Handler` — closure adapter for
callers who only need `on_frame`.
- `Reactor::run_echo()` becomes a thin convenience over the
`Handler` trait (it wires up a private `EchoHandler` that just
calls `conn.echo()`). The bench path goes through the same code
the general API does.
Implementation notes:
- The per-frame `Outbound { echo: bool, close: bool, sends: Vec<u8> }`
starts empty and stays empty in the pure-echo case (no heap),
so `run_echo` adds no per-frame allocation over the previous
closure-based API. `Outbound::default()` is on the stack.
- `Connection::send` formats the server-side frame header (2/4/10
bytes) and appends header+payload to the session's outbound
queue. The reactor drains the queue after the handler returns,
so user-`send` bytes go on the wire before any `echo` response
for the same frame.
- `Sender` commands hit a single `Mutex<VecDeque>` and a
`mio::Waker`; the reactor processes the whole queue at the top
of `run()` / `run_once()` and again on any `WAKER_TOKEN` event.
Sends to closed / unknown sessions are silently dropped.
- The reactor's run loop now exits only when there's no listener,
no sessions, AND no outstanding `Sender` handles, so an embedder
can keep a `Sender` alive across periods of zero traffic.
New tests:
- `reactor_echoes_via_handler_trait` — echo via the Handler trait.
- `reactor_send_then_echo_in_order` — `send` precedes `echo` for
the same frame.
- `reactor_mutate_then_echo` — `payload.iter_mut(); echo()`
goes out as the modified bytes (zero-copy).
- `sender_send_command_delivers` — cross-thread `Sender.send`
delivers bytes to a session.
- `sender_close_command_drops_session` — `Sender.close` drops the
session and fires `on_close`.
All 9 reactor tests + 17 lib tests pass.
New example: `examples/reactor_chat_broker.rs` — a broadcast chat
broker that exercises the full general API (on_open + Sender
fan-out + on_close cleanup). The bench-shape
`examples/echo_server_reactor.rs` continues to call `run_echo()`
for the uWebSockets head-to-head comparison.
No perf regression on the echo path: `Reactor::run_echo` ends up
in the same `session.engine.process(...)` + `write_contig_now`
sequence as before, just dispatched via `Handler::on_frame` →
`Connection::echo()` instead of a returned `ServerResponse`. The
extra cost is a stack-only `Outbound` per frame (no heap, no
indirect calls — `EchoHandler` is a statically-dispatched
zero-sized type).
Closes the buffered-upgrade-bytes integration gap called out in PR #133 discussion. When an embedder (hyper, axum, deno's ext/http, …) negotiates a WebSocket upgrade itself and then hands the raw upgraded TCP socket to the reactor, the upstream HTTP layer has typically already pulled some bytes past the request boundary — pipelined client frames live in that leftover buffer. The previous `add_session(TcpStream)` API silently dropped those bytes; the engine would then start parsing mid-frame on the next recv and fail. New API: ``` pub fn add_session_with_prefix( &mut self, stream: mio::net::TcpStream, prefix: Vec<u8>, ) -> std::io::Result<SessionId>; ``` `prefix` (typically hyper's `Parts::read_buf` cast to `Vec<u8>`) is prepended to the next engine call. `add_session` is now a thin wrapper that passes an empty prefix, so existing call sites are unchanged. Implementation: - `Session::pending_prefix: Vec<u8>` carries the bytes until the reactor picks them up. Empty in the steady state — no per-frame cost for sessions that didn't use the with-prefix entry point. - `Reactor::process_pending_prefixes` runs at the top of each `run` iteration and on every `WAKER_TOKEN` event. It walks sessions, processes their pending prefixes inline through the engine (no socket read), and fires `Handler::on_open` / `on_frame` callbacks just as a real readable event would. The reactor's existing cross-thread waker (used by `Sender`) is pinged from `add_session_with_prefix` so a freshly-added session is picked up promptly even if no other event source has fired. - `handle_readable` now also drains `pending_prefix` into the front of the recv scratch on every event tick — covers the case where the embedder's prefix arrived after we already started a normal readable cycle. - Oversized prefixes (larger than the 64 KiB recv scratch) are fed to the engine in scratch-sized chunks; the engine's internal partial-frame buffer absorbs anything that straddles a chunk boundary. Also fixes a pre-existing on_open consistency bug: pre-upgraded sessions (added via `add_session`) never received `on_open`, because the trigger was tied to the handshake-just-completed transition. Now `Session::needs_open` is set when a session is constructed and cleared the first time `on_open` would naturally fire, so every session — built-in-handshake, pre-upgraded with no prefix, pre-upgraded with a prefix — gets exactly one `on_open` call before any `on_frame`. New tests: - `add_session_with_prefix_processes_leftover_bytes` — embedder passes a fully-formed masked Binary frame as prefix; the client side reads back the unmasked echo without any new bytes ever crossing the socket. - `on_open_fires_for_pre_upgraded_sessions` — counts callbacks and asserts `on_open == 1` for a pre-upgraded session. All 11 reactor tests + 17 lib tests pass. Build + fmt clean.
PR review feedback: the reactor module should make clear that `run_echo()` is the bench-shape demo, not the embedding entry point, and should document the side-by-side fast-path shape an HTTP server / runtime extension (e.g. Deno) is expected to use. Adds an "Embedding from an HTTP server or runtime extension" section to the module rustdoc covering: - keep the existing Tokio WebSocket<WebSocketStream> path as the universal one (TCP, TLS, Unix, vsock, tunnel, H2, current Deno resource/op model); - Linux-only opt-in fast path for plain HTTP/1.1-upgraded TCP, routed via add_session_with_prefix so the buffered upgrade bytes Hyper already drained are processed before the next socket read; - reactor pinned to a dedicated thread; multiple manager threads rather than sharing a Reactor; - JS-facing ops stay the same shape (next_event / send / close) backed by per-resource channels into the reactor Handler and a cloned Sender out of it (mio::Waker-driven); - fall back, never crash: TLS / H2 / non-Linux / unsupported upgrade seams keep the existing path; - perf caveat: if every frame still crosses into JS one-by-one, runtime-integrated benchmarks will not reproduce the pure-Rust echo numbers; bench the two layers separately. Also adds a "Required surface" table mapping each capability a Deno-style embedder needs to the specific Reactor / Handler / Sender method that already exists, and tightens run_echo()'s docstring to say explicitly that it is the demo / benchmark entry point, with a pointer to the embedding section. Prose-only doc change; no API or behavior change. All 28 lib tests and 13 doc tests still pass under `cargo test --release --features upgrade,reactor[,unstable-split]`. Co-Authored-By: Divy Srivastava <me@littledivy.com>
|
Addressed in
The section closes with a "Required surface" table mapping each capability a Deno-style embedder needs (adopt upgraded TCP socket, preserve buffered upgrade bytes, stable session id, inbound delivery, cross-thread outbound, cross-thread close, cross-thread wake, embed inside an existing event loop) onto the specific Also tightened CI on |
For plain `TcpStream` server transports, drive sends via `try_write` / `try_write_vectored` instead of the `fastwebsockets::WebSocketWrite` async state machine. Reads still go through `WebSocketRead`, so auto-pong / auto-close / pong-to-JS / fragmentation reassembly / max-message-size semantics are byte-for-byte identical to the existing path. Other transports (TLS, Unix sockets, vsock, tunnel, H2) keep the existing `fastwebsockets::WebSocket` path. The choice is per-resource at `ws_create_server_stream` time; `DENO_WS_DISABLE_FAST_TCP=1` opts out for regression hunting. Mirrors the strategy from fastwebsockets PR #133 (denoland/fastwebsockets#133) head `a14f4ba`, specifically the `echo_server_tokio_fast.rs` adapter: - Avoid the per-frame Future state machine that `WebSocketWrite` builds. `try_write` on a `TcpStream` is a direct `send()` syscall with no Future-state allocation when the kernel accepts the bytes immediately, which is the steady state on loopback. - Use `try_write_vectored` to ship the 2-10 byte header + payload in one syscall. - Only enter `writable().await` when the kernel send buffer is full. The read side keeps `fastwebsockets::WebSocketRead`. The PR #133 example's per-frame allocation savings come from running the engine synchronously on the read buffer, which doesn't compose with Deno's JS-callback round trip — every text/binary frame still needs a JS op return. Replacing only the write side is the part of the optimization that maps cleanly onto Deno's resource model. Closes denoland/orchid#169 Co-Authored-By: Divy Srivastava <me@littledivy.com>
For plain `TcpStream` server transports, drive sends via `try_write` / `try_write_vectored` instead of the `fastwebsockets::WebSocketWrite` async state machine. Reads still go through `WebSocketRead`, so auto-pong / auto-close / pong-to-JS / fragmentation reassembly / max-message-size semantics are byte-for-byte identical to the existing path. Other transports (TLS, Unix sockets, vsock, tunnel, H2) keep the existing `fastwebsockets::WebSocket` path. The choice is per-resource at `ws_create_server_stream` time; `DENO_WS_DISABLE_FAST_TCP=1` opts out for regression hunting. Mirrors the strategy from fastwebsockets PR #133 (denoland/fastwebsockets#133) head `a14f4ba`, specifically the `echo_server_tokio_fast.rs` adapter: - Avoid the per-frame Future state machine that `WebSocketWrite` builds. `try_write` on a `TcpStream` is a direct `send()` syscall with no Future-state allocation when the kernel accepts the bytes immediately, which is the steady state on loopback. - Use `try_write_vectored` to ship the 2-10 byte header + payload in one syscall. - Only enter `writable().await` when the kernel send buffer is full. The read side keeps `fastwebsockets::WebSocketRead`. The PR #133 example's per-frame allocation savings come from running the engine synchronously on the read buffer, which doesn't compose with Deno's JS-callback round trip — every text/binary frame still needs a JS op return. Replacing only the write side is the part of the optimization that maps cleanly onto Deno's resource model. Closes denoland/orchid#169 Co-Authored-By: Divy Srivastava <me@littledivy.com>
|
Cool experiment. A custom reactor in deno_core would solve this. Closing |
Summary
Performance spike toward single-core
fastwebsocketsecho throughputon par with or ahead of uWebSockets, lifted into a usable public API
rather than left as a hand-written example.
Library / core additions in this PR:
sync_server::ServerEngine— non-async, callback-drivenframing engine. Owns WebSocket header parse / unmask / close
handling / in-place response synthesis. Same engine drives the
tokio adapter and the new reactor.
OutboundSegment+process_into/outbound_segments/outbound_local/clear_outbound— zero-copy response outputso adapters write straight from the recv buffer.
crate::reactormodule (Linux,reactorfeature):Reactor— single-thread mio-driven event loopwith one shared 64 KiB scratch.
Handlertrait —on_open/on_frame/on_closecallbacks; this is the surface real
applications implement.
Connection<'a>— per-callback handle withid(),echo(),send(opcode, payload),close().echo()keeps thezero-copy in-place response trick.
Sender— cross-thread handle (Send/Closecommands,
mio::Waker-driven). Letsan embedder own the reactor on one
thread and push outbound work from
another (e.g. an HTTP server bridge).
Reactor::add_session_with_prefix(stream, prefix)— forembedders that negotiate the
WebSocket upgrade themselves (hyper,
axum, custom HTTP), feeds any
already-read leftover bytes from
the upstream HTTP layer to the
engine before the next socket read.
on_openfires exactly once persession whether the session arrived
via the built-in handshake or via
add_session/add_session_with_prefix.Reactor::run_echo()— bench-shape convenience over thesame
Handlermachinery.Reactor::run_once()— single tick, for embedding inside alarger event loop.
examples/echo_server_tokio_fast.rs— Tokio/Deno-shapedadapter around
ServerEngine.read().await+try_writeforthe single-segment hot path.
examples/echo_server_reactor.rs(16 lines) — bench-shapedemo:
Reactor::new(),bind(),run_echo().examples/reactor_chat_broker.rs— end-to-end demo of thegeneral API: a broadcast chat broker using
Handler::on_open/Senderfan-out /Handler::on_close.parse_header, mask clearing afterFrame::unmask(),FragmentCollectorpass-through for whole unfragmented frames.When to use which
Two ergonomic server-side surfaces. Both share
ServerEngine, sothe per-frame parse / unmask / response-synthesis fast path is
the same in both.
echo_server_tokio_fast.rspattern)crate::reactor::Reactor)read().awaitfuture + onetry_write/try_write_vectoredsyscallrecv+ one inlinesend, no futureReactor::sender()→Sender::send(...)Final benchmark numbers
Single process, single thread server. Client is
bench/load_test.c.Host: Cascadelake VM, Linux 6.8. Five
load_testshapes: connectioncount / payload bytes.
mps= msg/sec, average of theinner-load_test 5-second windows.
The local sandbox blocks
listen()for newly-spawned binaries (aseccomp filter inherited by descendants of the shell I work in),
so the benches below come from two paths:
echo_server_mio_v11server already bound in this session. TheReactor'srun_echo()codepath is a structural lift of thatbinary's steady-state loop: same
mio::Poll, sameSlabofsessions, same
ServerEngine::processcallback, samewrite_now/write_contig_now. Testsreactor_echoes_via_handler_trait/reactor_send_then_echo_in_order/reactor_mutate_then_echoend-to-end viasocketpair(2)andverify the framing/dispatch are byte-equivalent.
echo_server_tokio_fastbinary built from PR head, run earlierin this session before the sandbox tightened.
EchoServer, baseline saved earliertoday on the same host.
Reactor beats uWS on all five cases (+13–20%).
tokio_fast v4beats uWS on three (+5/+15/+12%) and trails on the two many-
connection cases (-4.5% and -13.7%). The structural reason is the
profiler evidence below.
Raw artifact paths
/tmp/fws-bench/uws_x5.txt/tmp/fws-bench/mio_v11_x3.txt(averages 114 187 / 120 751 /81 595 / 77 217 / 64 847 — same shape as the live live row above,
taken earlier today on the same host)
/tmp/fws-bench/tokio_fast_v2_x3.txt(averages 103 017 / 114 045/ 78 211 / 59 711 / 48 836)
readable().await experiment):
/tmp/fws-bench/tokio_fast_v3_x3.txt/tmp/fws-prof/v4_final_x3.txt/tmp/fws-prof/live_mio_v11.txt/tmp/fws-prof/live_tokio_fast_v2.txt/tmp/fws-prof/bench_live.shVariance note
Live mio_v11 (= reactor) numbers above run higher than the saved
mio_v11_x3.txtbaseline (138/133/91/83/71 vs 114/120/81/77/64).The bench host's load-state varies across runs. The same-session
live tokio_fast v2 came in at 122 690 / 132 417 / 86 076 / 65 382
/ 49 171 — close to its earlier saved baseline. Ratios within one
measurement set are the comparable signal; absolute numbers will
drift with bench-host load.
Profiler evidence behind the design
strace -c -fover 5 seconds under load on the 500/16384 case:perf stat -p <pid>over 5 s, same case:Identical per-frame syscall mix. The gap at high fd counts is:
runnable). It burns user-space CPU in runtime bookkeeping
per task wake.
epoll_wait(491 vs 60 framesper poll). Structural: one event loop polling many fds versus N
tokio tasks each driving one fd, each waking on a single
readability event.
possible. The v3 spike tried (
try_read + try_write + readable().await) and regressed: WouldBlock returns allocatedone readable() future per miss — ~1080/s at 200 conn. v4
keeps
read().await(which correctly clears tokio's internalreadiness flag on WouldBlock) and only swaps the write side to
try_write. See/tmp/fws-bench/tokio_fast_v3_x3.txt.The structural fix is the
ReactorAPI: one task drives many fds,no per-frame future, no per-task scheduling. Same numbers as
mio_v11because it is the steady-state loop of that example,exposed as a library.
A scratch-buffer-size sweep (17 / 24 / 32 / 64 KiB) was also
attempted to relieve cache pressure on 500/16384. Two-run benches
suggested 24 KiB was 6% better; a 3-run repeat landed inside
±2% (std-dev ≈ 1.5k msg/s). Kept the default 64 KiB rather than
ship a noise-driven knob.
Validation
Current PR head:
a14f4ba.28 tests pass, 11 of them reactor-specific (7 of those new with
this PR's reactor work):
runHandlertraitConnection::sendthenConnection::echoordering for thesame frame (
sendbytes precedeechobytes on the wire)payloadinplace, the modified bytes go on the wire zero-copy)
Sender::senddelivers bytes to a sessionSender::closedrops a session and firesHandler::on_closeadd_session_with_prefixprocesses embedder-suppliedleftover upgrade bytes through the engine before any socket
read
Handler::on_openfires exactly once for pre-upgradedsessions added via
add_session(no built-in handshake leg)The reactor's end-to-end tests use
socketpair(2)so theyexercise the full
register → readable → engine.process → write_contig_nowpath without needinglisten().CI on
a14f4ba: all 6 jobs green (Ubuntu / macOS / Windows × 2workflows). The
reactorfeature module is Linux-only behindthe feature flag, so non-Linux builds get the stub example and
the lib continues to compile without the feature.
Closes denoland/divybot#167