Skip to content

fix(sandbox): relay WebSocket frames after HTTP 101 Switching Protocols#718

Open
johntmyers wants to merge 1 commit intomainfrom
fix/websocket-101-upgrade-relay-v2
Open

fix(sandbox): relay WebSocket frames after HTTP 101 Switching Protocols#718
johntmyers wants to merge 1 commit intomainfrom
fix/websocket-101-upgrade-relay-v2

Conversation

@johntmyers
Copy link
Copy Markdown
Collaborator

Summary

  • Detect HTTP 101 Switching Protocols in the L7 relay and switch to raw bidirectional TCP instead of re-entering the HTTP parsing loop
  • Validate client sent Upgrade + Connection: Upgrade headers before accepting 101 (rejects unsolicited upgrades from non-compliant upstream servers)
  • Replace bool return type with RelayOutcome enum across the L7Provider trait and all relay functions for cleaner upgrade signaling

Related Issue

Changes

  • crates/openshell-sandbox/src/l7/provider.rs: Add RelayOutcome enum (Reusable/Consumed/Upgraded { overflow }). Update L7Provider::relay trait return type from Result<bool> to Result<RelayOutcome>.
  • crates/openshell-sandbox/src/l7/rest.rs: Detect 101 in relay_response() before the generic 1xx/bodiless handler. Return RelayOutcome directly (no intermediate tuple). Add client_requested_upgrade() validation in relay_http_request_with_resolver(). Update relay_response_to_client wrapper to return RelayOutcome.
  • crates/openshell-sandbox/src/l7/relay.rs: Add shared handle_upgrade() helper. Update relay_rest() and relay_passthrough_with_credentials() to match on RelayOutcome. Add l7_decision = "allow_upgrade" audit log annotation.
  • crates/openshell-sandbox/tests/websocket_upgrade.rs: Integration test spinning up a real WebSocket echo server, upgrading through L7Provider::relay, and exchanging a text frame bidirectionally.
  • crates/openshell-sandbox/Cargo.toml: Add tokio-tungstenite and futures dev-dependencies for integration tests.

Improvements over #683

Area #683 This PR
Return type (bool, u16, Vec<u8>) tuple RelayOutcome enum directly from relay_response()
Trait consistency L7Provider::relay still returned bool Trait updated to Result<RelayOutcome>
Unsolicited 101 No validation Validates client sent Upgrade + Connection: Upgrade headers
Code duplication Duplicate upgrade handling in two relay paths Shared handle_upgrade() helper
Audit logging No upgrade-specific annotation l7_decision = "allow_upgrade" for observability
Dead code relay_response_to_client silently discarded overflow Updated to return RelayOutcome
Integration test Unit test only (relay_response in isolation) Full WebSocket echo test through L7Provider::relay

Security Notes

  • Post-upgrade L7 bypass is inherent and documented: After copy_bidirectional, no further L7 inspection occurs. The initial HTTP request is policy-checked before the upstream ever sees it.
  • Unsolicited 101 rejection: If the upstream sends 101 but the client never requested an upgrade, the connection is killed (Consumed). Prevents a non-compliant upstream from triggering a policy bypass.
  • Overflow buffer is bounded: ~17 KiB max (header read buffer + one read call), immediately forwarded, then dropped.

Testing

  • mise run pre-commit passes (format fixed)
  • All 405 existing unit tests pass
  • 8 new unit tests: 101 overflow capture, 101 no overflow, unsolicited 101 rejection, accepted 101 with upgrade headers, 4 client_requested_upgrade header parsing tests
  • 2 new integration tests: WebSocket upgrade through L7 relay with echo exchange, normal HTTP regression test
  • cargo clippy clean (no new warnings)

Checklist

  • Follows Conventional Commits
  • Architecture docs updated (not applicable — no new components)
  • Tests added for new behavior

Detect 101 Switching Protocols in relay_response() and switch to raw
bidirectional TCP relay instead of re-entering the HTTP parsing loop.

Previously, is_bodiless_response() treated 101 as a generic 1xx
informational response, forwarding only the headers and returning to
the HTTP parsing loop. After a 101, subsequent bytes are upgraded
protocol frames (e.g. WebSocket), not HTTP — causing the relay to
block or silently drop all post-upgrade traffic.

Changes:
- Add RelayOutcome enum (Reusable/Consumed/Upgraded) replacing bool
  return type across L7Provider::relay trait and all relay functions
- Detect 101 before generic 1xx handler in relay_response(), capture
  overflow bytes, return RelayOutcome::Upgraded
- Validate client sent Upgrade + Connection: Upgrade headers before
  accepting 101 (rejects unsolicited upgrades from non-compliant
  upstream servers)
- Extract shared handle_upgrade() helper used by both relay_rest()
  and relay_passthrough_with_credentials()
- Add l7_decision=allow_upgrade audit log annotation for upgrades
- Add unit tests for 101 overflow capture, unsolicited 101 rejection,
  and client_requested_upgrade header validation
- Add integration test: WebSocket echo through L7Provider::relay

Fixes: #652
@davidpeden3
Copy link
Copy Markdown

davidpeden3 commented Apr 1, 2026

tested pr #718 on docker desktop + wsl2 (windows 11, amd64, openshell v0.0.16). clean before/after results confirm the fix works.

setup:

  • icingdeath: windows 11, docker desktop 29.3.1, wsl2 kernel 6.6.87.2
  • openclaw sandbox connecting to gateway at MY_GATEWAY_IP:18789
  • test: socat proxy tunnel inside sandbox netns → raw python websocket upgrade + frame exchange
  • policy: openclaw_gateway allowing socat → MY_GATEWAY_IP:18789

before (stock binary, v0.0.16):
binary md5: 6ed90711e33ec487abf39e32dea43a60

[1/4] connecting to socat tunnel at 127.0.0.1:18790...
[2/4] sending websocket upgrade request...
    upgrade response: http/1.1 101 switching protocols
[3/4] sending websocket text frame ('hello')...
[4/4] waiting for websocket echo response...
[fail] timeout — websocket frames not relayed after 101 (issue #652 confirmed)

after (pr #718 fix/websocket-101-upgrade-relay-v2):
binary md5: 99be46a66d97a8012fed6e36d16a1693

[1/4] connecting to socat tunnel at 127.0.0.1:18790...
[2/4] sending websocket upgrade request...
    upgrade response: http/1.1 101 switching protocols
[3/4] sending websocket text frame ('hello')...
[4/4] waiting for websocket echo response...
    received: opcode=8 length=2 payload='...'
[pass] websocket frame received

proxy logs (after):

debug openshell_sandbox::proxy: evaluate_opa_tcp total: 3ms host=MY_GATEWAY_IP port=18789
debug openshell_sandbox::proxy: handle_tcp_connection dns_resolve_and_tcp_connect: 4ms host=MY_GATEWAY_IP
debug openshell_sandbox::l7::rest: relay_response framing
debug openshell_sandbox::l7::rest: 101 switching protocols — signaling protocol upgrade

the 101 is correctly detected in relay_response(), relayoutcome::upgraded is returned, and handle_upgrade() switches to copy_bidirectional for raw frame relay. the gateway responded with a close frame, confirming end-to-end bidirectional websocket communication through the proxy.

note on pr #720 interaction:

we also merged feat/ocsf-log-integration (#720) into the same build to test the new ocsf structured logging alongside the websocket fix. there's a minor merge conflict: handle_upgrade() in relay.rs uses info!() (from #718), but #720 removes info from the tracing imports in that file in favor of ocsf events. we resolved it by converting the handle_upgrade() log to a networkactivitybuilder ocsf event at severityid::informational (preserving the intended log level from #718).

depending on merge order:

either way it's a one-liner, just flagging it so it doesn't surprise anyone.

@mjamiv
Copy link
Copy Markdown

mjamiv commented Apr 1, 2026

Tested this against a live Slack Socket Mode deployment running OpenClaw inside an OpenShell sandbox. Different setup than @davidpeden3's socat echo test — this exercises the full application path including bidirectional file transfers.

Environment

  • Ubuntu 22.04, Hetzner, x86_64
  • OpenShell 0.0.16 cluster container (unchanged)
  • Sandbox binary built from this branch, swapped in via overlay remount
  • OpenClaw 2026.3.31 with Slack Socket Mode (@slack/socket-mode + ws)
  • All traffic through CONNECT proxy at 10.200.0.1:3128

How we tested

We run two sandboxes under the same gateway — agent (production, stock 0.0.16 binary) and agent2 (test, this PR). Side-by-side, no risk to production.

  1. Built openshell-sandbox from this branch
  2. Swapped binary in agent2 pod only (remount overlay rw, cp, restart pod)
  3. Confirmed Slack connects with our existing tls: skip workaround (baseline)
  4. Removed tls: skip from the slack_websocket policy — on 0.0.16 this kills the connection
  5. Restarted gateway to force fresh WebSocket through the L7 path

Results

Everything works:

  • Slack Socket Mode connects without tls: skip — L7 proxy handles 101 upgrade correctly
  • Inbound/outbound messages normal
  • Inbound file processing (image sent in Slack → downloaded through proxy → vision model) — works
  • Outbound file delivery (image generated → uploaded to Slack via *.slack-files.com) — works
  • No frame corruption, socket hang ups, or connection drops during testing
  • Production sandbox on 0.0.16 completely unaffected throughout

The working policy (for anyone else testing)

slack_websocket:
  name: slack_websocket
  endpoints:
    - host: wss-primary.slack.com
      port: 443
      # tls: skip not needed with this fix
      enforcement: enforce
      access: full
    - host: wss-backup.slack.com
      port: 443
      enforcement: enforce
      access: full
  binaries:
    - { path: /usr/bin/node }
    - { path: /usr/bin/curl }

Bottom line

This fixes #652 for us. On 0.0.16 the same config fails because is_bodiless_response() treats 101 as generic 1xx and re-enters the HTTP parser. With this PR, the transition to raw bidirectional TCP works cleanly. We can drop our tls: skip split-policy workaround once this ships.

Nice work @johntmyers — appreciate the quick turnaround. We didn't test the #720 interaction — only learned about it from @davidpeden3's comment. Our build was this branch only, no other PRs merged in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Egress proxy fails to relay WebSocket frames after successful HTTP CONNECT + 101 Switching Protocols

3 participants