netlink: virtualize NETLINK_ROUTE as loopback-only view#15
Merged
congwang-mk merged 1 commit intomainfrom Apr 16, 2026
Merged
Conversation
Replaces the unconditional AF_NETLINK seccomp block with a userspace virtualization that lets sandboxed processes open NETLINK_ROUTE sockets and see a synthetic one-interface (`lo`) view. Other netlink protocols (AUDIT, GENERIC, etc.) remain blocked via EAFNOSUPPORT in the handler. Signed-off-by: Cong Wang <cwang@multikernel.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #14 — OpenClaw (and any other Node/Go/Rust/glibc-linked app that
iterates interfaces to validate a loopback bind target) could not start
inside a sandlock sandbox because `AF_NETLINK` sockets were blocked
outright. glibc's `getifaddrs`/`if_nameindex`/`__check_pf` therefore
returned an empty set, which made OpenClaw fall back to `0.0.0.0` and
then refuse it as non-loopback (`gateway bind=loopback resolved to
non-loopback host 0.0.0.0; refusing fallback to a network bind`).
Summary
virtualization that synthesizes a loopback-only netlink view.
return `EAFNOSUPPORT` from the `socket()` handler.
`getsockname`, `recvmsg`/`recvfrom`, plus `close` for cookie-set
cleanup. Send traffic flows through a real `AF_UNIX SOCK_SEQPACKET`
socketpair and the kernel handles it natively.
Design
```
socket(AF_NETLINK, *, NETLINK_ROUTE)
→ handler creates socketpair(AF_UNIX, SOCK_SEQPACKET)
→ spawns per-sandbox tokio responder task on the supervisor's runtime
(same pattern as http_acl.rs, AsyncFd-based)
→ InjectFdSendTracked: ADDFD_SEND returns the child-side fd number,
and an on_success callback records (tgid, fd) atomically
→ responder reads request datagrams, synthesizes RTM_NEWLINK for `lo`,
RTM_NEWADDR for 127.0.0.1 and ::1, NLMSG_DONE, concatenates into
one datagram, sends back via the real socketpair
```
Cookie set is keyed by `(tgid, fd)` — fds are process-scoped so a cookie
created by one thread must be visible to its siblings. `close` intercept
removes the entry so reused fd slots don't collide. Getsockname writes
`nl_pid = tgid` for the same reason (stable across threads).
Subtle things
TOCTOU via `(tgid, fd)` tracking: the entry is registered in the
`on_success` callback of `InjectFdSendTracked`, which runs after the
kernel's `ADDFD_SEND` ioctl returns but before the child's syscall
unblocks. There's no window where another thread can race the check.
msg_name zeroing: glibc's netlink reader (`ifaddrs.c::__netlink_request`)
rejects replies where `source_addr.nl_pid != 0` with a silent
`continue`. On unix-socketpair recvmsg the kernel only writes
`sun_family` (2 bytes) into msg_name, leaving bytes 2..end as
uninitialized stack. Without pre-zeroing those bytes, runs hung ~50%
of the time in `__skb_wait_for_more_packets`. The recvmsg handler now
zeros the 12-byte `sockaddr_nl` region via `process_vm_writev` before
letting the kernel run.
Memory access goes through the existing
`seccomp::notif::{read_child_mem, write_child_mem}` helpers which do
`SECCOMP_IOCTL_NOTIF_ID_VALID` checks before and after every
`process_vm_readv`/`process_vm_writev` — the canonical seccomp-notify
TOCTOU mitigation.
Non-route netlink stays blocked at the `socket()` handler with
`EAFNOSUPPORT`. There is no path to a real non-route netlink fd, even
via direct `syscall(SYS_socket, ...)`.
Test plan
6 unit tests for wire format + synthesis (`proto.rs`, `synth.rs`)
Integration tests (`tests/integration/test_netlink_virt.rs`):
`__check_pf` path, which dumps RTM_GETADDR to pick address families
still works through the netlink `bind` handler's fall-through
178/178 full integration suite passing
50/50 reliability loop on `if_nameindex_returns_only_lo`
Manual CLI smoke tests from inside the sandbox:
```
$ sandlock run -- python3 -c 'import socket; print(socket.if_nameindex())'
[(1, 'lo')]
$ sandlock run -- python3 -c "import socket
$ sandlock run -- node -e 'console.log(JSON.stringify(require("os").networkInterfaces(), null, 2))'
{ "lo": [ { "address": "127.0.0.1", ..., "internal": true, ... },
{ "address": "::1", ..., "internal": true, ... } ] }
$ sandlock run -- python3 -c "import socket
Out of scope / follow-ups
IPs). OpenClaw only needs `lo`, so this is YAGNI; add a
`policy.synthetic_interfaces: Vec<...>` field if an app requires it.
netlink cookie to another fd slot, the dup isn't in the cookie set and
operations on it fall through. No real-world app exercises this (glibc
doesn't dup netlink sockets), but `pidfd_getfd`-based dup propagation
would close it if we ever need it.
🤖 Generated with Claude Code