Skip to content

netlink: virtualize NETLINK_ROUTE as loopback-only view#15

Merged
congwang-mk merged 1 commit intomainfrom
feat/netlink-virt
Apr 16, 2026
Merged

netlink: virtualize NETLINK_ROUTE as loopback-only view#15
congwang-mk merged 1 commit intomainfrom
feat/netlink-virt

Conversation

@congwang-mk
Copy link
Copy Markdown
Contributor

Fixes #14 — OpenClaw (and any other Node/Go/Rust/glibc-linked app that
iterates interfaces to validate a loopback bind target) could not start
inside a sandlock sandbox because `AF_NETLINK` sockets were blocked
outright. glibc's `getifaddrs`/`if_nameindex`/`__check_pf` therefore
returned an empty set, which made OpenClaw fall back to `0.0.0.0` and
then refuse it as non-loopback (`gateway bind=loopback resolved to
non-loopback host 0.0.0.0; refusing fallback to a network bind`).

Summary

  • Replace the unconditional `AF_NETLINK` seccomp-BPF block with userspace
    virtualization that synthesizes a loopback-only netlink view.
  • Other netlink protocols (`NETLINK_AUDIT`, `NETLINK_GENERIC`, ...) still
    return `EAFNOSUPPORT` from the `socket()` handler.
  • Only four syscalls need supervisor intercepts: `socket`, `bind`,
    `getsockname`, `recvmsg`/`recvfrom`, plus `close` for cookie-set
    cleanup. Send traffic flows through a real `AF_UNIX SOCK_SEQPACKET`
    socketpair and the kernel handles it natively.

Design

```
socket(AF_NETLINK, *, NETLINK_ROUTE)
→ handler creates socketpair(AF_UNIX, SOCK_SEQPACKET)
→ spawns per-sandbox tokio responder task on the supervisor's runtime
(same pattern as http_acl.rs, AsyncFd-based)
→ InjectFdSendTracked: ADDFD_SEND returns the child-side fd number,
and an on_success callback records (tgid, fd) atomically
→ responder reads request datagrams, synthesizes RTM_NEWLINK for `lo`,
RTM_NEWADDR for 127.0.0.1 and ::1, NLMSG_DONE, concatenates into
one datagram, sends back via the real socketpair
```

Cookie set is keyed by `(tgid, fd)` — fds are process-scoped so a cookie
created by one thread must be visible to its siblings. `close` intercept
removes the entry so reused fd slots don't collide. Getsockname writes
`nl_pid = tgid` for the same reason (stable across threads).

Subtle things

  1. TOCTOU via `(tgid, fd)` tracking: the entry is registered in the
    `on_success` callback of `InjectFdSendTracked`, which runs after the
    kernel's `ADDFD_SEND` ioctl returns but before the child's syscall
    unblocks. There's no window where another thread can race the check.

  2. msg_name zeroing: glibc's netlink reader (`ifaddrs.c::__netlink_request`)
    rejects replies where `source_addr.nl_pid != 0` with a silent
    `continue`. On unix-socketpair recvmsg the kernel only writes
    `sun_family` (2 bytes) into msg_name, leaving bytes 2..end as
    uninitialized stack. Without pre-zeroing those bytes, runs hung ~50%
    of the time in `__skb_wait_for_more_packets`. The recvmsg handler now
    zeros the 12-byte `sockaddr_nl` region via `process_vm_writev` before
    letting the kernel run.

  3. Memory access goes through the existing
    `seccomp::notif::{read_child_mem, write_child_mem}` helpers which do
    `SECCOMP_IOCTL_NOTIF_ID_VALID` checks before and after every
    `process_vm_readv`/`process_vm_writev` — the canonical seccomp-notify
    TOCTOU mitigation.

  4. Non-route netlink stays blocked at the `socket()` handler with
    `EAFNOSUPPORT`. There is no path to a real non-route netlink fd, even
    via direct `syscall(SYS_socket, ...)`.

Test plan

  • 6 unit tests for wire format + synthesis (`proto.rs`, `synth.rs`)

  • Integration tests (`tests/integration/test_netlink_virt.rs`):

    • `if_nameindex_returns_only_lo` — exercises RTM_GETLINK dump
    • `getaddrinfo_ai_addrconfig_returns_v4_and_v6` — exercises glibc
      `__check_pf` path, which dumps RTM_GETADDR to pick address families
    • `loopback_bind_succeeds` — regression check that non-netlink bind
      still works through the netlink `bind` handler's fall-through
    • `non_route_netlink_still_blocked` — NETLINK_AUDIT gets EAFNOSUPPORT
  • 178/178 full integration suite passing

  • 50/50 reliability loop on `if_nameindex_returns_only_lo`

  • Manual CLI smoke tests from inside the sandbox:

    ```
    $ sandlock run -- python3 -c 'import socket; print(socket.if_nameindex())'
    [(1, 'lo')]

    $ sandlock run -- python3 -c "import socket

    print(socket.getaddrinfo('localhost', 443, flags=socket.AI_ADDRCONFIG))"
    [(<AddressFamily.AF_INET6: 10>, ..., ('::1', 443, 0, 0)),
    (<AddressFamily.AF_INET: 2>, ..., ('127.0.0.1', 443))]

    $ sandlock run -- node -e 'console.log(JSON.stringify(require("os").networkInterfaces(), null, 2))'
    { "lo": [ { "address": "127.0.0.1", ..., "internal": true, ... },
    { "address": "::1", ..., "internal": true, ... } ] }

    $ sandlock run -- python3 -c "import socket

    try: socket.socket(socket.AF_NETLINK, socket.SOCK_RAW, 9).close()
    except OSError as e: print(f'BLOCKED errno={e.errno}')"
    BLOCKED errno=97
    ```

Out of scope / follow-ups

  • Multi-interface synthesis (e.g. a fake `eth0` with policy-declared
    IPs). OpenClaw only needs `lo`, so this is YAGNI; add a
    `policy.synthetic_interfaces: Vec<...>` field if an app requires it.
  • `dup`/`dup2`/`dup3`/`fcntl(F_DUPFD)` tracking. If a child dups a
    netlink cookie to another fd slot, the dup isn't in the cookie set and
    operations on it fall through. No real-world app exercises this (glibc
    doesn't dup netlink sockets), but `pidfd_getfd`-based dup propagation
    would close it if we ever need it.

🤖 Generated with Claude Code

Replaces the unconditional AF_NETLINK seccomp block with a userspace
virtualization that lets sandboxed processes open NETLINK_ROUTE sockets
and see a synthetic one-interface (`lo`) view. Other netlink protocols
(AUDIT, GENERIC, etc.) remain blocked via EAFNOSUPPORT in the handler.

Signed-off-by: Cong Wang <cwang@multikernel.io>
@congwang-mk congwang-mk merged commit e5f5be9 into main Apr 16, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Openclaw Supported Configuration

1 participant