Skip to content

rrr: safety + perf milestone — Marshal Vec rewrite, ClientConnection @safe class, HashMap/Weak library fixes#67

Merged
shuaimu merged 193 commits into
mako-devfrom
worktree-srpc
May 23, 2026
Merged

rrr: safety + perf milestone — Marshal Vec rewrite, ClientConnection @safe class, HashMap/Weak library fixes#67
shuaimu merged 193 commits into
mako-devfrom
worktree-srpc

Conversation

@shuaimu
Copy link
Copy Markdown
Contributor

@shuaimu shuaimu commented May 22, 2026

Summary

A milestone of work on the worktree-srpc branch covering:

  • Marshal rewrite: chunk-linked-list → rusty::Vec<uint8_t> + read-cursor (incidentally fixed a 5-year-old chunk leak)
  • ClientConnection class-level @safe with carefully-scoped per-method @unsafe overrides
  • rusty-cpp library improvements: HashMap::get returns Option<V&> (not Option<V*>), Vec::extend_from_slice memcpy fast path + geometric reserve, full Weak @safe annotations, Vec::value_type typedef
  • C++23 named-module + clang-21 TLS dup-symbol fix (taken from mako-dev's static inline thread_local form)
  • const-ification cascade through ConnectionStateMachine + ClientConnection lifecycle methods, eliminating 2 const_cast sites

Headline numbers

scripts/rrr_safety_loc.py ratio:

  • Start of this thread: 65.3% rrr-wide @safe
  • After this PR: 70.5% (+5.2pp)

client.cpp specifically:

  • @safe LOC: 1292 → 1651 (+359)
  • @unsafe LOC: 1049 → 663 (-386, -37%)

Marshal microbench (bench_marshal):

  • New Marshal beats chunk-list on every scenario, +15-81% depending on workload
  • ~30-40% faster on string/varint-heavy paths (the structural win)
  • The 81% headline on the 10×1KB drain pattern was investigated and attributed largely to a chunk-list leak; the leak-fixed chunk-list is tied with the new Marshal on bulk-byte paths

End-to-end RPC (rpcbench loopback, 4 threads × 100 outstanding, mode=fast, 8s):

  • bsize=100: +15% qps vs leak-fixed chunk-list, +11% vs original
  • bsize=1024: tied with leak-fixed chunk-list (Marshal is a smaller fraction of total RPC time at larger payloads)

What's in the PR

rrr/misc/marshal.cpp — Vec rewrite

  • File shrinks 1408 → 932 LOC (-34%)
  • @unsafe LOC: 382 → 10
  • All 50+ operator<<>> overloads preserved; bookmark API preserved (struct internals simplified from (size, char**) to (offset, size))
  • init_block_read reduced to a Vec::reserve hint; read_chnk/read_reuse_chnk removed (no external callers)
  • The chunk-list memory leak in Marshal::read() (a commented-out delete chnk; dating to 2020) is fixed as a side effect

rrr/rpc/client.cpp — class-level @safe + cleanups

  • class ClientConnection flipped from class-level @unsafe to @safe (was the single biggest @unsafe-LOC source in rrr; +296 LOC moved to @safe in that one flip alone)
  • 16 delegator methods flipped @safe (request convenience overloads, config setters/getters, pause/resume, etc.)
  • Six post-HashMap-API methods flipped @safe (fail_pending_future, handle_free, mark_closing, allow_request_with_circuit_metrics, record_circuit_result)
  • Six Weak<> copy-block @unsafe wraps dropped after library annotations landed
  • Const-ification: invalidate_pending_futures, mark_closing, close, handle_error are now const; reconnecting_ / reconnect_abort_ atomics marked mutable; 2 const_casts dropped

rrr/rpc/server.cpp — Weak cleanups

  • Same Weak-copy @unsafe blocks dropped (2 sites)

third-party/rusty-cpp — library improvements

  • Vec::extend_from_slice / Vec::write: memcpy fast path for trivially-copyable T (~10× faster on byte writes)
  • Vec::reserve: geometric growth (max(new, 2×capacity)) — required for amortized O(1) writes
  • HashMap::get / get_mut: return Option<V&> instead of Option<V*> (matches Rust + matches Vec::get)
  • sync::Weak<T> / rc::Weak<T>: every method individually @safe with atomic/raw-ptr work in inline @unsafe { ... } blocks

rrr/tests/bench_marshal — new Marshal microbench

  • 9 hot-path scenarios; standalone executable (no gtest); useful for any future Marshal perf work
  • Companion doc docs/dev/marshal_perf_baseline.md with baseline numbers, calibrated A/B vs leak-fixed chunk-list, end-to-end RPC numbers

Merge resolution

  • TLS class-static fix (reactor.cpp): took mako-dev's simpler static inline thread_local over srpc's function-local-static accessor approach
  • src/deptran/communicator.cc: updated to Reactor::clients_ member access
  • Submodule pointer: merged upstream rusty-cpp/main (vec value_type + noexcept dtor) into local @safe branch; pushed to shuaimu/rusty-cpp main

Test plan

  • cmake --build build_clang21 --target rrr -j32 — clean
  • borrow_check_rrr — 45/45 clean
  • ./build_clang21/test_marshal — 23/23
  • ./build_clang21/test_rpc_marshal_archive — 68/68
  • ./build_clang21/test_rpc_marshallable_proxy — 12/12
  • ./build_clang21/test_rpc_any_message — 8/8
  • ./build_clang21/test_idempotency — 32/32
  • ./build_clang21/test_rpc — 17/17 (incl. multi-threaded scenarios)
  • ./build_clang21/bench_marshal (microbench numbers match expectations)
  • CI Docker test suites (shardNoReplication, shard1Replication, etc.) — defer to CI

Notes for reviewers

  • The PR is structured as a long sequence of small commits — each compiles, each passes borrow check, each has a verified test pass. Reviewing commit-by-commit will be much easier than reviewing the diff as a whole.
  • The rusty-cpp submodule bump is non-trivial; the library API change (HashMap Option<V&>) is source-breaking for any caller binding to V*. The rrr-side rewire is included in this PR.
  • The Marshal leak fix is bundled with the rewrite, but documented in the post-mortem section of docs/dev/marshal_perf_baseline.md as a separable concern. The "rewrite + leak fix" combo is faster than "rewrite alone" against the original (leaky) chunk-list; against the leak-fixed chunk-list the wins are smaller but still real (~15% on small-payload RPC).

🤖 Generated with Claude Code

shuaimu and others added 30 commits May 16, 2026 21:43
Three pieces:

1. **rusty-cpp submodule bump** to commit 519b5861 ("Enable borrow-checking
   C++23 module TUs in libc++ / module-graph builds"). Three fixes upstream:
     - `build.rs` now honors `LIBCLANG_PATH` first (was hardcoded to
       `/usr/lib/llvm-14/lib`, which lost to a system clang too old to
       export `clang_CXXMethod_isDeleted`).
     - `src/main.rs` passes `-resource-dir=` (not `-I<dir>/include`) when
       compile_commands.json already owns the STL paths, so libclang's
       `__has_include_next` dance with libc++ wrappers works. Also
       passes through `-D` / `-U` / `-m` / `-f` / `-O` / `-g` flags
       from compile_commands.json so `-march=native` reaches libclang
       (without it, `__builtin_ia32_*` intrinsics fail to declare).
     - `src/parser/mod.rs` adds `-ferror-limit=0` so large module TUs
       don't abort on libc++ vs `import std;` duplicate-decl spam.

2. **`RRR_BORROW_SRC` updated** to include the modular .cpp's that were
   previously omitted with "module unit" comments: `base/basetypes`,
   `base/debugging`, `base/logging`, `base/misc`, `base/strop`,
   `base/threading`, `base/unittest`, `misc/alock`, `misc/any_message`,
   `misc/marshal`, `misc/rand`, `rpc/utils`, `rpc/client`. The
   `reactor/*.cc` lineup is replaced with `reactor/{epoll_wrapper.cc,
   fiber.cpp,future.cpp,reactor.cpp}` to reflect the post-modularization
   layout. `rpc/server.cpp` stays omitted — libclang 22 crashes parsing
   it (separate clang-22 issue, distinct from the codegen-crash
   workarounds in srpc_module_migration_plan.md).

3. **Build-order dependency**: each per-file `borrow_check_rrr_borrow_<F>`
   target now `add_dependencies(... rrr)`. Without this, ninja
   parallelizes borrow checks with `rrr`'s compilation and the checks
   start before `rrr.dir/.../*.o.modmap` and the matching `.pcm` BMIs
   exist — libclang then can't resolve `import rrr.X;` and analysis
   runs without module context. Adding the dep at the
   `borrow_check_all_rrr_borrow` aggregate isn't enough; ninja groups
   skip the implied dep for siblings.

End-to-end result on clang 22.1.5, `-j16`: `ninja borrow_check_rrr`
takes ~1m53s, analyzes 17 files, 15 are clean, 2 produce real
findings (`reactor.cpp`: 1; `client.cpp`: 23 — all `rusty::*` container
methods called from `@safe` code without `@unsafe` blocks). Zero
modmap-not-found warnings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Submodule bump (rusty-cpp d53d9c7) flips `option.hpp` and `result.hpp`
namespace default from `@unsafe` to `@safe`, matching the other rusty
containers. That alone closes ~22 of the 24 findings — they were noise
from those container methods (`Option::unwrap`, `Result::unwrap`,
`Option::is_some`, `RefCell::borrow`, ...) being marked unsafe by
default.

Project-side touches for the 2 remaining sites:
  - `src/rrr/base/debugging.cpp`: `// @safe` annotation on `rrr::verify`
    (pure precondition check; aborts on failure, parity with Rust's
    `assert!`). Note: this does not yet take effect on cross-module
    callers — rusty-cpp's annotation parser doesn't follow `import`
    statements (only `#include`), so reactor.cpp / client.cpp / etc.
    that import `rrr.debugging` still see `verify` as unannotated.
    File the upstream issue and leave the annotation in for the day
    that lands.
  - `src/rrr/reactor/reactor.cpp:2012`: drop the obsolete
    `// @unsafe { verify(wakeup_time > 0); }` wrapper. The text parser
    mis-classified the wrapped call as a function declaration (the
    `> 0` inside the `verify(...)` triggers
    `is_function_declaration`'s template-syntax heuristic). With
    verify itself now annotated `@safe`, the wrapper is no longer
    needed anyway.
  - `src/rrr/rpc/client.cpp:2893`: rewrite `(*fu_ptr.unwrap()).clone()`
    as `fu_ptr.unwrap()->clone()` plus a multi-line `@unsafe { ... }`
    block around the assignment. `HashMap::get` returns `Option<V*>`
    so `unwrap()` yields a raw pointer; the deref is genuinely needed
    here. rusty-cpp's pointer-safety analysis doesn't honor inline
    `@unsafe { }` blocks (separate upstream gap), so this finding
    persists in the report even though the source-level intent is
    annotated.

End-to-end on clang 22.1.5, `-j16`, `ninja borrow_check_rrr`:
  Before:  24 findings (reactor.cpp 1, client.cpp 23)
  After:   2 findings  (reactor.cpp 1, client.cpp 1)
Both remaining findings are documented upstream-rusty-cpp limitations,
not real safety regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up two upstream rusty-cpp parser fixes (committed locally on
rusty-cpp `main`, ahead of origin by 5 commits):

  - 278737e header_cache: follow C++23 `import X.Y.Z;` for annotation
    discovery. Resolves the first of the two remaining false-positive
    findings from the earlier pass: `Calling non-safe function
    'rrr::verify' at line 1361 ...` in reactor.cpp. With this change,
    the parser walks `import rrr.debugging;` to
    `src/rrr/base/debugging.cpp` and picks up the `// @safe`
    annotation on `verify`.

  - 4b82823 ast_visitor: recognize multi-line `// @unsafe` comment
    blocks. Resolves the second false-positive: `Unsafe pointer
    dereference in function call at line 2896 ...` in client.cpp.
    The `check_for_unsafe_annotation` heuristic now walks UP through
    the contiguous comment block above a `{`, instead of inspecting
    only the immediately-preceding line. The multi-line `// @unsafe`
    annotation on the HashMap-get raw-pointer deref is now honored.

End-to-end on clang 22.1.5, `-j16`, `ninja borrow_check_rrr`:
  Before this bump:  2 spurious findings + ~11 legitimate ones.
  After this bump:   0 spurious findings + 11 legitimate ones.

The 11 legitimate findings are calls to `rrr::SpinLock::lock` /
`unlock` / `rrr::PollThread::add` from `@safe` rrr code without
`@unsafe { }` wrappers. Those methods are explicitly `@unsafe` in
their own module declarations; they were exposed (correctly) by the
import-chasing fix. Fixing them is a separate code-review pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two source-level changes close all 11 outstanding findings from
`ninja borrow_check_rrr`:

  - `src/rrr/base/threading.cpp` — flip `SpinLock::lock` and
    `SpinLock::unlock` from `@unsafe` to `@safe`. Parity with Rust's
    `Mutex::lock` / drop semantics: atomic compare/exchange and store
    are memory-safe; the only genuinely unsafe call inside `lock()`
    is `nanosleep(&t, nullptr)` (passes the address of a stack-local
    `timespec` to a libc syscall), and that's now wrapped in an
    `@unsafe { ... }` block. The prior `@unsafe` annotation on
    `unlock()` was over-conservative — `std::atomic::store` is
    memory-safe outright. Removes 10 of the 11 findings (every
    `ClientPool::*` call to `SpinLock::lock` / `unlock`).

  - `src/rrr/rpc/client.cpp` — restructure `Client::close`'s
    poll-thread-add block. The `// @unsafe - schedules channel proxy
    close on poll thread` annotation was previously a statement-level
    comment; rusty-cpp only honors block-level `// @unsafe` (the
    comment must immediately precede a `{`). Wrap the body in an
    explicit `{ ... }` so the annotation now binds. Removes the last
    finding.

End-to-end on clang 22.1.5, `-j16`, `ninja borrow_check_rrr`:
  Before:  11 findings (1 PollThread::add + 10 SpinLock::lock/unlock).
  After:   0 findings. 17/17 files clean, build succeeds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Was: 17 hand-picked module units (the "correctness-critical" subset).
Now: every non-test C++ source file in `src/rrr/{base,misc,reactor,rpc}/`
that the borrow checker can actually parse, with two exclusions called
out inline with the reason:

  - `rpc/server.cpp` — clang-22 libclang crash on parse. Pre-existing.
  - `rpc/fiber_channel.cpp` — same clang-22 libclang crash, discovered
    during this expansion. Likely the same bug class as server.cpp
    (a `create_sp_event<>`-shaped pattern in module purview).
  - `reactor/fiber_context_{aarch64,x86_64}.cc` — implicitly excluded
    (asm-only context-switch trampolines, ~30 lines each, not in any
    cmake source list).

End-to-end on clang 22.1.5, `-j16`, `ninja borrow_check_rrr`:
  43 files analyzed, 43 clean, 0 violations. Total time ~57s.

Caveat on the "0 violations" result: of the 27 files added in this
commit, 22 currently have NO `@safe` annotations and pass the check
vacuously (no `@safe→@unsafe` call sites to flag). The remaining 5
do have meaningful `@safe` coverage:

  - `rpc/idempotency.cpp`        — 37 `@safe`, 10 `@unsafe`
  - `rpc/completion_tracker.cpp` — 32 `@safe`,  5 `@unsafe`
  - `rpc/request_queue.cpp`      —  4 `@safe`, 39 `@unsafe`
  - `misc/serializable.cpp`      —  2 `@safe`,  6 `@unsafe`
  - `rpc/tcp_channel.cpp`        —  0 `@safe`, 22 `@unsafe`

Future annotation work on the 22 vacuous-pass files will start
producing meaningful checks; until then this expansion's value is
ensuring the build infrastructure covers every borrow-checkable file
in rrr (regression-proof against new findings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h recovery)

Both files previously hard-failed libclang parse with "Failed to parse
file: Crash" (clang-22 libclang bug; clang++ CLI compiles them fine).
With the rusty-cpp bump in this commit (c12ef31), the parser now
detects the Crash and retries with module args dropped, yielding a
partial AST that's still usable for borrow analysis.

Three pieces:

  - **submodule bump (third-party/rusty-cpp → c12ef31)**: trust the
    compile_commands compiler for `-resource-dir=...` so resource-dir
    matches the toolchain that built the PCMs, and add a Crash-recovery
    fallback that retries with std-only module inputs.

  - **CMakeLists.txt**: drop the `# omitted — clang-22 libclang crash`
    exclusions for `rpc/server.cpp` and `rpc/fiber_channel.cpp`. Both
    now go through the recovery path. Inline comment notes that
    findings on these two are partial-info (cross-module callee names
    not resolved) until upstream libclang is fixed.

  - **src/rrr/rpc/server.cpp**: reconcile four decl-vs-def annotation
    mismatches. The class-body declarations of `run_async` (both
    ServerConnection and DeferredReply overloads), `wait_for_shutdown`,
    and `add_shutdown_hook` said `@safe`; the out-of-line definitions
    said `@unsafe`. The bodies do unsafe things (invoke caller-supplied
    `rusty::Function<void()>` callbacks; push to a mutex-guarded Vec;
    block on a condvar with arbitrary predicate). Flip declarations
    to `@unsafe` to match the definitions and the actual behaviour.
    These were the 4 partial-info findings that the recovery surfaced.

End-to-end on clang 22.1.5, `-j16`, `ninja borrow_check_rrr`:
  Before: 43 files analyzed, 2 hard-failed, 0 findings (with the 2
          excluded).
  After:  45 files analyzed, 0 hard failures, 0 findings.
That's the complete borrow-checkable surface of rrr (45 of 47 .cpp/.cc
files in src/rrr/{base,misc,reactor,rpc}/; the 2 holdouts are
asm-only context-switch trampolines).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures everything we learned during the bisection: the trigger
(first-time user-type template instantiation in module purview),
the parse-state-sensitivity (file truncations at 699 succeed but 700
crashes; 710 succeeds again), the matched-toolchain test that
isolated this as a libclang-22 regression vs libclang-19, and the
reason we don't downgrade (clang-19's own multi-attachment bug
blocks rrr's server.cpp from compiling).

Includes a filing checklist for an upstream LLVM bug report — not
yet filed (per Shuai's preference to hold). Cross-references the
workaround in third-party/rusty-cpp's parser and the broader migration
plan doc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…annotations

Switch the rrr borrow-check pipeline (and reference build) from Homebrew
clang 22.1.5 to Homebrew clang 21.1.8. clang 21 is the sweet spot: it
fixed the clang-19 multi-attachment trap (the original reason we left
clang 19), and it does not exhibit clang 22's libclang parse-time crash
on first-time user-type template instantiations in module purview that
forced rusty-cpp into degraded crash-recovery on server.cpp and
fiber_channel.cpp.

clang 21 is also a stricter compile, which surfaced one latent bug:

* rpc/load_balancer.cpp: drop the GMF forward-decl
  `namespace rrr { class Client; }`. The class is fully declared in
  client.cpp's `rrr.client` module purview, and clang 21 (correctly)
  rejects a global-module forward-decl that re-introduces the same
  name. The three "private member" errors clang 21 reported in client.cpp
  were symptoms of the duplicate-Client declaration, not real
  visibility violations — they vanish with this delete.

Bump third-party/rusty-cpp to df79169 so the analyzer can run cleanly
on a clang-21 toolchain:

* skip-cross-file-bodies: the analyzer now restricts its scan to the
  TU under check rather than the whole import graph. Previously this
  produced 181 cross-module-noise findings on the clang-21 pipeline
  because libclang's AST exposes function bodies from imported modules.
* qualify module annotations by namespace: annotations harvested from
  `export module rrr.foo;` units are now stored under their derived
  namespace (`rrr::Log_warn`, not bare `Log_warn`), matching the
  qualified lookup the analyzer uses.

Project-side annotation reconciliations needed for full-info borrow
check (0 findings on 45/45 files):

* base/logging.cpp: annotate `Log_debug/Log_info/Log_warn/Log_error`
  template shims `@safe`. They forward printf-style format strings
  (literals at every call site we control) plus variadic args; no
  memory operations escape.
* misc/serializable.cpp: `registry()` returns a reference into a
  process-wide singleton (`'static`-lifetime). Marked `@unsafe`
  because rusty-cpp doesn't yet express `&'static` lifetimes; safer
  than a half-modelled `@lifetime` annotation it can't enforce.
* rpc/client.cpp: `enqueue_heartbeat_probe` decl and its single call
  site marked `@unsafe`. The body uses Marshal `operator<<` chains
  that rusty-cpp treats as unsafe by default; wrapping the call site
  is cleaner than restructuring the Marshal API.
* rpc/server.cpp: declaration of `Server::close` aligned with the
  body's `@unsafe` annotation. Two stale `// @unsafe - Log_warn`
  inline comments removed — they confused the text parser into
  attaching `@unsafe` to the surrounding scope.

Verification on clang-21:
  borrow_check_rrr: 45/45 files clean, 0 findings, 1m54s.
…ad_local`

Required for clang 21 production link. clang 21 emits module-attached
class-static `thread_local` storage as a strong external in every TU
that uses the variable (typically via an inline accessor like
`is_on_poll_thread()`), causing duplicate-definition linker errors
when more than one consumer links into the same executable:

    multiple definition of `TLS init function for
      rrr::Reactor@rrr.reactor::sp_reactor_th_'

clang 22 happened to keep these in vague linkage; clang 19 didn't
emit them in consumer TUs at all. `inline` makes the linkage explicit
and toolchain-independent: the symbol is vague-linkage in every TU
that sees the declaration, the linker dedupes, and there's no need
for an out-of-class definition.

Variables converted:
  - `Reactor::sp_reactor_th_`
  - `Reactor::sp_disk_reactor_th_`
  - `Reactor::sp_running_fiber_th_`
  - `PollThreadWorker::current_worker_`

The non-thread_local `clients_` / `dangling_ips_` HashMaps and the
plain SpinLock `trying_job_` are left as out-of-class definitions —
they don't have inline accessors and don't trip the regression.

Verified on Homebrew clang 21.1.8:
  - clean `rpcbench` link (1m4s wallclock).
  - `borrow_check_rrr`: 45/45 files, no violations.
- CMakeLists.txt: prefer Homebrew/Linuxbrew clang-21 libclang for the
  rusty-cpp checker before falling back to system /usr/lib/llvm-*.
  We intentionally do not auto-detect clang-22 because of the
  documented libclang parse-crash regression (see
  docs/dev/libclang22_parse_crash.md).

- docs/dev/libclang22_parse_crash.md: production switched to clang
  21.1.8; refresh the three-way comparison table; document the source
  fixups (load_balancer.cpp duplicate forward-decl, reactor.cpp
  `static inline thread_local`) required for the switch.

- docs/dev/srpc_module_migration_plan.md: replace the clang-22 cmake
  invocation with the clang-21 one; cross-reference the libclang
  doc; add a "toolchain note" preamble to the Metrics table noting
  the rows were measured under clang 22 but production now builds
  on clang 21 (numbers within noise).

Verified on Homebrew clang 21.1.8:
  - clean rpcbench link
  - borrow_check_rrr: 45/45 files, 0 violations
Mechanical conversions where rusty-cpp's @unsafe reason was
encapsulable in an inner block, leaving the public surface @safe.
Borrow check stays clean (45/45 files, 0 violations) on Homebrew
clang 21.1.8.

Conversions by category:

  Aggregate-initialized POD factories (7):
    BufferingConfig::defaults / disabled / to_queue_config
    RequestQueueConfig::defaults / small / large / disabled

    These were marked @unsafe for "returns struct by value" with an
    inner @unsafe { struct construction } block. Both are redundant —
    aggregate-init of a POD is among the safest operations in C++.

  std::chrono helpers (3):
    QueuedRequest::is_expired
    QueuedRequest::age_ms
    rrr::current_time_ms

    std::chrono::steady_clock::now + duration_cast are read-only and
    memory-safe; the wrap-and-flip pattern hides the unannotated std
    calls behind a public @safe surface.

  Atomic counter accessors on Server (3):
    Server::pending_request_count
    Server::increment_pending
    Server::decrement_pending

    Each is one atomic load / fetch_add / fetch_sub; wrap that call
    in an inner @unsafe block and mark the accessor @safe. Matches
    the pattern already used for set_drop_heartbeat_replies /
    drop_heartbeat_replies in the same class.

  SpinCondVar signal/bcast (2):
    SpinCondVar::signal / SpinCondVar::bcast

    Each is one atomic store. wait / timed_wait stay @unsafe — they
    additionally call SpinLock and Time::sleep, which is Tier 3
    work (apply Pattern B1 across the remaining SpinLock users).

Bumps third-party/rusty-cpp to 755344c which annotates
rusty::sync::Weak::upgrade / rusty::rc::Weak::upgrade as @safe. Two
DeferredReply methods in server.cpp (reply, reply_error) drop their
inline `// @unsafe - weak pointer upgrade` wrappers as a result.

Not in this commit (defer):
  - Future::timed_wait — currently labeled "Uses std::chrono" but
    actually uses rusty::Mutex + Condvar, which is the larger blocker.
  - Server::drain — has chrono + usleep + multiple atomic loads in
    one function; a larger refactor than mechanical wrap-and-flip.
  - SpinCondVar::wait / timed_wait — wait until Tier 3 lands the
    remaining SpinLock pattern-B1 conversions.
…e callers

The rusty-cpp library already annotates rusty::VecDeque (namespace-
level `// @safe`), rusty::Function (per-method `// @safe`), and
SpinMutex / SpinMutexGuard / Cell as @safe. The remaining @unsafe
labels on rrr callers were leftover from before those library
annotations landed. Flipping them now removes ~18 false-unsafe
function signatures and drops 13 stale inline `@unsafe { }` markers
that no longer have a non-safe operation to wrap.

request_queue.cpp (9 functions, 8 stale inline markers):
  - RequestQueue ctor (defaults() factory is @safe per tier 1)
  - enqueue / dequeue / expire_stale
  - size / empty / full / remaining_capacity
  - clear_all

  Body composition for each: SpinMutex::lock() (@safe) + VecDeque
  method (@safe via namespace) + optional rusty::Function call
  (@safe). try/catch around callback invocation is ignored by
  rusty-cpp's analyzer (per its CLAUDE.md), so the catch arm doesn't
  need wrapping either.

client.cpp (6 functions, 4 stale inline markers):
  - ClientConnection::pending_request_count (RequestQueue::size)
  - ClientConnection::clear_pending_requests (RequestQueue::clear_all)
  - ClientConnection::set_on_server_restart (Function move-assign)
  - ClientConnection::check_server_instance (Cell::get/set + Function::op())
  - Client::pending_request_count (forwards to ClientConnection)
  - Client::clear_pending_requests (forwards to ClientConnection)
  - ClientConnection::replay_pending_requests (no-op stub returning 0)

reactor.cpp (6 inline marker removals):
  - 6 `// @unsafe - rusty::Function constructor` markers around
    extract_if / retain callback construction in Reactor::check_timeout
    and Reactor::loop. The Function constructor is @safe in the
    library; the braces are kept as scope hygiene for the local
    iterator variables.

Verified clean on Homebrew clang 21.1.8:
  borrow_check_rrr: 45/45 files, 0 violations.
…State>

ClientPool held `SpinLock l_` + two unprotected fields (`cache_` and
`lb_state_`), with seven public methods doing manual `l_.lock()` /
`l_.unlock()` around BTreeMap operations. The pattern was the last
raw-SpinLock holder in rrr.

Bundle the protected state into a `SpinMutex<PoolState>`:

    struct PoolState {
        rusty::BTreeMap<std::string, rusty::Vec<rusty::Arc<Client>>> cache;
        rusty::BTreeMap<std::string, LoadBalancerState> lb_state;
    };
    mutable SpinMutex<PoolState> state_;

Bundling matches the access pattern: `get_client` already touches
both maps under the same lock, so they want one mutex, not two.

Every method that used `l_.lock()/.unlock()` now uses RAII via
`auto guard = state_.lock().unwrap();`, accessing fields as
`guard->cache` / `guard->lb_state`. The destructor also takes the
guard before iterating (it was previously unlocked-by-convention as
"the last reference"; locking is a no-op in that case but lets the
analyzer see the guard).

Methods flipped @unsafe@safe:
  - get_healthy_client_count
  - remove_unhealthy_clients
  - close_idle_clients
  - remove_all_unhealthy
  - close_all_idle
  - total_client_count
  - address_count

Still @unsafe (for non-lock reasons):
  - get_client — drives Client::connect / try_reconnect_if_needed
    synchronously, which is network I/O.
  - reconnect_all (both overloads) — async batch driver uses
    nanosleep + std::atomic loop. The state_ snapshot at the top is
    @safe; the body is not.

`ClientPool::is_client_healthy` (used inside the lock) is one of the
remaining function-level @unsafe sites worth flipping next — it just
reads Cell + ConnectionMetrics, no actual unsafe op.

Verified on Homebrew clang 21.1.8:
  - borrow_check_rrr: 45/45 files, 0 violations.
  - clean rrr rebuild + rpcbench link.
… Client

Two large classes carried the tag "marked unsafe to suppress rusty-cpp
false positives (rusty-cpp is under development)". Those annotations
predated several analyzer improvements that landed in the last few
weeks: import-chasing for cross-module annotations, multi-line
`// @unsafe { }` block parsing, cross-file body filter, and
namespace-qualified annotation lookup (see commits 28e64c5, 0c16ebf,
df2388f).

With the matured analyzer, the suppressions are no longer needed.
Drop them and flip both classes to `// @safe` at the class level. The
methods inside that genuinely cross into unsafe territory (network
I/O, std::chrono, rusty::Mutex + Condvar combinations) already carry
their own method-level `// @unsafe` overrides; the rest of the class
is now analyzed as @safe by inheritance.

This is a class-level boundary change, not a method-level flip — no
individual method's annotation changed. The net effect: previously-
unannotated methods inside Future and Client now default to @safe via
class inheritance instead of being @unsafe-by-namespace. Callers of
those methods can now invoke them from @safe code without `// @unsafe { }`
wrappers.

Verified on Homebrew clang 21.1.8:
  - borrow_check_rrr: 45/45 files, 0 violations (no new findings from
    the @safe flip — the suppressions were genuinely no-ops).
  - clean rrr rebuild.
…ers in client.cpp

These were defensive `// @unsafe { ... }` scope wrappers around method
bodies whose underlying operations are all @safe in the matured rusty-cpp
library + project annotations:

  - rusty::Arc::make (@safe in arc.hpp)
  - rusty::Mutex::lock / MutexGuard::operator*/-> (@safe in mutex.hpp)
  - rusty::RefCell::borrow / borrow_mut (@safe via namespace in refcell.hpp)
  - rusty::Option::is_some/is_none/as_ref/unwrap (@safe via namespace)
  - rusty::Cell::get/set (@safe via namespace)
  - rusty::Arc::clone (@safe in arc.hpp)
  - ClientConnection::host/connected/connection_state/server_instance_id
    /pending_request_count/clear_pending_requests/set_keepalive
    (all @safe at decl after tier 1+2)

The wrappers were no-ops once those library/project methods landed
their @safe annotations; the analyzer was already letting these through
because the function-level @safe overrides the inner block scope, but
the dead `// @unsafe { ... }` markers were noise to anyone reading the
code.

Methods cleaned:
  - Future::create, Future::ready, Future::timed_out,
    Future::add_completion_callback, Future::get_reply
  - Client::create, Client::host, Client::connected,
    Client::connection_state, Client::connection,
    Client::server_instance_id, Client::is_reconnecting,
    Client::set_keepalive, Client::set_reconnect_policy
  - Client::request (3 overloads), Client::request_with_options (2 overloads)

Still @unsafe (kept):
  - Future::get_error_code's `{ timed_wait(x); }` block — timed_wait
    is genuinely @unsafe (Mutex+Condvar+chrono).

Bare-marker count in rrr: 115 → 96.

Verified on Homebrew clang 21.1.8:
  borrow_check_rrr: 45/45 files, 0 violations.
…pers

server.cpp (4):
  - Server::reg_service (Box<Service> overload) — rusty::Vec::push +
    Box move + ServiceProxy dispatch are @safe at the boundary.
  - Server::reg_service (Box<T> ServiceLike overload) — same.
  - Server::reg_rpc — rusty::HashMap::contains_key + insert are @safe
    via the rusty namespace.
  - Server::service_count — Option + Vec::size are @safe.

reactor.cpp (1):
  - Reactor::make_arc — rusty::Arc::make is @safe; the wrap was
    misdocumented as a "localized unsafe allocation boundary".

Kept @unsafe:
  - Server::for_each_service — body has `static_cast<Service*>(...)`
    on a raw pointer from ServiceProxy's __get_service__; that's a
    genuine raw-pointer cast.

Bare-marker count in rrr: 96 → 92.

Verified on Homebrew clang 21.1.8:
  borrow_check_rrr: 45/45 files, 0 violations.
Baseline (computed with /tmp/safety_loc.py):
  total in-fn LOC: 22,438
  @safe (function or class-inherited):     1,435  ( 6.4%)
  @unsafe (function or class-inherited):   2,526  (11.3%)
  inner `@unsafe { ... }` blocks:            733  ( 3.3%)
  unannotated (namespace default = @unsafe): 17,744  (79.1%)

The unannotated bucket is the conversion target. None of it violates
the borrow check today; it just hasn't been actively tagged.

Plan structure:
  Phase 0: fix LOC script's out-of-class inheritance bug (honest baseline)
  Phase 1: labeling sweep — class-level + namespace-level @safe
           sweeps on files that are already safe in fact (target 35–40%)
  Phase 2: easy raw-pointer refactors (target 50%)
           - ChannelConnectionProxy / FactoryProxy → rusty::Box
           - PollThreadWorker* TLS → rusty::Weak<PollThreadWorker>
           - rusty::sys::* syscall wrappers
           - ServiceProxy::__get_service__ → rusty::Arc<Service>
  Phase 3: harder refactors (target 65–70%)
           - alock.cpp ALock* → rusty::Weak<ALock>
           - serializable.cpp std::shared_ptr<Marshallable> → rusty::Arc
             (touches generated rcc_rpc.h codegen)
           - Reactor::loop @unsafe block scoping
           - Pthread_* → rusty::sync::* wrappers
  Phase 4: stretch (target 80%)
           - Marshal byte ops (refactor / external annot / quarantine)
           - fiber context quarantine (~150 LOC explicit @unsafe)
           - rcc_rpc.h codegen rewrite

Honest ceiling without rusty-cpp expressiveness changes or perf-costly
Marshal refactors is ~70%. The plan treats 70% as the success criterion
and 80% as a stretch.

Each Phase has a checklist; the self-pacing loop ticks one item per
iteration, runs borrow_check_rrr + rrr build, commits, and updates this
doc's Progress log.
Phase 0 of the rrr_safety_80pct_plan: an honest, reproducible baseline.

Previously the script lived in /tmp and had a bug where it could match
function calls (`Class::method(args)`) inside other function bodies as
out-of-class method definitions, inflating the @safe / @unsafe in-fn
counts.

The fix:
- Anchor OUT_OF_CLASS_METHOD regex to lines that don't start with
  control-flow keywords (`if/while/for/switch/return/else/do/try/catch`).
- Only consider an out-of-class match at brace depth==0 (file scope) —
  function definitions open at file scope; function calls don't.
- Handle multi-line signatures: stash a pending out-of-class name when
  the `ClassName::method(` line lacks `{`, attach when the `{` arrives.

Honest baseline after the fix (was a moving target):
  in-fn LOC: 22,438
  @safe:          1,408 (6.3%)
  @unsafe:        2,226 (9.9%)
  inner @unsafe:    733 (3.3%)
  unannotated:   18,071 (80.5%)

The earlier estimate that the fix would land "+4-6pp" was wrong — the
fix tightens classification rather than growing @safe. Real ratio
growth comes from the Phase 1 labeling sweep.
Two intertwined changes, both surface @safe LOC that was previously
hidden:

1. **`Server` class** (`rpc/server.cpp`) flipped from `// @unsafe` to
   `// @safe` at the class level, mirroring the Tier-4 flip on
   `Client`. Every method that genuinely crosses into unsafe territory
   (socket I/O via TcpListener, Pthread / std::atomic primitives, raw
   pointer extraction from ChannelListenerProxy, `usleep` in drain,
   etc.) already carries a method-level `// @unsafe` override and is
   unaffected.

2. **LOC-script bug fix**: `scripts/rrr_safety_loc.py` previously
   used `";" in line` to detect forward-decls and drop pending
   annotations. Multi-line `// @safe -` annotation comments often
   contain a `;` in prose ("// overrides; the rest of the class…"),
   which fired the drop spuriously and stripped the pending
   annotation BEFORE the class declaration was reached. Net effect:
   `Future` and `Client` class flips from Tier 4 were silently
   uncredited in every prior LOC report. Fix: check
   `";" in stripped` (after strip_line_comment) instead.

borrow_check_rrr: 45/45 clean.
LOC ratio: 6.3% → 7.2% @safe (1,408 → 1,623 LOC inside fn bodies).
Note: this delta combines Server's flip and the script fix's
retroactive credit for Future/Client; the pure Server contribution is
about half of it.
The amend in 54a2d98 changed the SHA from the originally-recorded
d5028a19. Update the Progress log entry to match.
Add `// @safe` at the `class Reactor` declaration. The Reactor design
already documents itself as memory-safe (single-threaded, RefCell /
Cell / Rc / HashMap with rusty borrow rules); the comment was
descriptive but the analyzer wasn't credited.

Methods that genuinely cross into unsafe territory keep their
existing method-level `// @unsafe` overrides:
  - `get_reactor` / `get_disk_reactor` (thread-local Rc access)
  - `register_stackless_poller` / `enqueue_stackless_task` /
    `process_stackless_tasks` / `spawn_stackless_task` (RefCell-mut
    via mutable members)
  - `check_timeout` / `loop` (rusty-cpp doesn't fully model RefCell)
  - `continue_fiber` / `recycle` (fiber-context interactions)

Most Reactor:: out-of-class methods already had per-method @safe or
@unsafe annotations, so the class-level flip mainly credits the
inline methods (getters, setters, factory helpers, register_fiber,
restore_running_fiber, etc.) that were silently unannotated.

borrow_check_rrr: 45/45 clean.
LOC ratio: 7.2% → 7.4% @safe (1,623 → 1,652 LOC).

Modest gain, as expected — the bigger reactor wins live in Phase 2
(PollThreadWorker* raw pointer → rusty::Weak) and Phase 3 (Reactor::loop
@unsafe-block tightening).
;

LOC script pending_for_class leak fix

Two correctness wins, no ratio movement:

1. Add `// @safe` at the class level for `IdempotencyKeyGenerator`
   and `IdempotencyCache` in `rpc/idempotency.cpp`. Both classes use
   only rusty primitives (Cell, Mutex, HashMap) with explicit @unsafe
   annotations on the few methods that copy Marshal byte payloads.

2. Fix `scripts/rrr_safety_loc.py`: `pending_for_class` was leaking
   across function-body `{` consumption. Only `pending` got reset
   when a function brace was consumed; `pending_for_class` kept the
   stale annotation forever, which then got attached to the next
   class declared anywhere in the file.

   Concrete effect: pre-fix, `IdempotencyKeyGenerator` was wrongly
   credited as `@unsafe` because `CachedResponse::get_response_data`
   set `pending_for_class = "unsafe"` 17 lines earlier and that
   never got cleared. Same pattern was likely contaminating other
   files' class annotations too — the previous LOC report was about
   1pp too optimistic on @safe and 0.9pp too optimistic on @unsafe.

Honest baseline after both fixes:
  in-fn LOC: 22,453
  @safe:          1,398 (6.2%)
  @unsafe:        2,137 (9.5%)
  inner @unsafe:    733 (3.3%)
  unannotated:   18,185 (81.0%)

This iteration's class flips on idempotency.cpp don't move the ratio
because every method in both classes already had explicit per-method
annotations — class-level @safe only helps when there are
unannotated method bodies for it to inherit.

borrow_check_rrr: 45/45 clean.
Add `// @safe` at the `class CompletionTracker` declaration. The class
already uses only rusty primitives (Cell, Mutex<std::list>, Mutex<HashSet>)
with no raw pointers, syscalls, or Marshal byte ops. All previously
unannotated methods inside the class body now inherit @safe.

borrow_check_rrr: 45/45 clean.
+107 @safe LOC; ratio 6.2% → 6.7%.
Single-threaded state machine; every field is rusty::Cell<T> with
trivially-copyable interior mutability. No raw pointers, syscalls,
or Marshal byte ops.

borrow_check_rrr: 45/45 clean.
+118 @safe LOC; ratio 6.7% → 7.2%.
Cell-based heartbeat tracker with rusty::Function timeout callback.
No raw pointers, syscalls, or Marshal byte ops.

borrow_check_rrr: 45/45 clean.
+88 @safe LOC; ratio 7.2% → 7.6%.
shuaimu and others added 28 commits May 23, 2026 00:24
…address

Five method-level `// @unsafe` markers on TcpConnection / TcpListener
adapter accessors were tied to ops that have since become @safe:

  - `TcpConnectionChannelAdapter::is_closed` — forwards to TcpConnection::
    is_closed (Cell::get, @safe).
  - `TcpConnectionChannelAdapter::peer_address` — forwards to TcpConnection::
    peer_address (const std::string accessor).
  - `TcpConnectionPollableAdapter::is_closed` — same as above.
  - `TcpConnectionPollableAdapter::check_pending_write_update` — forwards to
    TcpConnection::check_pending_write_update (Cell::get + Cell::set, @safe).
  - `TcpListener::local_address` — returns a copy of the const
    std::string member; the historical "std::string copy constructor isn't
    borrow-checked" rationale was over-conservative.

Verification: borrow_check_rrr 45/45 clean. Ratio holds at 71.5%
(impact small; mostly comment changes on one-liners).
Both methods only call SpinMutex::lock (@safe since Tier 2.x) and
CallbackWrapper move-assign (@safe via namespace+class annotation in
rrr.callback_wrapper). The historical "(not @safe)" rationale was
out of date.

Verification: borrow_check_rrr 45/45 clean. Ratio holds at 71.5%
(+6 @safe LOC, −6 @unsafe LOC).
…::make

Both static factory templates wrap `rusty::Arc<T>::make(args...)`. The
historical "(non-borrow-checked)" rationale predates the rusty::Arc
namespace-level @safe annotation: Arc::make is itself @safe with an
inner `// @unsafe { new }` block around the heap allocation.

Verification: borrow_check_rrr 45/45 clean. Ratio 71.5% → 71.6%.
  - TcpListener::handle_write (returns PollMode::NO_CHANGE, no side effects)
  - TcpListener::check_pending_write_update (returns false)
  - PollThreadWorker::do_remove_pollable (HashMap::contains_key +
    HashSet::insert, both @safe; the "uses STL operations" rationale
    was wrong — these are rusty types)
  - PollThreadWorker::do_close_pollable (HashSet::remove / HashMap::get
    / HashMap::remove / HashMap::contains_key are @safe; only
    Epoll::Remove and virtual Pollable::close() dispatch escape into
    inner @unsafe blocks).

Verification: borrow_check_rrr 45/45 clean. Ratio 71.6% → 71.8%
(+27 @safe LOC, −25 @unsafe LOC).
…loop

Two inline @unsafe blocks in Reactor's fiber-execution path
were rooted in "Fiber::finished() is not marked @safe" — that's
outdated. Fiber::finished() is annotated @safe (reads a Cell<Status>);
RefCell::borrow + Option::as_ref / unwrap are @safe via namespace
inheritance.

Verification: borrow_check_rrr 45/45 clean. Ratio 71.8% → 71.9%
(+6 @safe LOC, −6 @unsafe LOC).
`Event::test()` only calls @safe ops:
  - `verify()` is @safe (pure precondition check, marked @safe in
    rrr.debugging)
  - `is_ready()` is virtual; the base @safe-annotates the virtual
  - `status_.get()` / `status_.set()` (Cell, @safe)
  - `wp_fiber_.upgrade()` (rusty::sync::Weak::upgrade, @safe)
  - `option_fiber.is_some()` (@safe)
  - `Log_debug(fmt, args...)` is @safe (template shim wrapping Log::debug
    in an inner @unsafe block)

The historical "verify/log helpers not marked @safe" rationale was
outdated.

Verification: borrow_check_rrr 45/45 clean. Ratio holds at 71.9%.
RefCell / Option

Eight inner `// @unsafe { ... }` blocks were wrapping operations that
have been @safe for a while:

  - marshal.cpp: three `// @unsafe { Vec::clear is @safe; wrap
    defensively. }` blocks around `buf_.clear()` / `src.buf_.clear()`
    — defensive wraps that were never needed; rusty::Vec::clear is
    @safe via namespace inheritance.
  - client.cpp: two `// @unsafe { SpinMutex::lock }` wraps around
    `factory_.lock()` / `pending_factory_.lock()` — SpinMutex::lock is
    @safe since Tier 2.x.
  - client.cpp: five `// @unsafe { RefCell::borrow, Option::unwrap are
    not borrow-checked }` blocks around connection_.borrow() +
    Option::is_some / as_ref / unwrap — both RefCell::borrow and
    Option::unwrap are @safe via namespace inheritance.

Verification: borrow_check_rrr 45/45 clean. Ratio 71.9% → 72.1%
(+21 @safe LOC; inner-block LOC dropped by 25).
Nine more inner `// @unsafe { ... }` blocks wrapped operations that
have been @safe for a while:

  - client.cpp x4: `// @unsafe { RefCell::borrow, Option::unwrap }`
    around connection_.borrow() + Option::is_some / as_ref / unwrap.
  - client.cpp x2: `// @unsafe { SpinMutex::lock + Option::take }`
    around fiber_channel_.lock().unwrap() / direct_channel_.lock()
    .unwrap().
  - reactor.cpp x2: `// @unsafe { RefCell::borrow_mut, Option
    operator= are not borrow-checked }` around the
    sp_running_fiber_th_.borrow_mut() = Some/None idiom.

The wrapped operations are all @safe via the rusty namespace +
RefCell / Option / SpinMutex class annotations.

Verification: borrow_check_rrr 45/45 clean. Ratio 72.1% → 72.4%
(+31 @safe LOC; inner-block LOC dropped by 31, mostly net-neutral
moves into the @safe pool).
Two more pairs of inner `// @unsafe { ... }` blocks were wrapping
purely-@safe operations:

  - `// @unsafe { SpinMutex::lock + ChannelFactoryProxy move }` (x2)
    around factory_.lock().unwrap() + Some(std::move(factory)) — both
    SpinMutex::lock and Box move-assign are @safe.
  - `// @unsafe { make_box + SpinMutex mutation }` (x2) around
    fiber_channel_.lock().unwrap() + Some(make_box<FiberChannel>) —
    rusty::make_box is @safe (Arc::make is @safe), SpinMutex::lock is
    @safe.

Verification: borrow_check_rrr 45/45 clean. Ratio 72.4% → 72.5%
(+12 @safe LOC; inner-block LOC dropped by 12).
  - idempotency.cpp: Marshal `operator<<`/`operator>>` overloads for
    `IdempotencyKey` flip @safe — the Marshal class is namespace @safe
    via the Phase 4 Cursor rewrite; the per-method @unsafe blocks
    only remain on the chunk-list-era methods (which the rewrite
    deleted).
  - client.cpp `current_time_ms()`: dropped the `std::chrono::steady_clock`
    + `duration_cast` plumbing in favor of `rusty::sys::time::
    clock_monotonic_us() / 1000`. Function flips @safe end-to-end.
  - request_queue.cpp `QueuedRequest`: storage migrates from
    `std::chrono::steady_clock::time_point timestamp` to plain
    `std::uint64_t timestamp_us` (monotonic microseconds via
    `clock_monotonic_us`). All three methods (ctor, is_expired,
    age_ms) flip @safe.

Verification: borrow_check_rrr 45/45 clean. Ratio 72.5% → 72.6%
(+6 @safe LOC; inner-block LOC dropped by 15).
… Vec)

  - client.cpp `reconnect_address_ = addr` — std::string assignment from
    `const char*` is benign in @safe code; no need for an inner wrap.
  - client.cpp `bind_channel_direct` body — SpinMutex::lock +
    Option::operator= are both @safe.
  - server.cpp's channel-listener on_accept lambda — SpinMutex::lock +
    rusty::Vec::push are both @safe.

Verification: borrow_check_rrr 45/45 clean. Ratio 72.6% → 72.7%
(+6 @safe LOC; inner-block LOC dropped by 6).
`update_config` was method-level @unsafe with an inner @unsafe block
around `queue_.lock().unwrap()` + `config_ = config`. Both are @safe:
SpinMutex::lock and POD-struct assignment have no escape hatches.

Verification: borrow_check_rrr 45/45 clean. Ratio holds at 72.7%
(+8 @safe LOC, −9 @unsafe LOC).
Time::now flows through rusty::sys::time::clock_monotonic_us (itself
@safe with an inner @unsafe block around clock_gettime). The inline
wrap inside this_fiber::sleep_until_us was leftover from before the
sys::time landing.

Verification: borrow_check_rrr 45/45 clean. Ratio holds at 72.7%.
…ty()

std::string::empty() is a pure const accessor — completely safe in
@safe code. The lingering @unsafe wrap was incidental.

Verification: borrow_check_rrr 45/45 clean. Ratio 72.7% → 73.2%
(+58 @safe LOC; the inline @unsafe block was straddling an entire
if-condition, so removing the wrap relabels the surrounding code).
Third std::chrono call-site in client.cpp (after Future::timed_wait and
current_time_ms) flips to rusty::sys::time::clock_monotonic_us / 1000.
Removes the last std::chrono::steady_clock::now() use in this file
outside of the test-side rpcbench tooling.

Verification: borrow_check_rrr 45/45 clean. Ratio holds at 73.2%.
Marshal ctor and init_block_read had  blocks around buf_.reserve(N). rusty::Vec::reserve is
@safe — its internal  allocation is wrapped in an inner @unsafe
block inside the Vec class itself.

Verification: borrow_check_rrr 45/45 clean. Ratio holds at 73.2%.
The set_bookmark zero-fill loop appended n zero bytes via buf_.push(0)
inside an @unsafe block. rusty::Vec::push is @safe via namespace
inheritance, so the wrap was decorative.

Verification: borrow_check_rrr 45/45 clean. Ratio 73.2% → 73.3%.
Logs the ~30-commit sweep through stale per-method @unsafe overrides and
inline @unsafe { ... } blocks whose rationales were tied to operations
that have since become @safe. Cumulative net effect: +2.7pp of @safe
ratio (70.6% → 73.3%), borrow_check_rrr 45/45 clean throughout.
Phase 2 (rrr safety push, item 3 — second wrapper family). Bumps the
rusty-cpp submodule to pick up `rusty::sys::process::*` and updates
the five rrr files calling those syscalls directly:

  - base/misc.cpp `get_ncpu`: sysconf(_SC_NPROCESSORS_ONLN) → @safe
    `rusty::sys::process::sysconf`. Function flips @unsafe@safe.
  - base/misc.cpp `get_exec_path`: getpid → @safe wrapper. The
    function stays @unsafe overall (static char[PATH_MAX] + readlink
    + raw char* return) but the getpid call no longer carries an
    explicit syscall annotation.
  - misc/cpuinfo.cpp `CPUInfo::CPUInfo` + `get_cpu_stat`: sysinfo /
    sysconf / times / getpid all route through the new helpers. The
    methods stay @unsafe because of get_network / get_memory parsers
    (raw char* / strtok) but the syscall plumbing is gone.
  - misc/netinfo.cpp `NetInfo::NetInfo` + `get_net_stat`: drop the
    `// @unsafe { times(&tms_buf) }` inner blocks — times() now flows
    through rusty::sys::process::process_times.
  - reactor/reactor.cpp `fiber_task_t::init_context`: sysconf(_SC_PAGESIZE)
    → @safe wrapper.
  - rpc/server.cpp instance-id generation: getpid → @safe wrapper.

Submodule bump: 843ba3b adds `include/rusty/sys/process.hpp` with
getpid / sysconf / process_times (ProcessTimes aggregate) /
sysinfo (Linux-only SysInfo aggregate). Each entry is @safe, body
wraps a single libc syscall in an inner @unsafe block, return types
are plain integers or small PODs.

Verification: borrow_check_rrr 45/45 clean; rrr library compiles.
Ratio 73.3% → 73.5%.
Phase 2 (rrr safety push, item 3 — third wrapper family). Bumps the
rusty-cpp submodule to pick up rusty::sys::env::hostname() and updates
the one rrr caller:

  - rpc/utils.cpp get_host_name(): drops the raw char[1024] buffer +
    direct gethostname call. Flow now goes through @safe sys::env::
    hostname which wraps gethostname in an inner @unsafe block and
    returns an owned std::string. Function flips @unsafe@safe.

Submodule bump: d720f95 adds include/rusty/sys/env.hpp. Buffer size
defaults to 256 bytes (overridable via RUSTY_SYS_ENV_HOSTNAME_BUF).

Verification: borrow_check_rrr 45/45 clean. Ratio 73.5% → 73.6%.
…d_hash

Phase 2 (rrr safety push, item 3 — fourth wrapper family). Bumps the
rusty-cpp submodule to pick up rusty::sys::pthread::current_id_hash()
and updates the one rrr caller:

  - basetypes.cpp Rand::Rand: pthread_self() seed contributor now goes
    through @safe sys::pthread::current_id_hash. The only remaining
    @unsafe op is , which is now
    confined to a tight inner @unsafe { } block. Function flips
    @unsafe@safe.

Submodule bump: 97c45b4 adds include/rusty/sys/pthread.hpp scoped
narrowly to thread-identity (the mutex / condvar / thread-create
surface continues to live in rusty::sync::* / rusty::thread::*).

Verification: borrow_check_rrr 45/45 clean. Ratio holds at 73.6%
(Rand::Rand body is small; +8 @safe LOC, -7 @unsafe LOC, +2 inner-block).
Phase 2 item 3 (rusty::sys::* syscall wrappers) was [partial] after
the sys::time landing. With sys::process / sys::env / sys::pthread
also shipped + folded into rrr, the item is now [x].

Logs the four sub-families that landed (time, process, env, pthread)
with their submodule + parent commit hashes, and documents the
deliberate skips for epoll/kqueue, pthread mutex/condvar, and the
full file I/O surface (each annotated with the reason rrr's safety
ratio wouldn't move).
… Queue

Audit confirmed these classes had zero production-touching method
calls:
  - `ThreadPool` was constructed in `paxos_worker.cc` and
    `raft_worker.cc` (`base::ThreadPool::make(num_threads)` + `make(1)`
    for hb) and stored in `rusty::Arc<base::ThreadPool>` fields, but
    `run_async` was never invoked. The fields were checked via empty
    `if (thread_pool_g) { /* Arc auto-releases */ }` blocks during
    shutdown — i.e. it spun up worker threads on an idle job queue
    forever, allocated for nothing.
  - `RunLater`: zero callers outside threading.cpp.
  - `SpinCondVar`: zero callers outside threading.cpp.
  - `Queue<T>` (pthread mutex + cond blocking queue): only consumed
    internally by `ThreadPool::run_thread`.

Deletes:
  - threading.cpp: drops SpinCondVar (~70 LOC) + Queue<T> (~75 LOC) +
    ThreadPool class + impl (~155 LOC) + RunLater class + impl
    (~220 LOC). File shrinks from 930 → 411 LOC. Unused includes
    (rusty/arc.hpp, /box.hpp, /fn.hpp, /function.hpp, /vecdeque.hpp,
    sys/time.h) and imports (rrr.misc) come out too.
  - server_worker.{h,cc}: drops `svr_thread_pool_`, `thread_pool_g`,
    `hb_thread_pool_g` fields + the `hb_thread_pool_g = svr_thread_pool_`
    assignment.
  - paxos_worker.{h,cc}: drops `thread_pool_g` / `hb_thread_pool_g`
    fields + the two `base::ThreadPool::make(...)` construction calls.
  - raft_worker.{h,cc}: same shape; also drops the two empty
    `if (...thread_pool_g) { /* Arc auto-releases */ }` shutdown
    blocks.

Verification: borrow_check_rrr 45/45 clean. Channel-mode unit tests
pass (inmemory 24/24, fiber 8/8, tcp 20/20, client_factory 5/5).
rrr + mako + rpcbench + txlog_core_obj all link cleanly. Ratio
73.6% → 74.0%; @unsafe pool drops by ~114 LOC; total in-fn LOC
drops by ~360.

Net: 554 LOC removed, 2 added.
Two dead-code prunes surfaced by the Phase 2 follow-on survey:

- `NetInfo` (src/rrr/misc/netinfo.cpp, ~75 LOC) gauged ens4 rx/tx
  bytes/sec through `NetInfo::net_stat()`; zero callers anywhere in
  the tree. Removed the file, dropped `import rrr.netinfo;` from
  `rrr.hpp`, and pruned the entries from RRR_MODULE_SRC /
  RRR_BORROW_SRC in `src/rrr/CMakeLists.txt`.

- `ServerConnection::{fd, poll_mode, content_size, handle_read,
  handle_write, handle_error, handle_free, check_pending_write_update}`
  in `src/rrr/rpc/server.cpp` (~95 LOC) were stub methods whose
  doc-comments cited "PollableProxy facade ABI compatibility". But
  `class ServerConnection` has no base class — the stubs aren't
  overrides — and `make_pollable_proxy_from_typed_arc<T>` is never
  instantiated with `T = ServerConnection` anywhere in the tree
  (verified via grep). The only callers lived inside one test
  (`ServerApiSafetyTest.ServerConnectionContentSizeAndHandleFreeAreSafe`
  in test_rpc_extended.cc) that exercised the no-op behavior to
  confirm it was a no-op. Deleted the test along with the stubs.
  `is_closed()`, the one method in the same block that is actually
  used by the poll loop and many callers, is kept.

Verification:
  - rrr builds clean (`cmake --build build_clang21 --target rrr`).
  - borrow_check_rrr clean across 44 files (47/48 ninja tasks; 48th
    is the rrr.netinfo BMI which is no longer emitted).
  - test_rpc_extended builds + links clean.

Safety LOC tally: 8395 @safe / 11351 in-function = 74.0%.

Plan doc: append a "Phase 2 follow-on — dead-code removal" entry
listing this commit + the prior threading.cpp prune (f6be7df).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tomic

The Log class held a static `pthread_mutex_t m_s` that protected
writes to `level_s` (int) and `fp_s` (FILE*). Two findings:

1. `fp_s` and `Log::set_file()` had zero callers in the tree.
   They were dead state.
2. `log_v` (the hot path) reads `level_s` without taking the lock,
   so the mutex only guarded the writer — a torn-write hazard, not
   a torn-read one. Rust would model this as `AtomicI32`, not a
   `Mutex<i32>`.

Changes in `src/rrr/base/logging.cpp`:
  - Drop `static FILE* fp_s` field + its definition.
  - Drop `static pthread_mutex_t m_s` field + its definition.
  - Drop `Log::set_file(FILE*)` declaration and definition.
  - Change `static int level_s` to
    `static rusty::sync::atomic::Atomic<int> level_s`.
  - Initialize to `Log::DEBUG` via `Atomic<int>{Log::DEBUG}`.
  - `set_level` becomes `level_s.store(level, Ordering::Relaxed)`.
  - `log_v` reads via `level_s.load(Ordering::Relaxed)`.
  - Drop `#include <pthread.h>` and `import rrr.threading;` — the
    file no longer touches pthread or the Pthread_* wrappers.

The rusty-cpp submodule bumps to `5150277` (one commit) which adds
namespace-level `// @safe` annotations to `rusty::sync::atomic` so
`load`/`store` can be called from @safe code without inline
`@unsafe { ... }` blocks at the call sites.

Verification:
  - rrr builds clean.
  - borrow_check_rrr clean across all 44 files (logging.cpp
    specifically: 0 violations).
  - LOC tally: 8394 @safe / 11341 in-function = 74.0%.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The paxos/raft configs in `config/1leader_2followers/` hardcode ports
(45xxx/46xxx for paxos, 27xxx/28xxx for raft). Between consecutive CI
runs in the same job — including the auto-retries inside `ci/ci.sh` —
a process from the previous attempt that exited abnormally can leave
its listen sockets in TIME_WAIT / FIN-WAIT-2 on those ports. SO_REUSEADDR
helps with TIME_WAIT but not with sockets still tracked in connected
states, so the next attempt's `s102` bind on 0.0.0.0:46002 panics with
AddressInUse and the shard goes down before any throughput emits. We
hit this on `shard2Replication` in PR #67's CI run.

Mirror what simpleTransactionRep already does for the shard config
(`make_simple_txn_rep_config`): pick a random base port at test init,
probe the leader port of each (shard, cluster) tuple via real `bind()`
to verify the range is free, then materialize per-shard rebased copies
of the paxos/raft YAML into `/tmp/mako_paxos_cfg_XXXX/`. Test wrappers
export `MAKO_PAXOS_CONFIG_DIR` so `bash/shard.sh` and
`simpleTransactionRep` redirect to the tmp dir.

Files touched:

  - `examples/simple_transaction_rep_port_utils.sh`:
    + `pick_paxos_replication_port_base(nshards, nthreads)` — picks a
      base in [40000, 56000-window) with up to 2000 random retries,
      probing 4 * nshards leader ports via SO_REUSEADDR bind.
    + `write_paxos_replication_config(new_base, src, dest)` — rebases
      every `s<id>:<port>` token in `site.server` by
      `delta = new_base - min(existing ports)`. Preserves yaml-cpp-
      friendly flow-style nested lists.
    + `make_paxos_replication_configs(nshards, nthreads, type)` —
      orchestrates: picks base, mkdir-tmp, calls
      `write_paxos_replication_config` per shard with shard-stride
      `+1000`, returns the tmp dir path.
  - `bash/shard.sh`: honor `MAKO_PAXOS_CONFIG_DIR` (fall back to
    the in-tree `config/1leader_2followers/`).
  - `examples/simpleTransactionRep.cc`: same `MAKO_PAXOS_CONFIG_DIR`
    override at the `paxos_config_files[]` construction site.
  - All eight CI replication test wrappers:
    `test_{1,2}shard_replication{,_simple,_raft,_simple_raft}.sh`
    call `make_paxos_replication_configs`, export the env var, and
    clean up the tmp dir on EXIT / interrupt.

Verification:
  - `bash -n` clean on all 10 modified shell scripts.
  - `simpleTransactionRep` rebuilds.
  - Smoke test:
      $ tmp=$(make_paxos_replication_configs 2 3 paxos)
      $ head -8 $tmp/paxos3_shardidx{0,1}.yml
    shows shard 0 at base+0/100/200/300 and shard 1 at
    base+1000/1100/1200/1300 with the original site IDs (s101..s403)
    intact and yaml-cpp-style flow lists preserved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The first run of the randomized-port helper produced a base around 57900,
which gave paxos site ports up to ~58900. PaxosWorker::CtrlPortDelta = 10000
means each site also spawns a heartbeat listener at `port + 10000` — so we
were asking the kernel to bind 68900, which is past the valid TCP port
range, and bind() returned AddressInvalid:

    rrr::Server::start: channel listener failed to bind 0.0.0.0:67901: AddressInvalid

The original hardcoded configs never hit this: paxos shard 0 lived in 45xxx
(heartbeat 55xxx) and shard 9 topped out at 54xxx (heartbeat 64xxx).

Two fixes in `pick_paxos_replication_port_base`:

  1. Lower BASE_MAX by 10000 + per-shard window so
     `max_heartbeat = base + (nshards-1)*1000 + 300 + (nthreads-1) + 10000`
     stays below 65535.

  2. Add the matching `port + CTRL_PORT_DELTA` to the probe set so external
     port conflicts on heartbeat ports also block selection at init.

Sanity-checked: 5 random picks for `(nshards=2, nthreads=3)` produced bases
43692..51131 with max heartbeat 54994..62433 — all within range.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After the paxos/raft randomization landed in bb6bb21, CI moved past
all replication failures but now fails on `multiShardSingleProcess`
with `bind 0.0.0.0:31000: AddressInUse` / `0.0.0.0:31102: AddressInUse`.
Root cause: the multi-shard single-process tests hardcode
`--shard-config $path/config/local-shards2-warehouses$trd.yml`, which
ships with port 31000+ for `localhost`. Earlier tests
(`simpleTransaction`) pick a random base in `[20000, 28599]` with
offsets up to 3100, so their highest port is 31699 — overlapping
31000-31102. A TIME_WAIT / leftover socket from `simpleTransaction`
takes 31000 down before `multiShardSingleProcess` can bind it.

Apply the same `make_simple_txn_rep_config` pattern used by the
replication tests:

  - `examples/test_multi_shard_single_process.sh`: source the
    port-utils, generate `MAKO_CONFIG=$(make_simple_txn_rep_config 2 $trd)`,
    pass `--shard-config $TEMP_CONFIG`, clean up on EXIT.
  - `examples/test_2shard_single_process.sh`: same pattern.
  - `examples/test_2shard_single_process_replication.sh`: also generate
    a randomized paxos dir (`make_paxos_replication_configs 2 $trd paxos`)
    and rewrite the two hardcoded `-F config/1leader_2followers/paxos${trd}_shardidx*.yml`
    args to use `${MAKO_PAXOS_CONFIG_DIR}/...`.

After this, every CI test that binds ports does so on a per-run
randomized + bind-probed range; no test still depends on the
hardcoded 31xxx-34xxx or 45xxx-46xxx ranges.

`bash -n` clean on all three scripts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shuaimu shuaimu merged commit e2a1c46 into mako-dev May 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant