Skip to content

feat: populate BootstrapManager cache from client peer interactions#48

Merged
jacderida merged 3 commits intorc-2026.4.2from
ant-core-populate-bootstrapmanager-cache-from-real-peer
Apr 22, 2026
Merged

feat: populate BootstrapManager cache from client peer interactions#48
jacderida merged 3 commits intorc-2026.4.2from
ant-core-populate-bootstrapmanager-cache-from-real-peer

Conversation

@Nic-dorman
Copy link
Copy Markdown
Contributor

Summary

  • Client upload and download paths now feed add_discovered_peer + update_peer_metrics so saorsa-core's BootstrapManager cache fills with real, quality-scored peers during normal activity
  • Cold-start clients can reload quality peers from disk instead of always restarting from the 7 bundled bootstrap peers
  • Failures never insert, so unreachable peers don't consume cache quota

Test plan

  • New e2e_bootstrap_cache tests (upload + download) prove cache grows after peer interactions
  • e2e_chunk regression (9/9) — directly exercises modified chunk_put_with_proof and chunk_get_from_peer
  • e2e_data (6/6), e2e_file (4/4), e2e_huge_file::64mb_round_trip in isolation — higher-level coverage
  • 135 lib unit tests + daemon_integration (5/5)
  • cargo clippy --workspace --all-targets --all-features -- -D warnings clean
  • cargo fmt --check clean
  • Manual: confirm bootstrap_cache.json appears on a live multi-subnet testnet after ≥5 min activity (post-merge)

🤖 Generated with Claude Code

The quote, chunk PUT, and chunk GET paths now feed a shared helper
that upserts peers on success via add_discovered_peer and records
latency/success on both outcomes via update_peer_metrics. Cold-start
clients can now load real, quality-scored peers from the persisted
bootstrap cache instead of always restarting from the 7 bundled
peers. Failures never insert, so unreachable peers don't consume
cache quota.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@grumbach
Copy link
Copy Markdown
Contributor

Solid pull — the plumbing shape is right, the helper is scoped correctly, and the failure-path comment captures the intent. A couple of correctness issues though, and the test story doesn't actually prove the headline claim yet.

What I'd want before merging

  1. The "failures never insert" claim doesn't match the code. The helper decides reachable with !matches!(&result, Err(Error::Network(_) | Error::Timeout(_))), which means Err::Protocol, Err::InvalidData, Err::Serialization, Err::Payment, and even a remote PutResponse::Error all count as success and fire add_discovered_peer. A peer that sends a malformed quote, a payment rejection, or a storage error becomes a cache insert with a full-PUT RTT written to its quality score. That's the exact liveness-without-quality bar the PR claims to avoid. Either flip the test to an explicit whitelist (Ok(_) | Ok(Some(_)) etc.) or pass an explicit success: bool in from each call site where you actually know the semantic outcome. The three call sites each have a clean Ok branch — plumb that through instead of inferring from the error variant.

  2. Cold-start-from-disk is the reason this PR exists, and no test covers it. Both e2e_bootstrap_cache tests assert cached_peer_count grows inside a single process. Neither tears the node down, constructs a fresh P2PNode from the same cache dir, and asserts the reloaded cache contains the peers from the previous run. The PR body lists "confirm bootstrap_cache.json appears on a live multi-subnet testnet" as a manual, unchecked, post-merge task — for the feature's primary value-prop, that needs to be an automated cross-restart assertion with a shared cache_dir. The rate-limiter argument for relaxing the assertion to > doesn't apply to a load test. Please add one.

  3. RTT signal is noisy to the point of being misleading. start = Instant::now() in chunk_put_with_proof measures end-to-end: ~4 MB payload upload + store confirmation. On a residential uplink that's 10–60 s; a local testnet run gets sub-ms. Writing that into the bootstrap cache's latency field means quality rankings will be dominated by the uploading client's pipe rather than anything intrinsic to the peer. The quote path is fine (small request, small response). Either restrict the metrics update on the PUT path to success only (leave rtt_ms = None), or measure only the send-to-first-byte interval of the response.

P2 — polish

  • Primary-address selection is first-wins. In record_peer_outcome: addrs.iter().find(|a| a.socket_addr().is_some()). If a peer advertises multiple addresses, the order in the DHT response determines which one's metrics get updated and which one is keyed in the cache. A peer ordering [10.0.0.5, 203.0.113.1] will have its metrics tracked under the RFC1918 address while the real one goes uncredited. Either iterate them all (update_peer_metrics per resolvable addr) or prefer the globally routable one.
  • Cache directory is shared across saorsa- apps.* saorsa-core's default cache_dir is ~/Library/Caches/saorsa/bootstrap/ — the same dir saorsa-node and communitas write to. Two unrelated saorsa processes on the same host will cross-pollinate their bootstrap caches. Not introduced by this PR, but this PR is what makes it user-visible. Worth a namespaced cache_dir override in ant-core's CoreNodeConfig builder.
  • Test comment is partly wrong. The justification for after > before says "the rate limiter permits only the first 3 to join the cache in a single run" — but the loopback case is actually handled by BootstrapIpLimiter::with_loopback_and_k, not the temporal rate limiter. The observable ceiling is the same, but the cause isn't. If you want this comment to age well, reference the loopback bypass behavior instead of max_joins_per_24_per_hour.
  • Download test's assertion is too weak to be useful. after_download >= before_download is true for a no-op. If the /24 rate-limiter has already saturated, we're not actually exercising anything new on the download path — consider using a second testnet node (different simulated /24 if possible, or add_peer_trusted) for the download test.

Nits

  • u64::try_from(...).unwrap_or(u64::MAX) for an RTT in ms will only saturate at ~584 million years. Just use as u64 here — the saturation branch is unreachable and the cast intent is clearer. (Or, given the clippy rules in this project, a .unwrap_or(0) is probably closer to what you want on the off-chance the measurement is bogus.)
  • Helper doc says "upserts the peer (subject to saorsa-core Sybil checks)" — worth also noting the helper silently swallows both errors, because that's the behavior someone debugging "why didn't my peer get cached?" will want to know.
  • record_peer_outcome lives in client/mod.rs but is tightly coupled to chunk RTT semantics. Either move it next to where it's used (there are only three call sites, all in chunk.rs / quote.rs) or put it in its own file.

(1) and (2) are blockers for me. (3) is a blocker if we ever want to use the RTTs for anything downstream; if the current consumer just needs success/failure, feel free to mark it P2 and file a follow-up.

Blockers:
- Plumb explicit `success: bool` from each call site's `Ok` branch instead
  of inferring from error variants. Previously `Err::Protocol`, `::Payment`,
  `::Serialization`, `::InvalidData`, and remote PutResponse::Error all
  fired `add_discovered_peer` + wrote an RTT — the opposite of the stated
  "failures never insert" invariant.
- Drop RTT on the PUT path. `chunk_put_with_proof` wraps a ~4 MB payload
  upload, so the wall-clock was dominated by the uploader's uplink rather
  than the peer's responsiveness. Quote-path and GET-path RTTs still
  feed quality scoring.
- Add a cold-start-from-disk test that populates a BootstrapManager
  (via `add_peer_trusted` to skip the /24 rate limiter), saves, drops,
  reopens against the same cache_dir, and asserts peers reload. Closes
  the loop on the cold-start value prop.

Polish:
- Prefer globally-routable socket addresses when picking a peer's cache
  key. Previously a peer advertising `[10.0.0.5, 203.0.113.1]` would be
  keyed under the RFC1918 address and metrics recorded over the public
  address would land on a stale entry.
- Fold the weak "after_download >= before_download" assertion into the
  main test (now performs PUT + GET + asserts cache grew). The /24 rate
  limiter saturates during the PUT, so a standalone download assertion
  carries no information.

Nits:
- Use `as u64` for elapsed millis (saturates only after ~584M years).
- Move `record_peer_outcome` out of `client/mod.rs` into its own
  `client/peer_cache.rs` module.
- Doc now notes that both upstream calls silently discard errors.

Test comment on the 3-peer ceiling is rewritten to describe both Sybil
mechanisms accurately: `BootstrapIpLimiter` (IP diversity, exempted for
loopback) and `JoinRateLimiter` (temporal, NOT exempted) — the latter
is what caps /24 inserts at 3/hour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Nic-dorman
Copy link
Copy Markdown
Contributor Author

Addressed in 6315fb9.

Blockers

  1. Explicit success: bool from each call site — done. Dropped the !matches!(...) error-variant inference and now plumb result.is_ok() straight from the Ok branch at each of the three call sites. Err::Protocol, Err::InvalidData, Err::Serialization, Err::Payment, remote PutResponse::Error — all now correctly count as failure. The "failures never insert" invariant in the PR body is now the invariant in the code.

  2. RTT dropped on the PUT path — done. chunk_put_with_proof now passes rtt_ms = None; only the quote path and GET path record RTT. A comment on the site explains why (uploader-uplink-dominated measurement). Avoided the send-to-first-byte alternative to keep the change small; happy to revisit if downstream consumers ever want the stricter signal.

  3. Cold-start-from-disk test — done. test_bootstrap_cache_roundtrip_through_disk populates a BootstrapManager via add_peer_trusted (bypasses the /24 rate limiter which local testnets can't escape), saves, drops, reopens against the same cache_dir, and asserts all 15 peers reload. That's the upstream half of the lifecycle. The hook half is still covered by the existing in-process activity test. Both live in ant-core/tests/e2e_bootstrap_cache.rs.

P2

  • Globally-routable address preference — done. select_primary_multiaddr prefers a socket-addressable MultiAddr with a non-private / non-loopback / non-link-local IP over the first-available. Unit tests cover both v4 and v6 classification.
  • Weak download assertion — done. Folded into the main test: the test now performs PUT + GET as a round-trip and asserts (a) content matches and (b) cache grew. The orphan after_download >= before_download weak assertion is gone.
  • Shared cache_dir across saorsa apps — deferred. Not introduced by this PR; belongs in an ant-core CoreNodeConfig override. Happy to file a follow-up.

Nits

  • u64::try_from(...).unwrap_or(u64::MAX)as u64 — done, all three call sites.
  • Doc mentions error-swallowing — done. The helper's doc comment now says "Both upstream calls silently discard errors — peer-cache bookkeeping must never abort a user operation. Enable the saorsa_core::bootstrap tracing target to see rejection reasons."
  • Relocate helper out of client/mod.rs — done. Now lives in its own client/peer_cache.rs, which keeps chunk-RTT semantics co-located and removes clutter from the root client module.

One correction worth flagging

The test comment's attribution to the /24 rate limiter is correct — not the IP-diversity limiter. Verified by reading both:

  • BootstrapIpLimiter::can_accept (saorsa-core/src/security.rs:220-222) does short-circuit for loopback when allow_loopback = true. But that's the IP-diversity check.
  • JoinRateLimiter::check_join_allowed (saorsa-core/src/rate_limit.rs:254+) has no is_loopback branch — it applies max_joins_per_24_per_hour: 3 unconditionally, including for loopback.

So the observed ceiling of 3 peers in the cache is the temporal rate limiter biting on the single 127.0.0.0/24 bucket, not IP diversity. The test comment now explains both mechanisms explicitly so the attribution ages well.

`203.0.113.0/24` (TEST-NET-3) and `2001:db8::/32` are documentation
prefixes that `is_documentation()` / my v6 approximation correctly
reject. The previous test asserted the opposite on those ranges, which
passed locally on Windows but failed on Linux/macOS CI. Use
`8.8.8.8` and `2606:4700:4700::1111` instead, and add an assertion that
the TEST-NET-3 range IS rejected (that's the behaviour we actually want).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jacderida jacderida changed the base branch from main to rc-2026.4.2 April 22, 2026 22:30
@jacderida jacderida merged commit c9a9447 into rc-2026.4.2 Apr 22, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants