Skip to content

Add agent-level latency benchmark under fault conditions#83

Merged
tomtom215 merged 3 commits into
mainfrom
claude/apply-session-feedback-PK7Uj
Apr 15, 2026
Merged

Add agent-level latency benchmark under fault conditions#83
tomtom215 merged 3 commits into
mainfrom
claude/apply-session-feedback-PK7Uj

Conversation

@tomtom215
Copy link
Copy Markdown
Owner

Summary

This PR adds the first agent-level latency benchmark to the suite, measuring end-to-end latency through a 5-hop in-process coordinator chain as links become progressively unreliable. This addresses feedback that the existing 267 benchmarks measure only SDK-layer overhead (request encode, wire round-trip, task store contention) rather than the multi-agent coordination patterns that harness reviewers actually care about.

Key Changes

New Benchmark Infrastructure

  • benches/benches/coordinator_chain_under_fault.rs — Main benchmark file with two groups:

    • coordinator_chain_5hop/latency_injection: Varies per-link latency (0–20 ms) with zero errors to isolate the chain's latency-compounding factor
    • coordinator_chain_5hop/error_injection: Varies per-link error rate (0–5%) with 3 retries per hop to measure steady-state latency under transient faults
  • benches/src/coordinator.rs — Executor implementations:

    • ChainHopExecutor: Forwards incoming messages to the next hop via a pre-built A2aClient, emits Working → Completed status, and retries on transient errors
    • ChainLeafExecutor: Minimal leaf executor that emits Working → Completed
  • benches/src/fault_transport.rsFaultInjectingTransport wrapper:

    • Wraps any Transport implementation to inject synthetic latency and errors before delegation
    • Uses deterministic xorshift64 PRNG seeded per instance so criterion gets stable statistical estimates
    • Injects ClientError::Timeout (retryable) to exercise SDK retry paths faithfully

Documentation

  • docs/adr/0008-agent-executor-trait-shape.md — ADR explaining why AgentExecutor uses manual Pin<Box<dyn Future>> instead of async fn:

    • Object safety is load-bearing: keeps RequestHandler and all downstream types non-generic
    • async fn in traits is not object-safe on stable Rust
    • Compensates with boxed_future() helper and agent_executor! macro for ergonomics
  • Book updates — Added "Agent-Level Latency Under Fault" section to benchmark reference with honest caveats about in-process vs. real network faults

Example Updates

  • examples/rig-agent/ — Reframed as an integration template rather than a working agent:
    • Clarifies that rig is intentionally not a dependency so the example builds without LLM SDK or API keys
    • run_rig_completion() returns a mock echo response by default
    • Includes snippet for users to add rig-core and replace the stub with real LLM calls

Security Hardening

  • crates/a2a-server/src/push/sender.rs — DNS-rebinding defence improvements:
    • validate_webhook_url_with_dns() now returns Option<SocketAddr> (the validated IP to pin to)
    • New rewrite_uri_with_pinned_addr() function rewrites the outgoing URI to use the literal validated IP
    • New host_header_from_url() extracts the original hostname for the Host: header
    • Closes the TOCTOU window between validation and the HTTP client's own resolver by pinning the connection to the exact validated IP

Notable Implementation Details

  • Deterministic fault injection: The FaultInjectingTransport uses an atomic counter fed through xorshift64 so the same benchmark variant gets the same sequence of fault decisions across runs, enabling stable criterion statistics
  • Honest caveats: The benchmark documentation explicitly states this is in-process fault injection (synthetic Timeout before the wrapped transport is called), not real packet loss, so readers don't over-interpret the numbers
  • Retry composition: Coordinators retry 3 times per hop; the bench harness retries 8 times at entry level so published error rates have effectively-zero unrecov

https://claude.ai/code/session_01MYSB9bmv8QVSyrfRzPsnMN

claude added 3 commits April 15, 2026 17:11
…ADR, SSRF IP pinning

Docs
- dogfooding.md / testing.md: fix 65→68 bugs and 12→13 passes drift against
  dogfooding-bugs.md
- testing.md / cicd.md / ADR 0006 / README: soften "mandatory quality gate"
  language for cargo-mutants to match the current on-demand workflow
  (workflow_dispatch only; nightly schedule and PR-gate triggers are commented
  out). The workflow still fails on surviving mutants when it runs.
- docs/adr/0008-agent-executor-trait-shape.md: new ADR explaining why
  AgentExecutor returns Pin<Box<dyn Future + Send>> instead of async fn,
  including the four alternatives considered (AFIT, async-trait,
  trait_variant, hand-rolled erasure wrapper) and a revisit trigger.
  Summarized in book/src/reference/adrs.md.

Examples
- examples/rig-agent: make it unambiguous that this is an *integration
  template*, not a working rig agent. Top-of-README callout, banner in main.rs
  stdout, expanded module docstring, and an updated package description +
  examples/README.md row. The run_rig_completion() body is still the
  zero-dependency mock; instructions for adding rig-core and pasting in a
  provider snippet are preserved and clarified.

Security — SSRF DNS-rebinding hardening (crates/a2a-server/src/push/sender.rs)
- validate_webhook_url_with_dns now returns the specific SocketAddr it
  validated (or None for IP literals) instead of ().
- HttpPushSender::send rewrites the outgoing URI to connect directly to the
  literal validated IP, with the original hostname preserved via an explicit
  Host header. This closes the TOCTOU window between validation and hyper's
  own DNS resolver that a rebinding attacker could otherwise exploit — the
  HTTP client sees an IP literal and never re-enters DNS resolution.
- New helpers: rewrite_uri_with_pinned_addr (handles IPv4/IPv6 bracketing and
  path/query preservation) and host_header_from_url.
- 6 new unit tests covering URI rewriting and Host header extraction.
- dogfooding-bugs.md Bug H6 entry updated with the hardening note.

Verification
- cargo fmt --all -- --check (clean)
- cargo clippy --workspace --all-targets -- -D warnings (clean)
- cargo test -p a2a-protocol-server --lib: 545 passed, 0 failed
  (push::sender module: 43 tests, including 6 new ones)
- cargo check -p rig-a2a-agent (clean)

https://claude.ai/code/session_01MYSB9bmv8QVSyrfRzPsnMN
Addresses the single sharpest critique in the session feedback: "all 267
benchmarks are transport-level... wrong shape of benchmarks entirely".
One benchmark does not retroactively make the other 13 suites agent-level,
but this closes the most obvious gap — it is the first benchmark on the
page that does not measure SDK-layer overhead.

New infrastructure (benches/src/)
- fault_transport.rs: FaultInjectingTransport<T> wraps any Transport with
  per-call latency injection and deterministic xorshift-seeded error-rate
  injection. Implements the Transport trait so it is a drop-in replacement
  usable via ClientBuilder::with_custom_transport. Deterministic so that
  criterion's statistical estimates are stable across runs.
- coordinator.rs: ChainHopExecutor forwards to a next-hop A2aClient with
  configurable local retry budget; ChainLeafExecutor is the minimal
  Working → Completed terminal.

New benchmark (benches/benches/coordinator_chain_under_fault.rs)
- Builds a 5-hop in-process coordinator chain (entry → 4 coordinators →
  leaf). Every link, including the entry link, routes through its own
  FaultInjectingTransport instance so per-hop faults compound end-to-end.
- Group 1 `coordinator_chain_5hop/latency_injection`: varies per-hop
  latency {0, 1000, 5000, 20000} µs, zero error rate. Verified end-to-end
  locally via `--quick`: 2.15 ms baseline → 17.56 ms → 35.81 ms →
  113.07 ms. Linear scaling confirms each hop actually blocks on its
  downstream completion (not just task creation).
- Group 2 `coordinator_chain_5hop/error_injection`: varies per-hop error
  rate {0, 1, 2, 5} %, zero added latency, 3 retries per hop plus 8 outer
  retries at the bench harness. Measures successful-path latency
  including retry cost. Verified locally: 1.88 ms → 2.03 ms → 2.06 ms →
  2.11 ms (gentle slope — hop-local retries absorb most transient
  faults at low rates).

Honest caveats — documented both in the bench module rustdoc and in the
new "Agent-Level Latency Under Fault" section of the generator script
that emits book/src/reference/benchmarks.md:
- In-process synthetic ClientError::Timeout, not real packet loss. Does
  not exercise TCP congestion control, DNS, or head-of-line blocking.
- Sequential delegation only. Critic loops, parallel fan-out with
  deadline propagation, and plan-and-execute with replanning are
  explicitly out of scope for this benchmark.
- One benchmark does not retroactively make the other 13 suites
  agent-level. Additive, not a substitute for a real agent-capability
  suite.

Documentation
- benches/scripts/generate_book_page.sh: new "Agent-Level Latency Under
  Fault" section with topology diagram, caveats, and two emit_table calls
  for the new criterion groups. "What we benchmark / do NOT benchmark"
  footer updated to acknowledge the one agent-level bench while reinforcing
  that real network faults, real multi-agent topologies, and agent-
  capability evaluation remain out of scope.
- book/src/deployment/cicd.md: updated 13 suites / 267 benchmarks to 14 /
  275 with a link to the agent-level caveats section.
- README.md: updated the `cargo bench` comment with the new counts and a
  pointer to the caveats page.

Verification
- cargo fmt --all -- --check (clean)
- cargo clippy --workspace --all-targets -- -D warnings (clean)
- cargo test --workspace (all green, no regressions)
- cargo bench -p a2a-benchmarks --bench coordinator_chain_under_fault --
  --quick (all 8 variants produce clean numbers; latency-injection group
  scales linearly with per-hop latency as expected; error-injection
  group shows gentle retry-cost slope with zero unrecoverable failures)

Not committed
- book/src/reference/benchmarks.md itself is auto-generated by
  benches/scripts/generate_book_page.sh and is left for CI to
  regenerate on main with full numbers from the full bench run.
  Committing the --quick output would blank the other sections.

https://claude.ai/code/session_01MYSB9bmv8QVSyrfRzPsnMN
The `Documentation` CI check failed on PR #83 because the ChainHopExecutor
rustdoc linked to `ChainHopExecutor::max_retries`, which is a private
field. With `-D rustdoc::private-intra-doc-links` implied by
`-D warnings`, that is a hard error. Repoint the link at the public
setter `with_max_retries` instead — same concept, public API.

Verified locally with:
  RUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-deps

https://claude.ai/code/session_01MYSB9bmv8QVSyrfRzPsnMN
@tomtom215 tomtom215 merged commit 7e05a96 into main Apr 15, 2026
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants