Add agent-level latency benchmark under fault conditions#83
Merged
Conversation
…ADR, SSRF IP pinning Docs - dogfooding.md / testing.md: fix 65→68 bugs and 12→13 passes drift against dogfooding-bugs.md - testing.md / cicd.md / ADR 0006 / README: soften "mandatory quality gate" language for cargo-mutants to match the current on-demand workflow (workflow_dispatch only; nightly schedule and PR-gate triggers are commented out). The workflow still fails on surviving mutants when it runs. - docs/adr/0008-agent-executor-trait-shape.md: new ADR explaining why AgentExecutor returns Pin<Box<dyn Future + Send>> instead of async fn, including the four alternatives considered (AFIT, async-trait, trait_variant, hand-rolled erasure wrapper) and a revisit trigger. Summarized in book/src/reference/adrs.md. Examples - examples/rig-agent: make it unambiguous that this is an *integration template*, not a working rig agent. Top-of-README callout, banner in main.rs stdout, expanded module docstring, and an updated package description + examples/README.md row. The run_rig_completion() body is still the zero-dependency mock; instructions for adding rig-core and pasting in a provider snippet are preserved and clarified. Security — SSRF DNS-rebinding hardening (crates/a2a-server/src/push/sender.rs) - validate_webhook_url_with_dns now returns the specific SocketAddr it validated (or None for IP literals) instead of (). - HttpPushSender::send rewrites the outgoing URI to connect directly to the literal validated IP, with the original hostname preserved via an explicit Host header. This closes the TOCTOU window between validation and hyper's own DNS resolver that a rebinding attacker could otherwise exploit — the HTTP client sees an IP literal and never re-enters DNS resolution. - New helpers: rewrite_uri_with_pinned_addr (handles IPv4/IPv6 bracketing and path/query preservation) and host_header_from_url. - 6 new unit tests covering URI rewriting and Host header extraction. - dogfooding-bugs.md Bug H6 entry updated with the hardening note. Verification - cargo fmt --all -- --check (clean) - cargo clippy --workspace --all-targets -- -D warnings (clean) - cargo test -p a2a-protocol-server --lib: 545 passed, 0 failed (push::sender module: 43 tests, including 6 new ones) - cargo check -p rig-a2a-agent (clean) https://claude.ai/code/session_01MYSB9bmv8QVSyrfRzPsnMN
Addresses the single sharpest critique in the session feedback: "all 267
benchmarks are transport-level... wrong shape of benchmarks entirely".
One benchmark does not retroactively make the other 13 suites agent-level,
but this closes the most obvious gap — it is the first benchmark on the
page that does not measure SDK-layer overhead.
New infrastructure (benches/src/)
- fault_transport.rs: FaultInjectingTransport<T> wraps any Transport with
per-call latency injection and deterministic xorshift-seeded error-rate
injection. Implements the Transport trait so it is a drop-in replacement
usable via ClientBuilder::with_custom_transport. Deterministic so that
criterion's statistical estimates are stable across runs.
- coordinator.rs: ChainHopExecutor forwards to a next-hop A2aClient with
configurable local retry budget; ChainLeafExecutor is the minimal
Working → Completed terminal.
New benchmark (benches/benches/coordinator_chain_under_fault.rs)
- Builds a 5-hop in-process coordinator chain (entry → 4 coordinators →
leaf). Every link, including the entry link, routes through its own
FaultInjectingTransport instance so per-hop faults compound end-to-end.
- Group 1 `coordinator_chain_5hop/latency_injection`: varies per-hop
latency {0, 1000, 5000, 20000} µs, zero error rate. Verified end-to-end
locally via `--quick`: 2.15 ms baseline → 17.56 ms → 35.81 ms →
113.07 ms. Linear scaling confirms each hop actually blocks on its
downstream completion (not just task creation).
- Group 2 `coordinator_chain_5hop/error_injection`: varies per-hop error
rate {0, 1, 2, 5} %, zero added latency, 3 retries per hop plus 8 outer
retries at the bench harness. Measures successful-path latency
including retry cost. Verified locally: 1.88 ms → 2.03 ms → 2.06 ms →
2.11 ms (gentle slope — hop-local retries absorb most transient
faults at low rates).
Honest caveats — documented both in the bench module rustdoc and in the
new "Agent-Level Latency Under Fault" section of the generator script
that emits book/src/reference/benchmarks.md:
- In-process synthetic ClientError::Timeout, not real packet loss. Does
not exercise TCP congestion control, DNS, or head-of-line blocking.
- Sequential delegation only. Critic loops, parallel fan-out with
deadline propagation, and plan-and-execute with replanning are
explicitly out of scope for this benchmark.
- One benchmark does not retroactively make the other 13 suites
agent-level. Additive, not a substitute for a real agent-capability
suite.
Documentation
- benches/scripts/generate_book_page.sh: new "Agent-Level Latency Under
Fault" section with topology diagram, caveats, and two emit_table calls
for the new criterion groups. "What we benchmark / do NOT benchmark"
footer updated to acknowledge the one agent-level bench while reinforcing
that real network faults, real multi-agent topologies, and agent-
capability evaluation remain out of scope.
- book/src/deployment/cicd.md: updated 13 suites / 267 benchmarks to 14 /
275 with a link to the agent-level caveats section.
- README.md: updated the `cargo bench` comment with the new counts and a
pointer to the caveats page.
Verification
- cargo fmt --all -- --check (clean)
- cargo clippy --workspace --all-targets -- -D warnings (clean)
- cargo test --workspace (all green, no regressions)
- cargo bench -p a2a-benchmarks --bench coordinator_chain_under_fault --
--quick (all 8 variants produce clean numbers; latency-injection group
scales linearly with per-hop latency as expected; error-injection
group shows gentle retry-cost slope with zero unrecoverable failures)
Not committed
- book/src/reference/benchmarks.md itself is auto-generated by
benches/scripts/generate_book_page.sh and is left for CI to
regenerate on main with full numbers from the full bench run.
Committing the --quick output would blank the other sections.
https://claude.ai/code/session_01MYSB9bmv8QVSyrfRzPsnMN
The `Documentation` CI check failed on PR #83 because the ChainHopExecutor rustdoc linked to `ChainHopExecutor::max_retries`, which is a private field. With `-D rustdoc::private-intra-doc-links` implied by `-D warnings`, that is a hard error. Repoint the link at the public setter `with_max_retries` instead — same concept, public API. Verified locally with: RUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-deps https://claude.ai/code/session_01MYSB9bmv8QVSyrfRzPsnMN
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds the first agent-level latency benchmark to the suite, measuring end-to-end latency through a 5-hop in-process coordinator chain as links become progressively unreliable. This addresses feedback that the existing 267 benchmarks measure only SDK-layer overhead (request encode, wire round-trip, task store contention) rather than the multi-agent coordination patterns that harness reviewers actually care about.
Key Changes
New Benchmark Infrastructure
benches/benches/coordinator_chain_under_fault.rs— Main benchmark file with two groups:coordinator_chain_5hop/latency_injection: Varies per-link latency (0–20 ms) with zero errors to isolate the chain's latency-compounding factorcoordinator_chain_5hop/error_injection: Varies per-link error rate (0–5%) with 3 retries per hop to measure steady-state latency under transient faultsbenches/src/coordinator.rs— Executor implementations:ChainHopExecutor: Forwards incoming messages to the next hop via a pre-builtA2aClient, emits Working → Completed status, and retries on transient errorsChainLeafExecutor: Minimal leaf executor that emits Working → Completedbenches/src/fault_transport.rs—FaultInjectingTransportwrapper:Transportimplementation to inject synthetic latency and errors before delegationClientError::Timeout(retryable) to exercise SDK retry paths faithfullyDocumentation
docs/adr/0008-agent-executor-trait-shape.md— ADR explaining whyAgentExecutoruses manualPin<Box<dyn Future>>instead ofasync fn:RequestHandlerand all downstream types non-genericasync fnin traits is not object-safe on stable Rustboxed_future()helper andagent_executor!macro for ergonomicsBook updates — Added "Agent-Level Latency Under Fault" section to benchmark reference with honest caveats about in-process vs. real network faults
Example Updates
examples/rig-agent/— Reframed as an integration template rather than a working agent:rigis intentionally not a dependency so the example builds without LLM SDK or API keysrun_rig_completion()returns a mock echo response by defaultrig-coreand replace the stub with real LLM callsSecurity Hardening
crates/a2a-server/src/push/sender.rs— DNS-rebinding defence improvements:validate_webhook_url_with_dns()now returnsOption<SocketAddr>(the validated IP to pin to)rewrite_uri_with_pinned_addr()function rewrites the outgoing URI to use the literal validated IPhost_header_from_url()extracts the original hostname for theHost:headerNotable Implementation Details
FaultInjectingTransportuses an atomic counter fed through xorshift64 so the same benchmark variant gets the same sequence of fault decisions across runs, enabling stable criterion statisticsTimeoutbefore the wrapped transport is called), not real packet loss, so readers don't over-interpret the numbershttps://claude.ai/code/session_01MYSB9bmv8QVSyrfRzPsnMN