Add agent-level latency benchmark under fault conditions by tomtom215 · Pull Request #83 · tomtom215/a2a-rust

tomtom215 · 2026-04-15T17:37:46Z

Summary

This PR adds the first agent-level latency benchmark to the suite, measuring end-to-end latency through a 5-hop in-process coordinator chain as links become progressively unreliable. This addresses feedback that the existing 267 benchmarks measure only SDK-layer overhead (request encode, wire round-trip, task store contention) rather than the multi-agent coordination patterns that harness reviewers actually care about.

Key Changes

New Benchmark Infrastructure

benches/benches/coordinator_chain_under_fault.rs — Main benchmark file with two groups:
- coordinator_chain_5hop/latency_injection: Varies per-link latency (0–20 ms) with zero errors to isolate the chain's latency-compounding factor
- coordinator_chain_5hop/error_injection: Varies per-link error rate (0–5%) with 3 retries per hop to measure steady-state latency under transient faults
benches/src/coordinator.rs — Executor implementations:
- ChainHopExecutor: Forwards incoming messages to the next hop via a pre-built A2aClient, emits Working → Completed status, and retries on transient errors
- ChainLeafExecutor: Minimal leaf executor that emits Working → Completed
benches/src/fault_transport.rs — FaultInjectingTransport wrapper:
- Wraps any Transport implementation to inject synthetic latency and errors before delegation
- Uses deterministic xorshift64 PRNG seeded per instance so criterion gets stable statistical estimates
- Injects ClientError::Timeout (retryable) to exercise SDK retry paths faithfully

Documentation

docs/adr/0008-agent-executor-trait-shape.md — ADR explaining why AgentExecutor uses manual Pin<Box<dyn Future>> instead of async fn:
- Object safety is load-bearing: keeps RequestHandler and all downstream types non-generic
- async fn in traits is not object-safe on stable Rust
- Compensates with boxed_future() helper and agent_executor! macro for ergonomics
Book updates — Added "Agent-Level Latency Under Fault" section to benchmark reference with honest caveats about in-process vs. real network faults

Example Updates

examples/rig-agent/ — Reframed as an integration template rather than a working agent:
- Clarifies that rig is intentionally not a dependency so the example builds without LLM SDK or API keys
- run_rig_completion() returns a mock echo response by default
- Includes snippet for users to add rig-core and replace the stub with real LLM calls

Security Hardening

crates/a2a-server/src/push/sender.rs — DNS-rebinding defence improvements:
- validate_webhook_url_with_dns() now returns Option<SocketAddr> (the validated IP to pin to)
- New rewrite_uri_with_pinned_addr() function rewrites the outgoing URI to use the literal validated IP
- New host_header_from_url() extracts the original hostname for the Host: header
- Closes the TOCTOU window between validation and the HTTP client's own resolver by pinning the connection to the exact validated IP

Notable Implementation Details

Deterministic fault injection: The FaultInjectingTransport uses an atomic counter fed through xorshift64 so the same benchmark variant gets the same sequence of fault decisions across runs, enabling stable criterion statistics
Honest caveats: The benchmark documentation explicitly states this is in-process fault injection (synthetic Timeout before the wrapped transport is called), not real packet loss, so readers don't over-interpret the numbers
Retry composition: Coordinators retry 3 times per hop; the bench harness retries 8 times at entry level so published error rates have effectively-zero unrecov

https://claude.ai/code/session_01MYSB9bmv8QVSyrfRzPsnMN

…ADR, SSRF IP pinning Docs - dogfooding.md / testing.md: fix 65→68 bugs and 12→13 passes drift against dogfooding-bugs.md - testing.md / cicd.md / ADR 0006 / README: soften "mandatory quality gate" language for cargo-mutants to match the current on-demand workflow (workflow_dispatch only; nightly schedule and PR-gate triggers are commented out). The workflow still fails on surviving mutants when it runs. - docs/adr/0008-agent-executor-trait-shape.md: new ADR explaining why AgentExecutor returns Pin<Box<dyn Future + Send>> instead of async fn, including the four alternatives considered (AFIT, async-trait, trait_variant, hand-rolled erasure wrapper) and a revisit trigger. Summarized in book/src/reference/adrs.md. Examples - examples/rig-agent: make it unambiguous that this is an *integration template*, not a working rig agent. Top-of-README callout, banner in main.rs stdout, expanded module docstring, and an updated package description + examples/README.md row. The run_rig_completion() body is still the zero-dependency mock; instructions for adding rig-core and pasting in a provider snippet are preserved and clarified. Security — SSRF DNS-rebinding hardening (crates/a2a-server/src/push/sender.rs) - validate_webhook_url_with_dns now returns the specific SocketAddr it validated (or None for IP literals) instead of (). - HttpPushSender::send rewrites the outgoing URI to connect directly to the literal validated IP, with the original hostname preserved via an explicit Host header. This closes the TOCTOU window between validation and hyper's own DNS resolver that a rebinding attacker could otherwise exploit — the HTTP client sees an IP literal and never re-enters DNS resolution. - New helpers: rewrite_uri_with_pinned_addr (handles IPv4/IPv6 bracketing and path/query preservation) and host_header_from_url. - 6 new unit tests covering URI rewriting and Host header extraction. - dogfooding-bugs.md Bug H6 entry updated with the hardening note. Verification - cargo fmt --all -- --check (clean) - cargo clippy --workspace --all-targets -- -D warnings (clean) - cargo test -p a2a-protocol-server --lib: 545 passed, 0 failed (push::sender module: 43 tests, including 6 new ones) - cargo check -p rig-a2a-agent (clean) https://claude.ai/code/session_01MYSB9bmv8QVSyrfRzPsnMN

Addresses the single sharpest critique in the session feedback: "all 267 benchmarks are transport-level... wrong shape of benchmarks entirely". One benchmark does not retroactively make the other 13 suites agent-level, but this closes the most obvious gap — it is the first benchmark on the page that does not measure SDK-layer overhead. New infrastructure (benches/src/) - fault_transport.rs: FaultInjectingTransport<T> wraps any Transport with per-call latency injection and deterministic xorshift-seeded error-rate injection. Implements the Transport trait so it is a drop-in replacement usable via ClientBuilder::with_custom_transport. Deterministic so that criterion's statistical estimates are stable across runs. - coordinator.rs: ChainHopExecutor forwards to a next-hop A2aClient with configurable local retry budget; ChainLeafExecutor is the minimal Working → Completed terminal. New benchmark (benches/benches/coordinator_chain_under_fault.rs) - Builds a 5-hop in-process coordinator chain (entry → 4 coordinators → leaf). Every link, including the entry link, routes through its own FaultInjectingTransport instance so per-hop faults compound end-to-end. - Group 1 `coordinator_chain_5hop/latency_injection`: varies per-hop latency {0, 1000, 5000, 20000} µs, zero error rate. Verified end-to-end locally via `--quick`: 2.15 ms baseline → 17.56 ms → 35.81 ms → 113.07 ms. Linear scaling confirms each hop actually blocks on its downstream completion (not just task creation). - Group 2 `coordinator_chain_5hop/error_injection`: varies per-hop error rate {0, 1, 2, 5} %, zero added latency, 3 retries per hop plus 8 outer retries at the bench harness. Measures successful-path latency including retry cost. Verified locally: 1.88 ms → 2.03 ms → 2.06 ms → 2.11 ms (gentle slope — hop-local retries absorb most transient faults at low rates). Honest caveats — documented both in the bench module rustdoc and in the new "Agent-Level Latency Under Fault" section of the generator script that emits book/src/reference/benchmarks.md: - In-process synthetic ClientError::Timeout, not real packet loss. Does not exercise TCP congestion control, DNS, or head-of-line blocking. - Sequential delegation only. Critic loops, parallel fan-out with deadline propagation, and plan-and-execute with replanning are explicitly out of scope for this benchmark. - One benchmark does not retroactively make the other 13 suites agent-level. Additive, not a substitute for a real agent-capability suite. Documentation - benches/scripts/generate_book_page.sh: new "Agent-Level Latency Under Fault" section with topology diagram, caveats, and two emit_table calls for the new criterion groups. "What we benchmark / do NOT benchmark" footer updated to acknowledge the one agent-level bench while reinforcing that real network faults, real multi-agent topologies, and agent- capability evaluation remain out of scope. - book/src/deployment/cicd.md: updated 13 suites / 267 benchmarks to 14 / 275 with a link to the agent-level caveats section. - README.md: updated the `cargo bench` comment with the new counts and a pointer to the caveats page. Verification - cargo fmt --all -- --check (clean) - cargo clippy --workspace --all-targets -- -D warnings (clean) - cargo test --workspace (all green, no regressions) - cargo bench -p a2a-benchmarks --bench coordinator_chain_under_fault -- --quick (all 8 variants produce clean numbers; latency-injection group scales linearly with per-hop latency as expected; error-injection group shows gentle retry-cost slope with zero unrecoverable failures) Not committed - book/src/reference/benchmarks.md itself is auto-generated by benches/scripts/generate_book_page.sh and is left for CI to regenerate on main with full numbers from the full bench run. Committing the --quick output would blank the other sections. https://claude.ai/code/session_01MYSB9bmv8QVSyrfRzPsnMN

The `Documentation` CI check failed on PR #83 because the ChainHopExecutor rustdoc linked to `ChainHopExecutor::max_retries`, which is a private field. With `-D rustdoc::private-intra-doc-links` implied by `-D warnings`, that is a hard error. Repoint the link at the public setter `with_max_retries` instead — same concept, public API. Verified locally with: RUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-deps https://claude.ai/code/session_01MYSB9bmv8QVSyrfRzPsnMN

claude added 3 commits April 15, 2026 17:11

tomtom215 merged commit 7e05a96 into main Apr 15, 2026
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add agent-level latency benchmark under fault conditions#83

Add agent-level latency benchmark under fault conditions#83
tomtom215 merged 3 commits into
mainfrom
claude/apply-session-feedback-PK7Uj

tomtom215 commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomtom215 commented Apr 15, 2026

Summary

Key Changes

New Benchmark Infrastructure

Documentation

Example Updates

Security Hardening

Notable Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants