Skip to content

Add advanced_scenarios benchmark suite with tenant resolver and artifact accumulation tests#74

Merged
tomtom215 merged 5 commits into
mainfrom
claude/analyze-benchmark-results-Ok0Qs
Apr 1, 2026
Merged

Add advanced_scenarios benchmark suite with tenant resolver and artifact accumulation tests#74
tomtom215 merged 5 commits into
mainfrom
claude/analyze-benchmark-results-Ok0Qs

Conversation

@tomtom215
Copy link
Copy Markdown
Owner

Summary

This PR introduces a comprehensive advanced_scenarios benchmark suite that exercises previously unbenchmarked SDK capabilities, along with critical bug fixes to the MultiEventExecutor and performance optimizations to the in-memory task store.

Key Changes

New Benchmark Suite: advanced_scenarios

  • Tenant resolver overhead — Measures per-request extraction cost for HeaderTenantResolver, BearerTokenTenantResolver (with and without mapper), and PathSegmentTenantResolver
  • Agent card hot-reload — Benchmarks HotReloadAgentCardHandler read performance, atomic swap cost, and complex card updates (100 skills)
  • Agent card discovery endpoint — Measures /.well-known/agent.json HTTP fetch latency
  • Subscribe fan-out — Tests concurrent subscriber behavior (1–10 subscribers) during reconnection bursts
  • Streaming artifact accumulation — Isolates the task.clone() cost as artifacts accumulate (0–500 depth), demonstrating the 90µs/event bottleneck at 501+ events
  • Pagination full walk — Multi-page cursor traversal of 100–1K tasks with unfiltered and context-filtered variants
  • Extended agent card round-trip — Tests extended card endpoint latency

Bug Fixes

  • MultiEventExecutor invalid state transitions — Fixed emission of Working → Working status events in a loop, which violated the A2A spec state machine. Now correctly emits Working once, then N artifact events, then Completed
  • InMemoryTaskStore::insert() optimization — Added fast path for context-unchanged updates: skips all index operations (BTreeSet inserts, string clones) when updating an existing task with the same context_id, reducing update cost from ~2.5µs to ~700ns

Infrastructure Improvements

  • Added NoopPushSender for benchmarks requiring push notification support without actual HTTP delivery
  • Added start_jsonrpc_server_with_push() helper for benchmark servers with push capabilities
  • Increased measurement time to 8–10 seconds for multi-threaded benchmarks to reduce CI scheduler variance
  • Updated backpressure.rs event counts to match corrected MultiEventExecutor behavior (3 → 502 events instead of 3 → 1001)
  • Updated benchmark documentation and CI workflow to include the new suite

Notable Implementation Details

  • The artifact accumulation benchmark uses pre-populated tasks with 0, 10, 50, 100, and 500 artifacts to demonstrate how task.clone() cost scales with accumulated state
  • The pagination walk benchmark measures both unfiltered and context-filtered traversals to capture index lookup overhead
  • All multi-threaded benchmarks now use explicit measurement_time() configuration to ensure stable results across CI environments
  • The InMemoryTaskStore optimization maintains correctness by only skipping index operations when context_id is unchanged; context changes still trigger full index updates

https://claude.ai/code/session_015mMFNY6W6zUUZmVcpMDGrR

claude added 5 commits April 1, 2026 14:30
- Fix BLOCKING production_scenarios panic: MultiEventExecutor was emitting
  invalid Working → Working state transitions. Restructured to emit a single
  Working status, then N artifact events, then Completed — conforming to the
  A2A spec state machine where Working can only transition to terminal or
  interrupted states.

- Fix push_config benchmark: added NoopPushSender and server helper with
  push notification support enabled, resolving PushNotificationNotSupported
  errors during benchmark runs.

- Add measurement_time to 23+ benchmark groups across 8 files to eliminate
  criterion warmup warnings on CI runners. Groups classified by per-iteration
  cost: transport round-trips (8s), multi-turn/concurrent (10s), slow/wall-clock
  bound (15s). Sub-µs store/serde groups left at default 5s.

- Optimize InMemoryTaskStore::insert() for update path: when updating an
  existing task with unchanged context_id (the common case), skip all BTreeSet
  and context index operations — only update the primary HashMap entry. This
  eliminates the [1.5µs, 4.2µs] variance from occasional BTreeSet node splits
  and reduces update cost from ~2.5µs to ~700ns.

- Document get() 100K cache anomaly: the ~42% faster lookup at 100K vs 1K/10K
  is a CPU cache warming artifact from the large populate_store() setup, not a
  genuine HashMap performance difference. The 1K/10K number (~450ns) is the
  representative O(1) lookup time.

- Update MultiEventExecutor event count comments in backpressure.rs to reflect
  the corrected N+2 formula (Working + N artifacts + Completed) instead of the
  invalid 2N+1 (N Working + N artifacts + Completed).

https://claude.ai/code/session_015mMFNY6W6zUUZmVcpMDGrR
New benchmark file exercising 7 previously-unbenchmarked SDK capabilities:

1. **Tenant resolver overhead** (5 benchmarks): HeaderTenantResolver,
   BearerTokenTenantResolver (with and without mapper), PathSegmentTenantResolver,
   and missing-header fast-rejection path. Uses CallContext with realistic
   HTTP headers to measure per-request extraction cost.

2. **Agent card hot-reload** (3 benchmarks): Steady-state read of current card
   via RwLock, atomic swap-and-read cycle, and complex card swap (100 skills)
   measuring the cost of production-scale card replacement.

3. **Agent card discovery endpoint** (1 benchmark): /.well-known/agent.json
   HTTP round-trip latency via raw hyper client.

4. **Subscribe fan-out** (3 benchmarks): 1/5/10 concurrent subscribers
   reconnecting to the same task simultaneously — simulates mobile/web
   client reconnection bursts at scale.

5. **Streaming artifact accumulation** (10 benchmarks): Isolates the per-event
   task.clone() cost at 0/10/50/100/500 accumulated artifacts — the root cause
   of the 90µs/event bottleneck at 501+ events. Also measures full store_save
   at each depth to capture clone + index + HashMap insert.

6. **Pagination full walk** (4 benchmarks): Complete cursor-based traversal of
   100 and 1K task stores at page_size=25/50, both unfiltered and with
   context_id filter. Measures cumulative multi-page latency.

7. **Extended agent card round-trip** (1 benchmark): Full client→server→client
   round-trip for get_extended_agent_card with capability flag enabled.

All 24 benchmark test cases pass smoke testing.

https://claude.ai/code/session_015mMFNY6W6zUUZmVcpMDGrR
…nchmarks

Updates all documentation and CI infrastructure to include the new
advanced_scenarios benchmark suite (13th benchmark module):

CI & Scripts:
- .github/workflows/benchmarks.yml: add advanced_scenarios step
- benches/scripts/run_benchmarks.sh: add to BENCHMARKS array
- benches/scripts/generate_book_page.sh: add Advanced Scenarios section
  with emit_table "advanced_" call

Documentation:
- benches/README.md: add table row, architecture entry, update CI count
  (12→13), update executor listing (add NoopPushSender), add enterprise/
  production/advanced dimensions to "Why it matters" table, correct
  backpressure event counts (1001→502)
- CONTRIBUTING.md: add enterprise_scenarios, production_scenarios, and
  advanced_scenarios to benchmark table
- CHANGELOG.md: document advanced_scenarios addition, MultiEventExecutor
  fix, save() variance optimization, measurement_time fixes, NoopPushSender,
  and push-enabled server helper
- book/src/reference/changelog.md: add advanced_scenarios, save() insert
  optimization, MultiEventExecutor fix, measurement_time fixes, update
  backpressure event counts

https://claude.ai/code/session_015mMFNY6W6zUUZmVcpMDGrR
- cargo fmt: fix import ordering and closure formatting in
  advanced_scenarios.rs and backpressure.rs
- clippy: add backticks to doc comment references (context_id,
  BTreeSet, HashMap) in InMemoryTaskStore::insert()
- cargo doc: fix unresolved doc link to NoopPushSender in server.rs
  by using fully qualified crate::executor::NoopPushSender path

https://claude.ai/code/session_015mMFNY6W6zUUZmVcpMDGrR
- Use `next_back()` instead of `last()` on DoubleEndedIterator
  (split('.') returns a reversible iterator)
- Use `usize::div_ceil()` instead of manual `(n + d - 1) / d`

https://claude.ai/code/session_015mMFNY6W6zUUZmVcpMDGrR
@tomtom215 tomtom215 merged commit b82cd0d into main Apr 1, 2026
40 checks passed
@tomtom215 tomtom215 deleted the claude/analyze-benchmark-results-Ok0Qs branch April 6, 2026 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants