feat: add Anthropic Messages API endpoint by acere · Pull Request #63 · awslabs/llmeter

acere · 2026-04-17T03:39:07Z

Summary

Add support for the Anthropic Messages API as a new LLMeter endpoint type.

Closes #62

What's included

New endpoint classes (llmeter/endpoints/anthropic_messages.py):

AnthropicMessages — non-streaming
AnthropicMessagesStream — streaming with TTFT/TTLT measurement

Supports three providers via a single provider argument:

"anthropic" — direct Anthropic API
"bedrock" — AnthropicBedrock (InvokeModel-based)
"bedrock-mantle" — AnthropicBedrockMantle (Messages API via Mantle)

Dependencies (pyproject.toml):

anthropic optional extra (base SDK)
anthropic-bedrock optional extra (anthropic[bedrock] for AWS auth)
Added to all and test groups

Tests:

37 unit tests covering client construction, payload creation/parsing, response parsing, error handling, all three providers
3 integration tests against Bedrock Mantle (us-east-1, Opus 4.7)

Docs:

API reference page + mkdocs nav entry

Example notebook:

Compares TTFT between Converse API and Anthropic Messages API on Bedrock
Uses a NoCacheCallback to defeat prompt caching

Testing

# Unit tests
uv run pytest tests/unit/endpoints/test_anthropic_messages.py -v

# Integration tests (requires AWS credentials and Bedrock access)
uv run pytest tests/integ/test_anthropic_messages_bedrock.py -m integ -v

Refactor the Endpoint base class to provide a structured invoke lifecycle via __init_subclass__ wrapping: - prepare_payload(payload, **kwargs) → merge kwargs, inject provider fields - invoke(payload) → API call + parse_response() (abstract) - parse_response(raw_response, start_t) → extract text/tokens/metadata (abstract) The base class wrapper automatically provides: - Error handling: exceptions → error InvocationResponse with partial data - Timing: time_to_last_token back-fill for non-streaming endpoints - Metadata: input_payload, input_prompt, id always populated - _parse_payload for input prompt extraction (token counting fallback) Additional improvements: - Add num_tokens_input_cached field for prompt caching (Bedrock + OpenAI) - Extract AWS RequestId as response ID for Bedrock and SageMaker - Extract RetryAttempts for SageMaker (Bedrock already had this) - Preserve partial data on streaming errors instead of discarding - Define BEDROCK_STREAM_ERROR_TYPES as shared constant - Skip unknown stream events gracefully (forward-compatible) - Remove redundant try/except from all _parse_response methods - Remove uuid4/error handling boilerplate from all endpoint subclasses - Update docs: metrics table, key concepts, custom endpoint guide Closes awslabs#60

- Add num_tokens_input_cached to Result.stats aggregation metrics and total_cached_input_tokens to run-level stats - Add integration test for ConverseStream prompt caching with unique-per-run prefix to avoid stale cache hits - Add 6 unit tests verifying mid-stream errors (TimeoutError, ConnectionError) are caught by the invoke wrapper for BedrockConverseStream, BedrockInvokeStream, and OpenAICompletionStreamEndpoint - Add demo notebook comparing TTFT with/without prompt caching, using a CacheBuster callback to guarantee cache misses - Sort imports across codebase (ruff --select I) - Update metrics documentation with new stats fields

- Add `low_memory` parameter to Runner/run() that writes responses to disk without keeping them in memory, for large-scale test runs. - Introduce `RunningStats` class that accumulates metrics incrementally (counts, sums, sorted values for percentile computation). - Replace `_builtin_stats` cached_property on Result with `_preloaded_stats` populated by RunningStats during the run or from stats.json on load. - Add `snapshot()` method on RunningStats for live progress-bar display of p50/p90 TTFT, p50/p90 TTLT, median tokens/s, total tokens, and failure count — configurable via `progress_bar_stats` parameter. - Add `_compute_stats()` classmethod on Result as fallback for manually constructed Result objects and post-load_responses() recomputation. - Update tests for the new stats flow.

Add run_duration parameter for time-bound test runs: - New run_duration on Runner/run() and LoadTest: clients send requests continuously for a fixed duration instead of a fixed count. - Dedicated _invoke_for_duration / _invoke_duration_c methods (separate from count-bound _invoke_n / _invoke_n_c). - Time-based progress bar via _tick_time_bar async task. - Mutual exclusivity validation between n_requests and run_duration. Add LiveStatsDisplay for readable live metrics: - New llmeter/live_display.py: HTML table in Jupyter (grouped columns for Throughput, TTFT, TTLT, Tokens, Errors), ANSI multi-line in terminals. Updates in-place, shows placeholders before first response. - Replaces single-line tqdm postfix with a separate stats row. Improve throughput metric accuracy: - RunningStats.record_send() tracks send-side timestamps. - RPM and output_tps use send window (first-to-last request sent) instead of response-side elapsed time, preventing taper-off as clients finish. - output_tps (aggregate tokens/s) added to default snapshot stats. Fix StopIteration silently terminating invocation loops: - Both _invoke_n_no_wait and _invoke_for_duration now use while/next() instead of for-in-cycle() to prevent StopIteration from streaming endpoints from killing the loop. Add LoadTest support for new features: - run_duration, low_memory, progress_bar_stats forwarded to each run. Add example notebook and documentation: - examples/Time-bound runs with Bedrock OpenAI API.ipynb: end-to-end demo using bedrock-mantle endpoint with LoadTest, custom stats, low-memory mode, and comparison charts (RPM, TPS, TTFT, TTLT). - docs/user_guide/run_experiments.md: new sections for time-bound runs, live progress-bar stats, and low-memory mode. Add tests (51 new, 504 total): - test_running_stats.py: record_send, update, to_stats, snapshot (placeholders, rpm, output_tps, send window, aggregations). - test_live_display.py: _classify, _group_stats, _in_notebook, LiveStatsDisplay (disabled, terminal, overwrite, prefix). - test_experiments.py: LoadTest with run_duration/low_memory/ progress_bar_stats field storage and runner forwarding. - test_runner.py: time-bound validation, _invoke_for_duration, full run with duration, output path, multiple clients.

Consolidate live display config (review comment 1): - Merge _GROUP_PATTERNS + _GROUP_ORDER into single DEFAULT_GROUPS tuple - Make groups a constructor parameter on LiveStatsDisplay Move display aliases from RunningStats to LiveStatsDisplay (comment 4): - Remove RunningStats.snapshot() and DEFAULT_SNAPSHOT_STATS - Add rpm/output_tps as regular keys in RunningStats.to_stats() - Add LiveStatsDisplay.format_stats() owning alias mapping + formatting - New DEFAULT_DISPLAY_STATS in live_display.py maps display labels to canonical stat keys (e.g. "time_to_first_token-p50") - Runner passes raw to_stats() output; display handles the rest Cache fallback stats computation (comment 2): - Result.stats property caches _compute_stats back to _preloaded_stats Preserve contributed stats on load (comment 3): - Result.load(load_responses=True) merges extra keys from stats.json so callback-contributed stats survive save/load round-trips Make Result fields optional (comment 5): - total_requests, clients, n_requests now optional to match _RunConfig Accept timedelta for run_duration (comment 6): - run_duration accepts int | float | timedelta; normalized in __post_init__ Remove _n_requests indirection (comment 7): - Eliminated private _n_requests; n_requests set directly to resolved value Consolidate invoke methods (comment 8): - Merged 6 methods into 3: _invoke_n_no_wait (n + duration), _invoke_client (replaces _invoke_n/_invoke_duration), _invoke_clients (replaces _invoke_n_c/_invoke_duration_c) Tests: - Add TestContributedStatsRoundTrip (8 tests) for save/load round-trips - Add TestSendWindowStats for rpm/output_tps in to_stats() - Add TestFormatStat for display formatting - Update all tests for renamed methods and new APIs

Add a `request_time` (datetime UTC) field to InvocationResponse that records the wall-clock time when each request was sent. The invoke wrapper sets it automatically — both on success and error paths — so no endpoint subclass changes are needed. This enables time-series analysis of latency data and is required by the send-window throughput metrics in RunningStats.

Use the request_time stamps captured on InvocationResponse to drive rate statistics instead of the old record_send() / perf_counter approach. RunningStats changes: - Remove record_send() — send timestamps are now derived from request_time on each response via update() - to_stats() takes end_time (datetime) instead of total_requests / total_test_time — RPM uses the request send window, output rates use [first_request, end_time] - _send_window() computes elapsed seconds from datetime objects Result changes: - Add first_request_time / last_request_time fields, populated from RunningStats at end of run - Datetime serialization updated for the new fields Runner changes: - Remove record_send() call from invoke loop - Pass end_time to to_stats() and first/last request times to Result

- Fix DEFAULT_DISPLAY_STATS keys: "rpm" → "requests_per_minute", "num_tokens_input-sum" → "total_input_tokens", "num_tokens_output-sum" → "total_output_tokens" - Fix _format_stat: match "_per_minute" instead of "rpm", remove whole-number-float-to-int coercion (100.0 stays "100.0") - Update run_experiments.md doc example - Update test_experiments.py and test_live_display.py accordingly

Verify that request_time is always set on InvocationResponse — both on success and error paths — across all Bedrock integration tests: - test_bedrock_converse: non-streaming, streaming, with-image (3 tests) - test_bedrock_invoke: non-streaming, with-image, streaming (3 tests) - test_bedrock_error_handling: invalid model, invalid payload, error structure (3 tests)

…oke decorator Replace the implicit __init_subclass__ magic that auto-wrapped every subclass invoke method with an explicit @llmeter_invoke decorator. The decorator provides the same functionality (prepare_payload, timing, error handling, metadata back-fill) but is now visible at the definition site, making the contract explicit and allowing subclasses to opt out if they need raw control over invoke. - Add llmeter_invoke decorator to llmeter/endpoints/base.py - Remove Endpoint.__init_subclass__ entirely - Apply @llmeter_invoke to all 12 concrete invoke methods - Export llmeter_invoke from llmeter.endpoints - Update docs (key_concepts.md, connect_endpoints.md)

Our dummy ConcreteEndpoint class was not using the new decorator, so didn't populate request_time properly.

Remove self._start_t and self._last_payload from the llmeter_invoke decorator. Per-call state no longer leaks onto the endpoint instance: - Each invoke body captures its own local start_t via time.perf_counter() and passes it to parse_response directly - The decorator uses a local start_t for time_to_last_token back-fill - input_payload on the response gets the mutated dict (what was actually sent to the API) for reproducibility - _parse_payload receives a deepcopy snapshot taken before the API call, so prompt extraction is not affected by client-side mutations Add 18 tests for the decorator covering payload mutation, invocation isolation, timing, error handling, metadata back-fill, and the decorator marker.

Add support for the Anthropic Messages API as a new endpoint type, enabling benchmarking through the direct Anthropic API, Amazon Bedrock (AnthropicBedrock), and Amazon Bedrock Mantle (AnthropicBedrockMantle). New classes: - AnthropicMessages: non-streaming endpoint - AnthropicMessagesStream: streaming endpoint with TTFT/TTLT Also includes: - anthropic and anthropic-bedrock optional dependencies - Unit tests (37 tests) and integration tests (3 tests) - API reference docs and mkdocs nav entry - Sample notebook comparing Converse vs Messages API TTFT on Bedrock Closes awslabs#62

Remove self._start_t and self._last_payload from the llmeter_invoke decorator. Per-call state no longer leaks onto the endpoint instance: - Each invoke body captures its own local start_t via time.perf_counter() and passes it to parse_response directly - The decorator uses a local start_t for time_to_last_token back-fill - input_payload on the response gets the mutated dict (what was actually sent to the API) for reproducibility - _parse_payload receives a deepcopy snapshot taken before the API call, so prompt extraction is not affected by client-side mutations Add 18 tests for the decorator covering payload mutation, invocation isolation, timing, error handling, metadata back-fill, and the decorator marker.

acere added 10 commits April 16, 2026 22:44

acere force-pushed the feature/anthropic-messages-endpoint branch from 7466ae6 to f7e099d Compare April 17, 2026 05:55

athewsey and others added 4 commits April 17, 2026 14:48

test(Endpoint): Fix request_time unit test

8063242

Our dummy ConcreteEndpoint class was not using the new decorator, so didn't populate request_time properly.

acere force-pushed the feature/anthropic-messages-endpoint branch from 25b6fbf to b333585 Compare April 17, 2026 11:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Anthropic Messages API endpoint#63

feat: add Anthropic Messages API endpoint#63
acere wants to merge 14 commits intoawslabs:mainfrom
acere:feature/anthropic-messages-endpoint

acere commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

acere commented Apr 17, 2026

Summary

What's included

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants