feat: add Anthropic Messages API endpoint#63
Open
acere wants to merge 14 commits intoawslabs:mainfrom
Open
Conversation
Refactor the Endpoint base class to provide a structured invoke lifecycle via __init_subclass__ wrapping: - prepare_payload(payload, **kwargs) → merge kwargs, inject provider fields - invoke(payload) → API call + parse_response() (abstract) - parse_response(raw_response, start_t) → extract text/tokens/metadata (abstract) The base class wrapper automatically provides: - Error handling: exceptions → error InvocationResponse with partial data - Timing: time_to_last_token back-fill for non-streaming endpoints - Metadata: input_payload, input_prompt, id always populated - _parse_payload for input prompt extraction (token counting fallback) Additional improvements: - Add num_tokens_input_cached field for prompt caching (Bedrock + OpenAI) - Extract AWS RequestId as response ID for Bedrock and SageMaker - Extract RetryAttempts for SageMaker (Bedrock already had this) - Preserve partial data on streaming errors instead of discarding - Define BEDROCK_STREAM_ERROR_TYPES as shared constant - Skip unknown stream events gracefully (forward-compatible) - Remove redundant try/except from all _parse_response methods - Remove uuid4/error handling boilerplate from all endpoint subclasses - Update docs: metrics table, key concepts, custom endpoint guide Closes awslabs#60
- Add num_tokens_input_cached to Result.stats aggregation metrics and total_cached_input_tokens to run-level stats - Add integration test for ConverseStream prompt caching with unique-per-run prefix to avoid stale cache hits - Add 6 unit tests verifying mid-stream errors (TimeoutError, ConnectionError) are caught by the invoke wrapper for BedrockConverseStream, BedrockInvokeStream, and OpenAICompletionStreamEndpoint - Add demo notebook comparing TTFT with/without prompt caching, using a CacheBuster callback to guarantee cache misses - Sort imports across codebase (ruff --select I) - Update metrics documentation with new stats fields
- Add `low_memory` parameter to Runner/run() that writes responses to disk without keeping them in memory, for large-scale test runs. - Introduce `RunningStats` class that accumulates metrics incrementally (counts, sums, sorted values for percentile computation). - Replace `_builtin_stats` cached_property on Result with `_preloaded_stats` populated by RunningStats during the run or from stats.json on load. - Add `snapshot()` method on RunningStats for live progress-bar display of p50/p90 TTFT, p50/p90 TTLT, median tokens/s, total tokens, and failure count — configurable via `progress_bar_stats` parameter. - Add `_compute_stats()` classmethod on Result as fallback for manually constructed Result objects and post-load_responses() recomputation. - Update tests for the new stats flow.
Add run_duration parameter for time-bound test runs: - New run_duration on Runner/run() and LoadTest: clients send requests continuously for a fixed duration instead of a fixed count. - Dedicated _invoke_for_duration / _invoke_duration_c methods (separate from count-bound _invoke_n / _invoke_n_c). - Time-based progress bar via _tick_time_bar async task. - Mutual exclusivity validation between n_requests and run_duration. Add LiveStatsDisplay for readable live metrics: - New llmeter/live_display.py: HTML table in Jupyter (grouped columns for Throughput, TTFT, TTLT, Tokens, Errors), ANSI multi-line in terminals. Updates in-place, shows placeholders before first response. - Replaces single-line tqdm postfix with a separate stats row. Improve throughput metric accuracy: - RunningStats.record_send() tracks send-side timestamps. - RPM and output_tps use send window (first-to-last request sent) instead of response-side elapsed time, preventing taper-off as clients finish. - output_tps (aggregate tokens/s) added to default snapshot stats. Fix StopIteration silently terminating invocation loops: - Both _invoke_n_no_wait and _invoke_for_duration now use while/next() instead of for-in-cycle() to prevent StopIteration from streaming endpoints from killing the loop. Add LoadTest support for new features: - run_duration, low_memory, progress_bar_stats forwarded to each run. Add example notebook and documentation: - examples/Time-bound runs with Bedrock OpenAI API.ipynb: end-to-end demo using bedrock-mantle endpoint with LoadTest, custom stats, low-memory mode, and comparison charts (RPM, TPS, TTFT, TTLT). - docs/user_guide/run_experiments.md: new sections for time-bound runs, live progress-bar stats, and low-memory mode. Add tests (51 new, 504 total): - test_running_stats.py: record_send, update, to_stats, snapshot (placeholders, rpm, output_tps, send window, aggregations). - test_live_display.py: _classify, _group_stats, _in_notebook, LiveStatsDisplay (disabled, terminal, overwrite, prefix). - test_experiments.py: LoadTest with run_duration/low_memory/ progress_bar_stats field storage and runner forwarding. - test_runner.py: time-bound validation, _invoke_for_duration, full run with duration, output path, multiple clients.
Consolidate live display config (review comment 1): - Merge _GROUP_PATTERNS + _GROUP_ORDER into single DEFAULT_GROUPS tuple - Make groups a constructor parameter on LiveStatsDisplay Move display aliases from RunningStats to LiveStatsDisplay (comment 4): - Remove RunningStats.snapshot() and DEFAULT_SNAPSHOT_STATS - Add rpm/output_tps as regular keys in RunningStats.to_stats() - Add LiveStatsDisplay.format_stats() owning alias mapping + formatting - New DEFAULT_DISPLAY_STATS in live_display.py maps display labels to canonical stat keys (e.g. "time_to_first_token-p50") - Runner passes raw to_stats() output; display handles the rest Cache fallback stats computation (comment 2): - Result.stats property caches _compute_stats back to _preloaded_stats Preserve contributed stats on load (comment 3): - Result.load(load_responses=True) merges extra keys from stats.json so callback-contributed stats survive save/load round-trips Make Result fields optional (comment 5): - total_requests, clients, n_requests now optional to match _RunConfig Accept timedelta for run_duration (comment 6): - run_duration accepts int | float | timedelta; normalized in __post_init__ Remove _n_requests indirection (comment 7): - Eliminated private _n_requests; n_requests set directly to resolved value Consolidate invoke methods (comment 8): - Merged 6 methods into 3: _invoke_n_no_wait (n + duration), _invoke_client (replaces _invoke_n/_invoke_duration), _invoke_clients (replaces _invoke_n_c/_invoke_duration_c) Tests: - Add TestContributedStatsRoundTrip (8 tests) for save/load round-trips - Add TestSendWindowStats for rpm/output_tps in to_stats() - Add TestFormatStat for display formatting - Update all tests for renamed methods and new APIs
Add a `request_time` (datetime UTC) field to InvocationResponse that records the wall-clock time when each request was sent. The invoke wrapper sets it automatically — both on success and error paths — so no endpoint subclass changes are needed. This enables time-series analysis of latency data and is required by the send-window throughput metrics in RunningStats.
Use the request_time stamps captured on InvocationResponse to drive rate statistics instead of the old record_send() / perf_counter approach. RunningStats changes: - Remove record_send() — send timestamps are now derived from request_time on each response via update() - to_stats() takes end_time (datetime) instead of total_requests / total_test_time — RPM uses the request send window, output rates use [first_request, end_time] - _send_window() computes elapsed seconds from datetime objects Result changes: - Add first_request_time / last_request_time fields, populated from RunningStats at end of run - Datetime serialization updated for the new fields Runner changes: - Remove record_send() call from invoke loop - Pass end_time to to_stats() and first/last request times to Result
- Fix DEFAULT_DISPLAY_STATS keys: "rpm" → "requests_per_minute", "num_tokens_input-sum" → "total_input_tokens", "num_tokens_output-sum" → "total_output_tokens" - Fix _format_stat: match "_per_minute" instead of "rpm", remove whole-number-float-to-int coercion (100.0 stays "100.0") - Update run_experiments.md doc example - Update test_experiments.py and test_live_display.py accordingly
Verify that request_time is always set on InvocationResponse — both on success and error paths — across all Bedrock integration tests: - test_bedrock_converse: non-streaming, streaming, with-image (3 tests) - test_bedrock_invoke: non-streaming, with-image, streaming (3 tests) - test_bedrock_error_handling: invalid model, invalid payload, error structure (3 tests)
…oke decorator Replace the implicit __init_subclass__ magic that auto-wrapped every subclass invoke method with an explicit @llmeter_invoke decorator. The decorator provides the same functionality (prepare_payload, timing, error handling, metadata back-fill) but is now visible at the definition site, making the contract explicit and allowing subclasses to opt out if they need raw control over invoke. - Add llmeter_invoke decorator to llmeter/endpoints/base.py - Remove Endpoint.__init_subclass__ entirely - Apply @llmeter_invoke to all 12 concrete invoke methods - Export llmeter_invoke from llmeter.endpoints - Update docs (key_concepts.md, connect_endpoints.md)
7466ae6 to
f7e099d
Compare
Our dummy ConcreteEndpoint class was not using the new decorator, so didn't populate request_time properly.
Remove self._start_t and self._last_payload from the llmeter_invoke decorator. Per-call state no longer leaks onto the endpoint instance: - Each invoke body captures its own local start_t via time.perf_counter() and passes it to parse_response directly - The decorator uses a local start_t for time_to_last_token back-fill - input_payload on the response gets the mutated dict (what was actually sent to the API) for reproducibility - _parse_payload receives a deepcopy snapshot taken before the API call, so prompt extraction is not affected by client-side mutations Add 18 tests for the decorator covering payload mutation, invocation isolation, timing, error handling, metadata back-fill, and the decorator marker.
Add support for the Anthropic Messages API as a new endpoint type, enabling benchmarking through the direct Anthropic API, Amazon Bedrock (AnthropicBedrock), and Amazon Bedrock Mantle (AnthropicBedrockMantle). New classes: - AnthropicMessages: non-streaming endpoint - AnthropicMessagesStream: streaming endpoint with TTFT/TTLT Also includes: - anthropic and anthropic-bedrock optional dependencies - Unit tests (37 tests) and integration tests (3 tests) - API reference docs and mkdocs nav entry - Sample notebook comparing Converse vs Messages API TTFT on Bedrock Closes awslabs#62
Remove self._start_t and self._last_payload from the llmeter_invoke decorator. Per-call state no longer leaks onto the endpoint instance: - Each invoke body captures its own local start_t via time.perf_counter() and passes it to parse_response directly - The decorator uses a local start_t for time_to_last_token back-fill - input_payload on the response gets the mutated dict (what was actually sent to the API) for reproducibility - _parse_payload receives a deepcopy snapshot taken before the API call, so prompt extraction is not affected by client-side mutations Add 18 tests for the decorator covering payload mutation, invocation isolation, timing, error handling, metadata back-fill, and the decorator marker.
25b6fbf to
b333585
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add support for the Anthropic Messages API as a new LLMeter endpoint type.
Closes #62
What's included
New endpoint classes (
llmeter/endpoints/anthropic_messages.py):AnthropicMessages— non-streamingAnthropicMessagesStream— streaming with TTFT/TTLT measurementSupports three providers via a single
providerargument:"anthropic"— direct Anthropic API"bedrock"—AnthropicBedrock(InvokeModel-based)"bedrock-mantle"—AnthropicBedrockMantle(Messages API via Mantle)Dependencies (
pyproject.toml):anthropicoptional extra (base SDK)anthropic-bedrockoptional extra (anthropic[bedrock]for AWS auth)allandtestgroupsTests:
us-east-1, Opus 4.7)Docs:
Example notebook:
NoCacheCallbackto defeat prompt cachingTesting