Skip to content

feat: add Anthropic Messages API endpoint#63

Open
acere wants to merge 14 commits intoawslabs:mainfrom
acere:feature/anthropic-messages-endpoint
Open

feat: add Anthropic Messages API endpoint#63
acere wants to merge 14 commits intoawslabs:mainfrom
acere:feature/anthropic-messages-endpoint

Conversation

@acere
Copy link
Copy Markdown
Collaborator

@acere acere commented Apr 17, 2026

Summary

Add support for the Anthropic Messages API as a new LLMeter endpoint type.

Closes #62

What's included

New endpoint classes (llmeter/endpoints/anthropic_messages.py):

  • AnthropicMessages — non-streaming
  • AnthropicMessagesStream — streaming with TTFT/TTLT measurement

Supports three providers via a single provider argument:

  • "anthropic" — direct Anthropic API
  • "bedrock"AnthropicBedrock (InvokeModel-based)
  • "bedrock-mantle"AnthropicBedrockMantle (Messages API via Mantle)

Dependencies (pyproject.toml):

  • anthropic optional extra (base SDK)
  • anthropic-bedrock optional extra (anthropic[bedrock] for AWS auth)
  • Added to all and test groups

Tests:

  • 37 unit tests covering client construction, payload creation/parsing, response parsing, error handling, all three providers
  • 3 integration tests against Bedrock Mantle (us-east-1, Opus 4.7)

Docs:

  • API reference page + mkdocs nav entry

Example notebook:

  • Compares TTFT between Converse API and Anthropic Messages API on Bedrock
  • Uses a NoCacheCallback to defeat prompt caching

Testing

# Unit tests
uv run pytest tests/unit/endpoints/test_anthropic_messages.py -v

# Integration tests (requires AWS credentials and Bedrock access)
uv run pytest tests/integ/test_anthropic_messages_bedrock.py -m integ -v

acere added 10 commits April 16, 2026 22:44
Refactor the Endpoint base class to provide a structured invoke lifecycle
via __init_subclass__ wrapping:

- prepare_payload(payload, **kwargs) → merge kwargs, inject provider fields
- invoke(payload) → API call + parse_response() (abstract)
- parse_response(raw_response, start_t) → extract text/tokens/metadata (abstract)

The base class wrapper automatically provides:
- Error handling: exceptions → error InvocationResponse with partial data
- Timing: time_to_last_token back-fill for non-streaming endpoints
- Metadata: input_payload, input_prompt, id always populated
- _parse_payload for input prompt extraction (token counting fallback)

Additional improvements:
- Add num_tokens_input_cached field for prompt caching (Bedrock + OpenAI)
- Extract AWS RequestId as response ID for Bedrock and SageMaker
- Extract RetryAttempts for SageMaker (Bedrock already had this)
- Preserve partial data on streaming errors instead of discarding
- Define BEDROCK_STREAM_ERROR_TYPES as shared constant
- Skip unknown stream events gracefully (forward-compatible)
- Remove redundant try/except from all _parse_response methods
- Remove uuid4/error handling boilerplate from all endpoint subclasses
- Update docs: metrics table, key concepts, custom endpoint guide

Closes awslabs#60
- Add num_tokens_input_cached to Result.stats aggregation metrics
  and total_cached_input_tokens to run-level stats
- Add integration test for ConverseStream prompt caching with
  unique-per-run prefix to avoid stale cache hits
- Add 6 unit tests verifying mid-stream errors (TimeoutError,
  ConnectionError) are caught by the invoke wrapper for
  BedrockConverseStream, BedrockInvokeStream, and
  OpenAICompletionStreamEndpoint
- Add demo notebook comparing TTFT with/without prompt caching,
  using a CacheBuster callback to guarantee cache misses
- Sort imports across codebase (ruff --select I)
- Update metrics documentation with new stats fields
- Add `low_memory` parameter to Runner/run() that writes responses to
  disk without keeping them in memory, for large-scale test runs.
- Introduce `RunningStats` class that accumulates metrics incrementally
  (counts, sums, sorted values for percentile computation).
- Replace `_builtin_stats` cached_property on Result with `_preloaded_stats`
  populated by RunningStats during the run or from stats.json on load.
- Add `snapshot()` method on RunningStats for live progress-bar display
  of p50/p90 TTFT, p50/p90 TTLT, median tokens/s, total tokens, and
  failure count — configurable via `progress_bar_stats` parameter.
- Add `_compute_stats()` classmethod on Result as fallback for manually
  constructed Result objects and post-load_responses() recomputation.
- Update tests for the new stats flow.
Add run_duration parameter for time-bound test runs:
- New run_duration on Runner/run() and LoadTest: clients send requests
  continuously for a fixed duration instead of a fixed count.
- Dedicated _invoke_for_duration / _invoke_duration_c methods (separate
  from count-bound _invoke_n / _invoke_n_c).
- Time-based progress bar via _tick_time_bar async task.
- Mutual exclusivity validation between n_requests and run_duration.

Add LiveStatsDisplay for readable live metrics:
- New llmeter/live_display.py: HTML table in Jupyter (grouped columns
  for Throughput, TTFT, TTLT, Tokens, Errors), ANSI multi-line in
  terminals. Updates in-place, shows placeholders before first response.
- Replaces single-line tqdm postfix with a separate stats row.

Improve throughput metric accuracy:
- RunningStats.record_send() tracks send-side timestamps.
- RPM and output_tps use send window (first-to-last request sent)
  instead of response-side elapsed time, preventing taper-off as
  clients finish.
- output_tps (aggregate tokens/s) added to default snapshot stats.

Fix StopIteration silently terminating invocation loops:
- Both _invoke_n_no_wait and _invoke_for_duration now use while/next()
  instead of for-in-cycle() to prevent StopIteration from streaming
  endpoints from killing the loop.

Add LoadTest support for new features:
- run_duration, low_memory, progress_bar_stats forwarded to each run.

Add example notebook and documentation:
- examples/Time-bound runs with Bedrock OpenAI API.ipynb: end-to-end
  demo using bedrock-mantle endpoint with LoadTest, custom stats,
  low-memory mode, and comparison charts (RPM, TPS, TTFT, TTLT).
- docs/user_guide/run_experiments.md: new sections for time-bound runs,
  live progress-bar stats, and low-memory mode.

Add tests (51 new, 504 total):
- test_running_stats.py: record_send, update, to_stats, snapshot
  (placeholders, rpm, output_tps, send window, aggregations).
- test_live_display.py: _classify, _group_stats, _in_notebook,
  LiveStatsDisplay (disabled, terminal, overwrite, prefix).
- test_experiments.py: LoadTest with run_duration/low_memory/
  progress_bar_stats field storage and runner forwarding.
- test_runner.py: time-bound validation, _invoke_for_duration,
  full run with duration, output path, multiple clients.
Consolidate live display config (review comment 1):
- Merge _GROUP_PATTERNS + _GROUP_ORDER into single DEFAULT_GROUPS tuple
- Make groups a constructor parameter on LiveStatsDisplay

Move display aliases from RunningStats to LiveStatsDisplay (comment 4):
- Remove RunningStats.snapshot() and DEFAULT_SNAPSHOT_STATS
- Add rpm/output_tps as regular keys in RunningStats.to_stats()
- Add LiveStatsDisplay.format_stats() owning alias mapping + formatting
- New DEFAULT_DISPLAY_STATS in live_display.py maps display labels to
  canonical stat keys (e.g. "time_to_first_token-p50")
- Runner passes raw to_stats() output; display handles the rest

Cache fallback stats computation (comment 2):
- Result.stats property caches _compute_stats back to _preloaded_stats

Preserve contributed stats on load (comment 3):
- Result.load(load_responses=True) merges extra keys from stats.json
  so callback-contributed stats survive save/load round-trips

Make Result fields optional (comment 5):
- total_requests, clients, n_requests now optional to match _RunConfig

Accept timedelta for run_duration (comment 6):
- run_duration accepts int | float | timedelta; normalized in __post_init__

Remove _n_requests indirection (comment 7):
- Eliminated private _n_requests; n_requests set directly to resolved value

Consolidate invoke methods (comment 8):
- Merged 6 methods into 3: _invoke_n_no_wait (n + duration),
  _invoke_client (replaces _invoke_n/_invoke_duration),
  _invoke_clients (replaces _invoke_n_c/_invoke_duration_c)

Tests:
- Add TestContributedStatsRoundTrip (8 tests) for save/load round-trips
- Add TestSendWindowStats for rpm/output_tps in to_stats()
- Add TestFormatStat for display formatting
- Update all tests for renamed methods and new APIs
Add a `request_time` (datetime UTC) field to InvocationResponse that
records the wall-clock time when each request was sent. The invoke
wrapper sets it automatically — both on success and error paths — so
no endpoint subclass changes are needed.

This enables time-series analysis of latency data and is required by
the send-window throughput metrics in RunningStats.
Use the request_time stamps captured on InvocationResponse to drive
rate statistics instead of the old record_send() / perf_counter approach.

RunningStats changes:
- Remove record_send() — send timestamps are now derived from
  request_time on each response via update()
- to_stats() takes end_time (datetime) instead of total_requests /
  total_test_time — RPM uses the request send window, output rates
  use [first_request, end_time]
- _send_window() computes elapsed seconds from datetime objects

Result changes:
- Add first_request_time / last_request_time fields, populated from
  RunningStats at end of run
- Datetime serialization updated for the new fields

Runner changes:
- Remove record_send() call from invoke loop
- Pass end_time to to_stats() and first/last request times to Result
- Fix DEFAULT_DISPLAY_STATS keys: "rpm" → "requests_per_minute",
  "num_tokens_input-sum" → "total_input_tokens",
  "num_tokens_output-sum" → "total_output_tokens"
- Fix _format_stat: match "_per_minute" instead of "rpm", remove
  whole-number-float-to-int coercion (100.0 stays "100.0")
- Update run_experiments.md doc example
- Update test_experiments.py and test_live_display.py accordingly
Verify that request_time is always set on InvocationResponse — both
on success and error paths — across all Bedrock integration tests:

- test_bedrock_converse: non-streaming, streaming, with-image (3 tests)
- test_bedrock_invoke: non-streaming, with-image, streaming (3 tests)
- test_bedrock_error_handling: invalid model, invalid payload,
  error structure (3 tests)
…oke decorator

Replace the implicit __init_subclass__ magic that auto-wrapped every
subclass invoke method with an explicit @llmeter_invoke decorator.

The decorator provides the same functionality (prepare_payload, timing,
error handling, metadata back-fill) but is now visible at the definition
site, making the contract explicit and allowing subclasses to opt out
if they need raw control over invoke.

- Add llmeter_invoke decorator to llmeter/endpoints/base.py
- Remove Endpoint.__init_subclass__ entirely
- Apply @llmeter_invoke to all 12 concrete invoke methods
- Export llmeter_invoke from llmeter.endpoints
- Update docs (key_concepts.md, connect_endpoints.md)
@acere acere force-pushed the feature/anthropic-messages-endpoint branch from 7466ae6 to f7e099d Compare April 17, 2026 05:55
athewsey and others added 4 commits April 17, 2026 14:48
Our dummy ConcreteEndpoint class was not using the new decorator, so
didn't populate request_time properly.
Remove self._start_t and self._last_payload from the llmeter_invoke
decorator. Per-call state no longer leaks onto the endpoint instance:

- Each invoke body captures its own local start_t via time.perf_counter()
  and passes it to parse_response directly
- The decorator uses a local start_t for time_to_last_token back-fill
- input_payload on the response gets the mutated dict (what was actually
  sent to the API) for reproducibility
- _parse_payload receives a deepcopy snapshot taken before the API call,
  so prompt extraction is not affected by client-side mutations

Add 18 tests for the decorator covering payload mutation, invocation
isolation, timing, error handling, metadata back-fill, and the decorator
marker.
Add support for the Anthropic Messages API as a new endpoint type,
enabling benchmarking through the direct Anthropic API, Amazon Bedrock
(AnthropicBedrock), and Amazon Bedrock Mantle (AnthropicBedrockMantle).

New classes:
- AnthropicMessages: non-streaming endpoint
- AnthropicMessagesStream: streaming endpoint with TTFT/TTLT

Also includes:
- anthropic and anthropic-bedrock optional dependencies
- Unit tests (37 tests) and integration tests (3 tests)
- API reference docs and mkdocs nav entry
- Sample notebook comparing Converse vs Messages API TTFT on Bedrock

Closes awslabs#62
Remove self._start_t and self._last_payload from the llmeter_invoke
decorator. Per-call state no longer leaks onto the endpoint instance:

- Each invoke body captures its own local start_t via time.perf_counter()
  and passes it to parse_response directly
- The decorator uses a local start_t for time_to_last_token back-fill
- input_payload on the response gets the mutated dict (what was actually
  sent to the API) for reproducibility
- _parse_payload receives a deepcopy snapshot taken before the API call,
  so prompt extraction is not affected by client-side mutations

Add 18 tests for the decorator covering payload mutation, invocation
isolation, timing, error handling, metadata back-fill, and the decorator
marker.
@acere acere force-pushed the feature/anthropic-messages-endpoint branch from 25b6fbf to b333585 Compare April 17, 2026 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Anthropic Messages API endpoint

2 participants