feat: time-bound runs, live stats display, and send-window metrics by acere · Pull Request #58 · awslabs/llmeter

acere · 2026-04-02T17:20:31Z

Closes #57, closes #60

What

Adds time-bound test runs, a live stats display, send-window-based throughput metrics, standardized endpoint invoke lifecycle, prompt caching support, and fixes a StopIteration bug in invocation loops.

This PR combines the original time-bound runs feature with the endpoint lifecycle refactor (previously PR #61).

Endpoint lifecycle refactor

The Endpoint base class now provides a structured invoke lifecycle via __init_subclass__ wrapping:

Method	Required	Purpose
`invoke(payload)`	Yes	API call + `parse_response()`
`parse_response(raw_response, start_t)`	Yes	Extract text, tokens, metadata
`prepare_payload(payload, **kwargs)`	No	Merge kwargs, inject model_id, etc.

The wrapper automatically handles:

Error handling — exceptions → error InvocationResponse with payload attached
Timing — time_to_last_token back-filled for non-streaming endpoints
Metadata — input_payload, input_prompt, id, request_time always populated
_parse_payload — extracts human-readable prompt for observability and token counting fallback

Before/after (e.g. `OpenAIResponseEndpoint.invoke`)

Before (27 lines with 5 duplicate except handlers):

def invoke(self, payload, **kwargs):
    payload = {**kwargs, **payload}
    payload["model"] = self.model_id
    start_t = time.perf_counter()
    try:
        client_response = self._client.responses.create(**payload)
    except APIConnectionError as e:
        ...  # 5 identical except blocks
    response = self._parse_response(client_response, start_t)
    response.input_payload = payload
    response.input_prompt = self._parse_payload(payload)
    return response

After (3 lines):

def invoke(self, payload):
    client_response = self._client.responses.create(**payload)
    return self.parse_response(client_response, self._start_t)

New InvocationResponse fields

request_time (datetime UTC) — wall-clock time when the request was sent
num_tokens_input_cached — input tokens served from prompt cache (Bedrock + OpenAI)

Time-bound runs & live stats

run_duration parameter for continuous-duration runs (mutually exclusive with n_requests)
LiveStatsDisplay for real-time progress in Jupyter and terminals
RunningStats for incremental stat accumulation
low_memory mode that discards individual responses after stats extraction
RPM and throughput computed from request timestamps (send window), not response timestamps

Stats computation changes

RunningStats.to_stats() takes end_time (datetime) instead of total_requests/total_test_time
RPM uses request send window (first_request_time to last_request_time)
Output rates use [first_request, end_time] window
Result gains first_request_time / last_request_time fields

Additional improvements

Extract AWS RequestId as response ID for all Bedrock and SageMaker endpoints
Extract RetryAttempts for SageMaker (Bedrock already had this)
Preserve partial data on streaming errors instead of discarding
BEDROCK_STREAM_ERROR_TYPES as shared constant
Skip unknown stream events gracefully (forward-compatible)
Prompt caching demo notebook with CacheBuster callback
Updated docs: metrics table, key concepts, custom endpoint guide, run experiments

Tests

751 unit tests pass
6 new mid-stream error tests (TimeoutError, ConnectionError across 3 streaming endpoints)
Integration tests with request_time and AWS RequestId assertions
Prompt caching integration test with unique-per-run prefix

athewsey · 2026-04-06T10:41:27Z

+_GROUP_PATTERNS: list[tuple[str, str]] = [
+    ("rpm", "Throughput"),
+    ("tps", "Throughput"),
+    ("ttft", "TTFT"),
+    ("ttlt", "TTLT"),
+    ("token", "Tokens"),
+    ("fail", "Errors"),
+]
+
+_GROUP_ORDER = ["Throughput", "TTFT", "TTLT", "Tokens", "Errors", "Other"]


Maybe these could be condensed to a single config variable like below?

_GROUP_PATTERNS = ( ("Throughput", ("rpm", "tps")), ("TTFT", ("ttft",)), ("TTLT", ("ttlt",)), ("Tokens", ("token",)), ("Errors", ("fail",)), ("Other", ("",)), )

If it's an immutable type like this, it could also nicely become the default value of an argument groups in LiveStatsDisplay constructor, instead of a module-level constant?

athewsey · 2026-04-06T11:00:49Z

-        stats = self._builtin_stats.copy()
+        else:
+            # Fallback: compute from responses (e.g. Result constructed manually)
+            stats = self._compute_stats(self)


Should this be cached back to _preloaded_stats so it's not recomputed on subsequent accesses?

athewsey · 2026-04-06T11:05:03Z

                result._preloaded_stats = None
+        else:
+            # Compute stats from the loaded responses
+            result._preloaded_stats = cls._compute_stats(result)


What happens to callback _contributed_stats when a result is saved to file and loaded again? It looks like, even if the contributed stats get saved to stats.json, they might be overridden/deleted here?

yes, i missed it. fixed and also added a dedicated set of tests.

athewsey · 2026-04-06T11:24:03Z

+    DEFAULT_SNAPSHOT_STATS: dict[str, tuple[str, ...] | str] = {
+        "rpm": "rpm",
+        "output_tps": "output_tps",
+        "p50_ttft": ("time_to_first_token", "p50"),
+        "p90_ttft": ("time_to_first_token", "p90"),
+        "p50_ttlt": ("time_to_last_token", "p50"),
+        "p90_ttlt": ("time_to_last_token", "p90"),
+        "p50_tps": ("time_per_output_token", "p50", "inv"),
+        "input_tokens": ("num_tokens_input", "sum"),
+        "output_tokens": ("num_tokens_output", "sum"),
+        "fail": "failed",
+    }


Not a big fan of defining name aliases at this level - shouldn't that be more of a display-level property?

It also feels weird that this class is separate from Result stats... I'd suggest to instead revisit the way Result itself computes stats and add capability for some to be built calculated on running basis during the Run. After all, callbacks can already choose to _update_contributed_stats at any time?

Then, the LiveStatsDisplay could just be configured which stats to pull (e.g. time_to_first_token-p50) with alias names / groups / whatever other display-level properties.

restructured the relationship between Results and LiveStatDisplay. now it should be more consistent

athewsey · 2026-04-06T11:26:29Z

Should this be optional now if n_requests is optional in _RunConfig?

athewsey · 2026-04-06T11:28:37Z

    tokenizer: Tokenizer | Any | None = None
    clients: int = 1
    n_requests: int | None = None
+    run_duration: int | float | None = None


Perhaps this should either be a timedelta type, or have a name that explicitly indicates its units?

Added timedelta type as option, and clarified in docstrings that any numerical type represents duration in seconds.

athewsey · 2026-04-06T11:31:07Z

+        self._time_bound = self.run_duration is not None
+        if self._time_bound:
+            # For time-bound runs, _n_requests is unknown upfront
+            self._n_requests = 0


Do we need both n_requests and _n_requests? And the inconsistency of the public property being nullable while the private one's getting set to 0?

Combined into a single variable

athewsey · 2026-04-06T11:36:43Z

+    async def _invoke_duration_c(
+        self,
+        payload: list[dict],
+        clients: int = 1,
+    ) -> tuple[float, float, float]:


A bit concerned by the amount of duplication introduced by defining parallel _invoke_duration_c, _invoke_duration, _invoke_for_duration methods, rather than sharing anything with the corresponding _invoke_n... methods. Since these are all private, couldn't we consolidate some to a single method that tracks both the number and duration and terminates when either condition is met?

Consolidated from 6 to 3 methods.

Consolidate live display config (review comment 1): - Merge _GROUP_PATTERNS + _GROUP_ORDER into single DEFAULT_GROUPS tuple - Make groups a constructor parameter on LiveStatsDisplay Move display aliases from RunningStats to LiveStatsDisplay (comment 4): - Remove RunningStats.snapshot() and DEFAULT_SNAPSHOT_STATS - Add rpm/output_tps as regular keys in RunningStats.to_stats() - Add LiveStatsDisplay.format_stats() owning alias mapping + formatting - New DEFAULT_DISPLAY_STATS in live_display.py maps display labels to canonical stat keys (e.g. "time_to_first_token-p50") - Runner passes raw to_stats() output; display handles the rest Cache fallback stats computation (comment 2): - Result.stats property caches _compute_stats back to _preloaded_stats Preserve contributed stats on load (comment 3): - Result.load(load_responses=True) merges extra keys from stats.json so callback-contributed stats survive save/load round-trips Make Result fields optional (comment 5): - total_requests, clients, n_requests now optional to match _RunConfig Accept timedelta for run_duration (comment 6): - run_duration accepts int | float | timedelta; normalized in __post_init__ Remove _n_requests indirection (comment 7): - Eliminated private _n_requests; n_requests set directly to resolved value Consolidate invoke methods (comment 8): - Merged 6 methods into 3: _invoke_n_no_wait (n + duration), _invoke_client (replaces _invoke_n/_invoke_duration), _invoke_clients (replaces _invoke_n_c/_invoke_duration_c) Tests: - Add TestContributedStatsRoundTrip (8 tests) for save/load round-trips - Add TestSendWindowStats for rpm/output_tps in to_stats() - Add TestFormatStat for display formatting - Update all tests for renamed methods and new APIs

Refactor the Endpoint base class to provide a structured invoke lifecycle via __init_subclass__ wrapping: - prepare_payload(payload, **kwargs) → merge kwargs, inject provider fields - invoke(payload) → API call + parse_response() (abstract) - parse_response(raw_response, start_t) → extract text/tokens/metadata (abstract) The base class wrapper automatically provides: - Error handling: exceptions → error InvocationResponse with partial data - Timing: time_to_last_token back-fill for non-streaming endpoints - Metadata: input_payload, input_prompt, id always populated - _parse_payload for input prompt extraction (token counting fallback) Additional improvements: - Add num_tokens_input_cached field for prompt caching (Bedrock + OpenAI) - Extract AWS RequestId as response ID for Bedrock and SageMaker - Extract RetryAttempts for SageMaker (Bedrock already had this) - Preserve partial data on streaming errors instead of discarding - Define BEDROCK_STREAM_ERROR_TYPES as shared constant - Skip unknown stream events gracefully (forward-compatible) - Remove redundant try/except from all _parse_response methods - Remove uuid4/error handling boilerplate from all endpoint subclasses - Update docs: metrics table, key concepts, custom endpoint guide Closes awslabs#60

- Add num_tokens_input_cached to Result.stats aggregation metrics and total_cached_input_tokens to run-level stats - Add integration test for ConverseStream prompt caching with unique-per-run prefix to avoid stale cache hits - Add 6 unit tests verifying mid-stream errors (TimeoutError, ConnectionError) are caught by the invoke wrapper for BedrockConverseStream, BedrockInvokeStream, and OpenAICompletionStreamEndpoint - Add demo notebook comparing TTFT with/without prompt caching, using a CacheBuster callback to guarantee cache misses - Sort imports across codebase (ruff --select I) - Update metrics documentation with new stats fields

- Add `low_memory` parameter to Runner/run() that writes responses to disk without keeping them in memory, for large-scale test runs. - Introduce `RunningStats` class that accumulates metrics incrementally (counts, sums, sorted values for percentile computation). - Replace `_builtin_stats` cached_property on Result with `_preloaded_stats` populated by RunningStats during the run or from stats.json on load. - Add `snapshot()` method on RunningStats for live progress-bar display of p50/p90 TTFT, p50/p90 TTLT, median tokens/s, total tokens, and failure count — configurable via `progress_bar_stats` parameter. - Add `_compute_stats()` classmethod on Result as fallback for manually constructed Result objects and post-load_responses() recomputation. - Update tests for the new stats flow.

Add run_duration parameter for time-bound test runs: - New run_duration on Runner/run() and LoadTest: clients send requests continuously for a fixed duration instead of a fixed count. - Dedicated _invoke_for_duration / _invoke_duration_c methods (separate from count-bound _invoke_n / _invoke_n_c). - Time-based progress bar via _tick_time_bar async task. - Mutual exclusivity validation between n_requests and run_duration. Add LiveStatsDisplay for readable live metrics: - New llmeter/live_display.py: HTML table in Jupyter (grouped columns for Throughput, TTFT, TTLT, Tokens, Errors), ANSI multi-line in terminals. Updates in-place, shows placeholders before first response. - Replaces single-line tqdm postfix with a separate stats row. Improve throughput metric accuracy: - RunningStats.record_send() tracks send-side timestamps. - RPM and output_tps use send window (first-to-last request sent) instead of response-side elapsed time, preventing taper-off as clients finish. - output_tps (aggregate tokens/s) added to default snapshot stats. Fix StopIteration silently terminating invocation loops: - Both _invoke_n_no_wait and _invoke_for_duration now use while/next() instead of for-in-cycle() to prevent StopIteration from streaming endpoints from killing the loop. Add LoadTest support for new features: - run_duration, low_memory, progress_bar_stats forwarded to each run. Add example notebook and documentation: - examples/Time-bound runs with Bedrock OpenAI API.ipynb: end-to-end demo using bedrock-mantle endpoint with LoadTest, custom stats, low-memory mode, and comparison charts (RPM, TPS, TTFT, TTLT). - docs/user_guide/run_experiments.md: new sections for time-bound runs, live progress-bar stats, and low-memory mode. Add tests (51 new, 504 total): - test_running_stats.py: record_send, update, to_stats, snapshot (placeholders, rpm, output_tps, send window, aggregations). - test_live_display.py: _classify, _group_stats, _in_notebook, LiveStatsDisplay (disabled, terminal, overwrite, prefix). - test_experiments.py: LoadTest with run_duration/low_memory/ progress_bar_stats field storage and runner forwarding. - test_runner.py: time-bound validation, _invoke_for_duration, full run with duration, output path, multiple clients.

Consolidate live display config (review comment 1): - Merge _GROUP_PATTERNS + _GROUP_ORDER into single DEFAULT_GROUPS tuple - Make groups a constructor parameter on LiveStatsDisplay Move display aliases from RunningStats to LiveStatsDisplay (comment 4): - Remove RunningStats.snapshot() and DEFAULT_SNAPSHOT_STATS - Add rpm/output_tps as regular keys in RunningStats.to_stats() - Add LiveStatsDisplay.format_stats() owning alias mapping + formatting - New DEFAULT_DISPLAY_STATS in live_display.py maps display labels to canonical stat keys (e.g. "time_to_first_token-p50") - Runner passes raw to_stats() output; display handles the rest Cache fallback stats computation (comment 2): - Result.stats property caches _compute_stats back to _preloaded_stats Preserve contributed stats on load (comment 3): - Result.load(load_responses=True) merges extra keys from stats.json so callback-contributed stats survive save/load round-trips Make Result fields optional (comment 5): - total_requests, clients, n_requests now optional to match _RunConfig Accept timedelta for run_duration (comment 6): - run_duration accepts int | float | timedelta; normalized in __post_init__ Remove _n_requests indirection (comment 7): - Eliminated private _n_requests; n_requests set directly to resolved value Consolidate invoke methods (comment 8): - Merged 6 methods into 3: _invoke_n_no_wait (n + duration), _invoke_client (replaces _invoke_n/_invoke_duration), _invoke_clients (replaces _invoke_n_c/_invoke_duration_c) Tests: - Add TestContributedStatsRoundTrip (8 tests) for save/load round-trips - Add TestSendWindowStats for rpm/output_tps in to_stats() - Add TestFormatStat for display formatting - Update all tests for renamed methods and new APIs

Add a `request_time` (datetime UTC) field to InvocationResponse that records the wall-clock time when each request was sent. The invoke wrapper sets it automatically — both on success and error paths — so no endpoint subclass changes are needed. This enables time-series analysis of latency data and is required by the send-window throughput metrics in RunningStats.

Use the request_time stamps captured on InvocationResponse to drive rate statistics instead of the old record_send() / perf_counter approach. RunningStats changes: - Remove record_send() — send timestamps are now derived from request_time on each response via update() - to_stats() takes end_time (datetime) instead of total_requests / total_test_time — RPM uses the request send window, output rates use [first_request, end_time] - _send_window() computes elapsed seconds from datetime objects Result changes: - Add first_request_time / last_request_time fields, populated from RunningStats at end of run - Datetime serialization updated for the new fields Runner changes: - Remove record_send() call from invoke loop - Pass end_time to to_stats() and first/last request times to Result

- Fix DEFAULT_DISPLAY_STATS keys: "rpm" → "requests_per_minute", "num_tokens_input-sum" → "total_input_tokens", "num_tokens_output-sum" → "total_output_tokens" - Fix _format_stat: match "_per_minute" instead of "rpm", remove whole-number-float-to-int coercion (100.0 stays "100.0") - Update run_experiments.md doc example - Update test_experiments.py and test_live_display.py accordingly

Verify that request_time is always set on InvocationResponse — both on success and error paths — across all Bedrock integration tests: - test_bedrock_converse: non-streaming, streaming, with-image (3 tests) - test_bedrock_invoke: non-streaming, with-image, streaming (3 tests) - test_bedrock_error_handling: invalid model, invalid payload, error structure (3 tests)

…oke decorator Replace the implicit __init_subclass__ magic that auto-wrapped every subclass invoke method with an explicit @llmeter_invoke decorator. The decorator provides the same functionality (prepare_payload, timing, error handling, metadata back-fill) but is now visible at the definition site, making the contract explicit and allowing subclasses to opt out if they need raw control over invoke. - Add llmeter_invoke decorator to llmeter/endpoints/base.py - Remove Endpoint.__init_subclass__ entirely - Apply @llmeter_invoke to all 12 concrete invoke methods - Export llmeter_invoke from llmeter.endpoints - Update docs (key_concepts.md, connect_endpoints.md)

Our dummy ConcreteEndpoint class was not using the new decorator, so didn't populate request_time properly.

Remove self._start_t and self._last_payload from the llmeter_invoke decorator. Per-call state no longer leaks onto the endpoint instance: - Each invoke body captures its own local start_t via time.perf_counter() and passes it to parse_response directly - The decorator uses a local start_t for time_to_last_token back-fill - input_payload on the response gets the mutated dict (what was actually sent to the API) for reproducibility - _parse_payload receives a deepcopy snapshot taken before the API call, so prompt extraction is not affected by client-side mutations Add 18 tests for the decorator covering payload mutation, invocation isolation, timing, error handling, metadata back-fill, and the decorator marker.

…path The llmeter_invoke wrapper now calls parse_response(raw, start_t) automatically after invoke returns. Subclass invoke methods just return the raw API response — no start_t, no parse_response call, no type guards. - invoke returns raw API response (Any), wrapper calls parse_response - Remove isinstance checks from LiteLLM/LiteLLMStreaming invoke - Remove start_t = time.perf_counter() from all invoke methods - Update test fixtures to match new invoke contract - Sort imports and format with ruff

athewsey reviewed Apr 6, 2026

View reviewed changes

athewsey force-pushed the feature/time-bound-runs branch from 46d1454 to a1ab6b5 Compare April 16, 2026 06:09

acere added 9 commits April 16, 2026 22:44

acere force-pushed the feature/time-bound-runs branch from 0bb60ed to 89f35b4 Compare April 17, 2026 02:08

acere mentioned this pull request Apr 17, 2026

Standardize Endpoint invoke lifecycle with centralized error handling #61

Closed

acere and others added 4 commits April 17, 2026 13:53

test(Endpoint): Fix request_time unit test

8063242

Our dummy ConcreteEndpoint class was not using the new decorator, so didn't populate request_time properly.

Conversation

acere commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Endpoint lifecycle refactor

Before/after (e.g. OpenAIResponseEndpoint.invoke)

New InvocationResponse fields

Time-bound runs & live stats

Stats computation changes

Additional improvements

Tests

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

acere commented Apr 2, 2026 •

edited

Loading

Before/after (e.g. `OpenAIResponseEndpoint.invoke`)