feat: time-bound runs, live stats display, and send-window metrics#58
feat: time-bound runs, live stats display, and send-window metrics#58acere wants to merge 13 commits intoawslabs:mainfrom
Conversation
| _GROUP_PATTERNS: list[tuple[str, str]] = [ | ||
| ("rpm", "Throughput"), | ||
| ("tps", "Throughput"), | ||
| ("ttft", "TTFT"), | ||
| ("ttlt", "TTLT"), | ||
| ("token", "Tokens"), | ||
| ("fail", "Errors"), | ||
| ] | ||
|
|
||
| _GROUP_ORDER = ["Throughput", "TTFT", "TTLT", "Tokens", "Errors", "Other"] |
There was a problem hiding this comment.
Maybe these could be condensed to a single config variable like below?
_GROUP_PATTERNS = (
("Throughput", ("rpm", "tps")),
("TTFT", ("ttft",)),
("TTLT", ("ttlt",)),
("Tokens", ("token",)),
("Errors", ("fail",)),
("Other", ("",)),
)If it's an immutable type like this, it could also nicely become the default value of an argument groups in LiveStatsDisplay constructor, instead of a module-level constant?
| stats = self._builtin_stats.copy() | ||
| else: | ||
| # Fallback: compute from responses (e.g. Result constructed manually) | ||
| stats = self._compute_stats(self) |
There was a problem hiding this comment.
Should this be cached back to _preloaded_stats so it's not recomputed on subsequent accesses?
| result._preloaded_stats = None | ||
| else: | ||
| # Compute stats from the loaded responses | ||
| result._preloaded_stats = cls._compute_stats(result) |
There was a problem hiding this comment.
What happens to callback _contributed_stats when a result is saved to file and loaded again? It looks like, even if the contributed stats get saved to stats.json, they might be overridden/deleted here?
There was a problem hiding this comment.
yes, i missed it. fixed and also added a dedicated set of tests.
| DEFAULT_SNAPSHOT_STATS: dict[str, tuple[str, ...] | str] = { | ||
| "rpm": "rpm", | ||
| "output_tps": "output_tps", | ||
| "p50_ttft": ("time_to_first_token", "p50"), | ||
| "p90_ttft": ("time_to_first_token", "p90"), | ||
| "p50_ttlt": ("time_to_last_token", "p50"), | ||
| "p90_ttlt": ("time_to_last_token", "p90"), | ||
| "p50_tps": ("time_per_output_token", "p50", "inv"), | ||
| "input_tokens": ("num_tokens_input", "sum"), | ||
| "output_tokens": ("num_tokens_output", "sum"), | ||
| "fail": "failed", | ||
| } |
There was a problem hiding this comment.
Not a big fan of defining name aliases at this level - shouldn't that be more of a display-level property?
It also feels weird that this class is separate from Result stats... I'd suggest to instead revisit the way Result itself computes stats and add capability for some to be built calculated on running basis during the Run. After all, callbacks can already choose to _update_contributed_stats at any time?
Then, the LiveStatsDisplay could just be configured which stats to pull (e.g. time_to_first_token-p50) with alias names / groups / whatever other display-level properties.
There was a problem hiding this comment.
restructured the relationship between Results and LiveStatDisplay. now it should be more consistent
There was a problem hiding this comment.
Should this be optional now if n_requests is optional in _RunConfig?
| tokenizer: Tokenizer | Any | None = None | ||
| clients: int = 1 | ||
| n_requests: int | None = None | ||
| run_duration: int | float | None = None |
There was a problem hiding this comment.
Perhaps this should either be a timedelta type, or have a name that explicitly indicates its units?
There was a problem hiding this comment.
Added timedelta type as option, and clarified in docstrings that any numerical type represents duration in seconds.
| self._time_bound = self.run_duration is not None | ||
| if self._time_bound: | ||
| # For time-bound runs, _n_requests is unknown upfront | ||
| self._n_requests = 0 |
There was a problem hiding this comment.
Do we need both n_requests and _n_requests? And the inconsistency of the public property being nullable while the private one's getting set to 0?
There was a problem hiding this comment.
Combined into a single variable
| async def _invoke_duration_c( | ||
| self, | ||
| payload: list[dict], | ||
| clients: int = 1, | ||
| ) -> tuple[float, float, float]: |
There was a problem hiding this comment.
A bit concerned by the amount of duplication introduced by defining parallel _invoke_duration_c, _invoke_duration, _invoke_for_duration methods, rather than sharing anything with the corresponding _invoke_n... methods. Since these are all private, couldn't we consolidate some to a single method that tracks both the number and duration and terminates when either condition is met?
There was a problem hiding this comment.
Consolidated from 6 to 3 methods.
Consolidate live display config (review comment 1): - Merge _GROUP_PATTERNS + _GROUP_ORDER into single DEFAULT_GROUPS tuple - Make groups a constructor parameter on LiveStatsDisplay Move display aliases from RunningStats to LiveStatsDisplay (comment 4): - Remove RunningStats.snapshot() and DEFAULT_SNAPSHOT_STATS - Add rpm/output_tps as regular keys in RunningStats.to_stats() - Add LiveStatsDisplay.format_stats() owning alias mapping + formatting - New DEFAULT_DISPLAY_STATS in live_display.py maps display labels to canonical stat keys (e.g. "time_to_first_token-p50") - Runner passes raw to_stats() output; display handles the rest Cache fallback stats computation (comment 2): - Result.stats property caches _compute_stats back to _preloaded_stats Preserve contributed stats on load (comment 3): - Result.load(load_responses=True) merges extra keys from stats.json so callback-contributed stats survive save/load round-trips Make Result fields optional (comment 5): - total_requests, clients, n_requests now optional to match _RunConfig Accept timedelta for run_duration (comment 6): - run_duration accepts int | float | timedelta; normalized in __post_init__ Remove _n_requests indirection (comment 7): - Eliminated private _n_requests; n_requests set directly to resolved value Consolidate invoke methods (comment 8): - Merged 6 methods into 3: _invoke_n_no_wait (n + duration), _invoke_client (replaces _invoke_n/_invoke_duration), _invoke_clients (replaces _invoke_n_c/_invoke_duration_c) Tests: - Add TestContributedStatsRoundTrip (8 tests) for save/load round-trips - Add TestSendWindowStats for rpm/output_tps in to_stats() - Add TestFormatStat for display formatting - Update all tests for renamed methods and new APIs
Consolidate live display config (review comment 1): - Merge _GROUP_PATTERNS + _GROUP_ORDER into single DEFAULT_GROUPS tuple - Make groups a constructor parameter on LiveStatsDisplay Move display aliases from RunningStats to LiveStatsDisplay (comment 4): - Remove RunningStats.snapshot() and DEFAULT_SNAPSHOT_STATS - Add rpm/output_tps as regular keys in RunningStats.to_stats() - Add LiveStatsDisplay.format_stats() owning alias mapping + formatting - New DEFAULT_DISPLAY_STATS in live_display.py maps display labels to canonical stat keys (e.g. "time_to_first_token-p50") - Runner passes raw to_stats() output; display handles the rest Cache fallback stats computation (comment 2): - Result.stats property caches _compute_stats back to _preloaded_stats Preserve contributed stats on load (comment 3): - Result.load(load_responses=True) merges extra keys from stats.json so callback-contributed stats survive save/load round-trips Make Result fields optional (comment 5): - total_requests, clients, n_requests now optional to match _RunConfig Accept timedelta for run_duration (comment 6): - run_duration accepts int | float | timedelta; normalized in __post_init__ Remove _n_requests indirection (comment 7): - Eliminated private _n_requests; n_requests set directly to resolved value Consolidate invoke methods (comment 8): - Merged 6 methods into 3: _invoke_n_no_wait (n + duration), _invoke_client (replaces _invoke_n/_invoke_duration), _invoke_clients (replaces _invoke_n_c/_invoke_duration_c) Tests: - Add TestContributedStatsRoundTrip (8 tests) for save/load round-trips - Add TestSendWindowStats for rpm/output_tps in to_stats() - Add TestFormatStat for display formatting - Update all tests for renamed methods and new APIs
46d1454 to
a1ab6b5
Compare
Refactor the Endpoint base class to provide a structured invoke lifecycle via __init_subclass__ wrapping: - prepare_payload(payload, **kwargs) → merge kwargs, inject provider fields - invoke(payload) → API call + parse_response() (abstract) - parse_response(raw_response, start_t) → extract text/tokens/metadata (abstract) The base class wrapper automatically provides: - Error handling: exceptions → error InvocationResponse with partial data - Timing: time_to_last_token back-fill for non-streaming endpoints - Metadata: input_payload, input_prompt, id always populated - _parse_payload for input prompt extraction (token counting fallback) Additional improvements: - Add num_tokens_input_cached field for prompt caching (Bedrock + OpenAI) - Extract AWS RequestId as response ID for Bedrock and SageMaker - Extract RetryAttempts for SageMaker (Bedrock already had this) - Preserve partial data on streaming errors instead of discarding - Define BEDROCK_STREAM_ERROR_TYPES as shared constant - Skip unknown stream events gracefully (forward-compatible) - Remove redundant try/except from all _parse_response methods - Remove uuid4/error handling boilerplate from all endpoint subclasses - Update docs: metrics table, key concepts, custom endpoint guide Closes awslabs#60
- Add num_tokens_input_cached to Result.stats aggregation metrics and total_cached_input_tokens to run-level stats - Add integration test for ConverseStream prompt caching with unique-per-run prefix to avoid stale cache hits - Add 6 unit tests verifying mid-stream errors (TimeoutError, ConnectionError) are caught by the invoke wrapper for BedrockConverseStream, BedrockInvokeStream, and OpenAICompletionStreamEndpoint - Add demo notebook comparing TTFT with/without prompt caching, using a CacheBuster callback to guarantee cache misses - Sort imports across codebase (ruff --select I) - Update metrics documentation with new stats fields
- Add `low_memory` parameter to Runner/run() that writes responses to disk without keeping them in memory, for large-scale test runs. - Introduce `RunningStats` class that accumulates metrics incrementally (counts, sums, sorted values for percentile computation). - Replace `_builtin_stats` cached_property on Result with `_preloaded_stats` populated by RunningStats during the run or from stats.json on load. - Add `snapshot()` method on RunningStats for live progress-bar display of p50/p90 TTFT, p50/p90 TTLT, median tokens/s, total tokens, and failure count — configurable via `progress_bar_stats` parameter. - Add `_compute_stats()` classmethod on Result as fallback for manually constructed Result objects and post-load_responses() recomputation. - Update tests for the new stats flow.
Add run_duration parameter for time-bound test runs: - New run_duration on Runner/run() and LoadTest: clients send requests continuously for a fixed duration instead of a fixed count. - Dedicated _invoke_for_duration / _invoke_duration_c methods (separate from count-bound _invoke_n / _invoke_n_c). - Time-based progress bar via _tick_time_bar async task. - Mutual exclusivity validation between n_requests and run_duration. Add LiveStatsDisplay for readable live metrics: - New llmeter/live_display.py: HTML table in Jupyter (grouped columns for Throughput, TTFT, TTLT, Tokens, Errors), ANSI multi-line in terminals. Updates in-place, shows placeholders before first response. - Replaces single-line tqdm postfix with a separate stats row. Improve throughput metric accuracy: - RunningStats.record_send() tracks send-side timestamps. - RPM and output_tps use send window (first-to-last request sent) instead of response-side elapsed time, preventing taper-off as clients finish. - output_tps (aggregate tokens/s) added to default snapshot stats. Fix StopIteration silently terminating invocation loops: - Both _invoke_n_no_wait and _invoke_for_duration now use while/next() instead of for-in-cycle() to prevent StopIteration from streaming endpoints from killing the loop. Add LoadTest support for new features: - run_duration, low_memory, progress_bar_stats forwarded to each run. Add example notebook and documentation: - examples/Time-bound runs with Bedrock OpenAI API.ipynb: end-to-end demo using bedrock-mantle endpoint with LoadTest, custom stats, low-memory mode, and comparison charts (RPM, TPS, TTFT, TTLT). - docs/user_guide/run_experiments.md: new sections for time-bound runs, live progress-bar stats, and low-memory mode. Add tests (51 new, 504 total): - test_running_stats.py: record_send, update, to_stats, snapshot (placeholders, rpm, output_tps, send window, aggregations). - test_live_display.py: _classify, _group_stats, _in_notebook, LiveStatsDisplay (disabled, terminal, overwrite, prefix). - test_experiments.py: LoadTest with run_duration/low_memory/ progress_bar_stats field storage and runner forwarding. - test_runner.py: time-bound validation, _invoke_for_duration, full run with duration, output path, multiple clients.
Consolidate live display config (review comment 1): - Merge _GROUP_PATTERNS + _GROUP_ORDER into single DEFAULT_GROUPS tuple - Make groups a constructor parameter on LiveStatsDisplay Move display aliases from RunningStats to LiveStatsDisplay (comment 4): - Remove RunningStats.snapshot() and DEFAULT_SNAPSHOT_STATS - Add rpm/output_tps as regular keys in RunningStats.to_stats() - Add LiveStatsDisplay.format_stats() owning alias mapping + formatting - New DEFAULT_DISPLAY_STATS in live_display.py maps display labels to canonical stat keys (e.g. "time_to_first_token-p50") - Runner passes raw to_stats() output; display handles the rest Cache fallback stats computation (comment 2): - Result.stats property caches _compute_stats back to _preloaded_stats Preserve contributed stats on load (comment 3): - Result.load(load_responses=True) merges extra keys from stats.json so callback-contributed stats survive save/load round-trips Make Result fields optional (comment 5): - total_requests, clients, n_requests now optional to match _RunConfig Accept timedelta for run_duration (comment 6): - run_duration accepts int | float | timedelta; normalized in __post_init__ Remove _n_requests indirection (comment 7): - Eliminated private _n_requests; n_requests set directly to resolved value Consolidate invoke methods (comment 8): - Merged 6 methods into 3: _invoke_n_no_wait (n + duration), _invoke_client (replaces _invoke_n/_invoke_duration), _invoke_clients (replaces _invoke_n_c/_invoke_duration_c) Tests: - Add TestContributedStatsRoundTrip (8 tests) for save/load round-trips - Add TestSendWindowStats for rpm/output_tps in to_stats() - Add TestFormatStat for display formatting - Update all tests for renamed methods and new APIs
Add a `request_time` (datetime UTC) field to InvocationResponse that records the wall-clock time when each request was sent. The invoke wrapper sets it automatically — both on success and error paths — so no endpoint subclass changes are needed. This enables time-series analysis of latency data and is required by the send-window throughput metrics in RunningStats.
Use the request_time stamps captured on InvocationResponse to drive rate statistics instead of the old record_send() / perf_counter approach. RunningStats changes: - Remove record_send() — send timestamps are now derived from request_time on each response via update() - to_stats() takes end_time (datetime) instead of total_requests / total_test_time — RPM uses the request send window, output rates use [first_request, end_time] - _send_window() computes elapsed seconds from datetime objects Result changes: - Add first_request_time / last_request_time fields, populated from RunningStats at end of run - Datetime serialization updated for the new fields Runner changes: - Remove record_send() call from invoke loop - Pass end_time to to_stats() and first/last request times to Result
- Fix DEFAULT_DISPLAY_STATS keys: "rpm" → "requests_per_minute", "num_tokens_input-sum" → "total_input_tokens", "num_tokens_output-sum" → "total_output_tokens" - Fix _format_stat: match "_per_minute" instead of "rpm", remove whole-number-float-to-int coercion (100.0 stays "100.0") - Update run_experiments.md doc example - Update test_experiments.py and test_live_display.py accordingly
Verify that request_time is always set on InvocationResponse — both on success and error paths — across all Bedrock integration tests: - test_bedrock_converse: non-streaming, streaming, with-image (3 tests) - test_bedrock_invoke: non-streaming, with-image, streaming (3 tests) - test_bedrock_error_handling: invalid model, invalid payload, error structure (3 tests)
0bb60ed to
89f35b4
Compare
…oke decorator Replace the implicit __init_subclass__ magic that auto-wrapped every subclass invoke method with an explicit @llmeter_invoke decorator. The decorator provides the same functionality (prepare_payload, timing, error handling, metadata back-fill) but is now visible at the definition site, making the contract explicit and allowing subclasses to opt out if they need raw control over invoke. - Add llmeter_invoke decorator to llmeter/endpoints/base.py - Remove Endpoint.__init_subclass__ entirely - Apply @llmeter_invoke to all 12 concrete invoke methods - Export llmeter_invoke from llmeter.endpoints - Update docs (key_concepts.md, connect_endpoints.md)
Our dummy ConcreteEndpoint class was not using the new decorator, so didn't populate request_time properly.
Remove self._start_t and self._last_payload from the llmeter_invoke decorator. Per-call state no longer leaks onto the endpoint instance: - Each invoke body captures its own local start_t via time.perf_counter() and passes it to parse_response directly - The decorator uses a local start_t for time_to_last_token back-fill - input_payload on the response gets the mutated dict (what was actually sent to the API) for reproducibility - _parse_payload receives a deepcopy snapshot taken before the API call, so prompt extraction is not affected by client-side mutations Add 18 tests for the decorator covering payload mutation, invocation isolation, timing, error handling, metadata back-fill, and the decorator marker.
…path The llmeter_invoke wrapper now calls parse_response(raw, start_t) automatically after invoke returns. Subclass invoke methods just return the raw API response — no start_t, no parse_response call, no type guards. - invoke returns raw API response (Any), wrapper calls parse_response - Remove isinstance checks from LiteLLM/LiteLLMStreaming invoke - Remove start_t = time.perf_counter() from all invoke methods - Update test fixtures to match new invoke contract - Sort imports and format with ruff
Closes #57, closes #60
What
Adds time-bound test runs, a live stats display, send-window-based throughput metrics, standardized endpoint invoke lifecycle, prompt caching support, and fixes a
StopIterationbug in invocation loops.This PR combines the original time-bound runs feature with the endpoint lifecycle refactor (previously PR #61).
Endpoint lifecycle refactor
The
Endpointbase class now provides a structured invoke lifecycle via__init_subclass__wrapping:invoke(payload)parse_response()parse_response(raw_response, start_t)prepare_payload(payload, **kwargs)The wrapper automatically handles:
InvocationResponsewith payload attachedtime_to_last_tokenback-filled for non-streaming endpointsinput_payload,input_prompt,id,request_timealways populated_parse_payload— extracts human-readable prompt for observability and token counting fallbackBefore/after (e.g.
OpenAIResponseEndpoint.invoke)Before (27 lines with 5 duplicate except handlers):
After (3 lines):
New InvocationResponse fields
request_time(datetime UTC) — wall-clock time when the request was sentnum_tokens_input_cached— input tokens served from prompt cache (Bedrock + OpenAI)Time-bound runs & live stats
run_durationparameter for continuous-duration runs (mutually exclusive withn_requests)LiveStatsDisplayfor real-time progress in Jupyter and terminalsRunningStatsfor incremental stat accumulationlow_memorymode that discards individual responses after stats extractionStats computation changes
RunningStats.to_stats()takesend_time(datetime) instead oftotal_requests/total_test_timefirst_request_timetolast_request_time)[first_request, end_time]windowResultgainsfirst_request_time/last_request_timefieldsAdditional improvements
RequestIdas response ID for all Bedrock and SageMaker endpointsRetryAttemptsfor SageMaker (Bedrock already had this)BEDROCK_STREAM_ERROR_TYPESas shared constantCacheBustercallbackTests
request_timeand AWS RequestId assertions