Skip to content

feat: time-bound runs, live stats display, and send-window metrics#58

Open
acere wants to merge 13 commits intoawslabs:mainfrom
acere:feature/time-bound-runs
Open

feat: time-bound runs, live stats display, and send-window metrics#58
acere wants to merge 13 commits intoawslabs:mainfrom
acere:feature/time-bound-runs

Conversation

@acere
Copy link
Copy Markdown
Collaborator

@acere acere commented Apr 2, 2026

Closes #57, closes #60

What

Adds time-bound test runs, a live stats display, send-window-based throughput metrics, standardized endpoint invoke lifecycle, prompt caching support, and fixes a StopIteration bug in invocation loops.

This PR combines the original time-bound runs feature with the endpoint lifecycle refactor (previously PR #61).

Endpoint lifecycle refactor

The Endpoint base class now provides a structured invoke lifecycle via __init_subclass__ wrapping:

Method Required Purpose
invoke(payload) Yes API call + parse_response()
parse_response(raw_response, start_t) Yes Extract text, tokens, metadata
prepare_payload(payload, **kwargs) No Merge kwargs, inject model_id, etc.

The wrapper automatically handles:

  • Error handling — exceptions → error InvocationResponse with payload attached
  • Timingtime_to_last_token back-filled for non-streaming endpoints
  • Metadatainput_payload, input_prompt, id, request_time always populated
  • _parse_payload — extracts human-readable prompt for observability and token counting fallback

Before/after (e.g. OpenAIResponseEndpoint.invoke)

Before (27 lines with 5 duplicate except handlers):

def invoke(self, payload, **kwargs):
    payload = {**kwargs, **payload}
    payload["model"] = self.model_id
    start_t = time.perf_counter()
    try:
        client_response = self._client.responses.create(**payload)
    except APIConnectionError as e:
        ...  # 5 identical except blocks
    response = self._parse_response(client_response, start_t)
    response.input_payload = payload
    response.input_prompt = self._parse_payload(payload)
    return response

After (3 lines):

def invoke(self, payload):
    client_response = self._client.responses.create(**payload)
    return self.parse_response(client_response, self._start_t)

New InvocationResponse fields

  • request_time (datetime UTC) — wall-clock time when the request was sent
  • num_tokens_input_cached — input tokens served from prompt cache (Bedrock + OpenAI)

Time-bound runs & live stats

  • run_duration parameter for continuous-duration runs (mutually exclusive with n_requests)
  • LiveStatsDisplay for real-time progress in Jupyter and terminals
  • RunningStats for incremental stat accumulation
  • low_memory mode that discards individual responses after stats extraction
  • RPM and throughput computed from request timestamps (send window), not response timestamps

Stats computation changes

  • RunningStats.to_stats() takes end_time (datetime) instead of total_requests/total_test_time
  • RPM uses request send window (first_request_time to last_request_time)
  • Output rates use [first_request, end_time] window
  • Result gains first_request_time / last_request_time fields

Additional improvements

  • Extract AWS RequestId as response ID for all Bedrock and SageMaker endpoints
  • Extract RetryAttempts for SageMaker (Bedrock already had this)
  • Preserve partial data on streaming errors instead of discarding
  • BEDROCK_STREAM_ERROR_TYPES as shared constant
  • Skip unknown stream events gracefully (forward-compatible)
  • Prompt caching demo notebook with CacheBuster callback
  • Updated docs: metrics table, key concepts, custom endpoint guide, run experiments

Tests

  • 751 unit tests pass
  • 6 new mid-stream error tests (TimeoutError, ConnectionError across 3 streaming endpoints)
  • Integration tests with request_time and AWS RequestId assertions
  • Prompt caching integration test with unique-per-run prefix

Comment thread llmeter/live_display.py Outdated
Comment on lines +20 to +29
_GROUP_PATTERNS: list[tuple[str, str]] = [
("rpm", "Throughput"),
("tps", "Throughput"),
("ttft", "TTFT"),
("ttlt", "TTLT"),
("token", "Tokens"),
("fail", "Errors"),
]

_GROUP_ORDER = ["Throughput", "TTFT", "TTLT", "Tokens", "Errors", "Other"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these could be condensed to a single config variable like below?

_GROUP_PATTERNS = (
    ("Throughput", ("rpm", "tps")),
    ("TTFT", ("ttft",)),
    ("TTLT", ("ttlt",)),
    ("Tokens", ("token",)),
    ("Errors", ("fail",)),
    ("Other", ("",)),
)

If it's an immutable type like this, it could also nicely become the default value of an argument groups in LiveStatsDisplay constructor, instead of a module-level constant?

Comment thread llmeter/results.py Outdated
stats = self._builtin_stats.copy()
else:
# Fallback: compute from responses (e.g. Result constructed manually)
stats = self._compute_stats(self)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be cached back to _preloaded_stats so it's not recomputed on subsequent accesses?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment thread llmeter/results.py
result._preloaded_stats = None
else:
# Compute stats from the loaded responses
result._preloaded_stats = cls._compute_stats(result)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to callback _contributed_stats when a result is saved to file and loaded again? It looks like, even if the contributed stats get saved to stats.json, they might be overridden/deleted here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i missed it. fixed and also added a dedicated set of tests.

Comment thread llmeter/utils.py Outdated
Comment on lines +121 to +132
DEFAULT_SNAPSHOT_STATS: dict[str, tuple[str, ...] | str] = {
"rpm": "rpm",
"output_tps": "output_tps",
"p50_ttft": ("time_to_first_token", "p50"),
"p90_ttft": ("time_to_first_token", "p90"),
"p50_ttlt": ("time_to_last_token", "p50"),
"p90_ttlt": ("time_to_last_token", "p90"),
"p50_tps": ("time_per_output_token", "p50", "inv"),
"input_tokens": ("num_tokens_input", "sum"),
"output_tokens": ("num_tokens_output", "sum"),
"fail": "failed",
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of defining name aliases at this level - shouldn't that be more of a display-level property?

It also feels weird that this class is separate from Result stats... I'd suggest to instead revisit the way Result itself computes stats and add capability for some to be built calculated on running basis during the Run. After all, callbacks can already choose to _update_contributed_stats at any time?

Then, the LiveStatsDisplay could just be configured which stats to pull (e.g. time_to_first_token-p50) with alias names / groups / whatever other display-level properties.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restructured the relationship between Results and LiveStatDisplay. now it should be more consistent

Comment thread llmeter/results.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be optional now if n_requests is optional in _RunConfig?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment thread llmeter/runner.py Outdated
tokenizer: Tokenizer | Any | None = None
clients: int = 1
n_requests: int | None = None
run_duration: int | float | None = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should either be a timedelta type, or have a name that explicitly indicates its units?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added timedelta type as option, and clarified in docstrings that any numerical type represents duration in seconds.

Comment thread llmeter/runner.py Outdated
self._time_bound = self.run_duration is not None
if self._time_bound:
# For time-bound runs, _n_requests is unknown upfront
self._n_requests = 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both n_requests and _n_requests? And the inconsistency of the public property being nullable while the private one's getting set to 0?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined into a single variable

Comment thread llmeter/runner.py Outdated
Comment on lines +563 to +567
async def _invoke_duration_c(
self,
payload: list[dict],
clients: int = 1,
) -> tuple[float, float, float]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit concerned by the amount of duplication introduced by defining parallel _invoke_duration_c, _invoke_duration, _invoke_for_duration methods, rather than sharing anything with the corresponding _invoke_n... methods. Since these are all private, couldn't we consolidate some to a single method that tracks both the number and duration and terminates when either condition is met?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated from 6 to 3 methods.

acere added a commit to acere/llmeter that referenced this pull request Apr 8, 2026
Consolidate live display config (review comment 1):
- Merge _GROUP_PATTERNS + _GROUP_ORDER into single DEFAULT_GROUPS tuple
- Make groups a constructor parameter on LiveStatsDisplay

Move display aliases from RunningStats to LiveStatsDisplay (comment 4):
- Remove RunningStats.snapshot() and DEFAULT_SNAPSHOT_STATS
- Add rpm/output_tps as regular keys in RunningStats.to_stats()
- Add LiveStatsDisplay.format_stats() owning alias mapping + formatting
- New DEFAULT_DISPLAY_STATS in live_display.py maps display labels to
  canonical stat keys (e.g. "time_to_first_token-p50")
- Runner passes raw to_stats() output; display handles the rest

Cache fallback stats computation (comment 2):
- Result.stats property caches _compute_stats back to _preloaded_stats

Preserve contributed stats on load (comment 3):
- Result.load(load_responses=True) merges extra keys from stats.json
  so callback-contributed stats survive save/load round-trips

Make Result fields optional (comment 5):
- total_requests, clients, n_requests now optional to match _RunConfig

Accept timedelta for run_duration (comment 6):
- run_duration accepts int | float | timedelta; normalized in __post_init__

Remove _n_requests indirection (comment 7):
- Eliminated private _n_requests; n_requests set directly to resolved value

Consolidate invoke methods (comment 8):
- Merged 6 methods into 3: _invoke_n_no_wait (n + duration),
  _invoke_client (replaces _invoke_n/_invoke_duration),
  _invoke_clients (replaces _invoke_n_c/_invoke_duration_c)

Tests:
- Add TestContributedStatsRoundTrip (8 tests) for save/load round-trips
- Add TestSendWindowStats for rpm/output_tps in to_stats()
- Add TestFormatStat for display formatting
- Update all tests for renamed methods and new APIs
athewsey pushed a commit to acere/llmeter that referenced this pull request Apr 16, 2026
Consolidate live display config (review comment 1):
- Merge _GROUP_PATTERNS + _GROUP_ORDER into single DEFAULT_GROUPS tuple
- Make groups a constructor parameter on LiveStatsDisplay

Move display aliases from RunningStats to LiveStatsDisplay (comment 4):
- Remove RunningStats.snapshot() and DEFAULT_SNAPSHOT_STATS
- Add rpm/output_tps as regular keys in RunningStats.to_stats()
- Add LiveStatsDisplay.format_stats() owning alias mapping + formatting
- New DEFAULT_DISPLAY_STATS in live_display.py maps display labels to
  canonical stat keys (e.g. "time_to_first_token-p50")
- Runner passes raw to_stats() output; display handles the rest

Cache fallback stats computation (comment 2):
- Result.stats property caches _compute_stats back to _preloaded_stats

Preserve contributed stats on load (comment 3):
- Result.load(load_responses=True) merges extra keys from stats.json
  so callback-contributed stats survive save/load round-trips

Make Result fields optional (comment 5):
- total_requests, clients, n_requests now optional to match _RunConfig

Accept timedelta for run_duration (comment 6):
- run_duration accepts int | float | timedelta; normalized in __post_init__

Remove _n_requests indirection (comment 7):
- Eliminated private _n_requests; n_requests set directly to resolved value

Consolidate invoke methods (comment 8):
- Merged 6 methods into 3: _invoke_n_no_wait (n + duration),
  _invoke_client (replaces _invoke_n/_invoke_duration),
  _invoke_clients (replaces _invoke_n_c/_invoke_duration_c)

Tests:
- Add TestContributedStatsRoundTrip (8 tests) for save/load round-trips
- Add TestSendWindowStats for rpm/output_tps in to_stats()
- Add TestFormatStat for display formatting
- Update all tests for renamed methods and new APIs
@athewsey athewsey force-pushed the feature/time-bound-runs branch from 46d1454 to a1ab6b5 Compare April 16, 2026 06:09
acere added 9 commits April 16, 2026 22:44
Refactor the Endpoint base class to provide a structured invoke lifecycle
via __init_subclass__ wrapping:

- prepare_payload(payload, **kwargs) → merge kwargs, inject provider fields
- invoke(payload) → API call + parse_response() (abstract)
- parse_response(raw_response, start_t) → extract text/tokens/metadata (abstract)

The base class wrapper automatically provides:
- Error handling: exceptions → error InvocationResponse with partial data
- Timing: time_to_last_token back-fill for non-streaming endpoints
- Metadata: input_payload, input_prompt, id always populated
- _parse_payload for input prompt extraction (token counting fallback)

Additional improvements:
- Add num_tokens_input_cached field for prompt caching (Bedrock + OpenAI)
- Extract AWS RequestId as response ID for Bedrock and SageMaker
- Extract RetryAttempts for SageMaker (Bedrock already had this)
- Preserve partial data on streaming errors instead of discarding
- Define BEDROCK_STREAM_ERROR_TYPES as shared constant
- Skip unknown stream events gracefully (forward-compatible)
- Remove redundant try/except from all _parse_response methods
- Remove uuid4/error handling boilerplate from all endpoint subclasses
- Update docs: metrics table, key concepts, custom endpoint guide

Closes awslabs#60
- Add num_tokens_input_cached to Result.stats aggregation metrics
  and total_cached_input_tokens to run-level stats
- Add integration test for ConverseStream prompt caching with
  unique-per-run prefix to avoid stale cache hits
- Add 6 unit tests verifying mid-stream errors (TimeoutError,
  ConnectionError) are caught by the invoke wrapper for
  BedrockConverseStream, BedrockInvokeStream, and
  OpenAICompletionStreamEndpoint
- Add demo notebook comparing TTFT with/without prompt caching,
  using a CacheBuster callback to guarantee cache misses
- Sort imports across codebase (ruff --select I)
- Update metrics documentation with new stats fields
- Add `low_memory` parameter to Runner/run() that writes responses to
  disk without keeping them in memory, for large-scale test runs.
- Introduce `RunningStats` class that accumulates metrics incrementally
  (counts, sums, sorted values for percentile computation).
- Replace `_builtin_stats` cached_property on Result with `_preloaded_stats`
  populated by RunningStats during the run or from stats.json on load.
- Add `snapshot()` method on RunningStats for live progress-bar display
  of p50/p90 TTFT, p50/p90 TTLT, median tokens/s, total tokens, and
  failure count — configurable via `progress_bar_stats` parameter.
- Add `_compute_stats()` classmethod on Result as fallback for manually
  constructed Result objects and post-load_responses() recomputation.
- Update tests for the new stats flow.
Add run_duration parameter for time-bound test runs:
- New run_duration on Runner/run() and LoadTest: clients send requests
  continuously for a fixed duration instead of a fixed count.
- Dedicated _invoke_for_duration / _invoke_duration_c methods (separate
  from count-bound _invoke_n / _invoke_n_c).
- Time-based progress bar via _tick_time_bar async task.
- Mutual exclusivity validation between n_requests and run_duration.

Add LiveStatsDisplay for readable live metrics:
- New llmeter/live_display.py: HTML table in Jupyter (grouped columns
  for Throughput, TTFT, TTLT, Tokens, Errors), ANSI multi-line in
  terminals. Updates in-place, shows placeholders before first response.
- Replaces single-line tqdm postfix with a separate stats row.

Improve throughput metric accuracy:
- RunningStats.record_send() tracks send-side timestamps.
- RPM and output_tps use send window (first-to-last request sent)
  instead of response-side elapsed time, preventing taper-off as
  clients finish.
- output_tps (aggregate tokens/s) added to default snapshot stats.

Fix StopIteration silently terminating invocation loops:
- Both _invoke_n_no_wait and _invoke_for_duration now use while/next()
  instead of for-in-cycle() to prevent StopIteration from streaming
  endpoints from killing the loop.

Add LoadTest support for new features:
- run_duration, low_memory, progress_bar_stats forwarded to each run.

Add example notebook and documentation:
- examples/Time-bound runs with Bedrock OpenAI API.ipynb: end-to-end
  demo using bedrock-mantle endpoint with LoadTest, custom stats,
  low-memory mode, and comparison charts (RPM, TPS, TTFT, TTLT).
- docs/user_guide/run_experiments.md: new sections for time-bound runs,
  live progress-bar stats, and low-memory mode.

Add tests (51 new, 504 total):
- test_running_stats.py: record_send, update, to_stats, snapshot
  (placeholders, rpm, output_tps, send window, aggregations).
- test_live_display.py: _classify, _group_stats, _in_notebook,
  LiveStatsDisplay (disabled, terminal, overwrite, prefix).
- test_experiments.py: LoadTest with run_duration/low_memory/
  progress_bar_stats field storage and runner forwarding.
- test_runner.py: time-bound validation, _invoke_for_duration,
  full run with duration, output path, multiple clients.
Consolidate live display config (review comment 1):
- Merge _GROUP_PATTERNS + _GROUP_ORDER into single DEFAULT_GROUPS tuple
- Make groups a constructor parameter on LiveStatsDisplay

Move display aliases from RunningStats to LiveStatsDisplay (comment 4):
- Remove RunningStats.snapshot() and DEFAULT_SNAPSHOT_STATS
- Add rpm/output_tps as regular keys in RunningStats.to_stats()
- Add LiveStatsDisplay.format_stats() owning alias mapping + formatting
- New DEFAULT_DISPLAY_STATS in live_display.py maps display labels to
  canonical stat keys (e.g. "time_to_first_token-p50")
- Runner passes raw to_stats() output; display handles the rest

Cache fallback stats computation (comment 2):
- Result.stats property caches _compute_stats back to _preloaded_stats

Preserve contributed stats on load (comment 3):
- Result.load(load_responses=True) merges extra keys from stats.json
  so callback-contributed stats survive save/load round-trips

Make Result fields optional (comment 5):
- total_requests, clients, n_requests now optional to match _RunConfig

Accept timedelta for run_duration (comment 6):
- run_duration accepts int | float | timedelta; normalized in __post_init__

Remove _n_requests indirection (comment 7):
- Eliminated private _n_requests; n_requests set directly to resolved value

Consolidate invoke methods (comment 8):
- Merged 6 methods into 3: _invoke_n_no_wait (n + duration),
  _invoke_client (replaces _invoke_n/_invoke_duration),
  _invoke_clients (replaces _invoke_n_c/_invoke_duration_c)

Tests:
- Add TestContributedStatsRoundTrip (8 tests) for save/load round-trips
- Add TestSendWindowStats for rpm/output_tps in to_stats()
- Add TestFormatStat for display formatting
- Update all tests for renamed methods and new APIs
Add a `request_time` (datetime UTC) field to InvocationResponse that
records the wall-clock time when each request was sent. The invoke
wrapper sets it automatically — both on success and error paths — so
no endpoint subclass changes are needed.

This enables time-series analysis of latency data and is required by
the send-window throughput metrics in RunningStats.
Use the request_time stamps captured on InvocationResponse to drive
rate statistics instead of the old record_send() / perf_counter approach.

RunningStats changes:
- Remove record_send() — send timestamps are now derived from
  request_time on each response via update()
- to_stats() takes end_time (datetime) instead of total_requests /
  total_test_time — RPM uses the request send window, output rates
  use [first_request, end_time]
- _send_window() computes elapsed seconds from datetime objects

Result changes:
- Add first_request_time / last_request_time fields, populated from
  RunningStats at end of run
- Datetime serialization updated for the new fields

Runner changes:
- Remove record_send() call from invoke loop
- Pass end_time to to_stats() and first/last request times to Result
- Fix DEFAULT_DISPLAY_STATS keys: "rpm" → "requests_per_minute",
  "num_tokens_input-sum" → "total_input_tokens",
  "num_tokens_output-sum" → "total_output_tokens"
- Fix _format_stat: match "_per_minute" instead of "rpm", remove
  whole-number-float-to-int coercion (100.0 stays "100.0")
- Update run_experiments.md doc example
- Update test_experiments.py and test_live_display.py accordingly
Verify that request_time is always set on InvocationResponse — both
on success and error paths — across all Bedrock integration tests:

- test_bedrock_converse: non-streaming, streaming, with-image (3 tests)
- test_bedrock_invoke: non-streaming, with-image, streaming (3 tests)
- test_bedrock_error_handling: invalid model, invalid payload,
  error structure (3 tests)
acere and others added 4 commits April 17, 2026 13:53
…oke decorator

Replace the implicit __init_subclass__ magic that auto-wrapped every
subclass invoke method with an explicit @llmeter_invoke decorator.

The decorator provides the same functionality (prepare_payload, timing,
error handling, metadata back-fill) but is now visible at the definition
site, making the contract explicit and allowing subclasses to opt out
if they need raw control over invoke.

- Add llmeter_invoke decorator to llmeter/endpoints/base.py
- Remove Endpoint.__init_subclass__ entirely
- Apply @llmeter_invoke to all 12 concrete invoke methods
- Export llmeter_invoke from llmeter.endpoints
- Update docs (key_concepts.md, connect_endpoints.md)
Our dummy ConcreteEndpoint class was not using the new decorator, so
didn't populate request_time properly.
Remove self._start_t and self._last_payload from the llmeter_invoke
decorator. Per-call state no longer leaks onto the endpoint instance:

- Each invoke body captures its own local start_t via time.perf_counter()
  and passes it to parse_response directly
- The decorator uses a local start_t for time_to_last_token back-fill
- input_payload on the response gets the mutated dict (what was actually
  sent to the API) for reproducibility
- _parse_payload receives a deepcopy snapshot taken before the API call,
  so prompt extraction is not affected by client-side mutations

Add 18 tests for the decorator covering payload mutation, invocation
isolation, timing, error handling, metadata back-fill, and the decorator
marker.
…path

The llmeter_invoke wrapper now calls parse_response(raw, start_t)
automatically after invoke returns. Subclass invoke methods just
return the raw API response — no start_t, no parse_response call,
no type guards.

- invoke returns raw API response (Any), wrapper calls parse_response
- Remove isinstance checks from LiteLLM/LiteLLMStreaming invoke
- Remove start_t = time.perf_counter() from all invoke methods
- Update test fixtures to match new invoke contract
- Sort imports and format with ruff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants