Skip to content

Make load_test.py use strict token-id prompts (PER-75)#119

Draft
ishaan-shivhare wants to merge 2 commits into
mainfrom
cursor/load-test-strict-prompts-bdbf
Draft

Make load_test.py use strict token-id prompts (PER-75)#119
ishaan-shivhare wants to merge 2 commits into
mainfrom
cursor/load-test-strict-prompts-bdbf

Conversation

@ishaan-shivhare

@ishaan-shivhare ishaan-shivhare commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Makes load_test.py unconditionally strict for limericks/code datasets: prompts are built as exact token-id sequences with a deterministic shared prefix, reusing build_pair_ids / split_chat_template from prefill_load_test.py.

Deployment experiments were configured for 75% cache hit (--prompt-cache-max-len=37500 on 50k prompts) but measured ~58% because TranslationDataset estimated token counts from text chunks instead of enforcing shareable prefix token IDs.

Changes

  • TranslationDataset: uses prefill_load_test helpers — exact len(prompt) == prompt_tokens, first cached_tokens IDs identical across requests
  • Request path: token-id prompts go to /v1/completions (chat template applied client-side); URL routing done inline in _do_generate_text so other providers' get_url() signatures are unchanged
  • OpenAIProvider: handles list[int] prompts in format_payload
  • Forced generation from dataset: decodes token ids to text for the forced_generation field
  • Incompatibility: --prompt-images-with-resolutions + limericks/code raises at startup

Bugbot fix

Previous version passed prompt to get_url(), which broke TogetherProvider, TgiProvider, and Triton providers (they override get_url(self) with no extra parameter). Fixed by selecting /v1/completions inline when the prompt is a token-id list.

What this fixes vs follow-up

Fixed Follow-up (not in this PR)
Exact prompt lengths KV cache warmup (cached_tokens + 1)
prompt_cache_max_len = shareable prefix semantics Per-worker cache priming under LB
DSV4-safe via split_chat_template / encode_messages

Testing

cd llm_bench && python3 -m unittest test_load_test_strict.py -v

All 5 tests pass.

Suggested validation on cluster: re-run the DSV4 deployment experiment (--prompt-cache-max-len=37500, 50k prompt) and confirm measured cache hit ≈ 75% (modulo warmup).

Slack Thread

Open in Web Open in Cursor 

TranslationDataset now builds exact token-id sequences with a deterministic
shared prefix (matching prefill_load_test/gen_load_test semantics) instead of
approximate text chunking. Token-id prompts are sent via /v1/completions
even when --chat is set (chat template applied client-side).

Fixes PER-75: prompt_cache_max_len now corresponds to the first N shareable
token IDs across requests, not an estimated text prefix budget.

Adds unit tests for exact lengths, shared-prefix caching, and provider wiring.

Co-authored-by: Ishaan Shivhare <ishaan-shivhare@users.noreply.github.com>
Comment thread llm_bench/load_test.py Outdated
- Import build_pair_ids/split_chat_template from prefill_load_test instead
  of duplicating ~90 lines of helpers
- Fix Bugbot: route token-id prompts to /v1/completions inline in
  _do_generate_text instead of passing prompt to get_url(), which broke
  TogetherProvider/TgiProvider/Triton providers
- Slim TranslationDataset API (dataset name instead of path/prompt)

Co-authored-by: Ishaan Shivhare <ishaan-shivhare@users.noreply.github.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d4a7977. Configure here.

Comment thread llm_bench/load_test.py
if self.parsed_options.reasoning_effort is not None:
data["reasoning_effort"] = self.parsed_options.reasoning_effort
if isinstance(prompt, str):
if isinstance(prompt, list) and prompt and isinstance(prompt[0], int):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rerank breaks on token prompts

High Severity

With the default limericks/code datasets, prompts are now token-id lists, but the rerank branch still treats any list as the document list. Each token id becomes its own document instead of paragraph-split text, so /v1/rerank requests no longer match the documented workflow.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d4a7977. Configure here.

Comment thread llm_bench/load_test.py
)
elif options.dataset in ("limericks", "code"):
if options.dataset == "limericks":
if options.prompt is None:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom prompt flag ignored

Medium Severity

The --prompt CLI option is still documented, but limericks/code dataset setup no longer reads it after switching to build_pair_ids. Custom task instructions are dropped with no warning, so runs that relied on --prompt no longer match prior behavior.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d4a7977. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants