Make load_test.py use strict token-id prompts (PER-75)#119
Make load_test.py use strict token-id prompts (PER-75)#119ishaan-shivhare wants to merge 2 commits into
Conversation
TranslationDataset now builds exact token-id sequences with a deterministic shared prefix (matching prefill_load_test/gen_load_test semantics) instead of approximate text chunking. Token-id prompts are sent via /v1/completions even when --chat is set (chat template applied client-side). Fixes PER-75: prompt_cache_max_len now corresponds to the first N shareable token IDs across requests, not an estimated text prefix budget. Adds unit tests for exact lengths, shared-prefix caching, and provider wiring. Co-authored-by: Ishaan Shivhare <ishaan-shivhare@users.noreply.github.com>
- Import build_pair_ids/split_chat_template from prefill_load_test instead of duplicating ~90 lines of helpers - Fix Bugbot: route token-id prompts to /v1/completions inline in _do_generate_text instead of passing prompt to get_url(), which broke TogetherProvider/TgiProvider/Triton providers - Slim TranslationDataset API (dataset name instead of path/prompt) Co-authored-by: Ishaan Shivhare <ishaan-shivhare@users.noreply.github.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d4a7977. Configure here.
| if self.parsed_options.reasoning_effort is not None: | ||
| data["reasoning_effort"] = self.parsed_options.reasoning_effort | ||
| if isinstance(prompt, str): | ||
| if isinstance(prompt, list) and prompt and isinstance(prompt[0], int): |
There was a problem hiding this comment.
Rerank breaks on token prompts
High Severity
With the default limericks/code datasets, prompts are now token-id lists, but the rerank branch still treats any list as the document list. Each token id becomes its own document instead of paragraph-split text, so /v1/rerank requests no longer match the documented workflow.
Reviewed by Cursor Bugbot for commit d4a7977. Configure here.
| ) | ||
| elif options.dataset in ("limericks", "code"): | ||
| if options.dataset == "limericks": | ||
| if options.prompt is None: |
There was a problem hiding this comment.
Custom prompt flag ignored
Medium Severity
The --prompt CLI option is still documented, but limericks/code dataset setup no longer reads it after switching to build_pair_ids. Custom task instructions are dropped with no warning, so runs that relied on --prompt no longer match prior behavior.
Reviewed by Cursor Bugbot for commit d4a7977. Configure here.


Summary
Makes
load_test.pyunconditionally strict for limericks/code datasets: prompts are built as exact token-id sequences with a deterministic shared prefix, reusingbuild_pair_ids/split_chat_templatefromprefill_load_test.py.Deployment experiments were configured for 75% cache hit (
--prompt-cache-max-len=37500on 50k prompts) but measured ~58% becauseTranslationDatasetestimated token counts from text chunks instead of enforcing shareable prefix token IDs.Changes
TranslationDataset: usesprefill_load_testhelpers — exactlen(prompt) == prompt_tokens, firstcached_tokensIDs identical across requests/v1/completions(chat template applied client-side); URL routing done inline in_do_generate_textso other providers'get_url()signatures are unchangedOpenAIProvider: handleslist[int]prompts informat_payloadforced_generationfield--prompt-images-with-resolutions+ limericks/code raises at startupBugbot fix
Previous version passed
prompttoget_url(), which brokeTogetherProvider,TgiProvider, and Triton providers (they overrideget_url(self)with no extra parameter). Fixed by selecting/v1/completionsinline when the prompt is a token-id list.What this fixes vs follow-up
cached_tokens + 1)prompt_cache_max_len= shareable prefix semanticssplit_chat_template/encode_messagesTesting
All 5 tests pass.
Suggested validation on cluster: re-run the DSV4 deployment experiment (
--prompt-cache-max-len=37500, 50k prompt) and confirm measured cache hit ≈ 75% (modulo warmup).Slack Thread