Make load_test.py use strict token-id prompts (PER-75) by ishaan-shivhare · Pull Request #119 · fw-ai/benchmark

ishaan-shivhare · 2026-06-05T20:28:27Z

Summary

Makes load_test.py unconditionally strict for limericks/code datasets: prompts are built as exact token-id sequences with a deterministic shared prefix, reusing build_pair_ids / split_chat_template from prefill_load_test.py.

Deployment experiments were configured for 75% cache hit (--prompt-cache-max-len=37500 on 50k prompts) but measured ~58% because TranslationDataset estimated token counts from text chunks instead of enforcing shareable prefix token IDs.

Changes

TranslationDataset: uses prefill_load_test helpers — exact len(prompt) == prompt_tokens, first cached_tokens IDs identical across requests
Request path: token-id prompts go to /v1/completions (chat template applied client-side); URL routing done inline in _do_generate_text so other providers' get_url() signatures are unchanged
OpenAIProvider: handles list[int] prompts in format_payload
Forced generation from dataset: decodes token ids to text for the forced_generation field
Incompatibility: --prompt-images-with-resolutions + limericks/code raises at startup

Bugbot fix

Previous version passed prompt to get_url(), which broke TogetherProvider, TgiProvider, and Triton providers (they override get_url(self) with no extra parameter). Fixed by selecting /v1/completions inline when the prompt is a token-id list.

What this fixes vs follow-up

Fixed	Follow-up (not in this PR)
Exact prompt lengths	KV cache warmup (`cached_tokens + 1`)
`prompt_cache_max_len` = shareable prefix semantics	Per-worker cache priming under LB
DSV4-safe via `split_chat_template` / `encode_messages`

Testing

cd llm_bench && python3 -m unittest test_load_test_strict.py -v

All 5 tests pass.

Suggested validation on cluster: re-run the DSV4 deployment experiment (--prompt-cache-max-len=37500, 50k prompt) and confirm measured cache hit ≈ 75% (modulo warmup).

Slack Thread

TranslationDataset now builds exact token-id sequences with a deterministic shared prefix (matching prefill_load_test/gen_load_test semantics) instead of approximate text chunking. Token-id prompts are sent via /v1/completions even when --chat is set (chat template applied client-side). Fixes PER-75: prompt_cache_max_len now corresponds to the first N shareable token IDs across requests, not an estimated text prefix budget. Adds unit tests for exact lengths, shared-prefix caching, and provider wiring. Co-authored-by: Ishaan Shivhare <ishaan-shivhare@users.noreply.github.com>

- Import build_pair_ids/split_chat_template from prefill_load_test instead of duplicating ~90 lines of helpers - Fix Bugbot: route token-id prompts to /v1/completions inline in _do_generate_text instead of passing prompt to get_url(), which broke TogetherProvider/TgiProvider/Triton providers - Slim TranslationDataset API (dataset name instead of path/prompt) Co-authored-by: Ishaan Shivhare <ishaan-shivhare@users.noreply.github.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit d4a7977. Configure here.}

cursor · 2026-06-05T21:11:55Z

        if self.parsed_options.reasoning_effort is not None:
            data["reasoning_effort"] = self.parsed_options.reasoning_effort
-        if isinstance(prompt, str):
+        if isinstance(prompt, list) and prompt and isinstance(prompt[0], int):


Rerank breaks on token prompts

High Severity

With the default limericks/code datasets, prompts are now token-id lists, but the rerank branch still treats any list as the document list. Each token id becomes its own document instead of paragraph-split text, so /v1/rerank requests no longer match the documented workflow.

^{Reviewed by Cursor Bugbot for commit d4a7977. Configure here.}

cursor · 2026-06-05T21:11:55Z

            )
        elif options.dataset in ("limericks", "code"):
-            if options.dataset == "limericks":
-                if options.prompt is None:


Custom prompt flag ignored

Medium Severity

The --prompt CLI option is still documented, but limericks/code dataset setup no longer reads it after switching to build_pair_ids. Custom task instructions are dropped with no warning, so runs that relied on --prompt no longer match prior behavior.

^{Reviewed by Cursor Bugbot for commit d4a7977. Configure here.}

cursor Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread llm_bench/load_test.py Outdated

cursor Bot reviewed Jun 5, 2026

View reviewed changes

ishaan-shivhare mentioned this pull request Jun 6, 2026

Strict prompt construction in load_test.py (minimal) #120

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make load_test.py use strict token-id prompts (PER-75)#119

Make load_test.py use strict token-id prompts (PER-75)#119
ishaan-shivhare wants to merge 2 commits into
mainfrom
cursor/load-test-strict-prompts-bdbf

ishaan-shivhare commented Jun 5, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 5, 2026

Uh oh!

cursor Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ishaan-shivhare commented Jun 5, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Bugbot fix

What this fixes vs follow-up

Testing

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 5, 2026

Choose a reason for hiding this comment

Rerank breaks on token prompts

Uh oh!

cursor Bot Jun 5, 2026

Choose a reason for hiding this comment

Custom prompt flag ignored

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ishaan-shivhare commented Jun 5, 2026 •

edited by cursor Bot

Loading