Fix vLLM crash when max_gen_toks exceeds model context window by Ali-Elganzory · Pull Request #153 · mlfoundations/evalchemy

Ali-Elganzory · 2026-01-03T13:52:06Z

Summary

Fixes #152 by dynamically capping max_gen_toks to fit within the model's context window when using vLLM backend.

Problem Solved

vLLM backend crashes with ValueError: please provide at least one prompt when max_gen_toks exceeds max_model_len - prompt_length
This happens because vLLM truncates prompts to make room for generation tokens, resulting in empty prompts

Changes

File Modified: eval/task.py

Key Improvements

Graceful handling: Caps max_gen_toks instead of crashing
Safety buffer: Uses 16-token buffer to account for special tokens or tokenization edge cases
Warning logs: Logs when capping occurs for debugging visibility

Code Changes

# Before
elif isinstance(model, lm_eval_models.vllm_causallms.VLLM):
    instance.args[1]["max_gen_toks"] = max_new_tokens

# After
elif isinstance(model, lm_eval_models.vllm_causallms.VLLM):
    prompt = instance.args[0]
    prompt_length = len(model.tokenizer.encode(prompt))
    max_model_len = model.model.llm_engine.model_config.max_model_len
    
    # Calculate max allowed generation tokens (16 token safety buffer)
    max_allowed = max_model_len - prompt_length - 16
    capped_max_new_tokens = min(max_new_tokens, max(1, max_allowed))
    
    if capped_max_new_tokens < max_new_tokens:
        self.logger.warning(
            f"max_new_tokens ({max_new_tokens}) capped to {capped_max_new_tokens} "
            f"(prompt: {prompt_length} tokens, model max: {max_model_len})"
        )
    
    instance.args[1]["max_gen_toks"] = capped_max_new_tokens

Testing

# Test with vLLM - should now work instead of crashing
python -m eval.eval --model vllm --tasks MATH500 \
  --model_args "trust_remote_code=True,pretrained=ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3"

# Verify same model works with hf backend (baseline)
python -m eval.eval --model hf --tasks MATH500 \
  --model_args "trust_remote_code=True,pretrained=ali-elganzory/1.7b-MixtureVitae-300BT-v1-DPO-Tulu3"

Impact

Prevents runtime crashes from context window overflow
Better error handling with informative warnings
No functional changes to working cases (only triggers when needed)
Works across all benchmarks without per-benchmark fixes

…ndow

Ali-Elganzory · 2026-01-03T14:00:08Z

Hi @neginraoof, could you please review this PR when you have a chance?
I would appreciate your feedback. Thanks!

neginraoof · 2026-01-04T05:47:02Z

eval/task.py

                elif isinstance(model, lm_eval_models.vllm_causallms.VLLM):
-                    instance.args[1]["max_gen_toks"] = max_new_tokens
+                    # Get prompt from instance.args[0] (the templated string)
+                    prompt = instance.args[0]


thanks, can you wrap lines 57 to 64 in a try catch?
Also maybe check if prompt_length is extremely long (> max_model_len)

Good point, will wrap it in a try catch.

If prompt_length > max_model_len, we could log a warning.
capped_max_new_tokens will be set to 1, and the prompt will be truncated to fit into the context window. We are fine with that logic, right?

neginraoof · 2026-01-04T05:47:46Z

eval/task.py

+                    max_model_len = model.model.llm_engine.model_config.max_model_len
+
+                    # Calculate max allowed generation tokens (16 token safety buffer)
+                    max_allowed = max_model_len - prompt_length - 16


can you create a constant and name it instead of using 16 here

neginraoof · 2026-01-04T05:48:09Z

Thanks for creating the PR!

…eeding context window - HuggingFace's transformers does NOT fail when the `context_length < prompt_length + max_new_tokens`; the behavior is rather undefined or degraded. - Cap the max number of generated tokens for both HF and vLLM.

fix: cap max_gen_toks to prevent vLLM crash when exceeding context wi…

a9a4fa0

…ndow

neginraoof reviewed Jan 4, 2026

View reviewed changes

Ali-Elganzory added 2 commits January 4, 2026 11:27

Address PR feedback: add try-catch, prompt_length check, and constant

977d755

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix vLLM crash when max_gen_toks exceeds model context window#153

Fix vLLM crash when max_gen_toks exceeds model context window#153
Ali-Elganzory wants to merge 3 commits intomlfoundations:mainfrom
Ali-Elganzory:fix/vllm-context-window-crash

Ali-Elganzory commented Jan 3, 2026 •

edited

Loading

Uh oh!

Ali-Elganzory commented Jan 3, 2026

Uh oh!

neginraoof Jan 4, 2026

Uh oh!

Ali-Elganzory Jan 4, 2026

Uh oh!

neginraoof Jan 4, 2026

Uh oh!

neginraoof commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ali-Elganzory commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem Solved

Changes

Key Improvements

Code Changes

Testing

Impact

Uh oh!

Ali-Elganzory commented Jan 3, 2026

Uh oh!

neginraoof Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Ali-Elganzory Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

neginraoof Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

neginraoof commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ali-Elganzory commented Jan 3, 2026 •

edited

Loading