Bug Description
When running forgecode with a vLLM model provider backend (using Qwen3.5-122B-A10B), this generally works quite well. However, the context limits are not respected well. There are two places where this is a significant issue:
- Long-running sessions that have large tool calls, and don't trigger compaction before the session overruns the limit
- Large processing jobs that have significant returns from tool calls.
The error manifests like so
● [15:53:01] ERROR: POST http://<vllm_provider>:9000/v1/chat/completions
Caused by:
0: 400 Bad Request Reason: {"error":{"message":"This model's maximum context length is 200000 tokens. However, you requested 16000 output tokens and your prompt contains at least 184001 input tokens, for a total of at least 200001 tokens. Please reduce the length of the input prompt or the number of requested output tokens. (parameter=input_tokens, value=184001)","type":"BadRequestError","param":"input_tokens","code":400}}
1: Invalid Status Code: 400
vllm reports for this particular case:
(APIServer pid=308939) vllm.exceptions.VLLMValidationError: This model's maximum context length is 200000 tokens. However, you requested 16000 output tokens and your prompt contains at least 184001 input tokens, for a total of at least 200001 tokens. Please reduce the length of the input prompt or the number of requested output tokens. (parameter=input_tokens, value=184001)
I have looked through the config files and how to fix it, but the best way to fix it would be to set providers.models, and the current version of forgecode does not accept either the inline format or the separated format that would allow me to specify the maximum number of input tokens (context_length) that would prevent it from overrunning this output limit.
Two key issues:
- Impossible to set custom model context length using the
.forge.toml instance, seen in the commented out version of my config below.
- ForgeCode does not compute available context windows as vLLM's reported model context length as
<vllm context_length> = <forge_context_length> + <forge_max_tokens>. It assumes that context length is independent of max_tokens, the output parameters.
Steps to Reproduce
- Connect to a vLLM provider running v0.19, any model size.
- Ask forgecode to read a large log file (>= 1MB or so)
- Allow it to read it in chunks until the context window overruns.Receive error above.
Expected Behavior
- Ask to read a log file
- Reads it in chunks, summarizes for itself, and then drops the context to continue reading the next block
OR
- Reads the Context, then compacts the session and continues before hitting the maximum input token size (
context_length - max_tokens).
AND
- Fails on the sender side if too much context is about to be pushed according to the above formula
- attempts recovery with a notification after the tool call response message to the LLM (without the result) that the read was too long to fit into context, and to try again.
Actual Behavior
System overruns the maximum input context length from vLLM, enters unrecoverable state without changing to a model with a larger context window. vLLM cannot process the over-sized input. Even changing the max ouptut tokens parameter will still repeat this behavior on the next turn.
Forge Version
forge 2.11.1
Operating System & Version
No response
AI Provider
Other
Model
Sehyo/Qwen3.5-122B-A10B-NVFP4
Installation Method
npm install -g forgecode
Configuration
"$schema" = "https://forgecode.dev/schema.json"
max_search_lines = 1000
max_search_result_bytes = 10240
max_fetch_chars = 50000
max_stdout_prefix_lines = 100
max_stdout_suffix_lines = 100
max_stdout_line_chars = 500
max_line_chars = 2000
max_read_lines = 2000
max_file_read_batch_size = 50
max_file_size_bytes = 104857600
max_image_size_bytes = 262144
tool_timeout_secs = 300
auto_open_dump = false
max_conversations = 100
max_sem_search_results = 100
sem_search_top_k = 10
services_url = "https://api.forgecode.dev/"
max_extensions = 15
max_parallel_file_reads = 64
model_cache_ttl_secs = 604800
max_commit_count = 20
top_p = 0.8
top_k = 30
max_tokens = 16000
max_tool_failure_per_turn = 3
max_requests_per_turn = 100
restricted = false
tool_supported = true
currency_symbol = ""
currency_conversion_rate = 0.0
verify_todos = true
[retry]
initial_backoff_ms = 200
min_delay_ms = 1000
backoff_factor = 2
max_attempts = 8
status_codes = [
429,
500,
502,
503,
504,
408,
522,
524,
520,
529,
]
suppress_errors = false
[http]
connect_timeout_secs = 30
read_timeout_secs = 900
pool_idle_timeout_secs = 90
pool_max_idle_per_host = 5
max_redirects = 10
hickory = false
tls_backend = "default"
adaptive_window = true
keep_alive_interval_secs = 60
keep_alive_timeout_secs = 10
keep_alive_while_idle = true
accept_invalid_certs = false
[session]
provider_id = "vllm"
model_id = "Sehyo/Qwen3.5-122B-A10B-NVFP4"
[updates]
frequency = "daily"
auto_update = true
[compact]
retention_window = 6
eviction_window = 0.2
max_tokens = 2000
token_threshold = 120000
message_threshold = 200
on_turn_end = true
[reasoning]
effort = "high"
enabled = true
#[[providers]]
#id = "vllm"
#url = "http://<vllm_providers>:9000/v1/chat/completions"
#api_key_vars = "VLLM_API_KEY"
#response_type = "OpenAI"
#auth_methods = ["api_key"]
#models = [
# { id = "Sehyo/Qwen3.5-122B-A10B-NVFP4",
# name = "Qwen3.5-122B",
# description = "Qwen3.5 122B A10B", context_length = 131072, tools_supported = true, supports_parallel_tool_calls = true, supports_reasoning = true, input_modalities = ["text", "image"] }
#]
Bug Description
When running forgecode with a vLLM model provider backend (using Qwen3.5-122B-A10B), this generally works quite well. However, the context limits are not respected well. There are two places where this is a significant issue:
The error manifests like so
vllm reports for this particular case:
I have looked through the config files and how to fix it, but the best way to fix it would be to set providers.models, and the current version of forgecode does not accept either the inline format or the separated format that would allow me to specify the maximum number of input tokens (
context_length) that would prevent it from overrunning this output limit.Two key issues:
.forge.tomlinstance, seen in the commented out version of my config below.<vllm context_length> = <forge_context_length> + <forge_max_tokens>. It assumes that context length is independent ofmax_tokens, the output parameters.Steps to Reproduce
Expected Behavior
OR
context_length - max_tokens).AND
Actual Behavior
System overruns the maximum input context length from vLLM, enters unrecoverable state without changing to a model with a larger context window. vLLM cannot process the over-sized input. Even changing the max ouptut tokens parameter will still repeat this behavior on the next turn.
Forge Version
forge 2.11.1
Operating System & Version
No response
AI Provider
Other
Model
Sehyo/Qwen3.5-122B-A10B-NVFP4
Installation Method
npm install -g forgecode
Configuration