You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to the OpenAI Chat Completions API reference, max_tokens is “the maximum number of tokens that can be generated in the chat completion” (source). Kimi Code currently derives this provider request field from max_context_size instead of the configured max_output_size, which can cause OpenAI-compatible providers to reject requests with 400 errors when the context window exceeds the provider's output token limit.
What changed
This change preserves each model alias's configured output limit and uses it as the completion token cap for runtime requests. The model context size still describes the context window, while max_output_size now controls the output budget that is ultimately sent through provider-specific max_tokens handling.
This prevents OpenAI-compatible providers from using max_context_size as the request output limit when max_output_size is available.
@kermanx could you take a look at this PR when you have time?
This fixes the max_tokens cap issue reported in #306. It seems closely related to the completion-budget propagation work in #267: for OpenAI-compatible providers, the runtime request can end up using max_context_size as the output token cap, while the configured max_output_size is the actual provider output limit.
The PR is focused on using max_output_size as the completion budget cap, and includes a changeset. Please let me know if you prefer a different fix direction.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issue
Resolve #306
Problem
According to the OpenAI Chat Completions API reference,
max_tokensis “the maximum number of tokens that can be generated in the chat completion” (source). Kimi Code currently derives this provider request field frommax_context_sizeinstead of the configuredmax_output_size, which can cause OpenAI-compatible providers to reject requests with 400 errors when the context window exceeds the provider's output token limit.What changed
This change preserves each model alias's configured output limit and uses it as the completion token cap for runtime requests. The model context size still describes the context window, while
max_output_sizenow controls the output budget that is ultimately sent through provider-specificmax_tokenshandling.This prevents OpenAI-compatible providers from using
max_context_sizeas the request output limit whenmax_output_sizeis available.Checklist
gen-changesetsskill, or this PR needs no changeset.gen-docsskill, or this PR needs no doc update.