-
Notifications
You must be signed in to change notification settings - Fork 136
feat(engine): compute_log_probs API for RL sequence scoring (RL-plan M2) #321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
HJSang
wants to merge
1
commit into
main
Choose a base branch
from
hejian/rl_api
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| # Computing Log-Probabilities (RL Scoring) | ||
|
|
||
| `Engine.compute_log_probs` scores `prompt + completion` token sequences under the | ||
| engine's current weights and returns one log-probability per completion token. It | ||
| is the core scoring primitive for online-RL trainers (PPO, GRPO, and any | ||
| KL-penalised objective) — for example to form importance-sampling ratios against | ||
| the policy that generated the rollouts. | ||
|
|
||
| ## Usage | ||
|
|
||
| ```python | ||
| from tokenspeed.runtime.entrypoints.engine import Engine | ||
|
|
||
| # Scoring runs a pure-extend (prefill-only) forward. On backends that cannot | ||
| # serve a mixed prefill+decode batch eagerly (e.g. the default `mha` backend), | ||
| # launch the engine for scoring with a backend + scheduler config that keeps the | ||
| # request on a pure-extend path: | ||
| engine = Engine( | ||
| model="<model-path>", | ||
| attention_backend="flashinfer", | ||
| enforce_eager=True, | ||
| disable_overlap_schedule=True, | ||
| ) | ||
|
|
||
| out = engine.compute_log_probs( | ||
| sequences=[ | ||
| {"prompt_token_ids": [1, 2, 3, 4], "completion_token_ids": [5, 6, 7]}, | ||
| {"prompt_token_ids": [10, 11], "completion_token_ids": [12]}, | ||
| ], | ||
| temperature=1.0, | ||
| ) | ||
|
|
||
| # out["log_probs"][i][j] == log P(completion_token_ids[i][j] | context) | ||
| # out["tokens"][i] == completion_token_ids[i] | ||
| out["log_probs"] # e.g. [[-0.12, -0.47, -0.31], [-2.03]] | ||
| out["tokens"] # [[5, 6, 7], [12]] | ||
| ``` | ||
|
|
||
| `log_probs[i][j]` is the log-probability of the realised completion token `j` in | ||
| sequence `i`, conditioned on everything before it (prompt + earlier completion | ||
| tokens). Only completion positions are scored; the prompt is context. | ||
|
|
||
| ## How it works | ||
|
|
||
| It reuses the normal generation path: internally each sequence is sent through a | ||
| forward-only `generate` call (`max_new_tokens=0`, `return_logprob=True`, | ||
| `logprob_start_len=len(prompt)`), and the per-token input logprobs are read back | ||
| from `meta_info["input_token_logprobs"]`. Logits are gathered across tensor-parallel | ||
| ranks before `log_softmax`, exactly as on the sampling path. No engine pause is | ||
| required; scoring requests can be interleaved with normal generation. | ||
|
|
||
| Long sequences are handled across chunked prefill: when a `prompt + completion` | ||
| is split into multiple prefill chunks, the input-logprob window is collected from | ||
| every chunk it overlaps (not just the first), so the full set of completion | ||
| logprobs is returned regardless of `chunked_prefill_size`. | ||
|
|
||
| ## Limits (current) | ||
|
|
||
| - **Temperature:** `temperature=1.0` only (raw `log_softmax`). Other values raise | ||
| `NotImplementedError`. Sampling-temperature scaling (for off-policy importance | ||
| sampling) is a planned follow-up. | ||
| - **Speculative decoding:** unavailable — `compute_log_probs` raises if the engine | ||
| was launched with a speculative algorithm (the generation path disables logprobs | ||
| in that mode). | ||
| - **Prompt/completion:** both must be non-empty (the first completion token needs | ||
| prior context to be scored). | ||
| - **Surface:** exposed as the `Engine` Python method. A native HTTP / SMG endpoint | ||
| is deferred until there is a consumer for it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,142 @@ | ||
| # Copyright (c) 2026 LightSeek Foundation | ||
| # | ||
| # Permission is hereby granted, free of charge, to any person obtaining a copy | ||
| # of this software and associated documentation files (the "Software"), to deal | ||
| # in the Software without restriction, including without limitation the rights | ||
| # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
| # copies of the Software, and to permit persons to whom the Software is | ||
| # furnished to do so, subject to the following conditions: | ||
| # | ||
| # The above copyright notice and this permission notice shall be included in | ||
| # all copies or substantial portions of the Software. | ||
| # | ||
| # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
| # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
| # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
| # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
| # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
| # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
| # SOFTWARE. | ||
|
|
||
| """Pure, GPU-free helpers for the compute_log_probs API (RL-plan Milestone 2). | ||
|
|
||
| The engine scores ``prompt + completion`` sequences by reusing the normal | ||
| generation path: a forward-only ``generate`` call with ``return_logprob=True`` | ||
| and ``logprob_start_len=len(prompt)`` makes ``meta_info['input_token_logprobs']`` | ||
| carry exactly the per-completion-token logprobs. These helpers build that call | ||
| and parse its result; ``Engine.compute_log_probs`` wires them to ``self.generate``. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| from typing import Any, Callable | ||
|
|
||
| DEFAULT_TEMPERATURE = 1.0 | ||
| # Set to 1 if the GPU spike shows max_new_tokens=0 is unsupported; the single | ||
| # generated token lands in output_token_logprobs, never input_token_logprobs. | ||
| SCORE_MAX_NEW_TOKENS = 0 | ||
|
|
||
|
|
||
| class InvalidSequenceError(ValueError): | ||
| """Raised when a sequence cannot be scored (empty prompt or completion).""" | ||
|
|
||
|
|
||
| def validate_sequence( | ||
| prompt_token_ids: list[int], completion_token_ids: list[int] | ||
| ) -> None: | ||
| if not prompt_token_ids: | ||
| raise InvalidSequenceError( | ||
| "prompt_token_ids must be non-empty: the first completion token needs " | ||
| "prior context to be scored." | ||
| ) | ||
| if not completion_token_ids: | ||
| raise InvalidSequenceError( | ||
| "completion_token_ids must be non-empty: nothing to score." | ||
| ) | ||
|
|
||
|
|
||
| def build_score_kwargs( | ||
| prompt_token_ids: list[int], | ||
| completion_token_ids: list[int], | ||
| temperature: float = DEFAULT_TEMPERATURE, | ||
| ) -> dict[str, Any]: | ||
| """Build the kwargs for an internal forward-only ``Engine.generate`` call.""" | ||
| validate_sequence(prompt_token_ids, completion_token_ids) | ||
| # Note: compute_log_probs_core separately gates on temperature != 1.0 for v1; | ||
| # the two checks serve different audiences (standalone helper vs. v1 core path), | ||
| # so the divergence is intentional, not accidental. | ||
| if temperature <= 0: | ||
| raise ValueError(f"temperature must be > 0, got {temperature}") | ||
| return { | ||
| "input_ids": list(prompt_token_ids) + list(completion_token_ids), | ||
| "sampling_params": { | ||
| "max_new_tokens": SCORE_MAX_NEW_TOKENS, | ||
| "temperature": temperature, | ||
| }, | ||
| "return_logprob": True, | ||
| # The logprob of completion token c_j is read from the logits at the | ||
| # *preceding* position, so scoring starts one token before the | ||
| # completion: logprob_start_len = len(prompt) - 1. The engine returns | ||
| # one entry per position from there to the end — the M completion | ||
| # logprobs followed by one trailing sampled-position entry (target token | ||
| # -1) that extract_completion_logprobs drops. (Verified on B200.) | ||
| "logprob_start_len": len(prompt_token_ids) - 1, | ||
| } | ||
|
|
||
|
|
||
| def extract_completion_logprobs( | ||
| meta_info: dict[str, Any], num_completion: int | ||
| ) -> tuple[list[float], list[int]]: | ||
| """Split ``meta_info['input_token_logprobs']`` into (log_probs, tokens). | ||
|
|
||
| Each entry is a ``(logprob, token_id, text_or_None)`` tuple. The engine | ||
| returns the M completion logprobs (aligned to ``logprob_start_len = | ||
| len(prompt) - 1``) followed by one trailing sampled-position entry, so we | ||
| keep the first ``num_completion``. Fewer than that means the logprob window | ||
| was wrong (or input logprobs were not produced), so we fail loudly rather | ||
| than return a silently-misaligned array. | ||
| """ | ||
| entries = meta_info.get("input_token_logprobs") | ||
| if not entries or len(entries) < num_completion: | ||
| got = 0 if entries is None else len(entries) | ||
| raise ValueError( | ||
| f"expected at least {num_completion} completion logprobs, got {got}; " | ||
| "check logprob_start_len alignment / input-logprob support." | ||
| ) | ||
| entries = entries[:num_completion] | ||
| log_probs = [float(e[0]) for e in entries] | ||
| tokens = [int(e[1]) for e in entries] | ||
| return log_probs, tokens | ||
|
|
||
|
|
||
| def compute_log_probs_core( | ||
| sequences: list[dict[str, list[int]]], | ||
| generate_fn: Callable[..., dict[str, Any]], | ||
| temperature: float = DEFAULT_TEMPERATURE, | ||
| ) -> dict[str, list[list[float]]]: | ||
| """Score each sequence by calling ``generate_fn`` and parsing the result. | ||
|
|
||
| ``generate_fn`` must have the signature of ``Engine.generate`` and return a | ||
| single result dict (non-streaming) carrying ``meta_info``. v1 supports only | ||
| ``temperature == 1.0`` (raw log_softmax), matching the engine's default | ||
| ``temp_scaled_logprobs=False`` path; other values raise ``NotImplementedError``. | ||
| """ | ||
| if temperature != DEFAULT_TEMPERATURE: | ||
| raise NotImplementedError( | ||
| "compute_log_probs v1 supports temperature=1.0 (raw log_softmax) only; " | ||
| f"got {temperature}. Sampling-temperature scaling is a follow-up." | ||
| ) | ||
|
|
||
| log_probs_out: list[list[float]] = [] | ||
| tokens_out: list[list[int]] = [] | ||
| for seq in sequences: | ||
| prompt_ids = seq["prompt_token_ids"] | ||
| completion_ids = seq["completion_token_ids"] | ||
| kwargs = build_score_kwargs(prompt_ids, completion_ids, temperature) | ||
| result = generate_fn(**kwargs) | ||
| log_probs, tokens = extract_completion_logprobs( | ||
| result["meta_info"], len(completion_ids) | ||
| ) | ||
| log_probs_out.append(log_probs) | ||
| tokens_out.append(tokens) | ||
| return {"log_probs": log_probs_out, "tokens": tokens_out} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a scoring request is chunked, non-final prefill forwards can already carry
req_input_lp_val, but this new accumulation runs only after the existingprefill_finishedguard. Those chunk results are therefore discarded instead of being stored onRequestState, so long prompt+completion sequences return only the final chunk's logprobs (or none) andcompute_log_probsraises/incompletely scores. Accumulate the input logprobs before skipping chunked-prefill output.Useful? React with 👍 / 👎.