Context
PR #269 ships thinking-budget v2 (Level 2 BudgetHook). Hook is AR-decode only. When (n_gen − committed) falls within hard_limit_remaining + q_len of the close trigger, spec-decode is torn down and AR finishes the tail.
Reference: dflash/src/qwen35/qwen35_backend.cpp:1151-1182 — spec-decode tail-off, step_graph_destroy(draft_sg) then do_ar_decode().
Declared as follow-up in dflash/src/common/model_backend.h:68:
"Current implementation: AR-decode only. When budget_hook is set, backends MAY route generation through their AR path (skipping spec decode) — the perf trade-off is acceptable since this only kicks in for thinking-enabled requests. Spec-decode integration is a follow-up."
Scope
Last ~hard_limit_remaining + q_len tokens (≈96 with defaults hard_limit=64, q_len=32) lose dflash speedup on every thinking-enabled request. Bulk decode unaffected.
Goal
Allow BudgetHook to fire mid-spec-batch without falling back to AR.
Technical sketch
Author's note (qwen35_backend.cpp:1141):
"AR handles the close-token override cleanly; spec-decode's verify-and-accept loop can't safely inject a token mid-batch without a KV-state rewrite."
Two paths:
- KV rollback + inject: when the close-trigger position lands inside a verified batch, rollback KV state for positions ≥ trigger, inject
close_token_ids, replay forward.
- Trigger-aware draft: cap draft
q_len dynamically so the verify batch boundary aligns to the trigger; never overshoots, no rollback needed.
Path 2 likely cheaper.
Acceptance
- BudgetHook fires correctly with
req.thinking_opt_in = true while spec-decode stays active across the close boundary
[budget-hook] spec-decode tail-off log path no longer triggers in nominal cases
- Bench on
share/model_cards/qwen3.6-27b.json with --reasoning-effort medium: decode tok/s within 5% of non-thinking baseline (currently degrades on the tail)
test_server_unit passes, plus new regression test covering close-injection inside a spec-decode batch
Refs
Context
PR #269 ships thinking-budget v2 (Level 2 BudgetHook). Hook is AR-decode only. When
(n_gen − committed)falls withinhard_limit_remaining + q_lenof the close trigger, spec-decode is torn down and AR finishes the tail.Reference:
dflash/src/qwen35/qwen35_backend.cpp:1151-1182— spec-decode tail-off,step_graph_destroy(draft_sg)thendo_ar_decode().Declared as follow-up in
dflash/src/common/model_backend.h:68:Scope
Last ~
hard_limit_remaining + q_lentokens (≈96 with defaultshard_limit=64, q_len=32) lose dflash speedup on every thinking-enabled request. Bulk decode unaffected.Goal
Allow BudgetHook to fire mid-spec-batch without falling back to AR.
Technical sketch
Author's note (
qwen35_backend.cpp:1141):Two paths:
close_token_ids, replay forward.q_lendynamically so the verify batch boundary aligns to the trigger; never overshoots, no rollback needed.Path 2 likely cheaper.
Acceptance
req.thinking_opt_in = truewhile spec-decode stays active across the close boundary[budget-hook] spec-decode tail-offlog path no longer triggers in nominal casesshare/model_cards/qwen3.6-27b.jsonwith--reasoning-effort medium: decode tok/s within 5% of non-thinking baseline (currently degrades on the tail)test_server_unitpasses, plus new regression test covering close-injection inside a spec-decode batchRefs
dflash/src/qwen35/qwen35_backend.cpp:1151-1182(the bypass)dflash/src/common/model_backend.h:60-80(BudgetHook design + follow-up note)