Skip to content

dflash: integrate BudgetHook with spec-decode (PR #269 follow-up) #279

@davide221

Description

@davide221

Context

PR #269 ships thinking-budget v2 (Level 2 BudgetHook). Hook is AR-decode only. When (n_gen − committed) falls within hard_limit_remaining + q_len of the close trigger, spec-decode is torn down and AR finishes the tail.

Reference: dflash/src/qwen35/qwen35_backend.cpp:1151-1182 — spec-decode tail-off, step_graph_destroy(draft_sg) then do_ar_decode().

Declared as follow-up in dflash/src/common/model_backend.h:68:

"Current implementation: AR-decode only. When budget_hook is set, backends MAY route generation through their AR path (skipping spec decode) — the perf trade-off is acceptable since this only kicks in for thinking-enabled requests. Spec-decode integration is a follow-up."

Scope

Last ~hard_limit_remaining + q_len tokens (≈96 with defaults hard_limit=64, q_len=32) lose dflash speedup on every thinking-enabled request. Bulk decode unaffected.

Goal

Allow BudgetHook to fire mid-spec-batch without falling back to AR.

Technical sketch

Author's note (qwen35_backend.cpp:1141):

"AR handles the close-token override cleanly; spec-decode's verify-and-accept loop can't safely inject a token mid-batch without a KV-state rewrite."

Two paths:

  1. KV rollback + inject: when the close-trigger position lands inside a verified batch, rollback KV state for positions ≥ trigger, inject close_token_ids, replay forward.
  2. Trigger-aware draft: cap draft q_len dynamically so the verify batch boundary aligns to the trigger; never overshoots, no rollback needed.

Path 2 likely cheaper.

Acceptance

  • BudgetHook fires correctly with req.thinking_opt_in = true while spec-decode stays active across the close boundary
  • [budget-hook] spec-decode tail-off log path no longer triggers in nominal cases
  • Bench on share/model_cards/qwen3.6-27b.json with --reasoning-effort medium: decode tok/s within 5% of non-thinking baseline (currently degrades on the tail)
  • test_server_unit passes, plus new regression test covering close-injection inside a spec-decode batch

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions