dflash: integrate BudgetHook with spec-decode (PR #269 follow-up)

## Context

PR #269 ships thinking-budget v2 (Level 2 BudgetHook). Hook is **AR-decode only**. When `(n_gen − committed)` falls within `hard_limit_remaining + q_len` of the close trigger, spec-decode is torn down and AR finishes the tail.

Reference: `dflash/src/qwen35/qwen35_backend.cpp:1151-1182` — spec-decode tail-off, `step_graph_destroy(draft_sg)` then `do_ar_decode()`.

Declared as follow-up in `dflash/src/common/model_backend.h:68`:
> "Current implementation: AR-decode only. When budget_hook is set, backends MAY route generation through their AR path (skipping spec decode) — the perf trade-off is acceptable since this only kicks in for thinking-enabled requests. Spec-decode integration is a follow-up."

## Scope

Last ~`hard_limit_remaining + q_len` tokens (≈96 with defaults `hard_limit=64, q_len=32`) lose dflash speedup on every thinking-enabled request. Bulk decode unaffected.

## Goal

Allow BudgetHook to fire mid-spec-batch without falling back to AR.

## Technical sketch

Author's note (`qwen35_backend.cpp:1141`):
> "AR handles the close-token override cleanly; spec-decode's verify-and-accept loop can't safely inject a token mid-batch without a KV-state rewrite."

Two paths:
1. **KV rollback + inject**: when the close-trigger position lands inside a verified batch, rollback KV state for positions ≥ trigger, inject `close_token_ids`, replay forward.
2. **Trigger-aware draft**: cap draft `q_len` dynamically so the verify batch boundary aligns to the trigger; never overshoots, no rollback needed.

Path 2 likely cheaper.

## Acceptance

- BudgetHook fires correctly with `req.thinking_opt_in = true` while spec-decode stays active across the close boundary
- `[budget-hook] spec-decode tail-off` log path no longer triggers in nominal cases
- Bench on `share/model_cards/qwen3.6-27b.json` with `--reasoning-effort medium`: decode tok/s within 5% of non-thinking baseline (currently degrades on the tail)
- `test_server_unit` passes, plus new regression test covering close-injection inside a spec-decode batch

## Refs

- PR #269 (where the hook lives)
- `dflash/src/qwen35/qwen35_backend.cpp:1151-1182` (the bypass)
- `dflash/src/common/model_backend.h:60-80` (BudgetHook design + follow-up note)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dflash: integrate BudgetHook with spec-decode (PR #269 follow-up) #279

Context

Scope

Goal

Technical sketch

Acceptance

Refs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

dflash: integrate BudgetHook with spec-decode (PR #269 follow-up) #279

Description

Context

Scope

Goal

Technical sketch

Acceptance

Refs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions