Problem
When model = "auto" is configured, every user message triggers a serial Flash Router API call (deepseek-v4-flash, max_tokens: 96) before the actual model request can be dispatched. This adds an unavoidable network round-trip (typically 300 ms–2 s, up to 4 s on timeout) to every turn, even for trivially routable messages like "continue", "yes", or "list files".
The router call is strictly serial — the real API request cannot start until the router responds. The effect is a perceptible delay before streaming begins on every message, making the TUI feel sluggish compared to a fixed-model setup.
The current implementation is equivalent to: to decide whether to drive a Ferrari or a Corolla, you first take an Uber to go ask for directions, then come back to get your car.
Who is affected: every user who enables model = "auto" (whether via the /model auto command or the Model Picker).
Proposed solution
Replace or supplement the Flash Router with a local, zero-network-cost heuristic that runs first. The router API call should be reserved only for genuinely ambiguous cases the local heuristic cannot confidently resolve.
Concrete proposal:
- Heuristic-first dispatch: Run
auto_model_heuristic() synchronously before any network call. If the heuristic returns a strong signal (e.g. message is very short → Flash, contains complex keywords → Pro), use it immediately and skip the Flash Router entirely.
- Router only for grey zone: Only invoke the Flash Router when the heuristic lands on the default branch (100–500 chars, no decisive keywords). Those are the cases where a lightweight LLM classifier can add value.
- Optional: parallel speculative dispatch — send both the router request AND the likely-heuristic request simultaneously. If the router agrees, stream the heuristic response; if it disagrees, abort and retry with the correct model. (More complex, higher ceiling.)
Use case
I use model = "auto" to balance cost and capability — Flash for quick lookups, Pro for real work. But the upfront delay on every message is noticeable enough that I frequently switch back to a fixed model just to avoid the hesitation. The feature loses its value if the cost is a perceptible pause on even the simplest exchanges.
Alternatives considered
- Pure heuristic with no router — the current keyword + length heuristic in
auto_model_heuristic() plus auto_reasoning::select() already covers a wide range. Removing the router entirely would eliminate the latency but would lose the semantic understanding the Flash Router can bring for nuanced requests.
- Cached routing — reuse the previous turn's routing decision within the same session. This works for long running conversations but breaks on topic shifts, and the first message still pays the latency cost.
- Parallel routing — dispatch both the router and the most-likely request at once. Technically feasible but wastes API quota when the router disagrees.
The heuristic-first approach is the simplest change with the largest latency win.
Impact
Every single turn in auto mode. For power users who leave auto mode on all session, this is dozens to hundreds of turns per day, each paying 300 ms to 2 s of dead time before any visible response.
Additional context
Relevant source:
- Router dispatch:
crates/tui/src/commands/config.rs:922-958 — builds and calls the Flash router API
- Heuristic fallback:
crates/tui/src/commands/config.rs:720-735 — auto_model_heuristic() (keyword + length, pure local, ~0 ms)
- Integration point:
crates/tui/src/tui/ui.rs:4397-4412 — where the routing result is consumed; the UI blocks on resolve_auto_model_selection().await before it can send the real Op::SendMessage
Problem
When
model = "auto"is configured, every user message triggers a serial Flash Router API call (deepseek-v4-flash,max_tokens: 96) before the actual model request can be dispatched. This adds an unavoidable network round-trip (typically 300 ms–2 s, up to 4 s on timeout) to every turn, even for trivially routable messages like "continue", "yes", or "list files".The router call is strictly serial — the real API request cannot start until the router responds. The effect is a perceptible delay before streaming begins on every message, making the TUI feel sluggish compared to a fixed-model setup.
The current implementation is equivalent to: to decide whether to drive a Ferrari or a Corolla, you first take an Uber to go ask for directions, then come back to get your car.
Who is affected: every user who enables
model = "auto"(whether via the/model autocommand or the Model Picker).Proposed solution
Replace or supplement the Flash Router with a local, zero-network-cost heuristic that runs first. The router API call should be reserved only for genuinely ambiguous cases the local heuristic cannot confidently resolve.
Concrete proposal:
auto_model_heuristic()synchronously before any network call. If the heuristic returns a strong signal (e.g. message is very short → Flash, contains complex keywords → Pro), use it immediately and skip the Flash Router entirely.Use case
I use
model = "auto"to balance cost and capability — Flash for quick lookups, Pro for real work. But the upfront delay on every message is noticeable enough that I frequently switch back to a fixed model just to avoid the hesitation. The feature loses its value if the cost is a perceptible pause on even the simplest exchanges.Alternatives considered
auto_model_heuristic()plusauto_reasoning::select()already covers a wide range. Removing the router entirely would eliminate the latency but would lose the semantic understanding the Flash Router can bring for nuanced requests.The heuristic-first approach is the simplest change with the largest latency win.
Impact
Every single turn in auto mode. For power users who leave auto mode on all session, this is dozens to hundreds of turns per day, each paying 300 ms to 2 s of dead time before any visible response.
Additional context
Relevant source:
crates/tui/src/commands/config.rs:922-958— builds and calls the Flash router APIcrates/tui/src/commands/config.rs:720-735—auto_model_heuristic()(keyword + length, pure local, ~0 ms)crates/tui/src/tui/ui.rs:4397-4412— where the routing result is consumed; the UI blocks onresolve_auto_model_selection().awaitbefore it can send the realOp::SendMessage