Batch Skippy decode across concurrent requests#801
Draft
i386 wants to merge 3 commits into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Skippy can now batch decode work across concurrent requests instead of forcing each active lane through an independent single-request
llama_decodecall. This is the decode-side counterpart to the lane/runtime work already onmain: when multiple requests are decoding at the same time, stage 0 and downstream split stages can execute already-queued one-token decode work as one native batch.This PR originally used a fixed 200 us rendezvous window to encourage batch formation. The local benchmark showed that was not a good enough tradeoff, so this PR now takes the safer easy win: no intentional decode wait. The batchers drain whatever work has already accumulated; if only one request is ready, it runs immediately.
The updated local benchmark is still not a convincing performance/throughput win. It improves the concurrency-2 result slightly versus
main, but concurrency 4 remains slower on aggregate completion tok/s. This should stay draft unless a larger multi-stage benchmark proves the target topology benefits.Merge recommendation
Do not merge this PR as-is as a general performance/throughput improvement.
The implementation proves that cross-request decode batching can form real native decode batches, and the no-wait policy removes the earlier fixed rendezvous latency tax. However, the current local benchmark still does not show a convincing throughput win: concurrency 2 is only +0.9%, and concurrency 4 is -5.8% on completion tok/s on the local two-stage Qwen3-8B setup.
Recommended next step: keep this PR draft and use it as the experiment branch for a larger multi-stage run. Merge criteria should be a clear win on the target shape, for example a four-stage larger-model benchmark showing improved aggregate tok/s without a material TPOT/TTFT regression. If the larger run is also neutral, we should close this PR or rework batching into a fully adaptive production policy.
What changed
skippy_decode_step_frame_batch_sampledABI for batched one-token activation-frame decode.0.1.27.Architecture
Before this branch, split decode still had a per-request hot path even when the runtime had multiple lanes:
sequenceDiagram participant R1 as Request 1 participant R2 as Request 2 participant S0 as Stage 0 participant S1 as Downstream Stage participant N as Native llama.cpp R1->>S0: decode token S0->>N: llama_decode(batch=1) S0->>S1: activation frame S1->>N: llama_decode(batch=1) S1-->>S0: predicted token R2->>S0: decode token S0->>N: llama_decode(batch=1) S0->>S1: activation frame S1->>N: llama_decode(batch=1) S1-->>S0: predicted tokenAfter this branch, each stage keeps the existing request/session protocol but coalesces decode work that is already queued at the stage runtime boundary:
sequenceDiagram participant R1 as Request 1 participant R2 as Request 2 participant B0 as Stage 0 decode batcher participant S0 as Stage 0 native runtime participant B1 as Downstream decode batcher participant S1 as Downstream native runtime R1->>B0: decode token R2->>B0: decode token B0->>S0: llama_decode(batch=2) B0-->>R1: activation frame B0-->>R2: activation frame R1->>B1: activation frame R2->>B1: activation frame B1->>S1: llama_decode(batch=2) B1-->>R1: predicted token B1-->>R2: predicted tokenIf only one request is queued when the batcher wakes, the batcher does not sleep to wait for another request. That makes the policy safer for low-concurrency and fast-local topologies, but it also means fewer batches form unless requests naturally align.
Benchmark
Local benchmark date: 2026-06-05.
Setup:
mainvsskippy-cross-request-decode-batchingat commitbb3af8fa6.unsloth/Qwen3-8B-GGUF, two artifact-slice stages, layers0..18and18..36.activation_wire_dtype=f16,activation_width=4096,lane_count=4,openai_generation_concurrency=4.crates/skippy-bench/corpora/kv_mixed_prompts.jsonl, first 8 prompts.max_tokens=24, streaming chat completions,temperature=0,seed=123,enable_thinking=false.--telemetry-level off; a separate debug run used stderr telemetry only to prove batch formation.mainbaseline is unchanged from the prior PR benchmark; the branch side was rerun after removing the fixed rendezvous wait.Updated no-wait perf result:
Telemetry proof from the no-wait debug run on this branch:
Interpretation:
Protocol
The external OpenAI API is unchanged. The existing binary stage wire messages remain single-request messages; this PR batches at the stage runtime boundary across multiple active connections rather than requiring a batched wire envelope.
The native Skippy ABI is extended with
skippy_decode_step_frame_batch_sampled, so Skippy ABI patch version moves from0.1.26to0.1.27. Older native runtimes without this symbol are not compatible with this Rust runtime.Activation sideband families currently fall back to the existing single-frame decode path when the native batch ABI returns
Unsupported.Validation
just buildjust with-lld cargo build -p skippy-server -p skippy-benchcargo fmt --all -- --checkcargo test -p skippy-server --libcargo test -p skippy-runtime --libcargo test -p skippy-ffi --libcargo clippy -p skippy-ffi --all-targets -- -D warningscargo clippy -p skippy-runtime --all-targets -- -D warningscargo clippy -p skippy-server --all-targets -- -D warningsgit diff --checkskippy-bench chat-corpuscomparison above