Skip to content

Batch Skippy decode across concurrent requests#801

Draft
i386 wants to merge 3 commits into
mainfrom
skippy-cross-request-decode-batching
Draft

Batch Skippy decode across concurrent requests#801
i386 wants to merge 3 commits into
mainfrom
skippy-cross-request-decode-batching

Conversation

@i386
Copy link
Copy Markdown
Collaborator

@i386 i386 commented Jun 5, 2026

Summary

Skippy can now batch decode work across concurrent requests instead of forcing each active lane through an independent single-request llama_decode call. This is the decode-side counterpart to the lane/runtime work already on main: when multiple requests are decoding at the same time, stage 0 and downstream split stages can execute already-queued one-token decode work as one native batch.

This PR originally used a fixed 200 us rendezvous window to encourage batch formation. The local benchmark showed that was not a good enough tradeoff, so this PR now takes the safer easy win: no intentional decode wait. The batchers drain whatever work has already accumulated; if only one request is ready, it runs immediately.

The updated local benchmark is still not a convincing performance/throughput win. It improves the concurrency-2 result slightly versus main, but concurrency 4 remains slower on aggregate completion tok/s. This should stay draft unless a larger multi-stage benchmark proves the target topology benefits.

Merge recommendation

Do not merge this PR as-is as a general performance/throughput improvement.

The implementation proves that cross-request decode batching can form real native decode batches, and the no-wait policy removes the earlier fixed rendezvous latency tax. However, the current local benchmark still does not show a convincing throughput win: concurrency 2 is only +0.9%, and concurrency 4 is -5.8% on completion tok/s on the local two-stage Qwen3-8B setup.

Recommended next step: keep this PR draft and use it as the experiment branch for a larger multi-stage run. Merge criteria should be a clear win on the target shape, for example a four-stage larger-model benchmark showing improved aggregate tok/s without a material TPOT/TTFT regression. If the larger run is also neutral, we should close this PR or rework batching into a fully adaptive production policy.

What changed

  • Added local/full-model cross-request token decode batching for stage sessions.
  • Added a native skippy_decode_step_frame_batch_sampled ABI for batched one-token activation-frame decode.
  • Bumped the Skippy native ABI mirror to 0.1.27.
  • Added Rust runtime and server wrappers for batched activation-frame decode.
  • Added a shared binary-stage decode-frame batcher so downstream stages can batch across independent request connections.
  • Routed embedded stage 0 split decode through the same frame batcher before activation forwarding.
  • Removed the fixed 200 us decode rendezvous wait. Batching is now opportunistic only: already-queued work batches, single ready work runs immediately.
  • Added decode batch telemetry: batch size and batch wait time are emitted on decode spans.

Architecture

Before this branch, split decode still had a per-request hot path even when the runtime had multiple lanes:

sequenceDiagram
    participant R1 as Request 1
    participant R2 as Request 2
    participant S0 as Stage 0
    participant S1 as Downstream Stage
    participant N as Native llama.cpp

    R1->>S0: decode token
    S0->>N: llama_decode(batch=1)
    S0->>S1: activation frame
    S1->>N: llama_decode(batch=1)
    S1-->>S0: predicted token

    R2->>S0: decode token
    S0->>N: llama_decode(batch=1)
    S0->>S1: activation frame
    S1->>N: llama_decode(batch=1)
    S1-->>S0: predicted token
Loading

After this branch, each stage keeps the existing request/session protocol but coalesces decode work that is already queued at the stage runtime boundary:

sequenceDiagram
    participant R1 as Request 1
    participant R2 as Request 2
    participant B0 as Stage 0 decode batcher
    participant S0 as Stage 0 native runtime
    participant B1 as Downstream decode batcher
    participant S1 as Downstream native runtime

    R1->>B0: decode token
    R2->>B0: decode token
    B0->>S0: llama_decode(batch=2)
    B0-->>R1: activation frame
    B0-->>R2: activation frame

    R1->>B1: activation frame
    R2->>B1: activation frame
    B1->>S1: llama_decode(batch=2)
    B1-->>R1: predicted token
    B1-->>R2: predicted token
Loading

If only one request is queued when the batcher wakes, the batcher does not sleep to wait for another request. That makes the policy safer for low-concurrency and fast-local topologies, but it also means fewer batches form unless requests naturally align.

Benchmark

Local benchmark date: 2026-06-05.

Setup:

  • Branches compared: main vs skippy-cross-request-decode-batching at commit bb3af8fa6.
  • Model: cached unsloth/Qwen3-8B-GGUF, two artifact-slice stages, layers 0..18 and 18..36.
  • Runtime: local loopback split, activation_wire_dtype=f16, activation_width=4096, lane_count=4, openai_generation_concurrency=4.
  • Corpus: crates/skippy-bench/corpora/kv_mixed_prompts.jsonl, first 8 prompts.
  • Request shape: max_tokens=24, streaming chat completions, temperature=0, seed=123, enable_thinking=false.
  • Perf runs used --telemetry-level off; a separate debug run used stderr telemetry only to prove batch formation.
  • main baseline is unchanged from the prior PR benchmark; the branch side was rerun after removing the fixed rendezvous wait.

Updated no-wait perf result:

concurrency main completion tok/s PR completion tok/s delta main TTFT p95 ms PR TTFT p95 ms delta
2 46.42 46.82 +0.9% 242.0 227.3 -6.1%
4 54.56 51.40 -5.8% 435.5 471.9 +8.3%

Telemetry proof from the no-wait debug run on this branch:

observed decode batch size event count
1 533
2 24

Interpretation:

  • The no-wait policy removes the deliberate 200 us latency tax.
  • Opportunistic batches still form, but fewer than with the fixed wait policy.
  • On this local two-stage 8B setup, the change is roughly neutral at concurrency 2 and still loses aggregate throughput at concurrency 4.
  • This benchmark is probably not representative of the target four-stage / larger-model path: it runs both stages on one machine over loopback, with no LAN/WAN delay and a much smaller model than the target large split deployment.

Protocol

The external OpenAI API is unchanged. The existing binary stage wire messages remain single-request messages; this PR batches at the stage runtime boundary across multiple active connections rather than requiring a batched wire envelope.

The native Skippy ABI is extended with skippy_decode_step_frame_batch_sampled, so Skippy ABI patch version moves from 0.1.26 to 0.1.27. Older native runtimes without this symbol are not compatible with this Rust runtime.

Activation sideband families currently fall back to the existing single-frame decode path when the native batch ABI returns Unsupported.

Validation

  • just build
  • just with-lld cargo build -p skippy-server -p skippy-bench
  • cargo fmt --all -- --check
  • cargo test -p skippy-server --lib
  • cargo test -p skippy-runtime --lib
  • cargo test -p skippy-ffi --lib
  • cargo clippy -p skippy-ffi --all-targets -- -D warnings
  • cargo clippy -p skippy-runtime --all-targets -- -D warnings
  • cargo clippy -p skippy-server --all-targets -- -D warnings
  • git diff --check
  • Local skippy-bench chat-corpus comparison above

@i386 i386 marked this pull request as ready for review June 5, 2026 08:20
@i386 i386 marked this pull request as draft June 5, 2026 08:20
@i386 i386 marked this pull request as ready for review June 5, 2026 23:04
@i386 i386 marked this pull request as draft June 5, 2026 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant