Batch Skippy decode across concurrent requests by i386 · Pull Request #801 · Mesh-LLM/mesh-llm

i386 · 2026-06-05T08:03:39Z

Summary

Skippy can now batch decode work across concurrent requests instead of forcing each active lane through an independent single-request llama_decode call. This is the decode-side counterpart to the lane/runtime work already on main: when multiple requests are decoding at the same time, stage 0 and downstream split stages can execute already-queued one-token decode work as one native batch.

This PR originally used a fixed 200 us rendezvous window to encourage batch formation. The local benchmark showed that was not a good enough tradeoff, so this PR now takes the safer easy win: no intentional decode wait. The batchers drain whatever work has already accumulated; if only one request is ready, it runs immediately.

The updated local benchmark is still not a convincing performance/throughput win. It improves the concurrency-2 result slightly versus main, but concurrency 4 remains slower on aggregate completion tok/s. This should stay draft unless a larger multi-stage benchmark proves the target topology benefits.

Merge recommendation

Do not merge this PR as-is as a general performance/throughput improvement.

The implementation proves that cross-request decode batching can form real native decode batches, and the no-wait policy removes the earlier fixed rendezvous latency tax. However, the current local benchmark still does not show a convincing throughput win: concurrency 2 is only +0.9%, and concurrency 4 is -5.8% on completion tok/s on the local two-stage Qwen3-8B setup.

Recommended next step: keep this PR draft and use it as the experiment branch for a larger multi-stage run. Merge criteria should be a clear win on the target shape, for example a four-stage larger-model benchmark showing improved aggregate tok/s without a material TPOT/TTFT regression. If the larger run is also neutral, we should close this PR or rework batching into a fully adaptive production policy.

What changed

Added local/full-model cross-request token decode batching for stage sessions.
Added a native skippy_decode_step_frame_batch_sampled ABI for batched one-token activation-frame decode.
Bumped the Skippy native ABI mirror to 0.1.27.
Added Rust runtime and server wrappers for batched activation-frame decode.
Added a shared binary-stage decode-frame batcher so downstream stages can batch across independent request connections.
Routed embedded stage 0 split decode through the same frame batcher before activation forwarding.
Removed the fixed 200 us decode rendezvous wait. Batching is now opportunistic only: already-queued work batches, single ready work runs immediately.
Added decode batch telemetry: batch size and batch wait time are emitted on decode spans.

Architecture

Before this branch, split decode still had a per-request hot path even when the runtime had multiple lanes:

sequenceDiagram
    participant R1 as Request 1
    participant R2 as Request 2
    participant S0 as Stage 0
    participant S1 as Downstream Stage
    participant N as Native llama.cpp

    R1->>S0: decode token
    S0->>N: llama_decode(batch=1)
    S0->>S1: activation frame
    S1->>N: llama_decode(batch=1)
    S1-->>S0: predicted token

    R2->>S0: decode token
    S0->>N: llama_decode(batch=1)
    S0->>S1: activation frame
    S1->>N: llama_decode(batch=1)
    S1-->>S0: predicted token

After this branch, each stage keeps the existing request/session protocol but coalesces decode work that is already queued at the stage runtime boundary:

sequenceDiagram
    participant R1 as Request 1
    participant R2 as Request 2
    participant B0 as Stage 0 decode batcher
    participant S0 as Stage 0 native runtime
    participant B1 as Downstream decode batcher
    participant S1 as Downstream native runtime

    R1->>B0: decode token
    R2->>B0: decode token
    B0->>S0: llama_decode(batch=2)
    B0-->>R1: activation frame
    B0-->>R2: activation frame

    R1->>B1: activation frame
    R2->>B1: activation frame
    B1->>S1: llama_decode(batch=2)
    B1-->>R1: predicted token
    B1-->>R2: predicted token

If only one request is queued when the batcher wakes, the batcher does not sleep to wait for another request. That makes the policy safer for low-concurrency and fast-local topologies, but it also means fewer batches form unless requests naturally align.

Benchmark

Local benchmark date: 2026-06-05.

Setup:

Branches compared: main vs skippy-cross-request-decode-batching at commit bb3af8fa6.
Model: cached unsloth/Qwen3-8B-GGUF, two artifact-slice stages, layers 0..18 and 18..36.
Runtime: local loopback split, activation_wire_dtype=f16, activation_width=4096, lane_count=4, openai_generation_concurrency=4.
Corpus: crates/skippy-bench/corpora/kv_mixed_prompts.jsonl, first 8 prompts.
Request shape: max_tokens=24, streaming chat completions, temperature=0, seed=123, enable_thinking=false.
Perf runs used --telemetry-level off; a separate debug run used stderr telemetry only to prove batch formation.
main baseline is unchanged from the prior PR benchmark; the branch side was rerun after removing the fixed rendezvous wait.

Updated no-wait perf result:

concurrency	main completion tok/s	PR completion tok/s	delta	main TTFT p95 ms	PR TTFT p95 ms	delta
2	46.42	46.82	+0.9%	242.0	227.3	-6.1%
4	54.56	51.40	-5.8%	435.5	471.9	+8.3%

Telemetry proof from the no-wait debug run on this branch:

observed decode batch size	event count
1	533
2	24

Interpretation:

The no-wait policy removes the deliberate 200 us latency tax.
Opportunistic batches still form, but fewer than with the fixed wait policy.
On this local two-stage 8B setup, the change is roughly neutral at concurrency 2 and still loses aggregate throughput at concurrency 4.
This benchmark is probably not representative of the target four-stage / larger-model path: it runs both stages on one machine over loopback, with no LAN/WAN delay and a much smaller model than the target large split deployment.

Protocol

The external OpenAI API is unchanged. The existing binary stage wire messages remain single-request messages; this PR batches at the stage runtime boundary across multiple active connections rather than requiring a batched wire envelope.

The native Skippy ABI is extended with skippy_decode_step_frame_batch_sampled, so Skippy ABI patch version moves from 0.1.26 to 0.1.27. Older native runtimes without this symbol are not compatible with this Rust runtime.

Activation sideband families currently fall back to the existing single-frame decode path when the native batch ABI returns Unsupported.

Validation

just build
just with-lld cargo build -p skippy-server -p skippy-bench
cargo fmt --all -- --check
cargo test -p skippy-server --lib
cargo test -p skippy-runtime --lib
cargo test -p skippy-ffi --lib
cargo clippy -p skippy-ffi --all-targets -- -D warnings
cargo clippy -p skippy-runtime --all-targets -- -D warnings
cargo clippy -p skippy-server --all-targets -- -D warnings
git diff --check
Local skippy-bench chat-corpus comparison above

i386 added 2 commits June 5, 2026 17:40

Add Skippy cross-request token decode batching

2c85ee4

Batch Skippy split decode frames across requests

5980d56

i386 marked this pull request as ready for review June 5, 2026 08:20

i386 marked this pull request as draft June 5, 2026 08:20

Avoid fixed decode batch rendezvous waits

bb3af8f

i386 marked this pull request as ready for review June 5, 2026 23:04

i386 marked this pull request as draft June 5, 2026 23:06

ndizazzo assigned i386 Jun 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Skippy decode across concurrent requests#801

Batch Skippy decode across concurrent requests#801
i386 wants to merge 3 commits into
mainfrom
skippy-cross-request-decode-batching

i386 commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

i386 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Merge recommendation

What changed

Architecture

Benchmark

Protocol

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

i386 commented Jun 5, 2026 •

edited

Loading