Let clients mark stable prompt anchors for Skippy cache#793
Draft
i386 wants to merge 4 commits into
Draft
Conversation
Collaborator
Author
|
@michaelneale I think this one will be useful when we can classify traffic - e.g. long agent tool calls vs chat |
Collaborator
|
yeah I think a very good idea. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Long agentic prompts can now tell Skippy which token prefix is intended to stay stable across turns. The request can carry
prompt_cache_anchor_tokensalongsideprompt_cache_key, and Skippy records that exact prefix as a resident cache anchor while leaving the existing exact/full/generic prefix restore path in charge.This is aimed at tool-heavy and long-context sessions where the beginning of the prompt is durable, but the tail keeps changing with tool results, scratchpad state, or retrieved context.
Why
The previous prefix-cache work made repeated prompts much cheaper when the runtime could discover and replay an exact cached prefix. That helps, but it still leaves the client unable to say: "this many tokens are the stable system/tool/session prefix; please keep this boundary hot."
For agent workloads, that boundary is often known at request construction time. Making it explicit gives Skippy a durable cache target that is independent of the changing suffix.
Before
sequenceDiagram participant Client participant Stage0 participant Stage1 participant Stage2 Client->>Stage0: prompt_cache_key + full prompt Stage0->>Stage1: prefill chunks Stage1->>Stage2: prefill chunks Stage2-->>Stage0: decode replies Note over Stage0,Stage2: Cache records discovered exact/grid prefixes Client->>Stage0: same prefix + changed suffix Stage0->>Stage0: find best discovered prefix Stage0->>Stage1: restore prefix, then prefill suffixThe runtime could reuse what it discovered, but the client had no protocol surface for naming the stable prefix that should survive changing prompt tails.
After
sequenceDiagram participant Client participant Stage0 participant Stage1 participant Stage2 Client->>Stage0: prompt_cache_key + prompt_cache_anchor_tokens=N Stage0->>Stage0: validate N is a strict token prefix Stage0->>Stage1: prefill/decode as normal Stage0->>Stage0: record explicit N-token anchor Stage0->>Stage1: decode sideband carries N Stage1->>Stage1: record the same anchor prefix Stage1->>Stage2: decode sideband carries N Stage2->>Stage2: record the same anchor prefix Client->>Stage0: same key + changed suffix + N Stage0->>Stage0: prefer existing exact/full/generic restore Stage0->>Stage1: use anchor restore only if no prefix restoredThe anchor is conservative by design:
prompt_cache_keyPerformance
This PR adds the control-plane and cache-policy hook for stable prompt anchors. It does not claim a universal wall-clock speedup from the local benchmark shape, because the existing multi-token replay cache already captured nearly the same prefix in that run.
Local sanity benchmark:
max_tokens=4The important result here is correctness and safety: the anchor hint records cleanly and does not force a worse restore path over an already-good generic prefix restore. The expected win is in workloads where the useful stable prefix is not already captured by the grid, or where cache retention should favor a long-lived agent prefix over volatile prompt tails.
Protocol
Adds an optional OpenAI-compatible request extension:
prompt_cache_anchor_tokensIt is additive on the HTTP request surface. Existing clients do not need to send it. New servers only honor it when paired with
prompt_cache_key.Inside Skippy, downstream stages learn the requested anchor through the existing decode-record sideband shape, so this does not add a mesh protobuf field or a new public stream type. Mixed-version staged chains should not rely on this hint until all Skippy stages are updated, because older stages will not record or restore the anchor.
Validation
cargo fmt --all -- --checkgit diff --checkjust with-lld cargo test -p skippy-server --libjust with-lld cargo clippy -p skippy-server --all-targets -- -D warningsjust with-lld cargo clippy -p openai-frontend --all-targets -- -D warningsjust with-lld cargo clippy -p mesh-llm-host-runtime --all-targets -- -D warningsjust with-lld cargo clippy -p mesh-llm --all-targets -- -D warningsjust buildBenchmark/sanity run output was captured under:
/tmp/prompt-anchor-cache-bench-20260604-203642