Skip to content

Let clients mark stable prompt anchors for Skippy cache#793

Draft
i386 wants to merge 4 commits into
mainfrom
codex/skippy-prompt-anchor-cache
Draft

Let clients mark stable prompt anchors for Skippy cache#793
i386 wants to merge 4 commits into
mainfrom
codex/skippy-prompt-anchor-cache

Conversation

@i386

@i386 i386 commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Long agentic prompts can now tell Skippy which token prefix is intended to stay stable across turns. The request can carry prompt_cache_anchor_tokens alongside prompt_cache_key, and Skippy records that exact prefix as a resident cache anchor while leaving the existing exact/full/generic prefix restore path in charge.

This is aimed at tool-heavy and long-context sessions where the beginning of the prompt is durable, but the tail keeps changing with tool results, scratchpad state, or retrieved context.

Why

The previous prefix-cache work made repeated prompts much cheaper when the runtime could discover and replay an exact cached prefix. That helps, but it still leaves the client unable to say: "this many tokens are the stable system/tool/session prefix; please keep this boundary hot."

For agent workloads, that boundary is often known at request construction time. Making it explicit gives Skippy a durable cache target that is independent of the changing suffix.

Before

sequenceDiagram
    participant Client
    participant Stage0
    participant Stage1
    participant Stage2

    Client->>Stage0: prompt_cache_key + full prompt
    Stage0->>Stage1: prefill chunks
    Stage1->>Stage2: prefill chunks
    Stage2-->>Stage0: decode replies
    Note over Stage0,Stage2: Cache records discovered exact/grid prefixes

    Client->>Stage0: same prefix + changed suffix
    Stage0->>Stage0: find best discovered prefix
    Stage0->>Stage1: restore prefix, then prefill suffix
Loading

The runtime could reuse what it discovered, but the client had no protocol surface for naming the stable prefix that should survive changing prompt tails.

After

sequenceDiagram
    participant Client
    participant Stage0
    participant Stage1
    participant Stage2

    Client->>Stage0: prompt_cache_key + prompt_cache_anchor_tokens=N
    Stage0->>Stage0: validate N is a strict token prefix
    Stage0->>Stage1: prefill/decode as normal
    Stage0->>Stage0: record explicit N-token anchor
    Stage0->>Stage1: decode sideband carries N
    Stage1->>Stage1: record the same anchor prefix
    Stage1->>Stage2: decode sideband carries N
    Stage2->>Stage2: record the same anchor prefix

    Client->>Stage0: same key + changed suffix + N
    Stage0->>Stage0: prefer existing exact/full/generic restore
    Stage0->>Stage1: use anchor restore only if no prefix restored
Loading

The anchor is conservative by design:

  • ignored without prompt_cache_key
  • ignored when zero or not a strict prefix of the prompt
  • exact-token only; no lossy approximation
  • existing exact/full/generic prefix restore is preferred
  • anchor restore is attempted only when no prefix was restored by the existing cache path

Performance

This PR adds the control-plane and cache-policy hook for stable prompt anchors. It does not claim a universal wall-clock speedup from the local benchmark shape, because the existing multi-token replay cache already captured nearly the same prefix in that run.

Local sanity benchmark:

  • model: Qwen3 0.6B Q4_K_M
  • topology: 3 local Skippy stages
  • context: 8192
  • prompt: synthetic 35-line agent transcript with a shared stable prefix and changing suffix
  • output: max_tokens=4
  • cache grid: 256-token shared-prefix stride
  • requested anchor: 1600 tokens
Scenario Cold TTFT Warm TTFT median Warm elapsed median Observed cache behavior
Existing prefix cache 26.72s 1.53s 1.62s 1536-token generic/grid restore
Anchor hint requested 28.50s 1.63s 1.76s anchor recorded; existing 1536-token restore still preferred

The important result here is correctness and safety: the anchor hint records cleanly and does not force a worse restore path over an already-good generic prefix restore. The expected win is in workloads where the useful stable prefix is not already captured by the grid, or where cache retention should favor a long-lived agent prefix over volatile prompt tails.

Protocol

Adds an optional OpenAI-compatible request extension:

  • prompt_cache_anchor_tokens

It is additive on the HTTP request surface. Existing clients do not need to send it. New servers only honor it when paired with prompt_cache_key.

Inside Skippy, downstream stages learn the requested anchor through the existing decode-record sideband shape, so this does not add a mesh protobuf field or a new public stream type. Mixed-version staged chains should not rely on this hint until all Skippy stages are updated, because older stages will not record or restore the anchor.

Validation

  • cargo fmt --all -- --check
  • git diff --check
  • just with-lld cargo test -p skippy-server --lib
  • just with-lld cargo clippy -p skippy-server --all-targets -- -D warnings
  • just with-lld cargo clippy -p openai-frontend --all-targets -- -D warnings
  • just with-lld cargo clippy -p mesh-llm-host-runtime --all-targets -- -D warnings
  • just with-lld cargo clippy -p mesh-llm --all-targets -- -D warnings
  • just build

Benchmark/sanity run output was captured under:

  • /tmp/prompt-anchor-cache-bench-20260604-203642

@i386 i386 marked this pull request as draft June 4, 2026 10:51
@i386

i386 commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

@michaelneale I think this one will be useful when we can classify traffic - e.g. long agent tool calls vs chat

@michaelneale

Copy link
Copy Markdown
Collaborator

yeah I think a very good idea.

Base automatically changed from codex/skippy-multi-token-replay-cache to main June 5, 2026 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants