Skip to content

seglen eviction for mamba radix cache#3

Open
XinyiQiao wants to merge 57 commits into
mainfrom
seglen-eviction
Open

seglen eviction for mamba radix cache#3
XinyiQiao wants to merge 57 commits into
mainfrom
seglen-eviction

Conversation

@XinyiQiao

@XinyiQiao XinyiQiao commented Mar 25, 2026

Copy link
Copy Markdown
Collaborator

Motivation

This PR adds a new radix cache eviction policy "seglen" (segment length) for hybrid models using MambaRadixCache.

Our approach is inspired by Marconi prefix caching for hybrid LLMs. Seglen heuristically approximates Marconi’s FLOPs-efficiency score, preserving the core recomputation-cost intuition while reducing implementation complexity of model-architecture specific marginal FLOPs calculations.

seglen ranks eviction candidates using replay length to the nearest reusable Mamba ancestor, combined with recency. Compared with pure LRU, this is intended to make eviction decisions more aware of recomputation cost for hybrid models.

sglang serve \
  --model-path Qwen/Qwen3.5-9B \
  --mamba-scheduler-strategy extra_buffer \
  --radix-eviction-policy seglen \
  --marconi-eff-weight 0.85

Modifications

  • Add seglen as a supported radix eviction policy for hybrid SSM models
  • Implement seglen full-KV eviction and Mamba-state eviction in MambaRadixCache.
  • Add seglen_eff_weight to control the balance between replay-length efficiency and recency.
  • Update match_prefix behavior so seglen refreshes only the matched last node instead of all matched ancestors.
  • Add validation so --radix-eviction-policy=seglen is only allowed for hybrid SSM models.

Benchmarking and Profiling

Benchmark results on H100 show that seglen delivers substantial TTFT improvements on prefix-heavy workloads, while still providing a modest TTFT improvement on the ShareGPT regression dataset with low prefix-hit rate.

-29.5% TTFT on prefix-heavy datasets
-26.1% TTFT on SWE-bench datasets
-3.4% TTFT on ShareGPT as a regression check (~1% prefix hit)

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Mar 25, 2026
@XinyiQiao XinyiQiao changed the title Seglen eviction seglen eviction for mamba radix cache Mar 29, 2026
@XinyiQiao XinyiQiao added enhancement New feature or request and removed documentation Improvements or additions to documentation labels Mar 29, 2026
@XinyiQiao XinyiQiao marked this pull request as ready for review March 29, 2026 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant