Skip to content

feat(scheduler): per-adapter KV prefix-cache namespace + max_loras batch cap#268

Open
qywu wants to merge 19 commits into
lightseekorg:mainfrom
qywu:feat/lora-scheduler-kv-namespace
Open

feat(scheduler): per-adapter KV prefix-cache namespace + max_loras batch cap#268
qywu wants to merge 19 commits into
lightseekorg:mainfrom
qywu:feat/lora-scheduler-kv-namespace

Conversation

@qywu
Copy link
Copy Markdown
Collaborator

@qywu qywu commented May 26, 2026

Summary

Two scheduler-level changes that make the tokenspeed scheduler LoRA-aware. Extracted from the broader LoRA serving PR (#83) as a self-contained C++ contribution.


1 — Per-adapter KV prefix-cache namespace isolation

Problem: Without isolation, tokens cached under adapter A could be incorrectly reused by adapter B or the base model, producing silent wrong outputs.

Solution: Each LoRA adapter gets its own virtual root node in the radix tree, scoping all prefix cache lookups and insertions per adapter.

How it works:

  • KVPrefixCache::Match and Insert now accept a lora_id parameter (kLoraNone = 0 = base model).
  • For base model requests, the real tree root is used as before (no behaviour change).
  • For adapter requests, resolveStartNode(lora_id) returns (or lazily creates) a per-adapter virtual root keyed by a sentinel page [-lora_id, 0, …, 0]. Negative first tokens are outside all real vocabulary ranges, so there is no collision with actual sequences.
  • The virtual root holds an empty DeviceResource so PruneEmptyByNode never removes it while an adapter is loaded.
  • namespace_depth_offset = 1 is applied to match results so callers see real page counts, not counts including the sentinel page.
  • KVPrefixCache::EvictLoraNamespace(lora_id) DFS-collects all descendants of the virtual root, evicts their device and host pages via EvictSubtree, and removes the root from the tree. This is called from Scheduler::EvictLoraNamespace on adapter unload, immediately freeing KV memory instead of waiting for LRU pressure. Locked (in-flight) pages are skipped and freed when those requests finish.
  • lora_id is propagated through HybridPrefixCache::Match overloads and InsertHybridCache so the hybrid L2/L3 path is also namespace-aware.

Data model changes:

  • RequestSpec::lora_id (int32, default 0) — set when submitting a request.
  • Request::lora_id_ (private) / Request::LoraId() (public accessor) — carried through the FSM.
  • KVPrefixCache::lora_virtual_roots_unordered_map<int32, TreeNode*> of per-adapter roots.

2 — max_loras adapter cap in batch building

Problem: If a batch contains more unique LoRA adapters than GPU slots, prepare_loras() either crashes with a RuntimeError or silently runs an evicted adapter as the base model (wrong output, no error).

Solution: The scheduler enforces a hard cap on unique adapter IDs per forward pass.

How it works:

  • SchedulerConfig gains max_loras (int32, default 0 = LoRA disabled / no cap enforced).
  • newForwardOperation() maintains a batch_lora_ids unordered set. Before scheduling each request, if max_loras > 0 and the request's lora_id would be a new entry that pushes batch_lora_ids.size() past max_loras, the request is skipped (deferred to the next scheduling round) rather than added to the batch.
  • Base-model requests (lora_id == kLoraNone) are never deferred — the cap only applies to adapter traffic.
  • This guarantees prepare_loras() sees at most max_loras unique adapter IDs per step.

Files changed

File Change
csrc/resource/kv_prefix_cache/kv_prefix_cache.{h,cpp} Core namespace: Match/Insert with lora_id, getOrCreateLoraRoot, resolveStartNode, EvictLoraNamespace
csrc/resource/kv_prefix_cache/eviction.h EvictSubtree template method for bulk page eviction
csrc/resource/hybrid_prefix_cache/hybrid_prefix_cache.{h,cpp} Forward lora_id through Match overloads
csrc/fsm/forward_events.{h,cpp} Add lora_id param to InsertHybridCache; propagate into KV insert path
csrc/scheduler/request.{h,cpp}, request_spec.h Add lora_id field to spec and request; expose LoraId() accessor
csrc/scheduler/scheduler.{h,cpp} Add EvictLoraNamespace(lora_id) delegation
csrc/scheduler/operations/forward.cpp batch_lora_ids set + max_loras deferral logic; pass LoraId() to all Match calls
csrc/scheduler/types.h Add max_loras to SchedulerConfig
bindings/python_module.cpp Expose max_loras (on SchedulerConfig) and lora_id (on RequestSpec) to Python
python/…/scheduler_utils.py make_spec(lora_id=0) and make_config(max_loras=0)
python/…/event_loop.py Wire max_loras from server_args into make_config

Test plan

  • Scheduler C++ unit tests build and pass
  • Scheduler Python unit tests pass
  • Prefix-cache namespace: verify adapter A's cached pages are not returned on a cache lookup for adapter B with the same token sequence
  • max_loras=1 with two concurrent adapters: second adapter's requests are deferred, not batched (no wrong output, no crash)
  • EvictLoraNamespace: after unloading an adapter, its KV pages are freed and its virtual root is removed from the tree

qywu added 11 commits May 26, 2026 20:27
…tch cap

### 1 — Per-adapter KV prefix-cache namespace isolation

Each LoRA adapter gets its own virtual root in the KV radix tree so
prefix reuse is isolated per adapter. Tokens cached under adapter A are
never served to adapter B or the base model.

- RequestSpec / Request gain a lora_id field (0 = kLoraNone = base model)
- FSM forward events carry lora_id_ through prefill/decode scheduling
- KVPrefixCache adds lora_virtual_roots_: unordered_map<int32_t, TreeNode*>
- getOrCreateLoraRoot inserts a sentinel child keyed by [-lora_id, 0, ...]
  (negative first token, never collides with real vocabulary)
- Match/Insert accept lora_id; resolveStartNode scopes the radix walk;
  depth adjusted by namespace_depth_offset=1 to hide the sentinel page
- HybridPrefixCache::Match forwards lora_id; InsertHybridCache propagates it
- Scheduler::EvictLoraNamespace: DFS-collects and frees all KV pages under
  an adapter's virtual root when the adapter is unloaded

### 2 — max_loras adapter cap in batch building

SchedulerConfig gains max_loras (0 = disabled). newForwardOperation()
tracks batch_lora_ids and defers requests that would exceed the cap,
preventing silent wrong-output and RuntimeError crashes.

Python wiring: make_config() / event_loop.py pass server_args.max_loras;
python_module.cpp exposes max_loras on SchedulerConfig.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…refix_cache_adjunct_spec

These were accidentally removed when max_loras was added to make_config.
event_loop.py still calls pool_to_prefix_cache_adjunct_spec() at line 311
and passes prefix_cache_adjunct= to make_config() — both would crash at
runtime with the symbols missing.

Changes:
- scheduler_utils.py: restore PrefixCacheAdjunctSpec + PagedCacheGroupFamily
  imports via guarded try/except (pre-installed CI binary may not have them);
  restore prefix_cache_adjunct param in make_config(); restore family field in
  pool_to_paged_cache_groups(); restore pool_to_prefix_cache_adjunct_spec()
  with a clear RuntimeError when binary predates the feature
- event_loop.py: restore pool_to_prefix_cache_adjunct_spec import
- tokenspeed_scheduler/__init__.py: restore PagedCacheGroupFamily and
  PrefixCacheAdjunctSpec exports via same guarded try/except pattern

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
… prefix_cache_adjunct

Accidentally removed during merge conflict resolution. Restore by taking
upstream's python_module.cpp and re-applying only our LoRA additions:
  - max_loras field on SchedulerConfig
  - lora_id field on RequestSpec

Everything else (PagedCacheGroupFamily enum, PrefixCacheAdjunctSpec class,
prefix_cache_adjunct on SchedulerConfig, family param on PagedCacheGroupConfig,
OwnedPagesCount/BorrowedPagesCount on PagedCacheGroupTable, unordered_map
include) is now correctly preserved from upstream.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
… upstream

Upstream changed lru_leaves_ from std::pair<timestamp, TreeNode*> to
std::tuple<timestamp, SeqId, TreeNode*> to ensure deterministic LRU
ordering across TP ranks (pointer values diverge per-process due to ASLR,
causing different ranks to evict different nodes on timestamp ties and
eventually deadlocking NCCL collectives).

Update EvictSubtree's erase call from {it->second, node} to
{it->second, node->SeqId(), node} to match the new tuple key.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Upstream encapsulated lora_id as private lora_id_ with a LoraId()
accessor. Replace request->lora_id() with request->LoraId() in the
max_loras batch-cap loop in newForwardOperation().

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
No changes needed — PagedCacheGroupFamily and PrefixCacheAdjunctSpec
are properly exported by the upstream version as-is.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…er-serving

- Clean imports (no try/except guards for PagedCacheGroupFamily/PrefixCacheAdjunctSpec)
- Restore upstream ordering in make_config
- use server_args.enable_mixed_batch directly
- Use list() for required_groups

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
- Restore direct pool.paged_cache_group_specs access (no getattr default)
- Restore upstream ordering: compute family first, then build kwargs with family=family
- Restore [str(gid) for gid in required_group_ids] in pool_to_prefix_cache_adjunct_spec

Only truly necessary changes vs upstream remain:
  make_spec: lora_id param; make_config: max_loras param.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
@qywu qywu marked this pull request as ready for review May 27, 2026 00:49
@qywu qywu requested a review from a team as a code owner May 27, 2026 00:49
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 084c1ee4cf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +50 to +51
LoadLoraReqInput,
LoadLoraReqOutput,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Remove undefined LoRA control types from scheduler imports

SchedulerControlClient now imports LoadLoraReqInput/Output and UnloadLoraReqInput/Output, but those symbols are not defined in python/tokenspeed/runtime/engine/io_struct.py (repo-wide search only finds these names in this client). Importing AsyncLLM (which imports this module) will raise ImportError during startup, so the control plane cannot initialize.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6b7cab2 — reverted to upstream. These LoRA control imports belong in #83 alongside their io_struct.py definitions.

enable_mixed_prefill_decode=server_args.enable_mixed_batch,
enable_mixed_prefill_decode=enable_mixed_prefill_decode,
prefix_cache_adjunct=prefix_cache_adjunct,
max_loras=server_args.max_loras if server_args.enable_lora else 0,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Add missing ServerArgs fields before reading LoRA config

This line unconditionally reads server_args.enable_lora/server_args.max_loras, but ServerArgs in runtime/utils/server_args.py does not declare either field. In the default startup path, EventLoop construction will fail with AttributeError before scheduler initialization.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d184706 — reverted event_loop.py to upstream and pass max_loras=0. The enable_lora/max_loras ServerArgs fields live in #83 (full LoRA serving).

# Proactively evict the KV cache namespace for this adapter so pages
# are freed immediately rather than waiting for LRU eviction pressure.
if lora_id is not None:
self.scheduler.evict_lora_namespace(lora_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Expose LoRA namespace eviction on the Python Scheduler binding

unload_lora_adapter() calls self.scheduler.evict_lora_namespace(...), but the nanobind Scheduler class does not define an evict_lora_namespace method (binding currently exposes methods like submit_requests, next_execution_plan, advance, and stats accessors only). The first LoRA unload will raise AttributeError, leaving adapter unload cleanup incomplete.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d184706unload_lora_adapter was reverted from this branch. The Python binding for evict_lora_namespace lives in #83 alongside its only caller.

Comment on lines +324 to +326
kv_prefix_cache_.Insert<ResourceType::Device>(full_paged_tokens, prefix_pages, std::move(alloc_pages));

MatchResult match_result = kv_prefix_cache_.Match(full_paged_tokens, MatchIntent::StateRecovery);
MatchResult match_result = kv_prefix_cache_.Match(full_paged_tokens, request->LoraId(), MatchIntent::StateRecovery);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep LoRA namespace consistent when retracting cached pages

During retract, pages are inserted with the default namespace (Insert(..., std::move(alloc_pages))) but immediately matched in request->LoraId() namespace. For LoRA requests this writes KV pages into the base subtree and then looks them up in adapter subtree, so recovery misses its own cached prefix under retraction pressure.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7b0a3cascheduleRetract now passes request->LoraId() to Insert so the cached pages land in the adapter subtree where the subsequent StateRecovery Match looks for them.

Comment on lines 109 to +110
SchedulePrefillEvent(std::int32_t tokens_this_round, std::int32_t reserve_num_tokens_in_next_schedule_event,
HybridPrefixCache* hybrid_prefix_cache = nullptr)
HybridPrefixCache* hybrid_prefix_cache = nullptr, std::int32_t lora_id = kLoraNone)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require explicit lora_id in follow-up forward events

Adding a default lora_id = kLoraNone here makes old call sites compile even when they forget to propagate request LoRA identity. In this commit, schedulePrefill/scheduleDecode still construct these events without request->LoraId(), so subsequent cache inserts run in base namespace and can cross-contaminate adapter/base KV reuse.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both InsertHybridCache call sites in forward_events.cpp (SchedulePrefillEvent, ScheduleDecodeEvent) pass lora_id_ explicitly today. The default kLoraNone is a safety net for non-LoRA callers; happy to drop it in a follow-up to enforce explicit propagation if you prefer.

LoadLoraReqInput/UnloadLoraReqInput are part of the full LoRA serving
PR (lightseekorg#83) and don't exist in io_struct.py on this branch, causing an
import error that breaks every test importing AsyncLLM.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6b7cab2049

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// entire subtree (including any mamba slots attached to those nodes).
token_vec_t sentinel(tree_.PageSize(), 0);
sentinel[0] = -lora_id;
tree_.Root()->RemoveChild(sentinel);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid deleting locked LoRA cache nodes

When an adapter is unloaded while any request in that namespace is still in flight, EvictSubtree skips the locked descendants but this line still removes the virtual root, which destroys the entire subtree and the TreeNodes that active DeviceNodeRef/HostNodeRefs still point at. In that scenario the subsequent request state transitions will dereference dangling tree nodes (and the skipped pages are freed by destruction rather than when the request finishes), so namespace removal needs to wait until locked descendants are gone or abort/drain those requests first.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7b0a3caEvictLoraNamespace now scans descendants after EvictSubtree and defers RemoveChild if any node still has OnDevice()/OnHost() (meaning it was skipped as locked). Locked pages are freed when their requests finish via RAII; a later EvictLoraNamespace call (or natural LRU pressure) completes the cleanup.

Comment on lines +932 to +933
if spec.lora_id != 0:
self._request_lora_ids[spec.request_id] = spec.lora_id
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate request LoRA ids before scheduling

This admission path depends on spec.lora_id being non-zero, but the serving path still builds specs with make_spec(rid=recv_req.rid, tokens=recv_req.input_ids) and TokenizedGenerateReqInput does not carry an adapter id, so every API request reaches this branch with lora_id == 0. As a result real adapter requests are submitted to the scheduler as base-model requests: the per-adapter namespace and max_loras cap are bypassed, and _request_lora_ids is never populated for the model executor.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in d184706 — this admission path was reverted from this branch. Per-request lora_id wiring (admission, TokenizedGenerateReqInput, prefetch, pack policy) all lives in #83.

if (is_new && static_cast<std::int32_t>(batch_lora_ids.size()) >= config_.max_loras) {
continue; // defer to next step
}
batch_lora_ids.insert(request->LoraId());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count only scheduled adapters against max_loras

With max_loras enabled, this inserts a request's adapter into batch_lora_ids before any schedule* call succeeds. If that request cannot be scheduled in this round due to request-specific constraints such as paged-cache/mamba admission or capacity, its adapter still consumes the cap and can cause later requests for other adapters to be deferred even though the batch contains no operation for the first adapter; with max_loras=1 this can leave a schedulable request idle or trigger the empty-plan retraction path unnecessarily.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7b0a3ca — the cap check stays up front, but batch_lora_ids.insert() now only fires when ops.size() actually grew. Failed schedules (capacity/mamba admission) no longer consume a slot.

LoRA serving code (enable_lora checks, _lora_manager, load/unload
adapter methods, lora_id request tracking) references ServerArgs fields
that don't exist on this branch, crashing the scheduler at startup.
Only the max_loras=0 passthrough to make_config() is needed here.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d18470682f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Attach an empty DeviceResource so OnDevice() returns true.
// This prevents PruneEmptyByNode from removing the virtual root even when
// all adapter sequences have been evicted.
raw->AttachResource<ResourceType::Device>(std::make_unique<NodeResource<ResourceType::Device>>(OwnedPages{}));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add host resource to LoRA namespace roots

LoRA virtual roots only get an empty device resource, but HostNodeRef locks every ancestor with GetResource<Host>(c) and several changed paths create host locks for LoRA matches/writebacks. For example, a LoRA request that is retracted or later writes back to L2 can have match.host.last_node at or below this virtual root, and locking it dereferences the missing host resource, crashing instead of recovering/writing back the request.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7b0a3ca — the virtual root now also gets an empty HostResource. IsEmpty() stays true on both resources, so the root never enters lru_leaves_ but HostNodeRef can safely lock it when match.host.last_node lands on it.

paged_cache_groups=paged_cache_groups,
enable_mixed_prefill_decode=server_args.enable_mixed_batch,
prefix_cache_adjunct=prefix_cache_adjunct,
max_loras=0,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wire max_loras into the runtime config

The Python serving path always passes max_loras=0, while the scheduler only enforces the new adapter cap when config_.max_loras > 0. In normal EventLoop startup this leaves the max-loras batch cap disabled for every server, so batches can still contain more unique adapters than the model executor's LoRA slots unless the scheduler is instantiated manually outside this path.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional for this PR — it's the scheduler-only slice. max_loras wiring from ServerArgs lives in #83 alongside the --enable-lora/--max-loras CLI flags. 0 here makes the new cap a no-op until #83 lands, by design.

Four issues flagged by codex review on PR lightseekorg#268:

1. scheduleRetract Insert used base namespace while Match used the
   request's adapter namespace, so retracted LoRA requests wrote pages
   into the wrong subtree and never recovered their prefix on resume.
   Pass request->LoraId() through to Insert.

2. LoRA virtual roots had only a Device resource attached, so any host
   path that locked match.host.last_node when it pointed at the root
   would dereference a null Host resource. Attach an empty Host resource
   too; IsEmpty() stays true on both so the root never enters
   lru_leaves_.

3. EvictLoraNamespace unconditionally removed the virtual root after
   EvictSubtree, but EvictSubtree skips locked descendants. The
   unique_ptr cascade then destroyed those still-locked TreeNodes while
   live NodeRefs pointed at them — a use-after-free. Detect remaining
   locked descendants and defer root removal until a later call.

4. The max_loras cap counted a request's adapter before scheduling.
   Failed schedules (capacity, mamba admission, etc.) still consumed a
   slot, starving other adapters. Check the cap up front but insert
   into batch_lora_ids only after ops.size() grew.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b0a3ca9e0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const std::int32_t page_size = tree_.PageSize();
token_vec_t sentinel(page_size, 0);
sentinel[0] = -lora_id;
auto node = std::make_unique<TreeNode>(sentinel, std::chrono::steady_clock::now());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include the LoRA namespace in published block hashes

When KV cache events or L3 storage are enabled, a LoRA request with the same token pages as the base model (or another adapter) will publish the same block_hashes: this virtual root is only a radix-tree sentinel and has no block hash, so ParentBlockHash() returns nullopt for the first real adapter node and published_device_blocks_ deduplicates it as if it were the base-model block. That lets external KV storage/prefix-cache event consumers reuse or suppress blocks across different adapter weights even though the in-memory cache is namespaced.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

@qywu qywu May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d3c2bd7getOrCreateLoraRoot now seeds the virtual root's block_hashes_ with HashKvBlock(sentinel=[-lora_id, 0, ...], nullopt). The sentinel tokens are per-adapter, so descendants' ParentBlockHash() returns a per-adapter hash, and the HashKvBlock(child, parent) chain diverges across namespaces. published_device_blocks_ and any external KV-event/L3 consumer no longer dedups adapter blocks against the base model or other adapters.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 85165df042

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +52 to 53
auto match = kv_prefix_cache_.Match(token_ids, lora_id, intent);
augmentMatch(match);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve LoRA namespace depth before hybrid augmentation

When a LoRA request goes through the hybrid cache with Mamba enabled, this namespaced MatchResult carries namespace_depth_offset = 1; augmentMatch() can then replace match.device.last_node/host.last_node with the real radix root when no Mamba node exists, but it leaves that offset in place. The next scheduler code calls DepthInPage() on the result, so a no-hit LoRA prefix becomes 0 - 1 == -1, which inflates unscheduled and can produce a negative prefill window start for first chunks. Keep the last node at the virtual root or clear/adjust the namespace offset when truncating the match to the real root.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 156b86eaugmentMatch now clears namespace_depth_offset on the else branches that reassign last_node = root (both the legacy single-tier path and the Mamba-L2 device/host branches). Mamba-found paths still keep the offset since last_node lands on a descendant of the virtual root where DepthInPage(page_size) already includes the sentinel page. DepthInPage() returns 0 for no-hit LoRA matches now, not -1.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8e95a7485

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

using InvalidTransitionHandler<SchedulePrefillEvent>::operator();
SchedulePrefillEvent(std::int32_t tokens_this_round, std::int32_t reserve_num_tokens_in_next_schedule_event,
HybridPrefixCache* hybrid_prefix_cache = nullptr)
HybridPrefixCache* hybrid_prefix_cache = nullptr, std::int32_t lora_id = kLoraNone)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate LoRA ids into continuation events

Fresh evidence in the current tree is that Scheduler::schedulePrefill and Scheduler::scheduleDecode still construct these events with only the hybrid-cache argument (operations/forward.cpp:195-222), so this default leaves lora_id_ as kLoraNone. For multi-chunk LoRA requests when the hybrid cache is active, the later InsertHybridCache(..., lora_id_) calls therefore insert prefill/decode continuation pages into the base namespace, allowing base-model or other-adapter requests with the same tokens to reuse KV pages produced under the adapter weights.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 156b86eScheduler::schedulePrefill and Scheduler::scheduleDecode now construct SchedulePrefillEvent/ScheduleDecodeEvent with request->LoraId(). The default kLoraNone on the event constructors is preserved as a safety net for non-LoRA call sites, but all real call sites are now explicit. Multi-chunk LoRA continuations insert into the correct adapter namespace.

// Evict device and host pages. OwnedPages RAII returns them to the allocator.
// EvictSubtree skips nodes whose resource is currently locked by an in-flight
// request; their OnDevice()/OnHost() flag stays true.
device_.EvictSubtree(descendants);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Publish removals when evicting adapter namespaces

When KV cache events are enabled, unloading an adapter evicts all of its device pages through this direct EvictSubtree call, but this path never calls recordDeviceBlockRemoved as the normal eviction/release paths do. External KV-event consumers will not receive BlockRemoved for the adapter's blocks, and published_device_blocks_ keeps those hashes, so reloading the same adapter/prefix can also suppress the next BlockStored event.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 156b86eEvictLoraNamespace now snapshots device-evictable descendants before EvictSubtree and calls recordDeviceBlockRemoved for each one. External KV-event consumers receive the BlockRemoved notifications and published_device_blocks_ releases the hashes, so reloading the same adapter prefix isn't suppressed.

qywu added 2 commits May 27, 2026 06:31
Without this, a LoRA request with the same token pages as the base
model (or another adapter) publishes identical block_hashes. The
virtual root's block_hashes_ is empty by default, so BlockHash()
returns nullopt for descendants' ParentBlockHash(), and the chain
HashKvBlock(child_tokens, parent=nullopt) is independent of lora_id.
External KV-event consumers (and the internal published_device_blocks_
dedup set) then suppress or reuse blocks across different adapter
weights.

Fix: at virtual-root creation, seed block_hashes_ with
HashKvBlock(sentinel=[-lora_id, 0, ...], nullopt). Sentinel tokens are
already per-adapter, so the seed hash is per-adapter; descendants
inherit the namespace boundary via the normal ParentBlockHash chain.

Addresses the final P1 review comment on lightseekorg#268.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…g#268 review

1. hybrid_prefix_cache::augmentMatch — when a LoRA match has no Mamba
   node and gets truncated to the real radix root, also clear
   namespace_depth_offset. Without this DepthInPage() returns
   root_depth(0) - offset(1) = -1, inflating 'unscheduled' and pushing
   the prefill window start negative.

2. forward.cpp::schedulePrefill/scheduleDecode — construct
   SchedulePrefillEvent/ScheduleDecodeEvent with request->LoraId() so
   the continuation InsertHybridCache(..., lora_id_) calls insert
   prefill/decode pages into the adapter namespace. The previous
   default of kLoraNone allowed adapter KV pages to be inserted under
   the base namespace and reused across adapters/base.

3. EvictLoraNamespace — emit recordDeviceBlockRemoved for each
   device-evictable descendant before EvictSubtree detaches it.
   Otherwise kv_event_sink_ consumers miss the BlockRemoved
   notifications and published_device_blocks_ keeps the stale hashes,
   suppressing the next BlockStored for the same blocks on reload.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 156b86e936

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +75 to +76
const std::vector<std::string>& page_hashs = {}, TreeNode* start_node = nullptr,
std::int32_t lora_id = kLoraNone);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require lora_id for host cache inserts

Because this overload defaults lora_id to kLoraNone, the existing host-prefetch completion path still compiles without passing the request namespace (Scheduler::handleEvent(cache::PrefetchDone) inserts at tokenspeed-scheduler/csrc/scheduler/outside_event_handler.cpp:75). For a LoRA request with L3/host prefetch hits, those pages are written into the base namespace; the later LoRA match misses its prefetched pages, and base/other-adapter requests with the same tokens can load adapter KV from host. Please make the namespace explicit or update that prefetch insert to pass req->LoraId().

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8098148cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

MatchResult HybridPrefixCache::Match(const token_vec_t& token_ids, MatchIntent intent) {
auto match = kv_prefix_cache_.Match(token_ids, intent);
MatchResult HybridPrefixCache::Match(const token_vec_t& token_ids, std::int32_t lora_id, MatchIntent intent) {
auto match = kv_prefix_cache_.Match(token_ids, lora_id, intent);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Clear LoRA offsets before paged-cache root caps

When lora_id is non-zero and the paged-cache adjunct is enabled, this namespaced match carries namespace_depth_offset = 1 into augmentMatchPagedCache(). On a cold adapter namespace (or any path without a complete paged-cache snapshot), that helper caps last_node back to the real radix root but does not clear the offset, so subsequent DepthInPage() calls return 0 - 1 and the scheduler computes an inflated/negative prefill window for the LoRA request. Please make the paged-cache augmentation namespace-aware or clear the offset when it leaves the virtual root.

Useful? React with 👍 / 👎.

@qywu qywu requested a review from wangbo981016 June 1, 2026 04:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant