feat(scheduler): per-adapter KV prefix-cache namespace + max_loras batch cap#268
feat(scheduler): per-adapter KV prefix-cache namespace + max_loras batch cap#268qywu wants to merge 19 commits into
Conversation
…tch cap ### 1 — Per-adapter KV prefix-cache namespace isolation Each LoRA adapter gets its own virtual root in the KV radix tree so prefix reuse is isolated per adapter. Tokens cached under adapter A are never served to adapter B or the base model. - RequestSpec / Request gain a lora_id field (0 = kLoraNone = base model) - FSM forward events carry lora_id_ through prefill/decode scheduling - KVPrefixCache adds lora_virtual_roots_: unordered_map<int32_t, TreeNode*> - getOrCreateLoraRoot inserts a sentinel child keyed by [-lora_id, 0, ...] (negative first token, never collides with real vocabulary) - Match/Insert accept lora_id; resolveStartNode scopes the radix walk; depth adjusted by namespace_depth_offset=1 to hide the sentinel page - HybridPrefixCache::Match forwards lora_id; InsertHybridCache propagates it - Scheduler::EvictLoraNamespace: DFS-collects and frees all KV pages under an adapter's virtual root when the adapter is unloaded ### 2 — max_loras adapter cap in batch building SchedulerConfig gains max_loras (0 = disabled). newForwardOperation() tracks batch_lora_ids and defers requests that would exceed the cap, preventing silent wrong-output and RuntimeError crashes. Python wiring: make_config() / event_loop.py pass server_args.max_loras; python_module.cpp exposes max_loras on SchedulerConfig. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…refix_cache_adjunct_spec These were accidentally removed when max_loras was added to make_config. event_loop.py still calls pool_to_prefix_cache_adjunct_spec() at line 311 and passes prefix_cache_adjunct= to make_config() — both would crash at runtime with the symbols missing. Changes: - scheduler_utils.py: restore PrefixCacheAdjunctSpec + PagedCacheGroupFamily imports via guarded try/except (pre-installed CI binary may not have them); restore prefix_cache_adjunct param in make_config(); restore family field in pool_to_paged_cache_groups(); restore pool_to_prefix_cache_adjunct_spec() with a clear RuntimeError when binary predates the feature - event_loop.py: restore pool_to_prefix_cache_adjunct_spec import - tokenspeed_scheduler/__init__.py: restore PagedCacheGroupFamily and PrefixCacheAdjunctSpec exports via same guarded try/except pattern Signed-off-by: Qingyang Wu <willqywu@gmail.com>
… prefix_cache_adjunct Accidentally removed during merge conflict resolution. Restore by taking upstream's python_module.cpp and re-applying only our LoRA additions: - max_loras field on SchedulerConfig - lora_id field on RequestSpec Everything else (PagedCacheGroupFamily enum, PrefixCacheAdjunctSpec class, prefix_cache_adjunct on SchedulerConfig, family param on PagedCacheGroupConfig, OwnedPagesCount/BorrowedPagesCount on PagedCacheGroupTable, unordered_map include) is now correctly preserved from upstream. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
… upstream
Upstream changed lru_leaves_ from std::pair<timestamp, TreeNode*> to
std::tuple<timestamp, SeqId, TreeNode*> to ensure deterministic LRU
ordering across TP ranks (pointer values diverge per-process due to ASLR,
causing different ranks to evict different nodes on timestamp ties and
eventually deadlocking NCCL collectives).
Update EvictSubtree's erase call from {it->second, node} to
{it->second, node->SeqId(), node} to match the new tuple key.
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Upstream encapsulated lora_id as private lora_id_ with a LoraId() accessor. Replace request->lora_id() with request->LoraId() in the max_loras batch-cap loop in newForwardOperation(). Signed-off-by: Qingyang Wu <willqywu@gmail.com>
No changes needed — PagedCacheGroupFamily and PrefixCacheAdjunctSpec are properly exported by the upstream version as-is. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…er-serving - Clean imports (no try/except guards for PagedCacheGroupFamily/PrefixCacheAdjunctSpec) - Restore upstream ordering in make_config - use server_args.enable_mixed_batch directly - Use list() for required_groups Signed-off-by: Qingyang Wu <willqywu@gmail.com>
- Restore direct pool.paged_cache_group_specs access (no getattr default) - Restore upstream ordering: compute family first, then build kwargs with family=family - Restore [str(gid) for gid in required_group_ids] in pool_to_prefix_cache_adjunct_spec Only truly necessary changes vs upstream remain: make_spec: lora_id param; make_config: max_loras param. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Signed-off-by: Qingyang Wu <willqywu@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 084c1ee4cf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| LoadLoraReqInput, | ||
| LoadLoraReqOutput, |
There was a problem hiding this comment.
Remove undefined LoRA control types from scheduler imports
SchedulerControlClient now imports LoadLoraReqInput/Output and UnloadLoraReqInput/Output, but those symbols are not defined in python/tokenspeed/runtime/engine/io_struct.py (repo-wide search only finds these names in this client). Importing AsyncLLM (which imports this module) will raise ImportError during startup, so the control plane cannot initialize.
Useful? React with 👍 / 👎.
| enable_mixed_prefill_decode=server_args.enable_mixed_batch, | ||
| enable_mixed_prefill_decode=enable_mixed_prefill_decode, | ||
| prefix_cache_adjunct=prefix_cache_adjunct, | ||
| max_loras=server_args.max_loras if server_args.enable_lora else 0, |
There was a problem hiding this comment.
Add missing ServerArgs fields before reading LoRA config
This line unconditionally reads server_args.enable_lora/server_args.max_loras, but ServerArgs in runtime/utils/server_args.py does not declare either field. In the default startup path, EventLoop construction will fail with AttributeError before scheduler initialization.
Useful? React with 👍 / 👎.
| # Proactively evict the KV cache namespace for this adapter so pages | ||
| # are freed immediately rather than waiting for LRU eviction pressure. | ||
| if lora_id is not None: | ||
| self.scheduler.evict_lora_namespace(lora_id) |
There was a problem hiding this comment.
Expose LoRA namespace eviction on the Python Scheduler binding
unload_lora_adapter() calls self.scheduler.evict_lora_namespace(...), but the nanobind Scheduler class does not define an evict_lora_namespace method (binding currently exposes methods like submit_requests, next_execution_plan, advance, and stats accessors only). The first LoRA unload will raise AttributeError, leaving adapter unload cleanup incomplete.
Useful? React with 👍 / 👎.
| kv_prefix_cache_.Insert<ResourceType::Device>(full_paged_tokens, prefix_pages, std::move(alloc_pages)); | ||
|
|
||
| MatchResult match_result = kv_prefix_cache_.Match(full_paged_tokens, MatchIntent::StateRecovery); | ||
| MatchResult match_result = kv_prefix_cache_.Match(full_paged_tokens, request->LoraId(), MatchIntent::StateRecovery); |
There was a problem hiding this comment.
Keep LoRA namespace consistent when retracting cached pages
During retract, pages are inserted with the default namespace (Insert(..., std::move(alloc_pages))) but immediately matched in request->LoraId() namespace. For LoRA requests this writes KV pages into the base subtree and then looks them up in adapter subtree, so recovery misses its own cached prefix under retraction pressure.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 7b0a3ca — scheduleRetract now passes request->LoraId() to Insert so the cached pages land in the adapter subtree where the subsequent StateRecovery Match looks for them.
| SchedulePrefillEvent(std::int32_t tokens_this_round, std::int32_t reserve_num_tokens_in_next_schedule_event, | ||
| HybridPrefixCache* hybrid_prefix_cache = nullptr) | ||
| HybridPrefixCache* hybrid_prefix_cache = nullptr, std::int32_t lora_id = kLoraNone) |
There was a problem hiding this comment.
Require explicit lora_id in follow-up forward events
Adding a default lora_id = kLoraNone here makes old call sites compile even when they forget to propagate request LoRA identity. In this commit, schedulePrefill/scheduleDecode still construct these events without request->LoraId(), so subsequent cache inserts run in base namespace and can cross-contaminate adapter/base KV reuse.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Both InsertHybridCache call sites in forward_events.cpp (SchedulePrefillEvent, ScheduleDecodeEvent) pass lora_id_ explicitly today. The default kLoraNone is a safety net for non-LoRA callers; happy to drop it in a follow-up to enforce explicit propagation if you prefer.
LoadLoraReqInput/UnloadLoraReqInput are part of the full LoRA serving PR (lightseekorg#83) and don't exist in io_struct.py on this branch, causing an import error that breaks every test importing AsyncLLM. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6b7cab2049
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // entire subtree (including any mamba slots attached to those nodes). | ||
| token_vec_t sentinel(tree_.PageSize(), 0); | ||
| sentinel[0] = -lora_id; | ||
| tree_.Root()->RemoveChild(sentinel); |
There was a problem hiding this comment.
Avoid deleting locked LoRA cache nodes
When an adapter is unloaded while any request in that namespace is still in flight, EvictSubtree skips the locked descendants but this line still removes the virtual root, which destroys the entire subtree and the TreeNodes that active DeviceNodeRef/HostNodeRefs still point at. In that scenario the subsequent request state transitions will dereference dangling tree nodes (and the skipped pages are freed by destruction rather than when the request finishes), so namespace removal needs to wait until locked descendants are gone or abort/drain those requests first.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 7b0a3ca — EvictLoraNamespace now scans descendants after EvictSubtree and defers RemoveChild if any node still has OnDevice()/OnHost() (meaning it was skipped as locked). Locked pages are freed when their requests finish via RAII; a later EvictLoraNamespace call (or natural LRU pressure) completes the cleanup.
| if spec.lora_id != 0: | ||
| self._request_lora_ids[spec.request_id] = spec.lora_id |
There was a problem hiding this comment.
Propagate request LoRA ids before scheduling
This admission path depends on spec.lora_id being non-zero, but the serving path still builds specs with make_spec(rid=recv_req.rid, tokens=recv_req.input_ids) and TokenizedGenerateReqInput does not carry an adapter id, so every API request reaches this branch with lora_id == 0. As a result real adapter requests are submitted to the scheduler as base-model requests: the per-adapter namespace and max_loras cap are bypassed, and _request_lora_ids is never populated for the model executor.
Useful? React with 👍 / 👎.
| if (is_new && static_cast<std::int32_t>(batch_lora_ids.size()) >= config_.max_loras) { | ||
| continue; // defer to next step | ||
| } | ||
| batch_lora_ids.insert(request->LoraId()); |
There was a problem hiding this comment.
Count only scheduled adapters against max_loras
With max_loras enabled, this inserts a request's adapter into batch_lora_ids before any schedule* call succeeds. If that request cannot be scheduled in this round due to request-specific constraints such as paged-cache/mamba admission or capacity, its adapter still consumes the cap and can cause later requests for other adapters to be deferred even though the batch contains no operation for the first adapter; with max_loras=1 this can leave a schedulable request idle or trigger the empty-plan retraction path unnecessarily.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 7b0a3ca — the cap check stays up front, but batch_lora_ids.insert() now only fires when ops.size() actually grew. Failed schedules (capacity/mamba admission) no longer consume a slot.
LoRA serving code (enable_lora checks, _lora_manager, load/unload adapter methods, lora_id request tracking) references ServerArgs fields that don't exist on this branch, crashing the scheduler at startup. Only the max_loras=0 passthrough to make_config() is needed here. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d18470682f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Attach an empty DeviceResource so OnDevice() returns true. | ||
| // This prevents PruneEmptyByNode from removing the virtual root even when | ||
| // all adapter sequences have been evicted. | ||
| raw->AttachResource<ResourceType::Device>(std::make_unique<NodeResource<ResourceType::Device>>(OwnedPages{})); |
There was a problem hiding this comment.
Add host resource to LoRA namespace roots
LoRA virtual roots only get an empty device resource, but HostNodeRef locks every ancestor with GetResource<Host>(c) and several changed paths create host locks for LoRA matches/writebacks. For example, a LoRA request that is retracted or later writes back to L2 can have match.host.last_node at or below this virtual root, and locking it dereferences the missing host resource, crashing instead of recovering/writing back the request.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 7b0a3ca — the virtual root now also gets an empty HostResource. IsEmpty() stays true on both resources, so the root never enters lru_leaves_ but HostNodeRef can safely lock it when match.host.last_node lands on it.
| paged_cache_groups=paged_cache_groups, | ||
| enable_mixed_prefill_decode=server_args.enable_mixed_batch, | ||
| prefix_cache_adjunct=prefix_cache_adjunct, | ||
| max_loras=0, |
There was a problem hiding this comment.
Wire max_loras into the runtime config
The Python serving path always passes max_loras=0, while the scheduler only enforces the new adapter cap when config_.max_loras > 0. In normal EventLoop startup this leaves the max-loras batch cap disabled for every server, so batches can still contain more unique adapters than the model executor's LoRA slots unless the scheduler is instantiated manually outside this path.
Useful? React with 👍 / 👎.
Four issues flagged by codex review on PR lightseekorg#268: 1. scheduleRetract Insert used base namespace while Match used the request's adapter namespace, so retracted LoRA requests wrote pages into the wrong subtree and never recovered their prefix on resume. Pass request->LoraId() through to Insert. 2. LoRA virtual roots had only a Device resource attached, so any host path that locked match.host.last_node when it pointed at the root would dereference a null Host resource. Attach an empty Host resource too; IsEmpty() stays true on both so the root never enters lru_leaves_. 3. EvictLoraNamespace unconditionally removed the virtual root after EvictSubtree, but EvictSubtree skips locked descendants. The unique_ptr cascade then destroyed those still-locked TreeNodes while live NodeRefs pointed at them — a use-after-free. Detect remaining locked descendants and defer root removal until a later call. 4. The max_loras cap counted a request's adapter before scheduling. Failed schedules (capacity, mamba admission, etc.) still consumed a slot, starving other adapters. Check the cap up front but insert into batch_lora_ids only after ops.size() grew. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b0a3ca9e0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const std::int32_t page_size = tree_.PageSize(); | ||
| token_vec_t sentinel(page_size, 0); | ||
| sentinel[0] = -lora_id; | ||
| auto node = std::make_unique<TreeNode>(sentinel, std::chrono::steady_clock::now()); |
There was a problem hiding this comment.
Include the LoRA namespace in published block hashes
When KV cache events or L3 storage are enabled, a LoRA request with the same token pages as the base model (or another adapter) will publish the same block_hashes: this virtual root is only a radix-tree sentinel and has no block hash, so ParentBlockHash() returns nullopt for the first real adapter node and published_device_blocks_ deduplicates it as if it were the base-model block. That lets external KV storage/prefix-cache event consumers reuse or suppress blocks across different adapter weights even though the in-memory cache is namespaced.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in d3c2bd7 — getOrCreateLoraRoot now seeds the virtual root's block_hashes_ with HashKvBlock(sentinel=[-lora_id, 0, ...], nullopt). The sentinel tokens are per-adapter, so descendants' ParentBlockHash() returns a per-adapter hash, and the HashKvBlock(child, parent) chain diverges across namespaces. published_device_blocks_ and any external KV-event/L3 consumer no longer dedups adapter blocks against the base model or other adapters.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 85165df042
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| auto match = kv_prefix_cache_.Match(token_ids, lora_id, intent); | ||
| augmentMatch(match); |
There was a problem hiding this comment.
Preserve LoRA namespace depth before hybrid augmentation
When a LoRA request goes through the hybrid cache with Mamba enabled, this namespaced MatchResult carries namespace_depth_offset = 1; augmentMatch() can then replace match.device.last_node/host.last_node with the real radix root when no Mamba node exists, but it leaves that offset in place. The next scheduler code calls DepthInPage() on the result, so a no-hit LoRA prefix becomes 0 - 1 == -1, which inflates unscheduled and can produce a negative prefill window start for first chunks. Keep the last node at the virtual root or clear/adjust the namespace offset when truncating the match to the real root.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 156b86e — augmentMatch now clears namespace_depth_offset on the else branches that reassign last_node = root (both the legacy single-tier path and the Mamba-L2 device/host branches). Mamba-found paths still keep the offset since last_node lands on a descendant of the virtual root where DepthInPage(page_size) already includes the sentinel page. DepthInPage() returns 0 for no-hit LoRA matches now, not -1.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f8e95a7485
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| using InvalidTransitionHandler<SchedulePrefillEvent>::operator(); | ||
| SchedulePrefillEvent(std::int32_t tokens_this_round, std::int32_t reserve_num_tokens_in_next_schedule_event, | ||
| HybridPrefixCache* hybrid_prefix_cache = nullptr) | ||
| HybridPrefixCache* hybrid_prefix_cache = nullptr, std::int32_t lora_id = kLoraNone) |
There was a problem hiding this comment.
Propagate LoRA ids into continuation events
Fresh evidence in the current tree is that Scheduler::schedulePrefill and Scheduler::scheduleDecode still construct these events with only the hybrid-cache argument (operations/forward.cpp:195-222), so this default leaves lora_id_ as kLoraNone. For multi-chunk LoRA requests when the hybrid cache is active, the later InsertHybridCache(..., lora_id_) calls therefore insert prefill/decode continuation pages into the base namespace, allowing base-model or other-adapter requests with the same tokens to reuse KV pages produced under the adapter weights.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 156b86e — Scheduler::schedulePrefill and Scheduler::scheduleDecode now construct SchedulePrefillEvent/ScheduleDecodeEvent with request->LoraId(). The default kLoraNone on the event constructors is preserved as a safety net for non-LoRA call sites, but all real call sites are now explicit. Multi-chunk LoRA continuations insert into the correct adapter namespace.
| // Evict device and host pages. OwnedPages RAII returns them to the allocator. | ||
| // EvictSubtree skips nodes whose resource is currently locked by an in-flight | ||
| // request; their OnDevice()/OnHost() flag stays true. | ||
| device_.EvictSubtree(descendants); |
There was a problem hiding this comment.
Publish removals when evicting adapter namespaces
When KV cache events are enabled, unloading an adapter evicts all of its device pages through this direct EvictSubtree call, but this path never calls recordDeviceBlockRemoved as the normal eviction/release paths do. External KV-event consumers will not receive BlockRemoved for the adapter's blocks, and published_device_blocks_ keeps those hashes, so reloading the same adapter/prefix can also suppress the next BlockStored event.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 156b86e — EvictLoraNamespace now snapshots device-evictable descendants before EvictSubtree and calls recordDeviceBlockRemoved for each one. External KV-event consumers receive the BlockRemoved notifications and published_device_blocks_ releases the hashes, so reloading the same adapter prefix isn't suppressed.
Without this, a LoRA request with the same token pages as the base model (or another adapter) publishes identical block_hashes. The virtual root's block_hashes_ is empty by default, so BlockHash() returns nullopt for descendants' ParentBlockHash(), and the chain HashKvBlock(child_tokens, parent=nullopt) is independent of lora_id. External KV-event consumers (and the internal published_device_blocks_ dedup set) then suppress or reuse blocks across different adapter weights. Fix: at virtual-root creation, seed block_hashes_ with HashKvBlock(sentinel=[-lora_id, 0, ...], nullopt). Sentinel tokens are already per-adapter, so the seed hash is per-adapter; descendants inherit the namespace boundary via the normal ParentBlockHash chain. Addresses the final P1 review comment on lightseekorg#268. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
…g#268 review 1. hybrid_prefix_cache::augmentMatch — when a LoRA match has no Mamba node and gets truncated to the real radix root, also clear namespace_depth_offset. Without this DepthInPage() returns root_depth(0) - offset(1) = -1, inflating 'unscheduled' and pushing the prefill window start negative. 2. forward.cpp::schedulePrefill/scheduleDecode — construct SchedulePrefillEvent/ScheduleDecodeEvent with request->LoraId() so the continuation InsertHybridCache(..., lora_id_) calls insert prefill/decode pages into the adapter namespace. The previous default of kLoraNone allowed adapter KV pages to be inserted under the base namespace and reused across adapters/base. 3. EvictLoraNamespace — emit recordDeviceBlockRemoved for each device-evictable descendant before EvictSubtree detaches it. Otherwise kv_event_sink_ consumers miss the BlockRemoved notifications and published_device_blocks_ keeps the stale hashes, suppressing the next BlockStored for the same blocks on reload. Signed-off-by: Qingyang Wu <willqywu@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 156b86e936
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const std::vector<std::string>& page_hashs = {}, TreeNode* start_node = nullptr, | ||
| std::int32_t lora_id = kLoraNone); |
There was a problem hiding this comment.
Require lora_id for host cache inserts
Because this overload defaults lora_id to kLoraNone, the existing host-prefetch completion path still compiles without passing the request namespace (Scheduler::handleEvent(cache::PrefetchDone) inserts at tokenspeed-scheduler/csrc/scheduler/outside_event_handler.cpp:75). For a LoRA request with L3/host prefetch hits, those pages are written into the base namespace; the later LoRA match misses its prefetched pages, and base/other-adapter requests with the same tokens can load adapter KV from host. Please make the namespace explicit or update that prefetch insert to pass req->LoraId().
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f8098148cc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| MatchResult HybridPrefixCache::Match(const token_vec_t& token_ids, MatchIntent intent) { | ||
| auto match = kv_prefix_cache_.Match(token_ids, intent); | ||
| MatchResult HybridPrefixCache::Match(const token_vec_t& token_ids, std::int32_t lora_id, MatchIntent intent) { | ||
| auto match = kv_prefix_cache_.Match(token_ids, lora_id, intent); |
There was a problem hiding this comment.
Clear LoRA offsets before paged-cache root caps
When lora_id is non-zero and the paged-cache adjunct is enabled, this namespaced match carries namespace_depth_offset = 1 into augmentMatchPagedCache(). On a cold adapter namespace (or any path without a complete paged-cache snapshot), that helper caps last_node back to the real radix root but does not clear the offset, so subsequent DepthInPage() calls return 0 - 1 and the scheduler computes an inflated/negative prefill window for the LoRA request. Please make the paged-cache augmentation namespace-aware or clear the offset when it leaves the virtual root.
Useful? React with 👍 / 👎.
Summary
Two scheduler-level changes that make the tokenspeed scheduler LoRA-aware. Extracted from the broader LoRA serving PR (#83) as a self-contained C++ contribution.
1 — Per-adapter KV prefix-cache namespace isolation
Problem: Without isolation, tokens cached under adapter A could be incorrectly reused by adapter B or the base model, producing silent wrong outputs.
Solution: Each LoRA adapter gets its own virtual root node in the radix tree, scoping all prefix cache lookups and insertions per adapter.
How it works:
KVPrefixCache::MatchandInsertnow accept alora_idparameter (kLoraNone = 0= base model).resolveStartNode(lora_id)returns (or lazily creates) a per-adapter virtual root keyed by a sentinel page[-lora_id, 0, …, 0]. Negative first tokens are outside all real vocabulary ranges, so there is no collision with actual sequences.DeviceResourcesoPruneEmptyByNodenever removes it while an adapter is loaded.namespace_depth_offset = 1is applied to match results so callers see real page counts, not counts including the sentinel page.KVPrefixCache::EvictLoraNamespace(lora_id)DFS-collects all descendants of the virtual root, evicts their device and host pages viaEvictSubtree, and removes the root from the tree. This is called fromScheduler::EvictLoraNamespaceon adapter unload, immediately freeing KV memory instead of waiting for LRU pressure. Locked (in-flight) pages are skipped and freed when those requests finish.lora_idis propagated throughHybridPrefixCache::Matchoverloads andInsertHybridCacheso the hybrid L2/L3 path is also namespace-aware.Data model changes:
RequestSpec::lora_id(int32, default 0) — set when submitting a request.Request::lora_id_(private) /Request::LoraId()(public accessor) — carried through the FSM.KVPrefixCache::lora_virtual_roots_—unordered_map<int32, TreeNode*>of per-adapter roots.2 —
max_lorasadapter cap in batch buildingProblem: If a batch contains more unique LoRA adapters than GPU slots,
prepare_loras()either crashes with aRuntimeErroror silently runs an evicted adapter as the base model (wrong output, no error).Solution: The scheduler enforces a hard cap on unique adapter IDs per forward pass.
How it works:
SchedulerConfiggainsmax_loras(int32, default0= LoRA disabled / no cap enforced).newForwardOperation()maintains abatch_lora_idsunordered set. Before scheduling each request, ifmax_loras > 0and the request'slora_idwould be a new entry that pushesbatch_lora_ids.size()pastmax_loras, the request is skipped (deferred to the next scheduling round) rather than added to the batch.lora_id == kLoraNone) are never deferred — the cap only applies to adapter traffic.prepare_loras()sees at mostmax_lorasunique adapter IDs per step.Files changed
csrc/resource/kv_prefix_cache/kv_prefix_cache.{h,cpp}Match/Insertwithlora_id,getOrCreateLoraRoot,resolveStartNode,EvictLoraNamespacecsrc/resource/kv_prefix_cache/eviction.hEvictSubtreetemplate method for bulk page evictioncsrc/resource/hybrid_prefix_cache/hybrid_prefix_cache.{h,cpp}lora_idthroughMatchoverloadscsrc/fsm/forward_events.{h,cpp}lora_idparam toInsertHybridCache; propagate into KV insert pathcsrc/scheduler/request.{h,cpp},request_spec.hlora_idfield to spec and request; exposeLoraId()accessorcsrc/scheduler/scheduler.{h,cpp}EvictLoraNamespace(lora_id)delegationcsrc/scheduler/operations/forward.cppbatch_lora_idsset +max_lorasdeferral logic; passLoraId()to allMatchcallscsrc/scheduler/types.hmax_lorastoSchedulerConfigbindings/python_module.cppmax_loras(onSchedulerConfig) andlora_id(onRequestSpec) to Pythonpython/…/scheduler_utils.pymake_spec(lora_id=0)andmake_config(max_loras=0)python/…/event_loop.pymax_lorasfromserver_argsintomake_configTest plan
max_loras=1with two concurrent adapters: second adapter's requests are deferred, not batched (no wrong output, no crash)EvictLoraNamespace: after unloading an adapter, its KV pages are freed and its virtual root is removed from the tree