feat(scheduler): per-adapter KV prefix-cache namespace + max_loras batch cap by qywu · Pull Request #268 · lightseekorg/tokenspeed

qywu · 2026-05-26T20:27:42Z

Summary

Two scheduler-level changes that make the tokenspeed scheduler LoRA-aware. Extracted from the broader LoRA serving PR (#83) as a self-contained C++ contribution.

1 — Per-adapter KV prefix-cache namespace isolation

Problem: Without isolation, tokens cached under adapter A could be incorrectly reused by adapter B or the base model, producing silent wrong outputs.

Solution: Each LoRA adapter gets its own virtual root node in the radix tree, scoping all prefix cache lookups and insertions per adapter.

How it works:

KVPrefixCache::Match and Insert now accept a lora_id parameter (kLoraNone = 0 = base model).
For base model requests, the real tree root is used as before (no behaviour change).
For adapter requests, resolveStartNode(lora_id) returns (or lazily creates) a per-adapter virtual root keyed by a sentinel page [-lora_id, 0, …, 0]. Negative first tokens are outside all real vocabulary ranges, so there is no collision with actual sequences.
The virtual root holds an empty DeviceResource so PruneEmptyByNode never removes it while an adapter is loaded.
namespace_depth_offset = 1 is applied to match results so callers see real page counts, not counts including the sentinel page.
KVPrefixCache::EvictLoraNamespace(lora_id) DFS-collects all descendants of the virtual root, evicts their device and host pages via EvictSubtree, and removes the root from the tree. This is called from Scheduler::EvictLoraNamespace on adapter unload, immediately freeing KV memory instead of waiting for LRU pressure. Locked (in-flight) pages are skipped and freed when those requests finish.
lora_id is propagated through HybridPrefixCache::Match overloads and InsertHybridCache so the hybrid L2/L3 path is also namespace-aware.

Data model changes:

RequestSpec::lora_id (int32, default 0) — set when submitting a request.
Request::lora_id_ (private) / Request::LoraId() (public accessor) — carried through the FSM.
KVPrefixCache::lora_virtual_roots_ — unordered_map<int32, TreeNode*> of per-adapter roots.

2 — `max_loras` adapter cap in batch building

Problem: If a batch contains more unique LoRA adapters than GPU slots, prepare_loras() either crashes with a RuntimeError or silently runs an evicted adapter as the base model (wrong output, no error).

Solution: The scheduler enforces a hard cap on unique adapter IDs per forward pass.

How it works:

SchedulerConfig gains max_loras (int32, default 0 = LoRA disabled / no cap enforced).
newForwardOperation() maintains a batch_lora_ids unordered set. Before scheduling each request, if max_loras > 0 and the request's lora_id would be a new entry that pushes batch_lora_ids.size() past max_loras, the request is skipped (deferred to the next scheduling round) rather than added to the batch.
Base-model requests (lora_id == kLoraNone) are never deferred — the cap only applies to adapter traffic.
This guarantees prepare_loras() sees at most max_loras unique adapter IDs per step.

Files changed

File	Change
`csrc/resource/kv_prefix_cache/kv_prefix_cache.{h,cpp}`	Core namespace: `Match`/`Insert` with `lora_id`, `getOrCreateLoraRoot`, `resolveStartNode`, `EvictLoraNamespace`
`csrc/resource/kv_prefix_cache/eviction.h`	`EvictSubtree` template method for bulk page eviction
`csrc/resource/hybrid_prefix_cache/hybrid_prefix_cache.{h,cpp}`	Forward `lora_id` through `Match` overloads
`csrc/fsm/forward_events.{h,cpp}`	Add `lora_id` param to `InsertHybridCache`; propagate into KV insert path
`csrc/scheduler/request.{h,cpp}`, `request_spec.h`	Add `lora_id` field to spec and request; expose `LoraId()` accessor
`csrc/scheduler/scheduler.{h,cpp}`	Add `EvictLoraNamespace(lora_id)` delegation
`csrc/scheduler/operations/forward.cpp`	`batch_lora_ids` set + `max_loras` deferral logic; pass `LoraId()` to all `Match` calls
`csrc/scheduler/types.h`	Add `max_loras` to `SchedulerConfig`
`bindings/python_module.cpp`	Expose `max_loras` (on `SchedulerConfig`) and `lora_id` (on `RequestSpec`) to Python
`python/…/scheduler_utils.py`	`make_spec(lora_id=0)` and `make_config(max_loras=0)`
`python/…/event_loop.py`	Wire `max_loras` from `server_args` into `make_config`

Test plan

Scheduler C++ unit tests build and pass
Scheduler Python unit tests pass
Prefix-cache namespace: verify adapter A's cached pages are not returned on a cache lookup for adapter B with the same token sequence
max_loras=1 with two concurrent adapters: second adapter's requests are deferred, not batched (no wrong output, no crash)
EvictLoraNamespace: after unloading an adapter, its KV pages are freed and its virtual root is removed from the tree

…tch cap ### 1 — Per-adapter KV prefix-cache namespace isolation Each LoRA adapter gets its own virtual root in the KV radix tree so prefix reuse is isolated per adapter. Tokens cached under adapter A are never served to adapter B or the base model. - RequestSpec / Request gain a lora_id field (0 = kLoraNone = base model) - FSM forward events carry lora_id_ through prefill/decode scheduling - KVPrefixCache adds lora_virtual_roots_: unordered_map<int32_t, TreeNode*> - getOrCreateLoraRoot inserts a sentinel child keyed by [-lora_id, 0, ...] (negative first token, never collides with real vocabulary) - Match/Insert accept lora_id; resolveStartNode scopes the radix walk; depth adjusted by namespace_depth_offset=1 to hide the sentinel page - HybridPrefixCache::Match forwards lora_id; InsertHybridCache propagates it - Scheduler::EvictLoraNamespace: DFS-collects and frees all KV pages under an adapter's virtual root when the adapter is unloaded ### 2 — max_loras adapter cap in batch building SchedulerConfig gains max_loras (0 = disabled). newForwardOperation() tracks batch_lora_ids and defers requests that would exceed the cap, preventing silent wrong-output and RuntimeError crashes. Python wiring: make_config() / event_loop.py pass server_args.max_loras; python_module.cpp exposes max_loras on SchedulerConfig. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…refix_cache_adjunct_spec These were accidentally removed when max_loras was added to make_config. event_loop.py still calls pool_to_prefix_cache_adjunct_spec() at line 311 and passes prefix_cache_adjunct= to make_config() — both would crash at runtime with the symbols missing. Changes: - scheduler_utils.py: restore PrefixCacheAdjunctSpec + PagedCacheGroupFamily imports via guarded try/except (pre-installed CI binary may not have them); restore prefix_cache_adjunct param in make_config(); restore family field in pool_to_paged_cache_groups(); restore pool_to_prefix_cache_adjunct_spec() with a clear RuntimeError when binary predates the feature - event_loop.py: restore pool_to_prefix_cache_adjunct_spec import - tokenspeed_scheduler/__init__.py: restore PagedCacheGroupFamily and PrefixCacheAdjunctSpec exports via same guarded try/except pattern Signed-off-by: Qingyang Wu <willqywu@gmail.com>

… prefix_cache_adjunct Accidentally removed during merge conflict resolution. Restore by taking upstream's python_module.cpp and re-applying only our LoRA additions: - max_loras field on SchedulerConfig - lora_id field on RequestSpec Everything else (PagedCacheGroupFamily enum, PrefixCacheAdjunctSpec class, prefix_cache_adjunct on SchedulerConfig, family param on PagedCacheGroupConfig, OwnedPagesCount/BorrowedPagesCount on PagedCacheGroupTable, unordered_map include) is now correctly preserved from upstream. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

… upstream Upstream changed lru_leaves_ from std::pair<timestamp, TreeNode*> to std::tuple<timestamp, SeqId, TreeNode*> to ensure deterministic LRU ordering across TP ranks (pointer values diverge per-process due to ASLR, causing different ranks to evict different nodes on timestamp ties and eventually deadlocking NCCL collectives). Update EvictSubtree's erase call from {it->second, node} to {it->second, node->SeqId(), node} to match the new tuple key. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Upstream encapsulated lora_id as private lora_id_ with a LoraId() accessor. Replace request->lora_id() with request->LoraId() in the max_loras batch-cap loop in newForwardOperation(). Signed-off-by: Qingyang Wu <willqywu@gmail.com>

No changes needed — PagedCacheGroupFamily and PrefixCacheAdjunctSpec are properly exported by the upstream version as-is. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…er-serving - Clean imports (no try/except guards for PagedCacheGroupFamily/PrefixCacheAdjunctSpec) - Restore upstream ordering in make_config - use server_args.enable_mixed_batch directly - Use list() for required_groups Signed-off-by: Qingyang Wu <willqywu@gmail.com>

- Restore direct pool.paged_cache_group_specs access (no getattr default) - Restore upstream ordering: compute family first, then build kwargs with family=family - Restore [str(gid) for gid in required_group_ids] in pool_to_prefix_cache_adjunct_spec Only truly necessary changes vs upstream remain: make_spec: lora_id param; make_config: max_loras param. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 084c1ee4cf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T00:56:04Z

+    LoadLoraReqInput,
+    LoadLoraReqOutput,


Remove undefined LoRA control types from scheduler imports

SchedulerControlClient now imports LoadLoraReqInput/Output and UnloadLoraReqInput/Output, but those symbols are not defined in python/tokenspeed/runtime/engine/io_struct.py (repo-wide search only finds these names in this client). Importing AsyncLLM (which imports this module) will raise ImportError during startup, so the control plane cannot initialize.

Useful? React with 👍 / 👎.

Fixed in 6b7cab2 — reverted to upstream. These LoRA control imports belong in #83 alongside their io_struct.py definitions.

chatgpt-codex-connector · 2026-05-27T00:56:04Z

-            enable_mixed_prefill_decode=server_args.enable_mixed_batch,
+            enable_mixed_prefill_decode=enable_mixed_prefill_decode,
            prefix_cache_adjunct=prefix_cache_adjunct,
+            max_loras=server_args.max_loras if server_args.enable_lora else 0,


Add missing ServerArgs fields before reading LoRA config

This line unconditionally reads server_args.enable_lora/server_args.max_loras, but ServerArgs in runtime/utils/server_args.py does not declare either field. In the default startup path, EventLoop construction will fail with AttributeError before scheduler initialization.

Useful? React with 👍 / 👎.

Fixed in d184706 — reverted event_loop.py to upstream and pass max_loras=0. The enable_lora/max_loras ServerArgs fields live in #83 (full LoRA serving).

chatgpt-codex-connector · 2026-05-27T00:56:04Z

+        # Proactively evict the KV cache namespace for this adapter so pages
+        # are freed immediately rather than waiting for LRU eviction pressure.
+        if lora_id is not None:
+            self.scheduler.evict_lora_namespace(lora_id)


Expose LoRA namespace eviction on the Python Scheduler binding

unload_lora_adapter() calls self.scheduler.evict_lora_namespace(...), but the nanobind Scheduler class does not define an evict_lora_namespace method (binding currently exposes methods like submit_requests, next_execution_plan, advance, and stats accessors only). The first LoRA unload will raise AttributeError, leaving adapter unload cleanup incomplete.

Useful? React with 👍 / 👎.

Addressed in d184706 — unload_lora_adapter was reverted from this branch. The Python binding for evict_lora_namespace lives in #83 alongside its only caller.

chatgpt-codex-connector · 2026-05-27T00:56:04Z

    kv_prefix_cache_.Insert<ResourceType::Device>(full_paged_tokens, prefix_pages, std::move(alloc_pages));

-    MatchResult match_result = kv_prefix_cache_.Match(full_paged_tokens, MatchIntent::StateRecovery);
+    MatchResult match_result = kv_prefix_cache_.Match(full_paged_tokens, request->LoraId(), MatchIntent::StateRecovery);


Keep LoRA namespace consistent when retracting cached pages

During retract, pages are inserted with the default namespace (Insert(..., std::move(alloc_pages))) but immediately matched in request->LoraId() namespace. For LoRA requests this writes KV pages into the base subtree and then looks them up in adapter subtree, so recovery misses its own cached prefix under retraction pressure.

Useful? React with 👍 / 👎.

Fixed in 7b0a3ca — scheduleRetract now passes request->LoraId() to Insert so the cached pages land in the adapter subtree where the subsequent StateRecovery Match looks for them.

chatgpt-codex-connector · 2026-05-27T00:56:05Z

    SchedulePrefillEvent(std::int32_t tokens_this_round, std::int32_t reserve_num_tokens_in_next_schedule_event,
-                         HybridPrefixCache* hybrid_prefix_cache = nullptr)
+                         HybridPrefixCache* hybrid_prefix_cache = nullptr, std::int32_t lora_id = kLoraNone)


Require explicit lora_id in follow-up forward events

Adding a default lora_id = kLoraNone here makes old call sites compile even when they forget to propagate request LoRA identity. In this commit, schedulePrefill/scheduleDecode still construct these events without request->LoraId(), so subsequent cache inserts run in base namespace and can cross-contaminate adapter/base KV reuse.

Useful? React with 👍 / 👎.

Both InsertHybridCache call sites in forward_events.cpp (SchedulePrefillEvent, ScheduleDecodeEvent) pass lora_id_ explicitly today. The default kLoraNone is a safety net for non-LoRA callers; happy to drop it in a follow-up to enforce explicit propagation if you prefer.

LoadLoraReqInput/UnloadLoraReqInput are part of the full LoRA serving PR (lightseekorg#83) and don't exist in io_struct.py on this branch, causing an import error that breaks every test importing AsyncLLM. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6b7cab2049

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T02:26:05Z

+    // entire subtree (including any mamba slots attached to those nodes).
+    token_vec_t sentinel(tree_.PageSize(), 0);
+    sentinel[0] = -lora_id;
+    tree_.Root()->RemoveChild(sentinel);


Avoid deleting locked LoRA cache nodes

When an adapter is unloaded while any request in that namespace is still in flight, EvictSubtree skips the locked descendants but this line still removes the virtual root, which destroys the entire subtree and the TreeNodes that active DeviceNodeRef/HostNodeRefs still point at. In that scenario the subsequent request state transitions will dereference dangling tree nodes (and the skipped pages are freed by destruction rather than when the request finishes), so namespace removal needs to wait until locked descendants are gone or abort/drain those requests first.

Useful? React with 👍 / 👎.

Fixed in 7b0a3ca — EvictLoraNamespace now scans descendants after EvictSubtree and defers RemoveChild if any node still has OnDevice()/OnHost() (meaning it was skipped as locked). Locked pages are freed when their requests finish via RAII; a later EvictLoraNamespace call (or natural LRU pressure) completes the cleanup.

chatgpt-codex-connector · 2026-05-27T02:26:05Z

+            if spec.lora_id != 0:
+                self._request_lora_ids[spec.request_id] = spec.lora_id


Propagate request LoRA ids before scheduling

This admission path depends on spec.lora_id being non-zero, but the serving path still builds specs with make_spec(rid=recv_req.rid, tokens=recv_req.input_ids) and TokenizedGenerateReqInput does not carry an adapter id, so every API request reaches this branch with lora_id == 0. As a result real adapter requests are submitted to the scheduler as base-model requests: the per-adapter namespace and max_loras cap are bypassed, and _request_lora_ids is never populated for the model executor.

Useful? React with 👍 / 👎.

Addressed in d184706 — this admission path was reverted from this branch. Per-request lora_id wiring (admission, TokenizedGenerateReqInput, prefetch, pack policy) all lives in #83.

chatgpt-codex-connector · 2026-05-27T02:26:05Z

+            if (is_new && static_cast<std::int32_t>(batch_lora_ids.size()) >= config_.max_loras) {
+                continue;  // defer to next step
+            }
+            batch_lora_ids.insert(request->LoraId());


Count only scheduled adapters against max_loras

With max_loras enabled, this inserts a request's adapter into batch_lora_ids before any schedule* call succeeds. If that request cannot be scheduled in this round due to request-specific constraints such as paged-cache/mamba admission or capacity, its adapter still consumes the cap and can cause later requests for other adapters to be deferred even though the batch contains no operation for the first adapter; with max_loras=1 this can leave a schedulable request idle or trigger the empty-plan retraction path unnecessarily.

Useful? React with 👍 / 👎.

Fixed in 7b0a3ca — the cap check stays up front, but batch_lora_ids.insert() now only fires when ops.size() actually grew. Failed schedules (capacity/mamba admission) no longer consume a slot.

LoRA serving code (enable_lora checks, _lora_manager, load/unload adapter methods, lora_id request tracking) references ServerArgs fields that don't exist on this branch, crashing the scheduler at startup. Only the max_loras=0 passthrough to make_config() is needed here. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d18470682f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T03:40:46Z

+    // Attach an empty DeviceResource so OnDevice() returns true.
+    // This prevents PruneEmptyByNode from removing the virtual root even when
+    // all adapter sequences have been evicted.
+    raw->AttachResource<ResourceType::Device>(std::make_unique<NodeResource<ResourceType::Device>>(OwnedPages{}));


Add host resource to LoRA namespace roots

LoRA virtual roots only get an empty device resource, but HostNodeRef locks every ancestor with GetResource<Host>(c) and several changed paths create host locks for LoRA matches/writebacks. For example, a LoRA request that is retracted or later writes back to L2 can have match.host.last_node at or below this virtual root, and locking it dereferences the missing host resource, crashing instead of recovering/writing back the request.

Useful? React with 👍 / 👎.

Fixed in 7b0a3ca — the virtual root now also gets an empty HostResource. IsEmpty() stays true on both resources, so the root never enters lru_leaves_ but HostNodeRef can safely lock it when match.host.last_node lands on it.

chatgpt-codex-connector · 2026-05-27T03:40:46Z

            paged_cache_groups=paged_cache_groups,
            enable_mixed_prefill_decode=server_args.enable_mixed_batch,
            prefix_cache_adjunct=prefix_cache_adjunct,
+            max_loras=0,


Wire max_loras into the runtime config

The Python serving path always passes max_loras=0, while the scheduler only enforces the new adapter cap when config_.max_loras > 0. In normal EventLoop startup this leaves the max-loras batch cap disabled for every server, so batches can still contain more unique adapters than the model executor's LoRA slots unless the scheduler is instantiated manually outside this path.

Useful? React with 👍 / 👎.

Intentional for this PR — it's the scheduler-only slice. max_loras wiring from ServerArgs lives in #83 alongside the --enable-lora/--max-loras CLI flags. 0 here makes the new cap a no-op until #83 lands, by design.

Four issues flagged by codex review on PR lightseekorg#268: 1. scheduleRetract Insert used base namespace while Match used the request's adapter namespace, so retracted LoRA requests wrote pages into the wrong subtree and never recovered their prefix on resume. Pass request->LoraId() through to Insert. 2. LoRA virtual roots had only a Device resource attached, so any host path that locked match.host.last_node when it pointed at the root would dereference a null Host resource. Attach an empty Host resource too; IsEmpty() stays true on both so the root never enters lru_leaves_. 3. EvictLoraNamespace unconditionally removed the virtual root after EvictSubtree, but EvictSubtree skips locked descendants. The unique_ptr cascade then destroyed those still-locked TreeNodes while live NodeRefs pointed at them — a use-after-free. Detect remaining locked descendants and defer root removal until a later call. 4. The max_loras cap counted a request's adapter before scheduling. Failed schedules (capacity, mamba admission, etc.) still consumed a slot, starving other adapters. Check the cap up front but insert into batch_lora_ids only after ops.size() grew. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b0a3ca9e0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T03:54:05Z

+    const std::int32_t page_size = tree_.PageSize();
+    token_vec_t sentinel(page_size, 0);
+    sentinel[0] = -lora_id;
+    auto node = std::make_unique<TreeNode>(sentinel, std::chrono::steady_clock::now());


Include the LoRA namespace in published block hashes

When KV cache events or L3 storage are enabled, a LoRA request with the same token pages as the base model (or another adapter) will publish the same block_hashes: this virtual root is only a radix-tree sentinel and has no block hash, so ParentBlockHash() returns nullopt for the first real adapter node and published_device_blocks_ deduplicates it as if it were the base-model block. That lets external KV storage/prefix-cache event consumers reuse or suppress blocks across different adapter weights even though the in-memory cache is namespaced.

Useful? React with 👍 / 👎.

Fixed in d3c2bd7 — getOrCreateLoraRoot now seeds the virtual root's block_hashes_ with HashKvBlock(sentinel=[-lora_id, 0, ...], nullopt). The sentinel tokens are per-adapter, so descendants' ParentBlockHash() returns a per-adapter hash, and the HashKvBlock(child, parent) chain diverges across namespaces. published_device_blocks_ and any external KV-event/L3 consumer no longer dedups adapter blocks against the base model or other adapters.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 85165df042

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T04:31:16Z

+    auto match = kv_prefix_cache_.Match(token_ids, lora_id, intent);
    augmentMatch(match);


Preserve LoRA namespace depth before hybrid augmentation

When a LoRA request goes through the hybrid cache with Mamba enabled, this namespaced MatchResult carries namespace_depth_offset = 1; augmentMatch() can then replace match.device.last_node/host.last_node with the real radix root when no Mamba node exists, but it leaves that offset in place. The next scheduler code calls DepthInPage() on the result, so a no-hit LoRA prefix becomes 0 - 1 == -1, which inflates unscheduled and can produce a negative prefill window start for first chunks. Keep the last node at the virtual root or clear/adjust the namespace offset when truncating the match to the real root.

Useful? React with 👍 / 👎.

Fixed in 156b86e — augmentMatch now clears namespace_depth_offset on the else branches that reassign last_node = root (both the legacy single-tier path and the Mamba-L2 device/host branches). Mamba-found paths still keep the offset since last_node lands on a descendant of the virtual root where DepthInPage(page_size) already includes the sentinel page. DepthInPage() returns 0 for no-hit LoRA matches now, not -1.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8e95a7485

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T06:31:10Z

    using InvalidTransitionHandler<SchedulePrefillEvent>::operator();
    SchedulePrefillEvent(std::int32_t tokens_this_round, std::int32_t reserve_num_tokens_in_next_schedule_event,
-                         HybridPrefixCache* hybrid_prefix_cache = nullptr)
+                         HybridPrefixCache* hybrid_prefix_cache = nullptr, std::int32_t lora_id = kLoraNone)


Propagate LoRA ids into continuation events

Fresh evidence in the current tree is that Scheduler::schedulePrefill and Scheduler::scheduleDecode still construct these events with only the hybrid-cache argument (operations/forward.cpp:195-222), so this default leaves lora_id_ as kLoraNone. For multi-chunk LoRA requests when the hybrid cache is active, the later InsertHybridCache(..., lora_id_) calls therefore insert prefill/decode continuation pages into the base namespace, allowing base-model or other-adapter requests with the same tokens to reuse KV pages produced under the adapter weights.

Useful? React with 👍 / 👎.

Fixed in 156b86e — Scheduler::schedulePrefill and Scheduler::scheduleDecode now construct SchedulePrefillEvent/ScheduleDecodeEvent with request->LoraId(). The default kLoraNone on the event constructors is preserved as a safety net for non-LoRA call sites, but all real call sites are now explicit. Multi-chunk LoRA continuations insert into the correct adapter namespace.

chatgpt-codex-connector · 2026-05-27T06:31:11Z

+    // Evict device and host pages. OwnedPages RAII returns them to the allocator.
+    // EvictSubtree skips nodes whose resource is currently locked by an in-flight
+    // request; their OnDevice()/OnHost() flag stays true.
+    device_.EvictSubtree(descendants);


Publish removals when evicting adapter namespaces

When KV cache events are enabled, unloading an adapter evicts all of its device pages through this direct EvictSubtree call, but this path never calls recordDeviceBlockRemoved as the normal eviction/release paths do. External KV-event consumers will not receive BlockRemoved for the adapter's blocks, and published_device_blocks_ keeps those hashes, so reloading the same adapter/prefix can also suppress the next BlockStored event.

Useful? React with 👍 / 👎.

Fixed in 156b86e — EvictLoraNamespace now snapshots device-evictable descendants before EvictSubtree and calls recordDeviceBlockRemoved for each one. External KV-event consumers receive the BlockRemoved notifications and published_device_blocks_ releases the hashes, so reloading the same adapter prefix isn't suppressed.

Without this, a LoRA request with the same token pages as the base model (or another adapter) publishes identical block_hashes. The virtual root's block_hashes_ is empty by default, so BlockHash() returns nullopt for descendants' ParentBlockHash(), and the chain HashKvBlock(child_tokens, parent=nullopt) is independent of lora_id. External KV-event consumers (and the internal published_device_blocks_ dedup set) then suppress or reuse blocks across different adapter weights. Fix: at virtual-root creation, seed block_hashes_ with HashKvBlock(sentinel=[-lora_id, 0, ...], nullopt). Sentinel tokens are already per-adapter, so the seed hash is per-adapter; descendants inherit the namespace boundary via the normal ParentBlockHash chain. Addresses the final P1 review comment on lightseekorg#268. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

…g#268 review 1. hybrid_prefix_cache::augmentMatch — when a LoRA match has no Mamba node and gets truncated to the real radix root, also clear namespace_depth_offset. Without this DepthInPage() returns root_depth(0) - offset(1) = -1, inflating 'unscheduled' and pushing the prefill window start negative. 2. forward.cpp::schedulePrefill/scheduleDecode — construct SchedulePrefillEvent/ScheduleDecodeEvent with request->LoraId() so the continuation InsertHybridCache(..., lora_id_) calls insert prefill/decode pages into the adapter namespace. The previous default of kLoraNone allowed adapter KV pages to be inserted under the base namespace and reused across adapters/base. 3. EvictLoraNamespace — emit recordDeviceBlockRemoved for each device-evictable descendant before EvictSubtree detaches it. Otherwise kv_event_sink_ consumers miss the BlockRemoved notifications and published_device_blocks_ keeps the stale hashes, suppressing the next BlockStored for the same blocks on reload. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 156b86e936

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T06:42:42Z

+                        const std::vector<std::string>& page_hashs = {}, TreeNode* start_node = nullptr,
+                        std::int32_t lora_id = kLoraNone);


Require lora_id for host cache inserts

Because this overload defaults lora_id to kLoraNone, the existing host-prefetch completion path still compiles without passing the request namespace (Scheduler::handleEvent(cache::PrefetchDone) inserts at tokenspeed-scheduler/csrc/scheduler/outside_event_handler.cpp:75). For a LoRA request with L3/host prefetch hits, those pages are written into the base namespace; the later LoRA match misses its prefetched pages, and base/other-adapter requests with the same tokens can load adapter KV from host. Please make the namespace explicit or update that prefetch insert to pass req->LoraId().

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8098148cc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-27T16:50:46Z

-MatchResult HybridPrefixCache::Match(const token_vec_t& token_ids, MatchIntent intent) {
-    auto match = kv_prefix_cache_.Match(token_ids, intent);
+MatchResult HybridPrefixCache::Match(const token_vec_t& token_ids, std::int32_t lora_id, MatchIntent intent) {
+    auto match = kv_prefix_cache_.Match(token_ids, lora_id, intent);


Clear LoRA offsets before paged-cache root caps

When lora_id is non-zero and the paged-cache adjunct is enabled, this namespaced match carries namespace_depth_offset = 1 into augmentMatchPagedCache(). On a cold adapter namespace (or any path without a complete paged-cache snapshot), that helper caps last_node back to the real radix root but does not clear the offset, so subsequent DepthInPage() calls return 0 - 1 and the scheduler computes an inflated/negative prefill window for the LoRA request. Please make the paged-cache augmentation namespace-aware or clear the offset when it leaves the virtual root.

Useful? React with 👍 / 👎.

qywu added 11 commits May 26, 2026 20:27

Merge branch 'main' into feat/lora-scheduler-kv-namespace

e24735d

Merge branch 'main' into feat/lora-scheduler-kv-namespace

c4d08b6

fix(scheduler): use LoraId() accessor in forward.cpp

dc1e73c

Upstream encapsulated lora_id as private lora_id_ with a LoraId() accessor. Replace request->lora_id() with request->LoraId() in the max_loras batch-cap loop in newForwardOperation(). Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chore: restore tokenspeed_scheduler/__init__.py to upstream version

98edfc9

No changes needed — PagedCacheGroupFamily and PrefixCacheAdjunctSpec are properly exported by the upstream version as-is. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

chore: restore informational comment in scheduler/types.h

084c1ee

Signed-off-by: Qingyang Wu <willqywu@gmail.com>

qywu marked this pull request as ready for review May 27, 2026 00:49

qywu requested a review from a team as a code owner May 27, 2026 00:49

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

Merge branch 'main' into feat/lora-scheduler-kv-namespace

85165df

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

Merge branch 'main' into feat/lora-scheduler-kv-namespace

f8e95a7

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

qywu added 2 commits May 27, 2026 06:31

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

Merge branch 'main' into feat/lora-scheduler-kv-namespace

f809814

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

qywu mentioned this pull request Jun 1, 2026

RL Development Plan: Training-Inference Consistency, Weights Sync, API Support qywu/tokenspeed#7

Closed

17 tasks

qywu requested a review from wangbo981016 June 1, 2026 04:44

		if spec.lora_id != 0:
		self._request_lora_ids[spec.request_id] = spec.lora_id

		auto match = kv_prefix_cache_.Match(token_ids, lora_id, intent);
		augmentMatch(match);

		const std::vector<std::string>& page_hashs = {}, TreeNode* start_node = nullptr,
		std::int32_t lora_id = kLoraNone);

Conversation

qywu commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1 — Per-adapter KV prefix-cache namespace isolation

2 — max_loras adapter cap in batch building

Files changed

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

qywu May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

qywu commented May 26, 2026 •

edited

Loading

2 — `max_loras` adapter cap in batch building

qywu May 27, 2026 •

edited

Loading