SMC-SD support#4
Open
shadowpa0327 wants to merge 19 commits into
Open
Conversation
Self-contained SMC (Sequential Monte Carlo) speculative decoding, fully
under python/sglang/srt/smc/ with no modifications to core SGLang files.
Tree:
smc/
__init__.py
common/
__init__.py
debug.py # diagnostic record appender
utils.py # particle cloning, KV release, resampling utils
verify.py # SMCVerifyInput + assign_smc_cache_locs triton kernel
engine.py # SMCEngine offline inference entrypoint
mem_cache/
__init__.py
allocator.py # SMCRefCountedTokenAllocator + copy_block_table
v2/
__init__.py
info.py # SMCDecodeContext + SMCDraftInputV2
req_state.py # ScheduleBatchSMC (slot-based persistent state)
scheduler.py # SMCSchedulerV2 (Scheduler subclass) + SMCCoordinatorV2
stacked_state.py # StackedGroupState (contiguous per-group GPU state)
worker.py # SMCWorkerV2 (standalone BaseSpecWorker subclass)
kernels/
__init__.py
fused_collect.py # fused ESS → systematic resample → dead/excess
fused_resample_kv.py # fused KV block-table copy + refcount update
The directory is wired to the rest of SGLang via the core hooks landed in
the follow-up commit. On its own, this commit is a no-op addition (no
existing file touched, no import sites changed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minimal set of SGLang core-file edits needed to wire the standalone smc/
directory into the engine. All changes here exist because v2 actively
exercises these paths — nothing is dead code.
Per-file:
- managers/schedule_batch.py (+27)
* Req gains two fields: smc_group_id, smc_particle_idx.
* estimate_kv_cache_usage: SMC-specific sizing
(per_particle * len(requests), accounting for N particles per parent).
* release_req: particles route to smc.common.utils._release_internal_req
(which uses dec_ref_and_free); non-particles use release_kv_cache.
This is the KV-safety split — without it, finished particles would
raw-free slots still referenced by siblings.
* prepare_for_decode gate widened to "is_spec_v2 or is_smc()" — SMC
runs non-overlap so is_spec_v2 is false but it still manages its
own KV allocation via spec_info.prepare_for_decode.
- managers/scheduler.py (+10 net)
* Small generic bug fix in event_loop_overlap: processed_last_batch
flag prevents pop_and_process from running twice in some paths.
(Not SMC-specific, but surfaced during SMC dev; noted in commit for
reviewer sanity.)
* Two cosmetic comment rewrites.
- managers/scheduler_runtime_checker_mixin.py (+82)
* _smc_held_token_count / _smc_held_req_count routed through
slot_state.held_token_count / held_req_count on ScheduleBatchSMC.
* smc_held addend in both the memory-leak check and the req-pool
leak check — v2 particles hold KV+req rows outside the running
batch, so without this they look like leaks.
* Tracked-req / OCCUPIED_REQ_ROWS debug block that fires on
memory-leak detection; useful for diagnosing KV accounting bugs
regardless of algorithm.
- managers/utils.py (+3)
* GenerationBatchResult.logprob_diff field (consumed by
smc/v2/scheduler.py to feed the resample coordinator) + its
copy_to_cpu guard.
- model_executor/cuda_graph_runner.py (+23)
* Score-worker CUDA graph branches gated by
"is_smc() and not is_draft_worker" — mirrors eagle/standalone.
* Constructs SMCVerifyInput for the captured graph's spec_info slot.
- model_executor/model_runner.py (+24)
* Same score-worker SMC branches at the model-runner level.
* New disable_graph_runner per-ForwardBatch flag: v2 uses it to
suppress graph replay on edge-case shapes (small bs, etc).
- model_executor/model_runner_kv_cache_mixin.py (+21)
* max_num_reqs *= 2 * smc_n_particles + 1 (parent + 2*N particles
per group, so the req-to-token pool is sized correctly).
* Dispatch in _init_pools: when is_smc() and page_size==1, construct
SMCRefCountedTokenAllocator from smc/mem_cache/allocator.py
instead of TokenToKVPoolAllocator. Fail-fast assertion confirms
the dispatched allocator has slot_ref_count so the fused KV
resample kernel can reach it directly.
- layers/attention/triton_backend.py (+~450)
* Hooks for "linear target verify": when a spec_info implements
use_linear_target_verify(), use EXTEND-style causal attention
over gamma+1 drafted tokens instead of a custom mask. SMCVerifyInput
is the only user today.
* Wires generate_smc_draft_decode_kv_indices as the kv-indices
kernel for the draft worker's multi-step decode.
- speculative/spec_info.py (+18)
* SpeculativeAlgorithm.SMC enum + is_smc(); supports_spec_v2()
already covers SMC's classification.
* SpecInputType.SMC_DRAFT and SMC_VERIFY; is_draft_input /
is_verify_input include the SMC cases.
* SpecInput.use_linear_target_verify base method returning False;
SMCVerifyInput overrides to True (see smc/common/verify.py).
- speculative/spec_utils.py (+62)
* generate_smc_draft_decode_kv_indices: triton kernel that builds
the per-step KV index table for the draft worker. Consumed by
triton_backend's init_forward_metadata for SMC.
- server_args.py (+144)
* Seven SMC flags (smc_n_particles, smc_gamma, smc_draft_temperature,
smc_target_temperature, smc_resample_threshold, smc_resample_method,
smc_fast_resample), their CLI args, and the SMC validation block:
enforces page_size==1, triton/fa3 attention backend, no
disaggregation, no DP attention; infers speculative_num_steps /
speculative_num_draft_tokens from smc_gamma; force-disables
overlap scheduling.
Validated end-to-end: --mode smc_engine at batch=1 produces bit-identical
accuracy to the pre-transplant branch (13/20, 4335 tokens).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Accompanies the standalone smc/ directory: - scripts/smc/ : benchmark harnesses, including accuracy_test_gsm8k.py used for CI-style validation during development - test/test_smc_*.py : unit tests for info / scheduler / resampler / speculative decoding integration Kept in their own commit so a reviewer can focus on production code (commits 1-2) separately from validation tooling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The tooling commit was a wholesale transfer; some files are obsolete
after v1 retirement or are development scaffolding that no longer serves
a production purpose.
Removed:
- test/test_smc_info.py — imports smc.v1.{manager,info,resampler}
- test/test_smc_resampler.py — imports smc.v1.resampler.SMCResampler
- test/test_smc_dedicated_scheduler.py — imports smc.dedicated_scheduler
(the module was consolidated into smc/v2/scheduler.py)
- test/test_smc_speculative_decoding.py — exercises --speculative-algorithm
SMC via the regular server path, which goes through the SMC branch of
SpeculativeAlgorithm.create_worker that was removed with v1.
- scripts/smc/slot_state_simulation.py — pure-Python simulation of
ScheduleBatchSMC using hand-rolled mocks; served its purpose during
slot-based design review, no longer useful.
README.md refreshed to match the retained scripts, with the v1
"engine-level SMC" recipes replaced by the v2 `--mode smc_engine`
invocations.
Net: 6 files, +35 / -4360.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After v1 retirement, engine-level SMC (--mode smc) is no longer
functional: SpeculativeAlgorithm.create_worker has no SMC branch, so
sgl.Engine(speculative_algorithm="SMC") aborts at startup. The
native/Python-level path was only ever a debugger reference and is
strictly redundant now that smc_engine is the single production SMC
entrypoint.
Removed:
- NativeSMCConfig / NativeSMCDecoder class (~255 lines)
- _sum_logprobs / _normalize_weights / _effective_sample_size /
_resample helpers (native-only)
- run_native_eval driver
- run_engine_eval's SMC branches (engine-kwarg setup, --speculative-algorithm
SMC wiring, stage-timing fetch). Function simplified and renamed to
run_baseline_eval.
- --mode choices: {baseline, smc, smc_engine, native} -> {baseline, smc_engine}
- argparse group "native mode memory" (--draft-mem, --target-mem)
Script goes from 833 -> 424 lines (-409, -49%). Verified smc_engine
@Batch=1 still bit-identical: 13/20 (65.0%), 4335 tokens, 38.2s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The flag plumbed scheduler-side stage-timing buckets out of an SMC run,
but the buckets themselves were removed in the v1-retirement scrub
(see commit b96e98000 "smc: scrub dead code left behind by v1
retirement"). The RPC names the engine still tried to fan out
(reset_smc_stage_timing_summary / dump_smc_stage_timing_summary) had no
implementation in SMCSchedulerV2 or anywhere else, so the flag would
silently error if anyone ever turned it on.
Removed in scripts/smc/accuracy_test_gsm8k.py:
- --report-stage-timing argparse flag
- run_smc_engine_eval stage_timing setup, tempfile dump/load, and
reduced the return tuple from 4 -> 3 elements
- run_baseline_eval's trailing None placeholder
- print_stage_timing_summary helper (~50 lines)
- main()'s stage_timing handling
- now-unused imports: tempfile, json, os, dataclass, Dict, List, Tuple
Removed in python/sglang/srt/smc/engine.py:
- SMCEngine.reset_stage_timing_summary
- SMCEngine.dump_stage_timing_summary
Script: 424 -> 343 lines (-81, -19%). Verified smc_engine @Batch=1 still
bit-identical: 13/20 (65.0%), 4335 tokens.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two focused suites covering the v2 surface:
test/registered/unit/test_smc_v2_scheduler.py (CPU, 6 tests)
TestSMCSchedulerAdmission:
- admission gating against slot_state.available_slot_count()
- FIFO drain when capacity is available
- oversized-group abort path
TestSMCResampleSlowPath:
- skewed weights → all dead slots take the survivor's KV / req state /
finished_reason / finished_len; refcounts inc/dec correctly;
group weights zeroed; finished_mask flips True everywhere
- equal weights → empty plan, no req-state or refcount mutation
TestSMCFinalizeGroup:
- tied log_weights → finalize picks the particle with greater visible
output (token_count clipped to finished_len), copies its
output_ids / finished_reason / finished_len into parent_req
test/registered/kernels/test_smc_v2_kernels.py (CUDA, 8 tests)
TestFusedCollectKernel (batched_collect_fused):
- equal weights → resample_mask all False, no jobs, weights untouched
- dominant-weight row → 3 dst (dead) + 3 src (= survivor's slot),
dst unique, dst ∩ src = ∅, weights zeroed on that row only
- 2-row mix (uniform + skewed) → only the skewed row contributes
jobs; uniform row's weights stay untouched
- dst/src partition conservation: |dst| == |src| under arbitrary
skew (each dead replaced by exactly one excess survivor)
TestFusedResampleKVKernel (batched_resample_kv):
- single (dst, src) pair: block table copied; src refcounts ↑;
dst's old refcounts ↓; to_free reports the freed slots
- multiple pairs run in parallel — each dst inherits its own src
- shared old KV (refcount > 1) is NOT in to_free; only refcount=0
slots get freed
- empty input: kernel skipped, returns empty int64 tensor
Both files use CustomTestCase + register_cpu_ci/register_cuda_ci per
sglang/test/ci conventions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removed 4 v1 SGLang STANDALONE scripts (no SGLANG_ENABLE_SPEC_V2 — exercises the legacy non-overlap STANDALONE worker that's being phased out) and the 2 SSD scripts (entirely unrelated backend, not part of SMC): - sglang_sd_1b_8b_fa3.sh / sglang_sd_1b_8b_triton.sh - sglang_sd_1b_70b_fa3.sh / sglang_sd_1b_70b_triton.sh - ssd_1b_8b.sh / ssd_1b_70b.sh run_all.sh: dropped the v1/ssd entries; renamed the `sglang-v2` filter tag to `sglang` (no v1 left to disambiguate from). help text updated. BENCHMARK_CONFIGS.md: tables and group inventory rewritten to match the remaining 8-script set (4 sglang STANDALONE v2 + 4 SMC). Net: 8 files, +21 / -416. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ngine path
Previously each smc_1b_*.sh called:
python -O -m sglang.bench_offline_throughput \\
--speculative-algorithm SMC ...
That route went through SpeculativeAlgorithm.create_worker's SMC branch,
which was deleted in the v1 retirement (commit 0b5b5618f). The scripts
would have aborted at server startup.
New wrapper:
scripts/smc/tps_benchmark_scripts/bench_smc_engine_throughput.py
Constructs SMCEngine directly, generates random-token inputs, runs a
single warmup pass to amortise CUDA-graph capture, then times a
full-batch generate(). Emits "Output token throughput: <tps> tok/s" so
the existing shell-script grep contract still works. Supports the same
knobs the old harness took (model/draft/n/gamma/temps, attention
backend, mem fraction, max-running-requests, cuda-graph-max-bs,
random-input/output-len, num-prompts, --tp), plus optional
--smc-fast-resample and --smc-resample-threshold.
The 4 SMC throughput shell scripts (smc_1b_{8b,70b}_{fa3,triton}.sh)
were updated to invoke the wrapper. Sweep parameters and (gamma, n)
pairs are unchanged. Bumped the awk extractor from $NF to $(NF-1)
because the wrapper's TPS line ends with "tok/s" rather than just the
number.
Smoke-tested smc_1b_8b_triton equivalent on a 4-prompt × 64-token mini
run: 310.55 tok/s, no crashes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A grep across the entire tree showed the chain:
speculative/spec_utils.py
└─ generate_smc_draft_decode_kv_indices (Triton kernel, ~60 lines)
└─ stashed onto self in triton_backend.TritonMultiStepDraftBackend
└─ used ONLY inside
init_smc_forward_metadata_replay_cuda_graph
└─ NO CALLERS anywhere
The replay helper was a v1 draft-side CUDA-graph capture path. After v1
retirement and the SMC-v2 worker switching to the standard
init_forward_metadata_replay_cuda_graph interface (or skipping graph
replay altogether), the whole chain became unreachable.
Removed:
- speculative/spec_utils.py: generate_smc_draft_decode_kv_indices kernel
- layers/attention/triton_backend.py: import, the
self.generate_smc_draft_decode_kv_indices = ... assignment, and the
init_smc_forward_metadata_replay_cuda_graph method body
(Update to tasks/smc_v1_retirement_scan.md follows in a separate change
since that doc lives outside the tracked tree.)
Validated:
- All v2 unit + kernel tests pass (14/14, 0.88s)
- smc_engine @Batch=1 still bit-identical: 13/20 (65.0%), 4335 tokens
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit of the SMC v2 footprint in SGLang core (commits 780ffa5, 9a5c4e4) flagged 11 hunks that violated "touch only what you must": dead code that nothing reads, drive-by improvements to upstream comments / TODOs, and behavior changes unrelated to SMC. Reverting all of them brings the non-smc/ core diff down to only the hunks v2 actually exercises. HIGH (dead code): - model_executor/model_runner.py: drop the "disable_graph_runner per-batch override" guard in can_run_graph. The attribute is never set anywhere in srt/, scripts/, or test/. - speculative/spec_info.py: drop SpecInputType.SMC_SCORE. No SpecInput subclass ever constructs it (only SMC_DRAFT and SMC_VERIFY are used). Also remove it from the is_verify_input set. - server_args.py: revert the dummy-model fast-path expansion that ran _handle_missing_default_values / _handle_page_size / _handle_speculative_decoding for "SMC"/"NGRAM". Original "Skip for dummy models" comment + bare return restored. No SMC test in our current setup uses model_path="dummy". - server_args.py: revert "SMC" entry in auto_choose_speculative_params. That helper is only called from the EAGLE/EAGLE3/STANDALONE arm of _handle_speculative_decoding, never reached by SMC. MEDIUM (behavior changes unrelated to SMC): - managers/scheduler.py: revert the processed_last_batch flag and nested if/elif rewrite in event_loop_overlap. The original logic was already correct; my rewrite also subtly changed the idle-check condition (from "fires when last_batch is None and batch is None" to "fires when last_batch is True and ... and batch is None"). - managers/scheduler_runtime_checker_mixin.py: drop the ~54-line TRACKED_REQS + OCCUPIED_REQ_ROWS debug block from check_memory. Pure debugging convenience, gated by `if memory_leak`; SMC's KV accounting is already covered by smc_held / smc_req_count. LOW (cosmetic / drive-by): - managers/scheduler.py: restore upstream "speculative decoding v1" comment + lsyin TODO + the commented-out alternative implementation block in the spec-v2 store-to-map section. - managers/schedule_batch.py: restore upstream "TODO(spec-v2): all spec v2 should go through this path" comment after widening the gate to include SMC. - model_executor/{model_runner,cuda_graph_runner}.py: move the `from sglang.srt.speculative.eagle_info import EagleVerifyInput` import back to the top of its if-block instead of the else clause. Net: 7 files, +21 / -81. Validated: - 14/14 v2 unit + kernel tests pass - smc_engine @Batch=1 still bit-identical: 13/20 (65.0%), 4335 tokens - smc_engine @Batch=8: 27/40 (67.5%), 8184 tokens, 0 invalid (inside the established 23-30 noise band) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… touches)
A second-pass audit revealed that schedule_batch.py is now ZERO-diff vs
smc_v0_starting_commit — every SMC change in that file was unreachable
in v2's actual flow. The mistaken justification was "v2 needs these"
but in fact:
v2 uses ScheduleBatch ONLY for parent prefill (init_new + prepare_for_extend).
v2's decode path runs entirely through ScheduleBatchSMC (a separate
class, not a subclass), driven by SMCSchedulerV2._event_loop_v2 which
bypasses the standard Scheduler.update_running_batch.
That means none of the following ever fire in v2:
- ScheduleBatch.new_tokens_required_next_decode (called only from
check_decode_mem -> retract_decode and update_running_batch, both
of which are in the standard event loop v2 doesn't use)
- ScheduleBatch.release_req (called only from
retract_decode and PrefillAdder.preempt_to_schedule, neither of
which see SMC particles — particles only live in
ScheduleBatchSMC.slot_to_req)
- ScheduleBatch.prepare_for_decode (called only from
update_running_batch, same reason)
- Req.smc_group_id / Req.smc_particle_idx (smc_group_id was
write-only; smc_particle_idx is set dynamically by
clone_req_for_smc_particle on cloned particle reqs and read only
inside SMC code, which never sees a non-particle Req)
Removed:
- python/sglang/srt/managers/schedule_batch.py: 4 hunks (the SMC branches
in new_tokens_required_next_decode, release_req, prepare_for_decode,
plus the two Req field declarations). File goes from +27 to 0 diff
against smc_v0_starting_commit.
- python/sglang/srt/smc/v2/scheduler.py: drop the dead
`particle_req.smc_group_id = parent_req.rid` write inside
SequenceGroup.materialize_particles (no consumer).
Added an explicit failure for the unsupported retract path in v2 instead:
- python/sglang/srt/smc/v2/scheduler.py: SMCSchedulerV2._add_request_to_queue
now raises NotImplementedError when invoked with is_retracted=True
(was silently `del is_retracted`). Documents the limitation:
SMC v2 has no group-aware retract / re-admit protocol.
Net effect on the non-smc/ core diff vs smc_v0_starting_commit:
9 files, +532 / -146 -> 8 files, +511 / -143 (and schedule_batch.py
drops out entirely).
Validated:
- All 14 v2 unit + kernel tests pass
- smc_engine @Batch=1 still bit-identical: 13/20 (65.0%), 4335 tokens
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Revert scheduler_runtime_checker_mixin.py to upstream and override the three affected leak checks inside SMCSchedulerV2 instead. Keeps core free of SMC concepts (no slot_state, no _smc_held_* helpers) while preserving the slot-aware conservation formulas that v2 needs. - _check_radix_cache_memory: folds slot_state.held_token_count() into the leak formula for ScheduleBatchSMC-resident KV. - self_check_during_busy: folds slot-held tokens into the busy-mode total. - _check_req_pool: folds slot_state.held_req_count() into the req-pool check. Accuracy gate: 13/20, 4335 tokens (bit-identical). 14/14 unit+kernel tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The busy check is only dispatched from event_loop_normal and event_loop_overlap in core Scheduler — specialized loops (PP, disagg, multiplex, and v2's own _event_loop_v2) omit it. The v2 override was therefore unreachable; drop it and leave a comment documenting the intentional omission so a future edit to _event_loop_v2 can re-add it with correct slot-aware semantics. Idle-path overrides (_check_radix_cache_memory, _check_req_pool) retained: those are what v2's loop actually exercises. Refcount state is already reflected via available_size, since a shared page stays out of free_pages until its last refcount drops. Accuracy gate: 13/20, 4335 tokens (bit-identical). 14/14 unit+kernel tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove the 4-line SMC-specific touch from model_runner_kv_cache_mixin.py (_init_pools no longer multiplies max_num_reqs by 2N+1 when speculative algorithm is SMC). Push the expansion into SMCEngine.__init__ where it can be logged visibly, and have SMCSchedulerV2 derive its slot capacity from the resolved value. Behavioural change (intentional): max_running_requests in v2 now means "max concurrent user groups" (matches upstream convention) instead of the prior quirky "max particle slots". Pool size scaling drops from 2N+1 to N+1 (stale v1 math; v2 shares one Req per particle for both draft and target). Changes: - model_runner_kv_cache_mixin.py: delete `if SMC: max_num_reqs *= 2N+1`. - smc/engine.py: SMCEngine intercepts user-supplied max_running_requests, expands to G * (N+1), and logs the math. - smc/v2/scheduler.py: derive max_user_groups = max_running_requests // (N+1); size slot_state.max_slots = max_user_groups * N; simplify _admit_prefill_groups (drop unreachable oversized-abort check; remove redundant None handling — Scheduler.max_running_requests is always set). - test/registered/unit/test_smc_v2_scheduler.py: drop oversized-abort test (check is gone); remove max_running_requests from admission test mocks. - scripts/smc/accuracy_test_gsm8k.py, scripts/smc/smc_profile_engine.py: reduce default values to preserve current sizing under new semantics (128/8 -> 16, since each user group now sizes 8x more particle slots). Accuracy gate: 13/20, 4335 tokens (bit-identical). 13/13 unit+kernel tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Core no longer references SMC. Allocator selection moves into a subclass
chain that v2 owns end-to-end:
SMCSchedulerV2.init_tp_model_worker
→ constructs SMCTpModelWorker
SMCTpModelWorker._init_model_runner
→ constructs SMCModelRunner
SMCModelRunner._init_pools
→ super()._init_pools(), then swaps allocator to
SMCRefCountedTokenAllocator
The swap inside SMCModelRunner._init_pools fires immediately after the
standard allocator is constructed, before init_attention_backend or any
other consumer can cache a stale reference. Draft worker keeps the
standard TpModelWorker; it is passed the (already-SMC) allocator via
target_worker.get_memory_pool() in SMCWorkerV2.
Reverts the 22-line SMC branch + assertion in
model_runner_kv_cache_mixin.py:_init_pools, restoring upstream's
single-line TokenToKVPoolAllocator construction.
git diff smc_v0_starting_commit -- python/sglang/srt/model_executor/
is now empty (zero core touch).
Accuracy gate: 13/20, 4335 tokens (bit-identical). 13/13 unit+kernel tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the subclass-chain pattern from the allocator refactor. Core no longer imports SMCVerifyInput; SMC supplies it via method overrides on SMCModelRunner / SMCCudaGraphRunner. Core changes: - cuda_graph_runner.py: revert the `elif is_smc(): import SMCVerifyInput` branch in CudaGraphRunner.get_spec_info. Enum-check predicates for capture_forward_mode / num_tokens_per_bs remain (generic, no import). - model_runner.py: extract the nested get_spec_info() inside _dummy_run into a class method _build_dummy_run_spec_info(buffers, num_tokens_per_bs). Core's method handles Eagle / Standalone / NGRAM; SMC is gone. Extract the graph-runner class selection from the inline dict in init_device_graphs into a _get_graph_runner_class() method. Both are pure refactors (no behavior change for non-SMC) that create extension points. SMC additions: - smc/model_executor/smc_cuda_graph_runner.py: SMCCudaGraphRunner overrides get_spec_info to return SMCVerifyInput during graph capture. - smc/model_executor/smc_model_runner.py: adds overrides for _build_dummy_run_spec_info (SMCVerifyInput during autotune dummy run) and _get_graph_runner_class (returns SMCCudaGraphRunner on cuda). Grep confirms: `grep -rn "from sglang.srt.smc" python/sglang/srt/model_executor python/sglang/srt/managers python/sglang/srt/mem_cache` returns empty. Accuracy gate: 13/20, 4335 tokens (bit-identical). 13/13 unit+kernel tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fused collect kernel already treats threshold=0 as "never resample" (ess < 0 * N is always false), so the previous (0, 1] validation artificially excluded a working value that the smcsd README documents as (0 = disable). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
smc: allow --smc-resample-threshold=0 to disable resampling
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci