SMC-SD support by shadowpa0327 · Pull Request #4 · abdelfattah-lab/sglang

shadowpa0327 · 2026-04-20T23:10:04Z

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Self-contained SMC (Sequential Monte Carlo) speculative decoding, fully under python/sglang/srt/smc/ with no modifications to core SGLang files. Tree: smc/ __init__.py common/ __init__.py debug.py # diagnostic record appender utils.py # particle cloning, KV release, resampling utils verify.py # SMCVerifyInput + assign_smc_cache_locs triton kernel engine.py # SMCEngine offline inference entrypoint mem_cache/ __init__.py allocator.py # SMCRefCountedTokenAllocator + copy_block_table v2/ __init__.py info.py # SMCDecodeContext + SMCDraftInputV2 req_state.py # ScheduleBatchSMC (slot-based persistent state) scheduler.py # SMCSchedulerV2 (Scheduler subclass) + SMCCoordinatorV2 stacked_state.py # StackedGroupState (contiguous per-group GPU state) worker.py # SMCWorkerV2 (standalone BaseSpecWorker subclass) kernels/ __init__.py fused_collect.py # fused ESS → systematic resample → dead/excess fused_resample_kv.py # fused KV block-table copy + refcount update The directory is wired to the rest of SGLang via the core hooks landed in the follow-up commit. On its own, this commit is a no-op addition (no existing file touched, no import sites changed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Minimal set of SGLang core-file edits needed to wire the standalone smc/ directory into the engine. All changes here exist because v2 actively exercises these paths — nothing is dead code. Per-file: - managers/schedule_batch.py (+27) * Req gains two fields: smc_group_id, smc_particle_idx. * estimate_kv_cache_usage: SMC-specific sizing (per_particle * len(requests), accounting for N particles per parent). * release_req: particles route to smc.common.utils._release_internal_req (which uses dec_ref_and_free); non-particles use release_kv_cache. This is the KV-safety split — without it, finished particles would raw-free slots still referenced by siblings. * prepare_for_decode gate widened to "is_spec_v2 or is_smc()" — SMC runs non-overlap so is_spec_v2 is false but it still manages its own KV allocation via spec_info.prepare_for_decode. - managers/scheduler.py (+10 net) * Small generic bug fix in event_loop_overlap: processed_last_batch flag prevents pop_and_process from running twice in some paths. (Not SMC-specific, but surfaced during SMC dev; noted in commit for reviewer sanity.) * Two cosmetic comment rewrites. - managers/scheduler_runtime_checker_mixin.py (+82) * _smc_held_token_count / _smc_held_req_count routed through slot_state.held_token_count / held_req_count on ScheduleBatchSMC. * smc_held addend in both the memory-leak check and the req-pool leak check — v2 particles hold KV+req rows outside the running batch, so without this they look like leaks. * Tracked-req / OCCUPIED_REQ_ROWS debug block that fires on memory-leak detection; useful for diagnosing KV accounting bugs regardless of algorithm. - managers/utils.py (+3) * GenerationBatchResult.logprob_diff field (consumed by smc/v2/scheduler.py to feed the resample coordinator) + its copy_to_cpu guard. - model_executor/cuda_graph_runner.py (+23) * Score-worker CUDA graph branches gated by "is_smc() and not is_draft_worker" — mirrors eagle/standalone. * Constructs SMCVerifyInput for the captured graph's spec_info slot. - model_executor/model_runner.py (+24) * Same score-worker SMC branches at the model-runner level. * New disable_graph_runner per-ForwardBatch flag: v2 uses it to suppress graph replay on edge-case shapes (small bs, etc). - model_executor/model_runner_kv_cache_mixin.py (+21) * max_num_reqs *= 2 * smc_n_particles + 1 (parent + 2*N particles per group, so the req-to-token pool is sized correctly). * Dispatch in _init_pools: when is_smc() and page_size==1, construct SMCRefCountedTokenAllocator from smc/mem_cache/allocator.py instead of TokenToKVPoolAllocator. Fail-fast assertion confirms the dispatched allocator has slot_ref_count so the fused KV resample kernel can reach it directly. - layers/attention/triton_backend.py (+~450) * Hooks for "linear target verify": when a spec_info implements use_linear_target_verify(), use EXTEND-style causal attention over gamma+1 drafted tokens instead of a custom mask. SMCVerifyInput is the only user today. * Wires generate_smc_draft_decode_kv_indices as the kv-indices kernel for the draft worker's multi-step decode. - speculative/spec_info.py (+18) * SpeculativeAlgorithm.SMC enum + is_smc(); supports_spec_v2() already covers SMC's classification. * SpecInputType.SMC_DRAFT and SMC_VERIFY; is_draft_input / is_verify_input include the SMC cases. * SpecInput.use_linear_target_verify base method returning False; SMCVerifyInput overrides to True (see smc/common/verify.py). - speculative/spec_utils.py (+62) * generate_smc_draft_decode_kv_indices: triton kernel that builds the per-step KV index table for the draft worker. Consumed by triton_backend's init_forward_metadata for SMC. - server_args.py (+144) * Seven SMC flags (smc_n_particles, smc_gamma, smc_draft_temperature, smc_target_temperature, smc_resample_threshold, smc_resample_method, smc_fast_resample), their CLI args, and the SMC validation block: enforces page_size==1, triton/fa3 attention backend, no disaggregation, no DP attention; infers speculative_num_steps / speculative_num_draft_tokens from smc_gamma; force-disables overlap scheduling. Validated end-to-end: --mode smc_engine at batch=1 produces bit-identical accuracy to the pre-transplant branch (13/20, 4335 tokens). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Accompanies the standalone smc/ directory: - scripts/smc/ : benchmark harnesses, including accuracy_test_gsm8k.py used for CI-style validation during development - test/test_smc_*.py : unit tests for info / scheduler / resampler / speculative decoding integration Kept in their own commit so a reviewer can focus on production code (commits 1-2) separately from validation tooling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The tooling commit was a wholesale transfer; some files are obsolete after v1 retirement or are development scaffolding that no longer serves a production purpose. Removed: - test/test_smc_info.py — imports smc.v1.{manager,info,resampler} - test/test_smc_resampler.py — imports smc.v1.resampler.SMCResampler - test/test_smc_dedicated_scheduler.py — imports smc.dedicated_scheduler (the module was consolidated into smc/v2/scheduler.py) - test/test_smc_speculative_decoding.py — exercises --speculative-algorithm SMC via the regular server path, which goes through the SMC branch of SpeculativeAlgorithm.create_worker that was removed with v1. - scripts/smc/slot_state_simulation.py — pure-Python simulation of ScheduleBatchSMC using hand-rolled mocks; served its purpose during slot-based design review, no longer useful. README.md refreshed to match the retained scripts, with the v1 "engine-level SMC" recipes replaced by the v2 `--mode smc_engine` invocations. Net: 6 files, +35 / -4360. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Batch

After v1 retirement, engine-level SMC (--mode smc) is no longer functional: SpeculativeAlgorithm.create_worker has no SMC branch, so sgl.Engine(speculative_algorithm="SMC") aborts at startup. The native/Python-level path was only ever a debugger reference and is strictly redundant now that smc_engine is the single production SMC entrypoint. Removed: - NativeSMCConfig / NativeSMCDecoder class (~255 lines) - _sum_logprobs / _normalize_weights / _effective_sample_size / _resample helpers (native-only) - run_native_eval driver - run_engine_eval's SMC branches (engine-kwarg setup, --speculative-algorithm SMC wiring, stage-timing fetch). Function simplified and renamed to run_baseline_eval. - --mode choices: {baseline, smc, smc_engine, native} -> {baseline, smc_engine} - argparse group "native mode memory" (--draft-mem, --target-mem) Script goes from 833 -> 424 lines (-409, -49%). Verified smc_engine @Batch=1 still bit-identical: 13/20 (65.0%), 4335 tokens, 38.2s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Batch

The flag plumbed scheduler-side stage-timing buckets out of an SMC run, but the buckets themselves were removed in the v1-retirement scrub (see commit b96e98000 "smc: scrub dead code left behind by v1 retirement"). The RPC names the engine still tried to fan out (reset_smc_stage_timing_summary / dump_smc_stage_timing_summary) had no implementation in SMCSchedulerV2 or anywhere else, so the flag would silently error if anyone ever turned it on. Removed in scripts/smc/accuracy_test_gsm8k.py: - --report-stage-timing argparse flag - run_smc_engine_eval stage_timing setup, tempfile dump/load, and reduced the return tuple from 4 -> 3 elements - run_baseline_eval's trailing None placeholder - print_stage_timing_summary helper (~50 lines) - main()'s stage_timing handling - now-unused imports: tempfile, json, os, dataclass, Dict, List, Tuple Removed in python/sglang/srt/smc/engine.py: - SMCEngine.reset_stage_timing_summary - SMCEngine.dump_stage_timing_summary Script: 424 -> 343 lines (-81, -19%). Verified smc_engine @Batch=1 still bit-identical: 13/20 (65.0%), 4335 tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two focused suites covering the v2 surface: test/registered/unit/test_smc_v2_scheduler.py (CPU, 6 tests) TestSMCSchedulerAdmission: - admission gating against slot_state.available_slot_count() - FIFO drain when capacity is available - oversized-group abort path TestSMCResampleSlowPath: - skewed weights → all dead slots take the survivor's KV / req state / finished_reason / finished_len; refcounts inc/dec correctly; group weights zeroed; finished_mask flips True everywhere - equal weights → empty plan, no req-state or refcount mutation TestSMCFinalizeGroup: - tied log_weights → finalize picks the particle with greater visible output (token_count clipped to finished_len), copies its output_ids / finished_reason / finished_len into parent_req test/registered/kernels/test_smc_v2_kernels.py (CUDA, 8 tests) TestFusedCollectKernel (batched_collect_fused): - equal weights → resample_mask all False, no jobs, weights untouched - dominant-weight row → 3 dst (dead) + 3 src (= survivor's slot), dst unique, dst ∩ src = ∅, weights zeroed on that row only - 2-row mix (uniform + skewed) → only the skewed row contributes jobs; uniform row's weights stay untouched - dst/src partition conservation: |dst| == |src| under arbitrary skew (each dead replaced by exactly one excess survivor) TestFusedResampleKVKernel (batched_resample_kv): - single (dst, src) pair: block table copied; src refcounts ↑; dst's old refcounts ↓; to_free reports the freed slots - multiple pairs run in parallel — each dst inherits its own src - shared old KV (refcount > 1) is NOT in to_free; only refcount=0 slots get freed - empty input: kernel skipped, returns empty int64 tensor Both files use CustomTestCase + register_cpu_ci/register_cuda_ci per sglang/test/ci conventions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Removed 4 v1 SGLang STANDALONE scripts (no SGLANG_ENABLE_SPEC_V2 — exercises the legacy non-overlap STANDALONE worker that's being phased out) and the 2 SSD scripts (entirely unrelated backend, not part of SMC): - sglang_sd_1b_8b_fa3.sh / sglang_sd_1b_8b_triton.sh - sglang_sd_1b_70b_fa3.sh / sglang_sd_1b_70b_triton.sh - ssd_1b_8b.sh / ssd_1b_70b.sh run_all.sh: dropped the v1/ssd entries; renamed the `sglang-v2` filter tag to `sglang` (no v1 left to disambiguate from). help text updated. BENCHMARK_CONFIGS.md: tables and group inventory rewritten to match the remaining 8-script set (4 sglang STANDALONE v2 + 4 SMC). Net: 8 files, +21 / -416. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ngine path Previously each smc_1b_*.sh called: python -O -m sglang.bench_offline_throughput \\ --speculative-algorithm SMC ... That route went through SpeculativeAlgorithm.create_worker's SMC branch, which was deleted in the v1 retirement (commit 0b5b5618f). The scripts would have aborted at server startup. New wrapper: scripts/smc/tps_benchmark_scripts/bench_smc_engine_throughput.py Constructs SMCEngine directly, generates random-token inputs, runs a single warmup pass to amortise CUDA-graph capture, then times a full-batch generate(). Emits "Output token throughput: <tps> tok/s" so the existing shell-script grep contract still works. Supports the same knobs the old harness took (model/draft/n/gamma/temps, attention backend, mem fraction, max-running-requests, cuda-graph-max-bs, random-input/output-len, num-prompts, --tp), plus optional --smc-fast-resample and --smc-resample-threshold. The 4 SMC throughput shell scripts (smc_1b_{8b,70b}_{fa3,triton}.sh) were updated to invoke the wrapper. Sweep parameters and (gamma, n) pairs are unchanged. Bumped the awk extractor from $NF to $(NF-1) because the wrapper's TPS line ends with "tok/s" rather than just the number. Smoke-tested smc_1b_8b_triton equivalent on a 4-prompt × 64-token mini run: 310.55 tok/s, no crashes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Batch

A grep across the entire tree showed the chain: speculative/spec_utils.py └─ generate_smc_draft_decode_kv_indices (Triton kernel, ~60 lines) └─ stashed onto self in triton_backend.TritonMultiStepDraftBackend └─ used ONLY inside init_smc_forward_metadata_replay_cuda_graph └─ NO CALLERS anywhere The replay helper was a v1 draft-side CUDA-graph capture path. After v1 retirement and the SMC-v2 worker switching to the standard init_forward_metadata_replay_cuda_graph interface (or skipping graph replay altogether), the whole chain became unreachable. Removed: - speculative/spec_utils.py: generate_smc_draft_decode_kv_indices kernel - layers/attention/triton_backend.py: import, the self.generate_smc_draft_decode_kv_indices = ... assignment, and the init_smc_forward_metadata_replay_cuda_graph method body (Update to tasks/smc_v1_retirement_scan.md follows in a separate change since that doc lives outside the tracked tree.) Validated: - All v2 unit + kernel tests pass (14/14, 0.88s) - smc_engine @Batch=1 still bit-identical: 13/20 (65.0%), 4335 tokens Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Batch

Audit of the SMC v2 footprint in SGLang core (commits 780ffa5, 9a5c4e4) flagged 11 hunks that violated "touch only what you must": dead code that nothing reads, drive-by improvements to upstream comments / TODOs, and behavior changes unrelated to SMC. Reverting all of them brings the non-smc/ core diff down to only the hunks v2 actually exercises. HIGH (dead code): - model_executor/model_runner.py: drop the "disable_graph_runner per-batch override" guard in can_run_graph. The attribute is never set anywhere in srt/, scripts/, or test/. - speculative/spec_info.py: drop SpecInputType.SMC_SCORE. No SpecInput subclass ever constructs it (only SMC_DRAFT and SMC_VERIFY are used). Also remove it from the is_verify_input set. - server_args.py: revert the dummy-model fast-path expansion that ran _handle_missing_default_values / _handle_page_size / _handle_speculative_decoding for "SMC"/"NGRAM". Original "Skip for dummy models" comment + bare return restored. No SMC test in our current setup uses model_path="dummy". - server_args.py: revert "SMC" entry in auto_choose_speculative_params. That helper is only called from the EAGLE/EAGLE3/STANDALONE arm of _handle_speculative_decoding, never reached by SMC. MEDIUM (behavior changes unrelated to SMC): - managers/scheduler.py: revert the processed_last_batch flag and nested if/elif rewrite in event_loop_overlap. The original logic was already correct; my rewrite also subtly changed the idle-check condition (from "fires when last_batch is None and batch is None" to "fires when last_batch is True and ... and batch is None"). - managers/scheduler_runtime_checker_mixin.py: drop the ~54-line TRACKED_REQS + OCCUPIED_REQ_ROWS debug block from check_memory. Pure debugging convenience, gated by `if memory_leak`; SMC's KV accounting is already covered by smc_held / smc_req_count. LOW (cosmetic / drive-by): - managers/scheduler.py: restore upstream "speculative decoding v1" comment + lsyin TODO + the commented-out alternative implementation block in the spec-v2 store-to-map section. - managers/schedule_batch.py: restore upstream "TODO(spec-v2): all spec v2 should go through this path" comment after widening the gate to include SMC. - model_executor/{model_runner,cuda_graph_runner}.py: move the `from sglang.srt.speculative.eagle_info import EagleVerifyInput` import back to the top of its if-block instead of the else clause. Net: 7 files, +21 / -81. Validated: - 14/14 v2 unit + kernel tests pass - smc_engine @Batch=1 still bit-identical: 13/20 (65.0%), 4335 tokens - smc_engine @Batch=8: 27/40 (67.5%), 8184 tokens, 0 invalid (inside the established 23-30 noise band) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@Batch

… touches) A second-pass audit revealed that schedule_batch.py is now ZERO-diff vs smc_v0_starting_commit — every SMC change in that file was unreachable in v2's actual flow. The mistaken justification was "v2 needs these" but in fact: v2 uses ScheduleBatch ONLY for parent prefill (init_new + prepare_for_extend). v2's decode path runs entirely through ScheduleBatchSMC (a separate class, not a subclass), driven by SMCSchedulerV2._event_loop_v2 which bypasses the standard Scheduler.update_running_batch. That means none of the following ever fire in v2: - ScheduleBatch.new_tokens_required_next_decode (called only from check_decode_mem -> retract_decode and update_running_batch, both of which are in the standard event loop v2 doesn't use) - ScheduleBatch.release_req (called only from retract_decode and PrefillAdder.preempt_to_schedule, neither of which see SMC particles — particles only live in ScheduleBatchSMC.slot_to_req) - ScheduleBatch.prepare_for_decode (called only from update_running_batch, same reason) - Req.smc_group_id / Req.smc_particle_idx (smc_group_id was write-only; smc_particle_idx is set dynamically by clone_req_for_smc_particle on cloned particle reqs and read only inside SMC code, which never sees a non-particle Req) Removed: - python/sglang/srt/managers/schedule_batch.py: 4 hunks (the SMC branches in new_tokens_required_next_decode, release_req, prepare_for_decode, plus the two Req field declarations). File goes from +27 to 0 diff against smc_v0_starting_commit. - python/sglang/srt/smc/v2/scheduler.py: drop the dead `particle_req.smc_group_id = parent_req.rid` write inside SequenceGroup.materialize_particles (no consumer). Added an explicit failure for the unsupported retract path in v2 instead: - python/sglang/srt/smc/v2/scheduler.py: SMCSchedulerV2._add_request_to_queue now raises NotImplementedError when invoked with is_retracted=True (was silently `del is_retracted`). Documents the limitation: SMC v2 has no group-aware retract / re-admit protocol. Net effect on the non-smc/ core diff vs smc_v0_starting_commit: 9 files, +532 / -146 -> 8 files, +511 / -143 (and schedule_batch.py drops out entirely). Validated: - All 14 v2 unit + kernel tests pass - smc_engine @Batch=1 still bit-identical: 13/20 (65.0%), 4335 tokens Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Revert scheduler_runtime_checker_mixin.py to upstream and override the three affected leak checks inside SMCSchedulerV2 instead. Keeps core free of SMC concepts (no slot_state, no _smc_held_* helpers) while preserving the slot-aware conservation formulas that v2 needs. - _check_radix_cache_memory: folds slot_state.held_token_count() into the leak formula for ScheduleBatchSMC-resident KV. - self_check_during_busy: folds slot-held tokens into the busy-mode total. - _check_req_pool: folds slot_state.held_req_count() into the req-pool check. Accuracy gate: 13/20, 4335 tokens (bit-identical). 14/14 unit+kernel tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The busy check is only dispatched from event_loop_normal and event_loop_overlap in core Scheduler — specialized loops (PP, disagg, multiplex, and v2's own _event_loop_v2) omit it. The v2 override was therefore unreachable; drop it and leave a comment documenting the intentional omission so a future edit to _event_loop_v2 can re-add it with correct slot-aware semantics. Idle-path overrides (_check_radix_cache_memory, _check_req_pool) retained: those are what v2's loop actually exercises. Refcount state is already reflected via available_size, since a shared page stays out of free_pages until its last refcount drops. Accuracy gate: 13/20, 4335 tokens (bit-identical). 14/14 unit+kernel tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Remove the 4-line SMC-specific touch from model_runner_kv_cache_mixin.py (_init_pools no longer multiplies max_num_reqs by 2N+1 when speculative algorithm is SMC). Push the expansion into SMCEngine.__init__ where it can be logged visibly, and have SMCSchedulerV2 derive its slot capacity from the resolved value. Behavioural change (intentional): max_running_requests in v2 now means "max concurrent user groups" (matches upstream convention) instead of the prior quirky "max particle slots". Pool size scaling drops from 2N+1 to N+1 (stale v1 math; v2 shares one Req per particle for both draft and target). Changes: - model_runner_kv_cache_mixin.py: delete `if SMC: max_num_reqs *= 2N+1`. - smc/engine.py: SMCEngine intercepts user-supplied max_running_requests, expands to G * (N+1), and logs the math. - smc/v2/scheduler.py: derive max_user_groups = max_running_requests // (N+1); size slot_state.max_slots = max_user_groups * N; simplify _admit_prefill_groups (drop unreachable oversized-abort check; remove redundant None handling — Scheduler.max_running_requests is always set). - test/registered/unit/test_smc_v2_scheduler.py: drop oversized-abort test (check is gone); remove max_running_requests from admission test mocks. - scripts/smc/accuracy_test_gsm8k.py, scripts/smc/smc_profile_engine.py: reduce default values to preserve current sizing under new semantics (128/8 -> 16, since each user group now sizes 8x more particle slots). Accuracy gate: 13/20, 4335 tokens (bit-identical). 13/13 unit+kernel tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Core no longer references SMC. Allocator selection moves into a subclass chain that v2 owns end-to-end: SMCSchedulerV2.init_tp_model_worker → constructs SMCTpModelWorker SMCTpModelWorker._init_model_runner → constructs SMCModelRunner SMCModelRunner._init_pools → super()._init_pools(), then swaps allocator to SMCRefCountedTokenAllocator The swap inside SMCModelRunner._init_pools fires immediately after the standard allocator is constructed, before init_attention_backend or any other consumer can cache a stale reference. Draft worker keeps the standard TpModelWorker; it is passed the (already-SMC) allocator via target_worker.get_memory_pool() in SMCWorkerV2. Reverts the 22-line SMC branch + assertion in model_runner_kv_cache_mixin.py:_init_pools, restoring upstream's single-line TokenToKVPoolAllocator construction. git diff smc_v0_starting_commit -- python/sglang/srt/model_executor/ is now empty (zero core touch). Accuracy gate: 13/20, 4335 tokens (bit-identical). 13/13 unit+kernel tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extends the subclass-chain pattern from the allocator refactor. Core no longer imports SMCVerifyInput; SMC supplies it via method overrides on SMCModelRunner / SMCCudaGraphRunner. Core changes: - cuda_graph_runner.py: revert the `elif is_smc(): import SMCVerifyInput` branch in CudaGraphRunner.get_spec_info. Enum-check predicates for capture_forward_mode / num_tokens_per_bs remain (generic, no import). - model_runner.py: extract the nested get_spec_info() inside _dummy_run into a class method _build_dummy_run_spec_info(buffers, num_tokens_per_bs). Core's method handles Eagle / Standalone / NGRAM; SMC is gone. Extract the graph-runner class selection from the inline dict in init_device_graphs into a _get_graph_runner_class() method. Both are pure refactors (no behavior change for non-SMC) that create extension points. SMC additions: - smc/model_executor/smc_cuda_graph_runner.py: SMCCudaGraphRunner overrides get_spec_info to return SMCVerifyInput during graph capture. - smc/model_executor/smc_model_runner.py: adds overrides for _build_dummy_run_spec_info (SMCVerifyInput during autotune dummy run) and _get_graph_runner_class (returns SMCCudaGraphRunner on cuda). Grep confirms: `grep -rn "from sglang.srt.smc" python/sglang/srt/model_executor python/sglang/srt/managers python/sglang/srt/mem_cache` returns empty. Accuracy gate: 13/20, 4335 tokens (bit-identical). 13/13 unit+kernel tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The fused collect kernel already treats threshold=0 as "never resample" (ess < 0 * N is always false), so the previous (0, 1] validation artificially excluded a working value that the smcsd README documents as (0 = disable). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

smc: allow --smc-resample-threshold=0 to disable resampling

shadowpa0327 and others added 17 commits April 16, 2026 16:25

github-actions Bot added the documentation Improvements or additions to documentation label Apr 20, 2026

yahya010 and others added 2 commits April 24, 2026 13:19

Merge pull request #5 from abdelfattah-lab/smc-resample-threshold-zero

16e2e44

smc: allow --smc-resample-threshold=0 to disable resampling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMC-SD support#4

SMC-SD support#4
shadowpa0327 wants to merge 19 commits into
smc_v0_starting_commitfrom
smc_v2_clean

shadowpa0327 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shadowpa0327 commented Apr 20, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants