test(profiling): Update harness tolerance for 35B SSD-streaming by solderzzc · Pull Request #103 · SharpAI/SwiftLM

solderzzc · 2026-05-07T05:58:29Z

Integrates the mlx-swift-lm Gemma 4 MTP stability updates and adds Test 13 to run_benchmark.sh to explicitly benchmark MTP across extreme context window limits.

Resolves #102

- Add enableMTP (Bool) and numMTPTokens (Int) to GenerationConfig - InferenceEngine.generate() routes to generateMTP() when both config.enableMTP is true and the loaded model conforms to MTPLanguageModel; graceful fallback to standard path otherwise

- Added --mtp and --num-mtp-tokens CLI flags to Server.swift - Automatically injects SWIFTLM_MTP_ENABLE=1 into environment during startup if --mtp is specified - Exposed MTP configuration to ServerConfig and startup logs - Refactored MLXLMCommon.generate invocations to call generateMTP() when MTP is enabled and the model conforms to MTPLanguageModel

- Added 'MTP Speculative Decoding' toggle to the Advanced Engine settings pane. - Added a dynamic slider to configure the number of MTP draft tokens per round (1-5). - Integrated MTP toggle with the engine auto-reloading mechanism, similar to SSD Streaming.

- Increase server initialization timeout to 300s in profile_runner.py for massive FP8 models. - Introduce fp8_mtp_harness.py test suite for automated speculative decoding validation.

Copilot

Pull request overview

This PR expands SwiftLM/SwiftBuddy support for MTP (Multi-Token Prediction) speculative decoding and updates profiling tooling to better benchmark large FP8 MoE models.

Changes:

Added MTP enablement + token/round configuration paths across the SwiftLM CLI server and SwiftBuddy settings UI (with auto-reload for load-time toggles).
Added inference-side MTP generation selection and exposed lightweight last-turn performance metrics (TTFT/prefill/decode throughput).
Updated profiling configuration/output handling and introduced a dedicated FP8 MTP benchmark harness script.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
SwiftBuddy/SwiftBuddy/Views/SettingsView.swift	Adds MTP toggle + draft-token slider, and auto-reloads model on load-time setting changes with a reloading indicator.
Sources/SwiftLM/Server.swift	Adds `--mtp` / `--num-mtp-tokens` flags, sets env for MTP at load time, and routes generation through MTP when supported.
Sources/MLXInferenceCore/InferenceEngine.swift	Adds MTP generation path selection and publishes last-turn inference metrics.
Sources/MLXInferenceCore/GenerationConfig.swift	Persists new MTP settings in the shared generation configuration model.
scripts/profiling/profile_runner.py	Updates benchmark config matrix and TTFT reporting behavior for large-model profiling runs.
scripts/profiling/fp8_mtp_harness.py	New end-to-end harness to wait for FP8 shard availability, run benchmarks, and validate speedup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    /// Enable MTP (Multi-Token Prediction) speculative decoding.
+    /// When true, the inference engine will use the model's internal MTP heads
+    /// to draft `numMTPTokens` candidate tokens per step, then verify them in
+    /// a single batched forward pass — targeting 2x+ throughput improvement.
+    /// Requires a checkpoint that retains `mtp.*` weights (set SWIFTLM_MTP_ENABLE=1
+    /// at model-load time). No-ops gracefully if the model does not conform to
+    /// `MTPLanguageModel`.
+    /// ⚠️ LOAD-TIME flag: changes take effect on the next model load.
+    public var enableMTP: Bool


+    /// a single batched forward pass — targeting 2x+ throughput improvement.
+    /// Requires a checkpoint that retains `mtp.*` weights (set SWIFTLM_MTP_ENABLE=1
+    /// at model-load time). No-ops gracefully if the model does not conform to
+    /// `MTPLanguageModel`.
+    /// ⚠️ LOAD-TIME flag: changes take effect on the next model load.


+    public var prefillToksPerSec: Double
+    /// Decode throughput — tokens generated per second after the first token.
+    public var decodeToksPerSec: Double
+


+                    toggleRow(
+                        label: "MTP Speculative Decoding", icon: "bolt.horizontal.fill",
+                        isOn: mtpBinding,
+                        tint: SwiftBuddyTheme.accent,
+                        hint: "2x+ throughput using Multi-Token Prediction (auto-reloads model)"
                    )


+    # Baseline: no extras — establishes raw TPS floor on FP8 dequanted BF16
+    {"name": "Baseline",         "flags": ["--stream-experts"]},
+    # MTP Speculative — measures speculative gain
+    {"name": "MTP Speculative",  "flags": ["--mtp", "--stream-experts"]},
+    # MTP + TurboKV — target production config
+    {"name": "MTP + TurboQuant", "flags": ["--mtp", "--turbo-kv", "--stream-experts"]},


            if ok:
                results.append({
                    "config": config["name"],
                    "context": ctx_size,
-                    "ttft": f"{ttft:.2f}",
+                    "ttft": f"{ttft:.2f}" if ttft is not None else "N/A",
                    "tps": f"{tps:.2f}",
                    "static_mem": static_mem,
                    "os_ram": os_ram,
                    "gpu_alloc": f"{gpu_alloc:.1f}",
                    "gpu_in_use_peak": f"{peak_in_use:.1f}",
                })
-                print(f"  TTFT={ttft:.2f}s  TPS={tps:.2f}  OS_RAM={os_ram}GB  GPU_Alloc={gpu_alloc:.1f}GB  GPU_InUse(peak)={peak_in_use:.1f}GB")
+                ttft_str = f"{ttft:.2f}" if ttft is not None else "N/A"
+                print(f"  TTFT={ttft_str}s  TPS={tps:.2f}  OS_RAM={os_ram}GB  GPU_Alloc={gpu_alloc:.1f}GB  GPU_InUse(peak)={peak_in_use:.1f}GB")


+def find_snapshot_dir():
+    """Return the first (and only) snapshot hash directory."""
+    try:
+        snaps = os.listdir(HF_CACHE_PATH)
+        if snaps:
+            return os.path.join(HF_CACHE_PATH, snaps[0])


- Bump mlx-swift-lm submodule to 6c7a0ae (feat/mtp-speculative-decoding) containing native FP8 MoE inference support for Qwen3.6-35B-A3B - Update profile_runner.py: restore CONFIGS to stream-experts variants, fix CustomFunction return type annotation on kernel closure

…r non-MoE models

…ift-lm

…ndows

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 12 comments.

+    private var mtpBinding: Binding<Bool> {
+        Binding(
+            get: { viewModel.config.enableMTP },
+            set: { newValue in
+                viewModel.config.enableMTP = newValue
+                viewModel.config.save()
+                if currentModelId != nil {
+                    reloadCurrentModel()
+                }
+            }
+        )


+            set: { newValue in
+                viewModel.config.enableMTP = newValue
+                viewModel.config.save()
+                if currentModelId != nil {
+                    reloadCurrentModel()
+                }


                    let stream: AsyncStream<Generation> = try await container.perform { ctx in
-                        try MLXLMCommon.generate(
-                            input: lmInput,
-                            cache: cache,
-                            parameters: params,
-                            context: ctx
-                        )
+                        // MTP speculative decoding path: use MTPTokenIterator when
+                        //   1. The config requests MTP (enableMTP=true)
+                        //   2. The loaded model conforms to MTPLanguageModel
+                        if config.enableMTP, ctx.model is (any MTPLanguageModel) {
+                            return try MLXLMCommon.generateMTP(
+                                input: lmInput,
+                                cache: cache,
+                                parameters: params,
+                                context: ctx,
+                                numMTPTokens: config.numMTPTokens
+                            )


+        // Resolve model directory for profiling (checks HuggingFace cache)
+        let modelDirectory = resolveModelDirectory(modelId: modelId)
+        var mainModelProfile: ModelProfile? = nil
+        if let dir = modelDirectory {
+            mainModelProfile = ModelProfiler.profile(modelDirectory: dir, modelId: modelId)
+
+            // Fix #72 follow-up: If the user passed --stream-experts but the model
+            // is not an MoE, disable the flag early to prevent incorrect memory limits
+            // and erroneous auto-capping of draft tokens.
+            if self.streamExperts, let profile = mainModelProfile, !profile.isMoE {
+                print("[SwiftLM] ⚠️  Model does not support SSD expert streaming (\(profile.modelType) is not MoE). Ignoring --stream-experts flag.")
+                self.streamExperts = false


+                print(f"  > Bypassing abort because Qwen3.6-35B HF repo has duplicated tensor formats.")
+                # continue


+    // 2. HuggingFace cache
+    let slug = "models--" + id.replacingOccurrences(of: "/", with: "--")
+    let base = URL(fileURLWithPath: NSHomeDirectory())
+        .appendingPathComponent(".cache/huggingface/hub/\(slug)/snapshots")
+    if let snap = (try? FileManager.default.contentsOfDirectory(at: base,
+        includingPropertiesForKeys: nil))?.first {
+        return snap
+    }
+    // 3. Return the id as-is (mlx-swift-lm will resolve via HubClient)
+    return URL(fileURLWithPath: id)


+//   swift run -c release Gemma4MTPBench --main-model mlx-community/gemma-4-e2b-it-4bit \
+//                                        --asst-model mlx-community/gemma-4-E2B-it-assistant-bf16
+//
+// Safety limits baked in: maxKVSize=512, maxTokens=50, numDraft=2


+
+        guard let asstModel = asstCtx.model as? Gemma4AssistantModel else {
+            print("\n❌ Assistant model is not Gemma4AssistantModel — got \(type(of: asstCtx.model))")
+            Foundation.exit(1)


+        """, terminator: "")
+
+            // Correctness check
+            let baseText = mainCtx.tokenizer.decode(tokenIds: [])  // placeholder


+
+

…ance rate integration

- ci.yml: Fix ssd-draft-memory-guard [1/3] check — Qwen3.5-2B is not MoE so stream-experts is disabled before auto-cap fires. Check for draft model reference in server log instead (the real intent is RAM stays bounded). - Server.swift: Gate ModelProfiler filesystem walk behind streamExperts flag (Copilot: avoids startup I/O overhead on every non-streaming launch). - profile_runner.py: Fix visualization crash when ttft is 'N/A' — safely handle None ttft in min() and val_str formatting (Copilot: float() crash).

github-actions Bot added 4 commits May 5, 2026 10:23

test(profiling): Update harness tolerance for 35B SSD-streaming

16f9dd7

- Increase server initialization timeout to 300s in profile_runner.py for massive FP8 models. - Introduce fp8_mtp_harness.py test suite for automated speculative decoding validation.

Copilot AI review requested due to automatic review settings May 7, 2026 06:04

solderzzc added the enhancement New feature or request label May 7, 2026

Copilot started reviewing on behalf of solderzzc May 7, 2026 06:05 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

github-actions Bot added 2 commits May 7, 2026 23:36

fix(ssd-stream): prevent auto-capping and aggressive memory limits fo…

23a1ea6

…r non-MoE models

solderzzc mentioned this pull request May 8, 2026

Streams experts not working with draft model? #72

Closed

github-actions Bot added 6 commits May 8, 2026 09:52

chore: update MTP profiler config for Qwen3.6-27B-FP8 and bump mlx-sw…

72829c9

…ift-lm

docs: add MTP speculative decoding limitations and 27B proof

a273dba

feat: integrate Gemma4 MTP benchmark suite with varying KV context wi…

13f2577

…ndows

feat: display MTP draft acceptance rates in Gemma4MTPBench results

61ab81b

test: align Test 13 MTP benchmark flow with Test 1

9f7a87e

chore: trim default model menu to exclusively contain Gemma 4 variants

12dc118

solderzzc added the bug Something isn't working label May 12, 2026

solderzzc requested a review from Copilot May 12, 2026 19:38

Copilot started reviewing on behalf of solderzzc May 12, 2026 19:39 View session

chore: bump submodules to fix build

099bc91

Copilot AI reviewed May 12, 2026

View reviewed changes

github-actions Bot added 3 commits May 12, 2026 13:10

Address code quality feedback: fix MTP clamping, profiler, and accept…

c7006af

…ance rate integration

Bump mlx-swift-lm submodule to latest PR commit

b19182e

solderzzc merged commit 79f0ef3 into main May 13, 2026
11 checks passed

solderzzc deleted the feat/mtp-harness-updates branch May 13, 2026 05:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(profiling): Update harness tolerance for 35B SSD-streaming#103

test(profiling): Update harness tolerance for 35B SSD-streaming#103
solderzzc merged 16 commits into
mainfrom
feat/mtp-harness-updates

solderzzc commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		print(f" > Bypassing abort because Qwen3.6-35B HF repo has duplicated tensor formats.")
		# continue

Conversation

solderzzc commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

solderzzc commented May 7, 2026 •

edited

Loading