Skip to content

test(profiling): Update harness tolerance for 35B SSD-streaming#103

Merged
solderzzc merged 16 commits into
mainfrom
feat/mtp-harness-updates
May 13, 2026
Merged

test(profiling): Update harness tolerance for 35B SSD-streaming#103
solderzzc merged 16 commits into
mainfrom
feat/mtp-harness-updates

Conversation

@solderzzc
Copy link
Copy Markdown
Member

@solderzzc solderzzc commented May 7, 2026

Integrates the mlx-swift-lm Gemma 4 MTP stability updates and adds Test 13 to run_benchmark.sh to explicitly benchmark MTP across extreme context window limits.

Resolves #102

github-actions Bot added 4 commits May 5, 2026 10:23
- Add enableMTP (Bool) and numMTPTokens (Int) to GenerationConfig
- InferenceEngine.generate() routes to generateMTP() when both
  config.enableMTP is true and the loaded model conforms to
  MTPLanguageModel; graceful fallback to standard path otherwise
- Added --mtp and --num-mtp-tokens CLI flags to Server.swift
- Automatically injects SWIFTLM_MTP_ENABLE=1 into environment during startup if --mtp is specified
- Exposed MTP configuration to ServerConfig and startup logs
- Refactored MLXLMCommon.generate invocations to call generateMTP() when MTP is enabled and the model conforms to MTPLanguageModel
- Added 'MTP Speculative Decoding' toggle to the Advanced Engine settings pane.
- Added a dynamic slider to configure the number of MTP draft tokens per round (1-5).
- Integrated MTP toggle with the engine auto-reloading mechanism, similar to SSD Streaming.
- Increase server initialization timeout to 300s in profile_runner.py for massive FP8 models.
- Introduce fp8_mtp_harness.py test suite for automated speculative decoding validation.
Copilot AI review requested due to automatic review settings May 7, 2026 06:04
@solderzzc solderzzc added the enhancement New feature or request label May 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands SwiftLM/SwiftBuddy support for MTP (Multi-Token Prediction) speculative decoding and updates profiling tooling to better benchmark large FP8 MoE models.

Changes:

  • Added MTP enablement + token/round configuration paths across the SwiftLM CLI server and SwiftBuddy settings UI (with auto-reload for load-time toggles).
  • Added inference-side MTP generation selection and exposed lightweight last-turn performance metrics (TTFT/prefill/decode throughput).
  • Updated profiling configuration/output handling and introduced a dedicated FP8 MTP benchmark harness script.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
SwiftBuddy/SwiftBuddy/Views/SettingsView.swift Adds MTP toggle + draft-token slider, and auto-reloads model on load-time setting changes with a reloading indicator.
Sources/SwiftLM/Server.swift Adds --mtp / --num-mtp-tokens flags, sets env for MTP at load time, and routes generation through MTP when supported.
Sources/MLXInferenceCore/InferenceEngine.swift Adds MTP generation path selection and publishes last-turn inference metrics.
Sources/MLXInferenceCore/GenerationConfig.swift Persists new MTP settings in the shared generation configuration model.
scripts/profiling/profile_runner.py Updates benchmark config matrix and TTFT reporting behavior for large-model profiling runs.
scripts/profiling/fp8_mtp_harness.py New end-to-end harness to wait for FP8 shard availability, run benchmarks, and validate speedup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +53 to +61
/// Enable MTP (Multi-Token Prediction) speculative decoding.
/// When true, the inference engine will use the model's internal MTP heads
/// to draft `numMTPTokens` candidate tokens per step, then verify them in
/// a single batched forward pass — targeting 2x+ throughput improvement.
/// Requires a checkpoint that retains `mtp.*` weights (set SWIFTLM_MTP_ENABLE=1
/// at model-load time). No-ops gracefully if the model does not conform to
/// `MTPLanguageModel`.
/// ⚠️ LOAD-TIME flag: changes take effect on the next model load.
public var enableMTP: Bool
Comment on lines +56 to +60
/// a single batched forward pass — targeting 2x+ throughput improvement.
/// Requires a checkpoint that retains `mtp.*` weights (set SWIFTLM_MTP_ENABLE=1
/// at model-load time). No-ops gracefully if the model does not conform to
/// `MTPLanguageModel`.
/// ⚠️ LOAD-TIME flag: changes take effect on the next model load.
public var prefillToksPerSec: Double
/// Decode throughput — tokens generated per second after the first token.
public var decodeToksPerSec: Double

Comment on lines +351 to 356
toggleRow(
label: "MTP Speculative Decoding", icon: "bolt.horizontal.fill",
isOn: mtpBinding,
tint: SwiftBuddyTheme.accent,
hint: "2x+ throughput using Multi-Token Prediction (auto-reloads model)"
)
Comment thread scripts/profiling/profile_runner.py Outdated
Comment on lines +14 to +19
# Baseline: no extras — establishes raw TPS floor on FP8 dequanted BF16
{"name": "Baseline", "flags": ["--stream-experts"]},
# MTP Speculative — measures speculative gain
{"name": "MTP Speculative", "flags": ["--mtp", "--stream-experts"]},
# MTP + TurboKV — target production config
{"name": "MTP + TurboQuant", "flags": ["--mtp", "--turbo-kv", "--stream-experts"]},
Comment on lines 366 to +378
if ok:
results.append({
"config": config["name"],
"context": ctx_size,
"ttft": f"{ttft:.2f}",
"ttft": f"{ttft:.2f}" if ttft is not None else "N/A",
"tps": f"{tps:.2f}",
"static_mem": static_mem,
"os_ram": os_ram,
"gpu_alloc": f"{gpu_alloc:.1f}",
"gpu_in_use_peak": f"{peak_in_use:.1f}",
})
print(f" TTFT={ttft:.2f}s TPS={tps:.2f} OS_RAM={os_ram}GB GPU_Alloc={gpu_alloc:.1f}GB GPU_InUse(peak)={peak_in_use:.1f}GB")
ttft_str = f"{ttft:.2f}" if ttft is not None else "N/A"
print(f" TTFT={ttft_str}s TPS={tps:.2f} OS_RAM={os_ram}GB GPU_Alloc={gpu_alloc:.1f}GB GPU_InUse(peak)={peak_in_use:.1f}GB")
Comment on lines +43 to +48
def find_snapshot_dir():
"""Return the first (and only) snapshot hash directory."""
try:
snaps = os.listdir(HF_CACHE_PATH)
if snaps:
return os.path.join(HF_CACHE_PATH, snaps[0])
github-actions Bot added 2 commits May 7, 2026 23:36
- Bump mlx-swift-lm submodule to 6c7a0ae (feat/mtp-speculative-decoding)
  containing native FP8 MoE inference support for Qwen3.6-35B-A3B
- Update profile_runner.py: restore CONFIGS to stream-experts variants,
  fix CustomFunction return type annotation on kernel closure
@solderzzc solderzzc added the bug Something isn't working label May 12, 2026
@solderzzc solderzzc requested a review from Copilot May 12, 2026 19:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 12 comments.

Comment on lines +64 to +74
private var mtpBinding: Binding<Bool> {
Binding(
get: { viewModel.config.enableMTP },
set: { newValue in
viewModel.config.enableMTP = newValue
viewModel.config.save()
if currentModelId != nil {
reloadCurrentModel()
}
}
)
Comment on lines +67 to +72
set: { newValue in
viewModel.config.enableMTP = newValue
viewModel.config.save()
if currentModelId != nil {
reloadCurrentModel()
}
Comment on lines 649 to +660
let stream: AsyncStream<Generation> = try await container.perform { ctx in
try MLXLMCommon.generate(
input: lmInput,
cache: cache,
parameters: params,
context: ctx
)
// MTP speculative decoding path: use MTPTokenIterator when
// 1. The config requests MTP (enableMTP=true)
// 2. The loaded model conforms to MTPLanguageModel
if config.enableMTP, ctx.model is (any MTPLanguageModel) {
return try MLXLMCommon.generateMTP(
input: lmInput,
cache: cache,
parameters: params,
context: ctx,
numMTPTokens: config.numMTPTokens
)
Comment thread Sources/SwiftLM/Server.swift Outdated
Comment on lines +335 to +346
// Resolve model directory for profiling (checks HuggingFace cache)
let modelDirectory = resolveModelDirectory(modelId: modelId)
var mainModelProfile: ModelProfile? = nil
if let dir = modelDirectory {
mainModelProfile = ModelProfiler.profile(modelDirectory: dir, modelId: modelId)

// Fix #72 follow-up: If the user passed --stream-experts but the model
// is not an MoE, disable the flag early to prevent incorrect memory limits
// and erroneous auto-capping of draft tokens.
if self.streamExperts, let profile = mainModelProfile, !profile.isMoE {
print("[SwiftLM] ⚠️ Model does not support SSD expert streaming (\(profile.modelType) is not MoE). Ignoring --stream-experts flag.")
self.streamExperts = false
Comment on lines +315 to +316
print(f" > Bypassing abort because Qwen3.6-35B HF repo has duplicated tensor formats.")
# continue
Comment on lines +66 to +75
// 2. HuggingFace cache
let slug = "models--" + id.replacingOccurrences(of: "/", with: "--")
let base = URL(fileURLWithPath: NSHomeDirectory())
.appendingPathComponent(".cache/huggingface/hub/\(slug)/snapshots")
if let snap = (try? FileManager.default.contentsOfDirectory(at: base,
includingPropertiesForKeys: nil))?.first {
return snap
}
// 3. Return the id as-is (mlx-swift-lm will resolve via HubClient)
return URL(fileURLWithPath: id)
// swift run -c release Gemma4MTPBench --main-model mlx-community/gemma-4-e2b-it-4bit \
// --asst-model mlx-community/gemma-4-E2B-it-assistant-bf16
//
// Safety limits baked in: maxKVSize=512, maxTokens=50, numDraft=2

guard let asstModel = asstCtx.model as? Gemma4AssistantModel else {
print("\n❌ Assistant model is not Gemma4AssistantModel — got \(type(of: asstCtx.model))")
Foundation.exit(1)
Comment thread Sources/Gemma4MTPBench/main.swift Outdated
""", terminator: "")

// Correctness check
let baseText = mainCtx.tokenizer.decode(tokenIds: []) // placeholder
Comment thread run_benchmark.sh
Comment on lines +224 to +225


github-actions Bot added 3 commits May 12, 2026 13:10
- ci.yml: Fix ssd-draft-memory-guard [1/3] check — Qwen3.5-2B is not MoE
  so stream-experts is disabled before auto-cap fires. Check for draft model
  reference in server log instead (the real intent is RAM stays bounded).
- Server.swift: Gate ModelProfiler filesystem walk behind streamExperts flag
  (Copilot: avoids startup I/O overhead on every non-streaming launch).
- profile_runner.py: Fix visualization crash when ttft is 'N/A' — safely
  handle None ttft in min() and val_str formatting (Copilot: float() crash).
@solderzzc solderzzc merged commit 79f0ef3 into main May 13, 2026
11 checks passed
@solderzzc solderzzc deleted the feat/mtp-harness-updates branch May 13, 2026 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Integrate MTP Speculative Decoding (MTPLX-style) for 2x+ Speedup

2 participants