You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But two perf paths are not yet wired into qwen35moe:
DFlash spec-decode with MoE target. PR Split MoE weights between CPU & CUDA, support qwen35moe models #262 includes commit 7965190 feat: hybrid MoE spec-decode with DFlash draft model, but tested against dflash-draft-3.6-q8_0.gguf (dense 27B draft) the daemon hangs at chat CACHE — draft/target arch mismatch suspected. Needs a draft model matched to qwen35moe (or a verifier path that tolerates the mismatch).
PFlash prefill compression on the MoE forward path. PR Split MoE weights between CPU & CUDA, support qwen35moe models #262 does not touch pflash. Hybrid prefill compute is already partly CPU-bound on cold experts; PFlash compression could cut prompt-side cost further but needs validation it composes with the hybrid storage layer.
Goals
DFlash spec-decode end-to-end on qwen35moe (no hang, ≥1.5× decode on a long-form prompt)
PFlash composition with hybrid expert placement (correctness + perf vs no-PFlash baseline)
Sweep budget × spec-decode on/off × pflash on/off; identify the dominant configuration for lucebox shipping
Acceptance
test_dflash passes with arch=qwen35moe + --draft <matched_draft> (no chat CACHE hang)
bench_he or luce-bench --area ds4-eval ≥ baseline on the matched config
Document the recommended (budget, draft, pflash) combo for lucebox appliance shipping
Context
PR #262 (howard0su) lands hybrid CPU/CUDA expert placement for
qwen35moearch. End-to-end works; greedy decode correct; perf beats author's table on RTX 3090 (2.5-3× across the budget sweep). Validated on lucebox2 withQwen3.6-35B-A3B-UD-Q4_K_M.gguf.Bench (essay prompt, 400 gen tokens, t=0, RTX 3090 + Strix Halo CPU):
But two perf paths are not yet wired into qwen35moe:
7965190 feat: hybrid MoE spec-decode with DFlash draft model, but tested againstdflash-draft-3.6-q8_0.gguf(dense 27B draft) the daemon hangs atchat CACHE— draft/target arch mismatch suspected. Needs a draft model matched to qwen35moe (or a verifier path that tolerates the mismatch).Goals
qwen35moe(no hang, ≥1.5× decode on a long-form prompt)Acceptance
test_dflashpasses witharch=qwen35moe + --draft <matched_draft>(nochat CACHEhang)bench_heorluce-bench --area ds4-eval≥ baseline on the matched config(budget, draft, pflash)combo for lucebox appliance shippingRefs
project_megaqwen3_27b_dflash,project_dflash_pflash_inproc,project_lucebox_product