Skip to content

Expose ngram speculative decoding#268

Draft
leehack wants to merge 2 commits into
mainfrom
issue-190-speculative-ngram
Draft

Expose ngram speculative decoding#268
leehack wants to merge 2 commits into
mainfrom
issue-190-speculative-ngram

Conversation

@leehack

@leehack leehack commented Jul 5, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add SpeculativeDecodingStrategy.ngramSimple and SpeculativeDecodingConfig.ngramSimple for llama.cpp.
  • Route llama.cpp generation through ngram-simple when the native bundle exports llama_dart_ngram_* symbols.
  • Fail loudly for unsupported combinations, including LiteRT-LM ngram requests and llama.cpp MTP configs that pass ngramSize.
  • Document ngram-simple behavior, add test matrix coverage, and add ngram-only benchmark mode.

Validation

  • dart format --output=none --set-exit-if-changed on touched Dart files
  • dart analyze on touched Dart sources and tests
  • dart test test/unit/core/models/inference/generation_params_test.dart test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/tooling/test_matrix_test.dart test/unit/backends/llama_cpp/llama_cpp_service_test.dart test/integration/backends/llama_cpp/native_symbol_integration_test.dart
  • python3 tools/build.py apple --target macos-arm64 in leehack/llamadart-native worktree
  • python3 tools/validate_exports.py --format nm --tool nm bin/macos/arm64/libllamadart.dylib in leehack/llamadart-native worktree
  • Real-model consumer smoke with Qwen2.5 0.5B GGUF using rebuilt local native dylib: ngram-simple drafted tokens, invalid MTP ngramSize was rejected, maxTokens and small-batch cap boundaries passed
  • Ngram-only benchmark smoke: baseline plus ngram_simple only, no MTP cases

Blocking before ready

  • Depends on Add ngram speculative wrapper exports llamadart-native#23 landing and a published native artifact that exports llama_dart_ngram_*.
  • After that artifact exists, regenerate/update bindings from the released native headers, update hook/build.dart pin and checksums, and rerun the real-model smoke through the pinned bundle.

Related

@github-actions

github-actions Bot commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Chat app preview deployed for dbd013b.

@leehack

leehack commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

Expanded local speculative benchmark pass on macOS arm64 / CPU, using the rebuilt local native dylib from leehack/llamadart-native#23.

Models/configs tested:

Model Strategy/config Median wall tok/s Acceptance Result
qwen2.5-0.5b-instruct-q4_k_m.gguf baseline, repeated raw prompt 172.65 n/a baseline
qwen2.5-0.5b-instruct-q4_k_m.gguf ngramSimple n=1 draft=1 180.41 100% +4.5%
qwen2.5-0.5b-instruct-q4_k_m.gguf ngramSimple n=1 draft=2 193.79 100% +12.2%
qwen2.5-0.5b-instruct-q4_k_m.gguf ngramSimple n=1 draft=4 193.32 100% +12.0% on repeated prompt
qwen2.5-0.5b-instruct-q4_k_m.gguf ngramSimple n=4 draft=1/2/4 ~172 tok/s no drafts no benefit
Qwen3.5-0.8B-Q4_K_M.gguf baseline, repeated raw prompt 133.82 n/a baseline
Qwen3.5-0.8B-Q4_K_M.gguf ngramSimple n=1 draft=2/4 ~160 tok/s 100% about +20%
Llama-3.2-1B-Instruct-Q4_K_M.gguf ngramSimple n=1 draft=2/4 ~142 tok/s 100% about +25%, but baseline run was noisy
stories15M.gguf ngramSimple n=1 draft=1/2/4 1462 / 1340 / 1329 tok/s 23% / 12% / 6% draft=1 neutral, larger drafts slower
qwen2.5-0.5b-instruct-q4_k_m.gguf ngramSimple n=1, natural prompt 132 / 136 / 116 tok/s 50% / 57% / 36% slower than 154 tok/s baseline
gemma-4-E2B-it-Q4_K_S.gguf + mtp-gemma-4-E2B-it.gguf MTP draft=1/2 29.7 / 37.4 tok/s 43% / 34% slower than 51 tok/s baseline on CPU
gemma-4-E2B-it-Q4_K_S.gguf MTP without draft model n/a n/a rejected: target model does not contain MTP layers

Important blocker found during the expanded pass:

  • Deterministic parity is not proven for ngramSimple draft=4. On qwen2.5-0.5b with the natural prompt, baseline, draft=1, and draft=2 matched exactly (), but draft=4 diverged () at character 227 even with and .
  • This PR should stay draft until we either fix that parity issue, deliberately constrain ngramSimple to proven-safe caps, or document output divergence as an accepted semantic tradeoff.

Upstream references checked:

@leehack

leehack commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

Correction to the benchmark note above: zsh stripped the inline backtick snippets before posting.

Exact parity blocker values:

  • baseline / ngram draft=1 / ngram draft=2 output hash: 2f80026c
  • ngram draft=4 output hash: ec0928db
  • first differing character: 227
  • settings: temp 0.0, topK 1, seed 7, maxTokens 48, qwen2.5-0.5b-instruct-q4_k_m.gguf, CPU, local rebuilt native dylib from llamadart-native#23

The conclusion is unchanged: Dart PR #268 should stay draft until the ngram draft=4 deterministic parity issue is fixed, intentionally constrained, or documented as an accepted semantic tradeoff.

@leehack

leehack commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

Updated the issue-190 ngram fix on commit dbd013bfe.

What changed:

  • ngram-simple now captures a llama.cpp sequence checkpoint before verifying a speculative batch.
  • If any draft tail is rejected, it restores the checkpoint, removes target memory after the original position, and replays only the selected token plus accepted drafts before continuing.
  • Replay/rollback failures now fail loudly instead of silently continuing.
  • draftTokenMax > 2 with repeat penalties is rejected with LlamaUnsupportedException; use GenerationParams.penalty: 1.0 for deeper ngram-simple drafts or keep draftTokenMax <= 2.
  • README, website docs, Dartdoc, and the local speculative benchmark matrix now document/use that penalty: 1.0 requirement.

Local validation:

  • dart format --output=none --set-exit-if-changed . PASS (example chat app still reports existing missing flutter_lints package-resolution warnings, 0 files changed).
  • dart analyze lib test tool/testing PASS exit 0; two existing unrelated WebGPU info lints remain.
  • dart test test/unit/tooling/test_matrix_test.dart test/unit/core/models/inference/generation_params_test.dart --reporter compact PASS.
  • dart test -p vm -j 1 --exclude-tags local-only --reporter compact PASS: 1202 passed, 67 skipped.
  • git diff --check PASS.
  • Docs link validation previously passed for the website doc edits: ./tool/docs/validate_links.sh PASS.

Real-model runtime validation with the local llamadart-native wrapper worktree override:

  • Llama 3.2 1B Q4_K_M, penalty=1.0: baseline, ngram draft1, draft2, draft4 all matched token hash 663985d7; draft4 accepted 8/60 draft tokens.
  • Qwen2.5 0.5B Q4_K_M, penalty=1.0: baseline, ngram draft1, draft2, draft4 all matched token hash 7580a2c9; draft4 accepted 36/44 draft tokens.
  • Default repeat penalty with draftTokenMax > 2: throws before generation with the intended unsupported-path message.

Native/upstream boundary:

  • Native PR Add ngram speculative wrapper exports llamadart-native#23 remains wrapper-symbol-only and clean; no vendored llama.cpp edit is needed here.
  • The repeat-penalty/deep-ngram limitation appears tied to upstream sampler rollback: upstream llama_sampler_penalties stores prev and token_count, penalty application uses token_count, but the upstream clone path copies prev only. That is worth referencing upstream separately rather than patching vendored llama.cpp in this repo.

Current state: PR CI is running on dbd013bfe; keeping this PR draft until the CI run completes and the native wrapper artifact dependency is resolved.

@leehack

leehack commented Jul 5, 2026

Copy link
Copy Markdown
Owner Author

Follow-up: CI is now green on dbd013bfe.

All PR checks passed, including Analyze & Lint, Docs Build Check, Test Linux VM with Coverage, Test Web (Chrome), Test Native on macOS and Windows, Native Prompt Reuse Parity, both companion package jobs, the chat app PR preview, and the LiteRT-LM smoke jobs.

The PR is still draft only because the runtime feature depends on landing/publishing the native wrapper symbols from leehack/llamadart-native#23.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant