Expose ngram speculative decoding by leehack · Pull Request #268 · leehack/llamadart

leehack · 2026-07-05T00:36:07Z

Summary

Add SpeculativeDecodingStrategy.ngramSimple and SpeculativeDecodingConfig.ngramSimple for llama.cpp.
Route llama.cpp generation through ngram-simple when the native bundle exports llama_dart_ngram_* symbols.
Fail loudly for unsupported combinations, including LiteRT-LM ngram requests and llama.cpp MTP configs that pass ngramSize.
Document ngram-simple behavior, add test matrix coverage, and add ngram-only benchmark mode.

Validation

dart format --output=none --set-exit-if-changed on touched Dart files
dart analyze on touched Dart sources and tests
dart test test/unit/core/models/inference/generation_params_test.dart test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/tooling/test_matrix_test.dart test/unit/backends/llama_cpp/llama_cpp_service_test.dart test/integration/backends/llama_cpp/native_symbol_integration_test.dart
python3 tools/build.py apple --target macos-arm64 in leehack/llamadart-native worktree
python3 tools/validate_exports.py --format nm --tool nm bin/macos/arm64/libllamadart.dylib in leehack/llamadart-native worktree
Real-model consumer smoke with Qwen2.5 0.5B GGUF using rebuilt local native dylib: ngram-simple drafted tokens, invalid MTP ngramSize was rejected, maxTokens and small-batch cap boundaries passed
Ngram-only benchmark smoke: baseline plus ngram_simple only, no MTP cases

Blocking before ready

Depends on Add ngram speculative wrapper exports llamadart-native#23 landing and a published native artifact that exports llama_dart_ngram_*.
After that artifact exists, regenerate/update bindings from the released native headers, update hook/build.dart pin and checksums, and rerun the real-model smoke through the pinned bundle.

Partially addresses Expose llama.cpp speculative decoding strategies beyond MTP #190 with the first non-MTP llama.cpp strategy.

github-actions · 2026-07-05T00:37:50Z

Chat app preview deployed for dbd013b.

App: https://leehack-llamadart-chat-pr-268.static.hf.space
Space: https://huggingface.co/spaces/leehack/llamadart-chat-pr-268
Repo: leehack/llamadart-chat-pr-268

leehack · 2026-07-05T00:51:31Z

Expanded local speculative benchmark pass on macOS arm64 / CPU, using the rebuilt local native dylib from leehack/llamadart-native#23.

Models/configs tested:

Model	Strategy/config	Median wall tok/s	Acceptance	Result
qwen2.5-0.5b-instruct-q4_k_m.gguf	baseline, repeated raw prompt	172.65	n/a	baseline
qwen2.5-0.5b-instruct-q4_k_m.gguf	ngramSimple n=1 draft=1	180.41	100%	+4.5%
qwen2.5-0.5b-instruct-q4_k_m.gguf	ngramSimple n=1 draft=2	193.79	100%	+12.2%
qwen2.5-0.5b-instruct-q4_k_m.gguf	ngramSimple n=1 draft=4	193.32	100%	+12.0% on repeated prompt
qwen2.5-0.5b-instruct-q4_k_m.gguf	ngramSimple n=4 draft=1/2/4	~172 tok/s	no drafts	no benefit
Qwen3.5-0.8B-Q4_K_M.gguf	baseline, repeated raw prompt	133.82	n/a	baseline
Qwen3.5-0.8B-Q4_K_M.gguf	ngramSimple n=1 draft=2/4	~160 tok/s	100%	about +20%
Llama-3.2-1B-Instruct-Q4_K_M.gguf	ngramSimple n=1 draft=2/4	~142 tok/s	100%	about +25%, but baseline run was noisy
stories15M.gguf	ngramSimple n=1 draft=1/2/4	1462 / 1340 / 1329 tok/s	23% / 12% / 6%	draft=1 neutral, larger drafts slower
qwen2.5-0.5b-instruct-q4_k_m.gguf	ngramSimple n=1, natural prompt	132 / 136 / 116 tok/s	50% / 57% / 36%	slower than 154 tok/s baseline
gemma-4-E2B-it-Q4_K_S.gguf + mtp-gemma-4-E2B-it.gguf	MTP draft=1/2	29.7 / 37.4 tok/s	43% / 34%	slower than 51 tok/s baseline on CPU
gemma-4-E2B-it-Q4_K_S.gguf	MTP without draft model	n/a	n/a	rejected: target model does not contain MTP layers

Important blocker found during the expanded pass:

Deterministic parity is not proven for ngramSimple draft=4. On qwen2.5-0.5b with the natural prompt, baseline, draft=1, and draft=2 matched exactly (), but draft=4 diverged () at character 227 even with and .
This PR should stay draft until we either fix that parity issue, deliberately constrain ngramSimple to proven-safe caps, or document output divergence as an accepted semantic tradeoff.

Upstream references checked:

llama.cpp speculative docs: https://github.com/ggml-org/llama.cpp/blob/2d973636e292ee6f75fadcf08d29cb33511f509f/docs/speculative.md
strategy/default params: https://github.com/ggml-org/llama.cpp/blob/2d973636e292ee6f75fadcf08d29cb33511f509f/common/common.h#L167-L179 and https://github.com/ggml-org/llama.cpp/blob/2d973636e292ee6f75fadcf08d29cb33511f509f/common/common.h#L323-L380
ngram-simple implementation: https://github.com/ggml-org/llama.cpp/blob/2d973636e292ee6f75fadcf08d29cb33511f509f/common/ngram-map.cpp#L49-L112 and https://github.com/ggml-org/llama.cpp/blob/2d973636e292ee6f75fadcf08d29cb33511f509f/common/speculative.cpp#L1648-L1695
sampler verification helper: https://github.com/ggml-org/llama.cpp/blob/2d973636e292ee6f75fadcf08d29cb33511f509f/common/sampling.cpp#L624-L656
server speculative verification flow: https://github.com/ggml-org/llama.cpp/blob/2d973636e292ee6f75fadcf08d29cb33511f509f/tools/server/server-context.cpp#L3823-L3905

leehack · 2026-07-05T00:51:43Z

Correction to the benchmark note above: zsh stripped the inline backtick snippets before posting.

Exact parity blocker values:

baseline / ngram draft=1 / ngram draft=2 output hash: 2f80026c
ngram draft=4 output hash: ec0928db
first differing character: 227
settings: temp 0.0, topK 1, seed 7, maxTokens 48, qwen2.5-0.5b-instruct-q4_k_m.gguf, CPU, local rebuilt native dylib from llamadart-native#23

The conclusion is unchanged: Dart PR #268 should stay draft until the ngram draft=4 deterministic parity issue is fixed, intentionally constrained, or documented as an accepted semantic tradeoff.

leehack · 2026-07-05T01:55:29Z

Updated the issue-190 ngram fix on commit dbd013bfe.

What changed:

ngram-simple now captures a llama.cpp sequence checkpoint before verifying a speculative batch.
If any draft tail is rejected, it restores the checkpoint, removes target memory after the original position, and replays only the selected token plus accepted drafts before continuing.
Replay/rollback failures now fail loudly instead of silently continuing.
draftTokenMax > 2 with repeat penalties is rejected with LlamaUnsupportedException; use GenerationParams.penalty: 1.0 for deeper ngram-simple drafts or keep draftTokenMax <= 2.
README, website docs, Dartdoc, and the local speculative benchmark matrix now document/use that penalty: 1.0 requirement.

Local validation:

dart format --output=none --set-exit-if-changed . PASS (example chat app still reports existing missing flutter_lints package-resolution warnings, 0 files changed).
dart analyze lib test tool/testing PASS exit 0; two existing unrelated WebGPU info lints remain.
dart test test/unit/tooling/test_matrix_test.dart test/unit/core/models/inference/generation_params_test.dart --reporter compact PASS.
dart test -p vm -j 1 --exclude-tags local-only --reporter compact PASS: 1202 passed, 67 skipped.
git diff --check PASS.
Docs link validation previously passed for the website doc edits: ./tool/docs/validate_links.sh PASS.

Real-model runtime validation with the local llamadart-native wrapper worktree override:

Llama 3.2 1B Q4_K_M, penalty=1.0: baseline, ngram draft1, draft2, draft4 all matched token hash 663985d7; draft4 accepted 8/60 draft tokens.
Qwen2.5 0.5B Q4_K_M, penalty=1.0: baseline, ngram draft1, draft2, draft4 all matched token hash 7580a2c9; draft4 accepted 36/44 draft tokens.
Default repeat penalty with draftTokenMax > 2: throws before generation with the intended unsupported-path message.

Native/upstream boundary:

Native PR Add ngram speculative wrapper exports llamadart-native#23 remains wrapper-symbol-only and clean; no vendored llama.cpp edit is needed here.
The repeat-penalty/deep-ngram limitation appears tied to upstream sampler rollback: upstream llama_sampler_penalties stores prev and token_count, penalty application uses token_count, but the upstream clone path copies prev only. That is worth referencing upstream separately rather than patching vendored llama.cpp in this repo.

Current state: PR CI is running on dbd013bfe; keeping this PR draft until the CI run completes and the native wrapper artifact dependency is resolved.

leehack · 2026-07-05T02:03:38Z

Follow-up: CI is now green on dbd013bfe.

All PR checks passed, including Analyze & Lint, Docs Build Check, Test Linux VM with Coverage, Test Web (Chrome), Test Native on macOS and Windows, Native Prompt Reuse Parity, both companion package jobs, the chat app PR preview, and the LiteRT-LM smoke jobs.

The PR is still draft only because the runtime feature depends on landing/publishing the native wrapper symbols from leehack/llamadart-native#23.

Expose ngram speculative decoding

69d6e0d

Stabilize ngram speculative rollback

dbd013b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose ngram speculative decoding#268

Expose ngram speculative decoding#268
leehack wants to merge 2 commits into
mainfrom
issue-190-speculative-ngram

leehack commented Jul 5, 2026

Uh oh!

github-actions Bot commented Jul 5, 2026 •

edited

Loading

Uh oh!

leehack commented Jul 5, 2026

Uh oh!

leehack commented Jul 5, 2026

Uh oh!

leehack commented Jul 5, 2026

Uh oh!

leehack commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leehack commented Jul 5, 2026

Uh oh!

github-actions Bot commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leehack commented Jul 5, 2026

Uh oh!

leehack commented Jul 5, 2026

Uh oh!

leehack commented Jul 5, 2026

Uh oh!

leehack commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jul 5, 2026 •

edited

Loading