Skip to content

docs(bonsai): post-mortem + ANE decode-state lessons#163

Merged
john-rocky merged 1 commit into
mainfrom
docs/bonsai-postmortem
Apr 30, 2026
Merged

docs(bonsai): post-mortem + ANE decode-state lessons#163
john-rocky merged 1 commit into
mainfrom
docs/bonsai-postmortem

Conversation

@john-rocky
Copy link
Copy Markdown
Owner

Summary

The Qwen3 architecture support shipped in #162 came out of an investigation into porting prism-ml/Ternary-Bonsai-1.7B to ANE. Bonsai didn't ship — its per-(row, block) ternary scales can't be faithfully represented by ANEC (error -14 on per-block LUT palettization, and any stock-API approximation collapses the scales into a rank-1 outer product). This PR captures the post-mortem and the reusable lessons that came out of the failed port.

What lands

  • docs/TERNARY_BONSAI.md — full post-mortem: what was tried, why each path failed, what the right path is (mlx-lm with prism-ml/Ternary-Bonsai-1.7B-mlx-2bit).
  • docs/DECODE_STATE_LAYOUTS.md — ANE decode-state catalog. The headline finding: per-step decode cost on ANE is O(state_length), not weight-bandwidth (halving ctx 2048→1024 = 2.56× speedup, halving weights INT8→INT4 at same ctx = +12% only). Includes the mask-based rotating buffer pattern, palettization traps, and ternary-on-ANE checklist.
  • docs/GEMMA4_ROTATING_BUFFER_PORT.md — design note for applying the mask-based rotating buffer to Gemma 4's full-attention layers.
  • docs/NEXT_MODELS.md — shortlist: Qwen3-1.7B, Gemma 3 4B QAT, Llama-3.2, SmolLM3.
  • docs/ADDING_MODELS.md — new §4.5 KV state-layout checklist.
  • docs/ANE_OPTIMIZATION_SURVEY.md — cross-ref to the ctx > weight-bandwidth finding.
  • conversion/experiments/bonsai/ — 8 research scripts (oracle, ternary surgery, SWA comparisons, decode-chunks builder) kept as breadcrumbs.
  • conversion/config.py — NOTE comment in MODEL_REGISTRY explaining why Bonsai is intentionally absent.

Extracted from feat/qwen3-bonsai-investigation (commit 56ee545). Companion to #162 (Qwen3 architecture).

Test plan

  • Doc links resolve (no broken cross-refs between the four new docs)
  • python conversion/convert.py --list still works (config.py NOTE is a Python comment, no syntax change)

Adds the post-mortem for the ternary Bonsai investigation and the
ANE decode-path lessons it produced. The Qwen3 architecture support
that came out of the same investigation lands separately.

Why Bonsai didn't ship:
- prism-ml/Ternary-Bonsai-1.7B's compression depends on per-(row,
  block) independent scales (g=64). ANEC rejects that LUT granularity
  with error -14, and the stock-API per-block approximation factorizes
  scales into a rank-1 outer product, defeating the model's design.
- For Apple Silicon, the GPU path (mlx-lm with the official
  Ternary-Bonsai-1.7B-mlx-2bit) is the only honest option.

What lands as reusable infrastructure:
- docs/TERNARY_BONSAI.md: full post-mortem (what was tried, why each
  failed, what the right path is)
- docs/DECODE_STATE_LAYOUTS.md: ANE decode-state catalog — mask-based
  rotating buffer pattern, ctx > weight bandwidth result, palettization
  traps, ternary-on-ANE checklist
- docs/GEMMA4_ROTATING_BUFFER_PORT.md: design note for porting our
  mask-based rotating buffer to Gemma 4's full-attention layers
- docs/NEXT_MODELS.md: shortlist (Qwen3-1.7B, Gemma 3 4B QAT,
  Llama-3.2, SmolLM3) for the next port
- docs/ADDING_MODELS.md: §4.5 KV state-layout checklist
- docs/ANE_OPTIMIZATION_SURVEY.md: cross-reference to the ctx>weights finding
- conversion/experiments/bonsai/: research scripts (oracle, ternary
  surgery, SWA comparisons, decode-chunks builder) retained as
  breadcrumbs in case anyone retraces the path
- conversion/config.py: NOTE comment in MODEL_REGISTRY explaining why
  Bonsai is intentionally absent and pointing readers to the doc

Extracted from feat/qwen3-bonsai-investigation (commit 56ee545).
@john-rocky john-rocky merged commit 48a245c into main Apr 30, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant