feat(gemma4-e4b): multimodal CoreML bundle (text+image+video+audio on iPhone)#165
Merged
Conversation
added 7 commits
May 3, 2026 11:00
Phase A1-A3 of the E4B optimization stack. Brings the stage2-e4b 4-chunk
foundation (Phase 1 stateful + Phase 2a cross-turn KV) onto current main
and adds 3-chunk merged + multifunction prefill_bN support for E4B —
the lever that gave E2B its 33.4 tok/s iPhone 17 Pro decode.
Converter side
- SWAStatefulMergedChunk23{,Prefill,Single,PrefillSingle} accept
own_range / shared_range; defaults remain E2B (own=L8-14, shared=
L15-24) for back-compat. E4B passes (12,24)/(24,33) derived from
compute_chunk_boundaries(config) — kv13/kv14 names are kept as
legacy aliases for the (sliding,full) producer slots.
- build_gemma4_e2b_stateful_3chunks.py: drops the "E2B only"
hardcoded help; --model gemma4-e4b now produces a 3-chunk merged
bundle (chunk_1 L0-11 / chunk_2 L12-32 merged / chunk_3 L33-41 +
lm_head). Chunk-2 layout printed dynamically.
- sanity_stateful_chunks.py: from stage2-e4b — adds --model preset
so /tmp/gemma4-{e2b,e4b}-stateful chunks share one verifier.
Bundle side
- scripts/assemble_gemma4_stateful_e4b.sh: from stage2-e4b — pulls
chunk_*.mlmodelc + legacy E4B sidecars into the bundle layout
Gemma4StatefulEngine expects (subdir gemma4_e2b_stateful_chunks/
is intentionally shared across E2B/E4B; engine reads hidden /
layers / HKV from model_config.json).
Runtime side (Swift)
- ModelDownloader.swift: gemma4e4bStateful + gemma4e4bStatefulLinear
ModelInfo entries (slots 6/7 under LLM_SHOW_EXPERIMENTAL=1).
downloadURL is intentionally blank — A6 will fill in the new
mlboydaisuke/gemma-4-E4B-stateful-coreml repo URL once iPhone 17
Pro A/B clears. Existing mlboydaisuke/gemma-4-E4B-coreml legacy
repo is untouched, preserving the dual-repo pattern E2B uses.
- LLMRunner.swift: stateful detection comment now lists all four
folders that share the gemma4_e2b_stateful_chunks/ layout.
Build artefacts (A4) and iPhone validation (A5) follow.
Stage 8 builder (PR #149) already used `compute_chunk_boundaries` for chunk_1 / chunk_3 windows but called `convert_chunk2_merged_prefill` without `own_range` / `shared_range`, so on E4B the merged middle chunk silently used E2B's L8-14 / L15-24 layer ranges instead of L12-23 / L24-32. After A3 made the converter parametric, plumb the ranges through and refresh the docstring + the stale "we don't ship E4B stateful yet" comment.
…ateful Captures the design intel + chosen architecture (Option A: separate Gemma4StatefulMultimodalEngine class) so the next session can pick up without re-deriving. Records: - Phase A scope (already shipped on this branch as 4665ab2 + 2655c17) - Phase B engine class layout (storage, public API, helper port list) - State bridge code path (probe-2-verified nested withMultiArray closures + memcpy) - Generate flow for image+text prompts (T=288 prefill → bridge → decode) - Bundle layout for new HF repos gemma-4-{E2B,E4B}-stateful-multimodal-coreml - Open questions (picker naming, default-swap timing, cross-turn KV with re-encoded image features) - Build commands for the Mac compile run
PyPI wheel ships .so files referencing @rpath/lib*.dylib that aren't included; on macOS 26 (Darwin 25 / Tahoe) this silently produces an empty pybind11 module so every conversion script crashes at "BlobWriter not loaded". Captures the fresh-venv + source-build steps that get a working /tmp/ct_build_venv to unblock builds until upstream ships fixed wheels.
… iPhone)
Working configuration for iPhone 17 Pro at 15.7 tok/s decode + correct
output across all four input modalities. Validated 2026-05-03 on a clean
sandbox push of the assembled bundle.
Topology:
decode = Topology II (chunk1 legacy + chunk2_3way + chunk3_3way merged
21-layer middle + final lm_head). Auto-detected by
ChunkedEngine via chunk2_3way/chunk3_3way presence.
prefill = legacy chunks 1/2/3/4 prefill_b8 multifunction. Vision-aware
bidirectional mask within image span via the engine's
existing fillBatchMasksVisionAware (works at T=8 batches).
vision = vision.ane.mlmodelc (E4B, output [1, 256, 2560]).
audio = audio.mlmodelc (E4B, output [1, 50, 1024]) + Swift two-stage
projection 1024 -> 1536 -> 2560.
Changes:
- Sources/CoreMLLLM/AudioProcessor.swift: ProjectionWeights now derives
inDim/outDim/finalDim from weight tensor sizes (was hard-coded for
E2B's square 1536x1536 embed_proj). E4B's embed_proj is non-square
(2560, 1536); the embed_proj sgemm now uses finalDim for the output
dimension. Direct cause of the audio gibberish on E4B.
- conversion/models/gemma4_swa_merged.py: MergedChunk23 (the
non-stateful merged chunk2+chunk3 used by Topology II) now accepts
own_range / shared_range; defaults stay at E2B (L8-14 / L15-24).
Mirrors the stateful generalisation from 4665ab2.
- conversion/build_gemma4_3way.py: thread compute_chunk_boundaries(cfg)
through to MergedChunk23 so `--model gemma4-e4b` produces a 21-layer
chunk2_3way (L12-23 own + L24-32 shared) instead of the E2B-hardcoded
17-layer span.
- scripts/assemble_gemma4_e4b_multimodal.sh: reproducible bundle
assembly script (compiles mlpackage->mlmodelc, copies sidecars +
legacy chunks + E4B encoders).
- docs/E4B_MULTIMODAL_BUILD.md: build + sideload guide, including the
rejected paths (prefill_chunk* multifunction, stateful) and the
iPhone clean-sandbox requirement (devicectl never deletes orphans).
Out of scope (in this commit):
- Stateful Stage 8 engine — separate commit, Mac-only / iPhone-blocked.
- prefill_chunk{1..4}.mlmodelc multifunction path — built and tested
but produces broken output on iPhone with E4B (Mac OK); not shipped.
- vision_video.mlmodelc — engine falls back to 2x2 pool of vision
encoder; quality validated.
…c dev / iPhone blocked
Stage 8 follow-up to the stateful Linear shipment. Adds a parallel
engine that drives Gemma 4 stateful (3-chunk merged + Linear) with
T=288 single-function prefill chunks + the Stage 6 vision/audio
splice. The engine class works end-to-end on Mac (text decode
16.5 tok/s; assembled bundle drives image + audio splice through the
T=288 batched prefill with bidirectional within-image mask).
iPhone status: BLOCKED. Multiple converter paths attempted, all hit
the same iPhone ANE 18 MIL->EIR translation failure on chunk_2 (the
merged 21-layer middle chunk):
- 3-chunk merged stateful with kv13/kv14 alias output: std::bad_cast
- .clone() patch on the alias output assignment: same error
- 4-chunk decode split (chunk_2_own + chunk_2_shared): same error,
confirming the alias-slice-over-MLState pattern is the root cause
rather than graph size.
The non-stateful 3-way merged chunk2_3way (same 21 layers, but K/V
flow as plain tensor inputs/outputs — no MLState alias) compiles and
runs on iPhone ANE 18 at 15.7 tok/s, confirming the diagnosis.
Code keeps the stateful path for Mac development and future revisits
(stateful + multifunction T=288 might unlock once iPhone ANE picks up
multifunction T>1 + dual MLState; not on iOS 18).
Files:
- Sources/CoreMLLLM/Gemma4StatefulMultimodalEngine.swift (NEW)
~880-line dimension-agnostic stateful engine. 3-chunk merged
decode + 4-state MLState (decode/prefill x s1/s2) + bridgeKVState
via withMultiArray nested closures + ported Stage 6 multimodal
helpers (vision/video/audio splice + vision-aware bidir mask +
cross-turn LCP-resume). Padding-replicate scheme keeps
auto-emitted token at row T-1 valid even when validCount < T.
- Sources/CoreMLLLM/ModelDownloader.swift
gemma4e2bStatefulMultimodal + gemma4e4bStatefulMultimodal
ModelInfo entries (sideload-only, exposed under
LLM_SHOW_EXPERIMENTAL=1).
- Examples/CoreMLLLMChat/CoreMLLLMChat/LLMRunner.swift
Detection (prefill_T288/ subdir presence) routes to the new
engine; load + generate + image/audio caching mirror the existing
gemma4Stateful pattern.
- Sources/gemma4mm-smoke/main.swift (NEW)
Mac CLI smoke test for the stateful multimodal engine.
- Package.swift: gemma4mm-smoke executable target.
- scripts/assemble_gemma4_stateful_multimodal.sh (NEW)
Reproducible bundle assembly (decode 3-chunk + prefill_T288/
subdir + multimodal encoders).
- conversion/build_gemma4_stateful_singlefunc_prefill.py
Adds --four-chunk variant (used during the chunk_2 split probe).
- conversion/models/gemma4_swa_stateful_chunks.py
.clone() on the kv13/kv14 producer alias output (decode + prefill
T=N variants). Materialises the slice over MLState into a fresh
tensor; ineffective vs the iPhone ANE bug but not regressive.
…hared scheme Wires the HF-uploaded multimodal bundle into the in-app picker flow so users can download `mlboydaisuke/gemma-4-E4B-multimodal-coreml` with one tap (no sideload required). ModelDownloader.swift: - New `gemma4e4bMultimodal` ModelInfo entry (id `gemma4-e4b-multimodal`, size 7.6 GB, downloadURL points at the new HF repo). Shared `folderName: "gemma4-e4b"` with the legacy text-only entry mirrors the gemma4e2b3way / gemma4e2b pattern: chunks 1-4 are byte-identical in both repos, so users who switch between entries reuse the on-disk legacy chunks and only fetch the new files. - `gemma4e4b` (text-only) renamed to "Gemma 4 E4B (text-only)" to disambiguate from the new multimodal entry in the picker. - New `buildE4BMultimodalFileList()` enumerates 58 files matching the HF repo tree (decode chunks 1-4 + chunk2_3way + chunk3_3way + vision.ane.mlmodelc + audio.mlmodelc + audio sidecars + text sidecars). Splits files into legacyChunk(no metadata.json) vs newerMlc(with metadata.json) helpers — the legacy chunks were built before the metadata.json convention. - Defaults list inserts `gemma4e4bMultimodal` ahead of `gemma4e4b` so the picker presents multimodal as the primary E4B option. CoreMLLLMChat.xcscheme: - Add `LLM_VISION_FORCE_ANE=1` to the shared scheme. Safe to default — only affects models whose bundle ships a `vision.ane.mlmodelc` (the new E4B multimodal entry); other models silently fall through to their existing GPU `vision.mlmodelc`. - Add `LLM_SHOW_EXPERIMENTAL=1`. Required to expose the experimental picker entries (already documented in `ModelDownloader.swift`'s `defaults`). - Drop `LLM_PROFILE_EVERY_STEP=1` from the shared scheme; debug-only, belongs in a developer's local copy.
83a1a1a to
5f5d71a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships Gemma 4 E4B multimodal Core ML bundle validated on iPhone 17 Pro at 15.7 tok/s text decode + correct outputs across text / image / video / audio. HF artefact:
mlboydaisuke/gemma-4-E4B-multimodal-coreml(uploading separately, ~7.6 GB).Uses Topology II 3-chunk merged decode (already in
ChunkedEngine) plus the legacy 4-chunkprefill_b8multifunction for batched prefill. Vision is ANE-targeted (vision.ane.mlmodelc, output[1, 256, 2560]); audio uses the Conformer encoder + a Swift two-stage projection (1024 → 1536 → 2560).The conversion / runtime gap fixed in this PR was
AudioProcessor.swift'sembed_projmatmul, which assumed E2B's square(1536, 1536)shape. E4B'sembed_projprojects 1536 → 2560 (LM hidden);ProjectionWeightsnow derivesinDim/outDim/finalDimfrom the loaded weight tensor sizes.Reproduction guide:
docs/E4B_MULTIMODAL_BUILD.md.Assembly script:
scripts/assemble_gemma4_e4b_multimodal.sh.What's in this PR
feat(gemma4-e4b)(7a10047) — the actual shipSources/CoreMLLLM/AudioProcessor.swift:ProjectionWeightsnon-squareembed_projsupport → fixes E4B audio gibberish.conversion/models/gemma4_swa_merged.py:MergedChunk23acceptsown_range/shared_range(mirrors stateful generalisation from4665ab2).conversion/build_gemma4_3way.py: threadcompute_chunk_boundaries(cfg)so--model gemma4-e4bproduces the correct 21-layerchunk2_3way.scripts/assemble_gemma4_e4b_multimodal.sh+docs/E4B_MULTIMODAL_BUILD.md.research(gemma4-stateful-mm)(9c391f8) — Stage 8 stateful + multimodal engine, kept for Mac development. iPhone ANE 18 fails to compile the merged stateful chunk_2 withstd::bad_cast(alias slice overMLStateis the root cause; size and.clone()patches don't help). Documented indocs/E4B_MULTIMODAL_BUILD.mdrejected-paths section.Sources/CoreMLLLM/Gemma4StatefulMultimodalEngine.swift(~880 lines).Sources/gemma4mm-smoke/main.swift(Mac CLI smoke).Sources/CoreMLLLM/ModelDownloader.swift:gemma4e{2,4}bStatefulMultimodalModelInfo entries (sideload-only, behindLLM_SHOW_EXPERIMENTAL=1).Examples/CoreMLLLMChat/CoreMLLLMChat/LLMRunner.swift: detection (prefill_T288/subdir) + load + generate + image/audio caching.scripts/assemble_gemma4_stateful_multimodal.sh(reproducible bundle assembly for the stateful path).conversion/build_gemma4_stateful_singlefunc_prefill.py(--four-chunkvariant) andconversion/models/gemma4_swa_stateful_chunks.py(.clone()on alias outputs).Pre-existing commits (already on branch):
2655c17,4665ab2,1ccbfcd,340bf68.What was tried and rejected (documented)
prefill_chunk{1..4}.mlmodelcseparate-file multifunction (T=64/128/256/512) — works on Mac at 16.5 tok/s, but produces degenerate outputs on iPhone for E4B (likely int4 quantization noise + larger graph). Existinggemma4e2b3wayships this layout for E2B and works on iPhone; E4B-specific failure. Not shipped.research(gemma4-stateful-mm)commit. Mac OK, iPhone ANE 18 blocks at MIL→EIR.chunk_2intochunk_2_own(12 layers) +chunk_2_shared(9 layers) hits the samestd::bad_cast, confirming the alias-slice-over-MLStatepattern is the root cause rather than graph size.Test plan
coreml-llm-smoke).embed_projnon-square fix).gemma4mm-smoke).mlboydaisuke/gemma-4-E4B-multimodal-coreml(in progress, ~7.6 GB).Notes
LLM_PROFILE_EVERY_STEP=1/LLM_SHOW_EXPERIMENTAL=1/LLM_VISION_FORCE_ANE=1(untracked in this PR — left as the developer's local choice).LLM_VISION_FORCE_ANE=1is required for the ANE vision encoder.devicectldoesn't remove orphan files; a leftoverprefill_chunk1.mlmodelcfrom a previous push silently overrides the engine's choice. Documented indocs/E4B_MULTIMODAL_BUILD.md.