feat(gemma4-e4b): multimodal CoreML bundle (text+image+video+audio on iPhone) by john-rocky · Pull Request #165 · john-rocky/CoreML-LLM

john-rocky · 2026-05-03T01:44:52Z

Summary

Ships Gemma 4 E4B multimodal Core ML bundle validated on iPhone 17 Pro at 15.7 tok/s text decode + correct outputs across text / image / video / audio. HF artefact: mlboydaisuke/gemma-4-E4B-multimodal-coreml (uploading separately, ~7.6 GB).

Uses Topology II 3-chunk merged decode (already in ChunkedEngine) plus the legacy 4-chunk prefill_b8 multifunction for batched prefill. Vision is ANE-targeted (vision.ane.mlmodelc, output [1, 256, 2560]); audio uses the Conformer encoder + a Swift two-stage projection (1024 → 1536 → 2560).

The conversion / runtime gap fixed in this PR was AudioProcessor.swift's embed_proj matmul, which assumed E2B's square (1536, 1536) shape. E4B's embed_proj projects 1536 → 2560 (LM hidden); ProjectionWeights now derives inDim / outDim / finalDim from the loaded weight tensor sizes.

Reproduction guide: docs/E4B_MULTIMODAL_BUILD.md.
Assembly script: scripts/assemble_gemma4_e4b_multimodal.sh.

What's in this PR

feat(gemma4-e4b) (7a10047) — the actual ship
- Sources/CoreMLLLM/AudioProcessor.swift: ProjectionWeights non-square embed_proj support → fixes E4B audio gibberish.
- conversion/models/gemma4_swa_merged.py: MergedChunk23 accepts own_range/shared_range (mirrors stateful generalisation from 4665ab2).
- conversion/build_gemma4_3way.py: thread compute_chunk_boundaries(cfg) so --model gemma4-e4b produces the correct 21-layer chunk2_3way.
- scripts/assemble_gemma4_e4b_multimodal.sh + docs/E4B_MULTIMODAL_BUILD.md.
research(gemma4-stateful-mm) (9c391f8) — Stage 8 stateful + multimodal engine, kept for Mac development. iPhone ANE 18 fails to compile the merged stateful chunk_2 with std::bad_cast (alias slice over MLState is the root cause; size and .clone() patches don't help). Documented in docs/E4B_MULTIMODAL_BUILD.md rejected-paths section.
- New file: Sources/CoreMLLLM/Gemma4StatefulMultimodalEngine.swift (~880 lines).
- New file: Sources/gemma4mm-smoke/main.swift (Mac CLI smoke).
- Sources/CoreMLLLM/ModelDownloader.swift: gemma4e{2,4}bStatefulMultimodal ModelInfo entries (sideload-only, behind LLM_SHOW_EXPERIMENTAL=1).
- Examples/CoreMLLLMChat/CoreMLLLMChat/LLMRunner.swift: detection (prefill_T288/ subdir) + load + generate + image/audio caching.
- scripts/assemble_gemma4_stateful_multimodal.sh (reproducible bundle assembly for the stateful path).
- Builder generalisations in conversion/build_gemma4_stateful_singlefunc_prefill.py (--four-chunk variant) and conversion/models/gemma4_swa_stateful_chunks.py (.clone() on alias outputs).

Pre-existing commits (already on branch): 2655c17, 4665ab2, 1ccbfcd, 340bf68.

What was tried and rejected (documented)

prefill_chunk{1..4}.mlmodelc separate-file multifunction (T=64/128/256/512) — works on Mac at 16.5 tok/s, but produces degenerate outputs on iPhone for E4B (likely int4 quantization noise + larger graph). Existing gemma4e2b3way ships this layout for E2B and works on iPhone; E4B-specific failure. Not shipped.
Stateful E4B multimodal — see research(gemma4-stateful-mm) commit. Mac OK, iPhone ANE 18 blocks at MIL→EIR.
4-chunk decode split for stateful — splitting chunk_2 into chunk_2_own (12 layers) + chunk_2_shared (9 layers) hits the same std::bad_cast, confirming the alias-slice-over-MLState pattern is the root cause rather than graph size.

Test plan

Mac smoke: text decode 16.5 tok/s, baseline-quality output (coreml-llm-smoke).
iPhone 17 Pro text-only: 15.7 tok/s, baseline-quality output (matches Mac semantics).
iPhone 17 Pro image+text: coherent description (no gibberish).
iPhone 17 Pro video+text: coherent description.
iPhone 17 Pro audio+text: correct response (after embed_proj non-square fix).
Stateful Mac smoke: text decode 16.x tok/s on stateful 3-chunk + T=288 prefill (gemma4mm-smoke).
HF upload to mlboydaisuke/gemma-4-E4B-multimodal-coreml (in progress, ~7.6 GB).

Notes

The Xcode shared scheme adds LLM_PROFILE_EVERY_STEP=1 / LLM_SHOW_EXPERIMENTAL=1 / LLM_VISION_FORCE_ANE=1 (untracked in this PR — left as the developer's local choice). LLM_VISION_FORCE_ANE=1 is required for the ANE vision encoder.
iPhone bundle pushes need a clean sandbox (delete + reinstall the app) when bundle layouts change. devicectl doesn't remove orphan files; a leftover prefill_chunk1.mlmodelc from a previous push silently overrides the engine's choice. Documented in docs/E4B_MULTIMODAL_BUILD.md.

Phase A1-A3 of the E4B optimization stack. Brings the stage2-e4b 4-chunk foundation (Phase 1 stateful + Phase 2a cross-turn KV) onto current main and adds 3-chunk merged + multifunction prefill_bN support for E4B — the lever that gave E2B its 33.4 tok/s iPhone 17 Pro decode. Converter side - SWAStatefulMergedChunk23{,Prefill,Single,PrefillSingle} accept own_range / shared_range; defaults remain E2B (own=L8-14, shared= L15-24) for back-compat. E4B passes (12,24)/(24,33) derived from compute_chunk_boundaries(config) — kv13/kv14 names are kept as legacy aliases for the (sliding,full) producer slots. - build_gemma4_e2b_stateful_3chunks.py: drops the "E2B only" hardcoded help; --model gemma4-e4b now produces a 3-chunk merged bundle (chunk_1 L0-11 / chunk_2 L12-32 merged / chunk_3 L33-41 + lm_head). Chunk-2 layout printed dynamically. - sanity_stateful_chunks.py: from stage2-e4b — adds --model preset so /tmp/gemma4-{e2b,e4b}-stateful chunks share one verifier. Bundle side - scripts/assemble_gemma4_stateful_e4b.sh: from stage2-e4b — pulls chunk_*.mlmodelc + legacy E4B sidecars into the bundle layout Gemma4StatefulEngine expects (subdir gemma4_e2b_stateful_chunks/ is intentionally shared across E2B/E4B; engine reads hidden / layers / HKV from model_config.json). Runtime side (Swift) - ModelDownloader.swift: gemma4e4bStateful + gemma4e4bStatefulLinear ModelInfo entries (slots 6/7 under LLM_SHOW_EXPERIMENTAL=1). downloadURL is intentionally blank — A6 will fill in the new mlboydaisuke/gemma-4-E4B-stateful-coreml repo URL once iPhone 17 Pro A/B clears. Existing mlboydaisuke/gemma-4-E4B-coreml legacy repo is untouched, preserving the dual-repo pattern E2B uses. - LLMRunner.swift: stateful detection comment now lists all four folders that share the gemma4_e2b_stateful_chunks/ layout. Build artefacts (A4) and iPhone validation (A5) follow.

Stage 8 builder (PR #149) already used `compute_chunk_boundaries` for chunk_1 / chunk_3 windows but called `convert_chunk2_merged_prefill` without `own_range` / `shared_range`, so on E4B the merged middle chunk silently used E2B's L8-14 / L15-24 layer ranges instead of L12-23 / L24-32. After A3 made the converter parametric, plumb the ranges through and refresh the docstring + the stale "we don't ship E4B stateful yet" comment.

…ateful Captures the design intel + chosen architecture (Option A: separate Gemma4StatefulMultimodalEngine class) so the next session can pick up without re-deriving. Records: - Phase A scope (already shipped on this branch as 4665ab2 + 2655c17) - Phase B engine class layout (storage, public API, helper port list) - State bridge code path (probe-2-verified nested withMultiArray closures + memcpy) - Generate flow for image+text prompts (T=288 prefill → bridge → decode) - Bundle layout for new HF repos gemma-4-{E2B,E4B}-stateful-multimodal-coreml - Open questions (picker naming, default-swap timing, cross-turn KV with re-encoded image features) - Build commands for the Mac compile run

PyPI wheel ships .so files referencing @rpath/lib*.dylib that aren't included; on macOS 26 (Darwin 25 / Tahoe) this silently produces an empty pybind11 module so every conversion script crashes at "BlobWriter not loaded". Captures the fresh-venv + source-build steps that get a working /tmp/ct_build_venv to unblock builds until upstream ships fixed wheels.

… iPhone) Working configuration for iPhone 17 Pro at 15.7 tok/s decode + correct output across all four input modalities. Validated 2026-05-03 on a clean sandbox push of the assembled bundle. Topology: decode = Topology II (chunk1 legacy + chunk2_3way + chunk3_3way merged 21-layer middle + final lm_head). Auto-detected by ChunkedEngine via chunk2_3way/chunk3_3way presence. prefill = legacy chunks 1/2/3/4 prefill_b8 multifunction. Vision-aware bidirectional mask within image span via the engine's existing fillBatchMasksVisionAware (works at T=8 batches). vision = vision.ane.mlmodelc (E4B, output [1, 256, 2560]). audio = audio.mlmodelc (E4B, output [1, 50, 1024]) + Swift two-stage projection 1024 -> 1536 -> 2560. Changes: - Sources/CoreMLLLM/AudioProcessor.swift: ProjectionWeights now derives inDim/outDim/finalDim from weight tensor sizes (was hard-coded for E2B's square 1536x1536 embed_proj). E4B's embed_proj is non-square (2560, 1536); the embed_proj sgemm now uses finalDim for the output dimension. Direct cause of the audio gibberish on E4B. - conversion/models/gemma4_swa_merged.py: MergedChunk23 (the non-stateful merged chunk2+chunk3 used by Topology II) now accepts own_range / shared_range; defaults stay at E2B (L8-14 / L15-24). Mirrors the stateful generalisation from 4665ab2. - conversion/build_gemma4_3way.py: thread compute_chunk_boundaries(cfg) through to MergedChunk23 so `--model gemma4-e4b` produces a 21-layer chunk2_3way (L12-23 own + L24-32 shared) instead of the E2B-hardcoded 17-layer span. - scripts/assemble_gemma4_e4b_multimodal.sh: reproducible bundle assembly script (compiles mlpackage->mlmodelc, copies sidecars + legacy chunks + E4B encoders). - docs/E4B_MULTIMODAL_BUILD.md: build + sideload guide, including the rejected paths (prefill_chunk* multifunction, stateful) and the iPhone clean-sandbox requirement (devicectl never deletes orphans). Out of scope (in this commit): - Stateful Stage 8 engine — separate commit, Mac-only / iPhone-blocked. - prefill_chunk{1..4}.mlmodelc multifunction path — built and tested but produces broken output on iPhone with E4B (Mac OK); not shipped. - vision_video.mlmodelc — engine falls back to 2x2 pool of vision encoder; quality validated.

…c dev / iPhone blocked Stage 8 follow-up to the stateful Linear shipment. Adds a parallel engine that drives Gemma 4 stateful (3-chunk merged + Linear) with T=288 single-function prefill chunks + the Stage 6 vision/audio splice. The engine class works end-to-end on Mac (text decode 16.5 tok/s; assembled bundle drives image + audio splice through the T=288 batched prefill with bidirectional within-image mask). iPhone status: BLOCKED. Multiple converter paths attempted, all hit the same iPhone ANE 18 MIL->EIR translation failure on chunk_2 (the merged 21-layer middle chunk): - 3-chunk merged stateful with kv13/kv14 alias output: std::bad_cast - .clone() patch on the alias output assignment: same error - 4-chunk decode split (chunk_2_own + chunk_2_shared): same error, confirming the alias-slice-over-MLState pattern is the root cause rather than graph size. The non-stateful 3-way merged chunk2_3way (same 21 layers, but K/V flow as plain tensor inputs/outputs — no MLState alias) compiles and runs on iPhone ANE 18 at 15.7 tok/s, confirming the diagnosis. Code keeps the stateful path for Mac development and future revisits (stateful + multifunction T=288 might unlock once iPhone ANE picks up multifunction T>1 + dual MLState; not on iOS 18). Files: - Sources/CoreMLLLM/Gemma4StatefulMultimodalEngine.swift (NEW) ~880-line dimension-agnostic stateful engine. 3-chunk merged decode + 4-state MLState (decode/prefill x s1/s2) + bridgeKVState via withMultiArray nested closures + ported Stage 6 multimodal helpers (vision/video/audio splice + vision-aware bidir mask + cross-turn LCP-resume). Padding-replicate scheme keeps auto-emitted token at row T-1 valid even when validCount < T. - Sources/CoreMLLLM/ModelDownloader.swift gemma4e2bStatefulMultimodal + gemma4e4bStatefulMultimodal ModelInfo entries (sideload-only, exposed under LLM_SHOW_EXPERIMENTAL=1). - Examples/CoreMLLLMChat/CoreMLLLMChat/LLMRunner.swift Detection (prefill_T288/ subdir presence) routes to the new engine; load + generate + image/audio caching mirror the existing gemma4Stateful pattern. - Sources/gemma4mm-smoke/main.swift (NEW) Mac CLI smoke test for the stateful multimodal engine. - Package.swift: gemma4mm-smoke executable target. - scripts/assemble_gemma4_stateful_multimodal.sh (NEW) Reproducible bundle assembly (decode 3-chunk + prefill_T288/ subdir + multimodal encoders). - conversion/build_gemma4_stateful_singlefunc_prefill.py Adds --four-chunk variant (used during the chunk_2 split probe). - conversion/models/gemma4_swa_stateful_chunks.py .clone() on the kv13/kv14 producer alias output (decode + prefill T=N variants). Materialises the slice over MLState into a fresh tensor; ineffective vs the iPhone ANE bug but not regressive.

…hared scheme Wires the HF-uploaded multimodal bundle into the in-app picker flow so users can download `mlboydaisuke/gemma-4-E4B-multimodal-coreml` with one tap (no sideload required). ModelDownloader.swift: - New `gemma4e4bMultimodal` ModelInfo entry (id `gemma4-e4b-multimodal`, size 7.6 GB, downloadURL points at the new HF repo). Shared `folderName: "gemma4-e4b"` with the legacy text-only entry mirrors the gemma4e2b3way / gemma4e2b pattern: chunks 1-4 are byte-identical in both repos, so users who switch between entries reuse the on-disk legacy chunks and only fetch the new files. - `gemma4e4b` (text-only) renamed to "Gemma 4 E4B (text-only)" to disambiguate from the new multimodal entry in the picker. - New `buildE4BMultimodalFileList()` enumerates 58 files matching the HF repo tree (decode chunks 1-4 + chunk2_3way + chunk3_3way + vision.ane.mlmodelc + audio.mlmodelc + audio sidecars + text sidecars). Splits files into legacyChunk(no metadata.json) vs newerMlc(with metadata.json) helpers — the legacy chunks were built before the metadata.json convention. - Defaults list inserts `gemma4e4bMultimodal` ahead of `gemma4e4b` so the picker presents multimodal as the primary E4B option. CoreMLLLMChat.xcscheme: - Add `LLM_VISION_FORCE_ANE=1` to the shared scheme. Safe to default — only affects models whose bundle ships a `vision.ane.mlmodelc` (the new E4B multimodal entry); other models silently fall through to their existing GPU `vision.mlmodelc`. - Add `LLM_SHOW_EXPERIMENTAL=1`. Required to expose the experimental picker entries (already documented in `ModelDownloader.swift`'s `defaults`). - Drop `LLM_PROFILE_EVERY_STEP=1` from the shared scheme; debug-only, belongs in a developer's local copy.

john-rocky added 7 commits May 3, 2026 11:00

john-rocky force-pushed the feat/e4b-optimize-multimodal branch from 83a1a1a to 5f5d71a Compare May 3, 2026 02:01

john-rocky merged commit c69a4e9 into main May 3, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gemma4-e4b): multimodal CoreML bundle (text+image+video+audio on iPhone)#165

feat(gemma4-e4b): multimodal CoreML bundle (text+image+video+audio on iPhone)#165
john-rocky merged 7 commits into
mainfrom
feat/e4b-optimize-multimodal

john-rocky commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

john-rocky commented May 3, 2026

Summary

What's in this PR

What was tried and rejected (documented)

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant