Skip to content

feat(gemma4-e4b): multimodal CoreML bundle (text+image+video+audio on iPhone)#165

Merged
john-rocky merged 7 commits into
mainfrom
feat/e4b-optimize-multimodal
May 3, 2026
Merged

feat(gemma4-e4b): multimodal CoreML bundle (text+image+video+audio on iPhone)#165
john-rocky merged 7 commits into
mainfrom
feat/e4b-optimize-multimodal

Conversation

@john-rocky
Copy link
Copy Markdown
Owner

Summary

Ships Gemma 4 E4B multimodal Core ML bundle validated on iPhone 17 Pro at 15.7 tok/s text decode + correct outputs across text / image / video / audio. HF artefact: mlboydaisuke/gemma-4-E4B-multimodal-coreml (uploading separately, ~7.6 GB).

Uses Topology II 3-chunk merged decode (already in ChunkedEngine) plus the legacy 4-chunk prefill_b8 multifunction for batched prefill. Vision is ANE-targeted (vision.ane.mlmodelc, output [1, 256, 2560]); audio uses the Conformer encoder + a Swift two-stage projection (1024 → 1536 → 2560).

The conversion / runtime gap fixed in this PR was AudioProcessor.swift's embed_proj matmul, which assumed E2B's square (1536, 1536) shape. E4B's embed_proj projects 1536 → 2560 (LM hidden); ProjectionWeights now derives inDim / outDim / finalDim from the loaded weight tensor sizes.

Reproduction guide: docs/E4B_MULTIMODAL_BUILD.md.
Assembly script: scripts/assemble_gemma4_e4b_multimodal.sh.

What's in this PR

  • feat(gemma4-e4b) (7a10047) — the actual ship

    • Sources/CoreMLLLM/AudioProcessor.swift: ProjectionWeights non-square embed_proj support → fixes E4B audio gibberish.
    • conversion/models/gemma4_swa_merged.py: MergedChunk23 accepts own_range/shared_range (mirrors stateful generalisation from 4665ab2).
    • conversion/build_gemma4_3way.py: thread compute_chunk_boundaries(cfg) so --model gemma4-e4b produces the correct 21-layer chunk2_3way.
    • scripts/assemble_gemma4_e4b_multimodal.sh + docs/E4B_MULTIMODAL_BUILD.md.
  • research(gemma4-stateful-mm) (9c391f8) — Stage 8 stateful + multimodal engine, kept for Mac development. iPhone ANE 18 fails to compile the merged stateful chunk_2 with std::bad_cast (alias slice over MLState is the root cause; size and .clone() patches don't help). Documented in docs/E4B_MULTIMODAL_BUILD.md rejected-paths section.

    • New file: Sources/CoreMLLLM/Gemma4StatefulMultimodalEngine.swift (~880 lines).
    • New file: Sources/gemma4mm-smoke/main.swift (Mac CLI smoke).
    • Sources/CoreMLLLM/ModelDownloader.swift: gemma4e{2,4}bStatefulMultimodal ModelInfo entries (sideload-only, behind LLM_SHOW_EXPERIMENTAL=1).
    • Examples/CoreMLLLMChat/CoreMLLLMChat/LLMRunner.swift: detection (prefill_T288/ subdir) + load + generate + image/audio caching.
    • scripts/assemble_gemma4_stateful_multimodal.sh (reproducible bundle assembly for the stateful path).
    • Builder generalisations in conversion/build_gemma4_stateful_singlefunc_prefill.py (--four-chunk variant) and conversion/models/gemma4_swa_stateful_chunks.py (.clone() on alias outputs).

Pre-existing commits (already on branch): 2655c17, 4665ab2, 1ccbfcd, 340bf68.

What was tried and rejected (documented)

  1. prefill_chunk{1..4}.mlmodelc separate-file multifunction (T=64/128/256/512) — works on Mac at 16.5 tok/s, but produces degenerate outputs on iPhone for E4B (likely int4 quantization noise + larger graph). Existing gemma4e2b3way ships this layout for E2B and works on iPhone; E4B-specific failure. Not shipped.
  2. Stateful E4B multimodal — see research(gemma4-stateful-mm) commit. Mac OK, iPhone ANE 18 blocks at MIL→EIR.
  3. 4-chunk decode split for stateful — splitting chunk_2 into chunk_2_own (12 layers) + chunk_2_shared (9 layers) hits the same std::bad_cast, confirming the alias-slice-over-MLState pattern is the root cause rather than graph size.

Test plan

  • Mac smoke: text decode 16.5 tok/s, baseline-quality output (coreml-llm-smoke).
  • iPhone 17 Pro text-only: 15.7 tok/s, baseline-quality output (matches Mac semantics).
  • iPhone 17 Pro image+text: coherent description (no gibberish).
  • iPhone 17 Pro video+text: coherent description.
  • iPhone 17 Pro audio+text: correct response (after embed_proj non-square fix).
  • Stateful Mac smoke: text decode 16.x tok/s on stateful 3-chunk + T=288 prefill (gemma4mm-smoke).
  • HF upload to mlboydaisuke/gemma-4-E4B-multimodal-coreml (in progress, ~7.6 GB).

Notes

  • The Xcode shared scheme adds LLM_PROFILE_EVERY_STEP=1 / LLM_SHOW_EXPERIMENTAL=1 / LLM_VISION_FORCE_ANE=1 (untracked in this PR — left as the developer's local choice). LLM_VISION_FORCE_ANE=1 is required for the ANE vision encoder.
  • iPhone bundle pushes need a clean sandbox (delete + reinstall the app) when bundle layouts change. devicectl doesn't remove orphan files; a leftover prefill_chunk1.mlmodelc from a previous push silently overrides the engine's choice. Documented in docs/E4B_MULTIMODAL_BUILD.md.

john-rocky added 7 commits May 3, 2026 11:00
Phase A1-A3 of the E4B optimization stack. Brings the stage2-e4b 4-chunk
foundation (Phase 1 stateful + Phase 2a cross-turn KV) onto current main
and adds 3-chunk merged + multifunction prefill_bN support for E4B —
the lever that gave E2B its 33.4 tok/s iPhone 17 Pro decode.

Converter side
  - SWAStatefulMergedChunk23{,Prefill,Single,PrefillSingle} accept
    own_range / shared_range; defaults remain E2B (own=L8-14, shared=
    L15-24) for back-compat. E4B passes (12,24)/(24,33) derived from
    compute_chunk_boundaries(config) — kv13/kv14 names are kept as
    legacy aliases for the (sliding,full) producer slots.
  - build_gemma4_e2b_stateful_3chunks.py: drops the "E2B only"
    hardcoded help; --model gemma4-e4b now produces a 3-chunk merged
    bundle (chunk_1 L0-11 / chunk_2 L12-32 merged / chunk_3 L33-41 +
    lm_head). Chunk-2 layout printed dynamically.
  - sanity_stateful_chunks.py: from stage2-e4b — adds --model preset
    so /tmp/gemma4-{e2b,e4b}-stateful chunks share one verifier.

Bundle side
  - scripts/assemble_gemma4_stateful_e4b.sh: from stage2-e4b — pulls
    chunk_*.mlmodelc + legacy E4B sidecars into the bundle layout
    Gemma4StatefulEngine expects (subdir gemma4_e2b_stateful_chunks/
    is intentionally shared across E2B/E4B; engine reads hidden /
    layers / HKV from model_config.json).

Runtime side (Swift)
  - ModelDownloader.swift: gemma4e4bStateful + gemma4e4bStatefulLinear
    ModelInfo entries (slots 6/7 under LLM_SHOW_EXPERIMENTAL=1).
    downloadURL is intentionally blank — A6 will fill in the new
    mlboydaisuke/gemma-4-E4B-stateful-coreml repo URL once iPhone 17
    Pro A/B clears. Existing mlboydaisuke/gemma-4-E4B-coreml legacy
    repo is untouched, preserving the dual-repo pattern E2B uses.
  - LLMRunner.swift: stateful detection comment now lists all four
    folders that share the gemma4_e2b_stateful_chunks/ layout.

Build artefacts (A4) and iPhone validation (A5) follow.
Stage 8 builder (PR #149) already used `compute_chunk_boundaries` for
chunk_1 / chunk_3 windows but called `convert_chunk2_merged_prefill`
without `own_range` / `shared_range`, so on E4B the merged middle
chunk silently used E2B's L8-14 / L15-24 layer ranges instead of
L12-23 / L24-32. After A3 made the converter parametric, plumb the
ranges through and refresh the docstring + the stale "we don't ship
E4B stateful yet" comment.
…ateful

Captures the design intel + chosen architecture (Option A: separate
Gemma4StatefulMultimodalEngine class) so the next session can pick up
without re-deriving. Records:

  - Phase A scope (already shipped on this branch as 4665ab2 + 2655c17)
  - Phase B engine class layout (storage, public API, helper port list)
  - State bridge code path (probe-2-verified nested withMultiArray
    closures + memcpy)
  - Generate flow for image+text prompts (T=288 prefill → bridge → decode)
  - Bundle layout for new HF repos gemma-4-{E2B,E4B}-stateful-multimodal-coreml
  - Open questions (picker naming, default-swap timing, cross-turn KV
    with re-encoded image features)
  - Build commands for the Mac compile run
PyPI wheel ships .so files referencing @rpath/lib*.dylib that aren't
included; on macOS 26 (Darwin 25 / Tahoe) this silently produces an
empty pybind11 module so every conversion script crashes at
"BlobWriter not loaded". Captures the fresh-venv + source-build steps
that get a working /tmp/ct_build_venv to unblock builds until upstream
ships fixed wheels.
… iPhone)

Working configuration for iPhone 17 Pro at 15.7 tok/s decode + correct
output across all four input modalities. Validated 2026-05-03 on a clean
sandbox push of the assembled bundle.

Topology:
  decode  = Topology II (chunk1 legacy + chunk2_3way + chunk3_3way merged
            21-layer middle + final lm_head). Auto-detected by
            ChunkedEngine via chunk2_3way/chunk3_3way presence.
  prefill = legacy chunks 1/2/3/4 prefill_b8 multifunction. Vision-aware
            bidirectional mask within image span via the engine's
            existing fillBatchMasksVisionAware (works at T=8 batches).
  vision  = vision.ane.mlmodelc (E4B, output [1, 256, 2560]).
  audio   = audio.mlmodelc (E4B, output [1, 50, 1024]) + Swift two-stage
            projection 1024 -> 1536 -> 2560.

Changes:
- Sources/CoreMLLLM/AudioProcessor.swift: ProjectionWeights now derives
  inDim/outDim/finalDim from weight tensor sizes (was hard-coded for
  E2B's square 1536x1536 embed_proj). E4B's embed_proj is non-square
  (2560, 1536); the embed_proj sgemm now uses finalDim for the output
  dimension. Direct cause of the audio gibberish on E4B.
- conversion/models/gemma4_swa_merged.py: MergedChunk23 (the
  non-stateful merged chunk2+chunk3 used by Topology II) now accepts
  own_range / shared_range; defaults stay at E2B (L8-14 / L15-24).
  Mirrors the stateful generalisation from 4665ab2.
- conversion/build_gemma4_3way.py: thread compute_chunk_boundaries(cfg)
  through to MergedChunk23 so `--model gemma4-e4b` produces a 21-layer
  chunk2_3way (L12-23 own + L24-32 shared) instead of the E2B-hardcoded
  17-layer span.
- scripts/assemble_gemma4_e4b_multimodal.sh: reproducible bundle
  assembly script (compiles mlpackage->mlmodelc, copies sidecars +
  legacy chunks + E4B encoders).
- docs/E4B_MULTIMODAL_BUILD.md: build + sideload guide, including the
  rejected paths (prefill_chunk* multifunction, stateful) and the
  iPhone clean-sandbox requirement (devicectl never deletes orphans).

Out of scope (in this commit):
- Stateful Stage 8 engine — separate commit, Mac-only / iPhone-blocked.
- prefill_chunk{1..4}.mlmodelc multifunction path — built and tested
  but produces broken output on iPhone with E4B (Mac OK); not shipped.
- vision_video.mlmodelc — engine falls back to 2x2 pool of vision
  encoder; quality validated.
…c dev / iPhone blocked

Stage 8 follow-up to the stateful Linear shipment. Adds a parallel
engine that drives Gemma 4 stateful (3-chunk merged + Linear) with
T=288 single-function prefill chunks + the Stage 6 vision/audio
splice. The engine class works end-to-end on Mac (text decode
16.5 tok/s; assembled bundle drives image + audio splice through the
T=288 batched prefill with bidirectional within-image mask).

iPhone status: BLOCKED. Multiple converter paths attempted, all hit
the same iPhone ANE 18 MIL->EIR translation failure on chunk_2 (the
merged 21-layer middle chunk):
  - 3-chunk merged stateful with kv13/kv14 alias output: std::bad_cast
  - .clone() patch on the alias output assignment: same error
  - 4-chunk decode split (chunk_2_own + chunk_2_shared): same error,
    confirming the alias-slice-over-MLState pattern is the root cause
    rather than graph size.
The non-stateful 3-way merged chunk2_3way (same 21 layers, but K/V
flow as plain tensor inputs/outputs — no MLState alias) compiles and
runs on iPhone ANE 18 at 15.7 tok/s, confirming the diagnosis.

Code keeps the stateful path for Mac development and future revisits
(stateful + multifunction T=288 might unlock once iPhone ANE picks up
multifunction T>1 + dual MLState; not on iOS 18).

Files:
- Sources/CoreMLLLM/Gemma4StatefulMultimodalEngine.swift (NEW)
    ~880-line dimension-agnostic stateful engine. 3-chunk merged
    decode + 4-state MLState (decode/prefill x s1/s2) + bridgeKVState
    via withMultiArray nested closures + ported Stage 6 multimodal
    helpers (vision/video/audio splice + vision-aware bidir mask +
    cross-turn LCP-resume). Padding-replicate scheme keeps
    auto-emitted token at row T-1 valid even when validCount < T.
- Sources/CoreMLLLM/ModelDownloader.swift
    gemma4e2bStatefulMultimodal + gemma4e4bStatefulMultimodal
    ModelInfo entries (sideload-only, exposed under
    LLM_SHOW_EXPERIMENTAL=1).
- Examples/CoreMLLLMChat/CoreMLLLMChat/LLMRunner.swift
    Detection (prefill_T288/ subdir presence) routes to the new
    engine; load + generate + image/audio caching mirror the existing
    gemma4Stateful pattern.
- Sources/gemma4mm-smoke/main.swift (NEW)
    Mac CLI smoke test for the stateful multimodal engine.
- Package.swift: gemma4mm-smoke executable target.
- scripts/assemble_gemma4_stateful_multimodal.sh (NEW)
    Reproducible bundle assembly (decode 3-chunk + prefill_T288/
    subdir + multimodal encoders).
- conversion/build_gemma4_stateful_singlefunc_prefill.py
    Adds --four-chunk variant (used during the chunk_2 split probe).
- conversion/models/gemma4_swa_stateful_chunks.py
    .clone() on the kv13/kv14 producer alias output (decode + prefill
    T=N variants). Materialises the slice over MLState into a fresh
    tensor; ineffective vs the iPhone ANE bug but not regressive.
…hared scheme

Wires the HF-uploaded multimodal bundle into the in-app picker flow so
users can download `mlboydaisuke/gemma-4-E4B-multimodal-coreml` with one
tap (no sideload required).

ModelDownloader.swift:
- New `gemma4e4bMultimodal` ModelInfo entry (id `gemma4-e4b-multimodal`,
  size 7.6 GB, downloadURL points at the new HF repo). Shared
  `folderName: "gemma4-e4b"` with the legacy text-only entry mirrors
  the gemma4e2b3way / gemma4e2b pattern: chunks 1-4 are byte-identical
  in both repos, so users who switch between entries reuse the
  on-disk legacy chunks and only fetch the new files.
- `gemma4e4b` (text-only) renamed to "Gemma 4 E4B (text-only)" to
  disambiguate from the new multimodal entry in the picker.
- New `buildE4BMultimodalFileList()` enumerates 58 files matching the
  HF repo tree (decode chunks 1-4 + chunk2_3way + chunk3_3way +
  vision.ane.mlmodelc + audio.mlmodelc + audio sidecars + text
  sidecars). Splits files into legacyChunk(no metadata.json) vs
  newerMlc(with metadata.json) helpers — the legacy chunks were built
  before the metadata.json convention.
- Defaults list inserts `gemma4e4bMultimodal` ahead of `gemma4e4b` so
  the picker presents multimodal as the primary E4B option.

CoreMLLLMChat.xcscheme:
- Add `LLM_VISION_FORCE_ANE=1` to the shared scheme. Safe to default —
  only affects models whose bundle ships a `vision.ane.mlmodelc` (the
  new E4B multimodal entry); other models silently fall through to
  their existing GPU `vision.mlmodelc`.
- Add `LLM_SHOW_EXPERIMENTAL=1`. Required to expose the experimental
  picker entries (already documented in `ModelDownloader.swift`'s
  `defaults`).
- Drop `LLM_PROFILE_EVERY_STEP=1` from the shared scheme; debug-only,
  belongs in a developer's local copy.
@john-rocky john-rocky force-pushed the feat/e4b-optimize-multimodal branch from 83a1a1a to 5f5d71a Compare May 3, 2026 02:01
@john-rocky john-rocky merged commit c69a4e9 into main May 3, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant