Virgil Lemma foundations by Snider · Pull Request #8 · dAppCore/go-mlx

Snider · 2026-05-20T05:58:29Z

Summary by CodeRabbit

New Features
- Qwen 2/3 and Qwen 3.6 model support; new adapter with buffered and streaming generation.
- Block‑prefix cache service and memvid bundle index for faster prefix restores.
- Agentic memory: wake/sleep workflows, state bundles and memvid integration; session‑state artifact export.
Improvements
- Device‑aware memory planner; expanded chunked generation, prompt‑cache warm/restore and KV snapshot flows.
- Build/toolchain updated (C++23) and macOS deployment target raised.
Documentation
- Extensive new/updated docs: architecture, runtime, inference, memory, MoE, training and benchmarks.

Co-Authored-By: Virgil <virgil@lethean.io>

Implements the 2026-05-09 vMLX feature-parity sprint (see docs/vmlx-feature-gap-report.md + docs/superpowers/plans/) plus the runtime surfaces that hang off it. Closes the gap between go-mlx and vMLX's Python engine for MoE and advanced quantisation paths. Phase 1 surface: - MoE / advanced quant: minimax_m2.go + native_darwin, jang.go + native_darwin, codebook_vq.go, expert_residency.go. - Cache + decode: block_cache.go (block-prefix cache), prompt cache threshold integration, decode_optimisation.go (speculative + prompt- lookup harness). - Algorithm/architecture profiles: algorithm_profile.go + architecture_profile.go for backend capability reporting. - Agent memory: agent_memory.go (Wake/Sleep/Fork on top of KV snapshots + memvid), state_bundle.go round-trip via dappco.re/go/inference/state. - Scheduler + parsers: scheduler.go (queue-aware Schedule + Cancel), parser_registry.go (model-family tool/reasoning parsers), register_metal_{cache,parser,scheduler}.go capability mounts. - Model-pack + planning: gguf_info.go / gguf_quantize.go, memory_plan.go (device-class sizing), model_pack.go validation. - Internal Metal extensions: gemma4 paged KV, minimax_m2 forward stubs, codebook_vq kernels, jang_dequant, kv_snapshot_blocks_native. - Frame compute: compute.go API rounded out for non-LLM kernels. - admin.go, dataset_stream.go, fast_eval.go, hf_fit.go, small_model_smoke.go, workload_bench.go. - Observability: probe.go expanded for MoE router decisions, cache pressure, training events. docs/ pass adds per-file documentation under docs/{topic}/{file}.md so future readers can plan against the runtime without grep: - runtime/ — register_metal, adapter - memory/ — agent_memory, kv_snapshot family, state_bundle, medium - moe/ — minimax_m2, jang, codebook_vq, expert_residency - training/ — sft, lora_adapter, grpo, distill, eval - model/ — model_pack, memory_plan - inference/ — scheduler, block_cache, decode_optimisation, parser_registry, thinking - compute/ — frame-compute API - observability/ — probe.go emission - cmd/violet — sidecar daemon 34 new docs plus per-topic READMEs and a top-level index. Co-Authored-By: Virgil <virgil@lethean.io>

First lobe of the package-split out of the 80-file root dump. Moves the non-LLM Metal frame-compute lane (PixelBuffer / kernels / Session / NewSession) into its own subpackage so the root mlx package stays focused on LLM inference. - go/compute*.go → go/compute/ (10 files, package mlx → package compute) - compute_darwin.go renamed compute_metal.go (no _darwin suffix — package is Metal-only, no dual-platform split) - compute_stub.go variants deleted — Metal-only by design, no non-darwin compile target to guard against - All build tags dropped — package is darwin/arm64 implicit - DeviceInfo restored as type alias to metal.DeviceInfo (not field- flattened); DeviceInfo() returns metal.GetDeviceInfo() direct so upstream renames + new fields surface at compile time - unsupported_stub_test.go in parent dropped its compute.* compile- surface refs — stub build no longer needs to compile-check a Metal-only subpackage - examples/ moved into docs/examples/ (first-trip cleanup) No external consumers of compute symbols in the tetrad today; only internal sibling fast_eval / api_stub / session_* call sites and they use ModelSession.NewSession (method) rather than compute.NewSession (free function). No downstream import churn. Co-Authored-By: Virgil <virgil@lethean.io>

Drops the in-mlx output-parsing layer and consumes dappco.re/go/inference/parser instead. Driver-neutral logic — model- family reasoning markers, thinking-channel processor, tool-call parsing — now lives in go-inference so every driver (rocm, cuda, tpu, future) inherits it without re-implementation. Deletes: - go/parser_registry.go (466 lines) - go/thinking.go (320 lines) - their _test.go siblings Replaces with: - go/thinking.go (slim) — driver-side WithThinking* options that mutate the local mlx.GenerateConfig.Thinking field, FilterThinkingTokens wrapper for the *Tokenizer streaming path, parserHint() helper that converts mlx.ModelInfo to parser.Hint{Architecture, AdapterName}. Sibling fix-ups: - api_common.go: GenerateConfig.Thinking is parser.Config; default is parser.Show. - api_darwin.go: 5 emit sites use parser.NewProcessor + parserHint. - openai.go: 3 response handlers use parser.NewProcessor; reasoning selector uses parser.ForHint(parser.HintFromInference(...)). - register_metal_parser.go: outputParser() returns parser.OutputParser via parser.ForHint(parserHint(...)). - register_metal_cache.go: drops local modelInfoFromInference helper, uses adapter.Info() directly. - architecture_profile.go: parser.NormaliseKey replaces local helper. - thinking_darwin_test.go: parser.Chunk replaces ThinkingChunk. Submodule pin: external/go-inference advanced to cb4f9fb (parser package + ProbeScheduler vocab the mlx scheduler.go was emitting). Co-Authored-By: Virgil <virgil@lethean.io>

Drops the in-mlx JANG/JANGTQ + VQ codebook quant metadata and consumes dappco.re/go/inference/quant/{jang,codebook} instead. Driver-neutral quant types now lift to go-inference where every backend (mlx, rocm, cuda, tpu, future) inherits them. Deletes: - go/jang.go (597 lines) - go/codebook_vq.go (294 lines) - their _test.go siblings (228 lines) Adds: - go/jang_hf.go — driver-side helpers that depend on mlx-local HFModelMetadata (InferJANGFromHF, hfJANGGroupSize, inferJANGProfileName). Compose lifted jang.Info shape. - safetensor_ref.go: local mlxMaxIntValue() helper (was in jang.go). Symbol-namespace renames (package name takes the disambiguation slot): JANGQuantizationInfo → jang.Info JANGCapabilities → jang.Capabilities JANGTensorRole + consts → jang.TensorRole* JANGPackedQuantizationProfile → jang.PackedProfile JANGPackedTensorDescriptor → jang.PackedTensorDescriptor BuildJANGPackedQuantizationProfile → jang.BuildPackedProfile CloneJANGPackedQuantizationProfile → jang.ClonePackedProfile NewJANGPackedTensorDescriptor → jang.NewPackedTensorDescriptor ValidateJANGPackedTensor → jang.ValidatePackedTensor DequantizeJANGPackedTensor → jang.DequantizePackedTensor PackJANGQuantizedValues → jang.PackQuantizedValues readJANGQuantizationInfo → jang.ReadConfig parseJANGQuantizationInfo → jang.ParseConfig CodebookQuantizationType → codebook.Type CodebookFormatVQ → codebook.FormatVQ CodebookQuantizationProfile → codebook.Profile CodebookTensorDescriptor → codebook.TensorDescriptor ParseCodebookQuantizationProfile → codebook.ParseProfile NewCodebookTensorDescriptor → codebook.NewTensorDescriptor ValidateCodebookQuantizationProfile → codebook.ValidateProfile ValidateCodebookTensorDescriptor → codebook.ValidateTensorDescriptor ValidateCodebookTensorPayload → codebook.ValidateTensorPayload CodebookVQMatVec → codebook.MatVec readCodebookQuantizationProfile → codebook.ReadProfile cloneCodebookQuantizationProfile → codebook.CloneProfile Sibling fix-ups across 19 files (production + tests): - algorithm_profile, architecture_profile, hf_fit (+test), jang_native_darwin/stub, memory_plan (+test), minimax_m2 (+test), model_pack (+test), workload_bench (+test), expert_residency_test, jang_darwin_test, minimax_m2_darwin_test, inference_contract_test. - Variable shadowing: `jang` local variables renamed to `info` where they shadowed the package import. - jangQuantizationType(info) calls replaced with info.Packed.Type. - finalizeJANGQuantizationInfo helper inlined as info.Packed = jang.BuildPackedProfile(info). - testJANGTQInfo() helper re-added locally in jang_darwin_test.go (was in deleted jang_test.go). Submodule pin: external/go-inference advanced to cb3dc24 (parser + quant/jang + quant/codebook). Companion lifts deferred next round: - model/minimax/m2 — safetensorIndex (mlx-private) couplings in loader functions; needs either safetensors lift or types/loaders split. - moe/expert_residency — MemoryClass (Apple-tier enum) needs budget-bytes refactor before lifting. Co-Authored-By: Virgil <virgil@lethean.io>

Snider correction: file lifts shouldn't add new flat files to the go-mlx root, and the _darwin/_stub split is noise on a Metal-only driver. Same rules as compute/: package gets its own folder, no build-tag dance. go/jang_native_darwin.go + jang_native_stub.go → go/quant/jang/jang.go (one file, no _darwin suffix, no stub variant) Symbols drop redundant prefixes since the folder + package imply them: JANGPackedProjectionResult → jang.PackedProjectionResult DequantizeJANGPackedTensorMetal → jang.DequantizePackedTensor ProjectJANGPackedTensorMetal → jang.ProjectPackedTensor ProjectJANGPackedTensorMetalFused → jang.ProjectPackedTensorFused jangMetalShape (private) → jang.MetalShape (exported for tests) jangMetalShapeElements (private) → jang.ShapeElements int32SliceToInts (private) → jang.Int32SliceToInts Inside the package, the inference-side jang aliases as infjang to avoid the same-name self-collision. Consumers (jang_darwin_test + minimax_m2_native_darwin) alias the mlx-side as mlxjang. The HF-metadata helpers (InferJANGFromHF, hfJANGGroupSize, inferJANGProfileName) merged into hf_fit.go — they're HF-fit code that happens to produce *jang.Info, not jang-package code (they depend on HFModelMetadata which lives in hf_fit.go). hf_fit.go + HFModelMetadata still pending their own folder lift (likely go/hf/ in a future iteration). go-mlx/go root flat-file count: net −1 this commit (deletion of jang_native_stub.go + jang_native_darwin.go and jang_hf.go, addition of nothing new in root). Co-Authored-By: Virgil <virgil@lethean.io>

Commit 63f9894 renamed the file but shipped its OLD content (the working-tree perl edits weren't re-staged before commit, so the index had the pre-edit version under the new path). HEAD's quant/jang/jang.go was still `package mlx` with the build tag, despite the working tree being correct (which masked the bug locally — build passed because the file on disk was right). This commit ships what should have landed in 63f9894: - package mlx → package jang - drop //go:build darwin && arm64 && !nomlx - symbols dropped JANG/Metal prefixes: DequantizePackedTensor, ProjectPackedTensor*, MetalShape, ShapeElements, Int32SliceToInts - inference jang aliased as infjang inside the file Co-Authored-By: Virgil <virgil@lethean.io>

algorithm_profile.go + architecture_profile.go move into go/profile/. Both become package profile; consumers import dappco.re/go/mlx/profile and call profile.LookupAlgorithmProfile / profile.LookupArchitectureProfile. architecture.go inlines normalizeKnownArchitecture + architectureFromTransformersName as private helpers (originals live in gguf_info.go at mlx root). Inlining avoids the import cycle that would otherwise form when profile/ pulls from mlx and mlx-root tests exercise profile/. Same trick for KVCacheMode references — uses literal "q8" / "paged" / "k-q8-v-q4" strings instead of mlx-root constants. Tests stay in mlx root for now (algorithm_profile_test.go + architecture_profile_test.go), aliased as `prof "dappco.re/go/mlx/profile"` so the `profile` local-var name they use doesn't shadow the package. Local-var lookup results renamed `profile → p` where needed. model_pack.go's local `profile := pack.ArchitectureProfile` renamed to `arch` to avoid shadowing the new package import. go vet ./... clean. Test suite green. Co-Authored-By: Virgil <virgil@lethean.io>

Move lora_adapter.go → lora/adapter.go (package lora). Stage 1 only: lora_fuse* stays at mlx root because it references mlx-root types (ModelPack, ModelPackFormatSafetensors) — same blocker as gguf_quantize.go. Symbol renames (drop redundant "LoRA"/"lora" prefixes since pkg carries them): LoRAAdapterInfo → lora.AdapterInfo InspectLoRAAdapter → lora.InspectAdapter (1-arg convenience) inspectLoRAAdapter → lora.Inspect (2-arg form, now public) loraAdapterInfoEmpty → (info AdapterInfo) IsEmpty() method Private helpers in lora/ also drop redundant prefixes: loraAdapterConfigJSON → adapterConfigJSON loraAdapterConfigPath → adapterConfigPath hashLoRAAdapter → hashAdapter loraAdapterResultError → resultError lora_fuse.go gets its own inline copy of loraAdapterResultError (the generic core.Result → error helper isn't worth pulling into the public surface of lora). Also: fixes stray `package mlx` left in profile/algorithm.go + profile/architecture.go from the previous lift commit (8f5174a) where the package-line rename apparently raced with the commit. go vet ./... clean. mlx package tests green. Co-Authored-By: Virgil <virgil@lethean.io>

Pure types-lift: ModelPack struct + its constants, options, methods move into go-mlx/pack/. Inspectors + validators stay in mlx-root model_pack.go (they reference mlx-root concrete types — GGUFInfo, MiniMaxM2TensorPlan — that would create cycles). Cycle-breaker: 4 fields in pack.ModelPack typed as `any` since their concrete types live at mlx root: Quantization any (was *GGUFQuantizationInfo) GGUF any (was *GGUFInfo) MiniMaxM2 any (was *MiniMaxM2TensorPlan) MiniMaxM2LayerSkeleton any (was *MiniMaxM2LayerForwardSkeleton) Consumers type-assert at read sites (memory_plan.go + model_pack_test.go). Inspectors assign concrete pointers directly (any accepts). Symbol policy this round: NO renames. pack.ModelPack stays pack.ModelPack (verbose but lower-risk; renames can land as a follow-up). Mlx root imports pack as `mp` to avoid the local-var name collision (many functions use `pack` as parameter name). addIssue + issueSummary → AddIssue + IssueSummary (exported, since inspectors at mlx root call them across the package boundary). applyModelPackOptions → pack.ApplyOptions (similarly exported). Unblocks: lora_fuse and gguf_quantize can now live in their own packages once their other dependencies (safetensor private types + MiniMaxM2 types) also lift. This commit ships only the type lift. go vet ./... clean. mlx package tests green. Co-Authored-By: Virgil <virgil@lethean.io>

Move lora_fuse{,_darwin,_stub,_test,_darwin_test}.go into lora/ (package lora) — joins lora/adapter.go from the earlier lora_adapter lift. lora/ is now the LoRA package as intended. API change: lora.FuseIntoPack takes pre-validated pack.ModelPack as SourcePack (instead of ModelPath string). Callers validate via mlx.ValidateModelPack first, then call lora.FuseIntoPack, then validate output if they need a populated pack. This breaks the mlx ↔ lora cycle (otherwise lora.FuseIntoPack would need to call mlx.ValidateModelPack → cycle since mlx-root imports lora for AdapterInfo). No production consumers of FuseLoRA* — only tests — so the API change is safe. Symbol renames per discipline (drop redundant "LoRA"/"lora" prefix since pkg name carries it): FuseLoRAIntoModelPack → lora.FuseIntoPack FuseLoRAOptions → lora.FuseOptions FuseLoRAResult → lora.FuseResult (drops Pack field) LoRAFuseProvenance → lora.FuseProvenance LoRAFuseProvenanceFile → lora.FuseProvenanceFile prepareLoRAFuse → prepareFuse (private) loraFusePairName → fusePairName loraFuseBaseWeightKey → fuseBaseWeightKey loraFuseAdapterWeightFiles → fuseAdapterWeightFiles writeLoRAFuseProvenance → writeFuseProvenance buildLoRAFusePairs → buildFusePairs fuseLoRAModelWeightFiles → fuseModelWeightFiles fuseLoRAWeightPairs → fuseWeightPairs loraFusePair → fusePair loraFusePrepared → fusePrepared loRAFuseOutputWeights → fuseOutputWeights samePath + copyModelPackMetadata + isModelWeightMetadataCopySkip + copyModelPackLocalFile move to mlx-root model_merge.go (consumers: model_merge.go itself + gguf_quantize.go). loraAdapterResultError drops (lora's own resultError is used instead). Tests: portable + darwin tests moved into lora/ (need access to private helpers like fusePairName). Tests use pack.ModelPack{} fixture in place of mlx.ValidateModelPack (which would create a cycle); output verification reads files directly rather than via Pack.Valid(). go vet ./... clean. mlx + lora package tests green. Co-Authored-By: Virgil <virgil@lethean.io>

Move gguf_info.go + gguf_info_test.go + gguf_info_example_test.go into gguf/ (package gguf). Symbol renames per discipline (drop redundant GGUF prefix since pkg name carries it): GGUFInfo → gguf.Info GGUFTensorInfo → gguf.TensorInfo GGUFValidationSeverity → gguf.ValidationSeverity GGUFValidationIssue → gguf.ValidationIssue GGUFTensorTypeSummary → gguf.TensorTypeSummary GGUFQuantizationInfo → gguf.QuantizationInfo ReadGGUFInfo → gguf.ReadInfo DiscoveredModel + DiscoverModels keep their names (no GGUF prefix). Export binary-format internals that mlx-root gguf_quantize.go needs: ggufTensorTypeQ8_0 → gguf.TensorTypeQ8_0 ggufTensorTypeQ4_0 → gguf.TensorTypeQ4_0 ggufValueTypeString → gguf.ValueTypeString ggufValueTypeUint32 → gguf.ValueTypeUint32 normalizeGGUFQuantType → gguf.NormalizeQuantType gguf_quantize.go stays at mlx root (it depends on mlx-root safetensor private types + pack.ModelPack — full lift blocked until safetensor types lift to a shared package). Mlx-root keeps private copies of helpers consumed by 8+ mlx-root files (in hf_fit.go): firstNonEmpty, firstPositive, modelConfigProbe + methods, readModelConfig, normalizeKnownArchitecture, architectureFromTransformersName, indexString. Same inline-copy pattern as profile/architecture.go used. Test helpers (writeTestGGUF, ggufMetaSpec, ggufTensorSpec, ggufTensorTypeQ4K, etc.) duplicated in new gguf_test_helpers_test.go at mlx root for cross-test access. This unblocks gguf-using consumers from importing gguf/ directly. gguf_quantize.go still at mlx root for now. go vet ./... clean. mlx + gguf + lora package tests green. Co-Authored-By: Virgil <virgil@lethean.io>

…nsors/ Move safetensor-prefixed types + funcs from model_merge.go + safetensor_ref.go + gguf_quantize.go into safetensors/ (package safetensors). Symbol renames per discipline drop the safetensor prefix since the package name carries it: Types: safetensorIndex → safetensors.Index safetensorTensorRef → safetensors.TensorRef safetensorTensorReader → safetensors.TensorReader safetensorHeaderEntry → safetensors.HeaderEntry Funcs: indexSafetensorFiles → safetensors.IndexFiles readSafetensorIndex → safetensors.ReadIndex safetensorRefFromHeader → safetensors.RefFromHeader readSafetensorRefRaw → safetensors.ReadRefRaw readSafetensorRefValues → safetensors.ReadRefValues readSafetensorRefFloat32Chunk → safetensors.ReadRefFloat32Chunk writeSafetensorRefFloat32Chunks → safetensors.WriteRefFloat32Chunks openSafetensorTensorReaders → safetensors.OpenReaders openSafetensorTensorReader → safetensors.OpenReader closeSafetensorTensorReaders → safetensors.CloseReaders safetensorDTypeByteSize → safetensors.DTypeByteSize decodeSafetensorFloatData → safetensors.DecodeFloatData float16ToFloat32 → safetensors.Float16ToFloat32 Methods on TensorReader: close → Close, readFloat32Chunk → ReadFloat32Chunk. Stays in model_merge.go: merge-specific helpers (indexModelMergeSources, validateModelMergeTensorIndexes, writeMergedSafetensors, readMergeTensorRefs, buildMergedSafetensorsHeader, readMergeTensorValues, writeLinearMergedTensorChunks, writeSLERPMergedTensorChunks, slerpChunkedWeights, writeFloat32Values is in safetensors too). safetensor_ref.go deleted (mlxMaxIntValue + readSafetensorRefRaw now live inside safetensors package as private maxIntValue + exported ReadRefRaw). Consumers updated: model_merge.go, gguf_quantize.go, gguf_quantize_test.go, minimax_m2.go, model_merge_test.go, kv_snapshot.go. Net: -2 root flat .go files (safetensor_ref.go deleted, primitives extracted from model_merge.go + gguf_quantize.go without adding new root files). Unblocks: gguf_quantize.go could potentially lift to gguf/ next (still needs pack.ModelPack from pack/, but pack imports gguf, so gguf_quantize would create cycle — needs separate decision). go vet ./... clean. mlx + gguf + lora + safetensors package tests green. Co-Authored-By: Virgil <virgil@lethean.io>

Move gguf_quantize.go + gguf_quantize_test.go → gguf/quantize.go + gguf/quantize_test.go (package gguf). API change matches the lora.FuseIntoPack pattern: gguf.QuantizeModelPack takes pre-validated pack.ModelPack as SourcePack instead of a ModelPath string. Callers run mlx.ValidateModelPack first and call mlx.ValidateModelPack(result.OutputPath) afterwards if they need a populated output pack. Symbol renames per discipline (drop redundant GGUF prefix): QuantizeModelPackToGGUF → gguf.QuantizeModelPack QuantizeGGUFOptions → gguf.QuantizeOptions QuantizeGGUFResult → gguf.QuantizeResult (drops Pack field) GGUFQuantizeFormat → gguf.QuantizeFormat GGUFQuantizeQ8_0/Q4_0/Q4_K_M → gguf.QuantizeQ8_0/Q4_0/Q4_K_M Move ggufValidationSummary from mlx-root model_pack.go into gguf as exported gguf.ValidationSummary — model_pack.go now calls it via the gguf package. Same helper, single home now. Move samePath + copyModelPackMetadata + isModelWeightMetadataCopySkip + copyLocalFile into gguf as private helpers (also keep the model_merge.go mlx-root copies for non-gguf consumers like model_merge.go itself). mlx-root tests that depended on lifted private helpers (denseSafetensor, loadDenseSafetensors, readDenseSafetensors, decodeDenseSafetensor, writeDenseSafetensorsPack, writeTestSafetensorsF32, safetensorTestTensor, appendUint16LE, float32ToFloat16) get duplicated copies in gguf_test_helpers_test.go for the tests that still live at mlx root (model_merge_test, kv_snapshot_*, api_test). No production consumers of Quantize* API — only tests — so the API change is safe. Drop the second ValidateModelPack call (caller's responsibility); drop Pack field from QuantizeResult. go vet ./... clean. mlx + gguf + lora + safetensors package tests green. Co-Authored-By: Virgil <virgil@lethean.io>

Move model_merge.go + model_merge_test.go → merge/merge.go + merge/merge_test.go (package merge). API change matches the lora.FuseIntoPack + gguf.QuantizeModelPack pattern: merge.Source carries a pre-validated pack.ModelPack (Pack field) instead of a Path string. Callers run mlx.ValidateModelPack on each source before invoking merge.Packs, and re-validate the output via mlx.ValidateModelPack(result.OutputPath) if they need a populated pack. Symbol renames per discipline (drop redundant Model/ModelMerge prefix): MergeModelPacks → merge.Packs ModelMergeOptions → merge.Options ModelMergeResult → merge.Result (drops Pack field) ModelMergeMethod → merge.Method ModelMergeSource → merge.Source (Path → Pack) ModelMergeProvenance → merge.Provenance ModelMergeProvenanceFile → merge.ProvenanceFile ModelMergeLinear/SLERP/TIES/DARE → merge.MethodLinear/SLERP/TIES/DARE Private helpers moved with the source (drop prefixes where redundant): prepareModelMerge → prepare ensureEmptyModelMergeDestination → ensureEmptyDestination validateModelMergePackCompatibility → validatePackCompatibility indexModelMergeSources → indexSources validateModelMergeTensorIndexes → validateTensorIndexes readMergeTensorRefs → readTensorRefs buildMergedSafetensorsHeader → buildMergedHeader readMergeTensorValues → readTensorValues writeLinearMergedTensorChunks → writeLinearChunks writeSLERPMergedTensorChunks → writeSLERPChunks normalizedMergeWeights → normalizedWeights writeModelMergeProvenance → writeProvenance modelMergePrepared → prepared modelMergeResultError → resultError StateBundleFileHash → hashFile (inlined private copy in merge) samePath / copyModelPackMetadata / isModelWeightMetadataCopySkip / copyLocalFile / resultError travel with merge as private helpers (they were only used by model_merge.go after the earlier gguf_quantize lift moved away). merge/helpers_test.go takes its own copies of denseSafetensor + loadDenseSafetensors + readDenseSafetensors + decodeDenseSafetensor + safetensorTestTensor + writeDenseSafetensorsPack + writeTestSafetensorsF32 + testResultError + writeModelPackFile + modelPackTokenizerJSON + testPack / testPackArch fixture builders. Trim mlx-root gguf_test_helpers_test.go: remove safetensors-related helpers (denseSafetensor, loadDenseSafetensors, etc.) — they no longer have mlx-root consumers after the merge lift. mlx-root minimax_m2.go gains its own private copy of sameUint64Slice (small utility that was only used by minimax_m2 + the lifted merge code; the merge copy keeps its own). No production consumers of ModelMerge* API — only tests, so the API change is safe. go vet ./... clean. mlx + gguf + lora + safetensors + merge package tests green. Co-Authored-By: Virgil <virgil@lethean.io>

Move kv_snapshot.go, kv_snapshot_blocks.go, kv_snapshot_memvid.go, kv_analysis.go (and their tests + examples) into kv/ (package kv). kv_snapshot_index.go stays at mlx root — its KVSnapshotMemvidBundleIndex struct has StateBundleModel + StateBundleTokenizer fields whose types live at mlx-root and would cycle. Symbol renames per discipline (drop redundant KV/KVSnapshot prefix): KVSnapshot → kv.Snapshot KVLayerSnapshot → kv.LayerSnapshot KVHeadSnapshot → kv.HeadSnapshot KVSnapshotEncoding → kv.Encoding (+ Native/Q8/Base64/Binary) KVSnapshotVersion → kv.SnapshotVersion KVSnapshotSaveOptions → kv.SaveOptions KVSnapshotLoadOptions → kv.LoadOptions KVSnapshotCaptureOptions → kv.CaptureOptions LoadKVSnapshot{,WithOptions} → kv.Load{,WithOptions} KVSnapshotBlock → kv.Block KVSnapshotMemvidBlockOptions/Bundle/Ref → kv.MemvidBlock{Options,Bundle,Ref} KVSnapshotMemvidBlockBundleKind → kv.MemvidBlockBundleKind KVSnapshotMemvidBlockVersion → kv.MemvidBlockVersion AssembleKVSnapshotBlocks → kv.AssembleBlocks SaveKVSnapshotMemvidBlockBundle → kv.SaveMemvidBlockBundle LoadKVSnapshotFromMemvidBlocks{,WithOptions} → kv.LoadFromMemvidBlocks{,WithOptions} LoadKVSnapshotMemvidBlockBundle → kv.LoadMemvidBlockBundle LoadKVSnapshotPrefixFromMemvidBlocks{,WithOptions} → kv.LoadPrefixFromMemvidBlocks{,WithOptions} KVSnapshotMemvidOptions → kv.MemvidOptions LoadKVSnapshotFromMemvid{,WithOptions} → kv.LoadFromMemvid{,WithOptions} KVAnalysis → kv.Analysis, AnalyzeKV → kv.Analyze KVFeatures → kv.Features, KVFeatureLabels → kv.FeatureLabels Helpers also moved into kv package as exported (mlx-root callers crossed package boundary so they needed to go public): hashKVSnapshot → kv.HashSnapshot validateKVSnapshotMemvidBlockBundle → kv.ValidateMemvidBlockBundle loadKVSnapshotMemvidBlockWithOptions → kv.LoadMemvidBlockWithOptions effectiveKVSnapshotTokenOffset → kv.EffectiveTokenOffset effectiveKVSnapshotSeqLen → kv.EffectiveSeqLen clearKVSnapshotTerminalState → kv.ClearTerminalState dropKVSnapshotFloat32 → kv.DropFloat32 kvSnapshotResultError → kv.ResultError Snapshot.sliceBlock (method) → SliceBlock Inline private copies kept in kv: normalizeSnapshot (was normalizeBundleSnapshot), requiresNativeEncoding (was kvSnapshotRequiresNativeEncoding), firstNonEmpty, defaultCacheBlockSize. mlx-root NewStateBundle: local variable `kv` renamed to `snap` to avoid shadowing the imported kv package. State_bundle.go now calls kv.HashSnapshot / kv.Analyze directly. NEW mlx-root kv_test_helpers_test.go contains test helpers (kvSnapshotBlocksTestSnapshot, recordingMemvidStore, failingMemvidWriter) duplicated for mlx-root tests that no longer have access to kv-package test internals. ~22 consumer files updated: agent_memory, api_common, api_darwin, api_stub, api_test, fast_eval{,_test}, hf_fit_test, expert_residency_test, inference_contract_darwin, kv_snapshot_index{,_test}, kv_cache_bench{,_test}, memory_plan{,_test}, memvid_chapter_smoke{,_test}, session_agent_darwin{,_test}, session_artifact{,_test}, session_darwin{,_test,_example_test}, session_stub_example_test, small_model_smoke, state_bundle{,_test}, workload_bench{,_test}. go vet ./... clean. mlx + gguf + lora + safetensors + merge + kv tests green. Co-Authored-By: Virgil <virgil@lethean.io>

eval is driver-neutral (orchestrates evaluation given a Runner adapter), so it lifts to go-inference/eval/ instead of go-mlx/eval/ — alongside parser/, quant/jang/, quant/codebook/ which already live there. Interface redesign for cycle-breaking: - Sample/Batch/BatchConfig become opaque any - Dataset is an interface (Next returns any) - Runner gains BatchTokens callback (replaces sftBatchLossTokens) and SampleText callback (replaces direct .Text/.Response reads) - eval.Info mirrors mlx.ModelInfo fields; eval.AdapterInfo mirrors lora.AdapterInfo. mlx-root converts at the boundary via modelInfoToEval, evalInfoToModel, loraToEvalAdapter, evalAdapterToLora. - BuildBatches is now required (replaces optional Tokenizer + auto-build); driver wrappers provide BuildBatches that internally use their tokenizer + BuildDatasetBatches. Symbol renames per discipline: EvalConfig → eval.Config EvalRunner → eval.Runner EvalReport → eval.Report (with eval.Info + eval.AdapterInfo) EvalMetrics → eval.Metrics EvalBatchMetrics → eval.BatchMetrics EvalQualityProbe → eval.QualityProbe (Context/Report/Check too) RunDatasetEval → eval.RunDataset EvalReportVersion → eval.ReportVersion RunModelEval, NewModelEvalRunner stay at mlx-root as wrappers/adapters. Move ResponseCoverageProbe into eval/ as an exported probe constructor — driver wrappers attach it via RunModelEval so eval doesn't need to know about SFTSample's field shape. eval_test.go deleted from mlx-root (its orchestration testing now belongs in go-inference/eval/). Integration coverage stays in eval_darwin_test.go. Bumps external/go-inference submodule pin to a18708d (driver-neutral eval package shipped). Consumers updated: distill{,_test}.go, workload_bench{,_test}.go, inference_contract_{darwin,test}.go. distill.go gains a private distillCollectSamples helper (replaces collectEvalSamples from old eval.go). workload_bench.go gains normalizeWorkloadEvalConfig (replaces normalizeEvalConfig). go vet ./... clean. mlx + gguf + lora + safetensors + merge + kv tests green. Co-Authored-By: Virgil <virgil@lethean.io>

bench package (go-inference/bench/) is the new driver-neutral local benchmark/eval harness. Drivers supply a Runner with verb-shaped callbacks (BenchPromptCache, BenchMemvidKVBlockWarm, BenchKVRestore, BenchStateBundle, BenchProbeOverhead, BenchSpeculativeDecode, BenchPromptLookupDecode). bench.Run orchestrates generation timing + dispatches each enabled callback + assembles the Report. mlx-root: fast_eval.go shrinks to type aliases + boundary converters (FastEval* → bench.* via type aliases; modelInfoToBench / benchInfoToModel / fromMlxMetrics / toBenchGenerateOptions / loraToBenchAdapter / benchAdapterToLora helpers). NEW fast_eval_runner.go contains the Model→bench.Runner adapter — each Bench* callback implements its driver-specific section against the Model API (kv snapshots, state bundles, memvid block warming, decode optimisation via RunSpeculativeDecode / RunPromptLookupDecode). memvid_chapter_smoke decouples from the bench.Runner — its callbacks (CaptureKVBlocksToMemvid, GenerateWithMemvidPrefix) deal with mlx-specific kv types, so it has its own MemvidKVChapterRunner at mlx-root (no longer wedged into the verb-callback shape). inference_contract_darwin.go converts at the bench boundary (benchInfoToModel / benchAdapterToLora) before calling toInferenceModelIdentity / toInferenceRootAdapterIdentity. workload_bench.go: drops normalizeFastEvalConfig (bench.Run normalises internally); ModelInfo conversion via benchInfoToModel. Test coverage delta: fast_eval_test.go (801 lines), fast_eval_example_test.go (26 lines), workload_bench_test.go (525 lines) deleted — their callback mock setups exercise the OLD raw-callback Runner shape; equivalent coverage for the verb-callback shape should be added to go-inference/bench/ tests in a separate pass. memvid_chapter_smoke_test (integration tests for the chapter runner) rewrites to use MemvidKVChapterRunner + ChapterGeneration. inference_contract_test gains modelInfoToBench wrap at the boundary. Bumps external/go-inference to include the bench package. go vet ./... clean. mlx + gguf + lora + safetensors + merge + kv tests green. Co-Authored-By: Virgil <virgil@lethean.io>

Picks up the bench package unit tests (test(bench): unit tests for driver-neutral Run orchestration). Coverage rebuilt for the verb-callback Runner shape after deleting fast_eval_test.go + fast_eval_example_test.go + workload_bench_test.go in Phase 2M. Co-Authored-By: Virgil <virgil@lethean.io>

Phase 2N — the speculative + prompt-lookup decode algorithm is driver- neutral (accept/reject over token streams, generation delegated to caller callbacks), so it lifts to go-inference/decode/ alongside bench and eval. decode_optimisation.go is rewritten as a thin shim with legacy type aliases (DecodeOptimisationResult, DecodeOptimisationMetrics) and boundary converters (mlxDecodeGenToDecode, mlxTokensToDecode, decodeTokensToMlx). DecodeGenerateFunc keeps the mlx-shaped signature so existing callbacks continue to compile; RunSpeculativeDecode/ RunPromptLookupDecode wrap them to decode.GenerateFunc internally. decodeTokensText survives as a thin wrapper for memvid_chapter_smoke. Submodule pin bumped to go-inference 521dd53 (feat(decode): driver-neutral speculative + prompt-lookup decode harness). Coverage rebuilt: - decode_optimisation_test.go now covers the boundary converters, nil-callback handling, token round-trip, and legacy-alias surface - decode_optimisation_example_test.go for AX example registration - fast_eval_test.go BACKFILLS the Phase 2M orphan: covers alias routing, DefaultFastEvalConfig forwarding, RunFastEval bench smoke against a synthetic Runner, toBenchGenerateOptions clone + probe-sink passthrough, fromMlxMetrics field copy, modelInfoToBench round-trip with adapter clone, fastEvalResultError - fast_eval_example_test.go matches AX pattern go vet ./... clean. Tests: mlx + kv + lora + merge + gguf + pack all green. Pre-existing internal/metal failure (TestGenerate_Model_Staged MiniMaxReturnsDecodeError_Bad nil-tokenizer panic) is unrelated — fails identically on pristine HEAD. Co-Authored-By: Virgil <virgil@lethean.io>

Phase 2O — state bundle is deeply mlx-coupled (kv.Snapshot, lora.AdapterInfo, SAMI), so it lifts to go-mlx/bundle/ as a sibling package rather than to go-inference. SAMI types travel with bundle since Bundle.SAMI holds *SAMIResult. Symbols rename per the folder-taxonomy rule (drop prefixes the package carries): StateBundle → bundle.Bundle StateBundleOptions → bundle.Options StateBundleModel → bundle.Model StateBundlePrompt → bundle.Prompt StateBundleTokenizer → bundle.Tokenizer StateBundleRuntime → bundle.Runtime StateBundleAdapter → bundle.Adapter StateBundleSampler → bundle.Sampler StateBundleRef → bundle.Ref StateBundleVersion → bundle.Version StateBundleKind → bundle.Kind StateBundleRefMemvid → bundle.RefMemvid NewStateBundle → bundle.New LoadStateBundle → bundle.Load CheckStateBundleCompatibility → bundle.CheckCompatibility StateBundleFileHash → bundle.FileHash SAMIResult → bundle.SAMIResult (kept name — separate concept) SAMIOptions → bundle.SAMIOptions SAMIFromKV → bundle.SAMIFromKV mlx-root state_bundle.go becomes a thin shim with type aliases for the 77 caller sites + boundary converters for mlx.ModelInfo → bundle.ModelInfo and mlx.GenerateConfig → bundle.Sampler. mlx-root keeps StateBundleOptions as its own struct (carrying mlx-shaped ModelInfo + GenerateConfig + *SAMIResult) so existing callers compile unchanged. session_artifact.go's SAMIResult / SAMIOptions become aliases to bundle.SAMIResult / bundle.SAMIOptions; SAMIFromKV becomes a thin wrapper. The math helpers (clampUnit, clampRange, meanUnit, layerMetric) move to bundle/sami.go with the SAMI types. stateBundleTokenizer + stateHash + stateMemvidURI retained as private mlx-root wrappers (bundle.NormaliseTokenizer + bundle.HashString + bundle.MemvidURI) for callers session_agent_darwin.go + kv_snapshot_index.go that referenced the old in-package names. stateBundleTestSnapshot test helper moved to kv_test_helpers_test.go so lora_adapter*_test.go + session_darwin_test.go continue to compile. Coverage: - bundle/bundle_test.go covers Save/Load, memvid snapshot round-trip, frame-zero allowance, defensive cloning, Validate + CheckCompatibility happy + sad paths, AdapterFromInfo round-trip, NormaliseTokenizer, AdapterEmpty, HashString, FileHash, MemvidURI, SAMIFromKV - bundle/example_test.go for AX example registration - state_bundle_test.go covers the shim: alias identity, modelInfoToBundle, stateSamplerFromGenerateConfig clone, CheckStateBundleCompatibility, FileHash, Load round-trip, SnapshotFromMemvid via shim route, the private cross-file helpers go vet ./... clean. Tests: mlx + bundle + kv + lora + merge + gguf + pack all green. Pre-existing internal/metal panic remains unrelated. Co-Authored-By: Virgil <virgil@lethean.io>

Phase 2P — probe is the go-mlx event-vocabulary for inference + training observability. It lifts to go-mlx/probe/ rather than go-inference because the event shape is mlx-rich: ProbeExpertResidency carries MoE paging events that the driver-neutral inference.ProbeEvent contract (at dappco.re/go/inference root) doesn't expose. The two probe vocabularies remain intentionally separate — inference owns the backend contract, go-mlx/probe/ owns the rich driver event vocabulary. Symbols rename per the folder-taxonomy rule (drop prefixes the package carries): ProbeEvent → probe.Event ProbeEventKind → probe.Kind ProbePhase → probe.Phase ProbeToken → probe.Token ProbeLogit → probe.Logit ProbeLogits → probe.Logits ProbeEntropy → probe.Entropy ProbeHeadSelection → probe.HeadSelection ProbeLayerCoherence → probe.LayerCoherence ProbeRouterDecision → probe.RouterDecision ProbeExpertResidency → probe.ExpertResidency ProbeResidualSummary → probe.ResidualSummary ProbeCachePressure → probe.CachePressure ProbeMemoryPressure → probe.MemoryPressure ProbeTraining → probe.Training ProbeSink → probe.Sink ProbeSinkFunc → probe.SinkFunc ProbeBus → probe.Bus ProbeRecorder → probe.Recorder NewProbeBus → probe.NewBus NewProbeRecorder → probe.NewRecorder cloneProbeEvent → probe.CloneEvent (exported) ExpertResidencyAction + its four constants move from expert_residency.go to probe so probe.ExpertResidency.Action stays a typed enum; mlx-root expert_residency.go gets a type alias plus const re-declarations. mlx-root probe.go shrinks from 337 to ~80 LOC: type aliases for 19 types + 14 constants, plus the mlx-specific GenerateOption helpers (WithProbeSink, WithProbeCallback) that stay because they touch mlx.GenerateConfig. NewProbeBus/NewProbeRecorder become one-line forwarders. All ~203 caller references across 20+ files compile unchanged thanks to the alias surface. Coverage: - probe/probe_test.go covers Recorder defensive-copy semantics, Bus fanout + concurrent safety + nil-receiver guards, SinkFunc nil handling, CloneEvent deep-copy across every payload pointer plus Meta map, ExpertResidencyAction + Kind + Phase constant values - probe/example_test.go for AX example registration - probe_test.go (mlx-root) covers alias identity, constant preservation, ExpertResidencyAction alias identity, NewProbeBus + NewProbeRecorder wiring, WithProbeSink / WithProbeCallback installing on GenerateConfig (including the nil-callback no-op) - probe_example_test.go matches AX pattern go vet ./... clean. Tests: mlx + probe + bundle + kv + lora + merge + gguf + pack all green. Pre-existing internal/metal panic unrelated. Co-Authored-By: Virgil <virgil@lethean.io>

Phase 2Q — scheduler.go is fully driver-neutral (only inference.TextModel deps, no kv/lora/probe-mlx), so it lifts to go-inference/scheduler/ alongside bench, decode, and eval. Symbols rename per the folder-taxonomy rule: ScheduledModel → scheduler.Model SchedulerConfig → scheduler.Config NewScheduledModel → scheduler.New mlx-root scheduler.go shrinks from 400 to ~25 LOC: type aliases for ScheduledModel + SchedulerConfig + one-line NewScheduledModel forwarder. register_metal.go's `scheduler *ScheduledModel` field + register_metal_scheduler.go's wrappers compile unchanged through the aliases. Submodule pin bumped to go-inference 254b391 (feat(scheduler): driver-neutral request scheduler). Coverage: - go-inference/go/scheduler/scheduler_test.go ports the canonical suite (queue + latency probe, full-queue rejection, cancellation, Generate/Chat/Classify/BatchGenerate delegation, nil + cancelled- context paths, fallback cancel via inference.CancellableModel, Err propagation, generateOptions sampler conversion, cloneLabels + millis helpers) - go-inference/go/scheduler/example_test.go for AX coverage - scheduler_test.go (mlx-root) covers alias identity + NewScheduledModel forward + nil-base defensive wrapper - scheduler_example_test.go matches AX pattern go vet ./... clean. Tests: mlx + probe + bundle + kv + lora + merge + gguf + pack all green. Pre-existing internal/metal panic unrelated. Co-Authored-By: Virgil <virgil@lethean.io>

Phase 2R — memory_plan is the local-inference memory planner that maps measured Apple-silicon hardware + model metadata to a runtime policy. The generic core (memory class detection, base class plans, KV cache estimation, architecture hints, generic MoE residency) lifts to go-mlx/memory/. The MiniMax-M2-specific overrides (tensor-plan expert-residency + first-layer skeleton bytes) stay at mlx-root, layered on top of the generic plan. Symbols rename per the folder-taxonomy rule (drop prefixes the package carries): MemoryPlan → memory.Plan MemoryPlanInput → memory.Input (only used internally now — mlx-root keeps its own MemoryPlanInput with mlx-shaped DeviceInfo + ModelInfo) PlanMemory → memory.NewPlan MemoryClass → memory.Class MemoryClass* → memory.Class* (7 constants) MemoryGiB → memory.GiB KVCachePolicy → memory.KVCachePolicy (kept name; package doesn't repeat the prefix) KVCacheMode → memory.KVCacheMode ExpertResidencyPlan → memory.ExpertResidencyPlan ExpertResidencyMode → memory.ExpertResidencyMode ExpertResidencyMode* → memory.ExpertResidencyMode* (3 constants) ExpertEvictionPolicy → memory.ExpertEvictionPolicy ExpertEvictionLRU → memory.ExpertEvictionLRU mlx-root memory_plan.go shrinks from 529 to ~165 LOC: - Type aliases for MemoryPlan + MemoryClass + KVCachePolicy + KVCacheMode + 19 constants + MemoryGiB - mlx.MemoryPlanInput stays its own struct (carries mlx.DeviceInfo + *mlx.ModelInfo so existing callers compile unchanged) - PlanMemory wrapper: converts to memory.Input, calls memory.NewPlan, layers MiniMaxM2LayerForwardSkeleton bytes + MiniMaxM2TensorPlan expert residency on top - applyMemoryPlanToLoadConfig stays here (uses mlx.LoadConfig) - minPositive retained as a private helper for expert_residency.go expert_residency.go's ExpertResidencyPlan + Mode + EvictionPolicy become aliases to memory.* types. The runtime manager + Stats + Context types stay at mlx-root. memory package is self-contained: imports only inference/quant/jang, mlx/pack, mlx/profile. normalizeKnownArchitecture + trim/lower/replace ASCII helpers duplicated locally to avoid importing mlx-root. Coverage: - memory/memory_test.go covers the generic core: 16/24/32/64/96/128GB class plans, context capped by pack metadata, Qwen3-MoE hints, MiniMax architecture caps, BERT embedding disables generation cache, fallback on zero memory, model metadata caps context, Q8 KV cache for middle classes, generic MoE residency, ClassForBytes boundaries, minPositive, percentBytes, normalizeKnownArchitecture aliases (15 tests) - memory/example_test.go for AX coverage - memory_plan_test.go at mlx-root unchanged — all 11 existing tests pass through the shim, exercising the integrated path including MiniMaxM2 skeleton + tensor-plan residency go vet ./... clean. Tests: mlx + memory + probe + bundle + kv + lora + merge + gguf + pack all green. Pre-existing internal/metal panic unrelated. Co-Authored-By: Virgil <virgil@lethean.io>

Phase 2S — mega-lift matching the model/{arch}/{name}/ folder taxonomy called out in feedback_driver_lift_discipline.md. Moves four mlx-root source files (minimax_m2.go 1016 LOC + minimax_m2_native_darwin.go 167 + minimax_m2_native_stub.go 32 + expert_residency.go 476) plus three test files (minimax_m2_test.go 643 + minimax_m2_darwin_test.go 441 + expert_residency_test.go 159) to go-mlx/model/minimax/m2/ as a single self-contained package. Symbol renames per the folder-taxonomy rule (drop prefixes the package carries — m2 carries "MiniMaxM2"): MiniMaxM2Config → m2.Config MiniMaxM2TensorRole → m2.TensorRole MiniMaxM2TensorRole* (9 constants) → m2.TensorRole* (9 constants) MiniMaxM2TensorSpec → m2.TensorSpec MiniMaxM2TensorPlan → m2.TensorPlan MiniMaxM2RouterDecision → m2.RouterDecision MiniMaxM2ExpertFunc → m2.ExpertFunc MiniMaxM2PackedExpertWeights → m2.PackedExpertWeights MiniMaxM2RouterWeights → m2.RouterWeights MiniMaxM2PackedLayerForwardOptions → m2.PackedLayerForwardOptions MiniMaxM2PackedLayerForwardResult → m2.PackedLayerForwardResult MiniMaxM2LazyExpertLoad → m2.LazyExpertLoad MiniMaxM2DenseProjectionTensor → m2.DenseProjectionTensor MiniMaxM2DenseExpertWeights → m2.DenseExpertWeights MiniMaxM2ResolvedTensor → m2.ResolvedTensor MiniMaxM2LayerForwardSkeleton → m2.LayerForwardSkeleton ParseMiniMaxM2Config → m2.ParseConfig BuildMiniMaxM2TensorPlan → m2.BuildTensorPlan RouteMiniMaxM2Tokens → m2.RouteTokens DispatchMiniMaxM2Experts → m2.DispatchExperts LoadMiniMaxM2PackedExpertsForDecisionsFromSafetensors → m2.LoadPackedExpertsForDecisions LoadMiniMaxM2LazyExpertsForHiddenFromSafetensors → m2.LoadLazyExpertsForHidden LoadMiniMaxM2PackedExpertsFromSafetensors → m2.LoadPackedExperts LoadMiniMaxM2RouterFromSafetensors → m2.LoadRouter ProjectMiniMaxM2RouterScores → m2.ProjectRouterScores BuildMiniMaxM2LayerForwardSkeletonFromSafetensors → m2.BuildLayerForwardSkeleton MiniMaxM2RouterProbeEvents → m2.RouterProbeEvents MiniMaxM2ExpertResidencyLoader → m2.ResidencyLoader MiniMaxM2ExpertResidencyConfig → m2.ResidencyConfig MiniMaxM2ExpertResidencyManager → m2.ResidencyManager NewMiniMaxM2ExpertResidencyManager → m2.NewResidencyManager PlanMiniMaxM2ExpertResidency → m2.PlanResidency DispatchMiniMaxM2PackedExpertsMetal → m2.DispatchPackedExpertsMetal DispatchMiniMaxM2PackedExpertsFromSafetensorsMetal → m2.DispatchPackedExpertsFromSafetensorsMetal ForwardMiniMaxM2LazyExpertLoadMetal → m2.ForwardLazyExpertLoadMetal ForwardMiniMaxM2PackedLayerMetal → m2.ForwardPackedLayerMetal ForwardMiniMaxM2PackedLayerFromSafetensorsMetal → m2.ForwardPackedLayerFromSafetensorsMetal normaliseExpertResidencyPlan → m2.NormalisePlan JANGPackedProjectionTensor → m2.JANGPackedProjectionTensor Private helpers all lose the miniMaxM2 prefix (decisionExpertIDs, uniqueExpertIDs, packedDType, etc.). ExpertResidencyStats moves to memory.ExpertResidencyStats (it's the companion measurement type for memory.ExpertResidencyPlan that was already there). mlx-root shim files (minimax_m2.go, minimax_m2_native_darwin.go, minimax_m2_native_stub.go, expert_residency.go) preserve all 66 caller references via type aliases + wrapper functions. memory_plan.go's PlanMemory MiniMaxM2-specific overrides still compile through the aliases. model_pack.go's ParseMiniMaxM2Config / BuildMiniMaxM2TensorPlan / BuildMiniMaxM2LayerForwardSkeletonFromSafetensors calls route through wrappers. workload_bench.go's ExpertResidencyStats + normaliseExpertResidencyPlan route through aliases. m2 package is self-contained: imports core, jang, mlx/memory, mlx/probe, mlx/profile, mlx/safetensors, mlx/quant/jang only — no upward mlx-root import (which would cycle). Private helpers (firstNonEmpty, normalizeKnownArchitecture, nonZeroDuration, maxPositive, minPositive, firstPositive) duplicated locally in helpers.go. Test fixtures (miniMaxM2FixtureConfig + findMiniMaxM2Spec + writeMiniMaxM2RawSafetensors + miniMaxM2SkeletonRawTensors + miniMaxM2F32RawTensor + miniMaxM2RawSafetensor) duplicated at mlx-root in minimax_m2_test_helpers_test.go so jang_darwin_test.go and model_pack_test.go still build. Go test packages cannot import each other's internal _test.go helpers, hence the duplication. internal/metal/metal.go's defaultMetallibPath search expanded by two more parent-dir candidates so tests running from model/minimax/m2/ (5 directories deep) can still discover dist/lib/mlx.metallib. go vet ./... clean. Tests: mlx + m2 + memory + probe + bundle + kv + lora + merge + gguf + pack + ide-side packages all green. Pre-existing internal/metal TestGenerate_Model_StagedMiniMaxReturnsDecodeError_Bad nil-tokenizer panic still unrelated. Co-Authored-By: Virgil <virgil@lethean.io>

Phase 2T — hf_fit.go (1019 LOC) hosts the HuggingFace metadata source + local-fit planner. The public HF* symbols have ZERO callers in production code (only test references), so the lift is mostly a shape change. Lifts to go-mlx/hf/ with symbol renames per the folder-taxonomy rule: HFModelSource → hf.ModelSource HuggingFaceModelSourceConfig → hf.RemoteConfig HuggingFaceModelSource → hf.RemoteSource NewHuggingFaceModelSource → hf.NewRemoteSource HFModelFitConfig → hf.FitConfig HFModelMetadata → hf.ModelMetadata HFModelFile → hf.ModelFile HFModelConfig → hf.ModelConfig HFQuantizationConfig → hf.QuantizationConfig HFModelFitReport → hf.FitReport HFModelFitPlan → hf.FitPlan HFTrainingFit → hf.TrainingFit PlanHFModelFits → hf.PlanFits InferJANGFromHF → hf.InferJANG HFModelSourceRemote/Local → hf.SourceRemote/Local Plus all the private helpers (collectFitEntries, planFit, weightFormatAndBytes, inferQuantBits, etc.) lose the hf-redundant prefixes. hf package is self-contained: imports core, jang, mlx/memory, mlx/pack, mlx/profile. Uses memory.Class / memory.Plan / memory.NewPlan / memory.Input / memory.DeviceInfo / memory.GiB / memory.KVCacheMode* directly (no mlx-root coupling). The four model-pack-helper calls that previously delegated to mlx-root (modelPackSupportedArchitecture, modelPackNativeRuntimeSupported, modelPackUsesGenerationKVCache, inspectModelPackTaskProfiles) are now inlined as private hf helpers (archSupported, archNativeRuntime, usesGenerationKVCache, resolveArchitectureProfile) — each is a thin wrapper over profile.LookupArchitectureProfile, no behaviour change. mlx-root hf_fit.go shrinks from 1019 to ~65 LOC of pure shim: 11 type aliases + 2 const re-exports + 3 wrapper functions. PlanHFModelFits auto-fills cfg.Device from GetDeviceInfo() (the mlx-root metal probe) and converts to memory.DeviceInfo at the boundary — caller-facing behaviour preserved. helpers.go (new at mlx-root) holds firstNonEmpty / firstPositive / indexString that were at the bottom of hf_fit.go and are used by dataset_stream, kv_snapshot_index, memvid_chapter_smoke, model_pack, and openai. They stay at mlx-root because mlx-root consumers cannot import hf (wrong direction). model_config_probe.go (new at mlx-root) holds modelConfigProbe + readModelConfig + the probe's accessor methods, plus normalizeKnownArchitecture and architectureFromTransformersName. These are used by model_pack.go's inspectModelPackConfig + applyModelPackConfigMetadata; the originals lived in hf_fit.go. The hf package keeps its own private copies of the two architecture normalisers (they're used internally by the planner too). Tests port into hf package — they exercise internal fields/methods (.baseURL, .userAgent, .client, .byteSize) so package-private access is preserved. writeModelPackFile test helper duplicated in hf/test_helpers_test.go since Go test packages cannot import each other's internal helpers. go vet ./... clean. Tests: mlx + hf + memory + probe + bundle + kv + lora + merge + gguf + pack + m2 all green. Co-Authored-By: Virgil <virgil@lethean.io>

Co-Authored-By: Virgil <virgil@lethean.io>

github-advanced-security

SonarCloud found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Co-Authored-By: Virgil <virgil@lethean.io>

Use borrowed full page handles for immediate paged-cache decode attention, keeping partial preallocated pages owned as visible slices. Refresh the 100k retained workflow report with the measured borrowed-page run and current runner deltas. Co-Authored-By: Virgil <virgil@lethean.io>

Co-Authored-By: Virgil <virgil@lethean.io>

Evaluate non-paged prompt-cache state before detaching chunked prefill arrays so contiguous and rotating caches do not carry unevaluated MLX graph handles into the next chunk. Leave paged caches on the accepted production path without the extra synchronisation point. Document the fp16/rotating 100k diagnostic as a rejected production shortcut: the prefill primitive error is fixed, but decode still crashes before producing a report. Co-Authored-By: Virgil <virgil@lethean.io>

Record 100k same-shape diagnostics for larger paged K/V blocks and preallocated page writes. Both stay below the accepted 1024-page borrowed-state lane, so the long-context target remains fused paged/global attention rather than page-size tuning. Update GOAL.md, the runtime index, long-context diagnosis, and the production benchmark manifest with the new rejected artefacts. Co-Authored-By: Virgil <virgil@lethean.io>

Retain the materialised full K/V state produced by paged fast-concat on full-attention owner layers so shared Gemma 4 layers can reuse it instead of rebuilding the same long-context state. Records the 100k retained workflow moving from 260.093s / 51.293 tok/s to 231.109s / 60.011 tok/s, while keeping the external runner gap open in GOAL.md and runtime docs. Co-Authored-By: Virgil <virgil@lethean.io>

Adds the 5120-token-budget 100k retained-state diagnostic. The current prompt naturally stops at 2489 tokens per turn, but decode stays flat around 60 tok/s across ten retained turns and memory remains bounded under the production guards. Co-Authored-By: Virgil <virgil@lethean.io>

Co-Authored-By: Virgil <virgil@lethean.io>

sonarqubecloud · 2026-05-21T05:39:49Z

Quality Gate failed

Failed conditions
6.3% Duplication on New Code (required ≤ 3%)
E Security Rating on New Code (required ≥ A)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Co-Authored-By: Virgil <virgil@lethean.io>

Snider and others added 30 commits May 8, 2026 13:18

docs: define inference contract parity plan

178b957

feat(api): implement inference contracts

a3263f0

Co-Authored-By: Virgil <virgil@lethean.io>

feat(api): report metal runtime capabilities

850f482

Co-Authored-By: Virgil <virgil@lethean.io>

feat(api): expose metal memory limits via inference

92d29bd

Co-Authored-By: Virgil <virgil@lethean.io>

feat(api): expose openai chat handler

1eb011b

Co-Authored-By: Virgil <virgil@lethean.io>

Snider and others added 2 commits May 20, 2026 15:35

docs(runtime): refresh e2b quant matrix

667b6e5

Co-Authored-By: Virgil <virgil@lethean.io>

docs(runtime): fill e2b external quant rows

c5caff6

Co-Authored-By: Virgil <virgil@lethean.io>

github-advanced-security AI found potential problems May 20, 2026

View reviewed changes

Snider and others added 25 commits May 20, 2026 18:23

docs(runtime): add llama cached anchor

e82a2a4

Co-Authored-By: Virgil <virgil@lethean.io>

docs(runtime): explain long context gap

f2c5232

Co-Authored-By: Virgil <virgil@lethean.io>

perf(metal): tune hyper long paged kv

c3c4da5

Co-Authored-By: Virgil <virgil@lethean.io>

test(metal): correct token phase probe timing

9d55267

Co-Authored-By: Virgil <virgil@lethean.io>

docs(goal): audit gemma4 ideas update

e3baf55

Co-Authored-By: Virgil <virgil@lethean.io>

test(metal): guard gemma4 keqv cache split

66bbfe3

Co-Authored-By: Virgil <virgil@lethean.io>

perf(metal): pack adamw moment state

6c6d271

Co-Authored-By: Virgil <virgil@lethean.io>

fix(training): guard gemma4 lora targets

1cefb03

Co-Authored-By: Virgil <virgil@lethean.io>

docs(goal): record gomlxrunner compile pass

e1a5e97

Co-Authored-By: Virgil <virgil@lethean.io>

feat(api): expose prompt cache clearing

89d2dfb

Co-Authored-By: Virgil <virgil@lethean.io>

docs(goal): record ideas fine-tuning addendum

8fe0efd

Co-Authored-By: Virgil <virgil@lethean.io>

docs(runtime): add production benchmark manifest

c0c535c

Co-Authored-By: Virgil <virgil@lethean.io>

docs(runtime): add strict benchmark cleanup gate

34ac64a

Co-Authored-By: Virgil <virgil@lethean.io>

docs(runtime): clean noncanonical benchmark fragments

3786cf5

Co-Authored-By: Virgil <virgil@lethean.io>

bench(runtime): track e2b context ramp harness

95af568

Co-Authored-By: Virgil <virgil@lethean.io>

bench(runtime): record rejected 100k attention branches

0077a0d

Co-Authored-By: Virgil <virgil@lethean.io>

bench(runtime): gate native paged attention diagnostic

999d098

Co-Authored-By: Virgil <virgil@lethean.io>

bench(runtime): summarise long context trace buckets

b13cd65

Co-Authored-By: Virgil <virgil@lethean.io>

bench(runtime): reject right-sized fixed cache at 100k

5d0ded1

Co-Authored-By: Virgil <virgil@lethean.io>

docs(runtime): refresh 100k trace diagnosis

7badd57

Co-Authored-By: Virgil <virgil@lethean.io>

perf(metal): record paged full kv diagnostic

4d842ae

Co-Authored-By: Virgil <virgil@lethean.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Virgil Lemma foundations#8

Virgil Lemma foundations#8
Snider wants to merge 114 commits into
mainfrom
dev

Snider commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

github-advanced-security AI left a comment

Uh oh!

sonarqubecloud Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Snider commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 21, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Snider commented May 20, 2026 •

edited by coderabbitai Bot

Loading