Conversation
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Implements the 2026-05-09 vMLX feature-parity sprint (see
docs/vmlx-feature-gap-report.md + docs/superpowers/plans/) plus the
runtime surfaces that hang off it. Closes the gap between go-mlx and
vMLX's Python engine for MoE and advanced quantisation paths.
Phase 1 surface:
- MoE / advanced quant: minimax_m2.go + native_darwin, jang.go +
native_darwin, codebook_vq.go, expert_residency.go.
- Cache + decode: block_cache.go (block-prefix cache), prompt cache
threshold integration, decode_optimisation.go (speculative + prompt-
lookup harness).
- Algorithm/architecture profiles: algorithm_profile.go +
architecture_profile.go for backend capability reporting.
- Agent memory: agent_memory.go (Wake/Sleep/Fork on top of KV snapshots
+ memvid), state_bundle.go round-trip via dappco.re/go/inference/state.
- Scheduler + parsers: scheduler.go (queue-aware Schedule + Cancel),
parser_registry.go (model-family tool/reasoning parsers),
register_metal_{cache,parser,scheduler}.go capability mounts.
- Model-pack + planning: gguf_info.go / gguf_quantize.go, memory_plan.go
(device-class sizing), model_pack.go validation.
- Internal Metal extensions: gemma4 paged KV, minimax_m2 forward stubs,
codebook_vq kernels, jang_dequant, kv_snapshot_blocks_native.
- Frame compute: compute.go API rounded out for non-LLM kernels.
- admin.go, dataset_stream.go, fast_eval.go, hf_fit.go,
small_model_smoke.go, workload_bench.go.
- Observability: probe.go expanded for MoE router decisions, cache
pressure, training events.
docs/ pass adds per-file documentation under docs/{topic}/{file}.md so
future readers can plan against the runtime without grep:
- runtime/ — register_metal, adapter
- memory/ — agent_memory, kv_snapshot family, state_bundle, medium
- moe/ — minimax_m2, jang, codebook_vq, expert_residency
- training/ — sft, lora_adapter, grpo, distill, eval
- model/ — model_pack, memory_plan
- inference/ — scheduler, block_cache, decode_optimisation,
parser_registry, thinking
- compute/ — frame-compute API
- observability/ — probe.go emission
- cmd/violet — sidecar daemon
34 new docs plus per-topic READMEs and a top-level index.
Co-Authored-By: Virgil <virgil@lethean.io>
First lobe of the package-split out of the 80-file root dump. Moves the non-LLM Metal frame-compute lane (PixelBuffer / kernels / Session / NewSession) into its own subpackage so the root mlx package stays focused on LLM inference. - go/compute*.go → go/compute/ (10 files, package mlx → package compute) - compute_darwin.go renamed compute_metal.go (no _darwin suffix — package is Metal-only, no dual-platform split) - compute_stub.go variants deleted — Metal-only by design, no non-darwin compile target to guard against - All build tags dropped — package is darwin/arm64 implicit - DeviceInfo restored as type alias to metal.DeviceInfo (not field- flattened); DeviceInfo() returns metal.GetDeviceInfo() direct so upstream renames + new fields surface at compile time - unsupported_stub_test.go in parent dropped its compute.* compile- surface refs — stub build no longer needs to compile-check a Metal-only subpackage - examples/ moved into docs/examples/ (first-trip cleanup) No external consumers of compute symbols in the tetrad today; only internal sibling fast_eval / api_stub / session_* call sites and they use ModelSession.NewSession (method) rather than compute.NewSession (free function). No downstream import churn. Co-Authored-By: Virgil <virgil@lethean.io>
Drops the in-mlx output-parsing layer and consumes
dappco.re/go/inference/parser instead. Driver-neutral logic — model-
family reasoning markers, thinking-channel processor, tool-call
parsing — now lives in go-inference so every driver (rocm, cuda, tpu,
future) inherits it without re-implementation.
Deletes:
- go/parser_registry.go (466 lines)
- go/thinking.go (320 lines)
- their _test.go siblings
Replaces with:
- go/thinking.go (slim) — driver-side WithThinking* options that
mutate the local mlx.GenerateConfig.Thinking field, FilterThinkingTokens
wrapper for the *Tokenizer streaming path, parserHint() helper that
converts mlx.ModelInfo to parser.Hint{Architecture, AdapterName}.
Sibling fix-ups:
- api_common.go: GenerateConfig.Thinking is parser.Config; default is
parser.Show.
- api_darwin.go: 5 emit sites use parser.NewProcessor + parserHint.
- openai.go: 3 response handlers use parser.NewProcessor; reasoning
selector uses parser.ForHint(parser.HintFromInference(...)).
- register_metal_parser.go: outputParser() returns parser.OutputParser
via parser.ForHint(parserHint(...)).
- register_metal_cache.go: drops local modelInfoFromInference helper,
uses adapter.Info() directly.
- architecture_profile.go: parser.NormaliseKey replaces local helper.
- thinking_darwin_test.go: parser.Chunk replaces ThinkingChunk.
Submodule pin: external/go-inference advanced to cb4f9fb (parser
package + ProbeScheduler vocab the mlx scheduler.go was emitting).
Co-Authored-By: Virgil <virgil@lethean.io>
Drops the in-mlx JANG/JANGTQ + VQ codebook quant metadata and consumes
dappco.re/go/inference/quant/{jang,codebook} instead. Driver-neutral
quant types now lift to go-inference where every backend
(mlx, rocm, cuda, tpu, future) inherits them.
Deletes:
- go/jang.go (597 lines)
- go/codebook_vq.go (294 lines)
- their _test.go siblings (228 lines)
Adds:
- go/jang_hf.go — driver-side helpers that depend on mlx-local
HFModelMetadata (InferJANGFromHF, hfJANGGroupSize,
inferJANGProfileName). Compose lifted jang.Info shape.
- safetensor_ref.go: local mlxMaxIntValue() helper (was in jang.go).
Symbol-namespace renames (package name takes the disambiguation slot):
JANGQuantizationInfo → jang.Info
JANGCapabilities → jang.Capabilities
JANGTensorRole + consts → jang.TensorRole*
JANGPackedQuantizationProfile → jang.PackedProfile
JANGPackedTensorDescriptor → jang.PackedTensorDescriptor
BuildJANGPackedQuantizationProfile → jang.BuildPackedProfile
CloneJANGPackedQuantizationProfile → jang.ClonePackedProfile
NewJANGPackedTensorDescriptor → jang.NewPackedTensorDescriptor
ValidateJANGPackedTensor → jang.ValidatePackedTensor
DequantizeJANGPackedTensor → jang.DequantizePackedTensor
PackJANGQuantizedValues → jang.PackQuantizedValues
readJANGQuantizationInfo → jang.ReadConfig
parseJANGQuantizationInfo → jang.ParseConfig
CodebookQuantizationType → codebook.Type
CodebookFormatVQ → codebook.FormatVQ
CodebookQuantizationProfile → codebook.Profile
CodebookTensorDescriptor → codebook.TensorDescriptor
ParseCodebookQuantizationProfile → codebook.ParseProfile
NewCodebookTensorDescriptor → codebook.NewTensorDescriptor
ValidateCodebookQuantizationProfile → codebook.ValidateProfile
ValidateCodebookTensorDescriptor → codebook.ValidateTensorDescriptor
ValidateCodebookTensorPayload → codebook.ValidateTensorPayload
CodebookVQMatVec → codebook.MatVec
readCodebookQuantizationProfile → codebook.ReadProfile
cloneCodebookQuantizationProfile → codebook.CloneProfile
Sibling fix-ups across 19 files (production + tests):
- algorithm_profile, architecture_profile, hf_fit (+test),
jang_native_darwin/stub, memory_plan (+test), minimax_m2 (+test),
model_pack (+test), workload_bench (+test), expert_residency_test,
jang_darwin_test, minimax_m2_darwin_test, inference_contract_test.
- Variable shadowing: `jang` local variables renamed to `info`
where they shadowed the package import.
- jangQuantizationType(info) calls replaced with info.Packed.Type.
- finalizeJANGQuantizationInfo helper inlined as
info.Packed = jang.BuildPackedProfile(info).
- testJANGTQInfo() helper re-added locally in jang_darwin_test.go
(was in deleted jang_test.go).
Submodule pin: external/go-inference advanced to cb3dc24 (parser +
quant/jang + quant/codebook).
Companion lifts deferred next round:
- model/minimax/m2 — safetensorIndex (mlx-private) couplings in
loader functions; needs either safetensors lift or types/loaders
split.
- moe/expert_residency — MemoryClass (Apple-tier enum) needs
budget-bytes refactor before lifting.
Co-Authored-By: Virgil <virgil@lethean.io>
Snider correction: file lifts shouldn't add new flat files to the go-mlx root, and the _darwin/_stub split is noise on a Metal-only driver. Same rules as compute/: package gets its own folder, no build-tag dance. go/jang_native_darwin.go + jang_native_stub.go → go/quant/jang/jang.go (one file, no _darwin suffix, no stub variant) Symbols drop redundant prefixes since the folder + package imply them: JANGPackedProjectionResult → jang.PackedProjectionResult DequantizeJANGPackedTensorMetal → jang.DequantizePackedTensor ProjectJANGPackedTensorMetal → jang.ProjectPackedTensor ProjectJANGPackedTensorMetalFused → jang.ProjectPackedTensorFused jangMetalShape (private) → jang.MetalShape (exported for tests) jangMetalShapeElements (private) → jang.ShapeElements int32SliceToInts (private) → jang.Int32SliceToInts Inside the package, the inference-side jang aliases as infjang to avoid the same-name self-collision. Consumers (jang_darwin_test + minimax_m2_native_darwin) alias the mlx-side as mlxjang. The HF-metadata helpers (InferJANGFromHF, hfJANGGroupSize, inferJANGProfileName) merged into hf_fit.go — they're HF-fit code that happens to produce *jang.Info, not jang-package code (they depend on HFModelMetadata which lives in hf_fit.go). hf_fit.go + HFModelMetadata still pending their own folder lift (likely go/hf/ in a future iteration). go-mlx/go root flat-file count: net −1 this commit (deletion of jang_native_stub.go + jang_native_darwin.go and jang_hf.go, addition of nothing new in root). Co-Authored-By: Virgil <virgil@lethean.io>
Commit 63f9894 renamed the file but shipped its OLD content (the working-tree perl edits weren't re-staged before commit, so the index had the pre-edit version under the new path). HEAD's quant/jang/jang.go was still `package mlx` with the build tag, despite the working tree being correct (which masked the bug locally — build passed because the file on disk was right). This commit ships what should have landed in 63f9894: - package mlx → package jang - drop //go:build darwin && arm64 && !nomlx - symbols dropped JANG/Metal prefixes: DequantizePackedTensor, ProjectPackedTensor*, MetalShape, ShapeElements, Int32SliceToInts - inference jang aliased as infjang inside the file Co-Authored-By: Virgil <virgil@lethean.io>
algorithm_profile.go + architecture_profile.go move into go/profile/. Both become package profile; consumers import dappco.re/go/mlx/profile and call profile.LookupAlgorithmProfile / profile.LookupArchitectureProfile. architecture.go inlines normalizeKnownArchitecture + architectureFromTransformersName as private helpers (originals live in gguf_info.go at mlx root). Inlining avoids the import cycle that would otherwise form when profile/ pulls from mlx and mlx-root tests exercise profile/. Same trick for KVCacheMode references — uses literal "q8" / "paged" / "k-q8-v-q4" strings instead of mlx-root constants. Tests stay in mlx root for now (algorithm_profile_test.go + architecture_profile_test.go), aliased as `prof "dappco.re/go/mlx/profile"` so the `profile` local-var name they use doesn't shadow the package. Local-var lookup results renamed `profile → p` where needed. model_pack.go's local `profile := pack.ArchitectureProfile` renamed to `arch` to avoid shadowing the new package import. go vet ./... clean. Test suite green. Co-Authored-By: Virgil <virgil@lethean.io>
Move lora_adapter.go → lora/adapter.go (package lora). Stage 1 only: lora_fuse* stays at mlx root because it references mlx-root types (ModelPack, ModelPackFormatSafetensors) — same blocker as gguf_quantize.go. Symbol renames (drop redundant "LoRA"/"lora" prefixes since pkg carries them): LoRAAdapterInfo → lora.AdapterInfo InspectLoRAAdapter → lora.InspectAdapter (1-arg convenience) inspectLoRAAdapter → lora.Inspect (2-arg form, now public) loraAdapterInfoEmpty → (info AdapterInfo) IsEmpty() method Private helpers in lora/ also drop redundant prefixes: loraAdapterConfigJSON → adapterConfigJSON loraAdapterConfigPath → adapterConfigPath hashLoRAAdapter → hashAdapter loraAdapterResultError → resultError lora_fuse.go gets its own inline copy of loraAdapterResultError (the generic core.Result → error helper isn't worth pulling into the public surface of lora). Also: fixes stray `package mlx` left in profile/algorithm.go + profile/architecture.go from the previous lift commit (8f5174a) where the package-line rename apparently raced with the commit. go vet ./... clean. mlx package tests green. Co-Authored-By: Virgil <virgil@lethean.io>
Pure types-lift: ModelPack struct + its constants, options, methods move into go-mlx/pack/. Inspectors + validators stay in mlx-root model_pack.go (they reference mlx-root concrete types — GGUFInfo, MiniMaxM2TensorPlan — that would create cycles). Cycle-breaker: 4 fields in pack.ModelPack typed as `any` since their concrete types live at mlx root: Quantization any (was *GGUFQuantizationInfo) GGUF any (was *GGUFInfo) MiniMaxM2 any (was *MiniMaxM2TensorPlan) MiniMaxM2LayerSkeleton any (was *MiniMaxM2LayerForwardSkeleton) Consumers type-assert at read sites (memory_plan.go + model_pack_test.go). Inspectors assign concrete pointers directly (any accepts). Symbol policy this round: NO renames. pack.ModelPack stays pack.ModelPack (verbose but lower-risk; renames can land as a follow-up). Mlx root imports pack as `mp` to avoid the local-var name collision (many functions use `pack` as parameter name). addIssue + issueSummary → AddIssue + IssueSummary (exported, since inspectors at mlx root call them across the package boundary). applyModelPackOptions → pack.ApplyOptions (similarly exported). Unblocks: lora_fuse and gguf_quantize can now live in their own packages once their other dependencies (safetensor private types + MiniMaxM2 types) also lift. This commit ships only the type lift. go vet ./... clean. mlx package tests green. Co-Authored-By: Virgil <virgil@lethean.io>
Move lora_fuse{,_darwin,_stub,_test,_darwin_test}.go into lora/
(package lora) — joins lora/adapter.go from the earlier lora_adapter
lift. lora/ is now the LoRA package as intended.
API change: lora.FuseIntoPack takes pre-validated pack.ModelPack as
SourcePack (instead of ModelPath string). Callers validate via
mlx.ValidateModelPack first, then call lora.FuseIntoPack, then validate
output if they need a populated pack. This breaks the mlx ↔ lora cycle
(otherwise lora.FuseIntoPack would need to call mlx.ValidateModelPack →
cycle since mlx-root imports lora for AdapterInfo).
No production consumers of FuseLoRA* — only tests — so the API change
is safe.
Symbol renames per discipline (drop redundant "LoRA"/"lora" prefix
since pkg name carries it):
FuseLoRAIntoModelPack → lora.FuseIntoPack
FuseLoRAOptions → lora.FuseOptions
FuseLoRAResult → lora.FuseResult (drops Pack field)
LoRAFuseProvenance → lora.FuseProvenance
LoRAFuseProvenanceFile → lora.FuseProvenanceFile
prepareLoRAFuse → prepareFuse (private)
loraFusePairName → fusePairName
loraFuseBaseWeightKey → fuseBaseWeightKey
loraFuseAdapterWeightFiles → fuseAdapterWeightFiles
writeLoRAFuseProvenance → writeFuseProvenance
buildLoRAFusePairs → buildFusePairs
fuseLoRAModelWeightFiles → fuseModelWeightFiles
fuseLoRAWeightPairs → fuseWeightPairs
loraFusePair → fusePair
loraFusePrepared → fusePrepared
loRAFuseOutputWeights → fuseOutputWeights
samePath + copyModelPackMetadata + isModelWeightMetadataCopySkip +
copyModelPackLocalFile move to mlx-root model_merge.go (consumers:
model_merge.go itself + gguf_quantize.go). loraAdapterResultError
drops (lora's own resultError is used instead).
Tests: portable + darwin tests moved into lora/ (need access to
private helpers like fusePairName). Tests use pack.ModelPack{} fixture
in place of mlx.ValidateModelPack (which would create a cycle); output
verification reads files directly rather than via Pack.Valid().
go vet ./... clean. mlx + lora package tests green.
Co-Authored-By: Virgil <virgil@lethean.io>
Move gguf_info.go + gguf_info_test.go + gguf_info_example_test.go into gguf/ (package gguf). Symbol renames per discipline (drop redundant GGUF prefix since pkg name carries it): GGUFInfo → gguf.Info GGUFTensorInfo → gguf.TensorInfo GGUFValidationSeverity → gguf.ValidationSeverity GGUFValidationIssue → gguf.ValidationIssue GGUFTensorTypeSummary → gguf.TensorTypeSummary GGUFQuantizationInfo → gguf.QuantizationInfo ReadGGUFInfo → gguf.ReadInfo DiscoveredModel + DiscoverModels keep their names (no GGUF prefix). Export binary-format internals that mlx-root gguf_quantize.go needs: ggufTensorTypeQ8_0 → gguf.TensorTypeQ8_0 ggufTensorTypeQ4_0 → gguf.TensorTypeQ4_0 ggufValueTypeString → gguf.ValueTypeString ggufValueTypeUint32 → gguf.ValueTypeUint32 normalizeGGUFQuantType → gguf.NormalizeQuantType gguf_quantize.go stays at mlx root (it depends on mlx-root safetensor private types + pack.ModelPack — full lift blocked until safetensor types lift to a shared package). Mlx-root keeps private copies of helpers consumed by 8+ mlx-root files (in hf_fit.go): firstNonEmpty, firstPositive, modelConfigProbe + methods, readModelConfig, normalizeKnownArchitecture, architectureFromTransformersName, indexString. Same inline-copy pattern as profile/architecture.go used. Test helpers (writeTestGGUF, ggufMetaSpec, ggufTensorSpec, ggufTensorTypeQ4K, etc.) duplicated in new gguf_test_helpers_test.go at mlx root for cross-test access. This unblocks gguf-using consumers from importing gguf/ directly. gguf_quantize.go still at mlx root for now. go vet ./... clean. mlx + gguf + lora package tests green. Co-Authored-By: Virgil <virgil@lethean.io>
…nsors/ Move safetensor-prefixed types + funcs from model_merge.go + safetensor_ref.go + gguf_quantize.go into safetensors/ (package safetensors). Symbol renames per discipline drop the safetensor prefix since the package name carries it: Types: safetensorIndex → safetensors.Index safetensorTensorRef → safetensors.TensorRef safetensorTensorReader → safetensors.TensorReader safetensorHeaderEntry → safetensors.HeaderEntry Funcs: indexSafetensorFiles → safetensors.IndexFiles readSafetensorIndex → safetensors.ReadIndex safetensorRefFromHeader → safetensors.RefFromHeader readSafetensorRefRaw → safetensors.ReadRefRaw readSafetensorRefValues → safetensors.ReadRefValues readSafetensorRefFloat32Chunk → safetensors.ReadRefFloat32Chunk writeSafetensorRefFloat32Chunks → safetensors.WriteRefFloat32Chunks openSafetensorTensorReaders → safetensors.OpenReaders openSafetensorTensorReader → safetensors.OpenReader closeSafetensorTensorReaders → safetensors.CloseReaders safetensorDTypeByteSize → safetensors.DTypeByteSize decodeSafetensorFloatData → safetensors.DecodeFloatData float16ToFloat32 → safetensors.Float16ToFloat32 Methods on TensorReader: close → Close, readFloat32Chunk → ReadFloat32Chunk. Stays in model_merge.go: merge-specific helpers (indexModelMergeSources, validateModelMergeTensorIndexes, writeMergedSafetensors, readMergeTensorRefs, buildMergedSafetensorsHeader, readMergeTensorValues, writeLinearMergedTensorChunks, writeSLERPMergedTensorChunks, slerpChunkedWeights, writeFloat32Values is in safetensors too). safetensor_ref.go deleted (mlxMaxIntValue + readSafetensorRefRaw now live inside safetensors package as private maxIntValue + exported ReadRefRaw). Consumers updated: model_merge.go, gguf_quantize.go, gguf_quantize_test.go, minimax_m2.go, model_merge_test.go, kv_snapshot.go. Net: -2 root flat .go files (safetensor_ref.go deleted, primitives extracted from model_merge.go + gguf_quantize.go without adding new root files). Unblocks: gguf_quantize.go could potentially lift to gguf/ next (still needs pack.ModelPack from pack/, but pack imports gguf, so gguf_quantize would create cycle — needs separate decision). go vet ./... clean. mlx + gguf + lora + safetensors package tests green. Co-Authored-By: Virgil <virgil@lethean.io>
Move gguf_quantize.go + gguf_quantize_test.go → gguf/quantize.go + gguf/quantize_test.go (package gguf). API change matches the lora.FuseIntoPack pattern: gguf.QuantizeModelPack takes pre-validated pack.ModelPack as SourcePack instead of a ModelPath string. Callers run mlx.ValidateModelPack first and call mlx.ValidateModelPack(result.OutputPath) afterwards if they need a populated output pack. Symbol renames per discipline (drop redundant GGUF prefix): QuantizeModelPackToGGUF → gguf.QuantizeModelPack QuantizeGGUFOptions → gguf.QuantizeOptions QuantizeGGUFResult → gguf.QuantizeResult (drops Pack field) GGUFQuantizeFormat → gguf.QuantizeFormat GGUFQuantizeQ8_0/Q4_0/Q4_K_M → gguf.QuantizeQ8_0/Q4_0/Q4_K_M Move ggufValidationSummary from mlx-root model_pack.go into gguf as exported gguf.ValidationSummary — model_pack.go now calls it via the gguf package. Same helper, single home now. Move samePath + copyModelPackMetadata + isModelWeightMetadataCopySkip + copyLocalFile into gguf as private helpers (also keep the model_merge.go mlx-root copies for non-gguf consumers like model_merge.go itself). mlx-root tests that depended on lifted private helpers (denseSafetensor, loadDenseSafetensors, readDenseSafetensors, decodeDenseSafetensor, writeDenseSafetensorsPack, writeTestSafetensorsF32, safetensorTestTensor, appendUint16LE, float32ToFloat16) get duplicated copies in gguf_test_helpers_test.go for the tests that still live at mlx root (model_merge_test, kv_snapshot_*, api_test). No production consumers of Quantize* API — only tests — so the API change is safe. Drop the second ValidateModelPack call (caller's responsibility); drop Pack field from QuantizeResult. go vet ./... clean. mlx + gguf + lora + safetensors package tests green. Co-Authored-By: Virgil <virgil@lethean.io>
Move model_merge.go + model_merge_test.go → merge/merge.go + merge/merge_test.go (package merge). API change matches the lora.FuseIntoPack + gguf.QuantizeModelPack pattern: merge.Source carries a pre-validated pack.ModelPack (Pack field) instead of a Path string. Callers run mlx.ValidateModelPack on each source before invoking merge.Packs, and re-validate the output via mlx.ValidateModelPack(result.OutputPath) if they need a populated pack. Symbol renames per discipline (drop redundant Model/ModelMerge prefix): MergeModelPacks → merge.Packs ModelMergeOptions → merge.Options ModelMergeResult → merge.Result (drops Pack field) ModelMergeMethod → merge.Method ModelMergeSource → merge.Source (Path → Pack) ModelMergeProvenance → merge.Provenance ModelMergeProvenanceFile → merge.ProvenanceFile ModelMergeLinear/SLERP/TIES/DARE → merge.MethodLinear/SLERP/TIES/DARE Private helpers moved with the source (drop prefixes where redundant): prepareModelMerge → prepare ensureEmptyModelMergeDestination → ensureEmptyDestination validateModelMergePackCompatibility → validatePackCompatibility indexModelMergeSources → indexSources validateModelMergeTensorIndexes → validateTensorIndexes readMergeTensorRefs → readTensorRefs buildMergedSafetensorsHeader → buildMergedHeader readMergeTensorValues → readTensorValues writeLinearMergedTensorChunks → writeLinearChunks writeSLERPMergedTensorChunks → writeSLERPChunks normalizedMergeWeights → normalizedWeights writeModelMergeProvenance → writeProvenance modelMergePrepared → prepared modelMergeResultError → resultError StateBundleFileHash → hashFile (inlined private copy in merge) samePath / copyModelPackMetadata / isModelWeightMetadataCopySkip / copyLocalFile / resultError travel with merge as private helpers (they were only used by model_merge.go after the earlier gguf_quantize lift moved away). merge/helpers_test.go takes its own copies of denseSafetensor + loadDenseSafetensors + readDenseSafetensors + decodeDenseSafetensor + safetensorTestTensor + writeDenseSafetensorsPack + writeTestSafetensorsF32 + testResultError + writeModelPackFile + modelPackTokenizerJSON + testPack / testPackArch fixture builders. Trim mlx-root gguf_test_helpers_test.go: remove safetensors-related helpers (denseSafetensor, loadDenseSafetensors, etc.) — they no longer have mlx-root consumers after the merge lift. mlx-root minimax_m2.go gains its own private copy of sameUint64Slice (small utility that was only used by minimax_m2 + the lifted merge code; the merge copy keeps its own). No production consumers of ModelMerge* API — only tests, so the API change is safe. go vet ./... clean. mlx + gguf + lora + safetensors + merge package tests green. Co-Authored-By: Virgil <virgil@lethean.io>
Move kv_snapshot.go, kv_snapshot_blocks.go, kv_snapshot_memvid.go,
kv_analysis.go (and their tests + examples) into kv/ (package kv).
kv_snapshot_index.go stays at mlx root — its
KVSnapshotMemvidBundleIndex struct has StateBundleModel +
StateBundleTokenizer fields whose types live at mlx-root and would cycle.
Symbol renames per discipline (drop redundant KV/KVSnapshot prefix):
KVSnapshot → kv.Snapshot
KVLayerSnapshot → kv.LayerSnapshot
KVHeadSnapshot → kv.HeadSnapshot
KVSnapshotEncoding → kv.Encoding (+ Native/Q8/Base64/Binary)
KVSnapshotVersion → kv.SnapshotVersion
KVSnapshotSaveOptions → kv.SaveOptions
KVSnapshotLoadOptions → kv.LoadOptions
KVSnapshotCaptureOptions → kv.CaptureOptions
LoadKVSnapshot{,WithOptions} → kv.Load{,WithOptions}
KVSnapshotBlock → kv.Block
KVSnapshotMemvidBlockOptions/Bundle/Ref → kv.MemvidBlock{Options,Bundle,Ref}
KVSnapshotMemvidBlockBundleKind → kv.MemvidBlockBundleKind
KVSnapshotMemvidBlockVersion → kv.MemvidBlockVersion
AssembleKVSnapshotBlocks → kv.AssembleBlocks
SaveKVSnapshotMemvidBlockBundle → kv.SaveMemvidBlockBundle
LoadKVSnapshotFromMemvidBlocks{,WithOptions} → kv.LoadFromMemvidBlocks{,WithOptions}
LoadKVSnapshotMemvidBlockBundle → kv.LoadMemvidBlockBundle
LoadKVSnapshotPrefixFromMemvidBlocks{,WithOptions} → kv.LoadPrefixFromMemvidBlocks{,WithOptions}
KVSnapshotMemvidOptions → kv.MemvidOptions
LoadKVSnapshotFromMemvid{,WithOptions} → kv.LoadFromMemvid{,WithOptions}
KVAnalysis → kv.Analysis, AnalyzeKV → kv.Analyze
KVFeatures → kv.Features, KVFeatureLabels → kv.FeatureLabels
Helpers also moved into kv package as exported (mlx-root callers
crossed package boundary so they needed to go public):
hashKVSnapshot → kv.HashSnapshot
validateKVSnapshotMemvidBlockBundle → kv.ValidateMemvidBlockBundle
loadKVSnapshotMemvidBlockWithOptions → kv.LoadMemvidBlockWithOptions
effectiveKVSnapshotTokenOffset → kv.EffectiveTokenOffset
effectiveKVSnapshotSeqLen → kv.EffectiveSeqLen
clearKVSnapshotTerminalState → kv.ClearTerminalState
dropKVSnapshotFloat32 → kv.DropFloat32
kvSnapshotResultError → kv.ResultError
Snapshot.sliceBlock (method) → SliceBlock
Inline private copies kept in kv: normalizeSnapshot (was
normalizeBundleSnapshot), requiresNativeEncoding (was
kvSnapshotRequiresNativeEncoding), firstNonEmpty,
defaultCacheBlockSize.
mlx-root NewStateBundle: local variable `kv` renamed to `snap` to
avoid shadowing the imported kv package. State_bundle.go now calls
kv.HashSnapshot / kv.Analyze directly.
NEW mlx-root kv_test_helpers_test.go contains test helpers
(kvSnapshotBlocksTestSnapshot, recordingMemvidStore, failingMemvidWriter)
duplicated for mlx-root tests that no longer have access to kv-package
test internals.
~22 consumer files updated: agent_memory, api_common, api_darwin,
api_stub, api_test, fast_eval{,_test}, hf_fit_test, expert_residency_test,
inference_contract_darwin, kv_snapshot_index{,_test}, kv_cache_bench{,_test},
memory_plan{,_test}, memvid_chapter_smoke{,_test}, session_agent_darwin{,_test},
session_artifact{,_test}, session_darwin{,_test,_example_test},
session_stub_example_test, small_model_smoke, state_bundle{,_test},
workload_bench{,_test}.
go vet ./... clean. mlx + gguf + lora + safetensors + merge + kv tests green.
Co-Authored-By: Virgil <virgil@lethean.io>
eval is driver-neutral (orchestrates evaluation given a Runner adapter),
so it lifts to go-inference/eval/ instead of go-mlx/eval/ — alongside
parser/, quant/jang/, quant/codebook/ which already live there.
Interface redesign for cycle-breaking:
- Sample/Batch/BatchConfig become opaque any
- Dataset is an interface (Next returns any)
- Runner gains BatchTokens callback (replaces sftBatchLossTokens) and
SampleText callback (replaces direct .Text/.Response reads)
- eval.Info mirrors mlx.ModelInfo fields; eval.AdapterInfo mirrors
lora.AdapterInfo. mlx-root converts at the boundary via modelInfoToEval,
evalInfoToModel, loraToEvalAdapter, evalAdapterToLora.
- BuildBatches is now required (replaces optional Tokenizer + auto-build);
driver wrappers provide BuildBatches that internally use their tokenizer
+ BuildDatasetBatches.
Symbol renames per discipline:
EvalConfig → eval.Config
EvalRunner → eval.Runner
EvalReport → eval.Report (with eval.Info + eval.AdapterInfo)
EvalMetrics → eval.Metrics
EvalBatchMetrics → eval.BatchMetrics
EvalQualityProbe → eval.QualityProbe (Context/Report/Check too)
RunDatasetEval → eval.RunDataset
EvalReportVersion → eval.ReportVersion
RunModelEval, NewModelEvalRunner stay at mlx-root as wrappers/adapters.
Move ResponseCoverageProbe into eval/ as an exported probe constructor —
driver wrappers attach it via RunModelEval so eval doesn't need to know
about SFTSample's field shape.
eval_test.go deleted from mlx-root (its orchestration testing now belongs
in go-inference/eval/). Integration coverage stays in eval_darwin_test.go.
Bumps external/go-inference submodule pin to a18708d (driver-neutral eval
package shipped).
Consumers updated: distill{,_test}.go, workload_bench{,_test}.go,
inference_contract_{darwin,test}.go. distill.go gains a private
distillCollectSamples helper (replaces collectEvalSamples from old eval.go).
workload_bench.go gains normalizeWorkloadEvalConfig (replaces
normalizeEvalConfig).
go vet ./... clean. mlx + gguf + lora + safetensors + merge + kv tests green.
Co-Authored-By: Virgil <virgil@lethean.io>
bench package (go-inference/bench/) is the new driver-neutral local benchmark/eval harness. Drivers supply a Runner with verb-shaped callbacks (BenchPromptCache, BenchMemvidKVBlockWarm, BenchKVRestore, BenchStateBundle, BenchProbeOverhead, BenchSpeculativeDecode, BenchPromptLookupDecode). bench.Run orchestrates generation timing + dispatches each enabled callback + assembles the Report. mlx-root: fast_eval.go shrinks to type aliases + boundary converters (FastEval* → bench.* via type aliases; modelInfoToBench / benchInfoToModel / fromMlxMetrics / toBenchGenerateOptions / loraToBenchAdapter / benchAdapterToLora helpers). NEW fast_eval_runner.go contains the Model→bench.Runner adapter — each Bench* callback implements its driver-specific section against the Model API (kv snapshots, state bundles, memvid block warming, decode optimisation via RunSpeculativeDecode / RunPromptLookupDecode). memvid_chapter_smoke decouples from the bench.Runner — its callbacks (CaptureKVBlocksToMemvid, GenerateWithMemvidPrefix) deal with mlx-specific kv types, so it has its own MemvidKVChapterRunner at mlx-root (no longer wedged into the verb-callback shape). inference_contract_darwin.go converts at the bench boundary (benchInfoToModel / benchAdapterToLora) before calling toInferenceModelIdentity / toInferenceRootAdapterIdentity. workload_bench.go: drops normalizeFastEvalConfig (bench.Run normalises internally); ModelInfo conversion via benchInfoToModel. Test coverage delta: fast_eval_test.go (801 lines), fast_eval_example_test.go (26 lines), workload_bench_test.go (525 lines) deleted — their callback mock setups exercise the OLD raw-callback Runner shape; equivalent coverage for the verb-callback shape should be added to go-inference/bench/ tests in a separate pass. memvid_chapter_smoke_test (integration tests for the chapter runner) rewrites to use MemvidKVChapterRunner + ChapterGeneration. inference_contract_test gains modelInfoToBench wrap at the boundary. Bumps external/go-inference to include the bench package. go vet ./... clean. mlx + gguf + lora + safetensors + merge + kv tests green. Co-Authored-By: Virgil <virgil@lethean.io>
Picks up the bench package unit tests (test(bench): unit tests for driver-neutral Run orchestration). Coverage rebuilt for the verb-callback Runner shape after deleting fast_eval_test.go + fast_eval_example_test.go + workload_bench_test.go in Phase 2M. Co-Authored-By: Virgil <virgil@lethean.io>
Phase 2N — the speculative + prompt-lookup decode algorithm is driver-
neutral (accept/reject over token streams, generation delegated to
caller callbacks), so it lifts to go-inference/decode/ alongside bench
and eval.
decode_optimisation.go is rewritten as a thin shim with legacy type
aliases (DecodeOptimisationResult, DecodeOptimisationMetrics) and
boundary converters (mlxDecodeGenToDecode, mlxTokensToDecode,
decodeTokensToMlx). DecodeGenerateFunc keeps the mlx-shaped signature
so existing callbacks continue to compile; RunSpeculativeDecode/
RunPromptLookupDecode wrap them to decode.GenerateFunc internally.
decodeTokensText survives as a thin wrapper for memvid_chapter_smoke.
Submodule pin bumped to go-inference 521dd53 (feat(decode):
driver-neutral speculative + prompt-lookup decode harness).
Coverage rebuilt:
- decode_optimisation_test.go now covers the boundary converters,
nil-callback handling, token round-trip, and legacy-alias surface
- decode_optimisation_example_test.go for AX example registration
- fast_eval_test.go BACKFILLS the Phase 2M orphan: covers alias
routing, DefaultFastEvalConfig forwarding, RunFastEval bench
smoke against a synthetic Runner, toBenchGenerateOptions clone +
probe-sink passthrough, fromMlxMetrics field copy,
modelInfoToBench round-trip with adapter clone, fastEvalResultError
- fast_eval_example_test.go matches AX pattern
go vet ./... clean. Tests: mlx + kv + lora + merge + gguf + pack all
green. Pre-existing internal/metal failure (TestGenerate_Model_Staged
MiniMaxReturnsDecodeError_Bad nil-tokenizer panic) is unrelated —
fails identically on pristine HEAD.
Co-Authored-By: Virgil <virgil@lethean.io>
Phase 2O — state bundle is deeply mlx-coupled (kv.Snapshot,
lora.AdapterInfo, SAMI), so it lifts to go-mlx/bundle/ as a sibling
package rather than to go-inference. SAMI types travel with bundle
since Bundle.SAMI holds *SAMIResult.
Symbols rename per the folder-taxonomy rule (drop prefixes the package
carries):
StateBundle → bundle.Bundle
StateBundleOptions → bundle.Options
StateBundleModel → bundle.Model
StateBundlePrompt → bundle.Prompt
StateBundleTokenizer → bundle.Tokenizer
StateBundleRuntime → bundle.Runtime
StateBundleAdapter → bundle.Adapter
StateBundleSampler → bundle.Sampler
StateBundleRef → bundle.Ref
StateBundleVersion → bundle.Version
StateBundleKind → bundle.Kind
StateBundleRefMemvid → bundle.RefMemvid
NewStateBundle → bundle.New
LoadStateBundle → bundle.Load
CheckStateBundleCompatibility → bundle.CheckCompatibility
StateBundleFileHash → bundle.FileHash
SAMIResult → bundle.SAMIResult (kept name — separate concept)
SAMIOptions → bundle.SAMIOptions
SAMIFromKV → bundle.SAMIFromKV
mlx-root state_bundle.go becomes a thin shim with type aliases for the
77 caller sites + boundary converters for mlx.ModelInfo →
bundle.ModelInfo and mlx.GenerateConfig → bundle.Sampler. mlx-root keeps
StateBundleOptions as its own struct (carrying mlx-shaped ModelInfo +
GenerateConfig + *SAMIResult) so existing callers compile unchanged.
session_artifact.go's SAMIResult / SAMIOptions become aliases to
bundle.SAMIResult / bundle.SAMIOptions; SAMIFromKV becomes a thin
wrapper. The math helpers (clampUnit, clampRange, meanUnit, layerMetric)
move to bundle/sami.go with the SAMI types.
stateBundleTokenizer + stateHash + stateMemvidURI retained as
private mlx-root wrappers (bundle.NormaliseTokenizer + bundle.HashString
+ bundle.MemvidURI) for callers session_agent_darwin.go +
kv_snapshot_index.go that referenced the old in-package names.
stateBundleTestSnapshot test helper moved to kv_test_helpers_test.go
so lora_adapter*_test.go + session_darwin_test.go continue to compile.
Coverage:
- bundle/bundle_test.go covers Save/Load, memvid snapshot round-trip,
frame-zero allowance, defensive cloning, Validate + CheckCompatibility
happy + sad paths, AdapterFromInfo round-trip, NormaliseTokenizer,
AdapterEmpty, HashString, FileHash, MemvidURI, SAMIFromKV
- bundle/example_test.go for AX example registration
- state_bundle_test.go covers the shim: alias identity,
modelInfoToBundle, stateSamplerFromGenerateConfig clone,
CheckStateBundleCompatibility, FileHash, Load round-trip,
SnapshotFromMemvid via shim route, the private cross-file helpers
go vet ./... clean. Tests: mlx + bundle + kv + lora + merge + gguf +
pack all green. Pre-existing internal/metal panic remains unrelated.
Co-Authored-By: Virgil <virgil@lethean.io>
Phase 2P — probe is the go-mlx event-vocabulary for inference + training
observability. It lifts to go-mlx/probe/ rather than go-inference
because the event shape is mlx-rich: ProbeExpertResidency carries MoE
paging events that the driver-neutral inference.ProbeEvent contract
(at dappco.re/go/inference root) doesn't expose. The two probe
vocabularies remain intentionally separate — inference owns the
backend contract, go-mlx/probe/ owns the rich driver event vocabulary.
Symbols rename per the folder-taxonomy rule (drop prefixes the package
carries):
ProbeEvent → probe.Event
ProbeEventKind → probe.Kind
ProbePhase → probe.Phase
ProbeToken → probe.Token
ProbeLogit → probe.Logit
ProbeLogits → probe.Logits
ProbeEntropy → probe.Entropy
ProbeHeadSelection → probe.HeadSelection
ProbeLayerCoherence → probe.LayerCoherence
ProbeRouterDecision → probe.RouterDecision
ProbeExpertResidency → probe.ExpertResidency
ProbeResidualSummary → probe.ResidualSummary
ProbeCachePressure → probe.CachePressure
ProbeMemoryPressure → probe.MemoryPressure
ProbeTraining → probe.Training
ProbeSink → probe.Sink
ProbeSinkFunc → probe.SinkFunc
ProbeBus → probe.Bus
ProbeRecorder → probe.Recorder
NewProbeBus → probe.NewBus
NewProbeRecorder → probe.NewRecorder
cloneProbeEvent → probe.CloneEvent (exported)
ExpertResidencyAction + its four constants move from
expert_residency.go to probe so probe.ExpertResidency.Action stays a
typed enum; mlx-root expert_residency.go gets a type alias plus const
re-declarations.
mlx-root probe.go shrinks from 337 to ~80 LOC: type aliases for 19
types + 14 constants, plus the mlx-specific GenerateOption helpers
(WithProbeSink, WithProbeCallback) that stay because they touch
mlx.GenerateConfig. NewProbeBus/NewProbeRecorder become one-line
forwarders.
All ~203 caller references across 20+ files compile unchanged thanks
to the alias surface.
Coverage:
- probe/probe_test.go covers Recorder defensive-copy semantics, Bus
fanout + concurrent safety + nil-receiver guards, SinkFunc nil
handling, CloneEvent deep-copy across every payload pointer plus
Meta map, ExpertResidencyAction + Kind + Phase constant values
- probe/example_test.go for AX example registration
- probe_test.go (mlx-root) covers alias identity, constant
preservation, ExpertResidencyAction alias identity, NewProbeBus +
NewProbeRecorder wiring, WithProbeSink / WithProbeCallback installing
on GenerateConfig (including the nil-callback no-op)
- probe_example_test.go matches AX pattern
go vet ./... clean. Tests: mlx + probe + bundle + kv + lora + merge +
gguf + pack all green. Pre-existing internal/metal panic unrelated.
Co-Authored-By: Virgil <virgil@lethean.io>
Phase 2Q — scheduler.go is fully driver-neutral (only inference.TextModel
deps, no kv/lora/probe-mlx), so it lifts to go-inference/scheduler/
alongside bench, decode, and eval.
Symbols rename per the folder-taxonomy rule:
ScheduledModel → scheduler.Model
SchedulerConfig → scheduler.Config
NewScheduledModel → scheduler.New
mlx-root scheduler.go shrinks from 400 to ~25 LOC: type aliases for
ScheduledModel + SchedulerConfig + one-line NewScheduledModel forwarder.
register_metal.go's `scheduler *ScheduledModel` field +
register_metal_scheduler.go's wrappers compile unchanged through the
aliases.
Submodule pin bumped to go-inference 254b391
(feat(scheduler): driver-neutral request scheduler).
Coverage:
- go-inference/go/scheduler/scheduler_test.go ports the canonical
suite (queue + latency probe, full-queue rejection, cancellation,
Generate/Chat/Classify/BatchGenerate delegation, nil + cancelled-
context paths, fallback cancel via inference.CancellableModel, Err
propagation, generateOptions sampler conversion, cloneLabels +
millis helpers)
- go-inference/go/scheduler/example_test.go for AX coverage
- scheduler_test.go (mlx-root) covers alias identity +
NewScheduledModel forward + nil-base defensive wrapper
- scheduler_example_test.go matches AX pattern
go vet ./... clean. Tests: mlx + probe + bundle + kv + lora + merge +
gguf + pack all green. Pre-existing internal/metal panic unrelated.
Co-Authored-By: Virgil <virgil@lethean.io>
Phase 2R — memory_plan is the local-inference memory planner that maps
measured Apple-silicon hardware + model metadata to a runtime policy.
The generic core (memory class detection, base class plans, KV cache
estimation, architecture hints, generic MoE residency) lifts to
go-mlx/memory/. The MiniMax-M2-specific overrides (tensor-plan
expert-residency + first-layer skeleton bytes) stay at mlx-root,
layered on top of the generic plan.
Symbols rename per the folder-taxonomy rule (drop prefixes the package
carries):
MemoryPlan → memory.Plan
MemoryPlanInput → memory.Input (only used internally now —
mlx-root keeps its own MemoryPlanInput with
mlx-shaped DeviceInfo + ModelInfo)
PlanMemory → memory.NewPlan
MemoryClass → memory.Class
MemoryClass* → memory.Class* (7 constants)
MemoryGiB → memory.GiB
KVCachePolicy → memory.KVCachePolicy (kept name; package
doesn't repeat the prefix)
KVCacheMode → memory.KVCacheMode
ExpertResidencyPlan → memory.ExpertResidencyPlan
ExpertResidencyMode → memory.ExpertResidencyMode
ExpertResidencyMode* → memory.ExpertResidencyMode* (3 constants)
ExpertEvictionPolicy → memory.ExpertEvictionPolicy
ExpertEvictionLRU → memory.ExpertEvictionLRU
mlx-root memory_plan.go shrinks from 529 to ~165 LOC:
- Type aliases for MemoryPlan + MemoryClass + KVCachePolicy +
KVCacheMode + 19 constants + MemoryGiB
- mlx.MemoryPlanInput stays its own struct (carries mlx.DeviceInfo +
*mlx.ModelInfo so existing callers compile unchanged)
- PlanMemory wrapper: converts to memory.Input, calls memory.NewPlan,
layers MiniMaxM2LayerForwardSkeleton bytes + MiniMaxM2TensorPlan
expert residency on top
- applyMemoryPlanToLoadConfig stays here (uses mlx.LoadConfig)
- minPositive retained as a private helper for expert_residency.go
expert_residency.go's ExpertResidencyPlan + Mode + EvictionPolicy
become aliases to memory.* types. The runtime manager + Stats + Context
types stay at mlx-root.
memory package is self-contained: imports only inference/quant/jang,
mlx/pack, mlx/profile. normalizeKnownArchitecture + trim/lower/replace
ASCII helpers duplicated locally to avoid importing mlx-root.
Coverage:
- memory/memory_test.go covers the generic core: 16/24/32/64/96/128GB
class plans, context capped by pack metadata, Qwen3-MoE hints,
MiniMax architecture caps, BERT embedding disables generation
cache, fallback on zero memory, model metadata caps context,
Q8 KV cache for middle classes, generic MoE residency,
ClassForBytes boundaries, minPositive, percentBytes,
normalizeKnownArchitecture aliases (15 tests)
- memory/example_test.go for AX coverage
- memory_plan_test.go at mlx-root unchanged — all 11 existing tests
pass through the shim, exercising the integrated path including
MiniMaxM2 skeleton + tensor-plan residency
go vet ./... clean. Tests: mlx + memory + probe + bundle + kv + lora +
merge + gguf + pack all green. Pre-existing internal/metal panic
unrelated.
Co-Authored-By: Virgil <virgil@lethean.io>
Phase 2S — mega-lift matching the model/{arch}/{name}/ folder taxonomy
called out in feedback_driver_lift_discipline.md. Moves four mlx-root
source files (minimax_m2.go 1016 LOC + minimax_m2_native_darwin.go 167
+ minimax_m2_native_stub.go 32 + expert_residency.go 476) plus three
test files (minimax_m2_test.go 643 + minimax_m2_darwin_test.go 441 +
expert_residency_test.go 159) to go-mlx/model/minimax/m2/ as a single
self-contained package.
Symbol renames per the folder-taxonomy rule (drop prefixes the package
carries — m2 carries "MiniMaxM2"):
MiniMaxM2Config → m2.Config
MiniMaxM2TensorRole → m2.TensorRole
MiniMaxM2TensorRole* (9 constants) → m2.TensorRole* (9 constants)
MiniMaxM2TensorSpec → m2.TensorSpec
MiniMaxM2TensorPlan → m2.TensorPlan
MiniMaxM2RouterDecision → m2.RouterDecision
MiniMaxM2ExpertFunc → m2.ExpertFunc
MiniMaxM2PackedExpertWeights → m2.PackedExpertWeights
MiniMaxM2RouterWeights → m2.RouterWeights
MiniMaxM2PackedLayerForwardOptions → m2.PackedLayerForwardOptions
MiniMaxM2PackedLayerForwardResult → m2.PackedLayerForwardResult
MiniMaxM2LazyExpertLoad → m2.LazyExpertLoad
MiniMaxM2DenseProjectionTensor → m2.DenseProjectionTensor
MiniMaxM2DenseExpertWeights → m2.DenseExpertWeights
MiniMaxM2ResolvedTensor → m2.ResolvedTensor
MiniMaxM2LayerForwardSkeleton → m2.LayerForwardSkeleton
ParseMiniMaxM2Config → m2.ParseConfig
BuildMiniMaxM2TensorPlan → m2.BuildTensorPlan
RouteMiniMaxM2Tokens → m2.RouteTokens
DispatchMiniMaxM2Experts → m2.DispatchExperts
LoadMiniMaxM2PackedExpertsForDecisionsFromSafetensors
→ m2.LoadPackedExpertsForDecisions
LoadMiniMaxM2LazyExpertsForHiddenFromSafetensors
→ m2.LoadLazyExpertsForHidden
LoadMiniMaxM2PackedExpertsFromSafetensors → m2.LoadPackedExperts
LoadMiniMaxM2RouterFromSafetensors → m2.LoadRouter
ProjectMiniMaxM2RouterScores → m2.ProjectRouterScores
BuildMiniMaxM2LayerForwardSkeletonFromSafetensors
→ m2.BuildLayerForwardSkeleton
MiniMaxM2RouterProbeEvents → m2.RouterProbeEvents
MiniMaxM2ExpertResidencyLoader → m2.ResidencyLoader
MiniMaxM2ExpertResidencyConfig → m2.ResidencyConfig
MiniMaxM2ExpertResidencyManager → m2.ResidencyManager
NewMiniMaxM2ExpertResidencyManager → m2.NewResidencyManager
PlanMiniMaxM2ExpertResidency → m2.PlanResidency
DispatchMiniMaxM2PackedExpertsMetal → m2.DispatchPackedExpertsMetal
DispatchMiniMaxM2PackedExpertsFromSafetensorsMetal
→ m2.DispatchPackedExpertsFromSafetensorsMetal
ForwardMiniMaxM2LazyExpertLoadMetal → m2.ForwardLazyExpertLoadMetal
ForwardMiniMaxM2PackedLayerMetal → m2.ForwardPackedLayerMetal
ForwardMiniMaxM2PackedLayerFromSafetensorsMetal
→ m2.ForwardPackedLayerFromSafetensorsMetal
normaliseExpertResidencyPlan → m2.NormalisePlan
JANGPackedProjectionTensor → m2.JANGPackedProjectionTensor
Private helpers all lose the miniMaxM2 prefix (decisionExpertIDs,
uniqueExpertIDs, packedDType, etc.).
ExpertResidencyStats moves to memory.ExpertResidencyStats (it's the
companion measurement type for memory.ExpertResidencyPlan that was
already there).
mlx-root shim files (minimax_m2.go, minimax_m2_native_darwin.go,
minimax_m2_native_stub.go, expert_residency.go) preserve all 66 caller
references via type aliases + wrapper functions. memory_plan.go's
PlanMemory MiniMaxM2-specific overrides still compile through the
aliases. model_pack.go's ParseMiniMaxM2Config /
BuildMiniMaxM2TensorPlan / BuildMiniMaxM2LayerForwardSkeletonFromSafetensors
calls route through wrappers. workload_bench.go's ExpertResidencyStats
+ normaliseExpertResidencyPlan route through aliases.
m2 package is self-contained: imports core, jang, mlx/memory, mlx/probe,
mlx/profile, mlx/safetensors, mlx/quant/jang only — no upward mlx-root
import (which would cycle). Private helpers (firstNonEmpty,
normalizeKnownArchitecture, nonZeroDuration, maxPositive, minPositive,
firstPositive) duplicated locally in helpers.go.
Test fixtures (miniMaxM2FixtureConfig + findMiniMaxM2Spec +
writeMiniMaxM2RawSafetensors + miniMaxM2SkeletonRawTensors +
miniMaxM2F32RawTensor + miniMaxM2RawSafetensor) duplicated at mlx-root
in minimax_m2_test_helpers_test.go so jang_darwin_test.go and
model_pack_test.go still build. Go test packages cannot import each
other's internal _test.go helpers, hence the duplication.
internal/metal/metal.go's defaultMetallibPath search expanded by two
more parent-dir candidates so tests running from
model/minimax/m2/ (5 directories deep) can still discover
dist/lib/mlx.metallib.
go vet ./... clean. Tests: mlx + m2 + memory + probe + bundle + kv +
lora + merge + gguf + pack + ide-side packages all green. Pre-existing
internal/metal TestGenerate_Model_StagedMiniMaxReturnsDecodeError_Bad
nil-tokenizer panic still unrelated.
Co-Authored-By: Virgil <virgil@lethean.io>
Phase 2T — hf_fit.go (1019 LOC) hosts the HuggingFace metadata source + local-fit planner. The public HF* symbols have ZERO callers in production code (only test references), so the lift is mostly a shape change. Lifts to go-mlx/hf/ with symbol renames per the folder-taxonomy rule: HFModelSource → hf.ModelSource HuggingFaceModelSourceConfig → hf.RemoteConfig HuggingFaceModelSource → hf.RemoteSource NewHuggingFaceModelSource → hf.NewRemoteSource HFModelFitConfig → hf.FitConfig HFModelMetadata → hf.ModelMetadata HFModelFile → hf.ModelFile HFModelConfig → hf.ModelConfig HFQuantizationConfig → hf.QuantizationConfig HFModelFitReport → hf.FitReport HFModelFitPlan → hf.FitPlan HFTrainingFit → hf.TrainingFit PlanHFModelFits → hf.PlanFits InferJANGFromHF → hf.InferJANG HFModelSourceRemote/Local → hf.SourceRemote/Local Plus all the private helpers (collectFitEntries, planFit, weightFormatAndBytes, inferQuantBits, etc.) lose the hf-redundant prefixes. hf package is self-contained: imports core, jang, mlx/memory, mlx/pack, mlx/profile. Uses memory.Class / memory.Plan / memory.NewPlan / memory.Input / memory.DeviceInfo / memory.GiB / memory.KVCacheMode* directly (no mlx-root coupling). The four model-pack-helper calls that previously delegated to mlx-root (modelPackSupportedArchitecture, modelPackNativeRuntimeSupported, modelPackUsesGenerationKVCache, inspectModelPackTaskProfiles) are now inlined as private hf helpers (archSupported, archNativeRuntime, usesGenerationKVCache, resolveArchitectureProfile) — each is a thin wrapper over profile.LookupArchitectureProfile, no behaviour change. mlx-root hf_fit.go shrinks from 1019 to ~65 LOC of pure shim: 11 type aliases + 2 const re-exports + 3 wrapper functions. PlanHFModelFits auto-fills cfg.Device from GetDeviceInfo() (the mlx-root metal probe) and converts to memory.DeviceInfo at the boundary — caller-facing behaviour preserved. helpers.go (new at mlx-root) holds firstNonEmpty / firstPositive / indexString that were at the bottom of hf_fit.go and are used by dataset_stream, kv_snapshot_index, memvid_chapter_smoke, model_pack, and openai. They stay at mlx-root because mlx-root consumers cannot import hf (wrong direction). model_config_probe.go (new at mlx-root) holds modelConfigProbe + readModelConfig + the probe's accessor methods, plus normalizeKnownArchitecture and architectureFromTransformersName. These are used by model_pack.go's inspectModelPackConfig + applyModelPackConfigMetadata; the originals lived in hf_fit.go. The hf package keeps its own private copies of the two architecture normalisers (they're used internally by the planner too). Tests port into hf package — they exercise internal fields/methods (.baseURL, .userAgent, .client, .byteSize) so package-private access is preserved. writeModelPackFile test helper duplicated in hf/test_helpers_test.go since Go test packages cannot import each other's internal helpers. go vet ./... clean. Tests: mlx + hf + memory + probe + bundle + kv + lora + merge + gguf + pack + m2 all green. Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
There was a problem hiding this comment.
SonarCloud found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Use borrowed full page handles for immediate paged-cache decode attention, keeping partial preallocated pages owned as visible slices. Refresh the 100k retained workflow report with the measured borrowed-page run and current runner deltas. Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Evaluate non-paged prompt-cache state before detaching chunked prefill arrays so contiguous and rotating caches do not carry unevaluated MLX graph handles into the next chunk. Leave paged caches on the accepted production path without the extra synchronisation point. Document the fp16/rotating 100k diagnostic as a rejected production shortcut: the prefill primitive error is fixed, but decode still crashes before producing a report. Co-Authored-By: Virgil <virgil@lethean.io>
Record 100k same-shape diagnostics for larger paged K/V blocks and preallocated page writes. Both stay below the accepted 1024-page borrowed-state lane, so the long-context target remains fused paged/global attention rather than page-size tuning. Update GOAL.md, the runtime index, long-context diagnosis, and the production benchmark manifest with the new rejected artefacts. Co-Authored-By: Virgil <virgil@lethean.io>
Retain the materialised full K/V state produced by paged fast-concat on full-attention owner layers so shared Gemma 4 layers can reuse it instead of rebuilding the same long-context state. Records the 100k retained workflow moving from 260.093s / 51.293 tok/s to 231.109s / 60.011 tok/s, while keeping the external runner gap open in GOAL.md and runtime docs. Co-Authored-By: Virgil <virgil@lethean.io>
Adds the 5120-token-budget 100k retained-state diagnostic. The current prompt naturally stops at 2489 tokens per turn, but decode stays flat around 60 tok/s across ten retained turns and memory remains bounded under the production guards. Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
|
Co-Authored-By: Virgil <virgil@lethean.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Summary by CodeRabbit
New Features
Improvements
Documentation