Skip to content

Virgil Lemma foundations#8

Open
Snider wants to merge 114 commits into
mainfrom
dev
Open

Virgil Lemma foundations#8
Snider wants to merge 114 commits into
mainfrom
dev

Conversation

@Snider
Copy link
Copy Markdown
Contributor

@Snider Snider commented May 20, 2026

Summary by CodeRabbit

  • New Features

    • Qwen 2/3 and Qwen 3.6 model support; new adapter with buffered and streaming generation.
    • Block‑prefix cache service and memvid bundle index for faster prefix restores.
    • Agentic memory: wake/sleep workflows, state bundles and memvid integration; session‑state artifact export.
  • Improvements

    • Device‑aware memory planner; expanded chunked generation, prompt‑cache warm/restore and KV snapshot flows.
    • Build/toolchain updated (C++23) and macOS deployment target raised.
  • Documentation

    • Extensive new/updated docs: architecture, runtime, inference, memory, MoE, training and benchmarks.

Review Change Stack

Snider and others added 30 commits May 8, 2026 13:18
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Implements the 2026-05-09 vMLX feature-parity sprint (see
docs/vmlx-feature-gap-report.md + docs/superpowers/plans/) plus the
runtime surfaces that hang off it. Closes the gap between go-mlx and
vMLX's Python engine for MoE and advanced quantisation paths.

Phase 1 surface:
- MoE / advanced quant: minimax_m2.go + native_darwin, jang.go +
  native_darwin, codebook_vq.go, expert_residency.go.
- Cache + decode: block_cache.go (block-prefix cache), prompt cache
  threshold integration, decode_optimisation.go (speculative + prompt-
  lookup harness).
- Algorithm/architecture profiles: algorithm_profile.go +
  architecture_profile.go for backend capability reporting.
- Agent memory: agent_memory.go (Wake/Sleep/Fork on top of KV snapshots
  + memvid), state_bundle.go round-trip via dappco.re/go/inference/state.
- Scheduler + parsers: scheduler.go (queue-aware Schedule + Cancel),
  parser_registry.go (model-family tool/reasoning parsers),
  register_metal_{cache,parser,scheduler}.go capability mounts.
- Model-pack + planning: gguf_info.go / gguf_quantize.go, memory_plan.go
  (device-class sizing), model_pack.go validation.
- Internal Metal extensions: gemma4 paged KV, minimax_m2 forward stubs,
  codebook_vq kernels, jang_dequant, kv_snapshot_blocks_native.
- Frame compute: compute.go API rounded out for non-LLM kernels.
- admin.go, dataset_stream.go, fast_eval.go, hf_fit.go,
  small_model_smoke.go, workload_bench.go.
- Observability: probe.go expanded for MoE router decisions, cache
  pressure, training events.

docs/ pass adds per-file documentation under docs/{topic}/{file}.md so
future readers can plan against the runtime without grep:
- runtime/ — register_metal, adapter
- memory/ — agent_memory, kv_snapshot family, state_bundle, medium
- moe/ — minimax_m2, jang, codebook_vq, expert_residency
- training/ — sft, lora_adapter, grpo, distill, eval
- model/ — model_pack, memory_plan
- inference/ — scheduler, block_cache, decode_optimisation,
  parser_registry, thinking
- compute/ — frame-compute API
- observability/ — probe.go emission
- cmd/violet — sidecar daemon
34 new docs plus per-topic READMEs and a top-level index.

Co-Authored-By: Virgil <virgil@lethean.io>
First lobe of the package-split out of the 80-file root dump. Moves the
non-LLM Metal frame-compute lane (PixelBuffer / kernels / Session /
NewSession) into its own subpackage so the root mlx package stays
focused on LLM inference.

- go/compute*.go → go/compute/ (10 files, package mlx → package compute)
- compute_darwin.go renamed compute_metal.go (no _darwin suffix —
  package is Metal-only, no dual-platform split)
- compute_stub.go variants deleted — Metal-only by design, no
  non-darwin compile target to guard against
- All build tags dropped — package is darwin/arm64 implicit
- DeviceInfo restored as type alias to metal.DeviceInfo (not field-
  flattened); DeviceInfo() returns metal.GetDeviceInfo() direct so
  upstream renames + new fields surface at compile time
- unsupported_stub_test.go in parent dropped its compute.* compile-
  surface refs — stub build no longer needs to compile-check a
  Metal-only subpackage
- examples/ moved into docs/examples/ (first-trip cleanup)

No external consumers of compute symbols in the tetrad today; only
internal sibling fast_eval / api_stub / session_* call sites and they
use ModelSession.NewSession (method) rather than compute.NewSession
(free function). No downstream import churn.

Co-Authored-By: Virgil <virgil@lethean.io>
Drops the in-mlx output-parsing layer and consumes
dappco.re/go/inference/parser instead. Driver-neutral logic — model-
family reasoning markers, thinking-channel processor, tool-call
parsing — now lives in go-inference so every driver (rocm, cuda, tpu,
future) inherits it without re-implementation.

Deletes:
- go/parser_registry.go (466 lines)
- go/thinking.go         (320 lines)
- their _test.go siblings

Replaces with:
- go/thinking.go (slim) — driver-side WithThinking* options that
  mutate the local mlx.GenerateConfig.Thinking field, FilterThinkingTokens
  wrapper for the *Tokenizer streaming path, parserHint() helper that
  converts mlx.ModelInfo to parser.Hint{Architecture, AdapterName}.

Sibling fix-ups:
- api_common.go: GenerateConfig.Thinking is parser.Config; default is
  parser.Show.
- api_darwin.go: 5 emit sites use parser.NewProcessor + parserHint.
- openai.go: 3 response handlers use parser.NewProcessor; reasoning
  selector uses parser.ForHint(parser.HintFromInference(...)).
- register_metal_parser.go: outputParser() returns parser.OutputParser
  via parser.ForHint(parserHint(...)).
- register_metal_cache.go: drops local modelInfoFromInference helper,
  uses adapter.Info() directly.
- architecture_profile.go: parser.NormaliseKey replaces local helper.
- thinking_darwin_test.go: parser.Chunk replaces ThinkingChunk.

Submodule pin: external/go-inference advanced to cb4f9fb (parser
package + ProbeScheduler vocab the mlx scheduler.go was emitting).

Co-Authored-By: Virgil <virgil@lethean.io>
Drops the in-mlx JANG/JANGTQ + VQ codebook quant metadata and consumes
dappco.re/go/inference/quant/{jang,codebook} instead. Driver-neutral
quant types now lift to go-inference where every backend
(mlx, rocm, cuda, tpu, future) inherits them.

Deletes:
- go/jang.go         (597 lines)
- go/codebook_vq.go  (294 lines)
- their _test.go siblings (228 lines)

Adds:
- go/jang_hf.go — driver-side helpers that depend on mlx-local
  HFModelMetadata (InferJANGFromHF, hfJANGGroupSize,
  inferJANGProfileName). Compose lifted jang.Info shape.
- safetensor_ref.go: local mlxMaxIntValue() helper (was in jang.go).

Symbol-namespace renames (package name takes the disambiguation slot):

  JANGQuantizationInfo               → jang.Info
  JANGCapabilities                   → jang.Capabilities
  JANGTensorRole + consts            → jang.TensorRole*
  JANGPackedQuantizationProfile      → jang.PackedProfile
  JANGPackedTensorDescriptor         → jang.PackedTensorDescriptor
  BuildJANGPackedQuantizationProfile → jang.BuildPackedProfile
  CloneJANGPackedQuantizationProfile → jang.ClonePackedProfile
  NewJANGPackedTensorDescriptor      → jang.NewPackedTensorDescriptor
  ValidateJANGPackedTensor           → jang.ValidatePackedTensor
  DequantizeJANGPackedTensor         → jang.DequantizePackedTensor
  PackJANGQuantizedValues            → jang.PackQuantizedValues
  readJANGQuantizationInfo           → jang.ReadConfig
  parseJANGQuantizationInfo          → jang.ParseConfig

  CodebookQuantizationType           → codebook.Type
  CodebookFormatVQ                   → codebook.FormatVQ
  CodebookQuantizationProfile        → codebook.Profile
  CodebookTensorDescriptor           → codebook.TensorDescriptor
  ParseCodebookQuantizationProfile   → codebook.ParseProfile
  NewCodebookTensorDescriptor        → codebook.NewTensorDescriptor
  ValidateCodebookQuantizationProfile → codebook.ValidateProfile
  ValidateCodebookTensorDescriptor   → codebook.ValidateTensorDescriptor
  ValidateCodebookTensorPayload      → codebook.ValidateTensorPayload
  CodebookVQMatVec                   → codebook.MatVec
  readCodebookQuantizationProfile    → codebook.ReadProfile
  cloneCodebookQuantizationProfile   → codebook.CloneProfile

Sibling fix-ups across 19 files (production + tests):
- algorithm_profile, architecture_profile, hf_fit (+test),
  jang_native_darwin/stub, memory_plan (+test), minimax_m2 (+test),
  model_pack (+test), workload_bench (+test), expert_residency_test,
  jang_darwin_test, minimax_m2_darwin_test, inference_contract_test.
- Variable shadowing: `jang` local variables renamed to `info`
  where they shadowed the package import.
- jangQuantizationType(info) calls replaced with info.Packed.Type.
- finalizeJANGQuantizationInfo helper inlined as
  info.Packed = jang.BuildPackedProfile(info).
- testJANGTQInfo() helper re-added locally in jang_darwin_test.go
  (was in deleted jang_test.go).

Submodule pin: external/go-inference advanced to cb3dc24 (parser +
quant/jang + quant/codebook).

Companion lifts deferred next round:
- model/minimax/m2 — safetensorIndex (mlx-private) couplings in
  loader functions; needs either safetensors lift or types/loaders
  split.
- moe/expert_residency — MemoryClass (Apple-tier enum) needs
  budget-bytes refactor before lifting.

Co-Authored-By: Virgil <virgil@lethean.io>
Snider correction: file lifts shouldn't add new flat files to the
go-mlx root, and the _darwin/_stub split is noise on a Metal-only
driver. Same rules as compute/: package gets its own folder, no
build-tag dance.

  go/jang_native_darwin.go + jang_native_stub.go → go/quant/jang/jang.go
  (one file, no _darwin suffix, no stub variant)

Symbols drop redundant prefixes since the folder + package imply them:
  JANGPackedProjectionResult       → jang.PackedProjectionResult
  DequantizeJANGPackedTensorMetal  → jang.DequantizePackedTensor
  ProjectJANGPackedTensorMetal     → jang.ProjectPackedTensor
  ProjectJANGPackedTensorMetalFused → jang.ProjectPackedTensorFused
  jangMetalShape (private)         → jang.MetalShape (exported for tests)
  jangMetalShapeElements (private) → jang.ShapeElements
  int32SliceToInts (private)       → jang.Int32SliceToInts

Inside the package, the inference-side jang aliases as infjang to
avoid the same-name self-collision. Consumers (jang_darwin_test +
minimax_m2_native_darwin) alias the mlx-side as mlxjang.

The HF-metadata helpers (InferJANGFromHF, hfJANGGroupSize,
inferJANGProfileName) merged into hf_fit.go — they're HF-fit code
that happens to produce *jang.Info, not jang-package code (they
depend on HFModelMetadata which lives in hf_fit.go). hf_fit.go +
HFModelMetadata still pending their own folder lift (likely
go/hf/ in a future iteration).

go-mlx/go root flat-file count: net −1 this commit (deletion of
jang_native_stub.go + jang_native_darwin.go and jang_hf.go,
addition of nothing new in root).

Co-Authored-By: Virgil <virgil@lethean.io>
Commit 63f9894 renamed the file but shipped its OLD content (the
working-tree perl edits weren't re-staged before commit, so the
index had the pre-edit version under the new path). HEAD's
quant/jang/jang.go was still `package mlx` with the build tag,
despite the working tree being correct (which masked the bug
locally — build passed because the file on disk was right).

This commit ships what should have landed in 63f9894:
- package mlx → package jang
- drop //go:build darwin && arm64 && !nomlx
- symbols dropped JANG/Metal prefixes: DequantizePackedTensor,
  ProjectPackedTensor*, MetalShape, ShapeElements, Int32SliceToInts
- inference jang aliased as infjang inside the file

Co-Authored-By: Virgil <virgil@lethean.io>
algorithm_profile.go + architecture_profile.go move into go/profile/.
Both become package profile; consumers import dappco.re/go/mlx/profile
and call profile.LookupAlgorithmProfile / profile.LookupArchitectureProfile.

architecture.go inlines normalizeKnownArchitecture +
architectureFromTransformersName as private helpers (originals live in
gguf_info.go at mlx root). Inlining avoids the import cycle that would
otherwise form when profile/ pulls from mlx and mlx-root tests
exercise profile/. Same trick for KVCacheMode references — uses
literal "q8" / "paged" / "k-q8-v-q4" strings instead of mlx-root
constants.

Tests stay in mlx root for now (algorithm_profile_test.go +
architecture_profile_test.go), aliased as
`prof "dappco.re/go/mlx/profile"` so the `profile` local-var name
they use doesn't shadow the package. Local-var lookup results
renamed `profile → p` where needed.

model_pack.go's local `profile := pack.ArchitectureProfile` renamed
to `arch` to avoid shadowing the new package import.

go vet ./... clean. Test suite green.

Co-Authored-By: Virgil <virgil@lethean.io>
Move lora_adapter.go → lora/adapter.go (package lora). Stage 1 only:
lora_fuse* stays at mlx root because it references mlx-root types
(ModelPack, ModelPackFormatSafetensors) — same blocker as gguf_quantize.go.

Symbol renames (drop redundant "LoRA"/"lora" prefixes since pkg carries them):
  LoRAAdapterInfo      → lora.AdapterInfo
  InspectLoRAAdapter   → lora.InspectAdapter (1-arg convenience)
  inspectLoRAAdapter   → lora.Inspect (2-arg form, now public)
  loraAdapterInfoEmpty → (info AdapterInfo) IsEmpty() method

Private helpers in lora/ also drop redundant prefixes:
  loraAdapterConfigJSON  → adapterConfigJSON
  loraAdapterConfigPath  → adapterConfigPath
  hashLoRAAdapter        → hashAdapter
  loraAdapterResultError → resultError

lora_fuse.go gets its own inline copy of loraAdapterResultError (the
generic core.Result → error helper isn't worth pulling into the
public surface of lora).

Also: fixes stray `package mlx` left in profile/algorithm.go +
profile/architecture.go from the previous lift commit (8f5174a) where
the package-line rename apparently raced with the commit.

go vet ./... clean. mlx package tests green.

Co-Authored-By: Virgil <virgil@lethean.io>
Pure types-lift: ModelPack struct + its constants, options, methods move
into go-mlx/pack/. Inspectors + validators stay in mlx-root model_pack.go
(they reference mlx-root concrete types — GGUFInfo, MiniMaxM2TensorPlan
— that would create cycles).

Cycle-breaker: 4 fields in pack.ModelPack typed as `any` since their
concrete types live at mlx root:
  Quantization any (was *GGUFQuantizationInfo)
  GGUF any (was *GGUFInfo)
  MiniMaxM2 any (was *MiniMaxM2TensorPlan)
  MiniMaxM2LayerSkeleton any (was *MiniMaxM2LayerForwardSkeleton)

Consumers type-assert at read sites (memory_plan.go + model_pack_test.go).
Inspectors assign concrete pointers directly (any accepts).

Symbol policy this round: NO renames. pack.ModelPack stays pack.ModelPack
(verbose but lower-risk; renames can land as a follow-up). Mlx root imports
pack as `mp` to avoid the local-var name collision (many functions use
`pack` as parameter name).

addIssue + issueSummary → AddIssue + IssueSummary (exported, since
inspectors at mlx root call them across the package boundary).
applyModelPackOptions → pack.ApplyOptions (similarly exported).

Unblocks: lora_fuse and gguf_quantize can now live in their own packages
once their other dependencies (safetensor private types + MiniMaxM2 types)
also lift. This commit ships only the type lift.

go vet ./... clean. mlx package tests green.

Co-Authored-By: Virgil <virgil@lethean.io>
Move lora_fuse{,_darwin,_stub,_test,_darwin_test}.go into lora/
(package lora) — joins lora/adapter.go from the earlier lora_adapter
lift. lora/ is now the LoRA package as intended.

API change: lora.FuseIntoPack takes pre-validated pack.ModelPack as
SourcePack (instead of ModelPath string). Callers validate via
mlx.ValidateModelPack first, then call lora.FuseIntoPack, then validate
output if they need a populated pack. This breaks the mlx ↔ lora cycle
(otherwise lora.FuseIntoPack would need to call mlx.ValidateModelPack →
cycle since mlx-root imports lora for AdapterInfo).

No production consumers of FuseLoRA* — only tests — so the API change
is safe.

Symbol renames per discipline (drop redundant "LoRA"/"lora" prefix
since pkg name carries it):
  FuseLoRAIntoModelPack    → lora.FuseIntoPack
  FuseLoRAOptions          → lora.FuseOptions
  FuseLoRAResult           → lora.FuseResult (drops Pack field)
  LoRAFuseProvenance       → lora.FuseProvenance
  LoRAFuseProvenanceFile   → lora.FuseProvenanceFile
  prepareLoRAFuse          → prepareFuse (private)
  loraFusePairName         → fusePairName
  loraFuseBaseWeightKey    → fuseBaseWeightKey
  loraFuseAdapterWeightFiles → fuseAdapterWeightFiles
  writeLoRAFuseProvenance  → writeFuseProvenance
  buildLoRAFusePairs       → buildFusePairs
  fuseLoRAModelWeightFiles → fuseModelWeightFiles
  fuseLoRAWeightPairs      → fuseWeightPairs
  loraFusePair             → fusePair
  loraFusePrepared         → fusePrepared
  loRAFuseOutputWeights    → fuseOutputWeights

samePath + copyModelPackMetadata + isModelWeightMetadataCopySkip +
copyModelPackLocalFile move to mlx-root model_merge.go (consumers:
model_merge.go itself + gguf_quantize.go). loraAdapterResultError
drops (lora's own resultError is used instead).

Tests: portable + darwin tests moved into lora/ (need access to
private helpers like fusePairName). Tests use pack.ModelPack{} fixture
in place of mlx.ValidateModelPack (which would create a cycle); output
verification reads files directly rather than via Pack.Valid().

go vet ./... clean. mlx + lora package tests green.

Co-Authored-By: Virgil <virgil@lethean.io>
Move gguf_info.go + gguf_info_test.go + gguf_info_example_test.go
into gguf/ (package gguf). Symbol renames per discipline (drop redundant
GGUF prefix since pkg name carries it):
  GGUFInfo                → gguf.Info
  GGUFTensorInfo          → gguf.TensorInfo
  GGUFValidationSeverity  → gguf.ValidationSeverity
  GGUFValidationIssue     → gguf.ValidationIssue
  GGUFTensorTypeSummary   → gguf.TensorTypeSummary
  GGUFQuantizationInfo    → gguf.QuantizationInfo
  ReadGGUFInfo            → gguf.ReadInfo
  DiscoveredModel + DiscoverModels keep their names (no GGUF prefix).

Export binary-format internals that mlx-root gguf_quantize.go needs:
  ggufTensorTypeQ8_0     → gguf.TensorTypeQ8_0
  ggufTensorTypeQ4_0     → gguf.TensorTypeQ4_0
  ggufValueTypeString    → gguf.ValueTypeString
  ggufValueTypeUint32    → gguf.ValueTypeUint32
  normalizeGGUFQuantType → gguf.NormalizeQuantType

gguf_quantize.go stays at mlx root (it depends on mlx-root safetensor
private types + pack.ModelPack — full lift blocked until safetensor
types lift to a shared package).

Mlx-root keeps private copies of helpers consumed by 8+ mlx-root files
(in hf_fit.go): firstNonEmpty, firstPositive, modelConfigProbe +
methods, readModelConfig, normalizeKnownArchitecture,
architectureFromTransformersName, indexString. Same inline-copy pattern
as profile/architecture.go used. Test helpers (writeTestGGUF,
ggufMetaSpec, ggufTensorSpec, ggufTensorTypeQ4K, etc.) duplicated in
new gguf_test_helpers_test.go at mlx root for cross-test access.

This unblocks gguf-using consumers from importing gguf/ directly.
gguf_quantize.go still at mlx root for now.

go vet ./... clean. mlx + gguf + lora package tests green.

Co-Authored-By: Virgil <virgil@lethean.io>
…nsors/

Move safetensor-prefixed types + funcs from model_merge.go +
safetensor_ref.go + gguf_quantize.go into safetensors/ (package
safetensors). Symbol renames per discipline drop the safetensor prefix
since the package name carries it:

Types:
  safetensorIndex         → safetensors.Index
  safetensorTensorRef     → safetensors.TensorRef
  safetensorTensorReader  → safetensors.TensorReader
  safetensorHeaderEntry   → safetensors.HeaderEntry

Funcs:
  indexSafetensorFiles            → safetensors.IndexFiles
  readSafetensorIndex             → safetensors.ReadIndex
  safetensorRefFromHeader         → safetensors.RefFromHeader
  readSafetensorRefRaw            → safetensors.ReadRefRaw
  readSafetensorRefValues         → safetensors.ReadRefValues
  readSafetensorRefFloat32Chunk   → safetensors.ReadRefFloat32Chunk
  writeSafetensorRefFloat32Chunks → safetensors.WriteRefFloat32Chunks
  openSafetensorTensorReaders     → safetensors.OpenReaders
  openSafetensorTensorReader      → safetensors.OpenReader
  closeSafetensorTensorReaders    → safetensors.CloseReaders
  safetensorDTypeByteSize         → safetensors.DTypeByteSize
  decodeSafetensorFloatData       → safetensors.DecodeFloatData
  float16ToFloat32                → safetensors.Float16ToFloat32

Methods on TensorReader: close → Close, readFloat32Chunk → ReadFloat32Chunk.

Stays in model_merge.go: merge-specific helpers (indexModelMergeSources,
validateModelMergeTensorIndexes, writeMergedSafetensors,
readMergeTensorRefs, buildMergedSafetensorsHeader, readMergeTensorValues,
writeLinearMergedTensorChunks, writeSLERPMergedTensorChunks,
slerpChunkedWeights, writeFloat32Values is in safetensors too).

safetensor_ref.go deleted (mlxMaxIntValue + readSafetensorRefRaw now
live inside safetensors package as private maxIntValue + exported
ReadRefRaw).

Consumers updated: model_merge.go, gguf_quantize.go, gguf_quantize_test.go,
minimax_m2.go, model_merge_test.go, kv_snapshot.go.

Net: -2 root flat .go files (safetensor_ref.go deleted, primitives
extracted from model_merge.go + gguf_quantize.go without adding new
root files). Unblocks: gguf_quantize.go could potentially lift to gguf/
next (still needs pack.ModelPack from pack/, but pack imports gguf, so
gguf_quantize would create cycle — needs separate decision).

go vet ./... clean. mlx + gguf + lora + safetensors package tests green.

Co-Authored-By: Virgil <virgil@lethean.io>
Move gguf_quantize.go + gguf_quantize_test.go → gguf/quantize.go +
gguf/quantize_test.go (package gguf). API change matches the lora.FuseIntoPack
pattern: gguf.QuantizeModelPack takes pre-validated pack.ModelPack as
SourcePack instead of a ModelPath string. Callers run mlx.ValidateModelPack
first and call mlx.ValidateModelPack(result.OutputPath) afterwards if they
need a populated output pack.

Symbol renames per discipline (drop redundant GGUF prefix):
  QuantizeModelPackToGGUF → gguf.QuantizeModelPack
  QuantizeGGUFOptions     → gguf.QuantizeOptions
  QuantizeGGUFResult      → gguf.QuantizeResult (drops Pack field)
  GGUFQuantizeFormat      → gguf.QuantizeFormat
  GGUFQuantizeQ8_0/Q4_0/Q4_K_M → gguf.QuantizeQ8_0/Q4_0/Q4_K_M

Move ggufValidationSummary from mlx-root model_pack.go into gguf as
exported gguf.ValidationSummary — model_pack.go now calls it via the
gguf package. Same helper, single home now.

Move samePath + copyModelPackMetadata + isModelWeightMetadataCopySkip
+ copyLocalFile into gguf as private helpers (also keep the model_merge.go
mlx-root copies for non-gguf consumers like model_merge.go itself).

mlx-root tests that depended on lifted private helpers
(denseSafetensor, loadDenseSafetensors, readDenseSafetensors,
decodeDenseSafetensor, writeDenseSafetensorsPack, writeTestSafetensorsF32,
safetensorTestTensor, appendUint16LE, float32ToFloat16) get duplicated
copies in gguf_test_helpers_test.go for the tests that still live at
mlx root (model_merge_test, kv_snapshot_*, api_test).

No production consumers of Quantize* API — only tests — so the API
change is safe. Drop the second ValidateModelPack call (caller's
responsibility); drop Pack field from QuantizeResult.

go vet ./... clean. mlx + gguf + lora + safetensors package tests green.

Co-Authored-By: Virgil <virgil@lethean.io>
Move model_merge.go + model_merge_test.go → merge/merge.go + merge/merge_test.go
(package merge). API change matches the lora.FuseIntoPack + gguf.QuantizeModelPack
pattern: merge.Source carries a pre-validated pack.ModelPack (Pack field)
instead of a Path string. Callers run mlx.ValidateModelPack on each source
before invoking merge.Packs, and re-validate the output via
mlx.ValidateModelPack(result.OutputPath) if they need a populated pack.

Symbol renames per discipline (drop redundant Model/ModelMerge prefix):
  MergeModelPacks            → merge.Packs
  ModelMergeOptions          → merge.Options
  ModelMergeResult           → merge.Result (drops Pack field)
  ModelMergeMethod           → merge.Method
  ModelMergeSource           → merge.Source (Path → Pack)
  ModelMergeProvenance       → merge.Provenance
  ModelMergeProvenanceFile   → merge.ProvenanceFile
  ModelMergeLinear/SLERP/TIES/DARE → merge.MethodLinear/SLERP/TIES/DARE

Private helpers moved with the source (drop prefixes where redundant):
  prepareModelMerge          → prepare
  ensureEmptyModelMergeDestination → ensureEmptyDestination
  validateModelMergePackCompatibility → validatePackCompatibility
  indexModelMergeSources     → indexSources
  validateModelMergeTensorIndexes → validateTensorIndexes
  readMergeTensorRefs        → readTensorRefs
  buildMergedSafetensorsHeader → buildMergedHeader
  readMergeTensorValues      → readTensorValues
  writeLinearMergedTensorChunks → writeLinearChunks
  writeSLERPMergedTensorChunks  → writeSLERPChunks
  normalizedMergeWeights     → normalizedWeights
  writeModelMergeProvenance  → writeProvenance
  modelMergePrepared         → prepared
  modelMergeResultError      → resultError
  StateBundleFileHash        → hashFile (inlined private copy in merge)
  samePath / copyModelPackMetadata / isModelWeightMetadataCopySkip
  / copyLocalFile / resultError travel with merge as private helpers
  (they were only used by model_merge.go after the earlier gguf_quantize
  lift moved away).

merge/helpers_test.go takes its own copies of denseSafetensor +
loadDenseSafetensors + readDenseSafetensors + decodeDenseSafetensor +
safetensorTestTensor + writeDenseSafetensorsPack + writeTestSafetensorsF32
+ testResultError + writeModelPackFile + modelPackTokenizerJSON +
testPack / testPackArch fixture builders.

Trim mlx-root gguf_test_helpers_test.go: remove safetensors-related
helpers (denseSafetensor, loadDenseSafetensors, etc.) — they no longer
have mlx-root consumers after the merge lift.

mlx-root minimax_m2.go gains its own private copy of sameUint64Slice
(small utility that was only used by minimax_m2 + the lifted merge
code; the merge copy keeps its own).

No production consumers of ModelMerge* API — only tests, so the API
change is safe.

go vet ./... clean. mlx + gguf + lora + safetensors + merge package
tests green.

Co-Authored-By: Virgil <virgil@lethean.io>
Move kv_snapshot.go, kv_snapshot_blocks.go, kv_snapshot_memvid.go,
kv_analysis.go (and their tests + examples) into kv/ (package kv).
kv_snapshot_index.go stays at mlx root — its
KVSnapshotMemvidBundleIndex struct has StateBundleModel +
StateBundleTokenizer fields whose types live at mlx-root and would cycle.

Symbol renames per discipline (drop redundant KV/KVSnapshot prefix):
  KVSnapshot                → kv.Snapshot
  KVLayerSnapshot           → kv.LayerSnapshot
  KVHeadSnapshot            → kv.HeadSnapshot
  KVSnapshotEncoding        → kv.Encoding (+ Native/Q8/Base64/Binary)
  KVSnapshotVersion         → kv.SnapshotVersion
  KVSnapshotSaveOptions     → kv.SaveOptions
  KVSnapshotLoadOptions     → kv.LoadOptions
  KVSnapshotCaptureOptions  → kv.CaptureOptions
  LoadKVSnapshot{,WithOptions} → kv.Load{,WithOptions}
  KVSnapshotBlock           → kv.Block
  KVSnapshotMemvidBlockOptions/Bundle/Ref → kv.MemvidBlock{Options,Bundle,Ref}
  KVSnapshotMemvidBlockBundleKind → kv.MemvidBlockBundleKind
  KVSnapshotMemvidBlockVersion    → kv.MemvidBlockVersion
  AssembleKVSnapshotBlocks → kv.AssembleBlocks
  SaveKVSnapshotMemvidBlockBundle → kv.SaveMemvidBlockBundle
  LoadKVSnapshotFromMemvidBlocks{,WithOptions} → kv.LoadFromMemvidBlocks{,WithOptions}
  LoadKVSnapshotMemvidBlockBundle → kv.LoadMemvidBlockBundle
  LoadKVSnapshotPrefixFromMemvidBlocks{,WithOptions} → kv.LoadPrefixFromMemvidBlocks{,WithOptions}
  KVSnapshotMemvidOptions   → kv.MemvidOptions
  LoadKVSnapshotFromMemvid{,WithOptions} → kv.LoadFromMemvid{,WithOptions}
  KVAnalysis → kv.Analysis, AnalyzeKV → kv.Analyze
  KVFeatures → kv.Features, KVFeatureLabels → kv.FeatureLabels

Helpers also moved into kv package as exported (mlx-root callers
crossed package boundary so they needed to go public):
  hashKVSnapshot → kv.HashSnapshot
  validateKVSnapshotMemvidBlockBundle → kv.ValidateMemvidBlockBundle
  loadKVSnapshotMemvidBlockWithOptions → kv.LoadMemvidBlockWithOptions
  effectiveKVSnapshotTokenOffset → kv.EffectiveTokenOffset
  effectiveKVSnapshotSeqLen → kv.EffectiveSeqLen
  clearKVSnapshotTerminalState → kv.ClearTerminalState
  dropKVSnapshotFloat32 → kv.DropFloat32
  kvSnapshotResultError → kv.ResultError
  Snapshot.sliceBlock (method) → SliceBlock

Inline private copies kept in kv: normalizeSnapshot (was
normalizeBundleSnapshot), requiresNativeEncoding (was
kvSnapshotRequiresNativeEncoding), firstNonEmpty,
defaultCacheBlockSize.

mlx-root NewStateBundle: local variable `kv` renamed to `snap` to
avoid shadowing the imported kv package. State_bundle.go now calls
kv.HashSnapshot / kv.Analyze directly.

NEW mlx-root kv_test_helpers_test.go contains test helpers
(kvSnapshotBlocksTestSnapshot, recordingMemvidStore, failingMemvidWriter)
duplicated for mlx-root tests that no longer have access to kv-package
test internals.

~22 consumer files updated: agent_memory, api_common, api_darwin,
api_stub, api_test, fast_eval{,_test}, hf_fit_test, expert_residency_test,
inference_contract_darwin, kv_snapshot_index{,_test}, kv_cache_bench{,_test},
memory_plan{,_test}, memvid_chapter_smoke{,_test}, session_agent_darwin{,_test},
session_artifact{,_test}, session_darwin{,_test,_example_test},
session_stub_example_test, small_model_smoke, state_bundle{,_test},
workload_bench{,_test}.

go vet ./... clean. mlx + gguf + lora + safetensors + merge + kv tests green.

Co-Authored-By: Virgil <virgil@lethean.io>
eval is driver-neutral (orchestrates evaluation given a Runner adapter),
so it lifts to go-inference/eval/ instead of go-mlx/eval/ — alongside
parser/, quant/jang/, quant/codebook/ which already live there.

Interface redesign for cycle-breaking:
- Sample/Batch/BatchConfig become opaque any
- Dataset is an interface (Next returns any)
- Runner gains BatchTokens callback (replaces sftBatchLossTokens) and
  SampleText callback (replaces direct .Text/.Response reads)
- eval.Info mirrors mlx.ModelInfo fields; eval.AdapterInfo mirrors
  lora.AdapterInfo. mlx-root converts at the boundary via modelInfoToEval,
  evalInfoToModel, loraToEvalAdapter, evalAdapterToLora.
- BuildBatches is now required (replaces optional Tokenizer + auto-build);
  driver wrappers provide BuildBatches that internally use their tokenizer
  + BuildDatasetBatches.

Symbol renames per discipline:
  EvalConfig → eval.Config
  EvalRunner → eval.Runner
  EvalReport → eval.Report (with eval.Info + eval.AdapterInfo)
  EvalMetrics → eval.Metrics
  EvalBatchMetrics → eval.BatchMetrics
  EvalQualityProbe → eval.QualityProbe (Context/Report/Check too)
  RunDatasetEval → eval.RunDataset
  EvalReportVersion → eval.ReportVersion
  RunModelEval, NewModelEvalRunner stay at mlx-root as wrappers/adapters.

Move ResponseCoverageProbe into eval/ as an exported probe constructor —
driver wrappers attach it via RunModelEval so eval doesn't need to know
about SFTSample's field shape.

eval_test.go deleted from mlx-root (its orchestration testing now belongs
in go-inference/eval/). Integration coverage stays in eval_darwin_test.go.

Bumps external/go-inference submodule pin to a18708d (driver-neutral eval
package shipped).

Consumers updated: distill{,_test}.go, workload_bench{,_test}.go,
inference_contract_{darwin,test}.go. distill.go gains a private
distillCollectSamples helper (replaces collectEvalSamples from old eval.go).
workload_bench.go gains normalizeWorkloadEvalConfig (replaces
normalizeEvalConfig).

go vet ./... clean. mlx + gguf + lora + safetensors + merge + kv tests green.

Co-Authored-By: Virgil <virgil@lethean.io>
bench package (go-inference/bench/) is the new driver-neutral local
benchmark/eval harness. Drivers supply a Runner with verb-shaped
callbacks (BenchPromptCache, BenchMemvidKVBlockWarm, BenchKVRestore,
BenchStateBundle, BenchProbeOverhead, BenchSpeculativeDecode,
BenchPromptLookupDecode). bench.Run orchestrates generation timing +
dispatches each enabled callback + assembles the Report.

mlx-root: fast_eval.go shrinks to type aliases + boundary converters
(FastEval* → bench.* via type aliases; modelInfoToBench /
benchInfoToModel / fromMlxMetrics / toBenchGenerateOptions /
loraToBenchAdapter / benchAdapterToLora helpers).

NEW fast_eval_runner.go contains the Model→bench.Runner adapter — each
Bench* callback implements its driver-specific section against the
Model API (kv snapshots, state bundles, memvid block warming, decode
optimisation via RunSpeculativeDecode / RunPromptLookupDecode).

memvid_chapter_smoke decouples from the bench.Runner — its callbacks
(CaptureKVBlocksToMemvid, GenerateWithMemvidPrefix) deal with
mlx-specific kv types, so it has its own MemvidKVChapterRunner at
mlx-root (no longer wedged into the verb-callback shape).

inference_contract_darwin.go converts at the bench boundary
(benchInfoToModel / benchAdapterToLora) before calling
toInferenceModelIdentity / toInferenceRootAdapterIdentity.

workload_bench.go: drops normalizeFastEvalConfig (bench.Run normalises
internally); ModelInfo conversion via benchInfoToModel.

Test coverage delta: fast_eval_test.go (801 lines), fast_eval_example_test.go
(26 lines), workload_bench_test.go (525 lines) deleted — their callback
mock setups exercise the OLD raw-callback Runner shape; equivalent
coverage for the verb-callback shape should be added to
go-inference/bench/ tests in a separate pass. memvid_chapter_smoke_test
(integration tests for the chapter runner) rewrites to use
MemvidKVChapterRunner + ChapterGeneration. inference_contract_test gains
modelInfoToBench wrap at the boundary.

Bumps external/go-inference to include the bench package.

go vet ./... clean. mlx + gguf + lora + safetensors + merge + kv tests
green.

Co-Authored-By: Virgil <virgil@lethean.io>
Picks up the bench package unit tests (test(bench): unit tests for
driver-neutral Run orchestration). Coverage rebuilt for the verb-callback
Runner shape after deleting fast_eval_test.go + fast_eval_example_test.go
+ workload_bench_test.go in Phase 2M.

Co-Authored-By: Virgil <virgil@lethean.io>
Phase 2N — the speculative + prompt-lookup decode algorithm is driver-
neutral (accept/reject over token streams, generation delegated to
caller callbacks), so it lifts to go-inference/decode/ alongside bench
and eval.

decode_optimisation.go is rewritten as a thin shim with legacy type
aliases (DecodeOptimisationResult, DecodeOptimisationMetrics) and
boundary converters (mlxDecodeGenToDecode, mlxTokensToDecode,
decodeTokensToMlx). DecodeGenerateFunc keeps the mlx-shaped signature
so existing callbacks continue to compile; RunSpeculativeDecode/
RunPromptLookupDecode wrap them to decode.GenerateFunc internally.
decodeTokensText survives as a thin wrapper for memvid_chapter_smoke.

Submodule pin bumped to go-inference 521dd53 (feat(decode):
driver-neutral speculative + prompt-lookup decode harness).

Coverage rebuilt:

  - decode_optimisation_test.go now covers the boundary converters,
    nil-callback handling, token round-trip, and legacy-alias surface
  - decode_optimisation_example_test.go for AX example registration
  - fast_eval_test.go BACKFILLS the Phase 2M orphan: covers alias
    routing, DefaultFastEvalConfig forwarding, RunFastEval bench
    smoke against a synthetic Runner, toBenchGenerateOptions clone +
    probe-sink passthrough, fromMlxMetrics field copy,
    modelInfoToBench round-trip with adapter clone, fastEvalResultError
  - fast_eval_example_test.go matches AX pattern

go vet ./... clean. Tests: mlx + kv + lora + merge + gguf + pack all
green. Pre-existing internal/metal failure (TestGenerate_Model_Staged
MiniMaxReturnsDecodeError_Bad nil-tokenizer panic) is unrelated —
fails identically on pristine HEAD.

Co-Authored-By: Virgil <virgil@lethean.io>
Phase 2O — state bundle is deeply mlx-coupled (kv.Snapshot,
lora.AdapterInfo, SAMI), so it lifts to go-mlx/bundle/ as a sibling
package rather than to go-inference. SAMI types travel with bundle
since Bundle.SAMI holds *SAMIResult.

Symbols rename per the folder-taxonomy rule (drop prefixes the package
carries):

  StateBundle                   → bundle.Bundle
  StateBundleOptions            → bundle.Options
  StateBundleModel              → bundle.Model
  StateBundlePrompt             → bundle.Prompt
  StateBundleTokenizer          → bundle.Tokenizer
  StateBundleRuntime            → bundle.Runtime
  StateBundleAdapter            → bundle.Adapter
  StateBundleSampler            → bundle.Sampler
  StateBundleRef                → bundle.Ref
  StateBundleVersion            → bundle.Version
  StateBundleKind               → bundle.Kind
  StateBundleRefMemvid          → bundle.RefMemvid
  NewStateBundle                → bundle.New
  LoadStateBundle               → bundle.Load
  CheckStateBundleCompatibility → bundle.CheckCompatibility
  StateBundleFileHash           → bundle.FileHash
  SAMIResult                    → bundle.SAMIResult (kept name — separate concept)
  SAMIOptions                   → bundle.SAMIOptions
  SAMIFromKV                    → bundle.SAMIFromKV

mlx-root state_bundle.go becomes a thin shim with type aliases for the
77 caller sites + boundary converters for mlx.ModelInfo →
bundle.ModelInfo and mlx.GenerateConfig → bundle.Sampler. mlx-root keeps
StateBundleOptions as its own struct (carrying mlx-shaped ModelInfo +
GenerateConfig + *SAMIResult) so existing callers compile unchanged.

session_artifact.go's SAMIResult / SAMIOptions become aliases to
bundle.SAMIResult / bundle.SAMIOptions; SAMIFromKV becomes a thin
wrapper. The math helpers (clampUnit, clampRange, meanUnit, layerMetric)
move to bundle/sami.go with the SAMI types.

stateBundleTokenizer + stateHash + stateMemvidURI retained as
private mlx-root wrappers (bundle.NormaliseTokenizer + bundle.HashString
+ bundle.MemvidURI) for callers session_agent_darwin.go +
kv_snapshot_index.go that referenced the old in-package names.

stateBundleTestSnapshot test helper moved to kv_test_helpers_test.go
so lora_adapter*_test.go + session_darwin_test.go continue to compile.

Coverage:
  - bundle/bundle_test.go covers Save/Load, memvid snapshot round-trip,
    frame-zero allowance, defensive cloning, Validate + CheckCompatibility
    happy + sad paths, AdapterFromInfo round-trip, NormaliseTokenizer,
    AdapterEmpty, HashString, FileHash, MemvidURI, SAMIFromKV
  - bundle/example_test.go for AX example registration
  - state_bundle_test.go covers the shim: alias identity,
    modelInfoToBundle, stateSamplerFromGenerateConfig clone,
    CheckStateBundleCompatibility, FileHash, Load round-trip,
    SnapshotFromMemvid via shim route, the private cross-file helpers

go vet ./... clean. Tests: mlx + bundle + kv + lora + merge + gguf +
pack all green. Pre-existing internal/metal panic remains unrelated.

Co-Authored-By: Virgil <virgil@lethean.io>
Phase 2P — probe is the go-mlx event-vocabulary for inference + training
observability. It lifts to go-mlx/probe/ rather than go-inference
because the event shape is mlx-rich: ProbeExpertResidency carries MoE
paging events that the driver-neutral inference.ProbeEvent contract
(at dappco.re/go/inference root) doesn't expose. The two probe
vocabularies remain intentionally separate — inference owns the
backend contract, go-mlx/probe/ owns the rich driver event vocabulary.

Symbols rename per the folder-taxonomy rule (drop prefixes the package
carries):

  ProbeEvent           → probe.Event
  ProbeEventKind       → probe.Kind
  ProbePhase           → probe.Phase
  ProbeToken           → probe.Token
  ProbeLogit           → probe.Logit
  ProbeLogits          → probe.Logits
  ProbeEntropy         → probe.Entropy
  ProbeHeadSelection   → probe.HeadSelection
  ProbeLayerCoherence  → probe.LayerCoherence
  ProbeRouterDecision  → probe.RouterDecision
  ProbeExpertResidency → probe.ExpertResidency
  ProbeResidualSummary → probe.ResidualSummary
  ProbeCachePressure   → probe.CachePressure
  ProbeMemoryPressure  → probe.MemoryPressure
  ProbeTraining        → probe.Training
  ProbeSink            → probe.Sink
  ProbeSinkFunc        → probe.SinkFunc
  ProbeBus             → probe.Bus
  ProbeRecorder        → probe.Recorder
  NewProbeBus          → probe.NewBus
  NewProbeRecorder     → probe.NewRecorder
  cloneProbeEvent      → probe.CloneEvent (exported)

ExpertResidencyAction + its four constants move from
expert_residency.go to probe so probe.ExpertResidency.Action stays a
typed enum; mlx-root expert_residency.go gets a type alias plus const
re-declarations.

mlx-root probe.go shrinks from 337 to ~80 LOC: type aliases for 19
types + 14 constants, plus the mlx-specific GenerateOption helpers
(WithProbeSink, WithProbeCallback) that stay because they touch
mlx.GenerateConfig. NewProbeBus/NewProbeRecorder become one-line
forwarders.

All ~203 caller references across 20+ files compile unchanged thanks
to the alias surface.

Coverage:
  - probe/probe_test.go covers Recorder defensive-copy semantics, Bus
    fanout + concurrent safety + nil-receiver guards, SinkFunc nil
    handling, CloneEvent deep-copy across every payload pointer plus
    Meta map, ExpertResidencyAction + Kind + Phase constant values
  - probe/example_test.go for AX example registration
  - probe_test.go (mlx-root) covers alias identity, constant
    preservation, ExpertResidencyAction alias identity, NewProbeBus +
    NewProbeRecorder wiring, WithProbeSink / WithProbeCallback installing
    on GenerateConfig (including the nil-callback no-op)
  - probe_example_test.go matches AX pattern

go vet ./... clean. Tests: mlx + probe + bundle + kv + lora + merge +
gguf + pack all green. Pre-existing internal/metal panic unrelated.

Co-Authored-By: Virgil <virgil@lethean.io>
Phase 2Q — scheduler.go is fully driver-neutral (only inference.TextModel
deps, no kv/lora/probe-mlx), so it lifts to go-inference/scheduler/
alongside bench, decode, and eval.

Symbols rename per the folder-taxonomy rule:

  ScheduledModel    → scheduler.Model
  SchedulerConfig   → scheduler.Config
  NewScheduledModel → scheduler.New

mlx-root scheduler.go shrinks from 400 to ~25 LOC: type aliases for
ScheduledModel + SchedulerConfig + one-line NewScheduledModel forwarder.
register_metal.go's `scheduler *ScheduledModel` field +
register_metal_scheduler.go's wrappers compile unchanged through the
aliases.

Submodule pin bumped to go-inference 254b391
(feat(scheduler): driver-neutral request scheduler).

Coverage:
  - go-inference/go/scheduler/scheduler_test.go ports the canonical
    suite (queue + latency probe, full-queue rejection, cancellation,
    Generate/Chat/Classify/BatchGenerate delegation, nil + cancelled-
    context paths, fallback cancel via inference.CancellableModel, Err
    propagation, generateOptions sampler conversion, cloneLabels +
    millis helpers)
  - go-inference/go/scheduler/example_test.go for AX coverage
  - scheduler_test.go (mlx-root) covers alias identity +
    NewScheduledModel forward + nil-base defensive wrapper
  - scheduler_example_test.go matches AX pattern

go vet ./... clean. Tests: mlx + probe + bundle + kv + lora + merge +
gguf + pack all green. Pre-existing internal/metal panic unrelated.

Co-Authored-By: Virgil <virgil@lethean.io>
Phase 2R — memory_plan is the local-inference memory planner that maps
measured Apple-silicon hardware + model metadata to a runtime policy.
The generic core (memory class detection, base class plans, KV cache
estimation, architecture hints, generic MoE residency) lifts to
go-mlx/memory/. The MiniMax-M2-specific overrides (tensor-plan
expert-residency + first-layer skeleton bytes) stay at mlx-root,
layered on top of the generic plan.

Symbols rename per the folder-taxonomy rule (drop prefixes the package
carries):

  MemoryPlan             → memory.Plan
  MemoryPlanInput        → memory.Input (only used internally now —
                            mlx-root keeps its own MemoryPlanInput with
                            mlx-shaped DeviceInfo + ModelInfo)
  PlanMemory             → memory.NewPlan
  MemoryClass            → memory.Class
  MemoryClass*           → memory.Class*  (7 constants)
  MemoryGiB              → memory.GiB
  KVCachePolicy          → memory.KVCachePolicy (kept name; package
                            doesn't repeat the prefix)
  KVCacheMode            → memory.KVCacheMode
  ExpertResidencyPlan    → memory.ExpertResidencyPlan
  ExpertResidencyMode    → memory.ExpertResidencyMode
  ExpertResidencyMode*   → memory.ExpertResidencyMode*  (3 constants)
  ExpertEvictionPolicy   → memory.ExpertEvictionPolicy
  ExpertEvictionLRU      → memory.ExpertEvictionLRU

mlx-root memory_plan.go shrinks from 529 to ~165 LOC:
  - Type aliases for MemoryPlan + MemoryClass + KVCachePolicy +
    KVCacheMode + 19 constants + MemoryGiB
  - mlx.MemoryPlanInput stays its own struct (carries mlx.DeviceInfo +
    *mlx.ModelInfo so existing callers compile unchanged)
  - PlanMemory wrapper: converts to memory.Input, calls memory.NewPlan,
    layers MiniMaxM2LayerForwardSkeleton bytes + MiniMaxM2TensorPlan
    expert residency on top
  - applyMemoryPlanToLoadConfig stays here (uses mlx.LoadConfig)
  - minPositive retained as a private helper for expert_residency.go

expert_residency.go's ExpertResidencyPlan + Mode + EvictionPolicy
become aliases to memory.* types. The runtime manager + Stats + Context
types stay at mlx-root.

memory package is self-contained: imports only inference/quant/jang,
mlx/pack, mlx/profile. normalizeKnownArchitecture + trim/lower/replace
ASCII helpers duplicated locally to avoid importing mlx-root.

Coverage:
  - memory/memory_test.go covers the generic core: 16/24/32/64/96/128GB
    class plans, context capped by pack metadata, Qwen3-MoE hints,
    MiniMax architecture caps, BERT embedding disables generation
    cache, fallback on zero memory, model metadata caps context,
    Q8 KV cache for middle classes, generic MoE residency,
    ClassForBytes boundaries, minPositive, percentBytes,
    normalizeKnownArchitecture aliases (15 tests)
  - memory/example_test.go for AX coverage
  - memory_plan_test.go at mlx-root unchanged — all 11 existing tests
    pass through the shim, exercising the integrated path including
    MiniMaxM2 skeleton + tensor-plan residency

go vet ./... clean. Tests: mlx + memory + probe + bundle + kv + lora +
merge + gguf + pack all green. Pre-existing internal/metal panic
unrelated.

Co-Authored-By: Virgil <virgil@lethean.io>
Phase 2S — mega-lift matching the model/{arch}/{name}/ folder taxonomy
called out in feedback_driver_lift_discipline.md. Moves four mlx-root
source files (minimax_m2.go 1016 LOC + minimax_m2_native_darwin.go 167
+ minimax_m2_native_stub.go 32 + expert_residency.go 476) plus three
test files (minimax_m2_test.go 643 + minimax_m2_darwin_test.go 441 +
expert_residency_test.go 159) to go-mlx/model/minimax/m2/ as a single
self-contained package.

Symbol renames per the folder-taxonomy rule (drop prefixes the package
carries — m2 carries "MiniMaxM2"):

  MiniMaxM2Config                       → m2.Config
  MiniMaxM2TensorRole                   → m2.TensorRole
  MiniMaxM2TensorRole* (9 constants)    → m2.TensorRole* (9 constants)
  MiniMaxM2TensorSpec                   → m2.TensorSpec
  MiniMaxM2TensorPlan                   → m2.TensorPlan
  MiniMaxM2RouterDecision               → m2.RouterDecision
  MiniMaxM2ExpertFunc                   → m2.ExpertFunc
  MiniMaxM2PackedExpertWeights          → m2.PackedExpertWeights
  MiniMaxM2RouterWeights                → m2.RouterWeights
  MiniMaxM2PackedLayerForwardOptions    → m2.PackedLayerForwardOptions
  MiniMaxM2PackedLayerForwardResult     → m2.PackedLayerForwardResult
  MiniMaxM2LazyExpertLoad               → m2.LazyExpertLoad
  MiniMaxM2DenseProjectionTensor        → m2.DenseProjectionTensor
  MiniMaxM2DenseExpertWeights           → m2.DenseExpertWeights
  MiniMaxM2ResolvedTensor               → m2.ResolvedTensor
  MiniMaxM2LayerForwardSkeleton         → m2.LayerForwardSkeleton
  ParseMiniMaxM2Config                  → m2.ParseConfig
  BuildMiniMaxM2TensorPlan              → m2.BuildTensorPlan
  RouteMiniMaxM2Tokens                  → m2.RouteTokens
  DispatchMiniMaxM2Experts              → m2.DispatchExperts
  LoadMiniMaxM2PackedExpertsForDecisionsFromSafetensors
                                        → m2.LoadPackedExpertsForDecisions
  LoadMiniMaxM2LazyExpertsForHiddenFromSafetensors
                                        → m2.LoadLazyExpertsForHidden
  LoadMiniMaxM2PackedExpertsFromSafetensors → m2.LoadPackedExperts
  LoadMiniMaxM2RouterFromSafetensors    → m2.LoadRouter
  ProjectMiniMaxM2RouterScores          → m2.ProjectRouterScores
  BuildMiniMaxM2LayerForwardSkeletonFromSafetensors
                                        → m2.BuildLayerForwardSkeleton
  MiniMaxM2RouterProbeEvents            → m2.RouterProbeEvents
  MiniMaxM2ExpertResidencyLoader        → m2.ResidencyLoader
  MiniMaxM2ExpertResidencyConfig        → m2.ResidencyConfig
  MiniMaxM2ExpertResidencyManager       → m2.ResidencyManager
  NewMiniMaxM2ExpertResidencyManager    → m2.NewResidencyManager
  PlanMiniMaxM2ExpertResidency          → m2.PlanResidency
  DispatchMiniMaxM2PackedExpertsMetal   → m2.DispatchPackedExpertsMetal
  DispatchMiniMaxM2PackedExpertsFromSafetensorsMetal
                                        → m2.DispatchPackedExpertsFromSafetensorsMetal
  ForwardMiniMaxM2LazyExpertLoadMetal   → m2.ForwardLazyExpertLoadMetal
  ForwardMiniMaxM2PackedLayerMetal      → m2.ForwardPackedLayerMetal
  ForwardMiniMaxM2PackedLayerFromSafetensorsMetal
                                        → m2.ForwardPackedLayerFromSafetensorsMetal
  normaliseExpertResidencyPlan          → m2.NormalisePlan
  JANGPackedProjectionTensor            → m2.JANGPackedProjectionTensor

Private helpers all lose the miniMaxM2 prefix (decisionExpertIDs,
uniqueExpertIDs, packedDType, etc.).

ExpertResidencyStats moves to memory.ExpertResidencyStats (it's the
companion measurement type for memory.ExpertResidencyPlan that was
already there).

mlx-root shim files (minimax_m2.go, minimax_m2_native_darwin.go,
minimax_m2_native_stub.go, expert_residency.go) preserve all 66 caller
references via type aliases + wrapper functions. memory_plan.go's
PlanMemory MiniMaxM2-specific overrides still compile through the
aliases. model_pack.go's ParseMiniMaxM2Config /
BuildMiniMaxM2TensorPlan / BuildMiniMaxM2LayerForwardSkeletonFromSafetensors
calls route through wrappers. workload_bench.go's ExpertResidencyStats
+ normaliseExpertResidencyPlan route through aliases.

m2 package is self-contained: imports core, jang, mlx/memory, mlx/probe,
mlx/profile, mlx/safetensors, mlx/quant/jang only — no upward mlx-root
import (which would cycle). Private helpers (firstNonEmpty,
normalizeKnownArchitecture, nonZeroDuration, maxPositive, minPositive,
firstPositive) duplicated locally in helpers.go.

Test fixtures (miniMaxM2FixtureConfig + findMiniMaxM2Spec +
writeMiniMaxM2RawSafetensors + miniMaxM2SkeletonRawTensors +
miniMaxM2F32RawTensor + miniMaxM2RawSafetensor) duplicated at mlx-root
in minimax_m2_test_helpers_test.go so jang_darwin_test.go and
model_pack_test.go still build. Go test packages cannot import each
other's internal _test.go helpers, hence the duplication.

internal/metal/metal.go's defaultMetallibPath search expanded by two
more parent-dir candidates so tests running from
model/minimax/m2/ (5 directories deep) can still discover
dist/lib/mlx.metallib.

go vet ./... clean. Tests: mlx + m2 + memory + probe + bundle + kv +
lora + merge + gguf + pack + ide-side packages all green. Pre-existing
internal/metal TestGenerate_Model_StagedMiniMaxReturnsDecodeError_Bad
nil-tokenizer panic still unrelated.

Co-Authored-By: Virgil <virgil@lethean.io>
Phase 2T — hf_fit.go (1019 LOC) hosts the HuggingFace metadata source
+ local-fit planner. The public HF* symbols have ZERO callers in
production code (only test references), so the lift is mostly a shape
change. Lifts to go-mlx/hf/ with symbol renames per the folder-taxonomy
rule:

  HFModelSource                → hf.ModelSource
  HuggingFaceModelSourceConfig → hf.RemoteConfig
  HuggingFaceModelSource       → hf.RemoteSource
  NewHuggingFaceModelSource    → hf.NewRemoteSource
  HFModelFitConfig             → hf.FitConfig
  HFModelMetadata              → hf.ModelMetadata
  HFModelFile                  → hf.ModelFile
  HFModelConfig                → hf.ModelConfig
  HFQuantizationConfig         → hf.QuantizationConfig
  HFModelFitReport             → hf.FitReport
  HFModelFitPlan               → hf.FitPlan
  HFTrainingFit                → hf.TrainingFit
  PlanHFModelFits              → hf.PlanFits
  InferJANGFromHF              → hf.InferJANG
  HFModelSourceRemote/Local    → hf.SourceRemote/Local

Plus all the private helpers (collectFitEntries, planFit,
weightFormatAndBytes, inferQuantBits, etc.) lose the hf-redundant
prefixes.

hf package is self-contained: imports core, jang, mlx/memory, mlx/pack,
mlx/profile. Uses memory.Class / memory.Plan / memory.NewPlan /
memory.Input / memory.DeviceInfo / memory.GiB / memory.KVCacheMode*
directly (no mlx-root coupling). The four model-pack-helper calls
that previously delegated to mlx-root (modelPackSupportedArchitecture,
modelPackNativeRuntimeSupported, modelPackUsesGenerationKVCache,
inspectModelPackTaskProfiles) are now inlined as private hf helpers
(archSupported, archNativeRuntime, usesGenerationKVCache,
resolveArchitectureProfile) — each is a thin wrapper over
profile.LookupArchitectureProfile, no behaviour change.

mlx-root hf_fit.go shrinks from 1019 to ~65 LOC of pure shim: 11 type
aliases + 2 const re-exports + 3 wrapper functions. PlanHFModelFits
auto-fills cfg.Device from GetDeviceInfo() (the mlx-root metal probe)
and converts to memory.DeviceInfo at the boundary — caller-facing
behaviour preserved.

helpers.go (new at mlx-root) holds firstNonEmpty / firstPositive /
indexString that were at the bottom of hf_fit.go and are used by
dataset_stream, kv_snapshot_index, memvid_chapter_smoke, model_pack,
and openai. They stay at mlx-root because mlx-root consumers cannot
import hf (wrong direction).

model_config_probe.go (new at mlx-root) holds modelConfigProbe +
readModelConfig + the probe's accessor methods, plus
normalizeKnownArchitecture and architectureFromTransformersName. These
are used by model_pack.go's inspectModelPackConfig +
applyModelPackConfigMetadata; the originals lived in hf_fit.go. The hf
package keeps its own private copies of the two architecture
normalisers (they're used internally by the planner too).

Tests port into hf package — they exercise internal fields/methods
(.baseURL, .userAgent, .client, .byteSize) so package-private access
is preserved. writeModelPackFile test helper duplicated in
hf/test_helpers_test.go since Go test packages cannot import each
other's internal helpers.

go vet ./... clean. Tests: mlx + hf + memory + probe + bundle + kv +
lora + merge + gguf + pack + m2 all green.

Co-Authored-By: Virgil <virgil@lethean.io>
Snider and others added 2 commits May 20, 2026 15:35
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Copy link
Copy Markdown

@github-advanced-security github-advanced-security AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SonarCloud found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

Snider and others added 25 commits May 20, 2026 18:23
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Use borrowed full page handles for immediate paged-cache decode attention, keeping partial preallocated pages owned as visible slices. Refresh the 100k retained workflow report with the measured borrowed-page run and current runner deltas.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Evaluate non-paged prompt-cache state before detaching chunked prefill arrays so contiguous and rotating caches do not carry unevaluated MLX graph handles into the next chunk. Leave paged caches on the accepted production path without the extra synchronisation point.

Document the fp16/rotating 100k diagnostic as a rejected production shortcut: the prefill primitive error is fixed, but decode still crashes before producing a report.

Co-Authored-By: Virgil <virgil@lethean.io>
Record 100k same-shape diagnostics for larger paged K/V blocks and preallocated page writes. Both stay below the accepted 1024-page borrowed-state lane, so the long-context target remains fused paged/global attention rather than page-size tuning.

Update GOAL.md, the runtime index, long-context diagnosis, and the production benchmark manifest with the new rejected artefacts.

Co-Authored-By: Virgil <virgil@lethean.io>
Retain the materialised full K/V state produced by paged fast-concat on full-attention owner layers so shared Gemma 4 layers can reuse it instead of rebuilding the same long-context state.

Records the 100k retained workflow moving from 260.093s / 51.293 tok/s to 231.109s / 60.011 tok/s, while keeping the external runner gap open in GOAL.md and runtime docs.

Co-Authored-By: Virgil <virgil@lethean.io>
Adds the 5120-token-budget 100k retained-state diagnostic. The current prompt naturally stops at 2489 tokens per turn, but decode stays flat around 60 tok/s across ten retained turns and memory remains bounded under the production guards.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
6.3% Duplication on New Code (required ≤ 3%)
E Security Rating on New Code (required ≥ A)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Co-Authored-By: Virgil <virgil@lethean.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants