Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
124 commits
Select commit Hold shift + click to select a range
178b957
docs: define inference contract parity plan
Snider May 8, 2026
a3263f0
feat(api): implement inference contracts
Snider May 8, 2026
850f482
feat(api): report metal runtime capabilities
Snider May 8, 2026
92d29bd
feat(api): expose metal memory limits via inference
Snider May 8, 2026
1eb011b
feat(api): expose openai chat handler
Snider May 8, 2026
e6c3774
feat(mlx): vMLX parity Phase 1 + per-file docs
Snider May 11, 2026
bbdcd40
refactor(mlx): split compute → dappco.re/go/mlx/compute subpackage
Snider May 11, 2026
a04104d
refactor(mlx): lift parser/thinking → go-inference/parser/
Snider May 11, 2026
b80bd51
refactor(mlx): consume go-inference/quant/jang + codebook subpackages
Snider May 11, 2026
63f9894
refactor(mlx): driver-side jang into quant/jang/ folder
Snider May 11, 2026
8723e14
fix(mlx): finish quant/jang move — proper package + name renames
Snider May 11, 2026
8f5174a
refactor(mlx): lift profile to dappco.re/go/mlx/profile/
Snider May 11, 2026
efd0aad
refactor(mlx): lift lora_adapter to dappco.re/go/mlx/lora/
Snider May 11, 2026
0688d05
refactor(mlx): lift ModelPack types to dappco.re/go/mlx/pack/
Snider May 11, 2026
d44545b
refactor(mlx): lift lora_fuse to dappco.re/go/mlx/lora/
Snider May 11, 2026
844e27a
refactor(mlx): lift gguf_info to dappco.re/go/mlx/gguf/
Snider May 11, 2026
0799447
refactor(mlx): lift safetensors primitives to dappco.re/go/mlx/safete…
Snider May 11, 2026
090c2bf
refactor(mlx): lift gguf_quantize to dappco.re/go/mlx/gguf/
Snider May 11, 2026
6a4b0b0
refactor(mlx): lift model_merge to dappco.re/go/mlx/merge/
Snider May 11, 2026
4f072e3
refactor(mlx): lift kv_snapshot to dappco.re/go/mlx/kv/
Snider May 11, 2026
ae1588b
refactor(mlx): lift eval to go-inference/eval/ via interface redesign
Snider May 11, 2026
db52490
refactor(mlx): lift fast_eval to go-inference/bench/ via verb-callbacks
Snider May 11, 2026
d8cd5eb
chore(submodule): bump go-inference to 264eea8 (bench package tests)
Snider May 11, 2026
6031421
refactor(decode): lift decode_optimisation to go-inference/decode/
Snider May 11, 2026
06972b2
refactor(bundle): lift state_bundle to go-mlx/bundle/
Snider May 11, 2026
c86f516
refactor(probe): lift probe to go-mlx/probe/
Snider May 11, 2026
7613546
refactor(scheduler): lift scheduler to go-inference/scheduler/
Snider May 11, 2026
859662b
refactor(memory): lift memory_plan to go-mlx/memory/
Snider May 11, 2026
bd24ca2
refactor(m2): lift MiniMax M2 + expert_residency to model/minimax/m2/
Snider May 11, 2026
721b050
refactor(hf): lift hf_fit to go-mlx/hf/
Snider May 11, 2026
e0233de
refactor(agent): lift agent_memory + kv_snapshot_index to go-mlx/agent/
Snider May 11, 2026
22e1ee9
refactor(chat): lift chat template formatters to go-mlx/chat/
Snider May 11, 2026
ab4c8e1
refactor: remove hf_fit + decode_optimisation root shim files
Snider May 11, 2026
492da8a
refactor: remove scheduler.go root shim
Snider May 11, 2026
f84e52b
refactor: remove agent_memory.go root shim
Snider May 11, 2026
e26d050
refactor: remove probe.go root shim
Snider May 11, 2026
076de8f
refactor: remove expert_residency.go root shim
Snider May 11, 2026
d421a90
refactor: remove state_bundle.go root shim
Snider May 11, 2026
0ca072a
refactor: remove minimax_m2 root shim trio
Snider May 11, 2026
c5ea2f0
fix: import dappco.re/go/mlx/agent in session_agent_stub.go
Snider May 11, 2026
c697aef
fix: route unsupported_stub_test through gguf package
Snider May 11, 2026
b046f11
refactor: remove memory_plan.go alias surface (public API)
Snider May 11, 2026
345c88c
refactor: remove fast_eval.go alias surface (public API)
Snider May 11, 2026
c6e8d8c
refactor: remove session_artifact.go SAMI alias surface
Snider May 11, 2026
0128e6c
refactor: remove Message + ChatTemplateConfig aliases
Snider May 11, 2026
316b2c6
refactor: lift dataset_stream.go to dappco.re/go/mlx/dataset
Snider May 11, 2026
16ccc60
refactor: lift InferenceAdapter to dappco.re/go/mlx/adapter
Snider May 11, 2026
3d46b6d
refactor: delete non-darwin stub files
Snider May 11, 2026
5f0ae98
refactor: lift kv_cache_bench + model_pack into kv/ + model/
Snider May 11, 2026
7c79cb5
refactor: lift openai.go + admin.go into dappco.re/go/mlx/openai
Snider May 11, 2026
eebf217
refactor: lift block_cache.go to dappco.re/go/mlx/blockcache
Snider May 11, 2026
c95ae46
refactor: lift session_artifact + memvid_chapter_smoke to subpackages
Snider May 11, 2026
369ec71
refactor(mlx): untangle api_*.go cluster + strip _darwin tautology
Snider May 13, 2026
b82ddc0
refactor(mlx): strip _darwin tautology — 20 files merged or renamed
Snider May 13, 2026
1491c09
refactor(mlx): move small_model_smoke files to tests/smoke
Snider May 13, 2026
f005bca
refactor(mlx): relocate orphan profile tests to profile/
Snider May 13, 2026
4e5bd35
refactor(mlx): merge orphan _test_helpers files into their consumers
Snider May 13, 2026
79ee567
fix(test): restore writeModelPackFile after smoke move regression
Snider May 13, 2026
8948c10
refactor(mlx): drop the //go:build darwin && arm64 && !nomlx tag
Snider May 13, 2026
98ff340
chore(ci): wire sonar-project.properties for core_go-mlx
Snider May 13, 2026
4b7b40d
refactor(mlx): split api_test.go into per-source-file test homes
Snider May 13, 2026
94a6812
chore(external): add go-ai + go-ml submodules (temp — Codex sandbox)
Snider May 16, 2026
b0bfd46
feat(mlx): add agentic memory runner path
Snider May 20, 2026
0411e03
docs(runtime): record agentic runner evidence
Snider May 20, 2026
782067d
fix(metal): wire native bridge build sources
Snider May 20, 2026
e61ecc9
chore(external): advance go-inference dev
Snider May 20, 2026
89f613e
docs(goal): expose production gates
Snider May 20, 2026
c19bc07
perf(metal): stream pinned Gemma 4 KV restore
Snider May 20, 2026
ed09bfb
perf(metal): align Gemma 4 layer defaults and PLE residency
Snider May 20, 2026
0d225c8
fix(metal): use explicit CPU streams for model loads
Snider May 20, 2026
ccb78c6
test(metal): lock Gemma 4 E2B layer metadata
Snider May 20, 2026
f00ef92
test(metal): pin Gemma 4 architecture invariants
Snider May 20, 2026
82c1648
bench(metal): accept E2B 100k real workload
Snider May 20, 2026
e1f304d
bench(metal): add E2B llama 100k anchor
Snider May 20, 2026
dc4a23f
bench(metal): add E2B mlx runner anchors
Snider May 20, 2026
9b1f8c6
bench(metal): gate cache-only prefill diagnostic
Snider May 20, 2026
c910316
perf(metal): keep suppressed chat decode on greedy path
Snider May 20, 2026
8639490
perf(metal): restore paged kv from vector views
Snider May 20, 2026
b406a98
perf(kv): stream native layer slabs
Snider May 20, 2026
479af8b
perf(metal): bound gemma4 prefill masks
Snider May 20, 2026
583ef58
bench(cli): write chapter reports to file
Snider May 20, 2026
c59c4fe
docs(goal): accept e2b continuation lane
Snider May 20, 2026
e4124a0
docs(runtime): index production benchmark artefacts
Snider May 20, 2026
ffc9826
docs(runtime): mark quant matrix artifact gap
Snider May 20, 2026
9c0451e
fix(metal): materialise host logit reads
Snider May 20, 2026
ba169d8
bench(cli): write driver reports to file
Snider May 20, 2026
667b6e5
docs(runtime): refresh e2b quant matrix
Snider May 20, 2026
c5caff6
docs(runtime): fill e2b external quant rows
Snider May 20, 2026
e82a2a4
docs(runtime): add llama cached anchor
Snider May 20, 2026
f2c5232
docs(runtime): explain long context gap
Snider May 20, 2026
c3c4da5
perf(metal): tune hyper long paged kv
Snider May 20, 2026
9d55267
test(metal): correct token phase probe timing
Snider May 20, 2026
adc506d
perf(metal): borrow paged kv state for decode
Snider May 20, 2026
e3baf55
docs(goal): audit gemma4 ideas update
Snider May 20, 2026
66bbfe3
test(metal): guard gemma4 keqv cache split
Snider May 20, 2026
6c6d271
perf(metal): pack adamw moment state
Snider May 20, 2026
1cefb03
fix(training): guard gemma4 lora targets
Snider May 20, 2026
e1a5e97
docs(goal): record gomlxrunner compile pass
Snider May 20, 2026
89d2dfb
feat(api): expose prompt cache clearing
Snider May 20, 2026
8fe0efd
docs(goal): record ideas fine-tuning addendum
Snider May 20, 2026
c0c535c
docs(runtime): add production benchmark manifest
Snider May 20, 2026
34ac64a
docs(runtime): add strict benchmark cleanup gate
Snider May 20, 2026
3786cf5
docs(runtime): clean noncanonical benchmark fragments
Snider May 20, 2026
95af568
bench(runtime): track e2b context ramp harness
Snider May 20, 2026
0077a0d
bench(runtime): record rejected 100k attention branches
Snider May 20, 2026
999d098
bench(runtime): gate native paged attention diagnostic
Snider May 20, 2026
b13cd65
bench(runtime): summarise long context trace buckets
Snider May 20, 2026
5d0ded1
bench(runtime): reject right-sized fixed cache at 100k
Snider May 20, 2026
f9fc029
fix(metal): materialise cache state before detach
Snider May 20, 2026
ea799cb
docs(runtime): bound paged cache geometry probes
Snider May 20, 2026
a148951
perf(metal): reuse shared paged full kv
Snider May 20, 2026
f5b6795
docs(runtime): record 100k sustained long turn
Snider May 20, 2026
7badd57
docs(runtime): refresh 100k trace diagnosis
Snider May 21, 2026
4d842ae
perf(metal): record paged full kv diagnostic
Snider May 21, 2026
2c1a18b
docs(runtime): record fp16 long-context cliff
Snider May 21, 2026
5365538
perf(metal): skip single-head paged kv repeat
Snider May 21, 2026
64ff8c5
docs(runtime): clarify 128ki context default
Snider May 21, 2026
4f1dff3
perf(metal): borrow fixed kv state in native paths
Snider May 21, 2026
45ff644
perf(metal): gate streamy paged restore
Snider May 21, 2026
63a4845
docs(runtime): record paged restore threshold probes
Snider May 21, 2026
75fead9
perf(metal): gate typed kv cache storage
Snider May 21, 2026
2d75ccd
perf(metal): preserve typed kv restore
Snider May 21, 2026
0ef9898
docs(runtime): refresh fp16 token trace
Snider May 21, 2026
7f904a3
feat(cli): add retained state ramp profile
Snider May 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Build artifacts
build/
bin/
*.dylib
*.so
*.a
Expand Down
8 changes: 8 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,11 @@
path = external/go-io
url = https://github.com/dappcore/go-io.git
branch = dev
[submodule "external/go-ai"]
path = external/go-ai
url = https://github.com/dappcore/go-ai.git
branch = dev
[submodule "external/go-ml"]
path = external/go-ml
url = https://github.com/dappcore/go-ml.git
branch = dev
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ After Mantis #1241, all Go code lives under `go/`:
```
go/ Go module root (dappco.re/go/mlx)
*.go Public root API: model, tokenizer, compute, training, eval, distill, GRPO, hf-fit, merge, gguf-quantize, kv-snapshot, lora-fuse
cmd/mlx/ CLI tool (built with `-o core-mlx`; consumers rename: lthn-mlx)
cmd/violet/ Unix-socket sidecar daemon
internal/metal/ All CGO code (mlx-c bindings)
mlxlm/ CGO-free Python subprocess backend
Expand Down
6 changes: 5 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ cmake_minimum_required(VERSION 3.24)
project(mlx)

set(CMAKE_OSX_DEPLOYMENT_TARGET "26.0" CACHE STRING "Minimum macOS version")
set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS ON)

if(CMAKE_INSTALL_PREFIX_INITIALIZED_TO_DEFAULT)
set(CMAKE_INSTALL_PREFIX "${CMAKE_CURRENT_SOURCE_DIR}/dist" CACHE PATH "" FORCE)
Expand All @@ -17,7 +20,8 @@ set(CMAKE_INSTALL_RPATH "@loader_path")

include(FetchContent)

set(MLX_C_GIT_TAG "v0.4.1" CACHE STRING "")
set(MLX_C_GIT_TAG "v0.6.0" CACHE STRING "")
set(FETCHCONTENT_SOURCE_DIR_MLX "${CMAKE_CURRENT_SOURCE_DIR}/lib/mlx" CACHE PATH "Local patched MLX source")

FetchContent_Declare(
mlx-c
Expand Down
1,534 changes: 1,534 additions & 0 deletions GOAL.md

Large diffs are not rendered by default.

4 changes: 3 additions & 1 deletion cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
cmake_minimum_required(VERSION 3.24)
project(go-mlx-cpp LANGUAGES C CXX)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS ON)

# Fetch mlx-c v0.4.1 — same version as the Go side
include(FetchContent)
Expand Down
146 changes: 146 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
<!-- SPDX-Licence-Identifier: EUPL-1.2 -->

# go-mlx — documentation index

**Module**: `dappco.re/go/mlx`
**Role**: Native Apple Metal GPU inference + research-grade training pipeline. Implements the go-inference `Backend` + `TextModel` + `Session/Forker` contracts for darwin/arm64.

## Tetrad position

```
┌──────────────────────────────┐
│ dappco.re/go (core) │
└──────────────┬───────────────┘
┌──────────────┴────────────────┐
│ go-inference (contract) │
└──┬─────────────┬──────────────┘
│ │ register via init()
┌────────┴───┐ ┌──────┴────────┐
you are here → go-mlx │ │ go-rocm / │
│ darwin │ │ go-cuda │
│ arm64 │ │ (planned) │
└─────┬──┘ └───────────────┘
│ consumed by
┌─────┴──────────┬────────────────┐
│ go-ml │ go-ai │
│ scoring/agent │ router/demos │
└────────────────┘ └───────────────┘
```

## What this package owns

Five distinct areas, each with its own doc subtree:

| Area | Owns | Doc |
|------|------|-----|
| `runtime/` | Backend registration + adapter + Metal allocator | [runtime/README.md](runtime/README.md) |
| `memory/` | KV snapshots + bundles + memvid + Wake/Sleep/Fork | [memory/README.md](memory/README.md) |
| `moe/` | MiniMax M2 + JANG/JANGTQ + codebook VQ + expert residency | [moe/README.md](moe/README.md) |
| `training/` | SFT + GRPO + distillation + LoRA + eval + merge | [training/README.md](training/README.md) |
| `model/` | Model-pack validation + memory planning + GGUF | [model/README.md](model/README.md) |
| `inference/` | Scheduler + block cache + decode opt + parsers + thinking | [inference/README.md](inference/README.md) |
| `compute/` | Non-LLM Metal compute (pixel buffers, kernels, frame pipelines) | [compute/compute.md](compute/compute.md) |
| `observability/` | Probe emission (token / entropy / heads / router / cache / memory / training) | [observability/probe.md](observability/probe.md) |
| `cmd/` | Sidecar daemons | [cmd/violet.md](cmd/violet.md) |

## Mental model

```
┌─────────────────────────────────┐
│ caller: inference.LoadModel │
└──────────────┬──────────────────┘
┌──────────────────┴───────────────────┐
│ go-inference Default() │
│ picks "metal" → metalbackend │
└──────────────────┬───────────────────┘
runtime/ (register_metal.go)
┌──────────────────────────────────────┐
│ memory_plan → load weights via │
│ medium → metal.LoadAndInit → produce │
│ &metaladapter wrapping metal.Model │
└──────────────────┬───────────────────┘
┌────────────┬───────────┴────────┬──────────────┐
▼ ▼ ▼ ▼
inference/ memory/ training/ observability/
(scheduler (Wake/Sleep (SFT/LoRA/ (probe events)
cache bundles GRPO/distill/
decode-opt memvid) eval)
parsers
thinking)

moe/ adds MoE-specific paths into each area.
compute/ runs alongside on the same Metal device.
```

## Status snapshot (2026-05-11)

**Production**: dense models (Gemma 3/4 dense, Qwen 2/3, Llama 3) — load, inference, scheduler, block cache, KV snapshots, agent memory wake/sleep/fork, SFT, LoRA, distillation, GRPO, eval, model pack validation, GGUF read+write, memory planning, frame compute. Qwen 3.6 model packs are recognised and planned through the `mlx_lm` fallback while native hybrid linear-attention kernels are pending.

**Phase 1 in flight** (vMLX parity sprint, started 2026-05-09): MiniMax M2/2.7 MoE forward, JANGTQ_K weight load, codebook VQ kernels, expert residency native path, disk-backed block cache.

**Planned**: speculative decoding (paired with Gemma 4 `-assistant`), prompt-lookup decoding, embeddings + rerank surfaces, OpenAI Responses handler, vision/audio (out-of-scope for core runner near-term).

## Repository layout

```
go-mlx/
├── go/ Go module root (dappco.re/go/mlx)
│ ├── *.go ← root package (80+ files, this is where docs land)
│ ├── internal/metal/ ← CGO bindings to mlx-c (44 files, internal)
│ ├── mlxlm/ ← CGO-free Python subprocess fallback
│ ├── cmd/violet/ ← Unix-socket sidecar daemon
│ ├── cmd/mlx/ ← CLI tool (built with `-o core-mlx`; consumers rename: lthn-mlx, etc.)
│ ├── pkg/daemon/ ← daemon implementation
│ ├── pkg/memvid/ ← QR-video knowledge-pack codec
│ └── tests/ ← integration tests
├── cpp/ C++ companion (CLion-side)
├── docs/ ← YOU ARE HERE
├── examples/ per-feature usage walkthroughs
├── external/ vendored core libraries
├── lib/mlx/ upstream MLX submodule (v0.31.1)
└── patches/ local patches to lib/mlx
```

## Where to start

- **Caller (loading a model)** → [`runtime/register_metal.md`](runtime/register_metal.md) + [`runtime/adapter.md`](runtime/adapter.md)
- **Local setup / autotune UI** → [`runtime/local_autotune.md`](runtime/local_autotune.md)
- **Agent memory / book state** → [`memory/agent_memory.md`](memory/agent_memory.md)
- **LTHN project context seed** → [`memory/agentic_project_seed.md`](memory/agentic_project_seed.md)
- **Training Vi or a custom model** → [`training/README.md`](training/README.md) → [`training/sft.md`](training/sft.md) → [`training/distill.md`](training/distill.md)
- **Understanding the vMLX parity work** → [`moe/README.md`](moe/README.md) + `docs/vmlx-feature-gap-report.md`
- **Serving many requests** → [`inference/scheduler.md`](inference/scheduler.md)
- **Frame compute (emulator UIs)** → [`compute/compute.md`](compute/compute.md)
- **Sidecar deployment** → [`cmd/violet.md`](cmd/violet.md)

## Legacy docs

The flat docs in this folder (`architecture.md`, `compute.md`, `distillation.md`, `grpo.md`, `models.md`, `training.md`, `eval.md`, `model-operations.md`, `model-state-roadmap.md`, `build.md`, `development.md`, `history.md`, `index.md`, `vmlx-feature-gap-report.md`, `superpowers/plans/2026-05-09-vmlx-feature-parity.md`) pre-date this per-file pass and may rot. Keep `vmlx-feature-gap-report.md` and the parity plan (they're active references). Fold the rest into the per-package READMEs over time.

## Measured

| Operation | Bundle / model | Latency |
|-----------|----------------|---------|
| Wake — chapter (warm) | ~500MB | 998ms |
| Wake — full book (warm) | ~10.5GB | 2.15s |
| Wake — full book (cold runner) | ~10.5GB | 55.2s |
| Sleep — incremental, parent-reuse | 200-token delta | <1s |
| Gemma 4 E2B inference (M3 Ultra) | dense | ~80 tok/s decode |
| Gemma 4 26B inference (M3 Ultra) | dense | ~25 tok/s decode |

## Standards

- UK English in code, comments, docs (colour, organisation, licence, serialise)
- SPDX header on every new file: `// SPDX-Licence-Identifier: EUPL-1.2`
- Conventional commits: `type(scope): description` — scopes per package + `metal`, `api`, `mlxlm`, `repo`, `deps`
- Test triplets: `_Good` / `_Bad` / `_Ugly` + `*_example_test.go` runnable examples
- Error wrapping via `core.E(scope, msg, cause)`
- Co-Author: `Co-Authored-By: Virgil <virgil@lethean.io>`
- Native files: `//go:build darwin && arm64` (or `&& !nomlx`); stubs return false on `MetalAvailable()`
- CGO confined to `go/internal/metal/`
13 changes: 8 additions & 5 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,23 +41,26 @@ internal/metal/ <-- All CGO code
+-- metal.go Init, error handler, Eval, Materialize
|
v
mlx-c v0.4.1 <-- C API (fetched by CMake)
mlx-c v0.6.0 <-- C API (fetched by CMake)
|
v
Apple MLX / Metal / Accelerate <-- GPU compute
Apple MLX v0.31.1 / Metal / Accelerate <-- local patched lib/mlx
```

## CGO Binding

### Build Chain

mlx-c is fetched and built by CMake via `go generate ./...`. The `CMakeLists.txt` at the module root pulls mlx-c v0.4.1 from GitHub:
mlx-c is fetched and built by CMake via `go generate ./...`. The
`CMakeLists.txt` at the module root pulls mlx-c v0.6.0 from GitHub and points
mlx-c's nested MLX dependency at the local patched `lib/mlx` submodule:

```cmake
set(FETCHCONTENT_SOURCE_DIR_MLX "${CMAKE_CURRENT_SOURCE_DIR}/lib/mlx" CACHE PATH "Local patched MLX source")
FetchContent_Declare(
mlx-c
GIT_REPOSITORY "https://github.com/ml-explore/mlx-c.git"
GIT_TAG "v0.4.1"
GIT_TAG "v0.6.0"
)
```

Expand Down Expand Up @@ -255,7 +258,7 @@ session, err := mlx.NewSession()

Options from `inference.LoadConfig` understood by the Metal backend:

- `ContextLen` -- replaces unbounded `KVCache` with `RotatingKVCache(contextLen)` for all layers; default 131072
- `ContextLen` -- replaces unbounded `KVCache` with `RotatingKVCache(contextLen)` for all layers; default `131072` (`128Ki` tokens)
- `ParallelSlots` -- caps concurrent native inference calls for one loaded model before KV/cache allocation; default 1
- `AdapterPath` -- loads a trained LoRA adapter from disk at model load time
- `GPULayers` -- logged as a warning if set to 0 (Metal always uses full GPU offload)
Expand Down
10 changes: 6 additions & 4 deletions docs/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,8 @@ The submodule initialisation is required because `internal/metal/` contains
forwarding translation units that include sources from `lib/mlx`, `lib/mlx-c`,
and `lib/generated`.

CMake fetches mlx-c v0.4.1 from GitHub and builds it with:
CMake fetches mlx-c v0.6.0 from GitHub and builds it against the local
patched `lib/mlx` submodule with:

- `MLX_BUILD_SAFETENSORS=ON` -- required for model loading
- `MLX_BUILD_GGUF=ON` -- enables GGUF load/save support
Expand Down Expand Up @@ -133,7 +134,8 @@ set(BUILD_SHARED_LIBS ON CACHE BOOL "" FORCE)
set(CMAKE_INSTALL_RPATH "@loader_path")

include(FetchContent)
set(MLX_C_GIT_TAG "v0.4.1" CACHE STRING "")
set(MLX_C_GIT_TAG "v0.6.0" CACHE STRING "")
set(FETCHCONTENT_SOURCE_DIR_MLX "${CMAKE_CURRENT_SOURCE_DIR}/lib/mlx" CACHE PATH "Local patched MLX source")
FetchContent_Declare(
mlx-c
GIT_REPOSITORY "https://github.com/ml-explore/mlx-c.git"
Expand Down Expand Up @@ -230,8 +232,8 @@ CGO call overhead floors at approximately 170 us per operation (Metal command bu
```
go-mlx
+-- forge.lthn.ai/core/go-inference (shared interfaces, zero dependencies)
+-- mlx-c v0.4.1 (CMake, fetched at go generate time)
+-- Apple MLX (Metal GPU compute)
+-- mlx-c v0.6.0 (CMake, fetched at go generate time)
+-- Apple MLX v0.31.1 (local patched lib/mlx submodule)
+-- Foundation, Metal, Accelerate frameworks
```

Expand Down
112 changes: 112 additions & 0 deletions docs/cmd/violet.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
<!-- SPDX-Licence-Identifier: EUPL-1.2 -->

# cmd/violet — local-native inference sidecar

**Package**: `dappco.re/go/mlx/cmd/violet`
**Files**: `cmd/violet/main.go` (entry) + `pkg/daemon/` (server)

## What this is

The **Violet sidecar daemon** — a long-running process exposing inference + agent memory over a Unix socket. Lets local processes (CoreAgent, IDE, ml lab) call into a hot, model-loaded mlx runtime without each spawning their own.

Violet is what Cladius posts to instead of burning Anthropic tokens for routine inference. It's the local substrate that survives Codex's uncertain status (per `project_codex_status_uncertain.md`) and the budget pressure (per `project_go_mlx_research_grade.md`).

## Why a daemon

Three reasons one shared process beats N short-lived processes:

1. **Model load cost.** Loading Gemma 4 26B takes 30-60s on first touch. The daemon pays it once.
2. **KV cache locality.** Sessions retain their KV across requests; a fresh process can't.
3. **Memory budget.** Two LLM processes don't fit on a 96GB Ultra; one daemon serving many clients does.

## Transport

Unix domain socket — fast, secure-by-default (filesystem permissions), no TCP overhead.

```bash
violet --socket /var/run/violet/violet.sock --config /etc/violet.toml
```

Request envelope is line-delimited JSON over the socket; responses likewise (or SSE-like multi-line for streaming).

## Surface

Per-request operations (subset, more land as parity sprint completes):

- `Generate` / `Chat` — text generation
- `Classify` / `BatchGenerate`
- `WakeState` / `SleepState` / `ForkState` — agent memory
- `CacheStats` / `WarmCache` / `ClearCache` — prompt cache
- `CapabilityReport` — what this daemon supports right now
- `LoadModel` / `UnloadModel` — admin (default off, opt-in via config)

## Config

```toml
# /etc/violet.toml

[runtime]
socket = "/var/run/violet/violet.sock"
default_model = "gemma-4-e2b"

[models.gemma-4-e2b]
path = "/Volumes/Data/models/gemma-4-e2b/"
context_length = 32768

[models.qwen-3-coding]
path = "/Volumes/Data/models/qwen-3-coding-30b/"
context_length = 16384

[memory]
bundles_dir = "/var/lib/violet/bundles"
codec = "memvid" # or "file"

[scheduler]
max_concurrent = 4
max_queue = 32

[probe]
log_dir = "/var/log/violet/probes"
```

The daemon pre-loads `default_model` at startup. Other models load lazily on first reference.

## Lifecycle

```
violet starts
read config + open socket
pre-load default model
warm prompt cache from on-disk seeds (if configured)
serve requests until SIGINT/SIGTERM
flush in-flight bundles to durable storage
unload models cleanly
close socket
```

## Used by

- **Cladius's local-inference skills** — `mattermost`, `wiki`, code summarise — call violet for batch text processing instead of round-tripping Anthropic
- **CoreAgent / core/ide** — chat-with-local-model surface
- **Vi training pipeline** — distillation teacher endpoint
- **LARQL vindex inspection** — pre/post-SFT model inference for diff

## Status

Production. Used in daily Cladius workflow (the wikis + mattermost + code-summarise skills route through it).

## Related

- `pkg/daemon/` — server implementation (planned dedicated doc)
- `../memory/agent_memory.md` — Wake/Sleep exposed over the socket
- `../inference/scheduler.md` — the scheduler that admits violet requests
- `../runtime/register_metal.md` — Violet boots the metal backend
- `project_local_inference_topology.md` — measured topology
- `project_go_mlx_research_grade.md` — the substrate this is part of
Loading