A local, KV-cache-aware LLM inference server for Apple Silicon built on the MLX framework. Exposes an OpenAI-compatible API through a LiteLLM proxy. The core value proposition is maximum KV cache hit rate to minimize prefill latency — on a 27B model, cold 10k-token prefill costs ~42 s while a warm hit is effectively free.
- Dual-pipeline cache architecture: canonical pipeline drives the cache key, original pipeline drives the model. They never recombine after
_canonicalize_messages(), so cache stability never compromises model input. - Two cache layers: a global block-hash cache (token-level) and a message-aware stable-prefix cache (structure-level). Secondary lookup recovers hits after mid-history tool-result injections.
- Apple Silicon native: MLX +
mlx-lmfor text models,mlx-vlmfor vision-language models (auto-detected fromconfig.json). - OpenAI-compatible streaming: real token-by-token SSE with proper
reasoning_contentandcontentdeltas so clients (PI, OpenClaw, Claude Code, LiteLLM) render thinking blocks and answers as they arrive. - Multi-agent safe: canonical-form write guards on the session turn store prevent sub-agent branches from clobbering the orchestrator's turn record; per-session lineage tracking supports branch-return cache reuse.
- Request watchdog: wall-clock deadline per request releases
model_lockon stalls, evicts partial KV state, and terminates the response withfinish_reason="length". - Correctness observability: counters (
vlm_retreat,hybrid_trim_miss,exact_key_rejected_by_model_lcp) surface cache-correctness regressions as they happen.
# One-command bootstrap (creates .venv, installs deps, execs server)
python3.12 install_and_run.py
# Manual setup
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && python start-llm.py
# Smoke test
curl -s http://127.0.0.1:4000/v1/modelsinstall_and_run.py has a preflight that refuses to run if the venv has been contaminated by a non-3.12 Python (e.g. a stray python3.14 -m pip against the existing venv). Python 3.12 is strictly required.
Request path: Client -> LiteLLM Proxy (port 4000) -> MLX Server (port 8080) -> MLX Model.
- The model must never be blinded. All semantic content (tool results,
<think>blocks, file contents, images) reaches the model 100% intact. A cache hit that hides content is worse than a miss. - Separate cache key from model input. Canonical tokens drive cache lookup; original tokens drive the model. The two are computed at
_canonicalize_messages()and never recombined. - Structural normalisation, not string substitution. Volatile fields are normalised at message-struct level before
apply_chat_template. Post-render scrubs are atomic, line-scoped only — nore.DOTALLspanning section boundaries. - RoPE correctness. KV state reuse is position-aware. Physical KV offsets (
_kv_cache_offset(cache)) slice the originalmodel_tokensfor the remaining prefill; canonical token counts are never used for model-side slicing. - No gut-feel patches. Every change to caching or normalisation cites a specific log-confirmed failure mode. See
.context/LESSONS.mdbefore touching cache logic.
Incoming Request
|
v
[1] _heal_messages()
| Restores stripped <think> blocks from HEALING_STORE (SHA-256 keyed)
v
[2] _canonicalize_messages(messages)
|-- Returns: (original, canonical) — deep copies; original is never modified
|-- Canonical mutations: message_id -> __STABLE_MSG_ID__,
| Inbound Context block -> __STABLE_INBOUND_CONTEXT_SECTION__,
| sub-agent "Stats: runtime ..." -> __STABLE_RUNTIME__
| |
v v
[3] original_messages [4] canonical_messages
(Model Input Pipeline) (Cache Key Pipeline)
| |
v v
apply_chat_template() apply_chat_template()
| |
v v
model_tokens cache_prompt_raw
| |
| [5] _scrub_cache_key()
| Atomic, line-scoped scrubs:
| - "Current time is ..." timestamps
| - cch=<hex>; telemetry headers
| - -anthropic-billing-header: ...
| - <system-reminder>...</system-reminder>
| |
| [5a] Image-identity markers (G1)
| For VLM requests: inject a 2-token
| SHA-256-derived marker after each
| <|image_pad|> run so identical
| message structure with different
| images yields different cache keys
| |
v v
Prefill & Generate <--- KV Match --- prompt_tokens (canonical cache key)
(from model_tokens) Lookup via LRUPromptCache +
SessionIndex + SESSION_TURN_STORE
| Layer | Function | Scope | What it normalises |
|---|---|---|---|
| Message-struct | _canonicalize_messages() |
Before rendering | message_id, Inbound Context block, sub-agent Stats: runtime |
| Inbound Context | _canonicalize_inbound_context_block() |
Inside _canonicalize_messages |
Produces identical canonical output whether OpenClaw includes or omits the block |
| Post-render | _scrub_cache_key() |
After apply_chat_template |
Timestamps, cch= headers, billing headers, <system-reminder> tags |
| Image identity | _inject_image_markers() |
Post-tokenisation, canonical only | 2-token SHA-256 marker per image — forces cache-key divergence on image swap |
Two independent layers work together to maximise hits.
Token-based, no assumptions about session identity or message structure.
- Chain hashes: prompt tokens are split into 16-token blocks; each block's hash is SHA-256-chained with the prior block's hash.
- Longest-candidate selection: among entries sharing the same early block hashes, always pick the deepest matching entry. Without this, a short entry shadows a long one and collapses hit rate.
- Model-space LCP verification: candidate discovery is canonical-token-based, but final acceptance requires
entry.model_tokensto be a prefix of the request'smodel_tokens. This guards against canonical-prefix collisions across different VLM image expansions. - Hybrid-cache awareness: Qwen3.6 and other hybrid VLMs expose
ArraysCache + KVCachelayers thatcan_trim_prompt_cachereports as non-trimmable. Candidates that would require trim are rejected rather than silently reused at the wrong KV depth. - Cost-aware eviction: score =
age / (sqrt(length) * log(frequency)). Protects long conversations and frequently-used prefixes; evicts short, old branches. - Prefix culling: when a longer entry is inserted, shorter strict prefixes are removed (longest one kept as a safety net). Requires model-token-space prefix match, not just canonical.
- Metal memory management:
mx.metal.clear_cache()runs on eviction to return Metal buffers to the OS pool and prevent monotonic GPU memory growth.
Secondary layer that recovers hits after mid-history mutations (tool-result injections, early whitespace drift, message_id churn).
- Canonical-form storage and diff: the turn record stores
_canonicalize_messages()output, and_message_diff()compares canonical-form leading prefixes. This lets OpenClaw drift (per-turnmessage_id, Stats runtime, Inbound Context variants) collapse to equality because it IS equal in cache-key space. - Exact token boundaries:
_compute_msg_token_boundaries()converts the stable message count to an exact token count via cumulativeapply_chat_template(messages[:i+1])rendering. Falls back to even distribution only for VLMs without a usable text template. - Secondary lookup: if the global layer returns fewer cached tokens than the stable-prefix layer estimates, a secondary
fetch_nearest_cachescoped to the stable prefix forces a match up to the change point. - Strict-append write guard: only records when the new canonical list strictly extends the stored one. Prevents sub-agent or parallel-branch requests from clobbering the orchestrator's turn record.
On every cache hit, _kv_cache_offset(cache) reads the physical KV depth from the first attention layer that exposes .offset (hybrid VLM caches mix ArraysCache + KVCache — layer 0 may lack it). That offset slices the original model_tokens for the remaining prefill. RoPE stays coherent even when canonical and original pipelines differ by thousands of tokens (e.g., a 200-token JSON Inbound Context block collapses to a single __STABLE_INBOUND_CONTEXT_SECTION__ sentinel in the canonical pipeline).
SessionIndex: per-session cache-key history with parent/branch lineage. Anchor prefixes at 2048-token strides support branch-return reuse.SessionContext: extracted from request body (session_id,conversation_id,thread_id,chat_id, or nestedmetadata/extra_body). Falls back to a SHA-1 hash of the prompt's first 128 tokens when no explicit ID is provided.
The server uses a character-level state machine (_ReasoningStreamSplitter) to classify every generated chunk into one of three streams — reasoning (to delta.reasoning_content), content (to delta.content), or tool-call buffer (held for end-of-stream extraction). The classifier:
- Handles
<think>...</think>blocks emitted by Qwen3 and the orphan-close variant emitted by GLM ("reasoning\nanswer" with no opening tag). - Holds a small lookback so tags split across two generated chunks are detected correctly — no premature emission of partial sentinels.
- Matches
_strip_thinking_from_contentbyte-for-byte on the concatenatedcontentstream for every non-pathological case (verified by import-time self-test plus a 150-trial random-chunking soak). This means the stream a client receives is identical to the non-streamingmessage.content, keeping the healing-hash consistent acrossstream=true/stream=falseretries. - Switches to a buffered tool-call mode on the first
<tool_call>sentinel; end-of-stream extraction then emitsdelta.tool_callsplus any trailing non-tool text.
PI, LiteLLM's OpenAI-compatible wrapper, and other clients that understand reasoning_content render thinking live. Clients that don't understand the field see identical content and ignore the rest — no back-compat break.
The non-streaming path still derives message.content from the legacy _strip_thinking_from_content + _extract_openai_tool_calls pipeline (byte-identical to prior versions). When thinking was requested, message.reasoning_content is populated additively via the same splitter.
HEALING_STORE is a SHA-256-keyed OrderedDict (cap 2000). When a client echoes a stripped assistant message back in the next turn, _heal_messages looks up the hash of (content, canonicalised tool_calls) and swaps the stripped message for the full original (with <think> blocks intact). Tool-call arguments are canonicalised to parsed-dict form before hashing so clients that deserialise arguments differently from the server still hit the store.
The server extracts structured OpenAI tool_calls from model-generated XML:
- Legacy:
<tool_call>name <arg_key>k</arg_key><arg_value>v</arg_value></tool_call> - Qwen:
<tool_call><function=name><parameter=key>value</parameter></function></tool_call>
Both patterns are recognised independent of model_family; parser order is family-prioritised.
Any of these request fields map to enable_thinking (bool) internally:
enable_thinking: true/falsethinking: off|on|low|medium|high|xhigh|...(OpenClaw / Anthropic style)reasoning_effort: off|low|medium|high|xhigh(OpenAI / LiteLLM style)reasoning: {enabled|effort|level|type: ...}(Responses-style)- Nested
metadata.*/extra_body.*variants of the above
Default is controlled by DEFAULT_THINKING (default true).
VLMs are auto-detected from config.json via a whitelist of mlx-vlm model_type values (Qwen3.6, Qwen2.5-VL, GLM-4V, Llava, Gemma3, etc.). When a VLM is loaded:
- Messages with
image_urlorinput_imagecontent parts are processed viamlx_vlm.utils.prepare_inputs. Data URLs are base64-decoded to PIL; bare URLs pass through as strings. - The canonical cache key is derived using the text-only tokeniser (CPU, no GPU lock needed) so we don't re-acquire
model_lockfor cache lookup. - Image identity (G1): each image payload yields a 2-token SHA-256-derived marker injected after its
<|image_pad|>run in the canonical token stream. Two requests with the same message structure but different images now produce different cache keys. Without this,<|image_pad|>is a single token id that encodes identically for any image, and the block-hash index happily serves image A's KV for a request carrying image B. - Partial-image safety net: if a cache cut lands mid-image-pad-run (diagnosed via
_vlm_cache_covers_partial_image), the server evicts the offending entry, retreats to fresh prefill, and bumps thevlm_retreatcounter. With G1 active, this should never fire on clean traffic — any bump is a signal. - Metal sync (
torch.mps.synchronize()+mx.eval()) runs before generation to prevent Metal encoder collisions between PyTorch/torchvision and MLX.
Every request arms a threading.Timer(GENERATION_WATCHDOG_SECONDS) after acquiring model_lock. If it fires, a cooperative abort flag is set; the generator yield loop checks it between tokens and exits cleanly. On abort: partial KV state is not inserted into the cache, the session turn record is not advanced, the healing store is not updated, and the response terminates with finish_reason="length". The timer is cancelled unconditionally in the finally block so model_lock always releases. Default 600 s — set GENERATION_WATCHDOG_SECONDS=0 to disable.
_update_session_turn_store canonicalises the incoming message list and compares its leading prefix against the stored canonical form. Only strict appends in canonical space advance the record; any semantic divergence (new system prompt, mid-history edit, sub-agent branch) refuses the write to preserve the best-known linear record.
Three counters, atomically bumped and surfaced per-request in logs and at startup:
| Counter | Meaning | Expected value |
|---|---|---|
vlm_retreat |
Mid-image cache cut forced a fresh prefill | 0 on clean traffic (G1 effectiveness) |
hybrid_trim_miss |
Hybrid VLM cache rejected because can_trim_prompt_cache is False |
0 or low; justifies G4 (pre-generation checkpoint) |
exact_key_rejected_by_model_lcp |
Exact canonical key hit rejected because model_tokens diverged |
0 — any bump is a canonicalisation-over-normalisation bug |
model_lock— serialises GPU access (prefill + decode).prompt_cache_lock— protectsPROMPT_CACHE,SESSION_TURN_STORE,SESSION_INDEX.HEALING_STORE_LOCK— protects the healing OrderedDict.console_lock— serialises terminal output._CACHE_METRICS_LOCK— protects the correctness counter dict.
The server runs single-threaded (HTTPServer) because MLX streams are per-thread and mlx-vlm's generate_step issues bare mx.async_eval outside a with mx.stream(...) context. Every response sets Connection: close and self.close_connection = True so the accept loop doesn't park on a keep-alive socket from LiteLLM's connection pool. When upstream mlx-vlm moves to mx.new_thread_local_stream, this can revert to ThreadingHTTPServer.
_assert_cache_key_safety(opt-in viaCACHE_NORM_SAFETY_CHECK=true): asserts the cache key is not dramatically shorter than the original prompt (>10% delta = over-match). Falls back to original prompt as cache key on violation.- Metal crash fix: PyTorch MPS disabled at import time (
torch.backends.mps.is_built = lambda: False) to prevent Metal command buffer collisions with MLX. - Preserve-thinking probe: at startup,
_probe_preserve_thinking_support()tests whether the active tokenizer/processor accepts thepreserve_thinkingkwarg, caching the result soapply_chat_templatecalls stay consistent across cache-key and model-input paths.
| Variable | Default | Description |
|---|---|---|
MODEL_PATH |
lmstudio-community/GLM-4.7-Flash-MLX-4bit |
HuggingFace repo or local path |
MODEL_FAMILY |
auto-detected from path | qwen3, glm4, or generic |
PROXY_MODEL_ID |
openai/{MODEL_PATH} |
Model name exposed through LiteLLM |
MLX_HOST |
127.0.0.1 |
MLX server bind address |
MLX_PORT |
8080 |
MLX server port |
PROXY_PORT |
4000 |
LiteLLM proxy port |
PROXY_STARTUP_WAIT_SECONDS |
2.0 |
Seconds to wait for LiteLLM startup |
| Variable | Default | Description |
|---|---|---|
MAX_KV_SIZE |
196608 |
Maximum KV cache size in tokens |
KV_GROUP_SIZE |
64 |
KV cache group size (uniform scheme) |
KV_BITS |
OFF |
KV quantisation bits (4, 8, or OFF) |
KV_QUANT_SCHEME |
uniform |
uniform or turboquant |
QUANTIZED_KV_START |
5000 |
Token position where KV quantisation kicks in |
| Variable | Default | Description |
|---|---|---|
PROMPT_CACHE_MAX_ENTRIES_GLOBAL |
24 |
Max entries in the global block-hash cache |
PROMPT_CACHE_MAX_ENTRIES_PER_SESSION |
2 |
Max cache keys tracked per session |
PROMPT_CACHE_TTL_SECONDS |
1800 |
Time-to-live for cache entries (0 = infinite) |
PROMPT_CACHE_SESSION_MAX_IDLE_SECONDS |
1800 |
Session idle timeout for turn store (0 = infinite) |
PROMPT_CACHE_BLOCK_SIZE |
16 |
Token block size for hash chains |
CACHE_USE_BLOCK_INDEX |
true |
Enable block-hash index for O(blocks) lookup |
CACHE_CANONICALIZE_TOOL_CONTEXT |
true |
Enable dual-pipeline structured normalisation |
CACHE_VLM_IMAGE_IDENTITY |
true |
Inject per-image markers into canonical cache key (G1) |
CACHE_SESSION_PARTITIONING |
true |
Session partitioning flag (read but not applied — lookup is always global) |
CACHE_NORM_SAFETY_CHECK |
false |
Enable the 10%-delta cache-key safety assertion |
NORMALIZE_WRITE_TOOL_CONTENT_FOR_PROMPT |
false |
Replace write tool's content arg with a digest placeholder at render time |
| Variable | Default | Description |
|---|---|---|
DEFAULT_TEMPERATURE |
0.1 |
Sampling temperature |
DEFAULT_TOP_P |
0.9 |
Top-p (nucleus) sampling |
DEFAULT_TOP_K |
40 |
Top-k sampling |
DEFAULT_MIN_P |
0.05 |
Minimum probability threshold |
DEFAULT_REPETITION_PENALTY |
1.08 |
Repetition penalty |
DEFAULT_REPETITION_CONTEXT_SIZE |
256 |
Context window for repetition penalty |
DEFAULT_MAX_TOKENS |
2048 |
Default max output tokens |
DEFAULT_THINKING |
true |
Enable reasoning output by default |
PRESERVE_THINKING |
true |
Forward preserve_thinking=True to apply_chat_template when the tokeniser supports it |
GENERATION_WATCHDOG_SECONDS |
600 |
Wall-clock deadline per request (0 = disabled) |
| Variable | Default | Description |
|---|---|---|
ENABLE_REQUEST_LOGGING |
true |
Write per-request session logs to logs/ |
LOG_ROOT |
logs |
Log output directory |
VLM_CACHE_DEBUG |
false |
Log first 64 prompt token IDs and cache-diagnostic lines for VLM debugging |
| File | Purpose |
|---|---|
start-llm.py |
Entire server (~4900 LOC): API handler, dual pipeline, prompt cache, session layers, streaming splitter, watchdog |
install_and_run.py |
Bootstrap: verifies Python 3.12, detects venv contamination, installs deps, execs server |
.env / .env.example |
Runtime config |
requirements.txt |
Dependencies: mlx, mlx-lm, mlx-vlm, litellm[proxy], python-dotenv |
scripts/pi_like_probe.py |
Primary cache-hit / divergence validation harness — PI-shaped agentic probe |
scripts/probe_session.py |
4-scenario cache validation (normal, drift, semantic, insert) |
scripts/diff_turns.py |
Diff consecutive turns from cache session logs |
scripts/inspect_mlx_cache.py |
Inspect MLX KV cache state |
scripts/test_openclaw_integration.py |
Automated OpenClaw integration test harness |
.context/GOALS.md |
Strategic direction, active G-items (G1–G9) |
.context/TASKS.md |
Active portions and open items |
.context/LESSONS.md |
Hard-won knowledge — read before touching cache logic |
.claude/rules/server-core.md |
Auto-loaded rules when editing start-llm.py |
CLAUDE.md |
Project conventions and change protocol |
Any MLX-compatible text LM and any mlx-vlm 0.4.x-supported VLM. Tested and actively tuned for:
- Qwen3 / Qwen3.6 (text and hybrid VLM) — primary model_family
- GLM-4 / GLM-4.7-Flash — primary model_family with orphan-close
</think>handling - Generic LMs via mlx-lm's
load()path
Hybrid VLMs (Qwen3.6's Qwen3_5MoeForConditionalGeneration, mixing ArraysCache and KVCache layers) are handled with explicit offset scanning — layer 0 may lack .offset, so _kv_cache_offset walks layers to find the first attention cache.
No formal unit-test suite; verification is empirical via real agentic traffic.
python scripts/pi_like_probe.py— PI-shaped agentic probe, primary harness.python scripts/probe_session.py— 4-scenario cache validation.scripts/diff_turns.py— diff session prompts turn-by-turn to pinpoint cache-key divergence.logs/<date>/cache-session-*.log— per-session cache trace; first place to look when a fix may have regressed hit rate.
Healthy numbers: text T3 ≥ 97% hit, image T2 ≥ 94% hit, zero bumps on vlm_retreat and exact_key_rejected_by_model_lcp, hybrid_trim_miss small and only on cross-session cold transitions.
- Cache misses: inspect
logs/forstable_prefix_msg_count. If 0, an early message is drifting — checkcache_key_delta_charsto verify canonicalisation fired. Ifexact_key_rejected_by_model_lcpis bumping, the canonical pipeline is over-normalising. - Server hangs: watchdog should catch after
GENERATION_WATCHDOG_SECONDS. Ifmodel_lockis stuck, lower the timeout or investigate the Metal state viaActivity Monitor. - OOM: lower
PROMPT_CACHE_MAX_ENTRIES_GLOBALor enableKV_BITS=4. At 20k tokens each cache entry is ~2–3 GB of GPU memory. - Reasoning not rendering in client: verify client reads
delta.reasoning_content(OpenAI Qwen/DeepSeek convention). Clients that only readdelta.contentsee the answer without the thinking block — this is expected. - VLM Metal crash: the server disables PyTorch MPS at startup. If crashes persist, install
mlx-vlmwithout the torch extra:pip uninstall torch torchvision; pip install mlx-vlm. - Stale LiteLLM: auto-detected and killed on the proxy port at startup.
- Venv contamination (e.g.
python3.14binaries inside.venv/):install_and_run.pyrefuses to proceed and points at the artefacts. Delete the venv and re-run, or scrub the bin entries and restorepip -> pip3.12symlinks.
- Python 3.12 strict — enforced by
install_and_run.pypreflight. - macOS with Apple Silicon (M1/M2/M3/M4).
- Dependencies:
mlx,mlx-lm,mlx-vlm(optional — only loaded for VLM models),litellm[proxy],python-dotenv.