Skip to content

macOS (Apple Silicon): launcher never starts Ollama, no model installed, runaway context — with fixes #58

@siluri

Description

@siluri

Thanks for this project! While getting it running on macOS (Apple Silicon, M1 Max, Ollama 0.24) I hit a few Mac-specific issues. Reporting them together with the fixes that worked for me — happy to open a PR for the launcher + Modelfile changes if useful.

1. start-mac.command never starts the engine on current Ollama (blocker)

The launcher serves via the GUI binary:

"$MAC_OLLAMA_DIR/Ollama.app/Contents/MacOS/Ollama" serve

On current Ollama releases this prints serve command not supported, use ollama and exits, so no server runs and AnythingLLM has nothing to connect to. The CLI that actually serves is Ollama.app/Contents/Resources/ollama.

Fix (prefer the CLI binary, keep old layouts as fallback):

if [ -f "$MAC_OLLAMA_DIR/Ollama.app/Contents/Resources/ollama" ]; then
    "$MAC_OLLAMA_DIR/Ollama.app/Contents/Resources/ollama" serve > /dev/null 2>&1 &
elif [ -f "$MAC_OLLAMA_DIR/ollama" ]; then
    "$MAC_OLLAMA_DIR/ollama" serve > /dev/null 2>&1 &
elif [ -f "$MAC_OLLAMA_DIR/Ollama.app/Contents/MacOS/Ollama" ]; then
    "$MAC_OLLAMA_DIR/Ollama.app/Contents/MacOS/Ollama" serve > /dev/null 2>&1 &
fi

2. No model gets installed on macOS

The interactive model menu only exists for Windows (install.bat / install-core.ps1) and Linux (linux/install-core.sh). On macOS, start-mac.command downloads Ollama + AnythingLLM but never pulls a model, and models/installed-models.txt is never created, so it defaults to a non-existent nemomix-local. Result: AnythingLLM has no usable model.

Suggestion: add a macOS model step mirroring Linux — either ollama pull from the registry, or download a GGUF + ollama create <name> -f Modelfile, then write models/installed-models.txt (local_name|nice_name|label).

3. Runaway context → CPU offload → ~1.4 tok/s (performance)

Importing a GGUF with a bare FROM ./model.gguf makes Ollama adopt the model's declared context (NemoMix advertises ~1,024,000). Ollama then allocates a huge KV cache (observed: 256K context, ~40 GB KV cache), spills layers to CPU (21%/79% CPU/GPU), and crawls at ~1.4 tok/s on an M1 Max.

Fix: pin a sane context in the Modelfile, e.g. PARAMETER num_ctx 8192. Same machine afterwards: 100% GPU, ~31 tok/s (≈22× faster).

Optional speed/memory env in the launcher — Flash Attention + KV-cache quantization halves KV memory and lets you double the context for free:

export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
export OLLAMA_KEEP_ALIVE=30m   # keep the model warm, avoid cold-reload lag

(Measured: ctx 8192 → 16384, KV cache ~2.7 GB → ~1.36 GB, still 100% GPU, ~31 tok/s.)

4. GGUF imports need a chat template (quality)

A bare FROM import leaves the Ollama template at the default {{ .Prompt }}, so chat output is incoherent (the model just continues raw text). For NemoMix (Mistral-Nemo base) a Mistral [INST] … [/INST] TEMPLATE plus </s> / [INST] / [/INST] stop params fixes it.


Environment: macOS, Apple Silicon (M1 Max, 64 GB), Ollama 0.24 (bundled ollama-darwin.zip).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions