ailiance-agent can run a local stack that routes LLM requests intelligently:
isaac → Jina router (:5050) → LiteLLM proxy (:4000) → endpoints
# 1. Install (creates Python venvs in ~/.isaac/)
isaac stack install
# 2. Start
isaac stack start
# 3. Configure isaac to use the local stack
# Edit your isaac settings (or pass via VS Code config):
# apiProvider: "litellm"
# liteLlmBaseUrl: "http://127.0.0.1:5050"
# liteLlmApiKey: "sk-isaac-local-master-key"
# liteLlmModelId: "auto" # let the router pick
# 4. Verify
isaac stack status
# 5. Stop when done
isaac stack stop- Multiplexing across providers (Anthropic, OpenAI, Ollama, ailiance workers)
- Native fallback, retry, cost tracking, response cache
- Config:
~/.isaac/litellm/config.yaml - RAM: ~300 MB
- Edit the config to add/remove models. Required env vars:
ANTHROPIC_API_KEY,OPENAI_API_KEY, etc.
- Embeds incoming queries via
jinaai/jina-embeddings-v2-small-en(~80 MB model, ~150 MB RAM) - Classifies intent into:
code,chat,search,agent - Picks the preferred model per category (configurable)
- Forwards to LiteLLM with the chosen model
- Routes config:
~/.isaac/jina-router/routes.json - RAM: ~150 MB
Total local RAM: ~450 MB.
Edit ~/.isaac/jina-router/routes.json:
{
"code": {
"examples": ["refactor this", "fix the bug"],
"preferred_model": "claude-sonnet-4-5"
},
"embedded": {
"examples": ["esp32 code", "kicad schematic"],
"preferred_model": "qwen-coder-32b"
}
}The router computes the centroid embedding per category and picks the closest one to the user query.
If you don't want semantic routing:
isaac proxy start # only the LiteLLM proxy
# liteLlmBaseUrl: "http://127.0.0.1:4000"Edit your isaac settings.json (workspace or global):
{
"enabledMcpServers": ["claude-mem", "context7"],
"mcpToolDenylist": ["mcp__some_plugin__dangerous_tool"]
}enabledMcpServers: only these MCP servers from plugins will be loaded. Omit or setnullto load all (default).mcpToolDenylist: qualified tool names (mcp__plugin_server__tool) to exclude even if the server is enabled.mcpToolAllowlist: if set, only these tools are exposed — overridesmcpToolDenylist.
All three settings are optional. Without them, all plugin MCP servers and all their tools are available.
Once you've started the stack with isaac stack start, enable auto-detect in your settings:
{
"useLocalStack": true
}Now whenever the LiteLLM provider is used, isaac will:
- Check if the local stack is running
- If yes → route via the Jina router (port 5050) automatically
- If no → fall back to your
liteLlmBaseUrlsetting
You don't need to update liteLlmBaseUrl to switch between local stack mode and remote LiteLLM mode.
To disable, set useLocalStack: false (or omit) and isaac uses liteLlmBaseUrl as before.
Detection results are cached for 30 seconds — no per-request port scan overhead.
Au lieu de passer par le stack Python (LiteLLM proxy + Jina router), isaac embed un mini-routeur in-process qui :
- Cache les réponses LLM (LRU 100 entries, TTL 1h)
- Ping les workers en arrière-plan toutes les 30s (skip ceux DOWN)
- Classifie les prompts par heuristic (code/fr/reason/general) et choisit le meilleur worker dispo
Activation :
{
"useLocalRouter": true
}Pas besoin de isaac stack install/start — le LocalRouter est embarqué.
Pour configurer tes propres workers :
{
"useLocalRouter": true,
"localRouterWorkers": [
{
"id": "my-mlx",
"url": "http://my-mac.tailscale.ts.net:8080/v1",
"modelId": "qwen-coder-32b",
"capabilities": ["code", "general"],
"priority": 10
}
]
}Limitations PR2 : le LocalRouter ne prend en charge que les messages texte (pas d'images, pas de tool_calls). Les messages avec blocs non-texte et le streaming passent automatiquement par le proxy HTTP classique. PR3 ajoutera le streaming via LocalRouter si demandé.
isaac stack installfails on Python: install viabrew install uv(recommended) orbrew install python@3.11- Slow first start of router: it downloads the embeddings model from Hugging Face (~80 MB). Subsequent starts are instant.
- Provider returns 401: set
liteLlmApiKeyin isaac to match the master key in~/.isaac/litellm/config.yaml - Logs:
~/.isaac/litellm.logand~/.isaac/jina-router.log
For most users, isaac ships with an in-process LocalRouter that
provides similar features (multi-worker dispatch, cache, health monitoring)
without requiring Python sub-processes. See docs/local-router.md
for details.
| Use the local stack when... | Use LocalRouter when... |
|---|---|
| You want LiteLLM's 100+ provider list | You target Ailiance workers (Gemma/Apertus/EuroLLM) |
| You need cost tracking, retries, complex fallback | You need lowest latency overhead |
| You prefer external services to monitor independently | You want zero install (no Python) |
Primary tool-capable worker for agentic requests. Launched manually via
llama-server (no systemd unit). Tunnel: autossh electron-server:8002 →
kxkm-ai:18888. Exposed by the ailiance gateway on :9300 when tools[]
is present.
cd /home/kxkm
./llama.cpp/build/bin/llama-server \
-m models/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf \
-ngl 99 \
--override-tensor "ffn_(up|gate|down)_exps\.weight=CPU" \
--ctx-size 196608 \
-fa on -b 512 -ub 256 \
--cache-type-k q8_0 --cache-type-v q8_0 \
-np 1 --reasoning-format none \
--host 0.0.0.0 --port 18888 \
--api-key <key> --alias qwen-32b-awq \
--jinja --metrics| Ctx | KV cache | VRAM used | Free margin (24 GB RTX 4090) |
|---|---|---|---|
| 32k | ~3-4 GB | ~7 GB | 17 GB |
| 65k | ~7 GB | ~9 GB | 15 GB |
| 128k | ~14 GB | ~14 GB | 10 GB |
| 192k (current) | ~17 GB | ~8 GB | 16 GB |
| 256k | OOM | — | — |
With MoE A3B + --override-tensor FFN→CPU, only attention layers remain in
VRAM. KV cache is also compressed to q8, so VRAM stays low (~8 GB) even at
192k. Throughput: ~31 tok/s output, ~92 tok/s prompt.
useAutoCondense: true (default) triggers a conversation history summary
when approaching the context limit, so practical usage fits well within 192k.