Local LLM Stack: LiteLLM proxy + Jina semantic router

ailiance-agent can run a local stack that routes LLM requests intelligently:

isaac  →  Jina router (:5050)  →  LiteLLM proxy (:4000)  →  endpoints

Quick start

# 1. Install (creates Python venvs in ~/.isaac/)
isaac stack install

# 2. Start
isaac stack start

# 3. Configure isaac to use the local stack
# Edit your isaac settings (or pass via VS Code config):
#   apiProvider: "litellm"
#   liteLlmBaseUrl: "http://127.0.0.1:5050"
#   liteLlmApiKey: "sk-isaac-local-master-key"
#   liteLlmModelId: "auto"   # let the router pick

# 4. Verify
isaac stack status

# 5. Stop when done
isaac stack stop

Architecture

LiteLLM proxy (port 4000)

Multiplexing across providers (Anthropic, OpenAI, Ollama, ailiance workers)
Native fallback, retry, cost tracking, response cache
Config: ~/.isaac/litellm/config.yaml
RAM: ~300 MB
Edit the config to add/remove models. Required env vars: ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.

Jina semantic router (port 5050)

Embeds incoming queries via jinaai/jina-embeddings-v2-small-en (~80 MB model, ~150 MB RAM)
Classifies intent into: code, chat, search, agent
Picks the preferred model per category (configurable)
Forwards to LiteLLM with the chosen model
Routes config: ~/.isaac/jina-router/routes.json
RAM: ~150 MB

Total local RAM: ~450 MB.

Custom routes

Edit ~/.isaac/jina-router/routes.json:

{
  "code": {
    "examples": ["refactor this", "fix the bug"],
    "preferred_model": "claude-sonnet-4-5"
  },
  "embedded": {
    "examples": ["esp32 code", "kicad schematic"],
    "preferred_model": "qwen-coder-32b"
  }
}

The router computes the centroid embedding per category and picks the closest one to the user query.

Bypass the router (use proxy directly)

If you don't want semantic routing:

isaac proxy start                    # only the LiteLLM proxy
# liteLlmBaseUrl: "http://127.0.0.1:4000"

Filtering MCP servers and tools

Edit your isaac settings.json (workspace or global):

{
  "enabledMcpServers": ["claude-mem", "context7"],
  "mcpToolDenylist": ["mcp__some_plugin__dangerous_tool"]
}

enabledMcpServers: only these MCP servers from plugins will be loaded. Omit or set null to load all (default).
mcpToolDenylist: qualified tool names (mcp__plugin_server__tool) to exclude even if the server is enabled.
mcpToolAllowlist: if set, only these tools are exposed — overrides mcpToolDenylist.

All three settings are optional. Without them, all plugin MCP servers and all their tools are available.

Auto-detect (zero-config)

Once you've started the stack with isaac stack start, enable auto-detect in your settings:

{
  "useLocalStack": true
}

Now whenever the LiteLLM provider is used, isaac will:

Check if the local stack is running
If yes → route via the Jina router (port 5050) automatically
If no → fall back to your liteLlmBaseUrl setting

You don't need to update liteLlmBaseUrl to switch between local stack mode and remote LiteLLM mode.

To disable, set useLocalStack: false (or omit) and isaac uses liteLlmBaseUrl as before.

Detection results are cached for 30 seconds — no per-request port scan overhead.

Mode "speed" : LocalRouter natif

Au lieu de passer par le stack Python (LiteLLM proxy + Jina router), isaac embed un mini-routeur in-process qui :

Cache les réponses LLM (LRU 100 entries, TTL 1h)
Ping les workers en arrière-plan toutes les 30s (skip ceux DOWN)
Classifie les prompts par heuristic (code/fr/reason/general) et choisit le meilleur worker dispo

Activation :

{
  "useLocalRouter": true
}

Pas besoin de isaac stack install/start — le LocalRouter est embarqué.

Pour configurer tes propres workers :

{
  "useLocalRouter": true,
  "localRouterWorkers": [
    {
      "id": "my-mlx",
      "url": "http://my-mac.tailscale.ts.net:8080/v1",
      "modelId": "qwen-coder-32b",
      "capabilities": ["code", "general"],
      "priority": 10
    }
  ]
}

Limitations PR2 : le LocalRouter ne prend en charge que les messages texte (pas d'images, pas de tool_calls). Les messages avec blocs non-texte et le streaming passent automatiquement par le proxy HTTP classique. PR3 ajoutera le streaming via LocalRouter si demandé.

Troubleshooting

isaac stack install fails on Python: install via brew install uv (recommended) or brew install python@3.11
Slow first start of router: it downloads the embeddings model from Hugging Face (~80 MB). Subsequent starts are instant.
Provider returns 401: set liteLlmApiKey in isaac to match the master key in ~/.isaac/litellm/config.yaml
Logs: ~/.isaac/litellm.log and ~/.isaac/jina-router.log

Alternative: in-process LocalRouter

For most users, isaac ships with an in-process LocalRouter that provides similar features (multi-worker dispatch, cache, health monitoring) without requiring Python sub-processes. See docs/local-router.md for details.

Use the local stack when...	Use LocalRouter when...
You want LiteLLM's 100+ provider list	You target Ailiance workers (Gemma/Apertus/EuroLLM)
You need cost tracking, retries, complex fallback	You need lowest latency overhead
You prefer external services to monitor independently	You want zero install (no Python)

Qwen3-Next 80B-A3B (kxkm-ai, port 18888 → gateway :8002)

Primary tool-capable worker for agentic requests. Launched manually via llama-server (no systemd unit). Tunnel: autossh electron-server:8002 → kxkm-ai:18888. Exposed by the ailiance gateway on :9300 when tools[] is present.

cd /home/kxkm
./llama.cpp/build/bin/llama-server \
  -m models/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf \
  -ngl 99 \
  --override-tensor "ffn_(up|gate|down)_exps\.weight=CPU" \
  --ctx-size 196608 \
  -fa on -b 512 -ub 256 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  -np 1 --reasoning-format none \
  --host 0.0.0.0 --port 18888 \
  --api-key <key> --alias qwen-32b-awq \
  --jinja --metrics

Ctx	KV cache	VRAM used	Free margin (24 GB RTX 4090)
32k	~3-4 GB	~7 GB	17 GB
65k	~7 GB	~9 GB	15 GB
128k	~14 GB	~14 GB	10 GB
192k (current)	~17 GB	~8 GB	16 GB
256k	OOM	—	—

With MoE A3B + --override-tensor FFN→CPU, only attention layers remain in VRAM. KV cache is also compressed to q8, so VRAM stays low (~8 GB) even at 192k. Throughput: ~31 tok/s output, ~92 tok/s prompt.

useAutoCondense: true (default) triggers a conversation history summary when approaching the context limit, so practical usage fits well within 192k.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Local LLM Stack: LiteLLM proxy + Jina semantic router

Quick start

Architecture

LiteLLM proxy (port 4000)

Jina semantic router (port 5050)

Custom routes

Bypass the router (use proxy directly)

Filtering MCP servers and tools

Auto-detect (zero-config)

Mode "speed" : LocalRouter natif

Troubleshooting

Alternative: in-process LocalRouter

Qwen3-Next 80B-A3B (kxkm-ai, port 18888 → gateway :8002)

Uh oh!

FilesExpand file tree

local-stack.md

Latest commit

History

local-stack.md

File metadata and controls

Local LLM Stack: LiteLLM proxy + Jina semantic router

Quick start

Architecture

LiteLLM proxy (port 4000)

Jina semantic router (port 5050)

Custom routes

Bypass the router (use proxy directly)

Filtering MCP servers and tools

Auto-detect (zero-config)

Mode "speed" : LocalRouter natif

Troubleshooting

Alternative: in-process LocalRouter

Qwen3-Next 80B-A3B (kxkm-ai, port 18888 → gateway :8002)