Version: 0.5.1
LLM Router is a service that can be deployed on‑premises or in the cloud. It adds a layer between any application and the LLM provider. In real time it controls traffic, distributes load among providers of a specific LLM, and enables analysis of outgoing requests from a security perspective (masking, anonymization, prohibited content). It is an open‑source solution (Apache 2.0) that can be launched instantly by running a ready‑made image in your own infrastructure.
The LLM‑Router project is split across five dedicated repositories:
| Repository | Description |
|---|---|
| llm-router (this repo) | Core gateway — unified REST proxy, Python SDK, and configuration management |
| llm-router-api | REST proxy that routes requests to any supported LLM backend (OpenAI‑compatible, Ollama, vLLM, LM Studio, Anthropic), with built‑in load‑balancing, health checks, streaming responses and optional Prometheus metrics |
| llm-router-lib | Python SDK that wraps the API with typed request/response models, automatic retries, token handling and a rich exception hierarchy |
| llm-router-web | Ready‑to‑use Flask UIs — a Config Manager for model/user settings and an Anonymizer UI that masks sensitive data |
| llm-router-plugins | Pluggable anonymizers (maskers), guardrails, semantic routing and RAG plugins |
| llm-router-services | HTTP services that power the plugin ecosystem (NASK‑PIB/Sojka guardrails, PII masker) |
| llm-router-utils | CLI tools, batch translation, GenAI classification and ready‑made deployment configs (Speakleash models) |
| Feature | Description |
|---|---|
| Unified REST interface | One endpoint schema works for OpenAI‑compatible, Ollama, vLLM, LM Studio and Anthropic. |
| Provider‑agnostic streaming | The stream flag (default true) controls whether the proxy forwards chunked responses as they arrive or returns a single aggregated payload. Streaming responses include proper Cache‑Control, Pragma, Expires and Vary headers. |
| Built‑in prompt library | Language‑aware system prompts stored under resources/prompts can be referenced automatically. |
| Dynamic model configuration | JSON file (models-config.json) defines providers, model name, default options and per‑model overrides. |
| Request validation | Pydantic models guarantee correct payloads; errors are returned with clear messages. |
| Structured logging | Configurable log level, filename, and optional JSON formatting. |
| Health & metadata endpoints | /ping (simple 200 OK) and /tags (available model tags/metadata). |
| Embeddings support | Dedicated endpoints for generating text embeddings across all supported providers. |
| Simple deployment | One‑liner run script, Docker image, or Helm chart for Kubernetes. |
| Extensible conversation formats | Basic chat, conversation with system prompt, and extended conversation with richer options (temperature, top‑k, custom system prompt). |
| Multi‑provider model support | Each model can be backed by multiple providers (VLLM, Ollama, OpenAI, Anthropic) defined in models-config.json. |
| Load‑balanced default strategy | LoadBalancedStrategy distributes requests evenly across providers using in‑memory usage counters. |
| Dynamic model handling | ModelHandler loads model definitions at runtime and resolves the appropriate provider per request. |
| Pluggable endpoint architecture | Automatic discovery and registration of all concrete EndpointI implementations via EndpointAutoLoader. |
| Prometheus metrics integration | Optional /metrics endpoint for latency, error counts, and provider usage statistics. |
| Docker & Kubernetes ready | Dockerfile (non‑root user) and Helm charts for containerised deployment. |
LLM Router uses a registry-based pipeline pattern. Each plugin implements a tiny, well‑defined apply method and
can be composed in an ordered list to form a pipeline. Pipelines are instantiated by the MaskerPipeline,
GuardrailPipeline and UtilsPipeline classes and are driven automatically by the endpoint logic in endpoint_i.py.
Request → MaskerPipeline → GuardrailPipeline → UtilsPipeline → Model Provider
| Plugin ID | Type | Description |
|---|---|---|
fast_masker |
Local | Regex‑based PII masker with 30+ rule types (emails, IPs, URLs, phone numbers, PESEL, NIP, KRS, REGON, monetary amounts, dates, credit cards, JWTs, passports and more). |
pii_masker |
HTTP (remote) | ML‑based PII masker using a token‑classification model with an in‑memory cache to avoid redundant model calls for identical text inputs. |
| Plugin ID | Type | Description |
|---|---|---|
nask_guard |
HTTP (remote) | Safety check using the HerBERT‑PL‑Guard model (NASK‑PIB). |
sojka_guard |
HTTP (remote) | Safety check using the Bielik‑Guard‑0.1B model from SpeakLeash. |
| Plugin ID | Type | Description |
|---|---|---|
langchain_rag |
Local | Retrieves relevant document chunks from a FAISS vector store and injects them into the payload for Retrieval‑Augmented Generation. |
simple_semantic_routing |
Local | Two‑stage heuristic model selection: intent classification + complexity analysis. Activated when payload["model"] == "auto". |
Pipelines are configured via environment variables:
# Comma-separated list of masker plugins to apply
export LLM_ROUTER_MASKING_STRATEGY_PIPELINE="fast_masker,pii_masker"
# Enable guardrails
export LLM_ROUTER_FORCE_GUARDRAIL_REQUEST=1
# Enable masking entirely
export LLM_ROUTER_FORCE_MASKING=1
# Record masking operations in audit log
export LLM_ROUTER_MASKING_WITH_AUDIT=1| Component | Description |
|---|---|
| KeepAliveMonitor | Periodically pings model endpoints to keep them warm (prevents cold‑start latency). |
| ProviderMonitor | Tracks per‑provider availability using Redis as a shared state store. |
| ServicesMonitor | Periodically health‑checks the llm-router-services endpoints (guardrails, maskers). |
python3 -m venv .venv
source .venv/bin/activate
# Only the core library (llm-router-lib).
pip install .
# Core library + API wrapper (llm-router-api).
pip install .[api]
# Core library + API wrapper + Prometheus metrics.
pip install .[api,metrics]Note: When Prometheus metrics are enabled,
LLM_ROUTER_USE_PROMETHEUS=1must be set and Redis is required ( used for provider availability state).
Then start the application with the environment variable set:
export LLM_ROUTER_USE_PROMETHEUS=1When LLM_ROUTER_USE_PROMETHEUS is enabled, the router automatically registers a /metrics endpoint (under the API
prefix, e.g. /api/metrics). This endpoint exposes Prometheus‑compatible metrics such as request counts, latencies, and
any custom counters defined by the application.
./run-rest-api.sh
# or
LLM_ROUTER_MINIMUM=1 python3 -m llm_router_api.rest_apiIntegration examples for popular LLM libraries (LlamaIndex, LangChain, OpenAI, LiteLLM, Haystack) are in the
examples/ directory. See examples README for details.
The router can record request‑level events (guard‑rail checks, payload masking, custom logs) in a tamper‑evident,
encrypted form. All audit entries are written by the auditor module and stored under logs/auditor/ as
GPG‑encrypted files.
For a complete guide — including key generation, encryption workflow, and decryption utilities — see:
➡️ Auditing subsystem documentation
Utility scripts:
scripts/gen_and_export_gpg.sh— generate and export GPG keysscripts/decrypt_auditor_logs.sh— decrypt encrypted audit logs
Run the container with the default configuration:
docker run -p 5555:8080 quay.io/radlab/llm-router:rc1For more advanced usage you can use a custom launch script:
#!/bin/bash
PWD=$(pwd)
docker run \
-p 5555:8080 \
-e LLM_ROUTER_TIMEOUT=500 \
-e LLM_ROUTER_IN_DEBUG=1 \
-e LLM_ROUTER_MINIMUM=1 \
-e LLM_ROUTER_EP_PREFIX="/api" \
-e LLM_ROUTER_SERVER_TYPE=gunicorn \
-e LLM_ROUTER_SERVER_PORT=8080 \
-e LLM_ROUTER_SERVER_WORKERS_COUNT=4 \
-e LLM_ROUTER_DEFAULT_EP_LANGUAGE="pl" \
-e LLM_ROUTER_LOG_FILENAME="llm-proxy-rest.log" \
-e LLM_ROUTER_EXTERNAL_TIMEOUT=300 \
-e LLM_ROUTER_BALANCE_STRATEGY=balanced \
-e LLM_ROUTER_REDIS_HOST="192.168.100.67" \
-e LLM_ROUTER_REDIS_PORT=6379 \
-e LLM_ROUTER_MODELS_CONFIG=/srv/cfg.json \
-e LLM_ROUTER_PROMPTS_DIR="/srv/prompts" \
-v "${PWD}/resources/configs/models-config.json":/srv/cfg.json \
-v "${PWD}/resources/prompts":/srv/prompts \
quay.io/radlab/llm-router:rc1Helm charts for Kubernetes deployment are available in the helm_charts/ directory.
A full list of environment variables is available at: API README
| Variable | Default | Description |
|---|---|---|
LLM_ROUTER_PROMPTS_DIR |
resources/prompts |
Directory containing predefined system prompts. |
LLM_ROUTER_MODELS_CONFIG |
resources/configs/models-config.json |
Path to the models configuration JSON file. |
LLM_ROUTER_DEFAULT_EP_LANGUAGE |
pl |
Default language for endpoint prompts. |
LLM_ROUTER_TIMEOUT |
0 |
Timeout (seconds) for llm-router API calls. |
LLM_ROUTER_EXTERNAL_TIMEOUT |
300 |
Timeout (seconds) for external model API calls. |
LLM_ROUTER_LOG_FILENAME |
llm-router.log |
Name of the log file. |
LLM_ROUTER_LOG_LEVEL |
INFO |
Logging level (e.g., INFO, DEBUG). |
LLM_ROUTER_EP_PREFIX |
/api |
Prefix for all API endpoints. |
LLM_ROUTER_MINIMUM |
False |
Run service in proxy‑only mode. |
LLM_ROUTER_IN_DEBUG |
False |
Run server in debug mode. |
LLM_ROUTER_BALANCE_STRATEGY |
balanced |
Load‑balancing strategy: balanced, weighted, dynamic_weighted, first_available, first_available_optim. |
LLM_ROUTER_SERVER_TYPE |
flask |
Server implementation: flask, gunicorn, waitress. |
LLM_ROUTER_SERVER_PORT |
8080 |
Port on which the server listens. |
LLM_ROUTER_SERVER_HOST |
localhost |
Host address for the server. |
LLM_ROUTER_SERVER_WORKERS_COUNT |
2 |
Number of workers. |
LLM_ROUTER_SERVER_THREADS_COUNT |
8 |
Number of worker threads. |
LLM_ROUTER_SERVER_WORKER_CLASS |
None |
Worker class for servers that support it. |
LLM_ROUTER_USE_PROMETHEUS |
False |
Enable Prometheus metrics (/metrics endpoint). |
| Variable | Default | Description |
|---|---|---|
LLM_ROUTER_FORCE_MASKING |
False |
Enable force‑masking of every endpoint's payload. |
LLM_ROUTER_MASKING_STRATEGY_PIPELINE |
["fast_masker"] |
Ordered list of masker plugins (e.g. fast_masker,pii_masker). |
LLM_ROUTER_MASKING_WITH_AUDIT |
False |
Record each masking operation in the audit log. |
LLM_ROUTER_FORCE_GUARDRAIL_REQUEST |
False |
Force guardrail evaluation on every request. |
LLM_ROUTER_MASKER_PII_HOST |
— | Host URL for the PII masker service. |
LLM_ROUTER_GUARDRAIL_SOJKA_GUARD_HOST |
— | Host URL for the Sojka guardrail service. |
| Variable | Default | Description |
|---|---|---|
LLM_ROUTER_REDIS_HOST |
(empty) | Redis host for load‑balancing across multi‑provider models. |
LLM_ROUTER_REDIS_PORT |
6379 |
Redis port. |
LLM_ROUTER_REDIS_PASSWORD |
(not set) | Redis password. |
LLM_ROUTER_REDIS_DB |
0 |
Redis database number. |
Redis is now mandatory. The router raises
RuntimeErrorat startup if Redis is unavailable.
The current list of available strategies, the interface description, and an example extension can be found at: Load‑Balancing Strategies
Strategies: balanced, weighted, dynamic_weighted, first_available, first_available_optim.
The list of endpoints — categorized into built‑in, provider‑dependent, and utility endpoints — and a description of the streaming mechanisms can be found at: Endpoints Overview
| Endpoint | Method | Description |
|---|---|---|
/ping |
GET | Health‑check |
/tags |
GET | List Ollama model tags |
/models |
GET | List OpenAI‑compatible models |
/api/chat/completions |
POST | OpenAI‑style chat completion |
/api/v1/chat/completions |
POST | vLLM‑like chat completion |
/v1/messages |
POST | Anthropic‑compatible messages endpoint (Claude) |
/v1/responses |
POST | OpenAI‑like responses endpoint |
/api/embeddings |
POST | Standard embeddings |
/api/conversation_with_model |
POST | Built‑in standard chat |
/api/extended_conversation_with_model |
POST | Built‑in chat with extended fields |
/api/generative_answer |
POST | Answer a question using provided context |
/api/translate |
POST | Translate texts |
/api/generate_questions |
POST | Generate questions from texts |
/api/simplify_text |
POST | Simplify input texts |
Full web UI for managing LLM Router model configurations:
- Multi‑user with authentication and role‑based access (admin/user)
- Projects — group configurations by project
- Model configuration — create, edit, import/export JSON configs; manage providers across families (Google, OpenAI, Qwen)
- Version control — snapshot history with restore capability
- Active model selection — choose which models to activate per config
- Drag‑and‑drop provider reordering (HTMX)
- Light/dark themes (Alpine.js)
- 26+ API endpoints under
/configs
Run: ./run-configs-manager.sh
Web UI for text anonymization and interactive chat:
- 3 anonymization algorithms:
fast(regex),pii_masking(ML model),fast+pii(hybrid) - Interactive chat with streaming SSE responses and session persistence
- Dynamic model selection from the router
- i18n — Polish and English translations (122 keys)
- Privacy warnings when anonymization is disabled
- Privacy policy & terms pages
Run: ./run-anonymizer.sh
The llm-router-utils repository provides CLI tools and ready‑made deployment configs:
| Tool | Description |
|---|---|
translate-texts |
Batch translate texts in JSON/JSONL datasets via LLM Router |
genai-classifier |
Classify dataset texts using LLM prompts with multi‑threading and XLSX export |
The resources/llm-router-speakleash/ directory contains ready‑made configs for deploying Speakleash models:
speakleash-models.json— configuresBielik-11B-v2.3-Instructacross 8 vLLM providers on 3 hostsrun-bielik-*.sh— vLLM launch scripts for each GPU (cuda:0, cuda:1, cuda:2)run-rest-api-gunicorn.sh— full LLM Router server with masking, guardrails, Redis balancing and Prometheus metricsrun-sojka-guardrail.sh— guardrail service with Bielik‑Guard model
| Config File / Variable | Meaning |
|---|---|
resources/configs/models-config.json |
JSON map of provider → model → default options (e.g., keep_alive, options.num_ctx). |
LLM_ROUTER_PROMPTS_DIR |
Directory containing prompt templates (*.prompt). Sub‑folders are language‑specific (en/, pl/). |
LLM_ROUTER_DEFAULT_EP_LANGUAGE |
Language code used when a prompt does not explicitly specify one. |
LLM_ROUTER_TIMEOUT |
Upper bound for any request to an upstream LLM (seconds). |
LLM_ROUTER_LOG_FILENAME / LLM_ROUTER_LOG_LEVEL |
Logging destinations and verbosity. |
LLM_ROUTER_IN_DEBUG |
When set, enables DEBUG‑level logs and more verbose error payloads. |
- Python 3.10+ (project is tested on 3.10.6)
- All dependencies are listed in
requirements.txt. Install them inside the virtualenv. - To add a new provider, create a class in
llm_router_api/core/api_typesthat implements theBaseProviderinterface and register it inllm_router_api/register/__init__.py.
See the CHANGELOG for a complete history of changes.
See the LICENSE file.