Agnostic OpenAI-compatible proxy for dynamic model loading and unloading.
DynLLM sits between your OpenAI-compatible client (OpenWebUI, LangChain, curl, …) and your local inference backends (llama.cpp, OpenVINO Model Server, and/or Hugging Face transformers). It automatically loads models on demand, tracks VRAM usage, evicts models when memory is tight, and unloads idle models after a configurable timeout.
- OpenAI-compatible API –
/v1/chat/completions,/v1/completions,/v1/audio/transcriptions,/v1/audio/translations,/v1/audio/speech,/v1/images/generations,/v1/embeddings,/v1/rerank,/v1/models - Dynamic loading – models are started on first request and stopped when idle
- VRAM budgeting – LIFO eviction keeps total GPU memory within a configured limit
- Multi-backend – supports llama.cpp (GGUF), OpenVINO Model Server (IR), and
transformers serve - Per-model idle timeout – override the global timeout per model, or set
inf/-1to never auto-unload - Startup preloading – specify models to load when DynLLM starts
- Safe mid-generation – active inference requests are never interrupted by eviction
- Persistent state – SQLite database survives restarts; stale states are healed on startup
- systemd integration – ships with a ready-made unit file and installer script
- Python 3.11+ and uv
- At least one backend installed:
- llama.cpp – build
llama-serverfrom ggerganov/llama.cpp - OpenVINO Model Server – see the OVMS installation guide
- Hugging Face transformers – install
transformers[serving]; for Intel GPUs install torch XPU wheels first
- llama.cpp – build
# 1. Clone and install
git clone https://github.com/youruser/DynLLM
cd DynLLM
uv sync
# 2. Create a config
cp config.example.yaml config.yaml
# Edit config.yaml – set total_vram_mb, add your models
# 3. Run
uv run dynllm
# or with an explicit config path:
uv run dynllm --config /path/to/config.yamlThe proxy starts on http://0.0.0.0:8000 by default.
Copy config.example.yaml to config.yaml and adjust as needed.
| Key | Default | Description |
|---|---|---|
server.host |
0.0.0.0 |
Bind address |
server.port |
8000 |
Listen port |
total_vram_mb |
8192 |
VRAM budget in MB; eviction fires when exceeded |
idle_timeout_seconds |
300 |
Global idle auto-unload timeout (seconds) |
enabled_backends |
[llamacpp, openvino] |
Active backends |
models_dir |
— | Optional base dir for relative model paths |
db_path |
dynllm_state.db |
SQLite state database path |
log_level |
info |
debug / info / warning / error |
preload_models |
[] |
Model names to load on startup |
| Key | Default | Description |
|---|---|---|
llamacpp_binary |
llama-server |
Path or name of the llama-server binary |
ovms_binary |
ovms |
Path or name of the OVMS binary |
transformers_binary |
transformers |
Path or name of the Hugging Face transformers CLI |
port_range_start |
9100 |
Start of port range for backend subprocesses |
port_range_end |
9200 |
End of port range for backend subprocesses |
| Field | Required | Description |
|---|---|---|
name |
yes | Unique model ID; used as the model field in API requests |
path |
yes | Path to the .gguf file, OpenVINO IR directory, or local Hugging Face model directory |
backend |
yes | llamacpp, openvino, or transformers |
model_type |
no | llm, transcription, speech, image_generation, embedding, rerank, classification, detection, segmentation, ocr. Default: llm |
vram_mb |
yes | Estimated VRAM in MB when loaded (used for eviction math) |
target_device |
no | OpenVINO target device (CPU, GPU, NPU). Default: CPU |
n_gpu_layers |
no | llama.cpp only – GPU layers (-1 = all). Default: -1 |
context_size |
no | llama.cpp only – context window size. Default: 4096 |
ovms_shape |
no | OpenVINO only – shape hint (e.g. "auto") |
device |
no | transformers only – execution device (auto, cpu, cuda, xpu) |
dtype |
no | transformers only – load dtype (auto, float16, bfloat16, float32) |
quantization |
no | transformers only – none, bnb-4bit, or bnb-8bit for bitsandbytes quantization |
trust_remote_code |
no | transformers only – allow custom repo code when required |
compile_model |
no | transformers only – enable torch.compile through transformers serve |
continuous_batching |
no | transformers only – enable continuous batching for supported LLMs |
attn_implementation |
no | transformers only – auto, eager, sdpa, flash_attention_2, flash_attention_3, flex_attention |
model_timeout |
no | transformers only – backend-side idle timeout in seconds |
revision |
no | transformers only – HF revision rendered as model@revision |
unload_time |
no | Per-model idle timeout (seconds). Overrides idle_timeout_seconds. Use -1 or inf to never auto-unload |
server:
host: "0.0.0.0"
port: 8000
total_vram_mb: 7500
idle_timeout_seconds: 300
enabled_backends:
- llamacpp
- openvino
- transformers
backend:
llamacpp_binary: "llama-server"
ovms_binary: "ovms"
transformers_binary: "transformers"
port_range_start: 9100
port_range_end: 9200
# Load this model immediately at startup
preload_models:
- "llama3-8b-q4"
models:
- name: "llama3-8b-q4"
path: "/mnt/models/gguf/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf"
backend: llamacpp
model_type: llm
vram_mb: 5500
n_gpu_layers: -1
context_size: 4096
- name: "phi3-mini-ov"
path: "/mnt/models/openvino/phi-3-mini-4k-instruct-ov"
backend: openvino
model_type: llm
target_device: GPU
vram_mb: 4096
ovms_shape: "auto"
unload_time: -1 # keep this model loaded permanently
- name: "whisper-large-v3-ov"
path: "/mnt/models/openvino/whisper-large-v3-ov"
backend: openvino
model_type: transcription
target_device: CPU
vram_mb: 0
- name: "speecht5-ov"
path: "/mnt/models/openvino/speecht5-ov"
backend: openvino
model_type: speech
target_device: CPU
vram_mb: 0
- name: "qwen25-3b-hf"
path: "/mnt/models/huggingface/Qwen2.5-3B-Instruct"
backend: transformers
model_type: llm
device: xpu
dtype: bfloat16
quantization: bnb-4bit
attn_implementation: sdpa
model_timeout: 300
vram_mb: 6500DynLLM exposes a standard OpenAI-compatible REST API. Point any OpenAI client at
http://<host>:<port> and use the model name values from your config.
Returns the list of configured models in OpenAI format.
{
"object": "list",
"data": [
{ "id": "llama3-8b-q4", "object": "model", "created": 1700000000, "owned_by": "dynllm" }
]
}OpenAI-compatible chat completions. Supports streaming ("stream": true).
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3-8b-q4",
"messages": [{"role": "user", "content": "Hello!"}]
}'OpenAI-compatible text completions. Supports streaming.
OpenAI-compatible speech-to-text endpoint. DynLLM accepts the standard multipart request and proxies it to OVMS or transformers serve, depending on the configured backend.
curl http://localhost:8000/v1/audio/transcriptions \
-F "model=whisper-large-v3-ov" \
-F "file=@speech.wav"OpenAI-compatible speech translation endpoint. This uses the same OpenVINO transcription models and maps to OVMS /v3/audio/translations.
curl http://localhost:8000/v1/audio/translations \
-F "model=whisper-large-v3-ov" \
-F "file=@speech-es.wav"OpenAI-compatible text-to-speech endpoint. OVMS currently returns WAV audio.
curl http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"speecht5-ov","input":"Hello from DynLLM"}' \
-o speech.wavOpenAI-compatible image generation endpoint. Supported via OVMS with model_type: image_generation.
curl http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"model":"sd-xl-ov","prompt":"a cat wearing a hat","n":1,"size":"512x512"}'OpenAI-compatible embeddings endpoint. Supported by llama.cpp (GGUF embedding models) and OVMS.
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model":"bge-m3-gguf","input":"Hello world"}'Cohere-compatible reranking endpoint. Supported by llama.cpp (GGUF reranker models) and OVMS.
curl http://localhost:8000/v1/rerank \
-H "Content-Type: application/json" \
-d '{"model":"bge-reranker-v2-gguf","query":"what is AI?","documents":["AI is...","ML is..."]}'KServe v2 protocol passthrough for OpenVINO models. Supports classification, detection, segmentation, OCR, and other non-LLM models through OVMS.
# Model metadata
curl http://localhost:8000/v2/models/resnet50-vm
# Readiness check
curl http://localhost:8000/v2/models/resnet50-vm/ready
# Inference
curl -X POST http://localhost:8000/v2/models/resnet50-vm/infer \
-H "Content-Type: application/json" \
-d '{"inputs":[...]}'Returns detailed internal state for every known model (status, port, PID, VRAM, timestamps).
Manually unload a model from VRAM.
curl -X POST http://localhost:8000/admin/models/unload \
-H "Content-Type: application/json" \
-d '{"model": "llama3-8b-q4"}'When a request arrives for any configured endpoint (/v1/chat/completions, /v1/completions, /v1/audio/*, /v1/images/generations, /v1/embeddings, /v1/rerank, /v2/models/*):
- DynLLM looks up the model in the config by name.
- If not loaded: checks whether enough VRAM is free.
- If not enough VRAM: evicts models in LIFO order (most recently loaded first), skipping any model currently serving a request.
- Starts the backend subprocess and waits for it to be ready.
- Proxies the request to the backend and streams the response back.
A background scheduler checks loaded models every 30 seconds. Any model idle longer
than its effective timeout (per-model unload_time > global idle_timeout_seconds)
is stopped and its VRAM is freed. Models with active requests are never evicted.
Set unload_time: -1 (or inf) on a model to keep it permanently in VRAM unless
VRAM pressure forces eviction or you manually unload it via /admin/models/unload.
List model names under preload_models: to have them loaded before the proxy
starts accepting traffic. This reduces first-request latency.
At startup DynLLM logs the available torch execution backends detected in the runtime, for example ['cpu', 'xpu'] or ['cpu', 'cuda']. This helps verify that the installed torch build matches the hardware backend you expect to use.
When a new model needs to be loaded and there is insufficient free VRAM, DynLLM evicts the most recently loaded model first (Last-In, First-Out). This heuristic favours keeping the models you have been using longest resident in memory.
Models with in-flight requests are never evicted; the load will fail with HTTP 503 if eviction is impossible due to all candidates being busy.
All model state (status, PID, port, VRAM, timestamps) is stored in a SQLite database.
On startup, any models stuck in a loading or unloading state (e.g. due to a crash)
are automatically reset to unloaded.
# Run as root
sudo bash systemd/install.shThe installer:
- Creates a
dynllmsystem user - Copies the project to
/opt/dynllm - Runs
uv sync --frozen - Creates a default config at
/opt/dynllm/config.yaml - Installs and enables the systemd unit
dynllm.service
sudo systemctl status dynllm
sudo journalctl -u dynllm -fGPU access is granted via /dev/dri (Intel/AMD). For NVIDIA, uncomment the relevant
lines in systemd/dynllm.service.
- Serves GGUF models only.
- One
llama-serverprocess per loaded model. - Readiness is detected via
GET /health. - Relevant config fields:
n_gpu_layers,context_size.
- Serves OpenVINO IR model directories only (not GGUF).
- One
ovmsprocess per loaded model. model_type: llm,model_type: embedding, andmodel_type: rerankuse the standard single-model OVMS config flow with the OpenAI-compatible/v3/endpoints (/v3/chat/completions,/v3/embeddings,/v1/rerank).model_type: transcriptionandmodel_type: speechuse OVMS audio task mode (--task speech2text/--task text2speech). Audio endpoints require OVMS2025.4+.model_type: image_generationuses OVMS image generation task mode (--task image_generation) and exposes/v3/images/generations(OpenAI-compatible). Supports Stable Diffusion, SDXL, and FLUX.1 models in OpenVINO IR format.model_type: classification,detection,segmentation, andocrare served through the KServe v2 API (/v2/models/{name}/infer). These use the standard OVMS config flow and are accessible via any KServe-compatible client.- gRPC is disabled (
--port 0); only the REST API is used. - Readiness is detected by model type:
- LLM/embedding/rerank/CV models: KServe Model Readiness endpoint (
GET /v2/models/{name}/ready). - Audio/image generation models: OVMS task mode does not register KServe models, so readiness is detected by probing the relevant
/v3/endpoint with an OPTIONS request.
- LLM/embedding/rerank/CV models: KServe Model Readiness endpoint (
- Serves local Hugging Face model directories through
transformers serve. - One
transformers serveprocess per loaded model. - DynLLM keeps the public model alias from
config.yamland rewrites backend requests to the local model path expected bytransformers serve. - Supports
model_type: llmandmodel_type: transcription. - Relevant config fields:
device,dtype,quantization,trust_remote_code,compile_model,continuous_batching,attn_implementation,model_timeout,revision. - For Intel GPUs, install torch from the XPU wheel index before installing
transformers[serving]:
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/xpu
uv pip install "transformers[serving]"- For CUDA systems, install the CUDA-specific torch wheels first, then
transformers[serving]. quantization: bnb-4bitandquantization: bnb-8bitmap to bitsandbytes loading intransformers serve.- In practice, bitsandbytes is most proven on CUDA. On Intel XPU it may work, but treat it as deployment-specific and validate the exact torch + bitsandbytes stack on your target machine before relying on it in production.
- Quantization is enabled only for
model_type: llmin DynLLM. - DynLLM still applies the same VRAM accounting, LIFO eviction, and idle unload rules used for llama.cpp and OVMS.
See LICENSE.