DynLLM

Agnostic OpenAI-compatible proxy for dynamic model loading and unloading.

DynLLM sits between your OpenAI-compatible client (OpenWebUI, LangChain, curl, …) and your local inference backends (llama.cpp, OpenVINO Model Server, and/or Hugging Face transformers). It automatically loads models on demand, tracks VRAM usage, evicts models when memory is tight, and unloads idle models after a configurable timeout.

Features

OpenAI-compatible API – /v1/chat/completions, /v1/completions, /v1/audio/transcriptions, /v1/audio/translations, /v1/audio/speech, /v1/images/generations, /v1/embeddings, /v1/rerank, /v1/models
Dynamic loading – models are started on first request and stopped when idle
VRAM budgeting – LIFO eviction keeps total GPU memory within a configured limit
Multi-backend – supports llama.cpp (GGUF), OpenVINO Model Server (IR), and transformers serve
Per-model idle timeout – override the global timeout per model, or set inf/-1 to never auto-unload
Startup preloading – specify models to load when DynLLM starts
Safe mid-generation – active inference requests are never interrupted by eviction
Persistent state – SQLite database survives restarts; stale states are healed on startup
systemd integration – ships with a ready-made unit file and installer script

Requirements

Python 3.11+ and uv
At least one backend installed:
- llama.cpp – build llama-server from ggerganov/llama.cpp
- OpenVINO Model Server – see the OVMS installation guide
- Hugging Face transformers – install transformers[serving]; for Intel GPUs install torch XPU wheels first

Quick Start

# 1. Clone and install
git clone https://github.com/youruser/DynLLM
cd DynLLM
uv sync

# 2. Create a config
cp config.example.yaml config.yaml
# Edit config.yaml – set total_vram_mb, add your models

# 3. Run
uv run dynllm
# or with an explicit config path:
uv run dynllm --config /path/to/config.yaml

The proxy starts on http://0.0.0.0:8000 by default.

Configuration

Copy config.example.yaml to config.yaml and adjust as needed.

Top-level settings

Key	Default	Description
`server.host`	`0.0.0.0`	Bind address
`server.port`	`8000`	Listen port
`total_vram_mb`	`8192`	VRAM budget in MB; eviction fires when exceeded
`idle_timeout_seconds`	`300`	Global idle auto-unload timeout (seconds)
`enabled_backends`	`[llamacpp, openvino]`	Active backends
`models_dir`	—	Optional base dir for relative model paths
`db_path`	`dynllm_state.db`	SQLite state database path
`log_level`	`info`	`debug` / `info` / `warning` / `error`
`preload_models`	`[]`	Model names to load on startup

Backend settings (`backend:`)

Key	Default	Description
`llamacpp_binary`	`llama-server`	Path or name of the llama-server binary
`ovms_binary`	`ovms`	Path or name of the OVMS binary
`transformers_binary`	`transformers`	Path or name of the Hugging Face transformers CLI
`port_range_start`	`9100`	Start of port range for backend subprocesses
`port_range_end`	`9200`	End of port range for backend subprocesses

Model declaration fields

Field	Required	Description
`name`	yes	Unique model ID; used as the `model` field in API requests
`path`	yes	Path to the `.gguf` file, OpenVINO IR directory, or local Hugging Face model directory
`backend`	yes	`llamacpp`, `openvino`, or `transformers`
`model_type`	no	`llm`, `transcription`, `speech`, `image_generation`, `embedding`, `rerank`, `classification`, `detection`, `segmentation`, `ocr`. Default: `llm`
`vram_mb`	yes	Estimated VRAM in MB when loaded (used for eviction math)
`target_device`	no	OpenVINO target device (`CPU`, `GPU`, `NPU`). Default: `CPU`
`n_gpu_layers`	no	llama.cpp only – GPU layers (`-1` = all). Default: `-1`
`context_size`	no	llama.cpp only – context window size. Default: `4096`
`ovms_shape`	no	OpenVINO only – shape hint (e.g. `"auto"`)
`device`	no	transformers only – execution device (`auto`, `cpu`, `cuda`, `xpu`)
`dtype`	no	transformers only – load dtype (`auto`, `float16`, `bfloat16`, `float32`)
`quantization`	no	transformers only – `none`, `bnb-4bit`, or `bnb-8bit` for bitsandbytes quantization
`trust_remote_code`	no	transformers only – allow custom repo code when required
`compile_model`	no	transformers only – enable `torch.compile` through `transformers serve`
`continuous_batching`	no	transformers only – enable continuous batching for supported LLMs
`attn_implementation`	no	transformers only – `auto`, `eager`, `sdpa`, `flash_attention_2`, `flash_attention_3`, `flex_attention`
`model_timeout`	no	transformers only – backend-side idle timeout in seconds
`revision`	no	transformers only – HF revision rendered as `model@revision`
`unload_time`	no	Per-model idle timeout (seconds). Overrides `idle_timeout_seconds`. Use `-1` or `inf` to never auto-unload

Example config

server:
  host: "0.0.0.0"
  port: 8000

total_vram_mb: 7500
idle_timeout_seconds: 300

enabled_backends:
  - llamacpp
  - openvino
  - transformers

backend:
  llamacpp_binary: "llama-server"
  ovms_binary: "ovms"
  transformers_binary: "transformers"
  port_range_start: 9100
  port_range_end: 9200

# Load this model immediately at startup
preload_models:
  - "llama3-8b-q4"

models:
  - name: "llama3-8b-q4"
    path: "/mnt/models/gguf/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf"
    backend: llamacpp
    model_type: llm
    vram_mb: 5500
    n_gpu_layers: -1
    context_size: 4096

  - name: "phi3-mini-ov"
    path: "/mnt/models/openvino/phi-3-mini-4k-instruct-ov"
    backend: openvino
    model_type: llm
    target_device: GPU
    vram_mb: 4096
    ovms_shape: "auto"
    unload_time: -1   # keep this model loaded permanently

  - name: "whisper-large-v3-ov"
    path: "/mnt/models/openvino/whisper-large-v3-ov"
    backend: openvino
    model_type: transcription
    target_device: CPU
    vram_mb: 0

  - name: "speecht5-ov"
    path: "/mnt/models/openvino/speecht5-ov"
    backend: openvino
    model_type: speech
    target_device: CPU
    vram_mb: 0

  - name: "qwen25-3b-hf"
    path: "/mnt/models/huggingface/Qwen2.5-3B-Instruct"
    backend: transformers
    model_type: llm
    device: xpu
    dtype: bfloat16
    quantization: bnb-4bit
    attn_implementation: sdpa
    model_timeout: 300
    vram_mb: 6500

API Reference

DynLLM exposes a standard OpenAI-compatible REST API. Point any OpenAI client at http://<host>:<port> and use the model name values from your config.

`GET /v1/models`

Returns the list of configured models in OpenAI format.

{
  "object": "list",
  "data": [
    { "id": "llama3-8b-q4", "object": "model", "created": 1700000000, "owned_by": "dynllm" }
  ]
}

`POST /v1/chat/completions`

OpenAI-compatible chat completions. Supports streaming ("stream": true).

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-8b-q4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

`POST /v1/completions`

OpenAI-compatible text completions. Supports streaming.

`POST /v1/audio/transcriptions`

OpenAI-compatible speech-to-text endpoint. DynLLM accepts the standard multipart request and proxies it to OVMS or transformers serve, depending on the configured backend.

curl http://localhost:8000/v1/audio/transcriptions \
  -F "model=whisper-large-v3-ov" \
  -F "file=@speech.wav"

`POST /v1/audio/translations`

OpenAI-compatible speech translation endpoint. This uses the same OpenVINO transcription models and maps to OVMS /v3/audio/translations.

curl http://localhost:8000/v1/audio/translations \
  -F "model=whisper-large-v3-ov" \
  -F "file=@speech-es.wav"

`POST /v1/audio/speech`

OpenAI-compatible text-to-speech endpoint. OVMS currently returns WAV audio.

curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"speecht5-ov","input":"Hello from DynLLM"}' \
  -o speech.wav

`POST /v1/images/generations`

OpenAI-compatible image generation endpoint. Supported via OVMS with model_type: image_generation.

curl http://localhost:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"model":"sd-xl-ov","prompt":"a cat wearing a hat","n":1,"size":"512x512"}'

`POST /v1/embeddings`

OpenAI-compatible embeddings endpoint. Supported by llama.cpp (GGUF embedding models) and OVMS.

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"bge-m3-gguf","input":"Hello world"}'

`POST /v1/rerank`

Cohere-compatible reranking endpoint. Supported by llama.cpp (GGUF reranker models) and OVMS.

curl http://localhost:8000/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{"model":"bge-reranker-v2-gguf","query":"what is AI?","documents":["AI is...","ML is..."]}'

KServe API (`/v2/models/{name}/...`)

KServe v2 protocol passthrough for OpenVINO models. Supports classification, detection, segmentation, OCR, and other non-LLM models through OVMS.

# Model metadata
curl http://localhost:8000/v2/models/resnet50-vm

# Readiness check
curl http://localhost:8000/v2/models/resnet50-vm/ready

# Inference
curl -X POST http://localhost:8000/v2/models/resnet50-vm/infer \
  -H "Content-Type: application/json" \
  -d '{"inputs":[...]}'

`GET /admin/models`

Returns detailed internal state for every known model (status, port, PID, VRAM, timestamps).

`POST /admin/models/unload`

Manually unload a model from VRAM.

curl -X POST http://localhost:8000/admin/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3-8b-q4"}'

How It Works

Load on demand

When a request arrives for any configured endpoint (/v1/chat/completions, /v1/completions, /v1/audio/*, /v1/images/generations, /v1/embeddings, /v1/rerank, /v2/models/*):

DynLLM looks up the model in the config by name.
If not loaded: checks whether enough VRAM is free.
If not enough VRAM: evicts models in LIFO order (most recently loaded first), skipping any model currently serving a request.
Starts the backend subprocess and waits for it to be ready.
Proxies the request to the backend and streams the response back.

Idle auto-unload

A background scheduler checks loaded models every 30 seconds. Any model idle longer than its effective timeout (per-model unload_time > global idle_timeout_seconds) is stopped and its VRAM is freed. Models with active requests are never evicted.

Per-model `unload_time`

Set unload_time: -1 (or inf) on a model to keep it permanently in VRAM unless VRAM pressure forces eviction or you manually unload it via /admin/models/unload.

Startup preloading

List model names under preload_models: to have them loaded before the proxy starts accepting traffic. This reduces first-request latency.

Startup backend logging

At startup DynLLM logs the available torch execution backends detected in the runtime, for example ['cpu', 'xpu'] or ['cpu', 'cuda']. This helps verify that the installed torch build matches the hardware backend you expect to use.

VRAM eviction (LIFO)

When a new model needs to be loaded and there is insufficient free VRAM, DynLLM evicts the most recently loaded model first (Last-In, First-Out). This heuristic favours keeping the models you have been using longest resident in memory.

Models with in-flight requests are never evicted; the load will fail with HTTP 503 if eviction is impossible due to all candidates being busy.

State persistence

All model state (status, PID, port, VRAM, timestamps) is stored in a SQLite database. On startup, any models stuck in a loading or unloading state (e.g. due to a crash) are automatically reset to unloaded.

systemd Deployment

# Run as root
sudo bash systemd/install.sh

The installer:

Creates a dynllm system user
Copies the project to /opt/dynllm
Runs uv sync --frozen
Creates a default config at /opt/dynllm/config.yaml
Installs and enables the systemd unit dynllm.service

sudo systemctl status dynllm
sudo journalctl -u dynllm -f

GPU access is granted via /dev/dri (Intel/AMD). For NVIDIA, uncomment the relevant lines in systemd/dynllm.service.

Backend Notes

llama.cpp

Serves GGUF models only.
One llama-server process per loaded model.
Readiness is detected via GET /health.
Relevant config fields: n_gpu_layers, context_size.

OpenVINO Model Server (OVMS)

Serves OpenVINO IR model directories only (not GGUF).
One ovms process per loaded model.
model_type: llm, model_type: embedding, and model_type: rerank use the standard single-model OVMS config flow with the OpenAI-compatible /v3/ endpoints (/v3/chat/completions, /v3/embeddings, /v1/rerank).
model_type: transcription and model_type: speech use OVMS audio task mode (--task speech2text / --task text2speech). Audio endpoints require OVMS 2025.4+.
model_type: image_generation uses OVMS image generation task mode (--task image_generation) and exposes /v3/images/generations (OpenAI-compatible). Supports Stable Diffusion, SDXL, and FLUX.1 models in OpenVINO IR format.
model_type: classification, detection, segmentation, and ocr are served through the KServe v2 API (/v2/models/{name}/infer). These use the standard OVMS config flow and are accessible via any KServe-compatible client.
gRPC is disabled (--port 0); only the REST API is used.
Readiness is detected by model type:
- LLM/embedding/rerank/CV models: KServe Model Readiness endpoint (GET /v2/models/{name}/ready).
- Audio/image generation models: OVMS task mode does not register KServe models, so readiness is detected by probing the relevant /v3/ endpoint with an OPTIONS request.

Hugging Face transformers

Serves local Hugging Face model directories through transformers serve.
One transformers serve process per loaded model.
DynLLM keeps the public model alias from config.yaml and rewrites backend requests to the local model path expected by transformers serve.
Supports model_type: llm and model_type: transcription.
Relevant config fields: device, dtype, quantization, trust_remote_code, compile_model, continuous_batching, attn_implementation, model_timeout, revision.
For Intel GPUs, install torch from the XPU wheel index before installing transformers[serving]:

uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/xpu
uv pip install "transformers[serving]"

For CUDA systems, install the CUDA-specific torch wheels first, then transformers[serving].
quantization: bnb-4bit and quantization: bnb-8bit map to bitsandbytes loading in transformers serve.
In practice, bitsandbytes is most proven on CUDA. On Intel XPU it may work, but treat it as deployment-specific and validate the exact torch + bitsandbytes stack on your target machine before relying on it in production.
Quantization is enabled only for model_type: llm in DynLLM.
DynLLM still applies the same VRAM accounting, LIFO eviction, and idle unload rules used for llama.cpp and OVMS.

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
dynllm		dynllm
examples		examples
systemd		systemd
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.example.yaml		config.example.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

DynLLM

Features

Requirements

Quick Start

Configuration

Top-level settings

Backend settings (backend:)

Model declaration fields

Example config

API Reference

GET /v1/models

POST /v1/chat/completions

POST /v1/completions

POST /v1/audio/transcriptions

POST /v1/audio/translations

POST /v1/audio/speech

POST /v1/images/generations

POST /v1/embeddings

POST /v1/rerank

KServe API (/v2/models/{name}/...)

GET /admin/models

POST /admin/models/unload

How It Works

Load on demand

Idle auto-unload

Per-model unload_time

Startup preloading

Startup backend logging

VRAM eviction (LIFO)

State persistence

systemd Deployment

Backend Notes

llama.cpp

OpenVINO Model Server (OVMS)

Hugging Face transformers

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Backend settings (`backend:`)

`GET /v1/models`

`POST /v1/chat/completions`

`POST /v1/completions`

`POST /v1/audio/transcriptions`

`POST /v1/audio/translations`

`POST /v1/audio/speech`

`POST /v1/images/generations`

`POST /v1/embeddings`

`POST /v1/rerank`

KServe API (`/v2/models/{name}/...`)

`GET /admin/models`

`POST /admin/models/unload`

Per-model `unload_time`

Packages