Skip to content

federicoaugelli/DynLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DynLLM

Agnostic OpenAI-compatible proxy for dynamic model loading and unloading.

DynLLM sits between your OpenAI-compatible client (OpenWebUI, LangChain, curl, …) and your local inference backends (llama.cpp, OpenVINO Model Server, and/or Hugging Face transformers). It automatically loads models on demand, tracks VRAM usage, evicts models when memory is tight, and unloads idle models after a configurable timeout.


Features

  • OpenAI-compatible API/v1/chat/completions, /v1/completions, /v1/audio/transcriptions, /v1/audio/translations, /v1/audio/speech, /v1/images/generations, /v1/embeddings, /v1/rerank, /v1/models
  • Dynamic loading – models are started on first request and stopped when idle
  • VRAM budgeting – LIFO eviction keeps total GPU memory within a configured limit
  • Multi-backend – supports llama.cpp (GGUF), OpenVINO Model Server (IR), and transformers serve
  • Per-model idle timeout – override the global timeout per model, or set inf/-1 to never auto-unload
  • Startup preloading – specify models to load when DynLLM starts
  • Safe mid-generation – active inference requests are never interrupted by eviction
  • Persistent state – SQLite database survives restarts; stale states are healed on startup
  • systemd integration – ships with a ready-made unit file and installer script

Requirements

  • Python 3.11+ and uv
  • At least one backend installed:
    • llama.cpp – build llama-server from ggerganov/llama.cpp
    • OpenVINO Model Server – see the OVMS installation guide
    • Hugging Face transformers – install transformers[serving]; for Intel GPUs install torch XPU wheels first

Quick Start

# 1. Clone and install
git clone https://github.com/youruser/DynLLM
cd DynLLM
uv sync

# 2. Create a config
cp config.example.yaml config.yaml
# Edit config.yaml – set total_vram_mb, add your models

# 3. Run
uv run dynllm
# or with an explicit config path:
uv run dynllm --config /path/to/config.yaml

The proxy starts on http://0.0.0.0:8000 by default.


Configuration

Copy config.example.yaml to config.yaml and adjust as needed.

Top-level settings

Key Default Description
server.host 0.0.0.0 Bind address
server.port 8000 Listen port
total_vram_mb 8192 VRAM budget in MB; eviction fires when exceeded
idle_timeout_seconds 300 Global idle auto-unload timeout (seconds)
enabled_backends [llamacpp, openvino] Active backends
models_dir Optional base dir for relative model paths
db_path dynllm_state.db SQLite state database path
log_level info debug / info / warning / error
preload_models [] Model names to load on startup

Backend settings (backend:)

Key Default Description
llamacpp_binary llama-server Path or name of the llama-server binary
ovms_binary ovms Path or name of the OVMS binary
transformers_binary transformers Path or name of the Hugging Face transformers CLI
port_range_start 9100 Start of port range for backend subprocesses
port_range_end 9200 End of port range for backend subprocesses

Model declaration fields

Field Required Description
name yes Unique model ID; used as the model field in API requests
path yes Path to the .gguf file, OpenVINO IR directory, or local Hugging Face model directory
backend yes llamacpp, openvino, or transformers
model_type no llm, transcription, speech, image_generation, embedding, rerank, classification, detection, segmentation, ocr. Default: llm
vram_mb yes Estimated VRAM in MB when loaded (used for eviction math)
target_device no OpenVINO target device (CPU, GPU, NPU). Default: CPU
n_gpu_layers no llama.cpp only – GPU layers (-1 = all). Default: -1
context_size no llama.cpp only – context window size. Default: 4096
ovms_shape no OpenVINO only – shape hint (e.g. "auto")
device no transformers only – execution device (auto, cpu, cuda, xpu)
dtype no transformers only – load dtype (auto, float16, bfloat16, float32)
quantization no transformers only – none, bnb-4bit, or bnb-8bit for bitsandbytes quantization
trust_remote_code no transformers only – allow custom repo code when required
compile_model no transformers only – enable torch.compile through transformers serve
continuous_batching no transformers only – enable continuous batching for supported LLMs
attn_implementation no transformers only – auto, eager, sdpa, flash_attention_2, flash_attention_3, flex_attention
model_timeout no transformers only – backend-side idle timeout in seconds
revision no transformers only – HF revision rendered as model@revision
unload_time no Per-model idle timeout (seconds). Overrides idle_timeout_seconds. Use -1 or inf to never auto-unload

Example config

server:
  host: "0.0.0.0"
  port: 8000

total_vram_mb: 7500
idle_timeout_seconds: 300

enabled_backends:
  - llamacpp
  - openvino
  - transformers

backend:
  llamacpp_binary: "llama-server"
  ovms_binary: "ovms"
  transformers_binary: "transformers"
  port_range_start: 9100
  port_range_end: 9200

# Load this model immediately at startup
preload_models:
  - "llama3-8b-q4"

models:
  - name: "llama3-8b-q4"
    path: "/mnt/models/gguf/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf"
    backend: llamacpp
    model_type: llm
    vram_mb: 5500
    n_gpu_layers: -1
    context_size: 4096

  - name: "phi3-mini-ov"
    path: "/mnt/models/openvino/phi-3-mini-4k-instruct-ov"
    backend: openvino
    model_type: llm
    target_device: GPU
    vram_mb: 4096
    ovms_shape: "auto"
    unload_time: -1   # keep this model loaded permanently

  - name: "whisper-large-v3-ov"
    path: "/mnt/models/openvino/whisper-large-v3-ov"
    backend: openvino
    model_type: transcription
    target_device: CPU
    vram_mb: 0

  - name: "speecht5-ov"
    path: "/mnt/models/openvino/speecht5-ov"
    backend: openvino
    model_type: speech
    target_device: CPU
    vram_mb: 0

  - name: "qwen25-3b-hf"
    path: "/mnt/models/huggingface/Qwen2.5-3B-Instruct"
    backend: transformers
    model_type: llm
    device: xpu
    dtype: bfloat16
    quantization: bnb-4bit
    attn_implementation: sdpa
    model_timeout: 300
    vram_mb: 6500

API Reference

DynLLM exposes a standard OpenAI-compatible REST API. Point any OpenAI client at http://<host>:<port> and use the model name values from your config.

GET /v1/models

Returns the list of configured models in OpenAI format.

{
  "object": "list",
  "data": [
    { "id": "llama3-8b-q4", "object": "model", "created": 1700000000, "owned_by": "dynllm" }
  ]
}

POST /v1/chat/completions

OpenAI-compatible chat completions. Supports streaming ("stream": true).

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-8b-q4",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

POST /v1/completions

OpenAI-compatible text completions. Supports streaming.

POST /v1/audio/transcriptions

OpenAI-compatible speech-to-text endpoint. DynLLM accepts the standard multipart request and proxies it to OVMS or transformers serve, depending on the configured backend.

curl http://localhost:8000/v1/audio/transcriptions \
  -F "model=whisper-large-v3-ov" \
  -F "file=@speech.wav"

POST /v1/audio/translations

OpenAI-compatible speech translation endpoint. This uses the same OpenVINO transcription models and maps to OVMS /v3/audio/translations.

curl http://localhost:8000/v1/audio/translations \
  -F "model=whisper-large-v3-ov" \
  -F "file=@speech-es.wav"

POST /v1/audio/speech

OpenAI-compatible text-to-speech endpoint. OVMS currently returns WAV audio.

curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"speecht5-ov","input":"Hello from DynLLM"}' \
  -o speech.wav

POST /v1/images/generations

OpenAI-compatible image generation endpoint. Supported via OVMS with model_type: image_generation.

curl http://localhost:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"model":"sd-xl-ov","prompt":"a cat wearing a hat","n":1,"size":"512x512"}'

POST /v1/embeddings

OpenAI-compatible embeddings endpoint. Supported by llama.cpp (GGUF embedding models) and OVMS.

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"bge-m3-gguf","input":"Hello world"}'

POST /v1/rerank

Cohere-compatible reranking endpoint. Supported by llama.cpp (GGUF reranker models) and OVMS.

curl http://localhost:8000/v1/rerank \
  -H "Content-Type: application/json" \
  -d '{"model":"bge-reranker-v2-gguf","query":"what is AI?","documents":["AI is...","ML is..."]}'

KServe API (/v2/models/{name}/...)

KServe v2 protocol passthrough for OpenVINO models. Supports classification, detection, segmentation, OCR, and other non-LLM models through OVMS.

# Model metadata
curl http://localhost:8000/v2/models/resnet50-vm

# Readiness check
curl http://localhost:8000/v2/models/resnet50-vm/ready

# Inference
curl -X POST http://localhost:8000/v2/models/resnet50-vm/infer \
  -H "Content-Type: application/json" \
  -d '{"inputs":[...]}'

GET /admin/models

Returns detailed internal state for every known model (status, port, PID, VRAM, timestamps).

POST /admin/models/unload

Manually unload a model from VRAM.

curl -X POST http://localhost:8000/admin/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3-8b-q4"}'

How It Works

Load on demand

When a request arrives for any configured endpoint (/v1/chat/completions, /v1/completions, /v1/audio/*, /v1/images/generations, /v1/embeddings, /v1/rerank, /v2/models/*):

  1. DynLLM looks up the model in the config by name.
  2. If not loaded: checks whether enough VRAM is free.
  3. If not enough VRAM: evicts models in LIFO order (most recently loaded first), skipping any model currently serving a request.
  4. Starts the backend subprocess and waits for it to be ready.
  5. Proxies the request to the backend and streams the response back.

Idle auto-unload

A background scheduler checks loaded models every 30 seconds. Any model idle longer than its effective timeout (per-model unload_time > global idle_timeout_seconds) is stopped and its VRAM is freed. Models with active requests are never evicted.

Per-model unload_time

Set unload_time: -1 (or inf) on a model to keep it permanently in VRAM unless VRAM pressure forces eviction or you manually unload it via /admin/models/unload.

Startup preloading

List model names under preload_models: to have them loaded before the proxy starts accepting traffic. This reduces first-request latency.

Startup backend logging

At startup DynLLM logs the available torch execution backends detected in the runtime, for example ['cpu', 'xpu'] or ['cpu', 'cuda']. This helps verify that the installed torch build matches the hardware backend you expect to use.

VRAM eviction (LIFO)

When a new model needs to be loaded and there is insufficient free VRAM, DynLLM evicts the most recently loaded model first (Last-In, First-Out). This heuristic favours keeping the models you have been using longest resident in memory.

Models with in-flight requests are never evicted; the load will fail with HTTP 503 if eviction is impossible due to all candidates being busy.

State persistence

All model state (status, PID, port, VRAM, timestamps) is stored in a SQLite database. On startup, any models stuck in a loading or unloading state (e.g. due to a crash) are automatically reset to unloaded.


systemd Deployment

# Run as root
sudo bash systemd/install.sh

The installer:

  1. Creates a dynllm system user
  2. Copies the project to /opt/dynllm
  3. Runs uv sync --frozen
  4. Creates a default config at /opt/dynllm/config.yaml
  5. Installs and enables the systemd unit dynllm.service
sudo systemctl status dynllm
sudo journalctl -u dynllm -f

GPU access is granted via /dev/dri (Intel/AMD). For NVIDIA, uncomment the relevant lines in systemd/dynllm.service.


Backend Notes

llama.cpp

  • Serves GGUF models only.
  • One llama-server process per loaded model.
  • Readiness is detected via GET /health.
  • Relevant config fields: n_gpu_layers, context_size.

OpenVINO Model Server (OVMS)

  • Serves OpenVINO IR model directories only (not GGUF).
  • One ovms process per loaded model.
  • model_type: llm, model_type: embedding, and model_type: rerank use the standard single-model OVMS config flow with the OpenAI-compatible /v3/ endpoints (/v3/chat/completions, /v3/embeddings, /v1/rerank).
  • model_type: transcription and model_type: speech use OVMS audio task mode (--task speech2text / --task text2speech). Audio endpoints require OVMS 2025.4+.
  • model_type: image_generation uses OVMS image generation task mode (--task image_generation) and exposes /v3/images/generations (OpenAI-compatible). Supports Stable Diffusion, SDXL, and FLUX.1 models in OpenVINO IR format.
  • model_type: classification, detection, segmentation, and ocr are served through the KServe v2 API (/v2/models/{name}/infer). These use the standard OVMS config flow and are accessible via any KServe-compatible client.
  • gRPC is disabled (--port 0); only the REST API is used.
  • Readiness is detected by model type:
    • LLM/embedding/rerank/CV models: KServe Model Readiness endpoint (GET /v2/models/{name}/ready).
    • Audio/image generation models: OVMS task mode does not register KServe models, so readiness is detected by probing the relevant /v3/ endpoint with an OPTIONS request.

Hugging Face transformers

  • Serves local Hugging Face model directories through transformers serve.
  • One transformers serve process per loaded model.
  • DynLLM keeps the public model alias from config.yaml and rewrites backend requests to the local model path expected by transformers serve.
  • Supports model_type: llm and model_type: transcription.
  • Relevant config fields: device, dtype, quantization, trust_remote_code, compile_model, continuous_batching, attn_implementation, model_timeout, revision.
  • For Intel GPUs, install torch from the XPU wheel index before installing transformers[serving]:
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/xpu
uv pip install "transformers[serving]"
  • For CUDA systems, install the CUDA-specific torch wheels first, then transformers[serving].
  • quantization: bnb-4bit and quantization: bnb-8bit map to bitsandbytes loading in transformers serve.
  • In practice, bitsandbytes is most proven on CUDA. On Intel XPU it may work, but treat it as deployment-specific and validate the exact torch + bitsandbytes stack on your target machine before relying on it in production.
  • Quantization is enabled only for model_type: llm in DynLLM.
  • DynLLM still applies the same VRAM accounting, LIFO eviction, and idle unload rules used for llama.cpp and OVMS.

License

See LICENSE.

About

Agnostic OpenAI compatible proxy for dynamic model loading/unloading

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors