A professional AI model management platform for llama.cpp models and versions, designed for modern AI workflows with comprehensive GPU support (NVIDIA CUDA, AMD Vulkan/ROCm, Metal, OpenBLAS).
- Search & Download: Search HuggingFace for GGUF models with comprehensive metadata and size information for each quantization
- Multi-Quantization Support: Download and manage multiple quantizations of the same model
- Model Library: Manage downloaded models with start/stop/delete functionality
- Smart Configuration: Auto-generate optimal llama.cpp parameters based on GPU capabilities
- VRAM Estimation: Real-time VRAM usage estimation with warnings for memory constraints
- Metadata Extraction: Rich model information including parameters, architecture, license, tags, and more
- Safetensors Runner: Configure and run safetensors checkpoints via LMDeploy TurboMind with an OpenAI-compatible endpoint on port 2001
- Release Installation: Download and install pre-built binaries from GitHub releases
- Source Building: Build from source with optional patches from GitHub PRs
- Custom Build Configuration: Customize GPU backends (CUDA, Vulkan, Metal, OpenBLAS), build type, and compiler flags
- Update Checking: Check for updates to both releases and source code
- Version Management: Install, update, and delete multiple llama.cpp versions
- Build Validation: Automatic validation of built binaries to ensure they work correctly
- Multi-GPU Support: Automatic detection and configuration for NVIDIA, AMD, and other GPUs
- NVIDIA CUDA: Full support for CUDA compute capabilities, flash attention, and multi-GPU
- AMD GPU Support: Vulkan and ROCm support for AMD GPUs
- Apple Metal: Support for Apple Silicon GPUs
- OpenBLAS: CPU acceleration with optimized BLAS routines
- VRAM Monitoring: Real-time GPU memory usage and temperature monitoring
- NVLink Detection: Automatic detection of NVLink connections and topology analysis
- Concurrent Execution: Run multiple models simultaneously via llama-swap proxy
- OpenAI-Compatible API: Standard API format for easy integration
- Port 2000: All models served through a single unified endpoint
- Automatic Lifecycle Management: Seamless starting/stopping of models
- Modern UI: Vue.js 3 with PrimeVue components
- Real-time Updates: WebSocket-based progress tracking and system monitoring
- Responsive Design: Works on desktop and mobile devices
- System Status: CPU, memory, disk, and GPU monitoring
- LMDeploy Installer: Dedicated UI to install/remove LMDeploy at runtime with live logs
- Dark Mode: Built-in theme support
- Clone the repository:
git clone <repository-url>
cd llama-cpp-studio- Start the application:
# CPU-only mode
docker-compose -f docker-compose.cpu.yml up -d
# GPU mode (NVIDIA CUDA)
docker-compose -f docker-compose.cuda.yml up -d
# Vulkan/AMD GPU mode
docker-compose -f docker-compose.vulkan.yml up -d
# ROCm mode
docker-compose -f docker-compose.rocm.yml up -d- Access the web interface at
http://localhost:8080
Prebuilt images are pushed to GitHub Container Registry whenever the publish-docker workflow runs.
ghcr.io/<org-or-user>/llama-cpp-studio:latest– standard image based onubuntu:22.04with GPU tooling installed at runtime
Pull the image from GHCR:
docker pull ghcr.io/<org-or-user>/llama-cpp-studio:latest- Build the image:
docker build -t llama-cpp-studio .- Run the container:
# With GPU support
docker run -d \
--name llama-cpp-studio \
--gpus all \
-p 8080:8080 \
-v ./data:/app/data \
llama-cpp-studio
# CPU-only
docker run -d \
--name llama-cpp-studio \
-p 8080:8080 \
-v ./data:/app/data \
llama-cpp-studioCUDA_VISIBLE_DEVICES: GPU device selection (default: all, set to "" for CPU-only)PORT: Web server port (default: 8080)HUGGINGFACE_API_KEY: HuggingFace API token for model search and download (optional)LMDEPLOY_BIN: Override path to thelmdeployCLI (default:lmdeployon PATH)LMDEPLOY_PORT: Override the LMDeploy OpenAI port (default: 2001)
/app/data: Persistent storage for models, configurations, and database
To enable model search and download functionality, you need to set your HuggingFace API key. You can do this in several ways:
Uncomment and set the token in your docker-compose.yml:
environment:
- CUDA_VISIBLE_DEVICES=all
- HUGGINGFACE_API_KEY=your_huggingface_token_hereCreate a .env file in your project root:
HUGGINGFACE_API_KEY=your_huggingface_token_hereThen uncomment the env_file section in docker-compose.yml:
env_file:
- .envSet the environment variable before running Docker Compose:
export HUGGINGFACE_API_KEY=your_huggingface_token_here
docker-compose up -d- Go to HuggingFace Settings
- Create a new token with "Read" permissions
- Copy the token and use it in one of the methods above
Note: When the API key is set via environment variable, it cannot be modified through the web UI for security reasons.
- NVIDIA: NVIDIA GPU with CUDA support, NVIDIA Container Toolkit installed
- AMD: AMD GPU with Vulkan/ROCm drivers
- Apple: Apple Silicon with Metal support
- CPU: OpenBLAS for CPU acceleration (included in Docker image)
- Minimum 8GB VRAM recommended for most models
Safetensors execution relies on LMDeploy, but the base image intentionally omits it to keep Docker builds lightweight (critical for GitHub Actions). Use the LMDeploy page in the UI to install or remove LMDeploy inside the running container—installs happen via pip at runtime and logs are streamed live. The installer creates a dedicated virtual environment under /app/data/lmdeploy/venv, so the package lives on the writable volume and can be removed by deleting that folder. If you are running outside the container, you can still pip install lmdeploy manually or point LMDEPLOY_BIN to a custom binary. The runtime uses lmdeploy serve turbomind to expose an OpenAI-compatible server on port 2001.
- Use the search bar to find GGUF models on HuggingFace
- Filter by tags, parameters, or model name
- View comprehensive metadata including downloads, likes, tags, and file sizes
- Click download on any quantization to start downloading
- Multiple quantizations of the same model are automatically grouped
- Progress tracking with real-time updates via WebSocket
- Set llama.cpp parameters or use Smart Auto for optimal settings
- View VRAM estimation before starting
- Configure context size, batch sizes, temperature, and more
- Start/stop models with one click
- Multiple models can run simultaneously
- View running instances and resource usage
- View available releases and source updates
- See commit history and release notes
- Download pre-built binaries from GitHub
- Automatic verification and installation
- Compile from source with custom configuration
- Select GPU backends (CUDA, Vulkan, Metal, OpenBLAS)
- Configure build type (Release, Debug, RelWithDebInfo)
- Add custom CMake flags and compiler options
- Apply patches from GitHub PRs
- Automatic validation of built binaries
- Delete old versions to free up space
- View installation details and build configuration
- Overview: CPU, memory, disk, and GPU usage
- GPU Details: Individual GPU information and utilization
- Running Instances: Active model instances with resource usage
- WebSocket: Real-time updates for all metrics
llama-cpp-studio uses llama-swap to serve multiple models simultaneously on port 2000.
Simply start any model from the Model Library. All models run on port 2000 simultaneously.
curl http://localhost:2000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-2-1b-instruct-iq2-xs",
"messages": [{"role": "user", "content": "Hello!"}]
}'Model names are shown in System Status after starting a model.
- Multiple models run concurrently
- No loading time - instant switching between models
- Standard OpenAI API format
- Automatic lifecycle management
- Single unified endpoint
- Check available models:
http://localhost:2000/v1/models - Check proxy health:
http://localhost:2000/health - View logs:
docker logs llama-cpp-studio
- Run exactly one safetensors checkpoint at a time via LMDeploy
- Configure tensor/pipeline parallelism, context length, temperature, and other runtime flags from the Model Library
- Serves an OpenAI-compatible endpoint at
http://localhost:2001/v1/chat/completions - Install LMDeploy on demand from the LMDeploy page (or manually via
pip) before starting safetensors runtimes - Start/stop directly from the Safetensors panel; status is reported in System Status and the LMDeploy status chip
Enable specific GPU backends during source builds:
- CUDA: NVIDIA GPU acceleration with cuBLAS
- Vulkan: AMD/Intel GPU acceleration with Vulkan compute
- Metal: Apple Silicon GPU acceleration
- OpenBLAS: CPU optimization with OpenBLAS routines
Customize your build with:
- Build Type: Release (optimal), Debug (development), RelWithDebInfo
- Custom CMake Flags: Additional CMake configuration
- Compiler Flags: CFLAGS and CXXFLAGS for optimization
- Git Patches: Apply patches from GitHub PRs
{
"commit_sha": "master",
"patches": [
"https://github.com/ggerganov/llama.cpp/pull/1234.patch"
],
"build_config": {
"build_type": "Release",
"enable_cuda": true,
"enable_vulkan": false,
"enable_metal": false,
"enable_openblas": true,
"custom_cmake_args": "-DGGML_CUDA_CUBLAS=ON",
"cflags": "-O3 -march=native",
"cxxflags": "-O3 -march=native"
}
}The Smart Auto feature automatically generates optimal llama.cpp parameters based on:
- GPU Capabilities: VRAM, compute capability, multi-GPU support
- NVLink Topology: Automatic detection and optimization for NVLink clusters
- Model Architecture: Detected from model name (Llama, Mistral, etc.)
- Available Resources: CPU cores, memory, disk space
- Performance Optimization: Flash attention, tensor parallelism, batch sizing
The system automatically detects NVLink topology and applies appropriate strategies:
- Unified NVLink: All GPUs connected via NVLink - uses aggressive tensor splitting and higher parallelism
- Clustered NVLink: Multiple NVLink clusters - optimizes for the largest cluster
- Partial NVLink: Some GPUs connected via NVLink - uses hybrid approach
- PCIe Only: No NVLink detected - uses conservative PCIe-based configuration
- Context size, batch sizes, GPU layers
- Temperature, top-k, top-p, repeat penalty
- CPU threads, parallel sequences
- RoPE scaling, YaRN factors
- Multi-GPU tensor splitting
- Custom arguments via YAML config
GET /api/models- List all modelsPOST /api/models/search- Search HuggingFacePOST /api/models/download- Download modelGET /api/models/{id}/config- Get model configurationPUT /api/models/{id}/config- Update configurationPOST /api/models/{id}/auto-config- Generate smart configurationPOST /api/models/{id}/start- Start modelPOST /api/models/{id}/stop- Stop modelDELETE /api/models/{id}- Delete modelGET /api/models/safetensors/{model_id}/lmdeploy/config- Get LMDeploy config for a safetensors downloadPUT /api/models/safetensors/{model_id}/lmdeploy/config- Update LMDeploy configPOST /api/models/safetensors/{model_id}/lmdeploy/start- Start LMDeploy runtimePOST /api/models/safetensors/{model_id}/lmdeploy/stop- Stop LMDeploy runtimeGET /api/models/safetensors/lmdeploy/status- LMDeploy manager status
GET /api/lmdeploy/status- Installer status (version, binary path, current operation)POST /api/lmdeploy/install- Install LMDeploy via pip at runtimePOST /api/lmdeploy/remove- Remove LMDeploy from the runtime environmentGET /api/lmdeploy/logs- Tail the LMDeploy installer log
GET /api/llama-versions- List installed versionsGET /api/llama-versions/check-updates- Check for updatesGET /api/llama-versions/build-capabilities- Get build capabilitiesPOST /api/llama-versions/install-release- Install releasePOST /api/llama-versions/build-source- Build from sourceDELETE /api/llama-versions/{id}- Delete version
GET /api/status- System statusGET /api/gpu-info- GPU informationWebSocket /ws- Real-time updates
If upgrading from an older version, you may need to migrate your database:
# Run migration to support multi-quantization
python migrate_db.py-
GPU Not Detected
- Ensure NVIDIA Container Toolkit is installed (for NVIDIA)
- Check
nvidia-smioutput - Verify
--gpus allflag in docker run - For AMD: Check Vulkan/ROCm drivers
-
Build Failures
- Check CUDA version compatibility (for NVIDIA)
- Ensure sufficient disk space (at least 10GB free)
- Verify internet connectivity for downloads
- For Vulkan builds: Ensure
glslang-toolsis installed - Check build logs for specific errors
-
Memory Issues
- Use Smart Auto configuration
- Reduce context size or batch size
- Enable memory mapping
- Check available system RAM and VRAM
-
Model Download Failures
- Check HuggingFace connectivity
- Verify model exists and is public
- Ensure sufficient disk space
- Set HUGGINGFACE_API_KEY if using private models
-
Validation Failed
- Binary exists and is executable
- Binary runs
--versionsuccessfully - Output contains "llama" or "version:" string
- Application logs:
docker logs llama-cpp-studio - Model logs: Available in the web interface
- Build logs: Shown during source compilation
- WebSocket logs: DEBUG level for detailed connection info
- FastAPI with async support
- SQLAlchemy for database management
- WebSocket for real-time updates
- Background tasks for long operations
- Llama-swap integration for multi-model serving
- Vue.js 3 with Composition API
- PrimeVue component library
- Pinia for state management
- Vite for build tooling
- Dark mode support
- SQLite for simplicity
- Models, versions, and instances tracking
- Configuration storage
- Multi-quantization support
The studio’s capacity planning tooling is grounded in a three-component model for llama.cpp that provides a conservative upper bound on peak memory usage.
- Formula:
M_total = M_weights + M_kv + M_compute - Model weights (
M_weights): Treat the GGUF file size as the ground truth. When--no-mmapis disabled (default), the file is memory-mapped so only referenced pages touch physical RAM, but the virtual footprint still equals the file size. - KV cache (
M_kv): Uses the GQA-aware formulan_ctx × N_layers × N_head_kv × (N_embd / N_head) × (p_a_k + p_a_v), wherep_a_*are the bytes-per-value chosen via--cache-type-k/--cache-type-v. - Compute buffers (
M_compute): Approximate as a fixed CUDA overhead (~550 MB) plus a scratch buffer that scales with micro-batch size (n_ubatch × 0.5 MBby default).
-ngl 0(CPU-only): All components stay in RAM.-ngl > 0(hybrid/full GPU): Model weights split by layer between RAM and VRAM, while bothM_kvandM_computemove entirely to VRAM—the “VRAM trap”.- Full offload avoids PCIe contention; hybrid splits suffer a “performance cliff” because activations bounce between CPU and GPU.
- Attempt full offload first (best throughput). If weights + compute fit, deduce
n_ctx_maxfrom remaining VRAM budget. - When full offload fails, search decreasing
n_nglvalues that satisfy RAM limits while maximizing context length, accepting the hybrid performance penalty. - Iterate quantization choices to find the smallest model that still enables full offload on the target hardware profile.
The Smart Auto subsystem applies the model above to recommend llama.cpp launch parameters. Priority 1 fixes are complete, eliminating prior memory underestimation bugs.
- Resolutions:
- Corrected KV cache math to respect grouped-query attention head counts.
- Removed the dangerous 0.30 multiplier on cache size; estimates now use real memory.
- Ensured KV cache/compute buffers migrate to VRAM whenever GPU layers are in play.
- Modeled compute overhead as
550 MB + 0.5 MB × n_ubatch. - Improved GPU layer estimation using GGUF file size with a 20 % safety buffer.
- Open improvements:
- Reorder calculations so KV cache quantization feeds batch/context sizing directly.
- Replace remaining heuristics with joint optimization across
n_ctx,n_ngl, andn_ubatch.
- Benchmark against known examples (e.g., 13B @ 2 048 tokens → ~1.6 GB KV cache, 7B @ 4 096 tokens → ~6 GB total).
- Stress-test large contexts, tight VRAM scenarios, MoE models, and hybrid modes.
- Expand automated regression coverage around the estimator and Smart Auto flows.
Empirical testing with Llama-3.2-1B-Instruct.IQ1_M demonstrates that the estimator acts as a safe upper bound.
- Setup:
n_ctx ≈ 35 K, batch 32, CPU-only run. - Estimated peak: 4.99 GB (weights 394 MB, KV cache 4.34 GB, batch 12 MB, llama.cpp overhead 256 MB).
- Observed deltas:
- With mmap enabled: ~608 MB (11.9 % of estimate). Lower usage is expected because the KV cache grows as context fills and weights are paged on demand.
- With
--no-mmap: ~1.16 GB (23 % of estimate). Weights load fully, but KV cache still expands progressively.
- Takeaways:
- Estimates intentionally err on the high side to prevent OOM once the context window reaches capacity.
- Divergence between virtual and physical usage stems from memory mapping and lazy KV cache allocation.
- Additional GPU-focused measurements and long session traces are encouraged to correlate VRAM predictions with reality.
This project is licensed under the MIT License - see the LICENSE file for details.
Copyright (c) 2024 llama.cpp Studio
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
For issues and questions:
- Create an issue on GitHub
- Check the troubleshooting section
- Review the API documentation
- llama.cpp - The core inference engine
- llama-swap - Multi-model serving proxy
- HuggingFace - Model hosting and search
- Vue.js - Frontend framework
- FastAPI - Backend framework