📘 API Documentation | 🌟 Examples | 📦 Model Zoo
usls is a cross-platform Rust library powered by ONNX Runtime for efficient inference of SOTA vision and vision-language models (typically under 1B parameters).
- ⚡ High Performance: Multi-threading, SIMD, and CUDA-accelerated processing
- 🌐 Cross-Platform: Linux, macOS, Windows with ONNX Runtime execution providers (CUDA, TensorRT, CoreML, OpenVINO, DirectML, etc.)
- 🏗️ Unified API: Single
Modeltrait inference withrun()/forward()/encode_images()/encode_texts()and unifiedYoutput - 📥 Auto-Management: Automatic model download (HuggingFace/GitHub), caching and path resolution
- 📦 Multiple Inputs: Image, directory, video, webcam, stream and combinations
- 🎯 Precision Support: FP32, FP16, INT8, UINT8, Q4, Q4F16, BNB4, and more
- 🛠️ Full-Stack Suite:
DataLoader,Annotator, andViewerfor complete workflows - 🌱 Model Ecosystem: 50+ SOTA vision and VLM models
Run the YOLO-Series demo to explore models with different tasks, precision and execution providers:
- Tasks:
detect,segment,pose,classify,obb - Versions:
YOLOv5,YOLOv6,YOLOv7,YOLOv8,YOLOv9,YOLOv10,YOLO11,YOLOv12,YOLOv13,YOLO26 - Scales:
n,s,m,l,x - Precision:
fp32,fp16,q8,q4,q4f16,bnb4 - Execution Providers:
CPU,CUDA,TensorRT,TensorRT-RTX,CoreML,OpenVINO, and more
# CPU: Object detection with YOLO26n (FP16)
cargo run -r --example yolo -- --task detect --ver 26 --scale n --dtype fp16
# CUDA model + CPU processor: Instance segmentation with YOLO11m
cargo run -r -F cuda --example yolo -- --task segment --ver 11 --scale m --device cuda:0 --processor-device cpu
# CUDA model + CUDA processor: Pose estimation with YOLOv8m
cargo run -r -F cuda-full --example yolo -- --task pose --ver 8 --scale s --device cuda:0 --processor-device cuda:0
# TensorRT model + CPU processor
cargo run -r -F tensorrt --example yolo -- --device tensorrt:0 --processor-device cpu
# TensorRT model + CUDA processor (CUDA 12.4)
cargo run -r -F tensorrt-cuda-12040 --example yolo -- --device tensorrt:0 --processor-device cuda:0
# TensorRT-RTX model + CUDA processor
cargo run -r -F nvrtx-full --example yolo -- --device nvrtx:0 --processor-device cuda:0
# TensorRT-RTX model + CPU processor
cargo run -r -F nvrtx --example yolo -- --device nvrtx:0
# Apple Silicon CoreML
cargo run -r -F coreml --example yolo -- --device coreml
# Intel OpenVINO (CPU/GPU/VPU)
cargo run -r -F openvino -F ort-load-dynamic --example yolo -- --device openvino:CPU
# Show all available options
cargo run -r --example yolo -- --helpSee YOLO Examples for more details and use cases.
See Device Combination Guide for feature and device configurations.
CUDA failure 1: invalid argument), use --processor-device cpu instead of --processor-device cuda:0 to avoid CUDA memory transfer issues.
Environment: NVIDIA RTX 3060Ti (TensorRT-10.11.0.33, CUDA 12.8, TensorRT-RTX-1.3.0.35) / Intel i5-12400F
Setup: YOLO26n-detect model (640×640), COCO2017 validation set (5,000 images), no warm-up
| Backend | DType | Preprocess | Inference | Postprocess | Total |
|---|---|---|---|---|---|
| TensorRT EP + CUDA processor | FP16 | 234.570µs | 1.333ms | 253.631µs | 1.821ms |
| TensorRT EP + CPU processor | FP16 | 783.852µs | 2.438ms | 83.701µs | 3.306ms |
| TensorRT-RTX EP + CUDA processor | FP32 | 232.003µs | 2.934ms | 192.660µs | 3.359ms |
| TensorRT-RTX EP + CUDA processor | FP16 | ❓ | ❓ | ❓ | ❓ |
| TensorRT-RTX EP + CPU processor | FP32 | 794.292µs | 3.974ms | 83.926µs | 4.852ms |
| TensorRT-RTX EP + CPU processor | FP16 | ❓ | ❓ | ❓ | ❓ |
| CUDA EP + CUDA processor | FP32 | 242.752µs | 5.053ms | 95.968µs | 5.392ms |
| CUDA EP + CUDA processor | FP16 | 244.065µs | 3.684ms | 100.828µs | 4.029ms |
| CUDA EP + CPU processor | FP32 | 796.886µs | 6.044ms | 74.687µs | 6.916ms |
| CUDA EP + CPU processor | FP16 | 787.805µs | 4.565ms | 71.001µs | 5.424ms |
| CPU EP + CPU processor | FP32 | 971.332µs | 20.243ms | 59.022µs | 21.273ms |
| CPU EP + CPU processor | FP16 | 954.297µs | 23.155ms | 59.197µs | 24.168ms |
| Backend | DType | Preprocess | Inference | Postprocess | Total |
|---|---|---|---|---|---|
| TensorRT EP + CUDA processor | FP16 | 2.100ms | 6.497ms | 203.484µs | 8.801ms |
| TensorRT EP + CPU processor | FP16 | 18.913ms | 26.406ms | 194.782µs | 45.514ms |
| TensorRT-RTX EP + CUDA processor | FP32 | 2.161ms | 15.370ms | 167.937µs | 17.699ms |
| TensorRT-RTX EP + CUDA processor | FP16 | ❓ | ❓ | ❓ | ❓ |
| TensorRT-RTX EP + CPU processor | FP32 | 18.988ms | 35.101ms | 173.829µs | 54.263ms |
| TensorRT-RTX EP + CPU processor | FP16 | ❓ | ❓ | ❓ | ❓ |
| CUDA EP + CUDA processor | FP32 | 2.222ms | 24.479ms | 160.767µs | 26.862ms |
| CUDA EP + CUDA processor | FP16 | 2.262ms | 14.842ms | 135.593µs | 17.240ms |
| CUDA EP + CPU processor | FP32 | 19.037ms | 44.720ms | 190.740µs | 63.948ms |
| CUDA EP + CPU processor | FP16 | 18.245ms | 33.865ms | 183.226µs | 52.293ms |
| CPU EP + CPU processor | FP32 | 17.852ms | 216.872ms | 158.297µs | 234.883ms |
| CPU EP + CPU processor | FP16 | 17.698ms | 255.365ms | 117.239µs | 273.180ms |
Status: ✅ Supported | ❓ Unknown | ❌ Not Supported For Now
🔥 YOLO-Series
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| YOLOv5 | Image Classification Object Detection Instance Segmentation |
demo | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| YOLOv6 | Object Detection | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| YOLOv7 | Object Detection | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| YOLOv8 | Object Detection Instance Segmentation Image Classification Oriented Object Detection Keypoint Detection |
demo | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| YOLO11 | Object Detection Instance Segmentation Image Classification Oriented Object Detection Keypoint Detection |
demo | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| YOLOv9 | Object Detection | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| YOLOv10 | Object Detection | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
| YOLOv12 | Image Classification Object Detection Instance Segmentation |
demo | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| YOLOv13 | Object Detection | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| YOLO26 | Object Detection Instance Segmentation Image Classification Oriented Object Detection Keypoint Detection |
demo | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
🏷️ Image Classification & Tagging
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| BEiT | Image Classification | demo | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| ConvNeXt | Image Classification | demo | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| FastViT | Image Classification | demo | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| MobileOne | Image Classification | demo | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| DeiT | Image Classification | demo | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ |
| RAM | Image Tagging | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| RAM++ | Image Tagging | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
🎯 Object Detection
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| RT-DETRv1 | Object Detection | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| RT-DETRv2 | Object Detection | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| RT-DETRv4 | Object Detection | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| RF-DETR | Object Detection | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| PP-PicoDet | Object Detection | demo | ❌ | ❓ | ✅ | ❌ | ❌ | ❌ | ❌ |
| D-FINE | Object Detection | demo | ✅ | ❓ | ✅ | ❌ | ❌ | ❌ | ❌ |
| DEIM | Object Detection | demo | ✅ | ❓ | ✅ | ❌ | ❌ | ❌ | ❌ |
| DEIMv2 | Object Detection | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
🎨 Image Segmentation
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| SAM | Segment Anything | demo | ✅ | ❓ | ✅ | ❌ | ❌ | ❌ | ❌ |
| SAM-HQ | Segment Anything | demo | ✅ | ❓ | ✅ | ❌ | ❌ | ❌ | ❌ |
| MobileSAM | Segment Anything | demo | ✅ | ❓ | ✅ | ❌ | ❌ | ❌ | ❌ |
| EdgeSAM | Segment Anything | demo | ✅ | ❓ | ✅ | ❌ | ❌ | ❌ | ❌ |
| FastSAM | Instance Segmentation | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| SAM2 | Segment Anything | demo | ✅ | ❓ | ✅ | ❌ | ❌ | ❌ | ❌ |
| SAM3-Tracker | Segment Anything | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
🗺️ Open-Set Detection & Segmentation
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| GroundingDINO | Open-Set Detection With Language | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| MM-GDINO | Open-Set Detection With Language | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| LLMDet | Open-Set Detection With Language | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| OWLv2 | Open-Set Object Detection | demo | ✅ | ❓ | ✅ | ✅ | ❌ | ❌ | ❌ |
| YOLO-World | Open-Set Detection With Language | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| YOLOE | Open-Set Detection And Segmentation | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| SAM3-Image | Open-Set Detection And Segmentation | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
✨ Background Removal
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| RMBG | Image Segmentation Background Removal |
demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| BEN2 | Image Segmentation Background Removal |
demo | ✅ | ❓ | ✅ | ✅ | ❌ | ❌ | ❌ |
🏃 Multi-Object Tracking
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| ByteTrack | Multi-Object Tracking | demo | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
💎 Image Super-Resolution
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| Swin2SR | Image Restoration | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| APISR | Anime Super-Resolution | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
✂️ Image Matting
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| MODNet | Image Matting | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ❌ | ❌ |
| MediaPipe Selfie | Image Segmentation | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ❌ | ❌ |
🤸 Pose Estimation
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| RTMPose | Keypoint Detection | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| DWPose | Keypoint Detection | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| RTMW | Keypoint Detection | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| RTMO | Keypoint Detection | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ❌ |
🔍 OCR & Document Understanding
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| DB | Text Detection | demo | ✅ | ❓ | ✅ | ✅ | ❌ | ❌ | ❌ |
| FAST | Text Detection | demo | ✅ | ❓ | ✅ | ✅ | ❌ | ❌ | ❌ |
| LinkNet | Text Detection | demo | ✅ | ❓ | ✅ | ✅ | ❌ | ❌ | ❌ |
| SVTR | Text Recognition | demo | ✅ | ❓ | ✅ | ✅ | ❌ | ❌ | ❌ |
| TrOCR | Text Recognition | demo | ✅ | ❓ | ✅ | ✅ | ❌ | ❌ | ❌ |
| SLANet | Table Recognition | demo | ✅ | ❓ | ✅ | ✅ | ❌ | ❌ | ❌ |
| DocLayout-YOLO | Object Detection | demo | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
🧩 Vision-Language Models (VLM)
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| BLIP | Image Captioning | demo | ✅ | ❓ | ✅ | ❓ | ❌ | ❌ | ❌ |
| Florence2 | A Variety of Vision Tasks | demo | ✅ | ❓ | ✅ | ✅ | ❌ | ❌ | ❌ |
| Moondream2 | Open-Set Object Detection Open-Set Keypoints Detection Image Captioning Visual Question Answering |
demo | ✅ | ❓ | ❌ | ❌ | ✅ | ✅ | ❌ |
| SmolVLM | Visual Question Answering | demo | ✅ | ❓ | ✅ | ❓ | ❓ | ❓ | ❓ |
| SmolVLM2 | Visual Question Answering | demo | ✅ | ❓ | ✅ | ❓ | ❓ | ❓ | ❓ |
| FastVLM | Vision Language Models | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
🧬 Embedding Model
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| CLIP | Vision-Language Embedding | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| jina-clip-v1 | Vision-Language Embedding | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| jina-clip-v2 | Vision-Language Embedding | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| mobileclip | Vision-Language Embedding | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| DINOv2 | Vision Embedding | demo | ✅ | ❓ | ✅ | ❌ | ❌ | ❌ | ❌ |
| DINOv3 | Vision Embedding | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
📐 Depth Estimation
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| DepthAnything v1 | Monocular Depth Estimation | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| DepthAnything v2 | Monocular Depth Estimation | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| DepthPro | Monocular Depth Estimation | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Depth-Anything-3 | Monocular Metric Multi-View |
demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
🌌 Others
| Model | Task / Description | Demo | Dynamic Batch | TensorRT | FP32 | FP16 | Q8 | Q4f16 | BNB4 |
|---|---|---|---|---|---|---|---|---|---|
| Sapiens | Foundation for Human Vision Models | demo | ✅ | ❓ | ✅ | ✅ | ✅ | ✅ | ✅ |
| YOLOPv2 | Panoptic Driving | demo | ✅ | ❓ | ✅ | ❌ | ❌ | ❌ | ❌ |
❕ Features in italics are enabled by default.
-
ort-download-binaries: Automatically download prebuilt ONNX Runtime binaries from pyke.ort-load-dynamic: Manually link ONNX Runtime. Useful for custom builds or unsupported platforms. See Linking Guide for more details.viewer: Real-time image/video visualization (similar to OpenCVimshow). Empowered by minifb.video: Video I/O support for reading and writing video streams. Empowered by video-rs.hf-hub: Download model files from Hugging Face Hub.annotator: Annotation utilities for drawing bounding boxes, keypoints, and masks on images.
-
Additional image format support (optional for faster compilation):
image-all-formats: Enable all additional image formats.image-gif,image-bmp,image-ico,image-avif,image-tiff,image-dds,image-exr,image-ff,image-hdr,image-pnm,image-qoi, `image-tga: Individual image format support.
-
vision: Core vision models (Detection, Segmentation, Classification, Pose, etc.).vlm: Vision-Language Models (CLIP, BLIP, Florence2, etc.).mot: Multi-Object Tracking utilities.all-models: Enable all model categories.
-
Hardware acceleration for inference. Enable the one matching your hardware:
cuda: NVIDIA CUDA execution provider (pure model inference acceleration).tensorrt: NVIDIA TensorRT execution provider (pure model inference acceleration).nvrtx: NVIDIA NvTensorRT-RTX execution provider (pure model inference acceleration).cuda-full:cuda+cuda-runtime-build(Model + Image Preprocessing acceleration).tensorrt-full:tensorrt+cuda-runtime-build(Model + Image Preprocessing acceleration).nvrtx-full:nvrtx+cuda-runtime-build(Model + Image Preprocessing acceleration).coreml: Apple Silicon (macOS/iOS).openvino: Intel CPU/GPU/VPU.onednn: Intel Deep Neural Network Library.directml: DirectML (Windows).webgpu: WebGPU (Web/Chrome).rocm: AMD GPU acceleration.cann: Huawei Ascend NPU.rknpu: Rockchip NPU.xnnpack: Mobile CPU optimization.acl: Arm Compute Library.armnn: Arm Neural Network SDK.azure: Azure ML execution provider.migraphx: AMD MIGraphX.nnapi: Android Neural Networks API.qnn: Qualcomm SNPE.tvm: Apache TVM.vitis: Xilinx Vitis AI.
-
NVIDIA GPU acceleration with CUDA image processing kernels (requires
cudarc):cuda-full: Usescuda-version-from-build-system(auto-detects vianvcc).cuda-11040,cuda-11050,cuda-11060,cuda-11070,cuda-11080: CUDA 11.x versions (Model + Preprocess).cuda-12000,cuda-12010,cuda-12020,cuda-12030,cuda-12040,cuda-12050,cuda-12060,cuda-12080,cuda-12090: CUDA 12.x versions (Model + Preprocess).cuda-13000,cuda-13010: CUDA 13.x versions (Model + Preprocess).
-
NVIDIA TensorRT execution provider with CUDA runtime libraries:
tensorrt-full: Usescuda-version-from-build-system(auto-detects vianvcc).tensorrt-cuda-11040,tensorrt-cuda-11050,tensorrt-cuda-11060,tensorrt-cuda-11070,tensorrt-cuda-11080: TensorRT + CUDA 11.x runtime.tensorrt-cuda-12000,tensorrt-cuda-12010,tensorrt-cuda-12020,tensorrt-cuda-12030,tensorrt-cuda-12040,tensorrt-cuda-12050,tensorrt-cuda-12060,tensorrt-cuda-12080,tensorrt-cuda-12090: TensorRT + CUDA 12.x runtime.tensorrt-cuda-13000,tensorrt-cuda-13010: TensorRT + CUDA 13.x runtime.
Note:
tensorrt-cuda-*features enable TensorRT execution provider with CUDA runtime libraries for image processing. The "cuda" in the name refers tocudarcdependency. -
NVIDIA NvTensorRT-RTX execution provider with CUDA runtime libraries:
nvrtx-full: Usescuda-version-from-build-system(auto-detects vianvcc).nvrtx-cuda-11040,nvrtx-cuda-11050,nvrtx-cuda-11060,nvrtx-cuda-11070,nvrtx-cuda-11080: NVRTX + CUDA 11.x runtime.nvrtx-cuda-12000,nvrtx-cuda-12010,nvrtx-cuda-12020,nvrtx-cuda-12030,nvrtx-cuda-12040,nvrtx-cuda-12050,nvrtx-cuda-12060,nvrtx-cuda-12080,nvrtx-cuda-12090: NVRTX + CUDA 12.x runtime.nvrtx-cuda-13000,nvrtx-cuda-13010: NVRTX + CUDA 13.x runtime.
Note:
nvrtx-cuda-*features enable NVRTX execution provider with CUDA runtime libraries for image processing. The "cuda" in the name refers tocudarcdependency.
| Scenario | Model Device (--device) |
Processor Device (--processor-device) |
Required Features (-F) |
|---|---|---|---|
| CPU Only | cpu |
cpu |
vision (default) |
| GPU Inference (Slow Preprocess) | cuda |
cpu |
cuda |
| GPU Inference (Fast Preprocess) | cuda |
cuda |
cuda-full or cuda-120xxx |
| TensorRT (Slow Preprocess) | tensorrt |
cpu |
tensorrt |
| TensorRT (Fast Preprocess) | tensorrt |
cuda |
tensorrt-full or tensorrt-cuda-120xxx |
⚠️ In multi-GPU environments (e.g.,cuda:0,cuda:1), you MUST ensure that both--deviceand--processor-deviceuse the SAME GPU ID.
- ONNX Runtime Issues: For ONNX Runtime related errors, please check the ort issues or onnxruntime issues.
- Other Issues: For other questions or bug reports, see issues or open a new discussion.
If you encounter linking errors with __isoc23_strtoll or similar glibc symbols, use the dynamic loading feature:
cargo run -F ort-load-dynamic --exampleThis project focuses on vision and VLM models under 1B parameters for efficient inference.
Many high-performance inference engines already exist for LM/LLM models like vLLM.
Pure text embedding models may be considered in future releases.
Refer to YOLO performance benchmarks in the Performance section above.
This project uses multi-threading, SIMD, and CUDA hardware acceleration for optimization.
While vision models like YOLO and RFDETR are optimized, other models may need further interface and post-processing optimization.
This is a personal project maintained in spare time, so progress on performance optimization and new model support may vary.
We highly welcome PRs for model optimization! If you have expertise in specific models and can help optimize their interfaces or post-processing, your contributions would be invaluable. Feel free to open an issue or submit a pull request for suggestions, bug reports, or new features.
This project is built on top of ort (ONNX Runtime for Rust), which provides seamless Rust bindings for ONNX Runtime. Special thanks to the ort maintainers.
Thanks to all the open-source libraries and their maintainers that make this project possible. See Cargo.toml for a complete list of dependencies.
This project is licensed under LICENSE.
