Skip to content
/ usls Public

A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models such as YOLO, FastVLM, and more.

License

Notifications You must be signed in to change notification settings

jamjamjon/usls

Repository files navigation

usls

Rust CI Crates.io Version ONNXRuntime MSRV Rust MSRV


📘 API Documentation | 🌟 Examples | 📦 Model Zoo


usls is a cross-platform Rust library powered by ONNX Runtime for efficient inference of SOTA vision and vision-language models (typically under 1B parameters).

(Generated by Seedream4.5)

🌟 Highlights

  • ⚡ High Performance: Multi-threading, SIMD, and CUDA-accelerated processing
  • 🌐 Cross-Platform: Linux, macOS, Windows with ONNX Runtime execution providers (CUDA, TensorRT, CoreML, OpenVINO, DirectML, etc.)
  • 🏗️ Unified API: Single Model trait inference with run()/forward()/encode_images()/encode_texts() and unified Y output
  • 📥 Auto-Management: Automatic model download (HuggingFace/GitHub), caching and path resolution
  • 📦 Multiple Inputs: Image, directory, video, webcam, stream and combinations
  • 🎯 Precision Support: FP32, FP16, INT8, UINT8, Q4, Q4F16, BNB4, and more
  • 🛠️ Full-Stack Suite: DataLoader, Annotator, and Viewer for complete workflows
  • 🌱 Model Ecosystem: 50+ SOTA vision and VLM models

🚀 Quick Start

Run the YOLO-Series demo to explore models with different tasks, precision and execution providers:

  • Tasks: detect, segment, pose, classify, obb
  • Versions: YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv10, YOLO11, YOLOv12, YOLOv13, YOLO26
  • Scales: n, s, m, l, x
  • Precision: fp32, fp16, q8, q4, q4f16, bnb4
  • Execution Providers: CPU, CUDA, TensorRT, TensorRT-RTX, CoreML, OpenVINO, and more

Examples

# CPU: Object detection with YOLO26n (FP16)
cargo run -r --example yolo -- --task detect --ver 26 --scale n --dtype fp16

# CUDA model + CPU processor: Instance segmentation with YOLO11m
cargo run -r -F cuda --example yolo -- --task segment --ver 11 --scale m --device cuda:0 --processor-device cpu

# CUDA model + CUDA processor: Pose estimation with YOLOv8m
cargo run -r -F cuda-full --example yolo -- --task pose --ver 8 --scale s --device cuda:0 --processor-device cuda:0

# TensorRT model + CPU processor
cargo run -r -F tensorrt --example yolo -- --device tensorrt:0 --processor-device cpu

# TensorRT model + CUDA processor (CUDA 12.4)
cargo run -r -F tensorrt-cuda-12040 --example yolo -- --device tensorrt:0 --processor-device cuda:0

# TensorRT-RTX model + CUDA processor
cargo run -r -F nvrtx-full --example yolo -- --device nvrtx:0 --processor-device cuda:0

# TensorRT-RTX model + CPU processor
cargo run -r -F nvrtx --example yolo -- --device nvrtx:0

# Apple Silicon CoreML
cargo run -r -F coreml --example yolo -- --device coreml

# Intel OpenVINO (CPU/GPU/VPU)
cargo run -r -F openvino -F ort-load-dynamic --example yolo -- --device openvino:CPU

# Show all available options
cargo run -r --example yolo -- --help

See YOLO Examples for more details and use cases.

See Device Combination Guide for feature and device configurations.

⚠️ Warning: When encountering CUDA errors (e.g., CUDA failure 1: invalid argument), use --processor-device cpu instead of --processor-device cuda:0 to avoid CUDA memory transfer issues.

Performance

Environment: NVIDIA RTX 3060Ti (TensorRT-10.11.0.33, CUDA 12.8, TensorRT-RTX-1.3.0.35) / Intel i5-12400F

Setup: YOLO26n-detect model (640×640), COCO2017 validation set (5,000 images), no warm-up

Batch = 1

Backend DType Preprocess Inference Postprocess Total
TensorRT EP + CUDA processor FP16 234.570µs 1.333ms 253.631µs 1.821ms
TensorRT EP + CPU processor FP16 783.852µs 2.438ms 83.701µs 3.306ms
TensorRT-RTX EP + CUDA processor FP32 232.003µs 2.934ms 192.660µs 3.359ms
TensorRT-RTX EP + CUDA processor FP16
TensorRT-RTX EP + CPU processor FP32 794.292µs 3.974ms 83.926µs 4.852ms
TensorRT-RTX EP + CPU processor FP16
CUDA EP + CUDA processor FP32 242.752µs 5.053ms 95.968µs 5.392ms
CUDA EP + CUDA processor FP16 244.065µs 3.684ms 100.828µs 4.029ms
CUDA EP + CPU processor FP32 796.886µs 6.044ms 74.687µs 6.916ms
CUDA EP + CPU processor FP16 787.805µs 4.565ms 71.001µs 5.424ms
CPU EP + CPU processor FP32 971.332µs 20.243ms 59.022µs 21.273ms
CPU EP + CPU processor FP16 954.297µs 23.155ms 59.197µs 24.168ms

Batch = 8

Backend DType Preprocess Inference Postprocess Total
TensorRT EP + CUDA processor FP16 2.100ms 6.497ms 203.484µs 8.801ms
TensorRT EP + CPU processor FP16 18.913ms 26.406ms 194.782µs 45.514ms
TensorRT-RTX EP + CUDA processor FP32 2.161ms 15.370ms 167.937µs 17.699ms
TensorRT-RTX EP + CUDA processor FP16
TensorRT-RTX EP + CPU processor FP32 18.988ms 35.101ms 173.829µs 54.263ms
TensorRT-RTX EP + CPU processor FP16
CUDA EP + CUDA processor FP32 2.222ms 24.479ms 160.767µs 26.862ms
CUDA EP + CUDA processor FP16 2.262ms 14.842ms 135.593µs 17.240ms
CUDA EP + CPU processor FP32 19.037ms 44.720ms 190.740µs 63.948ms
CUDA EP + CPU processor FP16 18.245ms 33.865ms 183.226µs 52.293ms
CPU EP + CPU processor FP32 17.852ms 216.872ms 158.297µs 234.883ms
CPU EP + CPU processor FP16 17.698ms 255.365ms 117.239µs 273.180ms

📦 Model Zoo

Status: ✅ Supported  |  ❓ Unknown  |  ❌ Not Supported For Now

🔥 YOLO-Series
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
YOLOv5 Image Classification
Object Detection
Instance Segmentation
demo
YOLOv6 Object Detection demo
YOLOv7 Object Detection demo
YOLOv8 Object Detection
Instance Segmentation
Image Classification
Oriented Object Detection
Keypoint Detection
demo
YOLO11 Object Detection
Instance Segmentation
Image Classification
Oriented Object Detection
Keypoint Detection
demo
YOLOv9 Object Detection demo
YOLOv10 Object Detection demo
YOLOv12 Image Classification
Object Detection
Instance Segmentation
demo
YOLOv13 Object Detection demo
YOLO26 Object Detection
Instance Segmentation
Image Classification
Oriented Object Detection
Keypoint Detection
demo
🏷️ Image Classification & Tagging
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
BEiT Image Classification demo
ConvNeXt Image Classification demo
FastViT Image Classification demo
MobileOne Image Classification demo
DeiT Image Classification demo
RAM Image Tagging demo
RAM++ Image Tagging demo
🎯 Object Detection
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
RT-DETRv1 Object Detection demo
RT-DETRv2 Object Detection demo
RT-DETRv4 Object Detection demo
RF-DETR Object Detection demo
PP-PicoDet Object Detection demo
D-FINE Object Detection demo
DEIM Object Detection demo
DEIMv2 Object Detection demo
🎨 Image Segmentation
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
SAM Segment Anything demo
SAM-HQ Segment Anything demo
MobileSAM Segment Anything demo
EdgeSAM Segment Anything demo
FastSAM Instance Segmentation demo
SAM2 Segment Anything demo
SAM3-Tracker Segment Anything demo
🗺️ Open-Set Detection & Segmentation
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
GroundingDINO Open-Set Detection With Language demo
MM-GDINO Open-Set Detection With Language demo
LLMDet Open-Set Detection With Language demo
OWLv2 Open-Set Object Detection demo
YOLO-World Open-Set Detection With Language demo
YOLOE Open-Set Detection And Segmentation demo
SAM3-Image Open-Set Detection And Segmentation demo
✨ Background Removal
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
RMBG Image Segmentation
Background Removal
demo
BEN2 Image Segmentation
Background Removal
demo
🏃 Multi-Object Tracking
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
ByteTrack Multi-Object Tracking demo
💎 Image Super-Resolution
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
Swin2SR Image Restoration demo
APISR Anime Super-Resolution demo
✂️ Image Matting
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
MODNet Image Matting demo
MediaPipe Selfie Image Segmentation demo
🤸 Pose Estimation
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
RTMPose Keypoint Detection demo
DWPose Keypoint Detection demo
RTMW Keypoint Detection demo
RTMO Keypoint Detection demo
🔍 OCR & Document Understanding
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
DB Text Detection demo
FAST Text Detection demo
LinkNet Text Detection demo
SVTR Text Recognition demo
TrOCR Text Recognition demo
SLANet Table Recognition demo
DocLayout-YOLO Object Detection demo
🧩 Vision-Language Models (VLM)
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
BLIP Image Captioning demo
Florence2 A Variety of Vision Tasks demo
Moondream2 Open-Set Object Detection
Open-Set Keypoints Detection
Image Captioning
Visual Question Answering
demo
SmolVLM Visual Question Answering demo
SmolVLM2 Visual Question Answering demo
FastVLM Vision Language Models demo
🧬 Embedding Model
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
CLIP Vision-Language Embedding demo
jina-clip-v1 Vision-Language Embedding demo
jina-clip-v2 Vision-Language Embedding demo
mobileclip Vision-Language Embedding demo
DINOv2 Vision Embedding demo
DINOv3 Vision Embedding demo
📐 Depth Estimation
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
DepthAnything v1 Monocular Depth Estimation demo
DepthAnything v2 Monocular Depth Estimation demo
DepthPro Monocular Depth Estimation demo
Depth-Anything-3 Monocular
Metric
Multi-View
demo
🌌 Others
Model Task / Description Demo Dynamic Batch TensorRT FP32 FP16 Q8 Q4f16 BNB4
Sapiens Foundation for Human Vision Models demo
YOLOPv2 Panoptic Driving demo

Documentation

🔧 Cargo Features

❕ Features in italics are enabled by default.

  • Core & Utilities

    • ort-download-binaries: Automatically download prebuilt ONNX Runtime binaries from pyke.
    • ort-load-dynamic: Manually link ONNX Runtime. Useful for custom builds or unsupported platforms. See Linking Guide for more details.
    • viewer: Real-time image/video visualization (similar to OpenCV imshow). Empowered by minifb.
    • video: Video I/O support for reading and writing video streams. Empowered by video-rs.
    • hf-hub: Download model files from Hugging Face Hub.
    • annotator: Annotation utilities for drawing bounding boxes, keypoints, and masks on images.
  • Image Formats

    Additional image format support (optional for faster compilation):

    • image-all-formats: Enable all additional image formats.
    • image-gif, image-bmp, image-ico, image-avif, image-tiff, image-dds, image-exr, image-ff, image-hdr, image-pnm, image-qoi, `image-tga: Individual image format support.
  • Model Categories

    • vision: Core vision models (Detection, Segmentation, Classification, Pose, etc.).
    • vlm: Vision-Language Models (CLIP, BLIP, Florence2, etc.).
    • mot: Multi-Object Tracking utilities.
    • all-models: Enable all model categories.
  • Execution Providers

    Hardware acceleration for inference. Enable the one matching your hardware:

    • cuda: NVIDIA CUDA execution provider (pure model inference acceleration).
    • tensorrt: NVIDIA TensorRT execution provider (pure model inference acceleration).
    • nvrtx: NVIDIA NvTensorRT-RTX execution provider (pure model inference acceleration).
    • cuda-full: cuda + cuda-runtime-build (Model + Image Preprocessing acceleration).
    • tensorrt-full: tensorrt + cuda-runtime-build (Model + Image Preprocessing acceleration).
    • nvrtx-full: nvrtx + cuda-runtime-build (Model + Image Preprocessing acceleration).
    • coreml: Apple Silicon (macOS/iOS).
    • openvino: Intel CPU/GPU/VPU.
    • onednn: Intel Deep Neural Network Library.
    • directml: DirectML (Windows).
    • webgpu: WebGPU (Web/Chrome).
    • rocm: AMD GPU acceleration.
    • cann: Huawei Ascend NPU.
    • rknpu: Rockchip NPU.
    • xnnpack: Mobile CPU optimization.
    • acl: Arm Compute Library.
    • armnn: Arm Neural Network SDK.
    • azure: Azure ML execution provider.
    • migraphx: AMD MIGraphX.
    • nnapi: Android Neural Networks API.
    • qnn: Qualcomm SNPE.
    • tvm: Apache TVM.
    • vitis: Xilinx Vitis AI.
  • CUDA Support

    NVIDIA GPU acceleration with CUDA image processing kernels (requires cudarc):

    • cuda-full: Uses cuda-version-from-build-system (auto-detects via nvcc).
    • cuda-11040, cuda-11050, cuda-11060, cuda-11070, cuda-11080: CUDA 11.x versions (Model + Preprocess).
    • cuda-12000, cuda-12010, cuda-12020, cuda-12030, cuda-12040, cuda-12050, cuda-12060, cuda-12080, cuda-12090: CUDA 12.x versions (Model + Preprocess).
    • cuda-13000, cuda-13010: CUDA 13.x versions (Model + Preprocess).
  • TensorRT Support

    NVIDIA TensorRT execution provider with CUDA runtime libraries:

    • tensorrt-full: Uses cuda-version-from-build-system (auto-detects via nvcc).
    • tensorrt-cuda-11040, tensorrt-cuda-11050, tensorrt-cuda-11060, tensorrt-cuda-11070, tensorrt-cuda-11080: TensorRT + CUDA 11.x runtime.
    • tensorrt-cuda-12000, tensorrt-cuda-12010, tensorrt-cuda-12020, tensorrt-cuda-12030, tensorrt-cuda-12040, tensorrt-cuda-12050, tensorrt-cuda-12060, tensorrt-cuda-12080, tensorrt-cuda-12090: TensorRT + CUDA 12.x runtime.
    • tensorrt-cuda-13000, tensorrt-cuda-13010: TensorRT + CUDA 13.x runtime.

    Note: tensorrt-cuda-* features enable TensorRT execution provider with CUDA runtime libraries for image processing. The "cuda" in the name refers to cudarc dependency.

  • NVRTX Support

    NVIDIA NvTensorRT-RTX execution provider with CUDA runtime libraries:

    • nvrtx-full: Uses cuda-version-from-build-system (auto-detects via nvcc).
    • nvrtx-cuda-11040, nvrtx-cuda-11050, nvrtx-cuda-11060, nvrtx-cuda-11070, nvrtx-cuda-11080: NVRTX + CUDA 11.x runtime.
    • nvrtx-cuda-12000, nvrtx-cuda-12010, nvrtx-cuda-12020, nvrtx-cuda-12030, nvrtx-cuda-12040, nvrtx-cuda-12050, nvrtx-cuda-12060, nvrtx-cuda-12080, nvrtx-cuda-12090: NVRTX + CUDA 12.x runtime.
    • nvrtx-cuda-13000, nvrtx-cuda-13010: NVRTX + CUDA 13.x runtime.

    Note: nvrtx-cuda-* features enable NVRTX execution provider with CUDA runtime libraries for image processing. The "cuda" in the name refers to cudarc dependency.


🚀 Device Combination Guide

Scenario Model Device (--device) Processor Device (--processor-device) Required Features (-F)
CPU Only cpu cpu vision (default)
GPU Inference (Slow Preprocess) cuda cpu cuda
GPU Inference (Fast Preprocess) cuda cuda cuda-full or cuda-120xxx
TensorRT (Slow Preprocess) tensorrt cpu tensorrt
TensorRT (Fast Preprocess) tensorrt cuda tensorrt-full or tensorrt-cuda-120xxx

⚠️ In multi-GPU environments (e.g., cuda:0, cuda:1), you MUST ensure that both --device and --processor-device use the SAME GPU ID.


❓ FAQ

  • ONNX Runtime Issues: For ONNX Runtime related errors, please check the ort issues or onnxruntime issues.
  • Other Issues: For other questions or bug reports, see issues or open a new discussion.

⚠️ Compatibility Note

If you encounter linking errors with __isoc23_strtoll or similar glibc symbols, use the dynamic loading feature:

cargo run -F ort-load-dynamic --example

Why no LM models?

This project focuses on vision and VLM models under 1B parameters for efficient inference.

Many high-performance inference engines already exist for LM/LLM models like vLLM.

Pure text embedding models may be considered in future releases.

How fast is it?

Refer to YOLO performance benchmarks in the Performance section above.

This project uses multi-threading, SIMD, and CUDA hardware acceleration for optimization.

While vision models like YOLO and RFDETR are optimized, other models may need further interface and post-processing optimization.

🤝 Contributing

This is a personal project maintained in spare time, so progress on performance optimization and new model support may vary.

We highly welcome PRs for model optimization! If you have expertise in specific models and can help optimize their interfaces or post-processing, your contributions would be invaluable. Feel free to open an issue or submit a pull request for suggestions, bug reports, or new features.

🙏 Acknowledgments

This project is built on top of ort (ONNX Runtime for Rust), which provides seamless Rust bindings for ONNX Runtime. Special thanks to the ort maintainers.

Thanks to all the open-source libraries and their maintainers that make this project possible. See Cargo.toml for a complete list of dependencies.

📜 License

This project is licensed under LICENSE.

About

A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models such as YOLO, FastVLM, and more.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 12