Skip to content

scitix/InstantTensor

Repository files navigation

InstantTensor

PyPI Downloads License

InstantTensor is an ultra-fast, distributed Safetensors loader designed to maximize I/O throughput when moving model weights from Safetensors files to GPU memory.

Model loading benchmark on inference engines

Model GPU Backend Load Time (s) Throughput (GB/s) Speedup
Qwen3-30B-A3B 1*H200 Safetensors 57.4 1.1 1x
Qwen3-30B-A3B 1*H200 InstantTensor 1.77 35 32.4x
DeepSeek-R1 8*H200 Safetensors 160 4.3 1x
DeepSeek-R1 8*H200 InstantTensor 15.3 45 10.5x

See Benchmark for full benchmarks.

Quickstart

from instanttensor import safe_open

tensors = {}
with safe_open("model.safetensors", framework="pt", device=0) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

NOTE: tensor points to the internal buffer of InstantTensor and should be copied immediately (e.g. by clone() or copy_()) to avoid data being overwritten during buffer reuse.

See Usage for more details (multi-file and distributed usage).

Used by

Why InstantTensor?

  • Fast weight loading
    • Direct I/O: Avoid the slow page cache allocation on cold start. Friendly for large models and tight memory budgets.
    • Tuned I/O size and concurrency: Maximize hardware throughput.
    • Pipelining and prefetching: Parallelize and overlap the various stages of transmission.
  • Distributed loading
    • Use torch.distributed (NCCL) to speed up loading under any parallelism policy (TP/PP/EP/CP/DP).
  • Multiple I/O backends
    • Supports multiple backends: GPUDirect Storage, Legacy Storage, and Memory-based Storage.

When to Use InstantTensor

InstantTensor is recommended if any of the following conditions are met:

  • High storage bandwidth (>= 5 GB/s).
  • Unable to keep the model cached in host memory, for example:
    • Limited free memory for model caching (for example, when most memory is used for KV cache offloading in LLM serving).
    • Infrequent model loading, where Linux page cache is less effective.
    • Model switching, where multiple models cannot be cached in memory simultaneously.
  • The model is heavily sharded (for example, TP=8), resulting in small, non-contiguous I/O per GPU.
  • Loading from tmpfs.

Installation

Requirements

  • GPU platforms: CUDA, ROCm
  • Framework: PyTorch

Method 1: Install from pip

pip install instanttensor

Method 2: Build from source

git clone https://github.com/scitix/InstantTensor.git
cd InstantTensor
./checkout_submodules.sh
pip install .
# For a debug build, set "DEBUG=1" before "pip"

Usage

Multi-file loading

Passing a list of files allows the backend to plan reads and provides higher throughput than making multiple calls to load single files:

from instanttensor import safe_open

files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=0) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

Distributed loading

InstantTensor can use a torch.distributed NCCL process group to coordinate loading and achieve higher throughput compared to running safe_open independently on each GPU.

import torch
import torch.distributed as dist
from instanttensor import safe_open

dist.init_process_group(backend="nccl")
process_group = dist.GroupMember.WORLD

files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=torch.cuda.current_device(), process_group=process_group) as f:
    for name, tensor in f.tensors():
        tensors[name] = tensor.clone()

NOTE: You can also load weights using a subgroup created via dist.new_group, which allows multiple subgroups to load weights independently. For example, if you have TP=8 and PP=2 (i.e., two TP groups), you can create two subgroups and load weights independently on each TP group. In cross-node (multi-machine) scenarios, loading using per-node subgroups can sometimes be faster than loading on the world group. However, for most cases, the world group is a good default choice.

See tests/test.py for a full benchmark harness (TP/PP grouping, checksums, etc.).

API reference

See Build API reference

Thanks

Thanks to the AI Systems and Optimization team at ScitiX AI and the Wenfei Wu Lab at Peking University.

About

An ultra-fast, distributed Safetensors loader

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors