InstantTensor is an ultra-fast, distributed Safetensors loader designed to maximize I/O throughput when moving model weights from Safetensors files to GPU memory.
| Model | GPU | Backend | Load Time (s) | Throughput (GB/s) | Speedup |
|---|---|---|---|---|---|
| Qwen3-30B-A3B | 1*H200 | Safetensors | 57.4 | 1.1 | 1x |
| Qwen3-30B-A3B | 1*H200 | InstantTensor | 1.77 | 35 | 32.4x |
| DeepSeek-R1 | 8*H200 | Safetensors | 160 | 4.3 | 1x |
| DeepSeek-R1 | 8*H200 | InstantTensor | 15.3 | 45 | 10.5x |
See Benchmark for full benchmarks.
from instanttensor import safe_open
tensors = {}
with safe_open("model.safetensors", framework="pt", device=0) as f:
for name, tensor in f.tensors():
tensors[name] = tensor.clone()NOTE:
tensorpoints to the internal buffer of InstantTensor and should be copied immediately (e.g. by clone() or copy_()) to avoid data being overwritten during buffer reuse.
See Usage for more details (multi-file and distributed usage).
- Fast weight loading
- Direct I/O: Avoid the slow page cache allocation on cold start. Friendly for large models and tight memory budgets.
- Tuned I/O size and concurrency: Maximize hardware throughput.
- Pipelining and prefetching: Parallelize and overlap the various stages of transmission.
- Distributed loading
- Use
torch.distributed(NCCL) to speed up loading under any parallelism policy (TP/PP/EP/CP/DP).
- Use
- Multiple I/O backends
- Supports multiple backends: GPUDirect Storage, Legacy Storage, and Memory-based Storage.
InstantTensor is recommended if any of the following conditions are met:
- High storage bandwidth (>= 5 GB/s).
- Unable to keep the model cached in host memory, for example:
- Limited free memory for model caching (for example, when most memory is used for KV cache offloading in LLM serving).
- Infrequent model loading, where Linux page cache is less effective.
- Model switching, where multiple models cannot be cached in memory simultaneously.
- The model is heavily sharded (for example, TP=8), resulting in small, non-contiguous I/O per GPU.
- Loading from
tmpfs.
- GPU platforms: CUDA, ROCm
- Framework: PyTorch
pip install instanttensorgit clone https://github.com/scitix/InstantTensor.git
cd InstantTensor
./checkout_submodules.sh
pip install .
# For a debug build, set "DEBUG=1" before "pip"Passing a list of files allows the backend to plan reads and provides higher throughput than making multiple calls to load single files:
from instanttensor import safe_open
files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=0) as f:
for name, tensor in f.tensors():
tensors[name] = tensor.clone()InstantTensor can use a torch.distributed NCCL process group to coordinate loading and achieve higher throughput compared to running safe_open independently on each GPU.
import torch
import torch.distributed as dist
from instanttensor import safe_open
dist.init_process_group(backend="nccl")
process_group = dist.GroupMember.WORLD
files = ["model-00001-of-00002.safetensors", "model-00002-of-00002.safetensors"]
tensors = {}
with safe_open(files, framework="pt", device=torch.cuda.current_device(), process_group=process_group) as f:
for name, tensor in f.tensors():
tensors[name] = tensor.clone()NOTE: You can also load weights using a subgroup created via
dist.new_group, which allows multiple subgroups to load weights independently. For example, if you have TP=8 and PP=2 (i.e., two TP groups), you can create two subgroups and load weights independently on each TP group. In cross-node (multi-machine) scenarios, loading using per-node subgroups can sometimes be faster than loading on the world group. However, for most cases, the world group is a good default choice.
See tests/test.py for a full benchmark harness (TP/PP grouping, checksums, etc.).
Thanks to the AI Systems and Optimization team at ScitiX AI and the Wenfei Wu Lab at Peking University.