Qwen3-4B-Instruct-2507 NVFP4 BS=1 produces lower prefill and decode throughput than the numbers published in docs/source/user_guide/performance/performance-benchmarks.md.
| Metric |
Expected (perf-benchmarks.md v0.4.0) |
Actual |
| Prefill latency |
22.9 ms |
~34.2 ms |
| Prefill TPS |
15,895 tok/s |
~10,657 tok/s |
| Decode latency (osl=1, pastKVLen=364) |
11.1 ms |
14.9224 ms |
| Decode TPS |
90.2 tok/s |
67.0 tok/s |
CUDA graph is confirmed captured and enabled per runtime logs:
[INFO] CUDA graph enabled
[INFO] CUDA graph captured successfully.
[INFO] E2E Time (actual performance): 14.9224 ms
[INFO] Tokens/sec (E2E): 67.0
Steps/Code to reproduce bug
Build configuration:
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DTRT_PACKAGE_DIR=/usr \
-DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
-DEMBEDDED_TARGET=jetson-thor
make -j$(nproc)
Runtime command used:
ENGINE=./trt_edge_workspace/Qwen3-4B-Instruct-2507/engine
# Engine build
./build/examples/llm/llm_build \
--onnxDir $ENGINE/../onnx/llm \
--engineDir $ENGINE \
--maxBatchSize 1 \
--maxInputLen 364 \
--maxKVCacheCapacity 4096
# Prefill benchmark
./build/examples/llm/llm_bench \
--engineDir $ENGINE \
--mode prefill \
--inputLen 364
# Decode benchmark
./build/examples/llm/llm_bench \
--engineDir $ENGINE \
--mode decode \
--pastKVLen 364
Expected behavior
# Prefill
E2E Time (actual performance): 22.9 ms
Tokens/sec (E2E): 15895
# Decode
E2E Time (actual performance): 11.1 ms
Tokens/sec (E2E): 90.2
NOTE:
- This was tested on JetPack 7.2 with MAXN mode
- Other nvfp4 models showcase deviation as well
- Tried on llm_inference as well, but the numbers are not even in the vicinity
System information (Edge Device)
- Platform (e.g., NVIDIA Jetson Thor): NVIDIA Thor
- Software release (e.g., JetPack 7.1): R39 revision 2.0 (L4T, built 2026-06-01)
- CPU architecture: aarch64
- GPU compute capability (e.g., SM110 for Jetson Thor): SM110
- Total device memory: 122 GiB
- Build type (e.g., Release, Debug): Release
- Library versions:
- TensorRT Edge-LLM version or commit hash: 0.8.0 (git: f9cc746)
- CUDA: 13.2
- TensorRT: 10.16.2
- C++ compiler (e.g., GCC 11.4): GCC 13.3.0
- CMake options used:
- CMAKE_TOOLCHAIN_FILE:
cmake/aarch64_linux_toolchain.cmake
- EMBEDDED_TARGET:
jetson-thor
- TRT_PACKAGE_DIR:
/usr
Qwen3-4B-Instruct-2507NVFP4 BS=1 produces lower prefill and decode throughput than the numbers published indocs/source/user_guide/performance/performance-benchmarks.md.CUDA graph is confirmed captured and enabled per runtime logs:
Steps/Code to reproduce bug
Build configuration:
cmake .. \ -DCMAKE_BUILD_TYPE=Release \ -DTRT_PACKAGE_DIR=/usr \ -DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \ -DEMBEDDED_TARGET=jetson-thor make -j$(nproc)Runtime command used:
Expected behavior
NOTE:
System information (Edge Device)
cmake/aarch64_linux_toolchain.cmakejetson-thor/usr