Skip to content

Perf: Qwen3-4B-Instruct-2507 NVFP4 BS=1 prefill/decode throughput ~25–33% below published benchmarks on Jetson AGX Thor #115

Description

@satabios

Qwen3-4B-Instruct-2507 NVFP4 BS=1 produces lower prefill and decode throughput than the numbers published in docs/source/user_guide/performance/performance-benchmarks.md.

Metric Expected (perf-benchmarks.md v0.4.0) Actual
Prefill latency 22.9 ms ~34.2 ms
Prefill TPS 15,895 tok/s ~10,657 tok/s
Decode latency (osl=1, pastKVLen=364) 11.1 ms 14.9224 ms
Decode TPS 90.2 tok/s 67.0 tok/s

CUDA graph is confirmed captured and enabled per runtime logs:

[INFO] CUDA graph enabled
[INFO] CUDA graph captured successfully.
[INFO] E2E Time (actual performance): 14.9224 ms
[INFO] Tokens/sec (E2E): 67.0

Steps/Code to reproduce bug

Build configuration:

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DTRT_PACKAGE_DIR=/usr \
  -DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
  -DEMBEDDED_TARGET=jetson-thor
make -j$(nproc)

Runtime command used:

ENGINE=./trt_edge_workspace/Qwen3-4B-Instruct-2507/engine

# Engine build
./build/examples/llm/llm_build \
  --onnxDir  $ENGINE/../onnx/llm \
  --engineDir $ENGINE \
  --maxBatchSize 1 \
  --maxInputLen 364 \
  --maxKVCacheCapacity 4096

# Prefill benchmark
./build/examples/llm/llm_bench \
  --engineDir $ENGINE \
  --mode prefill \
  --inputLen 364

# Decode benchmark
./build/examples/llm/llm_bench \
  --engineDir $ENGINE \
  --mode decode \
  --pastKVLen 364

Expected behavior

# Prefill
E2E Time (actual performance): 22.9 ms
Tokens/sec (E2E): 15895

# Decode
E2E Time (actual performance): 11.1 ms
Tokens/sec (E2E): 90.2

NOTE:

  • This was tested on JetPack 7.2 with MAXN mode
  • Other nvfp4 models showcase deviation as well
  • Tried on llm_inference as well, but the numbers are not even in the vicinity

System information (Edge Device)

  • Platform (e.g., NVIDIA Jetson Thor): NVIDIA Thor
  • Software release (e.g., JetPack 7.1): R39 revision 2.0 (L4T, built 2026-06-01)
  • CPU architecture: aarch64
  • GPU compute capability (e.g., SM110 for Jetson Thor): SM110
  • Total device memory: 122 GiB
  • Build type (e.g., Release, Debug): Release
  • Library versions:
    • TensorRT Edge-LLM version or commit hash: 0.8.0 (git: f9cc746)
    • CUDA: 13.2
    • TensorRT: 10.16.2
    • C++ compiler (e.g., GCC 11.4): GCC 13.3.0
  • CMake options used:
    • CMAKE_TOOLCHAIN_FILE: cmake/aarch64_linux_toolchain.cmake
    • EMBEDDED_TARGET: jetson-thor
    • TRT_PACKAGE_DIR: /usr

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions