Perf: Qwen3-4B-Instruct-2507 NVFP4 BS=1 prefill/decode throughput ~25–33% below published benchmarks on Jetson AGX Thor

`Qwen3-4B-Instruct-2507` NVFP4 BS=1 produces lower prefill and decode throughput than the numbers published in `docs/source/user_guide/performance/performance-benchmarks.md`.

| Metric | Expected (perf-benchmarks.md v0.4.0) | Actual |
|---|---|---|
| Prefill latency | 22.9 ms | ~34.2 ms |
| Prefill TPS | 15,895 tok/s | ~10,657 tok/s |
| Decode latency (osl=1, pastKVLen=364) | 11.1 ms | 14.9224 ms |
| Decode TPS | 90.2 tok/s | 67.0 tok/s |

CUDA graph is confirmed captured and enabled per runtime logs:
```
[INFO] CUDA graph enabled
[INFO] CUDA graph captured successfully.
[INFO] E2E Time (actual performance): 14.9224 ms
[INFO] Tokens/sec (E2E): 67.0
```

### Steps/Code to reproduce bug

**Build configuration:**
```bash
cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DTRT_PACKAGE_DIR=/usr \
  -DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
  -DEMBEDDED_TARGET=jetson-thor
make -j$(nproc)
```

**Runtime command used:**
```bash
ENGINE=./trt_edge_workspace/Qwen3-4B-Instruct-2507/engine

# Engine build
./build/examples/llm/llm_build \
  --onnxDir  $ENGINE/../onnx/llm \
  --engineDir $ENGINE \
  --maxBatchSize 1 \
  --maxInputLen 364 \
  --maxKVCacheCapacity 4096

# Prefill benchmark
./build/examples/llm/llm_bench \
  --engineDir $ENGINE \
  --mode prefill \
  --inputLen 364

# Decode benchmark
./build/examples/llm/llm_bench \
  --engineDir $ENGINE \
  --mode decode \
  --pastKVLen 364
```

### Expected behavior

```
# Prefill
E2E Time (actual performance): 22.9 ms
Tokens/sec (E2E): 15895

# Decode
E2E Time (actual performance): 11.1 ms
Tokens/sec (E2E): 90.2
```
NOTE: 
* This was tested on JetPack 7.2 with MAXN mode
* Other nvfp4 models showcase deviation as well
* Tried on llm_inference as well, but the numbers are not even in the vicinity
* 
## System information (Edge Device)

- Platform (e.g., NVIDIA Jetson Thor): NVIDIA Thor
- Software release (e.g., JetPack 7.1): R39 revision 2.0 (L4T, built 2026-06-01)
- CPU architecture: aarch64
- GPU compute capability (e.g., SM110 for Jetson Thor): SM110
- Total device memory: 122 GiB
- Build type (e.g., Release, Debug): Release
- Library versions:
  - TensorRT Edge-LLM version or commit hash: 0.8.0 (git: f9cc746)
  - CUDA: 13.2
  - TensorRT: 10.16.2
  - C++ compiler (e.g., GCC 11.4): GCC 13.3.0
- CMake options used:
  - CMAKE_TOOLCHAIN_FILE: `cmake/aarch64_linux_toolchain.cmake`
  - EMBEDDED_TARGET: `jetson-thor`
  - TRT_PACKAGE_DIR: `/usr`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf: Qwen3-4B-Instruct-2507 NVFP4 BS=1 prefill/decode throughput ~25–33% below published benchmarks on Jetson AGX Thor #115

Steps/Code to reproduce bug

Expected behavior

System information (Edge Device)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Expected (perf-benchmarks.md v0.4.0)	Actual
Prefill latency	22.9 ms	~34.2 ms
Prefill TPS	15,895 tok/s	~10,657 tok/s
Decode latency (osl=1, pastKVLen=364)	11.1 ms	14.9224 ms
Decode TPS	90.2 tok/s	67.0 tok/s

Uh oh!

Perf: Qwen3-4B-Instruct-2507 NVFP4 BS=1 prefill/decode throughput ~25–33% below published benchmarks on Jetson AGX Thor #115

Description

Steps/Code to reproduce bug

Expected behavior

System information (Edge Device)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions