Comprehensive benchmark suite characterizing a 4-node, 32× NVIDIA H100 80GB HBM3 cluster operated by Voltage Park. Conducted in partnership with AI @ Georgia Tech. Results inform customer-facing performance claims and platform marketing for on-demand GPU infrastructure.
| Nodes | 4× bare metal |
| GPUs | 8× NVIDIA H100 80GB HBM3 SXM5 per node |
| CUDA | 12.6 |
| Interconnect | 3.2 Tbps InfiniBand |
| Container runtime | Docker + nvidia-container-toolkit |
| MPI | OpenMPI 4.1.2 |
| NCCL | 2.24.3+cuda12.6 |
H100 theoretical peaks (per GPU):
| Precision | Peak |
|---|---|
| FP64 | 67 TFLOPS |
| FP16 | 1,979 TFLOPS |
| FP8 | 3,958 TFLOPS |
| HBM3 memory bandwidth | 3,350 GB/s |
| Benchmark | Metric | Result | vs Theoretical Peak |
|---|---|---|---|
| HPCG (GPU) | GFLOP/s | 523 GFLOP/s | — |
| HPCG (CPU baseline) | GFLOP/s | 2.05 GFLOP/s | — |
| GPU Speedup | 255× | ||
| HPL FP64 | TFLOPS | 365.4 TFLOPS | 68% of FP64 peak |
| HPL-MxP FP16 | LU peak TFLOPS | 2,537 TFLOPS | — |
| HPL-MxP FP8 | LU peak TFLOPS | 3,554 TFLOPS | 89.8% of FP8 peak |
| Test | Result | vs Theoretical |
|---|---|---|
| CPU STREAM Triad | 486.7 GB/s | 79.3% of DDR5 peak (614 GB/s) |
| GPU H2D (PCIe) | 54.8 GB/s | 87% of PCIe 5.0 x16 |
| GPU D2H (PCIe) | 46.8 GB/s | — |
| GPU D2D (HBM3) | 2,903 GB/s | 86.7% of HBM3 peak |
| Test | Peak Bus BW | Notes |
|---|---|---|
| AllReduce | 479.72 GB/s | Peak at 8 GiB message |
| AllGather | 183.09 GB/s | Peak at 256 MiB |
| Broadcast | 212.86 GB/s | Peak at 128 MiB |
| Test | Bus BW | vs Single-Node |
|---|---|---|
| AllReduce | 330.93 GB/s | 69% efficiency |
| AllGather | 323.53 GB/s | 177% (scales super-linearly with more IB links) |
| Broadcast | 286.34 GB/s | 134% |
AllReduce at 69% cross-node efficiency reflects IB bandwidth across 4 nodes. AllGather and Broadcast exceed single-node bandwidth as 32 GPUs can aggregate across multiple IB links simultaneously.
| Configuration | Throughput | Time to Convergence | Scaling Efficiency |
|---|---|---|---|
| Single node (8× H100) | 1,555 seq/s | 491.85 min (8h 12m) | — |
| Multi-node (32× H100) | 4,937 seq/s | 154.86 min (2h 35m) | 79.4% |
Target: 0.720 masked LM accuracy. Both runs converged at step 1400/1563.
Scaling efficiency of 79.4% on Ethernet-only interconnect (no InfiniBand for this workload) is strong — efficiency loss is attributable to NCCL AllReduce collective overhead across nodes.
| Scenario | Result | Notes |
|---|---|---|
| Offline QPS | 3,336 QPS ✓ VALID | Batch=512 across 8 GPUs, FP16 |
| Server p99 latency | 207 ms ✗ INVALID | Harness-limited — see note below |
Raw CUDA inference latency (single GPU, no harness overhead):
| Batch | Median latency | Throughput |
|---|---|---|
| 1 | 8.6 ms | 116 q/s |
| 8 | 9.2 ms | 869 q/s |
| 16 | 16.0 ms | 998 q/s |
| 32 | 29.0 ms | 1,102 q/s |
8-GPU theoretical peak (batch=8/GPU): ~6,950 QPS
Note on Server scenario: The INVALID result is a harness artifact, not a hardware limitation. The custom Python harness (mlcommons-loadgen + PyTorch) introduces ~60ms of overhead per query from GIL serialization and threading, creating a 70ms latency floor against the 130ms MLPerf target. Raw CUDA inference latency on a single H100 is 8.6ms at batch=1. NVIDIA's official MLPerf submission uses TensorRT with CUDA graphs, achieving <10ms per-query latency and 5,000–8,000+ QPS Server on equivalent hardware. The Offline result (3,336 QPS) is the primary throughput metric and is not affected by harness overhead.
Container: nvcr.io/nvidia/hpc-benchmarks:24.09
- HPCG measures memory-bandwidth-bound sparse iterative solver performance — a real-world scientific computing proxy. Grid: 256³, runtime 60s.
- HPL (FP64) is the classic TOP500 dense linear algebra benchmark. Used
HPL-dgx-1N.dat(N=264,192, NB=1024, P×Q=4×2) — NVIDIA's standard single-node DGX H100 configuration. - HPL-MxP uses low-precision tensor cores (FP16 or FP8) for LU factorization, then refines to FP64 accuracy via GMRES. The LU GFLOPS figure directly measures tensor core throughput. Used N=274,432 to maximize GPU memory utilization (~37 GB / 79 GB per GPU). FP8 (
sloppy-type=1) is the primary result — 89.8% of H100 FP8 theoretical peak.
Container: nvcr.io/nvidia/nccl-tests
Single-node run with -g 8 (8 GPUs, single process). Multi-node run across all 4 nodes via OpenMPI. Message range: 8B to 8GiB. All results passed correctness validation.
CPU STREAM compiled with -O3 -march=native, 104 threads (matching physical core count). NVBench GPU bandwidth measured with pinned host memory across H2D, D2H, and D2D transfer directions. D2H throughput drops from 46.8 GB/s (small transfers) to ~20.5 GB/s (>2 MB) due to NUMA effects — H2D is unaffected.
Container: nvcr.io/nvidia/pytorch:24.09-py3 with custom training script
BERT-Large Phase 2 pretraining (seqlen=512). Dataset: AWS Neuron public S3 bert_pretrain_wikicorpus_tokenized_hdf5_seqlen512. Optimizer: APEX FusedLAMB with bf16 autocast + FlashAttention 2. Global batch size: 32,768. Multi-node launch via torchrun with static rendezvous (master_addr=10.15.21.81).
Methodology note: The official MLPerf Training container (
nvcr.io/nvidia/mlperf/training:bert-24.09) requires MLPerf program enrollment through MLCommons. Access was unavailable at time of testing. Results usepytorch:24.09-py3with equivalent training configuration. Throughput and convergence numbers are valid hardware characterizations but are not directly submittable to the MLCommons leaderboard.
Container: nvcr.io/nvidia/pytorch:24.09-py3 with mlcommons-loadgen 6.0.10
Model: official MLCommons BERT-Large checkpoint (Zenodo record 3733896). Dataset: SQuAD v1.1 validation set (10,570 examples → 10,772 sliding-window features). Accuracy: ~90.9 F1 (above BERT-99 threshold of ≥90.874 F1).
Methodology note: Same MLCommons enrollment requirement applied to the inference container. PyTorch backend used in place of TensorRT. Offline QPS is unaffected; Server scenario result is harness-limited as described above.
vpbenchmarking/
├── results/
│ ├── hpl_ai/
│ │ ├── hpcg_gpu_<timestamp>.txt
│ │ ├── hpl_fp64_<timestamp>.txt
│ │ ├── hpl_mxp_fp16_<timestamp>.txt
│ │ ├── hpl_mxp_fp8_<timestamp>.txt
│ │ └── hpl_ai_summary.md
│ ├── nccl/
│ │ ├── nccl_summary.md
│ │ └── nccl_multinode_summary.md
│ ├── stream/
│ │ └── stream_nvbench_summary.md
│ └── mlperf/
│ ├── bert/
│ │ └── bert_summary.md
│ └── inference/bert/
│ └── bert_inference_summary.md
├── hostfile
└── CLAUDE.md
Tensor core utilization is strong. HPL-MxP FP8 LU peak of 3,554 TFLOPS is 89.8% of H100 theoretical FP8 peak. The cluster is not compute-bottlenecked.
Memory bandwidth is healthy across all levels. HBM3 at 86.7% efficiency, PCIe H2D at 87%, CPU DDR5 at 79.3%.
Multi-node training scales well on Ethernet. 79.4% scaling efficiency across 4 nodes without InfiniBand. The 3.17× speedup on BERT training is attributable to NCCL AllReduce overhead, not hardware limitations.
Inference throughput is production-grade. 3,336 QPS Offline on 8× H100 for BERT-Large FP16. Scaled linearly across 4 nodes: ~13,344 QPS cluster-wide.
Benchmarking conducted March 2026 in partnership with Voltage Park by AI @ Georgia Tech.