A PyTorch benchmarking framework for training throughput analysis on NVIDIA GPUs. It combines end-to-end step timing, GPU telemetry, precision and batch-size sweeps, DataLoader analysis, bottleneck diagnosis, and report generation.
Most training performance discussions focus on model quality but under-specify system behavior. This suite is designed to answer practical and research-facing questions such as:
- What is the best stable batch size for a model on a given GPU?
- Is training limited by compute, memory bandwidth, host input pipeline, or optimizer overhead?
- What speedup do BF16/FP16 modes provide relative to FP32?
- Which DataLoader configuration minimizes data fetch latency?
The result is a reproducible JSON artifact plus a standalone HTML report.
- Throughput profiling (
samples/sec,step_time_ms) - Step-level phase breakdown (forward, backward, optimizer, data loading)
- GPU utilization and memory telemetry via NVML
- Batch-size sweep with OOM-safe stopping
- Precision sweep across FP32/FP16/BF16
- DataLoader worker sweep
- Bottleneck detection with actionable recommendations
- Roofline-oriented analysis helpers
- Self-contained HTML report generation
gpu-benchmark-suite-fixed/
benchmark/
analysis/ # bottlenecks, roofline, cost and recommendation logic
core/ # config, metrics collector, benchmark orchestrator
profilers/ # throughput and subsystem profilers
sweeps/ # batch size, precision, dataloader, distributed sweeps
utils/ # system and GPU helper utilities
visualization/ # plots + HTML report generator
benchmarks/ # suite overhead benchmarking
docs/ # user docs + technical report
examples/ # runnable scripts and CI regression example
tests/ # unit tests
- Create and activate a virtual environment.
- Install PyTorch for your CUDA version from https://pytorch.org/get-started/locally/.
- Install this package.
pip install -r requirements.txt
pip install -e .For development:
pip install -r requirements-dev.txtpython examples/basic_benchmark.pyArtifacts are written to results/:
- JSON metrics payload
- HTML report with embedded charts
import torchvision.models as tvm
from benchmark.core.benchmark_runner import BenchmarkRunner
from benchmark.core.config import BenchmarkConfig
config = BenchmarkConfig(
min_batch_size=32,
max_batch_size=256,
precision_modes=["fp32", "bf16"],
num_workers_candidates=[0, 2, 4, 8],
)
runner = BenchmarkRunner(
model=tvm.resnet50(num_classes=10),
input_shape=(3, 224, 224),
num_classes=10,
config=config,
)
results = runner.run_full_suite()
runner.save_results()
runner.generate_report()For deep-dive analysis, export PyTorch Kineto traces:
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA],
record_shapes=True
) as prof:
train_step()
prof.export_chrome_trace("trace.json")Then open in Nsight Systems:
nsys-ui trace.json
This correlates high-level PyTorch ops with low-level CUDA kernels.
- Benchmark runs should be repeated and aggregated (median or trimmed mean recommended).
- GPU clocks, thermals, background load, and driver state materially affect measured throughput.
- Use fixed seeds and stable software versions when reporting comparative results.
See Technical Report for methodology and limitations.
- Unit tests:
tests/ - CI workflow:
.github/workflows/ci.yml - API docs:
docs/api_reference.md - Optimization guidance:
docs/optimization_guide.md - Hardware guidance:
docs/hardware_guide.md
If this suite informs research or engineering reports, cite it via
CITATION.cff.
This project is licensed under the MIT License. See LICENSE.