Skip to content

Olajide-Badejo/GPU-Training-Bench

GPU Training Benchmark Suite

A PyTorch benchmarking framework for training throughput analysis on NVIDIA GPUs. It combines end-to-end step timing, GPU telemetry, precision and batch-size sweeps, DataLoader analysis, bottleneck diagnosis, and report generation.

Why this project exists

Most training performance discussions focus on model quality but under-specify system behavior. This suite is designed to answer practical and research-facing questions such as:

  • What is the best stable batch size for a model on a given GPU?
  • Is training limited by compute, memory bandwidth, host input pipeline, or optimizer overhead?
  • What speedup do BF16/FP16 modes provide relative to FP32?
  • Which DataLoader configuration minimizes data fetch latency?

The result is a reproducible JSON artifact plus a standalone HTML report.

Core capabilities

  • Throughput profiling (samples/sec, step_time_ms)
  • Step-level phase breakdown (forward, backward, optimizer, data loading)
  • GPU utilization and memory telemetry via NVML
  • Batch-size sweep with OOM-safe stopping
  • Precision sweep across FP32/FP16/BF16
  • DataLoader worker sweep
  • Bottleneck detection with actionable recommendations
  • Roofline-oriented analysis helpers
  • Self-contained HTML report generation

Repository layout

gpu-benchmark-suite-fixed/
  benchmark/
    analysis/         # bottlenecks, roofline, cost and recommendation logic
    core/             # config, metrics collector, benchmark orchestrator
    profilers/        # throughput and subsystem profilers
    sweeps/           # batch size, precision, dataloader, distributed sweeps
    utils/            # system and GPU helper utilities
    visualization/    # plots + HTML report generator
  benchmarks/         # suite overhead benchmarking
  docs/               # user docs + technical report
  examples/           # runnable scripts and CI regression example
  tests/              # unit tests

Installation

  1. Create and activate a virtual environment.
  2. Install PyTorch for your CUDA version from https://pytorch.org/get-started/locally/.
  3. Install this package.
pip install -r requirements.txt
pip install -e .

For development:

pip install -r requirements-dev.txt

Quick start

python examples/basic_benchmark.py

Artifacts are written to results/:

  • JSON metrics payload
  • HTML report with embedded charts

Typical programmatic usage

import torchvision.models as tvm
from benchmark.core.benchmark_runner import BenchmarkRunner
from benchmark.core.config import BenchmarkConfig

config = BenchmarkConfig(
    min_batch_size=32,
    max_batch_size=256,
    precision_modes=["fp32", "bf16"],
    num_workers_candidates=[0, 2, 4, 8],
)

runner = BenchmarkRunner(
    model=tvm.resnet50(num_classes=10),
    input_shape=(3, 224, 224),
    num_classes=10,
    config=config,
)

results = runner.run_full_suite()
runner.save_results()
runner.generate_report()

Integration with Nsight Systems

For deep-dive analysis, export PyTorch Kineto traces:

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA],
    record_shapes=True
) as prof:
    train_step()
prof.export_chrome_trace("trace.json")

Then open in Nsight Systems: nsys-ui trace.json

This correlates high-level PyTorch ops with low-level CUDA kernels.

Reproducibility notes

  • Benchmark runs should be repeated and aggregated (median or trimmed mean recommended).
  • GPU clocks, thermals, background load, and driver state materially affect measured throughput.
  • Use fixed seeds and stable software versions when reporting comparative results.

See Technical Report for methodology and limitations.

Engineering quality

  • Unit tests: tests/
  • CI workflow: .github/workflows/ci.yml
  • API docs: docs/api_reference.md
  • Optimization guidance: docs/optimization_guide.md
  • Hardware guidance: docs/hardware_guide.md

Citation

If this suite informs research or engineering reports, cite it via CITATION.cff.

License

This project is licensed under the MIT License. See LICENSE.

About

PyTorch training throughput benchmark for NVIDIA GPUs. Sweeps batch size, precision, and DataLoader workers, profiles step phases, collects NVML telemetry, and generates JSON + HTML reports.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages