GPU Training Benchmark Suite

A PyTorch benchmarking framework for training throughput analysis on NVIDIA GPUs. It combines end-to-end step timing, GPU telemetry, precision and batch-size sweeps, DataLoader analysis, bottleneck diagnosis, and report generation.

Why this project exists

Most training performance discussions focus on model quality but under-specify system behavior. This suite is designed to answer practical and research-facing questions such as:

What is the best stable batch size for a model on a given GPU?
Is training limited by compute, memory bandwidth, host input pipeline, or optimizer overhead?
What speedup do BF16/FP16 modes provide relative to FP32?
Which DataLoader configuration minimizes data fetch latency?

The result is a reproducible JSON artifact plus a standalone HTML report.

Core capabilities

Throughput profiling (samples/sec, step_time_ms)
Step-level phase breakdown (forward, backward, optimizer, data loading)
GPU utilization and memory telemetry via NVML
Batch-size sweep with OOM-safe stopping
Precision sweep across FP32/FP16/BF16
DataLoader worker sweep
Bottleneck detection with actionable recommendations
Roofline-oriented analysis helpers
Self-contained HTML report generation

Repository layout

gpu-benchmark-suite-fixed/
  benchmark/
    analysis/         # bottlenecks, roofline, cost and recommendation logic
    core/             # config, metrics collector, benchmark orchestrator
    profilers/        # throughput and subsystem profilers
    sweeps/           # batch size, precision, dataloader, distributed sweeps
    utils/            # system and GPU helper utilities
    visualization/    # plots + HTML report generator
  benchmarks/         # suite overhead benchmarking
  docs/               # user docs + technical report
  examples/           # runnable scripts and CI regression example
  tests/              # unit tests

Installation

Create and activate a virtual environment.
Install PyTorch for your CUDA version from https://pytorch.org/get-started/locally/.
Install this package.

pip install -r requirements.txt
pip install -e .

For development:

pip install -r requirements-dev.txt

Quick start

python examples/basic_benchmark.py

Artifacts are written to results/:

JSON metrics payload
HTML report with embedded charts

Typical programmatic usage

import torchvision.models as tvm
from benchmark.core.benchmark_runner import BenchmarkRunner
from benchmark.core.config import BenchmarkConfig

config = BenchmarkConfig(
    min_batch_size=32,
    max_batch_size=256,
    precision_modes=["fp32", "bf16"],
    num_workers_candidates=[0, 2, 4, 8],
)

runner = BenchmarkRunner(
    model=tvm.resnet50(num_classes=10),
    input_shape=(3, 224, 224),
    num_classes=10,
    config=config,
)

results = runner.run_full_suite()
runner.save_results()
runner.generate_report()

Integration with Nsight Systems

For deep-dive analysis, export PyTorch Kineto traces:

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CUDA],
    record_shapes=True
) as prof:
    train_step()
prof.export_chrome_trace("trace.json")

Then open in Nsight Systems: nsys-ui trace.json

This correlates high-level PyTorch ops with low-level CUDA kernels.

Reproducibility notes

Benchmark runs should be repeated and aggregated (median or trimmed mean recommended).
GPU clocks, thermals, background load, and driver state materially affect measured throughput.
Use fixed seeds and stable software versions when reporting comparative results.

See Technical Report for methodology and limitations.

Engineering quality

Unit tests: tests/
CI workflow: .github/workflows/ci.yml
API docs: docs/api_reference.md
Optimization guidance: docs/optimization_guide.md
Hardware guidance: docs/hardware_guide.md

Citation

If this suite informs research or engineering reports, cite it via CITATION.cff.

License

This project is licensed under the MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
.vscode		.vscode
benchmark		benchmark
benchmarks		benchmarks
docs		docs
examples		examples
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
FIXES.md		FIXES.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Training Benchmark Suite

Why this project exists

Core capabilities

Repository layout

Installation

Quick start

Typical programmatic usage

Integration with Nsight Systems

Reproducibility notes

Engineering quality

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPU Training Benchmark Suite

Why this project exists

Core capabilities

Repository layout

Installation

Quick start

Typical programmatic usage

Integration with Nsight Systems

Reproducibility notes

Engineering quality

Citation

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages