TU102 (sm_75) microarchitecture characterisation

Instruction-level latency/throughput tables, memory-hierarchy measurements, and interconnect (NVLink/PCIe/NCCL) characterisation for the NVIDIA TU102 GPU (Turing, compute capability 7.5), measured on 2× Quadro RTX 6000 with an NV2 NVLink bridge. In the style of Agner Fog's x86 instruction tables, with the methodology of Jia et al.'s Volta/Turing dissections.

Status: measured, audit in progress. 264 rows across every family: ALU with contention-probed pipe bindings, the full memory hierarchy (smem/L1/L2/DRAM/TLB/constant/instruction caches, carveout, policies, atomics), tensor cores, SFU, shuffle/branch/barrier, the launch family, and the NVLink/PCIe/NCCL interconnect. Every row gates on locked clocks and SASS-verified loops, and every flag carries its investigation. All three pre-registered hypotheses (paper/ section 4) carry recorded outcomes: a fired kill criterion, a confirmation reached through two documented gate remediations (the second supplied by a first-read-after-peer-write visibility row), and two published refutations. The manifest auditor (tools/check_complete.py, CI-gated) holds the open remainder explicit in table/COVERAGE.md. The paper and v1.0 close the work.

What this will contain

table/tu102_ops.csv — one row per (op, variant, metric): latency in cycles, reciprocal throughput, pipe binding, bandwidth, with per-row provenance (measured_by → bench + results line), priors from published work (prior_src → bib locator), and deviation flags. Features absent on sm_75 (e.g. cp.async) appear as explicit NA_SM75 rows, not missing keys.
bench/ — self-contained CUDA microbenchmarks, one family per directory. Every binary gates on locked clocks and verifies its own SASS.
tools/ — SASS census, table generation, and a projection model (op-mix × table → issue-cycle estimate) verified against reference kernels to ±20%.
data/results/<host>/ — append-only raw run CSVs. The pristine reference dataset is t5820-2xrtx6000 (Xeon W-2140B, both GPUs PCIe 3.0 x16).
paper/ — LaTeX write-up (arXiv-ready). PAPER.md at the repository root is a generated GitHub-readable mirror (make paper-md, via pandoc), and the .tex is the source of truth. Prose is British English, while code identifiers and named APIs stay US English ASCII.

Hardware


GPU	2× Quadro RTX 6000 (TU102, 72 SMs, 24 GB GDDR6, ~672 GB/s)
Interconnect	NVLink NV2 (2 links, ~50 GB/s/dir aggregate), PCIe 3.0 x16 per GPU
Clock policy	SM locked at 1455 MHz for all cycle-domain rows; memory clock locked for bandwidth rows; ECC disabled (recorded per run)
Toolchain	CUDA 13.3 (`nvcc -O2 -arch=sm_75`), driver 610.43.02

Scope: compute path only. RT cores (not reachable from CUDA), NVENC/NVDEC, and the graphics pipe are out of scope, stated rather than silent. The table is a snapshot of this toolchain/driver pairing, not a living document. Run headers carry the exact versions, and the append-only results layout admits later datasets under newer toolchains.

Methodology lineage

Format and intent follow Agner Fog's instruction tables [fog2025instruction]. Latency measurement uses dependent-chain timing and fine-grained pointer-chase [wong2010demystifying, mei2017dissecting]. Per-row priors are taken primarily from Jia et al.'s T4 dissection [jia2019turing], with the Volta report [jia2018volta] and the NVIDIA Turing whitepaper as secondary sources. Full citations in paper/references.bib.

Reproducing

REPRODUCING.md is the fresh-clone runbook: lock the clocks, run bench/common/run_all.sh, regenerate the table, compare within the per-domain floors in table/SCHEMA.md. Replication datasets are welcome as pull requests adding a new data/results/<host>/ directory.

Citing

See CITATION.cff (GitHub's "Cite this repository" button). Until the paper is released, cite the repository directly.

License

Code (bench/, tools/, Makefile): MIT — see LICENSE.
Table, data, and paper (table/, data/, paper/): CC BY 4.0 — see LICENSE-CC-BY-4.0.

AI assistance

The microbenchmark harness, analysis tooling, and manuscript drafts were developed with the assistance of Claude (model claude-fable-5, Anthropic), used via Claude Code. All measurements were executed, validated, and interpreted on the author's hardware under the author's direction, and the author takes sole responsibility for the content.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TU102 (sm_75) microarchitecture characterisation

What this will contain

Hardware

Methodology lineage

Reproducing

Citing

License

AI assistance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github/workflows		.github/workflows
bench		bench
data		data
paper		paper
table		table
tools		tools
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
LICENSE-CC-BY-4.0		LICENSE-CC-BY-4.0
Makefile		Makefile
PAPER.md		PAPER.md
README.md		README.md
REPRODUCING.md		REPRODUCING.md

Folders and files

Latest commit

History

Repository files navigation

TU102 (sm_75) microarchitecture characterisation

What this will contain

Hardware

Methodology lineage

Reproducing

Citing

License

AI assistance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages