AI Inference Data Center Simulator

An interactive visualization tool for understanding how GPUs process AI inference requests and what constrains throughput at data center scale.

Live Demo

https://iaps-ai.github.io/inference-data-center-simulator/

Or open index.html locally in a browser - no server required.

What This Teaches

The Two Phases of Inference

Prefill (Compute-Bound): Processing input tokens in parallel. High arithmetic intensity, GPU compute units stay busy.
Decode (Bandwidth-Bound): Generating output tokens one at a time. Must read all model weights for each token. Memory bandwidth is the bottleneck, not compute.

Key Constraints

Memory Bandwidth: Limits decode speed. Tokens/sec ≈ Bandwidth / (2 × Parameters)
Compute (FLOPS): Limits prefill speed
Memory Capacity: Limits concurrent requests via KV cache

Multi-GPU Parallelism

Tensor Parallelism (TP): Split weights across GPUs. Required when model doesn't fit on one GPU. Adds communication overhead (all-reduce per layer).
Data Parallelism (DP): Multiple independent replicas. Linear throughput scaling, no communication overhead.

Features

GPU Architecture Visualization: Watch data flow between HBM, memory bus, SRAM, and tensor cores
Data Center View: See racks, nodes, GPUs with NVLink and InfiniBand connections
Realistic Presets: Configurations modeled after Anthropic (Claude), OpenAI (ChatGPT), Moonshot (Kimi), and more
Configurable Parameters: GPU specs, model size, TP/DP, batch size, demand profiles
Real-time Metrics: Throughput, latency, utilization, bottleneck identification

Presets

Preset	Model	GPUs	TP	DP	Use Case
Anthropic (Claude Code)	70B	32× H100	4	8	Coding agent
OpenAI (ChatGPT)	70B	32× H100	2	16	Casual chat
Moonshot (Kimi)	70B	16× H800	8	2	Long-context
DeepSeek (Coder)	34B	8× H100	2	4	Coding
Startup (Budget)	7B	1× RTX 4090	1	1	Prototyping

Key Formulas

Decode tokens/sec (single request):

tokens/sec = memory_bandwidth / (2 × model_parameters)

Prefill tokens/sec:

tokens/sec = FLOPS / (2 × model_parameters)

KV Cache per request:

KV_cache = 2 × layers × hidden_dim × seq_length × 2 bytes

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.claude		.claude
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Inference Data Center Simulator

Live Demo

What This Teaches

The Two Phases of Inference

Key Constraints

Multi-GPU Parallelism

Features

Presets

Key Formulas

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Inference Data Center Simulator

Live Demo

What This Teaches

The Two Phases of Inference

Key Constraints

Multi-GPU Parallelism

Features

Presets

Key Formulas

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages