An interactive visualization tool for understanding how GPUs process AI inference requests and what constrains throughput at data center scale.
https://iaps-ai.github.io/inference-data-center-simulator/
Or open index.html locally in a browser - no server required.
-
Prefill (Compute-Bound): Processing input tokens in parallel. High arithmetic intensity, GPU compute units stay busy.
-
Decode (Bandwidth-Bound): Generating output tokens one at a time. Must read all model weights for each token. Memory bandwidth is the bottleneck, not compute.
- Memory Bandwidth: Limits decode speed. Tokens/sec ≈ Bandwidth / (2 × Parameters)
- Compute (FLOPS): Limits prefill speed
- Memory Capacity: Limits concurrent requests via KV cache
- Tensor Parallelism (TP): Split weights across GPUs. Required when model doesn't fit on one GPU. Adds communication overhead (all-reduce per layer).
- Data Parallelism (DP): Multiple independent replicas. Linear throughput scaling, no communication overhead.
- GPU Architecture Visualization: Watch data flow between HBM, memory bus, SRAM, and tensor cores
- Data Center View: See racks, nodes, GPUs with NVLink and InfiniBand connections
- Realistic Presets: Configurations modeled after Anthropic (Claude), OpenAI (ChatGPT), Moonshot (Kimi), and more
- Configurable Parameters: GPU specs, model size, TP/DP, batch size, demand profiles
- Real-time Metrics: Throughput, latency, utilization, bottleneck identification
| Preset | Model | GPUs | TP | DP | Use Case |
|---|---|---|---|---|---|
| Anthropic (Claude Code) | 70B | 32× H100 | 4 | 8 | Coding agent |
| OpenAI (ChatGPT) | 70B | 32× H100 | 2 | 16 | Casual chat |
| Moonshot (Kimi) | 70B | 16× H800 | 8 | 2 | Long-context |
| DeepSeek (Coder) | 34B | 8× H100 | 2 | 4 | Coding |
| Startup (Budget) | 7B | 1× RTX 4090 | 1 | 1 | Prototyping |
Decode tokens/sec (single request):
tokens/sec = memory_bandwidth / (2 × model_parameters)
Prefill tokens/sec:
tokens/sec = FLOPS / (2 × model_parameters)
KV Cache per request:
KV_cache = 2 × layers × hidden_dim × seq_length × 2 bytes
MIT