Skip to content

IAPS-AI/inference-data-center-simulator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

AI Inference Data Center Simulator

An interactive visualization tool for understanding how GPUs process AI inference requests and what constrains throughput at data center scale.

Live Demo

https://iaps-ai.github.io/inference-data-center-simulator/

Or open index.html locally in a browser - no server required.

What This Teaches

The Two Phases of Inference

  1. Prefill (Compute-Bound): Processing input tokens in parallel. High arithmetic intensity, GPU compute units stay busy.

  2. Decode (Bandwidth-Bound): Generating output tokens one at a time. Must read all model weights for each token. Memory bandwidth is the bottleneck, not compute.

Key Constraints

  • Memory Bandwidth: Limits decode speed. Tokens/sec ≈ Bandwidth / (2 × Parameters)
  • Compute (FLOPS): Limits prefill speed
  • Memory Capacity: Limits concurrent requests via KV cache

Multi-GPU Parallelism

  • Tensor Parallelism (TP): Split weights across GPUs. Required when model doesn't fit on one GPU. Adds communication overhead (all-reduce per layer).
  • Data Parallelism (DP): Multiple independent replicas. Linear throughput scaling, no communication overhead.

Features

  • GPU Architecture Visualization: Watch data flow between HBM, memory bus, SRAM, and tensor cores
  • Data Center View: See racks, nodes, GPUs with NVLink and InfiniBand connections
  • Realistic Presets: Configurations modeled after Anthropic (Claude), OpenAI (ChatGPT), Moonshot (Kimi), and more
  • Configurable Parameters: GPU specs, model size, TP/DP, batch size, demand profiles
  • Real-time Metrics: Throughput, latency, utilization, bottleneck identification

Presets

Preset Model GPUs TP DP Use Case
Anthropic (Claude Code) 70B 32× H100 4 8 Coding agent
OpenAI (ChatGPT) 70B 32× H100 2 16 Casual chat
Moonshot (Kimi) 70B 16× H800 8 2 Long-context
DeepSeek (Coder) 34B 8× H100 2 4 Coding
Startup (Budget) 7B 1× RTX 4090 1 1 Prototyping

Key Formulas

Decode tokens/sec (single request):

tokens/sec = memory_bandwidth / (2 × model_parameters)

Prefill tokens/sec:

tokens/sec = FLOPS / (2 × model_parameters)

KV Cache per request:

KV_cache = 2 × layers × hidden_dim × seq_length × 2 bytes

License

MIT

About

Interactive visualization for understanding AI inference at data center scale

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages