A production-quality framework for profiling LLM inference latency, throughput, and memory usage.
llm-inference-profiler/
├── benchmark/
│ ├── run_inference.py # Main entry point for benchmarking
│ ├── benchmark_latency.py # Latency measurement logic
│ ├── benchmark_throughput.py # Throughput measurement logic
│ ├── configs.yaml # Default configuration
│ └── utils.py # Utilities and Model Management
├── profiling/
│ ├── torch_profiler.py # GPU Profiling implementation
├── scripts/ # Convenience scripts
│ ├── run_fp32.sh
│ ├── run_fp16.sh
│ └── run_nsight.sh
├── results/ # Output directory for CSVs
└── plots/ # Plotting scripts
└── plot_results.py
- Install dependencies:
pip install -r requirements.txt
Run with default configuration (GPT-2, FP16 & FP32, various batch sizes):
python benchmark/run_inference.pyOverride parameters:
python benchmark/run_inference.py --model gpt2 --precision fp16 --batch-sizes 1,8 --seq-lens 128./scripts/run_fp16.sh
./scripts/run_fp32.shAfter running benchmarks, generate plots in plots/:
python plots/plot_results.py- Latency: End-to-end inference time (ms) for processing inputs.
- Throughput: Tokens processed per second.
- Memory: Peak GPU memory allocated and reserved.
- Utilization: Average GPU compute utilization.
- FP32: Higher precision, increased memory usage, standard performance.
- FP16: Mixed precision (via
autocast), reduced memory usage, higher throughput on Tensor Cores.
You can also run this dashboard using the Hugging Face Inference API.
- Select "Hugging Face API" in the dashboard sidebar.
- Enter your Hugging Face API Token (get one at hf.co/settings/tokens).
- Specify the Repository ID of the model you want to test (e.g.,
mistralai/Mistral-7B-Instruct-v0.2).
