Skip to content

RudraYBedekar/LLM-Inference-Performance-Analyzer-GPU-Profiling-Benchmarking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Inference Profiler

A production-quality framework for profiling LLM inference latency, throughput, and memory usage.

Dashboard Preview

Directory Structure

llm-inference-profiler/
├── benchmark/
│   ├── run_inference.py     # Main entry point for benchmarking
│   ├── benchmark_latency.py # Latency measurement logic
│   ├── benchmark_throughput.py # Throughput measurement logic
│   ├── configs.yaml         # Default configuration
│   └── utils.py             # Utilities and Model Management
├── profiling/
│   ├── torch_profiler.py    # GPU Profiling implementation
├── scripts/                 # Convenience scripts
│   ├── run_fp32.sh
│   ├── run_fp16.sh
│   └── run_nsight.sh
├── results/                 # Output directory for CSVs
└── plots/                   # Plotting scripts
    └── plot_results.py

Setup

  1. Install dependencies:
    pip install -r requirements.txt

Usage

Run Benchmark

Run with default configuration (GPT-2, FP16 & FP32, various batch sizes):

python benchmark/run_inference.py

Override parameters:

python benchmark/run_inference.py --model gpt2 --precision fp16 --batch-sizes 1,8 --seq-lens 128

Run using Scripts

./scripts/run_fp16.sh
./scripts/run_fp32.sh

Plot Results

After running benchmarks, generate plots in plots/:

python plots/plot_results.py

Metrics

  • Latency: End-to-end inference time (ms) for processing inputs.
  • Throughput: Tokens processed per second.
  • Memory: Peak GPU memory allocated and reserved.
  • Utilization: Average GPU compute utilization.

Trade-offs

  • FP32: Higher precision, increased memory usage, standard performance.
  • FP16: Mixed precision (via autocast), reduced memory usage, higher throughput on Tensor Cores.

Hugging Face API Support

You can also run this dashboard using the Hugging Face Inference API.

  1. Select "Hugging Face API" in the dashboard sidebar.
  2. Enter your Hugging Face API Token (get one at hf.co/settings/tokens).
  3. Specify the Repository ID of the model you want to test (e.g., mistralai/Mistral-7B-Instruct-v0.2).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors