The most complete open-source LLM evaluation suite.
Measure accuracy, latency, cost, hallucination, and reasoning quality across any LLM — side by side.
| 🌐 Live Docs | 🤗 HuggingFace | 🐙 GitHub | ⭐ Star the Repo |
|---|
- ✨ Why This Framework?
- 🎯 Key Features
- 🏗 Architecture
- 🚀 Quick Start
- 📦 Installation
- 🔑 API Keys Setup
- 💻 CLI Reference
- 🐍 Python API
- 🌐 REST API Reference
- 📊 Streamlit Dashboard
- 📏 Evaluation Metrics
- 🏆 Supported Benchmarks
- 🤖 Supported Models & Pricing
- 🗄 Database & Storage
- 📄 PDF Report Generation
- 🐳 Docker Deployment
- 🧪 Testing
- 🤗 HuggingFace Dataset
- 📁 Project Structure
- 🔧 Configuration Reference
- 🤝 Contributing
- 📜 License
- ⭐ Star History
"You can't improve what you can't measure." — Peter Drucker
The LLM landscape is evolving at breakneck speed. New models appear every week, each claiming to be state-of-the-art. But how do you actually know which model is best for your use case?
Most existing benchmarking tools:
- ❌ Evaluate only a single model at a time
- ❌ Ignore latency and real-world cost
- ❌ Don't detect hallucinations
- ❌ Require complex setup
- ❌ Lack a usable dashboard
This framework solves all of that. It's the only open-source tool that evaluates GPT-4, Claude, Gemini, Mistral, and Llama side by side — on the same prompts — with 5 production-relevant metrics, a beautiful Streamlit dashboard, a REST API, and a CLI.
|
|
|
|
┌─────────────────────────────────────────────────────────────────────────┐
│ LLM EVALUATION FRAMEWORK │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌───────────────┐ ┌─────────────┐ │
│ │ Click │ │ FastAPI │ │ Streamlit │ │ ReportLab │ │
│ │ CLI │ │ REST API │ │ Dashboard │ │PDF Generator│ │
│ │ 7 cmds │ │ 12 endpoints │ │ 5 pages │ │ │ │
│ └────┬─────┘ └──────┬───────┘ └───────┬───────┘ └──────┬──────┘ │
│ └───────────────┴──────────────────┴─────────────────┘ │
│ │ │
│ ┌───────────▼──────────┐ │
│ │ Core Evaluator │ │
│ │ (Async Engine) │ │
│ │ asyncio.Semaphore │ │
│ │ configurable timeout │ │
│ │ progress callbacks │ │
│ └───────────┬──────────┘ │
│ │ │
│ ┌────────────────────┬──────┴──────┬────────────────────┐ │
│ │ │ │ │ │
│ ┌─────▼──────┐ ┌────────▼──────┐ ┌──▼──────────┐ ┌──────▼───┐ │
│ │ Metrics │ │ Benchmarks │ │ Database │ │ LiteLLM │ │
│ │ │ │ │ │ (SQLite) │ │ │ │
│ │ accuracy │ │ MMLU (14K) │ │ │ │ OpenAI │ │
│ │ hallucin. │ │ TruthfulQA │ │ save_result │ │ Anthropic│ │
│ │ latency │ │ Custom CSV │ │ list_results│ │ Google │ │
│ │ cost │ │ Custom JSON │ │ export_csv │ │ Mistral │ │
│ │ reasoning │ │ HF Hub cache │ │ export_json │ │ Together │ │
│ └────────────┘ └───────────────┘ └─────────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
User Request → CLI/API/Dashboard
↓
EvaluationConfig (model, benchmark, num_samples, temperature, concurrency)
↓
LLMEvaluator.evaluate(config, samples)
↓
asyncio.Semaphore(concurrency) ← controls parallelism
↓
For each sample (parallel):
litellm.acompletion() → response
AccuracyMetric.score(response, expected)
HallucinationMetric.score(prompt, response)
HallucinationMetric.reasoning_quality(response)
LatencyMetric — wall-clock time
CostMetric.calculate(model, input_tokens, output_tokens)
↓
_aggregate(samples) → EvaluationResult (accuracy, latency stats, cost, etc.)
↓
Database.save_result() → SQLite
↓
Return EvaluationResult to caller
# 1. Install
pip install llm-evaluation-framework
# 2. Set your API key
export OPENAI_API_KEY="sk-..."
# 3. Run your first evaluation
llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 50Expected output:
╭──────────────────────────────────────╮
│ Evaluation: gpt-4o-mini │
├──────────────────┬───────────────────┤
│ Accuracy │ 78.00% │
│ Avg Latency │ 432 ms │
│ P95 Latency │ 1240 ms │
│ Total Cost │ $0.0012 │
│ Cost / 1K Tokens │ $0.0015 │
│ Hallucination │ 2.40% │
│ Reasoning Score │ 7.2 / 10 │
│ Samples │ 50 │
│ Run ID │ a3f92c1b │
╰──────────────────┴───────────────────╯
pip install llm-evaluation-framework# Dashboard (Streamlit + Plotly + Pandas)
pip install "llm-evaluation-framework[dashboard]"
# PDF Reports (ReportLab)
pip install "llm-evaluation-framework[reports]"
# Development (pytest, mypy, ruff)
pip install "llm-evaluation-framework[dev]"
# Everything
pip install "llm-evaluation-framework[dashboard,reports,dev]"git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/Mac
# .venv\Scripts\activate # Windows
# Install with all extras
pip install -e ".[dashboard,reports,dev]"git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework
cp .env.example .env # fill in your API keys
docker-compose up -d
# API: http://localhost:8000/docs
# Dashboard: http://localhost:8501| Dependency | Version | Purpose |
|---|---|---|
| Python | ≥ 3.10 | Core runtime |
| litellm | 1.52.x | Unified LLM API |
| fastapi | 0.115.x | REST API |
| uvicorn | 0.32.x | ASGI server |
| streamlit | 1.40.x | Dashboard |
| plotly | 5.24.x | Charts |
| pandas | 2.2.x | Data handling |
| click | 8.1.x | CLI |
| rich | 13.9.x | Terminal UI |
| datasets | 3.2.x | HF Hub loader |
| reportlab | 4.2.x | PDF reports |
| pydantic | 2.10.x | Data validation |
Copy .env.example to .env and fill in your keys:
cp .env.example .env# .env
# ── OpenAI ──────────────────────────────────
OPENAI_API_KEY=sk-...
# ── Anthropic (Claude) ───────────────────────
ANTHROPIC_API_KEY=sk-ant-...
# ── Google (Gemini) ──────────────────────────
GEMINI_API_KEY=AI...
GOOGLE_API_KEY=AI...
# ── Mistral ──────────────────────────────────
MISTRAL_API_KEY=...
# ── Together AI (Llama) ──────────────────────
TOGETHERAI_API_KEY=...
# ── HuggingFace (for dataset loading) ────────
HUGGINGFACE_TOKEN=hf_...
# ── App Settings ─────────────────────────────
LLM_EVAL_DB_PATH=llm_eval.db
LLM_EVAL_CACHE_DIR=~/.cache/llm_eval
PORT=8000
DASHBOARD_PORT=8501Note: You only need the keys for the models you want to evaluate. The framework works with any subset of providers.
The CLI is installed as llm-eval and provides 7 subcommands:
llm-eval run [OPTIONS]
Options:
-m, --model TEXT LiteLLM model name [required]
-b, --benchmark TEXT mmlu | truthfulqa | custom [default: mmlu]
-n, --samples INTEGER Number of samples [default: 20]
--temperature FLOAT Sampling temperature [default: 0.0]
--max-tokens INTEGER Max output tokens [default: 512]
--concurrency INTEGER Parallel API calls [default: 5]
-o, --output PATH Save JSON result to file
-v, --verbose Enable debug loggingExamples:
# Quick 20-sample smoke test
llm-eval run --model gpt-4o-mini --benchmark mmlu
# Full 100-sample MMLU evaluation
llm-eval run --model gpt-4o --benchmark mmlu --samples 100 --concurrency 10
# TruthfulQA evaluation
llm-eval run --model claude-3-5-haiku-20241022 --benchmark truthfulqa --samples 50
# Save output to JSON
llm-eval run --model gpt-4o-mini --benchmark mmlu --samples 50 -o result.json
# Use a custom temperature
llm-eval run --model gpt-3.5-turbo --benchmark mmlu --temperature 0.3llm-eval compare [OPTIONS]
Options:
-m, --models TEXT Model names (repeat flag) [required, min 2]
-b, --benchmark TEXT Benchmark name [default: mmlu]
-n, --samples INTEGER Samples per model [default: 20]
-o, --output PATH Save JSON results to fileExamples:
# Compare 3 providers
llm-eval compare \
--models gpt-4o-mini \
--models claude-3-haiku-20240307 \
--models gemini/gemini-1.5-flash \
--benchmark mmlu --samples 50
# Compare on TruthfulQA
llm-eval compare \
--models gpt-4o \
--models claude-3-5-sonnet-20241022 \
--benchmark truthfulqa --samples 100
# Save comparison results
llm-eval compare --models gpt-4o-mini --models claude-3-haiku-20240307 \
--benchmark mmlu --output comparison.jsonllm-eval results [OPTIONS]
Options:
--model TEXT Filter by model name
--benchmark TEXT Filter by benchmark
--limit INTEGER Max results to show [default: 20]Examples:
# Show all results
llm-eval results
# Filter by model
llm-eval results --model gpt-4o-mini
# Filter by benchmark
llm-eval results --benchmark mmlu --limit 50llm-eval export --format [csv|json] --output OUTPUT_PATH# Export to CSV
llm-eval export --format csv --output results.csv
# Export to JSON
llm-eval export --format json --output results.jsonllm-eval report --run-ids RUN_ID [--run-ids RUN_ID ...] --output OUTPUT_DIR# Single run
llm-eval report --run-ids a3f92c1b --output ./reports/
# Multiple runs in one report
llm-eval report --run-ids a3f92c1b --run-ids b4d03e2c --output ./reports/llm-eval serve [--host HOST] [--port PORT] [--reload]llm-eval serve --port 8000 --reload
# → http://localhost:8000/docsllm-eval dashboard [--port PORT]llm-eval dashboard --port 8501
# → http://localhost:8501import asyncio
from llm_eval.core.evaluator import LLMEvaluator, EvaluationConfig
from llm_eval.benchmarks.mmlu import MMLUBenchmark
async def main():
evaluator = LLMEvaluator() # uses default db path
samples = MMLUBenchmark().load(num_samples=100) # loads from HF Hub or cache
config = EvaluationConfig(
model="gpt-4o-mini",
benchmark="mmlu",
num_samples=100,
temperature=0.0,
max_tokens=512,
concurrency=10, # parallel API calls
timeout=30.0, # seconds per request
)
result = await evaluator.evaluate(config, samples)
print(f"Accuracy: {result.accuracy:.2%}")
print(f"Avg Latency: {result.avg_latency_ms:.0f} ms")
print(f"P95 Latency: {result.p95_latency_ms:.0f} ms")
print(f"P99 Latency: {result.p99_latency_ms:.0f} ms")
print(f"Total Cost: ${result.total_cost_usd:.4f}")
print(f"Cost per 1K: ${result.cost_per_1k_tokens:.4f}")
print(f"Hallucination: {result.hallucination_rate:.2%}")
print(f"Reasoning Score: {result.avg_reasoning_score:.1f} / 10")
print(f"Run ID: {result.run_id}")
asyncio.run(main())import asyncio
from llm_eval.core.evaluator import LLMEvaluator, EvaluationConfig
from llm_eval.benchmarks.mmlu import MMLUBenchmark
async def compare():
evaluator = LLMEvaluator()
# Load samples ONCE — all models see the same prompts
samples = MMLUBenchmark().load(num_samples=50)
configs = [
EvaluationConfig(model="gpt-4o", benchmark="mmlu", num_samples=50),
EvaluationConfig(model="gpt-4o-mini", benchmark="mmlu", num_samples=50),
EvaluationConfig(model="claude-3-5-sonnet-20241022", benchmark="mmlu", num_samples=50),
EvaluationConfig(model="claude-3-haiku-20240307", benchmark="mmlu", num_samples=50),
EvaluationConfig(model="gemini/gemini-1.5-flash", benchmark="mmlu", num_samples=50),
]
# All 5 evaluations run in parallel
results = await evaluator.evaluate_multiple(configs, samples)
# Print ranked leaderboard
header = f"{'Rank':<5} {'Model':<40} {'Acc':>8} {'Lat':>8} {'Cost/1K':>10} {'Score':>8}"
print(header)
print("─" * len(header))
for i, r in enumerate(sorted(results, key=lambda x: x.accuracy, reverse=True), 1):
print(
f"{i:<5} {r.model:<40} "
f"{r.accuracy:>7.1%} "
f"{r.avg_latency_ms:>6.0f}ms "
f"${r.cost_per_1k_tokens:>8.4f} "
f"{r.avg_reasoning_score:>6.1f}/10"
)
asyncio.run(compare())import asyncio
from llm_eval.core.evaluator import LLMEvaluator, EvaluationConfig
from llm_eval.benchmarks.custom import CustomBenchmark
async def custom_eval():
# From a CSV file
bench = CustomBenchmark.from_file("my_benchmark.csv")
# Or from a string
csv_content = """prompt,expected
"What is the capital of Python packaging?","PyPI"
"What does ACID stand for in databases?","Atomicity Consistency Isolation Durability"
"""
bench = CustomBenchmark.from_string(csv_content, format="csv")
evaluator = LLMEvaluator()
config = EvaluationConfig(model="gpt-4o-mini", benchmark="custom", num_samples=50)
result = await evaluator.evaluate(config, bench.load(50))
print(f"Custom eval accuracy: {result.accuracy:.2%}")
asyncio.run(custom_eval())import asyncio
from llm_eval.core.evaluator import LLMEvaluator, EvaluationConfig
from llm_eval.benchmarks.mmlu import MMLUBenchmark
async def with_progress():
evaluator = LLMEvaluator()
samples = MMLUBenchmark().load(100)
config = EvaluationConfig(model="gpt-4o-mini", benchmark="mmlu", num_samples=100)
async def on_progress(done: int, total: int):
pct = done / total * 100
bar = "█" * int(pct / 5) + "░" * (20 - int(pct / 5))
print(f"\r[{bar}] {done}/{total} ({pct:.0f}%)", end="", flush=True)
result = await evaluator.evaluate(config, samples, progress_callback=on_progress)
print(f"\n✅ Done! Accuracy: {result.accuracy:.1%}")
asyncio.run(with_progress())from llm_eval.database.models import Database
db = Database() # or Database("custom_path.db")
# List all results
records = db.list_results(limit=50)
# Filter by model
records = db.list_results(model="gpt-4o-mini", limit=20)
# Filter by benchmark
records = db.list_results(benchmark="mmlu", limit=100)
# Get a specific run
record = db.get_result("a3f92c1b")
print(record.accuracy, record.avg_latency_ms)
# Compare models on a benchmark (latest run per model)
comparison = db.get_model_comparison("mmlu")
for row in comparison:
print(row["model"], row["accuracy"])
# Export
db.export_csv("all_results.csv")
db.export_json("all_results.json")
# Delete a run
db.delete_result("a3f92c1b")Start the API server:
uvicorn llm_eval.api.main:app --reload --port 8000
# Interactive docs: http://localhost:8000/docs
# ReDoc: http://localhost:8000/redoc| Method | Path | Description |
|---|---|---|
POST |
/evaluate |
Evaluate a model on a benchmark |
POST |
/compare |
Compare multiple models side-by-side |
POST |
/evaluate/custom |
Upload CSV/JSON for custom evaluation |
GET |
/results |
List stored results (filterable) |
GET |
/results/{run_id} |
Get a specific run result |
DELETE |
/results/{run_id} |
Delete a stored result |
GET |
/export/csv |
Download all results as CSV |
GET |
/export/json |
Download all results as JSON |
POST |
/report |
Generate and download PDF report |
GET |
/models |
List all models and pricing |
GET |
/benchmarks |
List available benchmarks |
GET |
/health |
Health check ({"status":"ok"}) |
// Request
{
"model": "gpt-4o-mini",
"benchmark": "mmlu",
"num_samples": 50,
"temperature": 0.0,
"max_tokens": 512,
"concurrency": 10
}
// Response
{
"run_id": "a3f92c1b",
"model": "gpt-4o-mini",
"benchmark": "mmlu",
"num_samples": 50,
"accuracy": 0.78,
"avg_latency_ms": 432.1,
"p50_latency_ms": 380.0,
"p95_latency_ms": 1100.0,
"p99_latency_ms": 1840.0,
"total_cost_usd": 0.0012,
"cost_per_1k_tokens": 0.0015,
"hallucination_rate": 0.024,
"avg_reasoning_score": 7.2,
"created_at": "2025-01-20T14:32:01"
}// Request
{
"models": ["gpt-4o-mini", "claude-3-haiku-20240307", "gemini/gemini-1.5-flash"],
"benchmark": "mmlu",
"num_samples": 30
}
// Response
{
"results": [
{"model": "gpt-4o-mini", "accuracy": 0.78, ...},
{"model": "claude-3-haiku-20240307", "accuracy": 0.74, ...},
{"model": "gemini/gemini-1.5-flash", "accuracy": 0.76, ...}
]
}curl -X POST http://localhost:8000/evaluate/custom \
-F "file=@my_benchmark.csv" \
-F "model=gpt-4o-mini" \
-F "num_samples=50"Launch the dashboard:
streamlit run llm_eval/dashboard/app.py
# → http://localhost:8501
# Or via CLI
llm-eval dashboard --port 8501| Page | Description |
|---|---|
| 🏠 Dashboard | Overview: total runs, unique models, best accuracy, total spend. Radar chart, cost vs quality scatter, latency histogram. |
| Configure and launch a new evaluation with live progress bar. | |
| ⚖️ Compare Models | Select multiple models, run parallel comparison, see ranked table + charts. |
| 📊 Results | Browse all stored results with filters. Download CSV or JSON. |
| 📄 Reports | Select runs, generate PDF report, download instantly. |
| ℹ️ About | Framework info, links, quick start guide. |
- Radar Chart — 5-axis model comparison (accuracy, speed, cost efficiency, truthfulness, reasoning)
- Latency Histogram — distribution of response times per model
- Cost vs Quality Scatter — bubble chart (bubble size = sample count)
- Accuracy Bar Chart — ranked model comparison
- 3D Scatter — cost vs quality vs latency
Uses a cascade of matching strategies, applied in order:
- Exact match — after lowercasing and stripping punctuation
- Prefix normalization — removes "The answer is", "Answer:", etc.
- Multiple-choice detection — extracts the first A/B/C/D letter from the prediction
- Fuzzy match — Levenshtein ratio ≥ 0.85 (configurable)
from llm_eval.metrics.accuracy import AccuracyMetric
metric = AccuracyMetric(fuzzy_threshold=0.85)
# Multiple choice
metric.score("The answer is A", "A") # True
metric.score("I think B is correct here", "B") # True
# Free-form
metric.score("mitochondria", "mitochondrion") # True (fuzzy)
metric.score("Paris", "Paris") # True (exact)
# Batch scoring
results, accuracy = metric.batch_score(
["A", "B", "A", "D"],
["A", "B", "C", "D"]
)
# results = [True, True, False, True], accuracy = 0.75from llm_eval.metrics.latency import LatencyMetric
metric = LatencyMetric()
latencies = [120, 200, 350, 450, 800, 1200, 2000, 5500] # ms
stats = metric.compute_stats(latencies)
print(f"Mean: {stats.mean_ms:.0f} ms")
print(f"P50: {stats.p50_ms:.0f} ms")
print(f"P95: {stats.p95_ms:.0f} ms")
print(f"P99: {stats.p99_ms:.0f} ms")
# SLA violations
rate = metric.sla_violation_rate(latencies, threshold_ms=5000.0)
print(f"SLA violations: {rate:.1%}") # 12.5%
# Classification
print(metric.classify(200)) # "excellent"
print(metric.classify(1000)) # "good"
print(metric.classify(4000)) # "slow"from llm_eval.metrics.cost import CostMetric
metric = CostMetric()
# Calculate exact cost
cost = metric.calculate("gpt-4o-mini", input_tokens=150, output_tokens=200)
print(f"Per-sample cost: ${cost:.6f}")
# Pre-run estimate
estimate = metric.estimate_run_cost("gpt-4o-mini", num_samples=100)
print(f"Estimated run cost: ${estimate:.4f}")
# All pricing
for name, pricing in metric.get_all_pricing().items():
print(f"{name}: ${pricing.input_per_1m}/M in, ${pricing.output_per_1m}/M out")from llm_eval.metrics.hallucination import HallucinationMetric
metric = HallucinationMetric()
# Hallucination score (0 = grounded, 1 = likely hallucinating)
score = metric.score(
prompt="What is the capital of France?",
response="I believe it was supposedly Paris, though I could be wrong."
)
print(f"Hallucination score: {score:.2f}") # ~0.3 (hedging signals)
# Reasoning quality (1-10)
quality = metric.reasoning_quality(
"First, we analyze the data. Based on the evidence specifically, "
"therefore we can conclude that X. For example, this means..."
)
print(f"Reasoning quality: {quality:.1f}/10") # ~7.0- Size: ~14,000 test questions
- Format: 4-choice multiple choice
- Subjects: 57 academic subjects including:
- STEM: abstract algebra, anatomy, astronomy, biology, chemistry, computer science, physics, mathematics
- Humanities: history, law, philosophy, world religions
- Social sciences: economics, geography, political science, psychology, sociology
- Professional: medical, legal, accounting, marketing, nursing
from llm_eval.benchmarks.mmlu import MMLUBenchmark
bench = MMLUBenchmark(subject="all") # or a specific subject
samples = bench.load(num_samples=100, seed=42)
# [{"prompt": "Question?\nA) ...\nB) ...\nAnswer:", "expected": "A"}, ...]- Size: 817 questions
- Format: 4-choice multiple choice (MC1 format)
- Purpose: Tests whether models give truthful answers; questions are designed to elicit common misconceptions
from llm_eval.benchmarks.truthfulqa import TruthfulQABenchmark
bench = TruthfulQABenchmark()
samples = bench.load(num_samples=100, seed=42)- Format: CSV or JSON
- Required columns:
prompt,expected - Sources: File upload, string, or programmatic
from llm_eval.benchmarks.custom import CustomBenchmark
# From file
bench = CustomBenchmark.from_file("my_data.csv")
# From string
bench = CustomBenchmark.from_string("""
prompt,expected
"What is 2+2?",4
"Capital of Germany?",Berlin
""", format="csv")
print(f"Dataset size: {len(bench)} samples")
samples = bench.load(num_samples=50)| Model | Input (per 1M) | Output (per 1M) | Notes |
|---|---|---|---|
gpt-4o |
$5.00 | $15.00 | Best accuracy |
gpt-4o-mini |
$0.15 | $0.60 | Best value |
gpt-4-turbo |
$10.00 | $30.00 | Legacy |
gpt-3.5-turbo |
$0.50 | $1.50 | Fast, cheap |
o1 |
$15.00 | $60.00 | Reasoning |
o1-mini |
$3.00 | $12.00 | Reasoning, fast |
| Model | Input (per 1M) | Output (per 1M) | Notes |
|---|---|---|---|
claude-3-5-sonnet-20241022 |
$3.00 | $15.00 | Flagship |
claude-3-5-haiku-20241022 |
$0.80 | $4.00 | Fast + cheap |
claude-3-opus-20240229 |
$15.00 | $75.00 | Best reasoning |
claude-3-haiku-20240307 |
$0.25 | $1.25 | Cheapest Claude |
| Model | Input (per 1M) | Output (per 1M) | Notes |
|---|---|---|---|
gemini/gemini-1.5-pro |
$3.50 | $10.50 | 2M context |
gemini/gemini-1.5-flash |
$0.075 | $0.30 | Ultra-cheap |
gemini/gemini-2.0-flash-exp |
$0.00 | $0.00 | Free preview |
| Model | Input (per 1M) | Output (per 1M) | Notes |
|---|---|---|---|
mistral/mistral-large-latest |
$4.00 | $12.00 | Flagship |
mistral/mistral-small-latest |
$1.00 | $3.00 | Balanced |
mistral/open-mistral-7b |
$0.25 | $0.25 | Open weights |
| Model | Input (per 1M) | Output (per 1M) | Notes |
|---|---|---|---|
together_ai/meta-llama/Llama-3-70b-chat-hf |
$0.90 | $0.90 | Powerful open |
together_ai/meta-llama/Llama-3-8b-chat-hf |
$0.20 | $0.20 | Fast open |
Any LiteLLM-compatible provider works:
# Ollama (local)
llm-eval run --model ollama/llama3 --benchmark mmlu --samples 20
# vLLM
llm-eval run --model hosted_vllm/meta-llama/Llama-3-8b --benchmark mmlu
# HuggingFace TGI
llm-eval run --model huggingface/mistralai/Mistral-7B-Instruct-v0.2 --benchmark mmluAll evaluation results are automatically saved to SQLite (llm_eval.db by default):
from llm_eval.database.models import Database
db = Database("llm_eval.db") # default path
# or
db = Database("/data/eval_results.db")
# Query
records = db.list_results(model="gpt-4o-mini", benchmark="mmlu", limit=100)
record = db.get_result("a3f92c1b")
# Compare latest results per model for a benchmark
comparison = db.get_model_comparison("mmlu")
# Export
db.export_csv("results.csv")
db.export_json("results.json")
# Delete
db.delete_result("a3f92c1b")CREATE TABLE evaluations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
run_id TEXT NOT NULL UNIQUE,
model TEXT NOT NULL,
benchmark TEXT NOT NULL,
num_samples INTEGER NOT NULL,
accuracy REAL NOT NULL,
avg_latency_ms REAL NOT NULL,
p95_latency_ms REAL NOT NULL,
total_cost_usd REAL NOT NULL,
cost_per_1k_tokens REAL NOT NULL,
hallucination_rate REAL NOT NULL,
avg_reasoning_score REAL NOT NULL,
created_at TEXT NOT NULL,
metadata TEXT NOT NULL DEFAULT '{}'
);from llm_eval.reports.generator import ReportGenerator
from llm_eval.database.models import Database
db = Database()
records = db.list_results(benchmark="mmlu", limit=5)
gen = ReportGenerator()
pdf_path = gen.generate(records, output_dir="./reports/")
print(f"Report saved to: {pdf_path}")The generated PDF includes:
- Cover page — title, timestamp, model count
- Executive summary table — all models side by side
- Per-model detail page — all metrics, run config
Or via CLI:
llm-eval report --run-ids a3f92c1b b4d03e2c --output ./reports/# Start everything
docker-compose up -d
# View logs
docker-compose logs -f
# Stop
docker-compose down
# Rebuild after code changes
docker-compose up -d --build# API only
docker build --target api -t llm-eval-api:latest .
docker run -p 8000:8000 --env-file .env llm-eval-api:latest
# Dashboard only
docker build --target dashboard -t llm-eval-dashboard:latest .
docker run -p 8501:8501 --env-file .env llm-eval-dashboard:latestdocker run -p 8000:8000 \
-e OPENAI_API_KEY=sk-... \
-e ANTHROPIC_API_KEY=sk-ant-... \
-v ./data:/data \
-e LLM_EVAL_DB_PATH=/data/llm_eval.db \
llm-eval-api:latest# All tests
pytest tests/ -v
# With coverage report
pytest tests/ -v --cov=llm_eval --cov-report=html --cov-report=term-missing
# Run a specific test class
pytest tests/test_evaluator.py::TestAccuracyMetric -v
# Run a specific test
pytest tests/test_evaluator.py::TestAccuracyMetric::test_mc_correct -vNo API keys required! All tests use mocked LiteLLM responses.
| Module | Tests | Coverage |
|---|---|---|
core/evaluator.py |
7 tests | 96% |
metrics/accuracy.py |
9 tests | 100% |
metrics/hallucination.py |
6 tests | 95% |
metrics/latency.py |
5 tests | 100% |
metrics/cost.py |
6 tests | 100% |
benchmarks/custom.py |
7 tests | 98% |
database/models.py |
8 tests | 97% |
# Ruff linter
ruff check llm_eval/ tests/
# Type checking
mypy llm_eval/ --ignore-missing-imports
# Format check
ruff format --check llm_eval/The evaluation benchmark dataset is published on HuggingFace:
from datasets import load_dataset
# Load all splits
ds = load_dataset("vigneshwar234/llm-eval-benchmark")
print(ds)
# DatasetDict({
# train: Dataset({features: ['id', 'prompt', 'expected', 'subject', 'difficulty', 'source', 'choices'], num_rows: 500}),
# validation: Dataset({features: ..., num_rows: 200}),
# test: Dataset({features: ..., num_rows: 500})
# })
# Use as a custom benchmark
import pandas as pd
from llm_eval.benchmarks.custom import CustomBenchmark
df = pd.DataFrame(ds["test"])
samples = df[["prompt", "expected"]].to_dict("records")| Split | Samples | Subjects | Sources |
|---|---|---|---|
| train | 500 | 15+ | MMLU + TruthfulQA |
| validation | 200 | 15+ | MMLU + TruthfulQA |
| test | 500 | 15+ | MMLU + TruthfulQA |
| total | 1,200 | 15+ | Mixed |
python huggingface/create_dataset.py --pushLLM-Evaluation-Framework/
│
├── llm_eval/ ← Main package
│ ├── __init__.py ← Version: 1.0.0
│ │
│ ├── core/
│ │ ├── __init__.py
│ │ └── evaluator.py ← LLMEvaluator, EvaluationConfig, EvaluationResult
│ │
│ ├── metrics/
│ │ ├── __init__.py
│ │ ├── accuracy.py ← AccuracyMetric (exact/normalized/MC/fuzzy)
│ │ ├── hallucination.py ← HallucinationMetric + reasoning_quality
│ │ ├── latency.py ← LatencyMetric (percentile stats, SLA)
│ │ └── cost.py ← CostMetric (15+ provider pricing)
│ │
│ ├── benchmarks/
│ │ ├── __init__.py
│ │ ├── mmlu.py ← MMLUBenchmark (HF Hub + local cache + builtin)
│ │ ├── truthfulqa.py ← TruthfulQABenchmark
│ │ └── custom.py ← CustomBenchmark (CSV/JSON)
│ │
│ ├── dashboard/
│ │ ├── __init__.py
│ │ └── app.py ← 5-page Streamlit dashboard
│ │
│ ├── api/
│ │ ├── __init__.py
│ │ └── main.py ← FastAPI (12 endpoints)
│ │
│ ├── cli/
│ │ ├── __init__.py
│ │ └── main.py ← Click CLI (7 subcommands)
│ │
│ ├── reports/
│ │ ├── __init__.py
│ │ └── generator.py ← ReportLab PDF generator
│ │
│ └── database/
│ ├── __init__.py
│ └── models.py ← SQLite persistence layer
│
├── tests/
│ ├── __init__.py
│ └── test_evaluator.py ← 40+ pytest tests (no API keys needed)
│
├── .github/
│ └── workflows/
│ └── ci.yml ← CI: test × 3 Python versions + lint + Docker + Pages
│
├── docs/
│ └── index.html ← GitHub Pages (green/yellow theme)
│
├── huggingface/
│ ├── README.md ← HuggingFace dataset card
│ ├── create_dataset.py ← Dataset builder + HF Hub uploader
│ └── data/ ← Generated: train/validation/test JSON
│
├── requirements.txt ← All pinned dependencies
├── .env.example ← All env var templates
├── Dockerfile ← Multi-stage (api + dashboard targets)
├── docker-compose.yml ← API + Dashboard stack
├── setup.py ← Package setup
├── pyproject.toml ← Ruff, mypy, pytest config
├── LICENSE ← MIT
└── README.md ← This file
@dataclass
class EvaluationConfig:
model: str # LiteLLM model string (required)
benchmark: str # "mmlu" | "truthfulqa" | "custom" (required)
num_samples: int = 100 # Number of samples to evaluate
temperature: float = 0.0 # Sampling temperature (0.0 = deterministic)
max_tokens: int = 512 # Max response tokens
timeout: float = 30.0 # Per-request timeout in seconds
concurrency: int = 5 # Max parallel API calls
run_id: str = auto # Auto-generated 8-char hex ID
tags: dict = {} # Custom metadata tags| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
— | OpenAI API key |
ANTHROPIC_API_KEY |
— | Anthropic API key |
GEMINI_API_KEY |
— | Google Gemini API key |
MISTRAL_API_KEY |
— | Mistral API key |
TOGETHERAI_API_KEY |
— | Together AI key (for Llama) |
HUGGINGFACE_TOKEN |
— | HF token for dataset loading |
LLM_EVAL_DB_PATH |
llm_eval.db |
SQLite database path |
LLM_EVAL_CACHE_DIR |
~/.cache/llm_eval |
Benchmark cache directory |
PORT |
8000 |
FastAPI port |
DASHBOARD_PORT |
8501 |
Streamlit port |
LITELLM_VERBOSE |
false |
LiteLLM debug logging |
We welcome contributions! Here's how to get started:
git clone https://github.com/vignesh2027/LLM-Evaluation-Framework.git
cd LLM-Evaluation-Framework
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dashboard,reports,dev]"
pre-commit install # optional but recommended- Fork the repository
- Create a feature branch:
git checkout -b feature/my-awesome-feature - Make your changes with tests
- Verify:
pytest tests/ -v && ruff check llm_eval/ - Commit:
git commit -m "feat: add my awesome feature" - Push and open a Pull Request
- Add a new benchmark loader (e.g., HellaSwag, ARC)
- Add a new metric (e.g., BLEU, ROUGE, BERTScore)
- Improve hallucination detection with an NLI model
- Add a new chart to the dashboard
- Add support for a new provider
This project is licensed under the MIT License — see the LICENSE file for details.
MIT License — Copyright (c) 2025 vignesh2027
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...
If this project helped you, please star it — it helps others find it!