The first benchmark that measures both cost AND quality for LLM routers.
Most benchmarks measure one dimension — accuracy OR cost. herma-eval tracks both simultaneously, so you can prove your router saves money without sacrificing quality. Every result shows: what did it cost, and how does it compare to always using the frontier model?
Works with any OpenAI-compatible API — not just Herma.
pip install herma-eval
# For HumanEval+ code benchmarks:
pip install 'herma-eval[humaneval]'# Run benchmarks against any OpenAI-compatible endpoint
herma-eval run \
--api-base https://your-api.com/v1 \
--api-key YOUR_KEY \
--model auto \
--benchmarks gsm8k,humaneval
# Validate routing config against traffic scenarios (no API calls needed)
herma-eval validate --config router-config.jsonimport asyncio
from herma_eval import run_benchmarks
results = asyncio.run(run_benchmarks(
api_base="https://your-api.com/v1",
api_key="your-key",
model="auto",
benchmarks=["gsm8k"],
n_samples=50,
))
print(f"Quality vs frontier: {results.quality_pct:.1f}%")
print(f"Pass: {results.passes_thresholds}")Validate your routing config against simulated traffic without making any API calls:
from herma_eval import run_validation
result = run_validation(
quality_map={
"coding:easy": {"model": "gpt-4.1-mini", "pass_rate": 0.95},
"coding:hard": {"model": "claude-sonnet-4", "pass_rate": 0.88, "cascade": "claude-opus-4"},
"factual:easy": {"model": "deepseek-chat", "pass_rate": 0.98},
},
frontier_model="claude-opus-4",
)
print(f"All scenarios pass: {result['all_pass']}")| Benchmark | What It Measures | Default Samples | Frontier Baseline |
|---|---|---|---|
| GSM8K | Grade-school math | 100 | Opus 4.6 (95%) |
| HumanEval+ | Python code generation | 164 | Opus 4.6 (87% pass@1) |
- Quality retention: >= 90% of frontier model accuracy
- Cost savings (validate): >= 60% cheaper than always using frontier
For herma-eval validate, provide a JSON file:
{
"quality_map": {
"coding:easy": {"model": "gpt-4.1-mini", "pass_rate": 0.95},
"coding:medium": {"model": "gpt-4.1-mini", "pass_rate": 0.82, "cascade": "claude-sonnet-4"},
"coding:hard": {"model": "claude-sonnet-4", "pass_rate": 0.88, "cascade": "claude-opus-4"},
"analysis:easy": {"model": "gpt-4.1-mini", "pass_rate": 0.97},
"factual:easy": {"model": "deepseek-chat", "pass_rate": 0.98}
},
"frontier_model": "claude-opus-4",
"model_costs": {
"claude-opus-4": {"input": 15.00, "output": 75.00},
"gpt-4.1-mini": {"input": 0.40, "output": 1.60}
}
}MIT