Skip to content

Nikobar5/herma-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

herma-eval

The first benchmark that measures both cost AND quality for LLM routers.

Most benchmarks measure one dimension — accuracy OR cost. herma-eval tracks both simultaneously, so you can prove your router saves money without sacrificing quality. Every result shows: what did it cost, and how does it compare to always using the frontier model?

Works with any OpenAI-compatible API — not just Herma.

Install

pip install herma-eval

# For HumanEval+ code benchmarks:
pip install 'herma-eval[humaneval]'

Quick Start

CLI

# Run benchmarks against any OpenAI-compatible endpoint
herma-eval run \
  --api-base https://your-api.com/v1 \
  --api-key YOUR_KEY \
  --model auto \
  --benchmarks gsm8k,humaneval

# Validate routing config against traffic scenarios (no API calls needed)
herma-eval validate --config router-config.json

Python API

import asyncio
from herma_eval import run_benchmarks

results = asyncio.run(run_benchmarks(
    api_base="https://your-api.com/v1",
    api_key="your-key",
    model="auto",
    benchmarks=["gsm8k"],
    n_samples=50,
))

print(f"Quality vs frontier: {results.quality_pct:.1f}%")
print(f"Pass: {results.passes_thresholds}")

Traffic Validation

Validate your routing config against simulated traffic without making any API calls:

from herma_eval import run_validation

result = run_validation(
    quality_map={
        "coding:easy": {"model": "gpt-4.1-mini", "pass_rate": 0.95},
        "coding:hard": {"model": "claude-sonnet-4", "pass_rate": 0.88, "cascade": "claude-opus-4"},
        "factual:easy": {"model": "deepseek-chat", "pass_rate": 0.98},
    },
    frontier_model="claude-opus-4",
)

print(f"All scenarios pass: {result['all_pass']}")

Benchmarks

Benchmark What It Measures Default Samples Frontier Baseline
GSM8K Grade-school math 100 Opus 4.6 (95%)
HumanEval+ Python code generation 164 Opus 4.6 (87% pass@1)

Thresholds

  • Quality retention: >= 90% of frontier model accuracy
  • Cost savings (validate): >= 60% cheaper than always using frontier

Router Config Format

For herma-eval validate, provide a JSON file:

{
  "quality_map": {
    "coding:easy": {"model": "gpt-4.1-mini", "pass_rate": 0.95},
    "coding:medium": {"model": "gpt-4.1-mini", "pass_rate": 0.82, "cascade": "claude-sonnet-4"},
    "coding:hard": {"model": "claude-sonnet-4", "pass_rate": 0.88, "cascade": "claude-opus-4"},
    "analysis:easy": {"model": "gpt-4.1-mini", "pass_rate": 0.97},
    "factual:easy": {"model": "deepseek-chat", "pass_rate": 0.98}
  },
  "frontier_model": "claude-opus-4",
  "model_costs": {
    "claude-opus-4": {"input": 15.00, "output": 75.00},
    "gpt-4.1-mini": {"input": 0.40, "output": 1.60}
  }
}

License

MIT

About

Benchmark toolkit that measures both cost AND quality for LLM routers

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages