An open-source rules-based framework for evaluating AI agent performance across various industries and use cases.
-
Updated
Aug 4, 2025 - Python
An open-source rules-based framework for evaluating AI agent performance across various industries and use cases.
frontier-evals-harness is a lightweight framework for benchmarking frontier language models. It provides deterministic suite versioning, modular adapters, standardized scoring, and paired statistical comparisons with confidence intervals. Built for regression tracking and analysis, it enables reproducible evaluation without infrastructure.
A minimal, code-first retrieval observability harness that measures why RAG systems fail to surface relevant evidence, without changing retrieval or generation.
Controlled experiment isolating reranking as a first-class RAG system boundary, measuring how evidence priority—not recall—changes retrieval outcomes.
Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.
Add a description, image, and links to the evaluation-harness topic page so that developers can more easily learn about it.
To associate your repository with the evaluation-harness topic, visit your repo's landing page and select "manage topics."