Releases: microsoft/eval-recipes
v0.0.36
- New tasks
- Updated existing tasks
Full Changelog: v0.0.35...v0.0.36
v0.0.35
- Tooling improvements
- Switch from make to uv commands
- Add prek for pre-commit checks
- Migrate to ty for type checking
Full Changelog: v0.0.34...v0.0.35
v0.0.34
- Fix bug in how eval-recipes is installed in a docker container when executed outside of the scope of the repo itself.
Full Changelog: v0.0.33...v0.0.34
v0.0.33
- Fixed
collect_eval_recipes_package()incorrectly copying entiresite-packagesdirectory when running as an installed package - In installed mode, now generates a minimal
pyproject.tomlusing package metadata souvcan resolve dependencies in containers
Full Changelog: v0.0.32...v0.0.33
v0.0.32
This release includes a complete rewrite of the benchmarking system for improved usability and maintainability.
- Replaced
HarnessandComparisonHarnesswith newScorePipelineandComparisonPipelineclasses - Added new
loadersmodule withload_agents(),load_tasks(), andload_benchmark()functions - Agent definitions now consolidated into single
agent.yamlfile (removed separateinstall.dockerfile,command_template.txt,command_template_continue.txtfiles) - Task definitions now consolidated into single
task.yamlfile (removed separateinstructions.txt,setup.dockerfilefiles) - Benchmark configurations moved from
data/eval-setups/todata/benchmarks/with new unified format supporting both score and comparison benchmarks - New job framework with structured jobs for score and comparison benchmarks
- Old
HarnessandComparisonHarnessclasses removed - Agent and task definition file structure changed (migration required)
- Benchmark configuration format changed (migration required)
See BENCHMARKING.md for updated usage instructions and configuration formats.
Full Changelog: v0.0.31...v0.0.32
v0.0.31
- Added ability to pass through agent logs through to the semantic test
Full Changelog: v0.0.30...v0.0.31
v0.0.30
- Version references updated across documentation and code to use consistent
v0.0.30
Full Changelog: v0.0.29...v0.0.30
v0.0.29
- YAML-Based Run Configuration: Instead of using CLI filters to select agents and tasks, you now define your benchmark runs declaratively in YAML files with explicit control over which agents run which tasks and how many trials each combination gets.
Harnessconstructor now requires arun_definition: ScoreRunSpecparameter- Removed
agent_filters,task_filters, andnum_trialsparameters fromHarness runs_diris now optional (defaults to.benchmark_results)agents_dirandtasks_dirare now required parameters
- Removed
filters.pymodule non_deterministic_evalsintask_infonow defaults tofalse- Rewritten
BENCHMARKING.md
Full Changelog: v0.0.28...v0.0.29
v0.0.28
Comparison-Based Benchmarking
Add a new evaluation mode for the benchmarking harness: comparison-based evaluation. In contrast to the existing score-based evaluation (where each agent's output is scored independently by a test script), you can now compare multiple agents' outputs head-to-head using a judge agent.
- Comparison harness (
harness_comparison.py): Run agents on tasks and have an LLM compare their outputs to produce relative rankings - Semantic test comparison (
semantic_test_comparison.py): Blind comparison of agent outputs using an agnet as a judge with anonymized directory names - Aggregate reports: LLM-generated summaries explaining why agents were ranked the way they were, both per-task and across all tasks
- HTML report generation (
create_comparison_html_report.py): Interactive dashboard showing comparison results with metrics like Average Rank, Win Rate, Task Wins, and Kendall's W agreement scores - New script
scripts/run_comparison_benchmarks.pyfor running comparison benchmarks via CLI - New sample tasks: Added
ppt-1,ppt-2,ppt-3tasks for PowerPoint/Office document generation evaluation
Other
- Tasks now support
eval_typefield (score,comparison, orboth) intask.yaml - Added
comparison_eval.guidelinesfor task-specific evaluation guidance - Added
extract_directory_from_container()to DockerManager for extracting agent outputs - Updated dependencies
Full Changelog: v0.0.27...v0.0.28
v0.0.27
- Added
setup_frontier_science.pyscript to set up OpenAI FrontierScience benchmark tasks from HuggingFace - Added
setup_arc_agi_2.pyscript for ARC-AGI-2 benchmark task setup - Added explicit
timeoutvalues to all existing benchmark tasks to prevent premature termination:- Most tasks: 4500 seconds (75 minutes)
- Complex tasks (
email_drafting,github_docs_extractor,sec_10q_extractor,style_blender): 5400 seconds (90 minutes)
- Updated OpenAI Codex agent continuation model from
gpt-5.1-codextogpt-5.1-codex-max
Full Changelog: v0.0.26...v0.0.27