Releases · microsoft/eval-recipes

Fixed collect_eval_recipes_package() incorrectly copying entire site-packages directory when running as an installed package
In installed mode, now generates a minimal pyproject.toml using package metadata so uv can resolve dependencies in containers

Full Changelog: v0.0.32...v0.0.33

Assets 2

03 Feb 15:10

DavidKoleczek

v0.0.32

318688f

v0.0.32 Pre-release

Pre-release

This release includes a complete rewrite of the benchmarking system for improved usability and maintainability.

Replaced Harness and ComparisonHarness with new ScorePipeline and ComparisonPipeline classes
Added new loaders module with load_agents(), load_tasks(), and load_benchmark() functions
Agent definitions now consolidated into single agent.yaml file (removed separate install.dockerfile, command_template.txt, command_template_continue.txt files)
Task definitions now consolidated into single task.yaml file (removed separate instructions.txt, setup.dockerfile files)
Benchmark configurations moved from data/eval-setups/ to data/benchmarks/ with new unified format supporting both score and comparison benchmarks
New job framework with structured jobs for score and comparison benchmarks
Old Harness and ComparisonHarness classes removed
Agent and task definition file structure changed (migration required)
Benchmark configuration format changed (migration required)

See BENCHMARKING.md for updated usage instructions and configuration formats.

Full Changelog: v0.0.31...v0.0.32

Assets 2

27 Jan 18:28

DavidKoleczek

v0.0.31

c66922a

v0.0.31 Pre-release

Pre-release

Added ability to pass through agent logs through to the semantic test

Full Changelog: v0.0.30...v0.0.31

Assets 2

20 Jan 21:10

DavidKoleczek

v0.0.30

ed2540e

v0.0.30 Pre-release

Pre-release

Version references updated across documentation and code to use consistent v0.0.30

Full Changelog: v0.0.29...v0.0.30

Assets 2

17 Jan 20:00

DavidKoleczek

v0.0.29

b38e48d

v0.0.29 Pre-release

Pre-release

YAML-Based Run Configuration: Instead of using CLI filters to select agents and tasks, you now define your benchmark runs declaratively in YAML files with explicit control over which agents run which tasks and how many trials each combination gets.
- Harness constructor now requires a run_definition: ScoreRunSpec parameter
- Removed agent_filters, task_filters, and num_trials parameters from Harness
- runs_dir is now optional (defaults to .benchmark_results)
- agents_dir and tasks_dir are now required parameters
Removed filters.py module
non_deterministic_evals in task_info now defaults to false
Rewritten BENCHMARKING.md

Full Changelog: v0.0.28...v0.0.29

Assets 2

16 Jan 19:48

DavidKoleczek

v0.0.28

8721fe6

v0.0.28 Pre-release

Pre-release

Comparison-Based Benchmarking

Add a new evaluation mode for the benchmarking harness: comparison-based evaluation. In contrast to the existing score-based evaluation (where each agent's output is scored independently by a test script), you can now compare multiple agents' outputs head-to-head using a judge agent.

Comparison harness (harness_comparison.py): Run agents on tasks and have an LLM compare their outputs to produce relative rankings
Semantic test comparison (semantic_test_comparison.py): Blind comparison of agent outputs using an agnet as a judge with anonymized directory names
Aggregate reports: LLM-generated summaries explaining why agents were ranked the way they were, both per-task and across all tasks
HTML report generation (create_comparison_html_report.py): Interactive dashboard showing comparison results with metrics like Average Rank, Win Rate, Task Wins, and Kendall's W agreement scores
New script scripts/run_comparison_benchmarks.py for running comparison benchmarks via CLI
New sample tasks: Added ppt-1, ppt-2, ppt-3 tasks for PowerPoint/Office document generation evaluation

Other

Tasks now support eval_type field (score, comparison, or both) in task.yaml
Added comparison_eval.guidelines for task-specific evaluation guidance
Added extract_directory_from_container() to DockerManager for extracting agent outputs
Updated dependencies

Full Changelog: v0.0.27...v0.0.28

Assets 2

13 Jan 14:27

DavidKoleczek

v0.0.27

eb835cf

v0.0.27 Pre-release

Pre-release

Added setup_frontier_science.py script to set up OpenAI FrontierScience benchmark tasks from HuggingFace
Added setup_arc_agi_2.py script for ARC-AGI-2 benchmark task setup
Added explicit timeout values to all existing benchmark tasks to prevent premature termination:
- Most tasks: 4500 seconds (75 minutes)
- Complex tasks (email_drafting, github_docs_extractor, sec_10q_extractor, style_blender): 5400 seconds (90 minutes)
Updated OpenAI Codex agent continuation model from gpt-5.1-codex to gpt-5.1-codex-max

Full Changelog: v0.0.26...v0.0.27

Assets 2

Releases: microsoft/eval-recipes

v0.0.36

Uh oh!

v0.0.35

Uh oh!

v0.0.34

Uh oh!

v0.0.33

Uh oh!

v0.0.32

Uh oh!

v0.0.31

Uh oh!

v0.0.30

Uh oh!

v0.0.29

Uh oh!

v0.0.28

Comparison-Based Benchmarking

Other

Uh oh!

v0.0.27

Uh oh!