Skip to content

Releases: microsoft/eval-recipes

v0.0.36

10 Feb 14:42

Choose a tag to compare

v0.0.36 Pre-release
Pre-release
  • New tasks
  • Updated existing tasks

Full Changelog: v0.0.35...v0.0.36

v0.0.35

08 Feb 17:49

Choose a tag to compare

v0.0.35 Pre-release
Pre-release
  • Tooling improvements
    • Switch from make to uv commands
    • Add prek for pre-commit checks
    • Migrate to ty for type checking

Full Changelog: v0.0.34...v0.0.35

v0.0.34

03 Feb 18:13

Choose a tag to compare

v0.0.34 Pre-release
Pre-release
  • Fix bug in how eval-recipes is installed in a docker container when executed outside of the scope of the repo itself.

Full Changelog: v0.0.33...v0.0.34

v0.0.33

03 Feb 17:24

Choose a tag to compare

v0.0.33 Pre-release
Pre-release
  • Fixed collect_eval_recipes_package() incorrectly copying entire site-packages directory when running as an installed package
  • In installed mode, now generates a minimal pyproject.toml using package metadata so uv can resolve dependencies in containers

Full Changelog: v0.0.32...v0.0.33

v0.0.32

03 Feb 15:10
318688f

Choose a tag to compare

v0.0.32 Pre-release
Pre-release

This release includes a complete rewrite of the benchmarking system for improved usability and maintainability.

  • Replaced Harness and ComparisonHarness with new ScorePipeline and ComparisonPipeline classes
  • Added new loaders module with load_agents(), load_tasks(), and load_benchmark() functions
  • Agent definitions now consolidated into single agent.yaml file (removed separate install.dockerfile, command_template.txt, command_template_continue.txt files)
  • Task definitions now consolidated into single task.yaml file (removed separate instructions.txt, setup.dockerfile files)
  • Benchmark configurations moved from data/eval-setups/ to data/benchmarks/ with new unified format supporting both score and comparison benchmarks
  • New job framework with structured jobs for score and comparison benchmarks
  • Old Harness and ComparisonHarness classes removed
  • Agent and task definition file structure changed (migration required)
  • Benchmark configuration format changed (migration required)

See BENCHMARKING.md for updated usage instructions and configuration formats.

Full Changelog: v0.0.31...v0.0.32

v0.0.31

27 Jan 18:28

Choose a tag to compare

v0.0.31 Pre-release
Pre-release
  • Added ability to pass through agent logs through to the semantic test

Full Changelog: v0.0.30...v0.0.31

v0.0.30

20 Jan 21:10

Choose a tag to compare

v0.0.30 Pre-release
Pre-release
  • Version references updated across documentation and code to use consistent v0.0.30

Full Changelog: v0.0.29...v0.0.30

v0.0.29

17 Jan 20:00

Choose a tag to compare

v0.0.29 Pre-release
Pre-release
  • YAML-Based Run Configuration: Instead of using CLI filters to select agents and tasks, you now define your benchmark runs declaratively in YAML files with explicit control over which agents run which tasks and how many trials each combination gets.
    • Harness constructor now requires a run_definition: ScoreRunSpec parameter
    • Removed agent_filters, task_filters, and num_trials parameters from Harness
    • runs_dir is now optional (defaults to .benchmark_results)
    • agents_dir and tasks_dir are now required parameters
  • Removed filters.py module
  • non_deterministic_evals in task_info now defaults to false
  • Rewritten BENCHMARKING.md

Full Changelog: v0.0.28...v0.0.29

v0.0.28

16 Jan 19:48

Choose a tag to compare

v0.0.28 Pre-release
Pre-release

Comparison-Based Benchmarking

Add a new evaluation mode for the benchmarking harness: comparison-based evaluation. In contrast to the existing score-based evaluation (where each agent's output is scored independently by a test script), you can now compare multiple agents' outputs head-to-head using a judge agent.

  • Comparison harness (harness_comparison.py): Run agents on tasks and have an LLM compare their outputs to produce relative rankings
  • Semantic test comparison (semantic_test_comparison.py): Blind comparison of agent outputs using an agnet as a judge with anonymized directory names
  • Aggregate reports: LLM-generated summaries explaining why agents were ranked the way they were, both per-task and across all tasks
  • HTML report generation (create_comparison_html_report.py): Interactive dashboard showing comparison results with metrics like Average Rank, Win Rate, Task Wins, and Kendall's W agreement scores
  • New script scripts/run_comparison_benchmarks.py for running comparison benchmarks via CLI
  • New sample tasks: Added ppt-1, ppt-2, ppt-3 tasks for PowerPoint/Office document generation evaluation

Other

  • Tasks now support eval_type field (score, comparison, or both) in task.yaml
  • Added comparison_eval.guidelines for task-specific evaluation guidance
  • Added extract_directory_from_container() to DockerManager for extracting agent outputs
  • Updated dependencies

Full Changelog: v0.0.27...v0.0.28

v0.0.27

13 Jan 14:27
eb835cf

Choose a tag to compare

v0.0.27 Pre-release
Pre-release
  • Added setup_frontier_science.py script to set up OpenAI FrontierScience benchmark tasks from HuggingFace
  • Added setup_arc_agi_2.py script for ARC-AGI-2 benchmark task setup
  • Added explicit timeout values to all existing benchmark tasks to prevent premature termination:
    • Most tasks: 4500 seconds (75 minutes)
    • Complex tasks (email_drafting, github_docs_extractor, sec_10q_extractor, style_blender): 5400 seconds (90 minutes)
  • Updated OpenAI Codex agent continuation model from gpt-5.1-codex to gpt-5.1-codex-max

Full Changelog: v0.0.26...v0.0.27