Feat/mcp+skills by marianbasti · Pull Request #22 · surus-lat/benchy

marianbasti · 2026-03-05T20:43:34Z

Summary by CodeRabbit

New Features
- MCP server integration with benchy-mcp command for accessing run outputs and configurations
- Run validation script for smoke gate contract testing via CLI
- New facturas_argentinas task for Argentinian invoice data extraction
- Deprecated image_extraction task with guidance to use document_extraction instead
Documentation
- Added comprehensive skill guides for adding tasks, providers, and evaluating runs
- Enhanced AGENTS.md with MCP server setup and validation script instructions
Chores
- Updated SURUS API endpoints to new domain
- Restricted vllm dependency to Linux platforms

Handler-Based Task System - Major Refactor

Added initial CLI for models

[ADD] updated readme

Dev

- Introduced `add-provider` skill to facilitate the addition of new inference providers, covering both OpenAI-compatible and custom HTTP systems. - Added `add-task` skill for creating new benchmark tasks or task groups, including handler selection and task class implementation. - Implemented `evaluate` skill to run evaluations against models or systems, detailing the smoke and full run workflows. - Created `interpret-run` skill for reading and interpreting benchy run outputs, focusing on the structure of `run_outcome.json` and failure diagnosis. - Established `mcp` module with server capabilities to expose benchy run outputs and configurations as MCP tools for easier access and diagnostics.

Update documentation

coderabbitai · 2026-03-05T20:43:52Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

This PR completes a migration from image_extraction to document_extraction as the primary benchmark task, introduces an MCP server for serving benchy outputs and configurations, adds a validation script for smoke-gate checking, and deprecates the old task name with guidance messaging. Supporting changes include SURUS endpoint updates, new task metadata, reference data updates, and comprehensive skill documentation.

Changes

Cohort / File(s)	Summary
Task & Group Migration `configs/config.yaml`, `configs/models/openai_gpt-4o-mini.yaml`, `configs/models/openai_gpt-5-mini.yaml`, `configs/models/together_*.yaml`, `configs/tests/document-extract-gpt4.yaml`, `reference/tasks_groups.json`	Renamed `image_extraction` to `document_extraction` in task lists and group definitions; updated `document_extraction_only` task group in config.yaml to replace `image_extraction_only`.
SURUS System Configuration `configs/systems/surus-*.yaml`, `src/interfaces/README.md`	Updated SURUS API endpoints from `api.surus.dev` to `api.surus.lat`, added `image_max_edge: 2048` to surus-factura config, renamed task references to document_extraction, and updated subtask names to `facturas_argentinas`.
MCP Server Implementation `src/mcp/__init__.py`, `src/mcp/server.py`, `pyproject.toml`	New MCP server module exposing benchy outputs and configuration via FastMCP tools; added `benchy-mcp` CLI entry point and optional `mcp` dependency; includes functions for reading run outcomes, validating smoke gates, listing runs/configs/tasks, and aggregating task errors.
Validation & Deprecation Logic `scripts/validate_run.py`, `src/benchy_cli_eval.py`, `src/tasks/registry.py`, `tests/test_cli_arguments.py`	Added validation script with smoke-gate checking; introduced deprecation mapping and rejection helpers in CLI; added registry-level deprecation guards; new test cases for deprecated task detection.
Task Metadata & Registration `src/tasks/document_extraction/metadata.yaml`, `reference/tasks_list.json`	Updated document_extraction metadata with group wiring, multilingual descriptions, expanded capability requirements (`requires_files`, `requires_schema`); added new `facturas_argentinas` task entry to tasks_list.json.
Documentation & Skill Guides `AGENTS.md`, `README.md`, `docs/contribute_tasks.md`, `.agents/skills/add-provider/SKILL.md`, `.agents/skills/add-task/SKILL.md`, `.agents/skills/evaluate/SKILL.md`, `.agents/skills/interpret-run/SKILL.md`	Expanded AGENTS.md with Agent Rules guidance (samples, multimodal endpoints, document_extraction workflow), MCP Server section, and Validation Script section; added four new skill guides for adding providers, tasks, evaluating, and interpreting runs; updated CLI examples in README and task example paths.
Interface & Code Documentation `src/interfaces/surus/surus_ocr_interface.py`, `env.example`	Updated interface docstrings from image extraction to document extraction terminology; updated SURUS API key comment URL to new domain.
Dependencies & Build `pyproject.toml`	Added platform-restricted vllm dependency (`linux` only), new `benchy-mcp` script entry point, and optional `mcp` dependency group.
Tests & Dataset Handling `tests/test_facturas_argentinas_dataset_id.py`	Enhanced dataset ID resolution to flexibly detect URL from multiple metadata keys (`URL`, `url`, `Url`, `dataset_url`) with validation guard.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

New CLI and tests #18: Adds ad-hoc task/CLI and registry support; complements deprecation and task-renaming handling in registry and CLI modules.
Added initial CLI for models #16: Introduces image_extraction → document_extraction task surface and related config/CLI/task module updates; overlaps on task metadata and CLI wiring.
Code review and final refactor #8: Refactors task registry and group runner with related task metadata; intersects on deprecation and registry handling for image_extraction migration.

Poem

🐰 Hops through configs with glee,
Document extraction, wild and free,
MCP server springs to life,
Validation gates cut through the strife,
Deprecated tasks say their goodbye,
As benchy's future reaches the sky! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is vague and generic, using non-descriptive shorthand ('Feat/mcp+skills') that does not clearly convey what changes are being made.	Revise the title to be more descriptive and specific, such as 'Add MCP server and agent skills for validation and task management' or a similar clear summary of the main changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 94.44% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/mcp+skills

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (6)

configs/systems/surus-remove-background.yaml (1)
9-9: Clean up the stale inline endpoint comment on Line 9.

Keeping only the active endpoint improves readability and avoids accidental reuse of deprecated URLs.
Proposed cleanup
-  endpoint: 'https://api.surus.lat/functions/v1/remove-background' #"https://cuqhdcqkpspghnjbrged.supabase.co/functions/v1/remove-background" #
+  endpoint: "https://api.surus.lat/functions/v1/remove-background"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@configs/systems/surus-remove-background.yaml` at line 9, Remove the stale
inline commented URL on the endpoint line so the config contains only the active
endpoint value; locate the 'endpoint' entry (currently set to
'https://api.surus.lat/functions/v1/remove-background') and delete the trailing
commented URL and comment characters that reference the old Supabase URL,
leaving a single clean endpoint string.
skills/interpret-run/SKILL.md (1)
22-24: Apply markdownlint cleanup for fenced blocks/emphasis formatting.

The current markdown has fence/emphasis issues flagged by lint; cleaning them will keep docs CI-friendly.

Also applies to: 80-84, 259-259
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/interpret-run/SKILL.md` around lines 22 - 24, Fix markdown fence and
emphasis formatting in SKILL.md by ensuring all fenced code blocks use a proper
triple-backtick opening and closing fence (```) with no stray backticks or
inline emphasis inside the fence; search the file for the code block around the
list ending with "└── *_report.txt" and the other affected ranges and replace
malformed fences or stray asterisks so that fenced blocks are closed correctly
and emphasis uses single underscores or asterisks outside code fences as per
markdownlint rules.
src/mcp/server.py (1)
46-61: Consider reusing central config path resolution for output path lookup.

This function re-parses YAML directly. Using the existing config-loading path resolution logic would avoid drift (e.g., env-var/user-path expansion behavior differences).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/mcp/server.py` around lines 46 - 61, The _get_output_path function
currently re-parses YAML directly; replace that ad-hoc logic with the project's
central config-loading/path-resolution routine (import and call the shared
loader, e.g., load_config or equivalent) so you obtain the resolved config
object and then read cfg.get("paths", {}).get("benchmark_outputs",
"outputs/benchmark_outputs") from it; ensure you return a Path that preserves
the loader's env-var and user-path expansion behavior instead of manually
opening/parsing configs/config.yaml.
skills/add-task/SKILL.md (1)
184-186: Consider adding --run-id in the smoke-test example.

Including a run ID in the sample command improves reproducibility and resume/debug workflows.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/add-task/SKILL.md` around lines 184 - 186, Update the smoke-test
example command (the `benchy eval ... --exit-policy smoke` invocation) to
include an explicit `--run-id` flag so runs are reproducible and resumable; add
a sample run id (e.g., `--run-id smoke-<date-or-uuid>`) to the example and
mention that it should be unique per run.
src/tasks/registry.py (1)
17-20: Centralize deprecated-task mapping to avoid drift.

The same rename map now exists in multiple modules; move it to a shared constant/module and import it from both call sites.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tasks/registry.py` around lines 17 - 20, The duplicate deprecated-task
mapping in _DEPRECATED_TASK_GROUPS should be centralized: extract the mapping
into a new shared constant (e.g., DEPRECATED_TASK_GROUPS) in a shared module
(for example a new src/tasks/constants.py or existing shared config module),
replace the local _DEPRECATED_TASK_GROUPS in src/tasks/registry.py with an
import from that shared module, and update any other modules that currently
define the same mapping to import the single shared constant instead; ensure
_INTERNAL_TASK_GROUPS remains local if needed and update import paths/usages
(references to _DEPRECATED_TASK_GROUPS) to use the new shared symbol name.
src/benchy_cli_eval.py (1)
76-78: Consider a shared source of truth for deprecated task mappings.

This duplicates _DEPRECATED_TASK_GROUPS in src/tasks/registry.py (lines 18-20). While the current approach works and provides appropriate CLI vs registry error handling, a shared constant would prevent potential drift.
Example: Extract to shared module

Create a shared constants module (e.g., src/tasks/deprecations.py):
DEPRECATED_TASK_GROUPS = {
    "image_extraction": "document_extraction",
}
Then import from both benchy_cli_eval.py and registry.py.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/benchy_cli_eval.py` around lines 76 - 78, Replace the duplicated
deprecated-task mapping with a single shared constant: extract the mapping
currently defined as DEPRECATED_TASKS in benchy_cli_eval.py and
_DEPRECATED_TASK_GROUPS in registry.py into a new module (e.g., export
DEPRECATED_TASK_GROUPS from src/tasks/deprecations.py), then import that shared
constant in both benchy_cli_eval.py and registry.py and update references to use
the shared symbol (DEPRECATED_TASK_GROUPS) so both CLI and registry use the same
source of truth.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@reference/tasks_list.json`:
- Around line 241-246: The facturas_argentinas task entry uses the non-standard
field dataset_url; update the metadata for the task named "facturas_argentinas"
by renaming dataset_url to URL so it matches the canonical field used by the
normalization in generate_reference_files.py (which normalizes url/Url → URL),
ensuring the generated tasks_list.json uses the consistent URL key across all
document_extraction tasks.

In `@scripts/validate_run.py`:
- Around line 49-56: The current logic in validate_run.py auto-selects the first
directory when model_name is not provided, which can silently validate the wrong
model; change the logic in the block that computes model_dir (the variables
model_name, run_dir, candidates) so that if model_name is None and more than one
candidate exists you raise a clear error (e.g., ValueError or FileNotFoundError)
that lists the candidate directory names and instructs the caller to pass the
desired model_name instead of silently picking sorted(candidates)[0]; keep the
existing fallback of selecting the single candidate when there is exactly one.
- Around line 69-97: The validation currently ignores the run's exit code so a
non-zero exit can be marked passed; fetch exit_code = outcome.get("exit_code")
(or 0 default), add a violation when exit_code != 0 (e.g.,
violations.append(f"exit_code = {exit_code} (expected: 0)")), ensure the final
"passed" uses the updated violations and include the exit_code in the returned
dict (similar to "status" and "duration_s") so callers can see it; update
references around the existing status/counts checks (variables: status,
VALID_STATUSES, counts, BLOCKING_COUNT_KEYS, outcome, violations) accordingly.

In `@skills/evaluate/SKILL.md`:
- Around line 21-22: The README uses two different output-path placeholders
(`<model_name_last_segment>` and `<model_name>`) which can cause drift; pick one
placeholder and make it consistent across the file (e.g., replace all
occurrences of `<model_name_last_segment>` with `<model_name>` or vice versa),
updating every path example such as
outputs/benchmark_outputs/<run_id>/<model_name_last_segment>/run_outcome.json
and the occurrences near lines referencing run_outcome.json so the same
placeholder is used everywhere.

In `@skills/interpret-run/SKILL.md`:
- Around line 44-53: The JSON example uses a non-canonical key ("total_tasks");
update the example to use the canonical run outcome counters used by the
runtime: replace "total_tasks" with "requested_tasks" and add/ensure
"completed_tasks" is present alongside the existing keys (e.g.,
"requested_tasks", "completed_tasks", "passed_tasks", "degraded_tasks",
"failed_tasks", "skipped_tasks", "error_tasks", "no_samples_tasks",
"pending_tasks") so the example matches the actual run counters exposed by the
system.
- Around line 236-240: The example incorrectly uses open() with a glob pattern;
replace the single
open("outputs/benchmark_outputs/<run_id>/<model>/<task>/<subtask>/*_samples.json")
call with code that expands the glob (e.g., using glob.glob or
pathlib.Path.rglob) and iterates over the matching file paths, opening each file
and loading its JSON into data before unwrapping the envelope (referencing the
existing variable names data and all_samples); ensure you handle the case of no
matches and aggregate or process each file’s samples accordingly.

In `@src/mcp/server.py`:
- Around line 92-96: The code builds a candidates list and silently picks the
first directory into model_dir when model_name is omitted, which is
nondeterministic when multiple model outputs exist; change the logic in
src/mcp/server.py around the candidates/model_dir handling (references:
variables candidates, model_dir, model_name, base) to detect multiple candidate
directories and either (a) raise a clear error asking the caller to supply
model_name when more than one directory is found, or (b) deterministically
choose one (e.g., sort by name or use newest mtime) and emit a warning/log
explaining the choice; implement this change at all places that currently take
candidates[0] (the blocks referenced at lines ~95, ~316-320, ~362-365) so
selection is deterministic or explicitly disambiguated.
- Around line 83-90: The code composes filesystem paths from user-controlled
run_id and model_name (e.g., base = _get_output_path() / run_id and model_dir =
base / model_name) without confinement checks, allowing path traversal; fix by
normalizing and validating paths after composition: use pathlib.Path.resolve()
on the composed path(s) (e.g., resolved_base = ( _get_output_path() / run_id
).resolve() and resolved_model = (resolved_base / model_name).resolve()), then
ensure each resolved path is inside the expected output root (compare with
_get_output_path().resolve() via Path.is_relative_to() or os.path.commonpath)
and return an error if not; also disallow absolute inputs or components
containing '..' as an early check before using the values in functions/methods
that build paths.
- Around line 117-126: The smoke-gate validation currently only checks
run_outcome.status and blocking counts; update the validation logic to also
require run_outcome.exit_code == 0 when computing the 'passed' boolean (and when
adding to 'violations' if non-zero), i.e., include exit_code check alongside
status and counts in the same places that build the returned dict with 'passed',
'status', 'counts', and 'violations'; apply the same change to the second
occurrence of this logic (the other block referenced around lines 132-157) so
both code paths fail the gate for non-zero exit codes.

---

Nitpick comments:
In `@configs/systems/surus-remove-background.yaml`:
- Line 9: Remove the stale inline commented URL on the endpoint line so the
config contains only the active endpoint value; locate the 'endpoint' entry
(currently set to 'https://api.surus.lat/functions/v1/remove-background') and
delete the trailing commented URL and comment characters that reference the old
Supabase URL, leaving a single clean endpoint string.

In `@skills/add-task/SKILL.md`:
- Around line 184-186: Update the smoke-test example command (the `benchy eval
... --exit-policy smoke` invocation) to include an explicit `--run-id` flag so
runs are reproducible and resumable; add a sample run id (e.g., `--run-id
smoke-<date-or-uuid>`) to the example and mention that it should be unique per
run.

In `@skills/interpret-run/SKILL.md`:
- Around line 22-24: Fix markdown fence and emphasis formatting in SKILL.md by
ensuring all fenced code blocks use a proper triple-backtick opening and closing
fence (```) with no stray backticks or inline emphasis inside the fence; search
the file for the code block around the list ending with "└── *_report.txt" and
the other affected ranges and replace malformed fences or stray asterisks so
that fenced blocks are closed correctly and emphasis uses single underscores or
asterisks outside code fences as per markdownlint rules.

In `@src/benchy_cli_eval.py`:
- Around line 76-78: Replace the duplicated deprecated-task mapping with a
single shared constant: extract the mapping currently defined as
DEPRECATED_TASKS in benchy_cli_eval.py and _DEPRECATED_TASK_GROUPS in
registry.py into a new module (e.g., export DEPRECATED_TASK_GROUPS from
src/tasks/deprecations.py), then import that shared constant in both
benchy_cli_eval.py and registry.py and update references to use the shared
symbol (DEPRECATED_TASK_GROUPS) so both CLI and registry use the same source of
truth.

In `@src/mcp/server.py`:
- Around line 46-61: The _get_output_path function currently re-parses YAML
directly; replace that ad-hoc logic with the project's central
config-loading/path-resolution routine (import and call the shared loader, e.g.,
load_config or equivalent) so you obtain the resolved config object and then
read cfg.get("paths", {}).get("benchmark_outputs", "outputs/benchmark_outputs")
from it; ensure you return a Path that preserves the loader's env-var and
user-path expansion behavior instead of manually opening/parsing
configs/config.yaml.

In `@src/tasks/registry.py`:
- Around line 17-20: The duplicate deprecated-task mapping in
_DEPRECATED_TASK_GROUPS should be centralized: extract the mapping into a new
shared constant (e.g., DEPRECATED_TASK_GROUPS) in a shared module (for example a
new src/tasks/constants.py or existing shared config module), replace the local
_DEPRECATED_TASK_GROUPS in src/tasks/registry.py with an import from that shared
module, and update any other modules that currently define the same mapping to
import the single shared constant instead; ensure _INTERNAL_TASK_GROUPS remains
local if needed and update import paths/usages (references to
_DEPRECATED_TASK_GROUPS) to use the new shared symbol name.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8683461a-4209-4391-9667-8bf21e68a2f9

📥 Commits

Reviewing files that changed from the base of the PR and between e930906 and 01fdcd0.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (34)

AGENTS.md
README.md
configs/config.yaml
configs/models/openai_gpt-4o-mini.yaml
configs/models/openai_gpt-5-mini.yaml
configs/models/together_Apriel-1.6-15b-Thinker.yaml
configs/models/together_gemma-3n-E4B-it.yaml
configs/models/together_qwen2.5-vl-72b-instruct.yaml
configs/models/together_qwen3-next-80B-A3B-instruct.yaml
configs/models/zai-org_GLM-4.6V-Flash_single.yaml
configs/systems/README.md
configs/systems/surus-extract.yaml
configs/systems/surus-factura.yaml
configs/systems/surus-ocr.yaml
configs/systems/surus-remove-background.yaml
configs/tests/document-extract-gpt4.yaml
docs/contribute_tasks.md
env.example
pyproject.toml
reference/tasks_groups.json
reference/tasks_list.json
scripts/validate_run.py
skills/add-provider/SKILL.md
skills/add-task/SKILL.md
skills/evaluate/SKILL.md
skills/interpret-run/SKILL.md
src/benchy_cli_eval.py
src/interfaces/README.md
src/interfaces/surus/surus_ocr_interface.py
src/mcp/__init__.py
src/mcp/server.py
src/tasks/document_extraction/metadata.yaml
src/tasks/registry.py
tests/test_cli_arguments.py

coderabbitai · 2026-03-05T20:53:56Z

+outputs/benchmark_outputs/<run_id>/<model_name_last_segment>/run_outcome.json
+```


⚠️ Potential issue | 🟡 Minor

Use one output-path placeholder consistently.

Line 21 uses <model_name_last_segment> while Line 110 uses <model_name>. Pick one convention to avoid script/readme drift.

Also applies to: 110-111

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/evaluate/SKILL.md` around lines 21 - 22, The README uses two different output-path placeholders (`<model_name_last_segment>` and `<model_name>`) which can cause drift; pick one placeholder and make it consistent across the file (e.g., replace all occurrences of `<model_name_last_segment>` with `<model_name>` or vice versa), updating every path example such as outputs/benchmark_outputs/<run_id>/<model_name_last_segment>/run_outcome.json and the occurrences near lines referencing run_outcome.json so the same placeholder is used everywhere.

coderabbitai · 2026-03-05T20:53:56Z

+  "counts": {
+    "total_tasks": 3,
+    "passed_tasks": 3,
+    "degraded_tasks": 0,
+    "failed_tasks": 0,
+    "skipped_tasks": 0,
+    "error_tasks": 0,
+    "no_samples_tasks": 0,
+    "pending_tasks": 0
+  },


⚠️ Potential issue | 🟡 Minor

Use canonical counts keys in the JSON example.

Line 45 uses total_tasks, but run outcomes expose task counters like requested_tasks / completed_tasks / passed_tasks etc. This example can mislead automation readers.

🔧 Suggested patch

"counts": { - "total_tasks": 3, + "requested_tasks": 3, + "completed_tasks": 3, "passed_tasks": 3, "degraded_tasks": 0, "failed_tasks": 0,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/interpret-run/SKILL.md` around lines 44 - 53, The JSON example uses a non-canonical key ("total_tasks"); update the example to use the canonical run outcome counters used by the runtime: replace "total_tasks" with "requested_tasks" and add/ensure "completed_tasks" is present alongside the existing keys (e.g., "requested_tasks", "completed_tasks", "passed_tasks", "degraded_tasks", "failed_tasks", "skipped_tasks", "error_tasks", "no_samples_tasks", "pending_tasks") so the example matches the actual run counters exposed by the system.

coderabbitai · 2026-03-05T20:53:56Z

+with open("outputs/benchmark_outputs/<run_id>/<model>/<task>/<subtask>/*_samples.json") as f:
+    data = json.load(f)
+
+# Unwrap the envelope
+all_samples = data["samples"] if isinstance(data, dict) else data


⚠️ Potential issue | 🟡 Minor

Fix the non-runnable wildcard file-open example.

Line 236 uses open(".../*_samples.json"), but open() does not expand globs.

🔧 Suggested patch

import json -with open("outputs/benchmark_outputs/<run_id>/<model>/<task>/<subtask>/*_samples.json") as f: - data = json.load(f) +from pathlib import Path + +samples_path = next( + Path("outputs/benchmark_outputs/<run_id>/<model>/<task>/<subtask>").glob("*_samples.json") +) +with open(samples_path) as f: + data = json.load(f)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@skills/interpret-run/SKILL.md` around lines 236 - 240, The example incorrectly uses open() with a glob pattern; replace the single open("outputs/benchmark_outputs/<run_id>/<model>/<task>/<subtask>/*_samples.json") call with code that expands the glob (e.g., using glob.glob or pathlib.Path.rglob) and iterates over the matching file paths, opening each file and loading its JSON into data before unwrapping the envelope (referencing the existing variable names data and all_samples); ensure you handle the case of no matches and aggregate or process each file’s samples accordingly.

Refactor dataset URL keys in metadata and task files; enhance error handling in model directory resolution

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (4)

scripts/validate_run.py (1)

118-129: Avoid silently swallowing all config parsing errors.

Line 127 catches every exception and silently falls back to outputs/benchmark_outputs, which can hide bad config and validate the wrong run directory. Consider catching only expected errors and surfacing parse/config failures in stderr.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/validate_run.py` around lines 118 - 129, In _default_output_path(),
don't silently swallow all exceptions; instead catch only expected errors (e.g.,
ImportError when yaml is missing, FileNotFoundError/OSError when reading the
file, and yaml.YAMLError when parsing) and handle them by writing a clear
message to stderr or the module logger (including the exception text) before
falling back, while allowing unexpected exceptions to propagate; update the
try/except around the yaml import/open/safe_load block in _default_output_path
to catch those specific exceptions and log/print the error (with context
mentioning "configs/config.yaml") rather than using a bare except.

src/mcp/server.py (3)

50-60: Specify encoding and narrow exception handling.

Line 55 opens the file without an explicit encoding (defaulting to platform locale). Line 58 catches all exceptions, which could mask YAML syntax errors or permission issues during development.

♻️ Suggested improvement

     try:
         import yaml

         cfg_path = Path("configs/config.yaml")
         if cfg_path.exists():
-            with open(cfg_path) as f:
+            with open(cfg_path, encoding="utf-8") as f:
                 cfg = yaml.safe_load(f)
             return Path(cfg.get("paths", {}).get("benchmark_outputs", "outputs/benchmark_outputs"))
-    except Exception:
+    except (OSError, yaml.YAMLError):
         pass
     return Path("outputs/benchmark_outputs")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/mcp/server.py` around lines 50 - 60, The file-reading block uses
open(...) without encoding and catches all exceptions; update the cfg_path open
to specify encoding="utf-8" and replace the broad except with targeted handlers:
catch FileNotFoundError / PermissionError for filesystem issues and
yaml.YAMLError for parse problems around yaml.safe_load, and let unexpected
exceptions propagate (or re-raise) so they aren't silently swallowed; refer to
cfg_path, cfg, and yaml.safe_load to locate and adjust the try/except scope
accordingly.

156-157: Specify UTF-8 encoding on file opens.

Multiple open() calls throughout the file (lines 156, 251, 332, 386, 439) read JSON files without explicit encoding. On some platforms, the default encoding may not be UTF-8, potentially causing decode errors for non-ASCII content.

♻️ Example fix (apply similarly to other locations)

-    with open(outcome_path) as f:
+    with open(outcome_path, encoding="utf-8") as f:
         return json.load(f)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/mcp/server.py` around lines 156 - 157, The file reads JSON files using
plain open() which relies on system default encoding; update each open call
(e.g., the one reading outcome_path in the block returning json.load(f) and the
other opens at the locations noted) to explicitly specify encoding="utf-8" (use
with open(outcome_path, encoding="utf-8") as f: and likewise for the other
filenames) so non-ASCII content decodes consistently across platforms; apply
this change to all similar calls in server.py (the open calls reading JSON at
the commented locations).

295-296: Inconsistent error format.

Other tools return {"error": "string"}, but this returns {"error": [list]}. This inconsistency could confuse MCP tool consumers expecting a uniform error schema.

♻️ Suggested fix

     if kind != "all" and kind not in dirs:
-        return {"error": [f"Unknown kind '{kind}'. Use: models, systems, tests, providers, all"]}
+        return {"error": f"Unknown kind '{kind}'. Use: models, systems, tests, providers, all"}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/mcp/server.py` around lines 295 - 296, The error response for unknown
'kind' is returning a list under the "error" key which is inconsistent with
other handlers; update the return in the branch that checks "if kind != 'all'
and kind not in dirs" so it returns a string error (e.g. {"error": f"Unknown
kind '{kind}'. Use: models, systems, tests, providers, all"}) instead of a list;
locate the return statement in src/mcp/server.py that references the variable
kind and replace the list-wrapped f-string with a plain f-string to match the
{"error": "string"} schema used elsewhere.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/validate_run.py`:
- Around line 165-166: Wrap the call to validate_smoke_gates(outcome) and the
print(json.dumps(result, indent=2)) in a try/except that catches Exception,
builds a structured JSON error envelope (e.g., {"status":"error","message":
"Validation failed","error": str(e), "traceback": traceback.format_exc()}), and
prints that JSON instead of letting the exception propagate; use the
validate_smoke_gates function name and the existing result/print logic to locate
where to add the try/except and import traceback if not already present.
- Around line 73-87: The code currently coerces missing/null fields to harmless
defaults (exit_code -> 0, counts keys -> 0), masking missing required fields;
instead, detect absent or null top-level keys and treat them as violations.
Update validation in validate_run.py to (1) define the required top-level keys
list (schema_version, benchy_version, model, run_id, status, exit_policy,
exit_code, started_at, ended_at, duration_s, counts, tasks, invocation, git,
artifacts, errors) and append violations if any are missing or None; (2) for
status, use outcome.get("status") and if missing/None append a missing-field
violation, otherwise check membership against VALID_STATUSES; (3) for exit_code,
do not coerce to 0 — if outcome lacks "exit_code" or it is None add a violation,
else check it equals 0; and (4) ensure counts is present and a dict before
iterating BLOCKING_COUNT_KEYS, emitting violations if counts is missing or if
any blocking key is missing/None or >0. Reference symbols: outcome,
VALID_STATUSES, BLOCKING_COUNT_KEYS, exit_code, counts.

In `@src/mcp/server.py`:
- Around line 428-432: task_name is currently turned into a path with
task_name.replace(".", "/") which allows absolute paths and traversal; instead
validate and canonicalize the segments before joining: split task_name on "."
and reject any empty segments, segments containing path separators or "..", or
segments that start with "/" or are absolute; ensure each segment matches a safe
pattern (e.g., alphanumeric, underscores, hyphens), then build task_path via
model_dir.joinpath(*segments) and verify the resulting Path.resolve() is
contained within model_dir.resolve() (use Path.is_relative_to if available)
before using it.

In `@src/tasks/document_extraction/metadata.yaml`:
- Line 24: The metadata uses the key "URL" which the registry normalization
doesn't promote; change the subtask dataset key to "dataset_url" (replace URL:
with dataset_url:) in metadata.yaml so the dataset reference is recognized, or
alternatively add compatibility handling in the registry normalization logic to
treat "URL" as an alias for "dataset_url" (detect "URL" and map it to
"dataset_url" before assembly).

In `@tests/test_facturas_argentinas_dataset_id.py`:
- Around line 31-38: The test currently accepts multiple URL key aliases from
task_metadata (via the url variable) which can mask runtime breakage; update the
assertion to require the canonical key "dataset_url" exists in task_metadata
(i.e., check task_metadata.get("dataset_url") is not None) and remove the
fallback checks for "URL", "url", and "Url" so the test enforces the runtime
metadata contract used by the production code (refer to task_metadata and the
dataset_url key in this test).

---

Nitpick comments:
In `@scripts/validate_run.py`:
- Around line 118-129: In _default_output_path(), don't silently swallow all
exceptions; instead catch only expected errors (e.g., ImportError when yaml is
missing, FileNotFoundError/OSError when reading the file, and yaml.YAMLError
when parsing) and handle them by writing a clear message to stderr or the module
logger (including the exception text) before falling back, while allowing
unexpected exceptions to propagate; update the try/except around the yaml
import/open/safe_load block in _default_output_path to catch those specific
exceptions and log/print the error (with context mentioning
"configs/config.yaml") rather than using a bare except.

In `@src/mcp/server.py`:
- Around line 50-60: The file-reading block uses open(...) without encoding and
catches all exceptions; update the cfg_path open to specify encoding="utf-8" and
replace the broad except with targeted handlers: catch FileNotFoundError /
PermissionError for filesystem issues and yaml.YAMLError for parse problems
around yaml.safe_load, and let unexpected exceptions propagate (or re-raise) so
they aren't silently swallowed; refer to cfg_path, cfg, and yaml.safe_load to
locate and adjust the try/except scope accordingly.
- Around line 156-157: The file reads JSON files using plain open() which relies
on system default encoding; update each open call (e.g., the one reading
outcome_path in the block returning json.load(f) and the other opens at the
locations noted) to explicitly specify encoding="utf-8" (use with
open(outcome_path, encoding="utf-8") as f: and likewise for the other filenames)
so non-ASCII content decodes consistently across platforms; apply this change to
all similar calls in server.py (the open calls reading JSON at the commented
locations).
- Around line 295-296: The error response for unknown 'kind' is returning a list
under the "error" key which is inconsistent with other handlers; update the
return in the branch that checks "if kind != 'all' and kind not in dirs" so it
returns a string error (e.g. {"error": f"Unknown kind '{kind}'. Use: models,
systems, tests, providers, all"}) instead of a list; locate the return statement
in src/mcp/server.py that references the variable kind and replace the
list-wrapped f-string with a plain f-string to match the {"error": "string"}
schema used elsewhere.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 56ee8f5b-5cf7-4cce-a1bb-41c162ab06cb

📥 Commits

Reviewing files that changed from the base of the PR and between 01fdcd0 and 42d5620.

📒 Files selected for processing (5)

reference/tasks_list.json
scripts/validate_run.py
src/mcp/server.py
src/tasks/document_extraction/metadata.yaml
tests/test_facturas_argentinas_dataset_id.py

🚧 Files skipped from review as they are similar to previous changes (1)

reference/tasks_list.json

coderabbitai · 2026-03-05T22:22:25Z

+    status = outcome.get("status", "unknown")
+    exit_code = outcome.get("exit_code", 0) or 0
+    counts = outcome.get("counts", {})
+    violations: list[str] = []
+
+    if status not in VALID_STATUSES:
+        violations.append(f"status is '{status}' (expected: passed or degraded)")
+
+    if exit_code != 0:
+        violations.append(f"exit_code = {exit_code} (expected: 0)")
+
+    for key in BLOCKING_COUNT_KEYS:
+        val = counts.get(key, 0) or 0
+        if val > 0:
+            violations.append(f"{key} = {val} (expected: 0)")


⚠️ Potential issue | 🟠 Major

Fail-closed on missing required run_outcome.json fields.

Line 74 and Line 85 currently coerce missing/null values to passing defaults (exit_code -> 0, count keys -> 0). That can produce false positives when the outcome schema is incomplete. Missing required fields should be violations.

Suggested patch

def validate_smoke_gates(outcome: dict[str, Any]) -> dict[str, Any]: """Apply AGENTS.md smoke gates and return a structured result.""" - status = outcome.get("status", "unknown") - exit_code = outcome.get("exit_code", 0) or 0 - counts = outcome.get("counts", {}) + status = outcome.get("status", "unknown") + exit_code = outcome.get("exit_code") + counts = outcome.get("counts") violations: list[str] = [] + required_top_level = { + "schema_version", "benchy_version", "model", "run_id", "status", + "exit_policy", "exit_code", "started_at", "ended_at", "duration_s", + "counts", "tasks", "invocation", "git", "artifacts", "errors", + } + missing_top_level = sorted(k for k in required_top_level if k not in outcome) + if missing_top_level: + violations.append(f"missing top-level keys: {', '.join(missing_top_level)}") + if status not in VALID_STATUSES: violations.append(f"status is '{status}' (expected: passed or degraded)") + if not isinstance(exit_code, int): + violations.append(f"exit_code has invalid type: {type(exit_code).__name__} (expected: int)") + elif exit_code != 0: + violations.append(f"exit_code = {exit_code} (expected: 0)") + - if exit_code != 0: - violations.append(f"exit_code = {exit_code} (expected: 0)") + if not isinstance(counts, dict): + violations.append("counts is missing or invalid (expected: object)") + counts = {} for key in BLOCKING_COUNT_KEYS: - val = counts.get(key, 0) or 0 - if val > 0: + val = counts.get(key) + if not isinstance(val, int): + violations.append(f"{key} has invalid type: {type(val).__name__} (expected: int)") + elif val > 0: violations.append(f"{key} = {val} (expected: 0)")

Based on learnings: Ensure run_outcome.json contains top-level keys: schema_version, benchy_version, model, run_id, status, exit_policy, exit_code, started_at, ended_at, duration_s, counts, tasks, invocation, git, artifacts, and errors.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

status = outcome.get("status", "unknown")

exit_code = outcome.get("exit_code", 0) or 0

counts = outcome.get("counts", {})

violations: list[str] = []

if status not in VALID_STATUSES:

violations.append(f"status is '{status}' (expected: passed or degraded)")

if exit_code != 0:

violations.append(f"exit_code = {exit_code} (expected: 0)")

for key in BLOCKING_COUNT_KEYS:

val = counts.get(key, 0) or 0

if val > 0:

violations.append(f"{key} = {val} (expected: 0)")

status = outcome.get("status", "unknown")

exit_code = outcome.get("exit_code")

counts = outcome.get("counts")

violations: list[str] = []

required_top_level = {

"schema_version", "benchy_version", "model", "run_id", "status",

"exit_policy", "exit_code", "started_at", "ended_at", "duration_s",

"counts", "tasks", "invocation", "git", "artifacts", "errors",

}

missing_top_level = sorted(k for k in required_top_level if k not in outcome)

if missing_top_level:

violations.append(f"missing top-level keys: {', '.join(missing_top_level)}")

if status not in VALID_STATUSES:

violations.append(f"status is '{status}' (expected: passed or degraded)")

if not isinstance(exit_code, int):

violations.append(f"exit_code has invalid type: {type(exit_code).__name__} (expected: int)")

elif exit_code != 0:

violations.append(f"exit_code = {exit_code} (expected: 0)")

if not isinstance(counts, dict):

violations.append("counts is missing or invalid (expected: object)")

counts = {}

for key in BLOCKING_COUNT_KEYS:

val = counts.get(key)

if not isinstance(val, int):

violations.append(f"{key} has invalid type: {type(val).__name__} (expected: int)")

elif val > 0:

violations.append(f"{key} = {val} (expected: 0)")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/validate_run.py` around lines 73 - 87, The code currently coerces missing/null fields to harmless defaults (exit_code -> 0, counts keys -> 0), masking missing required fields; instead, detect absent or null top-level keys and treat them as violations. Update validation in validate_run.py to (1) define the required top-level keys list (schema_version, benchy_version, model, run_id, status, exit_policy, exit_code, started_at, ended_at, duration_s, counts, tasks, invocation, git, artifacts, errors) and append violations if any are missing or None; (2) for status, use outcome.get("status") and if missing/None append a missing-field violation, otherwise check membership against VALID_STATUSES; (3) for exit_code, do not coerce to 0 — if outcome lacks "exit_code" or it is None add a violation, else check it equals 0; and (4) ensure counts is present and a dict before iterating BLOCKING_COUNT_KEYS, emitting violations if counts is missing or if any blocking key is missing/None or >0. Reference symbols: outcome, VALID_STATUSES, BLOCKING_COUNT_KEYS, exit_code, counts.

coderabbitai · 2026-03-05T22:22:25Z

+    result = validate_smoke_gates(outcome)
+    print(json.dumps(result, indent=2))


⚠️ Potential issue | 🟠 Major

Validation exceptions are not guarded, so JSON output can break.

If validate_smoke_gates(outcome) raises (e.g., malformed tasks/counts shape), the script exits with traceback instead of the structured JSON error envelope used earlier. Wrap validation/emission in a try/except and return a JSON error payload.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/validate_run.py` around lines 165 - 166, Wrap the call to validate_smoke_gates(outcome) and the print(json.dumps(result, indent=2)) in a try/except that catches Exception, builds a structured JSON error envelope (e.g., {"status":"error","message": "Validation failed","error": str(e), "traceback": traceback.format_exc()}), and prints that JSON instead of letting the exception propagate; use the validate_smoke_gates function name and the existing result/print logic to locate where to add the try/except and import traceback if not already present.

coderabbitai · 2026-03-05T22:22:25Z

+    # Support dotted subtask path (e.g. document_extraction.facturas_argentinas → document_extraction/facturas_argentinas)
+    task_path = model_dir / task_name.replace(".", "/")
+
+    if not task_path.exists():
+        return {"error": f"Task directory not found: {task_path}"}


⚠️ Potential issue | 🔴 Critical

Path traversal vulnerability in task_name handling.

The task_name.replace(".", "/") conversion can produce absolute paths that bypass the model directory containment. For example:

task_name = ".etc.passwd" → becomes "/etc/passwd"

In pathlib, Path(model_dir) / "/etc/passwd" returns Path("/etc/passwd"), escaping the output root entirely.

This is distinct from the run_id/model_name path traversal addressed earlier—the task_name parameter lacks validation.

🔒 Suggested fix: validate task_name segments

# Support dotted subtask path (e.g. document_extraction.facturas_argentinas → document_extraction/facturas_argentinas) - task_path = model_dir / task_name.replace(".", "/") + task_parts = task_name.split(".") + try: + for part in task_parts: + _validate_relative_component(part, "task_name segment") + except ValueError as exc: + return {"error": str(exc)} + + task_path = model_dir + for part in task_parts: + task_path = task_path / part + task_path = task_path.resolve() + + if not _is_relative_to(task_path, model_dir): + return {"error": "task_name resolves outside the model directory"} if not task_path.exists(): return {"error": f"Task directory not found: {task_path}"}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/mcp/server.py` around lines 428 - 432, task_name is currently turned into a path with task_name.replace(".", "/") which allows absolute paths and traversal; instead validate and canonicalize the segments before joining: split task_name on "." and reject any empty segments, segments containing path separators or "..", or segments that start with "/" or are absolute; ensure each segment matches a safe pattern (e.g., alphanumeric, underscores, hyphens), then build task_path via model_dir.joinpath(*segments) and verify the resulting Path.resolve() is contained within model_dir.resolve() (use Path.is_relative_to if available) before using it.

coderabbitai · 2026-03-05T22:22:25Z

  facturas_argentinas:
    description: Extract structured invoice data from Argentinian factura images
-    dataset_url: https://huggingface.co/datasets/mauroibz/facturas_argentinas_2
+    URL: https://huggingface.co/datasets/mauroibz/facturas_argentinas_2


⚠️ Potential issue | 🟠 Major

Use dataset_url (or add compatibility handling) instead of URL.

Line 24 changed the subtask dataset key to URL, but registry normalization only promotes dataset_url (and related dataset keys). This means the dataset reference can be silently skipped in runtime config assembly.

Suggested fix

facturas_argentinas: description: Extract structured invoice data from Argentinian factura images - URL: https://huggingface.co/datasets/mauroibz/facturas_argentinas_2 + dataset_url: https://huggingface.co/datasets/mauroibz/facturas_argentinas_2 + # Optional compatibility alias if needed by external tooling: + URL: https://huggingface.co/datasets/mauroibz/facturas_argentinas_2

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/tasks/document_extraction/metadata.yaml` at line 24, The metadata uses the key "URL" which the registry normalization doesn't promote; change the subtask dataset key to "dataset_url" (replace URL: with dataset_url:) in metadata.yaml so the dataset reference is recognized, or alternatively add compatibility handling in the registry normalization logic to treat "URL" as an alias for "dataset_url" (detect "URL" and map it to "dataset_url" before assembly).

coderabbitai · 2026-03-05T22:22:25Z

+    task_metadata = metadata["subtasks"]["facturas_argentinas"]
+    url = (
+        task_metadata.get("URL")
+        or task_metadata.get("url")
+        or task_metadata.get("Url")
+        or task_metadata.get("dataset_url")
+    )
+    assert url is not None


⚠️ Potential issue | 🟠 Major

Test should enforce the runtime metadata contract, not tolerate unsupported aliases.

This block now accepts multiple key variants, but production normalization currently relies on dataset_url. The test can pass even when metadata uses keys the runtime ignores, masking real breakage.

Suggested fix

- task_metadata = metadata["subtasks"]["facturas_argentinas"] - url = ( - task_metadata.get("URL") - or task_metadata.get("url") - or task_metadata.get("Url") - or task_metadata.get("dataset_url") - ) + task_metadata = metadata["subtasks"]["facturas_argentinas"] + url = task_metadata.get("dataset_url") assert url is not None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_facturas_argentinas_dataset_id.py` around lines 31 - 38, The test currently accepts multiple URL key aliases from task_metadata (via the url variable) which can mask runtime breakage; update the assertion to require the canonical key "dataset_url" exists in task_metadata (i.e., check task_metadata.get("dataset_url") is not None) and remove the fallback checks for "URL", "url", and "Url" so the test enforces the runtime metadata contract used by the production code (refer to task_metadata and the dataset_url key in this test).

coderabbitai · 2026-03-07T14:03:07Z

Note

Docstrings generation - SUCCESS
Generated docstrings and committed to branch feat/mcp+skills (commit: bda2445efcf79e7dba3a737040bb623e832c6408)

@marianbasti

Docstrings generation was requested by @marianbasti. The following files were modified: * `scripts/validate_run.py` * `src/benchy_cli_eval.py` * `src/interfaces/surus/surus_ocr_interface.py` * `src/mcp/server.py` * `src/tasks/registry.py` * `tests/test_facturas_argentinas_dataset_id.py` These files were kept as they were: * `tests/test_cli_arguments.py` These file types are not supported: * `AGENTS.md` * `README.md` * `configs/config.yaml` * `configs/models/openai_gpt-4o-mini.yaml` * `configs/models/openai_gpt-5-mini.yaml` * `configs/models/together_Apriel-1.6-15b-Thinker.yaml` * `configs/models/together_gemma-3n-E4B-it.yaml` * `configs/models/together_qwen2.5-vl-72b-instruct.yaml` * `configs/models/together_qwen3-next-80B-A3B-instruct.yaml` * `configs/models/zai-org_GLM-4.6V-Flash_single.yaml` * `configs/systems/README.md` * `configs/systems/surus-extract.yaml` * `configs/systems/surus-factura.yaml` * `configs/systems/surus-ocr.yaml` * `configs/systems/surus-remove-background.yaml` * `configs/tests/document-extract-gpt4.yaml` * `docs/contribute_tasks.md` * `env.example` * `pyproject.toml` * `reference/tasks_groups.json` * `reference/tasks_list.json` * `skills/add-provider/SKILL.md` * `skills/add-task/SKILL.md` * `skills/evaluate/SKILL.md` * `skills/interpret-run/SKILL.md` * `src/interfaces/README.md` * `src/tasks/document_extraction/metadata.yaml`

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

scripts/validate_run.py (2)
118-130: ⚠️ Potential issue | 🟠 Major

Fail closed instead of defaulting missing smoke-gate fields to passing values.

Line 119 and Line 130 still coerce absent/None values into success (exit_code -> 0, blocking counts -> 0), so an incomplete run_outcome.json can validate as passed. This should treat missing required fields as violations instead of filling them with passing defaults.

Based on learnings: Ensure run_outcome.json contains top-level keys: schema_version, benchy_version, model, run_id, status, exit_policy, exit_code, started_at, ended_at, duration_s, counts, tasks, invocation, git, artifacts, and errors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/validate_run.py` around lines 118 - 130, The validation currently
coerces missing/None fields into passing values by using outcome.get(...,
default) and "or 0"; update the checks in validate_run.py so that status,
exit_code and counts keys are treated as required: do not default missing or
None to success—check for key presence and None explicitly (use
outcome.get("status") is None or status not in VALID_STATUSES => append
violation; if "exit_code" not in outcome or outcome["exit_code"] is None =>
append violation else check value != 0; for BLOCKING_COUNT_KEYS, treat missing
or None counts[key] as a violation rather than silently using 0). Also add a
top-level required-keys check that verifies the presence (and non-None) of
schema_version, benchy_version, model, run_id, status, exit_policy, exit_code,
started_at, ended_at, duration_s, counts, tasks, invocation, git, artifacts, and
errors and append violations for any missing keys.
226-227: ⚠️ Potential issue | 🟠 Major

Preserve the JSON contract when validation itself fails.

If validate_smoke_gates() raises on malformed payload shapes, this path emits a traceback instead of the structured JSON envelope used for load errors. Wrap validation and JSON emission in the same fail-safe handling.

Based on learnings: Treat run_outcome.json as authoritative for run success/failure determination in automated agents.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/validate_run.py` around lines 226 - 227, The current direct call to
validate_smoke_gates(outcome) can raise and emit a traceback instead of the
expected structured JSON envelope; wrap the call and the subsequent
json.dumps(result...) in a try/except that catches validation exceptions and
emits the same fail-safe JSON envelope used for load errors (e.g., include keys
like "success": false, "error": "<message>", and preserve run_outcome.json
semantics), ensuring validate_smoke_gates, outcome handling, and the JSON print
path always produce the structured output even on malformed payloads.

🧹 Nitpick comments (1)

src/benchy_cli_eval.py (1)
76-78: Use a single source of truth for deprecated task names.

DEPRECATED_TASKS now duplicates src/tasks/registry.py::_DEPRECATED_TASK_GROUPS. If one side is updated without the other, the CLI and registry will disagree about which tasks are invalid and users will see different failure points/messages depending on the entry path. Please move this mapping into a shared module and import it from both places.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/benchy_cli_eval.py` around lines 76 - 78, The DEPRECATED_TASKS mapping in
benchy_cli_eval.py duplicates the _DEPRECATED_TASK_GROUPS mapping in
src/tasks/registry.py leading to inconsistent behavior; extract the shared
mapping into a single new module (e.g., src/tasks/deprecated.py or similar) that
exports the mapping, update benchy_cli_eval.py to import DEPRECATED_TASKS from
that module and update src/tasks/registry.py to import _DEPRECATED_TASK_GROUPS
(or rename to the same exported symbol) from the new module so both use the
single source of truth (ensure you update import names/uses to the exported
symbol and remove the duplicated literal mapping).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/validate_run.py`:
- Around line 134-138: The comprehension that builds failed_tasks currently
excludes "skipped" and "no_samples" by using info.get("status") not in
VALID_STATUSES | {"skipped", "no_samples"}, which omits tasks that should appear
in the blocker details; change the filter to only exclude VALID_STATUSES (e.g.,
info.get("status") not in VALID_STATUSES) so tasks with "skipped" or
"no_samples" are included in the failed_tasks details, and apply the same change
to the analogous comprehension at lines 150-153 (the other failed-collection
variable in this file).

In `@src/tasks/registry.py`:
- Around line 17-20: The problem is that adding "image_extraction" to
_INTERNAL_TASK_GROUPS causes discover_task_group to hide that group, so
is_handler_based_task() returns False before _raise_if_deprecated_task_ref() can
raise the deprecation error; fix by ensuring deprecation is checked before any
internal-group filtering or by removing "image_extraction" from
_INTERNAL_TASK_GROUPS. Concretely, update is_handler_based_task(task_name: str)
to call _raise_if_deprecated_task_ref(task_name) first (as in the suggested
snippet), keep the "_adhoc_" shortcut, then determine group_name and call
discover_task_group(group_name), or alternatively remove "image_extraction" from
_INTERNAL_TASK_GROUPS so discover_task_group can find it and allow the
deprecation check to run. Ensure references to _INTERNAL_TASK_GROUPS,
_DEPRECATED_TASK_GROUPS, is_handler_based_task, _raise_if_deprecated_task_ref,
and discover_task_group are updated consistently.

---

Duplicate comments:
In `@scripts/validate_run.py`:
- Around line 118-130: The validation currently coerces missing/None fields into
passing values by using outcome.get(..., default) and "or 0"; update the checks
in validate_run.py so that status, exit_code and counts keys are treated as
required: do not default missing or None to success—check for key presence and
None explicitly (use outcome.get("status") is None or status not in
VALID_STATUSES => append violation; if "exit_code" not in outcome or
outcome["exit_code"] is None => append violation else check value != 0; for
BLOCKING_COUNT_KEYS, treat missing or None counts[key] as a violation rather
than silently using 0). Also add a top-level required-keys check that verifies
the presence (and non-None) of schema_version, benchy_version, model, run_id,
status, exit_policy, exit_code, started_at, ended_at, duration_s, counts, tasks,
invocation, git, artifacts, and errors and append violations for any missing
keys.
- Around line 226-227: The current direct call to validate_smoke_gates(outcome)
can raise and emit a traceback instead of the expected structured JSON envelope;
wrap the call and the subsequent json.dumps(result...) in a try/except that
catches validation exceptions and emits the same fail-safe JSON envelope used
for load errors (e.g., include keys like "success": false, "error": "<message>",
and preserve run_outcome.json semantics), ensuring validate_smoke_gates, outcome
handling, and the JSON print path always produce the structured output even on
malformed payloads.

---

Nitpick comments:
In `@src/benchy_cli_eval.py`:
- Around line 76-78: The DEPRECATED_TASKS mapping in benchy_cli_eval.py
duplicates the _DEPRECATED_TASK_GROUPS mapping in src/tasks/registry.py leading
to inconsistent behavior; extract the shared mapping into a single new module
(e.g., src/tasks/deprecated.py or similar) that exports the mapping, update
benchy_cli_eval.py to import DEPRECATED_TASKS from that module and update
src/tasks/registry.py to import _DEPRECATED_TASK_GROUPS (or rename to the same
exported symbol) from the new module so both use the single source of truth
(ensure you update import names/uses to the exported symbol and remove the
duplicated literal mapping).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 39acacb5-da45-4f5e-9fdb-2abe9bcfd89f

📥 Commits

Reviewing files that changed from the base of the PR and between 42d5620 and bda2445.

📒 Files selected for processing (6)

scripts/validate_run.py
src/benchy_cli_eval.py
src/interfaces/surus/surus_ocr_interface.py
src/mcp/server.py
src/tasks/registry.py
tests/test_facturas_argentinas_dataset_id.py

🚧 Files skipped from review as they are similar to previous changes (2)

tests/test_facturas_argentinas_dataset_id.py
src/mcp/server.py

coderabbitai · 2026-03-07T14:10:13Z

+    failed_tasks = {
+        name: info
+        for name, info in (outcome.get("tasks") or {}).items()
+        if info.get("status") not in VALID_STATUSES | {"skipped", "no_samples"}
+    }


⚠️ Potential issue | 🟡 Minor

Include skipped and no_samples tasks in the blocker details.

Those statuses fail the smoke gate, but this filter drops them, so the JSON report and stderr summary can fail without naming the offending tasks.

🩹 Proposed fix

failed_tasks = { name: info for name, info in (outcome.get("tasks") or {}).items() - if info.get("status") not in VALID_STATUSES | {"skipped", "no_samples"} + if info.get("status") not in VALID_STATUSES }

Also applies to: 150-153

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/validate_run.py` around lines 134 - 138, The comprehension that builds failed_tasks currently excludes "skipped" and "no_samples" by using info.get("status") not in VALID_STATUSES | {"skipped", "no_samples"}, which omits tasks that should appear in the blocker details; change the filter to only exclude VALID_STATUSES (e.g., info.get("status") not in VALID_STATUSES) so tasks with "skipped" or "no_samples" are included in the failed_tasks details, and apply the same change to the analogous comprehension at lines 150-153 (the other failed-collection variable in this file).

coderabbitai · 2026-03-07T14:10:13Z

+_INTERNAL_TASK_GROUPS = {"common", "image_extraction"}
+_DEPRECATED_TASK_GROUPS = {
+    "image_extraction": "document_extraction",
+}


⚠️ Potential issue | 🟠 Major

Don't hide deprecated groups behind the internal-group filter.

Adding image_extraction to _INTERNAL_TASK_GROUPS makes discover_task_group("image_extraction") return None, so is_handler_based_task() reports False before _raise_if_deprecated_task_ref() ever runs. In the pipeline path, that turns a deprecated-task error into the generic “not handler-based / legacy task.json” failure for non-CLI callers.

Possible fix

-_INTERNAL_TASK_GROUPS = {"common", "image_extraction"} +_INTERNAL_TASK_GROUPS = {"common"}

def is_handler_based_task(task_name: str) -> bool: _raise_if_deprecated_task_ref(task_name) if task_name.startswith("_adhoc_"): return True group_name = task_name.split(".")[0] return discover_task_group(group_name) is not None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

_INTERNAL_TASK_GROUPS = {"common", "image_extraction"}

_DEPRECATED_TASK_GROUPS = {

"image_extraction": "document_extraction",

}

_INTERNAL_TASK_GROUPS = {"common"}

_DEPRECATED_TASK_GROUPS = {

"image_extraction": "document_extraction",

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/tasks/registry.py` around lines 17 - 20, The problem is that adding "image_extraction" to _INTERNAL_TASK_GROUPS causes discover_task_group to hide that group, so is_handler_based_task() returns False before _raise_if_deprecated_task_ref() can raise the deprecation error; fix by ensuring deprecation is checked before any internal-group filtering or by removing "image_extraction" from _INTERNAL_TASK_GROUPS. Concretely, update is_handler_based_task(task_name: str) to call _raise_if_deprecated_task_ref(task_name) first (as in the suggested snippet), keep the "_adhoc_" shortcut, then determine group_name and call discover_task_group(group_name), or alternatively remove "image_extraction" from _INTERNAL_TASK_GROUPS so discover_task_group can find it and allow the deprecation check to run. Ensure references to _INTERNAL_TASK_GROUPS, _DEPRECATED_TASK_GROUPS, is_handler_based_task, _raise_if_deprecated_task_ref, and discover_task_group are updated consistently.

…path under repo-root/.data. Updated .gitignore files to exclude .data directories and adjusted dataset loading paths in Portuguese and translation tasks accordingly. Added optional dependencies for local inference in pyproject.toml and updated README for installation instructions.

…lowing users to evaluate arbitrary HTTP API endpoints. Added new flags for API URL, body template, response path, HTTP method, and headers. Updated documentation to reflect these changes and provided examples for usage.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

.agents/skills/interpret-run/SKILL.md (2)
233-236: ⚠️ Potential issue | 🟡 Minor

Fix the wildcard file-open example (it is not runnable Python).

open(".../*_samples.json") does not expand globs. Resolve file paths first (e.g., via pathlib.Path.glob) and then open each file.
Suggested patch
 import json
-with open("outputs/benchmark_outputs/<run_id>/<model>/<task>/<subtask>/*_samples.json") as f:
-    data = json.load(f)
+from pathlib import Path
+
+samples_dir = Path("outputs/benchmark_outputs/<run_id>/<model_name_last_segment>/<task>/<subtask>")
+all_samples = []
+for samples_path in samples_dir.glob("*_samples.json"):
+    with open(samples_path) as f:
+        data = json.load(f)
+    all_samples.extend(data["samples"] if isinstance(data, dict) else data)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/skills/interpret-run/SKILL.md around lines 233 - 236, The example
incorrectly uses open on a glob pattern string; change it to resolve the glob
first (e.g., use pathlib.Path(...).glob or Path.rglob to find files matching
"*_samples.json" under the
outputs/benchmark_outputs/<run_id>/<model>/<task>/<subtask> path), then iterate
over the resulting Path objects and open each file individually (or
read_text()/read_bytes()) to json.load/parse; update the example in SKILL.md to
demonstrate using Path(...).glob(...) and a for loop that opens/parses each
matched file.
43-52: ⚠️ Potential issue | 🟡 Minor

Use canonical counts keys in the run_outcome.json example.

"total_tasks" is non-canonical in this doc context; use "requested_tasks" and include "completed_tasks" so the example matches what automation expects.

Based on learnings: Ensure run_outcome.json contains the expected schema and canonical counters used by automation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/skills/interpret-run/SKILL.md around lines 43 - 52, Update the
example "counts" object in SKILL.md so it uses the canonical keys expected by
automation: replace "total_tasks" with "requested_tasks" and add a
"completed_tasks" entry (set to the appropriate number, e.g., 3) alongside the
existing "passed_tasks", "degraded_tasks", "failed_tasks", "skipped_tasks",
"error_tasks", "no_samples_tasks", and "pending_tasks" keys so the
run_outcome.json example matches the expected schema.
.agents/skills/evaluate/SKILL.md (1)
20-21: ⚠️ Potential issue | 🟡 Minor

Use one output-path placeholder consistently across the file.

The document mixes <model_name_last_segment> and <model_name> for the same directory level, which can cause copy/paste drift in automation docs. Please standardize on one placeholder everywhere (preferably <model_name_last_segment> to match runtime layout guidance).

Based on learnings: Write run outputs to <base_output_path>/<run_id>/<model_name_last_segment>/ and include required files.

Also applies to: 109-110
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/skills/evaluate/SKILL.md around lines 20 - 21, The document uses two
inconsistent placeholders (<model_name> and <model_name_last_segment>) for the
same directory level which can break automation; standardize all occurrences to
<model_name_last_segment> in SKILL.md (including the example path
"outputs/benchmark_outputs/<run_id>/<model_name_last_segment>/run_outcome.json"
and the references noted at lines 109-110) and update any adjacent instructions
that reference the output path so they all refer to
<base_output_path>/<run_id>/<model_name_last_segment>/ to match runtime layout
guidance.

🧹 Nitpick comments (2)

.agents/skills/add-task/SKILL.md (1)

232-245: Clarify import style to avoid copy/paste breakage.

This section uses from src.tasks.common import ..., while task-file examples above use relative imports (from ..common ...). Please align these or add a one-line note about when to use absolute vs relative imports.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.agents/skills/add-task/SKILL.md around lines 232 - 245, The import block
currently uses an absolute import ("from src.tasks.common import ...") while
examples above use relative imports ("from ..common import ..."); update the
SKILL.md import block to use the same relative style as the other task-file
examples (e.g., "from ..common import download_huggingface_dataset,
save_to_jsonl, ...") or alternatively add one clear sentence explaining when to
prefer absolute vs relative imports (project top-level scripts vs in-package
task files) so examples remain consistent and copy/paste-safe; ensure the change
references the existing import list (download_huggingface_dataset,
save_to_jsonl, normalize_text, ExactMatch, MultipleChoiceAccuracy, etc.) so
maintainers can locate and edit the exact line.

.agents/skills/interpret-run/SKILL.md (1)

13-23: Add fence languages to code blocks to satisfy markdown linting.

These fenced blocks are missing language tags (text/json/etc.), and static analysis already reports MD040 warnings.

Also applies to: 79-83, 107-128

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.agents/skills/interpret-run/SKILL.md around lines 13 - 23, The markdown
file contains fenced code blocks without language tags (e.g., the directory tree
and other examples in the SKILL.md content), causing MD040 lint warnings; update
each triple-backtick fence (including the directory tree block and the other
fenced examples referenced) to include an appropriate language tag such as text,
bash, or json depending on the block content (e.g., use ```text for ASCII tree,
```json for JSON snippets), ensuring all fenced blocks in SKILL.md have language
identifiers to satisfy the linter.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/skills/add-task/SKILL.md:
- Around line 217-225: The fenced directory-layout block in
.agents/skills/add-task/SKILL.md is missing a language tag (MD040); update the
triple-backtick fence that precedes the tree (the block showing
src/tasks/<my_benchmark>/ … my_task.jsonl) to include a language like "text"
(e.g., ```text) so markdownlint no longer flags the block, then save the file.

In @.agents/skills/interpret-run/SKILL.md:
- Line 257: Update the sentence to reflect resume semantics more generally:
change "Tasks already marked `passed` or `degraded` in `task_status.json` are
skipped." to indicate that when reusing the same `run_id`, the system skips
tasks already marked as completed in `<task>/task_status.json` (e.g., `passed`,
`degraded`, `skipped`, `no_samples`, etc.); reference reusing `run_id` and
`task_status.json` so readers understand the resumable-run behavior.

---

Duplicate comments:
In @.agents/skills/evaluate/SKILL.md:
- Around line 20-21: The document uses two inconsistent placeholders
(<model_name> and <model_name_last_segment>) for the same directory level which
can break automation; standardize all occurrences to <model_name_last_segment>
in SKILL.md (including the example path
"outputs/benchmark_outputs/<run_id>/<model_name_last_segment>/run_outcome.json"
and the references noted at lines 109-110) and update any adjacent instructions
that reference the output path so they all refer to
<base_output_path>/<run_id>/<model_name_last_segment>/ to match runtime layout
guidance.

In @.agents/skills/interpret-run/SKILL.md:
- Around line 233-236: The example incorrectly uses open on a glob pattern
string; change it to resolve the glob first (e.g., use pathlib.Path(...).glob or
Path.rglob to find files matching "*_samples.json" under the
outputs/benchmark_outputs/<run_id>/<model>/<task>/<subtask> path), then iterate
over the resulting Path objects and open each file individually (or
read_text()/read_bytes()) to json.load/parse; update the example in SKILL.md to
demonstrate using Path(...).glob(...) and a for loop that opens/parses each
matched file.
- Around line 43-52: Update the example "counts" object in SKILL.md so it uses
the canonical keys expected by automation: replace "total_tasks" with
"requested_tasks" and add a "completed_tasks" entry (set to the appropriate
number, e.g., 3) alongside the existing "passed_tasks", "degraded_tasks",
"failed_tasks", "skipped_tasks", "error_tasks", "no_samples_tasks", and
"pending_tasks" keys so the run_outcome.json example matches the expected
schema.

---

Nitpick comments:
In @.agents/skills/add-task/SKILL.md:
- Around line 232-245: The import block currently uses an absolute import ("from
src.tasks.common import ...") while examples above use relative imports ("from
..common import ..."); update the SKILL.md import block to use the same relative
style as the other task-file examples (e.g., "from ..common import
download_huggingface_dataset, save_to_jsonl, ...") or alternatively add one
clear sentence explaining when to prefer absolute vs relative imports (project
top-level scripts vs in-package task files) so examples remain consistent and
copy/paste-safe; ensure the change references the existing import list
(download_huggingface_dataset, save_to_jsonl, normalize_text, ExactMatch,
MultipleChoiceAccuracy, etc.) so maintainers can locate and edit the exact line.

In @.agents/skills/interpret-run/SKILL.md:
- Around line 13-23: The markdown file contains fenced code blocks without
language tags (e.g., the directory tree and other examples in the SKILL.md
content), causing MD040 lint warnings; update each triple-backtick fence
(including the directory tree block and the other fenced examples referenced) to
include an appropriate language tag such as text, bash, or json depending on the
block content (e.g., use ```text for ASCII tree, ```json for JSON snippets),
ensuring all fenced blocks in SKILL.md have language identifiers to satisfy the
linter.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ccb6db40-2c34-40fc-ab0d-c16c54931a4f

📥 Commits

Reviewing files that changed from the base of the PR and between 89a5d25 and 24b1b7c.

📒 Files selected for processing (4)

.agents/skills/add-provider/SKILL.md
.agents/skills/add-task/SKILL.md
.agents/skills/evaluate/SKILL.md
.agents/skills/interpret-run/SKILL.md

coderabbitai · 2026-03-26T14:31:40Z

+```
+src/tasks/<my_benchmark>/
+├── metadata.yaml          ← task group declaration (required)
+├── __init__.py            ← exports task classes (required)
+├── my_task.py             ← one file per subtask
+├── my_extraction.py
+└── .data/                 ← auto-created, cached dataset files
+    └── my_task.jsonl
+```


⚠️ Potential issue | 🟡 Minor

Add a language to this fenced code block (MD040).

The directory-layout block is missing a fence language, which trips markdownlint. Use text (or bash) for this block.

Proposed fix

-``` +```text src/tasks/<my_benchmark>/ ├── metadata.yaml ← task group declaration (required) ├── __init__.py ← exports task classes (required) ├── my_task.py ← one file per subtask ├── my_extraction.py └── .data/ ← auto-created, cached dataset files └── my_task.jsonl

</details> <details> <summary>🧰 Tools</summary> <details> <summary>🪛 markdownlint-cli2 (0.22.0)</summary> [warning] 217-217: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> <details> <summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @.agents/skills/add-task/SKILL.md around lines 217 - 225, The fenced
directory-layout block in .agents/skills/add-task/SKILL.md is missing a language
tag (MD040); update the triple-backtick fence that precedes the tree (the block
showing src/tasks/<my_benchmark>/ … my_task.jsonl) to include a language like
"text" (e.g., ```text) so markdownlint no longer flags the block, then save the
file.

</details>  

coderabbitai · 2026-03-26T14:31:40Z

+benchy eval --config <config> --run-id <same-run-id> --exit-policy strict
+```
+
+Tasks already marked `passed` or `degraded` in `task_status.json` are skipped.


⚠️ Potential issue | 🟡 Minor

Resume semantics are too narrow in this sentence.

Saying only passed/degraded are skipped is incomplete; resumed runs skip tasks already marked completed, which also includes statuses like skipped and no_samples.

Based on learnings: When reusing the same run_id, skip tasks already marked as completed in <task>/task_status.json.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.agents/skills/interpret-run/SKILL.md at line 257, Update the sentence to reflect resume semantics more generally: change "Tasks already marked `passed` or `degraded` in `task_status.json` are skipped." to indicate that when reusing the same `run_id`, the system skips tasks already marked as completed in `<task>/task_status.json` (e.g., `passed`, `degraded`, `skipped`, `no_samples`, etc.); reference reusing `run_id` and `task_status.json` so readers understand the resumable-run behavior.

…tics for smoke failures

…ocument rendering, and dataset spec Enable evaluating parquet datasets dropped in .data/ without writing Python. Benchy now auto-discovers schema, labels, ground-truth mappings, and binary columns from dataset_info.json and schema.json, renders PDF/image documents to PNG for LLM providers, and supports both flat and nested JSON Schema formats with inline $ref resolution for OpenAI strict mode. - Add parquet dataset loading with binary materialization and caching - Add document-to-image rendering (PDF via pymupdf, TIFF/HEIC via Pillow) - Add auto-discovery of .data/ datasets (resolve_dataset_name, list_data_datasets) - Add `benchy datasets` CLI command for dataset discovery - Add GT mapping extraction from both custom fields and standard JSON Schema - Add nested dot-notation GT mapping to build nested expected dicts - Add multimodal image_path propagation in structured and classification handlers - Add $ref resolution and schema sanitization for OpenAI strict mode - Add render CLI flags (--render-documents, --render-dpi, --render-max-pages) - Add docs/DATASET_SPEC.md with full dataset building specification - Update README, AGENTS.md, CLI_DATASET_USAGE.md with new capabilities - Fix subtask_name unbound variable in registry.py for ad-hoc tasks

- Added new performance metrics candidates and exclusion keys for better metric selection. - Implemented functions to compute quantiles and categorize performance into buckets. - Introduced a build_performance_summary function to aggregate performance metrics. - Updated artifact building to include performance summaries and per-sample metrics. - Added tests to validate performance summary integration in benchmark results.

…uation workflows

- Updated `_convert_custom_schema` to return field type mappings alongside JSON schema and ground-truth mappings. - Modified `_extract_gt_mapping_from_json_schema` to also return field type mappings. - Adjusted `resolve_dataset_name` to handle new type mappings for improved dataset processing. - Implemented tests for coercing ground-truth string values to lists for array fields and handling invalid JSON gracefully. - Ensured backward compatibility with existing schema formats while enhancing functionality.

…array handling - Added support for `metrics_config.json` to allow dataset-specific evaluation behavior. - Implemented loading of dataset metrics configuration in `StructuredHandler` and `MultimodalStructuredHandler`. - Enhanced `MetricsCalculator` to handle unordered array scoring based on specified key fields. - Updated documentation to reflect the new metrics configuration options and their usage. - Added tests to ensure correct loading and application of dataset metrics configurations for structured and multimodal tasks.

…metrics configuration

mauroibz and others added 8 commits January 8, 2026 13:45

Merge pull request #10 from surus-lat/FEAT-simpler-task-interface

08bf788

Handler-Based Task System - Major Refactor

Merge pull request #16 from surus-lat/dev

3714a4f

Added initial CLI for models

Merge pull request #17 from surus-lat/dev

6647a68

[ADD] updated readme

[FIX] installed on windows

08dc9d6

Merge pull request #19 from surus-lat/dev

395c638

Dev

Deprecate image_extraction in favour of document_extraction

147b107

Update documentation

replace surus.dev with surus.lat

01fdcd0

coderabbitai Bot reviewed Mar 5, 2026

View reviewed changes

Coderabbit suggestions:

42d5620

Refactor dataset URL keys in metadata and task files; enhance error handling in model directory resolution

coderabbitai Bot reviewed Mar 5, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 7, 2026

View reviewed changes

mauroibz and others added 4 commits March 24, 2026 12:46

Merge branch 'main' into feat/mcp+skills

89a5d25

Normalize SKILL documentation

24b1b7c

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

marianbasti and others added 10 commits March 27, 2026 13:58

Update SKILL documentation with additional error handling and diagnos…

7ab29ca

…tics for smoke failures

Add documentation for local endpoints without authentication in SKILL.md

e7437d1

Merge remote-tracking branch 'origin' into feat/mcp+skills

45b1b76

Update SKILL documentation with new provider types and zero-code eval…

80e0d79

…uation workflows

Merge remote-tracking branch 'origin' into feat/mcp+skills

c689908

[MOD] Update SKILL documentation with new dataset source options and …

bfb8dae

…metrics configuration

		outputs/benchmark_outputs/<run_id>/<model_name_last_segment>/run_outcome.json
		```

		result = validate_smoke_gates(outcome)
		print(json.dumps(result, indent=2))

Conversation

marianbasti commented Mar 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marianbasti commented Mar 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 5, 2026 •

edited

Loading

coderabbitai Bot commented Mar 7, 2026 •

edited

Loading