improve fixtures and relability by justinmclean · Pull Request #668 · apache/magpie

justinmclean · 2026-07-01T10:33:50Z

Summary

Hardens the skill-eval suite so cases stop failing on legitimate model
variation, while still catching genuine skill and model regressions. Also
fixes the real skill gaps the evals surfaced along the way.

The guiding rule throughout: only descriptive, formatting, and set-valued
fields were loosened. Verdict fields (classification, decision, severity,
boolean flags, enum values) stay exact, so a wrong answer still fails.

Grading robustness (harness + fixtures)

Scalar normalization in the runner: non-prose string comparison now
ignores case and collapsed/surrounding whitespace, so a correct verdict
written invalid or request_changes matches INVALID / REQUEST_CHANGES.
Different values (READY vs NEAR-MISS, advisory vs blocking) still fail.
Judged set/list fields via per-step grading-schema.json: contribution
tracks, blocking_factors, failing_criteria, violations, review
findings, render sections, inventory blocks, confirm items, and other
descriptive lists are graded for "same conclusion" rather than exact set
equality, tolerating order and granularity differences.
Redundant derived fields dropped from assertions where they only
restated a verdict (e.g. axes_without_findings, the subjective
tracks_thin_or_absent).
Permissive review-findings fixtures: the pairing review cases now assert
the essential finding(s) with axis and severity and explicitly allow
additional findings, since a thorough reviewer legitimately returns a
superset. A model that misses or misclassifies the key finding still fails.

Skill fixes (real gaps the evals found)

contributor-nomination: validate the GitHub login before any API call and
reject path-traversal / malformed handles; scope the mentoring track to the
off-GitHub row rather than the community-interaction narrative.
good-first-issue-author (readiness): R4 rejects a prose summary posing as
acceptance criteria; R6 degrades gracefully when the link cannot be fetched or
matched to config.
good-first-issue-sweep: G3 excludes a bare command/CLI name as a code
pointer, with an independence rule (one failing criterion is a NEAR-MISS) and
a worked example.
issue-reproducer: clarify E-vague vs E-precise (a named error is not
enough if reproducing needs invented setup).
pairing-multi-agent-review: a silent invariant violation is a blocking
correctness finding, not advisory.
pairing-self-review: add a prompt-injection guard (injection content is a
blocking security finding, treated as data).

Harness context the tests were withholding

Feed the Step 4 health-rating thresholds to the step-3-aggregate case.
Define pool_name for explicit-key selectors, and runtime_version as the
full runtime stack.
Treat ordering as vacuously satisfied on an empty dependency-audit report.

Test-case fix

Rewrote the no-code-pointer classify case to name nothing locatable, so it
cleanly tests the intent instead of arguing over whether a command name
counts as a pointer.

Type of change

Skill change (.claude/skills/<name>/) — eval fixtures updated below
Tool / bridge contract (tools/<system>/*.md)
Python package (tools/*/ with pyproject.toml)
Groovy reference impl
Cross-cutting (RFC, AGENTS.md, sandbox, privacy-LLM)
Documentation (docs/, README.md, CONTRIBUTING.md)
Project template (projects/_template/)
CI / dev loop (prek, workflows, validators)
Other:

Test plan

prek run --all-files passes
For Python packages touched: uv run pytest / ruff check / mypy passes
For Groovy bridges touched: command-line invocation tested end-to-end
For skill changes: eval suite passes for the affected skill
(PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner tools/skill-evals/evals/<skill>/)
For skill behaviour changes: a new or updated eval fixture is included in this PR
(a regression test for the bug fixed / the behaviour added — see CONTRIBUTING.md)
Other:

Stop failing legitimate model variation on contested set-valued fields. Route tracks/blocking_factors/failing_criteria/violations/findings to the judge or drop the redundant ones, and rewrite the pairing review-findings fixtures to assert the essential finding(s) while tolerating extras. Sharpen skills where evals surfaced real gaps: contributor-nomination login validation and mentoring-track scoping, readiness R4/R6, sweep G3 code pointer (with an independence rule and worked example), issue-reproducer E-vague vs E-precise, pairing correctness severity, and a self-review prompt-injection guard. Give the harness the step context it was withholding: issue-backlog-stats health thresholds, pool-selection pool_name, inventory runtime_version, and vacuous-true ordering on empty dependency-audit reports.

justinmclean self-assigned this Jul 1, 2026

justinmclean added 3 commits July 1, 2026 18:07

improve fixtures and relability

eabfb0c

Help opencode pass brittle cases without masking verdicts

344ea4b

potiuk force-pushed the improve-evals branch from c14662a to 344ea4b Compare July 1, 2026 16:21

potiuk merged commit 5d11eff into apache:main Jul 1, 2026
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

improve fixtures and relability#668

improve fixtures and relability#668
potiuk merged 3 commits into
apache:mainfrom
justinmclean:improve-evals

justinmclean commented Jul 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

justinmclean commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Grading robustness (harness + fixtures)

Skill fixes (real gaps the evals found)

Harness context the tests were withholding

Test-case fix

Type of change

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justinmclean commented Jul 1, 2026 •

edited

Loading