improve fixtures and relability#668
Merged
Merged
Conversation
Stop failing legitimate model variation on contested set-valued fields. Route tracks/blocking_factors/failing_criteria/violations/findings to the judge or drop the redundant ones, and rewrite the pairing review-findings fixtures to assert the essential finding(s) while tolerating extras. Sharpen skills where evals surfaced real gaps: contributor-nomination login validation and mentoring-track scoping, readiness R4/R6, sweep G3 code pointer (with an independence rule and worked example), issue-reproducer E-vague vs E-precise, pairing correctness severity, and a self-review prompt-injection guard. Give the harness the step context it was withholding: issue-backlog-stats health thresholds, pool-selection pool_name, inventory runtime_version, and vacuous-true ordering on empty dependency-audit reports.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens the skill-eval suite so cases stop failing on legitimate model
variation, while still catching genuine skill and model regressions. Also
fixes the real skill gaps the evals surfaced along the way.
The guiding rule throughout: only descriptive, formatting, and set-valued
fields were loosened. Verdict fields (classification, decision, severity,
boolean flags, enum values) stay exact, so a wrong answer still fails.
Grading robustness (harness + fixtures)
ignores case and collapsed/surrounding whitespace, so a correct verdict
written
invalidorrequest_changesmatchesINVALID/REQUEST_CHANGES.Different values (
READYvsNEAR-MISS,advisoryvsblocking) still fail.grading-schema.json: contributiontracks,
blocking_factors,failing_criteria,violations, reviewfindings, render sections, inventory blocks, confirm items, and otherdescriptive lists are graded for "same conclusion" rather than exact set
equality, tolerating order and granularity differences.
restated a verdict (e.g.
axes_without_findings, the subjectivetracks_thin_or_absent).the essential finding(s) with axis and severity and explicitly allow
additional findings, since a thorough reviewer legitimately returns a
superset. A model that misses or misclassifies the key finding still fails.
Skill fixes (real gaps the evals found)
reject path-traversal / malformed handles; scope the mentoring track to the
off-GitHub row rather than the community-interaction narrative.
acceptance criteria; R6 degrades gracefully when the link cannot be fetched or
matched to config.
pointer, with an independence rule (one failing criterion is a NEAR-MISS) and
a worked example.
enough if reproducing needs invented setup).
correctness finding, not advisory.
blocking security finding, treated as data).
Harness context the tests were withholding
step-3-aggregatecase.pool_namefor explicit-key selectors, andruntime_versionas thefull runtime stack.
Test-case fix
no-code-pointerclassify case to name nothing locatable, so itcleanly tests the intent instead of arguing over whether a command name
counts as a pointer.
Type of change
.claude/skills/<name>/) — eval fixtures updated belowtools/<system>/*.md)tools/*/withpyproject.toml)docs/,README.md,CONTRIBUTING.md)projects/_template/)prek, workflows, validators)Test plan
prek run --all-filespassesuv run pytest/ruff check/mypypasses(
PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner tools/skill-evals/evals/<skill>/)(a regression test for the bug fixed / the behaviour added — see CONTRIBUTING.md)