Skip to content

improve fixtures and relability#668

Merged
potiuk merged 3 commits into
apache:mainfrom
justinmclean:improve-evals
Jul 1, 2026
Merged

improve fixtures and relability#668
potiuk merged 3 commits into
apache:mainfrom
justinmclean:improve-evals

Conversation

@justinmclean

@justinmclean justinmclean commented Jul 1, 2026

Copy link
Copy Markdown
Member

Summary

Hardens the skill-eval suite so cases stop failing on legitimate model
variation, while still catching genuine skill and model regressions. Also
fixes the real skill gaps the evals surfaced along the way.

The guiding rule throughout: only descriptive, formatting, and set-valued
fields were loosened. Verdict fields (classification, decision, severity,
boolean flags, enum values) stay exact
, so a wrong answer still fails.

Grading robustness (harness + fixtures)

  • Scalar normalization in the runner: non-prose string comparison now
    ignores case and collapsed/surrounding whitespace, so a correct verdict
    written invalid or request_changes matches INVALID / REQUEST_CHANGES.
    Different values (READY vs NEAR-MISS, advisory vs blocking) still fail.
  • Judged set/list fields via per-step grading-schema.json: contribution
    tracks, blocking_factors, failing_criteria, violations, review
    findings, render sections, inventory blocks, confirm items, and other
    descriptive lists are graded for "same conclusion" rather than exact set
    equality, tolerating order and granularity differences.
  • Redundant derived fields dropped from assertions where they only
    restated a verdict (e.g. axes_without_findings, the subjective
    tracks_thin_or_absent).
  • Permissive review-findings fixtures: the pairing review cases now assert
    the essential finding(s) with axis and severity and explicitly allow
    additional findings, since a thorough reviewer legitimately returns a
    superset. A model that misses or misclassifies the key finding still fails.

Skill fixes (real gaps the evals found)

  • contributor-nomination: validate the GitHub login before any API call and
    reject path-traversal / malformed handles; scope the mentoring track to the
    off-GitHub row rather than the community-interaction narrative.
  • good-first-issue-author (readiness): R4 rejects a prose summary posing as
    acceptance criteria; R6 degrades gracefully when the link cannot be fetched or
    matched to config.
  • good-first-issue-sweep: G3 excludes a bare command/CLI name as a code
    pointer, with an independence rule (one failing criterion is a NEAR-MISS) and
    a worked example.
  • issue-reproducer: clarify E-vague vs E-precise (a named error is not
    enough if reproducing needs invented setup).
  • pairing-multi-agent-review: a silent invariant violation is a blocking
    correctness finding, not advisory.
  • pairing-self-review: add a prompt-injection guard (injection content is a
    blocking security finding, treated as data).

Harness context the tests were withholding

  • Feed the Step 4 health-rating thresholds to the step-3-aggregate case.
  • Define pool_name for explicit-key selectors, and runtime_version as the
    full runtime stack.
  • Treat ordering as vacuously satisfied on an empty dependency-audit report.

Test-case fix

  • Rewrote the no-code-pointer classify case to name nothing locatable, so it
    cleanly tests the intent instead of arguing over whether a command name
    counts as a pointer.

Type of change

  • Skill change (.claude/skills/<name>/) — eval fixtures updated below
  • Tool / bridge contract (tools/<system>/*.md)
  • Python package (tools/*/ with pyproject.toml)
  • Groovy reference impl
  • Cross-cutting (RFC, AGENTS.md, sandbox, privacy-LLM)
  • Documentation (docs/, README.md, CONTRIBUTING.md)
  • Project template (projects/_template/)
  • CI / dev loop (prek, workflows, validators)
  • Other:

Test plan

  • prek run --all-files passes
  • For Python packages touched: uv run pytest / ruff check / mypy passes
  • For Groovy bridges touched: command-line invocation tested end-to-end
  • For skill changes: eval suite passes for the affected skill
    (PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner tools/skill-evals/evals/<skill>/)
  • For skill behaviour changes: a new or updated eval fixture is included in this PR
    (a regression test for the bug fixed / the behaviour added — see CONTRIBUTING.md)
  • Other:

@justinmclean justinmclean self-assigned this Jul 1, 2026
Stop failing legitimate model variation on contested set-valued fields.
Route tracks/blocking_factors/failing_criteria/violations/findings to the
judge or drop the redundant ones, and rewrite the pairing review-findings
fixtures to assert the essential finding(s) while tolerating extras.

Sharpen skills where evals surfaced real gaps: contributor-nomination login
validation and mentoring-track scoping, readiness R4/R6, sweep G3 code
pointer (with an independence rule and worked example), issue-reproducer
E-vague vs E-precise, pairing correctness severity, and a self-review
prompt-injection guard.

Give the harness the step context it was withholding: issue-backlog-stats
health thresholds, pool-selection pool_name, inventory runtime_version, and
vacuous-true ordering on empty dependency-audit reports.
@potiuk potiuk merged commit 5d11eff into apache:main Jul 1, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants