Skip to content

fix(evals): correct pr-management-code-review fixtures#667

Merged
potiuk merged 3 commits into
apache:mainfrom
justinmclean:pr-management-code-review-evals
Jul 1, 2026
Merged

fix(evals): correct pr-management-code-review fixtures#667
potiuk merged 3 commits into
apache:mainfrom
justinmclean:pr-management-code-review-evals

Conversation

@justinmclean

Copy link
Copy Markdown
Member

Summary

  • injection-guard: grade excerpt as prose via grading-schema.json; fix
    case-1 gold to name both invoked authorities (security-team + maintainer)
  • selector-resolution: state area_label carries the area: prefix so the
    exact-match is deterministic
  • step-2.5-slop-detection: drop S2 from case-6 gold (injection body is not
    template-only; outcome stays early-exit)

Type of change

  • Skill change (.claude/skills/<name>/) — eval fixtures updated below
  • Tool / bridge contract (tools/<system>/*.md)
  • Python package (tools/*/ with pyproject.toml)
  • Groovy reference impl
  • Cross-cutting (RFC, AGENTS.md, sandbox, privacy-LLM)
  • Documentation (docs/, README.md, CONTRIBUTING.md)
  • Project template (projects/_template/)
  • CI / dev loop (prek, workflows, validators)
  • Other:

Test plan

  • prek run --all-files passes
  • For Python packages touched: uv run pytest / ruff check / mypy passes
  • For Groovy bridges touched: command-line invocation tested end-to-end
  • For skill changes: eval suite passes for the affected skill
    (PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner tools/skill-evals/evals/<skill>/)
  • For skill behaviour changes: a new or updated eval fixture is included in this PR
    (a regression test for the bug fixed / the behaviour added — see CONTRIBUTING.md)
  • Other:

Collapse "What's been built" to one line per item; all 22 planned work
items preserved verbatim; redundant shipped-state notes trimmed.

Generated-by: Claude (Opus 4.7)
Populate four empty stub directories in
tools/skill-evals/evals/pr-management-code-review/ with fixture cases:

- selector-resolution (4 cases): parser tests covering single-pr,
  composed area+collab+max flags, default my-reviews, and
  requested-only with dry-run and inline:off
- review-risk-classify (4 cases): per-finding severity tests covering
  blocking (GPL dep), major (missing tests), minor (AI disclosure
  absent), and none (clean code-quality change)
- injection-guard (4 cases): prompt-injection resistance across
  PR-body, code-comment, commit-message, and a clean-PR control
- review-handoff (3 cases): confirmation-gate tests for confirm-post,
  dry-run-skip, and wording-edit re-draft

Clear the "lacks a dedicated eval suite" Known-gap from
specs/pr-management-family.md now that acceptance criterion 6 is met.
Update pr-management-code-review/README.md from 79 to 112 total cases
and add the four new suites to the table; sync the skill-evals
top-level README.

Generated-by: Claude (Opus 4.7)
- injection-guard: grade excerpt as prose via grading-schema.json; fix
  case-1 gold to name both invoked authorities (security-team + maintainer)
- selector-resolution: state area_label carries the area: prefix so the
  exact-match is deterministic
- step-2.5-slop-detection: drop S2 from case-6 gold (injection body is not
  template-only; outcome stays early-exit)
@justinmclean justinmclean self-assigned this Jul 1, 2026
@potiuk potiuk merged commit 3d1a2eb into apache:main Jul 1, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants