fix(evals): correct pr-management-code-review fixtures by justinmclean · Pull Request #667 · apache/magpie

justinmclean · 2026-07-01T08:47:33Z

Summary

injection-guard: grade excerpt as prose via grading-schema.json; fix
case-1 gold to name both invoked authorities (security-team + maintainer)
selector-resolution: state area_label carries the area: prefix so the
exact-match is deterministic
step-2.5-slop-detection: drop S2 from case-6 gold (injection body is not
template-only; outcome stays early-exit)

Type of change

Skill change (.claude/skills/<name>/) — eval fixtures updated below
Tool / bridge contract (tools/<system>/*.md)
Python package (tools/*/ with pyproject.toml)
Groovy reference impl
Cross-cutting (RFC, AGENTS.md, sandbox, privacy-LLM)
Documentation (docs/, README.md, CONTRIBUTING.md)
Project template (projects/_template/)
CI / dev loop (prek, workflows, validators)
Other:

Test plan

prek run --all-files passes
For Python packages touched: uv run pytest / ruff check / mypy passes
For Groovy bridges touched: command-line invocation tested end-to-end
For skill changes: eval suite passes for the affected skill
(PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner tools/skill-evals/evals/<skill>/)
For skill behaviour changes: a new or updated eval fixture is included in this PR
(a regression test for the bug fixed / the behaviour added — see CONTRIBUTING.md)
Other:

Collapse "What's been built" to one line per item; all 22 planned work items preserved verbatim; redundant shipped-state notes trimmed. Generated-by: Claude (Opus 4.7)

Populate four empty stub directories in tools/skill-evals/evals/pr-management-code-review/ with fixture cases: - selector-resolution (4 cases): parser tests covering single-pr, composed area+collab+max flags, default my-reviews, and requested-only with dry-run and inline:off - review-risk-classify (4 cases): per-finding severity tests covering blocking (GPL dep), major (missing tests), minor (AI disclosure absent), and none (clean code-quality change) - injection-guard (4 cases): prompt-injection resistance across PR-body, code-comment, commit-message, and a clean-PR control - review-handoff (3 cases): confirmation-gate tests for confirm-post, dry-run-skip, and wording-edit re-draft Clear the "lacks a dedicated eval suite" Known-gap from specs/pr-management-family.md now that acceptance criterion 6 is met. Update pr-management-code-review/README.md from 79 to 112 total cases and add the four new suites to the table; sync the skill-evals top-level README. Generated-by: Claude (Opus 4.7)

- injection-guard: grade excerpt as prose via grading-schema.json; fix case-1 gold to name both invoked authorities (security-team + maintainer) - selector-resolution: state area_label carries the area: prefix so the exact-match is deterministic - step-2.5-slop-detection: drop S2 from case-6 gold (injection body is not template-only; outcome stays early-exit)

justinmclean added 3 commits July 1, 2026 11:08

chore(spec-loop): consolidate implementation plan

3f3de5a

Collapse "What's been built" to one line per item; all 22 planned work items preserved verbatim; redundant shipped-state notes trimmed. Generated-by: Claude (Opus 4.7)

justinmclean self-assigned this Jul 1, 2026

potiuk merged commit 3d1a2eb into apache:main Jul 1, 2026
30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(evals): correct pr-management-code-review fixtures#667

fix(evals): correct pr-management-code-review fixtures#667
potiuk merged 3 commits into
apache:mainfrom
justinmclean:pr-management-code-review-evals

justinmclean commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

justinmclean commented Jul 1, 2026

Summary

Type of change

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants