Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .github/actionlint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,14 @@ paths:
ignore:
- 'shellcheck reported issue in this script: SC2015'
- 'shellcheck reported issue in this script: SC2016'
# The waza eval workflows emit markdown PR comments via `printf` with
# backticks inside single-quoted strings (literal markdown code spans like
# `prompt`, `continue_session: true`). SC2016 ("Expressions don't expand in
# single quotes") is exactly the intent — single quotes prevent the shell
# from interpolating. Silence for these two files only.
".github/workflows/waza-evals.yml":
ignore:
- 'shellcheck reported issue in this script: SC2016'
".github/workflows/waza-agent-evals.yml":
ignore:
- 'shellcheck reported issue in this script: SC2016'
56 changes: 56 additions & 0 deletions .github/evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Git-Ape eval harness

Behavioral evals for the skills under `.github/skills/` and the agents
under `.github/agents/`. Investigated as part of [#61][issue-61].

## Decision: waza

We evaluated three options before landing the harness:

| Option | Verdict | Why |
|---|---|---|
| [`openai/evals`][openai-evals] | Rejected | Python-only ecosystem, Completion-Function-Protocol coupling to OpenAI models, and a registry shape that doesn't match how this repo loads skills/agents (filesystem-discovered Markdown with YAML frontmatter). |
| Custom Node harness (per [PR #40][pr-40] spike) | Rejected | Would have to reinvent grader composition, multi-model fan-out, CI fixture management, and PR-comment rendering. Net new surface area to maintain. |
| **[`waza`][waza]** | **Selected** | Already speaks the "skill / agent / task" vocabulary this repo uses, ships native cross-model `waza compare`, has a token/quality auditor, and integrates with both VS Code Copilot and GitHub Actions. Matches the maintainer workflow we want (`/skill-onboard` → `/skill-bench` → `/skill-improve` → `/skill-promote`). |

## Layout

```
.github/evals/
├── manifest.yaml # Skill tier configuration (skills only)
├── <skill-name>/
│ ├── eval.yaml # Skill eval definition
│ └── tasks/*.yaml # Per-task graders
└── agents/<agent-name>/
├── eval.yaml # Agent eval definition
├── <agent-name>.agent.md # Mirror of the canonical .agent.md
└── tasks/*.yaml # Per-task graders
```

Skills are discovered via [`manifest.yaml`](./manifest.yaml). Agents are
auto-discovered from the filesystem (no manifest entry needed).

## How to add a new eval suite

Run one of the slash commands from VS Code (Copilot Chat). They scaffold
the directory, patch it to repo conventions, and run a smoke trial:

- **Skills** — `/skill-onboard skillName=<name>`
- **Agents** — `/agent-onboard agentName=<name>`

Full lifecycle (onboard → bench → improve → promote) is documented in
the [authoring docs][authoring-evals].

## CI wiring

- Skills — [`.github/workflows/waza-evals.yml`](../workflows/waza-evals.yml)
- Agents — [`.github/workflows/waza-agent-evals.yml`](../workflows/waza-agent-evals.yml)

Both run on PRs touching the relevant artifacts, post results as a PR
comment, and are currently **non-blocking**.

[issue-61]: https://github.com/Azure/git-ape/issues/61
[pr-40]: https://github.com/Azure/git-ape/pull/40
[openai-evals]: https://github.com/openai/evals
[waza]: https://github.com/microsoft/waza
[authoring-evals]: https://azure.github.io/git-ape/docs/authoring/evals
47 changes: 47 additions & 0 deletions .github/evals/manifest.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Single source of truth for the waza-evals workflow matrix.
#
# Consumed by `.github/workflows/waza-evals.yml` (prepare job) to:
# - decide which skills are configured for evaluation
# - generate the matrix.include payload (skill × model fan-out per tier)
# - drive the per-skill ordering of the PR comment
#
# Everything else (skill markdown, eval.yaml, tasks, fixtures) is
# auto-discovered from the filesystem by waza itself. This file only
# exists because waza has no native "tier" concept.
#
# Maintenance:
# - Add a skill: append a `{ name, tier }` entry to `skills:`. Make sure
# `.github/skills/<name>/SKILL.md` and `.github/evals/<name>/eval.yaml`
# exist.
# - Promote a skill (expanded → pilot): change its `tier:`.
# - Add/remove a model on a tier: edit `tiers.<tier>.models:`.
# - Editing this file triggers the FULL matrix on PR (config-wide change).
#
# Bootstrap state (PR 1 of the eval harness port):
# Only `prereq-check` is enabled at landing time — it doubles as the
# harness smoke test. Each remaining skill suite ships in its own PR
# tracked under https://github.com/Azure/git-ape/issues/93.

# Ordered list of evaluable skills. Order controls the PR-comment ordering.
skills:
# Pilot tier: full multi-model fan-out (most-trusted skills).
- name: prereq-check
tier: pilot

# Per-tier model fan-out. The matrix runs each selected skill against every
# model in its tier. To compare additional models, add them here.
#
# Models with `baseline: true` run with `waza run --baseline` (A/B mode) to
# cap quota cost. The PR comment labels them clearly.
tiers:
pilot:
models:
- name: claude-sonnet-4.6
- name: gpt-5.4
baseline: true
- name: gpt-5-codex
- name: claude-opus-4.6
expanded:
models:
- name: claude-sonnet-4.6
- name: gpt-5-codex
42 changes: 42 additions & 0 deletions .github/evals/prereq-check/eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/eval.schema.json

# Pilot evaluation suite for the prereq-check skill.
# Validates trigger precision via the heuristic `trigger` grader.
#
# Run: waza run .github/evals/prereq-check/eval.yaml

name: prereq-check-eval
description: Trigger precision pilot for prereq-check.
skill: prereq-check
version: "0.2"

config:
# Pilot tier: 3 trials per task for flake detection (per skill-promote contract).
# Single-trial runs hide model nondeterminism on borderline triggers.
trials_per_task: 3
timeout_seconds: 60
parallel: false
executor: copilot-sdk
model: claude-sonnet-4.6

metrics:
- name: trigger_precision
weight: 1.0
threshold: 0.6
description: Skill should activate on tooling/install prompts and stay quiet otherwise.

graders:
# Budget grader: prereq-check is a lightweight diagnostic; flag anything
# that explodes in tool calls or takes longer than expected.
- type: behavior
name: budget
config:
max_tool_calls: 30
max_duration_ms: 240000

# answer_quality (LLM-as-judge) is scoped per-task on positive tasks only.
# Keeps judge-model errors from zeroing out the negative-task trigger check
# in the same leg.

tasks:
- "tasks/*.yaml"
16 changes: 16 additions & 0 deletions .github/evals/prereq-check/tasks/negative-template-edit.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json

id: negative-trigger-template-edit
name: Negative — Editing an ARM template
description: Editing template JSON should NOT trigger prereq-check.
# See positive-command-not-found.yaml for `mutable-by-*` tag semantics.
tags: [trigger, negative, mutable-by-skill]
inputs:
prompt: "Add a tag block to the storageAccount resource in this ARM template."
graders:
- name: trigger_relevance_negative
type: trigger
config:
skill_path: .github/skills/prereq-check/SKILL.md
mode: negative
threshold: 0.5
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json

id: negative-trigger-conceptual-azure
name: Negative — Azure service concept question
description: A pure conceptual question about an Azure service's behavior has nothing to do with local CLI prerequisites, versions, or auth sessions and must not trigger prereq-check.
# See positive-command-not-found.yaml for `mutable-by-*` tag semantics.
tags: [trigger, negative, mutable-by-skill]
inputs:
prompt: "Can you explain how Azure Container Apps revisions work? I want to understand the difference between single-revision and multiple-revision mode, how traffic splitting between revisions behaves, and what triggers a new revision to be created."
graders:
- name: trigger_relevance_negative
type: trigger
config:
skill_path: .github/skills/prereq-check/SKILL.md
mode: negative
threshold: 0.5
54 changes: 54 additions & 0 deletions .github/evals/prereq-check/tasks/positive-command-not-found.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json

id: positive-command-not-found
name: Positive — "command not found" failure
description: Skill should be invoked when CLI tools are missing.
# `mutable-by-*` tag declares which artifact must change for this task's
# score to move. Values:
# mutable-by-skill — score reflects SKILL.md (trigger graders)
# mutable-by-agent — score reflects .agent.md (persona, workflow, identity)
# mutable-by-eval-grader — score is locked by grader/task design; only this YAML can fix it
# Currently consumed by `/agent-improve` (actionability surface + Locked?
# column + verdict gate). `/skill-improve` integration is tracked as a
# follow-up; until it lands, the tag is informational on skill tasks.
tags: [trigger, positive, mutable-by-skill]
inputs:
prompt: "az: command not found — what tools should be installed for Git-Ape skills?"
graders:
- name: trigger_relevance_positive
type: trigger
config:
skill_path: .github/skills/prereq-check/SKILL.md
mode: positive
threshold: 0.5

# answer_quality (LLM-as-judge): scoped per-task on positives so a flaky
# judge call only zeroes out this task, not the whole leg. See eval.yaml.
#
# IMPORTANT: waza prompt graders are binary (set_waza_grade_pass = 1.0,
# set_waza_grade_fail = 0.0). They are NOT 1–5 rubrics. The judge has NO
# access to the agent's response unless continue_session: true is set — it
# resumes the agent's own session so it can read the response.
- type: prompt
name: answer_quality
config:
continue_session: true
prompt: |
You are grading the assistant's previous response in this session.
The user reported `az: command not found` and asked what tools should
be installed for Git-Ape skills.

PASS criteria — the response must contain ALL of:
1. Names ALL of the core required tools: `az` (Azure CLI), `gh`
(GitHub CLI), `jq`, and `git`.
2. Provides a concrete install command for the missing `az` for at
least one platform (macOS via `brew install azure-cli`, Linux
via apt/yum/curl, or Windows via winget/MSI).
3. Recommends verifying versions after install (e.g., `az --version`,
`gh --version`).
4. Reaches a verdict / next step — either a readiness summary
(✅ READY / ⚠️ Missing) OR explicit "after install, run X to
verify".

If ALL four PASS criteria are met, call `set_waza_grade_pass`.
Otherwise, call `set_waza_grade_fail` and list which criteria are missing.
46 changes: 46 additions & 0 deletions .github/evals/prereq-check/tasks/positive-what-do-i-need.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json

id: positive-what-do-i-need
name: Positive — "What do I need to install?"
description: Direct phrasing from the SKILL.md "When to use" section.
# See positive-command-not-found.yaml for `mutable-by-*` tag semantics.
tags: [trigger, positive, mutable-by-skill]
inputs:
prompt: "What do I need to install before running git-ape onboarding?"
graders:
- name: trigger_relevance_positive
type: trigger
config:
skill_path: .github/skills/prereq-check/SKILL.md
mode: positive
threshold: 0.5

# answer_quality (LLM-as-judge): scoped per-task on positives so a flaky
# judge call only zeroes out this task, not the whole leg. See eval.yaml.
#
# IMPORTANT: waza prompt graders are binary (set_waza_grade_pass = 1.0,
# set_waza_grade_fail = 0.0). They are NOT 1–5 rubrics. The judge has NO
# access to the agent's response unless continue_session: true is set — it
# resumes the agent's own session so it can read the response.
- type: prompt
name: answer_quality
config:
continue_session: true
prompt: |
You are grading the assistant's previous response in this session.
The user asked what they need to install before running Git-Ape
onboarding.

PASS criteria — the response must contain ALL of:
1. Lists `az` (Azure CLI), `gh` (GitHub CLI), `jq`, and `git` as
required tools.
2. Notes authentication requirements (at minimum `az login`; ideally
also `gh auth login`).
3. Mentions either minimum versions OR "use latest stable" / a
version check command.
4. Provides install commands or points the user to a verification
script (e.g., a prereq-check skill invocation that runs the
checks for them).

If ALL four PASS criteria are met, call `set_waza_grade_pass`.
Otherwise, call `set_waza_grade_fail` and list which criteria are missing.
Loading
Loading