Azure · arnaudlh · May 25, 2026 · May 19, 2026 · May 19, 2026 · May 19, 2026
diff --git a/.github/actionlint.yaml b/.github/actionlint.yaml
@@ -18,3 +18,14 @@ paths:
     ignore:
       - 'shellcheck reported issue in this script: SC2015'
       - 'shellcheck reported issue in this script: SC2016'
+  # The waza eval workflows emit markdown PR comments via `printf` with
+  # backticks inside single-quoted strings (literal markdown code spans like
+  # `prompt`, `continue_session: true`). SC2016 ("Expressions don't expand in
+  # single quotes") is exactly the intent — single quotes prevent the shell
+  # from interpolating. Silence for these two files only.
+  ".github/workflows/waza-evals.yml":
+    ignore:
+      - 'shellcheck reported issue in this script: SC2016'
+  ".github/workflows/waza-agent-evals.yml":
+    ignore:
+      - 'shellcheck reported issue in this script: SC2016'
diff --git a/.github/evals/README.md b/.github/evals/README.md
@@ -0,0 +1,56 @@
+# Git-Ape eval harness
+
+Behavioral evals for the skills under `.github/skills/` and the agents
+under `.github/agents/`. Investigated as part of [#61][issue-61].
+
+## Decision: waza
+
+We evaluated three options before landing the harness:
+
+| Option | Verdict | Why |
+|---|---|---|
+| [`openai/evals`][openai-evals] | Rejected | Python-only ecosystem, Completion-Function-Protocol coupling to OpenAI models, and a registry shape that doesn't match how this repo loads skills/agents (filesystem-discovered Markdown with YAML frontmatter). |
+| Custom Node harness (per [PR #40][pr-40] spike) | Rejected | Would have to reinvent grader composition, multi-model fan-out, CI fixture management, and PR-comment rendering. Net new surface area to maintain. |
+| **[`waza`][waza]** | **Selected** | Already speaks the "skill / agent / task" vocabulary this repo uses, ships native cross-model `waza compare`, has a token/quality auditor, and integrates with both VS Code Copilot and GitHub Actions. Matches the maintainer workflow we want (`/skill-onboard` → `/skill-bench` → `/skill-improve` → `/skill-promote`). |
+
+## Layout
+
+```
+.github/evals/
+├── manifest.yaml                       # Skill tier configuration (skills only)
+├── <skill-name>/
+│   ├── eval.yaml                       # Skill eval definition
+│   └── tasks/*.yaml                    # Per-task graders
+└── agents/<agent-name>/
+    ├── eval.yaml                       # Agent eval definition
+    ├── <agent-name>.agent.md           # Mirror of the canonical .agent.md
+    └── tasks/*.yaml                    # Per-task graders
+```
+
+Skills are discovered via [`manifest.yaml`](./manifest.yaml). Agents are
+auto-discovered from the filesystem (no manifest entry needed).
+
+## How to add a new eval suite
+
+Run one of the slash commands from VS Code (Copilot Chat). They scaffold
+the directory, patch it to repo conventions, and run a smoke trial:
+
+- **Skills** — `/skill-onboard skillName=<name>`
+- **Agents** — `/agent-onboard agentName=<name>`
+
+Full lifecycle (onboard → bench → improve → promote) is documented in
+the [authoring docs][authoring-evals].
+
+## CI wiring
+
+- Skills — [`.github/workflows/waza-evals.yml`](../workflows/waza-evals.yml)
+- Agents — [`.github/workflows/waza-agent-evals.yml`](../workflows/waza-agent-evals.yml)
+
+Both run on PRs touching the relevant artifacts, post results as a PR
+comment, and are currently **non-blocking**.
+
+[issue-61]: https://github.com/Azure/git-ape/issues/61
+[pr-40]: https://github.com/Azure/git-ape/pull/40
+[openai-evals]: https://github.com/openai/evals
+[waza]: https://github.com/microsoft/waza
+[authoring-evals]: https://azure.github.io/git-ape/docs/authoring/evals
diff --git a/.github/evals/manifest.yaml b/.github/evals/manifest.yaml
@@ -0,0 +1,47 @@
+# Single source of truth for the waza-evals workflow matrix.
+#
+# Consumed by `.github/workflows/waza-evals.yml` (prepare job) to:
+#   - decide which skills are configured for evaluation
+#   - generate the matrix.include payload (skill × model fan-out per tier)
+#   - drive the per-skill ordering of the PR comment
+#
+# Everything else (skill markdown, eval.yaml, tasks, fixtures) is
+# auto-discovered from the filesystem by waza itself. This file only
+# exists because waza has no native "tier" concept.
+#
+# Maintenance:
+#   - Add a skill: append a `{ name, tier }` entry to `skills:`. Make sure
+#     `.github/skills/<name>/SKILL.md` and `.github/evals/<name>/eval.yaml`
+#     exist.
+#   - Promote a skill (expanded → pilot): change its `tier:`.
+#   - Add/remove a model on a tier: edit `tiers.<tier>.models:`.
+#   - Editing this file triggers the FULL matrix on PR (config-wide change).
+#
+# Bootstrap state (PR 1 of the eval harness port):
+#   Only `prereq-check` is enabled at landing time — it doubles as the
+#   harness smoke test. Each remaining skill suite ships in its own PR
+#   tracked under https://github.com/Azure/git-ape/issues/93.
+
+# Ordered list of evaluable skills. Order controls the PR-comment ordering.
+skills:
+  # Pilot tier: full multi-model fan-out (most-trusted skills).
+  - name: prereq-check
+    tier: pilot
+
+# Per-tier model fan-out. The matrix runs each selected skill against every
+# model in its tier. To compare additional models, add them here.
+#
+# Models with `baseline: true` run with `waza run --baseline` (A/B mode) to
+# cap quota cost. The PR comment labels them clearly.
+tiers:
+  pilot:
+    models:
+      - name: claude-sonnet-4.6
+      - name: gpt-5.4
+        baseline: true
+      - name: gpt-5-codex
+      - name: claude-opus-4.6
+  expanded:
+    models:
+      - name: claude-sonnet-4.6
+      - name: gpt-5-codex
diff --git a/.github/evals/prereq-check/eval.yaml b/.github/evals/prereq-check/eval.yaml
@@ -0,0 +1,42 @@
+# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/eval.schema.json
+
+# Pilot evaluation suite for the prereq-check skill.
+# Validates trigger precision via the heuristic `trigger` grader.
+#
+# Run: waza run .github/evals/prereq-check/eval.yaml
+
+name: prereq-check-eval
+description: Trigger precision pilot for prereq-check.
+skill: prereq-check
+version: "0.2"
+
+config:
+  # Pilot tier: 3 trials per task for flake detection (per skill-promote contract).
+  # Single-trial runs hide model nondeterminism on borderline triggers.
+  trials_per_task: 3
+  timeout_seconds: 60
+  parallel: false
+  executor: copilot-sdk
+  model: claude-sonnet-4.6
+
+metrics:
+  - name: trigger_precision
+    weight: 1.0
+    threshold: 0.6
+    description: Skill should activate on tooling/install prompts and stay quiet otherwise.
+
+graders:
+  # Budget grader: prereq-check is a lightweight diagnostic; flag anything
+  # that explodes in tool calls or takes longer than expected.
+  - type: behavior
+    name: budget
+    config:
+      max_tool_calls: 30
+      max_duration_ms: 240000
+
+  # answer_quality (LLM-as-judge) is scoped per-task on positive tasks only.
+  # Keeps judge-model errors from zeroing out the negative-task trigger check
+  # in the same leg.
+
+tasks:
+  - "tasks/*.yaml"
diff --git a/.github/evals/prereq-check/tasks/negative-template-edit.yaml b/.github/evals/prereq-check/tasks/negative-template-edit.yaml
@@ -0,0 +1,16 @@
+# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json
+
+id: negative-trigger-template-edit
+name: Negative — Editing an ARM template
+description: Editing template JSON should NOT trigger prereq-check.
+# See positive-command-not-found.yaml for `mutable-by-*` tag semantics.
+tags: [trigger, negative, mutable-by-skill]
+inputs:
+  prompt: "Add a tag block to the storageAccount resource in this ARM template."
+graders:
+  - name: trigger_relevance_negative
+    type: trigger
+    config:
+      skill_path: .github/skills/prereq-check/SKILL.md
+      mode: negative
+      threshold: 0.5
diff --git a/.github/evals/prereq-check/tasks/negative-trigger-conceptual-azure.yaml b/.github/evals/prereq-check/tasks/negative-trigger-conceptual-azure.yaml
@@ -0,0 +1,16 @@
+# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json
+
+id: negative-trigger-conceptual-azure
+name: Negative — Azure service concept question
+description: A pure conceptual question about an Azure service's behavior has nothing to do with local CLI prerequisites, versions, or auth sessions and must not trigger prereq-check.
+# See positive-command-not-found.yaml for `mutable-by-*` tag semantics.
+tags: [trigger, negative, mutable-by-skill]
+inputs:
+  prompt: "Can you explain how Azure Container Apps revisions work? I want to understand the difference between single-revision and multiple-revision mode, how traffic splitting between revisions behaves, and what triggers a new revision to be created."
+graders:
+  - name: trigger_relevance_negative
+    type: trigger
+    config:
+      skill_path: .github/skills/prereq-check/SKILL.md
+      mode: negative
+      threshold: 0.5
diff --git a/.github/evals/prereq-check/tasks/positive-command-not-found.yaml b/.github/evals/prereq-check/tasks/positive-command-not-found.yaml
@@ -0,0 +1,54 @@
+# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json
+
+id: positive-command-not-found
+name: Positive — "command not found" failure
+description: Skill should be invoked when CLI tools are missing.
+# `mutable-by-*` tag declares which artifact must change for this task's
+# score to move. Values:
+#   mutable-by-skill        — score reflects SKILL.md (trigger graders)
+#   mutable-by-agent        — score reflects .agent.md (persona, workflow, identity)
+#   mutable-by-eval-grader  — score is locked by grader/task design; only this YAML can fix it
+# Currently consumed by `/agent-improve` (actionability surface + Locked?
+# column + verdict gate). `/skill-improve` integration is tracked as a
+# follow-up; until it lands, the tag is informational on skill tasks.
+tags: [trigger, positive, mutable-by-skill]
+inputs:
+  prompt: "az: command not found — what tools should be installed for Git-Ape skills?"
+graders:
+  - name: trigger_relevance_positive
+    type: trigger
+    config:
+      skill_path: .github/skills/prereq-check/SKILL.md
+      mode: positive
+      threshold: 0.5
+
+  # answer_quality (LLM-as-judge): scoped per-task on positives so a flaky
+  # judge call only zeroes out this task, not the whole leg. See eval.yaml.
+  #
+  # IMPORTANT: waza prompt graders are binary (set_waza_grade_pass = 1.0,
+  # set_waza_grade_fail = 0.0). They are NOT 1–5 rubrics. The judge has NO
+  # access to the agent's response unless continue_session: true is set — it
+  # resumes the agent's own session so it can read the response.
+  - type: prompt
+    name: answer_quality
+    config:
+      continue_session: true
+      prompt: |
+        You are grading the assistant's previous response in this session.
+        The user reported `az: command not found` and asked what tools should
+        be installed for Git-Ape skills.
+
+        PASS criteria — the response must contain ALL of:
+          1. Names ALL of the core required tools: `az` (Azure CLI), `gh`
+             (GitHub CLI), `jq`, and `git`.
+          2. Provides a concrete install command for the missing `az` for at
+             least one platform (macOS via `brew install azure-cli`, Linux
+             via apt/yum/curl, or Windows via winget/MSI).
+          3. Recommends verifying versions after install (e.g., `az --version`,
+             `gh --version`).
+          4. Reaches a verdict / next step — either a readiness summary
+             (✅ READY / ⚠️ Missing) OR explicit "after install, run X to
+             verify".
+
+        If ALL four PASS criteria are met, call `set_waza_grade_pass`.
+        Otherwise, call `set_waza_grade_fail` and list which criteria are missing.
diff --git a/.github/evals/prereq-check/tasks/positive-what-do-i-need.yaml b/.github/evals/prereq-check/tasks/positive-what-do-i-need.yaml
@@ -0,0 +1,46 @@
+# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json
+
+id: positive-what-do-i-need
+name: Positive — "What do I need to install?"
+description: Direct phrasing from the SKILL.md "When to use" section.
+# See positive-command-not-found.yaml for `mutable-by-*` tag semantics.
+tags: [trigger, positive, mutable-by-skill]
+inputs:
+  prompt: "What do I need to install before running git-ape onboarding?"
+graders:
+  - name: trigger_relevance_positive
+    type: trigger
+    config:
+      skill_path: .github/skills/prereq-check/SKILL.md
+      mode: positive
+      threshold: 0.5
+
+  # answer_quality (LLM-as-judge): scoped per-task on positives so a flaky
+  # judge call only zeroes out this task, not the whole leg. See eval.yaml.
+  #
+  # IMPORTANT: waza prompt graders are binary (set_waza_grade_pass = 1.0,
+  # set_waza_grade_fail = 0.0). They are NOT 1–5 rubrics. The judge has NO
+  # access to the agent's response unless continue_session: true is set — it
+  # resumes the agent's own session so it can read the response.
+  - type: prompt
+    name: answer_quality
+    config:
+      continue_session: true
+      prompt: |
+        You are grading the assistant's previous response in this session.
+        The user asked what they need to install before running Git-Ape
+        onboarding.
+
+        PASS criteria — the response must contain ALL of:
+          1. Lists `az` (Azure CLI), `gh` (GitHub CLI), `jq`, and `git` as
+             required tools.
+          2. Notes authentication requirements (at minimum `az login`; ideally
+             also `gh auth login`).
+          3. Mentions either minimum versions OR "use latest stable" / a
+             version check command.
+          4. Provides install commands or points the user to a verification
+             script (e.g., a prereq-check skill invocation that runs the
+             checks for them).
+
+        If ALL four PASS criteria are met, call `set_waza_grade_pass`.
+        Otherwise, call `set_waza_grade_fail` and list which criteria are missing.