feat(onboarding): template-driven scaffold + register prompts/eval in matrix#142
Conversation
The waza model catalog now ships gpt-5-codex under its versioned ID gpt-5.3-codex. Align manifest tiers and bench-prompt argument hints so dispatched runs resolve to a valid model. - .github/evals/manifest.yaml: pilot + expanded tier model lists - .github/prompts/agent-bench.prompt.md: default models in argument-hint + body - .github/prompts/skill-bench.prompt.md: default models in argument-hint + body 🔖 - Generated by Copilot
Rewrite git-ape-onboarding as a skill-driven CLI playbook backed by a
sync-able template bundle. The previous .exampleyml workflows lived in
this repo's .github/workflows/ and were copy-pasted by users; they're
now first-class templates under the skill and pushed into target repos
by scripts/sync-templates.{sh,ps1}.
What ships:
- .github/agents/git-ape-onboarding.agent.md: rewritten flow + tools
- .github/skills/git-ape-onboarding/SKILL.md: new playbook structure
- .github/skills/git-ape-onboarding/scripts/: bash + pwsh helpers
- scaffold-repo.{sh,ps1}: bootstrap target repo
- sync-templates.{sh,ps1}: drop-in workflow + instructions update
- .github/skills/git-ape-onboarding/templates/: canonical target-repo
artifacts (copilot-instructions.md, workflows/git-ape-{plan,deploy,
destroy,verify,drift}.yml + drift agentic workflow + drift lockfile)
- .github/evals/git-ape-onboarding/: positive + negative tasks for
first-time-setup, multi-env, skip-on-collision, and storage refusal
- .github/workflows/git-ape-onboarding-template-check.yml: CI check
that the shipped templates pass actionlint and round-trip cleanly
- .github/evals/manifest.yaml: register git-ape-onboarding in pilot
tier (matches its prior 4-model bench coverage)
Removed:
- .github/workflows/git-ape-{deploy,destroy,plan,verify}.exampleyml:
retired — their content is now in skills/.../templates/workflows/
The .exampleyml extension was a workaround to keep GitHub Actions from
auto-loading workflow scaffolds; templates under the skill don't need
the workaround because their path isn't .github/workflows/.
🐵 - Generated by Copilot
Wire the .github/prompts/ directory into the published artifacts:
- plugin.json: declare 'prompts: .github/prompts/' so the plugin
manifest exposes them alongside agents and skills.
- extension/package.template.json: register all 9 prompt files
(git-ape, agent-{bench,improve,onboard,promote}, skill-{bench,
improve,onboard,promote}) under chatPromptFiles so VS Code picks
them up from the installed extension.
- extension/.vscodeignore: explicitly exclude dev-only .github
subtrees (actionlint, dependabot, aw, copilot, evals, plugins,
references, scripts, templates, workflows). Keeps agents/, skills/,
plugin/, copilot-instructions.md, and now prompts/ in the VSIX
while shedding ~MB of CI tooling that shouldn't ship to users.
🧩 - Generated by Copilot
…t Stacks Align copilot-instructions with the actual workflow templates shipped by the onboarding skill: use 'az stack sub' instead of 'az deployment sub' / 'az group delete' for the full plan-deploy-destroy lifecycle. Why this matters for agents reading the instructions: - The stack is the single unit of lifecycle — create, update, and destroy all operate on it, not on the underlying RGs. - 'deleteAll' on unmanage cleans up every managed resource across every scope (subscription, multiple RGs, sub-scope role/policy assignments) in one call. No orphans, idempotent re-runs. - See #30 for the design rationale. Sample workflow snippet now also passes --action-on-unmanage deleteAll, --deny-settings-mode none, --yes — matching what .github/skills/git-ape-onboarding/templates/workflows/git-ape-deploy.yml generates in target repos. 📘 - Generated by Copilot
scripts/generate-docs.js: teach the workflow doc generator about two
source directories, the existing CI workflows under .github/workflows/
and the new user-facing templates under .github/skills/git-ape-
onboarding/templates/workflows/. Templated workflows get a Docusaurus
:::info admonition explaining they're scaffolded by /git-ape-onboarding
and don't run in the git-ape repo itself. Drops .exampleyml handling
since those stubs are gone.
README.md: update the Workflows table + repo tree to reflect the new
layout. The four git-ape-{plan,deploy,destroy,verify}.exampleyml stubs
no longer exist in .github/workflows/; their canonical sources are
inside the onboarding skill's templates/ directory and scaffolded into
user repos as ready-to-run .yml files. Mention skip-on-collision so
readers know existing workflows are never overwritten.
website/docs/: regenerate every page that the generator touches:
- workflows/{git-ape-plan,deploy,destroy,verify}.md: relocated to the
template source path + new admonition
- workflows/git-ape-drift-lock.md, git-ape-onboarding-template-check.md
(new pages)
- workflows/overview.md: refreshed listing
- agents/git-ape-onboarding.md, skills/git-ape-onboarding.md,
getting-started/onboarding.md: re-synced from current sources
- reference/{plugin-json,marketplace}.md: re-synced to pick up prompts:
registration and chatPromptFiles entries
📚 - Generated by Copilot
…source The auto-generated 'Continuous Drift Remediation' page documents the compiled '.lock.yml' shape. This adds the missing hand-curated page documenting the agentic '.md' source — schedule, severity model, anti-flapping rules, safe-outputs configuration, and how to recompile after editing. Ported from the private repo with two small adaptations: - Workflow-file path updated to the template location under .github/skills/git-ape-onboarding/templates/workflows/git-ape-drift.md (matches the autogen lock-page convention). - Added the ':::info[Scaffolded by /git-ape-onboarding]' admonition for consistency with the autogen lock page; clarifies the file is shipped as a template, not run in the git-ape repo itself. - Added a Related section linking to the lock-page, the azure-drift-detector skill, the deployment guide, and the use-case overview so readers can navigate the full drift story. Marked HAND-CURATED at the top so generate-docs.js maintainers know not to add a generator branch for '.md' workflow sources. 🌊 - Generated by Copilot
🤖 Waza agent evals (advisory)
Ran 0 agent evals against
📊 Agent file token comparison vs
|
🧪 Waza skill evals (advisory)
Ran 12 matrix legs in parallel (skills × models). Results are non-blocking — investigate failures via the workflow logs and the per-leg
📊 Token comparison vs
|
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-opus-4.6
Results saved to: .waza-results/prereq-check-claude-opus-4.6.json
JUnit XML saved to: .waza-results/prereq-check-claude-opus-4.6.junit.xml
Model: claude-sonnet-4.6
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✓ [4/4] Positive — "What do I need to install?"
[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: 604A:1AA583:402BCC:435B2F:6A268531)
✗ [3/4] Positive — "command not found" failure
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.77 | Duration: 3m37.483s
- Tests: 4 total, 3 passed, 1 failed, 0 errors
- Success Rate: 75.0%
- Score Range: 0.57 - 1.00 (σ=0.1839)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — "command not found" failure: 67% pass rate, score=0.89±0.16
Failed Task Details
Positive — "command not found" failure
Run 2/3 (error):
- ❌ answer_quality (0.00): fail: Assistant did not deliver any user-facing response: The assistant's previous response only invoked the prereq-check skill and attempted tool calls that failed with "unexpected user permission response". No user-facing message was produced. Missing all four PASS criteria: (1) did not name az/gh/jq/git, (2) did not provide install commands, (3) did not recommend version verification, (4) did not give a verdict or next steps.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-sonnet-4.6
Results saved to: .waza-results/prereq-check-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: E82A:1F8B1A:3D7FE0:408A35:6A2684DF)
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: E82A:1F8B1A:3D7F60:4089A4:6A2684DE)
✓ [1/4] Negative — Editing an ARM template
✗ [2/4] Negative — Azure service concept question
✗ [4/4] Positive — "What do I need to install?"
✗ [3/4] Positive — "command not found" failure
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.74 | Duration: 2m43.554s
- Tests: 4 total, 1 passed, 3 failed, 0 errors
- Success Rate: 25.0%
- Score Range: 0.57 - 0.89 (σ=0.1519)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ❌ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Negative — Azure service concept question: 67% pass rate, score=0.60±0.00
- Positive — "command not found" failure: 67% pass rate, score=0.89±0.16
- Positive — "What do I need to install?": 67% pass rate, score=0.89±0.16
Failed Task Details
Negative — Azure service concept question
Run 1/3 (error):
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_negative (0.20): Prompt correctly treated as non-trigger (score 0.20 < 0.50)
Positive — "command not found" failure
Run 3/3 (failed):
- ❌ answer_quality (0.00): fail: Missing concrete install command for az: Criterion 2 not met: the response does not provide a concrete install command for
azon any platform. It only vaguely says "On Linux, install from your distro/package source (or official vendor repos) for azure-cli, gh, jq, and git" without giving an actual command likebrew install azure-cli,sudo apt-get install azure-cli,curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash, orwinget install Microsoft.AzureCLI. Criteria 1, 3, and 4 are met (all four tools named with versions; verification commandsaz version,gh --version, etc. provided; verdict⚠️ REPORTED MISSING withaz login/gh auth loginnext steps). - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Positive — "What do I need to install?"
Run 1/3 (error):
- ❌ answer_quality (0.00): fail: No prior assistant response exists to grade: There is no previous assistant response in this session to evaluate. The user's question about Git-Ape onboarding prerequisites was never answered, so none of the four PASS criteria (required tools list, auth requirements, version guidance, install commands/verification) can be met.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.3-codex
Results saved to: .waza-results/prereq-check-gpt-5.3-codex.json
Model: gpt-5.4 *(baseline — A/B mode)*
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers
════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [2/4] Negative — Azure service concept question
✓ [1/4] Negative — Editing an ARM template
✗ [4/4] Positive — "What do I need to install?"
✓ [3/4] Positive — "command not found" failure
════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✗ [3/4] Positive — "command not found" failure
✓ [4/4] Positive — "What do I need to install?"
════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 75.0% (3/4 tasks passed)
Without Skills: 75.0% (3/4 tasks passed)
Impact: no change
Per-Task Breakdown:
• Negative — Editing an ARM template [NEUTRAL] 100% → 100% (+0pp)
• Negative — Azure service concept question [NEUTRAL] 100% → 100% (+0pp)
• Positive — "command not found" failure [IMPROVED] 67% → 100% (+33pp)
• Positive — "What do I need to install?" [REGRESSED] 100% → 67% (-33pp)
Verdict: Skills have NEUTRAL IMPACT (no net change)
════════════════════════════════════════════════════════════════
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.77 | Duration: 2m5.322s
- Tests: 4 total, 3 passed, 1 failed, 0 errors
- Success Rate: 75.0%
- Score Range: 0.57 - 1.00 (σ=0.1839)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 0.89 | ❌ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — "What do I need to install?": 67% pass rate, score=0.89±0.16
Failed Task Details
Positive — "What do I need to install?"
Run 1/3 (failed):
- ❌ answer_quality (0.00): fail: : Criterion 4 not met: the response lists the tools and auth commands and minimum versions, but does not provide install commands (e.g., brew/apt/winget) and does not point the user to a verification script or prereq-check skill invocation. Criteria 1, 2, and 3 are satisfied.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)
Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.4
Results saved to: .waza-results/prereq-check-gpt-5.4.json
🔢 Tokens (count + profile)
📊 prereq-check: 2,138 tokens (detailed ✓), 10 sections, 2 code blocks
⚠️ token count 2138 exceeds 1000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Purpose is immediately obvious, the Quick Reference table is an excellent at-a-glance aid, steps are logically ordered with clear section anchors, and the exact verdict strings (READY / TOOLS MISSING / REPORTED MISSING / AUTH MISSING) eliminate ambiguity in outputs.
completeness ████░ Error handling covers the most important failure modes well; however, there is no fallback for when the referenced external scripts (check-tools.sh, check-tools.ps1) are absent or not executable — the skill should specify inline commands to run in that case rather than silently depending on files that may not exist.
trigger_precision ████░ USE FOR is exceptionally detailed with concrete error-message literals, which aids accurate routing; DO NOT USE FOR is too terse ('Anything else') and could be sharpened by explicitly listing one or two adjacent intents to reject (e.g., 'do not use for Azure subscription validation or GitHub repo setup').
scope_coverage █████ Boundaries are explicitly stated in multiple places (Quick Reference side-effects row, Constraints Always/Never list, Rules), related skills are called out, and the read-only constraint is enforced consistently throughout — scope is tight and well-defended.
anti_patterns ████░ No conflicting directives or vague instructions; however, the Constraints 'Always' list includes 'Verify with command -v <tool> + <tool> --version after suggested fixes,' which implies the skill re-runs checks post-fix — contradicting the read-only, single-pass design and potentially confusing the agent about whether it should loop.
────────────────────────────────────────────
Overall: 4.4/5.0
A high-quality, production-ready skill document. It is notably strong in clarity and scope definition, with explicit verdict strings, a well-structured Quick Reference, and a thorough Always/Never constraints section. Two actionable improvements: add an inline fallback for missing check scripts, and tighten the DO NOT USE FOR trigger to list at least one concrete adjacent intent to reject.
✅ Check (compliance summary) (59 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/prereq-check/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: prereq-check
📋 Compliance Score: Medium-High
⚠️ Good, but could be improved. Missing routing clarity.
Issues found:
❌ SKILL.md is 2138 tokens (hard limit 500)
📐 Spec Compliance: 8/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
📎 Links: 4/4 valid
✅ All links valid.
📊 Token Budget: 2138 / 500 tokens
❌ Exceeds limit by 1638 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 4 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 1 reference module(s)
❌ [complexity] Complexity: comprehensive (2138 tokens, 1 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
❌ [cross-model-density] Advisory 16: word count is 122 (>60 may reduce cross-model effectiveness)
❌ [body-structure] Advisory 17: body structure quality — no examples section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
4. Reduce SKILL.md by 1638 tokens. Run 'waza tokens suggest' for optimization tips
Skill: git-ape-onboarding
📈 Score (per model) + Suggestions/Recommendations
Model: claude-opus-4.6
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-opus-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [1/4] Negative — Storage service comparison (off-topic)
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 4433:3F3797:3C1AB5:3F25DF:6A2684DE)
✗ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.60 | Duration: 59.301s
- Tests: 4 total, 1 passed, 3 failed, 0 errors
- Success Rate: 25.0%
- Score Range: 0.56 - 0.65 (σ=0.0340)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 0.60 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.58 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.65 | ❌ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Positive — First-time repo setup
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing prereq results and explicit auth-gate surfacing: Criterion 1 NOT met: No prereq check results presented. The agent attempted to run check-tools.sh and several bash probes, but every shell call returned "unexpected user permission response". No tool versions, no status table, no inspection output of any kind was shown — the agent never proved it inspected the environment. Criterion 2 NOT met: The blocking gate the agent surfaced was "I cannot execute shell commands" (sandbox permission), not the expected auth/prereq gate (e.g. "az not authenticated", "az login required", ❌ on Azure auth row). There is no explicit auth status or tool-missing marker. Criterion 3 MET: The agent requested 4 inputs (GitHub repo URL, Azure subscription ID, RBAC role, onboarding mode). Criterion 4 MET: The agent made no false claims of having configured OIDC, federated credentials, RBAC, environments, or scaffolded files; it explicitly waits for inputs and access.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)
Positive — Multi-environment onboarding
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Response skipped prereq gating: Criterion 1 missing: No prereq check results / tool & auth status table — the response only listed prerequisites as assumptions ("You have Owner...", "az and gh are authenticated") without actually inspecting the local environment (no
az account show,gh auth status, version checks, etc.).
Criterion 2 missing: No auth/prereq gate was surfaced. The agent did not verify whether az/gh were authenticated or block on missing auth; it proceeded straight to walkthrough steps.
Criterion 3 met: The agent requested 3 inputs at the end (org/repo, staging subscription ID, RBAC role).
Criterion 4 met: Multi-environment awareness is shown — explicitly names azure-deploy-staging environment, creates a separate federated credential entry for it, and discusses per-environment secrets/variables scoping subscription IDs.
Fails on criteria 1 and 2 (no prereq inspection, no explicit auth gate).
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)
Positive — Scaffold honors skip-with-notice on collision
Run 1/1 (error):
- ❌ answer_quality (0.00): fail: No prior assistant response to grade: There is no previous assistant response in the session to evaluate. None of the four criteria (skip-on-collision, notice surfaced, diff/backup recommendation, opt-in overwrite guidance) can be assessed because no answer was produced.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.94): Prompt is trigger-aligned (score 0.94 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-opus-4.6
Results saved to: .waza-results/git-ape-onboarding-claude-opus-4.6.json
Model: claude-sonnet-4.6
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.76 | Duration: 1m30.527s
- Tests: 4 total, 3 passed, 1 failed, 0 errors
- Success Rate: 75.0%
- Score Range: 0.56 - 0.98 (σ=0.1857)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 0.60 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.91 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Positive — First-time repo setup
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing prereq results presentation: Criterion 1 not met: No prereq check results table or list of tool versions (az, gh, jq, git) was presented. The agent attempted to run check-tools.sh and version commands but all bash/view tool calls failed with "unexpected user permission response", and the agent did not produce any substitute status table or version list proving environment inspection. Criterion 2 partially met but not via an Azure CLI auth status row — the agent surfaced a generic "sandboxed environment" blocker rather than an explicit Azure/GitHub auth gate. Criteria 3 (requested 5 inputs: repo URL, subscription ID, RBAC role, mode, OS/shell) and 4 (made no false configuration claims; explicitly waited for user input) are satisfied.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-sonnet-4.6
Results saved to: .waza-results/git-ape-onboarding-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
[ERROR] waiting for session.idle: context deadline exceeded
✗ [2/4] Positive — First-time repo setup
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.68 | Duration: 3m16.422s
- Tests: 4 total, 2 passed, 2 failed, 0 errors
- Success Rate: 50.0%
- Score Range: 0.56 - 0.98 (σ=0.1748)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 0.60 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.58 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Positive — First-time repo setup
Run 1/1 (error):
- ❌ answer_quality (0.00): fail: Response did not present a gated step-1 reply: Criterion 1 FAIL: No prereq check results were presented — the bash command returned "unexpected user permission response" and the assistant did not produce a tool/auth status table or list. Criterion 2 FAIL: No auth gate was surfaced (no Azure/GitHub auth status shown, no ❌ marker, no "az login required" message). Criterion 3 FAIL: The assistant did not request any of the required inputs (target repo, subscription ID, RBAC role, region, project name, mode) — it only said it would "collect" them later. Criterion 4 PASS: The assistant did not falsely claim to have configured OIDC, federated credentials, environments, or scaffolded files. Overall: the reply is a tool-invocation stub rather than the expected gated handoff that presents prereq results and solicits the required onboarding inputs.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)
Positive — Multi-environment onboarding
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing prereq execution and input gathering: Criteria 1 and 2 are not met: the assistant did not actually run or present prereq check results (no tool/auth status output, no inspection of az/gh/jq versions or auth state), and therefore did not surface the auth/prereq gate either — it merely instructed the user to run
/prereq-checkthemselves. Criterion 3 is also not met: the assistant did not request the required inputs (e.g., target repo URL, staging subscription ID, RBAC role, App Registration reuse decision, environment name confirmation, onboarding mode) before proceeding — instead it presented a complete set of executable commands with placeholders. Criterion 4 IS met: the response mentions creating a new federated credentialfc-azure-deploy-staging, references theazure-deploy-stagingenvironment name, and discusses per-environment scoping of secrets/RBAC to the staging subscription. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.3-codex
Results saved to: .waza-results/git-ape-onboarding-gpt-5.3-codex.json
Model: gpt-5.4 *(baseline — A/B mode)*
Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers
════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup
════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✗ [2/4] Positive — First-time repo setup
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 50.0% (2/4 tasks passed)
Without Skills: 50.0% (2/4 tasks passed)
Impact: no change
Per-Task Breakdown:
• Negative — Storage service comparison (off-topic) [NEUTRAL] 100% → 100% (+0pp)
• Positive — First-time repo setup [NEUTRAL] 0% → 0% (+0pp)
• Positive — Multi-environment onboarding [NEUTRAL] 0% → 0% (+0pp)
• Positive — Scaffold honors skip-with-notice on collision [NEUTRAL] 100% → 100% (+0pp)
Verdict: Skills have NEUTRAL IMPACT (no net change)
════════════════════════════════════════════════════════════════
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.68 | Duration: 1m27.038s
- Tests: 4 total, 2 passed, 2 failed, 0 errors
- Success Rate: 50.0%
- Score Range: 0.56 - 0.98 (σ=0.1748)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Storage service comparison (off-topic) | 0.56 | ✅ | budget, trigger_relevance_negative |
| Positive — First-time repo setup | 0.60 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Multi-environment onboarding | 0.58 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Scaffold honors skip-with-notice on collision | 0.98 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Positive — First-time repo setup
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Missing prereq results: Criterion 1 not met: the response does not present any prereq check results — no table/list of az/gh/jq/git versions and no Azure or GitHub auth status. The agent reported "UNABLE TO EXECUTE" because the shell was blocked, but never produced the inspection output the gate requires. Criterion 2 is also not satisfied in the expected form: there is no explicit ❌ on an Azure auth row or "az login required" message tied to a prereq check; only a generic runtime-permission failure. Criteria 3 (asks for repo URL, subscription IDs/roles, compliance framework, policy mode, confirmation — ≥3 inputs) and 4 (no fabricated claims of configuring OIDC/RBAC/environments/scaffolding) are met.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)
Positive — Multi-environment onboarding
Run 1/1 (failed):
- ❌ answer_quality (0.00): fail: Response skipped the prereq gate and proceeded straight to a 10-step playbook without gathering required inputs.: Criterion 1 missing: no prereq check results / tool / auth inspection was presented — the response only tells the user to "Run /prereq-check" without showing any output or inspecting the local env. Criterion 2 missing: no explicit auth/prereq gate is surfaced (no statement that az/gh are or aren't authenticated). Criterion 3 missing: the agent does not request inputs before proceeding — there are no numbered questions or input block; staging subscription ID, repo, RBAC role, and app-reuse decision all appear as inside an executable playbook rather than as gating questions. Criterion 4 satisfied: the response explicitly names the
azure-deploy-stagingenvironment, describes a newfc-azure-deploy-stagingfederated credential, and discusses per-environment secrets/variables and staging-subscription RBAC scoping. Overall: only 1 of 4 criteria met (criterion 4). - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)
Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.4
Results saved to: .waza-results/git-ape-onboarding-gpt-5.4.json
🔢 Tokens (count + profile)
📊 git-ape-onboarding: 3,101 tokens (detailed ✓), 17 sections, 15 code blocks
⚠️ token count 3101 exceeds 1000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity ████░ The skill is well-structured with numbered playbook steps, concrete CLI examples, an Invariants block, and a Suggested Agent Flow summary. Minor deduction: the 'Create or reuse' wording in Step 3 is not backed by code for the reuse path, leaving the agent to improvise for an existing app registration.
completeness ████░ Edge cases (org OIDC subject override, disabled subscriptions) are explicitly handled with detection commands and fix instructions. Missing: no rollback guidance if onboarding fails mid-way (e.g., RBAC assigned but secrets not set), and no de-onboarding/teardown procedure.
trigger_precision ███░░ The 'When to Use' section lists four valid scenarios clearly, but there is no 'DO NOT USE FOR' section — the skill gives no signal to route away from it (e.g., re-onboarding an already-configured repo, or subscription-only RBAC changes without a repo). This risks misuse for partial reconfiguration tasks.
scope_coverage ████░ The 'What It Configures' list and Safe-Execution Rules together establish clear boundaries. However, limitations are implicit rather than explicit — there is no statement about what the skill deliberately does NOT do (e.g., it won't create the subscription, won't manage Azure Policy assignments, won't handle private GitHub Enterprise endpoints).
anti_patterns ████░ The Invariants block proactively prevents the classic main/master substitution bug, and safe-execution rule #3 guards against secret leakage. One anti-pattern remains: Step 9 references templates at './templates/' with a relative path that only resolves correctly if the agent's working directory is the skill root — this should use an explicit, anchored path or a clear resolution instruction.
────────────────────────────────────────────
Overall: 3.8/5.0
A high-quality, production-oriented skill with strong invariant enforcement, concrete CLI playbooks, and meaningful edge-case coverage. The main gaps are the absence of a DO NOT USE FOR trigger section (hurting routing precision), missing rollback guidance for partial failures, and the implicit rather than explicit limitations boundary. Addressing those three points would push this to a 4.5+.
✅ Check (compliance summary) (64 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/git-ape-onboarding/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: git-ape-onboarding
📋 Compliance Score: Medium
⚠️ Needs improvement. Missing anti-triggers and routing clarity.
Issues found:
❌ SKILL.md is 3101 tokens (hard limit 500)
📐 Spec Compliance: 8/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
📎 Links: 2/5 valid
⚠️ 3 link issue(s) found.
❌ [templates/copilot-instructions.md] → .github/skills/azure-stack-deploy/SKILL.md: target does not exist
❌ [templates/copilot-instructions.md] → website/docs/deployment/state.md: target does not exist
❌ [templates/copilot-instructions.md] → .github/skills/azure-stack-destroy/SKILL.md: target does not exist
📊 Token Budget: 3101 / 500 tokens
❌ Exceeds limit by 2601 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 4 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (3101 tokens, 0 modules)
❌ [negative-delta-risk] Negative delta risk patterns detected: excessive constraints (12 constraint keywords found)
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
✅ [cross-model-density] Advisory 16: first sentence doesn't lead with action verb (reduces clarity)
❌ [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 5 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
2. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
3. Run 'waza dev' for interactive compliance improvement
4. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
5. Fix 3 broken link(s) — targets do not exist
6. Reduce SKILL.md by 2601 tokens. Run 'waza tokens suggest' for optimization tips
Skill: azure-stack-deploy
📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6
Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [5/5] Positive — Re-deploy after template edit
✓ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [3/5] Negative — What-if preview / preflight validation
✗ [1/5] Negative — Destroying / tearing down an existing deployment
✗ [4/5] Positive — Local deploy of an existing deployment artifact
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.78 | Duration: 2m28.072s
- Tests: 5 total, 2 passed, 3 failed, 0 errors
- Success Rate: 40.0%
- Score Range: 0.60 - 0.86 (σ=0.0946)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Destroying / tearing down an existing deployment | 0.86 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Negative — What-if preview / preflight validation | 0.82 | ❌ | budget, trigger_relevance_negative |
| Positive — Local deploy of an existing deployment artifact | 0.78 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Re-deploy after template edit | 0.85 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — Local deploy of an existing deployment artifact: 50% pass rate, score=0.78±0.17
Failed Task Details
Negative — Destroying / tearing down an existing deployment
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Negative — What-if preview / preflight validation
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Positive — Local deploy of an existing deployment artifact
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: Missing criterion 4: Criteria 1, 2, 3 met (mentions
az stack sub create, includes--action-on-unmanage deleteAll, references.github/skills/azure-stack-deploy/scripts/deploy-stack.sh). Criterion 4 not fully met: the response mentionsstate.jsonwrite but does not specify schemaVersion 1.0, nor does it explicitly state that the stack ID and managed resources are captured in it. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)
Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: claude-sonnet-4.6
Results saved to: .waza-results/azure-stack-deploy-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [1/5] Negative — Destroying / tearing down an existing deployment
✗ [3/5] Negative — What-if preview / preflight validation
✓ [5/5] Positive — Re-deploy after template edit
✗ [4/5] Positive — Local deploy of an existing deployment artifact
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.78 | Duration: 1m55.058s
- Tests: 5 total, 2 passed, 3 failed, 0 errors
- Success Rate: 40.0%
- Score Range: 0.60 - 0.86 (σ=0.0946)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Destroying / tearing down an existing deployment | 0.86 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Negative — What-if preview / preflight validation | 0.82 | ❌ | budget, trigger_relevance_negative |
| Positive — Local deploy of an existing deployment artifact | 0.78 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Re-deploy after template edit | 0.85 | ✅ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — Local deploy of an existing deployment artifact: 50% pass rate, score=0.78±0.17
Failed Task Details
Negative — Destroying / tearing down an existing deployment
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Negative — What-if preview / preflight validation
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Positive — Local deploy of an existing deployment artifact
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: Missing detail on state.json contents: Criteria 1, 2, 3 met: response names
az stack sub create, includes--action-on-unmanage deleteAll, and references.github/skills/azure-stack-deploy/scripts/deploy-stack.sh. Criterion 4 NOT met: the response only says "writesstate.jsonfor single-command teardown" without mentioning schemaVersion 1.0 or that it captures the stack ID and managed resources. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)
Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: gpt-5.3-codex
Results saved to: .waza-results/azure-stack-deploy-gpt-5.3-codex.json
🔢 Tokens (count + profile)
📊 azure-stack-deploy: 1,912 tokens (detailed ✓), 13 sections, 5 code blocks
⚠️ token count 1912 exceeds 1000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Purpose is immediately obvious, steps are logically ordered (locate → run → inspect → report), and concrete bash/PowerShell commands eliminate ambiguity. The 'What to tell the user' section is an especially useful finishing directive.
completeness █████ Covers prerequisites, arguments, procedure, state.json schema, soft-deletable resource types, failure modes, and cross-links to related skills. Edge cases like race conditions on state.json and missing template.json are explicitly handled.
trigger_precision ████░ USE FOR and DO NOT USE FOR sections clearly name alternative skills for adjacent tasks (destroy, preflight, template authoring), preventing misrouting. Minor gap: 'prepared Git-Ape deployment artifact' assumes familiarity with the ecosystem — a one-line definition would help new users route correctly.
scope_coverage █████ Scope is tightly bounded to subscription-scoped stack creation and state capture. Capabilities and limitations are explicit, and the fallback path is documented with an explicit trade-off warning so agents understand when they're operating outside the ideal path.
anti_patterns ████░ No vague instructions, no conflicting directives, and error handling is concrete (failure table with causes and recovery steps). The mandatory fallback to `az deployment sub create` is a minor complexity risk — agents could silently lose stack semantics — but the `--no-fallback` flag and the ⚠️ warning message mitigate this well.
────────────────────────────────────────────
Overall: 4.6/5.0
A high-quality, production-ready skill document. It is exceptionally thorough — schema documentation, soft-delete classification, dual-shell support, and an explicit post-run communication contract are all strong differentiators. The only actionable improvements are defining 'Git-Ape deployment artifact' for cold-start routing and ensuring the fallback path cannot be silently triggered in security-sensitive contexts.
✅ Check (compliance summary) (70 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/azure-stack-deploy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: azure-stack-deploy
📋 Compliance Score: Low
❌ Needs significant improvement. Description too short or missing triggers.
Issues found:
❌ SKILL.md is 1912 tokens (hard limit 500)
📐 Spec Compliance: 8/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
📎 Links: 0/8 valid
⚠️ 8 link issue(s) found.
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../../../website/docs/deployment/state.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-security-analyzer/SKILL.md: link escapes skill directory
📊 Token Budget: 1912 / 500 tokens
❌ Exceeds limit by 1412 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 5 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (1912 tokens, 0 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
✅ [cross-model-density] Description density is optimal for cross-model use
❌ [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix 8 link(s) that escape the skill directory
7. Reduce SKILL.md by 1412 tokens. Run 'waza tokens suggest' for optimization tips
Skill: azure-stack-destroy
📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6
Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers
✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [4/5] Positive — Clean up the deployment stack
✓ [5/5] Positive — Local destroy of a Git-Ape deployment
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.77 | Duration: 1m55.947s
- Tests: 5 total, 2 passed, 3 failed, 0 errors
- Success Rate: 40.0%
- Score Range: 0.60 - 0.96 (σ=0.1399)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Deploying a new stack (opposite operation) | 0.81 | ❌ | budget, trigger_relevance_negative |
| Negative — Deleting a non-Git-Ape resource group | 0.87 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — Clean up the deployment stack | 0.62 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Local destroy of a Git-Ape deployment | 0.96 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Negative — Deploying a new stack (opposite operation)
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Negative — Deleting a non-Git-Ape resource group
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Positive — Clean up the deployment stack
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: Missing explicit contrast with raw az group delete: Criteria 2, 3, 4 are met: response references state.json prerequisite under .azure/deployments/deploy-20260524-test/, mentions
az stack sub delete --action-on-unmanage deleteAllremoving all resources across all RGs in one call, and covers purging Key Vaults and Cognitive Services soft-deletes. However, criterion 1 is only partially met — the response recommends the destroy-stack.sh script over raw az commands, but does NOT explicitly explain WHY rawaz group deleteis inadequate (i.e., that it misses soft-delete cleanup and multi-RG/subscription-scope resources). The rationale comparison is absent. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: Missing explicit comparison to raw az group delete: Criterion 1 not fully met: the response recommends the destroy-stack.sh script and mentions deleteAll semantics, but does NOT explicitly contrast with raw
az group deletenor explain that raw RG delete misses soft-delete cleanup and multi-RG/subscription-scope resources. Criteria 2, 3, 4 are satisfied (state.json prerequisite mentioned, deleteAll semantics described, Key Vault/Cognitive Services purge sweep covered). - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: claude-sonnet-4.6
Results saved to: .waza-results/azure-stack-destroy-claude-sonnet-4.6.json
Model: gpt-5.3-codex
Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers
✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✗ [5/5] Positive — Local destroy of a Git-Ape deployment
✗ [4/5] Positive — Clean up the deployment stack
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.74 | Duration: 1m54.897s
- Tests: 5 total, 1 passed, 4 failed, 0 errors
- Success Rate: 20.0%
- Score Range: 0.60 - 0.87 (σ=0.1064)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Deploying a new stack (opposite operation) | 0.81 | ❌ | budget, trigger_relevance_negative |
| Negative — Deleting a non-Git-Ape resource group | 0.87 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — Clean up the deployment stack | 0.62 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Local destroy of a Git-Ape deployment | 0.80 | ❌ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — Local destroy of a Git-Ape deployment: 50% pass rate, score=0.80±0.17
Failed Task Details
Negative — Deploying a new stack (opposite operation)
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Negative — Deleting a non-Git-Ape resource group
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Positive — Clean up the deployment stack
Run 1/2 (failed):
- ❌ answer_quality (0.00): fail: Response only points the user to the destroy script without covering the required substance.: Missing criteria:
- (Partial) Recommends destroy-stack.sh but does NOT explain why over raw
az group delete(no mention of soft-delete cleanup or multi-RG coverage being missed). - Does NOT reference the
state.jsonprerequisite under.azure/deployments/deploy-20260524-test/. - Does NOT mention
az stack sub delete --action-on-unmanage deleteAllor its single-call cleanup semantics. - Does NOT cover the soft-delete purge sweep (Key Vault / Cognitive Services) nor the
purgeProtected: trueretention behavior.
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: Response missing key destroy-skill details: The assistant was blocked by tool errors and only provided a one-line command to run the destroy script. Missing criteria:
- (1) Partial: recommends destroy-stack.sh but does not explicitly contrast with raw
az group deleteor explain that raw RG delete misses soft-delete cleanup / multi-RG resources. - (2) Missing: no mention of the
state.jsonprerequisite under.azure/deployments/deploy-20260524-test/. - (3) Missing: no mention of
az stack sub delete --action-on-unmanage deleteAll(or equivalent single-delete semantics). - (4) Missing: no mention of the soft-delete purge sweep (Key Vault, Cognitive Services) nor of
purgeProtected: trueretention behavior. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Positive — Local destroy of a Git-Ape deployment
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: : Response recommends the azure-stack-destroy skill and destroy-stack.sh script (criterion 1 met), but fails criteria 2, 3, and 4: it does not reference state.json under .azure/deployments/deploy-20260506-001/ as the source of truth, does not name the
az stack sub delete --action-on-unmanage deleteAllcommand/semantics, and does not mentionaz keyvault purgeor explain the soft-delete purge sweep behavior for Key Vault name reuse. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)
Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: gpt-5.3-codex
Results saved to: .waza-results/azure-stack-destroy-gpt-5.3-codex.json
🔢 Tokens (count + profile)
📊 azure-stack-destroy: 2,644 tokens (detailed ✓), 14 sections, 7 code blocks
⚠️ token count 2644 exceeds 1000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Instructions are exceptionally well-ordered: purpose is stated in the frontmatter description, prerequisites are tabulated, the procedure is numbered with concrete shell commands, and output examples show exactly what success looks like. The fast-vs-sync mode table is a particularly strong clarity aid.
completeness █████ Edge cases are thoroughly covered — missing state.json, purge-protected vaults, out-of-sync stacks, subnet reference conflicts, already-destroyed stacks, and the fallback path when no stackId exists. The failure modes table with recovery steps leaves little room for an agent to get stuck.
trigger_precision ████░ USE FOR and DO NOT USE FOR sections are well-defined with concrete phrases and scenarios; the 'Prefer this over raw az group delete' sub-section adds excellent disambiguation. Minor overlap: the 'When to Use' section partially restates USE FOR triggers, which is redundant and could cause confusion — consolidate into one section.
scope_coverage █████ Scope is precisely bounded: subscription-scoped stacks created by Git-Ape only, with explicit exclusions for non-Git-Ape RGs, individual resource deletion, and non-Azure clouds. Capabilities (multi-RG, soft-delete purge, state update) and limitations (no surgical mode, no hand-written state.json) are both explicit.
anti_patterns ████░ Avoids most anti-patterns well — no vague instructions, error handling is explicit, and directives don't conflict. One minor issue: the 'When to Use' section duplicates trigger phrases already in 'USE FOR', adding noise. Also, App Configuration / API Management / ML workspace purge being silently skipped (natural expiry) is documented but the agent has no guidance on whether to warn the user proactively about name-reuse delays for those types.
────────────────────────────────────────────
Overall: 4.6/5.0
A high-quality, production-grade skill definition. It is thorough, honest about limitations, and provides enough operational detail for an agent to succeed without human guidance in the common case. Minor improvements: consolidate the duplicate 'When to Use' and 'USE FOR' sections, and add a user-facing warning step for non-auto-purged soft-delete types (App Config, APIM, ML) to prevent silent name-reuse confusion.
✅ Check (compliance summary) (69 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/azure-stack-destroy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: azure-stack-destroy
📋 Compliance Score: Low
❌ Needs significant improvement. Description too short or missing triggers.
Issues found:
❌ SKILL.md is 2644 tokens (hard limit 500)
📐 Spec Compliance: 7/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
❌ [spec-security] Security risks detected: description contains XML angle brackets
📎 XML angle brackets and reserved prefixes pose injection and naming conflict risks
📎 Links: 0/4 valid
⚠️ 4 link issue(s) found.
❌ [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-drift-detector/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-resource-visualizer/SKILL.md: link escapes skill directory
📊 Token Budget: 2644 / 500 tokens
❌ Exceeds limit by 2144 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 5 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (2644 tokens, 0 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
✅ [cross-model-density] Advisory 16: first sentence doesn't lead with action verb (reduces clarity)
❌ [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix spec violation [spec-security]: Security risks detected: description contains XML angle brackets
7. Fix 4 link(s) that escape the skill directory
8. Reduce SKILL.md by 2144 tokens. Run 'waza tokens suggest' for optimization tips
There was a problem hiding this comment.
Pull request overview
Overhauls the /git-ape-onboarding flow: replaces the .exampleyml activation hack with a template-driven scaffolder under the skill directory, migrates deploy/destroy from az deployment sub to Azure Deployment Stacks (closes part of #30), registers the onboarding eval suite in the pilot tier, declares prompt files in the VSIX, and regenerates website docs.
Changes:
- Removes
.github/workflows/git-ape-{plan,deploy,destroy,verify}.exampleyml, ships canonical templates under.github/skills/git-ape-onboarding/templates/workflows/, plusscaffold-repo.{sh,ps1}andsync-templates.{sh,ps1}with a newgit-ape-onboarding-template-checkCI workflow enforcing parity. - Rewrites deploy + destroy templates around
az stack sub create/delete --action-on-unmanage deleteAll, adds a state-filestackId/managedResources[]schema, a rollback step, a templateanalyzer staging workaround, and switchesAZURE_SUBSCRIPTION_IDfromsecretstovars. - Renames
gpt-5-codex→gpt-5.3-codexin tier manifest and bench prompts; registersgit-ape-onboardingin thepilottier with 4 tasks; tightens the agent's identity contract and adds a "required inputs" gate; declares.github/prompts/inplugin.jsonand registers 9 chatPromptFiles; trims.vscodeignore.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/git-ape-onboarding-template-check.yml |
New CI parity check (bash+pwsh sync + scaffold byte-diff). |
.github/workflows/git-ape-deploy.exampleyml (deleted) |
Old activation stub; superseded by template. |
.github/skills/git-ape-onboarding/templates/workflows/git-ape-{plan,deploy,destroy,verify}.yml |
Canonical workflow templates; Stacks migration + scan staging + rollback. |
.github/skills/git-ape-onboarding/templates/README.md + copilot-instructions.md |
Maintainer doc + canonical deployment standards. |
.github/skills/git-ape-onboarding/scripts/{scaffold-repo,sync-templates}.{sh,ps1} |
Parity scaffold + mirror scripts. |
.github/skills/git-ape-onboarding/SKILL.md, .github/agents/git-ape-onboarding.agent.md |
Drop acknowledgment phase, add invariants/identity/non-goals/required-inputs gate. |
.github/copilot-instructions.md |
Stacks-based deploy/destroy guidance. |
.github/evals/git-ape-onboarding/{eval,tasks/*}.yaml, .github/evals/manifest.yaml |
New eval suite (3 positive + 1 negative); register skill in pilot, rename codex model. |
.github/prompts/{agent,skill}-bench.prompt.md |
Update default model list. |
extension/package.template.json, extension/.vscodeignore, plugin.json |
Register prompt files in VSIX, drop dev-only .github/* paths from VSIX. |
scripts/generate-docs.js, README.md, website/docs/** |
Regenerated docs for both repo CI and scaffolded user-facing workflows. |
Copilot's findings
Comments suppressed due to low confidence (3)
.github/skills/git-ape-onboarding/templates/workflows/git-ape-verify.yml:44
- The check now reads
vars.AZURE_SUBSCRIPTION_ID(a repository variable), but the error message and summary still call it a "secret". This is misleading: a user looking at logs will go check repo Secrets, not repo Variables, and may waste time before realising the setup expects a variable. Update the user-facing messages and the missing-config copy to refer toAZURE_SUBSCRIPTION_IDas a variable. Also notegit-ape-deploy.ymlstill writessubscriptionfromvars.AZURE_SUBSCRIPTION_IDwhile the onboarding skill (Step 7) andcopilot-instructions.md(line 405) still documentAZURE_SUBSCRIPTION_IDas a secret — the docs and the workflow contract have diverged.
.github/skills/git-ape-onboarding/templates/workflows/git-ape-verify.yml:121 - The verify workflow checks for
git-ape-ttl-reaper.yml, but the scaffold helper (scaffold-repo.sh/scaffold-repo.ps1) does not ship a TTL Reaper template — the MAPPINGS only includeplan,deploy,destroy,verify, anddrift.{md,lock.yml}. Every onboarded repo will therefore see a perpetual⚠️ Git-Ape: TTL Reaper (git-ape-ttl-reaper.yml) — not foundwarning inVerify Setup. Either drop this entry from the workflow list, or add the TTL Reaper template to the scaffolder and thetemplates/workflows/directory.
.github/skills/git-ape-onboarding/templates/workflows/git-ape-destroy.yml:151 - This gate accepts a state file as long as it has either
stackIdordeploymentId. Every state file ever written by this project has adeploymentId(it's the matrix key), so the check effectively only fails ifstate.jsonis corrupt. For a deployment created by the old (pre-Stacks)git-ape-deploy.exampleyml,stackIdwill be empty butdeploymentIdwill be set — so the check passes, thenaz stack sub show --name "$STACK_NAME"in the next step returns a non-zero exit, the workflow recordsexists=falseand exits 0 with "Already destroyed (stack not found)". Real Azure resources still exist, the resource group is never deleted, but the destroy run reports success andmetadata.jsonwill be flipped todestroyed. To make this Stacks-only and safe, requirestackIdexplicitly (or, if you must accept old state files, fall back toaz group deleteonstate.resourceGroupwhenstackIdis empty).
- Files reviewed: 47/47 changed files
- Comments generated: 3
| STACK_ID='${{ steps.deploy.outputs.stack_id }}' | ||
| MANAGED='${{ steps.deploy.outputs.managed_resources }}' | ||
| MANAGED=${MANAGED:-[]} |
There was a problem hiding this comment.
Fixed in d6b41dc. managed_resources and stack_id are now passed through env: and read as MANAGED="${MANAGED_RESOURCES:-[]}", then validated with jq empty (falling back to [] on invalid/empty JSON) before being consumed via jq --argjson managedResources. No more single-quoted JSON literal.
| # Also snapshot the previous template from git (parent commit of this merge | ||
| # or origin/main for /deploy comment). Used to redeploy last-known-good on failure. | ||
| DEPLOY_DIR="${{ steps.params.outputs.deploy_dir }}" | ||
| mkdir -p /tmp/rollback | ||
| if git show HEAD~1:"$DEPLOY_DIR/template.json" > /tmp/rollback/template.json 2>/dev/null; then | ||
| cp "$DEPLOY_DIR/parameters.json" /tmp/rollback/parameters.json 2>/dev/null || true | ||
| # Prefer the previous parameters if they exist at HEAD~1 | ||
| git show HEAD~1:"$DEPLOY_DIR/parameters.json" > /tmp/rollback/parameters.json 2>/dev/null || true | ||
| echo "prior_template_available=true" >> "$GITHUB_OUTPUT" | ||
| echo "[$(date -u +%H:%M:%S)] Previous template captured from HEAD~1 → /tmp/rollback/" | ||
| echo " template bytes: $(wc -c < /tmp/rollback/template.json)" | ||
| else | ||
| echo "prior_template_available=false" >> "$GITHUB_OUTPUT" | ||
| echo "[$(date -u +%H:%M:%S)] No previous template in git history (first deployment)" | ||
| fi |
There was a problem hiding this comment.
Fixed in d6b41dc. The rollback baseline is now derived per trigger: HEAD~1 only for push; for /deploy comments we git fetch origin main --depth=1 and use origin/main. git show "$BASELINE_REF:$DEPLOY_DIR/template.json" then reads the correct previous known-good template instead of the PR head.
There was a problem hiding this comment.
Follow-up: the /deploy PR-comment trigger referenced above has since been removed entirely (unverifiable comment-author authorization). The rollback baseline is now derived solely from the push trigger — HEAD~1 on main after merge — so the origin/main fetch path for /deploy no longer exists.
| if (validationStatus === 'passed' && whatifResult) { | ||
| comment += `### What-If Analysis\n\n`; | ||
| comment += `\`\`\`\n${whatifResult}\n\`\`\`\n\n`; | ||
| } else if (whatifStatus === 'passed' && whatifResult) { | ||
| comment += `### What-If Analysis\n\n`; | ||
| comment += `\`\`\`\n${whatifResult}\n\`\`\`\n\n`; |
There was a problem hiding this comment.
Fixed in d6b41dc. Removed the unreachable validationStatus === passed && whatifResult branch; what-if rendering is now driven uniformly by whatifStatus === passed && whatifResult.
sendtoshailesh
left a comment
There was a problem hiding this comment.
Thanks for the substantial cleanup here — moving the onboarding scaffolds out of .github/workflows/, adding sync/parity tooling, and wiring prompt/eval registration all make sense. I also like the skip-on-collision behavior in the scaffolders and the explicit docs refresh.
I did find a few blocking issues that should be fixed before merge:
-
Command injection in manual destroy path (
.github/skills/git-ape-onboarding/templates/workflows/git-ape-destroy.yml, around lines 55-66)
inputs.confirmandinputs.deployment_idare interpolated directly into arun:script via${{ ... }}. Because Actions expands those expressions before bash parses the script, a crafted workflow_dispatch input can inject arbitrary shell. Please pass these values throughenv:(or another non-shell-interpolated channel) and read them from normal shell variables instead. -
Unsafe direct interpolation of
github.base_refinto shell (.github/skills/git-ape-onboarding/templates/workflows/git-ape-plan.yml:44)
github.base_refis used directly inside thegit diffcommand in arun:block. Per GitHub’s Actions hardening guidance, attacker-controlled context values should not be embedded into shell scripts this way. This should also be routed throughenv:and quoted normally in bash. -
Rollback source is wrong for
/deployruns (.github/skills/git-ape-onboarding/templates/workflows/git-ape-deploy.yml, around lines 219-228 and 475-486)
The comment says the workflow should snapshot the parent commit ororigin/mainfor/deploycomments, but the implementation always readsHEAD~1. On comment-triggered deploys that means rollback can redeploy an earlier PR commit that was never the last known-good state, instead of rolling back to main. That is especially risky on multi-commit PRs. Please branch this logic so/deploycaptures fromorigin/main(or another authoritative deployed baseline) before using it for rollback.
One additional hardening nit: git-ape-verify.yml also embeds secret values directly into shell conditionals (${{ secrets.AZURE_CLIENT_ID }} etc.). I would strongly prefer converting those checks to env booleans/variables as well.
Once the injection issues and rollback baseline are fixed, I’d be happy to re-review.
Address PR review on the git-ape-onboarding workflow templates: - Route attacker-controllable inputs (github.base_ref, workflow_dispatch inputs, JSON step outputs) through env: and read them as quoted shell variables to close script-injection vectors (plan, destroy). - plan: compute the PR diff against origin/$BASE_REF instead of an unsanitised interpolation. - deploy: derive the rollback baseline from HEAD~1 (push) or origin/main (/deploy comment); pass stack_id/managed_resources via env and validate the managed_resources JSON before jq consumes it. - destroy: make teardown Deployment-Stacks-only with a guarded legacy resource-group fallback; emit explicit legacy/fallback_rg outputs. - verify: gate required secrets/variable via env booleans; check the AZURE_SUBSCRIPTION_ID variable; align the scaffolded WORKFLOWS list with the scaffolder (drop ttl-reaper, add verify, use drift.lock.yml). - plan: remove the unreachable what-if render branch. Regenerate website workflow docs.
AZURE_SUBSCRIPTION_ID is consumed via vars. in every scaffolded workflow, so document it as a GitHub repository/environment variable (not a secret). AZURE_CLIENT_ID and AZURE_TENANT_ID remain secrets. Fix the OIDC snippet in both copilot-instructions templates to use vars.AZURE_SUBSCRIPTION_ID.
|
|
Thanks for the thorough review, @sendtoshailesh. All four points are addressed in d6b41dc (workflow templates) and 67005a7 (docs). Summary: 1. Command injection in manual destroy path ( 2. Unsafe interpolation of 3. Rollback baseline wrong for Hardening nit — While in here I also addressed the Copilot review threads (managed_resources JSON via Ready for re-review. |
sendtoshailesh
left a comment
There was a problem hiding this comment.
Follow-up review:
Previously raised issues:
- ✅ Fixed:
git-ape-destroy.ymlno longer interpolatesinputs.*directly into shell; the workflow_dispatch inputs are routed viaenvand JSON-encoded withjqbefore use. - ✅ Fixed:
git-ape-plan.ymlno longer inlines${{ github.base_ref }}in the shell; it is passed throughenv.BASE_REFfirst. - ✅ Fixed:
git-ape-deploy.ymlnow usesorigin/mainfor/deployrollback baselines instead of always assumingHEAD~1; the push path still usesHEAD~1, which is the previous main commit after merge. - ✅ Fixed:
git-ape-verify.ymlmoved the secret checks to env booleans instead of embedding${{ secrets.* }}directly in shell conditionals.
New issues found:
- ❌ Blocking:
matrix.deployment_idis still derived from attacker-controlled deployment directory names and interpolated directly intorun:blocks / JS string literals in the plan, deploy, and destroy templates. That reintroduces shell / script injection via paths under.azure/deployments/*/. ⚠️ Non-blocking: the/deploycomment path checks approval state, but it still does not verify that the commenter is an authorized collaborator/member before triggering deployment.⚠️ Non-blocking: both deploy and destroy still swallowgit pushfailures after updatingstate.json/metadata.json, which can leave Azure state changed without the repo state being persisted.
Overall verdict: the original blockers are resolved, but the new matrix.deployment_id injection path is still a release-blocking security issue, so this PR is not merge-ready yet.
…t_id injection matrix.deployment_id is derived from attacker-controllable .azure/deployments/*/ directory names and was interpolated directly into run: bash blocks and github-script JS string literals across the plan, deploy, and destroy workflow templates. Route it through job-level env (DEPLOYMENT_ID) so run blocks reference $DEPLOYMENT_ID and github-script reads process.env.DEPLOYMENT_ID, and reject any directory name outside ^[A-Za-z0-9._-]+$ at the detect step (defense in depth, also makes derived deploy_dir provably safe).
…e push The /deploy comment trigger cannot reliably verify the commenter's authorization, so deployment is now gated solely on merge to main (which already requires PR review + approval via branch protection). Removes the issue_comment trigger, the check-comment-trigger job, and all PR-head-ref checkout paths. Also fails loud (exit 1) instead of swallowing git push failures when committing deployment/teardown state back to main.
sendtoshailesh
left a comment
There was a problem hiding this comment.
Follow-up review:
Previously raised issues:
- ✅ Fixed:
matrix.deployment_idis now validated against^[A-Za-z0-9._-]+$before entering the matrix and routed throughenv.DEPLOYMENT_ID/process.env.DEPLOYMENT_IDin the plan, deploy, and destroy templates, so the earliermatrix.deployment_idshell/JS injection path is closed. - ✅ Fixed: the
/deploycomment path is gone entirely fromgit-ape-deploy.yml, so there is no longer an unauthenticated comment-triggered deployment path to authorize. - ✅ Fixed: deploy and destroy now fail the workflow if the post-state
git pushfails instead of silently swallowing that error.
New issues found:
- ❌ Blocking: untrusted values read from
parameters.jsonare still interpolated directly intorun:scripts via${{ ... }}in the workflow templates, which reintroduces the same GitHub Actions expression-to-shell injection class under a different input. Examples:git-ape-plan.ymluses${{ steps.params.outputs.location }}in shell at lines 157, 414, and 455;git-ape-deploy.ymluses${{ steps.params.outputs.location }},${{ steps.params.outputs.project }}, and${{ steps.params.outputs.environment }}in shell/JQ argument positions at lines 175, 178, 241, 244-245, 258, 420, and 498-500. These values come from attacker-controlled PR content (parameters.json) and need the same treatment asdeployment_id: validate if needed, pass throughenv:, and reference normal shell variables instead of inlining${{ ... }}into script source. ⚠️ Non-blocking:git-ape-plan.ymlstill tells reviewers to comment/deploy(Plan Commentstep, line 738), but that trigger has been intentionally removed. The PR guidance should be updated to avoid instructing users to use a nonexistent path.
Overall verdict:
The previously raised issues are resolved, but the new ${{ steps.params.outputs.* }} injection path is still a release-blocking security issue, so this PR is not merge-ready yet.
…ection
Untrusted location/project/environment values read from parameters.json
were interpolated directly into run: script bodies via ${{ steps.params.outputs.* }},
the same expression-to-shell injection class already fixed for deployment_id.
Route them through step-level env: blocks and reference $LOCATION/$PROJECT/$ENVIRONMENT
shell variables instead. Also drop the stale /deploy reviewer instruction in
git-ape-plan.yml (that trigger was removed). Regenerated workflow docs.
|
@sendtoshailesh Thanks for the thorough re-review. Fixed the remaining injection in Blocking item — Non-blocking item — stale
|
…rhaul # Conflicts: # .github/agents/git-ape.agent.md # .github/copilot-instructions.md # .github/evals/manifest.yaml # .github/workflows/git-ape-deploy.exampleyml # .github/workflows/git-ape-destroy.exampleyml # website/docs/agents/git-ape.md # website/docs/workflows/git-ape-deploy.md # website/docs/workflows/git-ape-destroy.md
Merge resolution updated the .github/copilot-instructions.md mirror to the stack-based deployment flow (dropping the /deploy trigger). Propagate the same content to the canonical templates/copilot-instructions.md so the onboarding template-check (bash + pwsh) passes.
Regenerated from sources updated by the upstream/main merge (azure-resource-deployer and azure-template-generator agents now delegate to skills; lock workflow metadata).
sendtoshailesh
left a comment
There was a problem hiding this comment.
Round 4 follow-up review:
Previously raised issues:
- ✅ Fixed: untrusted
parameters.jsonvalues (location,project,environment) are now routed throughenv:before use in shell steps instead of being interpolated directly intorun:blocks. - ✅ Fixed: the stale
/deployreference was removed from the plan comment path.
Conflict resolution assessment:
- ✅ Merge resolution looks clean overall. I did not find conflict markers or accidental duplicate sections in the changed templates/workflows, the key workflow YAML files parse successfully, and the onboarding template sync check passes.
New issues found:
- ❌ Blocking:
website/docs/getting-started/onboarding.mdstill tells users to configureAZURE_SUBSCRIPTION_IDas a GitHub secret (gh secret setat lines 364-366, 383-391), but the scaffolded workflows and verify flow now read it fromvars.AZURE_SUBSCRIPTION_IDas a variable. A user following the updated onboarding docs will end up with a broken setup: verify/deploy read fromvars, but the docs populatesecrets. Given this PR is specifically overhauling onboarding/scaffolding, that documentation contract needs to be consistent before merge. ⚠️ Non-blocking:git-ape-verify.ymland its generated docs still sayMerge or comment /deploy to deploy, and the summary still sayssecret(s) missingeven though one of the required values is now a variable. That guidance is stale/misleading, though the actual deploy trigger removal in plan/deploy is correct.
Overall verdict:
The round-3 blockers are fixed and the merge conflict resolution looks solid, but the onboarding docs still misconfigure AZURE_SUBSCRIPTION_ID, so I don’t think this is merge-ready yet. Once the docs/template guidance are aligned with the new variable-based contract, I’d be happy to re-review.
Round 4 review (sendtoshailesh): - Blocking: onboarding docs configured AZURE_SUBSCRIPTION_ID via 'gh secret set', but the scaffolded plan/deploy/destroy/verify workflows read it from vars.AZURE_SUBSCRIPTION_ID. Switch the single- and multi-environment setup steps to 'gh variable set' so the documented contract matches the workflows. AZURE_CLIENT_ID and AZURE_TENANT_ID remain secrets. - Non-blocking: git-ape-verify.yml summary said 'secret(s) missing' (one value is now a variable) and 'Merge or comment /deploy to deploy' (the /deploy trigger was removed). Reworded to 'required value(s) missing' and 'Merge to main to deploy'; renamed the check step accordingly. Regenerated git-ape-verify.md from the updated template.
|
@sendtoshailesh Thanks for the round 4 review. Both points addressed in Blocking — Non-blocking — stale The template ↔ mirror sync check passes locally for both bash and pwsh. |
sendtoshailesh
left a comment
There was a problem hiding this comment.
Round 5 follow-up review:
Verified fixes from round 4:
- ✅
website/docs/getting-started/onboarding.mdnow treatsAZURE_SUBSCRIPTION_IDas a GitHub variable (gh variable set ...) rather than a secret. - ✅
git-ape-verifyand its generated docs no longer mention/deploy, and the summary wording is nowrequired value(s) missing.
One issue still remains in the current diff:
- ❌
.github/agents/azure-resource-deployer.agent.md:78,website/docs/agents/azure-resource-deployer.md:103, andwebsite/docs/use-cases/cicd-pipeline.md:100still showsubscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}. That is now inconsistent with the onboarding/scaffolded workflow contract, which usesvars.AZURE_SUBSCRIPTION_ID(for example in.github/copilot-instructions.mdand the generated workflow templates). Because these docs/agent definitions are part of this PR, they should be updated to the variable-based form before merge.
…workflows Round 5 review (sendtoshailesh): the azure-resource-deployer agent definition and the cicd-pipeline / drift-detection example workflows still passed subscription-id from secrets.AZURE_SUBSCRIPTION_ID, contradicting the variable-based contract used by the scaffolded workflows and copilot-instructions (vars.AZURE_SUBSCRIPTION_ID). Switch all three azure/login@v2 examples to vars.AZURE_SUBSCRIPTION_ID (client-id and tenant-id remain secrets). Regenerated the agent doc from source.
|
@sendtoshailesh Round 5 issue fixed in All three
I also swept the whole repo: a repo-wide grep confirms zero remaining Ready for re-review. |
sendtoshailesh
left a comment
There was a problem hiding this comment.
Round 6 follow-up: new commits were pushed after the previous requested-changes review, and the remaining round-5 blocker is now resolved.
Across rounds 1-4, the earlier issues had already been fixed. In round 5, the only outstanding blocker was the inconsistent use of secrets.AZURE_SUBSCRIPTION_ID in three docs/agent examples. I re-checked the updated PR and confirmed those references now use vars.AZURE_SUBSCRIPTION_ID in:
.github/agents/azure-resource-deployer.agent.mdwebsite/docs/agents/azure-resource-deployer.mdwebsite/docs/use-cases/cicd-pipeline.md
I also searched the full current PR diff for any remaining secrets.AZURE_SUBSCRIPTION_ID references and found only removed lines, with the live additions consistently using vars.AZURE_SUBSCRIPTION_ID. I did a fresh scan of the current diff as well and did not find any new issues.
All issues raised across all six review rounds are now resolved. Approving.
Summary
End-to-end overhaul of the
/git-ape-onboardingflow plus the supporting packaging, instructions, eval registration, and regenerated docs. Five themed commits, each independently revertable:chore(models): rename gpt-5-codex to gpt-5.3-codexfeat(onboarding): replace exampleyml stubs with template-driven scaffold.github/workflows/(where the.exampleymlhack lived) into.github/skills/git-ape-onboarding/templates/; ship sync scripts; add eval suite + CI parity check; register the eval in the pilot tierfeat(extension): register prompt files in VSIX and tighten .vscodeignoreplugin.jsonnow declaresprompts:;package.template.jsonregisters all 9 prompts aschatPromptFiles;.vscodeignoresheds dev-only.githubsubtrees from the published VSIXdocs(instructions): switch deploy/destroy guidance to Azure Deployment Stacksaz stack sub create/delete --action-on-unmanage deleteAllinstead ofaz deployment sub+az group delete(see #30)docs(website): regenerate for templated workflows and prompt assets.github/workflows/and the skill templates directory, tags scaffolded workflows with a Docusaurus admonitionKey change:
.exampleymlis goneOld shape (this repo's
.github/workflows/):.exampleymlwas a workaround so GitHub Actions wouldn't auto-load the scaffolds.New shape (
.github/skills/git-ape-onboarding/templates/workflows/):The path is no longer
.github/workflows/so the workaround isn't needed./git-ape-onboardingcopies these into the target repo with skip-on-collision so customized workflows are never overwritten.Eval registration
Adds
git-ape-onboardingtopilottier in .github/evals/manifest.yaml — matches its prior 4-model bench coverage. The eval ships 4 tasks:positive-first-time-setup,positive-multi-env,positive-skip-on-collision,negative-storage-comparison. Closes part of #93.Verification done
actionlintclean on the newgit-ape-onboarding-template-check.ymlnode scripts/generate-docs.jsre-runs with no further driftscripts/carry the executable bitDependency
Depends on #140 (LLM-as-judge →
claude-opus-4.7). This PR is mergeable independently — if it lands first, the newgit-ape-onboardingeval will run with whatever judge is pinned at merge time, then automatically pick up the opus judge once #140 merges. Prefer to merge #140 first to avoid a mixed-judge snapshot.Risk
Medium-low.
.exampleymldeletions are the only destructive change in this repo; their content is preserved verbatim in the templates directory.waza-evalsmatrix dispatch (pilot × 4 models = 16 legs). Quota cost: ~equivalent to prereq-check baseline.