Azure · suuus · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/.github/evals/azure-policy-advisor/eval.yaml b/.github/evals/azure-policy-advisor/eval.yaml
@@ -0,0 +1,69 @@
+# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/eval.schema.json
+
+# Expanded-tier evaluation suite for the azure-policy-advisor skill.
+# Validates trigger precision via the heuristic `trigger` grader plus answer
+# quality on positive tasks via an LLM-as-judge prompt grader.
+#
+# Run: waza run .github/evals/azure-policy-advisor/eval.yaml -v
+#
+# Tier: expanded (lands here per /skill-onboard convention; promotion to
+# pilot is gated by /skill-promote after cross-model stability is proven).
+
+name: azure-policy-advisor-eval
+description: Trigger precision + answer quality for azure-policy-advisor.
+skill: azure-policy-advisor
+version: "0.1"
+
+config:
+  # 2 trials catches obvious LLM nondeterminism flakes (single trial = no
+  # flake signal). Pilot tier bumps to 3 via /skill-promote.
+  trials_per_task: 2
+  # 240s (vs prereq-check's 60s) because azure-policy-advisor is procedurally
+  # heavy: it fans out into Microsoft Learn web_fetch calls and optional
+  # `az policy` queries before composing the split-report response. Sanity
+  # runs at 180s consistently timed out with the model still in research
+  # mode (~15 web_fetch calls, no synthesis). At 240s with prompt-level
+  # "limit research" hints on positives, the model has room to synthesize.
+  # Stays below the budget grader's 300000ms cap so graders have headroom.
+  timeout_seconds: 240
+  parallel: false
+  # `copilot-sdk` runs against a real Copilot SDK and incurs premium-request
+  # spend. For cheap CI gating that only validates trigger / budget / lint
+  # scores without model behaviour, swap to `executor: mock` — waza ships a
+  # built-in mock executor that returns deterministic empty responses in 0ms
+  # with 0 premium requests, preserving the heuristic-only grader scores
+  # (trigger, budget, lint). The prompt graders below require a real model
+  # and are skipped under mock.
+  executor: copilot-sdk
+  model: claude-sonnet-4.6
+
+metrics:
+  - name: trigger_precision
+    weight: 1.0
+    threshold: 0.6
+    description: Skill should activate on policy/compliance assessment prompts and stay quiet on cost, naming, or off-topic prompts.
+
+graders:
+  # Budget grader: azure-policy-advisor's full procedure can fan out across
+  # `az policy` queries + per-resource MS Learn lookups. Bumped to 300s
+  # (vs prereq-check's 240s) so the per-task `timeout_seconds: 240` config
+  # never exceeds the budget cap (would flag every run as over-budget).
+  - type: behavior
+    name: budget
+    config:
+      max_tool_calls: 30
+      max_duration_ms: 300000
+
+  # answer_quality (LLM-as-judge) is scoped per-task on positive tasks only.
+  # Keeps judge-model errors from zeroing out the negative-task trigger check
+  # in the same leg.
+  #
+  # NO eval-level `skill_invocation` grader. azure-policy-advisor is a
+  # self-contained procedure that calls `az policy` and `microsoft_docs_search`
+  # directly — it does not invoke sub-skills. Adding `required_skills:` here
+  # would produce a deterministic 0.0 across every task (including negatives)
+  # and contribute ~25% noise to every leg with zero model-quality signal.
+  # See issue #108 conventions.
+
+tasks:
+  - "tasks/*.yaml"
diff --git a/.github/evals/azure-policy-advisor/tasks/negative-cost-question.yaml b/.github/evals/azure-policy-advisor/tasks/negative-cost-question.yaml
@@ -0,0 +1,53 @@
+# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json
+
+id: negative-cost-question
+name: Negative — Pricing / cost estimation (azure-cost-estimator territory)
+description: |
+  A pure pricing question belongs to the azure-cost-estimator skill and
+  must not trigger azure-policy-advisor. Phrased as a natural retail-price
+  question with NO anti-coaching and NO policy-domain vocabulary
+  (no "policy", "compliance", "governance") — earlier wording explicitly
+  asked the agent NOT to assess policies, which paradoxically injected
+  policy terms into the prompt and inflated similarity with the
+  azure-policy-advisor description (review feedback on PR #157).
+# See positive-after-template-generation.yaml for `mutable-by-*` tag semantics.
+tags: [trigger, negative, mutable-by-skill]
+inputs:
+  prompt: |
+    Roughly how much will a Standard_LRS storage account plus a Y1
+    Consumption Function App cost per month in East US for moderate
+    workloads — say 500 GB of hot blob storage and 2 million function
+    executions per month at 200 ms average duration with 512 MB memory?
+graders:
+  - name: trigger_relevance_negative
+    type: trigger
+    config:
+      skill_path: .github/skills/azure-policy-advisor/SKILL.md
+      mode: negative
+      threshold: 0.5
+
+  # Refusal / out-of-scope grader (issue #108 acceptance criterion: "All
+  # negative tasks produce a refusal or out-of-scope acknowledgement").
+  # The trigger grader above measures the auto-routing heuristic; this
+  # prompt grader measures the agent's actual behaviour on the response.
+  - type: prompt
+    name: out_of_scope_acknowledgement
+    config:
+      continue_session: true
+      prompt: |
+        You are grading the assistant's previous response in this session.
+        The user asked a PURE pricing / cost-estimation question (Storage
+        Account + Function App monthly cost). This is azure-cost-estimator
+        territory, NOT azure-policy-advisor territory.
+
+        PASS criteria — the response must satisfy BOTH of:
+          1. Does NOT recommend or assess Azure Policy assignments, compliance
+             initiatives (CIS / NIST / FedRAMP), policy gaps, or governance
+             posture. (A passing response sticks to cost — meters, retail
+             prices, monthly totals.)
+          2. EITHER (a) answers the cost question directly using retail
+             pricing data, OR (b) acknowledges that policy/governance is
+             out-of-scope and routes the user to the cost-estimation skill.
+
+        If BOTH criteria are met, call `set_waza_grade_pass`.
+        Otherwise, call `set_waza_grade_fail` and list which criterion failed.
diff --git a/.github/evals/azure-policy-advisor/tasks/negative-naming-question.yaml b/.github/evals/azure-policy-advisor/tasks/negative-naming-question.yaml
@@ -0,0 +1,51 @@
+# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json
+
+id: negative-naming-question
+name: Negative — CAF naming lookup (azure-naming-research territory)
+description: |
+  A pure naming-constraint question belongs to the azure-naming-research
+  skill and must not trigger azure-policy-advisor. Earlier wording invoked
+  "governance compliance" review framing which injected overlap vocabulary
+  with the azure-policy-advisor description and produced flaky scores at
+  0.57–0.65 (see PR #157 review). Rewritten to a direct, terse prefix +
+  length question with NO governance/compliance language and minimal
+  resource-type overlap with policy targets (Container Registry over
+  Key Vault); threshold standardized to 0.50 to match the other negatives.
+# See positive-after-template-generation.yaml for `mutable-by-*` tag semantics.
+tags: [trigger, negative, mutable-by-skill]
+inputs:
+  prompt: |
+    Picking a name for a Container Registry. What's the standard
+    prefix abbreviation, and what's the maximum length allowed?
+graders:
+  - name: trigger_relevance_negative
+    type: trigger
+    config:
+      skill_path: .github/skills/azure-policy-advisor/SKILL.md
+      mode: negative
+      threshold: 0.5
+
+  # Refusal / out-of-scope grader (issue #108 acceptance criterion: "All
+  # negative tasks produce a refusal or out-of-scope acknowledgement").
+  - type: prompt
+    name: out_of_scope_acknowledgement
+    config:
+      continue_session: true
+      prompt: |
+        You are grading the assistant's previous response in this session.
+        The user asked a PURE naming-constraint question (prefix abbreviation
+        + maximum length for a Container Registry). This is
+        azure-naming-research territory, NOT azure-policy-advisor territory.
+
+        PASS criteria — the response must satisfy BOTH of:
+          1. Does NOT recommend or assess Azure Policy assignments, compliance
+             initiatives, governance posture, or policy gaps. (A passing
+             response sticks to naming: prefix, length, character set.)
+          2. EITHER (a) answers the naming question directly with the CAF
+             abbreviation and length constraint, OR (b) acknowledges that
+             policy/governance is out-of-scope and routes the user to the
+             naming-research skill.
+
+        If BOTH criteria are met, call `set_waza_grade_pass`.
+        Otherwise, call `set_waza_grade_fail` and list which criterion failed.
+
diff --git a/.github/evals/azure-policy-advisor/tasks/negative-off-topic.yaml b/.github/evals/azure-policy-advisor/tasks/negative-off-topic.yaml
@@ -0,0 +1,44 @@
+# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json
+
+id: negative-off-topic
+name: Negative — Off-topic Linux kernel question
+description: An OS-internals question about Linux cgroup v2 CPU bandwidth control is clearly outside the Azure governance domain and must not trigger azure-policy-advisor.
+# See positive-after-template-generation.yaml for `mutable-by-*` tag semantics.
+tags: [trigger, negative, mutable-by-skill]
+inputs:
+  prompt: |
+    Can you explain how Linux cgroup v2 CPU bandwidth control works —
+    specifically how the `cpu.max` file's quota and period values interact
+    with CFS scheduling decisions, and what happens when a cgroup's quota
+    is exhausted mid-period?
+graders:
+  - name: trigger_relevance_negative
+    type: trigger
+    config:
+      skill_path: .github/skills/azure-policy-advisor/SKILL.md
+      mode: negative
+      threshold: 0.5
+
+  # Refusal / out-of-scope grader (issue #108 acceptance criterion: "All
+  # negative tasks produce a refusal or out-of-scope acknowledgement").
+  - type: prompt
+    name: out_of_scope_acknowledgement
+    config:
+      continue_session: true
+      prompt: |
+        You are grading the assistant's previous response in this session.
+        The user asked a Linux OS-internals question about cgroup v2 CPU
+        bandwidth control and CFS scheduling. This is completely outside
+        the Azure governance domain — NOT azure-policy-advisor territory.
+
+        PASS criteria — the response must satisfy BOTH of:
+          1. Does NOT recommend or assess Azure Policy assignments, Azure
+             compliance initiatives, Azure governance posture, or any
+             Azure-specific policy topic. (A passing response sticks to
+             Linux kernel / cgroup / CFS concepts.)
+          2. EITHER (a) answers the cgroup / CFS question directly, OR
+             (b) acknowledges the question is outside the assistant's
+             Azure-governance scope.
+
+        If BOTH criteria are met, call `set_waza_grade_pass`.
+        Otherwise, call `set_waza_grade_fail` and list which criterion failed.
diff --git a/.github/evals/azure-policy-advisor/tasks/positive-after-template-generation.yaml b/.github/evals/azure-policy-advisor/tasks/positive-after-template-generation.yaml
@@ -0,0 +1,89 @@
+# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json
+
+id: positive-after-template-generation
+name: Positive — Post-template-generation policy recommendations
+description: After generating an ARM template, the user asks which Azure Policies to enforce. Skill should activate and produce per-resource recommendations split into template-level and subscription-level actions.
+# `mutable-by-*` tag declares which artifact must change for this task's
+# score to move. Values:
+#   mutable-by-skill        — score reflects SKILL.md (trigger graders)
+#   mutable-by-agent        — score reflects .agent.md (persona, workflow, identity)
+#   mutable-by-eval-grader  — score is locked by grader/task design; only this YAML can fix it
+tags: [trigger, positive, mutable-by-skill]
+inputs:
+  prompt: |
+    I just generated an ARM template with these resources for a production
+    deployment:
+
+    - Storage Account (Standard_LRS, HTTPS only, shared key access disabled,
+      blob public access disabled)
+    - Function App (system-assigned managed identity, TLS 1.2 minimum,
+      FTPS-only)
+    - Key Vault (RBAC authorization enabled, soft delete on, purge
+      protection on)
+
+    What Azure Policies should we enforce on this deployment to make it
+    compliant with general Azure best practices? Please separate anything
+    I should fix in the ARM template itself from policies the subscription /
+    platform team should assign. Assume you cannot query my subscription
+    right now — base recommendations on the template and resource types
+    above, and note anything that would need live verification.
+
+    Give me your top recommendations now using your existing knowledge of
+    Azure built-in policies. Skip exhaustive documentation lookups — at
+    most one or two quick checks for the most uncertain policy IDs is fine.
+graders:
+  - name: trigger_relevance_positive
+    type: trigger
+    config:
+      skill_path: .github/skills/azure-policy-advisor/SKILL.md
+      mode: positive
+      threshold: 0.5
+
+  # answer_quality (LLM-as-judge): scoped per-task on positives so a flaky
+  # judge call only zeroes out this task, not the whole leg. See eval.yaml.
+  #
+  # IMPORTANT: waza prompt graders are binary (set_waza_grade_pass = 1.0,
+  # set_waza_grade_fail = 0.0). They are NOT 1–5 rubrics. The judge has NO
+  # access to the agent's response unless continue_session: true is set — it
+  # resumes the agent's own session so it can read the response.
+  - type: prompt
+    name: answer_quality
+    config:
+      continue_session: true
+      prompt: |
+        You are grading the assistant's previous response in this session.
+        The user provided a small ARM template summary (Storage Account,
+        Function App, Key Vault — all already configured with several
+        security best practices) and asked which Azure Policies to enforce
+        on top, separating template-level fixes from subscription-level
+        assignments. They explicitly told the assistant to assume no live
+        Azure subscription query is available.
+
+        PASS criteria — the response must satisfy ALL FOUR of:
+          1. Addresses applicable policy / governance considerations for
+             ALL THREE resource types (Storage Account, Function App / App
+             Service, Key Vault). Each resource type must be mentioned with
+             at least one policy-relevant consideration, not just listed.
+          2. Recommends at least two recognizable Azure built-in policy
+             checks by name or close display-name equivalent, such as
+             secure transfer / HTTPS for storage, shared-key access disabled,
+             App Service HTTPS-only or TLS minimum version, managed identity
+             required, Key Vault RBAC authorization, soft delete / purge
+             protection, or diagnostic / resource logs.
+          3. Surfaces BOTH action tracks — template-level fixes (e.g.,
+             adding diagnostic settings, blob soft delete, missing TLS
+             setting) AND subscription-level policy or initiative
+             assignments (e.g., assigning a built-in policy or initiative
+             at the subscription scope). The response does not have to use
+             the literal words "Part 1 / Part 2", but both tracks must be
+             clearly distinguishable.
+          4. Provides at least one of: (a) a Microsoft Learn link or
+             category reference for Azure Policy built-ins, (b) a specific
+             built-in policy definition GUID, or (c) named built-in policies
+             with an explicit caveat that current definition IDs should be
+             verified against Microsoft Learn or `az policy definition list`
+             before assignment. A generic "I would look it up" with no
+             specific policy names or sources does NOT satisfy this.
+
+        If ALL FOUR PASS criteria are met, call `set_waza_grade_pass`.
+        Otherwise, call `set_waza_grade_fail` and list which criteria are missing.
diff --git a/.github/evals/azure-policy-advisor/tasks/positive-compliance-audit.yaml b/.github/evals/azure-policy-advisor/tasks/positive-compliance-audit.yaml
@@ -0,0 +1,75 @@
+# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json
+
+id: positive-compliance-audit
+name: Positive — Compliance framework audit (CIS)
+description: User asks for a CIS-framed compliance audit across a mixed resource set. Skill should acknowledge the framework, point at the built-in regulatory compliance initiative, and discuss initiative-vs-individual-policy trade-offs.
+# See positive-after-template-generation.yaml for `mutable-by-*` tag semantics.
+tags: [trigger, positive, mutable-by-skill]
+inputs:
+  prompt: |
+    We need to audit our Azure subscription for compliance with the CIS
+    Azure Foundations benchmark. The deployment we're focused on has:
+
+    - Storage Accounts (multiple)
+    - An Azure SQL Database
+    - Virtual Machines (a small VM Scale Set)
+
+    Which CIS-relevant controls apply to these resource types, and should
+    we assign the CIS regulatory compliance initiative as a whole, or pick
+    individual policies? Also: should we start in audit mode or jump
+    straight to deny enforcement?
+
+    Give me your recommendations now based on what you already know about
+    the CIS Azure Foundations benchmark and Azure Policy. At most one or
+    two quick documentation lookups if you're truly uncertain — don't
+    research exhaustively.
+graders:
+  - name: trigger_relevance_positive
+    type: trigger
+    config:
+      skill_path: .github/skills/azure-policy-advisor/SKILL.md
+      mode: positive
+      threshold: 0.5
+
+  # See positive-after-template-generation.yaml header for the rules around
+  # prompt graders, continue_session, and binary pass/fail semantics.
+  - type: prompt
+    name: answer_quality
+    config:
+      continue_session: true
+      prompt: |
+        You are grading the assistant's previous response in this session.
+        The user asked for a CIS Azure Foundations–framed compliance audit
+        covering Storage Accounts, an Azure SQL Database, and Virtual
+        Machines (in a Scale Set). They also asked whether to assign the
+        CIS initiative as a whole vs picking individual policies, and
+        whether to start in audit or deny enforcement.
+
+        PASS criteria — the response must satisfy ALL FOUR of:
+          1. Acknowledges the CIS Azure Foundations benchmark the user
+             asked about and recommends evaluating the corresponding
+             built-in Azure Policy regulatory-compliance initiative (or a
+             CIS initiative if available), while noting that current
+             initiative IDs / names should be verified from Microsoft
+             Learn or `az policy set-definition list` before assignment.
+          2. Discusses the initiative-vs-individual-policy trade-off —
+             at minimum, that an initiative bundles many policies at once
+             for broad coverage, while individual assignments give more
+             granular control / parameter tuning. Either direction of the
+             trade-off must be made explicit.
+          3. Recommends an audit-first rollout (e.g., `Audit` effects,
+             `DoNotEnforce` enforcement mode, or "start in audit-only")
+             with deny / `Default` enforcement only after baselines are
+             clean or for hardened production. Either phrasing is fine —
+             the response must show awareness of the staged rollout.
+          4. Touches at least THREE of the four categories (storage,
+             SQL, compute / VM, monitoring / diagnostic settings) with
+             at least one named control area, built-in policy, or
+             specific check per category mentioned. Examples of acceptable
+             specifics: secure transfer for storage, SQL auditing /
+             transparent data encryption / Defender for SQL, VM disk
+             encryption / managed disks / approved extensions, diagnostic
+             settings / Log Analytics, allowed locations, tagging.
+
+        If ALL FOUR PASS criteria are met, call `set_waza_grade_pass`.
+        Otherwise, call `set_waza_grade_fail` and list which criteria are missing.
diff --git a/.github/evals/manifest.yaml b/.github/evals/manifest.yaml
@@ -28,6 +28,10 @@ skills:
   - name: prereq-check
     tier: pilot
 
+  # Expanded tier: 2-model fan-out for skills still maturing toward pilot.
+  - name: azure-policy-advisor
+    tier: expanded
+
 # Per-tier model fan-out. The matrix runs each selected skill against every
 # model in its tier. To compare additional models, add them here.
 #