Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions .github/evals/azure-policy-advisor/eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/eval.schema.json

# Expanded-tier evaluation suite for the azure-policy-advisor skill.
# Validates trigger precision via the heuristic `trigger` grader plus answer
# quality on positive tasks via an LLM-as-judge prompt grader.
#
# Run: waza run .github/evals/azure-policy-advisor/eval.yaml -v
#
# Tier: expanded (lands here per /skill-onboard convention; promotion to
# pilot is gated by /skill-promote after cross-model stability is proven).

name: azure-policy-advisor-eval
description: Trigger precision + answer quality for azure-policy-advisor.
skill: azure-policy-advisor
version: "0.1"

config:
# 2 trials catches obvious LLM nondeterminism flakes (single trial = no
# flake signal). Pilot tier bumps to 3 via /skill-promote.
trials_per_task: 2
# 240s (vs prereq-check's 60s) because azure-policy-advisor is procedurally
# heavy: it fans out into Microsoft Learn web_fetch calls and optional
# `az policy` queries before composing the split-report response. Sanity
# runs at 180s consistently timed out with the model still in research
# mode (~15 web_fetch calls, no synthesis). At 240s with prompt-level
# "limit research" hints on positives, the model has room to synthesize.
# Stays below the budget grader's 300000ms cap so graders have headroom.
timeout_seconds: 240
parallel: false
# `copilot-sdk` runs against a real Copilot SDK and incurs premium-request
# spend. For cheap CI gating that only validates trigger / budget / lint
# scores without model behaviour, swap to `executor: mock` — waza ships a
# built-in mock executor that returns deterministic empty responses in 0ms
# with 0 premium requests, preserving the heuristic-only grader scores
# (trigger, budget, lint). The prompt graders below require a real model
# and are skipped under mock.
executor: copilot-sdk
model: claude-sonnet-4.6

metrics:
- name: trigger_precision
weight: 1.0
threshold: 0.6
description: Skill should activate on policy/compliance assessment prompts and stay quiet on cost, naming, or off-topic prompts.

graders:
# Budget grader: azure-policy-advisor's full procedure can fan out across
# `az policy` queries + per-resource MS Learn lookups. Bumped to 300s
# (vs prereq-check's 240s) so the per-task `timeout_seconds: 240` config
# never exceeds the budget cap (would flag every run as over-budget).
- type: behavior
name: budget
config:
max_tool_calls: 30
max_duration_ms: 300000

# answer_quality (LLM-as-judge) is scoped per-task on positive tasks only.
# Keeps judge-model errors from zeroing out the negative-task trigger check
# in the same leg.
#
# NO eval-level `skill_invocation` grader. azure-policy-advisor is a
# self-contained procedure that calls `az policy` and `microsoft_docs_search`
# directly — it does not invoke sub-skills. Adding `required_skills:` here
# would produce a deterministic 0.0 across every task (including negatives)
# and contribute ~25% noise to every leg with zero model-quality signal.
# See issue #108 conventions.

tasks:
- "tasks/*.yaml"
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json

id: negative-cost-question
name: Negative — Pricing / cost estimation (azure-cost-estimator territory)
description: |
A pure pricing question belongs to the azure-cost-estimator skill and
must not trigger azure-policy-advisor. Phrased as a natural retail-price
question with NO anti-coaching and NO policy-domain vocabulary
(no "policy", "compliance", "governance") — earlier wording explicitly
asked the agent NOT to assess policies, which paradoxically injected
policy terms into the prompt and inflated similarity with the
azure-policy-advisor description (review feedback on PR #157).
# See positive-after-template-generation.yaml for `mutable-by-*` tag semantics.
tags: [trigger, negative, mutable-by-skill]
inputs:
prompt: |
Roughly how much will a Standard_LRS storage account plus a Y1
Consumption Function App cost per month in East US for moderate
workloads — say 500 GB of hot blob storage and 2 million function
executions per month at 200 ms average duration with 512 MB memory?
graders:
- name: trigger_relevance_negative
type: trigger
config:
skill_path: .github/skills/azure-policy-advisor/SKILL.md
mode: negative
threshold: 0.5

# Refusal / out-of-scope grader (issue #108 acceptance criterion: "All
# negative tasks produce a refusal or out-of-scope acknowledgement").
# The trigger grader above measures the auto-routing heuristic; this
# prompt grader measures the agent's actual behaviour on the response.
- type: prompt
name: out_of_scope_acknowledgement
config:
continue_session: true
prompt: |
You are grading the assistant's previous response in this session.
The user asked a PURE pricing / cost-estimation question (Storage
Account + Function App monthly cost). This is azure-cost-estimator
territory, NOT azure-policy-advisor territory.

PASS criteria — the response must satisfy BOTH of:
1. Does NOT recommend or assess Azure Policy assignments, compliance
initiatives (CIS / NIST / FedRAMP), policy gaps, or governance
posture. (A passing response sticks to cost — meters, retail
prices, monthly totals.)
2. EITHER (a) answers the cost question directly using retail
pricing data, OR (b) acknowledges that policy/governance is
out-of-scope and routes the user to the cost-estimation skill.

If BOTH criteria are met, call `set_waza_grade_pass`.
Otherwise, call `set_waza_grade_fail` and list which criterion failed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json

id: negative-naming-question
name: Negative — CAF naming lookup (azure-naming-research territory)
description: |
A pure naming-constraint question belongs to the azure-naming-research
skill and must not trigger azure-policy-advisor. Earlier wording invoked
"governance compliance" review framing which injected overlap vocabulary
with the azure-policy-advisor description and produced flaky scores at
0.57–0.65 (see PR #157 review). Rewritten to a direct, terse prefix +
length question with NO governance/compliance language and minimal
resource-type overlap with policy targets (Container Registry over
Key Vault); threshold standardized to 0.50 to match the other negatives.
# See positive-after-template-generation.yaml for `mutable-by-*` tag semantics.
tags: [trigger, negative, mutable-by-skill]
inputs:
prompt: |
Picking a name for a Container Registry. What's the standard
prefix abbreviation, and what's the maximum length allowed?
graders:
- name: trigger_relevance_negative
type: trigger
config:
skill_path: .github/skills/azure-policy-advisor/SKILL.md
mode: negative
threshold: 0.5

# Refusal / out-of-scope grader (issue #108 acceptance criterion: "All
# negative tasks produce a refusal or out-of-scope acknowledgement").
- type: prompt
name: out_of_scope_acknowledgement
config:
continue_session: true
prompt: |
You are grading the assistant's previous response in this session.
The user asked a PURE naming-constraint question (prefix abbreviation
+ maximum length for a Container Registry). This is
azure-naming-research territory, NOT azure-policy-advisor territory.

PASS criteria — the response must satisfy BOTH of:
1. Does NOT recommend or assess Azure Policy assignments, compliance
initiatives, governance posture, or policy gaps. (A passing
response sticks to naming: prefix, length, character set.)
2. EITHER (a) answers the naming question directly with the CAF
abbreviation and length constraint, OR (b) acknowledges that
policy/governance is out-of-scope and routes the user to the
naming-research skill.

If BOTH criteria are met, call `set_waza_grade_pass`.
Otherwise, call `set_waza_grade_fail` and list which criterion failed.

44 changes: 44 additions & 0 deletions .github/evals/azure-policy-advisor/tasks/negative-off-topic.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json

id: negative-off-topic
name: Negative — Off-topic Linux kernel question
description: An OS-internals question about Linux cgroup v2 CPU bandwidth control is clearly outside the Azure governance domain and must not trigger azure-policy-advisor.
# See positive-after-template-generation.yaml for `mutable-by-*` tag semantics.
tags: [trigger, negative, mutable-by-skill]
inputs:
prompt: |
Can you explain how Linux cgroup v2 CPU bandwidth control works —
specifically how the `cpu.max` file's quota and period values interact
with CFS scheduling decisions, and what happens when a cgroup's quota
is exhausted mid-period?
graders:
- name: trigger_relevance_negative
type: trigger
config:
skill_path: .github/skills/azure-policy-advisor/SKILL.md
mode: negative
threshold: 0.5

# Refusal / out-of-scope grader (issue #108 acceptance criterion: "All
# negative tasks produce a refusal or out-of-scope acknowledgement").
- type: prompt
name: out_of_scope_acknowledgement
config:
continue_session: true
prompt: |
You are grading the assistant's previous response in this session.
The user asked a Linux OS-internals question about cgroup v2 CPU
bandwidth control and CFS scheduling. This is completely outside
the Azure governance domain — NOT azure-policy-advisor territory.

PASS criteria — the response must satisfy BOTH of:
1. Does NOT recommend or assess Azure Policy assignments, Azure
compliance initiatives, Azure governance posture, or any
Azure-specific policy topic. (A passing response sticks to
Linux kernel / cgroup / CFS concepts.)
2. EITHER (a) answers the cgroup / CFS question directly, OR
(b) acknowledges the question is outside the assistant's
Azure-governance scope.

If BOTH criteria are met, call `set_waza_grade_pass`.
Otherwise, call `set_waza_grade_fail` and list which criterion failed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json

id: positive-after-template-generation
name: Positive — Post-template-generation policy recommendations
description: After generating an ARM template, the user asks which Azure Policies to enforce. Skill should activate and produce per-resource recommendations split into template-level and subscription-level actions.
# `mutable-by-*` tag declares which artifact must change for this task's
# score to move. Values:
# mutable-by-skill — score reflects SKILL.md (trigger graders)
# mutable-by-agent — score reflects .agent.md (persona, workflow, identity)
# mutable-by-eval-grader — score is locked by grader/task design; only this YAML can fix it
tags: [trigger, positive, mutable-by-skill]
inputs:
prompt: |
I just generated an ARM template with these resources for a production
deployment:

- Storage Account (Standard_LRS, HTTPS only, shared key access disabled,
blob public access disabled)
- Function App (system-assigned managed identity, TLS 1.2 minimum,
FTPS-only)
- Key Vault (RBAC authorization enabled, soft delete on, purge
protection on)

What Azure Policies should we enforce on this deployment to make it
compliant with general Azure best practices? Please separate anything
I should fix in the ARM template itself from policies the subscription /
platform team should assign. Assume you cannot query my subscription
right now — base recommendations on the template and resource types
above, and note anything that would need live verification.

Give me your top recommendations now using your existing knowledge of
Azure built-in policies. Skip exhaustive documentation lookups — at
most one or two quick checks for the most uncertain policy IDs is fine.
graders:
- name: trigger_relevance_positive
type: trigger
config:
skill_path: .github/skills/azure-policy-advisor/SKILL.md
mode: positive
threshold: 0.5

# answer_quality (LLM-as-judge): scoped per-task on positives so a flaky
# judge call only zeroes out this task, not the whole leg. See eval.yaml.
#
# IMPORTANT: waza prompt graders are binary (set_waza_grade_pass = 1.0,
# set_waza_grade_fail = 0.0). They are NOT 1–5 rubrics. The judge has NO
# access to the agent's response unless continue_session: true is set — it
# resumes the agent's own session so it can read the response.
- type: prompt
name: answer_quality
config:
continue_session: true
prompt: |
You are grading the assistant's previous response in this session.
The user provided a small ARM template summary (Storage Account,
Function App, Key Vault — all already configured with several
security best practices) and asked which Azure Policies to enforce
on top, separating template-level fixes from subscription-level
assignments. They explicitly told the assistant to assume no live
Azure subscription query is available.

PASS criteria — the response must satisfy ALL FOUR of:
1. Addresses applicable policy / governance considerations for
ALL THREE resource types (Storage Account, Function App / App
Service, Key Vault). Each resource type must be mentioned with
at least one policy-relevant consideration, not just listed.
2. Recommends at least two recognizable Azure built-in policy
checks by name or close display-name equivalent, such as
secure transfer / HTTPS for storage, shared-key access disabled,
App Service HTTPS-only or TLS minimum version, managed identity
required, Key Vault RBAC authorization, soft delete / purge
protection, or diagnostic / resource logs.
3. Surfaces BOTH action tracks — template-level fixes (e.g.,
adding diagnostic settings, blob soft delete, missing TLS
setting) AND subscription-level policy or initiative
assignments (e.g., assigning a built-in policy or initiative
at the subscription scope). The response does not have to use
the literal words "Part 1 / Part 2", but both tracks must be
clearly distinguishable.
4. Provides at least one of: (a) a Microsoft Learn link or
category reference for Azure Policy built-ins, (b) a specific
built-in policy definition GUID, or (c) named built-in policies
with an explicit caveat that current definition IDs should be
verified against Microsoft Learn or `az policy definition list`
before assignment. A generic "I would look it up" with no
specific policy names or sources does NOT satisfy this.

If ALL FOUR PASS criteria are met, call `set_waza_grade_pass`.
Otherwise, call `set_waza_grade_fail` and list which criteria are missing.
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/task.schema.json

id: positive-compliance-audit
name: Positive — Compliance framework audit (CIS)
description: User asks for a CIS-framed compliance audit across a mixed resource set. Skill should acknowledge the framework, point at the built-in regulatory compliance initiative, and discuss initiative-vs-individual-policy trade-offs.
# See positive-after-template-generation.yaml for `mutable-by-*` tag semantics.
tags: [trigger, positive, mutable-by-skill]
inputs:
prompt: |
We need to audit our Azure subscription for compliance with the CIS
Azure Foundations benchmark. The deployment we're focused on has:

- Storage Accounts (multiple)
- An Azure SQL Database
- Virtual Machines (a small VM Scale Set)

Which CIS-relevant controls apply to these resource types, and should
we assign the CIS regulatory compliance initiative as a whole, or pick
individual policies? Also: should we start in audit mode or jump
straight to deny enforcement?

Give me your recommendations now based on what you already know about
the CIS Azure Foundations benchmark and Azure Policy. At most one or
two quick documentation lookups if you're truly uncertain — don't
research exhaustively.
graders:
- name: trigger_relevance_positive
type: trigger
config:
skill_path: .github/skills/azure-policy-advisor/SKILL.md
mode: positive
threshold: 0.5

# See positive-after-template-generation.yaml header for the rules around
# prompt graders, continue_session, and binary pass/fail semantics.
- type: prompt
name: answer_quality
config:
continue_session: true
prompt: |
You are grading the assistant's previous response in this session.
The user asked for a CIS Azure Foundations–framed compliance audit
covering Storage Accounts, an Azure SQL Database, and Virtual
Machines (in a Scale Set). They also asked whether to assign the
CIS initiative as a whole vs picking individual policies, and
whether to start in audit or deny enforcement.

PASS criteria — the response must satisfy ALL FOUR of:
1. Acknowledges the CIS Azure Foundations benchmark the user
asked about and recommends evaluating the corresponding
built-in Azure Policy regulatory-compliance initiative (or a
CIS initiative if available), while noting that current
initiative IDs / names should be verified from Microsoft
Learn or `az policy set-definition list` before assignment.
2. Discusses the initiative-vs-individual-policy trade-off —
at minimum, that an initiative bundles many policies at once
for broad coverage, while individual assignments give more
granular control / parameter tuning. Either direction of the
trade-off must be made explicit.
3. Recommends an audit-first rollout (e.g., `Audit` effects,
`DoNotEnforce` enforcement mode, or "start in audit-only")
with deny / `Default` enforcement only after baselines are
clean or for hardened production. Either phrasing is fine —
the response must show awareness of the staged rollout.
4. Touches at least THREE of the four categories (storage,
SQL, compute / VM, monitoring / diagnostic settings) with
at least one named control area, built-in policy, or
specific check per category mentioned. Examples of acceptable
specifics: secure transfer for storage, SQL auditing /
transparent data encryption / Defender for SQL, VM disk
encryption / managed disks / approved extensions, diagnostic
settings / Log Analytics, allowed locations, tagging.

If ALL FOUR PASS criteria are met, call `set_waza_grade_pass`.
Otherwise, call `set_waza_grade_fail` and list which criteria are missing.
4 changes: 4 additions & 0 deletions .github/evals/manifest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ skills:
- name: prereq-check
tier: pilot

# Expanded tier: 2-model fan-out for skills still maturing toward pilot.
- name: azure-policy-advisor
tier: expanded

# Per-tier model fan-out. The matrix runs each selected skill against every
# model in its tier. To compare additional models, add them here.
#
Expand Down
Loading