Summary
Enhance the azure-integration-tester skill to detect false successes — cases where the agent reports "deployment successful" but the execution trace reveals that mandatory checkpoints were skipped or never evaluated. Based on findings from arXiv:2605.03159.
Depends on: #148 (structured trace capture), #149 (dominator-tree validation skill)
Why This Matters
The Problem
Autonomous agents frequently misreport their own success. The paper quantifies this precisely:
"The CUA frequently misreported failures as successes, often due to timing out or misinterpreting its own state, achieving only 82.2% accuracy and 60.0% recall. In contrast, the dominator tree achieved perfect differentiation by focusing on whether essential milestones were actually reached rather than relying on the agent's internal assessment."
In git-ape's context, this manifests as:
- Agent says "✅ Deployment complete" but never ran the security gate (prompt regression, context overflow, or model degradation)
- Agent says "✅ All checks passed" but the integration tester was never invoked
- Agent encounters an error mid-flow, recovers by skipping ahead, and reports success on the final step
The existing azure-integration-tester skill checks resource health (HTTP 200, resource exists) but does not verify that the process was followed correctly. A resource can be healthy AND insecurely deployed if the security gate was skipped.
The Research Finding
The paper's most impactful result for CI/CD is the ability to distinguish failure root causes:
"The most significant impact for developers is in reducing false alarms. When a test fails, high-signal feedback is needed to determine if the product code is broken or if the agent simply stumbled due to environmental noise. [...] The CUA's internal self-assessment was completely unable to identify 'not a bug' scenarios (0% F1-score)."
Applied to git-ape:
- Product bug = Azure deployment actually failed (resource unhealthy)
- Agent issue = Agent skipped steps but resources happen to work (process violation)
- False success = Agent claimed success, resources look fine, but security/compliance gates were never evaluated
Categories 2 and 3 are invisible today. This feature makes them visible.
Value to Customers
- Security teams: Proof that every deployment passed the security gate — not just the agent's word for it
- Platform engineers: Alert when agent behavior degresses silently (prompt changes causing step-skipping)
- Compliance: Detect deployments where policy assessment was supposed to run but didn't
- CI reliability: Distinguish "the agent messed up" from "the deployment is actually broken" — reduces wasted investigation time
Design
Detection Layers
The false-success detector runs as a post-deployment validation step, after resource health checks pass:
┌────────────────────────────────┐
│ azure-integration-tester │
│ (existing: resource health) │
│ HTTP 200? Resource exists? │
└──────────────┬─────────────────┘
│ Health: ✅
▼
┌────────────────────────────────┐
│ False-Success Detector (NEW) │
│ Structural trace validation │
│ Were all gates actually hit? │
└──────────────┬─────────────────┘
│
▼
┌───────────────┐
│ Final Verdict │
└───────────────┘
Detection Logic
INPUT: trace.jsonl + dominator-model.json + integration test results
OUTPUT: One of four verdicts:
1. ✅ VERIFIED SUCCESS
- Resources healthy AND trace covers 100% essential states
- "Deployment is correct and process was followed."
2. ⚠️ PROCESS VIOLATION (resources healthy, trace incomplete)
- Resources healthy BUT essential states missing from trace
- "Resources deployed successfully, but the security gate was never evaluated.
This deployment may not meet security requirements."
3. ❌ DEPLOYMENT FAILURE (resources unhealthy)
- Integration tests failed — actual infrastructure problem
- "Deployment failed: Function App returned HTTP 503."
4. ⚠️ AGENT ISSUE (trace shows abnormal flow)
- Trace contains error states, retries, or out-of-order transitions
- "Agent encountered errors during execution. Resources may be in inconsistent state."
Classification Heuristics
| Signal |
Indicates |
security_gate_evaluated missing from trace |
Process violation — gate skipped |
deployment_executed present but preflight_validated missing |
Deployed without validation |
| Trace has < N events (suspiciously short) |
Agent may have crashed/restarted |
state: "error" events in trace followed by jump to deployment_executed |
Agent recovered by skipping |
Terminal state doesn't match expected (integration_tests_passed) |
Incomplete workflow |
Integration with Existing Skill
Extend azure-integration-tester SKILL.md to add a "structural validation" phase after health checks:
## Phase 2: Structural Validation (NEW)
After resource health passes, validate the execution trace:
1. Load `trace.jsonl` from the deployment directory
2. Load the dominator model (`.azure/baselines/default/dominator-model.json`)
- If no model exists, fall back to the hardcoded essential-state checklist
3. Run topological subsequence matching (invoke `trace-validator` skill or inline)
4. Classify the result using the 4-verdict system above
5. Include classification in the PR comment / integration test output
Hardcoded Fallback (No Model Available)
When no dominator model has been built yet (first deployments), use a static checklist derived from git-ape's documented mandatory stages:
{
"essential_states_fallback": [
"requirements_gathered",
"template_generated",
"security_gate_evaluated",
"preflight_validated",
"deployment_executed",
"integration_tests_passed"
]
}
This ensures false-success detection works from day 1, even before golden traces are collected.
PR Comment Output
On process violation:
## 🧪 Post-Deployment Validation
### Resource Health
| Resource | Status | Check |
|----------|:------:|-------|
| func-api-dev-eastus | ✅ | HTTP 200, runtime healthy |
| st-api-dev-eastus | ✅ | Blob endpoint reachable |
### Process Integrity
| Verdict | ⚠️ PROCESS VIOLATION |
|---------|---------------------|
| Coverage | 5/7 essential states (71%) |
| Missing | `security_gate_evaluated`, `preflight_validated` |
> ⚠️ **Resources are healthy, but the deployment process was incomplete.**
> The security gate and preflight validation were never executed during this run.
> This may indicate a prompt regression or context overflow in the agent.
>
> **Action required:** Re-run security analysis on the deployed template, or
> trigger a redeployment through the standard workflow.
Alerting / Escalation
When a process violation is detected:
- PR comment — always (visible to reviewer)
- Workflow annotation —
::warning:: so it shows in the Actions UI
- Optional: GitHub issue — auto-create if configured (for tracking process regressions over time)
- Never block deployment retroactively — resources are already deployed; this is detection, not prevention
Prevention is the security gate's job (blocking, pre-deployment). False-success detection is a post-hoc safety net that catches when prevention was circumvented.
Acceptance Criteria
Non-Goals (This Issue)
- Blocking deployments based on trace validation (that's the security gate's job, pre-deployment)
- Auto-remediation (re-running skipped gates)
- Historical trend analysis ("this agent has been skipping gates more often lately")
- Alert routing to external systems (PagerDuty, Slack)
References
Summary
Enhance the
azure-integration-testerskill to detect false successes — cases where the agent reports "deployment successful" but the execution trace reveals that mandatory checkpoints were skipped or never evaluated. Based on findings from arXiv:2605.03159.Depends on: #148 (structured trace capture), #149 (dominator-tree validation skill)
Why This Matters
The Problem
Autonomous agents frequently misreport their own success. The paper quantifies this precisely:
In git-ape's context, this manifests as:
The existing
azure-integration-testerskill checks resource health (HTTP 200, resource exists) but does not verify that the process was followed correctly. A resource can be healthy AND insecurely deployed if the security gate was skipped.The Research Finding
The paper's most impactful result for CI/CD is the ability to distinguish failure root causes:
Applied to git-ape:
Categories 2 and 3 are invisible today. This feature makes them visible.
Value to Customers
Design
Detection Layers
The false-success detector runs as a post-deployment validation step, after resource health checks pass:
Detection Logic
Classification Heuristics
security_gate_evaluatedmissing from tracedeployment_executedpresent butpreflight_validatedmissingstate: "error"events in trace followed by jump todeployment_executedintegration_tests_passed)Integration with Existing Skill
Extend
azure-integration-testerSKILL.md to add a "structural validation" phase after health checks:Hardcoded Fallback (No Model Available)
When no dominator model has been built yet (first deployments), use a static checklist derived from git-ape's documented mandatory stages:
{ "essential_states_fallback": [ "requirements_gathered", "template_generated", "security_gate_evaluated", "preflight_validated", "deployment_executed", "integration_tests_passed" ] }This ensures false-success detection works from day 1, even before golden traces are collected.
PR Comment Output
On process violation:
Alerting / Escalation
When a process violation is detected:
::warning::so it shows in the Actions UIPrevention is the security gate's job (blocking, pre-deployment). False-success detection is a post-hoc safety net that catches when prevention was circumvented.
Acceptance Criteria
azure-integration-testerSKILL.md updated with Phase 2 (structural validation)::warning::) on process violation.github/evals/website/docs/skills/azure-integration-tester.md(new section)Non-Goals (This Issue)
References
.github/skills/azure-integration-tester/(to be extended).github/agents/git-ape.agent.md(mandatory checkpoints)