feat: False-success detection in post-deployment validation

## Summary

Enhance the `azure-integration-tester` skill to detect **false successes** — cases where the agent reports "deployment successful" but the execution trace reveals that mandatory checkpoints were skipped or never evaluated. Based on findings from [arXiv:2605.03159](https://arxiv.org/abs/2605.03159).

**Depends on:** #148 (structured trace capture), #149 (dominator-tree validation skill)

## Why This Matters

### The Problem

Autonomous agents frequently misreport their own success. The paper quantifies this precisely:

> *"The CUA frequently misreported failures as successes, often due to timing out or misinterpreting its own state, achieving only 82.2% accuracy and 60.0% recall. In contrast, the dominator tree achieved perfect differentiation by focusing on whether essential milestones were actually reached rather than relying on the agent's internal assessment."*

In git-ape's context, this manifests as:
- Agent says "✅ Deployment complete" but never ran the security gate (prompt regression, context overflow, or model degradation)
- Agent says "✅ All checks passed" but the integration tester was never invoked
- Agent encounters an error mid-flow, recovers by skipping ahead, and reports success on the final step

The existing `azure-integration-tester` skill checks resource health (HTTP 200, resource exists) but does **not** verify that the *process* was followed correctly. A resource can be healthy AND insecurely deployed if the security gate was skipped.

### The Research Finding

The paper's most impactful result for CI/CD is the ability to distinguish failure *root causes*:

> *"The most significant impact for developers is in reducing false alarms. When a test fails, high-signal feedback is needed to determine if the product code is broken or if the agent simply stumbled due to environmental noise. [...] The CUA's internal self-assessment was completely unable to identify 'not a bug' scenarios (0% F1-score)."*

Applied to git-ape:
- **Product bug** = Azure deployment actually failed (resource unhealthy)
- **Agent issue** = Agent skipped steps but resources happen to work (process violation)
- **False success** = Agent claimed success, resources look fine, but security/compliance gates were never evaluated

Categories 2 and 3 are invisible today. This feature makes them visible.

### Value to Customers

- **Security teams:** Proof that every deployment passed the security gate — not just the agent's word for it
- **Platform engineers:** Alert when agent behavior degresses silently (prompt changes causing step-skipping)
- **Compliance:** Detect deployments where policy assessment was supposed to run but didn't
- **CI reliability:** Distinguish "the agent messed up" from "the deployment is actually broken" — reduces wasted investigation time

## Design

### Detection Layers

The false-success detector runs as a **post-deployment validation step**, after resource health checks pass:

```
┌────────────────────────────────┐
│  azure-integration-tester      │
│  (existing: resource health)   │
│  HTTP 200? Resource exists?    │
└──────────────┬─────────────────┘
               │ Health: ✅
               ▼
┌────────────────────────────────┐
│  False-Success Detector (NEW)  │
│  Structural trace validation   │
│  Were all gates actually hit?  │
└──────────────┬─────────────────┘
               │
               ▼
       ┌───────────────┐
       │  Final Verdict │
       └───────────────┘
```

### Detection Logic

```
INPUT:  trace.jsonl + dominator-model.json + integration test results
OUTPUT: One of four verdicts:

1. ✅ VERIFIED SUCCESS
   - Resources healthy AND trace covers 100% essential states
   - "Deployment is correct and process was followed."

2. ⚠️ PROCESS VIOLATION (resources healthy, trace incomplete)
   - Resources healthy BUT essential states missing from trace
   - "Resources deployed successfully, but the security gate was never evaluated.
     This deployment may not meet security requirements."

3. ❌ DEPLOYMENT FAILURE (resources unhealthy)
   - Integration tests failed — actual infrastructure problem
   - "Deployment failed: Function App returned HTTP 503."

4. ⚠️ AGENT ISSUE (trace shows abnormal flow)
   - Trace contains error states, retries, or out-of-order transitions
   - "Agent encountered errors during execution. Resources may be in inconsistent state."
```

### Classification Heuristics

| Signal | Indicates |
|--------|-----------|
| `security_gate_evaluated` missing from trace | Process violation — gate skipped |
| `deployment_executed` present but `preflight_validated` missing | Deployed without validation |
| Trace has < N events (suspiciously short) | Agent may have crashed/restarted |
| `state: "error"` events in trace followed by jump to `deployment_executed` | Agent recovered by skipping |
| Terminal state doesn't match expected (`integration_tests_passed`) | Incomplete workflow |

### Integration with Existing Skill

Extend `azure-integration-tester` SKILL.md to add a "structural validation" phase after health checks:

```markdown
## Phase 2: Structural Validation (NEW)

After resource health passes, validate the execution trace:

1. Load `trace.jsonl` from the deployment directory
2. Load the dominator model (`.azure/baselines/default/dominator-model.json`)
   - If no model exists, fall back to the hardcoded essential-state checklist
3. Run topological subsequence matching (invoke `trace-validator` skill or inline)
4. Classify the result using the 4-verdict system above
5. Include classification in the PR comment / integration test output
```

### Hardcoded Fallback (No Model Available)

When no dominator model has been built yet (first deployments), use a static checklist derived from git-ape's documented mandatory stages:

```json
{
  "essential_states_fallback": [
    "requirements_gathered",
    "template_generated",
    "security_gate_evaluated",
    "preflight_validated",
    "deployment_executed",
    "integration_tests_passed"
  ]
}
```

This ensures false-success detection works from day 1, even before golden traces are collected.

### PR Comment Output

On process violation:

```markdown
## 🧪 Post-Deployment Validation

### Resource Health
| Resource | Status | Check |
|----------|:------:|-------|
| func-api-dev-eastus | ✅ | HTTP 200, runtime healthy |
| st-api-dev-eastus | ✅ | Blob endpoint reachable |

### Process Integrity
| Verdict | ⚠️ PROCESS VIOLATION |
|---------|---------------------|
| Coverage | 5/7 essential states (71%) |
| Missing | `security_gate_evaluated`, `preflight_validated` |

> ⚠️ **Resources are healthy, but the deployment process was incomplete.**
> The security gate and preflight validation were never executed during this run.
> This may indicate a prompt regression or context overflow in the agent.
>
> **Action required:** Re-run security analysis on the deployed template, or
> trigger a redeployment through the standard workflow.
```

### Alerting / Escalation

When a process violation is detected:

1. **PR comment** — always (visible to reviewer)
2. **Workflow annotation** — `::warning::` so it shows in the Actions UI
3. **Optional: GitHub issue** — auto-create if configured (for tracking process regressions over time)
4. **Never block deployment retroactively** — resources are already deployed; this is detection, not prevention

Prevention is the security gate's job (blocking, pre-deployment). False-success detection is a post-hoc safety net that catches when prevention was circumvented.

## Acceptance Criteria

- [ ] `azure-integration-tester` SKILL.md updated with Phase 2 (structural validation)
- [ ] 4-verdict classification logic implemented
- [ ] Hardcoded fallback checklist for when no dominator model exists
- [ ] PR comment format includes "Process Integrity" section
- [ ] Workflow annotation (`::warning::`) on process violation
- [ ] Test fixtures: traces for each of the 4 verdicts in `.github/evals/`
- [ ] Works without #149 model (uses fallback checklist) — graceful degradation
- [ ] Docs updated: `website/docs/skills/azure-integration-tester.md` (new section)

## Non-Goals (This Issue)

- Blocking deployments based on trace validation (that's the security gate's job, pre-deployment)
- Auto-remediation (re-running skipped gates)
- Historical trend analysis ("this agent has been skipping gates more often lately")
- Alert routing to external systems (PagerDuty, Slack)

## References

- Paper: [arXiv:2605.03159](https://arxiv.org/abs/2605.03159) — Section 5.3 (Results: false success detection) and Section 6 (Discussion: CI/CD application)
- Dependencies: #148 (trace capture), #149 (dominator validation skill)
- Existing skill: `.github/skills/azure-integration-tester/` (to be extended)
- Agent workflow: `.github/agents/git-ape.agent.md` (mandatory checkpoints)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: False-success detection in post-deployment validation #150

Summary

Why This Matters

The Problem

The Research Finding

Value to Customers

Design

Detection Layers

Detection Logic

Classification Heuristics

Integration with Existing Skill

Hardcoded Fallback (No Model Available)

PR Comment Output

Alerting / Escalation

Acceptance Criteria

Non-Goals (This Issue)

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Signal	Indicates
`security_gate_evaluated` missing from trace	Process violation — gate skipped
`deployment_executed` present but `preflight_validated` missing	Deployed without validation
Trace has < N events (suspiciously short)	Agent may have crashed/restarted
`state: "error"` events in trace followed by jump to `deployment_executed`	Agent recovered by skipping
Terminal state doesn't match expected (`integration_tests_passed`)	Incomplete workflow

feat: False-success detection in post-deployment validation #150

Description

Summary

Why This Matters

The Problem

The Research Finding

Value to Customers

Design

Detection Layers

Detection Logic

Classification Heuristics

Integration with Existing Skill

Hardcoded Fallback (No Model Available)

PR Comment Output

Alerting / Escalation

Acceptance Criteria

Non-Goals (This Issue)

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions