Skip to content

feat: False-success detection in post-deployment validation #150

@suuus

Description

@suuus

Summary

Enhance the azure-integration-tester skill to detect false successes — cases where the agent reports "deployment successful" but the execution trace reveals that mandatory checkpoints were skipped or never evaluated. Based on findings from arXiv:2605.03159.

Depends on: #148 (structured trace capture), #149 (dominator-tree validation skill)

Why This Matters

The Problem

Autonomous agents frequently misreport their own success. The paper quantifies this precisely:

"The CUA frequently misreported failures as successes, often due to timing out or misinterpreting its own state, achieving only 82.2% accuracy and 60.0% recall. In contrast, the dominator tree achieved perfect differentiation by focusing on whether essential milestones were actually reached rather than relying on the agent's internal assessment."

In git-ape's context, this manifests as:

  • Agent says "✅ Deployment complete" but never ran the security gate (prompt regression, context overflow, or model degradation)
  • Agent says "✅ All checks passed" but the integration tester was never invoked
  • Agent encounters an error mid-flow, recovers by skipping ahead, and reports success on the final step

The existing azure-integration-tester skill checks resource health (HTTP 200, resource exists) but does not verify that the process was followed correctly. A resource can be healthy AND insecurely deployed if the security gate was skipped.

The Research Finding

The paper's most impactful result for CI/CD is the ability to distinguish failure root causes:

"The most significant impact for developers is in reducing false alarms. When a test fails, high-signal feedback is needed to determine if the product code is broken or if the agent simply stumbled due to environmental noise. [...] The CUA's internal self-assessment was completely unable to identify 'not a bug' scenarios (0% F1-score)."

Applied to git-ape:

  • Product bug = Azure deployment actually failed (resource unhealthy)
  • Agent issue = Agent skipped steps but resources happen to work (process violation)
  • False success = Agent claimed success, resources look fine, but security/compliance gates were never evaluated

Categories 2 and 3 are invisible today. This feature makes them visible.

Value to Customers

  • Security teams: Proof that every deployment passed the security gate — not just the agent's word for it
  • Platform engineers: Alert when agent behavior degresses silently (prompt changes causing step-skipping)
  • Compliance: Detect deployments where policy assessment was supposed to run but didn't
  • CI reliability: Distinguish "the agent messed up" from "the deployment is actually broken" — reduces wasted investigation time

Design

Detection Layers

The false-success detector runs as a post-deployment validation step, after resource health checks pass:

┌────────────────────────────────┐
│  azure-integration-tester      │
│  (existing: resource health)   │
│  HTTP 200? Resource exists?    │
└──────────────┬─────────────────┘
               │ Health: ✅
               ▼
┌────────────────────────────────┐
│  False-Success Detector (NEW)  │
│  Structural trace validation   │
│  Were all gates actually hit?  │
└──────────────┬─────────────────┘
               │
               ▼
       ┌───────────────┐
       │  Final Verdict │
       └───────────────┘

Detection Logic

INPUT:  trace.jsonl + dominator-model.json + integration test results
OUTPUT: One of four verdicts:

1. ✅ VERIFIED SUCCESS
   - Resources healthy AND trace covers 100% essential states
   - "Deployment is correct and process was followed."

2. ⚠️ PROCESS VIOLATION (resources healthy, trace incomplete)
   - Resources healthy BUT essential states missing from trace
   - "Resources deployed successfully, but the security gate was never evaluated.
     This deployment may not meet security requirements."

3. ❌ DEPLOYMENT FAILURE (resources unhealthy)
   - Integration tests failed — actual infrastructure problem
   - "Deployment failed: Function App returned HTTP 503."

4. ⚠️ AGENT ISSUE (trace shows abnormal flow)
   - Trace contains error states, retries, or out-of-order transitions
   - "Agent encountered errors during execution. Resources may be in inconsistent state."

Classification Heuristics

Signal Indicates
security_gate_evaluated missing from trace Process violation — gate skipped
deployment_executed present but preflight_validated missing Deployed without validation
Trace has < N events (suspiciously short) Agent may have crashed/restarted
state: "error" events in trace followed by jump to deployment_executed Agent recovered by skipping
Terminal state doesn't match expected (integration_tests_passed) Incomplete workflow

Integration with Existing Skill

Extend azure-integration-tester SKILL.md to add a "structural validation" phase after health checks:

## Phase 2: Structural Validation (NEW)

After resource health passes, validate the execution trace:

1. Load `trace.jsonl` from the deployment directory
2. Load the dominator model (`.azure/baselines/default/dominator-model.json`)
   - If no model exists, fall back to the hardcoded essential-state checklist
3. Run topological subsequence matching (invoke `trace-validator` skill or inline)
4. Classify the result using the 4-verdict system above
5. Include classification in the PR comment / integration test output

Hardcoded Fallback (No Model Available)

When no dominator model has been built yet (first deployments), use a static checklist derived from git-ape's documented mandatory stages:

{
  "essential_states_fallback": [
    "requirements_gathered",
    "template_generated",
    "security_gate_evaluated",
    "preflight_validated",
    "deployment_executed",
    "integration_tests_passed"
  ]
}

This ensures false-success detection works from day 1, even before golden traces are collected.

PR Comment Output

On process violation:

## 🧪 Post-Deployment Validation

### Resource Health
| Resource | Status | Check |
|----------|:------:|-------|
| func-api-dev-eastus || HTTP 200, runtime healthy |
| st-api-dev-eastus || Blob endpoint reachable |

### Process Integrity
| Verdict | ⚠️ PROCESS VIOLATION |
|---------|---------------------|
| Coverage | 5/7 essential states (71%) |
| Missing | `security_gate_evaluated`, `preflight_validated` |

> ⚠️ **Resources are healthy, but the deployment process was incomplete.**
> The security gate and preflight validation were never executed during this run.
> This may indicate a prompt regression or context overflow in the agent.
>
> **Action required:** Re-run security analysis on the deployed template, or
> trigger a redeployment through the standard workflow.

Alerting / Escalation

When a process violation is detected:

  1. PR comment — always (visible to reviewer)
  2. Workflow annotation::warning:: so it shows in the Actions UI
  3. Optional: GitHub issue — auto-create if configured (for tracking process regressions over time)
  4. Never block deployment retroactively — resources are already deployed; this is detection, not prevention

Prevention is the security gate's job (blocking, pre-deployment). False-success detection is a post-hoc safety net that catches when prevention was circumvented.

Acceptance Criteria

  • azure-integration-tester SKILL.md updated with Phase 2 (structural validation)
  • 4-verdict classification logic implemented
  • Hardcoded fallback checklist for when no dominator model exists
  • PR comment format includes "Process Integrity" section
  • Workflow annotation (::warning::) on process violation
  • Test fixtures: traces for each of the 4 verdicts in .github/evals/
  • Works without feat: Dominator-tree validation skill for deployment workflows #149 model (uses fallback checklist) — graceful degradation
  • Docs updated: website/docs/skills/azure-integration-tester.md (new section)

Non-Goals (This Issue)

  • Blocking deployments based on trace validation (that's the security gate's job, pre-deployment)
  • Auto-remediation (re-running skipped gates)
  • Historical trend analysis ("this agent has been skipping gates more often lately")
  • Alert routing to external systems (PagerDuty, Slack)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions