Skip to content

feat: Dominator-tree validation skill for deployment workflows #149

@suuus

Description

@suuus

Summary

Ship a customer-facing trace-validator skill that learns correct deployment behavior from a small number of golden execution traces and structurally validates all future runs against that model. Based on dominator analysis from arXiv:2605.03159.

Depends on: #148 (structured trace capture)

Why This Matters

The Problem

Git-ape orchestrates complex multi-stage deployment workflows. Customers need confidence that:

  1. Every mandatory checkpoint was actually executed (not just claimed by the agent)
  2. The workflow followed a valid path through the required stages
  3. Acceptable variations (optional cost estimation, architecture review) don't trigger false alarms

Today this is enforced purely by agent instructions and post-hoc integration tests. If the agent skips a step (or a prompt regression causes it to), there's no structural safety net.

The Research

The paper demonstrates that dominator analysis — borrowed from compiler control-flow theory — cleanly separates essential states from optional variations:

"A state d dominates state s if every path from the initial state to s must pass through d. [...] Loading screens do not dominate anything because they are optional."

Applied to git-ape: security_gate_evaluated dominates deployment_executed (you can't deploy without passing the gate), but cost_estimated does not (cost estimation is advisory).

The algorithm needs only 2–10 passing traces to learn this structure automatically:

"Using a model built from only three passing traces, the system achieved 100% accuracy in detecting product bugs and successfully identified false successes while tolerating valid UI variations."

Value to Customers

  • Regulated industries: Machine-verifiable proof that compliance gates were executed
  • Platform teams: Structural guardrails on what Copilot agents can skip
  • CI/CD pipelines: PR comments showing "5/5 essential states hit" or "⚠️ missing: security_gate_evaluated"
  • Multi-team orgs: Golden baselines per environment (prod requires all gates; dev allows shortcuts)

Design

Skill Interface

Name: trace-validator
Triggers: "validate deployment trace", "check if run was complete", post-deployment
Input: deployment-id (or path to trace.jsonl)
Output: PASS/FAIL with coverage metrics, matched/missing states, explanation

Algorithm (3 Phases from the Paper)

Phase 1: Build Prefix Tree Acceptors

Each golden trace (trace.jsonl) becomes a PTA — a directed graph where nodes = states, edges = transitions.

Golden trace 1: requirements → api_ref → template → security_gate → preflight → confirm → deploy → integration_tests
Golden trace 2: requirements → api_ref → template → security_gate → cost_est → preflight → confirm → deploy → integration_tests
Golden trace 3: requirements → api_ref → template → security_gate → policy → preflight → confirm → deploy → integration_tests

Phase 2: Merge + Dominator Extraction

  1. Merge PTAs into a unified graph using state equivalence (exact match on state field for structured traces — no LLM needed for JSON states)
  2. Compute dominators using Lengauer-Tarjan algorithm
  3. Extract dominator tree by tracing back from terminal states

Result for the example above:

Dominator tree (essential states):
  requirements_gathered
    → api_reference_lookup
      → template_generated
        → security_gate_evaluated
          → preflight_validated
            → user_confirmation
              → deployment_executed
                → integration_tests_passed

Optional (not in dominator tree):
  cost_estimated, policy_assessed, architecture_reviewed

Phase 3: Validate New Traces

Topological subsequence matching:

  • Extract state sequence from the new trace.jsonl
  • Check that all dominator-tree states appear in the correct topological order
  • Extra states (optional) are allowed between essential states
  • Compute coverage = matched / total_essential × 100%

State Equivalence Strategy

For git-ape traces (structured JSON, not screenshots), we use a simplified tiered approach:

Tier Method When
1 Exact match on state field Default — handles 95% of cases
2 Normalized match (ignore timestamp, meta variations) Same state ID with different metadata
3 LLM semantic equivalence Future: for free-form trace states from other agent systems

This is much simpler than the paper's visual equivalence (perceptual hash → SSIM → LLM) because our traces are structured.

File Layout

.azure/baselines/
├── default/                          # Default baseline (all environments)
│   ├── golden-traces/
│   │   ├── trace-001.jsonl
│   │   ├── trace-002.jsonl
│   │   └── trace-003.jsonl
│   └── dominator-model.json          # Auto-generated from golden traces
├── prod/                             # Stricter baseline for prod
│   ├── golden-traces/
│   │   └── ...
│   └── dominator-model.json

Dominator Model Schema

// .azure/baselines/default/dominator-model.json
{
  "version": "1.0",
  "generated_from": ["trace-001.jsonl", "trace-002.jsonl", "trace-003.jsonl"],
  "generated_at": "2025-06-01T09:00:00Z",
  "essential_states": [
    { "id": "requirements_gathered", "stage": 1, "dominates": ["api_reference_lookup"] },
    { "id": "api_reference_lookup", "stage": 2, "dominates": ["template_generated"] },
    { "id": "template_generated", "stage": 2, "dominates": ["security_gate_evaluated"] },
    { "id": "security_gate_evaluated", "stage": 2.5, "dominates": ["preflight_validated"] },
    { "id": "preflight_validated", "stage": 2, "dominates": ["user_confirmation"] },
    { "id": "user_confirmation", "stage": 3, "dominates": ["deployment_executed"] },
    { "id": "deployment_executed", "stage": 3, "dominates": ["integration_tests_passed"] },
    { "id": "integration_tests_passed", "stage": 4, "dominates": [] }
  ],
  "optional_states": ["cost_estimated", "policy_assessed", "architecture_reviewed", "drift_checked"],
  "coverage_threshold": 100
}

Skill Implementation

.github/skills/trace-validator/
├── SKILL.md                    # Skill definition (frontmatter + instructions)
├── scripts/
│   ├── build-model.sh          # Ingest golden traces → generate dominator-model.json
│   └── validate-trace.sh       # Validate a trace.jsonl against the model
└── references/
    └── algorithm.md            # Dominator extraction explained for the agent

The core algorithm (build-model and validate-trace) should be implemented as a Node.js script (matching the repo's existing scripts/ tooling) or as a standalone shell-based implementation using jq for JSON processing.

Integration with Deploy Workflow

In git-ape-deploy.yml (or .exampleyml), add a post-deployment step:

- name: Validate execution trace
  if: hashFiles(format(.azure/deployments/{0}/trace.jsonl, env.DEPLOYMENT_ID)) != 
  run: |
    node .github/skills/trace-validator/scripts/validate-trace.js \
      --trace ".azure/deployments/${{ env.DEPLOYMENT_ID }}/trace.jsonl" \
      --model ".azure/baselines/default/dominator-model.json" \
      --threshold 100

PR Comment Output

## 🔍 Trace Validation

**Status:** 🟢 PASSED (7/7 essential states matched)

| # | Essential State | Status | Timestamp |
|---|----------------|:------:|-----------|
| 1 | requirements_gathered || 08:30:00 |
| 2 | api_reference_lookup || 08:31:12 |
| 3 | template_generated || 08:32:45 |
| 4 | security_gate_evaluated || 08:33:10 |
| 5 | preflight_validated || 08:34:00 |
| 6 | deployment_executed || 08:36:30 |
| 7 | integration_tests_passed || 08:37:15 |

**Coverage:** 100% | **Optional states observed:** cost_estimated, policy_assessed
**Model:** `.azure/baselines/default/dominator-model.json` (built from 3 traces)

Or on failure:

## 🔍 Trace Validation

**Status:** 🔴 FAILED (5/7 essential states matched)

| # | Essential State | Status |
|---|----------------|:------:|
| 1 | requirements_gathered ||
| 2 | api_reference_lookup ||
| 3 | template_generated ||
| 4 | security_gate_evaluated | ❌ MISSING |
| 5 | preflight_validated | ❌ MISSING |
| 6 | deployment_executed ||
| 7 | integration_tests_passed ||

⚠️ **The agent skipped the security gate and preflight validation.**
This deployment may not meet security requirements.

**Coverage:** 71% (below 100% threshold)

Customer Workflow

  1. Bootstrap: Run 3–5 deployments successfully → traces auto-captured (feat: Structured execution trace capture for agent workflows #148)
  2. Build model: trace-validator --build --traces .azure/baselines/default/golden-traces/
  3. Validate ongoing: Every deployment auto-validated in CI, or invoke interactively with "validate my last deployment trace"

Acceptance Criteria

  • SKILL.md created with correct frontmatter, triggers, and instructions
  • build-model script: ingests N trace files → produces dominator-model.json
  • validate-trace script: validates a trace against a model → outputs PASS/FAIL + coverage
  • Schema file for dominator-model.json in .github/schemas/
  • Integration example in deploy workflow (.exampleyml)
  • Fixture: 3 golden traces + expected model in .github/evals/trace-validator/
  • Docs page: website/docs/skills/trace-validator.md
  • Works in both interactive (skill invocation) and CI (workflow step) modes

Non-Goals (This Issue)

References

  • Paper: arXiv:2605.03159 — Sections 3.2 (Merge + Dominator Extraction) and 3.3 (Validation)
  • Dominator algorithm: Lengauer & Tarjan (1979) — "A fast algorithm for finding dominators in a flowgraph"
  • Dependency: feat: Structured execution trace capture for agent workflows #148 (trace capture)
  • Agent workflow stages: .github/agents/git-ape.agent.md
  • Existing skill pattern: .github/skills/prereq-check/SKILL.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions