Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
216 changes: 216 additions & 0 deletions docs/validation-report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# Spec: HTML validation report

- Status: Draft
- Scope: A self-contained HTML report (plus a machine-readable `report.json`) emitted at the end of every agent run, summarizing all validation layers + the reviewer with embedded screenshots.
- Relationship to other docs: implements the "HTML validation report" idea (highest-ROI visual surface) without building a native GUI or a hosted web product. Aligns with `MONETIZATION.md` — a thin static artifact over the existing `dispatch()` core, not a new product surface.

---

## 1. Why

The pipeline already produces rich, visual validation evidence — leftover-token findings, per-platform build commands/durations, home-screen screenshots, vision-judge rubric scores with rationales, and a contract-parity diff. Today `index.ts` collapses all of it to two lines:

```
result: Layer 1 3/3 pass · Layer 2 3/3 pass · Layer 3 2/2 pass · reviewer PASS
overall: PASS
```

Everything else is written to `tmp/trace/*.log` (unstructured) and `tmp/screenshots/*.png` (ephemeral, cleaned). The single most valuable, most *visual* output of the agent is invisible unless you tail log files.

The report turns that evidence into one portable artifact you can open, attach to a PR, drop into a demo video, or hand to a non-technical buyer.

**Non-goals:** not interactive, no server, no JS framework, no live re-run, no hosting. That is the Phase-4 web product. This is a static file.

---

## 2. The data gap (read this first — it's the real work)

The report is *mostly* a rendering problem, but there's a structural blocker: **the per-layer detail is computed and then thrown away.**

`runJudge` (`src/agents/judge.ts`) builds a per-platform `PlatformReport` inside `evaluate()` — it has `layer1.findings`, `layer2.command`, `layer2.durationMs`, `layer2.stderrTail`, etc. — but the returned `JudgeResult` keeps only:

```ts
export type JudgeResult = {
overallPass: boolean;
summary: string; // a single formatted string
visual?: VisualJudgeReport; // the only structured survivor
};
```

So `Layer1Finding[]`, `Layer2Result.stderrTail`, the reviewer's `diffs`, the layer-2 command/duration, and the domain rename plan never reach the caller. **The report cannot be a pure post-process of today's `JudgeResult`.** Step 1 of implementation is widening what survives the run.

### Decision: emit a `report.json`, render HTML from it

Rather than thread more return values through callers ad hoc, introduce a single structured aggregate, `RunReport`, written to disk as `report.json`. The HTML renderer is a **pure function of `RunReport`** (`renderReport(report): string`). Benefits:

- HTML generator is decoupled, deterministic, and golden-file testable (no pipeline needed to test rendering).
- `report.json` is a free machine-readable artifact for CI gating, the MCP tool result, and the future web product.
- `dispatch()` is the natural assembler — it already holds `domain`, the three `WorkerResult`s, `reviewer`, the `JudgeResult`, and run timing.

This requires one upstream change: `runJudge` must return the structured per-platform/per-layer detail it already computes (see §5).

---

## 3. Output layout

Written to the run's output root (sibling to the three generated projects):

```
out/<slug>/
├── rails/ ios/ android/ # existing generated projects
├── report.json # machine-readable RunReport (always)
├── validation-report.html # self-contained, openable anywhere (always)
└── report-assets/ # only when --report-embed=false
├── ios-home.png android-home.png
└── ios-step-3.png ...
```

**Default = single self-contained file.** Screenshots are base64-embedded and CSS is inlined, so `validation-report.html` is one portable file (emailable, attachable, survives `tmp/` cleanup). For ~4–8 PNGs this is ~1–3 MB — acceptable. `--report-embed=false` writes images to `report-assets/` and references them relatively, for size-sensitive cases.

Screenshots originate in `tmp/screenshots/{ios-home,android-home}.png` (Stage 1) and `tmp/screenshots/*-step-*.png` (Stage 2); the report collector reads them from the paths recorded in `VisualJudgePlatformReport.screenshotPath` / `Stage2PlatformReport.screenshots` and copies-or-embeds them. Never link directly into `tmp/` — it's ephemeral.

---

## 4. Report content (sections, in order)

1. **Header** — overall `PASS`/`FAIL` badge, spec text, `displayName`, slug, timestamp, agent version (from `package.json`), judge model (`claude-opus-4-7`), `NATIVEAPPTEMPLATE_VISUAL` level, total run duration.
2. **Gate strip** — four chips: Layer 1, Layer 2, Layer 3, Reviewer, each `x/3` (or `x/2` for L3) with pass/fail color. This is the existing `summary` string, made visual.
3. **Platform × Layer matrix** — the headline. Rows `rails / ios / android`, columns `Layer 1 / Layer 2 / Layer 3`, each cell a pass/fail/skip mark. (Rails has no Layer 3 — render as "n/a".)
4. **Layer 1 — Structural.** Per platform: pass + count. On failure, a findings table: `token | file:line | text excerpt`. Clean state shows an explicit "no leftover tokens" row.
5. **Layer 2 — Runtime.** Per platform: `command`, mode (`fast`/`build`), `exitCode`, `durationMs`. On failure, `stderrTail` in a native `<details>` (collapsed).
6. **Layer 3 — Semantic (vision judge).** Per platform (ios/android):
- **Stage 1:** home-screen screenshot thumbnail + rubric table (`question | pass/fail | rationale`). Note "median of 3 samples per criterion" (`DEFAULT_SAMPLES = 3`).
- **Stage 2** (when `VISUAL=2`): scenario name, `stepsPassed/stepCount`, a screenshot filmstrip (`screenshots[]`), and the post-toggle Layer 3 scores (`layer3Scores`).
- `error` rendered prominently when a platform's visual phase failed.
7. **Reviewer — Contract parity.** `pass`/`fail` + `diffs[]` in a collapsed `<details>` when non-empty.
8. **Domain spec appendix.** Rename plan table (`from → to`), entities and fields, so the reader sees *what was generated and why the tokens changed*.
9. **Footer / reproduce.** The exact command to reproduce (`npx nativeapptemplate-agent "<spec>"` + the `NATIVEAPPTEMPLATE_VISUAL` value), and a pointer to `tmp/trace/*.log` for raw logs.

**Forward-looking (optional, render only if present):** a **self-repair** section showing the ≤5 iteration history (CLAUDE.md cap) — which layer failed, what was attempted, the delta. The model carries `repairAttempts?` so the renderer is ready when the loop is wired.

---

## 5. Data model

New file `src/report/model.ts`. Field types reuse existing exports verbatim where possible.

```ts
export type RunReport = {
meta: {
spec: string;
slug: string;
displayName: string;
agentVersion: string; // package.json version
judgeModel: string; // "claude-opus-4-7"
visualLevel: 0 | 1 | 2; // from NATIVEAPPTEMPLATE_VISUAL
startedAt: string; // ISO
finishedAt: string; // ISO
durationMs: number;
};
overallPass: boolean;
summary: string; // existing one-line summary, preserved

platforms: PlatformDetail[]; // rails, ios, android

reviewer: {
contractParity: "pass" | "fail";
diffs: readonly string[]; // from ReviewerResult
};

domain: {
renamePlan: readonly { from: string; to: string }[];
entities: readonly { name: string; replaces: string;
fields: readonly { name: string; type: string; references?: string }[];
states?: readonly string[] }[];
};

repairAttempts?: readonly RepairAttempt[]; // future; render if present
};

export type PlatformDetail = {
platform: "rails" | "ios" | "android";
layer1: { pass: boolean; findings: readonly Layer1Finding[] }; // full findings, not a count
layer2: { pass: boolean; command: string; mode: "fast" | "build";
exitCode: number | null; durationMs: number; stderrTail?: string };
layer3?: VisualJudgePlatformReport; // ios/android only; reuse existing type (incl. .stage2)
};

export type RepairAttempt = {
iteration: number; // 1..5
failingLayer: "layer1" | "layer2" | "layer3" | "reviewer";
platform?: "rails" | "ios" | "android";
action: string;
resolved: boolean;
};
```

`Layer1Finding` and `VisualJudgePlatformReport` are imported from existing modules — no duplication. The report model is the *only* place these get aggregated.

### Required upstream change

`runJudge` already computes everything in `PlatformDetail.layer1`/`layer2` inside `evaluate()`, but discards the detail. Change `evaluate()` to retain full `Layer1Result.findings` and full `Layer2Result` (not just `findings.length` and `command`), and have `runJudge` return them. Smallest viable shape: add an optional `platforms?: PlatformDetail[]` to `JudgeResult` (back-compatible — existing CLI/MCP consumers ignore it). `dispatch()` then assembles `RunReport` from `domain` + `reviewer` + the enriched `JudgeResult` + timing it records around the run.

---

## 6. Code layout

```
src/report/
├── model.ts # RunReport, PlatformDetail, RepairAttempt
├── collect.ts # assemble RunReport in dispatch; copy/embed screenshots; write report.json
├── render.ts # renderReport(report: RunReport): string ← pure, no I/O, no network
└── theme.ts # inline CSS string (brand palette), reused by render.ts
```

- `render.ts` is pure: `RunReport → string`. No `fs`, no `Date.now()` inside (timestamps come from `meta`). This makes it golden-file testable and deterministic.
- `collect.ts` owns all I/O: read screenshots, base64-or-copy, write `report.json` and `validation-report.html`.
- Vanilla HTML + inlined CSS. Interactivity limited to native `<details>` — **no JS framework, no bundler, no external assets, no web fonts** (system font stack, matching the social-preview SVG).

### Styling

Reuse the social-preview palette for brand coherence (`docs/social-preview.svg`): dark base `#1F2933`→`#0B69A3`, accent `#40C3F7`/`#2BB0ED`, text `#F5F7FA`/`#CBD2D9`. Pass = green, fail = red, skip/n-a = muted gray. Responsive, print-friendly (`@media print`), single-column on narrow widths.

---

## 7. CLI / MCP integration

`index.ts`, after `dispatch()`:

- Always write `out/<slug>/report.json` + `validation-report.html`.
- Print the report path as a `file://` URL (clickable in terminals) alongside the existing summary lines.

Flags:

| Flag | Default | Effect |
|---|---|---|
| `--no-report` | off | Skip both artifacts |
| `--report-format=html\|json\|both` | `both` | Which to emit |
| `--report-embed=true\|false` | `true` | Inline screenshots vs. `report-assets/` |
| `--report-open` | off | `open` the HTML after the run (macOS) |

**MCP** (`src/mcp.ts`): return `report.json` inline in the tool result and include the `validation-report.html` path, so a calling assistant gets structured pass/fail data plus a human-openable artifact.

`dispatch()` records `startedAt`/`finishedAt` to populate `meta.durationMs` (it currently records no timing).

---

## 8. Testing

- **Golden-file render test:** two fixture `RunReport`s (all-pass; mixed-fail with Layer 1 findings + Layer 2 `stderrTail` + reviewer diffs) → `renderReport` → snapshot the HTML. Pure function, no pipeline, fast.
- **Smoke test:** stub mode (`NATIVEAPPTEMPLATE_STUB_ALL=1`, the existing `runStubJudge` path) must produce a deterministic `RunReport` and write a valid `report.json` + non-empty HTML. Assert the file exists and the gate strip reflects the stub's PASS.
- **Self-contained check:** with `--report-embed=true`, assert the HTML references no `tmp/` paths and no `http(s)://` asset URLs (portability guarantee).
- Stub fixtures keep timestamps fixed so golden output is stable.

---

## 9. Implementation order

1. `src/report/model.ts` — types.
2. Widen `runJudge`/`evaluate` to retain full Layer 1 findings + Layer 2 detail; add `platforms?` to `JudgeResult` (back-compatible).
3. `dispatch()` — record timing; assemble `RunReport`.
4. `src/report/render.ts` + `theme.ts` — pure renderer + golden test.
5. `src/report/collect.ts` — screenshot embed/copy + write files.
6. `index.ts` flags + `file://` output line; smoke-test assertion.
7. `src/mcp.ts` — return `report.json` + HTML path.

Steps 1–4 land the machine-readable artifact and the HTML for the common case; 5–7 are polish and surface integration.
Loading