diff --git a/docs/validation-report.md b/docs/validation-report.md new file mode 100644 index 0000000..5529850 --- /dev/null +++ b/docs/validation-report.md @@ -0,0 +1,216 @@ +# Spec: HTML validation report + +- Status: Draft +- Scope: A self-contained HTML report (plus a machine-readable `report.json`) emitted at the end of every agent run, summarizing all validation layers + the reviewer with embedded screenshots. +- Relationship to other docs: implements the "HTML validation report" idea (highest-ROI visual surface) without building a native GUI or a hosted web product. Aligns with `MONETIZATION.md` — a thin static artifact over the existing `dispatch()` core, not a new product surface. + +--- + +## 1. Why + +The pipeline already produces rich, visual validation evidence — leftover-token findings, per-platform build commands/durations, home-screen screenshots, vision-judge rubric scores with rationales, and a contract-parity diff. Today `index.ts` collapses all of it to two lines: + +``` +result: Layer 1 3/3 pass · Layer 2 3/3 pass · Layer 3 2/2 pass · reviewer PASS +overall: PASS +``` + +Everything else is written to `tmp/trace/*.log` (unstructured) and `tmp/screenshots/*.png` (ephemeral, cleaned). The single most valuable, most *visual* output of the agent is invisible unless you tail log files. + +The report turns that evidence into one portable artifact you can open, attach to a PR, drop into a demo video, or hand to a non-technical buyer. + +**Non-goals:** not interactive, no server, no JS framework, no live re-run, no hosting. That is the Phase-4 web product. This is a static file. + +--- + +## 2. The data gap (read this first — it's the real work) + +The report is *mostly* a rendering problem, but there's a structural blocker: **the per-layer detail is computed and then thrown away.** + +`runJudge` (`src/agents/judge.ts`) builds a per-platform `PlatformReport` inside `evaluate()` — it has `layer1.findings`, `layer2.command`, `layer2.durationMs`, `layer2.stderrTail`, etc. — but the returned `JudgeResult` keeps only: + +```ts +export type JudgeResult = { + overallPass: boolean; + summary: string; // a single formatted string + visual?: VisualJudgeReport; // the only structured survivor +}; +``` + +So `Layer1Finding[]`, `Layer2Result.stderrTail`, the reviewer's `diffs`, the layer-2 command/duration, and the domain rename plan never reach the caller. **The report cannot be a pure post-process of today's `JudgeResult`.** Step 1 of implementation is widening what survives the run. + +### Decision: emit a `report.json`, render HTML from it + +Rather than thread more return values through callers ad hoc, introduce a single structured aggregate, `RunReport`, written to disk as `report.json`. The HTML renderer is a **pure function of `RunReport`** (`renderReport(report): string`). Benefits: + +- HTML generator is decoupled, deterministic, and golden-file testable (no pipeline needed to test rendering). +- `report.json` is a free machine-readable artifact for CI gating, the MCP tool result, and the future web product. +- `dispatch()` is the natural assembler — it already holds `domain`, the three `WorkerResult`s, `reviewer`, the `JudgeResult`, and run timing. + +This requires one upstream change: `runJudge` must return the structured per-platform/per-layer detail it already computes (see §5). + +--- + +## 3. Output layout + +Written to the run's output root (sibling to the three generated projects): + +``` +out// +├── rails/ ios/ android/ # existing generated projects +├── report.json # machine-readable RunReport (always) +├── validation-report.html # self-contained, openable anywhere (always) +└── report-assets/ # only when --report-embed=false + ├── ios-home.png android-home.png + └── ios-step-3.png ... +``` + +**Default = single self-contained file.** Screenshots are base64-embedded and CSS is inlined, so `validation-report.html` is one portable file (emailable, attachable, survives `tmp/` cleanup). For ~4–8 PNGs this is ~1–3 MB — acceptable. `--report-embed=false` writes images to `report-assets/` and references them relatively, for size-sensitive cases. + +Screenshots originate in `tmp/screenshots/{ios-home,android-home}.png` (Stage 1) and `tmp/screenshots/*-step-*.png` (Stage 2); the report collector reads them from the paths recorded in `VisualJudgePlatformReport.screenshotPath` / `Stage2PlatformReport.screenshots` and copies-or-embeds them. Never link directly into `tmp/` — it's ephemeral. + +--- + +## 4. Report content (sections, in order) + +1. **Header** — overall `PASS`/`FAIL` badge, spec text, `displayName`, slug, timestamp, agent version (from `package.json`), judge model (`claude-opus-4-7`), `NATIVEAPPTEMPLATE_VISUAL` level, total run duration. +2. **Gate strip** — four chips: Layer 1, Layer 2, Layer 3, Reviewer, each `x/3` (or `x/2` for L3) with pass/fail color. This is the existing `summary` string, made visual. +3. **Platform × Layer matrix** — the headline. Rows `rails / ios / android`, columns `Layer 1 / Layer 2 / Layer 3`, each cell a pass/fail/skip mark. (Rails has no Layer 3 — render as "n/a".) +4. **Layer 1 — Structural.** Per platform: pass + count. On failure, a findings table: `token | file:line | text excerpt`. Clean state shows an explicit "no leftover tokens" row. +5. **Layer 2 — Runtime.** Per platform: `command`, mode (`fast`/`build`), `exitCode`, `durationMs`. On failure, `stderrTail` in a native `
` (collapsed). +6. **Layer 3 — Semantic (vision judge).** Per platform (ios/android): + - **Stage 1:** home-screen screenshot thumbnail + rubric table (`question | pass/fail | rationale`). Note "median of 3 samples per criterion" (`DEFAULT_SAMPLES = 3`). + - **Stage 2** (when `VISUAL=2`): scenario name, `stepsPassed/stepCount`, a screenshot filmstrip (`screenshots[]`), and the post-toggle Layer 3 scores (`layer3Scores`). + - `error` rendered prominently when a platform's visual phase failed. +7. **Reviewer — Contract parity.** `pass`/`fail` + `diffs[]` in a collapsed `
` when non-empty. +8. **Domain spec appendix.** Rename plan table (`from → to`), entities and fields, so the reader sees *what was generated and why the tokens changed*. +9. **Footer / reproduce.** The exact command to reproduce (`npx nativeapptemplate-agent ""` + the `NATIVEAPPTEMPLATE_VISUAL` value), and a pointer to `tmp/trace/*.log` for raw logs. + +**Forward-looking (optional, render only if present):** a **self-repair** section showing the ≤5 iteration history (CLAUDE.md cap) — which layer failed, what was attempted, the delta. The model carries `repairAttempts?` so the renderer is ready when the loop is wired. + +--- + +## 5. Data model + +New file `src/report/model.ts`. Field types reuse existing exports verbatim where possible. + +```ts +export type RunReport = { + meta: { + spec: string; + slug: string; + displayName: string; + agentVersion: string; // package.json version + judgeModel: string; // "claude-opus-4-7" + visualLevel: 0 | 1 | 2; // from NATIVEAPPTEMPLATE_VISUAL + startedAt: string; // ISO + finishedAt: string; // ISO + durationMs: number; + }; + overallPass: boolean; + summary: string; // existing one-line summary, preserved + + platforms: PlatformDetail[]; // rails, ios, android + + reviewer: { + contractParity: "pass" | "fail"; + diffs: readonly string[]; // from ReviewerResult + }; + + domain: { + renamePlan: readonly { from: string; to: string }[]; + entities: readonly { name: string; replaces: string; + fields: readonly { name: string; type: string; references?: string }[]; + states?: readonly string[] }[]; + }; + + repairAttempts?: readonly RepairAttempt[]; // future; render if present +}; + +export type PlatformDetail = { + platform: "rails" | "ios" | "android"; + layer1: { pass: boolean; findings: readonly Layer1Finding[] }; // full findings, not a count + layer2: { pass: boolean; command: string; mode: "fast" | "build"; + exitCode: number | null; durationMs: number; stderrTail?: string }; + layer3?: VisualJudgePlatformReport; // ios/android only; reuse existing type (incl. .stage2) +}; + +export type RepairAttempt = { + iteration: number; // 1..5 + failingLayer: "layer1" | "layer2" | "layer3" | "reviewer"; + platform?: "rails" | "ios" | "android"; + action: string; + resolved: boolean; +}; +``` + +`Layer1Finding` and `VisualJudgePlatformReport` are imported from existing modules — no duplication. The report model is the *only* place these get aggregated. + +### Required upstream change + +`runJudge` already computes everything in `PlatformDetail.layer1`/`layer2` inside `evaluate()`, but discards the detail. Change `evaluate()` to retain full `Layer1Result.findings` and full `Layer2Result` (not just `findings.length` and `command`), and have `runJudge` return them. Smallest viable shape: add an optional `platforms?: PlatformDetail[]` to `JudgeResult` (back-compatible — existing CLI/MCP consumers ignore it). `dispatch()` then assembles `RunReport` from `domain` + `reviewer` + the enriched `JudgeResult` + timing it records around the run. + +--- + +## 6. Code layout + +``` +src/report/ +├── model.ts # RunReport, PlatformDetail, RepairAttempt +├── collect.ts # assemble RunReport in dispatch; copy/embed screenshots; write report.json +├── render.ts # renderReport(report: RunReport): string ← pure, no I/O, no network +└── theme.ts # inline CSS string (brand palette), reused by render.ts +``` + +- `render.ts` is pure: `RunReport → string`. No `fs`, no `Date.now()` inside (timestamps come from `meta`). This makes it golden-file testable and deterministic. +- `collect.ts` owns all I/O: read screenshots, base64-or-copy, write `report.json` and `validation-report.html`. +- Vanilla HTML + inlined CSS. Interactivity limited to native `
` — **no JS framework, no bundler, no external assets, no web fonts** (system font stack, matching the social-preview SVG). + +### Styling + +Reuse the social-preview palette for brand coherence (`docs/social-preview.svg`): dark base `#1F2933`→`#0B69A3`, accent `#40C3F7`/`#2BB0ED`, text `#F5F7FA`/`#CBD2D9`. Pass = green, fail = red, skip/n-a = muted gray. Responsive, print-friendly (`@media print`), single-column on narrow widths. + +--- + +## 7. CLI / MCP integration + +`index.ts`, after `dispatch()`: + +- Always write `out//report.json` + `validation-report.html`. +- Print the report path as a `file://` URL (clickable in terminals) alongside the existing summary lines. + +Flags: + +| Flag | Default | Effect | +|---|---|---| +| `--no-report` | off | Skip both artifacts | +| `--report-format=html\|json\|both` | `both` | Which to emit | +| `--report-embed=true\|false` | `true` | Inline screenshots vs. `report-assets/` | +| `--report-open` | off | `open` the HTML after the run (macOS) | + +**MCP** (`src/mcp.ts`): return `report.json` inline in the tool result and include the `validation-report.html` path, so a calling assistant gets structured pass/fail data plus a human-openable artifact. + +`dispatch()` records `startedAt`/`finishedAt` to populate `meta.durationMs` (it currently records no timing). + +--- + +## 8. Testing + +- **Golden-file render test:** two fixture `RunReport`s (all-pass; mixed-fail with Layer 1 findings + Layer 2 `stderrTail` + reviewer diffs) → `renderReport` → snapshot the HTML. Pure function, no pipeline, fast. +- **Smoke test:** stub mode (`NATIVEAPPTEMPLATE_STUB_ALL=1`, the existing `runStubJudge` path) must produce a deterministic `RunReport` and write a valid `report.json` + non-empty HTML. Assert the file exists and the gate strip reflects the stub's PASS. +- **Self-contained check:** with `--report-embed=true`, assert the HTML references no `tmp/` paths and no `http(s)://` asset URLs (portability guarantee). +- Stub fixtures keep timestamps fixed so golden output is stable. + +--- + +## 9. Implementation order + +1. `src/report/model.ts` — types. +2. Widen `runJudge`/`evaluate` to retain full Layer 1 findings + Layer 2 detail; add `platforms?` to `JudgeResult` (back-compatible). +3. `dispatch()` — record timing; assemble `RunReport`. +4. `src/report/render.ts` + `theme.ts` — pure renderer + golden test. +5. `src/report/collect.ts` — screenshot embed/copy + write files. +6. `index.ts` flags + `file://` output line; smoke-test assertion. +7. `src/mcp.ts` — return `report.json` + HTML path. + +Steps 1–4 land the machine-readable artifact and the HTML for the common case; 5–7 are polish and surface integration.