nativeapptemplate · dadachi · May 22, 2026 · May 22, 2026
diff --git a/docs/validation-report.md b/docs/validation-report.md
@@ -0,0 +1,216 @@
+# Spec: HTML validation report
+
+- Status: Draft
+- Scope: A self-contained HTML report (plus a machine-readable `report.json`) emitted at the end of every agent run, summarizing all validation layers + the reviewer with embedded screenshots.
+- Relationship to other docs: implements the "HTML validation report" idea (highest-ROI visual surface) without building a native GUI or a hosted web product. Aligns with `MONETIZATION.md` — a thin static artifact over the existing `dispatch()` core, not a new product surface.
+
+---
+
+## 1. Why
+
+The pipeline already produces rich, visual validation evidence — leftover-token findings, per-platform build commands/durations, home-screen screenshots, vision-judge rubric scores with rationales, and a contract-parity diff. Today `index.ts` collapses all of it to two lines:
+
+```
+result: Layer 1 3/3 pass · Layer 2 3/3 pass · Layer 3 2/2 pass · reviewer PASS
+overall: PASS
+```
+
+Everything else is written to `tmp/trace/*.log` (unstructured) and `tmp/screenshots/*.png` (ephemeral, cleaned). The single most valuable, most *visual* output of the agent is invisible unless you tail log files.
+
+The report turns that evidence into one portable artifact you can open, attach to a PR, drop into a demo video, or hand to a non-technical buyer.
+
+**Non-goals:** not interactive, no server, no JS framework, no live re-run, no hosting. That is the Phase-4 web product. This is a static file.
+
+---
+
+## 2. The data gap (read this first — it's the real work)
+
+The report is *mostly* a rendering problem, but there's a structural blocker: **the per-layer detail is computed and then thrown away.**
+
+`runJudge` (`src/agents/judge.ts`) builds a per-platform `PlatformReport` inside `evaluate()` — it has `layer1.findings`, `layer2.command`, `layer2.durationMs`, `layer2.stderrTail`, etc. — but the returned `JudgeResult` keeps only:
+
+```ts
+export type JudgeResult = {
+  overallPass: boolean;
+  summary: string;            // a single formatted string
+  visual?: VisualJudgeReport; // the only structured survivor
+};
+```
+
+So `Layer1Finding[]`, `Layer2Result.stderrTail`, the reviewer's `diffs`, the layer-2 command/duration, and the domain rename plan never reach the caller. **The report cannot be a pure post-process of today's `JudgeResult`.** Step 1 of implementation is widening what survives the run.
+
+### Decision: emit a `report.json`, render HTML from it
+
+Rather than thread more return values through callers ad hoc, introduce a single structured aggregate, `RunReport`, written to disk as `report.json`. The HTML renderer is a **pure function of `RunReport`** (`renderReport(report): string`). Benefits:
+
+- HTML generator is decoupled, deterministic, and golden-file testable (no pipeline needed to test rendering).
+- `report.json` is a free machine-readable artifact for CI gating, the MCP tool result, and the future web product.
+- `dispatch()` is the natural assembler — it already holds `domain`, the three `WorkerResult`s, `reviewer`, the `JudgeResult`, and run timing.
+
+This requires one upstream change: `runJudge` must return the structured per-platform/per-layer detail it already computes (see §5).
+
+---
+
+## 3. Output layout
+
+Written to the run's output root (sibling to the three generated projects):
+
+```
+out/<slug>/
+├── rails/  ios/  android/        # existing generated projects
+├── report.json                   # machine-readable RunReport (always)
+├── validation-report.html        # self-contained, openable anywhere (always)
+└── report-assets/                # only when --report-embed=false
+    ├── ios-home.png  android-home.png
+    └── ios-step-3.png ...
+```
+
+**Default = single self-contained file.** Screenshots are base64-embedded and CSS is inlined, so `validation-report.html` is one portable file (emailable, attachable, survives `tmp/` cleanup). For ~4–8 PNGs this is ~1–3 MB — acceptable. `--report-embed=false` writes images to `report-assets/` and references them relatively, for size-sensitive cases.
+
+Screenshots originate in `tmp/screenshots/{ios-home,android-home}.png` (Stage 1) and `tmp/screenshots/*-step-*.png` (Stage 2); the report collector reads them from the paths recorded in `VisualJudgePlatformReport.screenshotPath` / `Stage2PlatformReport.screenshots` and copies-or-embeds them. Never link directly into `tmp/` — it's ephemeral.
+
+---
+
+## 4. Report content (sections, in order)
+
+1. **Header** — overall `PASS`/`FAIL` badge, spec text, `displayName`, slug, timestamp, agent version (from `package.json`), judge model (`claude-opus-4-7`), `NATIVEAPPTEMPLATE_VISUAL` level, total run duration.
+2. **Gate strip** — four chips: Layer 1, Layer 2, Layer 3, Reviewer, each `x/3` (or `x/2` for L3) with pass/fail color. This is the existing `summary` string, made visual.
+3. **Platform × Layer matrix** — the headline. Rows `rails / ios / android`, columns `Layer 1 / Layer 2 / Layer 3`, each cell a pass/fail/skip mark. (Rails has no Layer 3 — render as "n/a".)
+4. **Layer 1 — Structural.** Per platform: pass + count. On failure, a findings table: `token | file:line | text excerpt`. Clean state shows an explicit "no leftover tokens" row.
+5. **Layer 2 — Runtime.** Per platform: `command`, mode (`fast`/`build`), `exitCode`, `durationMs`. On failure, `stderrTail` in a native `<details>` (collapsed).
+6. **Layer 3 — Semantic (vision judge).** Per platform (ios/android):
+   - **Stage 1:** home-screen screenshot thumbnail + rubric table (`question | pass/fail | rationale`). Note "median of 3 samples per criterion" (`DEFAULT_SAMPLES = 3`).
+   - **Stage 2** (when `VISUAL=2`): scenario name, `stepsPassed/stepCount`, a screenshot filmstrip (`screenshots[]`), and the post-toggle Layer 3 scores (`layer3Scores`).
+   - `error` rendered prominently when a platform's visual phase failed.
+7. **Reviewer — Contract parity.** `pass`/`fail` + `diffs[]` in a collapsed `<details>` when non-empty.
+8. **Domain spec appendix.** Rename plan table (`from → to`), entities and fields, so the reader sees *what was generated and why the tokens changed*.
+9. **Footer / reproduce.** The exact command to reproduce (`npx nativeapptemplate-agent "<spec>"` + the `NATIVEAPPTEMPLATE_VISUAL` value), and a pointer to `tmp/trace/*.log` for raw logs.
+
+**Forward-looking (optional, render only if present):** a **self-repair** section showing the ≤5 iteration history (CLAUDE.md cap) — which layer failed, what was attempted, the delta. The model carries `repairAttempts?` so the renderer is ready when the loop is wired.
+
+---
+
+## 5. Data model
+
+New file `src/report/model.ts`. Field types reuse existing exports verbatim where possible.
+
+```ts
+export type RunReport = {
+  meta: {
+    spec: string;
+    slug: string;
+    displayName: string;
+    agentVersion: string;       // package.json version
+    judgeModel: string;         // "claude-opus-4-7"
+    visualLevel: 0 | 1 | 2;     // from NATIVEAPPTEMPLATE_VISUAL
+    startedAt: string;          // ISO
+    finishedAt: string;         // ISO
+    durationMs: number;
+  };
+  overallPass: boolean;
+  summary: string;              // existing one-line summary, preserved
+
+  platforms: PlatformDetail[];  // rails, ios, android
+
+  reviewer: {
+    contractParity: "pass" | "fail";
+    diffs: readonly string[];   // from ReviewerResult
+  };
+
+  domain: {
+    renamePlan: readonly { from: string; to: string }[];
+    entities: readonly { name: string; replaces: string;
+      fields: readonly { name: string; type: string; references?: string }[];
+      states?: readonly string[] }[];
+  };
+
+  repairAttempts?: readonly RepairAttempt[]; // future; render if present
+};
+
+export type PlatformDetail = {
+  platform: "rails" | "ios" | "android";
+  layer1: { pass: boolean; findings: readonly Layer1Finding[] };       // full findings, not a count
+  layer2: { pass: boolean; command: string; mode: "fast" | "build";
+            exitCode: number | null; durationMs: number; stderrTail?: string };
+  layer3?: VisualJudgePlatformReport;  // ios/android only; reuse existing type (incl. .stage2)
+};
+
+export type RepairAttempt = {
+  iteration: number;            // 1..5
+  failingLayer: "layer1" | "layer2" | "layer3" | "reviewer";
+  platform?: "rails" | "ios" | "android";
+  action: string;
+  resolved: boolean;
+};
+```
+
+`Layer1Finding` and `VisualJudgePlatformReport` are imported from existing modules — no duplication. The report model is the *only* place these get aggregated.
+
+### Required upstream change
+
+`runJudge` already computes everything in `PlatformDetail.layer1`/`layer2` inside `evaluate()`, but discards the detail. Change `evaluate()` to retain full `Layer1Result.findings` and full `Layer2Result` (not just `findings.length` and `command`), and have `runJudge` return them. Smallest viable shape: add an optional `platforms?: PlatformDetail[]` to `JudgeResult` (back-compatible — existing CLI/MCP consumers ignore it). `dispatch()` then assembles `RunReport` from `domain` + `reviewer` + the enriched `JudgeResult` + timing it records around the run.
+
+---
+
+## 6. Code layout
+
+```
+src/report/
+├── model.ts      # RunReport, PlatformDetail, RepairAttempt
+├── collect.ts    # assemble RunReport in dispatch; copy/embed screenshots; write report.json
+├── render.ts     # renderReport(report: RunReport): string   ← pure, no I/O, no network
+└── theme.ts      # inline CSS string (brand palette), reused by render.ts
+```
+
+- `render.ts` is pure: `RunReport → string`. No `fs`, no `Date.now()` inside (timestamps come from `meta`). This makes it golden-file testable and deterministic.
+- `collect.ts` owns all I/O: read screenshots, base64-or-copy, write `report.json` and `validation-report.html`.
+- Vanilla HTML + inlined CSS. Interactivity limited to native `<details>` — **no JS framework, no bundler, no external assets, no web fonts** (system font stack, matching the social-preview SVG).
+
+### Styling
+
+Reuse the social-preview palette for brand coherence (`docs/social-preview.svg`): dark base `#1F2933`→`#0B69A3`, accent `#40C3F7`/`#2BB0ED`, text `#F5F7FA`/`#CBD2D9`. Pass = green, fail = red, skip/n-a = muted gray. Responsive, print-friendly (`@media print`), single-column on narrow widths.
+
+---
+
+## 7. CLI / MCP integration
+
+`index.ts`, after `dispatch()`:
+
+- Always write `out/<slug>/report.json` + `validation-report.html`.
+- Print the report path as a `file://` URL (clickable in terminals) alongside the existing summary lines.
+
+Flags:
+
+| Flag | Default | Effect |
+|---|---|---|
+| `--no-report` | off | Skip both artifacts |
+| `--report-format=html\|json\|both` | `both` | Which to emit |
+| `--report-embed=true\|false` | `true` | Inline screenshots vs. `report-assets/` |
+| `--report-open` | off | `open` the HTML after the run (macOS) |
+
+**MCP** (`src/mcp.ts`): return `report.json` inline in the tool result and include the `validation-report.html` path, so a calling assistant gets structured pass/fail data plus a human-openable artifact.
+
+`dispatch()` records `startedAt`/`finishedAt` to populate `meta.durationMs` (it currently records no timing).
+
+---
+
+## 8. Testing
+
+- **Golden-file render test:** two fixture `RunReport`s (all-pass; mixed-fail with Layer 1 findings + Layer 2 `stderrTail` + reviewer diffs) → `renderReport` → snapshot the HTML. Pure function, no pipeline, fast.
+- **Smoke test:** stub mode (`NATIVEAPPTEMPLATE_STUB_ALL=1`, the existing `runStubJudge` path) must produce a deterministic `RunReport` and write a valid `report.json` + non-empty HTML. Assert the file exists and the gate strip reflects the stub's PASS.
+- **Self-contained check:** with `--report-embed=true`, assert the HTML references no `tmp/` paths and no `http(s)://` asset URLs (portability guarantee).
+- Stub fixtures keep timestamps fixed so golden output is stable.
+
+---
+
+## 9. Implementation order
+
+1. `src/report/model.ts` — types.
+2. Widen `runJudge`/`evaluate` to retain full Layer 1 findings + Layer 2 detail; add `platforms?` to `JudgeResult` (back-compatible).
+3. `dispatch()` — record timing; assemble `RunReport`.
+4. `src/report/render.ts` + `theme.ts` — pure renderer + golden test.
+5. `src/report/collect.ts` — screenshot embed/copy + write files.
+6. `index.ts` flags + `file://` output line; smoke-test assertion.
+7. `src/mcp.ts` — return `report.json` + HTML path.
+
+Steps 1–4 land the machine-readable artifact and the HTML for the common case; 5–7 are polish and surface integration.