- Status: Draft
- Scope: A self-contained HTML report (plus a machine-readable
report.json) emitted at the end of every agent run, summarizing all validation layers + the reviewer with embedded screenshots. - Relationship to other docs: implements the "HTML validation report" idea (highest-ROI visual surface) without building a native GUI or a hosted web product. Aligns with
MONETIZATION.md— a thin static artifact over the existingdispatch()core, not a new product surface.
The pipeline already produces rich, visual validation evidence — leftover-token findings, per-platform build commands/durations, home-screen screenshots, vision-judge rubric scores with rationales, and a contract-parity diff. Today index.ts collapses all of it to two lines:
result: Layer 1 3/3 pass · Layer 2 3/3 pass · Layer 3 2/2 pass · reviewer PASS
overall: PASS
Everything else is written to tmp/trace/*.log (unstructured) and tmp/screenshots/*.png (ephemeral, cleaned). The single most valuable, most visual output of the agent is invisible unless you tail log files.
The report turns that evidence into one portable artifact you can open, attach to a PR, drop into a demo video, or hand to a non-technical buyer.
Non-goals: not interactive, no server, no JS framework, no live re-run, no hosting. That is the Phase-4 web product. This is a static file.
The report is mostly a rendering problem, but there's a structural blocker: the per-layer detail is computed and then thrown away.
runJudge (src/agents/judge.ts) builds a per-platform PlatformReport inside evaluate() — it has layer1.findings, layer2.command, layer2.durationMs, layer2.stderrTail, etc. — but the returned JudgeResult keeps only:
export type JudgeResult = {
overallPass: boolean;
summary: string; // a single formatted string
visual?: VisualJudgeReport; // the only structured survivor
};So Layer1Finding[], Layer2Result.stderrTail, the reviewer's diffs, the layer-2 command/duration, and the domain rename plan never reach the caller. The report cannot be a pure post-process of today's JudgeResult. Step 1 of implementation is widening what survives the run.
Rather than thread more return values through callers ad hoc, introduce a single structured aggregate, RunReport, written to disk as report.json. The HTML renderer is a pure function of RunReport (renderReport(report): string). Benefits:
- HTML generator is decoupled, deterministic, and golden-file testable (no pipeline needed to test rendering).
report.jsonis a free machine-readable artifact for CI gating, the MCP tool result, and the future web product.dispatch()is the natural assembler — it already holdsdomain, the threeWorkerResults,reviewer, theJudgeResult, and run timing.
This requires one upstream change: runJudge must return the structured per-platform/per-layer detail it already computes (see §5).
Written to the run's output root (sibling to the three generated projects):
out/<slug>/
├── rails/ ios/ android/ # existing generated projects
├── report.json # machine-readable RunReport (always)
├── validation-report.html # self-contained, openable anywhere (always)
└── report-assets/ # only when --report-embed=false
├── ios-home.png android-home.png
└── ios-step-3.png ...
Default = single self-contained file. Screenshots are base64-embedded and CSS is inlined, so validation-report.html is one portable file (emailable, attachable, survives tmp/ cleanup). For ~4–8 PNGs this is ~1–3 MB — acceptable. --report-embed=false writes images to report-assets/ and references them relatively, for size-sensitive cases.
Screenshots originate in tmp/screenshots/{ios-home,android-home}.png (Stage 1) and tmp/screenshots/*-step-*.png (Stage 2); the report collector reads them from the paths recorded in VisualJudgePlatformReport.screenshotPath / Stage2PlatformReport.screenshots and copies-or-embeds them. Never link directly into tmp/ — it's ephemeral.
- Header — overall
PASS/FAILbadge, spec text,displayName, slug, timestamp, agent version (frompackage.json), judge model (claude-opus-4-7),NATIVEAPPTEMPLATE_VISUALlevel, total run duration. - Gate strip — four chips: Layer 1, Layer 2, Layer 3, Reviewer, each
x/3(orx/2for L3) with pass/fail color. This is the existingsummarystring, made visual. - Platform × Layer matrix — the headline. Rows
rails / ios / android, columnsLayer 1 / Layer 2 / Layer 3, each cell a pass/fail/skip mark. (Rails has no Layer 3 — render as "n/a".) - Layer 1 — Structural. Per platform: pass + count. On failure, a findings table:
token | file:line | text excerpt. Clean state shows an explicit "no leftover tokens" row. - Layer 2 — Runtime. Per platform:
command, mode (fast/build),exitCode,durationMs. On failure,stderrTailin a native<details>(collapsed). - Layer 3 — Semantic (vision judge). Per platform (ios/android):
- Stage 1: home-screen screenshot thumbnail + rubric table (
question | pass/fail | rationale). Note "median of 3 samples per criterion" (DEFAULT_SAMPLES = 3). - Stage 2 (when
VISUAL=2): scenario name,stepsPassed/stepCount, a screenshot filmstrip (screenshots[]), and the post-toggle Layer 3 scores (layer3Scores). errorrendered prominently when a platform's visual phase failed.
- Stage 1: home-screen screenshot thumbnail + rubric table (
- Reviewer — Contract parity.
pass/fail+diffs[]in a collapsed<details>when non-empty. - Domain spec appendix. Rename plan table (
from → to), entities and fields, so the reader sees what was generated and why the tokens changed. - Footer / reproduce. The exact command to reproduce (
npx nativeapptemplate-agent "<spec>"+ theNATIVEAPPTEMPLATE_VISUALvalue), and a pointer totmp/trace/*.logfor raw logs.
Forward-looking (optional, render only if present): a self-repair section showing the ≤5 iteration history (CLAUDE.md cap) — which layer failed, what was attempted, the delta. The model carries repairAttempts? so the renderer is ready when the loop is wired.
New file src/report/model.ts. Field types reuse existing exports verbatim where possible.
export type RunReport = {
meta: {
spec: string;
slug: string;
displayName: string;
agentVersion: string; // package.json version
judgeModel: string; // "claude-opus-4-7"
visualLevel: 0 | 1 | 2; // from NATIVEAPPTEMPLATE_VISUAL
startedAt: string; // ISO
finishedAt: string; // ISO
durationMs: number;
};
overallPass: boolean;
summary: string; // existing one-line summary, preserved
platforms: PlatformDetail[]; // rails, ios, android
reviewer: {
contractParity: "pass" | "fail";
diffs: readonly string[]; // from ReviewerResult
};
domain: {
renamePlan: readonly { from: string; to: string }[];
entities: readonly { name: string; replaces: string;
fields: readonly { name: string; type: string; references?: string }[];
states?: readonly string[] }[];
};
repairAttempts?: readonly RepairAttempt[]; // future; render if present
};
export type PlatformDetail = {
platform: "rails" | "ios" | "android";
layer1: { pass: boolean; findings: readonly Layer1Finding[] }; // full findings, not a count
layer2: { pass: boolean; command: string; mode: "fast" | "build";
exitCode: number | null; durationMs: number; stderrTail?: string };
layer3?: VisualJudgePlatformReport; // ios/android only; reuse existing type (incl. .stage2)
};
export type RepairAttempt = {
iteration: number; // 1..5
failingLayer: "layer1" | "layer2" | "layer3" | "reviewer";
platform?: "rails" | "ios" | "android";
action: string;
resolved: boolean;
};Layer1Finding and VisualJudgePlatformReport are imported from existing modules — no duplication. The report model is the only place these get aggregated.
runJudge already computes everything in PlatformDetail.layer1/layer2 inside evaluate(), but discards the detail. Change evaluate() to retain full Layer1Result.findings and full Layer2Result (not just findings.length and command), and have runJudge return them. Smallest viable shape: add an optional platforms?: PlatformDetail[] to JudgeResult (back-compatible — existing CLI/MCP consumers ignore it). dispatch() then assembles RunReport from domain + reviewer + the enriched JudgeResult + timing it records around the run.
src/report/
├── model.ts # RunReport, PlatformDetail, RepairAttempt
├── collect.ts # assemble RunReport in dispatch; copy/embed screenshots; write report.json
├── render.ts # renderReport(report: RunReport): string ← pure, no I/O, no network
└── theme.ts # inline CSS string (brand palette), reused by render.ts
render.tsis pure:RunReport → string. Nofs, noDate.now()inside (timestamps come frommeta). This makes it golden-file testable and deterministic.collect.tsowns all I/O: read screenshots, base64-or-copy, writereport.jsonandvalidation-report.html.- Vanilla HTML + inlined CSS. Interactivity limited to native
<details>— no JS framework, no bundler, no external assets, no web fonts (system font stack, matching the social-preview SVG).
Reuse the social-preview palette for brand coherence (docs/social-preview.svg): dark base #1F2933→#0B69A3, accent #40C3F7/#2BB0ED, text #F5F7FA/#CBD2D9. Pass = green, fail = red, skip/n-a = muted gray. Responsive, print-friendly (@media print), single-column on narrow widths.
index.ts, after dispatch():
- Always write
out/<slug>/report.json+validation-report.html. - Print the report path as a
file://URL (clickable in terminals) alongside the existing summary lines.
Flags:
| Flag | Default | Effect |
|---|---|---|
--no-report |
off | Skip both artifacts |
--report-format=html|json|both |
both |
Which to emit |
--report-embed=true|false |
true |
Inline screenshots vs. report-assets/ |
--report-open |
off | open the HTML after the run (macOS) |
MCP (src/mcp.ts): return report.json inline in the tool result and include the validation-report.html path, so a calling assistant gets structured pass/fail data plus a human-openable artifact.
dispatch() records startedAt/finishedAt to populate meta.durationMs (it currently records no timing).
- Golden-file render test: two fixture
RunReports (all-pass; mixed-fail with Layer 1 findings + Layer 2stderrTail+ reviewer diffs) →renderReport→ snapshot the HTML. Pure function, no pipeline, fast. - Smoke test: stub mode (
NATIVEAPPTEMPLATE_STUB_ALL=1, the existingrunStubJudgepath) must produce a deterministicRunReportand write a validreport.json+ non-empty HTML. Assert the file exists and the gate strip reflects the stub's PASS. - Self-contained check: with
--report-embed=true, assert the HTML references notmp/paths and nohttp(s)://asset URLs (portability guarantee). - Stub fixtures keep timestamps fixed so golden output is stable.
src/report/model.ts— types.- Widen
runJudge/evaluateto retain full Layer 1 findings + Layer 2 detail; addplatforms?toJudgeResult(back-compatible). dispatch()— record timing; assembleRunReport.src/report/render.ts+theme.ts— pure renderer + golden test.src/report/collect.ts— screenshot embed/copy + write files.index.tsflags +file://output line; smoke-test assertion.src/mcp.ts— returnreport.json+ HTML path.
Steps 1–4 land the machine-readable artifact and the HTML for the common case; 5–7 are polish and surface integration.
Status: shipped. Steps 1–2 in #73, steps 3–7 in #74, and the CI exit-code + README docs in #75. The forward-looking
repairAttemptssection renders only once the self-repair loop is wired (separate work).
validation-report.html is self-contained (screenshots base64-embedded, no external assets), so the zero-tooling path is just to open it:
open out/<slug>/validation-report.html # macOSFor agent-driven visual verification — having Claude open the report and capture a screenshot — use the Playwright MCP. This is a maintainer convenience for eyeballing the rendered output; it is not part of the agent's pipeline (device screenshots in Layer 2/3 come from mobile-mcp, not Playwright).
One-time setup (local scope — this project only, not committed):
claude mcp add playwright -- npx -y @playwright/mcp@latest
# then restart the Claude Code session so the browser_* tools load
# (mid-session adds register the server but don't inject tools into the running session)
npx playwright install chromium # only if the first run reports a missing browserWorkflow, once a run has produced out/<slug>/validation-report.html:
- Ask Claude to open and screenshot the report.
- Under the hood it calls
browser_navigatewith afile://URL, thenbrowser_take_screenshot:browser_navigate→file://<abs-path>/out/<slug>/validation-report.htmlbrowser_take_screenshot(use the full-page option to capture the whole report)
- The PNG is the visual check;
report.jsoncarries the structured pass/fail.
Heads-up — @playwright/mcp blocks the file: protocol by default, so browser_navigate against a file://…/validation-report.html fails with "Access to 'file:' protocol is blocked." The reliable workaround is to serve the output dir over a throwaway local HTTP server and navigate to that instead:
# from the run's output dir (out/<slug>/, or wherever the report was written)
python3 -m http.server 8765 --bind 127.0.0.1 &
# then: browser_navigate → http://127.0.0.1:8765/validation-report.htmlThe report is self-contained (screenshots base64-embedded), so a plain static server suffices — there are no asset paths to resolve. Stop the server when done. (Some @playwright/mcp builds can instead be launched with file: access permitted via config, but that's version-dependent; the local-server route always works.) Note also that browser_take_screenshot's saved PNG lands in the MCP server's working directory (the project root), not the output dir — move it out afterward.
Remove it later with claude mcp remove playwright.
No session restart and no MCP registration — a one-off that's also suitable for capturing a CI artifact:
npx -y playwright@latest screenshot --full-page \
"file://$(pwd)/out/<slug>/validation-report.html" \
report.png
# if it errors on a missing browser:
npx -y playwright@latest install chromiumThis needs no project dependency — npx fetches Playwright transiently. Don't add Playwright to package.json; the report itself has no runtime browser dependency, and keeping it out preserves the lean install.