diff --git a/.gitignore b/.gitignore index b22c8944..1c1dd2a3 100644 --- a/.gitignore +++ b/.gitignore @@ -219,3 +219,6 @@ uv.lock .dev.vars* !.dev.vars.example !.env.example + +# macOS +.DS_Store diff --git a/CLAUDE.md b/CLAUDE.md index 50232314..1a5473d8 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -67,11 +67,16 @@ src/lightcone/ # namespace — NO __init__.py ├── harness.py, sandbox.py, graders.py, build.py, report.py, models.py claude/lightcone/ # Claude plugin source — force-included into the wheel -├── skills/ # lc-new, lc-build, lc-verify, lc-migrate, lc-feedback +├── skills/ # lc-new, lc-from-code, lc-from-paper, +│ # lc-feedback, ralph; +│ # paper-reproduction bundle: lc-from-paper (entry), +│ # ralph (loop substrate), narrative, +│ # paper-extraction, figure-comparison, +│ # check-sentence-by-sentence +│ # (see skills/README.md for the full bundle map) ├── agents/ # lc-extractor -├── guides/ # astra-reference, lightcone-cli-reference, ui-brand ├── templates/ # Project CLAUDE.md template -└── scripts/ # Session hooks (bash): venv activation, validate-on-save, status display +└── scripts/ # Session hooks (bash): venv activation, validate-on-save, session-start primer tests/ # pytest — mirrors src/ structure pyproject.toml # hatchling + hatch-vcs, ASTRA + Snakemake as deps diff --git a/README.md b/README.md index b4c0db3d..7145b1aa 100644 --- a/README.md +++ b/README.md @@ -18,10 +18,12 @@ cd my-analysis claude ``` -Then tell the agent `/lc-new` to scope your research question. After the spec exists, just tell the agent to build it — implementation is a normal Claude Code workflow guided by `.claude/guides/`. +Then tell the agent what you have to start from — a research question (`/lc-new`), existing code (`/lc-from-code`), or a paper to reproduce (`/lc-from-paper`). After the spec exists, work with the agent however suits you; the substrate (`astra.yaml`, `lc run`, `lc status`, `lc verify`) keeps things in sync. ## Skills +The `/lc-from-*` family is parallel by what you start from: a question, code, or a paper. + ### `/lc-new` — Scope and specify an analysis Guides you from a research question to a complete `astra.yaml` specification through interactive conversation. The agent will: @@ -34,17 +36,21 @@ Guides you from a research question to a complete `astra.yaml` specification thr You don't write any code or YAML during this phase — the agent produces the full specification. -### `/lc-migrate` — Bring an existing project into ASTRA +### `/lc-from-code` — Bring an existing project into ASTRA Scans an existing codebase, drafts an `astra.yaml` that captures its inputs, outputs, and analytical decisions, parameterizes the code so decisions can vary across universes, and runs the analysis through `lc` until every output materializes. Existing logic is left intact — changes are confined to parameter plumbing. +### `/lc-from-paper` — Reproduce a published paper + +ORIENT-first driver for reproducing a published paper in ASTRA. ORIENT runs in the user's main session in seven stages — asks for the paper, runs `/paper-extraction` inline to acquire it, interviews the user (grounded in the paper), clones the reference code and runs `/lc-from-code` scan-only (if a repo exists), optionally follows up, then drafts a per-paper `constitution.md` (the ralph loop's driving document) + `CLAUDE.md` (auto-loading rules + accumulators) from the full paper-plus-code context for user review. Then the rest of the reproduction hands off to a **ralph loop** whose iterations carry the long middle: ARCHITECT → SPECIFY → LITERATURE → IMPLEMENT → RUN → COMPARE. Each iteration runs in a fresh tmux session against the constitution; the fresh-context property between iterations is what makes per-phase review work. When the loop closes (constitution `status: closed` after COMPARE returns `pass`), REVIEW runs back in the user's main session. Composes a bundle of sibling skills (`ralph`, `paper-extraction`, `narrative`, `figure-comparison`, `check-sentence-by-sentence`). See [`claude/lightcone/skills/README.md`](claude/lightcone/skills/README.md) for the full bundle map. + ### `/lc-feedback` — Report a bug Files a GitHub issue against the right repo (ASTRA or lightcone-cli) with version info and error context auto-collected from your session. ### Building and verifying -There is no `/lc-build` or `/lc-verify` skill — building and verifying are part of the normal Claude Code workflow once `astra.yaml` exists. The agent reads `.claude/guides/lightcone-cli-reference.md` (workflow, commands, status meanings) and `.claude/guides/astra-reference.md` (spec syntax) and drives the build directly: write scripts under `src/`, run `lc run`, watch `lc status` until every output is `ok`, then `astra validate astra.yaml` and `lc verify` to confirm the spec is valid and the provenance chain is intact. +Once `astra.yaml` exists, you (or the agent) build it however suits you. The typical flow is `lc run` to materialize outputs, `lc status` to track progress, `astra validate astra.yaml` for spec validity, and `lc verify` for provenance integrity — agent-driven, ralph-looped, or hand-written, the `lc` substrate stays in sync. ## CLI Reference @@ -52,8 +58,6 @@ There is no `/lc-build` or `/lc-verify` skill — building and verifying are par The first `lc` invocation auto-creates `~/.lightcone/config.yaml` with `container.runtime: auto`. To pin a runtime or change other settings, edit the file directly. -**Extraction model:** Literature extraction subagents default to Sonnet. To change this, set `extraction_model:` in `~/.lightcone/config.yaml` (options: `sonnet`, `haiku`, or omit for inherit). - ### Project scaffolding ```bash diff --git a/claude/lightcone/scripts/session-start.sh b/claude/lightcone/scripts/session-start.sh index 23b812ec..121f7c9f 100755 --- a/claude/lightcone/scripts/session-start.sh +++ b/claude/lightcone/scripts/session-start.sh @@ -1,11 +1,12 @@ #!/bin/bash # SessionStart hook: surface a terse project status to the agent. # -# Reports validation status, materialization counts, and pointers to the -# canonical reference docs. Project name / decision count / universe count -# are intentionally omitted -- they are trivia the agent reads from -# astra.yaml and CLAUDE.md when needed, and they cost against the 10k -# additionalContext budget. +# Reports validation status, materialization counts, and a tight CLI +# primer so the agent knows what substrate commands exist and which +# reference skills carry the depth. Project name / decision count / +# universe count are intentionally omitted -- they are trivia the agent +# reads from astra.yaml and CLAUDE.md when needed, and they cost against +# the 10k additionalContext budget. input=$(cat) cwd=$(echo "$input" | jq -r '.cwd // empty') @@ -48,7 +49,13 @@ fi summary="$summary Materialization: ok=$ok_count stale=$stale_count missing=$missing_count alias=$alias_count -References: .claude/guides/astra-reference.md (spec) and .claude/guides/lightcone-cli-reference.md (CLI)." +Substrate CLIs (use --help on any): + lc init / lc run / lc status / lc verify / lc build / lc export wrroc + astra validate / astra paper add / astra universe generate + +Reference skills (invoke when the surface above isn't enough): + /astra — astra.yaml spec: decisions, prior_insights, findings, evidence, sub-analyses, narrative anchors + /lc-cli — lc workflow: spec-code invariant, status interpretation, failure diagnosis" if [ "$validation_ok" -ne 0 ]; then # tail rather than head -- the leading lines are success markers diff --git a/claude/lightcone/skills/README.md b/claude/lightcone/skills/README.md new file mode 100644 index 00000000..6d3a238d --- /dev/null +++ b/claude/lightcone/skills/README.md @@ -0,0 +1,43 @@ +# lightcone-cli skills + +Each subdirectory is one Claude Code skill: `SKILL.md` plus optional `references/`, `assets/`, and `scripts/`. `lc init` copies these into a project's `.claude/skills/` so they are discoverable to Claude Code sessions. + +## Project lifecycle skills + +| Skill | Role | +|---|---| +| `lc-new` | Scaffold a new ASTRA-shaped project from a research question. | +| `lc-from-code` | Bring an existing codebase into ASTRA — scan, spec, parameterize. | +| `lc-from-paper` | Reproduce a published paper in ASTRA (paper-reproduction bundle entry point — see below). | +| `lc-feedback` | Report bugs and feature requests upstream. | +| `ralph` | Author a constitution and run a ralph loop against it (authoring + launching + iterating in one skill). `lc-from-paper` uses this for the long middle of a reproduction; standalone for any other long-running work. | + +## Reference skills + +Not direct entry points — Claude invokes these (or other skills invoke them) to load reference content into the session. The session-start hook primes their names so they're discoverable from turn one. + +| Skill | Role | +|---|---| +| `astra` | Reference for the `astra.yaml` spec: structure, decisions, options, prior insights, findings, evidence, sub-analyses, narrative anchors, composition mechanics. | +| `lc-cli` | Reference for `lc` workflow: commands, the Spec-Code Invariant, status interpretation, failure diagnosis, multiverse runs, WRROC export. | + +## Paper-reproduction bundle + +A self-contained toolkit for reproducing published papers in ASTRA. The bundle is co-located so a single `lc init` brings the full toolkit into a project — no plugin marketplace, no separate installs. + +| Skill | Role | +|---|---| +| [`lc-from-paper`](lc-from-paper/SKILL.md) | **Reproduction driver.** ORIENT-first; one pre-loop phase in the user's main session that asks for the paper, runs `/paper-extraction` inline, interviews the user (grounded in the paper), clones the reference code and runs `/lc-from-code` scan-only (when a repo exists), and drafts the per-paper `constitution.md` + `CLAUDE.md`. Then hands off to a ralph loop whose iterations carry the long middle: ARCHITECT → SPECIFY → LITERATURE → IMPLEMENT → RUN → COMPARE. When the loop closes (constitution `status: closed` after COMPARE returns `pass`), REVIEW runs back in the user's main session. Fidelity intent — captured as prose at ORIENT — is what every iteration reads when sizing its next move, and what COMPARE grades opportunities against. | +| [`ralph`](ralph/SKILL.md) | The loop substrate. `lc-from-paper`'s ORIENT invokes `/ralph`'s Authoring mode to draft the per-paper constitution; the loop launcher hands off after ORIENT lands. Each iteration runs `/ralph`'s Loop protocol against the constitution. | +| [`narrative`](narrative/SKILL.md) | Author the `narrative:` prose and decision `rationale:` in `astra.yaml`. Invoked by `lc-from-paper`'s ARCHITECT (for the structural narrative) and SPECIFY (for anchored content narrative). | +| [`paper-extraction`](paper-extraction/SKILL.md) | Turn an arXiv ID or DOI into a standardized `work/reference/` directory: structural index (figures, tables, outline, citations with resolved DOIs) plus a stub `astra.yaml` for the paper. Primary acquisition path for `lc-from-paper`'s ORIENT (Stage 2); also invoked per cited paper by LITERATURE. | +| [`check-sentence-by-sentence`](check-sentence-by-sentence/SKILL.md) | Audit paper claims against code locations (`file:line` or `NOT FOUND`). Invoked from `lc-from-paper`'s REVIEW close-out (opt-in); also user-invokable directly. | +| [`figure-comparison`](figure-comparison/SKILL.md) | Build a self-contained HTML side-by-side: original figures/tables/numerics vs replicated. Invoked from `lc-from-paper`'s REVIEW close-out (mandatory); also user-invokable directly. | + +The full reproduction story spans these skills. `lc-from-paper`'s `SKILL.md` names each by role and tells the agent when to invoke them; the siblings stand alone and don't know about `lc-from-paper`. + +### Why bundle (not depend on plugin install) + +- **Testability.** We want to verify `lc-from-paper` invokes its sibling skills correctly. That only works when all are in the same checkout. +- **Single install path.** `lc init` brings the full toolkit. Adding a separate plugin-marketplace step is friction we don't need. +- **Future consolidation is open.** The long-run shape may be `astra` ships skills in `astra`, `lc` ships skills in `lightcone-cli`, plus a centralized external-skills list. Today: bundle it all. See [[lightcone/skills-location-policy]]. diff --git a/claude/lightcone/guides/astra-reference.md b/claude/lightcone/skills/astra/SKILL.md similarity index 92% rename from claude/lightcone/guides/astra-reference.md rename to claude/lightcone/skills/astra/SKILL.md index 1c4ffec7..e46a044e 100644 --- a/claude/lightcone/guides/astra-reference.md +++ b/claude/lightcone/skills/astra/SKILL.md @@ -1,3 +1,17 @@ +--- +name: astra +description: > + Comprehensive reference for the `astra.yaml` specification — top-level + structure, sub-analyses, inputs/outputs, decisions and options, prior + insights and findings, evidence and quote verification, narrative + anchors, and composition mechanics. Invoke whenever reading, writing, + validating, or debugging an `astra.yaml` spec; whenever working with + decisions, options, prior_insights, findings, or evidence; or whenever + the user asks about ASTRA schema, spec syntax, or sub-analysis + composition. +allowed-tools: Read, Glob, Grep, Bash(astra:*) +--- + # ASTRA Reference ## What an ASTRA Analysis Is @@ -100,6 +114,17 @@ A decision is a methodological choice where a different defensible option could Decisions may carry an optional `tags:` list for grouping (e.g. `[preprocessing]`, `[physics]`, `[stats]`). Keep the tag vocabulary **small and consolidated** -- reuse existing tags rather than minting new ones, since tags are mostly useful for cross-cutting views over a shared decision space, and that view fragments quickly when every decision invents its own label. +### Options + +Each decision must have at least one option. Options are `key: { ... }` entries: + +- `label:` (required) -- short human-readable name for compact rendering. +- `description:` (optional) -- longer prose explaining what the option means. +- `insights:` (optional) -- list of `prior_insights:` IDs that justify this option; back-references the supporting evidence (see [Prior Insights and Findings](#prior-insights-and-findings)). +- `excluded:` + `excluded_reason:` -- option considered but rejected. See [Constraints](#constraints). + +`label:` and `options:` are required on the decision itself. An aliased decision (one that points at another via `from: ../decisions.foo` -- see [Composition Mechanics](#composition-mechanics)) inherits both from its source and doesn't redeclare them. + ### Parameterization **Every decision must be parameterized in code** -- never hardcode a decision value. The recipe's `command:` template references it via `{decisions.}` (see [Command Template Substitution](#command-template-substitution)). @@ -173,7 +198,7 @@ Two kinds of insight, distinguished by direction: - **Prior insights** (`prior_insights:`) — knowledge from outside the analysis that informs decisions. From literature (by DOI) or artifacts from a prior/parent analysis. - **Findings** (`findings:`) — conclusions from the analysis itself, backed by its own output artifacts. -Both use the same Insight model: `id`, `label` (optional), `claim`, `created_at`, `evidence`, plus optional `derived` (true if synthesized/inferred from multiple sources), `scope` (applicability conditions), `tags`, `notes`. Placement determines direction. +Both use the same Insight model. Required: `id`, `claim`, `created_at` (ISO 8601 datetime — e.g. `"2025-02-01T14:00:00"`), `evidence`. Optional: `label`, `derived` (true if synthesized/inferred from multiple sources), `scope` (applicability conditions), `tags`, `notes`. Placement determines direction. Each evidence item has its own fields: `id`, exactly one of `doi` (literature) or `artifact` (output ID), and either a `quote` (TextQuoteSelector with required `exact`, optional `prefix`/`suffix`) or `location` (FragmentSelector with `value` like `"page=6"` and/or 1-indexed `page`). DOI evidence may add `version` (arXiv version). Artifact evidence may add `snapshot` (path to an immutable artifact copy) and `source_commit` (git commit that produced it). diff --git a/claude/lightcone/skills/check-sentence-by-sentence/SKILL.md b/claude/lightcone/skills/check-sentence-by-sentence/SKILL.md new file mode 100644 index 00000000..06c0211a --- /dev/null +++ b/claude/lightcone/skills/check-sentence-by-sentence/SKILL.md @@ -0,0 +1,369 @@ +--- +name: check-sentence-by-sentence +description: > + Sentence-by-sentence audit of a paper against an ASTRA project's code. For + every claim about implementation or results in the methodology, results, + discussion, and appendices, locate the corresponding code (file:line) or + mark NOT FOUND. Only the user can invoke this skill, though this skill can be suggested for the user to invoke during paper reproduction. Other skills may mention this skill as an optional follow-up, but should not invoke it themselves. Run from the project folder containing astra.yaml. In lc-from-paper projects, read paper sources from + work/reference/: prefer arXiv TeX under work/reference/source/, fall back to + Docling/Pandoc markdown at work/reference/document.md. +argument-hint: "[path to paper source, e.g. work/reference/source/main.tex or work/reference/document.md]" +--- + +# /check-sentence-by-sentence + +Audit a paper against the code in this ASTRA project, sentence by sentence. +Every sentence that asserts an implementation detail or a numerical/empirical +result is located in the code (`file:line`) or marked NOT FOUND. The agent +does NOT run any code -- this is a static reading audit. + +In lc-from-paper projects, the paper substrate comes from `work/reference/`. +Path A is arXiv source at `work/reference/source/`; Path B is the parsed +markdown fallback at `work/reference/document.md`, produced by Docling or +Pandoc. + +## Setup + +1. **Confirm project root.** Read `astra.yaml` in the current working + directory. If it is missing, ask the user: + + > "I do not see an `astra.yaml` in the current directory. Please point me + > to the ASTRA project folder, or `cd` there and re-invoke." + + Stop until resolved. + +2. **Confirm paper source.** The user may have passed a path as an + argument. Resolve it in this order: + + 1. If the argument is a `.tex` file, use it in `tex` mode. + 2. If the argument is `work/reference/` or another directory, first look + for TeX source under `/source/`, then for `/document.md`. + 3. If no argument was supplied, prefer the lc-from-paper layout: + - `work/reference/source/
.tex` if TeX source exists. Identify the + main file with `grep -l '\\documentclass' work/reference/source/*.tex`; + if exactly one file matches, use it. If multiple files match, ask the + user which one is the main paper file. After identifying the main + file, expand its local `\input{...}` and `\include{...}` files before + section enumeration; many arXiv papers keep most prose outside the + main TeX wrapper. + - `work/reference/document.md` if there is no TeX source. This is the + Docling/Pandoc fallback and should be audited in `markdown` mode. + 4. Only after those lc-from-paper paths fail, look for an obvious legacy + `.tex` source in cwd: a top-level `*.tex`, or one inside `paper/`, + `tex/`, or a similarly named subdirectory. If exactly one obvious + candidate is found, use it in `tex` mode. + + If no usable source is found, ask: + + > "Which paper source should I audit? Please give me a `.tex` path or + > `work/reference/document.md`." + + If only `work/reference/paper.pdf` exists, ask the user to run the PARSE + phase first so `work/reference/document.md` exists. Do not audit PDFs + directly. + +## Section enumeration + +This is **your job in the main agent** -- do it carefully so each subagent +gets a precise line range. Do NOT read full section content; only enough to +identify boundaries. + +1. Enumerate sections according to source mode: + - In `tex` mode, first build the ordered audit source list. Start with the + main TeX file, scan it for local `\input{...}` and `\include{...}` paths, + normalize missing `.tex` suffixes, and include those files when they + exist under the same source tree. Recurse one level deeper when an + included file itself includes local TeX files. Ignore package/style + imports (`\usepackage`, `.sty`, `.cls`) and remote/generated files. If + the main file is mostly a wrapper, the leaf included files will carry + most audit units. + - For every file in the TeX audit source list, use `grep -n` for + `^\\section`, `^\\subsection`, and `^\\appendix`. Record each match's + file path, line number, and label. + - In `markdown` mode, use `grep -n` for markdown headings + (`^#`, `^##`, `^###`, etc.) in `work/reference/document.md`. Treat + heading depth the way TeX treats section/subsection. If Docling emitted + unnumbered headings, use their text labels. +2. Get the file's total line count with `wc -l`. +3. Compute each section's line range: **start = the section's own line + number; end = (next section/subsection or same/lower heading-depth start + minus 1 in the same source file), or that source file's last line for the + final section in that file.** For a section that contains subsections, + each subsection's range runs from its own line to (next subsection + start − 1), and the section's pre-subsection prose (if any) becomes its + own audit unit covering (section line + 1) to (first subsection − 1) if + that span is non-trivial. +4. Mark sections appearing after `\appendix` (TeX) or after an `Appendix` / + `Appendices` heading (markdown) as appendices regardless of label. + +Identify the audit-relevant sections: + +- Methodology (often `Methods`, `Analysis`, `Data`, `Sample selection`) +- Results +- Discussion (often `Discussion and Conclusions`) +- Appendices (every section after `\appendix`) + +Skip Abstract, Introduction, Acknowledgements, References, author lists. + +For each retained section, check whether it has subsections. **Spin up one +subagent per leaf (sub)section** -- a section with subsections becomes one +subagent per subsection (plus optionally one for any pre-subsection prose +span); a section without subsections becomes one subagent for the whole +section. Spawn them all in a single message so they run in parallel. + +## Subagent prompt + +Use `Agent(subagent_type="general-purpose", ...)`. Pass each subagent: + +- The absolute path to the paper source file for this section +- The paper source mode: `tex` or `markdown` +- The exact section/subsection label and the line range in the source file + it covers (so it knows where to read) +- The absolute path to the project root (which contains `astra.yaml`) +- The instructions below, verbatim + +``` +You are auditing one (sub)section of a paper against an ASTRA project's +code. Your job is mechanical and exhaustive. + +INPUTS +- Paper source file: +- Source mode: +- Section: , lines - +- Project root: + +PROCEDURE +1. Read the assigned section of the paper. Split it into sentences using + common sense, not naive period-splitting. In `tex` mode, use TeX-aware + splitting; in `markdown` mode, preserve Docling/Pandoc math blocks, + captions, and headings as source text. Treat `e.g.`, `i.e.`, `et al.`, + `Fig.`, `Eq.`, `Sec.`, `Dr.`, decimals (`0.5`), inline math `$...$`, + and citation commands (`\citep{...}`, `\citet{...}`) as part of the + surrounding sentence, not boundaries. Display equations belong to + whichever sentence introduces them. +2. For each sentence, decide using common sense: does it make a concrete + claim about an IMPLEMENTATION DETAIL (a method, parameter, threshold, + formula, data cut, model choice, sample definition, algorithmic step) + or a RESULTS DETAIL (a numerical value, plot, fitted parameter, + statistical outcome)? If neither -- pure motivation, citation prose, + or generic framing -- skip it. +3. Before searching, **read `astra.yaml` once** -- it is a pre-built + paper↔code map maintained by the project. Harvest specifically: + - `narrative.methods` — links paper methodology concepts to decision + IDs (e.g. paper prose "the chosen " → `#decisions.`) + - `narrative.findings` — links paper claims/values to result anchors + - `prior_insights` (if present) — extracted paper quotes already tied + to decisions + - per-decision `evidence` quotes and `description` fields + Treat these as your translation table: paper prose → decision/output + IDs → script files. Do not re-derive what the spec already encodes. + + For everything not covered by the spec, use common sense to translate + concepts. In general: + - A quality cut stated as a ratio or threshold may appear in code + under an inverted form or a different variable name -- map by + meaning, not by symbol. + - A named model or distribution will usually appear as a function + whose name describes its shape or role, not as the paper's prose + phrasing. + - A cited constant from a referenced paper will usually appear as a + module-level constant or as an option value in a decision. + Grep for the underlying concept, not just the paper's wording. +4. For every claim-bearing sentence, search the project code (`scripts/`, + source files, `universes/`, `astra.yaml`, `results/`) for where the + claim is implemented or computed. Use Grep, Glob, and Read. +5. Record one of: + - (quote, path/file.py:LINE, optional <10-word note) + when the sentence's claim is implemented or computed at that location + - (quote, NOT FOUND, optional <10-word note) + when no implementation or matching computation is present + +CONSTRAINTS +- Do NOT run any code. No Bash beyond ls/grep/find/wc for searching. +- Do NOT read the paper outside the assigned line range. +- Quote the sentence verbatim, trimmed to a single sentence. If the + sentence is long, you may include just the claim-bearing clause but + preserve enough text to identify it. +- file:line should point to the most specific line that implements or + states the claim (the function call, parameter assignment, or computed + value -- not just the file). +- Notes must be under 10 words. Use them for nuance like "approximate + match", "different constant", "implemented but commented out", + "value computed at runtime, not statically comparable", "produced as + figure but printed value not stored". +- For numerical results that the paper states as a final number, point + at the line that computes the value and use a note like "value + computed at runtime" -- you cannot verify numerical agreement without + executing code, and that is fine. + +OUTPUT +Return a JSON-ish list, one entry per sentence, in paper order: + +[ + {"quote": "...", "location": "scripts/foo.py:142", "note": "..."}, + {"quote": "...", "location": "NOT FOUND", "note": "..."}, + ... +] + +Return nothing else. +``` + +## Aggregation + +When all subagents return, you receive raw entries from every claim-bearing +sentence each subagent kept. **Do not just concatenate and print them.** +Two filtering passes happen here, in this order: + +### Pass 1 — drop non-computational sentences + +Subagents are deliberately generous about what they keep, so the raw list +contains a long tail of sentences that quote the paper but do not actually +correspond to anything you would expect to find in code. **Drop any entry +whose sentence is:** + +- **Framing / motivation** — sentences whose job is to set up the next + step, e.g. "the first step is...", "to investigate this...", "we want + to look at...", "for this reason..." +- **Citation prose / literature comparison** — sentences that compare to + or quote prior literature, e.g. "agrees with values typical of previous + measurements...", "much like Author+YYYY they show...", "in particular, + Author found ..." +- **Theoretical framing or derivations** — sentences asserting a property + expected from theory rather than implemented in code, and restatements + of textbook identities used only to introduce the next equation +- **Rhetorical / interpretive claims** — qualitative readings of a + figure or trend, e.g. "the trend clearly has an oscillatory + behaviour", "the trend seems to be independent of ", "this + supports that..." +- **Conclusions / justifications / qualitative observations** — + "thus we conclude that...", "we choose not to include this + because...", "by and large the trends are similar" +- **Future work / speculation** — "this could be improved by...", "the + discrepancy could be explained by..." +- **Forward/backward references with no claim** — "we discuss this in + Sec X below", "as described in Sec Y above" +- **NOT FOUND entries that fall in any of the above categories** — most + framing/motivation sentences will land as NOT FOUND because there is + nothing to find. Drop them silently; they are noise, not gaps. + +Keep an entry only if it asserts something a reader would expect to be +implemented or computed: a parameter value, a cut, a formula, an +algorithmic step, a fitted/measured value, a figure that the project +should produce, a sample size after a specific cut. + +When in doubt about a NOT FOUND, ask: "if this sentence is not in the +code, is that a real gap?" If no, drop it. + +### Pass 2 — deduplicate / merge near-duplicates + +Subagents do not see each other, and the same claim is often restated +across sentences within a (sub)section -- e.g. a prose statement of a +cut followed by a sentence asserting "this is the only cut we make", or +two sub-equations of one larger formula that map to the same line. +Collapse these: + +- If two adjacent sentences make the same claim and resolve to the same + `file:line`, keep one entry whose quote is the more specific or + formula-bearing of the two, and append the other in a short + parenthetical only if it adds information. +- If a paper-text claim and an explicit equation/quoted code map to the + same line, prefer the equation/quoted-code form. +- Do not merge across (sub)sections. +- Do not merge if the two sentences resolve to different `file:line` + locations -- they may look similar but are doing different things. + +### Pass 3 — render + +After filtering and deduplication, present the result to the user as +markdown, organized by section -> subsection -> sentence, in paper order: + +``` +# Sentence-by-sentence reproduction audit + +Paper: +Project: + +##
+ +### (omit if no subsections) + +- "" + → ✅ `scripts/foo.py:142` -- + +- "" + → ❌ NOT FOUND -- +- ... +``` + +Use `→ ✅ \`file:line\`` for found entries and `→ ❌ NOT FOUND` for +missing ones. Notes are optional; only include the trailing `-- ` +when the subagent supplied one. + +End with a one-line summary: + +> N sentences audited across M sections. K implemented, J not found. + +### Follow-up suggestion (conditional) + +After the summary, scan the NOT FOUND entries and **cluster them**. A +cluster is a group of NOT FOUND sentences that all relate to the same +missing piece of work (a missing analysis, a missing diagnostic, an +unimplemented model variant) -- usually a few consecutive sentences in +one (sub)section, or sentences that all reference the same concept across +sections. + +**Only emit the follow-up block if there is at least one major +unimplemented cluster** -- a cluster of genuine missing computation +substantial enough to be worth offering to add (rule of thumb: ≥3 +sentences of related missing-computation claims, or a single +heavyweight missing artifact like an entire missing analysis or +figure). If every NOT FOUND is isolated framing, motivation, or +qualitative interpretation -- or if the only clusters are tiny -- stop +after the one-line summary. Do not pad with a follow-up just to have +one. + +When the threshold is met, write a short follow-up block in this shape: + +> Major unimplemented clusters: (1) `` +> (`<§section>`, ~`` sentences), and (2) ` cluster 2>` (`<§section>`, ~`` sentences). The rest of the NOT +> FOUND entries are pure framing/motivation/qualitative interpretation, +> not computational claims. Worth considering as a follow-up if you +> want full coverage — want me to add `` and +> ``? + +Rules for this block: +- Only call out clusters that look like genuine missing computation, not + rhetoric. +- Keep it to 1–3 clusters. Do not enumerate every NOT FOUND entry. +- The closing offer must name **concrete artifacts** the user could add + (a new output ID, a new script filename, a new decision option, a new + figure) -- not vague promises like "fill in the gaps". +- Cite the section reference in the project's own notation (`§2.1`, + `Appendix B`, etc.) and an approximate sentence count. +- One short paragraph; do not pad. + +## Restrictions + +- You MUST NOT run project code, recipes, or `lc run`. This is static. +- You MUST NOT read the paper source wholesale into the main context; + delegate to subagents. +- You MUST NOT modify any project file. Read-only. +- You MUST NOT fabricate `file:line` locations -- if a subagent's location + looks suspicious, ask it to re-verify rather than guessing. +- You MUST spawn one subagent per leaf (sub)section, in parallel. + +## Anti-patterns + +- **Auditing intro/abstract** -- skip narrative-only sections; only + methodology, results, discussion, and appendices. +- **Bundling sentences** -- one entry per sentence. Do not collapse + multiple claims into one row even if they share a citation or location. +- **Vague locations** -- a bare filename (`scripts/foo.py`) is not + enough; a line number is required for found entries. +- **Long notes** -- the 10-word cap is a hard limit; reserve notes for + signal, not commentary. +- **Running code to verify** -- this skill is a reading audit. If a claim + cannot be verified by reading code alone, mark it found at the + computing line and note "value computed at runtime" rather than + executing anything. diff --git a/claude/lightcone/skills/figure-comparison/SKILL.md b/claude/lightcone/skills/figure-comparison/SKILL.md new file mode 100644 index 00000000..0c5247ce --- /dev/null +++ b/claude/lightcone/skills/figure-comparison/SKILL.md @@ -0,0 +1,578 @@ +--- +name: figure-comparison +description: > + Build a self-contained HTML report comparing the figures, tables, and + numerical results in lc-from-paper's `work/reference/` paper substrate + against artifacts produced under `results//`. When + `comparison-report.yaml` or `targets/targets.md` exists, use that scoped + target set first; otherwise fall back to paper-driven inventory from arXiv + TeX or Docling/Pandoc artifacts under `work/reference/`. Images are + base64-embedded; missing matches are flagged. Use when the user says + "compare results", "side-by-side comparison", "build comparison HTML", or + "did we reproduce the paper". Run from the project folder containing + astra.yaml. +argument-hint: "[path to paper reference dir, e.g. work/reference/]" +--- + +# /figure-comparison + +Generate a single self-contained HTML report (`.lightcone/comparison.html`) +that places paper reference artifacts from `work/reference/` on the left +and the project's reproduced artifacts from `results//` on the +right, with red flags wherever a counterpart is missing. Images are embedded +as base64 so the HTML is portable. The helper script and intermediate +manifest also live under `.lightcone/` so they don't pollute the baseline +results. + +## Setup + +1. **Confirm project root.** Read `astra.yaml` in the cwd. If missing, ask: + + > "I do not see an `astra.yaml` here. Please `cd` to the ASTRA project + > and re-invoke." + + Stop until resolved. + +2. **Confirm results exist.** Default universe is `baseline`, unless + `comparison-report.yaml` names reproduced files under another universe or + the user supplied a universe explicitly. Check `ls results//`. + If the directory is missing or empty, ask: + + > "I cannot find populated results under `results//`. Build the + > universe first (`lc run --universe ` or equivalent), then + > re-invoke." + + Stop. Do NOT attempt to run the pipeline yourself -- this skill is + read-only over the build artifacts. + +3. **Locate the paper reference substrate.** The user may have passed a + path. Resolve it in this order: + + 1. If the argument is a directory containing `metadata.json`, + `document.md`, `figures/`, or `tables/`, use that directory as the + paper reference root. + 2. If the argument is an arXiv source directory containing `.tex` files, + use it as `source_root`, and use its parent `work/reference/` as the + paper reference root when that parent exists. + 3. If no argument was supplied, prefer lc-from-paper's layout: + - `work/reference/source/` when arXiv TeX source exists. Use the TeX + files there for labels/captions and the parsed artifacts under + `work/reference/{figures,tables,metadata.json}` for renderable + reference files. + - `work/reference/document.md` plus + `work/reference/{figures,tables,metadata.json}` when no TeX source + exists. This is the PDF + Docling fallback from lc-from-paper. + 4. Only after lc-from-paper paths fail, look for a legacy unzipped arXiv + dir in cwd: a directory containing both a `*.tex` file and figure + files (`*.pdf`, `*.png`, `*.eps`). Common names: `paper_source/`, + `arxiv_source/`, `*_Original_Paper/`. + + If no usable reference substrate is found, ask: + + > "Where is the paper reference directory? In a lc-from-paper project this + > should usually be `work/reference/`, containing `document.md`, + > `metadata.json`, and extracted `figures/` / `tables/`." + + If only `work/reference/paper.pdf` exists, ask the user to run the PARSE + phase first so Docling or the TeX parser populates `work/reference/`. + Do not compare directly against a whole PDF. + +## Phase 1 -- Understand the paper's main results + +Read, in this order: + +1. **Scoped comparison artifacts, if present.** + - If `comparison-report.yaml` exists, treat it as the highest-priority + scope because it records what lc-from-paper actually compared. Use its + `outputs:` entries, including `type`, `priority`, `paper_value`, + `reproduced_value`, `reference_file`, `reproduced_file`, `match`, and + `notes` when present. + - Else if `targets/targets.md` exists, treat it as the scope ledger. Use + only the targets it names, including out-of-scope notes, priorities, + reference paths, expected values/trends, and output/spec-home pointers. + - If neither file exists, use the default paper-driven flow below and + build a best-effort report from `astra.yaml` plus `work/reference/`. + +2. **`astra.yaml`** -- specifically `narrative.summary`, `narrative.outputs`, + `narrative.findings`, `outputs:`, and `findings:` if present. Use it to + map scoped targets to output IDs and to harvest declared findings. Do not + assume ASTRA outputs have a dedicated filename-hint field; result paths + come from the output ID and the result resolver in Phase 2. + +3. **The paper reference substrate**, in this order: + - Read `work/reference/metadata.json` when present. It is the primary + index for paper figures and tables; its paths are relative to + `work/reference/` and usually point into `figures/` or `tables/`. + - If `work/reference/source/` exists, grep its TeX files for + `\includegraphics`, `\label{fig:...}`, `\caption{...}`, and + `\begin{table}` to recover labels/captions that metadata may have + missed. + - If only `work/reference/document.md` exists, use the markdown plus + `metadata.json` as the source of captions, table text, and in-text + numerical claims. This is the Docling/Pandoc fallback; preserve its + line numbers and do not pretend it is TeX. + - Grep the abstract, results, and discussion sections of the TeX or + markdown source for in-text numerical claims that look like primary + results -- typically a quantity with value + uncertainty (e.g. + `$X = a \pm b$ unit`). Prefer values that `astra.yaml`'s `findings:` + already names; do not try to extract every number in the paper. + + Do NOT read the paper wholesale. For long papers (>500 lines), read + only the abstract, results, and discussion sections. + +If the paper is large or has many sections and neither `comparison-report.yaml` +nor `targets/targets.md` exists, **delegate the figure / table / value +enumeration to a single subagent** with +`subagent_type="general-purpose"` -- pass it the paper path, the output +schema below, and ask it to return only the inventory. One subagent is +enough; do not fan out. Multiple subagents would have to re-read the +same file. + +## Phase 2 -- Build the comparison manifest + +Produce a manifest in memory (you'll write it as JSON in Phase 3) with +three sections: `figures`, `tables`, `values`. Each entry pairs a +paper-side artifact with a project-side artifact. + +Build entries in this priority order: + +1. **From `comparison-report.yaml` if present.** One manifest entry per + `outputs.` item. Use `type` to route it to `figures`, + `tables`, or `values`. Use `reference_file` as the paper-side path and + `reproduced_file` as the project-side path when present. Preserve the + report's `paper_value`, `reproduced_value`, `match`, and `notes` in the + manifest so the HTML reflects the completed COMPARE verdict. +2. **Else from `targets/targets.md` if present.** One manifest entry per + in-scope target. Use each target's reference path under `targets/`, its + expected values/trends, and its output/spec-home pointer. If the ledger + marks a target out of scope, omit it from the HTML unless the user asked + for out-of-scope targets too. +3. **Else use the default paper-driven inventory.** Enumerate figures, + tables, and values from `astra.yaml` plus `work/reference/`, and fall back + to filename-stem similarity only when no scoped ledger exists. + +For project-side result paths, resolve every output ID with this order: +- Use an explicit `reproduced_file` from `comparison-report.yaml` or an + explicit reproduced path/glob from `targets/targets.md`, if present and + the file exists. +- Search for flat files at `results//.` with the + first suitable type-specific extension: images (`.png`, `.jpg`, `.jpeg`, + `.pdf`, `.eps`), tables (`.csv`, `.parquet`, `.md`, `.txt`), values + (`.json`, `.yaml`, `.yml`, `.txt`, `.md`). +- If still unmatched and no scoped ledger exists, fall back to filename-stem + similarity within `results//`. +- If no match is found, use `project_path: null` and render a red + `NOT PRODUCED` panel. Do not include unrelated result files; the report is + target-driven when target/report files exist, and paper-driven otherwise. + +For tables: use `work/reference/metadata.json` and `work/reference/tables/` +when present. If TeX source exists, capture the raw LaTeX of the `tabular` +block and any `\caption{...}`. If only `work/reference/document.md` exists, +capture the Docling/Pandoc markdown table or the extracted table artifact +under `work/reference/tables/`. The project side is whatever artifact +carries the same content -- typically a CSV / parquet / markdown file at +`results//.`. If `astra.yaml` declares no matching +output, use `project_path: null`. **If the paper contains no tables at all, +leave the manifest's `tables` list empty; the helper must omit the entire +Tables section from the HTML in that case (no header, no "no tables" +placeholder).** + +For values: each entry is `{name, paper_value, paper_uncertainty?, +project_value?, project_value_source?, paper_quote}`. Pull +`paper_value` from the in-text claim or `astra.yaml`'s +`findings.*.paper_value`. Pull `project_value` from +`astra.yaml`'s `findings.*.replicated_value` if present, otherwise from +a scoped `comparison-report.yaml` entry or a flat result summary file at +`results//.` that you can read statically. +**Never compute or re-derive values yourself.** If no project value can +be located statically, leave it null and flag in the HTML. + +When `comparison-report.yaml` or `targets/targets.md` exists, the values list +is scoped to that file. Otherwise, be exhaustive about values, not selective. +A common failure mode is the values section ending up with only 1--3 entries, +which makes the report feel thin. Aim for **every** numerical claim that the +paper asserts and the project tracks. Concretely, harvest from: +- Every entry under `findings:` in `astra.yaml` -- one manifest entry + per finding, even when several findings share a parent quantity. +- The paper's abstract: every ` ± ` it reports. +- The paper's results and discussion sections: every fitted parameter, + every feature location ("dip near x = X₁", "peak at x = X₂"), every + reported sample size after a specific cut, every bin width or step + used as a result-defining choice, every reported accuracy / score / + metric. +- Any explicit reproduction targets in `astra.yaml`'s `narrative.findings`. + +It is fine to repeat one quantity in multiple manifest entries when the +paper reports it under different conditions (preliminary vs. final, +per-subset, per-bin median, per-method variant). Each condition is its +own row. Feature locations are values too: encode "feature located at +domain coordinate X" as +`{name: "", paper_value: "", paper_unit: +""}`. **Target ≥6 value entries on a typical paper.** If you end +up with fewer than 4, you are filtering too aggressively -- re-read +`astra.yaml`'s `findings:` and the paper's results section. + +## Phase 3 -- Generate the HTML + +Use a small Python helper rather than embedding base64 inline through +your tool calls -- multi-MB image base64 strings would balloon your +context. + +Use the existing `.lightcone/` directory in the project root. Do not create +directories in this skill. All three files this skill writes -- manifest, +helper, and final HTML -- live there. + +1. **Write the manifest** as JSON to + `.lightcone/comparison_manifest.json`. Schema: + + ```json + { + "project_name": "...", + "paper_path": "work/reference/document.md", + "scope_source": "comparison-report.yaml", + "universe": "baseline", + "results_path": "results/baseline", + "figures": [ + { + "paper_label": "fig:main_result", + "paper_caption": "...", + "paper_path": "targets/main_result.pdf", + "project_output_id": "primary_metric_plot", + "project_path": "results/baseline/primary_metric_plot.png" + } + ], + "tables": [ + { + "paper_label": "tab:summary", + "paper_caption": "...", + "paper_latex": "\\begin{tabular}{...}\\end{tabular}", + "project_output_id": "...", + "project_path": "results/baseline/summary_table.csv" + } + ], + "values": [ + { + "name": "primary_metric", + "paper_value": "12.5", + "paper_uncertainty": "0.4", + "paper_unit": "", + "paper_quote": "we find $\\mathrm{metric} = 12.5 \\pm 0.4$ ", + "project_value": "12.47", + "project_uncertainty": "0.41", + "project_value_source": "results/baseline/metric.json" + } + ] + } + ``` + + `figures`, `tables`, and `values` may each be `[]`. Empty lists mean + the helper skips that section entirely. There is no + `unmatched_baseline` field -- baseline files the paper does not + reference are not in scope for this report. + + Use `null` for any missing field. Paths are relative to the project + root. + +2. **Write the helper script** to `.lightcone/build_comparison.py`. + The helper must: + - Read the manifest JSON. + - For each figure entry: emit one `
` per figure, + with the structure described in **"Required HTML structure"** + below -- a single `
` containing a + `
` and one row-level status badge, followed + by a `
` of two `
`s + (paper, project). One badge per row, in flow inside `.row-head`. + **Never emit per-cell absolutely-positioned badges.** + Read `paper_path` and `project_path` as bytes, base64-encode, and + embed each image inside its cell. **PDFs must be converted to PNG + before base64-encoding -- never embed PDFs as PDF data URIs.** Use + `` uniformly for every + figure cell. Conversion order to try, falling back if a tool is + unavailable: + 1. `pdf2image` (Python) -- `convert_from_path(path, dpi=150)[0]` + 2. `pypdfium2` -- render page 1 at 150 DPI to a PIL image + 3. shell out to `pdftoppm -png -r 150 -f 1 -l 1 ` + and read the resulting PNG + 4. shell out to `magick [0] -density 150 ` (ImageMagick) + If none are available, the helper renders a small ⚠️ panel that + says `PDF preview unavailable -- install pdf2image or pdftoppm` + and links to the `.pdf` file path. Do not fall back to embedding + the PDF binary. PNG / JPG inputs skip conversion and are + base64-encoded directly. For any non-image type, embed as a + UTF-8 text block. Missing path → render a red panel saying + `❌ NOT PRODUCED` with the expected output ID. Captions live as + `
` inside each cell, never as a row-spanning element. + - For each table entry: paper side renders the captured LaTeX inside + `
` plus the caption; project side renders the project file
+     (CSV/parquet → first ~20 rows as an HTML table; markdown → render
+     as `
`; missing → red ❌ panel). Same row structure as figures.
+   - For each value entry: emit one `
` + per value -- **same card layout as figures, not a ``.** + The row has a `.row-head` (value name + single status badge), + a `.row-grid` of two `.cell`s (paper | project), and a trailing + `.value-note` with the σ delta. The paper cell shows the value + (with uncertainty and unit) and the `paper_quote` as a + `
`. The project cell shows the value and the + `project_value_source` as a small `` line. Compute a simple + status -- ✅ if both values exist and the project value lies within + ±1 paper-uncertainty of the paper value; ⚠️ if both exist but + disagree by more than that; ❌ if either is missing. If + `paper_uncertainty` is null, fall back to a 5%-tolerance + comparison: ✅ if `|prj − paper| ≤ max(0.05·|paper|, 0.05)`. Do + NOT do anything more sophisticated; you cannot run code. **Do not + render values as a single HTML `
`** -- the report's whole + point is side-by-side cards. + - Emit a single self-contained HTML file with inline CSS in the + **Vellum** aesthetic (see below): the `` carries the + parchment background and grain, and **all content lives inside a + single `
` that is the lighter `--surface` cream + card with soft drop shadows.** This is non-negotiable -- the cream + page card on top of the parchment body is the headline visual. Two + content columns (paper | project) per row, the project name in the + `

`, and a top-of-page summary line counting found / missing + for each non-empty section. **Skip any section whose manifest list + is empty** -- omit its header and content entirely; do not emit a + "no tables found" placeholder. + - Write the HTML to `.lightcone/comparison.html` and print the + absolute path on stdout. + +### Required HTML structure (figures and values) + +The helper MUST produce this exact shape for every figure / value row. +Per-cell absolute badges, value-as-table, and missing `.row-head` are +all forbidden -- they break the layout (overlapping the cell heading, +losing the row-level status, breaking the visual rhythm with figures). + +```html +
+
+
+ fig:main_resultprimary_metric_plot +
+ ✅ matched +
+
+
+
PAPER
+ +
Caption from paper.
+
+
+
PROJECT · results/baseline/...
+ +
output_id
+
+
+ +
Δ = 0.03 <unit> (0.07σ)
+
+``` + +Status states for the row badge: `badge-ok` (matched), `badge-warn` +(partial / off-target / no σ), `badge-miss` (missing on either side). +Exactly one badge per row. + +3. **Run the helper:** `python3 .lightcone/build_comparison.py` + from the project root. If `python3` is missing, try `python`. If + the helper imports anything beyond the standard library (e.g. + `pyarrow` to read parquet, or `pandas` to render tables), have it + gracefully fall back to "preview not available -- file exists at + ``" rather than failing. The helper must work with stdlib + alone for the figure path; the parquet / pandas previews are + nice-to-haves. + +4. After the helper runs, **read back** the HTML's first ~50 lines and + the absolute file size to verify it was produced and isn't trivially + small (>10 KB sanity check). Then report to the user the path and a + one-line summary: + + > Comparison HTML at `.lightcone/comparison.html` -- N figures + > (K matched, J missing), N tables (...), N values (...). + +## Vellum aesthetic + +The helper must style the page in the **Vellum** aesthetic: a +weathered-parchment look that reads like a printed scientific paper, +not a web app. The helper bakes all of this into inline `