Skip to content

Structured artifacts from astra paper add: figures, tables, and a findings ledger under work/reference/ #82

@cailmdaley

Description

@cailmdaley

Background

The paper2astra → lightcone-cli skill migration (lightcone-cli#86) preserves all functionality but surfaced an ingestion gap: the original parse_paper.py produced structured figure/table metadata that the new skill doesn't reproduce, and there's no input-side mechanism for the COMPARE phase to verify reproduction outputs against the paper's claimed numerical results.

This is complementary to #81 (substrate convergence). #81 picks the parser; this issue specifies what comes out of it.

Proposal

astra paper add <DOI> produces, under work/reference/:

paper.pdf           # original
paper.tex           # main source (when arXiv source available)
figures/            # files copied from LaTeX source where available
                    #   metadata.json: caption, label, page, source-path
                    #   Docling render-and-crop only as PDF-only fallback (per #81)
tables/             # \begin{table} blocks extracted by label, with caption
findings.yaml       # the paper's own numerical findings (see below)
code/               # cloned reference repo when available (Zenodo / GitHub link)

Every consumer (paper2astra, standalone reproductions, commentary tools) trusts these artifacts blindly. paper2astra stops caring how the paper info was obtained.

The findings artifact — input-side back-pressure

The most interesting piece. Today, COMPARE phases of reproductions rely on agents eyeballing figures and reading prose to decide whether their numerical results match the paper's. Without a structured surface to diff against, agents drift toward plausible-but-wrong values. (Nolan's figure-comparison skill in lightcone-cli#86 helps for figures; check-sentence-by-sentence helps for prose. There's no structured numerical surface yet.)

A structured ledger of the paper's published values would be a complementary forcing function. Example shape:

findings:
  - claim: "Ω_m = 0.315 ± 0.007"
    evidence:
      - type: quote
        anchor: "abstract"
        text: "We find Ω_m = 0.315 ± 0.007..."
  - claim: "χ² = 1834.2 with 1750 dof"
    evidence:
      - type: quote
        anchor: "table:bao-fits"
        text: "..."

ASTRA semantics: these are the paper's own findings: (what that paper claims). When a reproduction's astra.yaml references them, it treats them as prior_insights: from its perspective. Same data, different roles — no tautology because the data lives in the ingestion artifact, not in the reproduction's spec.

COMPARE then iterates against a structured ledger:

for finding in findings.yaml:
    locate matching value in reproduction outputs
    diff and log to comparison-report.md

This mirrors the output-side LaTeX-macro pattern (\newcommand{\Omegam}{0.315} so the rendered paper can't drift from the analysis): an input-side analogue where the paper's claims are extracted into a structured ledger and the reproduction can't claim convergence without matching them.

Two extraction paths

  1. Author-defined \newcommand macros — regex over the LaTeX source. Free, scriptable. Realistic estimate: most papers don't define these; we'll build our own variable set per-paper.
  2. Inline values like $\Omega_m = 0.315 \pm 0.007$ — agent-driven during ACQUIRE. Imperfect but tractable; iterating on the prompt and validation rules will get most of the way.

Open questions

  • Format for findings. ASTRA-shape ({claim, evidence} entries under findings:) gives downstream consistency — astra paper add only ever emits valid ASTRA YAML, MySTRA renders paper findings with the same machinery as reproduction findings. Bespoke flat schema ([{name, value, error, source, quote}]) is more concise for COMPARE iteration. Agent-side back-pressure is roughly equivalent either way; argument for ASTRA-shape is downstream consistency.
  • Figures from LaTeX vs render-and-crop. When LaTeX source is available, copying figure files directly is cleaner than rendering pages. Render-and-crop survives only as a fallback for PDF-only inputs (depends on Converge PDF parsing across the stack: arXiv-LaTeX primary, single PDF fallback #81).
  • Tables — is metadata enough? A \begin{table} block with caption, by label, is enough for an agent to read. No need for extracted values as separate JSON; the LaTeX source already has the values.
  • Phasing. Phase 1: figures + tables + author-macros (all scriptable). Phase 2: inline findings extraction (agent-driven, iterative).

Cross-references

Suggested labels

area:paper-management, enhancement, discuss-before-doing

— Claude on behalf of Cail

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions