Skip to content

aw-pr/explainer-batch

Repository files navigation

Simple explainers from long-form academic publications

Turn research PDFs into structured JSON explainer articles using the Claude or OpenAI batch/sync APIs.

Node License Status

  • Input: local PDFs and/or a list of paper URLs.
  • Output: one output/<slug>.json per paper — a validated, schema-stable explainer object — plus an optional self-contained output/<slug>.html.
  • Cost: uses the Claude Message Batches API / OpenAI Batch API (50% discount) or a synchronous path for fast single runs.
  • Determinism: model schema drift is normalised after generation; figures are extracted from the source PDF deterministically, never hallucinated.

What this is / is not

  • It is a focused pipeline that produces a specific JSON schema (src/types/explainer-json.ts) intended for a separate React renderer.
  • It is not a general-purpose PDF summariser or chat tool.
  • Figure extraction is macOS-first: it shells out to sips. On Linux the pipeline still runs but image blocks are dropped (see Troubleshooting).
  • The website integration is optional. Without it you still get JSON.

Architecture

flowchart LR
  subgraph IN["Inputs"]
    A1["Local PDFs<br/>(input/*.pdf)"]
    A2["Paper URLs<br/>(input/urls.txt)"]
  end

  IN --> B["Preprocess"]

  B --> C{"Run mode"}
  C -->|"batch (50% discount)"| D["Submit + poll provider batch"]
  C -->|"--sync (fast, single)"| E["Synchronous run"]

  D --> F["Validate + one-pass repair"]
  E --> F
  F --> G["Normalise schema drift"]
  G --> H["Extract figure from source PDF"]
  H --> R(["output/&lt;slug&gt;.json"])

  R -. "only if WEBSITE_REPO set" .-> S["Stage + render website HTML"]

  classDef opt stroke-dasharray:4 4;
  class S opt;
Loading

Claude and OpenAI are both supported as the provider; the website step (S) is entirely optional.

Prerequisites

  • Node 20+ (.nvmrc pins 20; package.json enforces engines).
  • One provider credential — an Anthropic or OpenAI API key, a Claude Max/Pro OAuth token, or Codex CLI auth (see Auth routes).
  • popplerpdftotext, pdftoppm, pdfimages, pdfinfo, used for figure extraction. brew install poppler / apt-get install poppler-utils.
  • sips — ships with macOS; used to downscale extracted figures. Not available on Linux: figures are skipped there, the rest of the pipeline is unaffected.
  • tmux — only required for the secure headless route (npm run process:secure:tmux).

Quick start

npm install
cp .env.example .env
# edit .env: set ANTHROPIC_API_KEY and/or OPENAI_API_KEY

# Provide input as a local PDF...
cp ~/Downloads/my-paper.pdf input/
# ...and/or paper URLs (arXiv, DOI, publisher page, or direct PDF link):
echo "https://arxiv.org/pdf/2411.13768" >> input/urls.txt

# Claude batch (50% discount, ~1h turnaround):
npm run process -- --provider claude

# or OpenAI batch:
npm run process -- --provider openai

# or a fast synchronous run (Claude OAuth or Codex CLI auth, no batch wait):
npm run process -- --provider openai --sync

.env is loaded fill-only (src/env.ts): a value there is used only if the variable is not already set, so it never overrides a real shell export or a 1Password-injected key.

Generated JSON lands in output/ (or EXPLAINER_OUTPUT_DIR). No website repo is required — staging and website-HTML export are skipped unless WEBSITE_REPO is set.

Headless / 1Password route (optional)

For non-interactive/headless runs without putting keys in .env or your shell:

cp op-refs.local.sh.example op-refs.local.sh
# edit op-refs.local.sh with your real op:// vault refs

npm run process:secure:tmux                 # default provider: openai
PROVIDER=claude npm run process:secure:tmux # Claude batch

scripts/run-batch-tmux.sh sources op-refs.sh, resolves only the key the selected route needs via an op-fetch resolver on PATH, and runs the child in a detached tmux session with a sanitised environment. If op-fetch or op-refs.local.sh is absent it falls back to running directly with .env.

Commands

Command What it does
npm run process Full pipeline: preprocess → submit batch → poll → save JSON (+ optional HTML)
npm run submit Preprocess + submit batch, then exit (prints batch ID)
npm run poll Poll the latest pending batch and save results when ready
npm run poll -- <batch-id> Poll a specific batch by ID
npm run status Show all batches, per-request results, token counts
npm run process -- --sync Run synchronously (no batch queue)
npm run export-html -- --input-dir output Re-render HTML from existing JSON (needs WEBSITE_REPO)
npm run typecheck / npm run build Type-check / compile to dist/
npm run smoke Offline check: every provider tier resolves and has a pricing row (no API calls)
npm run guards:install Arm the local publish git hooks (contributors)

Always pass --provider claude or --provider openai explicitly. With no flag and no PROVIDER env var the default is claude.

Inputs

  • Local PDFs — drop any .pdf into input/. OpenAI uploads are cached in state.json by filename + content hash; unchanged files are reused.
  • Remote URLsinput/urls.txt, one HTTP(S) link per line (# comments ignored).
  • Per-paper focus hint (optional) — steer emphasis for one paper without changing the global prompt:
    • PDF: sidecar input/<basename>.focus.md (body = emphasis block).
    • URL: append # focus: … to the line in urls.txt.
Per-paper directives (<paper>.focus.md)

For a PDF at input/my-paper.pdf, an optional sidecar at input/my-paper.focus.md lets you steer one paper without touching the global prompt. Structured directive lines are stripped before the model sees the prose; the remainder is appended verbatim to the user message under the heading READER EMPHASIS FOR THIS PAPER. Malformed directive lines are ignored silently.

Directive Purpose
image: Figure N Force a specific figure as the lead image (accepts Figure 3, Fig. 4a, etc.)
image_caption: … Override the caption attached to the lead image
image_alt: … Override the alt text for the lead image
image_page_hint: N 1-based page number used by the figure cropper instead of caption search

See input/.focus.md.example for a copy-paste-and-edit template.

Configuration

All variables are optional except a provider credential. Set them in .env, .env.local, or the shell.

Environment variables, model defaults, and auth routes
Variable Purpose Default
ANTHROPIC_API_KEY Claude batch / API
OPENAI_API_KEY OpenAI batch / API
CLAUDE_CODE_OAUTH_TOKEN Claude Max/Pro sync route
PROVIDER Default provider when no --provider flag claude
MODEL_BATCH MODEL_LANE MODEL_SYNTHESIS MODEL_REPAIR Per-stage model overrides see below
MAX_TOKENS LANE_MAX_TOKENS SYNTHESIS_MAX_TOKENS REPAIR_MAX_TOKENS Per-stage token caps see below
EXPLAINER_INPUT_DIR Where source PDFs / urls.txt / focus sidecars are read <repo>/input
EXPLAINER_OUTPUT_DIR Where JSON is written <repo>/output
EXPLAINER_OBSIDIAN_DIR Extra always-on mirror for each saved JSON (independent of the output dir and website staging) ~/obsidian/explainers (set empty to disable)
WEBSITE_REPO Consuming website repo; enables staging + website HTML unset (skipped)
EXPLAINER_JOBS_DIR Shared batch-dashboard jobs dir (best-effort) <repo>/jobs

Model defaults — Claude: batch/lane/synthesis claude-opus-4-8, repair claude-sonnet-4-6. OpenAI: batch/synthesis gpt-5.4, lane/repair gpt-5.4-mini. For a committed profile, copy config/models.json.example to config/models.json (gitignored; env vars still win over the file).

Fable 5 is opt-in, not a default. The Claude synthesisModel is only consulted on the --sync (OAuth/subscription) path; the batch path generates on batchModel. So reach for Fable on the route where it is cheapest:

  • API batch (preferred): MODEL_BATCH=claude-fable-5 npm run process -- --provider claude. The batch API applies its 50% discount, and Fable 5 needs your Console admin to have accepted 30-day data retention (misuse detection only, not training).
  • Sync/OAuth: MODEL_SYNTHESIS=claude-fable-5 npm run process -- --provider claude --sync. On the Claude.ai subscription, Fable 5 is free until 22 June 2026, then pre-paid with no batch discount, so prefer the API-batch route once that window closes.

Fable runs once per paper, so its cost multiplies by paper count.

Auth routes

Route Credential Notes
Claude batch ANTHROPIC_API_KEY 50% discount, ~1h turnaround
Claude sync CLAUDE_CODE_OAUTH_TOKEN Max/Pro quota; mixed OAuth+API-key env is rejected
OpenAI batch OPENAI_API_KEY semantic-lane extraction + synthesis
OpenAI sync Codex CLI auth (no key fetched) or OPENAI_API_KEY fast single runs, no batch wait

1Password is optional and orthogonal to the route — see the headless route above and docs/SECURITY.md.

How it works

  1. Preprocess — read local PDFs and the optional URL list.
  2. Submit — Claude: one explainer request per paper. OpenAI: semantic lane extraction (methods/results/limitations/implications) with a smaller lane model.
  3. Poll — exponential backoff (30 s → 5 min cap) until the batch ends (skipped for --sync).
  4. Synthesis + save — OpenAI runs a second synthesis stage; all providers get validation + an optional one-pass repair, then:
    • normalizeSchemaDrift derives canonical paragraphs and end_takeaway.label/body when the model emits *_html/heading variants, so the renderer never shows blank blocks.
    • attachFigureImage applies any lead-image override, then deterministically locates the named figure in the source PDF (poppler + sips), re-encodes to a downscaled JPEG, and inlines it as a base64 data URL. Failures drop the image block silently — the explainer still renders.
    • JSON is written to the output dir, mirrored to EXPLAINER_OBSIDIAN_DIR (~/obsidian/explainers by default), and, if WEBSITE_REPO is set, also staged and rendered to standalone HTML.

The system prompt is skill.md (after its YAML frontmatter) sent verbatim.

Website integration (optional)

Only relevant if you set WEBSITE_REPO to a checkout of the consuming website repo. When set, every saved JSON is copied into that repo's explainers-new/ staging dir, and html-export.ts spawns its scripts/export-explainer-html.mts to write a self-contained HTML file next to the JSON (Chart.js from CDN, no Next.js runtime dependency). output/*.html opens directly in a browser for a pixel-accurate preview.

Shared type contract: src/types/explainer-json.ts here must stay in sync with the website repo's equivalent ExplainerJson type. Any schema change must land in both repos.

Troubleshooting

Common symptoms and fixes
Symptom Cause / fix
Website HTML export skipped (WEBSITE_REPO not set) Expected when not using the website integration. Not an error.
Image silently dropped poppler not installed, or on Linux (sips is macOS-only), or the named figure was not found in the PDF.
JSON parse failed — raw output saved to *_error.txt Model returned non-JSON; inspect the .txt in the output dir.
Claude run rejected for mixed auth Both CLAUDE_CODE_OAUTH_TOKEN and ANTHROPIC_API_KEY were set for a sync run. Use the secure tmux route or unset one.
op-fetch is not installed Not fatal — the secure wrapper falls back to .env. Only the 1Password route needs op-fetch.

Development

npm run typecheck      # tsc --noEmit
npm run build          # compile to dist/
npm run guards:install # arm local publish hooks (one-time, per clone)

state.json is auto-managed — do not hand-edit. Agent guidance lives in AGENTS.md; Claude Code specifics in CLAUDE.md. The private-work → public-mirror branching/publish model is documented in docs/PUBLISH-WORKFLOW.md.

Security

Credentials are never committed. The .env/.env.local loader is fill-only; the 1Password route resolves only the keys a route needs and runs the child with a sanitised environment; committed op-refs.sh holds placeholders while real refs live in a gitignored op-refs.local.sh. Local git hooks (scripts/install-guards.sh) block machine-specific data from being committed. See docs/SECURITY.md.

License

MIT — see LICENSE.

About

CLI that turns research PDFs into structured JSON explainer articles via the Claude or OpenAI batch/sync APIs.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors