Arcanum: AI-First Research Infrastructure

"The only truly exact science is the reproduction of a result by means of its deductive consequence." — Piero Sraffa, notes on method

An open-source framework for AI-assisted scholarly research — built to make the tools of rigorous empirical work available to independent researchers, not just those with institutional access to proprietary databases and research teams.

Why This Exists

This workspace was built by a PhD candidate at the New School for Social Research who needed to process thousands of archival PDFs, reconstruct historical datasets from primary sources, and produce fully reproducible replication packages — all with the rigor that heterodox economics demands.

The intellectual tradition here is specific: Sraffa's meticulous reconstruction of Ricardo's works from manuscripts, Leontief's input-output tables built from primary census data, Shaikh's empirical reconstructions of classical political economy. These scholars didn't just theorize — they built their knowledge from sources, one table at a time.

Arcanum is the infrastructure that makes this kind of work possible at scale with AI agents. Every methodology here has been battle-tested on real research projects: extracting tables from Soviet statistical yearbooks, digitizing pre-war tax directories, reconstructing decades of macroeconomic data from scattered government reports.

This repository shares everything: the full HDARP protocol for PDF processing, the complete Anu Framework for data construction, all skill templates, workspace standards, and pipeline documentation. Take what's useful, adapt it to your work, and build something rigorous.

Core Systems

HDARP v6.1 — Hybrid Direct Agent Reading Protocol

The crown jewel. A high-accuracy PDF processing methodology achieving 95-98% accuracy through:

Vision-based table extraction (DARP) — AI agent vision reads table structures directly, achieving 98%+ structural accuracy
Document-adaptive OCR (Sraffa 4.0) — digital pages via PyMuPDF, scans via EasyOCR GPU with agent QA, QA failures escalate to Chandra 2 NF4 fallback. Core consensus engine (Sraffa 3.0: PaddleOCR + EasyOCR + Tesseract, 6-rule adjudication) remains the OCR adjudication layer
Flat 10-page chunking — Consistent document segmentation
4 content types: Tables (CSV), Equations (LaTeX), Figures (Markdown), Body Text (full OCR)
Automatic batch continuation — processes all prepared batches without stopping (--single for one-batch mode)
Sonnet Mandatory — Sonnet for all processors, Opus for validator (Haiku banned)
No API keys required — uses agent native capabilities

Full documentation: docs/03-hdarp/

DARP Command Family

Four parallel processing commands built on the HDARP methodology:

Command	Smart	Hybrid	Extracts
`/pdarp`	No	No	Tables, Equations, Figures
`/phdarp`	No	Yes	+ Body Text (OCR)
`/spdarp`	Yes	No	Tables, Equations, Figures
`/sphdarp`	Yes	Yes	+ Body Text (OCR)

Key: P = Parallel (all spawn N+1 agents), S = Smart (document-aware batching), H = Hybrid (includes Sraffa 4.0 OCR)

HDARP Pipeline (End-to-End)

/preparehdarp  →  /sphdarp N  →  /enrichhdarp  →  /sraffa-ocr  →  /hdarp-integrate  →  /hdarp-cleanup
   (chunk)     (extract+auto-   (audit/fix)      (OCR gaps)      (organize KB)        (clean up)
                continue)

Anu Framework v12.0 — Data Construction Framework

A 19-active-skill framework (plus the anu-build 9-stage orchestrator) for building structured research datasets from HDARP extractions. Covers the full lifecycle from source mining to reproducible, audit-grade publication:

Stage	Skill	Purpose
Rules	Anu Rules	Mandatory invariants (no synthetic data, no proxies, unit safety)
1	Anu Research	Mine quotes, methodology, references from the Knowledge Base
2 (gate)	Anu Adequacy	Post-research readiness gate — verify data sources sufficient
3	Anu Ingestion	series_registry.json construction, series decomposition, DPRs/FPRs
4	Anu Extension	Faithful extension with live API data (FRED, BEA, BLS), EPRs, divergence register
5	Anu Scaffold / Replicator	Render + assemble self-contained L##/P##/V##/M## reproduction package
6	Anu Chopped / Extenbook	Machine-readable CSV / 4-sheet human-readable Excel
7	Anu Visualize	Interactive visualization (R Shiny + Plotly or Plotly Dash)
8	Anu Publish / Drive / Archive	Publication pipeline (GitHub / Google Drive / audit-grade archive)
Orchestration	Anu Build	9-stage pipeline orchestrator (anu-pipeline / anu-rebuild are deprecated redirect stubs)
Audit	Anu Review	14-dimension quality scoring (12 weighted D1–D12 + D13 Data Authenticity gate + D14 Outward-Facing Intelligibility gate)
Docs	Anu Docs	Per-series documentation (T1/T2/T3 tiers)
Tracking	Anu Variant	Methodology variant management (VPRs)
Manifest	Anu Ledger	Auto-generated project artifact inventory
Format	Anu Architecture	8-phase format standard for econometric data construction (formerly anu-data)
Health	Anu Doctor	Framework checks (D01–D19) + project checks (P01–P39)

Single source of truth: series_registry.json

Full documentation: docs/04-anu-framework/

AnuData Architecture — Econometric Research Pipeline

A standardized, self-contained architecture for bespoke econometric data construction and analysis. Extends the Anu Replicator's L##/P## pattern with a full 8-phase research pipeline:

Prefix	Name	Purpose
S##	Setup	Package installation, workspace config
L##	Load	Raw data from files and APIs
P##	Process	Clean, transform, construct analysis-ready datasets
V##	Validate	Data integrity + post-estimation diagnostics
M##	Manual Adjust	Documented corrections with audit trail
A##	Analyze	Econometric estimation, robustness, comparison
O##	Output	Publication-quality tables, figures, reports
E##	Explore	Standalone exploratory scripts (ephemeral)

Key features: project_registry.json (single source of truth), structured DECISION_LOG.md, auto-generated CHECKLIST.md, language-agnostic (R/Python/Stata).

Full documentation: docs/09-anudata/

Council Tools

Infrastructure tools providing specialized capabilities:

Tool	Purpose
Druck	Performance monitoring, workspace standards, HDARP protocol hub
Robert	PDF processing infrastructure and Knowledge Base
Robin	Economic data platform (27 sources / 52.9 GB; canonical counts in AUTHORITATIVE_COUNTS.json)
Wynne	Economics research framework and HDARP campaigns
Grace	VLM cloud processing (A100 GPU)
Arthur	MIDI music generation
Caro	Biographical research with multi-engine OCR
Eleanor	Voice input processing
Manim	Mathematical animations
Pheidippides	Discord integration

Repository Structure

arcanum-workspace/
├── README.md                    # This file
├── AGENTS.md                    # Agent instructions and commands
├── skills/                      # Skill templates (the actual .md prompts)
│   ├── hdarp/                   # HDARP family (8 skills)
│   ├── anu-framework/           # Anu Framework v12.0 (19 active skills + anu-build)
│   └── workspace/               # Utility skills (6 skills)
├── docs/
│   ├── 02-philosophy/           # Basher methodology, real-data principles
│   ├── 03-hdarp/                # Complete HDARP documentation
│   ├── 04-anu-framework/        # Anu Framework architecture
│   ├── 05-council-tools/        # Council tool overviews
│   ├── 06-commands-skills/      # Command references
│   ├── 07-workspace-standards/  # Data analysis, file management, KB system
│   ├── 08-agent-operations/     # Session management
│   └── 09-anudata/              # AnuData econometric research architecture
├── council/                     # Infrastructure tool documentation
├── projects/                    # Research project descriptions
└── tools/                       # Build and maintenance scripts

Philosophy

Bashers, Not Sweepers

Borrowed from Kurt Vonnegut's distinction: bashers work sentence by sentence, figuring out what's right before moving on. Sweepers write everything at once, then go back to fix it. This workspace operates as bashers.

In practice:

No placeholders — never write stub code or use dummy data
Real data only — all analysis uses genuine datasets, never synthetic
Step-by-step construction — complete each step fully before proceeding
Force column types early — immediately after data import, enforce types
Console-print debugging — print shapes, samples, and statistics at every transformation

Why This Matters

The tools of rigorous empirical research have historically been gatekept — behind expensive software licenses, institutional access, and research teams. AI agents change this equation fundamentally. A single researcher with well-designed infrastructure can now process document archives, reconstruct datasets, and produce replication packages that meet professional standards.

This workspace is an attempt to share that infrastructure openly, in the tradition of scholars who believed that method matters as much as theory — and that making methods transparent is itself a political act.

Getting Started

git clone https://github.com/andenick/arcanum-workspace.git
cd arcanum-workspace

For AI Agents

Read AGENTS.md for complete instructions
Run /readystart [project_name] to initialize
Follow project-specific instructions
Document work with /handoff

For Researchers

Browse skills/ to see the actual skill templates
Study docs/03-hdarp/ for the PDF processing methodology
Study docs/04-anu-framework/ for the data construction framework
Adapt the templates to your own workspace

Multi-LLM Compatibility

All methodologies work across platforms:

Claude Code
Cursor AI
Gemini CLI
Any vision-capable AI agent

Standards (The Short Version)

Mandatory 3-folder structure: Inputs/ (read-only originals), Technical/ (working files), Outputs/ (results)
Inputs/ is flat — no type-based subdirectories, preserve user organization
HDARP for PDFs — files >10 pages OR >1MB must be chunked (no exceptions)
Real data only — no placeholders, no dummy data, no stubs

License

MIT License — take what's useful, build something rigorous.

Version: 5.0 Last Updated: 2026-05-12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arcanum: AI-First Research Infrastructure

Why This Exists

Core Systems

HDARP v6.1 — Hybrid Direct Agent Reading Protocol

DARP Command Family

HDARP Pipeline (End-to-End)

Anu Framework v12.0 — Data Construction Framework

AnuData Architecture — Econometric Research Pipeline

Council Tools

Repository Structure

Philosophy

Bashers, Not Sweepers

Why This Matters

Getting Started

For AI Agents

For Researchers

Multi-LLM Compatibility

Standards (The Short Version)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
council		council
docs		docs
projects		projects
skills		skills
tools		tools
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Arcanum: AI-First Research Infrastructure

Why This Exists

Core Systems

HDARP v6.1 — Hybrid Direct Agent Reading Protocol

DARP Command Family

HDARP Pipeline (End-to-End)

Anu Framework v12.0 — Data Construction Framework

AnuData Architecture — Econometric Research Pipeline

Council Tools

Repository Structure

Philosophy

Bashers, Not Sweepers

Why This Matters

Getting Started

For AI Agents

For Researchers

Multi-LLM Compatibility

Standards (The Short Version)

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages