Skip to content

andenick/arcanum-workspace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arcanum: AI-First Research Infrastructure

"The only truly exact science is the reproduction of a result by means of its deductive consequence." — Piero Sraffa, notes on method

An open-source framework for AI-assisted scholarly research — built to make the tools of rigorous empirical work available to independent researchers, not just those with institutional access to proprietary databases and research teams.

Why This Exists

This workspace was built by a PhD candidate at the New School for Social Research who needed to process thousands of archival PDFs, reconstruct historical datasets from primary sources, and produce fully reproducible replication packages — all with the rigor that heterodox economics demands.

The intellectual tradition here is specific: Sraffa's meticulous reconstruction of Ricardo's works from manuscripts, Leontief's input-output tables built from primary census data, Shaikh's empirical reconstructions of classical political economy. These scholars didn't just theorize — they built their knowledge from sources, one table at a time.

Arcanum is the infrastructure that makes this kind of work possible at scale with AI agents. Every methodology here has been battle-tested on real research projects: extracting tables from Soviet statistical yearbooks, digitizing pre-war tax directories, reconstructing decades of macroeconomic data from scattered government reports.

This repository shares everything: the full HDARP protocol for PDF processing, the complete Anu Framework for data construction, all skill templates, workspace standards, and pipeline documentation. Take what's useful, adapt it to your work, and build something rigorous.

Core Systems

HDARP v6.1 — Hybrid Direct Agent Reading Protocol

The crown jewel. A high-accuracy PDF processing methodology achieving 95-98% accuracy through:

  • Vision-based table extraction (DARP) — AI agent vision reads table structures directly, achieving 98%+ structural accuracy
  • Document-adaptive OCR (Sraffa 4.0) — digital pages via PyMuPDF, scans via EasyOCR GPU with agent QA, QA failures escalate to Chandra 2 NF4 fallback. Core consensus engine (Sraffa 3.0: PaddleOCR + EasyOCR + Tesseract, 6-rule adjudication) remains the OCR adjudication layer
  • Flat 10-page chunking — Consistent document segmentation
  • 4 content types: Tables (CSV), Equations (LaTeX), Figures (Markdown), Body Text (full OCR)
  • Automatic batch continuation — processes all prepared batches without stopping (--single for one-batch mode)
  • Sonnet Mandatory — Sonnet for all processors, Opus for validator (Haiku banned)
  • No API keys required — uses agent native capabilities

Full documentation: docs/03-hdarp/

DARP Command Family

Four parallel processing commands built on the HDARP methodology:

Command Smart Hybrid Extracts
/pdarp No No Tables, Equations, Figures
/phdarp No Yes + Body Text (OCR)
/spdarp Yes No Tables, Equations, Figures
/sphdarp Yes Yes + Body Text (OCR)

Key: P = Parallel (all spawn N+1 agents), S = Smart (document-aware batching), H = Hybrid (includes Sraffa 4.0 OCR)

HDARP Pipeline (End-to-End)

/preparehdarp  →  /sphdarp N  →  /enrichhdarp  →  /sraffa-ocr  →  /hdarp-integrate  →  /hdarp-cleanup
   (chunk)     (extract+auto-   (audit/fix)      (OCR gaps)      (organize KB)        (clean up)
                continue)

Anu Framework v12.0 — Data Construction Framework

A 19-active-skill framework (plus the anu-build 9-stage orchestrator) for building structured research datasets from HDARP extractions. Covers the full lifecycle from source mining to reproducible, audit-grade publication:

Stage Skill Purpose
Rules Anu Rules Mandatory invariants (no synthetic data, no proxies, unit safety)
1 Anu Research Mine quotes, methodology, references from the Knowledge Base
2 (gate) Anu Adequacy Post-research readiness gate — verify data sources sufficient
3 Anu Ingestion series_registry.json construction, series decomposition, DPRs/FPRs
4 Anu Extension Faithful extension with live API data (FRED, BEA, BLS), EPRs, divergence register
5 Anu Scaffold / Replicator Render + assemble self-contained L##/P##/V##/M## reproduction package
6 Anu Chopped / Extenbook Machine-readable CSV / 4-sheet human-readable Excel
7 Anu Visualize Interactive visualization (R Shiny + Plotly or Plotly Dash)
8 Anu Publish / Drive / Archive Publication pipeline (GitHub / Google Drive / audit-grade archive)
Orchestration Anu Build 9-stage pipeline orchestrator (anu-pipeline / anu-rebuild are deprecated redirect stubs)
Audit Anu Review 14-dimension quality scoring (12 weighted D1–D12 + D13 Data Authenticity gate + D14 Outward-Facing Intelligibility gate)
Docs Anu Docs Per-series documentation (T1/T2/T3 tiers)
Tracking Anu Variant Methodology variant management (VPRs)
Manifest Anu Ledger Auto-generated project artifact inventory
Format Anu Architecture 8-phase format standard for econometric data construction (formerly anu-data)
Health Anu Doctor Framework checks (D01–D19) + project checks (P01–P39)

Single source of truth: series_registry.json

Full documentation: docs/04-anu-framework/

AnuData Architecture — Econometric Research Pipeline

A standardized, self-contained architecture for bespoke econometric data construction and analysis. Extends the Anu Replicator's L##/P## pattern with a full 8-phase research pipeline:

Prefix Name Purpose
S## Setup Package installation, workspace config
L## Load Raw data from files and APIs
P## Process Clean, transform, construct analysis-ready datasets
V## Validate Data integrity + post-estimation diagnostics
M## Manual Adjust Documented corrections with audit trail
A## Analyze Econometric estimation, robustness, comparison
O## Output Publication-quality tables, figures, reports
E## Explore Standalone exploratory scripts (ephemeral)

Key features: project_registry.json (single source of truth), structured DECISION_LOG.md, auto-generated CHECKLIST.md, language-agnostic (R/Python/Stata).

Full documentation: docs/09-anudata/

Council Tools

Infrastructure tools providing specialized capabilities:

Tool Purpose
Druck Performance monitoring, workspace standards, HDARP protocol hub
Robert PDF processing infrastructure and Knowledge Base
Robin Economic data platform (27 sources / 52.9 GB; canonical counts in AUTHORITATIVE_COUNTS.json)
Wynne Economics research framework and HDARP campaigns
Grace VLM cloud processing (A100 GPU)
Arthur MIDI music generation
Caro Biographical research with multi-engine OCR
Eleanor Voice input processing
Manim Mathematical animations
Pheidippides Discord integration

Repository Structure

arcanum-workspace/
├── README.md                    # This file
├── AGENTS.md                    # Agent instructions and commands
├── skills/                      # Skill templates (the actual .md prompts)
│   ├── hdarp/                   # HDARP family (8 skills)
│   ├── anu-framework/           # Anu Framework v12.0 (19 active skills + anu-build)
│   └── workspace/               # Utility skills (6 skills)
├── docs/
│   ├── 02-philosophy/           # Basher methodology, real-data principles
│   ├── 03-hdarp/                # Complete HDARP documentation
│   ├── 04-anu-framework/        # Anu Framework architecture
│   ├── 05-council-tools/        # Council tool overviews
│   ├── 06-commands-skills/      # Command references
│   ├── 07-workspace-standards/  # Data analysis, file management, KB system
│   ├── 08-agent-operations/     # Session management
│   └── 09-anudata/              # AnuData econometric research architecture
├── council/                     # Infrastructure tool documentation
├── projects/                    # Research project descriptions
└── tools/                       # Build and maintenance scripts

Philosophy

Bashers, Not Sweepers

Borrowed from Kurt Vonnegut's distinction: bashers work sentence by sentence, figuring out what's right before moving on. Sweepers write everything at once, then go back to fix it. This workspace operates as bashers.

In practice:

  • No placeholders — never write stub code or use dummy data
  • Real data only — all analysis uses genuine datasets, never synthetic
  • Step-by-step construction — complete each step fully before proceeding
  • Force column types early — immediately after data import, enforce types
  • Console-print debugging — print shapes, samples, and statistics at every transformation

Why This Matters

The tools of rigorous empirical research have historically been gatekept — behind expensive software licenses, institutional access, and research teams. AI agents change this equation fundamentally. A single researcher with well-designed infrastructure can now process document archives, reconstruct datasets, and produce replication packages that meet professional standards.

This workspace is an attempt to share that infrastructure openly, in the tradition of scholars who believed that method matters as much as theory — and that making methods transparent is itself a political act.

Getting Started

git clone https://github.com/andenick/arcanum-workspace.git
cd arcanum-workspace

For AI Agents

  1. Read AGENTS.md for complete instructions
  2. Run /readystart [project_name] to initialize
  3. Follow project-specific instructions
  4. Document work with /handoff

For Researchers

  1. Browse skills/ to see the actual skill templates
  2. Study docs/03-hdarp/ for the PDF processing methodology
  3. Study docs/04-anu-framework/ for the data construction framework
  4. Adapt the templates to your own workspace

Multi-LLM Compatibility

All methodologies work across platforms:

  • Claude Code
  • Cursor AI
  • Gemini CLI
  • Any vision-capable AI agent

Standards (The Short Version)

  • Mandatory 3-folder structure: Inputs/ (read-only originals), Technical/ (working files), Outputs/ (results)
  • Inputs/ is flat — no type-based subdirectories, preserve user organization
  • HDARP for PDFs — files >10 pages OR >1MB must be chunked (no exceptions)
  • Real data only — no placeholders, no dummy data, no stubs

License

MIT License — take what's useful, build something rigorous.


Version: 5.0 Last Updated: 2026-05-12

About

HDARP v6.1 PDF processing (95-98% accuracy), Sraffa 4.0 OCR, Anu Framework v12.0 data construction (19 skills), reproducible research infrastructure. AI-first.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages