A meta-tool for constructing agent-traversable knowledge codices. It packages a multi-phase orchestration process into reference prompts that Claude Code reads and executes autonomously in target projects.
A codex is a structured knowledge base: source material (manuals, documentation, specs) transcribed into indexed, tagged markdown files with a manifest for surgical retrieval by AI agents.
You have a large body of reference material (PDFs, documentation, specs) and want to make it efficiently queryable by AI agents. Instead of feeding entire documents into context windows, a codex gives agents a manifest to search, section-level line numbers to jump to, and a controlled vocabulary to filter by.
- Claude Code (this tool is designed to be driven by CC)
- Python 3.10+ (for scripts)
- PyMuPDF (
pip install PyMuPDF) — for PDF TOC extraction and verification - pyyaml (
pip install pyyaml) — for YAML processing in scripts
codex-builder is a set of prompts, not a runtime. You clone it once, then run it from within your target project. The codex is generated entirely inside the target project — codex-builder itself stays untouched.
# 1. Clone codex-builder somewhere accessible
git clone https://github.com/ryan-voitiskis/codex-builder.git
# 2. Create or navigate to your target project
mkdir my-project && cd my-project
# 3. Start Claude Code and feed it the first prompt
claude
# Inside Claude Code, tell it to read the init prompt:
> Read /path/to/codex-builder/prompts/00-init.md and follow its instructionsEach phase prompt tells Claude Code what to do. Run them in order — each reads the outputs of previous phases. The generated codex lives entirely in your target project:
my-project/
├── docs/{corpus-name}/ # Transcribed markdown (the codex)
├── source-material/ # Raw PDFs, docs (gitignored)
├── codex-config.yaml # Your codex settings
├── codex-state.yaml # Progress tracker
├── manifest.yaml # Section-level index for agent retrieval
└── validate.sh # Validation script (copied from templates)
Each phase reads the outputs of previous phases and produces inputs for the next.
Phase 00 → 01 → 02 → 03 → 04 → 05 → 06
init gather plan transcribe verify map review
Guided conversation to set up the project. Gathers domain info, creates directory structure, generates codex-config.yaml.
Input: User answers
Output: codex-config.yaml, codex-state.yaml, directory structure
Discovers and catalogs all source material. Helps fetch public resources, guides manual download of gated content.
Input: codex-config.yaml
Output: sources.yaml, populated source-material/
Analyzes source structure (PDF TOCs, web sitemaps) and plans how to split material into transcribable chunks.
Input: sources.yaml, source PDFs
Output: chunking-plan.yaml
Orchestrates parallel subagents to transcribe each chunk into markdown with YAML frontmatter. Tracks progress for resumability.
Input: chunking-plan.yaml, source material
Output: docs/{corpus}/ with markdown files
Systematic accuracy checks against source material. Configurable depth: exhaustive, sampling, or hybrid.
Input: Transcribed docs, source material Output: Confidence levels, fixes applied
Finalizes the controlled vocabulary (topics were tagged freely during transcription, now normalized). Generates the manifest with section-level entries. Adds cross-references.
Input: Verified docs
Output: manifest.yaml, normalized vocabulary, cross-references
Last-pass validation, spot-checks, manifest integrity verification, and README generation for the codex itself.
Input: Mapped codex Output: Validated codex, README, completion status
codex-builder/
├── prompts/ # Phase prompt files — the core of the tool
│ ├── 00-init.md
│ ├── 01-source-gathering.md
│ ├── 02-structure-plan.md
│ ├── 03-transcription.md
│ ├── 04-verification.md
│ ├── 05-mapping.md
│ └── 06-final-review.md
├── templates/ # Reusable templates copied into target projects
│ ├── codex-config.yaml # Config template (single source of truth)
│ ├── manifest-template.yaml # Empty manifest skeleton
│ └── validate.sh # Bash validation script (no dependencies)
├── scripts/ # Python helper scripts
│ ├── pdf-toc-extract.py # PDF bookmark/TOC extraction
│ └── verify-transcription.py # Automated verification
├── examples/ # Real-world example configs
│ └── rekordbox-codex-config.yaml
└── README.md
The entire process is driven by codex-config.yaml. Key customization points:
- vocabulary — Document types, modes, and topics specific to your domain
- content_types — Toggle rules for prose, tables, code, formulas, images
- verification.depth — Trade off speed vs. confidence
- transcription.batch_size — Control parallelism based on your resources
- screenshot_notation — Format for image placeholders
Topics evolve organically. During transcription (phase 03), agents tag freely with whatever terms fit. Phase 05 collects all tags, analyzes frequency/overlap, proposes a normalized vocabulary, and runs a fixup pass. This avoids premature taxonomy design.
Section-level manifest entries. The manifest doesn't just list documents — it indexes every ## heading with line numbers. Agents can jump directly to relevant sections without reading entire files.
Resumable progress tracking. codex-state.yaml tracks per-document status (pending → transcribed → verified → mapped → reviewed). If a phase is interrupted, it picks up where it left off.
No heavy dependencies for validation. validate.sh uses only bash, awk, and grep — it runs anywhere without Python or jq.