Skip to content

A meta-tool for constructing agent-traversable knowledge codices

License

Notifications You must be signed in to change notification settings

ryan-voitiskis/codex-builder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

codex-builder

A meta-tool for constructing agent-traversable knowledge codices. It packages a multi-phase orchestration process into reference prompts that Claude Code reads and executes autonomously in target projects.

A codex is a structured knowledge base: source material (manuals, documentation, specs) transcribed into indexed, tagged markdown files with a manifest for surgical retrieval by AI agents.

When to use this

You have a large body of reference material (PDFs, documentation, specs) and want to make it efficiently queryable by AI agents. Instead of feeding entire documents into context windows, a codex gives agents a manifest to search, section-level line numbers to jump to, and a controlled vocabulary to filter by.

Prerequisites

  • Claude Code (this tool is designed to be driven by CC)
  • Python 3.10+ (for scripts)
  • PyMuPDF (pip install PyMuPDF) — for PDF TOC extraction and verification
  • pyyaml (pip install pyyaml) — for YAML processing in scripts

Usage

codex-builder is a set of prompts, not a runtime. You clone it once, then run it from within your target project. The codex is generated entirely inside the target project — codex-builder itself stays untouched.

# 1. Clone codex-builder somewhere accessible
git clone https://github.com/ryan-voitiskis/codex-builder.git

# 2. Create or navigate to your target project
mkdir my-project && cd my-project

# 3. Start Claude Code and feed it the first prompt
claude

# Inside Claude Code, tell it to read the init prompt:
> Read /path/to/codex-builder/prompts/00-init.md and follow its instructions

Each phase prompt tells Claude Code what to do. Run them in order — each reads the outputs of previous phases. The generated codex lives entirely in your target project:

my-project/
├── docs/{corpus-name}/       # Transcribed markdown (the codex)
├── source-material/          # Raw PDFs, docs (gitignored)
├── codex-config.yaml         # Your codex settings
├── codex-state.yaml          # Progress tracker
├── manifest.yaml             # Section-level index for agent retrieval
└── validate.sh               # Validation script (copied from templates)

Workflow

Each phase reads the outputs of previous phases and produces inputs for the next.

Phase 00 → 01 → 02 → 03 → 04 → 05 → 06
init    gather plan  transcribe verify map  review

Phase 00 — Initialization (prompts/00-init.md)

Guided conversation to set up the project. Gathers domain info, creates directory structure, generates codex-config.yaml.

Input: User answers Output: codex-config.yaml, codex-state.yaml, directory structure

Phase 01 — Source Gathering (prompts/01-source-gathering.md)

Discovers and catalogs all source material. Helps fetch public resources, guides manual download of gated content.

Input: codex-config.yaml Output: sources.yaml, populated source-material/

Phase 02 — Structure Planning (prompts/02-structure-plan.md)

Analyzes source structure (PDF TOCs, web sitemaps) and plans how to split material into transcribable chunks.

Input: sources.yaml, source PDFs Output: chunking-plan.yaml

Phase 03 — Transcription (prompts/03-transcription.md)

Orchestrates parallel subagents to transcribe each chunk into markdown with YAML frontmatter. Tracks progress for resumability.

Input: chunking-plan.yaml, source material Output: docs/{corpus}/ with markdown files

Phase 04 — Verification (prompts/04-verification.md)

Systematic accuracy checks against source material. Configurable depth: exhaustive, sampling, or hybrid.

Input: Transcribed docs, source material Output: Confidence levels, fixes applied

Phase 05 — Mapping (prompts/05-mapping.md)

Finalizes the controlled vocabulary (topics were tagged freely during transcription, now normalized). Generates the manifest with section-level entries. Adds cross-references.

Input: Verified docs Output: manifest.yaml, normalized vocabulary, cross-references

Phase 06 — Final Review (prompts/06-final-review.md)

Last-pass validation, spot-checks, manifest integrity verification, and README generation for the codex itself.

Input: Mapped codex Output: Validated codex, README, completion status

Project Structure

codex-builder/
├── prompts/                    # Phase prompt files — the core of the tool
│   ├── 00-init.md
│   ├── 01-source-gathering.md
│   ├── 02-structure-plan.md
│   ├── 03-transcription.md
│   ├── 04-verification.md
│   ├── 05-mapping.md
│   └── 06-final-review.md
├── templates/                  # Reusable templates copied into target projects
│   ├── codex-config.yaml       # Config template (single source of truth)
│   ├── manifest-template.yaml  # Empty manifest skeleton
│   └── validate.sh             # Bash validation script (no dependencies)
├── scripts/                    # Python helper scripts
│   ├── pdf-toc-extract.py      # PDF bookmark/TOC extraction
│   └── verify-transcription.py # Automated verification
├── examples/                   # Real-world example configs
│   └── rekordbox-codex-config.yaml
└── README.md

Customization

The entire process is driven by codex-config.yaml. Key customization points:

  • vocabulary — Document types, modes, and topics specific to your domain
  • content_types — Toggle rules for prose, tables, code, formulas, images
  • verification.depth — Trade off speed vs. confidence
  • transcription.batch_size — Control parallelism based on your resources
  • screenshot_notation — Format for image placeholders

Key Design Decisions

Topics evolve organically. During transcription (phase 03), agents tag freely with whatever terms fit. Phase 05 collects all tags, analyzes frequency/overlap, proposes a normalized vocabulary, and runs a fixup pass. This avoids premature taxonomy design.

Section-level manifest entries. The manifest doesn't just list documents — it indexes every ## heading with line numbers. Agents can jump directly to relevant sections without reading entire files.

Resumable progress tracking. codex-state.yaml tracks per-document status (pendingtranscribedverifiedmappedreviewed). If a phase is interrupted, it picks up where it left off.

No heavy dependencies for validation. validate.sh uses only bash, awk, and grep — it runs anywhere without Python or jq.

About

A meta-tool for constructing agent-traversable knowledge codices

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •