Skip to content

chris-c-thomas/LexBuild

Repository files navigation

LexBuild

npm CI TypeScript Node license

LexBuild is an open-source toolchain for U.S. legal texts. It transforms official source XML into structured Markdown with rich metadata, optimized for LLMs, RAG pipelines, and semantic search.

Table of Contents


Overview

The United States Code is the official codification of federal statutory law, organized into 54 titles. It is available as USLM XML from the Office of the Law Revision Counsel (OLRC).

The Code of Federal Regulations (CFR) is the official codification of federal administrative regulations, organized into 50 titles. The Electronic Code of Federal Regulations (eCFR) is a continuously updated editorial compilation incorporating changes as they appear in the Federal Register. eCFR XML is available from the ecfr.gov API (daily-updated) and GovInfo (bulk data).

Both formats are dense and deeply nested, making them difficult to work with directly.

LexBuild transforms this XML into per-section Markdown files with YAML frontmatter, predictable file paths, and content sized for typical embedding model context windows, making the full corpus of federal law and regulations accessible to LLMs, vector databases, and legal research tools.


Sources

Source Package XML Format Titles Status
U.S. Code @lexbuild/usc USLM 1.0 54 Stable
eCFR (Code of Federal Regulations) @lexbuild/ecfr GPO/SGML 50 Stable
Annual CFR (official edition) @lexbuild/cfr GPO/SGML 50 Planned
Federal Register @lexbuild/fr GPO/SGML variant Planned
State statutes @lexbuild/state-* Varies Exploratory

Data Sources

Source Download From Update Frequency Notes
U.S. Code uscode.house.gov (OLRC) Multiple times/month Release point auto-detected from OLRC download page
eCFR (default) ecfr.gov API Daily Point-in-time support via --date flag
eCFR (fallback) govinfo.gov Irregular Bulk XML, updates per-title as regulations change

Install

Run Directly (no install)

npx @lexbuild/cli download-usc --all
npx @lexbuild/cli convert-usc --all

Global Install

npm install -g @lexbuild/cli
# or
pnpm add -g @lexbuild/cli

Build From Source

Requires Node.js >= 22 and pnpm >= 10.

git clone https://github.com/chris-c-thomas/LexBuild.git
cd LexBuild
pnpm install && pnpm turbo build

Quick Start

U.S. Code

# Download and convert all 54 titles
lexbuild download-usc --all && lexbuild convert-usc --all

# Start small — a single title
lexbuild download-usc --titles 1 && lexbuild convert-usc --titles 1

# A range of titles
lexbuild download-usc --titles 1-5 && lexbuild convert-usc --titles 1-5

eCFR (Code of Federal Regulations)

# Download and convert all 50 titles
lexbuild download-ecfr --all && lexbuild convert-ecfr --all

# A single title
lexbuild download-ecfr --titles 17 && lexbuild convert-ecfr --titles 17

# Point-in-time download (CFR as of a specific date)
lexbuild download-ecfr --all --date 2025-01-01

Commands

download-usc

Fetch U.S. Code XML from the OLRC. Auto-detects the latest release point.

lexbuild download-usc --all                                  # All 54 titles
lexbuild download-usc --titles 1-5,8,11                      # Specific titles
lexbuild download-usc --all --release-point 119-73not60      # Pin a release
Option Default Description
--titles <spec> Title(s): 1, 1-5, 1-5,8,11
--all Download all 54 titles (single bulk zip)
-o, --output <dir> ./downloads/usc/xml Output directory
--release-point <id> auto-detected Pin a specific OLRC release point

convert-usc

Convert downloaded USC XML to Markdown.

lexbuild convert-usc --all                                   # All downloaded titles
lexbuild convert-usc --titles 1 -g chapter                   # Chapter-level output
lexbuild convert-usc --titles 26 --dry-run                   # Preview without writing
lexbuild convert-usc ./downloads/usc/xml/usc01.xml           # Direct file path
Option Default Description
--titles <spec> Title(s) to convert
--all Convert all titles in input directory
-i, --input-dir <dir> ./downloads/usc/xml Input XML directory
-o, --output <dir> ./output Output directory
-g, --granularity section section, chapter, or title
--link-style plaintext plaintext, canonical, or relative
--no-include-source-credits Exclude source credits
--no-include-notes Exclude all notes
--include-editorial-notes Include editorial notes only
--include-statutory-notes Include statutory notes only
--include-amendments Include amendment notes only
--dry-run Parse and report without writing
-v, --verbose Verbose output

download-ecfr

Fetch eCFR XML. Defaults to the ecfr.gov API (daily-updated); govinfo bulk data available as fallback.

lexbuild download-ecfr --all                                 # All 50 titles (eCFR API)
lexbuild download-ecfr --titles 1-5,17                       # Specific titles
lexbuild download-ecfr --all --date 2025-01-01               # Point-in-time download
lexbuild download-ecfr --all --source govinfo                # Govinfo bulk fallback
Option Default Description
--titles <spec> Title(s): 1, 1-5, 1-5,17
--all Download all 50 titles
-o, --output <dir> ./downloads/ecfr/xml Output directory
--source ecfr-api ecfr-api (daily-updated) or govinfo (bulk)
--date <YYYY-MM-DD> current Point-in-time date (ecfr-api only)

convert-ecfr

Convert downloaded eCFR XML to Markdown.

lexbuild convert-ecfr --all                                  # All downloaded titles
lexbuild convert-ecfr --titles 17 -g part                    # Part-level output
lexbuild convert-ecfr --all --dry-run                        # Preview without writing
lexbuild convert-ecfr ./downloads/ecfr/xml/ECFR-title17.xml  # Direct file path
Option Default Description
--titles <spec> Title(s) to convert
--all Convert all titles in input directory
-i, --input-dir <dir> ./downloads/ecfr/xml Input XML directory
-o, --output <dir> ./output Output directory
-g, --granularity section section, part, chapter, or title
--link-style plaintext plaintext, canonical, or relative
--no-include-source-credits Exclude source credits
--no-include-notes Exclude all notes
--include-editorial-notes Include editorial/regulatory notes only
--include-statutory-notes Include statutory notes only
--include-amendments Include amendment notes only
--dry-run Parse and report without writing
-v, --verbose Verbose output

Output

File Structure

U.S. Code (-g section, default):

output/usc/
  title-01/
    README.md
    _meta.json
    chapter-01/
      _meta.json
      section-1.md
      section-2.md

eCFR (-g section, default):

output/ecfr/
  title-17/
    README.md
    _meta.json
    chapter-IV/
      part-240/
        _meta.json
        section-240.10b-5.md

All granularity levels:

Source section chapter/part title
USC title-01/chapter-01/section-1.md title-01/chapter-01/chapter-01.md title-01.md
eCFR title-17/chapter-IV/part-240/section-240.10b-5.md title-17/chapter-IV/part-240.md title-17.md

Frontmatter

Every Markdown file includes YAML frontmatter with source-specific metadata:

U.S. Code:

---
identifier: "/us/usc/t1/s7"
source: "usc"
legal_status: "official_legal_evidence"
title: "1 USC § 7 - Marriage"
title_number: 1
title_name: "GENERAL PROVISIONS"
section_number: "7"
section_name: "Marriage"
chapter_number: 1
chapter_name: "RULES OF CONSTRUCTION"
positive_law: true
currency: "119-73"
last_updated: "2025-12-03"
format_version: "1.1.0"
generator: "lexbuild@1.9.3"
source_credit: "(Added Pub. L. 104-199, § 3(a), Sept. 21, 1996, ...)"
---

eCFR:

---
identifier: "/us/cfr/t17/s240.10b-5"
source: "ecfr"
legal_status: "authoritative_unofficial"
title: "17 CFR § 240.10b-5 - Employment of manipulative and deceptive devices"
title_number: 17
section_number: "240.10b-5"
positive_law: false
authority: "15 U.S.C. 78a et seq., ..."
cfr_part: "240"
---

The source field discriminates content origin. The legal_status field indicates provenance: "official_legal_evidence" (positive law USC titles), "official_prima_facie" (non-positive law USC titles), or "authoritative_unofficial" (eCFR).

Metadata Indexes

Each directory includes a _meta.json sidecar file for programmatic access without parsing Markdown:

{
  "format_version": "1.1.0",
  "identifier": "/us/usc/t5",
  "title_number": 5,
  "title_name": "Government Organization and Employees",
  "stats": {
    "chapter_count": 63,
    "section_count": 1162,
    "total_tokens_estimate": 2207855
  },
  "chapters": [
    {
      "identifier": "/us/usc/t5/ptI/ch1",
      "number": 1,
      "name": "Organization",
      "directory": "chapter-01",
      "sections": [
        {
          "identifier": "/us/usc/t5/s101",
          "number": "101",
          "name": "Executive departments",
          "file": "section-101.md",
          "token_estimate": 4200,
          "has_notes": true,
          "status": "current"
        }
      ]
    }
  ]
}

Performance

Corpus Titles Sections Est. Tokens Conversion Time
U.S. Code 54 ~60,000 ~85M ~20–30s
eCFR 49 (excl. reserved) ~227,000 ~350M ~60–90s
Combined 103 ~287,000 ~435M ~2 min

SAX streaming keeps memory bounded for even the largest titles (100MB+ XML). Conversion is CPU-bound — no network I/O during the convert step.


Monorepo

LexBuild is a monorepo managed with pnpm workspaces and Turborepo.

lexbuild/
├── packages/
│   ├── core/           # @lexbuild/core — XML parsing, AST, Markdown rendering
│   ├── usc/            # @lexbuild/usc — U.S. Code converter and downloader
│   ├── ecfr/           # @lexbuild/ecfr — eCFR converter and downloader
│   └── cli/            # @lexbuild/cli — CLI binary
├── apps/
│   └── astro/          # LexBuild web app (lexbuild.dev)
├── fixtures/           # Test fixtures (synthetic XML + expected output snapshots)
├── reference/          # GPO/OLRC XML schema reference guides
├── turbo.json
└── pnpm-workspace.yaml

Dependency Graph

@lexbuild/cli
  ├── @lexbuild/usc
  │     └── @lexbuild/core
  ├── @lexbuild/ecfr
  │     └── @lexbuild/core
  └── @lexbuild/core

apps/astro (no code deps — consumes output only)

Source packages are independent — @lexbuild/usc and @lexbuild/ecfr never import from each other. Future source packages follow the same pattern.

All internal dependencies use pnpm's workspace:* protocol. Changesets manages lockstep versioning across all published packages.


Packages

Package npm Description
@lexbuild/cli npm CLI binary — download and convert legal XML
@lexbuild/core npm Shared XML parsing, AST, Markdown rendering
@lexbuild/usc npm U.S. Code (USLM XML) converter and downloader
@lexbuild/ecfr npm eCFR converter and downloader (ecfr.gov API + govinfo)

Each package has its own README with full API documentation.

Apps

LexBuild

(LexBuild.dev)

A server-rendered legal content browser built with Astro 6, React 19, Tailwind CSS 4, and shadcn/ui.

  • 260,000+ section pages across U.S. Code and eCFR
  • Four granularity levels — title, chapter, part (eCFR), section
  • Syntax-highlighted source and rendered HTML preview
  • Sidebar navigation with virtualized section lists
  • Full-text search via Meilisearch
  • Dark mode with system preference detection
  • Zero client JS by default — interactive React islands only where needed

The web app consumes LexBuild's output (.md files and _meta.json sidecars) and has no code dependency on the conversion packages.

See apps/astro/README.md for setup and development instructions.


Development

Prerequisites

Getting Started

git clone https://github.com/chris-c-thomas/LexBuild.git
cd LexBuild
pnpm install
pnpm turbo build

Common Commands

pnpm turbo build           # Build all packages
pnpm turbo test            # Run all tests
pnpm turbo lint            # Lint all packages
pnpm turbo typecheck       # Type-check all packages
pnpm turbo dev             # Watch mode

Working on a Specific Package

pnpm turbo build --filter=@lexbuild/core
pnpm turbo test --filter=@lexbuild/ecfr

# Run the CLI locally
node packages/cli/dist/index.js download-usc --titles 1
node packages/cli/dist/index.js convert-usc --titles 1
node packages/cli/dist/index.js download-ecfr --titles 17
node packages/cli/dist/index.js convert-ecfr --titles 17

Web App Development

# Build packages first
pnpm turbo build

# Download and convert some content
node packages/cli/dist/index.js download-usc --titles 1 && node packages/cli/dist/index.js convert-usc --titles 1
node packages/cli/dist/index.js download-ecfr --titles 1 && node packages/cli/dist/index.js convert-ecfr --titles 1

# Set up the web app
cd apps/astro
bash scripts/link-content.sh
npx tsx scripts/generate-nav.ts
pnpm dev

Contributing

Contributions are welcome. Please see CONTRIBUTING.md.


License

MIT

About

LexBuild provides an open-source CLI toolchain and searchable web resource for U.S. federal and state legal texts. The CLI transforms official XML into metadata-rich Markdown optimized for LLMs and RAG, while the dedicated web hub allows developers to directly browse, copy, and bulk-download the structured legal corpus

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors