CompiledAI - Essentials

What It Is

CompiledAI is a research benchmark suite for deterministic LLM-based workflow automation. Instead of calling LLMs at runtime (expensive, non-deterministic), it generates code once during "compilation," then executes deterministically with zero marginal LLM cost.

Authors: Geert Trooskens, Aaron Karlsberg, Anmol Sharma, Lamara De Brouwer, Walter A. De Brouwer Affiliation: XY.AI Labs, Stanford University School of Medicine

The Problem

Current LLM agents suffer from:

High per-transaction inference costs
Non-deterministic outputs (identical inputs → different results)
Unpredictable latency (P99 variance)
Reliability gaps (35-65% failure rates in multi-turn workflows)

The Solution

Generate Once → Validate Thoroughly → Execute Deterministically

Break-Even: Compiled AI beats runtime inference after N* executions:

N* = Generation_Cost / Runtime_Cost_Per_Execution

Typically N < 100* (often < 10)

Tech Stack

Category	Tools
Language	Python 3.11+
Package Manager	uv (recommended)
LLM Providers	Anthropic Claude Opus 4.5, OpenAI GPT-4o, Google Gemini
Code Generation	PydanticAI agents
Workflow Execution	Temporal (prod), XY Local Workflow Executor (test)
Validation	Bandit, Semgrep, mypy, ruff, radon
Data	Pydantic v2, PyYAML, jsonschema

Architecture

YAML Spec (task definition)
         │
         ▼
┌─────────────────────┐
│    CONFIG AGENT     │  Parse spec, select templates
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│    CODE FACTORY     │
│  ┌───────────────┐  │
│  │ Planner Agent │  │  Design workflow structure
│  └───────────────┘  │
│  ┌───────────────┐  │
│  │  Coder Agent  │  │  Generate YAML + Python
│  └───────────────┘  │
└─────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────┐
│       4-STAGE VALIDATION PIPELINE           │
├─────────────────────────────────────────────┤
│ Stage 1: SECURITY   │ Bandit, Semgrep       │
│ Stage 2: SYNTAX     │ AST, mypy, ruff       │
│ Stage 3: EXECUTION  │ Sandbox, fixtures     │
│ Stage 4: ACCURACY   │ Golden output match   │
└─────────────────────────────────────────────┘
         │
         ▼
   TEMPORAL ACTIVITY
   (Deterministic, validated, production-ready)

Self-Healing: Auto-regenerates on validation failure (max 5 attempts)

Project Structure

CompiledAI/
├── src/compiled_ai/
│   ├── factory/
│   │   ├── code_factory/          # Core generation system
│   │   │   ├── factory.py         # Main orchestrator
│   │   │   ├── agents.py          # Planner + Coder agents
│   │   │   ├── template_registry.py
│   │   │   ├── semantic_search.py
│   │   │   └── llm_adapter.py
│   │   ├── activities/            # Pre-built templates
│   │   └── XYLocalWorkflowExecutor/
│   ├── validation/                # 4-stage pipeline
│   ├── baselines/                 # Comparison implementations
│   │   ├── direct_llm.py
│   │   ├── code_factory.py
│   │   ├── langchain_agent.py
│   │   └── multi_agent.py
│   ├── metrics/                   # 7 evaluation categories
│   │   ├── token_efficiency.py
│   │   ├── latency.py
│   │   ├── consistency.py
│   │   ├── reliability.py
│   │   ├── code_quality.py
│   │   ├── validation_pipeline.py
│   │   └── cost.py
│   ├── runner/                    # Benchmark execution
│   └── evaluation/                # Output evaluators
│
├── datasets/
│   ├── xy_benchmark/              # 5 tasks, 12 instances
│   ├── bfcl_v4/                   # 9 categories, 2,810 instances
│   └── agentbench/                # 5 environments, 146 instances
│
├── workflows/                     # Generated artifacts
├── results/                       # Benchmark results (97+ files)
├── paper/                         # Research documentation
│   ├── compiled_ai_paper.md
│   └── framework.md               # Evaluation framework
│
├── scripts/
│   ├── run_benchmark.py           # Main runner
│   ├── run_bfcl_benchmark.py
│   └── run_agentbench_benchmark.py
│
└── tests/

Key Components

Code Factory (`factory/code_factory/`)

File	Purpose
`factory.py`	Main orchestrator with regeneration loop
`agents.py`	PydanticAI Planner + Coder agents
`template_registry.py`	Searchable activity templates
`semantic_search.py`	Embedding-based template discovery
`llm_adapter.py`	Multi-provider abstraction
`compilation_metrics.py`	Token amortization tracking

Validation Pipeline (`validation/`)

Stage	Tools	Checks
Security	Bandit, Semgrep	Injection, malware, key leaks
Syntax	AST, mypy, ruff, radon	Types, linting, complexity <10
Execution	Sandbox, fixtures	Test cases, coverage >80%
Accuracy	Golden outputs	Schema compliance, output match

Metrics Engine (`metrics/`)

Category	Key Metrics
Token Efficiency	Compression ratio (>4x), Break-even N*
Latency	TTFT (<500ms), TPOT (<200ms), P99
Consistency	Semantic entropy, Exact match rate
Reliability	Task completion (>50%), Error rates
Code Quality	Cyclomatic complexity (<10), pass@k
Validation	First-pass rate (>70%), Regen attempts
Cost	Determinism Advantage (>1 = winning)

Baselines (`baselines/`)

Baseline	Description
Direct LLM	Per-transaction inference (control)
Code Factory	Compiled with template reuse
LangChain Agent	Tool-using agent framework
Multi-Agent	AutoGen-style coordination

Benchmark Datasets

Dataset	Instances	Focus
XY_Benchmark	12	Internal: classification, normalization, extraction
BFCL v3	2,810	Function calling accuracy
AgentBench	146	Multi-turn agents (OS, DB, Web)

Competitive Targets

Metric	Competitive	Excellent
Compression Ratio	>4x	>10x
Break-Even N*	<100	<10
TTFT	<500ms	<200ms
Exact Match Rate	>80%	>95%
Task Completion	>50%	>75%
Cyclomatic Complexity	<10	<5
First-Pass Rate	>70%	>90%
Determinism Advantage	>1	>10

Quick Start

# Install
uv sync

# Configure
cp .env.example .env
# Add: ANTHROPIC_API_KEY=your_key

# Run benchmarks
python scripts/run_benchmark.py
python scripts/run_bfcl_benchmark.py
python scripts/run_xy_benchmark.py

# Development
pytest
mypy src/
ruff check src/

Key Files

File	Purpose
`scripts/run_benchmark.py`	Main interactive runner
`BENCHMARK_PLAN.md`	Implementation specification
`paper/framework.md`	Evaluation framework
`paper/compiled_ai_paper.md`	Research paper
`src/compiled_ai/factory/code_factory/factory.py`	Core orchestrator

Research Contributions

Novel Metrics:

Compression Ratio - Output tokens / Input tokens
Token Amortization Factor - Gen tokens / Expected executions
Determinism Advantage - Runtime cost × N / Generation cost
Break-Even N* - Executions for cost parity
First-Pass Rate - Validation pass without regeneration

Research Foundations:

MLPerf Inference (latency standards)
BFCL v3 (function calling benchmark)
AgentBench (multi-turn evaluation, ICLR 2024)
Semantic Entropy (Nature 2024)
Pan & Wang 2025 (break-even analysis)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CompiledAI - Essentials

What It Is

The Problem

The Solution

Tech Stack

Architecture

Project Structure

Key Components

Code Factory (`factory/code_factory/`)

Validation Pipeline (`validation/`)

Metrics Engine (`metrics/`)

Baselines (`baselines/`)

Benchmark Datasets

Competitive Targets

Quick Start

Key Files

Research Contributions

FilesExpand file tree

ESSENTIALS.md

Latest commit

History

ESSENTIALS.md

File metadata and controls

CompiledAI - Essentials

What It Is

The Problem

The Solution

Tech Stack

Architecture

Project Structure

Key Components

Code Factory (factory/code_factory/)

Validation Pipeline (validation/)

Metrics Engine (metrics/)

Baselines (baselines/)

Benchmark Datasets

Competitive Targets

Quick Start

Key Files

Research Contributions

Code Factory (`factory/code_factory/`)

Validation Pipeline (`validation/`)

Metrics Engine (`metrics/`)

Baselines (`baselines/`)