Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions docs/26-FEB-2026-planexe-2026-strategy-auditor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# PlanExe 2026: From Plan Generator to Autonomous Agent Auditor

**Date:** 26 February 2026
**Authors:** Larry, Egon, Simon (for review)
**Status:** Strategic Proposal for Feedback

---

## Executive Summary

PlanExe was originally positioned as a plan *generator* — take a vague idea, have an LLM dream up a business plan. In 2025, we learned that LLMs hallucinate plans with no grounding. By 2026, the market has moved on: agents don't need another hallucinated plan generator.

**What agents actually need:** A trusted auditing layer that validates whether the assumptions driving their autonomous workflows are sane.

This proposal argues that PlanExe's real value in 2026 is as **the canonical auditing gate for autonomous agent loops** — not as a plan creator, but as a safety layer that prevents hallucinations before they propagate downstream.

---

## The Problem: Autonomous Agents in Bubbles

Agents run in isolation. They have no world model. They can't verify if their assumptions are grounded in reality. They hallucinate:
- Cost estimates that are off by orders of magnitude
- Timelines that ignore real-world constraints
- Team sizes that make no sense

**The consequence:** Bad assumptions → bad downstream decisions → failed autonomy.

Agents need an external oracle that can say: **"This assumption is grounded. Proceed."** or **"This looks hallucinated. Re-evaluate."**

---

## The Opportunity: Validation as a Service

**What we've built in Phase 1-2:**

1. **FermiSanityCheck (Phase 1)**: A validation gate that inspects every quantified assumption:
- Are bounds present and non-contradictory?
- Is the span ratio reasonable (≤100×)?
- Does low-confidence claim have supporting evidence?
- Do the numbers pass domain heuristics?

**Output:** Structured JSON + Markdown that agents can parse deterministically.

2. **Domain-Aware Auditor (Phase 2)**: Auto-detect the domain (carpenter, dentist, personal project) and normalize to domain standards:
- Currency → domain default + EUR for comparison
- Units → metric
- Confidence keywords → domain-aware signals

**Why it matters:** "Cost 5000" means nothing without context. "5000 DKK for a carpenter project" is verifiable and sane. FermiSanityCheck becomes the translator.

---

## Why This Wins in the Agentic Economy

### 1. **Software Already Won the LLM Game**
Code is verifiable. It compiles or it doesn't. Tests pass or they don't. No trust required.

**Business plans?** No immediate validation. High trust requirement. High risk.

### 2. **Agents Are Untrusted Sources**
The lesson from 2025: don't trust the AI.

In 2026, agents will run in bubbles. External content will be labeled as untrusted to prevent prompt injection. But agents still need *some* external signal they can trust.

**PlanExe becomes that trusted signal.** It's not trying to out-think the agent; it's just saying: "Your assumption passes quantitative grounding. You can rely on it."

### 3. **Auditing is Composable**
Agents will chain together. Agent A's output becomes Agent B's input. Without a validation layer, assumptions compound into hallucinations.

**PlanExe sits in the middle:** catches bad assumptions before they propagate.

---

## The Business Model Shift

### Before (2025 thinking):
- Sell plans to humans
- Revenue: per-plan generation
- Value proposition: "Better plans than manual consulting"
- Problem: Plans are hallucinated; no immediate verification

### After (2026 reality):
- Sell validation to agents
- Revenue: per-assumption audited (or per-agent subscription)
- Value proposition: "Safe, trustworthy validation gate for autonomous loops"
- Advantage: Immediate, deterministic output (JSON); agents can compose it

---

## Implementation Path

### Phase 1: ✅ Done
- FermiSanityCheck validator
- DAG integration (MakeAssumptions → Validate → DistillAssumptions)
- Structured JSON output

### Phase 2: 🔄 In Progress
- Domain profiles (Carpenter, Dentist, Personal, Startup, etc.)
- Auto-detection + normalization
- Ready for integration testing

### Phase 3: Proposed
- Auditing API (agents call `/validate` with assumptions)
- Trust scoring (confidence + grounding + domain consistency)
- Audit logs (track what agents relied on)

---

## Key Questions for Simon

1. **Does this positioning resonate?** Are we solving the right problem for agents?

2. **Should we lean harder into auditor narrative?**
- Update PRs to frame FermiSanityCheck as "validation gate for agents"
- Reposition marketing toward agent platforms (not humans)
- Build toward auditing API (Phase 3)

3. **Or stay hybrid?** Keep the plan-generator story + add auditing as a feature?

4. **What does success look like in 2026?**
- Agents paying for validation service?
- PlanExe as a required middleware in agentic workflows?
- Something else?

---

## Next Steps

1. **Simon's feedback** on positioning (auditor vs. hybrid)
2. **Phase 2 completion** + integration testing
3. **PR updates** (if auditor positioning is approved)
4. **Phase 3 design** (auditing API + trust scoring)

---

**End of proposal.** Ready for Simon's thoughts.
165 changes: 165 additions & 0 deletions docs/domain-profiles/domain-profile-schema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
---
title: Domain Profiles for FermiSanityCheck
---

# Domain Profile Schema

Domain profiles encode the **context** that FermiSanityCheck needs to interpret numerical assumptions correctly. Each profile describes the currency/unit conventions, confidence language, and detection signals for a vertical so the system can normalize any messy input (e.g. invoices, emails, photos) into validated data.

## Schema (YAML)

```yaml
profiles:
- id: <slug>
name: <human name>
description: <what kind of projects this profile covers>
currency:
default: <canonical currency code (ISO 4217)>
aliases:
- <additional currencies or local abbreviations>
units:
metric: true
convert:
- from: "sqft"
to: "m2"
factor: 0.092903
- from: "lbs"
to: "kg"
factor: 0.453592
heuristics:
budget_keywords:
- budget
- cost
- invoice
timeline_keywords:
- days
- weeks
- timetable
team_keywords:
- crew
- workers
confidence_keywords:
high:
- guarantee
- have done this
medium:
- plan to
- intend
low:
- estimate
- hope
detection:
currency_signals:
- DKK
- "kr"
unit_signals:
- m2
- meter
- hours
keyword_signals:
- contractor
- materials
- carpentry
```

### Fields explained

- **id / name / description**: human-friendly identifiers for the domain profile.
- **currency**: canonical currency + local aliases (DKK, kr, kroner) so we can map all budgets to one reference value before comparing.
- **units**: flag if metric-first; provide conversion factors to normalize common imperial terms we might still encounter.
- **heuristics**: keyword lists partitioned by topic (budget, timeline, team) plus per-tier confidence keywords.
- **detection**: signals to match incoming documents to this profile (currencies, units, domain-specific keywords). Used by auto-detection logic.

## Examples

### Carpenter (DKK / metric crafts)

```yaml
- id: carpenter
name: Carpenter / small contractor
description: Tradespeople working with materials, local currencies, and hourly estimations.
currency:
default: DKK
aliases: ["kr", "dkk", "kroner"]
units:
metric: true
convert:
- from: "sqft"
to: "m2"
factor: 0.092903
- from: "ft"
to: "m"
factor: 0.3048
heuristics:
budget_keywords: ["material", "invoice", "estimate", "quote", "project cost"]
timeline_keywords: ["days", "weeks", "duration", "weather delay", "delivery"]
team_keywords: ["crew", "workers", "carpenter", "helper"]
confidence_keywords:
high: ["I've done this", "guarantee", "know"]
medium: ["plan to", "expect"]
low: ["estimate", "maybe", "roughly"]
detection:
currency_signals: ["DKK", "kr", "kroner"]
unit_signals: ["m2", "meter", "cm", "mm"]
keyword_signals: ["carpenter", "wood", "build", "materials", "client site"]
```

### Dentist (clinical services)

```yaml
- id: dentist
name: Dental clinic
description: Small medical/dental practices with patient capacity and procedural budgets.
currency:
default: USD
aliases: ["usd", "$", "dollars", "clinic credit"]
units:
metric: true
convert:
- from: "chair"
to: "unit"
factor: 1
heuristics:
budget_keywords: ["treatment", "insurance", "revenue", "procedure cost"]
timeline_keywords: ["week", "patient", "appointment", "quarter"]
team_keywords: ["doctor", "assistant", "hygienist"]
confidence_keywords:
high: ["patient guarantee", "clinically proven", "always"]
medium: ["plan to", "expect"]
low: ["estimate", "maybe"]
detection:
currency_signals: ["USD", "$", "dollars", "USD/per"]
keyword_signals: ["patient", "clinic", "treatment", "appointment", "revenue"]
```

### Personal Project (family trip / weight loss)

```yaml
- id: personal
name: Personal project/goal
description: Non-commercial plans with budgets, timelines, and behavioral commitments.
currency:
default: USD
aliases: ["usd", "$", "personal budget"]
units:
metric: true
heuristics:
budget_keywords: ["budget", "cost", "ticket", "transport"]
timeline_keywords: ["days", "weeks", "schedule"]
team_keywords: ["family", "participants", "people"]
confidence_keywords:
high: ["definitely", "committed"]
medium: ["plan to", "expect"]
low: ["maybe", "hope to"]
detection:
keyword_signals: ["family", "trip", "weight loss", "goal", "personal"]
```

## Domain detection logic (overview)

1. **Scan incoming data** (assumptions metadata, extracted keywords, currency mentions, units).
2. **Score each profile** by counting matches across the `detection` sections (`currency_signals`, `unit_signals`, `keyword_signals`).
3. **Pick the highest scoring profile** above a configurable threshold (default: majority signal). If no profile wins, fall back to `default` (e.g., "general business").
4. **Tag the assumption** with the chosen profile so the normalizer/validator applies the correct heuristics and conversions.

This schema can be extended when new domains appear (A2A tokenization, manufacturing, etc.). Once the detection logic tags a profile, the normalizer can apply metric conversions, currency mapping, and confidence heuristics that align with the domain's expectations.