PlanExeOrg · 82deutschmark · Feb 25, 2026 · Feb 25, 2026 · Feb 25, 2026 · Feb 25, 2026
diff --git a/docs/proposals/1-triage.md b/docs/proposals/1-triage.md
@@ -0,0 +1,74 @@
+---
+title: PlanExe Proposal Triage — 80/20 Landscape
+date: 2026-02-25
+status: working note
+author: Egon + Larry
+---
+
+# Overview
+
+Simon asked us to triage the proposal space with an 80:20 lens. The goal of this note is to capture:
+1. Which proposals deliver outsized value (the 20% that unlock 80% of the architecture)
+2. Which other proposals are nearby in the graph and could reuse their artifacts or reasoning
+3. High-leverage parameter tweaks, code tweaks, and second/third order effects
+4. Gaps in the current docs and ideas for new proposals
+5. Relevant questions/tasks you might not have asked yet
+
+We focused on the most recent proposals ("67+" cluster) plus the ones directly touching the validation/orchestration story that FermiSanityCheck will unlock.
+
+# High-Leverage Proposals (the 20%)
+
+1. **#07 Elo Ranking System (1,751 lines)** – Core ranking mechanism for comparing idea variants, plan quality, and post-plan summaries. Louis-level heuristics here inform nearly every downstream comparison use case.
+2. **#63 Luigi Agent Integration & #64 Post-plan Orchestration Layer** – These three documents (#63, #64, #66) describe how PlanExe schedules, retries, and enriches its Luigi DAG. Any change to the DAG (including FermiSanityCheck or arcgentica-style loops) ripples through this cluster.
+3. **#62 Agent-first Frontend Discoverability (609 lines)** – Defines the agent UX, which depends on the scoring/ranking engine (#07) and the reliability signals that our validation cluster will provide.
+4. **#69 Arcgentica Agent Patterns (279 lines)** – The arcgentica comparison is already referencing our validation work and sets the guardrails for self-evaluation/soft-autonomy.
+5. **#41 Autonomous Execution of Plan & #05 Semantic Plan Search Graph** – These represent core system-level capabilities (distributed execution and semantic search) whose outputs feed the ranking and reporting layers.
+
+These documents together unlock most of the architectural work. They interlock around: planning quality signals (#07, #69, Fermi), orchestration (#63, #64, #66), and the interfaces (#62, #41, #05).
+
+# Related Proposals & Reuse Opportunities
+
+- **#07 Elo Ranking + #62 Agent-first Frontend** can share heuristics. Instead of reinventing ranking weights in #62, reuse the cost/feasibility tradeoffs defined in #07 plus FermiSanityCheck flags as features.
+- **#63-66 orchestration cluster** already describe Luigi tasks. The validation loop doc should be cross-referenced there to show where FermiSanityCheck sits in the DAG and how downstream tasks like WBS, Scheduler, ExpertOrchestrator should consume the validation report.
+- **#69 + #56 (Adversarial Red Team) + #43 (Assumption Drift Monitor)** form a validation cluster. FermiSanityCheck is the front line; these others are observers (red team, drift monitor) that should consume the validation report and escalate to human review.
+- **#32 Gantt Parallelization & #33 CBS** could re-use the same thresholds as FermiSanityCheck when calculating duration plausibility (e.g., if duration falls outside the published feasible range, highlight the same issue in the Gantt UI).
+
+# 80:20 Tweaks & Parameter Changes
+
+- **Ranking weights (#07)** – adjust cost vs. feasibility vs. confidence to surface plans that pass quantitative grounding. No rewrite needed; just new weights (e.g., penalize plans where FermiSanityCheck flags >3 assumptions).
+- **Batch size thresholds (#63)** – the Luigi DAG currently runs every task. We can gate the WBS tasks with a flag that only fires if FermiSanityCheck passes or fails softly, enabling a smaller workflow for low-risk inputs without re-architecting.
+- **Risk terminology alignment (#38 & #44)** – harmonize the words used in the risk propagation network and investor audit pack so they can share visualization tooling, reducing duplicate explanations.
+
+# Second/Third Order Effects
+
+- **Validation loop → downstream trust**: Once FermiSanityCheck is in place, client reports (e.g., #60 plan-to-repo, #41 autonomous execution) can annotate numbers with the validation status, reducing rework.
+- **Arcgentica/agent patterns**: Hardening PlanExe encourages stricter typed outputs (#69). This lets the UI (#08) and ranking engine (#07) rely on structured data instead of parsing Markdown.
+- **Quantitative grounding improves ranking** (#07, #62) which in turn makes downstream dashboards (#60, #62) more actionable and reduces QA overhead.
+- **Clustering proposals** (#63-66, #69, #56) around validation/orchestration helps the next human reviewer (Simon) make a single decision that affects multiple docs.
+
+# Gaps & Future Proposal Ideas
+
+- **FermiSanityCheck Implementation Roadmap** – Document how MakeAssumptions output becomes QuantifiedAssumption, where heuristics live, and how Luigi tasks consume the validation_report. (We have the spec in `planexe-validation-loop-spec.md` but not a public proposal yet.)
+- **Validation Observability Dashboard** – A proposal capturing how the validation report is surfaced to humans (per #44, #60). Could cover alerts (Slack/Discord) when FermiSanityCheck fails or when repeated fails accumulate.
+- **Arbitration Workflow** – When FermiSanityCheck fails and ReviewPlan still thinks the plan is OK, we need a human-in-the-loop workflow. This is not yet documented anywhere.
+
+# Questions You Might Not Be Asking
+
+1. What are the acceptance criteria for FermiSanityCheck? (confidence levels, heuristics, why 100× spans?)
+2. Who owns the validation report downstream? Should ExpertOrchestrator or Governance phases be responsible for acting on it?
+3. Does FermiSanityCheck expire per run or is it stored for audit trails (per #42 evidence traceability)?
+4. Can we reuse the same heuristics for other tasks (#32 Gantt, #34 finance) to maximize payoff?
+5. How do we rank the outputs once FermiSanityCheck is added? Should ranking (#07) penalize low confidence even if the costs look good?
+6. Do we need a battle plan for manual overrides when FermiSanityCheck is overzealous (e.g., ROI assumptions where domain experts know the average is >100×)?
+
+# Tasks We Can Own Now
+
+- Extract the QuantifiedAssumption schema (claim, lower_bound, upper_bound, unit, confidence, evidence) and add it to PlanExe’s assumption bundle.
+- Implement a FermiSanityCheck Luigi task that runs immediately after MakeAssumptions and produces validation_report.json.
+- Hook the validation report into DistillAssumptions / ReviewAssumptions by adding a `validation_passed` flag.
+- Update #69 and #56 docs with references to the validation report to keep the narrative cohesive.
+- Create the proposed dashboard proposal (validation observability) to track how many plans fail numeric sanity each week.
+
+# Summary
+
+The high-leverage 20% of proposals are: ranking (#07), orchestration (#63-66), UI (#62), arcgentica patterns (#69), and autonomous execution/search (#41, #05). We can activate them by implementing FermiSanityCheck, aligning their heuristics, and surfacing the new validation signals in the UI/dashboards. The docs already cover most of the research; now we need a short, focused proposal/clustering doc (this one) plus the Fermi implementation and dashboards. After Simon approves, we can execute the chosen cluster.
diff --git a/worker_plan/worker_plan_api/filenames.py b/worker_plan/worker_plan_api/filenames.py
@@ -37,6 +37,8 @@ class FilenameEnum(str, Enum):
     REVIEW_ASSUMPTIONS_MARKDOWN = "003-9-review_assumptions.md"
     CONSOLIDATE_ASSUMPTIONS_FULL_MARKDOWN = "003-10-consolidate_assumptions_full.md"
     CONSOLIDATE_ASSUMPTIONS_SHORT_MARKDOWN = "003-11-consolidate_assumptions_short.md"
+    FERMI_SANITY_CHECK_REPORT = "003-12-fermi_sanity_check_report.json"
+    FERMI_SANITY_CHECK_SUMMARY = "003-13-fermi_sanity_check_summary.md"
     PRE_PROJECT_ASSESSMENT_RAW = "004-1-pre_project_assessment_raw.json"
     PRE_PROJECT_ASSESSMENT = "004-2-pre_project_assessment.json"
     PROJECT_PLAN_RAW = "005-1-project_plan_raw.json"

diff --git a/worker_plan/worker_plan_internal/assume/fermi_sanity_check.py b/worker_plan/worker_plan_internal/assume/fermi_sanity_check.py
@@ -0,0 +1,224 @@
+"""Validation helpers for QuantifiedAssumption data."""
+from __future__ import annotations
+
+from typing import List, Optional, Sequence
+
+from pydantic import BaseModel, Field
+
+from worker_plan_internal.assume.quantified_assumptions import ConfidenceLevel, QuantifiedAssumption
+
+MAX_SPAN_RATIO = 100.0
+MIN_EVIDENCE_LENGTH = 40
+BUDGET_LOWER_THRESHOLD = 1_000.0
+BUDGET_UPPER_THRESHOLD = 100_000_000.0
+TIMELINE_MAX_DAYS = 3650
+TIMELINE_MIN_DAYS = 1
+TEAM_MIN = 1
+TEAM_MAX = 1000
+
+CURRENCY_UNITS = {
+    "usd",
+    "eur",
+    "dkk",
+    "gbp",
+    "cad",
+    "aud",
+    "sek",
+    "nzd",
+    "mxn",
+    "chf"
+}
+
+TIME_UNIT_TO_DAYS = {
+    "day": 1,
+    "days": 1,
+    "week": 7,
+    "weeks": 7,
+    "month": 30,
+    "months": 30,
+    "year": 365,
+    "years": 365
+}
+
+TEAM_KEYWORDS = {
+    "team",
+    "people",
+    "engineer",
+    "engineers",
+    "staff",
+    "headcount",
+    "crew",
+    "members",
+    "contractors",
+    "workers"
+}
+
+BUDGET_KEYWORDS = {
+    "budget",
+    "cost",
+    "funding",
+    "investment",
+    "price",
+    "capex",
+    "spend",
+    "expense",
+    "capital"
+}
+
+TIMELINE_KEYWORDS = {
+    "timeline",
+    "duration",
+    "schedule",
+    "milestone",
+    "delivery",
+    "months",
+    "years",
+    "weeks",
+    "days"
+}
+
+
+class ValidationEntry(BaseModel):
+    assumption_id: str = Field(description="Stable identifier for the assumption")
+    question: str = Field(description="Source question for context")
+    passed: bool = Field(description="Whether the assumption passed validation")
+    reasons: List[str] = Field(description="List of validation failures")
+
+
+class ValidationReport(BaseModel):
+    entries: List[ValidationEntry] = Field(description="Detailed result per assumption")
+    total_assumptions: int = Field(description="Total number of assumptions processed")
+    passed: int = Field(description="Count of assumptions that passed")
+    failed: int = Field(description="Count of assumptions that failed")
+    pass_rate_pct: float = Field(description="Percentage of assumptions that passed")
+
+
+def validate_quantified_assumptions(
+    assumptions: Sequence[QuantifiedAssumption]
+) -> ValidationReport:
+    entries: List[ValidationEntry] = []
+    passed = 0
+
+    for assumption in assumptions:
+        reasons: List[str] = []
+        lower = assumption.lower_bound
+        upper = assumption.upper_bound
+
+        if lower is None or upper is None:
+            reasons.append("Missing lower or upper bound.")
+        elif lower > upper:
+            reasons.append("Lower bound is greater than upper bound.")
+        else:
+            if ratio := assumption.span_ratio:
+                if ratio > MAX_SPAN_RATIO:
+                    reasons.append("Range spans more than 100×; too wide.")
+
+        if assumption.confidence == ConfidenceLevel.low:
+            evidence = assumption.evidence or ""
+            if len(evidence.strip()) < MIN_EVIDENCE_LENGTH:
+                reasons.append("Low confidence claim lacks sufficient evidence.")
+
+        if _should_check_budget(assumption):
+            _apply_budget_constraints(lower, upper, reasons)
+
+        if _should_check_timeline(assumption):
+            _apply_timeline_constraints(lower, upper, assumption.unit, reasons)
+
+        if _should_check_team(assumption):
+            _apply_team_constraints(lower, upper, reasons)
+
+        passed_flag = not reasons
+        if passed_flag:
+            passed += 1
+
+        entry = ValidationEntry(
+            assumption_id=assumption.assumption_id,
+            question=assumption.question,
+            passed=passed_flag,
+            reasons=reasons
+        )
+        entries.append(entry)
+
+    total = len(entries)
+    failed = total - passed
+    pass_rate = (passed / total * 100.0) if total else 0.0
+    return ValidationReport(
+        entries=entries,
+        total_assumptions=total,
+        passed=passed,
+        failed=failed,
+        pass_rate_pct=round(pass_rate, 2)
+    )
+
+
+def render_validation_summary(report: ValidationReport) -> str:
+    lines = [
+        "# Fermi Sanity Check",
+        "",
+        f"- Total assumptions: {report.total_assumptions}",
+        f"- Passed: {report.passed}",
+        f"- Failed: {report.failed}",
+        f"- Pass rate: {report.pass_rate_pct:.1f}%",
+        ""
+    ]
+
+    if report.failed:
+        lines.append("## Failed assumptions")
+        for entry in report.entries:
+            if not entry.passed:
+                reasons = ", ".join(entry.reasons) if entry.reasons else "No details provided."
+                lines.append(f"- `{entry.assumption_id}` ({entry.question or 'question missing'}): {reasons}")
+
+    return "\n".join(lines)
+
+
+def _should_check_budget(assumption: QuantifiedAssumption) -> bool:
+    text = (assumption.question or "").lower()
+    return any(keyword in text for keyword in BUDGET_KEYWORDS) or (assumption.unit or "") in CURRENCY_UNITS
+
+
+def _should_check_timeline(assumption: QuantifiedAssumption) -> bool:
+    text = (assumption.question or "").lower()
+    return any(keyword in text for keyword in TIMELINE_KEYWORDS)
+
+
+def _should_check_team(assumption: QuantifiedAssumption) -> bool:
+    text = (assumption.question or "").lower()
+    return any(keyword in text for keyword in TEAM_KEYWORDS)
+
+
+def _apply_budget_constraints(lower: Optional[float], upper: Optional[float], reasons: List[str]) -> None:
+    if lower is not None and lower < BUDGET_LOWER_THRESHOLD:
+        reasons.append(f"Budget below ${BUDGET_LOWER_THRESHOLD:,.0f}.")
+    if upper is not None and upper > BUDGET_UPPER_THRESHOLD:
+        reasons.append(f"Budget above ${BUDGET_UPPER_THRESHOLD:,.0f}.")
+
+
+def _apply_timeline_constraints(
+    lower: Optional[float], upper: Optional[float], unit: Optional[str], reasons: List[str]
+) -> None:
+    lower_days = _normalize_to_days(lower, unit)
+    upper_days = _normalize_to_days(upper, unit)
+
+    if lower_days is not None and lower_days < TIMELINE_MIN_DAYS:
+        reasons.append("Timeline below 1 day.")
+    if upper_days is not None and upper_days > TIMELINE_MAX_DAYS:
+        reasons.append("Timeline exceeds ten years (3,650 days).")
+
+
+def _normalize_to_days(value: Optional[float], unit: Optional[str]) -> Optional[float]:
+    if value is None:
+        return None
+    if not unit:
+        return value
+    normalized = TIME_UNIT_TO_DAYS.get(unit.lower())
+    if normalized is None:
+        return value
+    return value * normalized
+
+
+def _apply_team_constraints(lower: Optional[float], upper: Optional[float], reasons: List[str]) -> None:
+    if lower is not None and lower < TEAM_MIN:
+        reasons.append("Team size below 1 person.")
+    if upper is not None and upper > TEAM_MAX:
+        reasons.append("Team size above 1,000 people.")
diff --git a/worker_plan/worker_plan_internal/assume/quantified_assumption_schema.md b/worker_plan/worker_plan_internal/assume/quantified_assumption_schema.md
@@ -0,0 +1,39 @@
+# QuantifiedAssumption Schema Reference
+
+| Field | Type | Description |
+| --- | --- | --- |
+| `assumption_id` | `str` | Unique stable identifier for the assumption (use `assumption-<index>` when not provided). |
+| `question` | `str` | The source question that prompted the assumption. |
+| `claim` | `str` | Normalized assumption text with the `Assumption:` prefix removed. |
+| `lower_bound` | `float?` | Parsed lower numeric bound (if present). |
+| `upper_bound` | `float?` | Parsed upper numeric bound (mirror of lower_bound when none explicitly provided). |
+| `unit` | `str?` | Detected unit token (e.g., `mw`, `days`, `usd`, `%`). |
+| `confidence` | `ConfidenceLevel` (`high` / `medium` / `low`) | Estimated confidence level inferred from hedging words. |
+| `evidence` | `str` | Text excerpt used as evidence (currently same as `claim` but can be overridden with extracted snippets). |
+| `extracted_numbers` | `List[float]` | All numeric values found in the assumption for further heuristics. |
+| `raw_assumption` | `str` | Original string returned by `MakeAssumptions` (includes prefix). |
+
+## Confidence Enum Values
+
+| Level | Detection Signals |
+| --- | --- |
+| `high` | Contains strong modality ("will", "must", "ensure", "guarantee"). |
+| `medium` | Default when no strong signal is detected. |
+| `low` | Contains hedging words ("estimate", "approx", "may", "likely"). |
+
+## Unit Examples
+
+- Financial: `usd`, `eur`, `million`, `billion`
+- Capacity/Scale: `mw`, `kw`, `tonnes`, `sqft`, `people`
+- Time: `days`, `weeks`, `months`, `years` (expressed as words following the range)
+- Percentage/Ratio: `%`, `bps`
+
+Units are extracted by scanning the text around the numeric range or first detected unit word after the numbers.
+
+## Evidence Expectations by Confidence
+
+- `high`: sentence should include explicit value statements or commitments (e.g., "We will deliver 30 MW") and the evidence string can be the same sentence.
+- `medium`: treat as the default; evidence is the claim text itself.
+- `low`: must cite qualifiers and ideally pair the claim with supporting context (e.g., "~8 months" followed by "assuming no permit delays"). Evidence may include surrounding context when available.
+
+Use this reference when wiring FermiSanityCheck so the validation functions know what fields exist, what values they expect, and how to treat the evidence for confidence levels.