Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions docs/proposals/1-triage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: PlanExe Proposal Triage — 80/20 Landscape
date: 2026-02-25
status: working note
author: Egon + Larry
---

# Overview

Simon asked us to triage the proposal space with an 80:20 lens. The goal of this note is to capture:
1. Which proposals deliver outsized value (the 20% that unlock 80% of the architecture)
2. Which other proposals are nearby in the graph and could reuse their artifacts or reasoning
3. High-leverage parameter tweaks, code tweaks, and second/third order effects
4. Gaps in the current docs and ideas for new proposals
5. Relevant questions/tasks you might not have asked yet

We focused on the most recent proposals ("67+" cluster) plus the ones directly touching the validation/orchestration story that FermiSanityCheck will unlock.

# High-Leverage Proposals (the 20%)

1. **#07 Elo Ranking System (1,751 lines)** – Core ranking mechanism for comparing idea variants, plan quality, and post-plan summaries. Louis-level heuristics here inform nearly every downstream comparison use case.
2. **#63 Luigi Agent Integration & #64 Post-plan Orchestration Layer** – These three documents (#63, #64, #66) describe how PlanExe schedules, retries, and enriches its Luigi DAG. Any change to the DAG (including FermiSanityCheck or arcgentica-style loops) ripples through this cluster.
3. **#62 Agent-first Frontend Discoverability (609 lines)** – Defines the agent UX, which depends on the scoring/ranking engine (#07) and the reliability signals that our validation cluster will provide.
4. **#69 Arcgentica Agent Patterns (279 lines)** – The arcgentica comparison is already referencing our validation work and sets the guardrails for self-evaluation/soft-autonomy.
5. **#41 Autonomous Execution of Plan & #05 Semantic Plan Search Graph** – These represent core system-level capabilities (distributed execution and semantic search) whose outputs feed the ranking and reporting layers.

These documents together unlock most of the architectural work. They interlock around: planning quality signals (#07, #69, Fermi), orchestration (#63, #64, #66), and the interfaces (#62, #41, #05).

# Related Proposals & Reuse Opportunities

- **#07 Elo Ranking + #62 Agent-first Frontend** can share heuristics. Instead of reinventing ranking weights in #62, reuse the cost/feasibility tradeoffs defined in #07 plus FermiSanityCheck flags as features.
- **#63-66 orchestration cluster** already describe Luigi tasks. The validation loop doc should be cross-referenced there to show where FermiSanityCheck sits in the DAG and how downstream tasks like WBS, Scheduler, ExpertOrchestrator should consume the validation report.
- **#69 + #56 (Adversarial Red Team) + #43 (Assumption Drift Monitor)** form a validation cluster. FermiSanityCheck is the front line; these others are observers (red team, drift monitor) that should consume the validation report and escalate to human review.
- **#32 Gantt Parallelization & #33 CBS** could re-use the same thresholds as FermiSanityCheck when calculating duration plausibility (e.g., if duration falls outside the published feasible range, highlight the same issue in the Gantt UI).

# 80:20 Tweaks & Parameter Changes

- **Ranking weights (#07)** – adjust cost vs. feasibility vs. confidence to surface plans that pass quantitative grounding. No rewrite needed; just new weights (e.g., penalize plans where FermiSanityCheck flags >3 assumptions).
- **Batch size thresholds (#63)** – the Luigi DAG currently runs every task. We can gate the WBS tasks with a flag that only fires if FermiSanityCheck passes or fails softly, enabling a smaller workflow for low-risk inputs without re-architecting.
- **Risk terminology alignment (#38 & #44)** – harmonize the words used in the risk propagation network and investor audit pack so they can share visualization tooling, reducing duplicate explanations.

# Second/Third Order Effects

- **Validation loop → downstream trust**: Once FermiSanityCheck is in place, client reports (e.g., #60 plan-to-repo, #41 autonomous execution) can annotate numbers with the validation status, reducing rework.
- **Arcgentica/agent patterns**: Hardening PlanExe encourages stricter typed outputs (#69). This lets the UI (#08) and ranking engine (#07) rely on structured data instead of parsing Markdown.
- **Quantitative grounding improves ranking** (#07, #62) which in turn makes downstream dashboards (#60, #62) more actionable and reduces QA overhead.
- **Clustering proposals** (#63-66, #69, #56) around validation/orchestration helps the next human reviewer (Simon) make a single decision that affects multiple docs.

# Gaps & Future Proposal Ideas

- **FermiSanityCheck Implementation Roadmap** – Document how MakeAssumptions output becomes QuantifiedAssumption, where heuristics live, and how Luigi tasks consume the validation_report. (We have the spec in `planexe-validation-loop-spec.md` but not a public proposal yet.)
- **Validation Observability Dashboard** – A proposal capturing how the validation report is surfaced to humans (per #44, #60). Could cover alerts (Slack/Discord) when FermiSanityCheck fails or when repeated fails accumulate.
- **Arbitration Workflow** – When FermiSanityCheck fails and ReviewPlan still thinks the plan is OK, we need a human-in-the-loop workflow. This is not yet documented anywhere.

# Questions You Might Not Be Asking

1. What are the acceptance criteria for FermiSanityCheck? (confidence levels, heuristics, why 100× spans?)
2. Who owns the validation report downstream? Should ExpertOrchestrator or Governance phases be responsible for acting on it?
3. Does FermiSanityCheck expire per run or is it stored for audit trails (per #42 evidence traceability)?
4. Can we reuse the same heuristics for other tasks (#32 Gantt, #34 finance) to maximize payoff?
5. How do we rank the outputs once FermiSanityCheck is added? Should ranking (#07) penalize low confidence even if the costs look good?
6. Do we need a battle plan for manual overrides when FermiSanityCheck is overzealous (e.g., ROI assumptions where domain experts know the average is >100×)?

# Tasks We Can Own Now

- Extract the QuantifiedAssumption schema (claim, lower_bound, upper_bound, unit, confidence, evidence) and add it to PlanExe’s assumption bundle.
- Implement a FermiSanityCheck Luigi task that runs immediately after MakeAssumptions and produces validation_report.json.
- Hook the validation report into DistillAssumptions / ReviewAssumptions by adding a `validation_passed` flag.
- Update #69 and #56 docs with references to the validation report to keep the narrative cohesive.
- Create the proposed dashboard proposal (validation observability) to track how many plans fail numeric sanity each week.

# Summary

The high-leverage 20% of proposals are: ranking (#07), orchestration (#63-66), UI (#62), arcgentica patterns (#69), and autonomous execution/search (#41, #05). We can activate them by implementing FermiSanityCheck, aligning their heuristics, and surfacing the new validation signals in the UI/dashboards. The docs already cover most of the research; now we need a short, focused proposal/clustering doc (this one) plus the Fermi implementation and dashboards. After Simon approves, we can execute the chosen cluster.
2 changes: 2 additions & 0 deletions worker_plan/worker_plan_api/filenames.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ class FilenameEnum(str, Enum):
REVIEW_ASSUMPTIONS_MARKDOWN = "003-9-review_assumptions.md"
CONSOLIDATE_ASSUMPTIONS_FULL_MARKDOWN = "003-10-consolidate_assumptions_full.md"
CONSOLIDATE_ASSUMPTIONS_SHORT_MARKDOWN = "003-11-consolidate_assumptions_short.md"
FERMI_SANITY_CHECK_REPORT = "003-12-fermi_sanity_check_report.json"
FERMI_SANITY_CHECK_SUMMARY = "003-13-fermi_sanity_check_summary.md"
PRE_PROJECT_ASSESSMENT_RAW = "004-1-pre_project_assessment_raw.json"
PRE_PROJECT_ASSESSMENT = "004-2-pre_project_assessment.json"
PROJECT_PLAN_RAW = "005-1-project_plan_raw.json"
Expand Down
224 changes: 224 additions & 0 deletions worker_plan/worker_plan_internal/assume/fermi_sanity_check.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
"""Validation helpers for QuantifiedAssumption data."""
from __future__ import annotations

from typing import List, Optional, Sequence

from pydantic import BaseModel, Field

from worker_plan_internal.assume.quantified_assumptions import ConfidenceLevel, QuantifiedAssumption

MAX_SPAN_RATIO = 100.0
MIN_EVIDENCE_LENGTH = 40
BUDGET_LOWER_THRESHOLD = 1_000.0
BUDGET_UPPER_THRESHOLD = 100_000_000.0
TIMELINE_MAX_DAYS = 3650
TIMELINE_MIN_DAYS = 1
TEAM_MIN = 1
TEAM_MAX = 1000

CURRENCY_UNITS = {
"usd",
"eur",
"dkk",
"gbp",
"cad",
"aud",
"sek",
"nzd",
"mxn",
"chf"
}

TIME_UNIT_TO_DAYS = {
"day": 1,
"days": 1,
"week": 7,
"weeks": 7,
"month": 30,
"months": 30,
"year": 365,
"years": 365
}

TEAM_KEYWORDS = {
"team",
"people",
"engineer",
"engineers",
"staff",
"headcount",
"crew",
"members",
"contractors",
"workers"
}

BUDGET_KEYWORDS = {
"budget",
"cost",
"funding",
"investment",
"price",
"capex",
"spend",
"expense",
"capital"
}

TIMELINE_KEYWORDS = {
"timeline",
"duration",
"schedule",
"milestone",
"delivery",
"months",
"years",
"weeks",
"days"
}


class ValidationEntry(BaseModel):
assumption_id: str = Field(description="Stable identifier for the assumption")
question: str = Field(description="Source question for context")
passed: bool = Field(description="Whether the assumption passed validation")
reasons: List[str] = Field(description="List of validation failures")


class ValidationReport(BaseModel):
entries: List[ValidationEntry] = Field(description="Detailed result per assumption")
total_assumptions: int = Field(description="Total number of assumptions processed")
passed: int = Field(description="Count of assumptions that passed")
failed: int = Field(description="Count of assumptions that failed")
pass_rate_pct: float = Field(description="Percentage of assumptions that passed")


def validate_quantified_assumptions(
assumptions: Sequence[QuantifiedAssumption]
) -> ValidationReport:
entries: List[ValidationEntry] = []
passed = 0

for assumption in assumptions:
reasons: List[str] = []
lower = assumption.lower_bound
upper = assumption.upper_bound

if lower is None or upper is None:
reasons.append("Missing lower or upper bound.")
elif lower > upper:
reasons.append("Lower bound is greater than upper bound.")
else:
if ratio := assumption.span_ratio:
if ratio > MAX_SPAN_RATIO:
reasons.append("Range spans more than 100×; too wide.")

if assumption.confidence == ConfidenceLevel.low:
evidence = assumption.evidence or ""
if len(evidence.strip()) < MIN_EVIDENCE_LENGTH:
reasons.append("Low confidence claim lacks sufficient evidence.")

if _should_check_budget(assumption):
_apply_budget_constraints(lower, upper, reasons)

if _should_check_timeline(assumption):
_apply_timeline_constraints(lower, upper, assumption.unit, reasons)

if _should_check_team(assumption):
_apply_team_constraints(lower, upper, reasons)

passed_flag = not reasons
if passed_flag:
passed += 1

entry = ValidationEntry(
assumption_id=assumption.assumption_id,
question=assumption.question,
passed=passed_flag,
reasons=reasons
)
entries.append(entry)

total = len(entries)
failed = total - passed
pass_rate = (passed / total * 100.0) if total else 0.0
return ValidationReport(
entries=entries,
total_assumptions=total,
passed=passed,
failed=failed,
pass_rate_pct=round(pass_rate, 2)
)


def render_validation_summary(report: ValidationReport) -> str:
lines = [
"# Fermi Sanity Check",
"",
f"- Total assumptions: {report.total_assumptions}",
f"- Passed: {report.passed}",
f"- Failed: {report.failed}",
f"- Pass rate: {report.pass_rate_pct:.1f}%",
""
]

if report.failed:
lines.append("## Failed assumptions")
for entry in report.entries:
if not entry.passed:
reasons = ", ".join(entry.reasons) if entry.reasons else "No details provided."
lines.append(f"- `{entry.assumption_id}` ({entry.question or 'question missing'}): {reasons}")

return "\n".join(lines)


def _should_check_budget(assumption: QuantifiedAssumption) -> bool:
text = (assumption.question or "").lower()
return any(keyword in text for keyword in BUDGET_KEYWORDS) or (assumption.unit or "") in CURRENCY_UNITS


def _should_check_timeline(assumption: QuantifiedAssumption) -> bool:
text = (assumption.question or "").lower()
return any(keyword in text for keyword in TIMELINE_KEYWORDS)


def _should_check_team(assumption: QuantifiedAssumption) -> bool:
text = (assumption.question or "").lower()
return any(keyword in text for keyword in TEAM_KEYWORDS)


def _apply_budget_constraints(lower: Optional[float], upper: Optional[float], reasons: List[str]) -> None:
if lower is not None and lower < BUDGET_LOWER_THRESHOLD:
reasons.append(f"Budget below ${BUDGET_LOWER_THRESHOLD:,.0f}.")
if upper is not None and upper > BUDGET_UPPER_THRESHOLD:
reasons.append(f"Budget above ${BUDGET_UPPER_THRESHOLD:,.0f}.")


def _apply_timeline_constraints(
lower: Optional[float], upper: Optional[float], unit: Optional[str], reasons: List[str]
) -> None:
lower_days = _normalize_to_days(lower, unit)
upper_days = _normalize_to_days(upper, unit)

if lower_days is not None and lower_days < TIMELINE_MIN_DAYS:
reasons.append("Timeline below 1 day.")
if upper_days is not None and upper_days > TIMELINE_MAX_DAYS:
reasons.append("Timeline exceeds ten years (3,650 days).")


def _normalize_to_days(value: Optional[float], unit: Optional[str]) -> Optional[float]:
if value is None:
return None
if not unit:
return value
normalized = TIME_UNIT_TO_DAYS.get(unit.lower())
if normalized is None:
return value
return value * normalized


def _apply_team_constraints(lower: Optional[float], upper: Optional[float], reasons: List[str]) -> None:
if lower is not None and lower < TEAM_MIN:
reasons.append("Team size below 1 person.")
if upper is not None and upper > TEAM_MAX:
reasons.append("Team size above 1,000 people.")
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# QuantifiedAssumption Schema Reference

| Field | Type | Description |
| --- | --- | --- |
| `assumption_id` | `str` | Unique stable identifier for the assumption (use `assumption-<index>` when not provided). |
| `question` | `str` | The source question that prompted the assumption. |
| `claim` | `str` | Normalized assumption text with the `Assumption:` prefix removed. |
| `lower_bound` | `float?` | Parsed lower numeric bound (if present). |
| `upper_bound` | `float?` | Parsed upper numeric bound (mirror of lower_bound when none explicitly provided). |
| `unit` | `str?` | Detected unit token (e.g., `mw`, `days`, `usd`, `%`). |
| `confidence` | `ConfidenceLevel` (`high` / `medium` / `low`) | Estimated confidence level inferred from hedging words. |
| `evidence` | `str` | Text excerpt used as evidence (currently same as `claim` but can be overridden with extracted snippets). |
| `extracted_numbers` | `List[float]` | All numeric values found in the assumption for further heuristics. |
| `raw_assumption` | `str` | Original string returned by `MakeAssumptions` (includes prefix). |

## Confidence Enum Values

| Level | Detection Signals |
| --- | --- |
| `high` | Contains strong modality ("will", "must", "ensure", "guarantee"). |
| `medium` | Default when no strong signal is detected. |
| `low` | Contains hedging words ("estimate", "approx", "may", "likely"). |

## Unit Examples

- Financial: `usd`, `eur`, `million`, `billion`
- Capacity/Scale: `mw`, `kw`, `tonnes`, `sqft`, `people`
- Time: `days`, `weeks`, `months`, `years` (expressed as words following the range)
- Percentage/Ratio: `%`, `bps`

Units are extracted by scanning the text around the numeric range or first detected unit word after the numbers.

## Evidence Expectations by Confidence

- `high`: sentence should include explicit value statements or commitments (e.g., "We will deliver 30 MW") and the evidence string can be the same sentence.
- `medium`: treat as the default; evidence is the claim text itself.
- `low`: must cite qualifiers and ideally pair the claim with supporting context (e.g., "~8 months" followed by "assuming no permit delays"). Evidence may include surrounding context when available.

Use this reference when wiring FermiSanityCheck so the validation functions know what fields exist, what values they expect, and how to treat the evidence for confidence levels.
Loading