diff --git a/docs/proposals/2026-03-07-luigi-resume-enhancements.md b/docs/proposals/2026-03-07-luigi-resume-enhancements.md new file mode 100644 index 00000000..ec46be86 --- /dev/null +++ b/docs/proposals/2026-03-07-luigi-resume-enhancements.md @@ -0,0 +1,244 @@ +# Luigi Resume Enhancements — Proposal + +**Date:** 2026-03-07 +**Author:** Bubba (Mac Mini agent) +**Context:** Written from direct experience running overnight pipeline recovery sessions with Qwen 3.5-35B-A3B on a Mac Mini M4 Pro. Every suggestion here is grounded in a concrete friction point hit during real runs. + +--- + +## Background + +Luigi's file-based task completion tracking is one of PlanExe's most valuable properties for local model runs. When a 60-task pipeline fails at task 30, Luigi resumes from task 30 — not task 1. For long runs on slow local hardware, this is the difference between a 4-hour retry and a 4-minute one. + +But the current resume flow has rough edges that require manual intervention, log-watching, and tribal knowledge. This proposal describes targeted enhancements to make resume-driven iteration faster, safer, and automatable. + +--- + +## 1. Webhook / Event Hooks on Task Completion and Failure + +### The problem + +Right now, monitoring a running pipeline means tailing a log file and parsing freeform text every 60 seconds. Agents and humans alike poll `log.txt` to find out if a task passed or failed. There is no push notification. + +### What we want + +A lightweight event hook system that fires on task state changes: + +- `task.started` — task began executing +- `task.completed` — task wrote its output file and marked done +- `task.failed` — task raised an exception or exhausted retries +- `pipeline.completed` — all tasks finished successfully +- `pipeline.failed` — pipeline terminated with at least one failure + +### Proposed interface + +In the run config or environment, specify a webhook URL: + +``` +PLANEXE_WEBHOOK_URL=http://localhost:9000/planexe-events +``` + +On each event, POST a JSON payload: + +```json +{ + "event": "task.failed", + "task": "PreProjectAssessmentTask", + "run_id_dir": "/path/to/run", + "timestamp_utc": "2026-03-06T21:55:00Z", + "error_summary": "1 validation error for ExpertDetails — combined_summary: Field required", + "attempt": 1, + "model": "lmstudio-qwen3.5-35b-a3b" +} +``` + +### Why this matters + +An agent monitoring a pipeline via webhook can react in seconds instead of polling every minute. It can detect failure, inspect the error, apply a fix, and resume — without ever reading a log file. This is the foundation for autonomous overnight pipeline repair. + +A Discord webhook makes this immediately useful: post task completions and failures directly to a channel, with structured data, without any polling infrastructure. + +--- + +## 2. Task Invalidation CLI + +### The problem + +To re-run a completed task, you currently delete its output file manually. This requires knowing which file corresponds to which task, finding it in the run directory, and deleting it without accidentally removing dependent outputs. There is no guard against invalidating a task that has already-complete downstream dependents that would then also need to be re-run. + +### What we want + +A CLI command: + +```bash +planexe invalidate [--run-dir /path/to/run] [--cascade] +``` + +Behavior: + +- Without `--cascade`: deletes only the output file(s) for the named task. On next resume, only that task re-runs. +- With `--cascade`: deletes output files for the named task AND all downstream tasks that depend on it. Useful after fixing a schema or prompt that affects interpretation further down the chain. +- Prints what would be deleted before deleting (dry-run first). + +### Example + +``` +$ planexe invalidate SelectScenarioTask --run-dir ./run/Qwen_Clean_v1 +Would delete: + run/Qwen_Clean_v1/002-17-selected_scenario_raw.json + run/Qwen_Clean_v1/002-18-selected_scenario.json + run/Qwen_Clean_v1/002-19-scenarios.md +Proceed? [y/N] +``` + +### Why this matters + +Tonight we needed to re-run `SelectScenarioTask` after applying a fix. Without knowing exactly which files to delete, the safe move is to delete all files from that task number onward — which means re-running 40+ already-complete tasks. A targeted invalidation command makes surgical retries possible. + +--- + +## 3. Plan File Hot-Editing with Downstream Invalidation + +### The problem + +The input plan (`001-2-plan.txt`) is locked in at run start. If a user wants to refine the plan description mid-run — clarify scope, correct a factual error, tighten the framing — there is no supported path. The only option is start a new run from scratch. + +### What we want + +A mechanism to edit the plan file and selectively invalidate downstream tasks: + +```bash +planexe edit-plan --run-dir ./run/Qwen_Clean_v1 +# opens plan.txt in $EDITOR +# after save, asks: "Invalidate all tasks? [Y/n]" +# or: "Which tasks to invalidate? (comma-separated, or 'all')" +``` + +Alternatively: a `--invalidate-from ` flag that cascades from a specific task boundary, allowing early tasks that don't use the plan text directly to be preserved. + +### Why this matters + +On local hardware, a full re-run can take 4–6 hours. If a user notices a plan description issue at task 20, they currently have to restart from scratch. Targeted invalidation from the point where the plan text first materially influences output would save hours. + +--- + +## 4. Pipeline Status Command + +### The problem + +There is no quick way to ask "where is this pipeline right now?" without parsing `log.txt`. The run directory contains output files, but mapping file names to task names and status requires knowing the file-naming convention. + +### What we want + +```bash +planexe status --run-dir ./run/Qwen_Clean_v1 +``` + +Output: + +``` +Run: Qwen_Clean_v1 +Model: lmstudio-qwen3.5-35b-a3b +Started: 2026-03-06 19:47 UTC + +✅ DONE (23 tasks) + SetupTask, StartTimeTask, RedlineGateTask, ... + +⏳ RUNNING (1 task) + PreProjectAssessmentTask — started 21:53 UTC + +❌ FAILED (1 task) + PreProjectAssessmentTask — "1 validation error for ExpertDetails" + +⬜ PENDING (38 tasks) + IdentifyRisksTask, CreateWBSLevel3Task, ... +``` + +### Why this matters + +Agents and humans checking in on an overnight run need this information immediately. Currently it requires log-parsing expertise. A status command makes the pipeline observable without tribal knowledge. + +--- + +## 5. Per-Task Timeout Configuration + +### The problem + +LM Studio's `request_timeout` is set globally in the model config JSON. Some tasks (e.g., `PreProjectAssessmentTask` with multiple expert sub-calls) consistently take longer than others and hit the global timeout. Raising the global timeout to accommodate slow tasks means slow failure detection for tasks that genuinely hang. + +### What we want + +Per-task timeout overrides in the model config or a separate task config file: + +```json +{ + "task_timeouts": { + "PreProjectAssessmentTask": 900, + "CreateWBSLevel3Task": 1200, + "default": 600 + } +} +``` + +### Why this matters + +`PreProjectAssessmentTask` runs multiple expert sub-calls sequentially. On Qwen 3.5-35B-A3B, each expert call takes 2–4 minutes. A 600-second global timeout is right for most tasks but causes `ReadTimeout` failures on this specific task. The fix tonight was a full LM Studio restart — not ideal at 10pm when a pipeline is running unattended. + +--- + +## 6. Structured Failure Log + +### The problem + +When a task fails, the error goes into `log.txt` as a Python traceback mixed with INFO/DEBUG lines. Extracting the failure signature (task name, model, error type, missing fields, attempt number) requires log-parsing. + +### What we want + +On any task failure, append a structured entry to `failures.jsonl` in the run directory: + +```json +{ + "timestamp_utc": "2026-03-06T21:55:22Z", + "task": "PreProjectAssessmentTask", + "model": "lmstudio-qwen3.5-35b-a3b", + "attempt": 1, + "error_type": "ValidationError", + "missing_fields": ["combined_summary", "go_no_go_recommendation"], + "invalid_fields": [], + "raw_error": "1 validation error for ExpertDetails...", + "run_id_dir": "/path/to/run" +} +``` + +### Why this matters + +Structured failure data enables: +- Automated retry decisions (is this truncation or a model capability gap?) +- Cross-run comparison (does the same task fail on different models?) +- PR evidence (attach `failures.jsonl` excerpt to PRs as proof) +- Model scorecards (which models fail at which tasks?) + +Tonight, diagnosing each failure required reading a Python traceback and manually extracting the task name, model, and missing fields. A `failures.jsonl` file would have made that instant. + +--- + +## Implementation Priority + +Ordered by value-to-effort ratio: + +| Priority | Feature | Effort | Value | +|----------|---------|--------|-------| +| 1 | Structured failure log (`failures.jsonl`) | Low | High | +| 2 | Pipeline status command | Medium | High | +| 3 | Task invalidation CLI | Medium | High | +| 4 | Webhook / event hooks | Medium | High | +| 5 | Per-task timeout config | Low | Medium | +| 6 | Plan hot-editing + cascade | High | Medium | + +Items 1 and 5 are config/logging changes with minimal blast radius. Items 2–4 are new CLI surface area but don't touch pipeline logic. Item 6 requires careful dependency graph analysis. + +--- + +## Decision Request + +Approve this roadmap and we will implement each item as a separate focused PR, starting with structured failure logging (item 1) and the task invalidation CLI (item 3). diff --git a/docs/proposals/2026-03-07-pipeline-intelligence-layer.md b/docs/proposals/2026-03-07-pipeline-intelligence-layer.md new file mode 100644 index 00000000..a2d3a8a1 --- /dev/null +++ b/docs/proposals/2026-03-07-pipeline-intelligence-layer.md @@ -0,0 +1,223 @@ +# Pipeline Intelligence Layer — Proposal + +**Date:** 2026-03-07 +**Author:** Bubba (Mac Mini agent) +**Context:** Second-order proposal. The first proposal (PR #157) covered operational tooling. This one covers the deeper architectural gap: the pipeline treats model failures as terminal events, when they should be learning opportunities. + +--- + +## The Core Problem + +The current retry loop in `llm_executor.py` does something fundamentally broken: when a model fails Pydantic validation, it retries the **identical prompt** with the **identical model**. This is not a retry — it's repetition. It produces the same output and fails the same way. + +Every truncation failure we fixed tonight with `default=""` is a symptom of this. We applied a bandage. The wound is that the pipeline has no way to tell the model what it did wrong and ask it to try again with that information. + +This proposal is about giving the pipeline a real feedback loop. + +--- + +## 1. Error-Feedback Retries + +### Current behavior + +``` +Attempt 1: send prompt → model omits holistic_profile_of_the_plan → Pydantic fails +Attempt 2: send identical prompt → model omits holistic_profile_of_the_plan → Pydantic fails +Attempt 3: exhausted → pipeline dies +``` + +The model receives no information about what it did wrong. + +### Proposed behavior + +On Pydantic validation failure, extract structured error information and inject it into the next attempt: + +``` +Attempt 1: send prompt → model omits holistic_profile_of_the_plan → Pydantic fails + → extract: missing field "holistic_profile_of_the_plan" in PlanCharacteristics + +Attempt 2: send prompt + correction message: + "Your previous response was missing the required field 'holistic_profile_of_the_plan' + in PlanCharacteristics. Please regenerate the complete JSON, including this field: + a concise holistic summary synthesizing the four characteristics above." + → model generates complete JSON → Pydantic succeeds +``` + +### Implementation + +In `llm_executor.py`, after catching a `ValidationError`: + +1. Extract missing fields and invalid types from the Pydantic error +2. Generate a compact correction message (< 200 tokens) +3. Append as a new `ChatMessage(role=ASSISTANT, content=)` + `ChatMessage(role=USER, content=)` +4. Retry with the augmented message history + +This converts a blind retry into a self-correcting dialogue. The model sees its own bad output and a precise instruction to fix it. + +### Expected impact + +Tasks that currently require `default=""` workarounds may self-correct in attempt 2 instead of failing silently with an empty field. The `default=""` fixes are still useful as a last resort — but they should be the fallback, not the primary recovery path. + +--- + +## 2. Task-Level Output Validation Beyond Pydantic + +### The problem + +Pydantic validates structure. It does not validate content. A model can return `holistic_profile_of_the_plan: ""` (which our fix now allows) and the pipeline continues happily with an empty field. The plan output is now corrupted in a way that Pydantic cannot detect. + +More broadly: a model can return `go_no_go_recommendation: "I'll go ahead and recommend proceeding"` instead of `"Go"` or `"No Go"`. Structurally valid, semantically wrong. + +### Proposed solution + +A lightweight post-validation layer per task that checks content constraints: + +```python +class ContentConstraint: + field: str + constraint_type: Literal["non_empty", "one_of", "min_length", "max_length", "regex"] + params: dict + severity: Literal["warn", "retry", "fail"] +``` + +For `ExpertDetails`: +```python +constraints = [ + ContentConstraint("combined_summary", "min_length", {"chars": 100}, "retry"), + ContentConstraint("go_no_go_recommendation", "one_of", + {"values": ["Go", "No Go", "Execute Immediately", + "Proceed with Caution", "Do Not Execute"]}, "retry"), +] +``` + +On `severity="retry"`, run the error-feedback retry loop with the content violation as the correction message. On `severity="warn"`, log and continue. On `severity="fail"`, treat as a hard failure. + +This closes the gap between structural validity (Pydantic) and semantic validity (what the pipeline actually needs). + +--- + +## 3. Adaptive Model Selection + +### The problem + +The current model config has a fixed priority list. If model A fails task X, the pipeline falls back to model B — but model B may be known to fail task X for a completely different reason. There is no memory of which models succeed at which tasks. + +### Proposed solution + +A `model_task_matrix.json` that records observed task outcomes per model: + +```json +{ + "lmstudio-qwen3.5-35b-a3b": { + "SelectScenarioTask": "pass", + "PreProjectAssessmentTask": "pass", + "CreateWBSLevel3Task": "unknown" + }, + "zai-org/glm-4.6v-flash": { + "SelectScenarioTask": "pass", + "PreProjectAssessmentTask": "fail:schema_echo", + "CreateWBSLevel3Task": "fail:schema_echo" + } +} +``` + +The pipeline updates this file after each task attempt. On the next run, when building the model fallback order for a given task, it consults the matrix and skips models with `fail:*` outcomes for that task. + +This creates an empirical model scorecard that improves over time without requiring human intervention. + +--- + +## 4. Human-in-the-Loop Gates + +### The problem + +Some decisions in the pipeline should not be automated. The `RedlineGateTask` already exists as a safety check — but it auto-continues. There is no mechanism to pause the pipeline and wait for a human to review output before downstream tasks consume it. + +For an agent running an overnight pipeline, this is critical: if `go_no_go_recommendation` comes back `"Do Not Execute"`, the pipeline should pause and alert the human rather than continuing to generate a 60-task project plan for something the expert assessment said was infeasible. + +### Proposed solution + +A `PLANEXE_PAUSE_ON` environment variable (comma-separated list of task names or field values): + +``` +PLANEXE_PAUSE_ON=go_no_go_recommendation:Do Not Execute,RedlineGateTask:blocked +``` + +When a listed condition is met: +1. Pipeline writes a `pause.json` to the run directory with the condition details +2. If a webhook is configured, fires a `pipeline.paused` event +3. Pipeline waits for `resume.json` to appear (human creates it to continue) or `abort.json` (human creates it to stop) +4. `planexe resume --run-dir ./run/X` creates the `resume.json` file + +This gives humans meaningful control over consequential decision points without requiring them to babysit every task. + +--- + +## 5. Streaming Output Monitor + +### The problem + +For long-running LLM calls (2–4 minutes for expert assessment tasks), the pipeline is completely opaque. You cannot tell whether the model is generating useful output, hallucinating, or stuck. You only know the result when it finishes — by which point it has already timed out or produced bad output. + +### Proposed solution + +A streaming preview mode that logs model output tokens to a `stream.log` file in the run directory as they arrive: + +``` +PLANEXE_STREAM_LOG=true +``` + +This does not change pipeline behavior — it adds a side channel. An agent or human can `tail -f ./run/Qwen_Clean_v1/stream.log` to watch output in real time and detect early if a model is going off the rails (e.g., generating markdown prose instead of JSON, or repeating the schema back instead of filling it). + +For agent monitoring, this enables early abort: if the stream shows the model is not producing JSON after the first 500 tokens, the agent can kill the request and retry before the full timeout expires. + +--- + +## 6. Run Comparison + +### The problem + +After applying a fix and re-running, there is no easy way to see what changed in task outputs between run A and run B. Did the model generate more complete output after the `default=""` fix? Did the scenario selection change? Did the expert assessment reach a different conclusion? + +### Proposed solution + +```bash +planexe diff --run-a ./run/Qwen_Only_Clean_v1 --run-b ./run/Qwen_Clean_v1 --task PreProjectAssessmentTask +``` + +Output: + +```diff +PreProjectAssessmentTask output diff: + + combined_summary: +- (empty — field was missing) ++ The project faces three critical blockers: insufficient capital runway... + + go_no_go_recommendation: +- (empty — field was missing) ++ Proceed with Caution — the core concept is viable but... +``` + +This makes it immediately clear whether a fix improved output quality, changed conclusions, or introduced regressions. Essential for PR evidence and model comparison. + +--- + +## Priority + +| Priority | Feature | Complexity | Impact | +|----------|---------|-----------|--------| +| 1 | Error-feedback retries | Medium | Eliminates most `default=""` workarounds | +| 2 | Human-in-the-loop gates | Low | Safety for consequential decisions | +| 3 | Content validation beyond Pydantic | Medium | Closes silent corruption gap | +| 4 | Streaming output monitor | Medium | Enables early abort, faster debug | +| 5 | Adaptive model selection | Medium | Self-improving failure recovery | +| 6 | Run comparison | Low | Evidence-based development | + +--- + +## Relationship to PR #157 + +PR #157 covers operational tooling (status, invalidation, webhooks, timeouts). This proposal covers the intelligence layer — making the pipeline smarter about failure and recovery. Both are necessary. #157 makes the pipeline observable and controllable. This proposal makes it self-correcting. + +The highest-value item here is error-feedback retries (#1). Implementing it would make several of the `default=""` fixes in PRs #153, #155, #156 unnecessary — or demote them from primary fixes to last-resort fallbacks.