fix: air-tight workflow lifecycle (incomplete-marked-complete + state auto-detect + cascade close + stale sweep) by spencermarx · Pull Request #31 · spencermarx/open-code-review

spencermarx · 2026-05-10T13:20:43Z

Summary

A full audit of the review + map state lifecycle. Closes every gap that lets a workflow end up in an incorrect/inconsistent state, not just the original "Success shown when not really complete" symptom.

Audit findings & fixes

Gap	Root cause	Fix
#1 Outcome ≠ exit code	Dashboard derived "Success" from `exit_code === 0` alone, ignoring whether the workflow actually closed	New `deriveCommandOutcome(exit_code, sessions.status)`; "Incomplete" outcome surfaced in UI
#2 Two parallel auto-detect helpers	`resolveActiveSession` (transition/close) and `resolveSessionForCompletion` (round/map complete) diverged; fixing one missed the other	Collapsed into single `resolveSession` with `--session-id` > env-var > latest-active. Refuses on ambiguity. Decision printed to stderr
#3 No phase-progression validation	Flat `VALID_PHASES` set let `reviews → complete` through; let cross-type phases (e.g. `topology` on a review workflow) through	Workflow-typed phase graphs. `validatePhaseTransition` enforces legal source→target edges; round-boundary resets explicitly allowed
#4 `stateClose` not cascading	Closing a workflow left running dependent `command_executions` stranded until the heartbeat sweep noticed	Cascade-stamps in-flight deps with `exit_code = -4` and a structured note
#5 `stateClose` not idempotent	Second close → duplicate `session_closed` event	No-op + stderr notice on already-closed
#6 Round derived from filesystem	`stateInit` re-open walked `rounds/round-N/` on disk to decide next round number — broke if disk drifted	Derives from `MAX(round_completed.round) + 1` events. `stateRoundComplete` also bumps `sessions.current_round`
#7 No `workflow_type` compat check on re-open	`stateInit` would happily re-open a review session as a map (or vice versa), with disjoint phase graphs	Hard error on mismatched workflow_type
#8 `stateInit` set `current_phase = 'context'` for all workflows	Map workflows started with a review phase, breaking the phase graph immediately	Initial phase = `map-context` for maps
#9 `stateSync` left `phase_number` / `current_round` at 1 for completed backfilled sessions	Backfilled rows looked half-complete in the dashboard	Reconstruct terminal status from artifacts (final.md → phase 8, latest round number from disk)
#10 No stale-active `sessions` sweep	Stale-active rows accumulated forever, poisoning latest-active auto-detect (the Wrkbelt March stragglers)	New `sweepStaleSessions` — closes status='active' rows with no events in 7d and no in-flight dependents. Wired into dashboard startup + 5-min periodic timer
#11 Liveness sweep only on dashboard startup	Long-running dashboards accumulated stranded `command_executions` rows	5-min periodic timer runs both sweeps inside the running dashboard

What "air-tight" now looks like

Lifecycle invariants now enforced:

Invariant	Enforced
I1 Every `sessions.status` transition follows a legal graph edge	✓ (phase graph + cascade close)
I2 `closed` ⇒ all dependent `command_executions` are terminal	✓ (cascade close)
I3 `current_phase` only takes values from the workflow_type's enum	✓ (workflow-typed phase graphs)
I4 Phase transitions don't skip required phases	✓ (graph edge validation)
I5 `current_round` derives from `round_completed` events	✓ (deriveNextRound)
I6 Auto-detect surfaces its choice; refuses on ambiguity	✓ (stderr announce + ambiguity error)
I7 Sweep eventually reclassifies stranded `running` state without dashboard restart	✓ (5-min periodic timer)
I8 Backfill reconstructs terminal state from artifacts	✓ (stateSync infers phase_number + round)

Test plan

Commits in this PR

d656834 feat(dashboard): derive command outcome from workflow lifecycle
b1b8204 fix(cli): resolve completion session via dashboard execution UID
b7c3f4e fix(cli): air-tight workflow state lifecycle
4bf3596 feat(dashboard,cli): sweep stale-active sessions + periodic dashboard timer

🤖 Generated with claude-flow

When a parent AI process exits 0 mid-workflow (e.g. macOS sleep drops the streaming connection before the AI ever calls `ocr state close-session`), the dashboard previously labelled the command "Success" — its only completion signal was the process exit code. Users had no way to tell from the dashboard that the workflow was actually unfinished. Cross-check the workflow lifecycle when reporting outcome. New pure helper `deriveCommandOutcome(exit_code, workflow_status)` returns: - 'success' — exit 0 AND (no workflow | workflow.status='closed') - 'incomplete' — exit 0 BUT linked workflow still 'active' - 'failed' — non-zero exit code (excluding -2 cancel sentinel) - 'cancelled' — exit code -2 Wire-up: - command-runner.finishExecution queries workflow status via a single LEFT JOIN and includes outcome in the `command:finished` socket event - getCommandHistory LEFT JOINs sessions; the route projects outcome onto each row so historical data is reclassified retroactively at read time (no schema migration, no backfill) - Client provider, history list, and workflow-output badge prefer `outcome` from server, falling back to legacy exit-code mapping for older sockets / unhydrated rows - New 'Incomplete' StatusFilter chip (amber, distinct from green Success and red Failed); badge tooltip explains the likely cause and points to "Resume in terminal" Single source of truth means client + server can never disagree on labelling. Co-Authored-By: claude-flow <ruv@ruv.net>

`ocr state round-complete` and `ocr state close-session` previously used a "latest-active" heuristic when no `--session-id` was given — 'SELECT * FROM sessions WHERE status = active ORDER BY started_at DESC LIMIT 1'. With multiple stale-active rows in the DB (which the incomplete-workflow bug itself produces over time), this picked the wrong session and wrote round-meta.json into an unrelated session directory. The dashboard already sets OCR_DASHBOARD_EXECUTION_UID when it spawns the AI, and the SessionCaptureService binds that uid to command_executions.workflow_id. Teach `resolveSessionForCompletion` to follow that linkage before falling back to latest-active: 1. explicit --session-id (most specific) 2. process.env.OCR_DASHBOARD_EXECUTION_UID → command_executions.workflow_id 3. getLatestActiveSession (fine for direct CLI use) A dashboard-spawned AI now always knows its own workflow regardless of how many other active rows exist. Direct CLI users see no behavior change. Co-Authored-By: claude-flow <ruv@ruv.net>

Audited every state-mutation surface (state init/transition/close/ round-complete/map-complete/sync) for the failure modes that caused the "incomplete session wrongly marked complete" and "wrong session got closed" bugs. Closes the structural gaps that PR #31's outcome derivation alone couldn't fix: 1. Single session resolver. Two parallel helpers (resolveActiveSession + resolveSessionForCompletion) diverged — fixing one missed the other. Collapsed into `resolveSession`, used by every CLI subcommand that takes an optional --session-id. Resolution order: explicit --session-id → OCR_DASHBOARD_EXECUTION_UID → latest-active. Refuses with a hard ambiguity error when >1 active sessions exist and no env var is set, rather than silently picking one. Auto-detect decisions are now printed to stderr ("Auto-detected session: X (via latest-active)") so the user sees which session a command will affect. 2. Phase-progression graph. Replaced the flat VALID_PHASES set (12 phases mixing review + map) with two workflow-typed graphs. stateTransition validates source→target legality: a review workflow can no longer transition to map phases, and the AI can no longer skip from `reviews` straight to `complete`. Round/map-run boundaries are treated as a permitted reset to the initial phase. 3. Cascade stateClose + idempotency. Closing a workflow now stamps any in-flight dependent command_executions with exit_code=-4 and a "closed by parent workflow close" note. Idempotent: closing an already-closed session no-ops with a notice rather than writing a duplicate session_closed event. 4. Round derivation from events. stateInit's re-open path used to walk the filesystem (rounds/round-N/final.md presence) to decide the next round. Now derives from MAX(round_completed.round) + 1 — events are authoritative, filesystem is observational. stateRoundComplete also advances sessions.current_round so the column stays in sync. 5. Workflow_type compat on re-open. stateInit refuses to re-open a review session as a map (and vice versa) — disjoint phase graphs would corrupt state immediately. 6. Workflow-typed initial phase. stateInit now sets current_phase='map-context' for map workflows (was always 'context'). Without this, every subsequent map transition fails the new phase-graph check. Tests: stateTransition phase-graph (legal/illegal/cross-type/round- boundary), stateClose cascade + idempotency, stateInit round-from- events + type-mismatch, resolveActiveSession ambiguity refusal + env-var disambiguation. Pre-existing progress-sqlite test helper updated to walk legal phase edges. Co-Authored-By: claude-flow <ruv@ruv.net>

… timer `sessions.status = 'active'` rows previously had no automated cleanup — sessions that initialised but never reached close-session accumulated forever, poisoning latest-active auto-detect (the root cause of the "wrong session got closed" failure mode). The Wrkbelt DB showed 5 March stragglers from this exact path. Adds `sweepStaleSessions(db, thresholdSeconds)` — closes any status='active' row whose most recent orchestration_event is older than the threshold AND has no in-flight dependent command_executions. Writes a `session_auto_closed_stale` event recording the reason. Wires the new sweep alongside the existing agent-session liveness sweep: - Dashboard startup: both sweeps run before the API routes register - Periodic: 5-minute timer inside the running dashboard runs both sweeps so long-running dashboards don't accumulate stranded rows (the previous design only swept on startup) Threshold for stale sessions: 7 days. Long enough that an in-progress review can sit overnight without triggering, short enough that abandoned/crashed initialisations get cleaned up within a week. Tests cover: closes past-threshold sessions, leaves recent sessions alone, preserves stale-active sessions with in-flight dependents, writes the auto_closed_stale event with the threshold. Co-Authored-By: claude-flow <ruv@ruv.net>

spencermarx and others added 4 commits May 10, 2026 15:20

spencermarx changed the title ~~fix: incomplete workflows wrongly labelled "Success" + state auto-detect ambiguity~~ fix: air-tight workflow lifecycle (incomplete-marked-complete + state auto-detect + cascade close + stale sweep) May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: air-tight workflow lifecycle (incomplete-marked-complete + state auto-detect + cascade close + stale sweep)#31

fix: air-tight workflow lifecycle (incomplete-marked-complete + state auto-detect + cascade close + stale sweep)#31
spencermarx wants to merge 4 commits into
mainfrom
fix/incomplete-workflow-shown-as-success

spencermarx commented May 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spencermarx commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Audit findings & fixes

What "air-tight" now looks like

Test plan

Commits in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

spencermarx commented May 10, 2026 •

edited

Loading