fix: air-tight workflow lifecycle (incomplete-marked-complete + state auto-detect + cascade close + stale sweep)#31
Open
spencermarx wants to merge 4 commits into
Open
Conversation
When a parent AI process exits 0 mid-workflow (e.g. macOS sleep drops
the streaming connection before the AI ever calls
`ocr state close-session`), the dashboard previously labelled the
command "Success" — its only completion signal was the process exit
code. Users had no way to tell from the dashboard that the workflow
was actually unfinished.
Cross-check the workflow lifecycle when reporting outcome. New pure
helper `deriveCommandOutcome(exit_code, workflow_status)` returns:
- 'success' — exit 0 AND (no workflow | workflow.status='closed')
- 'incomplete' — exit 0 BUT linked workflow still 'active'
- 'failed' — non-zero exit code (excluding -2 cancel sentinel)
- 'cancelled' — exit code -2
Wire-up:
- command-runner.finishExecution queries workflow status via a single
LEFT JOIN and includes outcome in the `command:finished` socket
event
- getCommandHistory LEFT JOINs sessions; the route projects outcome
onto each row so historical data is reclassified retroactively at
read time (no schema migration, no backfill)
- Client provider, history list, and workflow-output badge prefer
`outcome` from server, falling back to legacy exit-code mapping
for older sockets / unhydrated rows
- New 'Incomplete' StatusFilter chip (amber, distinct from green
Success and red Failed); badge tooltip explains the likely cause
and points to "Resume in terminal"
Single source of truth means client + server can never disagree on
labelling.
Co-Authored-By: claude-flow <ruv@ruv.net>
`ocr state round-complete` and `ocr state close-session` previously used a "latest-active" heuristic when no `--session-id` was given — 'SELECT * FROM sessions WHERE status = active ORDER BY started_at DESC LIMIT 1'. With multiple stale-active rows in the DB (which the incomplete-workflow bug itself produces over time), this picked the wrong session and wrote round-meta.json into an unrelated session directory. The dashboard already sets OCR_DASHBOARD_EXECUTION_UID when it spawns the AI, and the SessionCaptureService binds that uid to command_executions.workflow_id. Teach `resolveSessionForCompletion` to follow that linkage before falling back to latest-active: 1. explicit --session-id (most specific) 2. process.env.OCR_DASHBOARD_EXECUTION_UID → command_executions.workflow_id 3. getLatestActiveSession (fine for direct CLI use) A dashboard-spawned AI now always knows its own workflow regardless of how many other active rows exist. Direct CLI users see no behavior change. Co-Authored-By: claude-flow <ruv@ruv.net>
Audited every state-mutation surface (state init/transition/close/ round-complete/map-complete/sync) for the failure modes that caused the "incomplete session wrongly marked complete" and "wrong session got closed" bugs. Closes the structural gaps that PR #31's outcome derivation alone couldn't fix: 1. Single session resolver. Two parallel helpers (resolveActiveSession + resolveSessionForCompletion) diverged — fixing one missed the other. Collapsed into `resolveSession`, used by every CLI subcommand that takes an optional --session-id. Resolution order: explicit --session-id → OCR_DASHBOARD_EXECUTION_UID → latest-active. Refuses with a hard ambiguity error when >1 active sessions exist and no env var is set, rather than silently picking one. Auto-detect decisions are now printed to stderr ("Auto-detected session: X (via latest-active)") so the user sees which session a command will affect. 2. Phase-progression graph. Replaced the flat VALID_PHASES set (12 phases mixing review + map) with two workflow-typed graphs. stateTransition validates source→target legality: a review workflow can no longer transition to map phases, and the AI can no longer skip from `reviews` straight to `complete`. Round/map-run boundaries are treated as a permitted reset to the initial phase. 3. Cascade stateClose + idempotency. Closing a workflow now stamps any in-flight dependent command_executions with exit_code=-4 and a "closed by parent workflow close" note. Idempotent: closing an already-closed session no-ops with a notice rather than writing a duplicate session_closed event. 4. Round derivation from events. stateInit's re-open path used to walk the filesystem (rounds/round-N/final.md presence) to decide the next round. Now derives from MAX(round_completed.round) + 1 — events are authoritative, filesystem is observational. stateRoundComplete also advances sessions.current_round so the column stays in sync. 5. Workflow_type compat on re-open. stateInit refuses to re-open a review session as a map (and vice versa) — disjoint phase graphs would corrupt state immediately. 6. Workflow-typed initial phase. stateInit now sets current_phase='map-context' for map workflows (was always 'context'). Without this, every subsequent map transition fails the new phase-graph check. Tests: stateTransition phase-graph (legal/illegal/cross-type/round- boundary), stateClose cascade + idempotency, stateInit round-from- events + type-mismatch, resolveActiveSession ambiguity refusal + env-var disambiguation. Pre-existing progress-sqlite test helper updated to walk legal phase edges. Co-Authored-By: claude-flow <ruv@ruv.net>
… timer
`sessions.status = 'active'` rows previously had no automated cleanup —
sessions that initialised but never reached close-session accumulated
forever, poisoning latest-active auto-detect (the root cause of the
"wrong session got closed" failure mode). The Wrkbelt DB showed 5
March stragglers from this exact path.
Adds `sweepStaleSessions(db, thresholdSeconds)` — closes any
status='active' row whose most recent orchestration_event is older
than the threshold AND has no in-flight dependent command_executions.
Writes a `session_auto_closed_stale` event recording the reason.
Wires the new sweep alongside the existing agent-session liveness
sweep:
- Dashboard startup: both sweeps run before the API routes register
- Periodic: 5-minute timer inside the running dashboard runs both
sweeps so long-running dashboards don't accumulate stranded rows
(the previous design only swept on startup)
Threshold for stale sessions: 7 days. Long enough that an in-progress
review can sit overnight without triggering, short enough that
abandoned/crashed initialisations get cleaned up within a week.
Tests cover: closes past-threshold sessions, leaves recent sessions
alone, preserves stale-active sessions with in-flight dependents,
writes the auto_closed_stale event with the threshold.
Co-Authored-By: claude-flow <ruv@ruv.net>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A full audit of the review + map state lifecycle. Closes every gap that lets a workflow end up in an incorrect/inconsistent state, not just the original "Success shown when not really complete" symptom.
Audit findings & fixes
exit_code === 0alone, ignoring whether the workflow actually closedderiveCommandOutcome(exit_code, sessions.status); "Incomplete" outcome surfaced in UIresolveActiveSession(transition/close) andresolveSessionForCompletion(round/map complete) diverged; fixing one missed the otherresolveSessionwith--session-id> env-var > latest-active. Refuses on ambiguity. Decision printed to stderrVALID_PHASESset letreviews → completethrough; let cross-type phases (e.g.topologyon a review workflow) throughvalidatePhaseTransitionenforces legal source→target edges; round-boundary resets explicitly allowedstateClosenot cascadingcommand_executionsstranded until the heartbeat sweep noticedexit_code = -4and a structured notestateClosenot idempotentsession_closedeventstateInitre-open walkedrounds/round-N/on disk to decide next round number — broke if disk driftedMAX(round_completed.round) + 1events.stateRoundCompletealso bumpssessions.current_roundworkflow_typecompat check on re-openstateInitwould happily re-open a review session as a map (or vice versa), with disjoint phase graphsstateInitsetcurrent_phase = 'context'for all workflowsmap-contextfor mapsstateSyncleftphase_number/current_roundat 1 for completed backfilled sessionssessionssweepsweepStaleSessions— closes status='active' rows with no events in 7d and no in-flight dependents. Wired into dashboard startup + 5-min periodic timercommand_executionsrowsWhat "air-tight" now looks like
Lifecycle invariants now enforced:
sessions.statustransition follows a legal graph edgeclosed⇒ all dependentcommand_executionsare terminalcurrent_phaseonly takes values from the workflow_type's enumcurrent_roundderives fromround_completedeventsrunningstate without dashboard restartTest plan
deriveCommandOutcome— 6 branchesresolveActiveSession— env-var disambiguation + ambiguity refusalstateTransition— legal/illegal/cross-type/round-boundarystateClose— cascade + idempotencystateInit— round from events + workflow_type compatsweepStaleSessions— close past-threshold, preserve fresh, preserve in-flight deps, structured eventnx run-many -t testgreen across cli + dashboard + dashboard-api-e2e packagesnx run-many -t e2e --projects=cli-e2e,dashboard-api-e2egreen (31/31)nx run-many -t build)state transition --phase completefromreviewsgets a clear errorCommits in this PR
d656834feat(dashboard): derive command outcome from workflow lifecycleb1b8204fix(cli): resolve completion session via dashboard execution UIDb7c3f4efix(cli): air-tight workflow state lifecycle4bf3596feat(dashboard,cli): sweep stale-active sessions + periodic dashboard timer🤖 Generated with claude-flow