Skip to content

fix: air-tight workflow lifecycle (incomplete-marked-complete + state auto-detect + cascade close + stale sweep)#31

Open
spencermarx wants to merge 4 commits into
mainfrom
fix/incomplete-workflow-shown-as-success
Open

fix: air-tight workflow lifecycle (incomplete-marked-complete + state auto-detect + cascade close + stale sweep)#31
spencermarx wants to merge 4 commits into
mainfrom
fix/incomplete-workflow-shown-as-success

Conversation

@spencermarx
Copy link
Copy Markdown
Owner

@spencermarx spencermarx commented May 10, 2026

Summary

A full audit of the review + map state lifecycle. Closes every gap that lets a workflow end up in an incorrect/inconsistent state, not just the original "Success shown when not really complete" symptom.

Audit findings & fixes

Gap Root cause Fix
#1 Outcome ≠ exit code Dashboard derived "Success" from exit_code === 0 alone, ignoring whether the workflow actually closed New deriveCommandOutcome(exit_code, sessions.status); "Incomplete" outcome surfaced in UI
#2 Two parallel auto-detect helpers resolveActiveSession (transition/close) and resolveSessionForCompletion (round/map complete) diverged; fixing one missed the other Collapsed into single resolveSession with --session-id > env-var > latest-active. Refuses on ambiguity. Decision printed to stderr
#3 No phase-progression validation Flat VALID_PHASES set let reviews → complete through; let cross-type phases (e.g. topology on a review workflow) through Workflow-typed phase graphs. validatePhaseTransition enforces legal source→target edges; round-boundary resets explicitly allowed
#4 stateClose not cascading Closing a workflow left running dependent command_executions stranded until the heartbeat sweep noticed Cascade-stamps in-flight deps with exit_code = -4 and a structured note
#5 stateClose not idempotent Second close → duplicate session_closed event No-op + stderr notice on already-closed
#6 Round derived from filesystem stateInit re-open walked rounds/round-N/ on disk to decide next round number — broke if disk drifted Derives from MAX(round_completed.round) + 1 events. stateRoundComplete also bumps sessions.current_round
#7 No workflow_type compat check on re-open stateInit would happily re-open a review session as a map (or vice versa), with disjoint phase graphs Hard error on mismatched workflow_type
#8 stateInit set current_phase = 'context' for all workflows Map workflows started with a review phase, breaking the phase graph immediately Initial phase = map-context for maps
#9 stateSync left phase_number / current_round at 1 for completed backfilled sessions Backfilled rows looked half-complete in the dashboard Reconstruct terminal status from artifacts (final.md → phase 8, latest round number from disk)
#10 No stale-active sessions sweep Stale-active rows accumulated forever, poisoning latest-active auto-detect (the Wrkbelt March stragglers) New sweepStaleSessions — closes status='active' rows with no events in 7d and no in-flight dependents. Wired into dashboard startup + 5-min periodic timer
#11 Liveness sweep only on dashboard startup Long-running dashboards accumulated stranded command_executions rows 5-min periodic timer runs both sweeps inside the running dashboard

What "air-tight" now looks like

Lifecycle invariants now enforced:

Invariant Enforced
I1 Every sessions.status transition follows a legal graph edge ✓ (phase graph + cascade close)
I2 closed ⇒ all dependent command_executions are terminal ✓ (cascade close)
I3 current_phase only takes values from the workflow_type's enum ✓ (workflow-typed phase graphs)
I4 Phase transitions don't skip required phases ✓ (graph edge validation)
I5 current_round derives from round_completed events ✓ (deriveNextRound)
I6 Auto-detect surfaces its choice; refuses on ambiguity ✓ (stderr announce + ambiguity error)
I7 Sweep eventually reclassifies stranded running state without dashboard restart ✓ (5-min periodic timer)
I8 Backfill reconstructs terminal state from artifacts ✓ (stateSync infers phase_number + round)

Test plan

  • Unit: deriveCommandOutcome — 6 branches
  • Unit: resolveActiveSession — env-var disambiguation + ambiguity refusal
  • Unit: stateTransition — legal/illegal/cross-type/round-boundary
  • Unit: stateClose — cascade + idempotency
  • Unit: stateInit — round from events + workflow_type compat
  • Unit: sweepStaleSessions — close past-threshold, preserve fresh, preserve in-flight deps, structured event
  • nx run-many -t test green across cli + dashboard + dashboard-api-e2e packages
  • nx run-many -t e2e --projects=cli-e2e,dashboard-api-e2e green (31/31)
  • Build green (nx run-many -t build)
  • Manual: existing Wrkbelt stale-active stragglers are auto-closed by the sweep on next dashboard restart
  • Manual: cancel a review mid-flight → command row shows "Incomplete" with structured outcome
  • Manual: a misbehaving AI that tries state transition --phase complete from reviews gets a clear error

Commits in this PR

  1. d656834 feat(dashboard): derive command outcome from workflow lifecycle
  2. b1b8204 fix(cli): resolve completion session via dashboard execution UID
  3. b7c3f4e fix(cli): air-tight workflow state lifecycle
  4. 4bf3596 feat(dashboard,cli): sweep stale-active sessions + periodic dashboard timer

🤖 Generated with claude-flow

spencermarx and others added 4 commits May 10, 2026 15:20
When a parent AI process exits 0 mid-workflow (e.g. macOS sleep drops
the streaming connection before the AI ever calls
`ocr state close-session`), the dashboard previously labelled the
command "Success" — its only completion signal was the process exit
code. Users had no way to tell from the dashboard that the workflow
was actually unfinished.

Cross-check the workflow lifecycle when reporting outcome. New pure
helper `deriveCommandOutcome(exit_code, workflow_status)` returns:

  - 'success'    — exit 0 AND (no workflow | workflow.status='closed')
  - 'incomplete' — exit 0 BUT linked workflow still 'active'
  - 'failed'     — non-zero exit code (excluding -2 cancel sentinel)
  - 'cancelled'  — exit code -2

Wire-up:
  - command-runner.finishExecution queries workflow status via a single
    LEFT JOIN and includes outcome in the `command:finished` socket
    event
  - getCommandHistory LEFT JOINs sessions; the route projects outcome
    onto each row so historical data is reclassified retroactively at
    read time (no schema migration, no backfill)
  - Client provider, history list, and workflow-output badge prefer
    `outcome` from server, falling back to legacy exit-code mapping
    for older sockets / unhydrated rows
  - New 'Incomplete' StatusFilter chip (amber, distinct from green
    Success and red Failed); badge tooltip explains the likely cause
    and points to "Resume in terminal"

Single source of truth means client + server can never disagree on
labelling.

Co-Authored-By: claude-flow <ruv@ruv.net>
`ocr state round-complete` and `ocr state close-session` previously
used a "latest-active" heuristic when no `--session-id` was given —
'SELECT * FROM sessions WHERE status = active ORDER BY started_at
DESC LIMIT 1'. With multiple stale-active rows in the DB (which the
incomplete-workflow bug itself produces over time), this picked the
wrong session and wrote round-meta.json into an unrelated session
directory.

The dashboard already sets OCR_DASHBOARD_EXECUTION_UID when it spawns
the AI, and the SessionCaptureService binds that uid to
command_executions.workflow_id. Teach `resolveSessionForCompletion`
to follow that linkage before falling back to latest-active:

  1. explicit --session-id (most specific)
  2. process.env.OCR_DASHBOARD_EXECUTION_UID → command_executions.workflow_id
  3. getLatestActiveSession (fine for direct CLI use)

A dashboard-spawned AI now always knows its own workflow regardless
of how many other active rows exist. Direct CLI users see no behavior
change.

Co-Authored-By: claude-flow <ruv@ruv.net>
Audited every state-mutation surface (state init/transition/close/
round-complete/map-complete/sync) for the failure modes that caused
the "incomplete session wrongly marked complete" and "wrong session
got closed" bugs. Closes the structural gaps that PR #31's outcome
derivation alone couldn't fix:

  1. Single session resolver. Two parallel helpers
     (resolveActiveSession + resolveSessionForCompletion) diverged —
     fixing one missed the other. Collapsed into `resolveSession`,
     used by every CLI subcommand that takes an optional --session-id.
     Resolution order: explicit --session-id → OCR_DASHBOARD_EXECUTION_UID
     → latest-active. Refuses with a hard ambiguity error when >1
     active sessions exist and no env var is set, rather than silently
     picking one. Auto-detect decisions are now printed to stderr
     ("Auto-detected session: X (via latest-active)") so the user
     sees which session a command will affect.

  2. Phase-progression graph. Replaced the flat VALID_PHASES set
     (12 phases mixing review + map) with two workflow-typed graphs.
     stateTransition validates source→target legality: a review
     workflow can no longer transition to map phases, and the AI can
     no longer skip from `reviews` straight to `complete`. Round/map-run
     boundaries are treated as a permitted reset to the initial phase.

  3. Cascade stateClose + idempotency. Closing a workflow now stamps
     any in-flight dependent command_executions with exit_code=-4
     and a "closed by parent workflow close" note. Idempotent: closing
     an already-closed session no-ops with a notice rather than
     writing a duplicate session_closed event.

  4. Round derivation from events. stateInit's re-open path used to
     walk the filesystem (rounds/round-N/final.md presence) to decide
     the next round. Now derives from MAX(round_completed.round) + 1
     — events are authoritative, filesystem is observational.
     stateRoundComplete also advances sessions.current_round so the
     column stays in sync.

  5. Workflow_type compat on re-open. stateInit refuses to re-open a
     review session as a map (and vice versa) — disjoint phase graphs
     would corrupt state immediately.

  6. Workflow-typed initial phase. stateInit now sets
     current_phase='map-context' for map workflows (was always
     'context'). Without this, every subsequent map transition fails
     the new phase-graph check.

Tests: stateTransition phase-graph (legal/illegal/cross-type/round-
boundary), stateClose cascade + idempotency, stateInit round-from-
events + type-mismatch, resolveActiveSession ambiguity refusal +
env-var disambiguation. Pre-existing progress-sqlite test helper
updated to walk legal phase edges.

Co-Authored-By: claude-flow <ruv@ruv.net>
… timer

`sessions.status = 'active'` rows previously had no automated cleanup —
sessions that initialised but never reached close-session accumulated
forever, poisoning latest-active auto-detect (the root cause of the
"wrong session got closed" failure mode). The Wrkbelt DB showed 5
March stragglers from this exact path.

Adds `sweepStaleSessions(db, thresholdSeconds)` — closes any
status='active' row whose most recent orchestration_event is older
than the threshold AND has no in-flight dependent command_executions.
Writes a `session_auto_closed_stale` event recording the reason.

Wires the new sweep alongside the existing agent-session liveness
sweep:

  - Dashboard startup: both sweeps run before the API routes register
  - Periodic: 5-minute timer inside the running dashboard runs both
    sweeps so long-running dashboards don't accumulate stranded rows
    (the previous design only swept on startup)

Threshold for stale sessions: 7 days. Long enough that an in-progress
review can sit overnight without triggering, short enough that
abandoned/crashed initialisations get cleaned up within a week.

Tests cover: closes past-threshold sessions, leaves recent sessions
alone, preserves stale-active sessions with in-flight dependents,
writes the auto_closed_stale event with the threshold.

Co-Authored-By: claude-flow <ruv@ruv.net>
@spencermarx spencermarx changed the title fix: incomplete workflows wrongly labelled "Success" + state auto-detect ambiguity fix: air-tight workflow lifecycle (incomplete-marked-complete + state auto-detect + cascade close + stale sweep) May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant