feat(workflow): v0.6.0 — accuracy + harness upgrade (11 items) by cskwork · Pull Request #27 · cskwork/symphony-multi-agent

cskwork · 2026-05-17T06:57:49Z

Summary

v0.5.1 -> v0.6.0. Eleven coordinated changes across agent prompts and the orchestrator so Symphony ticket outcomes are predictable and the system is observable. Plan + post-fix verification log: docs/improvements/workflow-v0.5.2.md.

Prompt contracts (agent-visible)

A1 Plan emits ## Acceptance Tests + ## Done Signals; QA scores them via ## AC Scorecard.
A2 Rewinds receive SYMPHONY_REWIND_SCOPE JSON env and scope edits to flagged files (## Scope Expansion for justified spillover).
A3 Explore emits a scored reuse-inventory.md; Plan candidate set is now an explicit Markdown table with reuse_from + observability columns (revised after live demo showed claude falling back to bullets under the original prose-only spec).
B2 Review emits a 7-row ## Security Audit (secrets / input-validation / sql-injection / xss / csrf / authz / rate-limit); any fail auto-promotes to CRITICAL.
B3 Wiki entries record **Observability hooks:** (log / metric / trace) under ## Technical Reference (KO + EN). Plan candidate table gains observability column.

Harness (orchestrator + hooks)

C1 Orchestrator parses ## Touched Files at dispatch and auto-Blocks overlapping candidates with a ## Conflict section. Agent-side pre-check removed. Regex hardened to accept real-world annotations like `path` (new) and `path with spaces` (deleted) (live demo found this gap).
C2 BaseAgentBackend.is_progress_event lifts the meta-event stall filter into one place. Claude + Codex override; Pi + Gemini inherit the conservative default.
C3 Per-state EMA of completion tokens persists to .symphony/token_ema.json; dispatch injects SYMPHONY_TOKEN_EMA + SYMPHONY_TOKEN_BUDGET. The EMA captures the source state at worker_turn_started so token costs attribute to the stage that actually consumed them (live demo found pre-fix EMA was recording under the destination state).
B1 after_run prefixes the wip commit with [no-test] when prod-code diff lacks a paired test. Review scans the marker and promotes each to HIGH. Carve-out includes *.md, LICENSE*, NOTICE, CHANGELOG*, README*, AGENTS.md, GEMINI.md. 10 hermetic shell tests pin the classifier.
C4 200-line after_create heredoc extracted to scripts/symphony-setup-worktree.sh; hook is a one-liner.
C5 New symphony wiki-sweep CLI scans docs/llm-wiki/ for dup / orphan / stale rows. Orchestrator runs it every Nth Done (default wiki.sweep_every_n: 10). Done counter persists to .symphony/done_count.json so cadence survives orchestrator restarts.

Bug fixes

after_run YAML literal block fixed (replaced IFS='\n' with POSIX printf trick).
C1 path-with-spaces regex hardened (backtick + plain forms, lenient trailing content).
C5 done counter persistence.

Release hygiene

pyproject.toml + src/symphony/__init__.py bumped to 0.6.0 in lockstep, separate chore(release) commit.
CHANGELOG [0.6.0] entry with PM-readable per-item summaries.

Live demo verification

Two backends (claude + codex) ran the same demo ticket twice — once with the original v0.6.0 bundle, once after addressing the gaps surfaced by run 1.

Fix	Run 1 (pre-fix)	Run 2 (post-fix)
C1 regex `(new)` annotation	parser returned `set()`	parses both paths
C3 EMA keys	`plan, in progress, review` (destination state)	`explore, plan` (source state) on both backends
A3 Plan candidate table	claude=bullets; columns missing	both backends emit full 3-row table
B2 Security Audit	claude emitted exact 7-row table; codex sandbox blocked file write	not re-tested (out of mini-run scope)

Also live-verified: ## Acceptance Tests, ## Done Signals, ## Implementation, reuse-inventory.md, work/feature.md, implementation-plan.md, conflict pre-check parser, no spurious stalls under 5+ minute codex turns, EMA persistence across restarts.

Test plan

pytest -q -> 519 passed, 6 skipped.
YAML parse check on all three WORKFLOW*.md files.
python -m symphony wiki-sweep --root docs/llm-wiki --dry-run exits 0 on clean / missing root.
scripts/symphony-setup-worktree.sh has +x mode bit (100755).
code-reviewer pass: HIGH x 2 + MEDIUM x 1 addressed in fix commit; 3 MEDIUMs filed as follow-ups.
Live demo, claude + codex, run 1 (full Todo->Done arc, max_turns=4) + run 2 (Explore->Plan, max_turns=2).
Manual: run a live ticket on the host repo's main board through one full cycle before tagging the release.

Commits

ba9273c -- docs: plan 11-item workflow accuracy + harness upgrade
61f5a2d -- feat(workflow): accuracy + harness upgrade (v0.6.0)
86ca62b -- chore(release): v0.6.0
8be3d26 -- fix(workflow): address code-review HIGH + MEDIUM items
1e82d70 -- fix(workflow): close gaps surfaced by live demo
aa31f3b -- chore: clarify wiki_sweep imports
123800c -- docs: record post-fix live re-verification + out-of-scope findings

Follow-ups (filed in plan doc, NOT in this PR)

C1 retry-overlap completeness (best-effort today; needs tracker re-read at dispatch).
C3 per-subprocess env= for true isolation under max_concurrent_agents > 1.
C3 Windows rename atomicity on concurrent reader hold.
Codex sandbox blocks symlinked board writes (pre-existing; surfaced during demo).
EMA last-writer-wins under multiple co-located orchestrators sharing .symphony/ (test-harness curiosity).

Canonical brief for feat/workflow-accuracy-and-harness-upgrades that parallel subagents read for scope, contracts, and acceptance criteria. Covers six prompt-contract changes (DoD signals, rewind scope, reuse scoring, TDD marker, security audit, observability hooks) and five harness changes (system-side conflict pre-check, stall predicate abstraction, adaptive token budget, hook script extraction, scheduled wiki sweep).

Eleven coordinated changes across agent prompts and the orchestrator to make ticket outcomes predictable and the system observable. Full contract in docs/improvements/workflow-v0.5.2.md. Prompts (agent-visible): - A1 plan.md emits ## Acceptance Tests + ## Done Signals; qa.md scores them via ## AC Scorecard; missing rows fail QA. - A2 in-progress.md scopes rewinds to $SYMPHONY_REWIND_SCOPE; touching other files appends ## Scope Expansion (non-blocking). - A3 explore.md makes reuse-inventory.md required (scored table); plan.md gains reuse_from column and demands ## Plan Rationale when high-fit candidates are rejected. - B2 review.md adds ## Security Audit (exactly 7 rows: secrets, input-validation, sql-injection, xss, csrf, authz, rate-limit); any fail auto-promotes to CRITICAL. - B3 learn.md wiki template adds **Observability hooks:** under ## Technical Reference (KO + EN); plan.md gains observability column. Harness (orchestrator + hooks): - C1 orchestrator parses ## Touched Files at dispatch and refuses overlap; candidate is auto-Blocked with ## Conflict. Agent-side pre-check removed from in-progress.md. - C3 per-state EMA (alpha 0.3) of completion tokens persists across restarts; dispatch injects SYMPHONY_TOKEN_EMA + SYMPHONY_TOKEN_BUDGET. - B1 after_run hook prefixes wip subject with [no-test] when prod diff lacks paired test; review.md scans for marker and promotes to HIGH finding. - C2 BaseAgentBackend exposes is_progress_event(); claude + codex override with type=="assistant" filter; pi/gemini inherit default. - C4 200-line after_create heredoc extracted to scripts/symphony-setup-worktree.sh; hook is a one-liner. - C5 new `symphony wiki-sweep` CLI scans docs/llm-wiki/ for duplicates/orphans/stale; orchestrator runs it every Nth Done (default wiki.sweep_every_n=10). Other: - after_run YAML literal block fixed: IFS=$'\n' replaced with POSIX printf trick so the closing quote no longer terminates the block. Tests: 508 passed, 6 skipped.

Bump pyproject.toml + __init__.py in lockstep (memory: project_symphony_version_source_of_truth). Adds CHANGELOG entry summarising the eleven workflow-accuracy + harness items shipped on this branch.

- C1: split _BULLET_PATH_RE into backtick + plain regexes so paths with spaces in backticked bullets parse correctly (mac users with spaces in repo paths hit the prior `[^\s\`]+` bug). - C5: persist `_done_count` to `.symphony/done_count.json` and reload on `start()` so the wiki-sweep cadence survives orchestrator restarts. Without this, every restart reset the counter to 0 and sweeps either bunched up or skipped indefinitely. - B1: expand the after_run docs carve-out to include `*.md`, `LICENSE*`, `NOTICE`, `CHANGELOG*`, `README*`, `AGENTS.md`, `GEMINI.md` so root documentation edits don't trip the `[no-test]` marker. - C2: documented the intentional spec deviation in the plan doc — `CodexAppServerBackend.is_progress_event` overrides because the existing test_on_codex_event_extracts_nested_item_preview_* tests prove codex's OTHER_MESSAGE bucket carries both real previews and tool/item notifications; removing the filter regresses those tests. - Moved wiki_sweep to a top-level module import to keep pyright happy about the deferred symbol. Plan doc updated with the post-review follow-up list (3 remaining MEDIUMs filed: C1 retry-overlap completeness, C3 env-mutation under concurrency, C3 Windows rename atomicity). Tests: 508 passed, 6 skipped.

Live claude + codex demos (workflow-v0.5.2 §B) exposed three regressions in the v0.6.0 bundle and one under-tested code path: 1. C1 regex rejected ` (new)` / ` (deleted)` / ` (M)` annotations after a backticked path — claude's actual output. Lenient match: stop anchoring after the closing backtick / first whitespace token. Reproduce in tests/test_orchestrator_dispatch.py. 2. C3 EMA recorded under the destination state instead of the source. `entry.issue.state` is already the next stage at EVENT_TURN_COMPLETED time because the agent flips `state:` in the ticket body mid-turn. Fix: capture `state_at_turn_start` on each worker_turn_started; read it at update time. 3. A3 plan.md asked for `reuse_from` / `observability` "columns" with no table template. Both backends silently dropped them in favor of bullets. Replace the bullet-prose with an explicit Markdown table header — non-optional. 4. B1 marker had no direct test — claude's strict TDD meant the `[no-test]` branch never fired naturally. Added a hermetic shell harness (tests/test_after_run_classifier.py) that exercises all marker / scope-expand / docs-carve-out combinations through bash. Tests: 519 passed, 6 skipped (was 508). Live demo also confirmed (no fix needed): - B2 ## Security Audit emits exactly 7 rows in spec order (claude). - A1 ## Acceptance Tests + ## Done Signals emit with What/Why/AsIs/ToBe headers (both backends). - in-progress.md ## Implementation + docs/<id>/work/feature.md created (both backends). - A3 reuse-inventory.md emitted under <workspace>/docs/<id>/explore/ (claude verified; codex pending Review turn). - C2 no spurious stalls under 5+ min codex turns.

Use explicit `from .wiki_sweep import sweep` in orchestrator and the `as wiki_sweep` re-export idiom in cli so the symbol relationship is unambiguous to both readers and static checkers. Runtime behaviour unchanged.

Run 2 of the workflow-v0.5.2 demo (claude + codex, max_turns=2) confirmed every fix that landed in 1e82d70 works on both backends: - C1 regex extracts paths from real `(new)` annotations. - C3 EMA keys now match the source state (`explore, plan`) on both backends; run 1 had recorded them under the destination state. - A3 Plan candidate table emits with full `reuse_from` and `observability` columns on both backends. - B1 marker bash classifier is covered by 10 hermetic shell tests. Also files two out-of-scope discoveries that surfaced during the demo (both pre-existing): codex sandbox blocking symlinked board writes, and EMA last-writer-wins under multiple co-located orchestrators.

CI Python 3.10 job failed on test_max_total_tokens_cap_cancels_worker because `asyncio.Task.cancelling()` was introduced in 3.11. The test already had a `cancelled()` fallback path; route through `getattr` so the attribute lookup never happens on 3.10. Test still validates the same behaviour (cancel was attempted) on 3.10/3.11/3.12.

cskwork added 8 commits May 17, 2026 14:21

chore(release): v0.6.0

86ca62b

Bump pyproject.toml + __init__.py in lockstep (memory: project_symphony_version_source_of_truth). Adds CHANGELOG entry summarising the eleven workflow-accuracy + harness items shipped on this branch.

chore: clarify wiki_sweep imports

aa31f3b

Use explicit `from .wiki_sweep import sweep` in orchestrator and the `as wiki_sweep` re-export idiom in cli so the symbol relationship is unambiguous to both readers and static checkers. Runtime behaviour unchanged.

cskwork merged commit 3b86c7f into main May 17, 2026
2 of 3 checks passed

cskwork deleted the feat/workflow-accuracy-and-harness-upgrades branch May 17, 2026 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(workflow): v0.6.0 — accuracy + harness upgrade (11 items)#27

feat(workflow): v0.6.0 — accuracy + harness upgrade (11 items)#27
cskwork merged 8 commits into
mainfrom
feat/workflow-accuracy-and-harness-upgrades

cskwork commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cskwork commented May 17, 2026

Summary

Prompt contracts (agent-visible)

Harness (orchestrator + hooks)

Bug fixes

Release hygiene

Live demo verification

Test plan

Commits

Follow-ups (filed in plan doc, NOT in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant