feat(workflow): v0.6.0 — accuracy + harness upgrade (11 items)#27
Merged
Conversation
Canonical brief for feat/workflow-accuracy-and-harness-upgrades that parallel subagents read for scope, contracts, and acceptance criteria. Covers six prompt-contract changes (DoD signals, rewind scope, reuse scoring, TDD marker, security audit, observability hooks) and five harness changes (system-side conflict pre-check, stall predicate abstraction, adaptive token budget, hook script extraction, scheduled wiki sweep).
Eleven coordinated changes across agent prompts and the orchestrator to make ticket outcomes predictable and the system observable. Full contract in docs/improvements/workflow-v0.5.2.md. Prompts (agent-visible): - A1 plan.md emits ## Acceptance Tests + ## Done Signals; qa.md scores them via ## AC Scorecard; missing rows fail QA. - A2 in-progress.md scopes rewinds to $SYMPHONY_REWIND_SCOPE; touching other files appends ## Scope Expansion (non-blocking). - A3 explore.md makes reuse-inventory.md required (scored table); plan.md gains reuse_from column and demands ## Plan Rationale when high-fit candidates are rejected. - B2 review.md adds ## Security Audit (exactly 7 rows: secrets, input-validation, sql-injection, xss, csrf, authz, rate-limit); any fail auto-promotes to CRITICAL. - B3 learn.md wiki template adds **Observability hooks:** under ## Technical Reference (KO + EN); plan.md gains observability column. Harness (orchestrator + hooks): - C1 orchestrator parses ## Touched Files at dispatch and refuses overlap; candidate is auto-Blocked with ## Conflict. Agent-side pre-check removed from in-progress.md. - C3 per-state EMA (alpha 0.3) of completion tokens persists across restarts; dispatch injects SYMPHONY_TOKEN_EMA + SYMPHONY_TOKEN_BUDGET. - B1 after_run hook prefixes wip subject with [no-test] when prod diff lacks paired test; review.md scans for marker and promotes to HIGH finding. - C2 BaseAgentBackend exposes is_progress_event(); claude + codex override with type=="assistant" filter; pi/gemini inherit default. - C4 200-line after_create heredoc extracted to scripts/symphony-setup-worktree.sh; hook is a one-liner. - C5 new `symphony wiki-sweep` CLI scans docs/llm-wiki/ for duplicates/orphans/stale; orchestrator runs it every Nth Done (default wiki.sweep_every_n=10). Other: - after_run YAML literal block fixed: IFS=$'\n' replaced with POSIX printf trick so the closing quote no longer terminates the block. Tests: 508 passed, 6 skipped.
Bump pyproject.toml + __init__.py in lockstep (memory: project_symphony_version_source_of_truth). Adds CHANGELOG entry summarising the eleven workflow-accuracy + harness items shipped on this branch.
- C1: split _BULLET_PATH_RE into backtick + plain regexes so paths with spaces in backticked bullets parse correctly (mac users with spaces in repo paths hit the prior `[^\s\`]+` bug). - C5: persist `_done_count` to `.symphony/done_count.json` and reload on `start()` so the wiki-sweep cadence survives orchestrator restarts. Without this, every restart reset the counter to 0 and sweeps either bunched up or skipped indefinitely. - B1: expand the after_run docs carve-out to include `*.md`, `LICENSE*`, `NOTICE`, `CHANGELOG*`, `README*`, `AGENTS.md`, `GEMINI.md` so root documentation edits don't trip the `[no-test]` marker. - C2: documented the intentional spec deviation in the plan doc — `CodexAppServerBackend.is_progress_event` overrides because the existing test_on_codex_event_extracts_nested_item_preview_* tests prove codex's OTHER_MESSAGE bucket carries both real previews and tool/item notifications; removing the filter regresses those tests. - Moved wiki_sweep to a top-level module import to keep pyright happy about the deferred symbol. Plan doc updated with the post-review follow-up list (3 remaining MEDIUMs filed: C1 retry-overlap completeness, C3 env-mutation under concurrency, C3 Windows rename atomicity). Tests: 508 passed, 6 skipped.
Live claude + codex demos (workflow-v0.5.2 §B) exposed three regressions in the v0.6.0 bundle and one under-tested code path: 1. C1 regex rejected ` (new)` / ` (deleted)` / ` (M)` annotations after a backticked path — claude's actual output. Lenient match: stop anchoring after the closing backtick / first whitespace token. Reproduce in tests/test_orchestrator_dispatch.py. 2. C3 EMA recorded under the destination state instead of the source. `entry.issue.state` is already the next stage at EVENT_TURN_COMPLETED time because the agent flips `state:` in the ticket body mid-turn. Fix: capture `state_at_turn_start` on each worker_turn_started; read it at update time. 3. A3 plan.md asked for `reuse_from` / `observability` "columns" with no table template. Both backends silently dropped them in favor of bullets. Replace the bullet-prose with an explicit Markdown table header — non-optional. 4. B1 marker had no direct test — claude's strict TDD meant the `[no-test]` branch never fired naturally. Added a hermetic shell harness (tests/test_after_run_classifier.py) that exercises all marker / scope-expand / docs-carve-out combinations through bash. Tests: 519 passed, 6 skipped (was 508). Live demo also confirmed (no fix needed): - B2 ## Security Audit emits exactly 7 rows in spec order (claude). - A1 ## Acceptance Tests + ## Done Signals emit with What/Why/AsIs/ToBe headers (both backends). - in-progress.md ## Implementation + docs/<id>/work/feature.md created (both backends). - A3 reuse-inventory.md emitted under <workspace>/docs/<id>/explore/ (claude verified; codex pending Review turn). - C2 no spurious stalls under 5+ min codex turns.
Use explicit `from .wiki_sweep import sweep` in orchestrator and the `as wiki_sweep` re-export idiom in cli so the symbol relationship is unambiguous to both readers and static checkers. Runtime behaviour unchanged.
Run 2 of the workflow-v0.5.2 demo (claude + codex, max_turns=2) confirmed every fix that landed in 1e82d70 works on both backends: - C1 regex extracts paths from real `(new)` annotations. - C3 EMA keys now match the source state (`explore, plan`) on both backends; run 1 had recorded them under the destination state. - A3 Plan candidate table emits with full `reuse_from` and `observability` columns on both backends. - B1 marker bash classifier is covered by 10 hermetic shell tests. Also files two out-of-scope discoveries that surfaced during the demo (both pre-existing): codex sandbox blocking symlinked board writes, and EMA last-writer-wins under multiple co-located orchestrators.
CI Python 3.10 job failed on test_max_total_tokens_cap_cancels_worker because `asyncio.Task.cancelling()` was introduced in 3.11. The test already had a `cancelled()` fallback path; route through `getattr` so the attribute lookup never happens on 3.10. Test still validates the same behaviour (cancel was attempted) on 3.10/3.11/3.12.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
v0.5.1 -> v0.6.0. Eleven coordinated changes across agent prompts and the orchestrator so Symphony ticket outcomes are predictable and the system is observable. Plan + post-fix verification log:
docs/improvements/workflow-v0.5.2.md.Prompt contracts (agent-visible)
## Acceptance Tests+## Done Signals; QA scores them via## AC Scorecard.SYMPHONY_REWIND_SCOPEJSON env and scope edits to flagged files (## Scope Expansionfor justified spillover).reuse-inventory.md; Plan candidate set is now an explicit Markdown table withreuse_from+observabilitycolumns (revised after live demo showed claude falling back to bullets under the original prose-only spec).## Security Audit(secrets / input-validation / sql-injection / xss / csrf / authz / rate-limit); anyfailauto-promotes to CRITICAL.**Observability hooks:**(log / metric / trace) under## Technical Reference(KO + EN). Plan candidate table gainsobservabilitycolumn.Harness (orchestrator + hooks)
## Touched Filesat dispatch and auto-Blocks overlapping candidates with a## Conflictsection. Agent-side pre-check removed. Regex hardened to accept real-world annotations like`path` (new)and`path with spaces` (deleted)(live demo found this gap).BaseAgentBackend.is_progress_eventlifts the meta-event stall filter into one place. Claude + Codex override; Pi + Gemini inherit the conservative default..symphony/token_ema.json; dispatch injectsSYMPHONY_TOKEN_EMA+SYMPHONY_TOKEN_BUDGET. The EMA captures the source state atworker_turn_startedso token costs attribute to the stage that actually consumed them (live demo found pre-fix EMA was recording under the destination state).after_runprefixes the wip commit with[no-test]when prod-code diff lacks a paired test. Review scans the marker and promotes each to HIGH. Carve-out includes*.md,LICENSE*,NOTICE,CHANGELOG*,README*,AGENTS.md,GEMINI.md. 10 hermetic shell tests pin the classifier.after_createheredoc extracted toscripts/symphony-setup-worktree.sh; hook is a one-liner.symphony wiki-sweepCLI scansdocs/llm-wiki/for dup / orphan / stale rows. Orchestrator runs it every NthDone(defaultwiki.sweep_every_n: 10). Done counter persists to.symphony/done_count.jsonso cadence survives orchestrator restarts.Bug fixes
after_runYAML literal block fixed (replacedIFS='\n'with POSIXprintftrick).Release hygiene
pyproject.toml+src/symphony/__init__.pybumped to0.6.0in lockstep, separatechore(release)commit.[0.6.0]entry with PM-readable per-item summaries.Live demo verification
Two backends (claude + codex) ran the same demo ticket twice — once with the original v0.6.0 bundle, once after addressing the gaps surfaced by run 1.
(new)annotationset()plan, in progress, review(destination state)explore, plan(source state) on both backendsAlso live-verified:
## Acceptance Tests,## Done Signals,## Implementation,reuse-inventory.md,work/feature.md,implementation-plan.md, conflict pre-check parser, no spurious stalls under 5+ minute codex turns, EMA persistence across restarts.Test plan
pytest -q-> 519 passed, 6 skipped.WORKFLOW*.mdfiles.python -m symphony wiki-sweep --root docs/llm-wiki --dry-runexits 0 on clean / missing root.scripts/symphony-setup-worktree.shhas+xmode bit (100755).Commits
ba9273c-- docs: plan 11-item workflow accuracy + harness upgrade61f5a2d-- feat(workflow): accuracy + harness upgrade (v0.6.0)86ca62b-- chore(release): v0.6.08be3d26-- fix(workflow): address code-review HIGH + MEDIUM items1e82d70-- fix(workflow): close gaps surfaced by live demoaa31f3b-- chore: clarify wiki_sweep imports123800c-- docs: record post-fix live re-verification + out-of-scope findingsFollow-ups (filed in plan doc, NOT in this PR)
env=for true isolation undermax_concurrent_agents > 1..symphony/(test-harness curiosity).