Skip to content

feat(workflow): v0.6.0 — accuracy + harness upgrade (11 items)#27

Merged
cskwork merged 8 commits into
mainfrom
feat/workflow-accuracy-and-harness-upgrades
May 17, 2026
Merged

feat(workflow): v0.6.0 — accuracy + harness upgrade (11 items)#27
cskwork merged 8 commits into
mainfrom
feat/workflow-accuracy-and-harness-upgrades

Conversation

@cskwork
Copy link
Copy Markdown
Owner

@cskwork cskwork commented May 17, 2026

Summary

v0.5.1 -> v0.6.0. Eleven coordinated changes across agent prompts and the orchestrator so Symphony ticket outcomes are predictable and the system is observable. Plan + post-fix verification log: docs/improvements/workflow-v0.5.2.md.

Prompt contracts (agent-visible)

  • A1 Plan emits ## Acceptance Tests + ## Done Signals; QA scores them via ## AC Scorecard.
  • A2 Rewinds receive SYMPHONY_REWIND_SCOPE JSON env and scope edits to flagged files (## Scope Expansion for justified spillover).
  • A3 Explore emits a scored reuse-inventory.md; Plan candidate set is now an explicit Markdown table with reuse_from + observability columns (revised after live demo showed claude falling back to bullets under the original prose-only spec).
  • B2 Review emits a 7-row ## Security Audit (secrets / input-validation / sql-injection / xss / csrf / authz / rate-limit); any fail auto-promotes to CRITICAL.
  • B3 Wiki entries record **Observability hooks:** (log / metric / trace) under ## Technical Reference (KO + EN). Plan candidate table gains observability column.

Harness (orchestrator + hooks)

  • C1 Orchestrator parses ## Touched Files at dispatch and auto-Blocks overlapping candidates with a ## Conflict section. Agent-side pre-check removed. Regex hardened to accept real-world annotations like `path` (new) and `path with spaces` (deleted) (live demo found this gap).
  • C2 BaseAgentBackend.is_progress_event lifts the meta-event stall filter into one place. Claude + Codex override; Pi + Gemini inherit the conservative default.
  • C3 Per-state EMA of completion tokens persists to .symphony/token_ema.json; dispatch injects SYMPHONY_TOKEN_EMA + SYMPHONY_TOKEN_BUDGET. The EMA captures the source state at worker_turn_started so token costs attribute to the stage that actually consumed them (live demo found pre-fix EMA was recording under the destination state).
  • B1 after_run prefixes the wip commit with [no-test] when prod-code diff lacks a paired test. Review scans the marker and promotes each to HIGH. Carve-out includes *.md, LICENSE*, NOTICE, CHANGELOG*, README*, AGENTS.md, GEMINI.md. 10 hermetic shell tests pin the classifier.
  • C4 200-line after_create heredoc extracted to scripts/symphony-setup-worktree.sh; hook is a one-liner.
  • C5 New symphony wiki-sweep CLI scans docs/llm-wiki/ for dup / orphan / stale rows. Orchestrator runs it every Nth Done (default wiki.sweep_every_n: 10). Done counter persists to .symphony/done_count.json so cadence survives orchestrator restarts.

Bug fixes

  • after_run YAML literal block fixed (replaced IFS='\n' with POSIX printf trick).
  • C1 path-with-spaces regex hardened (backtick + plain forms, lenient trailing content).
  • C5 done counter persistence.

Release hygiene

  • pyproject.toml + src/symphony/__init__.py bumped to 0.6.0 in lockstep, separate chore(release) commit.
  • CHANGELOG [0.6.0] entry with PM-readable per-item summaries.

Live demo verification

Two backends (claude + codex) ran the same demo ticket twice — once with the original v0.6.0 bundle, once after addressing the gaps surfaced by run 1.

Fix Run 1 (pre-fix) Run 2 (post-fix)
C1 regex (new) annotation parser returned set() parses both paths
C3 EMA keys plan, in progress, review (destination state) explore, plan (source state) on both backends
A3 Plan candidate table claude=bullets; columns missing both backends emit full 3-row table
B2 Security Audit claude emitted exact 7-row table; codex sandbox blocked file write not re-tested (out of mini-run scope)

Also live-verified: ## Acceptance Tests, ## Done Signals, ## Implementation, reuse-inventory.md, work/feature.md, implementation-plan.md, conflict pre-check parser, no spurious stalls under 5+ minute codex turns, EMA persistence across restarts.

Test plan

  • pytest -q -> 519 passed, 6 skipped.
  • YAML parse check on all three WORKFLOW*.md files.
  • python -m symphony wiki-sweep --root docs/llm-wiki --dry-run exits 0 on clean / missing root.
  • scripts/symphony-setup-worktree.sh has +x mode bit (100755).
  • code-reviewer pass: HIGH x 2 + MEDIUM x 1 addressed in fix commit; 3 MEDIUMs filed as follow-ups.
  • Live demo, claude + codex, run 1 (full Todo->Done arc, max_turns=4) + run 2 (Explore->Plan, max_turns=2).
  • Manual: run a live ticket on the host repo's main board through one full cycle before tagging the release.

Commits

  • ba9273c -- docs: plan 11-item workflow accuracy + harness upgrade
  • 61f5a2d -- feat(workflow): accuracy + harness upgrade (v0.6.0)
  • 86ca62b -- chore(release): v0.6.0
  • 8be3d26 -- fix(workflow): address code-review HIGH + MEDIUM items
  • 1e82d70 -- fix(workflow): close gaps surfaced by live demo
  • aa31f3b -- chore: clarify wiki_sweep imports
  • 123800c -- docs: record post-fix live re-verification + out-of-scope findings

Follow-ups (filed in plan doc, NOT in this PR)

  • C1 retry-overlap completeness (best-effort today; needs tracker re-read at dispatch).
  • C3 per-subprocess env= for true isolation under max_concurrent_agents > 1.
  • C3 Windows rename atomicity on concurrent reader hold.
  • Codex sandbox blocks symlinked board writes (pre-existing; surfaced during demo).
  • EMA last-writer-wins under multiple co-located orchestrators sharing .symphony/ (test-harness curiosity).

cskwork added 8 commits May 17, 2026 14:21
Canonical brief for feat/workflow-accuracy-and-harness-upgrades that
parallel subagents read for scope, contracts, and acceptance criteria.

Covers six prompt-contract changes (DoD signals, rewind scope, reuse
scoring, TDD marker, security audit, observability hooks) and five
harness changes (system-side conflict pre-check, stall predicate
abstraction, adaptive token budget, hook script extraction, scheduled
wiki sweep).
Eleven coordinated changes across agent prompts and the orchestrator
to make ticket outcomes predictable and the system observable. Full
contract in docs/improvements/workflow-v0.5.2.md.

Prompts (agent-visible):
- A1 plan.md emits ## Acceptance Tests + ## Done Signals; qa.md scores
  them via ## AC Scorecard; missing rows fail QA.
- A2 in-progress.md scopes rewinds to $SYMPHONY_REWIND_SCOPE; touching
  other files appends ## Scope Expansion (non-blocking).
- A3 explore.md makes reuse-inventory.md required (scored table);
  plan.md gains reuse_from column and demands ## Plan Rationale when
  high-fit candidates are rejected.
- B2 review.md adds ## Security Audit (exactly 7 rows: secrets,
  input-validation, sql-injection, xss, csrf, authz, rate-limit); any
  fail auto-promotes to CRITICAL.
- B3 learn.md wiki template adds **Observability hooks:** under
  ## Technical Reference (KO + EN); plan.md gains observability column.

Harness (orchestrator + hooks):
- C1 orchestrator parses ## Touched Files at dispatch and refuses
  overlap; candidate is auto-Blocked with ## Conflict. Agent-side
  pre-check removed from in-progress.md.
- C3 per-state EMA (alpha 0.3) of completion tokens persists across
  restarts; dispatch injects SYMPHONY_TOKEN_EMA + SYMPHONY_TOKEN_BUDGET.
- B1 after_run hook prefixes wip subject with [no-test] when prod
  diff lacks paired test; review.md scans for marker and promotes to
  HIGH finding.
- C2 BaseAgentBackend exposes is_progress_event(); claude + codex
  override with type=="assistant" filter; pi/gemini inherit default.
- C4 200-line after_create heredoc extracted to
  scripts/symphony-setup-worktree.sh; hook is a one-liner.
- C5 new `symphony wiki-sweep` CLI scans docs/llm-wiki/ for
  duplicates/orphans/stale; orchestrator runs it every Nth Done
  (default wiki.sweep_every_n=10).

Other:
- after_run YAML literal block fixed: IFS=$'\n' replaced with POSIX
  printf trick so the closing quote no longer terminates the block.

Tests: 508 passed, 6 skipped.
Bump pyproject.toml + __init__.py in lockstep (memory:
project_symphony_version_source_of_truth). Adds CHANGELOG entry
summarising the eleven workflow-accuracy + harness items shipped on
this branch.
- C1: split _BULLET_PATH_RE into backtick + plain regexes so paths
  with spaces in backticked bullets parse correctly (mac users with
  spaces in repo paths hit the prior `[^\s\`]+` bug).
- C5: persist `_done_count` to `.symphony/done_count.json` and reload
  on `start()` so the wiki-sweep cadence survives orchestrator
  restarts. Without this, every restart reset the counter to 0 and
  sweeps either bunched up or skipped indefinitely.
- B1: expand the after_run docs carve-out to include `*.md`, `LICENSE*`,
  `NOTICE`, `CHANGELOG*`, `README*`, `AGENTS.md`, `GEMINI.md` so root
  documentation edits don't trip the `[no-test]` marker.
- C2: documented the intentional spec deviation in the plan doc —
  `CodexAppServerBackend.is_progress_event` overrides because the
  existing test_on_codex_event_extracts_nested_item_preview_*  tests
  prove codex's OTHER_MESSAGE bucket carries both real previews and
  tool/item notifications; removing the filter regresses those tests.
- Moved wiki_sweep to a top-level module import to keep pyright happy
  about the deferred symbol.

Plan doc updated with the post-review follow-up list (3 remaining
MEDIUMs filed: C1 retry-overlap completeness, C3 env-mutation under
concurrency, C3 Windows rename atomicity).

Tests: 508 passed, 6 skipped.
Live claude + codex demos (workflow-v0.5.2 §B) exposed three regressions
in the v0.6.0 bundle and one under-tested code path:

1. C1 regex rejected ` (new)` / ` (deleted)` / ` (M)` annotations after
   a backticked path — claude's actual output. Lenient match: stop
   anchoring after the closing backtick / first whitespace token.
   Reproduce in tests/test_orchestrator_dispatch.py.

2. C3 EMA recorded under the destination state instead of the source.
   `entry.issue.state` is already the next stage at EVENT_TURN_COMPLETED
   time because the agent flips `state:` in the ticket body mid-turn.
   Fix: capture `state_at_turn_start` on each worker_turn_started; read
   it at update time.

3. A3 plan.md asked for `reuse_from` / `observability` "columns" with
   no table template. Both backends silently dropped them in favor of
   bullets. Replace the bullet-prose with an explicit Markdown table
   header — non-optional.

4. B1 marker had no direct test — claude's strict TDD meant the
   `[no-test]` branch never fired naturally. Added a hermetic shell
   harness (tests/test_after_run_classifier.py) that exercises all
   marker / scope-expand / docs-carve-out combinations through bash.

Tests: 519 passed, 6 skipped (was 508).

Live demo also confirmed (no fix needed):
- B2 ## Security Audit emits exactly 7 rows in spec order (claude).
- A1 ## Acceptance Tests + ## Done Signals emit with What/Why/AsIs/ToBe
  headers (both backends).
- in-progress.md ## Implementation + docs/<id>/work/feature.md created
  (both backends).
- A3 reuse-inventory.md emitted under <workspace>/docs/<id>/explore/
  (claude verified; codex pending Review turn).
- C2 no spurious stalls under 5+ min codex turns.
Use explicit `from .wiki_sweep import sweep` in orchestrator and the
`as wiki_sweep` re-export idiom in cli so the symbol relationship is
unambiguous to both readers and static checkers. Runtime behaviour
unchanged.
Run 2 of the workflow-v0.5.2 demo (claude + codex, max_turns=2)
confirmed every fix that landed in 1e82d70 works on both backends:
- C1 regex extracts paths from real `(new)` annotations.
- C3 EMA keys now match the source state (`explore, plan`) on both
  backends; run 1 had recorded them under the destination state.
- A3 Plan candidate table emits with full `reuse_from` and
  `observability` columns on both backends.
- B1 marker bash classifier is covered by 10 hermetic shell tests.

Also files two out-of-scope discoveries that surfaced during the demo
(both pre-existing): codex sandbox blocking symlinked board writes,
and EMA last-writer-wins under multiple co-located orchestrators.
CI Python 3.10 job failed on test_max_total_tokens_cap_cancels_worker
because `asyncio.Task.cancelling()` was introduced in 3.11. The test
already had a `cancelled()` fallback path; route through `getattr` so
the attribute lookup never happens on 3.10.

Test still validates the same behaviour (cancel was attempted) on
3.10/3.11/3.12.
@cskwork cskwork merged commit 3b86c7f into main May 17, 2026
2 of 3 checks passed
@cskwork cskwork deleted the feat/workflow-accuracy-and-harness-upgrades branch May 17, 2026 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant