Skip to content

Coding agent#163

Open
v1r3n wants to merge 131 commits into
mainfrom
coding_agent
Open

Coding agent#163
v1r3n wants to merge 131 commits into
mainfrom
coding_agent

Conversation

@v1r3n

@v1r3n v1r3n commented Apr 26, 2026

Copy link
Copy Markdown
Contributor

No description provided.

v1r3n added 30 commits April 23, 2026 20:51
Multi-agent coding agent that takes a GitHub issue number, analyzes the
codebase, implements a fix with tests, and creates a PR autonomously.

Architecture: pipeline-wrapped swarm (Issue Analyst >> SWARM >> PR Creator)
with Tech Lead (Opus), Coder (Sonnet), DG Code Reviewer (skill agent),
and QA Lead (Sonnet). Includes contextbook for durable team memory,
stateful workers, idempotency via issue number, and full e2e test gate.

All project-specific values are configurable constants for reuse.
3-chunk plan: tools (21 @tool functions), agent assembly (6 agents +
instructions + pipeline), and verification. 8 tasks, ~30 steps.
Files: _issue_fixer_tools.py, _issue_fixer_instructions.py,
100_issue_fixer_agent.py.
Six detailed prompt strings: Issue Analyst, Tech Lead, Coder,
DG Reviewer, QA Lead, PR Creator. Parameterized with {repo},
{branch_prefix}, {max_review_cycles}, {max_e2e_retries}.
serve() re-registers workers in the default domain (no domain), but
start() already registered them under the execution's unique domain UUID.
The domain mismatch caused stateful tool tasks (like contextbook_read) to
stay in SCHEDULED state with pollCount=0 — no worker was polling in the
correct domain.

handle.join() blocks until completion using the workers already registered
by start() under the correct domain.
_register_workers recursed into sub-agents without passing the domain
parameter. This caused sub-agent tools (e.g. contextbook_read on
issue_analyst, a child of the pipeline) to register in the default
domain while the Conductor tasks were scheduled in the execution's
unique domain — resulting in pollCount=0 and SCHEDULED forever.
When an agent has stateful=True, the Conductor server schedules all
tasks (tools, stop_when, termination, handoff, check_transfer, etc.)
under the execution's unique domain UUID. But only tool workers were
registered with that domain — all 11 system worker types (guardrails,
stop_when, gate, callbacks, termination, check_transfer, router,
handoff, swarm transfers, manual selection) were registered without
a domain, causing them to poll in the default domain and never pick
up the scheduled tasks (pollCount=0, stuck in SCHEDULED forever).

Fix: add domain parameter to all _register_*_worker helpers and
propagate it from _register_workers through every call site and
recursive sub-agent registration.

Affected methods:
- _register_guardrail_worker
- _register_single_guardrail_worker
- _register_stop_when_worker
- _register_gate_worker
- _register_callback_worker
- _register_termination_worker
- _register_check_transfer_worker
- _register_router_worker
- _register_handoff_worker
- _register_swarm_transfer_workers
- _register_manual_selection_worker
6 tests that verify workers register under the correct Conductor domain
when stateful=True. All assertions inspect the workflow execution via
server API — no mocks, no LLM output parsing, fully deterministic.

Tests:
1. Stateful tool completes (not stuck SCHEDULED)
2. Stateful stop_when callback executes in domain
3. Stateful swarm handoff + termination execute in domain
4. Mixed @tool and @tool(stateful=True) share domain
5. Concurrent stateful executions get different domains (isolation)
6. Non-stateful agents work without domain (regression guard)

Counterfactual verified: removing the domain fix causes test 2 to fail
(stop_when stuck SCHEDULED with pollCount=0, timeout after 300s).
Root cause of failed execution: Issue Analyst cloned repo into a random
temp dir, but all other agents' tools operated in the SDK's CWD. The
Coder, QA Lead, and PR Creator couldn't see the cloned repo.

Fix:
- Add set_working_dir()/get_working_dir() to tools module
- Add _resolve(path) helper that resolves relative paths against the
  shared working directory
- Add _cwd() helper for subprocess calls
- All 21 tools now use _resolve() for file paths and _cwd() for
  subprocess cwd
- Contextbook stored inside working dir (.contextbook/)
- Entry point creates temp dir with random UUID suffix and calls
  set_working_dir() before starting the pipeline
- Issue Analyst instructions updated to clone into "." (the working dir)
  instead of a separate mktemp dir
- All agent instructions updated with "must use tools" reminders to
  prevent LLM from hallucinating code instead of calling tools
…ination

Observed failures in execution 56063a70:
- Issue Analyst: looped on contextbook_read 12 times, never produced output
- Tech Lead: spent 25 turns reading files, never wrote implementation_plan
- Coder: never ran (no handoff from Tech Lead)
- PR Creator: nothing to commit

Fixes:
- Add explicit turn budgets to every agent ("you have N turns, budget them")
- Add step-by-step turn allocation (turns 1-3: read, turns 4-15: explore, etc.)
- Add anti-loop rules ("do NOT call contextbook_read repeatedly")
- Add CRITICAL rules: "you MUST write implementation_plan before handing off"
- Add "if you run out of turns without writing the plan, you have FAILED"
- Simplify PR Creator to 5-turn max with numbered steps
- Every agent: "output HANDOFF_TO_X in your response text" (not as a tool call)
- DG Reviewer: "complete in 10 turns or fewer"
Observed in fe7daec4: Tech Lead hit max_turns=25 after reading 17 files
but never wrote implementation_plan. Each turn used only 1 tool call.

Changes:
- max_turns: tech_lead 30->80, coder 100->200, qa_lead 40->80
- All instructions now say "call multiple tools in parallel" with
  specific examples of what to batch per turn
- Tech Lead: phased approach with "reserve 30% of turns for writing"
  and "do NOT spend more than 70% reading"
- Every agent: parallel-first patterns ("read these 3-5 files at once")
- Removed fixed turn budgets that were too rigid, replaced with
  percentage-based guidance
Problems in c014e645:
- Coder made 6 edits and committed but never output HANDOFF_TO_DG.
  It kept making tool calls after commit, handoff check found no
  handoff text, swarm exited. DG and QA never ran.

Fixes:
1. Coder instructions: "After git commit, your VERY NEXT response
   must be the HANDOFF text with ZERO tool calls" and "The handoff
   text must be the ONLY content in that response"

2. New Documentation Agent (pipeline stage 3):
   - Runs after coding swarm, before PR creator
   - For features: MUST create example + update docs (mandatory)
   - For bug fixes: update relevant docs only if needed
   - Pipeline: issue_analyst >> coding_swarm >> docs_agent >> pr_creator
   - max_turns=40, has read/write/edit/glob/grep/list/run_command tools
…alls

Reflects all changes since the original spec:
- New pipeline: issue_analyst >> coding_swarm >> docs_agent >> pr_creator
- Docs Agent (Stage 3): updates docs, creates mandatory examples for features
- Working Directory section: shared temp dir, clone to ".", all tools relative
- Tech Lead max_turns: 30 -> 80
- Coder max_turns: 100 -> 200
- QA Lead max_turns: 40 -> 80
- Tool assignment matrix: added Docs Agent column
- Parallel tool calls: noted in design notes
- Updated PR Creator to Stage 4
Two-layer fix:
1. Issue Analyst adds '.contextbook/' to .gitignore after cloning
2. All git add commands use ':!.contextbook' exclude pathspec
New constants (user-overridable):
  DOCS_PLAN_DIR = "docs/plan"      — Tech Lead writes implementation plans here
  DOCS_DESIGN_DIR = "docs/design"  — Docs Agent writes feature design docs here

Tech Lead now writes the plan to BOTH:
  - {docs_plan_dir}/issue-<N>-plan.md (persisted in repo, committed)
  - contextbook implementation_plan (for agent communication)

Docs Agent now writes feature design docs to:
  - {docs_design_dir}/issue-<N>-<feature-slug>.md

Both paths are format placeholders in instructions, resolved from constants.
Issues often reference external URLs (RFCs, API docs, related PRs, design docs).
The web_fetch tool fetches a URL, strips HTML to plain text, and returns up to
16K chars. Assigned to Tech Lead, Coder, and Docs Agent.
Each PR now includes a machine-readable JSON block capturing the full
context of what changed, why, and by whom. Designed so that release
reviews can programmatically aggregate PR context across a release.

Flow:
1. Coder writes change_context JSON to contextbook after implementing
   (issue_number, title, change_type, date, author, root_cause,
    what_changed [{file, change}], testing, risks, related_issues)
2. Coder updates it again after writing tests (testing field, test files)
3. PR Creator reads change_context and embeds it in the PR body inside
   a collapsible <details> block with a ```json code fence

New contextbook section: "change_context"
Major restructure: replace single 4-agent swarm with deterministic
pipeline stages and focused 2-agent review loops.

Old: issue_analyst >> [SWARM: tech_lead, coder, dg, qa] >> docs >> pr
Problem: coder never handed off to DG — swarm exited after coder.

New: issue_analyst >> tech_lead >> [impl_loop] >> [test_loop] >> docs >> pr

Pipeline stages:
  1. Issue Analyst — fetch issue, clone, branch
  2. Tech Lead — analyze code, write implementation plan (deeper thinking)
  3. Implementation Loop (outer SWARM):
     - code_review_loop (inner SWARM: coder <-> DG, until CODE_APPROVED)
     - tl_reviewer (Tech Lead final review, IMPL_APPROVED or NEEDS_REWORK)
  4. Test Loop (SWARM: qa_lead <-> test_coder, until TESTS_PASS)
  5. Docs Agent — design docs, API docs, examples (MANDATORY for features)
  6. PR Creator — commit, push, create PR with change_context JSON

Key improvements:
- DG review is GUARANTEED — 2-agent swarm must alternate
- TL final review gate — implementation must be approved before testing
- QA evidence: test results saved to {qa_evidence_dir}/issue-<N>/
- Each loop is pluggable — swap DG with any code reviewer
- Separate test_coder instance for test writing
- New TL_REVIEW_INSTRUCTIONS for final sign-off
New mode: address review comments on an existing PR and update it.

Usage:
  python 100_issue_fixer_agent.py 42           # Fix issue #42 (new)
  python 100_issue_fixer_agent.py 42 --pr 157  # Address PR #157 feedback

Feedback pipeline:
  pr_feedback >> impl_loop >> test_loop >> pr_updater

Flow:
1. PR Feedback Agent: fetches PR comments, reviews, inline review
   comments (file:line), writes structured feedback to contextbook
2. Implementation Loop: coder addresses feedback, DG reviews changes
3. Test Loop: QA verifies tests still pass after changes
4. PR Updater: pushes to same branch, adds summary comment to PR
   with table of feedback items and resolutions

New agents: pr_feedback, pr_updater
New instructions: PR_FEEDBACK_INSTRUCTIONS, PR_UPDATER_INSTRUCTIONS
Idempotency key for feedback: "issue-{N}-pr-{PR}-feedback"
WorkerManager._start_new_workers() deduplicates by task_def_name only,
ignoring domain. When the same tool (e.g. contextbook_read) is registered
under both default domain (None) and a stateful execution domain, the
domain-specific worker was skipped because the name already existed.

This caused SCHEDULED tasks with pollCount=0 in domain-specific queues —
the worker was polling the default domain, not the execution's domain.

Fix: track (task_def_name, domain) tuples in the existing set instead
of just task_def_name. A worker under domain=None and the same worker
under a specific domain are different polling targets that both need
their own process.
…antee DG runs

Root cause: SWARM relies on LLM to output handoff text. The coder LLM
never outputs "HANDOFF_TO_DG" — it keeps making tool calls until max_turns.
Result: DG never ran in any execution. 4 code_review_loop iterations,
0 dg_reviewer sub-workflows.

Fix: Replace SWARM with SEQUENTIAL pipeline for review stages.

Before (SWARM — unreliable):
  code_review_loop = SWARM(coder, dg_reviewer)  # DG never runs

After (SEQUENTIAL — deterministic):
  code_then_review = coder >> dg_reviewer        # DG ALWAYS runs

The outer impl_loop remains a SWARM for the TL approval cycle:
  impl_loop = SWARM(code_then_review, tl_reviewer)
  - code_then_review runs (coder + DG guaranteed)
  - tl_reviewer checks: IMPL_APPROVED or NEEDS_REWORK

Testing also made sequential:
  test_then_verify = qa_lead >> test_coder >> qa_reviewer
  - QA plans, coder writes tests, QA reviews + runs e2e

Also: added write_file tool to QA agents for writing QA evidence files.
Multi-agent coding agent that takes a GitHub issue number, analyzes the
codebase, implements a fix with tests, and creates a PR autonomously.

Architecture: pipeline-wrapped swarm (Issue Analyst >> SWARM >> PR Creator)
with Tech Lead (Opus), Coder (Sonnet), DG Code Reviewer (skill agent),
and QA Lead (Sonnet). Includes contextbook for durable team memory,
stateful workers, idempotency via issue number, and full e2e test gate.

All project-specific values are configurable constants for reuse.
3-chunk plan: tools (21 @tool functions), agent assembly (6 agents +
instructions + pipeline), and verification. 8 tasks, ~30 steps.
Files: _issue_fixer_tools.py, _issue_fixer_instructions.py,
100_issue_fixer_agent.py.
Six detailed prompt strings: Issue Analyst, Tech Lead, Coder,
DG Reviewer, QA Lead, PR Creator. Parameterized with {repo},
{branch_prefix}, {max_review_cycles}, {max_e2e_retries}.
serve() re-registers workers in the default domain (no domain), but
start() already registered them under the execution's unique domain UUID.
The domain mismatch caused stateful tool tasks (like contextbook_read) to
stay in SCHEDULED state with pollCount=0 — no worker was polling in the
correct domain.

handle.join() blocks until completion using the workers already registered
by start() under the correct domain.
v1r3n added 18 commits May 6, 2026 15:49
…resh starts

E2E tests (Tasks 10–12 of the worker-liveness plan) — six tests under
tests/integration/test_worker_liveness_live.py covering all three
failure modes plus a validity counter-test for each:
- LocalLivenessCheck raises WorkerStartupError when no worker subprocess
  is alive after registration (Mode B-1).
- ServerLivenessMonitor surfaces WorkerStallError through join() when a
  SCHEDULED task in our domain has pollCount=0 past the stall threshold,
  with liveness_stall_policy="raise" (Mode B-2).
- AgentHandle.is_resumed flips True and the "Resumed existing execution"
  INFO log is emitted on idempotency_key replay (Mode A).

Each test pairs with a counter-test (liveness_enabled=False or fresh
idempotency_key) confirming the new check, not unrelated behavior, is
what's signaling. All assertions are algorithmic.

Hot-path fix: skip _extract_domain on fresh starts. is_resumed and the
resume INFO log can only fire when the caller passed an idempotency_key
— without one, the server cannot return a previously-recorded domain.
The original implementation made an extra get_workflow GET on every
start/run regardless. This regressed the test_runtime unit tests
(which assert get_workflow is called exactly once) and added latency
to the common case. All four call sites — start/run, sync/async — now
gate the resume detection behind ``if idempotency_key:``.

Verification:
- tests/unit/test_runtime.py: 160 pass (was 5 failing on the extra GET).
- tests/integration/test_worker_liveness_live.py: 6/6 pass in 26s.
A stateful agent re-run with the same task names but a different
``run_id`` (e.g., a fresh execution that isn't an idempotency replay)
needs ``WorkerManager.start()`` to register the new ``(task_name,
domain)`` pairs. The previous ``elif new_workers:`` branch only fired
when the *task-name set* changed; identical names under a different
domain were silently skipped, leaving the new domain unpolled.

Convert to ``else:`` at all four sites (sync/async ``run`` and
``start``, plus the framework and graph paths). ``WorkerManager`` is
domain-aware and only spawns missing pairs, so this remains safe under
concurrent re-runs and avoids the fork() deadlock window of a full
stop/restart cycle.

Adds test_prepare_workers_starts_worker_manager_when_only_domain_changes
to lock the contract: start() is called once even when the task names
have already been registered, as long as the domain differs.
…XECUTE

Adds an optional ``planSource`` field on AgentConfig that points to a
SIMPLE task (e.g., contextbook_read) which can serve a structured plan
when the planner agent's text output cannot be parsed as a JSON fence.

This is the recovery path for plan extraction failures — instead of
immediately falling through to the agentic fallback agent, the harness
can first try a deterministic source. Useful when the planner writes
its plan to a contextbook section as part of its own toolchain and
the extraction-from-LLM-output is unreliable.

- server: AgentConfig.planSource (Map<String,Object>) — {tool, args}.
- python sdk: Agent(plan_source={...}) constructor kwarg + serialization.
- typescript sdk: not wired in this commit; will be added when the TS
  PLAN_EXECUTE harness needs it.
Adds parallel e2e tests in Python, Java, and TypeScript verifying that
the GraalJS plan compiler honors ``max_tokens`` declared on a generate
block. Counterfactual coverage: without the compiler reading
``gen.max_tokens``, every LLM_CHAT_COMPLETE task uses the default 4096
ceiling; the test instructs the planner to set max_tokens=8192 and
requests sections >250 words, so a regression would surface as
truncated output and missing word-count thresholds.

All assertions are algorithmic — file existence, word counts,
expected sections.
No functional changes — line-wrap and indentation adjustments
across AgentCompiler, AgentChatCompleteTaskMapper, AgentService,
and AgentCompilerTest.
Replaces the single ``CODER_PLANNER_INSTRUCTIONS`` with three distinct
prompts to align the example with the deterministic-coding-workflows
harness:

- ``CODER_EXPLORER_INSTRUCTIONS`` — read-only exploration phase that
  writes a structured plan to the contextbook with the JSON fence
  required by Strategy.PLAN_EXECUTE.
- ``CODER_PLANNER_INSTRUCTIONS`` — slim copy-and-emit role used after
  exploration so the planner output is deterministic enough for plan
  extraction.
- ``CODER_IMPLEMENTER_FALLBACK_INSTRUCTIONS`` — agentic fallback when
  the deterministic plan execution fails validation.
CODING_AGENT_HARNESS_DESIGN.md — 1,568-line specification for a
coding-agent runtime: conversation state, tool safety, file edits,
command execution, sub-agent delegation, interruption recovery,
session persistence. Reference architecture for staged build.

GENERIC_AGENT_HARNESS_DESIGN.md — 2,814-line generalization beyond
coding: resource + capability model with adapters for filesystem,
browser, database, workflow, and device domains. Plugin/credential
boundaries and security policy.

.gitignore: ignore /.claude (per-user IDE artifacts).
… package rename

# Conflicts:
#	sdk/java/src/main/java/ai/agentspan/Agent.java
#	sdk/java/src/main/java/ai/agentspan/internal/AgentConfigSerializer.java
#	sdk/java/src/test/java/ai/agentspan/e2e/E2ePlanExecuteTest.java
Addresses 17 findings from the dg adversarial code review of
Strategy.PLAN_EXECUTE. The implementation had multiple paths where
errors silently routed to "success" and the planner's free-text could
inject script into the orchestrator. This commit fails closed everywhere.

JavaScriptBuilder.compilePlanToWorkflowScript:
- success_condition is filtered by safeCondition() — length-capped at 256
  chars and rejected if it contains function/loop/assignment/host-access
  tokens. Plan validation surfaces "unsafe success_condition: ..." instead
  of evaluating planner-supplied JS.
- Topological sort detects cycles and returns a structured error with the
  full cycle path, instead of silently dropping nodes.
- Cycle/duplicate-id/unsafe-condition/bad-output_schema all return
  {workflow_def: null, error: "..."} which the parent now surfaces.
- output_schema is validated as an instance-shape example object: real
  JSON Schema input ({"type":"object","properties":{...}}) is rejected
  rather than producing toolInputs.type / toolInputs.properties garbage.
- LLM_CHAT_COMPLETE temperature is read from gen.temperature (default 0),
  not hardcoded.
- LLM user message no longer appends "Respond with valid JSON only." —
  the system message + jsonOutput:true + temperature carry the contract.
- INLINE parse task is followed by a SWITCH(parsed.__parse_error). On
  parse failure → TERMINATE FAILED; on success → SIMPLE tool task. This
  prevents the tool task from firing with all-undefined args.
- Validation SWITCH inverts: decisionCases:{passed:onSuccess},
  defaultCase:onFailure. Aggregator returning null/error/anything-else
  fails closed instead of silently routing to onSuccess.
- optional:true is removed from all parse, eval, validation, terminate,
  on_success, on_failure, static, generated, and gate tasks. Failures
  bubble through SUB_WORKFLOW so the parent SWITCH can route to fallback.
  retryCount:1 on tool tasks covers transient errors.
- timeoutSeconds is sourced from harnessTimeoutSeconds input (default 600
  if absent), so the dynamic sub-workflow tracks the parent's contract.
- workflow_def is no longer array-wrapped — bare JSON.stringify(wfDef).
  parse_wf in the parent simplifies to JSON.parse(wfDefJson).

MultiAgentCompiler.compilePlanExecute:
- After compile_plan, a new compile_status INLINE + compile_gate SWITCH
  TERMINATE on compile_error with the actual error message, so cycle /
  duplicate-id / unsafe-condition errors are visible to the caller
  instead of disappearing into a null-def SUB_WORKFLOW launch.
- has_json predicate now requires Array.isArray(p.steps) && length > 0,
  matching what compilePlan actually accepts. Weaker predicates trip-
  wired the compile path with plans the compiler immediately rejected.
- Sub-workflow input forwards cwd, credentials, and media in addition
  to prompt/session_id/__agentspan_ctx__/context. Compiled plan tools
  that depend on working dir or credentials no longer silently default.
  This is why examples had to hardcode WORK_DIR.
- SUB_WORKFLOW.optional:true removed — sub-workflow failures must
  propagate so the fallback agent is reached.
- harnessTimeoutSeconds is plumbed into compile_plan inputs.
- planSource.tool is validated at compile time via
  isToolRegisteredInHarness — typos throw IllegalArgumentException at
  deploy instead of silently failing at runtime.
- Fallback config rebuild uses toBuilder().maxTurns(fbMaxTurns).build()
  instead of an explicit 8-field whitelist copy that silently dropped
  agents/strategy/memory/promptInputs/toolChoice/handoffs/callbacks/
  metadata/etc when fallback_max_turns was set.

AgentConfig: @builder(toBuilder = true) so toBuilder() is available for
the fallback rebuild above.

Tests: PlanCompilerScriptTest updated to read the bare-object workflow_def
(no more List<Map> unwrap) and to surface compile errors as test failures
instead of NPEs. All 5 tests + the broader server suite pass.

Skipped from this batch:
- Per-execution workflow naming (race fix): on closer reading the
  placeholder is registered once at deploy and is not overwritten at
  runtime; the actual sub-workflow definition is injected inline via
  SubWorkflowParams.workflowDefinition. The "race" Gilfoyle described
  doesn't actually exist in the code path.
- Move plan compiler to .js resource (large refactor — separate commit).
- TS SDK planSource parity (separate SDK).
- Failure-mode e2e tests (separate batch).
- Spec doc sync (separate commit).
… consistency

- TypeScript SDK: AgentOptions and Agent class gain planSource field;
  serializer forwards it as planSource on the wire to match the Java/Python
  AgentConfig. Closes the previously-flagged cross-SDK gap.

- docs/design/deterministic-coding-workflows.md
  - Replace plan_json_section with plan_source (matches actual SDK).
  - Clarify output_schema is an instance-shape example object, not real
    JSON Schema. Real JSON Schema is rejected at compile time.
  - Document success_condition as a length-capped, denylist-filtered JS
    expression rather than free-form jq/JS.
  - Replace the "degrade to Strategy.SEQUENTIAL" bullets in the Risks
    table with what the impl actually does: try plan_source → fall through
    to fallback agent → TERMINATE if no fallback.

- extractJsonFenceScript Case 1: when the planner returned an already-parsed
  plan object, markdown_plan is now the original coerced planner text (when
  available) instead of a re-pretty-printed JSON. Other cases already passed
  the original — this aligns Case 1 with the rest so the fallback agent
  always sees the LLM's actual prose, not a re-serialized object.
…ed paths

Adds 18 unit tests exercising the failure modes the dg review flagged.
All run as ordinary GraalVM Polyglot / pure Java unit tests — no live
Conductor server, no LLM API calls.

PlanCompilerScriptTest (+13 tests):
- testCycleInDependsOnIsRejected — a→b→a returns a structured cycle
  error with the full path, instead of the previous silent partial DAG.
- testDuplicateStepIdIsRejected — duplicate step ids produce a clear
  validation error.
- testEmptyStepsArrayIsRejected — empty steps array short-circuits with
  "non-empty steps array".
- testUnsafeSuccessConditionIsRejected — six adversarial conditions
  (function literal, while loop, Java.type, eval, assignment, var
  declaration) are each rejected by safeCondition.
- testSafeSuccessConditionsAreAccepted — counter-test: typical safe
  conditions ($.exit_code === 0, $.indexOf('passed') >= 0, etc.) compile.
- testJsonSchemaAsOutputSchemaIsRejected — passing real JSON Schema
  ({"type":"object","properties":...}) errors with a clear "use an
  example object instead" message instead of producing toolInputs.type
  / toolInputs.properties garbage.
- testInstanceShapeOutputSchemaIsAccepted — counter-test: plain
  instance-shape compiles cleanly.
- testGeneratedOpUsesParseGateSwitch — every generated op's chain
  contains a SWITCH(pgate_*) that routes parse errors to TERMINATE.
- testNoTaskIsOptional — verifies the optional:true cancer is gone
  across static, generated, validation, on_success, on_failure,
  TERMINATE, INLINE, and SWITCH tasks.
- testValidationSwitchFailsClosed — decisionCases:{passed:onSuccess},
  defaultCase:onFailure with a TERMINATE inside, so any non-passed
  aggregator result fails closed.
- testTimeoutFromHarnessConfig — harnessTimeoutSeconds=1234 flows into
  workflow.timeoutSeconds.
- testDefaultTimeoutWhenHarnessTimeoutAbsent — no harness timeout falls
  back to 600.
- testWorkflowDefIsNotArrayWrapped — workflow_def is a bare JSON object,
  not [wfDef] (the historical array wrap is gone, plus parse_wf simplified).

MultiAgentCompilerTest (+5 tests):
- testPlanExecutePlanSourceWithUnknownToolIsRejectedAtCompile —
  planSource.tool that's not registered anywhere in the harness throws
  IllegalArgumentException at compile time. Surfaces typos at deploy
  instead of as silent runtime no-ops.
- testPlanExecutePlanSourceMissingToolFieldIsRejected — planSource map
  without a 'tool' key fails fast.
- testPlanExecutePlanSourceWithRegisteredToolCompiles — counter-test:
  registered tool produces a plan_reader SIMPLE task.
- testPlanExecuteSurfacesCompileErrors — has_plan branch contains the
  new compile_status INLINE + compile_gate SWITCH that TERMINATEs with
  "Plan compilation failed: ..." instead of letting parse_wf trip on a
  null workflow def.
- testPlanExecuteSubWorkflowForwardsCwdCredentialsMedia — SUB_WORKFLOW
  inputs include cwd, credentials, media in addition to prompt/session_id.
- testPlanExecuteSubWorkflowIsNotOptional — execTask.isOptional() is
  false so failures propagate to the fallback SWITCH.

Total: 18 new tests, 0 failures, full server suite stays green.

Note on Python e2e tests for the same failure modes (no fence → fallback,
validation failure → fallback, etc.): those require a freshly-built
Conductor server with the new compiler code. Adding them requires a
``./gradlew build`` + server restart, which is destructive against the
locally-running shared server and was not done in this commit. The Java
unit tests above exercise the same compile paths against the same
compiler script, just without the LLM round-trip.
… fallback routing

Addresses 5 findings from a second /dg adversarial review of the
PLAN_EXECUTE implementation. Three were criticals introduced in the
previous "fix" pass.

JavaScriptBuilder.compilePlanToWorkflowScript:safeCondition (rewrite)
  Previous denylist was the wrong shape — it blocked function/eval/Java
  but left ``constructor``, ``prototype``, ``__proto__``, ``Object``,
  ``Reflect``, ``Proxy``, bracket access, comma operator, and ``+``-based
  string concatenation. The bypass
    $.constructor.constructor('return Java.type(...)')()
  passed the filter and reached GraalJS evaluation — a textbook sandbox
  escape via the Function constructor.

  Replaced with three layers:
    1. Character allowlist — only ``$.()'"!=<>&|-`` plus alphanumeric/
       underscore/whitespace. Rejects ``;``, ``{``, ``}``, backticks,
       backslashes (no escape evasion), ``,`` (no comma operator),
       ``?``/``:`` (no ternary), ``[``/``]`` (no bracket-key access),
       ``+``/``*``/``/`` (no string concat tricks).
    2. String literals are stripped before identifier check so legitimate
       conditions like ``$.kind === 'constructor'`` are preserved.
    3. Identifier denylist on the stripped string blocks the sandbox-
       escape primitives and host-bridge globals: constructor, prototype,
       __proto__, __defineGetter__, Function, eval, Object, Reflect,
       Proxy, Java, Promise, etc., plus all JS control-flow keywords.

  Bare assignment is still rejected. The ``globallThis`` (3-l) typo from
  the prior iteration is gone — globalThis is in the new identifier list
  with correct spelling.

JavaScriptBuilder.compilePlanToWorkflowScript:output_schema sniff (broaden)
  Previous heuristic required both ``properties`` AND ``type === 'object'``.
  Missed ``{type:'object', required:[...]}`` (no properties) and
  ``{$schema, properties}`` (no top-level type). Now flags presence of any
  JSON-Schema-only key: $schema, properties, required, additionalProperties,
  definitions, $defs, $ref, allOf, anyOf, oneOf, patternProperties.

MultiAgentCompiler.compilePlanExecute (compile_gate restructure)
  Two related fixes:

  (a) Previous gate had decision case only for ``compile_error`` and an
      empty ``defaultCase``. The compile_status script also emitted
      ``no_def`` (defensive sentinel for null wfDef without an error
      string) — that fell through, parse_wf JSON.parse(null) returned
      null, SUB_WORKFLOW launched with null definition. The very silent-
      degrade the gate was meant to prevent. Folded both error sentinels
      into one ``compile_failed`` so the gate cannot have a fall-through.

  (b) Compile failures TERMINATEd the parent harness even when a fallback
      agent was configured — an architectural inversion since fallback
      exists exactly to recover from deterministic-path failures, and
      compile failure is the canonical such failure. Now routes
      compile_failed into buildFallbackBranch when fallbackConfig != null
      (with a distinct prefix to avoid task-name collision with the
      exec-failure fallback emitted further down). Only TERMINATEs when
      no fallback is configured, with the actual error string in
      terminationReason.

Tests:
  - PlanCompilerScriptTest.testUnsafeSuccessConditionIsRejected expanded
    from 6 cases to 22, covering the round-2 bypasses: $.constructor.
    constructor(...), $['constructor'], comma operator, backticks,
    backslash escapes (\\u0063onstructor), Object/Reflect/Proxy globals,
    ternary, __defineGetter__, deeper-position assignment.
  - testSuccessConditionAllowsLiteralBannedWordInString — new counter-
    test asserting ``$.kind === 'constructor'`` (banned word in string
    literal) IS accepted. The new filter strips string literals before
    the identifier check.
  - testPlanExecuteSurfacesCompileErrors updated for the renamed
    ``compile_failed`` sentinel and the route-to-fallback behavior:
    expects last task in the failed branch to be SUB_WORKFLOW (fallback)
    when fallback is configured.
  - testPlanExecuteCompileErrorTerminatesWhenNoFallback (new) — counter-
    test for the no-fallback path: TERMINATE with the actual error
    message in terminationReason.
  - testAgentConfigToBuilderPreservesAllFieldsExceptOverridden (new) —
    locks the contract for the toBuilder() rebuild. Verifies maxTurns
    override applies but model/instructions/maxTokens/temperature/
    credentials all survive. Catches future regressions if @builder
    config changes.

Verification: PlanCompilerScriptTest 19/19, MultiAgentCompilerTest 42/42,
full server suite green.
…, output_select, namespace, docs

Addresses the 5 important findings from the second /dg adversarial review.

MultiAgentCompiler.compilePlanExecute (markdown_plan wired through)
  Previously the fallback agent prompt received ``plannerResult`` (the
  coerced string from planner_coerce). When the planner returned an
  already-parsed Map via jsonOutput:true, ``plannerResult`` was the
  re-stringified Map — losing the LLM's original prose. Meanwhile
  ``extract_json`` was always computing ``markdown_plan`` (the original
  text) but no consumer read it.

  Fixed by passing ``${extractRef}.output.result.markdown_plan`` as the
  fallback plan-text reference into both buildPlanExecutionBranch and
  buildFallbackOnlyBranch. The fallback agent now sees the planner's
  actual prose, which is more useful for agentic recovery.

MultiAgentCompiler.compilePlanExecute (output_select non-optional)
  Previous output_select used ``optional:true`` to mask unresolved
  ``${...}`` refs from inactive branches. That suppression also swallowed
  real expression bugs. With the new compile-failure fallback path
  (``${prefix}_compile_fallback``), there are now four possible result
  refs but only one runs per execution.

  Replaced with a ``safe()`` helper inside the JS expression that detects
  Conductor's literal-``${...}`` markers (constructed via
  String.fromCharCode(36) so this script's own source isn't pre-resolved)
  and treats them as null. Coalesces in priority order:
  plan_exec → fallback → compile_fallback → noplan_fallback. Drops
  ``optional:true`` so genuine expression failures now surface.

MultiAgentCompiler.isToolRegisteredInHarness (namespace tightened)
  Previously recursed into sub-agents. The plan_reader SIMPLE task is
  emitted in the parent harness's task namespace, so a tool registered
  only on a deeper sub-agent's worker would not be polled for the
  parent's task name — a silent runtime hang. Restricted the check to
  harness-level tools only. The user must declare the plan_source tool
  on the harness.

  testPlanExecutePlanSourceWithRegisteredToolCompiles renamed to
  testPlanExecutePlanSourceWithHarnessLevelToolCompiles and updated to
  put the tool on the harness (not the planner sub-agent). New
  testPlanExecutePlanSourceWithSubAgentOnlyToolIsRejected counter-test
  asserts a tool only on a sub-agent now fails compile with
  IllegalArgumentException.

100_issue_fixer_agent.py (brittleness documented)
  The ``coder`` PLAN_EXECUTE harness has one sub-agent and no fallback
  (per commit 1fad27a). Any compile/extract/validation/tool failure
  TERMINATEs the SUB_WORKFLOW, ending the SWARM iteration without a
  chance for QA feedback recovery. Added a comment block above the agent
  definition explaining the trade-off and how to add a fallback if
  agentic recovery is wanted.

deterministic-coding-workflows.md (failure-mode table expanded)
  Replaced the 3-row failure table with a 7-row enumeration covering
  every failure mode the implementation handles: no fence, compile
  errors (cycle/duplicate-id/unsafe-condition/bad-schema), null-def
  defensive case, parse failures inside generated ops, tool failures,
  validation failures, and plan_source.tool misconfiguration. Each row
  notes the detection point and the routing.

Verification: server suite green (43 MultiAgentCompiler tests, 19
PlanCompilerScript tests, all other compilers unchanged).
…lidation result, brace-matching, error message

Addresses the 5 findings from a third /dg adversarial review (2 important
+ 3 minor). Two of the importants are direct consequences of fixes from
prior rounds.

JavaScriptBuilder.compilePlanToWorkflowScript:
  Half-wired ambient input forwarding (the fix from re-review #2 was
  incomplete). Parent harness forwards cwd/credentials/media to the
  SUB_WORKFLOW input correctly, but compilePlanToWorkflowScript only
  injected __agentspan_ctx__ and session_id into per-tool SIMPLE task
  inputParameters — so tools saw workflow.input.cwd populated at the
  dynamic-workflow level but their own input maps didn't include it.
  Filesystem tools running inside a plan_execute strategy were
  effectively rootless.

  Refactored: added an injectAmbient(args) helper at the top of the
  script that sets all five ambient inputs (__agentspan_ctx__,
  session_id, cwd, credentials, media). Replaced six per-site
  __agentspan_ctx__/session_id assignments (static op, generated op
  toolInputs, validation, on_success, on_failure) with single
  injectAmbient(...) calls. DRYs the change and prevents per-site drift.

  Static 'completed' literal shadowing fallback output (consequence of
  the re-review #1 optional:true removal): when plan.validation is
  absent, outputParameters emitted result='completed' / status='completed'
  as literal strings — regardless of actual completion. Conductor
  resolves output mapping on FAILED workflows too. Then the parent's
  output_select coalesce read _plan_exec.output.result first, saw the
  literal 'completed', decided it was truthy and not a ${...} placeholder,
  and returned it. Fallback ran but its recovered output was shadowed.
  Tests masked it because every test path used a validation block.

  Fixed by tracking lastOpRef during step emission (= taskReferenceName
  of the last top-level task in each step). When lastAggRef is null,
  result is now ref(lastOpRef + '.output.result') — pointing at the last
  tool's actual output. On a FAILED workflow this resolves to a literal
  ${...} string which the parent's safe() helper detects and skips,
  letting the fallback's recovered output surface as the harness result.

  Status output keeps the 'completed' literal as a placeholder for the
  no-validation path — that string is informational only and not in the
  fallback-shadowing chain.

JavaScriptBuilder.extractJsonFenceScript Case 4:
  Brace-matching JSON extraction counted { / } without tracking string-
  literal state. {"key": "}{}"} miscounted and sliced the wrong
  substring. Now tracks in-string state with backslash-escape handling
  during the brace walk so braces inside string values are ignored.

MultiAgentCompiler.isToolRegisteredInHarness error message:
  Said "is not registered as a tool on '<harness>' or any of its sub-
  agents" but the validator only checks harness-level tools (after the
  re-review #2 namespace tightening). The message gave users false hope
  of putting the tool on a sub-agent — they'd move it and fail again.
  Updated to "is not registered as a harness-level tool on '<harness>'.
  The plan_reader task is emitted in the harness's task namespace, so
  the tool must be declared in tools=[...] on the harness itself
  (declaring it on a sub-agent does not work)."

Tests:
  - testInnerToolTasksReceiveCwdCredentialsMedia (new): compiles a real
    plan with static op + generated op + validation + on_success +
    on_failure, walks every emitted SIMPLE task, asserts every one of
    them contains cwd/credentials/media/session_id/__agentspan_ctx__ as
    inputParameters. Catches the prior incomplete-plumbing regression.
  - testNoValidationPlanResultPointsAtLastTaskNotLiteral (new): asserts
    outputParameters.result references a task output (${...}.output.result)
    instead of the literal 'completed' for plans without a validation
    block. Locks the fix against future regression.

Verification: PlanCompilerScriptTest 21/21, MultiAgentCompilerTest 43/43,
full server suite green.
…ator, sibling-fix gaps

Addresses 4 findings from a fourth /dg adversarial review (1 critical +
3 important). Three of the four are 'fix-didn't-reach-siblings' from
prior rounds — patches that landed at one site but missed structurally
identical neighbors. Adds a structural invariant test to catch this
class of regression going forward.

JavaScriptBuilder.compilePlanToWorkflowScript — injectAmbient ordering
  CRITICAL. The LLM-generated tool branch built toolInputs with
  ``var toolInputs = injectAmbient({})`` first, then overlaid LLM-driven
  schema keys on top. The static-args branch did the inverse (correct)
  ordering: copy LLM args first, ambient override last. The comment on
  the helper definition explicitly said ambient values are "forced
  overrides — LLM-supplied args cannot redirect these."

  An LLM-authored ``output_schema`` containing a key named ``cwd``,
  ``credentials``, ``media``, ``session_id``, or ``__agentspan_ctx__``
  could redirect the filesystem root or substitute credentials — exactly
  the threat the helper exists to prevent.

  Fix: changed the LLM branch to start with ``var toolInputs = {}``,
  populate from the schema-key loop (or _args fallback in the catch),
  and call ``injectAmbient(toolInputs)`` at the end so ambient keys
  always win. Mirrors the static-args branch.

JavaScriptBuilder.compilePlanToWorkflowScript — lastOpRef = joinRef
  Conductor's JoinTask.output is ``{taskRef → outputMap}`` with no
  top-level ``result`` key. ``${joinRef.output.result}`` resolved to a
  literal placeholder, the parent's safe() helper stripped it, and any
  plan whose terminal step was parallel:true with no validation block
  produced an empty result.

  Fix: after the JOIN, emit an INLINE aggregator (``parallel_agg_<step>``)
  that collects each branch's last task's output into an array and points
  lastOpRef at it. ``${aggRef.output.result}`` now resolves to the array
  of branch outputs.

JavaScriptBuilder.extractJsonFenceScript — Case 5c not string-aware
  Round 3 added string-literal-aware brace matching to Case 4
  (planner-text path) but the structurally identical block for
  planReaderContent (Case 5c) was untouched. A plan_source tool returning
  content with ``"steps"`` or ``{`` / ``}`` inside string literals would
  mis-slice.

  Fix: extracted the brace-walk into a ``findMatchingBrace(text, openIdx)``
  helper at the top of the script and replaced both Case 4 and Case 5c
  with helper calls. Single source of truth — future fixes won't drift.

MultiAgentCompiler — plan_reader doesn't forward cwd/credentials/media
  Round 3's ambient-forwarding fix landed in compilePlanToWorkflowScript
  via injectAmbient but missed the plan_reader SIMPLE task in
  buildPlanExecutionBranch. A reader tool needing cwd (e.g. workspace
  filesystem reads) would silently fail and fall through to the no_plan
  branch.

  Fix: added cwd/credentials/media forwards to readerInputs alongside
  the existing session_id/__agentspan_ctx__ forwards. Kept optional:true
  with a comment explaining why (a reader pointed at a not-yet-written
  contextbook section is a normal "no fallback content" condition; the
  no_plan SWITCH path surfaces real failures via the fallback agent or
  TERMINATE).

Tests:
  - testEverySimpleTaskHasFiveAmbientKeys (NEW, structural invariant) —
    compiles a plan exercising static + generated + parallel + validation
    + on_success + on_failure ops, walks every emitted SIMPLE task
    (recursing into FORK_JOIN branches AND SWITCH decisionCases for the
    parseGate-wrapped tool tasks), and asserts every one carries cwd,
    credentials, media, session_id, __agentspan_ctx__. This is the
    "fix-didn't-reach-siblings" guard — if any future tool emission
    forgets injectAmbient(), this test fails immediately.
  - testGeneratedOpAmbientKeysWinOverLLMSuppliedSchemaKeys (NEW,
    regression for the critical) — compiles a plan whose output_schema
    contains keys named ``cwd``, ``credentials``, ``media`` (the
    LLM-driven attack shape) plus a non-colliding ``safe_field``. Asserts
    the emitted SIMPLE task's cwd/credentials/media point at
    workflow.input.* (ambient won), and safe_field points at the parsed
    result (legitimate non-ambient flow).
  - collectTasks helper now recurses into SWITCH decisionCases and
    defaultCase — required for the structural invariant test to walk
    the parseGate-wrapped tool tasks.

Verification: PlanCompilerScriptTest 23/23, MultiAgentCompilerTest 43/43,
full server suite green.
… SWITCH for result references

Addresses 3 findings from a fifth /dg adversarial review (2 important,
1 minor). Same root cause as the round-4 lastOpRef-=-joinRef fix,
recurring at a third host: parseGate SWITCH terminals.

Root pattern (round 4 + round 5): wrapper task types like SWITCH
(parseGate) and JOIN don't expose a meaningful ``.result`` on their
output. SWITCH outputs the case decision; JOIN outputs
``{taskRef → outputMap}``. References like ``${pgate_*.output.result}``
or ``${join_*.output.result}`` resolve to literal placeholders.

The round-4 fix wrapped JOIN with a parallel_agg INLINE that produces
a real ``.result``. But the same disease afflicts parseGate SWITCHes —
when a generated op terminates a branch (``chain.push(parseGate)``),
the chain's terminal is the SWITCH, not the inner tool. parallel_agg's
per-branch inputs and the sequential lastOpRef both ended up pointing
at the SWITCH wrapper.

Fix: introduced ``innerRefMap`` and ``terminalRef(task)``. When emitting
a parseGate, register ``innerRefMap[parseGate.taskReferenceName] = toolRef``
— the actual SIMPLE tool task wrapped by the SWITCH. ``terminalRef(task)``
returns the inner ref when one exists, else the task's own ref. Use
``terminalRef`` at:
  - parallel_agg's per-branch inputs (was ``ref(joinOn[ja] + ...)``,
    now ``ref(terminalRef(branches[ja][...]) + ...)``). joinOn keeps
    the SWITCH ref for ordering.
  - sequential lastOpRef assignment (was branch task's ref directly,
    now ``terminalRef(branches[b2][t])``).

Tests:
  - testSequentialTerminalGeneratedOpResultPointsAtInnerTool (NEW) —
    sequential plan ending in generated op asserts outputParameters.result
    references ``t_*`` (inner tool), not ``pgate_*`` (SWITCH wrapper).
  - testParallelTerminalGeneratedOpAggregatorPointsAtInnerTool (NEW) —
    parallel branches all ending in generated ops; asserts parallel_agg
    inputs ``b0``, ``b1`` reference ``t_s1_*`` (inner tools), not
    ``pgate_s1_*``.

Both tests are CLAUDE.md-compliant: written first, manually verified
to fail against pre-fix code (lastOpRef pointed at parseGate SWITCH),
then pass after the terminalRef change.

Three rounds of the same root cause:
  - round 3: cwd/credentials/media — parent forwarded to SUB_WORKFLOW
    input, missed per-tool task injection (sibling miss).
  - round 4: lastOpRef = joinRef, JOIN has no .result (sibling miss).
  - round 5: lastOpRef = parseGate SWITCH, SWITCH has no .result
    (sibling miss).

The structural invariant test from round 4 caught ambient-key
forgetfulness but didn't assert result-reference correctness for
terminal tasks. The two new round-5 tests close that specific gap.

Verification: PlanCompilerScriptTest 25/25, MultiAgentCompilerTest 43/43,
full server suite green.
… nudge + non-empty SWITCH passed branch

Bumped Conductor from 3.3.0-SNAPSHOT (local build) to 3.30.0.rc12.
Two latent regressions surfaced under the new server.

JavaScriptBuilder.compilePlanToWorkflowScript — user-message json mention
  OpenAI's Responses API requires the literal word "json" in input
  (user) messages when ``text.format`` is ``json_object``. The system
  message's "JSON" mention is not sufficient — Responses API checks
  user-message content specifically.

  My round-1 fix had removed the trailing "Respond with valid JSON only."
  user-message nudge on the rationale that system prompt + jsonOutput:true
  carried the contract. That dismissal as "cargo cult" was wrong: it was
  the OpenAI-required guard.

  Restored as "Respond as json." (minimal, lowercase, satisfies the
  check). Comment updated to explain why this is required, not redundant.

JavaScriptBuilder.compilePlanToWorkflowScript — empty SWITCH branch
  Conductor 3.30's SWITCH falls through to defaultCase when the matched
  decision case is empty. With the round-2 inversion
  (``decisionCases:{passed: onSuccess}``, ``defaultCase: onFailure``)
  and the common case of no on_success actions, val_agg='passed'
  matched the 'passed' case (0 tasks), Conductor fell through to
  defaultCase, TERMINATE ran, workflow falsely FAILED.

  Insert a no-op INLINE in the 'passed' branch when on_success is empty
  so the matched case is never empty. The no-op produces a sentinel
  result downstream consumers ignore.

Tests:
  - testValidationPassedBranchHasNoOpWhenOnSuccessIsEmpty (NEW) —
    asserts the 'passed' branch is non-empty when no on_success is
    declared, and the inserted task is the ok_noop INLINE.
  - testValidationPassedBranchPreservesOnSuccessTasks (NEW) — counter-
    test: when on_success IS provided, those tasks land in the passed
    branch and the no-op is NOT inserted.

Verification:
  - PlanCompilerScriptTest: 27/27 (was 25; +2 regression tests)
  - MultiAgentCompilerTest: 43/43
  - Python unit tests: 1629/1629
  - Python e2e tests/integration/test_plan_execute_live.py: 3/3 in 31.9s
    (under Conductor 3.30.0.rc12 with real OpenAI calls)

@bradyyie bradyyie left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA review: This PR has merge conflicts. It's a massive change (80+ files) spanning Python SDK, TypeScript SDK, Java SDK, and server — adds plan-execute harness, coding agent, worker liveness, state updates protocol, and many tests. Too large and conflicting to approve as-is. Recommend: rebase against main and consider splitting the liveness, plan-execute, and coding agent features into separate PRs.

v1r3n and others added 9 commits May 13, 2026 10:37
# Conflicts:
#	sdk/java/src/main/java/ai/agentspan/internal/ToolRegistry.java
#	sdk/python/src/agentspan/agents/agent.py
#	sdk/python/src/agentspan/agents/tool.py
#	server/src/main/java/dev/agentspan/runtime/model/AgentConfig.java
Move the coding-agent and generic-agent harness design docs out of this PR.
The harness module itself is uncommitted WIP and will land on the separate
feat/coding-agent-harness branch.

This PR now covers only:
  a) stateful-tools fixes in the SDKs (domain propagation, worker liveness,
     handoff_check registration, stateful re-registration on domain change)
  b) PAC/PAE — Strategy.PLAN_EXECUTE, prefill_tools, planSource, success
     condition eval, parallel FORK_JOIN validations
Continuation of PAC/PAE work, restored from in-progress WIP after the
main-sync split:

  * PAC (Plan-And-Compile): plans.py SDK helpers, PlanAndCompileTask + config
    on the server, MultiAgentCompiler/AgentCompiler/ToolCompiler/GuardrailCompiler
    + accompanying tests (SynthOutputScriptTest, EnrichToolsScriptTest,
    PlanAndCompileTaskTest, AgentCompileE2ETest).
  * PAE (Plan-And-Execute): docs/concepts/plan-execute.md, integration tests
    (test_plan_execute_live, test_pac_toolType_routing_e2e), example flows
    (103_plan_and_compile.py, 104_plan_execute_guardrails.py,
    106_plan_execute_agent_fanout.py, 107_pac_mcp_proof.py).
  * UI: CompiledPlanView, AgentExecutionDiagram updates, agentExecutionUtils,
    LeftPanelTabs, Execution.jsx state machine for compiled-plan rendering.
  * SDK plumbing: ModelContextWindows, WorkflowTaskUtils, JavaScriptBuilder,
    AgentChatCompleteTaskMapper, AgentRun/StartRequest/GoogleADKNormalizer.
  * Java SDK examples: Example48Planner; Agent + AgentConfigSerializer
    updates.
  * Worker domain contract spec (docs/design/WORKER_DOMAIN_CONTRACT.md).
  * Restored main-side fields that the stash apply had overwritten:
    AgentConfig.maskedFields, Agent.synthesize/masked_fields,
    ToolDef.retry_count/retry_delay_seconds, plus their serializer paths.
… tests

The WIP stash apply in be808e0 clobbered three main-side PRs in
runtime.py (the merge had them, the WIP version did not). Restoring
them and fixing two unrelated CI breakers introduced by the same WIP
commit.

runtime.py — restore three main PRs:
  * #200 fix(python-sdk): register transfer tool workers for hybrid agents
    (this is a stateful-workers fix — without it, hybrid agents with both
    tools and sub-agents deadlock because the transfer_to_<name> SIMPLE
    tasks have no Python workers polling for them)
    -> re-add _register_hybrid_transfer_workers method + its call site
    -> include transfer_to_<sub> names in _collect_worker_names
  * #196 fix(sdk): guard _collect against non-Agent sub-agents (LangGraph)
    -> add isinstance() guard in both _collect inner functions
  * #188 feat(sdk): retry_count / retry_delay_seconds in @tool decorator
    -> re-add params to _default_task_def signature
    -> tool_registry.py passes ToolDef's retry_count/retry_delay_seconds

server/build.gradle — revert conductorVersion to 3.30.0.rc12.
The WIP bumped to rc13 which lives only in a local conductor checkout
(not published to Maven). CI server-tests couldn't resolve the JARs.

sdk/typescript/src/types.ts — ToolDef.call is now optional (call?()).
The WIP made it required, breaking ToolDef literals like the one in
code-execution.ts:246 (CodeExecutor.asTool()) which don't need to be
prefill-callable.

tests/unit/test_contextbook_flow.py — three stale assertions updated to
match the WIP's intentional behavior changes in _issue_fixer_tools.py:
  * VALID sections set now includes 'task_brief' (added by WIP)
  * Fetcher write step now also writes task_brief (matches the new
    _fetcher_done gate in 100_issue_fixer_agent.py)
  * Repeat-read assertion checks the new 'REPEAT READ #4' / 'STOP
    RE-READING' banner instead of the old 'repeat read limit exceeded'
    substring.

Local verification:
  * Python: 1689 unit tests pass (was 1681 pass + 8 fail)
  * TypeScript: full build + dts build succeed
  * Server: rc12 is the latest published Maven version (the previous
    coding_agent state used rc12 successfully)
Two more carry-overs from the WIP overwrite, both blocking server-tests
in CI:

AgentService.java — restore main PR #195's deleteExecutionRecord and
pruneExecutions methods (plus the java.time imports they need). The
WIP version of AgentService.java predates that PR, but AgentController
still calls both methods — the compileJava target failed with
'cannot find symbol' until these were re-added.

server/.../util/PlanCompilerScriptTest.java — delete (WIP intended).
The test exercised JavaScriptBuilder.compilePlanToWorkflowScript(),
which the PAC/PAE refactor removed. Replacement coverage now lives in
PlanAndCompileTaskTest, SynthOutputScriptTest, EnrichToolsScriptTest,
and ModelContextWindowsTest — all already committed in be808e0.

Local verification:
  ./gradlew test → BUILD SUCCESSFUL, 578 tests, 0 failures, 0 errors
The project enforces a 'no inline fully-qualified names' lint via a
custom Gradle task. Two carry-overs from the WIP overwrite violated it:

  * AgentCompiler.java:1257 — `.collect(java.util.stream.Collectors.joining(", "))`
    -> add `import java.util.stream.Collectors;`, use `Collectors.joining`
  * AgentChatCompleteTaskMapper.java:665 — `java.util.Set<Integer> ... = new java.util.LinkedHashSet<>();`
    -> add `import java.util.LinkedHashSet;`, drop the `java.util.` prefix
       (`java.util.Set` is already imported via `import java.util.Set;`)

Verified:
  ./gradlew checkNoInlineFQN -> BUILD SUCCESSFUL
  ./gradlew test             -> BUILD SUCCESSFUL (578 tests, 0 fail)
Run ./gradlew :spotlessApply on the 5 WIP files that violated formatting:
  * AgentCompiler.java
  * PlanAndCompileTask.java
  * AgentChatCompleteTaskMapperTest.java
  * AgentCompilerTest.java
  * PlanAndCompileTaskTest.java

Mechanical line-break / chain-call reformatting; no behavior change.

Verified:
  ./gradlew :spotlessJavaCheck -> BUILD SUCCESSFUL
  ./gradlew test               -> BUILD SUCCESSFUL
…sult.error

Carry-over from the WIP overwrite: PR #201 was on main when we merged it
in, but the stash apply rolled runtime.py back to the pre-#201 version.

Restored:
  * Two _run* tail paths inspect tasks for the first FAILED task and use
    its reasonForIncompletion (prefixed with the task's referenceTaskName)
    as AgentResult.error, falling back to the workflow-level reason.
  * _extract_failed_task_reason helper.

So instead of the opaque workflow reason, callers now see something like
'Task ''manager_llm'' failed: LLM API returned 429' — diagnosable without
opening the execution history UI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants