Coding agent#163
Open
v1r3n wants to merge 131 commits into
Open
Conversation
Multi-agent coding agent that takes a GitHub issue number, analyzes the codebase, implements a fix with tests, and creates a PR autonomously. Architecture: pipeline-wrapped swarm (Issue Analyst >> SWARM >> PR Creator) with Tech Lead (Opus), Coder (Sonnet), DG Code Reviewer (skill agent), and QA Lead (Sonnet). Includes contextbook for durable team memory, stateful workers, idempotency via issue number, and full e2e test gate. All project-specific values are configurable constants for reuse.
3-chunk plan: tools (21 @tool functions), agent assembly (6 agents + instructions + pipeline), and verification. 8 tasks, ~30 steps. Files: _issue_fixer_tools.py, _issue_fixer_instructions.py, 100_issue_fixer_agent.py.
Six detailed prompt strings: Issue Analyst, Tech Lead, Coder,
DG Reviewer, QA Lead, PR Creator. Parameterized with {repo},
{branch_prefix}, {max_review_cycles}, {max_e2e_retries}.
serve() re-registers workers in the default domain (no domain), but start() already registered them under the execution's unique domain UUID. The domain mismatch caused stateful tool tasks (like contextbook_read) to stay in SCHEDULED state with pollCount=0 — no worker was polling in the correct domain. handle.join() blocks until completion using the workers already registered by start() under the correct domain.
_register_workers recursed into sub-agents without passing the domain parameter. This caused sub-agent tools (e.g. contextbook_read on issue_analyst, a child of the pipeline) to register in the default domain while the Conductor tasks were scheduled in the execution's unique domain — resulting in pollCount=0 and SCHEDULED forever.
When an agent has stateful=True, the Conductor server schedules all tasks (tools, stop_when, termination, handoff, check_transfer, etc.) under the execution's unique domain UUID. But only tool workers were registered with that domain — all 11 system worker types (guardrails, stop_when, gate, callbacks, termination, check_transfer, router, handoff, swarm transfers, manual selection) were registered without a domain, causing them to poll in the default domain and never pick up the scheduled tasks (pollCount=0, stuck in SCHEDULED forever). Fix: add domain parameter to all _register_*_worker helpers and propagate it from _register_workers through every call site and recursive sub-agent registration. Affected methods: - _register_guardrail_worker - _register_single_guardrail_worker - _register_stop_when_worker - _register_gate_worker - _register_callback_worker - _register_termination_worker - _register_check_transfer_worker - _register_router_worker - _register_handoff_worker - _register_swarm_transfer_workers - _register_manual_selection_worker
6 tests that verify workers register under the correct Conductor domain when stateful=True. All assertions inspect the workflow execution via server API — no mocks, no LLM output parsing, fully deterministic. Tests: 1. Stateful tool completes (not stuck SCHEDULED) 2. Stateful stop_when callback executes in domain 3. Stateful swarm handoff + termination execute in domain 4. Mixed @tool and @tool(stateful=True) share domain 5. Concurrent stateful executions get different domains (isolation) 6. Non-stateful agents work without domain (regression guard) Counterfactual verified: removing the domain fix causes test 2 to fail (stop_when stuck SCHEDULED with pollCount=0, timeout after 300s).
Root cause of failed execution: Issue Analyst cloned repo into a random temp dir, but all other agents' tools operated in the SDK's CWD. The Coder, QA Lead, and PR Creator couldn't see the cloned repo. Fix: - Add set_working_dir()/get_working_dir() to tools module - Add _resolve(path) helper that resolves relative paths against the shared working directory - Add _cwd() helper for subprocess calls - All 21 tools now use _resolve() for file paths and _cwd() for subprocess cwd - Contextbook stored inside working dir (.contextbook/) - Entry point creates temp dir with random UUID suffix and calls set_working_dir() before starting the pipeline - Issue Analyst instructions updated to clone into "." (the working dir) instead of a separate mktemp dir - All agent instructions updated with "must use tools" reminders to prevent LLM from hallucinating code instead of calling tools
…ination
Observed failures in execution 56063a70:
- Issue Analyst: looped on contextbook_read 12 times, never produced output
- Tech Lead: spent 25 turns reading files, never wrote implementation_plan
- Coder: never ran (no handoff from Tech Lead)
- PR Creator: nothing to commit
Fixes:
- Add explicit turn budgets to every agent ("you have N turns, budget them")
- Add step-by-step turn allocation (turns 1-3: read, turns 4-15: explore, etc.)
- Add anti-loop rules ("do NOT call contextbook_read repeatedly")
- Add CRITICAL rules: "you MUST write implementation_plan before handing off"
- Add "if you run out of turns without writing the plan, you have FAILED"
- Simplify PR Creator to 5-turn max with numbered steps
- Every agent: "output HANDOFF_TO_X in your response text" (not as a tool call)
- DG Reviewer: "complete in 10 turns or fewer"
Observed in fe7daec4: Tech Lead hit max_turns=25 after reading 17 files
but never wrote implementation_plan. Each turn used only 1 tool call.
Changes:
- max_turns: tech_lead 30->80, coder 100->200, qa_lead 40->80
- All instructions now say "call multiple tools in parallel" with
specific examples of what to batch per turn
- Tech Lead: phased approach with "reserve 30% of turns for writing"
and "do NOT spend more than 70% reading"
- Every agent: parallel-first patterns ("read these 3-5 files at once")
- Removed fixed turn budgets that were too rigid, replaced with
percentage-based guidance
Problems in c014e645: - Coder made 6 edits and committed but never output HANDOFF_TO_DG. It kept making tool calls after commit, handoff check found no handoff text, swarm exited. DG and QA never ran. Fixes: 1. Coder instructions: "After git commit, your VERY NEXT response must be the HANDOFF text with ZERO tool calls" and "The handoff text must be the ONLY content in that response" 2. New Documentation Agent (pipeline stage 3): - Runs after coding swarm, before PR creator - For features: MUST create example + update docs (mandatory) - For bug fixes: update relevant docs only if needed - Pipeline: issue_analyst >> coding_swarm >> docs_agent >> pr_creator - max_turns=40, has read/write/edit/glob/grep/list/run_command tools
…alls Reflects all changes since the original spec: - New pipeline: issue_analyst >> coding_swarm >> docs_agent >> pr_creator - Docs Agent (Stage 3): updates docs, creates mandatory examples for features - Working Directory section: shared temp dir, clone to ".", all tools relative - Tech Lead max_turns: 30 -> 80 - Coder max_turns: 100 -> 200 - QA Lead max_turns: 40 -> 80 - Tool assignment matrix: added Docs Agent column - Parallel tool calls: noted in design notes - Updated PR Creator to Stage 4
Two-layer fix: 1. Issue Analyst adds '.contextbook/' to .gitignore after cloning 2. All git add commands use ':!.contextbook' exclude pathspec
New constants (user-overridable):
DOCS_PLAN_DIR = "docs/plan" — Tech Lead writes implementation plans here
DOCS_DESIGN_DIR = "docs/design" — Docs Agent writes feature design docs here
Tech Lead now writes the plan to BOTH:
- {docs_plan_dir}/issue-<N>-plan.md (persisted in repo, committed)
- contextbook implementation_plan (for agent communication)
Docs Agent now writes feature design docs to:
- {docs_design_dir}/issue-<N>-<feature-slug>.md
Both paths are format placeholders in instructions, resolved from constants.
Issues often reference external URLs (RFCs, API docs, related PRs, design docs). The web_fetch tool fetches a URL, strips HTML to plain text, and returns up to 16K chars. Assigned to Tech Lead, Coder, and Docs Agent.
Each PR now includes a machine-readable JSON block capturing the full
context of what changed, why, and by whom. Designed so that release
reviews can programmatically aggregate PR context across a release.
Flow:
1. Coder writes change_context JSON to contextbook after implementing
(issue_number, title, change_type, date, author, root_cause,
what_changed [{file, change}], testing, risks, related_issues)
2. Coder updates it again after writing tests (testing field, test files)
3. PR Creator reads change_context and embeds it in the PR body inside
a collapsible <details> block with a ```json code fence
New contextbook section: "change_context"
Major restructure: replace single 4-agent swarm with deterministic
pipeline stages and focused 2-agent review loops.
Old: issue_analyst >> [SWARM: tech_lead, coder, dg, qa] >> docs >> pr
Problem: coder never handed off to DG — swarm exited after coder.
New: issue_analyst >> tech_lead >> [impl_loop] >> [test_loop] >> docs >> pr
Pipeline stages:
1. Issue Analyst — fetch issue, clone, branch
2. Tech Lead — analyze code, write implementation plan (deeper thinking)
3. Implementation Loop (outer SWARM):
- code_review_loop (inner SWARM: coder <-> DG, until CODE_APPROVED)
- tl_reviewer (Tech Lead final review, IMPL_APPROVED or NEEDS_REWORK)
4. Test Loop (SWARM: qa_lead <-> test_coder, until TESTS_PASS)
5. Docs Agent — design docs, API docs, examples (MANDATORY for features)
6. PR Creator — commit, push, create PR with change_context JSON
Key improvements:
- DG review is GUARANTEED — 2-agent swarm must alternate
- TL final review gate — implementation must be approved before testing
- QA evidence: test results saved to {qa_evidence_dir}/issue-<N>/
- Each loop is pluggable — swap DG with any code reviewer
- Separate test_coder instance for test writing
- New TL_REVIEW_INSTRUCTIONS for final sign-off
New mode: address review comments on an existing PR and update it. Usage: python 100_issue_fixer_agent.py 42 # Fix issue #42 (new) python 100_issue_fixer_agent.py 42 --pr 157 # Address PR #157 feedback Feedback pipeline: pr_feedback >> impl_loop >> test_loop >> pr_updater Flow: 1. PR Feedback Agent: fetches PR comments, reviews, inline review comments (file:line), writes structured feedback to contextbook 2. Implementation Loop: coder addresses feedback, DG reviews changes 3. Test Loop: QA verifies tests still pass after changes 4. PR Updater: pushes to same branch, adds summary comment to PR with table of feedback items and resolutions New agents: pr_feedback, pr_updater New instructions: PR_FEEDBACK_INSTRUCTIONS, PR_UPDATER_INSTRUCTIONS Idempotency key for feedback: "issue-{N}-pr-{PR}-feedback"
WorkerManager._start_new_workers() deduplicates by task_def_name only, ignoring domain. When the same tool (e.g. contextbook_read) is registered under both default domain (None) and a stateful execution domain, the domain-specific worker was skipped because the name already existed. This caused SCHEDULED tasks with pollCount=0 in domain-specific queues — the worker was polling the default domain, not the execution's domain. Fix: track (task_def_name, domain) tuples in the existing set instead of just task_def_name. A worker under domain=None and the same worker under a specific domain are different polling targets that both need their own process.
…antee DG runs Root cause: SWARM relies on LLM to output handoff text. The coder LLM never outputs "HANDOFF_TO_DG" — it keeps making tool calls until max_turns. Result: DG never ran in any execution. 4 code_review_loop iterations, 0 dg_reviewer sub-workflows. Fix: Replace SWARM with SEQUENTIAL pipeline for review stages. Before (SWARM — unreliable): code_review_loop = SWARM(coder, dg_reviewer) # DG never runs After (SEQUENTIAL — deterministic): code_then_review = coder >> dg_reviewer # DG ALWAYS runs The outer impl_loop remains a SWARM for the TL approval cycle: impl_loop = SWARM(code_then_review, tl_reviewer) - code_then_review runs (coder + DG guaranteed) - tl_reviewer checks: IMPL_APPROVED or NEEDS_REWORK Testing also made sequential: test_then_verify = qa_lead >> test_coder >> qa_reviewer - QA plans, coder writes tests, QA reviews + runs e2e Also: added write_file tool to QA agents for writing QA evidence files.
Multi-agent coding agent that takes a GitHub issue number, analyzes the codebase, implements a fix with tests, and creates a PR autonomously. Architecture: pipeline-wrapped swarm (Issue Analyst >> SWARM >> PR Creator) with Tech Lead (Opus), Coder (Sonnet), DG Code Reviewer (skill agent), and QA Lead (Sonnet). Includes contextbook for durable team memory, stateful workers, idempotency via issue number, and full e2e test gate. All project-specific values are configurable constants for reuse.
3-chunk plan: tools (21 @tool functions), agent assembly (6 agents + instructions + pipeline), and verification. 8 tasks, ~30 steps. Files: _issue_fixer_tools.py, _issue_fixer_instructions.py, 100_issue_fixer_agent.py.
Six detailed prompt strings: Issue Analyst, Tech Lead, Coder,
DG Reviewer, QA Lead, PR Creator. Parameterized with {repo},
{branch_prefix}, {max_review_cycles}, {max_e2e_retries}.
serve() re-registers workers in the default domain (no domain), but start() already registered them under the execution's unique domain UUID. The domain mismatch caused stateful tool tasks (like contextbook_read) to stay in SCHEDULED state with pollCount=0 — no worker was polling in the correct domain. handle.join() blocks until completion using the workers already registered by start() under the correct domain.
…resh starts E2E tests (Tasks 10–12 of the worker-liveness plan) — six tests under tests/integration/test_worker_liveness_live.py covering all three failure modes plus a validity counter-test for each: - LocalLivenessCheck raises WorkerStartupError when no worker subprocess is alive after registration (Mode B-1). - ServerLivenessMonitor surfaces WorkerStallError through join() when a SCHEDULED task in our domain has pollCount=0 past the stall threshold, with liveness_stall_policy="raise" (Mode B-2). - AgentHandle.is_resumed flips True and the "Resumed existing execution" INFO log is emitted on idempotency_key replay (Mode A). Each test pairs with a counter-test (liveness_enabled=False or fresh idempotency_key) confirming the new check, not unrelated behavior, is what's signaling. All assertions are algorithmic. Hot-path fix: skip _extract_domain on fresh starts. is_resumed and the resume INFO log can only fire when the caller passed an idempotency_key — without one, the server cannot return a previously-recorded domain. The original implementation made an extra get_workflow GET on every start/run regardless. This regressed the test_runtime unit tests (which assert get_workflow is called exactly once) and added latency to the common case. All four call sites — start/run, sync/async — now gate the resume detection behind ``if idempotency_key:``. Verification: - tests/unit/test_runtime.py: 160 pass (was 5 failing on the extra GET). - tests/integration/test_worker_liveness_live.py: 6/6 pass in 26s.
A stateful agent re-run with the same task names but a different ``run_id`` (e.g., a fresh execution that isn't an idempotency replay) needs ``WorkerManager.start()`` to register the new ``(task_name, domain)`` pairs. The previous ``elif new_workers:`` branch only fired when the *task-name set* changed; identical names under a different domain were silently skipped, leaving the new domain unpolled. Convert to ``else:`` at all four sites (sync/async ``run`` and ``start``, plus the framework and graph paths). ``WorkerManager`` is domain-aware and only spawns missing pairs, so this remains safe under concurrent re-runs and avoids the fork() deadlock window of a full stop/restart cycle. Adds test_prepare_workers_starts_worker_manager_when_only_domain_changes to lock the contract: start() is called once even when the task names have already been registered, as long as the domain differs.
…XECUTE
Adds an optional ``planSource`` field on AgentConfig that points to a
SIMPLE task (e.g., contextbook_read) which can serve a structured plan
when the planner agent's text output cannot be parsed as a JSON fence.
This is the recovery path for plan extraction failures — instead of
immediately falling through to the agentic fallback agent, the harness
can first try a deterministic source. Useful when the planner writes
its plan to a contextbook section as part of its own toolchain and
the extraction-from-LLM-output is unreliable.
- server: AgentConfig.planSource (Map<String,Object>) — {tool, args}.
- python sdk: Agent(plan_source={...}) constructor kwarg + serialization.
- typescript sdk: not wired in this commit; will be added when the TS
PLAN_EXECUTE harness needs it.
Adds parallel e2e tests in Python, Java, and TypeScript verifying that the GraalJS plan compiler honors ``max_tokens`` declared on a generate block. Counterfactual coverage: without the compiler reading ``gen.max_tokens``, every LLM_CHAT_COMPLETE task uses the default 4096 ceiling; the test instructs the planner to set max_tokens=8192 and requests sections >250 words, so a regression would surface as truncated output and missing word-count thresholds. All assertions are algorithmic — file existence, word counts, expected sections.
No functional changes — line-wrap and indentation adjustments across AgentCompiler, AgentChatCompleteTaskMapper, AgentService, and AgentCompilerTest.
Replaces the single ``CODER_PLANNER_INSTRUCTIONS`` with three distinct prompts to align the example with the deterministic-coding-workflows harness: - ``CODER_EXPLORER_INSTRUCTIONS`` — read-only exploration phase that writes a structured plan to the contextbook with the JSON fence required by Strategy.PLAN_EXECUTE. - ``CODER_PLANNER_INSTRUCTIONS`` — slim copy-and-emit role used after exploration so the planner output is deterministic enough for plan extraction. - ``CODER_IMPLEMENTER_FALLBACK_INSTRUCTIONS`` — agentic fallback when the deterministic plan execution fails validation.
CODING_AGENT_HARNESS_DESIGN.md — 1,568-line specification for a coding-agent runtime: conversation state, tool safety, file edits, command execution, sub-agent delegation, interruption recovery, session persistence. Reference architecture for staged build. GENERIC_AGENT_HARNESS_DESIGN.md — 2,814-line generalization beyond coding: resource + capability model with adapters for filesystem, browser, database, workflow, and device domains. Plugin/credential boundaries and security policy. .gitignore: ignore /.claude (per-user IDE artifacts).
… package rename # Conflicts: # sdk/java/src/main/java/ai/agentspan/Agent.java # sdk/java/src/main/java/ai/agentspan/internal/AgentConfigSerializer.java # sdk/java/src/test/java/ai/agentspan/e2e/E2ePlanExecuteTest.java
Addresses 17 findings from the dg adversarial code review of
Strategy.PLAN_EXECUTE. The implementation had multiple paths where
errors silently routed to "success" and the planner's free-text could
inject script into the orchestrator. This commit fails closed everywhere.
JavaScriptBuilder.compilePlanToWorkflowScript:
- success_condition is filtered by safeCondition() — length-capped at 256
chars and rejected if it contains function/loop/assignment/host-access
tokens. Plan validation surfaces "unsafe success_condition: ..." instead
of evaluating planner-supplied JS.
- Topological sort detects cycles and returns a structured error with the
full cycle path, instead of silently dropping nodes.
- Cycle/duplicate-id/unsafe-condition/bad-output_schema all return
{workflow_def: null, error: "..."} which the parent now surfaces.
- output_schema is validated as an instance-shape example object: real
JSON Schema input ({"type":"object","properties":{...}}) is rejected
rather than producing toolInputs.type / toolInputs.properties garbage.
- LLM_CHAT_COMPLETE temperature is read from gen.temperature (default 0),
not hardcoded.
- LLM user message no longer appends "Respond with valid JSON only." —
the system message + jsonOutput:true + temperature carry the contract.
- INLINE parse task is followed by a SWITCH(parsed.__parse_error). On
parse failure → TERMINATE FAILED; on success → SIMPLE tool task. This
prevents the tool task from firing with all-undefined args.
- Validation SWITCH inverts: decisionCases:{passed:onSuccess},
defaultCase:onFailure. Aggregator returning null/error/anything-else
fails closed instead of silently routing to onSuccess.
- optional:true is removed from all parse, eval, validation, terminate,
on_success, on_failure, static, generated, and gate tasks. Failures
bubble through SUB_WORKFLOW so the parent SWITCH can route to fallback.
retryCount:1 on tool tasks covers transient errors.
- timeoutSeconds is sourced from harnessTimeoutSeconds input (default 600
if absent), so the dynamic sub-workflow tracks the parent's contract.
- workflow_def is no longer array-wrapped — bare JSON.stringify(wfDef).
parse_wf in the parent simplifies to JSON.parse(wfDefJson).
MultiAgentCompiler.compilePlanExecute:
- After compile_plan, a new compile_status INLINE + compile_gate SWITCH
TERMINATE on compile_error with the actual error message, so cycle /
duplicate-id / unsafe-condition errors are visible to the caller
instead of disappearing into a null-def SUB_WORKFLOW launch.
- has_json predicate now requires Array.isArray(p.steps) && length > 0,
matching what compilePlan actually accepts. Weaker predicates trip-
wired the compile path with plans the compiler immediately rejected.
- Sub-workflow input forwards cwd, credentials, and media in addition
to prompt/session_id/__agentspan_ctx__/context. Compiled plan tools
that depend on working dir or credentials no longer silently default.
This is why examples had to hardcode WORK_DIR.
- SUB_WORKFLOW.optional:true removed — sub-workflow failures must
propagate so the fallback agent is reached.
- harnessTimeoutSeconds is plumbed into compile_plan inputs.
- planSource.tool is validated at compile time via
isToolRegisteredInHarness — typos throw IllegalArgumentException at
deploy instead of silently failing at runtime.
- Fallback config rebuild uses toBuilder().maxTurns(fbMaxTurns).build()
instead of an explicit 8-field whitelist copy that silently dropped
agents/strategy/memory/promptInputs/toolChoice/handoffs/callbacks/
metadata/etc when fallback_max_turns was set.
AgentConfig: @builder(toBuilder = true) so toBuilder() is available for
the fallback rebuild above.
Tests: PlanCompilerScriptTest updated to read the bare-object workflow_def
(no more List<Map> unwrap) and to surface compile errors as test failures
instead of NPEs. All 5 tests + the broader server suite pass.
Skipped from this batch:
- Per-execution workflow naming (race fix): on closer reading the
placeholder is registered once at deploy and is not overwritten at
runtime; the actual sub-workflow definition is injected inline via
SubWorkflowParams.workflowDefinition. The "race" Gilfoyle described
doesn't actually exist in the code path.
- Move plan compiler to .js resource (large refactor — separate commit).
- TS SDK planSource parity (separate SDK).
- Failure-mode e2e tests (separate batch).
- Spec doc sync (separate commit).
… consistency
- TypeScript SDK: AgentOptions and Agent class gain planSource field;
serializer forwards it as planSource on the wire to match the Java/Python
AgentConfig. Closes the previously-flagged cross-SDK gap.
- docs/design/deterministic-coding-workflows.md
- Replace plan_json_section with plan_source (matches actual SDK).
- Clarify output_schema is an instance-shape example object, not real
JSON Schema. Real JSON Schema is rejected at compile time.
- Document success_condition as a length-capped, denylist-filtered JS
expression rather than free-form jq/JS.
- Replace the "degrade to Strategy.SEQUENTIAL" bullets in the Risks
table with what the impl actually does: try plan_source → fall through
to fallback agent → TERMINATE if no fallback.
- extractJsonFenceScript Case 1: when the planner returned an already-parsed
plan object, markdown_plan is now the original coerced planner text (when
available) instead of a re-pretty-printed JSON. Other cases already passed
the original — this aligns Case 1 with the rest so the fallback agent
always sees the LLM's actual prose, not a re-serialized object.
…ed paths
Adds 18 unit tests exercising the failure modes the dg review flagged.
All run as ordinary GraalVM Polyglot / pure Java unit tests — no live
Conductor server, no LLM API calls.
PlanCompilerScriptTest (+13 tests):
- testCycleInDependsOnIsRejected — a→b→a returns a structured cycle
error with the full path, instead of the previous silent partial DAG.
- testDuplicateStepIdIsRejected — duplicate step ids produce a clear
validation error.
- testEmptyStepsArrayIsRejected — empty steps array short-circuits with
"non-empty steps array".
- testUnsafeSuccessConditionIsRejected — six adversarial conditions
(function literal, while loop, Java.type, eval, assignment, var
declaration) are each rejected by safeCondition.
- testSafeSuccessConditionsAreAccepted — counter-test: typical safe
conditions ($.exit_code === 0, $.indexOf('passed') >= 0, etc.) compile.
- testJsonSchemaAsOutputSchemaIsRejected — passing real JSON Schema
({"type":"object","properties":...}) errors with a clear "use an
example object instead" message instead of producing toolInputs.type
/ toolInputs.properties garbage.
- testInstanceShapeOutputSchemaIsAccepted — counter-test: plain
instance-shape compiles cleanly.
- testGeneratedOpUsesParseGateSwitch — every generated op's chain
contains a SWITCH(pgate_*) that routes parse errors to TERMINATE.
- testNoTaskIsOptional — verifies the optional:true cancer is gone
across static, generated, validation, on_success, on_failure,
TERMINATE, INLINE, and SWITCH tasks.
- testValidationSwitchFailsClosed — decisionCases:{passed:onSuccess},
defaultCase:onFailure with a TERMINATE inside, so any non-passed
aggregator result fails closed.
- testTimeoutFromHarnessConfig — harnessTimeoutSeconds=1234 flows into
workflow.timeoutSeconds.
- testDefaultTimeoutWhenHarnessTimeoutAbsent — no harness timeout falls
back to 600.
- testWorkflowDefIsNotArrayWrapped — workflow_def is a bare JSON object,
not [wfDef] (the historical array wrap is gone, plus parse_wf simplified).
MultiAgentCompilerTest (+5 tests):
- testPlanExecutePlanSourceWithUnknownToolIsRejectedAtCompile —
planSource.tool that's not registered anywhere in the harness throws
IllegalArgumentException at compile time. Surfaces typos at deploy
instead of as silent runtime no-ops.
- testPlanExecutePlanSourceMissingToolFieldIsRejected — planSource map
without a 'tool' key fails fast.
- testPlanExecutePlanSourceWithRegisteredToolCompiles — counter-test:
registered tool produces a plan_reader SIMPLE task.
- testPlanExecuteSurfacesCompileErrors — has_plan branch contains the
new compile_status INLINE + compile_gate SWITCH that TERMINATEs with
"Plan compilation failed: ..." instead of letting parse_wf trip on a
null workflow def.
- testPlanExecuteSubWorkflowForwardsCwdCredentialsMedia — SUB_WORKFLOW
inputs include cwd, credentials, media in addition to prompt/session_id.
- testPlanExecuteSubWorkflowIsNotOptional — execTask.isOptional() is
false so failures propagate to the fallback SWITCH.
Total: 18 new tests, 0 failures, full server suite stays green.
Note on Python e2e tests for the same failure modes (no fence → fallback,
validation failure → fallback, etc.): those require a freshly-built
Conductor server with the new compiler code. Adding them requires a
``./gradlew build`` + server restart, which is destructive against the
locally-running shared server and was not done in this commit. The Java
unit tests above exercise the same compile paths against the same
compiler script, just without the LLM round-trip.
… fallback routing
Addresses 5 findings from a second /dg adversarial review of the
PLAN_EXECUTE implementation. Three were criticals introduced in the
previous "fix" pass.
JavaScriptBuilder.compilePlanToWorkflowScript:safeCondition (rewrite)
Previous denylist was the wrong shape — it blocked function/eval/Java
but left ``constructor``, ``prototype``, ``__proto__``, ``Object``,
``Reflect``, ``Proxy``, bracket access, comma operator, and ``+``-based
string concatenation. The bypass
$.constructor.constructor('return Java.type(...)')()
passed the filter and reached GraalJS evaluation — a textbook sandbox
escape via the Function constructor.
Replaced with three layers:
1. Character allowlist — only ``$.()'"!=<>&|-`` plus alphanumeric/
underscore/whitespace. Rejects ``;``, ``{``, ``}``, backticks,
backslashes (no escape evasion), ``,`` (no comma operator),
``?``/``:`` (no ternary), ``[``/``]`` (no bracket-key access),
``+``/``*``/``/`` (no string concat tricks).
2. String literals are stripped before identifier check so legitimate
conditions like ``$.kind === 'constructor'`` are preserved.
3. Identifier denylist on the stripped string blocks the sandbox-
escape primitives and host-bridge globals: constructor, prototype,
__proto__, __defineGetter__, Function, eval, Object, Reflect,
Proxy, Java, Promise, etc., plus all JS control-flow keywords.
Bare assignment is still rejected. The ``globallThis`` (3-l) typo from
the prior iteration is gone — globalThis is in the new identifier list
with correct spelling.
JavaScriptBuilder.compilePlanToWorkflowScript:output_schema sniff (broaden)
Previous heuristic required both ``properties`` AND ``type === 'object'``.
Missed ``{type:'object', required:[...]}`` (no properties) and
``{$schema, properties}`` (no top-level type). Now flags presence of any
JSON-Schema-only key: $schema, properties, required, additionalProperties,
definitions, $defs, $ref, allOf, anyOf, oneOf, patternProperties.
MultiAgentCompiler.compilePlanExecute (compile_gate restructure)
Two related fixes:
(a) Previous gate had decision case only for ``compile_error`` and an
empty ``defaultCase``. The compile_status script also emitted
``no_def`` (defensive sentinel for null wfDef without an error
string) — that fell through, parse_wf JSON.parse(null) returned
null, SUB_WORKFLOW launched with null definition. The very silent-
degrade the gate was meant to prevent. Folded both error sentinels
into one ``compile_failed`` so the gate cannot have a fall-through.
(b) Compile failures TERMINATEd the parent harness even when a fallback
agent was configured — an architectural inversion since fallback
exists exactly to recover from deterministic-path failures, and
compile failure is the canonical such failure. Now routes
compile_failed into buildFallbackBranch when fallbackConfig != null
(with a distinct prefix to avoid task-name collision with the
exec-failure fallback emitted further down). Only TERMINATEs when
no fallback is configured, with the actual error string in
terminationReason.
Tests:
- PlanCompilerScriptTest.testUnsafeSuccessConditionIsRejected expanded
from 6 cases to 22, covering the round-2 bypasses: $.constructor.
constructor(...), $['constructor'], comma operator, backticks,
backslash escapes (\\u0063onstructor), Object/Reflect/Proxy globals,
ternary, __defineGetter__, deeper-position assignment.
- testSuccessConditionAllowsLiteralBannedWordInString — new counter-
test asserting ``$.kind === 'constructor'`` (banned word in string
literal) IS accepted. The new filter strips string literals before
the identifier check.
- testPlanExecuteSurfacesCompileErrors updated for the renamed
``compile_failed`` sentinel and the route-to-fallback behavior:
expects last task in the failed branch to be SUB_WORKFLOW (fallback)
when fallback is configured.
- testPlanExecuteCompileErrorTerminatesWhenNoFallback (new) — counter-
test for the no-fallback path: TERMINATE with the actual error
message in terminationReason.
- testAgentConfigToBuilderPreservesAllFieldsExceptOverridden (new) —
locks the contract for the toBuilder() rebuild. Verifies maxTurns
override applies but model/instructions/maxTokens/temperature/
credentials all survive. Catches future regressions if @builder
config changes.
Verification: PlanCompilerScriptTest 19/19, MultiAgentCompilerTest 42/42,
full server suite green.
…, output_select, namespace, docs
Addresses the 5 important findings from the second /dg adversarial review.
MultiAgentCompiler.compilePlanExecute (markdown_plan wired through)
Previously the fallback agent prompt received ``plannerResult`` (the
coerced string from planner_coerce). When the planner returned an
already-parsed Map via jsonOutput:true, ``plannerResult`` was the
re-stringified Map — losing the LLM's original prose. Meanwhile
``extract_json`` was always computing ``markdown_plan`` (the original
text) but no consumer read it.
Fixed by passing ``${extractRef}.output.result.markdown_plan`` as the
fallback plan-text reference into both buildPlanExecutionBranch and
buildFallbackOnlyBranch. The fallback agent now sees the planner's
actual prose, which is more useful for agentic recovery.
MultiAgentCompiler.compilePlanExecute (output_select non-optional)
Previous output_select used ``optional:true`` to mask unresolved
``${...}`` refs from inactive branches. That suppression also swallowed
real expression bugs. With the new compile-failure fallback path
(``${prefix}_compile_fallback``), there are now four possible result
refs but only one runs per execution.
Replaced with a ``safe()`` helper inside the JS expression that detects
Conductor's literal-``${...}`` markers (constructed via
String.fromCharCode(36) so this script's own source isn't pre-resolved)
and treats them as null. Coalesces in priority order:
plan_exec → fallback → compile_fallback → noplan_fallback. Drops
``optional:true`` so genuine expression failures now surface.
MultiAgentCompiler.isToolRegisteredInHarness (namespace tightened)
Previously recursed into sub-agents. The plan_reader SIMPLE task is
emitted in the parent harness's task namespace, so a tool registered
only on a deeper sub-agent's worker would not be polled for the
parent's task name — a silent runtime hang. Restricted the check to
harness-level tools only. The user must declare the plan_source tool
on the harness.
testPlanExecutePlanSourceWithRegisteredToolCompiles renamed to
testPlanExecutePlanSourceWithHarnessLevelToolCompiles and updated to
put the tool on the harness (not the planner sub-agent). New
testPlanExecutePlanSourceWithSubAgentOnlyToolIsRejected counter-test
asserts a tool only on a sub-agent now fails compile with
IllegalArgumentException.
100_issue_fixer_agent.py (brittleness documented)
The ``coder`` PLAN_EXECUTE harness has one sub-agent and no fallback
(per commit 1fad27a). Any compile/extract/validation/tool failure
TERMINATEs the SUB_WORKFLOW, ending the SWARM iteration without a
chance for QA feedback recovery. Added a comment block above the agent
definition explaining the trade-off and how to add a fallback if
agentic recovery is wanted.
deterministic-coding-workflows.md (failure-mode table expanded)
Replaced the 3-row failure table with a 7-row enumeration covering
every failure mode the implementation handles: no fence, compile
errors (cycle/duplicate-id/unsafe-condition/bad-schema), null-def
defensive case, parse failures inside generated ops, tool failures,
validation failures, and plan_source.tool misconfiguration. Each row
notes the detection point and the routing.
Verification: server suite green (43 MultiAgentCompiler tests, 19
PlanCompilerScript tests, all other compilers unchanged).
…lidation result, brace-matching, error message Addresses the 5 findings from a third /dg adversarial review (2 important + 3 minor). Two of the importants are direct consequences of fixes from prior rounds. JavaScriptBuilder.compilePlanToWorkflowScript: Half-wired ambient input forwarding (the fix from re-review #2 was incomplete). Parent harness forwards cwd/credentials/media to the SUB_WORKFLOW input correctly, but compilePlanToWorkflowScript only injected __agentspan_ctx__ and session_id into per-tool SIMPLE task inputParameters — so tools saw workflow.input.cwd populated at the dynamic-workflow level but their own input maps didn't include it. Filesystem tools running inside a plan_execute strategy were effectively rootless. Refactored: added an injectAmbient(args) helper at the top of the script that sets all five ambient inputs (__agentspan_ctx__, session_id, cwd, credentials, media). Replaced six per-site __agentspan_ctx__/session_id assignments (static op, generated op toolInputs, validation, on_success, on_failure) with single injectAmbient(...) calls. DRYs the change and prevents per-site drift. Static 'completed' literal shadowing fallback output (consequence of the re-review #1 optional:true removal): when plan.validation is absent, outputParameters emitted result='completed' / status='completed' as literal strings — regardless of actual completion. Conductor resolves output mapping on FAILED workflows too. Then the parent's output_select coalesce read _plan_exec.output.result first, saw the literal 'completed', decided it was truthy and not a ${...} placeholder, and returned it. Fallback ran but its recovered output was shadowed. Tests masked it because every test path used a validation block. Fixed by tracking lastOpRef during step emission (= taskReferenceName of the last top-level task in each step). When lastAggRef is null, result is now ref(lastOpRef + '.output.result') — pointing at the last tool's actual output. On a FAILED workflow this resolves to a literal ${...} string which the parent's safe() helper detects and skips, letting the fallback's recovered output surface as the harness result. Status output keeps the 'completed' literal as a placeholder for the no-validation path — that string is informational only and not in the fallback-shadowing chain. JavaScriptBuilder.extractJsonFenceScript Case 4: Brace-matching JSON extraction counted { / } without tracking string- literal state. {"key": "}{}"} miscounted and sliced the wrong substring. Now tracks in-string state with backslash-escape handling during the brace walk so braces inside string values are ignored. MultiAgentCompiler.isToolRegisteredInHarness error message: Said "is not registered as a tool on '<harness>' or any of its sub- agents" but the validator only checks harness-level tools (after the re-review #2 namespace tightening). The message gave users false hope of putting the tool on a sub-agent — they'd move it and fail again. Updated to "is not registered as a harness-level tool on '<harness>'. The plan_reader task is emitted in the harness's task namespace, so the tool must be declared in tools=[...] on the harness itself (declaring it on a sub-agent does not work)." Tests: - testInnerToolTasksReceiveCwdCredentialsMedia (new): compiles a real plan with static op + generated op + validation + on_success + on_failure, walks every emitted SIMPLE task, asserts every one of them contains cwd/credentials/media/session_id/__agentspan_ctx__ as inputParameters. Catches the prior incomplete-plumbing regression. - testNoValidationPlanResultPointsAtLastTaskNotLiteral (new): asserts outputParameters.result references a task output (${...}.output.result) instead of the literal 'completed' for plans without a validation block. Locks the fix against future regression. Verification: PlanCompilerScriptTest 21/21, MultiAgentCompilerTest 43/43, full server suite green.
…ator, sibling-fix gaps
Addresses 4 findings from a fourth /dg adversarial review (1 critical +
3 important). Three of the four are 'fix-didn't-reach-siblings' from
prior rounds — patches that landed at one site but missed structurally
identical neighbors. Adds a structural invariant test to catch this
class of regression going forward.
JavaScriptBuilder.compilePlanToWorkflowScript — injectAmbient ordering
CRITICAL. The LLM-generated tool branch built toolInputs with
``var toolInputs = injectAmbient({})`` first, then overlaid LLM-driven
schema keys on top. The static-args branch did the inverse (correct)
ordering: copy LLM args first, ambient override last. The comment on
the helper definition explicitly said ambient values are "forced
overrides — LLM-supplied args cannot redirect these."
An LLM-authored ``output_schema`` containing a key named ``cwd``,
``credentials``, ``media``, ``session_id``, or ``__agentspan_ctx__``
could redirect the filesystem root or substitute credentials — exactly
the threat the helper exists to prevent.
Fix: changed the LLM branch to start with ``var toolInputs = {}``,
populate from the schema-key loop (or _args fallback in the catch),
and call ``injectAmbient(toolInputs)`` at the end so ambient keys
always win. Mirrors the static-args branch.
JavaScriptBuilder.compilePlanToWorkflowScript — lastOpRef = joinRef
Conductor's JoinTask.output is ``{taskRef → outputMap}`` with no
top-level ``result`` key. ``${joinRef.output.result}`` resolved to a
literal placeholder, the parent's safe() helper stripped it, and any
plan whose terminal step was parallel:true with no validation block
produced an empty result.
Fix: after the JOIN, emit an INLINE aggregator (``parallel_agg_<step>``)
that collects each branch's last task's output into an array and points
lastOpRef at it. ``${aggRef.output.result}`` now resolves to the array
of branch outputs.
JavaScriptBuilder.extractJsonFenceScript — Case 5c not string-aware
Round 3 added string-literal-aware brace matching to Case 4
(planner-text path) but the structurally identical block for
planReaderContent (Case 5c) was untouched. A plan_source tool returning
content with ``"steps"`` or ``{`` / ``}`` inside string literals would
mis-slice.
Fix: extracted the brace-walk into a ``findMatchingBrace(text, openIdx)``
helper at the top of the script and replaced both Case 4 and Case 5c
with helper calls. Single source of truth — future fixes won't drift.
MultiAgentCompiler — plan_reader doesn't forward cwd/credentials/media
Round 3's ambient-forwarding fix landed in compilePlanToWorkflowScript
via injectAmbient but missed the plan_reader SIMPLE task in
buildPlanExecutionBranch. A reader tool needing cwd (e.g. workspace
filesystem reads) would silently fail and fall through to the no_plan
branch.
Fix: added cwd/credentials/media forwards to readerInputs alongside
the existing session_id/__agentspan_ctx__ forwards. Kept optional:true
with a comment explaining why (a reader pointed at a not-yet-written
contextbook section is a normal "no fallback content" condition; the
no_plan SWITCH path surfaces real failures via the fallback agent or
TERMINATE).
Tests:
- testEverySimpleTaskHasFiveAmbientKeys (NEW, structural invariant) —
compiles a plan exercising static + generated + parallel + validation
+ on_success + on_failure ops, walks every emitted SIMPLE task
(recursing into FORK_JOIN branches AND SWITCH decisionCases for the
parseGate-wrapped tool tasks), and asserts every one carries cwd,
credentials, media, session_id, __agentspan_ctx__. This is the
"fix-didn't-reach-siblings" guard — if any future tool emission
forgets injectAmbient(), this test fails immediately.
- testGeneratedOpAmbientKeysWinOverLLMSuppliedSchemaKeys (NEW,
regression for the critical) — compiles a plan whose output_schema
contains keys named ``cwd``, ``credentials``, ``media`` (the
LLM-driven attack shape) plus a non-colliding ``safe_field``. Asserts
the emitted SIMPLE task's cwd/credentials/media point at
workflow.input.* (ambient won), and safe_field points at the parsed
result (legitimate non-ambient flow).
- collectTasks helper now recurses into SWITCH decisionCases and
defaultCase — required for the structural invariant test to walk
the parseGate-wrapped tool tasks.
Verification: PlanCompilerScriptTest 23/23, MultiAgentCompilerTest 43/43,
full server suite green.
… SWITCH for result references
Addresses 3 findings from a fifth /dg adversarial review (2 important,
1 minor). Same root cause as the round-4 lastOpRef-=-joinRef fix,
recurring at a third host: parseGate SWITCH terminals.
Root pattern (round 4 + round 5): wrapper task types like SWITCH
(parseGate) and JOIN don't expose a meaningful ``.result`` on their
output. SWITCH outputs the case decision; JOIN outputs
``{taskRef → outputMap}``. References like ``${pgate_*.output.result}``
or ``${join_*.output.result}`` resolve to literal placeholders.
The round-4 fix wrapped JOIN with a parallel_agg INLINE that produces
a real ``.result``. But the same disease afflicts parseGate SWITCHes —
when a generated op terminates a branch (``chain.push(parseGate)``),
the chain's terminal is the SWITCH, not the inner tool. parallel_agg's
per-branch inputs and the sequential lastOpRef both ended up pointing
at the SWITCH wrapper.
Fix: introduced ``innerRefMap`` and ``terminalRef(task)``. When emitting
a parseGate, register ``innerRefMap[parseGate.taskReferenceName] = toolRef``
— the actual SIMPLE tool task wrapped by the SWITCH. ``terminalRef(task)``
returns the inner ref when one exists, else the task's own ref. Use
``terminalRef`` at:
- parallel_agg's per-branch inputs (was ``ref(joinOn[ja] + ...)``,
now ``ref(terminalRef(branches[ja][...]) + ...)``). joinOn keeps
the SWITCH ref for ordering.
- sequential lastOpRef assignment (was branch task's ref directly,
now ``terminalRef(branches[b2][t])``).
Tests:
- testSequentialTerminalGeneratedOpResultPointsAtInnerTool (NEW) —
sequential plan ending in generated op asserts outputParameters.result
references ``t_*`` (inner tool), not ``pgate_*`` (SWITCH wrapper).
- testParallelTerminalGeneratedOpAggregatorPointsAtInnerTool (NEW) —
parallel branches all ending in generated ops; asserts parallel_agg
inputs ``b0``, ``b1`` reference ``t_s1_*`` (inner tools), not
``pgate_s1_*``.
Both tests are CLAUDE.md-compliant: written first, manually verified
to fail against pre-fix code (lastOpRef pointed at parseGate SWITCH),
then pass after the terminalRef change.
Three rounds of the same root cause:
- round 3: cwd/credentials/media — parent forwarded to SUB_WORKFLOW
input, missed per-tool task injection (sibling miss).
- round 4: lastOpRef = joinRef, JOIN has no .result (sibling miss).
- round 5: lastOpRef = parseGate SWITCH, SWITCH has no .result
(sibling miss).
The structural invariant test from round 4 caught ambient-key
forgetfulness but didn't assert result-reference correctness for
terminal tasks. The two new round-5 tests close that specific gap.
Verification: PlanCompilerScriptTest 25/25, MultiAgentCompilerTest 43/43,
full server suite green.
… nudge + non-empty SWITCH passed branch
Bumped Conductor from 3.3.0-SNAPSHOT (local build) to 3.30.0.rc12.
Two latent regressions surfaced under the new server.
JavaScriptBuilder.compilePlanToWorkflowScript — user-message json mention
OpenAI's Responses API requires the literal word "json" in input
(user) messages when ``text.format`` is ``json_object``. The system
message's "JSON" mention is not sufficient — Responses API checks
user-message content specifically.
My round-1 fix had removed the trailing "Respond with valid JSON only."
user-message nudge on the rationale that system prompt + jsonOutput:true
carried the contract. That dismissal as "cargo cult" was wrong: it was
the OpenAI-required guard.
Restored as "Respond as json." (minimal, lowercase, satisfies the
check). Comment updated to explain why this is required, not redundant.
JavaScriptBuilder.compilePlanToWorkflowScript — empty SWITCH branch
Conductor 3.30's SWITCH falls through to defaultCase when the matched
decision case is empty. With the round-2 inversion
(``decisionCases:{passed: onSuccess}``, ``defaultCase: onFailure``)
and the common case of no on_success actions, val_agg='passed'
matched the 'passed' case (0 tasks), Conductor fell through to
defaultCase, TERMINATE ran, workflow falsely FAILED.
Insert a no-op INLINE in the 'passed' branch when on_success is empty
so the matched case is never empty. The no-op produces a sentinel
result downstream consumers ignore.
Tests:
- testValidationPassedBranchHasNoOpWhenOnSuccessIsEmpty (NEW) —
asserts the 'passed' branch is non-empty when no on_success is
declared, and the inserted task is the ok_noop INLINE.
- testValidationPassedBranchPreservesOnSuccessTasks (NEW) — counter-
test: when on_success IS provided, those tasks land in the passed
branch and the no-op is NOT inserted.
Verification:
- PlanCompilerScriptTest: 27/27 (was 25; +2 regression tests)
- MultiAgentCompilerTest: 43/43
- Python unit tests: 1629/1629
- Python e2e tests/integration/test_plan_execute_live.py: 3/3 in 31.9s
(under Conductor 3.30.0.rc12 with real OpenAI calls)
bradyyie
reviewed
May 12, 2026
bradyyie
left a comment
Collaborator
There was a problem hiding this comment.
QA review: This PR has merge conflicts. It's a massive change (80+ files) spanning Python SDK, TypeScript SDK, Java SDK, and server — adds plan-execute harness, coding agent, worker liveness, state updates protocol, and many tests. Too large and conflicting to approve as-is. Recommend: rebase against main and consider splitting the liveness, plan-execute, and coding agent features into separate PRs.
# Conflicts: # sdk/java/src/main/java/ai/agentspan/internal/ToolRegistry.java # sdk/python/src/agentspan/agents/agent.py # sdk/python/src/agentspan/agents/tool.py # server/src/main/java/dev/agentspan/runtime/model/AgentConfig.java
Move the coding-agent and generic-agent harness design docs out of this PR.
The harness module itself is uncommitted WIP and will land on the separate
feat/coding-agent-harness branch.
This PR now covers only:
a) stateful-tools fixes in the SDKs (domain propagation, worker liveness,
handoff_check registration, stateful re-registration on domain change)
b) PAC/PAE — Strategy.PLAN_EXECUTE, prefill_tools, planSource, success
condition eval, parallel FORK_JOIN validations
Continuation of PAC/PAE work, restored from in-progress WIP after the
main-sync split:
* PAC (Plan-And-Compile): plans.py SDK helpers, PlanAndCompileTask + config
on the server, MultiAgentCompiler/AgentCompiler/ToolCompiler/GuardrailCompiler
+ accompanying tests (SynthOutputScriptTest, EnrichToolsScriptTest,
PlanAndCompileTaskTest, AgentCompileE2ETest).
* PAE (Plan-And-Execute): docs/concepts/plan-execute.md, integration tests
(test_plan_execute_live, test_pac_toolType_routing_e2e), example flows
(103_plan_and_compile.py, 104_plan_execute_guardrails.py,
106_plan_execute_agent_fanout.py, 107_pac_mcp_proof.py).
* UI: CompiledPlanView, AgentExecutionDiagram updates, agentExecutionUtils,
LeftPanelTabs, Execution.jsx state machine for compiled-plan rendering.
* SDK plumbing: ModelContextWindows, WorkflowTaskUtils, JavaScriptBuilder,
AgentChatCompleteTaskMapper, AgentRun/StartRequest/GoogleADKNormalizer.
* Java SDK examples: Example48Planner; Agent + AgentConfigSerializer
updates.
* Worker domain contract spec (docs/design/WORKER_DOMAIN_CONTRACT.md).
* Restored main-side fields that the stash apply had overwritten:
AgentConfig.maskedFields, Agent.synthesize/masked_fields,
ToolDef.retry_count/retry_delay_seconds, plus their serializer paths.
… tests The WIP stash apply in be808e0 clobbered three main-side PRs in runtime.py (the merge had them, the WIP version did not). Restoring them and fixing two unrelated CI breakers introduced by the same WIP commit. runtime.py — restore three main PRs: * #200 fix(python-sdk): register transfer tool workers for hybrid agents (this is a stateful-workers fix — without it, hybrid agents with both tools and sub-agents deadlock because the transfer_to_<name> SIMPLE tasks have no Python workers polling for them) -> re-add _register_hybrid_transfer_workers method + its call site -> include transfer_to_<sub> names in _collect_worker_names * #196 fix(sdk): guard _collect against non-Agent sub-agents (LangGraph) -> add isinstance() guard in both _collect inner functions * #188 feat(sdk): retry_count / retry_delay_seconds in @tool decorator -> re-add params to _default_task_def signature -> tool_registry.py passes ToolDef's retry_count/retry_delay_seconds server/build.gradle — revert conductorVersion to 3.30.0.rc12. The WIP bumped to rc13 which lives only in a local conductor checkout (not published to Maven). CI server-tests couldn't resolve the JARs. sdk/typescript/src/types.ts — ToolDef.call is now optional (call?()). The WIP made it required, breaking ToolDef literals like the one in code-execution.ts:246 (CodeExecutor.asTool()) which don't need to be prefill-callable. tests/unit/test_contextbook_flow.py — three stale assertions updated to match the WIP's intentional behavior changes in _issue_fixer_tools.py: * VALID sections set now includes 'task_brief' (added by WIP) * Fetcher write step now also writes task_brief (matches the new _fetcher_done gate in 100_issue_fixer_agent.py) * Repeat-read assertion checks the new 'REPEAT READ #4' / 'STOP RE-READING' banner instead of the old 'repeat read limit exceeded' substring. Local verification: * Python: 1689 unit tests pass (was 1681 pass + 8 fail) * TypeScript: full build + dts build succeed * Server: rc12 is the latest published Maven version (the previous coding_agent state used rc12 successfully)
Two more carry-overs from the WIP overwrite, both blocking server-tests in CI: AgentService.java — restore main PR #195's deleteExecutionRecord and pruneExecutions methods (plus the java.time imports they need). The WIP version of AgentService.java predates that PR, but AgentController still calls both methods — the compileJava target failed with 'cannot find symbol' until these were re-added. server/.../util/PlanCompilerScriptTest.java — delete (WIP intended). The test exercised JavaScriptBuilder.compilePlanToWorkflowScript(), which the PAC/PAE refactor removed. Replacement coverage now lives in PlanAndCompileTaskTest, SynthOutputScriptTest, EnrichToolsScriptTest, and ModelContextWindowsTest — all already committed in be808e0. Local verification: ./gradlew test → BUILD SUCCESSFUL, 578 tests, 0 failures, 0 errors
The project enforces a 'no inline fully-qualified names' lint via a
custom Gradle task. Two carry-overs from the WIP overwrite violated it:
* AgentCompiler.java:1257 — `.collect(java.util.stream.Collectors.joining(", "))`
-> add `import java.util.stream.Collectors;`, use `Collectors.joining`
* AgentChatCompleteTaskMapper.java:665 — `java.util.Set<Integer> ... = new java.util.LinkedHashSet<>();`
-> add `import java.util.LinkedHashSet;`, drop the `java.util.` prefix
(`java.util.Set` is already imported via `import java.util.Set;`)
Verified:
./gradlew checkNoInlineFQN -> BUILD SUCCESSFUL
./gradlew test -> BUILD SUCCESSFUL (578 tests, 0 fail)
Run ./gradlew :spotlessApply on the 5 WIP files that violated formatting: * AgentCompiler.java * PlanAndCompileTask.java * AgentChatCompleteTaskMapperTest.java * AgentCompilerTest.java * PlanAndCompileTaskTest.java Mechanical line-break / chain-call reformatting; no behavior change. Verified: ./gradlew :spotlessJavaCheck -> BUILD SUCCESSFUL ./gradlew test -> BUILD SUCCESSFUL
…sult.error Carry-over from the WIP overwrite: PR #201 was on main when we merged it in, but the stash apply rolled runtime.py back to the pre-#201 version. Restored: * Two _run* tail paths inspect tasks for the first FAILED task and use its reasonForIncompletion (prefixed with the task's referenceTaskName) as AgentResult.error, falling back to the workflow-level reason. * _extract_failed_task_reason helper. So instead of the opaque workflow reason, callers now see something like 'Task ''manager_llm'' failed: LLM API returned 429' — diagnosable without opening the execution history UI.
This was referenced May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.