Skip to content

Experimental: multi-agent foundation (Phase 0) — AgentDefinition + tier slots + TOML registry#313

Merged
quangdang46 merged 28 commits into
masterfrom
experimental/multi-agent-foundation
Jun 5, 2026
Merged

Experimental: multi-agent foundation (Phase 0) — AgentDefinition + tier slots + TOML registry#313
quangdang46 merged 28 commits into
masterfrom
experimental/multi-agent-foundation

Conversation

@quangdang46
Copy link
Copy Markdown
Owner

@quangdang46 quangdang46 commented May 25, 2026

Summary

Multi-agent architecture foundation for jcode adapted from CodebuffAI's design.

Phase 0 — Foundation (commits 1-3)

AgentDefinition Declarative schema (id, model, tools, prompt, reasoning, output mode, etc.)
ModelTier Routine / Thinking enum — maps to env vars + session-gateway routing
OutputMode LastMessage / AllMessages / StructuredOutput — controls how tool results surface
AgentRegistry TOML directory loader for .jcode/agents/*.toml with roundtrip validation
Cross-ref validation Ensures spawnable_agents IDs actually exist in registry at load time
Skill MAS bridge MAS-prefixed skill names invoke jcode-skill-{name} binaries
sample_agents.rs 6 integration tests (bundled + disk-loaded TOML agents)

Phase 1 — Agent TOML definitions

basher.toml Routine + Minimal reasoning, bash-only leaf agent
editor.toml Thinking + Medium reasoning, full edit toolkit (8 tools)

Phase 4 — Prompt utilities

prompt_placeholders.rs {{FILE_TREE}}, {{CURRENT_DATE}}, {{REMAINING_STEPS}}, {{KNOWLEDGE_FILES}}, {{GIT_CHANGES}} substitution engine
wrap_as_system_reminder() Wraps harness step prompts in <system_reminder> tags

Phase 5 — JBench evaluation framework

evals/jbench/src/agent_runner.rs run_agent_in_repo() — spawns jcode subprocess, streams stdout, captures diff via git diff HEAD
evals/jbench/src/judge.rs judge_with_three_models() — GPT + Gemini + Claude in parallel, median analysis + averaged scores
evals/jbench/src/lessons.rs extract_lessons() + append_lessons_to_file() — lessons accumulation per agent
evals/jbench/src/bin/jbench.rs CLI with run and meta-analyze implemented; pick-commits/gen-evals/judge as Phase stubs

Test results

jcode-agent-runtime: 49 unit + 6 integration = 55 passed, 0 failed
jcode-jbench types:  3 roundtrip tests passed
cargo check --bin jcode: OK

🤖 Generated with Claude Code

quangdang46 and others added 7 commits May 25, 2026 21:44
…ase 0.1+0.2)

Lay the foundation for declarative agent definitions adapted from
Codebuff's AgentDefinition schema, but adapted to jcode's single-OAuth
provider reality:

- signals.rs: existing soft-interrupt + cancellation primitives moved
  into a named module; root-level re-exports preserved so src/agent.rs
  consumers compile unchanged.
- definition.rs: AgentDefinition struct (id, model_override, prefer_tier,
  reasoning, tool_names, spawnable_agents, prompts, output_mode,
  inherit_parent_system_prompt, include_message_history) with TOML
  round-trip + validation for id format, system_prompt vs inherit
  conflict, structured_output schema requirement, self-spawn, and
  duplicate tool/agent ids.
- tier.rs: user-defined tier slot (routine/thinking) backed by the
  same JCODE_ROUTING_* env vars as model_routing.rs (#100). NOT a
  catalog — agents inherit session model when no tier is configured,
  so subscription users (Claude Pro / ChatGPT Plus / Gemini Advanced)
  see no behavior change. Pay-per-token users opt in by setting two
  env vars.
- reasoning.rs: ReasoningEffort enum (minimal/low/medium/high).
- output.rs: OutputMode enum (last_message/all_messages/structured_output).

32 unit tests pass. Full `cargo check --bin jcode` succeeds.

This is Phase 0 of the multi-agent foundation — no runtime engine
changes yet. Next: TOML loader for .jcode/agents/*.toml + builtin
embedded agents (Phase 0.3).
…hase 0.3)

Discover and load AgentDefinition files from three locations with
priority order:

  1. <project>/.jcode/agents/*.toml   (project-local, highest)
  2. ~/.jcode/agents/*.toml           (user-global)
  3. AgentRegistry::register_builtin  (compiled-in defaults, lowest)

Project-local overrides user-global overrides builtin. Re-registering
a builtin after a higher-priority entry is loaded does NOT clobber the
override — the priority check is symmetric in `insert`.

Design choices:

- Filename must match `<id>.toml` so users can find agents by id without
  opening every file. Mismatches are surfaced as a load error rather
  than silently misindexing.
- Malformed/invalid files are collected as non-fatal LoadError entries
  so a single bad file doesn't prevent the rest of the registry from
  loading. `jcode doctor` (future) reads load_errors() to surface
  these.
- AgentRegistry intentionally does NOT cross-reference `tool_names` /
  `spawnable_agents` — that's done at spawn time because the tool
  universe may be feature-gated (Phase 0.4).

41 unit tests pass (32 prior + 9 new). `cargo check --bin jcode` succeeds.
… agents (Phase 0.4-0.6)

Phase 0.4 — Cross-reference validation:
  - ReferenceError enum (UnknownTools, UnknownSpawnableAgents) kept
    separate from DefinitionError because the runtime tool/agent
    universe isn't known at TOML-load time.
  - AgentDefinition::validate_tool_references<I, S>() and
    validate_spawn_references<I, S>() — caller passes the available
    name set, gets back a sorted, comma-joined list of unknowns.
  - 5 new tests covering the happy path, unknowns, empty lists,
    and deterministic alphabetical ordering of the error message.

  This deliberately does NOT modify src/tool/mod.rs. The whitelist
  check is a pure function over the agent definition + a name set;
  no need to refactor tool dispatch. Phase 1 will wire the actual
  tool registry into the spawn path.

Phase 0.5 — Skill MAS (#94) bridge:
  - AgentRegistry::lookup_for_skill_routing(skill_agent_id) — named
    alias of get() that documents the integration point with the
    SKILL.md  field. Returns None for missing references; the
    skill activation site decides fallback policy.
  - 2 tests: hit + miss.

Phase 0.6 — Sample agents + integration test:
  - .jcode/agents/file-picker.toml — Routine tier, no message history,
    leaf agent. Demonstrates file-picker pattern adapted from Codebuff.
  - .jcode/agents/code-reviewer.toml — Thinking tier with
    inherit_parent_system_prompt=true to demonstrate the prompt-cache
    prefix-sharing trick (~90% input-token savings on cache hits).
  - tests/sample_agents.rs — integration test loads both files via the
    public AgentRegistry API and asserts shape + behavior. 4 tests.

Phase 0 totals: 49 unit + 4 integration = 53 tests, all passing.
`cargo check --bin jcode` succeeds (full workspace, 3m13s).

Phase 0 (foundation) is now complete:
  - Schema: AgentDefinition + ModelTier + OutputMode + ReasoningEffort
  - Loader: registry with priority order (project > user > builtin)
  - Validation: id format, internal invariants, cross-references
  - Sample agents demonstrating cache-hit and tier patterns
  - Skill MAS (#94) integration point established

Phase 1 (4 builtin agents + spawn_agents tool + cache benchmark) is
the next track.
…prompt utilities, sample agents

Phase 1:两名真实 TOML agent definitions (basher + editor) với full schema
Phase 4: `prompt_placeholders.rs` — `{{FILE_TREE}}`, `{{CURRENT_DATE}}`, etc.
Phase 4: `wrap_as_system_reminder()` in `src/agent/prompting.rs`
Phase 5: `evals/jbench/` scaffold — types, judge stub, lessons stub, agent_runner stub
Phase 0.6: integration tests `basher_sample_has_expected_shape` + `editor_sample_has_expected_shape`

All jcode-agent-runtime tests pass (49 unit + 6 integration).
…, CLI

Phase 5.3 (agent_runner): `run_agent_in_repo()` spawns jcode subprocess
  with prompt on stdin, streams stdout, captures trace + diff via
  `git diff HEAD`. Uses `timeout()` for per-run deadline.

Phase 5.4 (judge): `judge_with_three_models()` runs GPT + Gemini +
  Claude judges in parallel via OpenAI Responses API + Anthropic
  Messages API. Median analysis, averaged scores. `run_single_judge()`
  exposes per-judge entry point.

Phase 5.5 (lessons): `extract_lessons()` calls lessons extractor model
  via Responses API. `append_lessons_to_file()` accumulates lessons in
  per-agent JSON files with read-modify-write.

Phase 5.6 (CLI): Full `jbench run` implemented (loads eval JSON, iterates
  commits, calls `run_agent_in_repo`, writes `.run.json` files).
  `jbench meta-analyze` aggregates results. Other subcommands print
  Phase stubs and exit 0.

Bug fixes:
- `JudgingResult: Default` impl added (needed for EvalRun init)
- `OnceLock` for lazy reqwest static client (fixes const-eval restrictions)
- `context` method from `anyhow::Context` imported in bin
Enhance session management, prompt navigation, and macOS shortcuts
Bugs fixed:

1. JudgingResult deserialization (jbench/types.rs)
   The judge prompt schema asks for camelCase fields
   (completionScore, codeQualityScore, overallScore) but
   the Rust struct used snake_case without serde rename.
   parse_scorecard would fail on every real judge response.

   Fix: add #[serde(alias = ...)] on each score field so
   on-disk JSON stays snake_case while LLM-returned
   camelCase still deserializes cleanly.

2. Anthropic judge authentication (jbench/judge.rs)
   run_anthropic_judge used Authorization: Bearer <key>
   which always 401s on the Anthropic Messages API.

   Fix: switch to x-api-key header (Anthropic standard).
   Also split JudgeConfig::api_base / api_key from new
   anthropic_api_base / anthropic_api_key so the Anthropic
   branch can target api.anthropic.com without breaking
   the OpenAI-compatible path. Plumbed through
   run_single_judge.

3. Duplicate substitute_placeholders (src/prompt_placeholders.rs)
   Conflicts with the existing
   prompt_templates::substitute_placeholders. Different
   semantics (fixed context vs HashMap bindings) but same
   name made grep / jump-to-def ambiguous.

   Fix: rename the new one to
   substitute_context_placeholders and document the
   relationship in the doc comment.

4. meta_analyze .run.json filter (jbench/bin/jbench.rs)
   path.extension() returns only the final extension
   ('json'), so matching against "run.json" never fired.
   meta-analyze would always report zero runs.

   Fix: match against file_name().ends_with(".run.json").

Plus:
- Run cargo fmt --all to clear the Format CI job that PR
  #313 was failing.
- Add tests parse_scorecard_accepts_camelcase_from_llm and
  parse_scorecard_accepts_snake_case_from_disk to lock in
  the wire-format contract.
quangdang46 and others added 4 commits May 28, 2026 12:43
…review-issues

fix(agent-runtime): address PR #313 review issues (4 bugs + fmt)
…tions)

Synthesizes best patterns from 9 reference repos:
- AgentPath tree + mailbox (codex)
- Tool-based agent delegation (CC)
- DAG wave parallelism (oh-my-pi)
- Role-based config bundles (opencode + codex)
- Team pipeline lifecycle (oh-my-claudecode)
- Cost aggregation + ancestry tracking (codebuff)

Covers: architecture, types, pseudocode, Rust implementation,
CLI commands, config wiring, test cases, benchmarks, rollout
…ut, field caps, serde strictness

- Gate agent_runner behind 'agent-runner' feature flag
- Add KNOWLEDGE_FILES_MAX_CHARS = 100_000 constant with truncation
- Add #[serde(deny_unknown_fields)] to AgentDefinition
- Per-model timeout in judge_with_three_models (join_all with individual timeouts)
- Fix integer truncation in meta_analyze_impl avg_duration
- Remove stray merge conflict marker in src/lib.rs
@quangdang46
Copy link
Copy Markdown
Owner Author

Review-swarm fixes applied in 8feff0a

  • agent_runner feature gate: Gated behind agent-runner feature flag (not auto-enabled)
  • KNOWLEDGE_FILES cap: Added KNOWLEDGE_FILES_MAX_CHARS = 100_000 with truncation
  • serde strictness: #[serde(deny_unknown_fields)] on AgentDefinition
  • Per-model timeout: Each judge model gets its own timeout via individual join_all futures — prevents a slow model from starving the others
  • Integer truncation: avg_duration uses f64 division + round() instead of integer division
  • Merge conflict marker: Removed stray <<<<<<< HEAD from src/lib.rs

Verified: cargo check -p jcode-jbench --features agent-runner and cargo check -p jcode-agent-runtime both pass.

quangdang46 and others added 17 commits June 4, 2026 22:08
- Revert src/lib.rs to master (remove stale 36-module list)
- Move prompt_placeholders.rs from src/ into crates/jcode-app-core/src/
- Add pub mod prompt_placeholders to jcode-app-core/src/lib.rs
- Resolve Cargo.lock merge conflict (hyper/hyper-rustls versions)

Build verified: cargo check --bin jcode passes.
Tests: jcode-agent-runtime 55 pass, jcode-jbench 3 pass.
…son tables + roadmap

feature-planning skill analysis across codebuff, codex, claude-code,
opencode, oh-my-pi, oh-my-openagent, oh-my-claudecode, pi-agent-rust,
oh-my-codex.

Includes:
- 9 per-dimension comparison tables (schema, registry, routing, lifecycle,
  permission, tool, eval, prompt, session)
- Top 5 gaps ranked by ROI
- Wire-up plan for SafetySystem + AgentDefinition.permissionMode
- Phase roadmap (Phase 1 → Phase 5)
- 5 actionable issues with severity and fix suggestions
Add PermissionMode enum to jcode-agent-runtime (mirrors dcg_core::Mode):
- Default: rule-based classification (legacy AUTO_ALLOWED list)
- AcceptEdits: file ops auto-allowed, network/spawn prompt
- Plan: read-only, writes denied without prompting
- DontAsk: allow-listed tools pass, never prompt
- BypassPermissions: skip all evaluation
- Auto: LLM-based classifier decides per call

Add permission_mode: Option<PermissionMode> to AgentDefinition.
When None, agent inherits session-global mode.

Update sample TOML agents:
- basher: accept-edits (auto-approve bash)
- editor: accept-edits (auto-approve file ops)
- file-picker: plan (read-only)
- code-reviewer: plan (read-only)

Tests: 54 unit + 6 integration = 60 passed, 0 failed.

Wire-up plan: at spawn time, convert PermissionMode to dcg_core::Mode
and pass to SubagentTool/SessionToolPolicy for per-agent override.
Add optional max_turns: Option<u32> field that limits the number of
agentic turns an agent may execute before being stopped. Prevents
runaway agents from consuming unbounded tokens/time.

When None, the agent has no per-agent turn limit (session global
limit still applies).

Tests: 56 unit + 6 integration = 62 passed, 0 failed.
- basher: max_turns = 10 (quick shell commands)
- file-picker: max_turns = 5 (find files, done fast)
- code-reviewer: max_turns = 15 (review needs more context)
- editor: no limit (complex edits may need many turns)
- extract_diff_from_repo: wrap sync std::process::Command in
  tokio::task::spawn_blocking to avoid blocking the async runtime
- todo_step: use exit code 2 (not implemented) instead of 0 (success)
- Fix unused variable warnings (max_turns, timeout_secs)
- cfg-gate unused imports behind agent-runner feature
Add permission_mode_to_dcg() conversion from PermissionMode to
dcg_core::Mode (free function due to orphan rule).

Add per-session permission mode storage (SESSION_MODES) so subagents
can run under a different mode than the global default:
- set_session_mode(session_id, mode)
- clear_session_mode(session_id)
- session_mode(session_id) -> Option<Mode>

Add classify_for_agent(action, agent_permission_mode) that uses the
agent's mode when set, falling back to global mode otherwise.

Wire SubagentTool to propagate permission_mode from agent definition
to child session via set_session_mode, and clean up on completion.

Tests: 4 new tests in dcg_bridge (conversion, classify_for_agent,
session_mode lifecycle).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add disallowed_tools: Vec<String> denylist to AgentDefinition.
Takes precedence over tool_names — useful for inheriting a broad
whitelist while blocking specific dangerous tools.

Fix TOML consistency:
- file-picker.toml: add explicit inherit_parent_system_prompt = false
- Add documentation comment explaining why

Tests: 58 unit + 6 integration = 64 passed, 0 failed.
… parent-child tree

## SubagentTool wiring (task.rs)
- Add AgentRegistry to SubagentTool for definition lookup
- Look up AgentDefinition by subagent_type at spawn time
- Apply tool_names whitelist from definition (intersected with available)
- Apply disallowed_tools denylist from definition
- Inject system_prompt when inherit_parent_system_prompt is false
- Wire permission_mode: params override > definition > inherit session
- Map OutputMode: LastMessage->Answer, AllMessages->Compact
- Log max_turns for future enforcement

## Parent-child tree (session.rs)
- Add children: Vec<String> to Session with serde(default)
- Add add_child() method for registering child sessions
- Wire SubagentTool to call parent.add_child() after spawn
- Children persisted in session JSON for TUI tree visualization

Backward compatible: all new fields use serde(default),
AgentRegistry is Option so missing registry falls back to existing behavior.
….children tree

## Registry wiring
- Thread Option<Arc<AgentRegistry>> through Registry::new()
- Pass to SubagentTool for definition lookup at spawn time
- Update all Registry::new() call sites (30+ files) with None

## Session parent-child tree (already committed in Phase 2)
- children: Vec<String> on Session with serde(default)
- add_child() method for registering child sessions
- SubagentTool calls parent.add_child() after spawn
- Children persisted in session JSON + journal meta

33 files changed, +436/-105 lines.
## New tools registered in Registry

### team_create
Creates a team with name + description. Stores config as JSON at
~/.jcode/teams/<name>.json. Idempotent — re-creating returns existing.

### team_delete
Deletes a team config file by name.

### task_create
Adds a task to an existing team. Validates team exists. Uses UUID
for task IDs.

### task_update
Updates task status and/or owner. Partial updates supported.

### task_list
Lists all tasks in a team with their status and owner.

## Files
- crates/jcode-app-core/src/tool/team.rs — TeamConfig, TeamCreateTool, TeamDeleteTool
- crates/jcode-app-core/src/tool/task_management.rs — TaskCreate/Update/ListTool
- crates/jcode-app-core/src/tool/mod.rs — register 5 new tools

Build: cargo check passes (2 pre-existing warnings).
Resolve Cargo.toml conflict: keep both evals/jbench (branch) and
crates/jcode-render-core (master).
- Remove stale >>>>>>> conflict marker in skill.rs
- Fix clippy: derive Default on PermissionMode instead of manual impl
- Fix clippy: collapsible if-let in tier.rs
- Fix clippy: doc list indentation in output.rs
- cargo fmt --all
Security fixes:
- H1: Add validate_team_name() to prevent path traversal in TeamConfig
- H4: Reject BypassPermissions in project-local TOML agent definitions

Runtime wiring:
- H2: Wire shared AgentRegistry into production Registry::new sites
- H3: Add classify_for_session() that checks per-session mode overrides
- H5: Add max_turns enforcement in Agent turn loop
- H6: Wire agent_def.resolve_model() into SubagentTool model resolution

Code quality:
- M4: Remove deny_unknown_fields from AgentDefinition for forward compat
- M5: Align PermissionMode::parse() with serde kebab-case
- M6: Gate experimental team/task tools behind JCODE_EXPERIMENTAL_TOOLS env
- M7: Document parent session mutation race condition
- M8: Add SessionModeGuard RAII for automatic session mode cleanup

All 63 agent-runtime tests pass. cargo check clean.
@quangdang46 quangdang46 merged commit a4f8f1f into master Jun 5, 2026
4 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant