fix(agent-runtime): address PR #313 review issues (4 bugs + fmt) by quangdang46 · Pull Request #333 · quangdang46/jcode

quangdang46 · 2026-05-28T04:32:40Z

Summary

Sub-PR against #313. Fixes the four blocking bugs surfaced during review plus the Format CI failure.

Bugs fixed

JudgingResult deserialization — evals/jbench/src/types.rs
The judge prompt schema requests completionScore / codeQualityScore / overallScore (camelCase) but the Rust struct only accepted completion_score etc. (snake_case), so parse_scorecard (judge.rs:334) would have failed on every real judge response. On-disk JSON stays snake_case (matches the rest of jcode's eval files); added #[serde(alias = "completionScore")] etc. so the LLM-returned camelCase deserializes cleanly without a separate wire-format struct. Locked in with two new tests (parse_scorecard_accepts_camelcase_from_llm, parse_scorecard_accepts_snake_case_from_disk).
Anthropic judge authentication — evals/jbench/src/judge.rs
run_anthropic_judge was sending Authorization: Bearer <key>. The Anthropic Messages API uses x-api-key, so every Anthropic-routed judge would 401 with a valid key. Switched to x-api-key. Also split out JudgeConfig::anthropic_api_base / anthropic_api_key (with env-var defaults JBENCH_ANTHROPIC_API_BASE / JBENCH_ANTHROPIC_API_KEY) so the Anthropic branch can target api.anthropic.com without breaking the OpenAI-compatible path; the overrides are plumbed through run_single_judge and judge_with_three_models. Falls back to the primary api_base / api_key when overrides are not set (covers the gateway-proxy case).
Duplicate substitute_placeholders — src/prompt_placeholders.rs
Conflicted by name with the existing prompt_templates::substitute_placeholders (different semantics — fixed context fields vs arbitrary HashMap bindings) which made grep / jump-to-def ambiguous. Renamed the new function to substitute_context_placeholders and added a doc cross-link to the templates variant. All five existing tests updated and still green.
meta_analyze .run.json filter — evals/jbench/src/bin/jbench.rs
Path::extension() returns only the trailing component ("json"), so matching against "run.json" never fired and meta-analyze always reported zero runs. Switched to file_name().is_some_and(|s| s.ends_with(".run.json")).

Quality / CI

cargo fmt --all — clears the Format job that PR Experimental: multi-agent foundation (Phase 0) — AgentDefinition + tier slots + TOML registry #313 was failing. All formatting changes are within files PR Experimental: multi-agent foundation (Phase 0) — AgentDefinition + tier slots + TOML registry #313 already touches (no unrelated churn).
cargo check -p jcode-agent-runtime -p jcode-jbench — clean.
cargo test -p jcode-jbench — 6 unit + 3 integration tests pass (incl. 2 new ones).
cargo test -p jcode-agent-runtime — 49 unit + 6 integration tests pass.

Review & Testing Checklist for Human

Risk: yellow (4 small, surgical bug fixes plus a mechanical cargo fmt --all — but the Anthropic auth fix and the JudgingResult alias change touch wire-format / network paths I cannot exercise end-to-end without real API keys).

Run cargo test -p jcode-jbench --lib and confirm parse_scorecard_accepts_camelcase_from_llm + parse_scorecard_accepts_snake_case_from_disk pass; spot-check an existing .run.json file (if any) still deserializes via the snake_case alias path.
If you can hit a real Anthropic key: set JBENCH_ANTHROPIC_API_BASE=https://api.anthropic.com and JBENCH_ANTHROPIC_API_KEY=<sk-ant-...>, run the judge against a tiny diff, and confirm Anthropic returns 200 (not 401) and the scorecard parses.
Confirm cargo fmt --all --check is clean.
Confirm Format / Quality Guardrails CI jobs flip from red to green on this sub-PR.
Skim the rename in src/prompt_placeholders.rs — the new function is substitute_context_placeholders; the existing prompt_templates::substitute_placeholders is untouched.

Notes

Anthropic structured-output extraction in run_anthropic_judge still parses the final text block as JSON rather than using Messages-API tools for guaranteed structured output. That's a separate improvement and is out of scope for this fix-up PR.
JudgeConfig::models is still [String; 3]. Converting it to Vec<String> (plus a configurable min_success count) was flagged as a "should fix" in the review but kept out of this PR to keep the diff narrow and focused on the four blockers.
This PR targets experimental/multi-agent-foundation (the head of Experimental: multi-agent foundation (Phase 0) — AgentDefinition + tier slots + TOML registry #313) so reviewing it as a sub-PR is straightforward and Experimental: multi-agent foundation (Phase 0) — AgentDefinition + tier slots + TOML registry #313 itself stays intact.

Bugs fixed: 1. JudgingResult deserialization (jbench/types.rs) The judge prompt schema asks for camelCase fields (completionScore, codeQualityScore, overallScore) but the Rust struct used snake_case without serde rename. parse_scorecard would fail on every real judge response. Fix: add #[serde(alias = ...)] on each score field so on-disk JSON stays snake_case while LLM-returned camelCase still deserializes cleanly. 2. Anthropic judge authentication (jbench/judge.rs) run_anthropic_judge used Authorization: Bearer <key> which always 401s on the Anthropic Messages API. Fix: switch to x-api-key header (Anthropic standard). Also split JudgeConfig::api_base / api_key from new anthropic_api_base / anthropic_api_key so the Anthropic branch can target api.anthropic.com without breaking the OpenAI-compatible path. Plumbed through run_single_judge. 3. Duplicate substitute_placeholders (src/prompt_placeholders.rs) Conflicts with the existing prompt_templates::substitute_placeholders. Different semantics (fixed context vs HashMap bindings) but same name made grep / jump-to-def ambiguous. Fix: rename the new one to substitute_context_placeholders and document the relationship in the doc comment. 4. meta_analyze .run.json filter (jbench/bin/jbench.rs) path.extension() returns only the final extension ('json'), so matching against "run.json" never fired. meta-analyze would always report zero runs. Fix: match against file_name().ends_with(".run.json"). Plus: - Run cargo fmt --all to clear the Format CI job that PR #313 was failing. - Add tests parse_scorecard_accepts_camelcase_from_llm and parse_scorecard_accepts_snake_case_from_disk to lock in the wire-format contract.

quangdang46 merged commit 089fb6c into experimental/multi-agent-foundation May 28, 2026

quangdang46 deleted the devin/1779932200-fix-pr-313-review-issues branch May 28, 2026 05:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent-runtime): address PR #313 review issues (4 bugs + fmt)#333

fix(agent-runtime): address PR #313 review issues (4 bugs + fmt)#333
quangdang46 merged 1 commit into
experimental/multi-agent-foundationfrom
devin/1779932200-fix-pr-313-review-issues

quangdang46 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

quangdang46 commented May 28, 2026

Summary

Bugs fixed

Quality / CI

Review & Testing Checklist for Human

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant