Skip to content

fix(agent-runtime): address PR #313 review issues (4 bugs + fmt)#333

Merged
quangdang46 merged 1 commit into
experimental/multi-agent-foundationfrom
devin/1779932200-fix-pr-313-review-issues
May 28, 2026
Merged

fix(agent-runtime): address PR #313 review issues (4 bugs + fmt)#333
quangdang46 merged 1 commit into
experimental/multi-agent-foundationfrom
devin/1779932200-fix-pr-313-review-issues

Conversation

@quangdang46
Copy link
Copy Markdown
Owner

Summary

Sub-PR against #313. Fixes the four blocking bugs surfaced during review plus the Format CI failure.

Bugs fixed

  1. JudgingResult deserializationevals/jbench/src/types.rs
    The judge prompt schema requests completionScore / codeQualityScore / overallScore (camelCase) but the Rust struct only accepted completion_score etc. (snake_case), so parse_scorecard (judge.rs:334) would have failed on every real judge response. On-disk JSON stays snake_case (matches the rest of jcode's eval files); added #[serde(alias = "completionScore")] etc. so the LLM-returned camelCase deserializes cleanly without a separate wire-format struct. Locked in with two new tests (parse_scorecard_accepts_camelcase_from_llm, parse_scorecard_accepts_snake_case_from_disk).

  2. Anthropic judge authenticationevals/jbench/src/judge.rs
    run_anthropic_judge was sending Authorization: Bearer <key>. The Anthropic Messages API uses x-api-key, so every Anthropic-routed judge would 401 with a valid key. Switched to x-api-key. Also split out JudgeConfig::anthropic_api_base / anthropic_api_key (with env-var defaults JBENCH_ANTHROPIC_API_BASE / JBENCH_ANTHROPIC_API_KEY) so the Anthropic branch can target api.anthropic.com without breaking the OpenAI-compatible path; the overrides are plumbed through run_single_judge and judge_with_three_models. Falls back to the primary api_base / api_key when overrides are not set (covers the gateway-proxy case).

  3. Duplicate substitute_placeholderssrc/prompt_placeholders.rs
    Conflicted by name with the existing prompt_templates::substitute_placeholders (different semantics — fixed context fields vs arbitrary HashMap bindings) which made grep / jump-to-def ambiguous. Renamed the new function to substitute_context_placeholders and added a doc cross-link to the templates variant. All five existing tests updated and still green.

  4. meta_analyze .run.json filterevals/jbench/src/bin/jbench.rs
    Path::extension() returns only the trailing component ("json"), so matching against "run.json" never fired and meta-analyze always reported zero runs. Switched to file_name().is_some_and(|s| s.ends_with(".run.json")).

Quality / CI

Review & Testing Checklist for Human

Risk: yellow (4 small, surgical bug fixes plus a mechanical cargo fmt --all — but the Anthropic auth fix and the JudgingResult alias change touch wire-format / network paths I cannot exercise end-to-end without real API keys).

  • Run cargo test -p jcode-jbench --lib and confirm parse_scorecard_accepts_camelcase_from_llm + parse_scorecard_accepts_snake_case_from_disk pass; spot-check an existing .run.json file (if any) still deserializes via the snake_case alias path.
  • If you can hit a real Anthropic key: set JBENCH_ANTHROPIC_API_BASE=https://api.anthropic.com and JBENCH_ANTHROPIC_API_KEY=<sk-ant-...>, run the judge against a tiny diff, and confirm Anthropic returns 200 (not 401) and the scorecard parses.
  • Confirm cargo fmt --all --check is clean.
  • Confirm Format / Quality Guardrails CI jobs flip from red to green on this sub-PR.
  • Skim the rename in src/prompt_placeholders.rs — the new function is substitute_context_placeholders; the existing prompt_templates::substitute_placeholders is untouched.

Notes

Bugs fixed:

1. JudgingResult deserialization (jbench/types.rs)
   The judge prompt schema asks for camelCase fields
   (completionScore, codeQualityScore, overallScore) but
   the Rust struct used snake_case without serde rename.
   parse_scorecard would fail on every real judge response.

   Fix: add #[serde(alias = ...)] on each score field so
   on-disk JSON stays snake_case while LLM-returned
   camelCase still deserializes cleanly.

2. Anthropic judge authentication (jbench/judge.rs)
   run_anthropic_judge used Authorization: Bearer <key>
   which always 401s on the Anthropic Messages API.

   Fix: switch to x-api-key header (Anthropic standard).
   Also split JudgeConfig::api_base / api_key from new
   anthropic_api_base / anthropic_api_key so the Anthropic
   branch can target api.anthropic.com without breaking
   the OpenAI-compatible path. Plumbed through
   run_single_judge.

3. Duplicate substitute_placeholders (src/prompt_placeholders.rs)
   Conflicts with the existing
   prompt_templates::substitute_placeholders. Different
   semantics (fixed context vs HashMap bindings) but same
   name made grep / jump-to-def ambiguous.

   Fix: rename the new one to
   substitute_context_placeholders and document the
   relationship in the doc comment.

4. meta_analyze .run.json filter (jbench/bin/jbench.rs)
   path.extension() returns only the final extension
   ('json'), so matching against "run.json" never fired.
   meta-analyze would always report zero runs.

   Fix: match against file_name().ends_with(".run.json").

Plus:
- Run cargo fmt --all to clear the Format CI job that PR
  #313 was failing.
- Add tests parse_scorecard_accepts_camelcase_from_llm and
  parse_scorecard_accepts_snake_case_from_disk to lock in
  the wire-format contract.
@quangdang46 quangdang46 merged commit 089fb6c into experimental/multi-agent-foundation May 28, 2026
@quangdang46 quangdang46 deleted the devin/1779932200-fix-pr-313-review-issues branch May 28, 2026 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant