fix(agent-runtime): address PR #313 review issues (4 bugs + fmt)#333
Merged
quangdang46 merged 1 commit intoMay 28, 2026
Conversation
Bugs fixed:
1. JudgingResult deserialization (jbench/types.rs)
The judge prompt schema asks for camelCase fields
(completionScore, codeQualityScore, overallScore) but
the Rust struct used snake_case without serde rename.
parse_scorecard would fail on every real judge response.
Fix: add #[serde(alias = ...)] on each score field so
on-disk JSON stays snake_case while LLM-returned
camelCase still deserializes cleanly.
2. Anthropic judge authentication (jbench/judge.rs)
run_anthropic_judge used Authorization: Bearer <key>
which always 401s on the Anthropic Messages API.
Fix: switch to x-api-key header (Anthropic standard).
Also split JudgeConfig::api_base / api_key from new
anthropic_api_base / anthropic_api_key so the Anthropic
branch can target api.anthropic.com without breaking
the OpenAI-compatible path. Plumbed through
run_single_judge.
3. Duplicate substitute_placeholders (src/prompt_placeholders.rs)
Conflicts with the existing
prompt_templates::substitute_placeholders. Different
semantics (fixed context vs HashMap bindings) but same
name made grep / jump-to-def ambiguous.
Fix: rename the new one to
substitute_context_placeholders and document the
relationship in the doc comment.
4. meta_analyze .run.json filter (jbench/bin/jbench.rs)
path.extension() returns only the final extension
('json'), so matching against "run.json" never fired.
meta-analyze would always report zero runs.
Fix: match against file_name().ends_with(".run.json").
Plus:
- Run cargo fmt --all to clear the Format CI job that PR
#313 was failing.
- Add tests parse_scorecard_accepts_camelcase_from_llm and
parse_scorecard_accepts_snake_case_from_disk to lock in
the wire-format contract.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Sub-PR against #313. Fixes the four blocking bugs surfaced during review plus the Format CI failure.
Bugs fixed
JudgingResultdeserialization —evals/jbench/src/types.rsThe judge prompt schema requests
completionScore/codeQualityScore/overallScore(camelCase) but the Rust struct only acceptedcompletion_scoreetc. (snake_case), soparse_scorecard(judge.rs:334) would have failed on every real judge response. On-disk JSON stays snake_case (matches the rest of jcode's eval files); added#[serde(alias = "completionScore")]etc. so the LLM-returned camelCase deserializes cleanly without a separate wire-format struct. Locked in with two new tests (parse_scorecard_accepts_camelcase_from_llm,parse_scorecard_accepts_snake_case_from_disk).Anthropic judge authentication —
evals/jbench/src/judge.rsrun_anthropic_judgewas sendingAuthorization: Bearer <key>. The Anthropic Messages API usesx-api-key, so every Anthropic-routed judge would 401 with a valid key. Switched tox-api-key. Also split outJudgeConfig::anthropic_api_base/anthropic_api_key(with env-var defaultsJBENCH_ANTHROPIC_API_BASE/JBENCH_ANTHROPIC_API_KEY) so the Anthropic branch can targetapi.anthropic.comwithout breaking the OpenAI-compatible path; the overrides are plumbed throughrun_single_judgeandjudge_with_three_models. Falls back to the primaryapi_base/api_keywhen overrides are not set (covers the gateway-proxy case).Duplicate
substitute_placeholders—src/prompt_placeholders.rsConflicted by name with the existing
prompt_templates::substitute_placeholders(different semantics — fixed context fields vs arbitraryHashMapbindings) which made grep / jump-to-def ambiguous. Renamed the new function tosubstitute_context_placeholdersand added a doc cross-link to the templates variant. All five existing tests updated and still green.meta_analyze.run.jsonfilter —evals/jbench/src/bin/jbench.rsPath::extension()returns only the trailing component ("json"), so matching against"run.json"never fired andmeta-analyzealways reported zero runs. Switched tofile_name().is_some_and(|s| s.ends_with(".run.json")).Quality / CI
cargo fmt --all— clears the Format job that PR Experimental: multi-agent foundation (Phase 0) — AgentDefinition + tier slots + TOML registry #313 was failing. All formatting changes are within files PR Experimental: multi-agent foundation (Phase 0) — AgentDefinition + tier slots + TOML registry #313 already touches (no unrelated churn).cargo check -p jcode-agent-runtime -p jcode-jbench— clean.cargo test -p jcode-jbench— 6 unit + 3 integration tests pass (incl. 2 new ones).cargo test -p jcode-agent-runtime— 49 unit + 6 integration tests pass.Review & Testing Checklist for Human
Risk: yellow (4 small, surgical bug fixes plus a mechanical
cargo fmt --all— but the Anthropic auth fix and theJudgingResultalias change touch wire-format / network paths I cannot exercise end-to-end without real API keys).cargo test -p jcode-jbench --liband confirmparse_scorecard_accepts_camelcase_from_llm+parse_scorecard_accepts_snake_case_from_diskpass; spot-check an existing.run.jsonfile (if any) still deserializes via the snake_case alias path.JBENCH_ANTHROPIC_API_BASE=https://api.anthropic.comandJBENCH_ANTHROPIC_API_KEY=<sk-ant-...>, run the judge against a tiny diff, and confirm Anthropic returns 200 (not 401) and the scorecard parses.cargo fmt --all --checkis clean.src/prompt_placeholders.rs— the new function issubstitute_context_placeholders; the existingprompt_templates::substitute_placeholdersis untouched.Notes
run_anthropic_judgestill parses the final text block as JSON rather than using Messages-APItoolsfor guaranteed structured output. That's a separate improvement and is out of scope for this fix-up PR.JudgeConfig::modelsis still[String; 3]. Converting it toVec<String>(plus a configurablemin_successcount) was flagged as a "should fix" in the review but kept out of this PR to keep the diff narrow and focused on the four blockers.experimental/multi-agent-foundation(the head of Experimental: multi-agent foundation (Phase 0) — AgentDefinition + tier slots + TOML registry #313) so reviewing it as a sub-PR is straightforward and Experimental: multi-agent foundation (Phase 0) — AgentDefinition + tier slots + TOML registry #313 itself stays intact.