Skip to content

test: add integ tests for batch evaluation and evaluator lifecycle#1110

Open
notgitika wants to merge 2 commits intopreviewfrom
test/batch-eval-integ-preview
Open

test: add integ tests for batch evaluation and evaluator lifecycle#1110
notgitika wants to merge 2 commits intopreviewfrom
test/batch-eval-integ-preview

Conversation

@notgitika
Copy link
Copy Markdown
Contributor

Summary

Adds 20 integration tests covering batch evaluation, evaluator, and online eval config CLI lifecycle:

  • Batch eval flag validation: --runtime required, --evaluator required
  • Run eval flag validation: --evaluator required
  • Evaluator lifecycle: SESSION/TRACE/TOOL_CALL levels, code-based (external lambda + managed), config file import
  • Online eval config lifecycle: builtin evaluator refs, enable-on-create, multiple evaluators, sampling rate
  • Evaluator validation edge cases: invalid level, --model with code-based, --lambda-arn without code-based
  • Ground truth file parsing: malformed JSON, wrong structure, valid array format, valid object format

Originally from private-agentcore-cli-staging PR #104. Only the test file is needed — the code changes from that PR (remove code-based evaluator blocks, add test utils) are already on preview.

Test plan

  • npm run test:integ passes (20 new tests)

Adds 20 integration tests covering:
- batch-evaluation CLI flag validation (requires --runtime, --evaluator)
- run eval CLI flag validation
- Evaluator lifecycle: SESSION, TRACE, TOOL_CALL levels, code-based
  (external lambda + managed), config file import
- Online eval config lifecycle: builtin evaluator refs, enable-on-create,
  multiple evaluators, sampling rate
- Evaluator validation edge cases: invalid level, --model with code-based,
  --lambda-arn without code-based
- Ground truth file parsing: malformed JSON, wrong structure, valid array
  format, valid object format with sessionMetadata key

Originally from private-agentcore-cli-staging PR #104.
@notgitika notgitika requested a review from a team May 4, 2026 18:48
@github-actions github-actions Bot added the agentcore-harness-reviewing AgentCore Harness review in progress label May 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 43.45% 9904 / 22794
🔵 Statements 42.72% 10519 / 24619
🔵 Functions 40.36% 1667 / 4130
🔵 Branches 40.16% 6401 / 15935
Generated in workflow #2369 for commit 62f9d80 by the Vitest Coverage Report Action

@agentcore-cli-automation
Copy link
Copy Markdown

Reviewed the new integ test file. There are a few issues that I think will make several tests fail (or silently pass without testing anything) on preview:

1. run batch-evaluation subcommand doesn't exist on preview

I can't find a batch-evaluation subcommand registered anywhere under src/cli/commands/run/ or in the primitives — the only run subcommand is run eval (see src/cli/commands/run/command.tsx). Same for --ground-truth — I can't find that flag defined on any command (only groundTruth in TUI wizard types).

The PR description says the code changes from staging PR #104 are "already on preview," but it looks like run batch-evaluation and --ground-truth are part of those code changes and are not on preview yet. The following tests exercise run batch-evaluation and will hit commander's "unknown command" path:

  • run batch-evaluation requires flags > requires --runtime
  • run batch-evaluation requires flags > requires --evaluator
  • all four tests under ground truth file parsing

Options:

  • (a) Drop the run batch-evaluation / ground-truth tests from this PR and add them back when the command lands on preview.
  • (b) Land the run batch-evaluation command + --ground-truth flag on preview first, then this PR.
  • (c) Rewrite these tests against run eval if that was the intent (but run eval has no --ground-truth flag today either).

2. Several assertions in those tests won't hold even if exit codes happen to match

Even though commander exits with code 1 for an unknown subcommand (which would satisfy expect(result.exitCode).toBe(1)), the downstream assertions won't pass:

  • expect(result.stderr).toContain('--runtime') / '--evaluator' — commander will print "error: unknown command 'batch-evaluation'", not a flag-specific validation message.
  • parseJsonOutput(result.stdout) in accepts valid ground truth file (array format) and (object format...) will throw "Empty output, cannot parse JSON" because commander writes the unknown-command error to stderr and stdout is empty.
  • expect(json.error).toContain('deployed') — same issue; there's no JSON to parse.

3. Silent no-op in rejects code-based evaluator in online eval config

const codeName = project
  ? (await readProjectConfig(project.projectPath)).evaluators.find(e => e.config.codeBased)?.name
  : undefined;

if (!codeName) return;

If the prior "adds a code-based evaluator" tests failed or ran in a different order, this test returns early with zero assertions and appears to pass. That defeats the purpose of the test. Worth either:

  • dropping the early return and letting it fail loudly (e.g. expect(codeName).toBeDefined()), or
  • adding a code-based evaluator inline in this test so it doesn't depend on prior test state.

Also, the project ? ternary is always truthy (project is assigned in beforeAll), so it can be simplified — but the early-return is the real issue.

Happy to re-review once these are sorted out.

@github-actions github-actions Bot removed the agentcore-harness-reviewing AgentCore Harness review in progress label May 4, 2026
The code-based evaluator restriction in OnlineEvalConfigPrimitive was
already removed, so this test expected a rejection that no longer occurs.
@notgitika notgitika force-pushed the test/batch-eval-integ-preview branch from eaca03a to 62f9d80 Compare May 4, 2026 19:03
@notgitika
Copy link
Copy Markdown
Contributor Author

addressed comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants