-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Plan: MCP Server Provisioning for Waza Evaluations
Problem Statement
Waza can evaluate AI agents on tasks, but cannot currently test agents that use MCP tools.
The mcp_servers field exists in the eval YAML schema and is parsed into Config.ServerConfigs,
but is never consumed — no servers are started, no config is written to workspaces, and no
MCP-aware grading exists.
Goal: Allow eval authors to declare MCP servers in eval YAML, have waza provision them
during task execution, and grade how effectively the agent uses those MCP tools.
Approach
Rather than waiting for Copilot SDK changes, use workspace-based MCP discovery:
the agent (Copilot CLI) already discovers MCP servers via .copilot/mcp.json in the
working directory. Waza can write this config file into each task's temp workspace and
manage the MCP server process lifecycle.
eval.yaml Task Workspace (/tmp/waza-xxx/)
┌──────────────┐ ┌────────────────────────────┐
│ mcp_servers: │ ──copy──▶ │ .copilot/mcp.json │
│ github: │ │ fixtures/... │
│ command: │ │ (agent works here) │
│ args: ... │ └────────────────────────────┘
└──────────────┘ │
waza starts MCP server process
agent discovers via mcp.json
waza tracks MCP tool calls
waza grades MCP tool usage
waza stops MCP server process
Phases & Todos
Phase 1: Formalize MCP Server Config Schema
Define a typed Go struct for MCP server configuration (replacing map[string]any),
and update the JSON schema with proper validation.
-
1a. Define
MCPServerConfigstruct ininternal/models/spec.go- Fields:
Command,Args,Env,WorkingDir,Url(for SSE/streamable-http transports) - Replace
ServerConfigs map[string]anywithMCPServers map[string]MCPServerConfig - Maintain backward compatibility with existing YAML parsing
- Fields:
-
1b. Update JSON schema in
schemas/eval.schema.json- Replace
additionalProperties: truewith proper object schema for each server - Add
command(required),args(array),env(object),url(string)
- Replace
-
1c. Add config validation in spec loading
- Validate that each server has either
commandorurl(stdio vs remote) - Validate that
commandis a string,argsis string array, etc.
- Validate that each server has either
Phase 2: MCP Server Lifecycle Manager
Create a component that starts/stops MCP server processes and writes workspace config.
-
2a. Create
internal/mcp/lifecycle.goMCPServerManagerstruct withStart(ctx, configs)andStop()methods- Starts each configured server as a subprocess
- Tracks PIDs for cleanup
- Health-check: wait for server readiness (configurable timeout)
- Graceful shutdown with SIGTERM → SIGKILL fallback
-
2b. Create
internal/mcp/workspace.goWriteMCPConfig(workspaceDir, configs)function- Writes
.copilot/mcp.jsoninto the task workspace directory - Transforms
MCPServerConfig→ Copilot MCP JSON format - Creates
.copilot/directory if needed
-
2c. Add tests for lifecycle manager
- Test start/stop with a simple echo MCP server
- Test config file generation
- Test cleanup on context cancellation
- Test error handling (server fails to start)
Phase 3: Wire Into Execution Pipeline
Connect the lifecycle manager to the orchestration runner and execution engine.
-
3a. Update
ExecutionRequestininternal/execution/engine.go- Add
MCPServers map[string]MCPServerConfigfield - Allows per-request MCP server config to flow through
- Add
-
3b. Update
CopilotEngine.Execute()ininternal/execution/copilot.go- Before creating session: call
WriteMCPConfig()to write config to workspace - After task completion: cleanup MCP config file
- Before creating session: call
-
3c. Update
TestRunner.buildExecutionRequest()ininternal/orchestration/runner.go- Pass
spec.Config.MCPServersintoExecutionRequest.MCPServers
- Pass
-
3d. Update
TestRunnerlifecycle hooks- Before benchmark: start MCP servers via
MCPServerManager.Start() - After benchmark: stop MCP servers via
MCPServerManager.Stop() - Consider: per-task vs per-benchmark server lifetime (config option)
- Before benchmark: start MCP servers via
-
3e. Add integration tests
- Mock MCP server that exposes a simple tool
- Eval YAML that configures the mock server
- Verify agent can discover and call the MCP tool
- Verify cleanup after task
Phase 4: MCP Tool Call Tracking
Enrich tool call data with MCP server origin information.
-
4a. Extend
ToolCallmodel ininternal/models/events.go- Add
Source stringfield:"builtin","mcp:{server_name}","skill" - Add
MCPServer stringfield (empty for non-MCP tools)
- Add
-
4b. Update
FilterToolCalls()to classify tool origins- Match tool names against configured MCP server tool lists
- Or use naming convention (MCP tools often have server-prefixed names)
- Falls back to
"builtin"if no MCP match
-
4c. Update
SessionDigestto include MCP summary- Add
MCPToolCalls int— count of MCP-originated tool calls - Add
MCPServersUsed []string— which MCP servers were actually called
- Add
-
4d. Update dashboard in
web/- Show MCP tool calls distinguished from built-in tools in TrajectoryViewer
- Add MCP server badge/icon to tool call entries
Phase 5: MCP-Aware Graders
Add grading capabilities specific to MCP tool usage.
-
5a. Extend
tool_constraintgrader- Support
source: mcp:{server_name}filter inexpect_tools/reject_tools - Example:
expect_tools: [{tool: "get_issue", source: "mcp:github"}]
- Support
-
5b. Extend
behaviorgrader- Add
max_mcp_callsconstraint - Add
required_mcp_servers— ensure agent used specific MCP servers - Add
forbidden_mcp_servers— ensure agent didn't use certain servers
- Add
-
5c. Add
mcp_compliancegrader (new grader type)- Validates that the agent correctly discovered and used MCP tools
- Checks: tool selection accuracy, parameter correctness, error handling
- Configurable rubric for MCP-specific evaluation criteria
- Example config:
graders: - type: mcp_compliance name: github_tool_usage config: server: github expect_tools_used: [get_issue, search_code] max_tool_errors: 1
-
5d. Add tests for MCP graders
Phase 6: Task-Level MCP Overrides
Allow individual tasks to specify additional or different MCP servers.
-
6a. Add
mcp_serverstoTestCasemodel- Tasks can add MCP servers beyond those in the eval config
- Tasks can override eval-level MCP server configs
- Merge strategy: task configs overlay eval configs
-
6b. Update task YAML schema in
schemas/- Add
mcp_serversfield to task schema
- Add
-
6c. Update runner to merge configs
- Eval-level
mcp_serversas base - Task-level
mcp_serversmerged on top - Pass merged config into
ExecutionRequest
- Eval-level
Phase 7: Documentation & Examples
Update docs and provide example evals.
-
7a. Create example eval in
examples/mcp-eval/- Example eval YAML with MCP server config
- Example task that requires MCP tool usage
- Mock MCP server for the example
-
7b. Update documentation
site/src/content/docs/guides/eval-yaml.mdx— mcp_servers sectionsite/src/content/docs/guides/graders.mdx— mcp_compliance gradersite/src/content/docs/reference/cli.mdx— any new flagsREADME.md— MCP eval support overview
-
7c. Update AGENTS.md with MCP eval patterns
Key Design Decisions
-
Workspace-based discovery (not SDK-level): Write
.copilot/mcp.jsonto workspace
so agent discovers MCP servers naturally. No SDK changes needed. -
Per-benchmark server lifetime (default): Start MCP servers once per eval run,
not per-task. Addmcp_server_lifetime: per_task | per_benchmarkconfig option. -
Typed config over
map[string]any: Replace generic map withMCPServerConfig
struct for type safety and validation. -
Additive grading: Extend existing graders (tool_constraint, behavior) rather
than replacing them. Add newmcp_compliancegrader for MCP-specific checks. -
Tool origin tracking: Tag each tool call with its source (builtin/mcp/skill)
to enable source-aware grading.
Open Questions
- Should waza validate that MCP servers are healthy before starting tasks?
(Recommend: yes, with configurable timeout) - Should MCP server stdout/stderr be captured in eval results for debugging?
(Recommend: yes, as optional verbose output) - Should waza support remote MCP servers (SSE/streamable-http) in addition to stdio?
(Recommend: yes viaurlfield, Phase 1) - How to handle MCP servers that need auth tokens — env var passthrough?
(Recommend:envmap in config, supports${VAR}expansion)
Dependencies
- No Copilot SDK changes required (workspace-based discovery)
mcp-go v0.45.0already ingo.mod(indirect) — may use for server health checks- Go 1.26+ (already required)