Skip to content

Support evals against mcp #147

@RickWinter

Description

@RickWinter

Plan: MCP Server Provisioning for Waza Evaluations

Problem Statement

Waza can evaluate AI agents on tasks, but cannot currently test agents that use MCP tools.
The mcp_servers field exists in the eval YAML schema and is parsed into Config.ServerConfigs,
but is never consumed — no servers are started, no config is written to workspaces, and no
MCP-aware grading exists.

Goal: Allow eval authors to declare MCP servers in eval YAML, have waza provision them
during task execution, and grade how effectively the agent uses those MCP tools.

Approach

Rather than waiting for Copilot SDK changes, use workspace-based MCP discovery:
the agent (Copilot CLI) already discovers MCP servers via .copilot/mcp.json in the
working directory. Waza can write this config file into each task's temp workspace and
manage the MCP server process lifecycle.

eval.yaml                    Task Workspace (/tmp/waza-xxx/)
┌──────────────┐             ┌────────────────────────────┐
│ mcp_servers:  │  ──copy──▶ │ .copilot/mcp.json          │
│   github:     │            │ fixtures/...               │
│     command:  │            │ (agent works here)         │
│     args: ... │            └────────────────────────────┘
└──────────────┘                       │
                              waza starts MCP server process
                              agent discovers via mcp.json
                              waza tracks MCP tool calls
                              waza grades MCP tool usage
                              waza stops MCP server process

Phases & Todos

Phase 1: Formalize MCP Server Config Schema

Define a typed Go struct for MCP server configuration (replacing map[string]any),
and update the JSON schema with proper validation.

  • 1a. Define MCPServerConfig struct in internal/models/spec.go

    • Fields: Command, Args, Env, WorkingDir, Url (for SSE/streamable-http transports)
    • Replace ServerConfigs map[string]any with MCPServers map[string]MCPServerConfig
    • Maintain backward compatibility with existing YAML parsing
  • 1b. Update JSON schema in schemas/eval.schema.json

    • Replace additionalProperties: true with proper object schema for each server
    • Add command (required), args (array), env (object), url (string)
  • 1c. Add config validation in spec loading

    • Validate that each server has either command or url (stdio vs remote)
    • Validate that command is a string, args is string array, etc.

Phase 2: MCP Server Lifecycle Manager

Create a component that starts/stops MCP server processes and writes workspace config.

  • 2a. Create internal/mcp/lifecycle.go

    • MCPServerManager struct with Start(ctx, configs) and Stop() methods
    • Starts each configured server as a subprocess
    • Tracks PIDs for cleanup
    • Health-check: wait for server readiness (configurable timeout)
    • Graceful shutdown with SIGTERM → SIGKILL fallback
  • 2b. Create internal/mcp/workspace.go

    • WriteMCPConfig(workspaceDir, configs) function
    • Writes .copilot/mcp.json into the task workspace directory
    • Transforms MCPServerConfig → Copilot MCP JSON format
    • Creates .copilot/ directory if needed
  • 2c. Add tests for lifecycle manager

    • Test start/stop with a simple echo MCP server
    • Test config file generation
    • Test cleanup on context cancellation
    • Test error handling (server fails to start)

Phase 3: Wire Into Execution Pipeline

Connect the lifecycle manager to the orchestration runner and execution engine.

  • 3a. Update ExecutionRequest in internal/execution/engine.go

    • Add MCPServers map[string]MCPServerConfig field
    • Allows per-request MCP server config to flow through
  • 3b. Update CopilotEngine.Execute() in internal/execution/copilot.go

    • Before creating session: call WriteMCPConfig() to write config to workspace
    • After task completion: cleanup MCP config file
  • 3c. Update TestRunner.buildExecutionRequest() in internal/orchestration/runner.go

    • Pass spec.Config.MCPServers into ExecutionRequest.MCPServers
  • 3d. Update TestRunner lifecycle hooks

    • Before benchmark: start MCP servers via MCPServerManager.Start()
    • After benchmark: stop MCP servers via MCPServerManager.Stop()
    • Consider: per-task vs per-benchmark server lifetime (config option)
  • 3e. Add integration tests

    • Mock MCP server that exposes a simple tool
    • Eval YAML that configures the mock server
    • Verify agent can discover and call the MCP tool
    • Verify cleanup after task

Phase 4: MCP Tool Call Tracking

Enrich tool call data with MCP server origin information.

  • 4a. Extend ToolCall model in internal/models/events.go

    • Add Source string field: "builtin", "mcp:{server_name}", "skill"
    • Add MCPServer string field (empty for non-MCP tools)
  • 4b. Update FilterToolCalls() to classify tool origins

    • Match tool names against configured MCP server tool lists
    • Or use naming convention (MCP tools often have server-prefixed names)
    • Falls back to "builtin" if no MCP match
  • 4c. Update SessionDigest to include MCP summary

    • Add MCPToolCalls int — count of MCP-originated tool calls
    • Add MCPServersUsed []string — which MCP servers were actually called
  • 4d. Update dashboard in web/

    • Show MCP tool calls distinguished from built-in tools in TrajectoryViewer
    • Add MCP server badge/icon to tool call entries

Phase 5: MCP-Aware Graders

Add grading capabilities specific to MCP tool usage.

  • 5a. Extend tool_constraint grader

    • Support source: mcp:{server_name} filter in expect_tools / reject_tools
    • Example: expect_tools: [{tool: "get_issue", source: "mcp:github"}]
  • 5b. Extend behavior grader

    • Add max_mcp_calls constraint
    • Add required_mcp_servers — ensure agent used specific MCP servers
    • Add forbidden_mcp_servers — ensure agent didn't use certain servers
  • 5c. Add mcp_compliance grader (new grader type)

    • Validates that the agent correctly discovered and used MCP tools
    • Checks: tool selection accuracy, parameter correctness, error handling
    • Configurable rubric for MCP-specific evaluation criteria
    • Example config:
      graders:
        - type: mcp_compliance
          name: github_tool_usage
          config:
            server: github
            expect_tools_used: [get_issue, search_code]
            max_tool_errors: 1
  • 5d. Add tests for MCP graders

Phase 6: Task-Level MCP Overrides

Allow individual tasks to specify additional or different MCP servers.

  • 6a. Add mcp_servers to TestCase model

    • Tasks can add MCP servers beyond those in the eval config
    • Tasks can override eval-level MCP server configs
    • Merge strategy: task configs overlay eval configs
  • 6b. Update task YAML schema in schemas/

    • Add mcp_servers field to task schema
  • 6c. Update runner to merge configs

    • Eval-level mcp_servers as base
    • Task-level mcp_servers merged on top
    • Pass merged config into ExecutionRequest

Phase 7: Documentation & Examples

Update docs and provide example evals.

  • 7a. Create example eval in examples/mcp-eval/

    • Example eval YAML with MCP server config
    • Example task that requires MCP tool usage
    • Mock MCP server for the example
  • 7b. Update documentation

    • site/src/content/docs/guides/eval-yaml.mdx — mcp_servers section
    • site/src/content/docs/guides/graders.mdx — mcp_compliance grader
    • site/src/content/docs/reference/cli.mdx — any new flags
    • README.md — MCP eval support overview
  • 7c. Update AGENTS.md with MCP eval patterns

Key Design Decisions

  1. Workspace-based discovery (not SDK-level): Write .copilot/mcp.json to workspace
    so agent discovers MCP servers naturally. No SDK changes needed.

  2. Per-benchmark server lifetime (default): Start MCP servers once per eval run,
    not per-task. Add mcp_server_lifetime: per_task | per_benchmark config option.

  3. Typed config over map[string]any: Replace generic map with MCPServerConfig
    struct for type safety and validation.

  4. Additive grading: Extend existing graders (tool_constraint, behavior) rather
    than replacing them. Add new mcp_compliance grader for MCP-specific checks.

  5. Tool origin tracking: Tag each tool call with its source (builtin/mcp/skill)
    to enable source-aware grading.

Open Questions

  • Should waza validate that MCP servers are healthy before starting tasks?
    (Recommend: yes, with configurable timeout)
  • Should MCP server stdout/stderr be captured in eval results for debugging?
    (Recommend: yes, as optional verbose output)
  • Should waza support remote MCP servers (SSE/streamable-http) in addition to stdio?
    (Recommend: yes via url field, Phase 1)
  • How to handle MCP servers that need auth tokens — env var passthrough?
    (Recommend: env map in config, supports ${VAR} expansion)

Dependencies

  • No Copilot SDK changes required (workspace-based discovery)
  • mcp-go v0.45.0 already in go.mod (indirect) — may use for server health checks
  • Go 1.26+ (already required)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions