Support evals against mcp

# Plan: MCP Server Provisioning for Waza Evaluations

## Problem Statement

Waza can evaluate AI agents on tasks, but cannot currently test agents that **use MCP tools**.
The `mcp_servers` field exists in the eval YAML schema and is parsed into `Config.ServerConfigs`,
but is **never consumed** — no servers are started, no config is written to workspaces, and no
MCP-aware grading exists.

**Goal:** Allow eval authors to declare MCP servers in eval YAML, have waza provision them
during task execution, and grade how effectively the agent uses those MCP tools.

## Approach

Rather than waiting for Copilot SDK changes, use **workspace-based MCP discovery**:
the agent (Copilot CLI) already discovers MCP servers via `.copilot/mcp.json` in the
working directory. Waza can write this config file into each task's temp workspace and
manage the MCP server process lifecycle.

```
eval.yaml                    Task Workspace (/tmp/waza-xxx/)
┌──────────────┐             ┌────────────────────────────┐
│ mcp_servers:  │  ──copy──▶ │ .copilot/mcp.json          │
│   github:     │            │ fixtures/...               │
│     command:  │            │ (agent works here)         │
│     args: ... │            └────────────────────────────┘
└──────────────┘                       │
                              waza starts MCP server process
                              agent discovers via mcp.json
                              waza tracks MCP tool calls
                              waza grades MCP tool usage
                              waza stops MCP server process
```

## Phases & Todos

### Phase 1: Formalize MCP Server Config Schema
Define a typed Go struct for MCP server configuration (replacing `map[string]any`),
and update the JSON schema with proper validation.

- **1a. Define `MCPServerConfig` struct** in `internal/models/spec.go`
  - Fields: `Command`, `Args`, `Env`, `WorkingDir`, `Url` (for SSE/streamable-http transports)
  - Replace `ServerConfigs map[string]any` with `MCPServers map[string]MCPServerConfig`
  - Maintain backward compatibility with existing YAML parsing

- **1b. Update JSON schema** in `schemas/eval.schema.json`
  - Replace `additionalProperties: true` with proper object schema for each server
  - Add `command` (required), `args` (array), `env` (object), `url` (string)

- **1c. Add config validation** in spec loading
  - Validate that each server has either `command` or `url` (stdio vs remote)
  - Validate that `command` is a string, `args` is string array, etc.

### Phase 2: MCP Server Lifecycle Manager
Create a component that starts/stops MCP server processes and writes workspace config.

- **2a. Create `internal/mcp/lifecycle.go`**
  - `MCPServerManager` struct with `Start(ctx, configs)` and `Stop()` methods
  - Starts each configured server as a subprocess
  - Tracks PIDs for cleanup
  - Health-check: wait for server readiness (configurable timeout)
  - Graceful shutdown with SIGTERM → SIGKILL fallback

- **2b. Create `internal/mcp/workspace.go`**
  - `WriteMCPConfig(workspaceDir, configs)` function
  - Writes `.copilot/mcp.json` into the task workspace directory
  - Transforms `MCPServerConfig` → Copilot MCP JSON format
  - Creates `.copilot/` directory if needed

- **2c. Add tests for lifecycle manager**
  - Test start/stop with a simple echo MCP server
  - Test config file generation
  - Test cleanup on context cancellation
  - Test error handling (server fails to start)

### Phase 3: Wire Into Execution Pipeline
Connect the lifecycle manager to the orchestration runner and execution engine.

- **3a. Update `ExecutionRequest`** in `internal/execution/engine.go`
  - Add `MCPServers map[string]MCPServerConfig` field
  - Allows per-request MCP server config to flow through

- **3b. Update `CopilotEngine.Execute()`** in `internal/execution/copilot.go`
  - Before creating session: call `WriteMCPConfig()` to write config to workspace
  - After task completion: cleanup MCP config file

- **3c. Update `TestRunner.buildExecutionRequest()`** in `internal/orchestration/runner.go`
  - Pass `spec.Config.MCPServers` into `ExecutionRequest.MCPServers`

- **3d. Update `TestRunner` lifecycle hooks**
  - Before benchmark: start MCP servers via `MCPServerManager.Start()`
  - After benchmark: stop MCP servers via `MCPServerManager.Stop()`
  - Consider: per-task vs per-benchmark server lifetime (config option)

- **3e. Add integration tests**
  - Mock MCP server that exposes a simple tool
  - Eval YAML that configures the mock server
  - Verify agent can discover and call the MCP tool
  - Verify cleanup after task

### Phase 4: MCP Tool Call Tracking
Enrich tool call data with MCP server origin information.

- **4a. Extend `ToolCall` model** in `internal/models/events.go`
  - Add `Source string` field: `"builtin"`, `"mcp:{server_name}"`, `"skill"`
  - Add `MCPServer string` field (empty for non-MCP tools)

- **4b. Update `FilterToolCalls()`** to classify tool origins
  - Match tool names against configured MCP server tool lists
  - Or use naming convention (MCP tools often have server-prefixed names)
  - Falls back to `"builtin"` if no MCP match

- **4c. Update `SessionDigest`** to include MCP summary
  - Add `MCPToolCalls int` — count of MCP-originated tool calls
  - Add `MCPServersUsed []string` — which MCP servers were actually called

- **4d. Update dashboard** in `web/`
  - Show MCP tool calls distinguished from built-in tools in TrajectoryViewer
  - Add MCP server badge/icon to tool call entries

### Phase 5: MCP-Aware Graders
Add grading capabilities specific to MCP tool usage.

- **5a. Extend `tool_constraint` grader**
  - Support `source: mcp:{server_name}` filter in `expect_tools` / `reject_tools`
  - Example: `expect_tools: [{tool: "get_issue", source: "mcp:github"}]`

- **5b. Extend `behavior` grader**
  - Add `max_mcp_calls` constraint
  - Add `required_mcp_servers` — ensure agent used specific MCP servers
  - Add `forbidden_mcp_servers` — ensure agent didn't use certain servers

- **5c. Add `mcp_compliance` grader** (new grader type)
  - Validates that the agent correctly discovered and used MCP tools
  - Checks: tool selection accuracy, parameter correctness, error handling
  - Configurable rubric for MCP-specific evaluation criteria
  - Example config:
    ```yaml
    graders:
      - type: mcp_compliance
        name: github_tool_usage
        config:
          server: github
          expect_tools_used: [get_issue, search_code]
          max_tool_errors: 1
    ```

- **5d. Add tests for MCP graders**

### Phase 6: Task-Level MCP Overrides
Allow individual tasks to specify additional or different MCP servers.

- **6a. Add `mcp_servers` to `TestCase`** model
  - Tasks can add MCP servers beyond those in the eval config
  - Tasks can override eval-level MCP server configs
  - Merge strategy: task configs overlay eval configs

- **6b. Update task YAML schema** in `schemas/`
  - Add `mcp_servers` field to task schema

- **6c. Update runner to merge configs**
  - Eval-level `mcp_servers` as base
  - Task-level `mcp_servers` merged on top
  - Pass merged config into `ExecutionRequest`

### Phase 7: Documentation & Examples
Update docs and provide example evals.

- **7a. Create example eval** in `examples/mcp-eval/`
  - Example eval YAML with MCP server config
  - Example task that requires MCP tool usage
  - Mock MCP server for the example

- **7b. Update documentation**
  - `site/src/content/docs/guides/eval-yaml.mdx` — mcp_servers section
  - `site/src/content/docs/guides/graders.mdx` — mcp_compliance grader
  - `site/src/content/docs/reference/cli.mdx` — any new flags
  - `README.md` — MCP eval support overview

- **7c. Update AGENTS.md** with MCP eval patterns

## Key Design Decisions

1. **Workspace-based discovery** (not SDK-level): Write `.copilot/mcp.json` to workspace
   so agent discovers MCP servers naturally. No SDK changes needed.

2. **Per-benchmark server lifetime** (default): Start MCP servers once per eval run,
   not per-task. Add `mcp_server_lifetime: per_task | per_benchmark` config option.

3. **Typed config over `map[string]any`**: Replace generic map with `MCPServerConfig`
   struct for type safety and validation.

4. **Additive grading**: Extend existing graders (tool_constraint, behavior) rather
   than replacing them. Add new `mcp_compliance` grader for MCP-specific checks.

5. **Tool origin tracking**: Tag each tool call with its source (builtin/mcp/skill)
   to enable source-aware grading.

## Open Questions

- Should waza validate that MCP servers are healthy before starting tasks?
  (Recommend: yes, with configurable timeout)
- Should MCP server stdout/stderr be captured in eval results for debugging?
  (Recommend: yes, as optional verbose output)
- Should waza support remote MCP servers (SSE/streamable-http) in addition to stdio?
  (Recommend: yes via `url` field, Phase 1)
- How to handle MCP servers that need auth tokens — env var passthrough?
  (Recommend: `env` map in config, supports `${VAR}` expansion)

## Dependencies

- No Copilot SDK changes required (workspace-based discovery)
- `mcp-go v0.45.0` already in `go.mod` (indirect) — may use for server health checks
- Go 1.26+ (already required)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support evals against mcp #147

Plan: MCP Server Provisioning for Waza Evaluations

Problem Statement

Approach

Phases & Todos

Phase 1: Formalize MCP Server Config Schema

Phase 2: MCP Server Lifecycle Manager

Phase 3: Wire Into Execution Pipeline

Phase 4: MCP Tool Call Tracking

Phase 5: MCP-Aware Graders

Phase 6: Task-Level MCP Overrides

Phase 7: Documentation & Examples

Key Design Decisions

Open Questions

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support evals against mcp #147

Description

Plan: MCP Server Provisioning for Waza Evaluations

Problem Statement

Approach

Phases & Todos

Phase 1: Formalize MCP Server Config Schema

Phase 2: MCP Server Lifecycle Manager

Phase 3: Wire Into Execution Pipeline

Phase 4: MCP Tool Call Tracking

Phase 5: MCP-Aware Graders

Phase 6: Task-Level MCP Overrides

Phase 7: Documentation & Examples

Key Design Decisions

Open Questions

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions