Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions docs/configuration/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,7 @@ Configuration for an LLM provider used by the agent, synthetic user, or judge.
| `base_url` | `str \| None` | No | `None` | Base URL for API endpoint (optional, provider-specific) |
| `context_size` | `int \| None` | No | `None` | Context window size in tokens (mainly for Ollama, which defaults to only 2048) |
| `reasoning` | `"low" \| "medium" \| "high" \| None` | No | `None` | Reasoning/thinking effort level for models that support it |
| `extra_instructions` | `str \| None` | No | `None` | Additional instructions appended to system prompts (judge or synthetic_user) |

### Validation Rules

Expand All @@ -282,6 +283,56 @@ Configuration for an LLM provider used by the agent, synthetic user, or judge.
- `api_key`: Stored as SecretStr for security (won't be logged or printed)
- `context_size`: Must be at least 1024 if specified

### Extra Instructions

The `extra_instructions` field allows you to append custom instructions to the system prompts used by the judge and synthetic user. This is useful for:

- Adding domain-specific evaluation criteria to the judge
- Customizing synthetic user behavior for specific test scenarios
- Fine-tuning evaluation strictness

**Important:** Extra instructions are **appended** (not replaced) across all configuration levels. Instructions from multiple sources are concatenated with newlines.

**Configuration priority for extra_instructions:**
1. CLI arguments (highest)
2. Per-scenario config (in scenario YAML)
3. Component-specific config (`judge:` or `synthetic_user:` sections)
4. Shared config (`llm:` section)

**Example - Strict judge instructions:**
```yaml
# mcprobe.yaml
judge:
provider: ollama
model: llama3.2
extra_instructions: |
Be strict about tool parameter validation.
Any missing or incorrect parameters should result in failure.
Do not accept approximate answers - require exact matches.
```

**Example - Custom synthetic user behavior:**
```yaml
# mcprobe.yaml
synthetic_user:
provider: ollama
model: llama3.2
extra_instructions: |
Always express urgency in your requests.
If the agent asks more than 2 clarifying questions, show impatience.
```

**Example - Per-scenario override (in scenario YAML):**
```yaml
# scenario.yaml
name: Strict Parameter Validation Test
config:
judge:
extra_instructions: |
For this test, verify that ALL tool parameters match exactly.
This is a strict validation scenario.
```

### Supported Providers

#### Ollama
Expand Down
118 changes: 117 additions & 1 deletion docs/scenarios/format.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,15 @@ This document provides a complete reference for the MCProbe test scenario YAML f

## Schema Overview

A test scenario consists of five top-level sections:
A test scenario consists of six top-level sections:

```yaml
name: string # Required: Scenario identifier
description: string # Required: What this scenario tests
synthetic_user: {...} # Required: User simulation config
evaluation: {...} # Required: Success/failure criteria
tags: [...] # Optional: Classification tags
config: {...} # Optional: Per-scenario LLM overrides
```

## Complete Schema Reference
Expand Down Expand Up @@ -63,6 +64,112 @@ tags: [...] # Optional: Classification tags
- **Description:** Tags for organizing and filtering scenarios
- **Example:** `["weather", "clarification", "basic"]`

#### `config`
- **Type:** `ScenarioConfig` object
- **Required:** No
- **Default:** `null`
- **Description:** Per-scenario LLM configuration overrides for judge and synthetic user
- **See:** [Per-Scenario Configuration](#per-scenario-configuration) section below

### Per-Scenario Configuration

The `config` section allows you to override LLM settings for specific scenarios without changing your global `mcprobe.yaml` configuration. This is useful for:

- Using a stricter/more lenient judge for specific test cases
- Customizing synthetic user behavior per scenario
- Testing with different models for specific scenarios

#### ScenarioConfig Fields

##### `config.judge`
- **Type:** `ScenarioLLMOverride` object
- **Required:** No
- **Description:** Overrides for the judge LLM configuration

##### `config.synthetic_user`
- **Type:** `ScenarioLLMOverride` object
- **Required:** No
- **Description:** Overrides for the synthetic user LLM configuration

#### ScenarioLLMOverride Fields

| Field | Type | Description |
|-------|------|-------------|
| `model` | `string \| null` | Override the model for this scenario |
| `temperature` | `float \| null` | Override the temperature for this scenario |
| `extra_instructions` | `string \| null` | Additional instructions appended to system prompts |

**Note:** Only specified fields are overridden. Unspecified fields inherit from global config.

#### Per-Scenario Configuration Examples

**Custom judge instructions for strict validation:**
```yaml
name: Strict Tool Parameter Test
description: Verifies exact parameter matching

config:
judge:
extra_instructions: |
Be extremely strict about tool parameter validation.
Any parameter that doesn't match exactly should fail the test.
Do not accept approximate or "close enough" values.

synthetic_user:
persona: A precise user who expects exact results
initial_query: Get the weather for latitude 37.7749, longitude -122.4194

evaluation:
correctness_criteria:
- The agent calls get_weather with exact coordinates provided
```

**Different model for complex scenario:**
```yaml
name: Complex Multi-Step Reasoning
description: Tests multi-step problem solving

config:
judge:
model: gpt-4o # Use more capable model for complex evaluation
synthetic_user:
model: gpt-4o-mini
temperature: 0.5 # Slightly more varied responses

synthetic_user:
persona: A user with a complex, multi-part question
initial_query: I need to plan a trip considering weather, flights, and hotels

evaluation:
correctness_criteria:
- The agent addresses all three aspects of the query
```

**Customizing synthetic user behavior:**
```yaml
name: Impatient User Test
description: Tests agent response to impatient users

config:
synthetic_user:
extra_instructions: |
You are in a hurry. Express frustration if the agent asks
more than one clarifying question. Use short, curt responses.

synthetic_user:
persona: A busy executive with no time to spare
initial_query: Weather now!
clarification_behavior:
traits:
patience: low
verbosity: concise

evaluation:
correctness_criteria:
- The agent provides weather information quickly
- The agent minimizes clarifying questions
```

### SyntheticUserConfig Fields

#### `persona`
Expand Down Expand Up @@ -271,6 +378,15 @@ description: |
ask appropriate clarifying questions, and provide accurate results with
proper error handling.

# Optional: Per-scenario LLM configuration overrides
config:
judge:
extra_instructions: |
Pay special attention to whether the agent correctly interprets
"tomorrow" as the next calendar day, not "in 24 hours".
synthetic_user:
temperature: 0.2 # Slightly varied responses

synthetic_user:
persona: |
A busy professional who needs quick weather information but tends to
Expand Down