diff --git a/docs/configuration/reference.md b/docs/configuration/reference.md index f19be6e..92bc8a7 100644 --- a/docs/configuration/reference.md +++ b/docs/configuration/reference.md @@ -272,6 +272,7 @@ Configuration for an LLM provider used by the agent, synthetic user, or judge. | `base_url` | `str \| None` | No | `None` | Base URL for API endpoint (optional, provider-specific) | | `context_size` | `int \| None` | No | `None` | Context window size in tokens (mainly for Ollama, which defaults to only 2048) | | `reasoning` | `"low" \| "medium" \| "high" \| None` | No | `None` | Reasoning/thinking effort level for models that support it | +| `extra_instructions` | `str \| None` | No | `None` | Additional instructions appended to system prompts (judge or synthetic_user) | ### Validation Rules @@ -282,6 +283,56 @@ Configuration for an LLM provider used by the agent, synthetic user, or judge. - `api_key`: Stored as SecretStr for security (won't be logged or printed) - `context_size`: Must be at least 1024 if specified +### Extra Instructions + +The `extra_instructions` field allows you to append custom instructions to the system prompts used by the judge and synthetic user. This is useful for: + +- Adding domain-specific evaluation criteria to the judge +- Customizing synthetic user behavior for specific test scenarios +- Fine-tuning evaluation strictness + +**Important:** Extra instructions are **appended** (not replaced) across all configuration levels. Instructions from multiple sources are concatenated with newlines. + +**Configuration priority for extra_instructions:** +1. CLI arguments (highest) +2. Per-scenario config (in scenario YAML) +3. Component-specific config (`judge:` or `synthetic_user:` sections) +4. Shared config (`llm:` section) + +**Example - Strict judge instructions:** +```yaml +# mcprobe.yaml +judge: + provider: ollama + model: llama3.2 + extra_instructions: | + Be strict about tool parameter validation. + Any missing or incorrect parameters should result in failure. + Do not accept approximate answers - require exact matches. +``` + +**Example - Custom synthetic user behavior:** +```yaml +# mcprobe.yaml +synthetic_user: + provider: ollama + model: llama3.2 + extra_instructions: | + Always express urgency in your requests. + If the agent asks more than 2 clarifying questions, show impatience. +``` + +**Example - Per-scenario override (in scenario YAML):** +```yaml +# scenario.yaml +name: Strict Parameter Validation Test +config: + judge: + extra_instructions: | + For this test, verify that ALL tool parameters match exactly. + This is a strict validation scenario. +``` + ### Supported Providers #### Ollama diff --git a/docs/scenarios/format.md b/docs/scenarios/format.md index ab4d33d..c46a764 100644 --- a/docs/scenarios/format.md +++ b/docs/scenarios/format.md @@ -10,7 +10,7 @@ This document provides a complete reference for the MCProbe test scenario YAML f ## Schema Overview -A test scenario consists of five top-level sections: +A test scenario consists of six top-level sections: ```yaml name: string # Required: Scenario identifier @@ -18,6 +18,7 @@ description: string # Required: What this scenario tests synthetic_user: {...} # Required: User simulation config evaluation: {...} # Required: Success/failure criteria tags: [...] # Optional: Classification tags +config: {...} # Optional: Per-scenario LLM overrides ``` ## Complete Schema Reference @@ -63,6 +64,112 @@ tags: [...] # Optional: Classification tags - **Description:** Tags for organizing and filtering scenarios - **Example:** `["weather", "clarification", "basic"]` +#### `config` +- **Type:** `ScenarioConfig` object +- **Required:** No +- **Default:** `null` +- **Description:** Per-scenario LLM configuration overrides for judge and synthetic user +- **See:** [Per-Scenario Configuration](#per-scenario-configuration) section below + +### Per-Scenario Configuration + +The `config` section allows you to override LLM settings for specific scenarios without changing your global `mcprobe.yaml` configuration. This is useful for: + +- Using a stricter/more lenient judge for specific test cases +- Customizing synthetic user behavior per scenario +- Testing with different models for specific scenarios + +#### ScenarioConfig Fields + +##### `config.judge` +- **Type:** `ScenarioLLMOverride` object +- **Required:** No +- **Description:** Overrides for the judge LLM configuration + +##### `config.synthetic_user` +- **Type:** `ScenarioLLMOverride` object +- **Required:** No +- **Description:** Overrides for the synthetic user LLM configuration + +#### ScenarioLLMOverride Fields + +| Field | Type | Description | +|-------|------|-------------| +| `model` | `string \| null` | Override the model for this scenario | +| `temperature` | `float \| null` | Override the temperature for this scenario | +| `extra_instructions` | `string \| null` | Additional instructions appended to system prompts | + +**Note:** Only specified fields are overridden. Unspecified fields inherit from global config. + +#### Per-Scenario Configuration Examples + +**Custom judge instructions for strict validation:** +```yaml +name: Strict Tool Parameter Test +description: Verifies exact parameter matching + +config: + judge: + extra_instructions: | + Be extremely strict about tool parameter validation. + Any parameter that doesn't match exactly should fail the test. + Do not accept approximate or "close enough" values. + +synthetic_user: + persona: A precise user who expects exact results + initial_query: Get the weather for latitude 37.7749, longitude -122.4194 + +evaluation: + correctness_criteria: + - The agent calls get_weather with exact coordinates provided +``` + +**Different model for complex scenario:** +```yaml +name: Complex Multi-Step Reasoning +description: Tests multi-step problem solving + +config: + judge: + model: gpt-4o # Use more capable model for complex evaluation + synthetic_user: + model: gpt-4o-mini + temperature: 0.5 # Slightly more varied responses + +synthetic_user: + persona: A user with a complex, multi-part question + initial_query: I need to plan a trip considering weather, flights, and hotels + +evaluation: + correctness_criteria: + - The agent addresses all three aspects of the query +``` + +**Customizing synthetic user behavior:** +```yaml +name: Impatient User Test +description: Tests agent response to impatient users + +config: + synthetic_user: + extra_instructions: | + You are in a hurry. Express frustration if the agent asks + more than one clarifying question. Use short, curt responses. + +synthetic_user: + persona: A busy executive with no time to spare + initial_query: Weather now! + clarification_behavior: + traits: + patience: low + verbosity: concise + +evaluation: + correctness_criteria: + - The agent provides weather information quickly + - The agent minimizes clarifying questions +``` + ### SyntheticUserConfig Fields #### `persona` @@ -271,6 +378,15 @@ description: | ask appropriate clarifying questions, and provide accurate results with proper error handling. +# Optional: Per-scenario LLM configuration overrides +config: + judge: + extra_instructions: | + Pay special attention to whether the agent correctly interprets + "tomorrow" as the next calendar day, not "in 24 hours". + synthetic_user: + temperature: 0.2 # Slightly varied responses + synthetic_user: persona: | A busy professional who needs quick weather information but tends to