Liquescent-Development · richardkiene · Jan 23, 2026 · Jan 23, 2026
diff --git a/docs/configuration/reference.md b/docs/configuration/reference.md
@@ -272,6 +272,7 @@ Configuration for an LLM provider used by the agent, synthetic user, or judge.
 | `base_url` | `str \| None` | No | `None` | Base URL for API endpoint (optional, provider-specific) |
 | `context_size` | `int \| None` | No | `None` | Context window size in tokens (mainly for Ollama, which defaults to only 2048) |
 | `reasoning` | `"low" \| "medium" \| "high" \| None` | No | `None` | Reasoning/thinking effort level for models that support it |
+| `extra_instructions` | `str \| None` | No | `None` | Additional instructions appended to system prompts (judge or synthetic_user) |
 
 ### Validation Rules
 
@@ -282,6 +283,56 @@ Configuration for an LLM provider used by the agent, synthetic user, or judge.
 - `api_key`: Stored as SecretStr for security (won't be logged or printed)
 - `context_size`: Must be at least 1024 if specified
 
+### Extra Instructions
+
+The `extra_instructions` field allows you to append custom instructions to the system prompts used by the judge and synthetic user. This is useful for:
+
+- Adding domain-specific evaluation criteria to the judge
+- Customizing synthetic user behavior for specific test scenarios
+- Fine-tuning evaluation strictness
+
+**Important:** Extra instructions are **appended** (not replaced) across all configuration levels. Instructions from multiple sources are concatenated with newlines.
+
+**Configuration priority for extra_instructions:**
+1. CLI arguments (highest)
+2. Per-scenario config (in scenario YAML)
+3. Component-specific config (`judge:` or `synthetic_user:` sections)
+4. Shared config (`llm:` section)
+
+**Example - Strict judge instructions:**
+```yaml
+# mcprobe.yaml
+judge:
+  provider: ollama
+  model: llama3.2
+  extra_instructions: |
+    Be strict about tool parameter validation.
+    Any missing or incorrect parameters should result in failure.
+    Do not accept approximate answers - require exact matches.
+```
+
+**Example - Custom synthetic user behavior:**
+```yaml
+# mcprobe.yaml
+synthetic_user:
+  provider: ollama
+  model: llama3.2
+  extra_instructions: |
+    Always express urgency in your requests.
+    If the agent asks more than 2 clarifying questions, show impatience.
+```
+
+**Example - Per-scenario override (in scenario YAML):**
+```yaml
+# scenario.yaml
+name: Strict Parameter Validation Test
+config:
+  judge:
+    extra_instructions: |
+      For this test, verify that ALL tool parameters match exactly.
+      This is a strict validation scenario.
+```
+
 ### Supported Providers
 
 #### Ollama

diff --git a/docs/scenarios/format.md b/docs/scenarios/format.md
@@ -10,14 +10,15 @@ This document provides a complete reference for the MCProbe test scenario YAML f
 
 ## Schema Overview
 
-A test scenario consists of five top-level sections:
+A test scenario consists of six top-level sections:
 
 ```yaml
 name: string                    # Required: Scenario identifier
 description: string             # Required: What this scenario tests
 synthetic_user: {...}           # Required: User simulation config
 evaluation: {...}               # Required: Success/failure criteria
 tags: [...]                     # Optional: Classification tags
+config: {...}                   # Optional: Per-scenario LLM overrides
 ```
 
 ## Complete Schema Reference
@@ -63,6 +64,112 @@ tags: [...]                     # Optional: Classification tags
 - **Description:** Tags for organizing and filtering scenarios
 - **Example:** `["weather", "clarification", "basic"]`
 
+#### `config`
+- **Type:** `ScenarioConfig` object
+- **Required:** No
+- **Default:** `null`
+- **Description:** Per-scenario LLM configuration overrides for judge and synthetic user
+- **See:** [Per-Scenario Configuration](#per-scenario-configuration) section below
+
+### Per-Scenario Configuration
+
+The `config` section allows you to override LLM settings for specific scenarios without changing your global `mcprobe.yaml` configuration. This is useful for:
+
+- Using a stricter/more lenient judge for specific test cases
+- Customizing synthetic user behavior per scenario
+- Testing with different models for specific scenarios
+
+#### ScenarioConfig Fields
+
+##### `config.judge`
+- **Type:** `ScenarioLLMOverride` object
+- **Required:** No
+- **Description:** Overrides for the judge LLM configuration
+
+##### `config.synthetic_user`
+- **Type:** `ScenarioLLMOverride` object
+- **Required:** No
+- **Description:** Overrides for the synthetic user LLM configuration
+
+#### ScenarioLLMOverride Fields
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `model` | `string \| null` | Override the model for this scenario |
+| `temperature` | `float \| null` | Override the temperature for this scenario |
+| `extra_instructions` | `string \| null` | Additional instructions appended to system prompts |
+
+**Note:** Only specified fields are overridden. Unspecified fields inherit from global config.
+
+#### Per-Scenario Configuration Examples
+
+**Custom judge instructions for strict validation:**
+```yaml
+name: Strict Tool Parameter Test
+description: Verifies exact parameter matching
+
+config:
+  judge:
+    extra_instructions: |
+      Be extremely strict about tool parameter validation.
+      Any parameter that doesn't match exactly should fail the test.
+      Do not accept approximate or "close enough" values.
+
+synthetic_user:
+  persona: A precise user who expects exact results
+  initial_query: Get the weather for latitude 37.7749, longitude -122.4194
+
+evaluation:
+  correctness_criteria:
+    - The agent calls get_weather with exact coordinates provided
+```
+
+**Different model for complex scenario:**
+```yaml
+name: Complex Multi-Step Reasoning
+description: Tests multi-step problem solving
+
+config:
+  judge:
+    model: gpt-4o  # Use more capable model for complex evaluation
+  synthetic_user:
+    model: gpt-4o-mini
+    temperature: 0.5  # Slightly more varied responses
+
+synthetic_user:
+  persona: A user with a complex, multi-part question
+  initial_query: I need to plan a trip considering weather, flights, and hotels
+
+evaluation:
+  correctness_criteria:
+    - The agent addresses all three aspects of the query
+```
+
+**Customizing synthetic user behavior:**
+```yaml
+name: Impatient User Test
+description: Tests agent response to impatient users
+
+config:
+  synthetic_user:
+    extra_instructions: |
+      You are in a hurry. Express frustration if the agent asks
+      more than one clarifying question. Use short, curt responses.
+
+synthetic_user:
+  persona: A busy executive with no time to spare
+  initial_query: Weather now!
+  clarification_behavior:
+    traits:
+      patience: low
+      verbosity: concise
+
+evaluation:
+  correctness_criteria:
+    - The agent provides weather information quickly
+    - The agent minimizes clarifying questions
+```
+
 ### SyntheticUserConfig Fields
 
 #### `persona`
@@ -271,6 +378,15 @@ description: |
   ask appropriate clarifying questions, and provide accurate results with
   proper error handling.
 
+# Optional: Per-scenario LLM configuration overrides
+config:
+  judge:
+    extra_instructions: |
+      Pay special attention to whether the agent correctly interprets
+      "tomorrow" as the next calendar day, not "in 24 hours".
+  synthetic_user:
+    temperature: 0.2  # Slightly varied responses
+
 synthetic_user:
   persona: |
     A busy professional who needs quick weather information but tends to