Fetch Details agent: LLM evaluation framework (DeepEval, dataset, CI workflow)

Follow-up to #114. Introduce an LLM evaluation framework for the Fetch Details step that lets us compare agent performance across iterations of prompt, model, and tools.

The structured-heavy report schema landed in #114 was designed with this in mind — outcome, source_kind, audio_url, `extraction_confidence`, `cross_linked`, `discovered_*` flags are all field-level evaluable without LLM-as-judge. This issue is about the framework, dataset, and CI gating that turn those fields into a regression signal.

## Scope

### Dataset

Hand-curated YAML golden set, ~10-15 cases, in `episodes/tests/fixtures/fetch_details/cases.yaml`. Cover:

- All five outcomes (`ok`, `partial`, `not_a_podcast_episode`, `unreachable`, `extraction_failed`).
- Canonical-URL cases (rtve.es, ardsounds.de, similar publisher sites).
- Aggregator-URL cases (Apple Podcasts).
- Cross-linking cases (canonical → iTunes lookup; aggregator → canonical link).
- A 404 / non-podcast-page case for `not_a_podcast_episode`.
- A Spotify-style JS-only page for `extraction_failed` (or `partial` if static fetch returns enough).
- 2-3 absolute-fact cases (URL X *should* yield title='...') so v0 flaws don't get baked in as the de facto baseline.

### Framework

Sharp split:

- **Field-level metrics** (pytest, no LLM judge): `outcome` match, `source_kind` match, `aggregator_provider` match, required-field population, `audio_url` resolves via HTTP HEAD when expected, `cross_linked` flag matches, `extraction_confidence` calibration across the suite.
- **Behavioral metrics** (pytest, from tool trace): \"for canonical-with-no-audio cases, did the agent call `search_apple_podcasts`?\", median tool calls per case, ratio of correct cross-linking decisions.
- **Prose quality** (DeepEval, gated marker): `narrative` faithfulness vs trace, `hints_for_next_step` usefulness. Run on demand; not in the regular CI loop.

Test files:

- `episodes/tests/test_fetch_details_eval.py` — pytest, parametrized over the YAML cases.
- `episodes/tests/test_fetch_details_prose_eval.py` — DeepEval, behind `@pytest.mark.deepeval` or similar.

### CI workflow

New separate workflow `.github/workflows/fetch-details-eval.yml`:

```yaml
on:
  push:
    branches: [main]
    paths:
      - episodes/agents/fetch_details*.py
      - episodes/agents/_model.py
      - episodes/fetch_details_step.py
      - episodes/podcast_aggregators/**
      - episodes/tests/test_fetch_details*.py
      - episodes/tests/fixtures/fetch_details/**
      - .github/workflows/fetch-details-eval.yml
  pull_request:
    branches: [main]
    paths: [...same as above...]
  workflow_dispatch:
```

Mirror the Postgres service from `ci.yml`. Pin `RAGTIME_FETCH_DETAILS_API_KEY` as a GitHub Actions secret.

### Stability

- VCR cassettes (`vcrpy`) for the HTTP layer (iTunes, fyyd, target podcast pages) so a third-party outage doesn't break the eval. LLM calls stay live.
- Cassettes regenerated on demand with a make target / pytest flag; reviewed in PRs that touch them.

### Open decisions

- **CI model: production vs cheaper.** Production model gives faithful signal but costs more per CI run; a cheaper model (e.g., `gpt-4.1-mini` if not already production) tests infra but not production quality. Decide based on actual cost data once we run a few real evals.

### Versioning add-on

Add `agent_config_hash` column to `FetchDetailsRun`:

```python
agent_config_hash = models.CharField(max_length=16, blank=True, default="", db_index=True)
# = sha256(system_prompt + model_string + sorted_tool_names)[:16]
```

Computed once at `get_agent()` init, stamped on each run. Eval groups runs by hash to compare across versions. Caveat: tool *names* not implementations — coarse signal, but eval observes implementation behavior anyway.

## Acceptance criteria

- `pytest -m fetch_details_eval` runs the field-level + behavioral suite locally and in CI on path-filtered triggers.
- `pytest -m deepeval` runs the prose suite on demand.
- VCR cassettes covering the dataset cases.
- `agent_config_hash` migration + write-on-run wiring.
- Documentation: `doc/plans/`, `doc/sessions/`, `doc/features/`, CHANGELOG.

## Related

- Implementation PR: #114
- Plan in #114: `doc/plans/2026-04-29-fetch-details-cross-linking.md` (\"Deferred to follow-up issue\" section)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch Details agent: LLM evaluation framework (DeepEval, dataset, CI workflow) #115

Scope

Dataset

Framework

CI workflow

Stability

Open decisions

Versioning add-on

Acceptance criteria

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Fetch Details agent: LLM evaluation framework (DeepEval, dataset, CI workflow) #115

Description

Scope

Dataset

Framework

CI workflow

Stability

Open decisions

Versioning add-on

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions