Follow-up to #114. Introduce an LLM evaluation framework for the Fetch Details step that lets us compare agent performance across iterations of prompt, model, and tools.
The structured-heavy report schema landed in #114 was designed with this in mind — outcome, source_kind, audio_url, extraction_confidence, cross_linked, discovered_* flags are all field-level evaluable without LLM-as-judge. This issue is about the framework, dataset, and CI gating that turn those fields into a regression signal.
Scope
Dataset
Hand-curated YAML golden set, ~10-15 cases, in episodes/tests/fixtures/fetch_details/cases.yaml. Cover:
- All five outcomes (
ok, partial, not_a_podcast_episode, unreachable, extraction_failed).
- Canonical-URL cases (rtve.es, ardsounds.de, similar publisher sites).
- Aggregator-URL cases (Apple Podcasts).
- Cross-linking cases (canonical → iTunes lookup; aggregator → canonical link).
- A 404 / non-podcast-page case for
not_a_podcast_episode.
- A Spotify-style JS-only page for
extraction_failed (or partial if static fetch returns enough).
- 2-3 absolute-fact cases (URL X should yield title='...') so v0 flaws don't get baked in as the de facto baseline.
Framework
Sharp split:
- Field-level metrics (pytest, no LLM judge):
outcome match, source_kind match, aggregator_provider match, required-field population, audio_url resolves via HTTP HEAD when expected, cross_linked flag matches, extraction_confidence calibration across the suite.
- Behavioral metrics (pytest, from tool trace): "for canonical-with-no-audio cases, did the agent call
search_apple_podcasts?", median tool calls per case, ratio of correct cross-linking decisions.
- Prose quality (DeepEval, gated marker):
narrative faithfulness vs trace, hints_for_next_step usefulness. Run on demand; not in the regular CI loop.
Test files:
episodes/tests/test_fetch_details_eval.py — pytest, parametrized over the YAML cases.
episodes/tests/test_fetch_details_prose_eval.py — DeepEval, behind @pytest.mark.deepeval or similar.
CI workflow
New separate workflow .github/workflows/fetch-details-eval.yml:
on:
push:
branches: [main]
paths:
- episodes/agents/fetch_details*.py
- episodes/agents/_model.py
- episodes/fetch_details_step.py
- episodes/podcast_aggregators/**
- episodes/tests/test_fetch_details*.py
- episodes/tests/fixtures/fetch_details/**
- .github/workflows/fetch-details-eval.yml
pull_request:
branches: [main]
paths: [...same as above...]
workflow_dispatch:
Mirror the Postgres service from ci.yml. Pin RAGTIME_FETCH_DETAILS_API_KEY as a GitHub Actions secret.
Stability
- VCR cassettes (
vcrpy) for the HTTP layer (iTunes, fyyd, target podcast pages) so a third-party outage doesn't break the eval. LLM calls stay live.
- Cassettes regenerated on demand with a make target / pytest flag; reviewed in PRs that touch them.
Open decisions
- CI model: production vs cheaper. Production model gives faithful signal but costs more per CI run; a cheaper model (e.g.,
gpt-4.1-mini if not already production) tests infra but not production quality. Decide based on actual cost data once we run a few real evals.
Versioning add-on
Add agent_config_hash column to FetchDetailsRun:
agent_config_hash = models.CharField(max_length=16, blank=True, default="", db_index=True)
# = sha256(system_prompt + model_string + sorted_tool_names)[:16]
Computed once at get_agent() init, stamped on each run. Eval groups runs by hash to compare across versions. Caveat: tool names not implementations — coarse signal, but eval observes implementation behavior anyway.
Acceptance criteria
pytest -m fetch_details_eval runs the field-level + behavioral suite locally and in CI on path-filtered triggers.
pytest -m deepeval runs the prose suite on demand.
- VCR cassettes covering the dataset cases.
agent_config_hash migration + write-on-run wiring.
- Documentation:
doc/plans/, doc/sessions/, doc/features/, CHANGELOG.
Related
Follow-up to #114. Introduce an LLM evaluation framework for the Fetch Details step that lets us compare agent performance across iterations of prompt, model, and tools.
The structured-heavy report schema landed in #114 was designed with this in mind — outcome, source_kind, audio_url,
extraction_confidence,cross_linked,discovered_*flags are all field-level evaluable without LLM-as-judge. This issue is about the framework, dataset, and CI gating that turn those fields into a regression signal.Scope
Dataset
Hand-curated YAML golden set, ~10-15 cases, in
episodes/tests/fixtures/fetch_details/cases.yaml. Cover:ok,partial,not_a_podcast_episode,unreachable,extraction_failed).not_a_podcast_episode.extraction_failed(orpartialif static fetch returns enough).Framework
Sharp split:
outcomematch,source_kindmatch,aggregator_providermatch, required-field population,audio_urlresolves via HTTP HEAD when expected,cross_linkedflag matches,extraction_confidencecalibration across the suite.search_apple_podcasts?", median tool calls per case, ratio of correct cross-linking decisions.narrativefaithfulness vs trace,hints_for_next_stepusefulness. Run on demand; not in the regular CI loop.Test files:
episodes/tests/test_fetch_details_eval.py— pytest, parametrized over the YAML cases.episodes/tests/test_fetch_details_prose_eval.py— DeepEval, behind@pytest.mark.deepevalor similar.CI workflow
New separate workflow
.github/workflows/fetch-details-eval.yml:Mirror the Postgres service from
ci.yml. PinRAGTIME_FETCH_DETAILS_API_KEYas a GitHub Actions secret.Stability
vcrpy) for the HTTP layer (iTunes, fyyd, target podcast pages) so a third-party outage doesn't break the eval. LLM calls stay live.Open decisions
gpt-4.1-miniif not already production) tests infra but not production quality. Decide based on actual cost data once we run a few real evals.Versioning add-on
Add
agent_config_hashcolumn toFetchDetailsRun:Computed once at
get_agent()init, stamped on each run. Eval groups runs by hash to compare across versions. Caveat: tool names not implementations — coarse signal, but eval observes implementation behavior anyway.Acceptance criteria
pytest -m fetch_details_evalruns the field-level + behavioral suite locally and in CI on path-filtered triggers.pytest -m deepevalruns the prose suite on demand.agent_config_hashmigration + write-on-run wiring.doc/plans/,doc/sessions/,doc/features/, CHANGELOG.Related
doc/plans/2026-04-29-fetch-details-cross-linking.md("Deferred to follow-up issue" section)