Skip to content

Fetch Details agent: LLM evaluation framework (DeepEval, dataset, CI workflow) #115

@rafacm

Description

@rafacm

Follow-up to #114. Introduce an LLM evaluation framework for the Fetch Details step that lets us compare agent performance across iterations of prompt, model, and tools.

The structured-heavy report schema landed in #114 was designed with this in mind — outcome, source_kind, audio_url, extraction_confidence, cross_linked, discovered_* flags are all field-level evaluable without LLM-as-judge. This issue is about the framework, dataset, and CI gating that turn those fields into a regression signal.

Scope

Dataset

Hand-curated YAML golden set, ~10-15 cases, in episodes/tests/fixtures/fetch_details/cases.yaml. Cover:

  • All five outcomes (ok, partial, not_a_podcast_episode, unreachable, extraction_failed).
  • Canonical-URL cases (rtve.es, ardsounds.de, similar publisher sites).
  • Aggregator-URL cases (Apple Podcasts).
  • Cross-linking cases (canonical → iTunes lookup; aggregator → canonical link).
  • A 404 / non-podcast-page case for not_a_podcast_episode.
  • A Spotify-style JS-only page for extraction_failed (or partial if static fetch returns enough).
  • 2-3 absolute-fact cases (URL X should yield title='...') so v0 flaws don't get baked in as the de facto baseline.

Framework

Sharp split:

  • Field-level metrics (pytest, no LLM judge): outcome match, source_kind match, aggregator_provider match, required-field population, audio_url resolves via HTTP HEAD when expected, cross_linked flag matches, extraction_confidence calibration across the suite.
  • Behavioral metrics (pytest, from tool trace): "for canonical-with-no-audio cases, did the agent call search_apple_podcasts?", median tool calls per case, ratio of correct cross-linking decisions.
  • Prose quality (DeepEval, gated marker): narrative faithfulness vs trace, hints_for_next_step usefulness. Run on demand; not in the regular CI loop.

Test files:

  • episodes/tests/test_fetch_details_eval.py — pytest, parametrized over the YAML cases.
  • episodes/tests/test_fetch_details_prose_eval.py — DeepEval, behind @pytest.mark.deepeval or similar.

CI workflow

New separate workflow .github/workflows/fetch-details-eval.yml:

on:
  push:
    branches: [main]
    paths:
      - episodes/agents/fetch_details*.py
      - episodes/agents/_model.py
      - episodes/fetch_details_step.py
      - episodes/podcast_aggregators/**
      - episodes/tests/test_fetch_details*.py
      - episodes/tests/fixtures/fetch_details/**
      - .github/workflows/fetch-details-eval.yml
  pull_request:
    branches: [main]
    paths: [...same as above...]
  workflow_dispatch:

Mirror the Postgres service from ci.yml. Pin RAGTIME_FETCH_DETAILS_API_KEY as a GitHub Actions secret.

Stability

  • VCR cassettes (vcrpy) for the HTTP layer (iTunes, fyyd, target podcast pages) so a third-party outage doesn't break the eval. LLM calls stay live.
  • Cassettes regenerated on demand with a make target / pytest flag; reviewed in PRs that touch them.

Open decisions

  • CI model: production vs cheaper. Production model gives faithful signal but costs more per CI run; a cheaper model (e.g., gpt-4.1-mini if not already production) tests infra but not production quality. Decide based on actual cost data once we run a few real evals.

Versioning add-on

Add agent_config_hash column to FetchDetailsRun:

agent_config_hash = models.CharField(max_length=16, blank=True, default="", db_index=True)
# = sha256(system_prompt + model_string + sorted_tool_names)[:16]

Computed once at get_agent() init, stamped on each run. Eval groups runs by hash to compare across versions. Caveat: tool names not implementations — coarse signal, but eval observes implementation behavior anyway.

Acceptance criteria

  • pytest -m fetch_details_eval runs the field-level + behavioral suite locally and in CI on path-filtered triggers.
  • pytest -m deepeval runs the prose suite on demand.
  • VCR cassettes covering the dataset cases.
  • agent_config_hash migration + write-on-run wiring.
  • Documentation: doc/plans/, doc/sessions/, doc/features/, CHANGELOG.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions