rafacm · rafacm · May 4, 2026 · May 4, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,8 +6,14 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 
 ## 2026-05-04
 
+### Added
+
+- `Episode.show_name` `CharField` (max 255, blank default) populated by the fetch_details agent from `<meta property="og:site_name">`, `<meta name="application-name">`, RSS `<channel><title>`, JSON-LD `isPartOf.name` / `partOfSeries.name`, or the visible publisher heading. Persisted only when the agent's extracted value is non-empty (a re-run that fails to extract leaves a previously-good value or admin edit in place). Migration `0025_episode_show_name`. No backfill — pre-prod data freedom per `feedback_reembed_ok_preprod.md`.
+- `EpisodeCandidate.published_at` (`date | None`) on the podcast-aggregator dataclass. Each aggregator now extracts a publication date from its native field (`pubdate` for fyyd, `releaseDate` for iTunes, `datePublished` epoch seconds for podcastindex) and logs + returns `None` on missing or malformed input rather than dropping the candidate. Surfaced through `DownloadDeps.published_at` and `IndexCandidate.published_at` so the download agent sees per-episode and per-candidate dates.
+
 ### Changed
 
+- Download agent's `_show_name(episode)` cascade now prefers `episode.show_name`, falling back to the URL host only as a defense-in-depth signal. The agent prompt is updated to detect hostname-shaped `Show` values (contains `.` and no spaces) and switch to a `(title, published_at)` match with ±1 day tolerance instead of requiring an exact `show_name` string match. Real broadcast titles still use the existing show-plus-title path. Closes #111 — [plan](doc/plans/2026-05-04-download-show-name-fix.md), [feature](doc/features/2026-05-04-download-show-name-fix.md), [planning session](doc/sessions/2026-05-04-download-show-name-fix-planning-session.md), [implementation session](doc/sessions/2026-05-04-download-show-name-fix-implementation-session.md)
 - AGENTS.md and the Feature PR Documentation Bundle AI check now recognize **agent-orchestrated sessions**. When a parallel implementation agent is launched from a parent Claude Code session (e.g. under Conductor) and has no direct user-to-implementation-agent messages, the transcript may use `### Parent agent (orchestrator)` headings *instead of* `### User`, provided the parent-agent's launching prompt is reproduced verbatim. The transcript must declare the session as agent-orchestrated at the top of `## Detailed conversation`. Same verbatim rule applies — summarized parent prompts are still rejected. This is a policy clarification only; no code changes.
 - AI checks workflow now renders non-applicable rules as gray "skipped" icons instead of green "pass". `list_ai_checks.py` evaluates each rule's `paths:` frontmatter against `git diff --name-only $BASE_REF...HEAD` and emits `applies: bool` per matrix include; `.github/workflows/ai-checks.yml` gates each `check` shard on `if: ${{ ... && matrix.applies }}`, so non-applicable shards skip at the GHA level — no runner spin-up, no model call, no token cost. Four rules gain `paths:` frontmatter (`pipeline-step-sync`, `asgi-wsgi-scott`, `qdrant-payload-slim`, `entity-creation-race-safety`); the other five remain semantic. The driver's verdict tool drops `skip` from its enum: semantic non-applicability is now `pass` with `summary: "Rule does not apply."` and a one-line `details` explanation. Closes #124 — [plan](doc/plans/2026-05-04-ai-checks-skipped.md), [feature](doc/features/2026-05-04-ai-checks-skipped.md), [planning session](doc/sessions/2026-05-04-ai-checks-skipped-planning-session.md), [implementation session](doc/sessions/2026-05-04-ai-checks-skipped-implementation-session.md)
 - **BREAKING** — `StepFailed`-derived exceptions now pickle as `RuntimeError(message)` rather than as their typed subclass. Pre-fix workflow rows in the DBOS `dbos.workflow_status` table will not deserialize to readable text — the Episode admin's "View workflow steps" page will show a base64 preview only. Action required: reprocess affected episodes (which produces fresh, portable pickles) or clear the workflow_status table in dev environments. No production impact since this project is pre-prod. See PR #129 for details.

diff --git a/doc/README.md b/doc/README.md
@@ -36,7 +36,7 @@ Three keyless tools live in [`episodes/agents/fetch_details_tools.py`](../episod
 
 The agent emits a wrapped `FetchDetailsOutput { details, report, concise }`:
 
-- `details` — episode-level facts: `title`, `description`, `published_at`, `image_url`, `audio_url`, `audio_format` (closed `Literal`), `language` (ISO 639-1), `country` (ISO 3166-1 alpha-2), `guid`, `canonical_url`, `source_kind` (`canonical | aggregator | unknown`), `aggregator_provider`.
+- `details` — episode-level facts: `title`, `show_name` (broadcast / podcast title, e.g. "Zeitzeichen" — not the publisher's company name and not the URL hostname), `description`, `published_at`, `image_url`, `audio_url`, `audio_format` (closed `Literal`), `language` (ISO 639-1), `country` (ISO 3166-1 alpha-2), `guid`, `canonical_url`, `source_kind` (`canonical | aggregator | unknown`), `aggregator_provider`.
 - `report` — structured trace: `attempted_sources`, `discovered_canonical_url`, `discovered_audio_url`, `cross_linked`, `extraction_confidence` (`high | medium | low`), `narrative` (2–4 sentences), `hints_for_next_step` (carried into the Download step).
 - `concise` — `outcome` (5-value enum) + `summary` (≤140 chars).
 
@@ -52,7 +52,7 @@ Five outcomes drive the step's status transitions:
 
 Discrimination among the three terminal outcomes happens on `FetchDetailsRun.outcome` — only one `Episode.Status.FAILED` value is used.
 
-Every run persists a `FetchDetailsRun` row carrying the structured output, the auto-captured tool-call trace (input / output excerpts / `ok` flag), the Pydantic AI usage dict, and the DBOS workflow ID. `Episode` columns are overwritten directly by the agent's authoritative output (no empty-field-only merge); a re-run via the admin `reprocess` action increments `run_index` and overwrites again.
+Every run persists a `FetchDetailsRun` row carrying the structured output, the auto-captured tool-call trace (input / output excerpts / `ok` flag), the Pydantic AI usage dict, and the DBOS workflow ID. `Episode` columns are overwritten directly by the agent's authoritative output (no empty-field-only merge); a re-run via the admin `reprocess` action increments `run_index` and overwrites again. The one exception is `show_name`, which is **additive only** — when a fresh run fails to extract a value, any previously-good `show_name` (whether from an earlier run or an admin edit) is preserved rather than cleared.
 
 The step orchestrator ([`episodes/fetch_details_step.py`](../episodes/fetch_details_step.py)) is DBOS-agnostic: the `@DBOS.step()` wrapper in `episodes/workflows.py` reads `DBOS.workflow_id` and passes it in. The orchestrator records it onto `FetchDetailsRun.dbos_workflow_id` for cross-reference forensics.
 

diff --git a/doc/features/2026-05-04-download-show-name-fix.md b/doc/features/2026-05-04-download-show-name-fix.md
@@ -0,0 +1,127 @@
+# Download show_name + published_at fix
+
+**Date:** 2026-05-04
+
+## Problem
+
+The Download agent rejected every fyyd / podcastindex candidate for the
+canonical ARD Sounds test episode
+(`https://www.ardsounds.de/episode/urn:ard:episode:fdcf93eef8395b35/`)
+even though fyyd had a clean Akamai enclosure URL for it. Two factors
+combined:
+
+1. `Episode` had no `show_name` field. `episodes/downloader._show_name`
+   fell back to the URL host (`www.ardsounds.de`), which never matches
+   a real broadcast title (`Zeitzeichen`).
+2. The download agent's system prompt encouraged a strict show / title
+   match — so the LLM played it safe and rejected the candidate set.
+
+## Changes
+
+Three coordinated layers, single commit set.
+
+### Layer 1 — `Episode.show_name` source of truth
+
+- New `Episode.show_name = CharField(max_length=255, blank=True, default="")`.
+- Migration `0025_episode_show_name`. No backfill (pre-prod).
+- `EpisodeDetails` Pydantic schema in `episodes/agents/fetch_details.py`
+  gains `show_name: str | None = None`. The system prompt instructs the
+  agent to extract from (in order): `<meta property="og:site_name">`,
+  `<meta name="application-name">`, RSS `<channel><title>`, JSON-LD
+  `isPartOf.name` / `partOfSeries.name`, then the visible publisher
+  heading. Hostnames and company names are explicitly out of scope.
+- `episodes/fetch_details_step.py` persists
+  `EpisodeDetails.show_name → Episode.show_name` only when the agent's
+  value is non-empty. A re-run that fails to extract a show name leaves
+  the previously-good value (or admin edit) in place.
+- `episodes/downloader._show_name(episode)` cascades:
+  `episode.show_name` → URL host → `""`. The URL host fallback stays as
+  a defense-in-depth so the agent always receives some `show_name`
+  context, but the agent prompt now treats hostname-shaped values as a
+  degraded signal.
+
+### Layer 2 — `published_at` as a tiebreaker
+
+- `EpisodeCandidate` (in `episodes/podcast_aggregators/base.py`) gains
+  `published_at: date | None = None`.
+- Each aggregator now extracts a publication date and logs+returns
+  `None` on missing / malformed input rather than dropping the candidate:
+  - **fyyd**: parses `item["pubdate"]` (string,
+    `"YYYY-MM-DD HH:MM:SS"` per fyyd docs; also handles ISO 8601 and
+    bare `YYYY-MM-DD`).
+  - **iTunes**: parses `item["releaseDate"]` (ISO 8601 datetime, e.g.
+    `"2024-08-30T04:00:00Z"`).
+  - **podcastindex**: converts `item["datePublished"]` (Unix epoch
+    seconds, ints or numeric strings) to UTC date.
+- `DownloadDeps` (in `episodes/agents/download_deps.py`) gains
+  `published_at: date | None = None`.
+- `IndexCandidate` in `episodes/agents/download_tools.py` gains
+  `published_at: date | None = None` so the agent sees per-candidate
+  dates inside `lookup_podcast_index` results.
+- `episodes/downloader.download_episode` passes `episode.published_at`
+  through to `run_download_agent` → `_run_agent_async` → `DownloadDeps`.
+
+### Layer 3 — hostname-aware download agent prompt
+
+`DOWNLOAD_SYSTEM_PROMPT` in `episodes/agents/download.py` now:
+
+- Surfaces the episode's `Published` date alongside title / show.
+- Tells the agent: when `Show` contains a `.` and no spaces (i.e. it
+  looks like a hostname), treat it as a degraded signal — do NOT
+  require exact string match on candidate `show_name`. Instead, prefer
+  matching candidates by `(title, published_at)`, with a window of ±1
+  day. When `Published` is unknown, fall back to title similarity alone.
+- Real show titles still use the existing show-plus-title match.
+
+## Key parameters
+
+- `Episode.show_name max_length = 255` — comfortable for very long
+  podcast titles (Apple Podcasts allows up to 255).
+- Date match window: ±1 day. Looser than exact match (covers timezone
+  drift between publisher and aggregator) but tighter than e.g. a
+  week (avoids matching an unrelated re-broadcast).
+
+## Verification
+
+Tests:
+
+```bash
+uv run python manage.py test
+# 384 tests pass.
+```
+
+Manual (post-deploy / pre-merge with reviewer — see PR body):
+
+1. Submit `https://www.ardsounds.de/episode/urn:ard:episode:fdcf93eef8395b35/`
+   via the admin "Submit episode" form.
+2. Watch fetch_details populate `Episode.show_name = "Zeitzeichen"` (or
+   similar).
+3. Confirm download step reaches `READY` via the index path with
+   `source="fyyd"`.
+4. `dbos workflow steps <id>` shows `download_step` succeeded.
+5. Smoke-test on one English episode (e.g. an iTunes-indexed show).
+6. Smoke-test on an episode where fetch_details fails to extract a
+   show name → confirm host-fallback path still works (agent reads
+   the hostname-shaped show_name, switches to `(title, published_at)`
+   match).
+
+## Files modified
+
+| Path | Summary |
+|------|---------|
+| `episodes/models.py` | Add `show_name` `CharField`. |
+| `episodes/migrations/0025_episode_show_name.py` | New migration. |
+| `episodes/agents/fetch_details.py` | Add `show_name` to `EpisodeDetails`; extend system prompt. |
+| `episodes/fetch_details_step.py` | Persist `show_name` only when non-empty. |
+| `episodes/downloader.py` | Cascade in `_show_name`; pass `published_at` through. |
+| `episodes/podcast_aggregators/base.py` | Add `published_at` to `EpisodeCandidate`. |
+| `episodes/podcast_aggregators/fyyd.py` | Parse `pubdate` → `date`. |
+| `episodes/podcast_aggregators/itunes.py` | Parse `releaseDate` → `date`. |
+| `episodes/podcast_aggregators/podcastindex.py` | Parse `datePublished` (epoch) → `date`. |
+| `episodes/agents/download_deps.py` | Add `published_at` to `DownloadDeps`. |
+| `episodes/agents/download_tools.py` | Add `published_at` to `IndexCandidate`; surface in tool output. |
+| `episodes/agents/download.py` | Hostname-aware prompt; plumb `published_at`. |
+| `episodes/tests/test_models.py` | `show_name` default + persistence. |
+| `episodes/tests/test_podcast_aggregators.py` | Pubdate parse / missing / malformed for all three aggregators. |
+| `episodes/tests/test_download.py` | `_show_name` cascade tests. |
+| `CHANGELOG.md` | Entry under `## 2026-05-04`. |
diff --git a/doc/plans/2026-05-04-download-show-name-fix.md b/doc/plans/2026-05-04-download-show-name-fix.md
@@ -0,0 +1,124 @@
+# Download show_name + published_at fix (issue #111)
+
+**Date:** 2026-05-04
+
+## Summary
+
+Tighten the download agent's index-candidate matching for non-English shows
+whose extracted show name is empty by fixing the root cause (no `show_name`
+extraction in fetch-details) rather than papering over it in the agent
+prompt alone.
+
+## Problem
+
+Submitting `https://www.ardsounds.de/episode/urn:ard:episode:fdcf93eef8395b35/`
+fails the download step even though fyyd carries a clean enclosure URL for
+the episode. Two factors interact:
+
+1. **No `show_name` source of truth.** `Episode` has no `show_name` field
+   and `fetch_details` doesn't extract one. `episodes/downloader._show_name`
+   falls back to the URL host (`www.ardsounds.de`).
+2. **Strict match in the agent prompt.** With `show_name = "www.ardsounds.de"`
+   on the episode and `show_name = "Zeitzeichen"` on every fyyd candidate,
+   the agent rejects every candidate and the episode goes to `FAILED`.
+
+## Plan
+
+Three layers, bundled into one PR.
+
+### Layer 1 — real `show_name`
+
+- Add `Episode.show_name = CharField(max_length=255, blank=True, default="")`.
+  Generate Django migration. No backfill (project is pre-prod).
+- Extend `EpisodeDetails` schema in `episodes/agents/fetch_details.py` with
+  `show_name: str | None = None`. Update the system prompt to extract
+  `show_name` from `<meta property="og:site_name">`,
+  `<meta name="application-name">`, RSS `<channel><title>`, JSON-LD
+  `isPartOf.name`, and the visible publisher / show heading.
+- Persist `EpisodeDetails.show_name → Episode.show_name` in
+  `episodes/fetch_details_step.py` only when the value is non-empty (don't
+  wipe a previously-good value or a user edit on a re-run that fails to
+  extract).
+- Update `episodes/downloader._show_name(episode)` to prefer
+  `episode.show_name`, with the URL host as a defense-in-depth fallback.
+
+### Layer 2 — date as a tiebreaker
+
+- Add `published_at: date | None = None` to
+  `episodes/podcast_aggregators/base.EpisodeCandidate`.
+- Plumb pubdate through each aggregator's `_candidate()`:
+  - **fyyd**: `item["pubdate"]` (ISO 8601 string like
+    `"2024-08-30 04:00:00"`) → `date`.
+  - **iTunes**: `item["releaseDate"]` (ISO 8601 datetime) → `date`.
+  - **podcastindex**: `item["datePublished"]` (Unix epoch seconds) →
+    `date`.
+  - On parse failure: leave `published_at = None` and log a warning. Do
+    NOT drop the candidate.
+- Pass `episode.published_at` through `DownloadDeps` and the download
+  agent's prompt template.
+- Surface `published_at` on each `IndexCandidate` returned by
+  `lookup_podcast_index`.
+
+### Layer 3 — looser, hostname-aware prompt
+
+Update the download agent system prompt in `episodes/agents/download.py`:
+
+> `show_name` may be a publisher hostname rather than the broadcast title
+> (e.g. `www.ardsounds.de` instead of `Zeitzeichen`). When `show_name`
+> looks like a hostname (contains `.` and no spaces), do NOT require an
+> exact string match against the candidate's `show_name`. Instead, prefer
+> matching candidates by `(title, published_at)`. A candidate is a strong
+> match when its title closely matches the episode title and its
+> `published_at` is within ±1 day of the episode's `published_at`.
+
+## Decisions
+
+- **Bundle all three layers**: Layer 1 alone leaves the prompt strict;
+  Layer 3 alone papers over the missing field. Doing all three in one
+  commit gives the agent both real data and the right matching policy.
+- **Add new model field, no backfill**: `Episode.show_name` is a new
+  `CharField` with empty default. Pre-prod data per
+  `feedback_reembed_ok_preprod.md`.
+- **Manual verification in PR description**: requires a live ASGI server,
+  provider keys, and the canonical ARD episode. Not run by the
+  implementation agent — flagged in the PR body for the reviewer.
+
+## Test plan
+
+- Each aggregator's `_candidate()` populates `published_at` from a canned
+  payload.
+- Each aggregator handles missing / malformed pub dates gracefully
+  (returns `published_at=None`, doesn't drop the candidate).
+- `_show_name(episode)` returns `episode.show_name` when set, falls back
+  to URL host otherwise, returns `""` for a URL with no host.
+- `Episode.show_name` is blank by default and persists when set.
+- `manage.py makemigrations --check` passes (no further migrations
+  needed).
+- Full test suite passes via `uv run python manage.py test`.
+
+## Files touched
+
+- `episodes/models.py` — add `show_name` field.
+- `episodes/migrations/0025_episode_show_name.py` — new migration.
+- `episodes/agents/fetch_details.py` — `EpisodeDetails.show_name` +
+  prompt update.
+- `episodes/fetch_details_step.py` — persist `show_name` when non-empty.
+- `episodes/downloader.py` — `_show_name` cascade; pass `published_at`
+  to the agent.
+- `episodes/podcast_aggregators/base.py` — `EpisodeCandidate.published_at`.
+- `episodes/podcast_aggregators/{fyyd,itunes,podcastindex}.py` — parse
+  pubdate, log + skip on failure.
+- `episodes/agents/download_deps.py` — `DownloadDeps.published_at`.
+- `episodes/agents/download_tools.py` — `IndexCandidate.published_at`,
+  surface in `lookup_podcast_index` return value.
+- `episodes/agents/download.py` — extend prompt, plumb `published_at`
+  through `_run_agent_async` / `run_download_agent`.
+- `episodes/tests/test_models.py` — `show_name` blank default + persistence.
+- `episodes/tests/test_podcast_aggregators.py` — three pubdate parse
+  scenarios per aggregator.
+- `episodes/tests/test_download.py` — `_show_name` cascade tests.
+- `doc/plans/2026-05-04-download-show-name-fix.md` — this file.
+- `doc/features/2026-05-04-download-show-name-fix.md` — implementation doc.
+- `doc/sessions/2026-05-04-download-show-name-fix-planning-session.md`
+- `doc/sessions/2026-05-04-download-show-name-fix-implementation-session.md`
+- `CHANGELOG.md` — entry under `## 2026-05-04`.