Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,14 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

## 2026-05-04

### Added

- `Episode.show_name` `CharField` (max 255, blank default) populated by the fetch_details agent from `<meta property="og:site_name">`, `<meta name="application-name">`, RSS `<channel><title>`, JSON-LD `isPartOf.name` / `partOfSeries.name`, or the visible publisher heading. Persisted only when the agent's extracted value is non-empty (a re-run that fails to extract leaves a previously-good value or admin edit in place). Migration `0025_episode_show_name`. No backfill — pre-prod data freedom per `feedback_reembed_ok_preprod.md`.
- `EpisodeCandidate.published_at` (`date | None`) on the podcast-aggregator dataclass. Each aggregator now extracts a publication date from its native field (`pubdate` for fyyd, `releaseDate` for iTunes, `datePublished` epoch seconds for podcastindex) and logs + returns `None` on missing or malformed input rather than dropping the candidate. Surfaced through `DownloadDeps.published_at` and `IndexCandidate.published_at` so the download agent sees per-episode and per-candidate dates.

### Changed

- Download agent's `_show_name(episode)` cascade now prefers `episode.show_name`, falling back to the URL host only as a defense-in-depth signal. The agent prompt is updated to detect hostname-shaped `Show` values (contains `.` and no spaces) and switch to a `(title, published_at)` match with ±1 day tolerance instead of requiring an exact `show_name` string match. Real broadcast titles still use the existing show-plus-title path. Closes #111 — [plan](doc/plans/2026-05-04-download-show-name-fix.md), [feature](doc/features/2026-05-04-download-show-name-fix.md), [planning session](doc/sessions/2026-05-04-download-show-name-fix-planning-session.md), [implementation session](doc/sessions/2026-05-04-download-show-name-fix-implementation-session.md)
- AGENTS.md and the Feature PR Documentation Bundle AI check now recognize **agent-orchestrated sessions**. When a parallel implementation agent is launched from a parent Claude Code session (e.g. under Conductor) and has no direct user-to-implementation-agent messages, the transcript may use `### Parent agent (orchestrator)` headings *instead of* `### User`, provided the parent-agent's launching prompt is reproduced verbatim. The transcript must declare the session as agent-orchestrated at the top of `## Detailed conversation`. Same verbatim rule applies — summarized parent prompts are still rejected. This is a policy clarification only; no code changes.
- AI checks workflow now renders non-applicable rules as gray "skipped" icons instead of green "pass". `list_ai_checks.py` evaluates each rule's `paths:` frontmatter against `git diff --name-only $BASE_REF...HEAD` and emits `applies: bool` per matrix include; `.github/workflows/ai-checks.yml` gates each `check` shard on `if: ${{ ... && matrix.applies }}`, so non-applicable shards skip at the GHA level — no runner spin-up, no model call, no token cost. Four rules gain `paths:` frontmatter (`pipeline-step-sync`, `asgi-wsgi-scott`, `qdrant-payload-slim`, `entity-creation-race-safety`); the other five remain semantic. The driver's verdict tool drops `skip` from its enum: semantic non-applicability is now `pass` with `summary: "Rule does not apply."` and a one-line `details` explanation. Closes #124 — [plan](doc/plans/2026-05-04-ai-checks-skipped.md), [feature](doc/features/2026-05-04-ai-checks-skipped.md), [planning session](doc/sessions/2026-05-04-ai-checks-skipped-planning-session.md), [implementation session](doc/sessions/2026-05-04-ai-checks-skipped-implementation-session.md)
- **BREAKING** — `StepFailed`-derived exceptions now pickle as `RuntimeError(message)` rather than as their typed subclass. Pre-fix workflow rows in the DBOS `dbos.workflow_status` table will not deserialize to readable text — the Episode admin's "View workflow steps" page will show a base64 preview only. Action required: reprocess affected episodes (which produces fresh, portable pickles) or clear the workflow_status table in dev environments. No production impact since this project is pre-prod. See PR #129 for details.
Expand Down
4 changes: 2 additions & 2 deletions doc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Three keyless tools live in [`episodes/agents/fetch_details_tools.py`](../episod

The agent emits a wrapped `FetchDetailsOutput { details, report, concise }`:

- `details` — episode-level facts: `title`, `description`, `published_at`, `image_url`, `audio_url`, `audio_format` (closed `Literal`), `language` (ISO 639-1), `country` (ISO 3166-1 alpha-2), `guid`, `canonical_url`, `source_kind` (`canonical | aggregator | unknown`), `aggregator_provider`.
- `details` — episode-level facts: `title`, `show_name` (broadcast / podcast title, e.g. "Zeitzeichen" — not the publisher's company name and not the URL hostname), `description`, `published_at`, `image_url`, `audio_url`, `audio_format` (closed `Literal`), `language` (ISO 639-1), `country` (ISO 3166-1 alpha-2), `guid`, `canonical_url`, `source_kind` (`canonical | aggregator | unknown`), `aggregator_provider`.
- `report` — structured trace: `attempted_sources`, `discovered_canonical_url`, `discovered_audio_url`, `cross_linked`, `extraction_confidence` (`high | medium | low`), `narrative` (2–4 sentences), `hints_for_next_step` (carried into the Download step).
- `concise` — `outcome` (5-value enum) + `summary` (≤140 chars).

Expand All @@ -52,7 +52,7 @@ Five outcomes drive the step's status transitions:

Discrimination among the three terminal outcomes happens on `FetchDetailsRun.outcome` — only one `Episode.Status.FAILED` value is used.

Every run persists a `FetchDetailsRun` row carrying the structured output, the auto-captured tool-call trace (input / output excerpts / `ok` flag), the Pydantic AI usage dict, and the DBOS workflow ID. `Episode` columns are overwritten directly by the agent's authoritative output (no empty-field-only merge); a re-run via the admin `reprocess` action increments `run_index` and overwrites again.
Every run persists a `FetchDetailsRun` row carrying the structured output, the auto-captured tool-call trace (input / output excerpts / `ok` flag), the Pydantic AI usage dict, and the DBOS workflow ID. `Episode` columns are overwritten directly by the agent's authoritative output (no empty-field-only merge); a re-run via the admin `reprocess` action increments `run_index` and overwrites again. The one exception is `show_name`, which is **additive only** — when a fresh run fails to extract a value, any previously-good `show_name` (whether from an earlier run or an admin edit) is preserved rather than cleared.

The step orchestrator ([`episodes/fetch_details_step.py`](../episodes/fetch_details_step.py)) is DBOS-agnostic: the `@DBOS.step()` wrapper in `episodes/workflows.py` reads `DBOS.workflow_id` and passes it in. The orchestrator records it onto `FetchDetailsRun.dbos_workflow_id` for cross-reference forensics.

Expand Down
127 changes: 127 additions & 0 deletions doc/features/2026-05-04-download-show-name-fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Download show_name + published_at fix

**Date:** 2026-05-04

## Problem

The Download agent rejected every fyyd / podcastindex candidate for the
canonical ARD Sounds test episode
(`https://www.ardsounds.de/episode/urn:ard:episode:fdcf93eef8395b35/`)
even though fyyd had a clean Akamai enclosure URL for it. Two factors
combined:

1. `Episode` had no `show_name` field. `episodes/downloader._show_name`
fell back to the URL host (`www.ardsounds.de`), which never matches
a real broadcast title (`Zeitzeichen`).
2. The download agent's system prompt encouraged a strict show / title
match — so the LLM played it safe and rejected the candidate set.

## Changes

Three coordinated layers, single commit set.

### Layer 1 — `Episode.show_name` source of truth

- New `Episode.show_name = CharField(max_length=255, blank=True, default="")`.
- Migration `0025_episode_show_name`. No backfill (pre-prod).
- `EpisodeDetails` Pydantic schema in `episodes/agents/fetch_details.py`
gains `show_name: str | None = None`. The system prompt instructs the
agent to extract from (in order): `<meta property="og:site_name">`,
`<meta name="application-name">`, RSS `<channel><title>`, JSON-LD
`isPartOf.name` / `partOfSeries.name`, then the visible publisher
heading. Hostnames and company names are explicitly out of scope.
- `episodes/fetch_details_step.py` persists
`EpisodeDetails.show_name → Episode.show_name` only when the agent's
value is non-empty. A re-run that fails to extract a show name leaves
the previously-good value (or admin edit) in place.
- `episodes/downloader._show_name(episode)` cascades:
`episode.show_name` → URL host → `""`. The URL host fallback stays as
a defense-in-depth so the agent always receives some `show_name`
context, but the agent prompt now treats hostname-shaped values as a
degraded signal.

### Layer 2 — `published_at` as a tiebreaker

- `EpisodeCandidate` (in `episodes/podcast_aggregators/base.py`) gains
`published_at: date | None = None`.
- Each aggregator now extracts a publication date and logs+returns
`None` on missing / malformed input rather than dropping the candidate:
- **fyyd**: parses `item["pubdate"]` (string,
`"YYYY-MM-DD HH:MM:SS"` per fyyd docs; also handles ISO 8601 and
bare `YYYY-MM-DD`).
- **iTunes**: parses `item["releaseDate"]` (ISO 8601 datetime, e.g.
`"2024-08-30T04:00:00Z"`).
- **podcastindex**: converts `item["datePublished"]` (Unix epoch
seconds, ints or numeric strings) to UTC date.
- `DownloadDeps` (in `episodes/agents/download_deps.py`) gains
`published_at: date | None = None`.
- `IndexCandidate` in `episodes/agents/download_tools.py` gains
`published_at: date | None = None` so the agent sees per-candidate
dates inside `lookup_podcast_index` results.
- `episodes/downloader.download_episode` passes `episode.published_at`
through to `run_download_agent` → `_run_agent_async` → `DownloadDeps`.

### Layer 3 — hostname-aware download agent prompt

`DOWNLOAD_SYSTEM_PROMPT` in `episodes/agents/download.py` now:

- Surfaces the episode's `Published` date alongside title / show.
- Tells the agent: when `Show` contains a `.` and no spaces (i.e. it
looks like a hostname), treat it as a degraded signal — do NOT
require exact string match on candidate `show_name`. Instead, prefer
matching candidates by `(title, published_at)`, with a window of ±1
day. When `Published` is unknown, fall back to title similarity alone.
- Real show titles still use the existing show-plus-title match.

## Key parameters

- `Episode.show_name max_length = 255` — comfortable for very long
podcast titles (Apple Podcasts allows up to 255).
- Date match window: ±1 day. Looser than exact match (covers timezone
drift between publisher and aggregator) but tighter than e.g. a
week (avoids matching an unrelated re-broadcast).

## Verification

Tests:

```bash
uv run python manage.py test
# 384 tests pass.
```

Manual (post-deploy / pre-merge with reviewer — see PR body):

1. Submit `https://www.ardsounds.de/episode/urn:ard:episode:fdcf93eef8395b35/`
via the admin "Submit episode" form.
2. Watch fetch_details populate `Episode.show_name = "Zeitzeichen"` (or
similar).
3. Confirm download step reaches `READY` via the index path with
`source="fyyd"`.
4. `dbos workflow steps <id>` shows `download_step` succeeded.
5. Smoke-test on one English episode (e.g. an iTunes-indexed show).
6. Smoke-test on an episode where fetch_details fails to extract a
show name → confirm host-fallback path still works (agent reads
the hostname-shaped show_name, switches to `(title, published_at)`
match).

## Files modified

| Path | Summary |
|------|---------|
| `episodes/models.py` | Add `show_name` `CharField`. |
| `episodes/migrations/0025_episode_show_name.py` | New migration. |
| `episodes/agents/fetch_details.py` | Add `show_name` to `EpisodeDetails`; extend system prompt. |
| `episodes/fetch_details_step.py` | Persist `show_name` only when non-empty. |
| `episodes/downloader.py` | Cascade in `_show_name`; pass `published_at` through. |
| `episodes/podcast_aggregators/base.py` | Add `published_at` to `EpisodeCandidate`. |
| `episodes/podcast_aggregators/fyyd.py` | Parse `pubdate` → `date`. |
| `episodes/podcast_aggregators/itunes.py` | Parse `releaseDate` → `date`. |
| `episodes/podcast_aggregators/podcastindex.py` | Parse `datePublished` (epoch) → `date`. |
| `episodes/agents/download_deps.py` | Add `published_at` to `DownloadDeps`. |
| `episodes/agents/download_tools.py` | Add `published_at` to `IndexCandidate`; surface in tool output. |
| `episodes/agents/download.py` | Hostname-aware prompt; plumb `published_at`. |
| `episodes/tests/test_models.py` | `show_name` default + persistence. |
| `episodes/tests/test_podcast_aggregators.py` | Pubdate parse / missing / malformed for all three aggregators. |
| `episodes/tests/test_download.py` | `_show_name` cascade tests. |
| `CHANGELOG.md` | Entry under `## 2026-05-04`. |
124 changes: 124 additions & 0 deletions doc/plans/2026-05-04-download-show-name-fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Download show_name + published_at fix (issue #111)

**Date:** 2026-05-04

## Summary

Tighten the download agent's index-candidate matching for non-English shows
whose extracted show name is empty by fixing the root cause (no `show_name`
extraction in fetch-details) rather than papering over it in the agent
prompt alone.

## Problem

Submitting `https://www.ardsounds.de/episode/urn:ard:episode:fdcf93eef8395b35/`
fails the download step even though fyyd carries a clean enclosure URL for
the episode. Two factors interact:

1. **No `show_name` source of truth.** `Episode` has no `show_name` field
and `fetch_details` doesn't extract one. `episodes/downloader._show_name`
falls back to the URL host (`www.ardsounds.de`).
2. **Strict match in the agent prompt.** With `show_name = "www.ardsounds.de"`
on the episode and `show_name = "Zeitzeichen"` on every fyyd candidate,
the agent rejects every candidate and the episode goes to `FAILED`.

## Plan

Three layers, bundled into one PR.

### Layer 1 — real `show_name`

- Add `Episode.show_name = CharField(max_length=255, blank=True, default="")`.
Generate Django migration. No backfill (project is pre-prod).
- Extend `EpisodeDetails` schema in `episodes/agents/fetch_details.py` with
`show_name: str | None = None`. Update the system prompt to extract
`show_name` from `<meta property="og:site_name">`,
`<meta name="application-name">`, RSS `<channel><title>`, JSON-LD
`isPartOf.name`, and the visible publisher / show heading.
- Persist `EpisodeDetails.show_name → Episode.show_name` in
`episodes/fetch_details_step.py` only when the value is non-empty (don't
wipe a previously-good value or a user edit on a re-run that fails to
extract).
- Update `episodes/downloader._show_name(episode)` to prefer
`episode.show_name`, with the URL host as a defense-in-depth fallback.

### Layer 2 — date as a tiebreaker

- Add `published_at: date | None = None` to
`episodes/podcast_aggregators/base.EpisodeCandidate`.
- Plumb pubdate through each aggregator's `_candidate()`:
- **fyyd**: `item["pubdate"]` (ISO 8601 string like
`"2024-08-30 04:00:00"`) → `date`.
- **iTunes**: `item["releaseDate"]` (ISO 8601 datetime) → `date`.
- **podcastindex**: `item["datePublished"]` (Unix epoch seconds) →
`date`.
- On parse failure: leave `published_at = None` and log a warning. Do
NOT drop the candidate.
- Pass `episode.published_at` through `DownloadDeps` and the download
agent's prompt template.
- Surface `published_at` on each `IndexCandidate` returned by
`lookup_podcast_index`.

### Layer 3 — looser, hostname-aware prompt

Update the download agent system prompt in `episodes/agents/download.py`:

> `show_name` may be a publisher hostname rather than the broadcast title
> (e.g. `www.ardsounds.de` instead of `Zeitzeichen`). When `show_name`
> looks like a hostname (contains `.` and no spaces), do NOT require an
> exact string match against the candidate's `show_name`. Instead, prefer
> matching candidates by `(title, published_at)`. A candidate is a strong
> match when its title closely matches the episode title and its
> `published_at` is within ±1 day of the episode's `published_at`.

## Decisions

- **Bundle all three layers**: Layer 1 alone leaves the prompt strict;
Layer 3 alone papers over the missing field. Doing all three in one
commit gives the agent both real data and the right matching policy.
- **Add new model field, no backfill**: `Episode.show_name` is a new
`CharField` with empty default. Pre-prod data per
`feedback_reembed_ok_preprod.md`.
- **Manual verification in PR description**: requires a live ASGI server,
provider keys, and the canonical ARD episode. Not run by the
implementation agent — flagged in the PR body for the reviewer.

## Test plan

- Each aggregator's `_candidate()` populates `published_at` from a canned
payload.
- Each aggregator handles missing / malformed pub dates gracefully
(returns `published_at=None`, doesn't drop the candidate).
- `_show_name(episode)` returns `episode.show_name` when set, falls back
to URL host otherwise, returns `""` for a URL with no host.
- `Episode.show_name` is blank by default and persists when set.
- `manage.py makemigrations --check` passes (no further migrations
needed).
- Full test suite passes via `uv run python manage.py test`.

## Files touched

- `episodes/models.py` — add `show_name` field.
- `episodes/migrations/0025_episode_show_name.py` — new migration.
- `episodes/agents/fetch_details.py` — `EpisodeDetails.show_name` +
prompt update.
- `episodes/fetch_details_step.py` — persist `show_name` when non-empty.
- `episodes/downloader.py` — `_show_name` cascade; pass `published_at`
to the agent.
- `episodes/podcast_aggregators/base.py` — `EpisodeCandidate.published_at`.
- `episodes/podcast_aggregators/{fyyd,itunes,podcastindex}.py` — parse
pubdate, log + skip on failure.
- `episodes/agents/download_deps.py` — `DownloadDeps.published_at`.
- `episodes/agents/download_tools.py` — `IndexCandidate.published_at`,
surface in `lookup_podcast_index` return value.
- `episodes/agents/download.py` — extend prompt, plumb `published_at`
through `_run_agent_async` / `run_download_agent`.
- `episodes/tests/test_models.py` — `show_name` blank default + persistence.
- `episodes/tests/test_podcast_aggregators.py` — three pubdate parse
scenarios per aggregator.
- `episodes/tests/test_download.py` — `_show_name` cascade tests.
- `doc/plans/2026-05-04-download-show-name-fix.md` — this file.
- `doc/features/2026-05-04-download-show-name-fix.md` — implementation doc.
- `doc/sessions/2026-05-04-download-show-name-fix-planning-session.md`
- `doc/sessions/2026-05-04-download-show-name-fix-implementation-session.md`
- `CHANGELOG.md` — entry under `## 2026-05-04`.
Loading
Loading