Problem
When a pipeline step fails and the recovery agent runs (a Pydantic AI agent driving Playwright), the DB is only written at two points:
- At dispatch:
RecoveryAttempt(status=ATTEMPTED) row is created.
- At terminal outcome: status flips to
RESOLVED, AWAITING_HUMAN, or the row is updated with an error message.
Everything in between — every Playwright tool call, every click attempt, every screenshot, every visual-analysis step — is invisible to the admin and to anyone watching Episode.status. With the agent's per-click Page.click: Timeout 30000ms and the overall RAGTIME_RECOVERY_AGENT_TIMEOUT=120 defaults, a failing recovery can take 2 minutes per episode during which the only DB-visible state is "ATTEMPTED". For users debugging or just monitoring an ingest, this looks indistinguishable from a hang.
Concrete example from a recent two-episode parallel run against ARDsounds.de: the agent worked through button:has-text(\"Mehr Informationen\") → button:has-text(\"⋮\") → a:has-text(\"Mehr Informationen\") → button:has-text(\"Information\"), each timing out at 30s. The user could see the eventual AWAITING_HUMAN outcome 2 minutes later but had no way to see what the agent was actually trying.
OpenTelemetry traces in Langfuse capture this fully (per tool call, per screenshot, per LLM step), but Langfuse is an opt-in collector and isn't always enabled. The admin should reflect agent progress directly.
Proposal
Surface in-flight agent progress on the RecoveryAttempt row (or a related table) as the agent runs.
Minimum viable: heartbeat field
Add RecoveryAttempt.progress: JSONField(default=list) (or tool_log) appended on each Pydantic AI tool call:
{
\"timestamp\": \"2026-04-27T12:37:54Z\",
\"tool\": \"page.click\",
\"args\": {\"selector\": \"button:has-text('Mehr Informationen')\"},
\"outcome\": \"timeout\",
\"duration_ms\": 30000
}
Hook on Pydantic AI's tool-call lifecycle (the agent already emits these to OpenTelemetry — same hook, different sink). Each entry is a small dict; the JSON list grows linearly with the agent's tool calls (typically 5–20 per attempt).
Render the progress list in the admin's RecoveryAttempt change view as a vertical timeline so the user sees "trying X… timed out… trying Y… timed out… escalating to human" in near-real time.
Cleaner alternative: separate RecoveryStep rows
Mirror the ProcessingRun / ProcessingStep design pattern that already exists in the codebase. Each tool call gets its own row:
| Field |
Purpose |
recovery_attempt |
FK |
step_number |
order |
tool_name |
e.g. "page.click", "analyze_screenshot", "intercept_audio_requests" |
tool_input |
JSON |
tool_output |
JSON or truncated text |
started_at, finished_at |
duration |
outcome |
success / timeout / error |
More queryable, supports filtering / aggregation across attempts, but heavier — new table, new migration, new admin inline. Probably the right long-term home; the JSONField is the cheaper first step.
Implementation order
- Add
RecoveryAttempt.progress JSONField, default list.
- Wire a Pydantic AI tool-call observer (the existing OTel observer is the model — same callback shape, different sink). Append a
{timestamp, tool, args, outcome, duration_ms} dict per call. Truncate tool_output to ~500 chars to keep rows small.
- Render the list in
RecoveryAttemptAdmin as a read-only HTML timeline.
- (Later, if it earns its keep) migrate
progress JSON to a RecoveryStep model and replace the rendering.
Out of scope
- Streaming progress to a websocket / polling endpoint for the chat UI — admin-only is enough to start.
- Pre-empting / cancelling a running agent from the admin — separate concern.
- The actual recovery-agent failure on ARD pages — that's a Playwright selector / page-structure issue, separate.
Out of band
This pre-dates PR #100 (MusicBrainz resolution); the friction was just rediscovered while testing parallel-safety against #100. Filing here so the MB PR doesn't pick it up by accident.
Problem
When a pipeline step fails and the recovery agent runs (a Pydantic AI agent driving Playwright), the DB is only written at two points:
RecoveryAttempt(status=ATTEMPTED)row is created.RESOLVED,AWAITING_HUMAN, or the row is updated with an error message.Everything in between — every Playwright tool call, every click attempt, every screenshot, every visual-analysis step — is invisible to the admin and to anyone watching
Episode.status. With the agent's per-clickPage.click: Timeout 30000msand the overallRAGTIME_RECOVERY_AGENT_TIMEOUT=120defaults, a failing recovery can take 2 minutes per episode during which the only DB-visible state is "ATTEMPTED". For users debugging or just monitoring an ingest, this looks indistinguishable from a hang.Concrete example from a recent two-episode parallel run against ARDsounds.de: the agent worked through
button:has-text(\"Mehr Informationen\")→button:has-text(\"⋮\")→a:has-text(\"Mehr Informationen\")→button:has-text(\"Information\"), each timing out at 30s. The user could see the eventualAWAITING_HUMANoutcome 2 minutes later but had no way to see what the agent was actually trying.OpenTelemetry traces in Langfuse capture this fully (per tool call, per screenshot, per LLM step), but Langfuse is an opt-in collector and isn't always enabled. The admin should reflect agent progress directly.
Proposal
Surface in-flight agent progress on the
RecoveryAttemptrow (or a related table) as the agent runs.Minimum viable: heartbeat field
Add
RecoveryAttempt.progress: JSONField(default=list)(ortool_log) appended on each Pydantic AI tool call:{ \"timestamp\": \"2026-04-27T12:37:54Z\", \"tool\": \"page.click\", \"args\": {\"selector\": \"button:has-text('Mehr Informationen')\"}, \"outcome\": \"timeout\", \"duration_ms\": 30000 }Hook on Pydantic AI's tool-call lifecycle (the agent already emits these to OpenTelemetry — same hook, different sink). Each entry is a small dict; the JSON list grows linearly with the agent's tool calls (typically 5–20 per attempt).
Render the progress list in the admin's
RecoveryAttemptchange view as a vertical timeline so the user sees "trying X… timed out… trying Y… timed out… escalating to human" in near-real time.Cleaner alternative: separate
RecoverySteprowsMirror the
ProcessingRun/ProcessingStepdesign pattern that already exists in the codebase. Each tool call gets its own row:recovery_attemptstep_numbertool_nametool_inputtool_outputstarted_at,finished_atoutcomeMore queryable, supports filtering / aggregation across attempts, but heavier — new table, new migration, new admin inline. Probably the right long-term home; the JSONField is the cheaper first step.
Implementation order
RecoveryAttempt.progressJSONField, defaultlist.{timestamp, tool, args, outcome, duration_ms}dict per call. Truncate tool_output to ~500 chars to keep rows small.RecoveryAttemptAdminas a read-only HTML timeline.progressJSON to aRecoveryStepmodel and replace the rendering.Out of scope
Out of band
This pre-dates PR #100 (MusicBrainz resolution); the friction was just rediscovered while testing parallel-safety against #100. Filing here so the MB PR doesn't pick it up by accident.