Skip to content

Stream recovery-agent progress to RecoveryAttempt rows for live admin visibility #104

@rafacm

Description

@rafacm

Problem

When a pipeline step fails and the recovery agent runs (a Pydantic AI agent driving Playwright), the DB is only written at two points:

  1. At dispatch: RecoveryAttempt(status=ATTEMPTED) row is created.
  2. At terminal outcome: status flips to RESOLVED, AWAITING_HUMAN, or the row is updated with an error message.

Everything in between — every Playwright tool call, every click attempt, every screenshot, every visual-analysis step — is invisible to the admin and to anyone watching Episode.status. With the agent's per-click Page.click: Timeout 30000ms and the overall RAGTIME_RECOVERY_AGENT_TIMEOUT=120 defaults, a failing recovery can take 2 minutes per episode during which the only DB-visible state is "ATTEMPTED". For users debugging or just monitoring an ingest, this looks indistinguishable from a hang.

Concrete example from a recent two-episode parallel run against ARDsounds.de: the agent worked through button:has-text(\"Mehr Informationen\")button:has-text(\"⋮\")a:has-text(\"Mehr Informationen\")button:has-text(\"Information\"), each timing out at 30s. The user could see the eventual AWAITING_HUMAN outcome 2 minutes later but had no way to see what the agent was actually trying.

OpenTelemetry traces in Langfuse capture this fully (per tool call, per screenshot, per LLM step), but Langfuse is an opt-in collector and isn't always enabled. The admin should reflect agent progress directly.

Proposal

Surface in-flight agent progress on the RecoveryAttempt row (or a related table) as the agent runs.

Minimum viable: heartbeat field

Add RecoveryAttempt.progress: JSONField(default=list) (or tool_log) appended on each Pydantic AI tool call:

{
    \"timestamp\": \"2026-04-27T12:37:54Z\",
    \"tool\": \"page.click\",
    \"args\": {\"selector\": \"button:has-text('Mehr Informationen')\"},
    \"outcome\": \"timeout\",
    \"duration_ms\": 30000
}

Hook on Pydantic AI's tool-call lifecycle (the agent already emits these to OpenTelemetry — same hook, different sink). Each entry is a small dict; the JSON list grows linearly with the agent's tool calls (typically 5–20 per attempt).

Render the progress list in the admin's RecoveryAttempt change view as a vertical timeline so the user sees "trying X… timed out… trying Y… timed out… escalating to human" in near-real time.

Cleaner alternative: separate RecoveryStep rows

Mirror the ProcessingRun / ProcessingStep design pattern that already exists in the codebase. Each tool call gets its own row:

Field Purpose
recovery_attempt FK
step_number order
tool_name e.g. "page.click", "analyze_screenshot", "intercept_audio_requests"
tool_input JSON
tool_output JSON or truncated text
started_at, finished_at duration
outcome success / timeout / error

More queryable, supports filtering / aggregation across attempts, but heavier — new table, new migration, new admin inline. Probably the right long-term home; the JSONField is the cheaper first step.

Implementation order

  1. Add RecoveryAttempt.progress JSONField, default list.
  2. Wire a Pydantic AI tool-call observer (the existing OTel observer is the model — same callback shape, different sink). Append a {timestamp, tool, args, outcome, duration_ms} dict per call. Truncate tool_output to ~500 chars to keep rows small.
  3. Render the list in RecoveryAttemptAdmin as a read-only HTML timeline.
  4. (Later, if it earns its keep) migrate progress JSON to a RecoveryStep model and replace the rendering.

Out of scope

  • Streaming progress to a websocket / polling endpoint for the chat UI — admin-only is enough to start.
  • Pre-empting / cancelling a running agent from the admin — separate concern.
  • The actual recovery-agent failure on ARD pages — that's a Playwright selector / page-structure issue, separate.

Out of band

This pre-dates PR #100 (MusicBrainz resolution); the friction was just rediscovered while testing parallel-safety against #100. Filing here so the MB PR doesn't pick it up by accident.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions