Stream recovery-agent progress to RecoveryAttempt rows for live admin visibility

## Problem

When a pipeline step fails and the recovery agent runs (a Pydantic AI agent driving Playwright), the DB is only written at two points:

1. At dispatch: `RecoveryAttempt(status=ATTEMPTED)` row is created.
2. At terminal outcome: status flips to `RESOLVED`, `AWAITING_HUMAN`, or the row is updated with an error message.

Everything in between — every Playwright tool call, every click attempt, every screenshot, every visual-analysis step — is invisible to the admin and to anyone watching `Episode.status`. With the agent's per-click `Page.click: Timeout 30000ms` and the overall `RAGTIME_RECOVERY_AGENT_TIMEOUT=120` defaults, a failing recovery can take **2 minutes per episode** during which the only DB-visible state is \"ATTEMPTED\". For users debugging or just monitoring an ingest, this looks indistinguishable from a hang.

Concrete example from a recent two-episode parallel run against ARDsounds.de: the agent worked through `button:has-text(\"Mehr Informationen\")` → `button:has-text(\"⋮\")` → `a:has-text(\"Mehr Informationen\")` → `button:has-text(\"Information\")`, each timing out at 30s. The user could see the eventual `AWAITING_HUMAN` outcome 2 minutes later but had no way to see what the agent was actually trying.

OpenTelemetry traces in Langfuse capture this fully (per tool call, per screenshot, per LLM step), but Langfuse is an opt-in collector and isn't always enabled. The admin should reflect agent progress directly.

## Proposal

Surface in-flight agent progress on the `RecoveryAttempt` row (or a related table) as the agent runs.

### Minimum viable: heartbeat field

Add `RecoveryAttempt.progress: JSONField(default=list)` (or `tool_log`) appended on each Pydantic AI tool call:

```python
{
    \"timestamp\": \"2026-04-27T12:37:54Z\",
    \"tool\": \"page.click\",
    \"args\": {\"selector\": \"button:has-text('Mehr Informationen')\"},
    \"outcome\": \"timeout\",
    \"duration_ms\": 30000
}
```

Hook on Pydantic AI's tool-call lifecycle (the agent already emits these to OpenTelemetry — same hook, different sink). Each entry is a small dict; the JSON list grows linearly with the agent's tool calls (typically 5–20 per attempt).

Render the progress list in the admin's `RecoveryAttempt` change view as a vertical timeline so the user sees \"trying X… timed out… trying Y… timed out… escalating to human\" in near-real time.

### Cleaner alternative: separate `RecoveryStep` rows

Mirror the `ProcessingRun` / `ProcessingStep` design pattern that already exists in the codebase. Each tool call gets its own row:

| Field | Purpose |
|---|---|
| `recovery_attempt` | FK |
| `step_number` | order |
| `tool_name` | e.g. \"page.click\", \"analyze_screenshot\", \"intercept_audio_requests\" |
| `tool_input` | JSON |
| `tool_output` | JSON or truncated text |
| `started_at`, `finished_at` | duration |
| `outcome` | success / timeout / error |

More queryable, supports filtering / aggregation across attempts, but heavier — new table, new migration, new admin inline. Probably the right long-term home; the JSONField is the cheaper first step.

## Implementation order

1. Add `RecoveryAttempt.progress` JSONField, default `list`.
2. Wire a Pydantic AI tool-call observer (the existing OTel observer is the model — same callback shape, different sink). Append a `{timestamp, tool, args, outcome, duration_ms}` dict per call. Truncate tool_output to ~500 chars to keep rows small.
3. Render the list in `RecoveryAttemptAdmin` as a read-only HTML timeline.
4. (Later, if it earns its keep) migrate `progress` JSON to a `RecoveryStep` model and replace the rendering.

## Out of scope

- Streaming progress to a websocket / polling endpoint for the chat UI — admin-only is enough to start.
- Pre-empting / cancelling a running agent from the admin — separate concern.
- The actual recovery-agent failure on ARD pages — that's a Playwright selector / page-structure issue, separate.

## Out of band

This pre-dates PR #100 (MusicBrainz resolution); the friction was just rediscovered while testing parallel-safety against #100. Filing here so the MB PR doesn't pick it up by accident.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream recovery-agent progress to RecoveryAttempt rows for live admin visibility #104

Problem

Proposal

Minimum viable: heartbeat field

Cleaner alternative: separate `RecoveryStep` rows

Implementation order

Out of scope

Out of band

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Field	Purpose
`recovery_attempt`	FK
`step_number`	order
`tool_name`	e.g. "page.click", "analyze_screenshot", "intercept_audio_requests"
`tool_input`	JSON
`tool_output`	JSON or truncated text
`started_at`, `finished_at`	duration
`outcome`	success / timeout / error

Stream recovery-agent progress to RecoveryAttempt rows for live admin visibility #104

Description

Problem

Proposal

Minimum viable: heartbeat field

Cleaner alternative: separate RecoveryStep rows

Implementation order

Out of scope

Out of band

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Cleaner alternative: separate `RecoveryStep` rows