Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

- AGENTS.md and the Feature PR Documentation Bundle AI check now recognize **agent-orchestrated sessions**. When a parallel implementation agent is launched from a parent Claude Code session (e.g. under Conductor) and has no direct user-to-implementation-agent messages, the transcript may use `### Parent agent (orchestrator)` headings *instead of* `### User`, provided the parent-agent's launching prompt is reproduced verbatim. The transcript must declare the session as agent-orchestrated at the top of `## Detailed conversation`. Same verbatim rule applies — summarized parent prompts are still rejected. This is a policy clarification only; no code changes.
- AI checks workflow now renders non-applicable rules as gray "skipped" icons instead of green "pass". `list_ai_checks.py` evaluates each rule's `paths:` frontmatter against `git diff --name-only $BASE_REF...HEAD` and emits `applies: bool` per matrix include; `.github/workflows/ai-checks.yml` gates each `check` shard on `if: ${{ ... && matrix.applies }}`, so non-applicable shards skip at the GHA level — no runner spin-up, no model call, no token cost. Four rules gain `paths:` frontmatter (`pipeline-step-sync`, `asgi-wsgi-scott`, `qdrant-payload-slim`, `entity-creation-race-safety`); the other five remain semantic. The driver's verdict tool drops `skip` from its enum: semantic non-applicability is now `pass` with `summary: "Rule does not apply."` and a one-line `details` explanation. Closes #124 — [plan](doc/plans/2026-05-04-ai-checks-skipped.md), [feature](doc/features/2026-05-04-ai-checks-skipped.md), [planning session](doc/sessions/2026-05-04-ai-checks-skipped-planning-session.md), [implementation session](doc/sessions/2026-05-04-ai-checks-skipped-implementation-session.md)
- **BREAKING** — `StepFailed`-derived exceptions now pickle as `RuntimeError(message)` rather than as their typed subclass. Pre-fix workflow rows in the DBOS `dbos.workflow_status` table will not deserialize to readable text — the Episode admin's "View workflow steps" page will show a base64 preview only. Action required: reprocess affected episodes (which produces fresh, portable pickles) or clear the workflow_status table in dev environments. No production impact since this project is pre-prod. See PR #129 for details.

### Fixed

- DBOS step errors now deserialize cleanly outside the Django process. `StepFailed.__reduce__` returns `(RuntimeError, (str(self),))` so the on-the-wire pickle is a plain stdlib `RuntimeError` — the standalone `dbos workflow steps` CLI (and DBOS Conductor) can rehydrate it without importing `episodes.workflows`, eliminating the `Warning: exception object could not be deserialized … No module named 'episodes'` noise. Worker-side semantics are unchanged: `raise DownloadStepFailed(...)` still matches `except StepFailed`. The Episode admin's `_decode_dbos_payload()` fallback for legacy rows whose typed-class wire format can no longer be rehydrated now renders `<could not deserialize: <80-char preview>>` instead of the raw b64 blob. Closes #110 — [plan](doc/plans/2026-05-04-stepfailed-portable-pickle.md), [feature](doc/features/2026-05-04-stepfailed-portable-pickle.md), [planning session](doc/sessions/2026-05-04-stepfailed-portable-pickle-planning-session.md), [implementation session](doc/sessions/2026-05-04-stepfailed-portable-pickle-implementation-session.md)

## 2026-05-01

Expand Down
70 changes: 70 additions & 0 deletions doc/features/2026-05-04-stepfailed-portable-pickle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Make `StepFailed` pickles portable across processes

**Date:** 2026-05-04

## Problem

DBOS persists per-step `error` payloads as base64-encoded pickle bytes. The standalone `dbos workflow steps <id>` CLI (and DBOS Conductor) runs outside the Django process and can't import `episodes.workflows`, so unpickling a typed `StepFailed` subclass there raises `ModuleNotFoundError: No module named 'episodes'` and the CLI prints:

```
Warning: exception object could not be deserialized for workflow … : No module named 'episodes'
```

Issue #110. Commit `9ff6719` partially addressed the problem by adding `StepFailed.__reduce__` returning `(self.__class__, (episode_id, error_message))` — that fixed the in-process Episode admin "View workflow steps" page but kept the wire format tied to a Django module the CLI can't reach.

## Changes

1. **`StepFailed.__reduce__` now returns `(RuntimeError, (str(self),))`** (Option 3 from the issue). The pickle written to DBOS's system DB is a plain stdlib `RuntimeError` carrying the formatted message (`"<step> failed for episode <id>: <error_message>"`). Any Python process — CLI, Conductor, ad-hoc REPL — can deserialize it without importing project code. Worker-side semantics are unchanged: `raise DownloadStepFailed(...)` inside `process_episode` still matches `except StepFailed` / `except DownloadStepFailed` because `__reduce__` only kicks in at pickle time, after the typed exception has done its job.

*Why Option 3 over Option 1 (`raise RuntimeError(json.dumps(...))`):* keeps the typed hierarchy at raise time so step wrappers still read as intent-revealing code (`raise DownloadStepFailed(...)` not `raise RuntimeError(...)`); one method on the base class is the smallest possible blast radius. Verified with `grep` that no caller catches by typed subclass — the typed shape is for log readability, not control flow.

2. **Admin legacy-row fallback upgraded.** `_decode_dbos_payload()` in `episodes/admin.py` previously returned the raw b64 blob when `pickle.loads` failed (e.g., for legacy rows pickled before `9ff6719` whose `__reduce__` representation can no longer round-trip). Now it returns `<could not deserialize: <preview>>` with the b64 preview truncated to 80 characters, so the admin "View workflow steps" page renders an explanation instead of a wall of base64.

3. **Tests** in `episodes/tests/test_admin.py`:
- `test_step_failed_pickle_round_trip_yields_runtime_error` — new — confirms `pickle.loads(pickle.dumps(DownloadStepFailed(...)))` returns a plain `RuntimeError` carrying the formatted message, and that the typed instance is still a `StepFailed` before pickling.
- `test_decode_dbos_payload_unpickles_step_output` — updated comment for the new RuntimeError-on-the-wire semantics; assertions unchanged because `str(RuntimeError(msg)) == msg` matches what the old typed unpickle produced.
- `test_decode_dbos_payload_passthrough_for_plain_values` — updated to expect the new "could not deserialize" notice for the `gASnotpickle` legacy-row case.

## Key parameters

| Knob | Value | Why |
|---|---|---|
| Wire-format exception class | `RuntimeError` | Stdlib — no project imports needed at unpickle time |
| Legacy-row preview length | 80 chars (truncated to `<77 chars>...`) | Long enough to identify the row, short enough not to overflow the admin column |

## Verification

Full unit-test suite:

```bash
uv run python manage.py test
# Ran 372 tests in ~23s — OK
```

Pickle round-trip (smoke test, no DB needed):

```bash
uv run python -c "
import pickle, django; django.setup()
from episodes.workflows import DownloadStepFailed
err = DownloadStepFailed(episode_id=7, error_message='boom')
out = pickle.loads(pickle.dumps(err))
print(type(out).__name__, '->', out)
"
# RuntimeError -> downloading failed for episode 7: boom
```

End-to-end CLI verification (requires a DBOS-recorded failure):

```bash
uv run dbos workflow steps episode-<id>-run-<n> --sys-db-url "$RAGTIME_DB_URL"
# Expect: error column renders the formatted message; no "could not be deserialized" warning.
```

The CLI repro path was not exercised in this PR — the existing pickle round-trip test covers the same property the CLI relies on (deserializing without importing `episodes.workflows`).

## Files modified

- `episodes/workflows.py` — `StepFailed.__reduce__` now returns `(RuntimeError, (str(self),))`; comment rewritten to explain cross-process portability.
- `episodes/admin.py` — `_decode_dbos_payload()` failure branch now returns `<could not deserialize: <80-char preview>>` instead of the raw blob.
- `episodes/tests/test_admin.py` — new pickle round-trip test; existing tests updated for the RuntimeError-on-the-wire semantics and the new fallback message.
38 changes: 38 additions & 0 deletions doc/plans/2026-05-04-stepfailed-portable-pickle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Make `StepFailed` pickles portable across processes

**Date:** 2026-05-04

## Goal

Resolve issue #110: replace the typed `StepFailed` exception hierarchy on the DBOS wire so the standalone `dbos workflow steps` CLI (and Conductor) can render step errors instead of printing `Warning: exception object could not be deserialized … No module named 'episodes'`.

## Context

Commit `9ff6719` partially addressed the issue by adding `StepFailed.__reduce__` returning `(self.__class__, (episode_id, error_message))`, which fixed the in-process Episode admin view but kept the wire format tied to a Django module the CLI can't import. The issue's top comment narrows the remaining acceptance criteria to:

- [ ] `dbos workflow steps <id>` renders the step error as a readable message (no "could not be deserialized" warning).
- [ ] Existing in-process unpickle (Episode admin) keeps working.
- [ ] 349+ tests passing.

## Options considered

| Option | Sketch | Decision |
|---|---|---|
| **1. Inline raise** | Step wrappers raise `RuntimeError(json.dumps({...}))` directly, drop the typed hierarchy. | Loses intent-revealing code at raise sites. Rejected. |
| **2. Move classes to a stdlib-only module** | New `episodes_dbos/exceptions.py` (no Django imports), import path the CLI can resolve. | Overkill — these classes are 8 lines each, and the CLI still has to know where to find them. Rejected. |
| **3. `__reduce__` returns `RuntimeError`** | Worker-side typed hierarchy intact; the on-the-wire pickle collapses to a stdlib `RuntimeError`. | Smallest blast radius, preserves typed shape inside the raise process, portable everywhere. **Picked.** |

Verified with `grep` that no caller does `except FetchDetailsFailed` etc. — the typed hierarchy exists for log readability, not control flow, so collapsing on the wire is safe.

## Plan

1. **`episodes/workflows.py`** — change `StepFailed.__reduce__` to return `(RuntimeError, (str(self),))`. Update the docstring with the cross-process-portability rationale.
2. **`episodes/admin.py`** — when `_decode_dbos_payload()` can't unpickle a value that *looks* like a pickle blob (legacy rows pickled with the typed class), return `<could not deserialize: <80-char preview>>` instead of the raw b64 string. Improves the admin UX for legacy rows whose typed wire format the current process can no longer rehydrate.
3. **Tests** — new pickle round-trip test confirming `pickle.loads(pickle.dumps(DownloadStepFailed(...)))` returns a plain `RuntimeError`; update the existing legacy-row test to expect the new fallback message.
4. **Run the full test suite** — must stay at the 349+ baseline from `9ff6719`.
5. **Documentation** — feature doc + planning + implementation session transcripts + changelog entry per CLAUDE.md.

## Out of scope

- DBOS Conductor end-to-end verification (no Conductor running locally).
- Re-raising legacy admin rows as RuntimeError post-hoc — too invasive; they fall through to the readable fallback message.
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Make `StepFailed` pickles portable across processes — Implementation Session

**Date:** 2026-05-04

**Session ID:** unavailable

## Summary

Implemented Option 3 from issue #110: `StepFailed.__reduce__` now returns `(RuntimeError, (str(self),))` so DBOS-stored exceptions deserialize as plain stdlib `RuntimeError` in any process — the CLI and Conductor included. Worker-side typed hierarchy unchanged (still raises `DownloadStepFailed` etc.). Upgraded the admin's `_decode_dbos_payload()` legacy-row fallback to `<could not deserialize: <80-char preview>>` so legacy rows render an explanation instead of raw base64. Added a pickle round-trip test; existing tests updated for the new wire-format semantics. Full suite stays green at 372 tests.

## Detailed conversation

This session was orchestrated by a parent agent acting on behalf of the user via Claude Code's Agent tool with `isolation: worktree`. No direct user-to-implementation-agent messages occurred. The parent-agent's launching prompt — which serves as the de-facto user message for this session — is reproduced verbatim below.

### Parent agent (orchestrator)

```
You are working on issue #110 in the rafacm/ragtime repo: "Replace typed StepFailed exceptions with serializable failure objects so the dbos CLI can render them".

## Read first
- `CLAUDE.md` and `AGENTS.md` (project docs requirements, branching rules)
- The full issue and its top comment: run `gh issue view 110 --comments`
- Commit `9ff6719` which partially addressed this: `git show 9ff6719`

## Step 0 — Branching
Per AGENTS.md, confirm we are on `main` and pull latest. Create a branch named `rafacm/stepfailed-portable` off `main`.

## Step 1 — Staleness check
The issue is explicitly noted as "partially addressed in 9ff6719". Verify the current state:
1. Read `episodes/workflows.py` — find `StepFailed` and its subclasses, see if `__reduce__` is already there and what it returns
2. Check if any further work has landed: `git log --oneline --all -- episodes/workflows.py | head -20`

Decide based on what you find whether further work is needed. If it's already complete, report and stop.

## Step 2 — Implementation
Pick the smallest viable option from the issue:
- **Option 3 (recommended for minimal blast radius):** Make `__reduce__` return `(RuntimeError, (str(self),))` so the on-the-wire pickle is a plain RuntimeError that any process can deserialize, while the typed shape stays available inside the worker (raise-time).
- **Option 1:** raise `RuntimeError(json.dumps({...}))` directly — drops the typed hierarchy.

Confirm DBOS or any Django code does NOT do `except FetchDetailsFailed` etc. (grep). If nothing catches by type, both options are safe; pick option 3 for typed-shape preservation.

Also handle the legacy-rows graceful fallback: if Django's admin "View workflow steps" page tries to deserialize a pre-`9ff6719` pickle blob, it should not crash — instead render a graceful "could not deserialize: <preview>" message. Find where the admin renders these and add the fallback.

## Step 3 — Verification
1. Run `uv run python manage.py test` (full suite — issue says 349+ should pass)
2. If you can stand up DBOS locally and produce a failed workflow, run `uv run dbos workflow steps <id> --sys-db-url ...` and confirm no "could not be deserialized" warning. If impractical, document manual repro steps.

## Step 4 — Documentation
Follow CLAUDE.md's "Documentation > Implementation phase" — feature doc, session transcripts, changelog. Date prefix `2026-05-04-`.

## Step 5 — PR
- Push and `gh pr create --base main` (rebase strategy)
- "Closes #110" in body

## Reporting
At the end report: (a) staleness verdict, (b) which option you picked and why, (c) what you did about legacy rows, (d) files changed, (e) PR URL, (f) tests passing count, (g) what you couldn't manually verify. Under 250 words.
```

### Assistant (implementation agent)

**Branching.** The current worktree was on `rafacm/triage-issues`; pulled latest into the `main` worktree (already up to date), then `git checkout -b rafacm/stepfailed-portable main` from the `paris` worktree.

**Implementation.** Edited `episodes/workflows.py`:

```python
def __reduce__(self):
# Pickle as a plain ``RuntimeError`` carrying the formatted message.
# ... cross-process portability rationale + the broken default-Exception
# round-trip note from 9ff6719, kept because it still applies.
return (RuntimeError, (str(self),))
```

The two-arg `__init__` issue that 9ff6719 worked around is sidestepped automatically when the wire form is a one-arg `RuntimeError`.

Edited `episodes/admin.py::_decode_dbos_payload()`. Previous behaviour: silently return the raw b64 string when `pickle.loads` failed. New behaviour: return `<could not deserialize: <preview>>` with the preview truncated at 80 characters (`value[:77] + "..."` for longer blobs). Most likely path that triggers the fallback now is legacy rows pickled before this PR whose typed-class wire format the current process can no longer rehydrate.

**Tests.** Edited `episodes/tests/test_admin.py`:

- New: `test_step_failed_pickle_round_trip_yields_runtime_error` — asserts `pickle.loads(pickle.dumps(DownloadStepFailed(7, "boom")))` returns a `type(...) is RuntimeError` carrying `"downloading failed for episode 7: boom"`, and that the original instance is still a `StepFailed` subclass before pickling.
- Updated comment block on `test_decode_dbos_payload_unpickles_step_output` to explain the RuntimeError-on-the-wire semantics. Assertions unchanged because `str(RuntimeError(msg)) == msg` matches what the typed unpickle produced.
- Updated `test_decode_dbos_payload_passthrough_for_plain_values`: the `gASnotpickle` case now expects `"could not deserialize"` and the original blob to both appear in the output.

**Verification.** `uv run python manage.py test` → `Ran 372 tests in 23.067s` → `OK`. Up from the 349 baseline because of intervening commits unrelated to this issue.

**End-to-end CLI repro.** Not run — would require a live DBOS instance and a freshly-failed workflow. The pickle round-trip test covers the same property the CLI relies on (deserializing without importing `episodes.workflows`); the unit test asserts on `type(out) is RuntimeError` which is exactly what the CLI needs.

**Documentation.** Wrote plan, feature doc, and both session transcripts (this is the implementation one). Changelog entry follows under today's date.
Loading
Loading