Skip to content

Optional per-iteration prompt dump for faithful reproduction #6

@samkeen

Description

@samkeen

Motivation

memory_load events tell us which channels the agent saw and their hashes, but not the exact bytes the model received on each turn. For debugging surprising behavior — especially the kind that ships in the Altered Craft article — being able to read back the literal prompt + tool schemas + interleaved tool results is invaluable.

Proposal

Behind an env flag (e.g. TILTH_PROMPT_DUMP=1), write the rendered request to sessions/<id>/prompts/<task_id>-iter<N>.md (or .json) before each client.chat() call. Include the system prompt, the user prompt, tool schemas, and the conversation history at the moment of the call.

events.jsonl would carry a path reference in the model_call payload (e.g. prompt_dump: "prompts/T-001-iter3.md") so consumers can navigate from the event to the exact request.

Trade-offs

  • Cost: a few KB to ~20KB per iteration. A 20-iteration task ≈ 400KB; a multi-task session could push into the MB range. Mitigations: gzip on write, or only enable on demand via the env flag (default off).
  • Privacy: prompts can include workspace contents (file reads, diffs). Gating behind a flag keeps this opt-in.
  • Bloat to events.jsonl: none — the event only stores a path, not the prompt itself.

Default behavior

Off. This is a debugger feature; reach for it when chasing a specific question. Production runs should not pay the disk cost by default.

Acceptance criteria

  • TILTH_PROMPT_DUMP=1 enables prompt dumps; default is off
  • Dumps land in sessions/<id>/prompts/ with one file per client.chat() call
  • model_call event payload references the dump path when written
  • Filename naming makes ordering obvious (task_id + iter + monotonic suffix for non-iter calls like judge/self-improve)
  • No change to default-off behavior; existing test suite still passes
  • README/CLAUDE.md mention the flag in the observability section

Context

Spun out of an observability pass that added memory_load and hook_run events plus OTel-shaped trace/span IDs. See conversation around the addition of `tilth/summary.json` aggregation for related work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions