stage.tgz archives recursively include prior visit_*/stage.tgz files, causing O(2^N) disk growth

## Summary

In a long attractor run with many revisits to the same node, each
`<node>/visit_N/stage.tgz` archive appears to include every prior
`<node>/visit_M/stage.tgz` (M < N) verbatim, with only ~50 MB of new
per-visit metadata added. Sizes therefore double per visit, leading to
exponential disk growth. A 17-cycle run produced a 115 GB archive at
visit_17, 229 GB at visit_18, and a 490 GB root-level
`merge_implementation/stage.tgz` — most of which is recursively
duplicated content from earlier visits. The same pattern is visible
under `postmortem/`, `verify_fidelity/`, and other revisited nodes.

We're not certain this is unintended — possibly stage.tgz is meant to be
a complete cumulative snapshot for forensic resume. But the size cost
made our run unrecoverable (1.6 TB of disk consumed by what is
effectively ~50 MB of unique per-visit content), so we wanted to flag it
and ask.

## Environment

- kilroy 0.1.0 (binary built 2026-04-27)
- Ubuntu 24.04 LTS in an Incus container, ZFS backing store
- Pipeline: ~30-node attractor with `impl_fanout`, `merge_implementation`,
  `verify_fidelity`, `verify_test`, `postmortem`, etc.
- 17 implement/merge cycles completed before disk-out

## Evidence

### Inspecting one archive

`tar -tzf merge_implementation/visit_17/stage.tgz` shows 16 nested
`*.tgz` files — one per prior visit:

    visit_1/stage.tgz
    visit_2/stage.tgz
    visit_3/stage.tgz
    ...
    visit_16/stage.tgz

Top consumers *inside* the 115 GB visit_17 archive (`tar -tzvf | sort -rn`):

    61 GB   visit_16/stage.tgz
    30 GB   visit_15/stage.tgz
    15 GB   visit_14/stage.tgz
    7.6 GB  visit_13/stage.tgz
    3.8 GB  visit_12/stage.tgz
    1.9 GB  visit_11/stage.tgz
    954 MB  visit_10/stage.tgz
    ... (perfect halving down to MB-scale)
    5 MB    visit_13/events.json
    4.9 MB  visit_13/response.md
    4.9 MB  visit_13/stdout.log
    3.6 MB  events.json

So 99.9% of visit_17/stage.tgz is older visits' stage.tgz files; the
genuinely new per-visit content (logs, response, events) is ~50 MB.

### Size progression on disk (`merge_implementation/`)

    visit_1:    741 K
    visit_2:    1.9 M    (×2.6)
    visit_3:    4.9 M    (×2.6)
    visit_4:    12 M     (×2.4)
    visit_5:    27 M     (×2.25)
    visit_6:    56 M     (×2.07)
    visit_7:    112 M    (×2.0)
    visit_8:    227 M    (×2.0)
    visit_9:    454 M    (×2.0)
    visit_10:   911 M    (×2.0)
    visit_11:   1.8 G    (×2.0)
    visit_12:   3.6 G    (×2.0)
    visit_13:   7.2 G    (×2.0)
    visit_14:   15 G     (×2.0)
    visit_15:   29 G     (×2.0)
    visit_16:   58 G     (×2.0)
    visit_17:   115 G    (×2.0)
    visit_18:   229 G    (×2.0)
    root-level: 490 G    (×2.1)

Once per-visit metadata becomes negligible vs. the recursive payload,
the doubling becomes exact.

## Impact

- This single run consumed 1.6 TB. After deleting the recursive archives
  (`find ... -name 'stage.tgz' -delete` of older visit_N archives),
  the actual unique on-disk state was on the order of 100-200 GB.
- The kilroy run died after 17 cycles with `stall watchdog timeout` —
  the proximate cause appears to be that the next stage couldn't write
  its working files because the disk filled.
- Resuming the run (`kilroy attractor resume --logs-root ...`) currently
  fails with `no space left on device` even after substantial cleanup.

## Possible fix / question

If the per-visit `stage.tgz` is intended to be a self-contained snapshot
for `kilroy attractor resume`, then including prior visits is by design —
but a tar of a tar (no recompression benefit, since the inner is already
gzipped) seems like an expensive way to do it. A few options that would
defuse the growth:

1. Exclude `visit_*/stage.tgz` from the archive set when packaging a
   stage. The next `stage.tgz` would be the per-visit logs + response
   only (~50 MB).
2. Symlink prior visits' stage.tgz from inside the new archive rather
   than re-tarring them.
3. Make stage.tgz creation opt-in via a flag (e.g.
   `--no-stage-archive`), or keep it but cap the visit history retained
   (e.g. last 3 visits).

Happy to test a patch if useful. Mostly we wanted to confirm whether the
recursive inclusion is intentional. If yes, would appreciate guidance on
disk-budgeting for long attractor runs.

## Reproduction

Any pipeline with a node that gets revisited >10 times (e.g. an
implement → verify → postmortem → implement loop in an attractor run
that doesn't converge quickly). The growth becomes catastrophic around
visit 12-15.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stage.tgz archives recursively include prior visit_*/stage.tgz files, causing O(2^N) disk growth #89

Summary

Environment

Evidence

Inspecting one archive

Size progression on disk (`merge_implementation/`)

Impact

Possible fix / question

Reproduction

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

stage.tgz archives recursively include prior visit_*/stage.tgz files, causing O(2^N) disk growth #89

Description

Summary

Environment

Evidence

Inspecting one archive

Size progression on disk (merge_implementation/)

Impact

Possible fix / question

Reproduction

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Size progression on disk (`merge_implementation/`)