Skip to content

stage.tgz archives recursively include prior visit_*/stage.tgz files, causing O(2^N) disk growth #89

@x85446

Description

@x85446

Summary

In a long attractor run with many revisits to the same node, each
<node>/visit_N/stage.tgz archive appears to include every prior
<node>/visit_M/stage.tgz (M < N) verbatim, with only ~50 MB of new
per-visit metadata added. Sizes therefore double per visit, leading to
exponential disk growth. A 17-cycle run produced a 115 GB archive at
visit_17, 229 GB at visit_18, and a 490 GB root-level
merge_implementation/stage.tgz — most of which is recursively
duplicated content from earlier visits. The same pattern is visible
under postmortem/, verify_fidelity/, and other revisited nodes.

We're not certain this is unintended — possibly stage.tgz is meant to be
a complete cumulative snapshot for forensic resume. But the size cost
made our run unrecoverable (1.6 TB of disk consumed by what is
effectively ~50 MB of unique per-visit content), so we wanted to flag it
and ask.

Environment

  • kilroy 0.1.0 (binary built 2026-04-27)
  • Ubuntu 24.04 LTS in an Incus container, ZFS backing store
  • Pipeline: ~30-node attractor with impl_fanout, merge_implementation,
    verify_fidelity, verify_test, postmortem, etc.
  • 17 implement/merge cycles completed before disk-out

Evidence

Inspecting one archive

tar -tzf merge_implementation/visit_17/stage.tgz shows 16 nested
*.tgz files — one per prior visit:

visit_1/stage.tgz
visit_2/stage.tgz
visit_3/stage.tgz
...
visit_16/stage.tgz

Top consumers inside the 115 GB visit_17 archive (tar -tzvf | sort -rn):

61 GB   visit_16/stage.tgz
30 GB   visit_15/stage.tgz
15 GB   visit_14/stage.tgz
7.6 GB  visit_13/stage.tgz
3.8 GB  visit_12/stage.tgz
1.9 GB  visit_11/stage.tgz
954 MB  visit_10/stage.tgz
... (perfect halving down to MB-scale)
5 MB    visit_13/events.json
4.9 MB  visit_13/response.md
4.9 MB  visit_13/stdout.log
3.6 MB  events.json

So 99.9% of visit_17/stage.tgz is older visits' stage.tgz files; the
genuinely new per-visit content (logs, response, events) is ~50 MB.

Size progression on disk (merge_implementation/)

visit_1:    741 K
visit_2:    1.9 M    (×2.6)
visit_3:    4.9 M    (×2.6)
visit_4:    12 M     (×2.4)
visit_5:    27 M     (×2.25)
visit_6:    56 M     (×2.07)
visit_7:    112 M    (×2.0)
visit_8:    227 M    (×2.0)
visit_9:    454 M    (×2.0)
visit_10:   911 M    (×2.0)
visit_11:   1.8 G    (×2.0)
visit_12:   3.6 G    (×2.0)
visit_13:   7.2 G    (×2.0)
visit_14:   15 G     (×2.0)
visit_15:   29 G     (×2.0)
visit_16:   58 G     (×2.0)
visit_17:   115 G    (×2.0)
visit_18:   229 G    (×2.0)
root-level: 490 G    (×2.1)

Once per-visit metadata becomes negligible vs. the recursive payload,
the doubling becomes exact.

Impact

  • This single run consumed 1.6 TB. After deleting the recursive archives
    (find ... -name 'stage.tgz' -delete of older visit_N archives),
    the actual unique on-disk state was on the order of 100-200 GB.
  • The kilroy run died after 17 cycles with stall watchdog timeout
    the proximate cause appears to be that the next stage couldn't write
    its working files because the disk filled.
  • Resuming the run (kilroy attractor resume --logs-root ...) currently
    fails with no space left on device even after substantial cleanup.

Possible fix / question

If the per-visit stage.tgz is intended to be a self-contained snapshot
for kilroy attractor resume, then including prior visits is by design —
but a tar of a tar (no recompression benefit, since the inner is already
gzipped) seems like an expensive way to do it. A few options that would
defuse the growth:

  1. Exclude visit_*/stage.tgz from the archive set when packaging a
    stage. The next stage.tgz would be the per-visit logs + response
    only (~50 MB).
  2. Symlink prior visits' stage.tgz from inside the new archive rather
    than re-tarring them.
  3. Make stage.tgz creation opt-in via a flag (e.g.
    --no-stage-archive), or keep it but cap the visit history retained
    (e.g. last 3 visits).

Happy to test a patch if useful. Mostly we wanted to confirm whether the
recursive inclusion is intentional. If yes, would appreciate guidance on
disk-budgeting for long attractor runs.

Reproduction

Any pipeline with a node that gets revisited >10 times (e.g. an
implement → verify → postmortem → implement loop in an attractor run
that doesn't converge quickly). The growth becomes catastrophic around
visit 12-15.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions