Summary
In a long attractor run with many revisits to the same node, each
<node>/visit_N/stage.tgz archive appears to include every prior
<node>/visit_M/stage.tgz (M < N) verbatim, with only ~50 MB of new
per-visit metadata added. Sizes therefore double per visit, leading to
exponential disk growth. A 17-cycle run produced a 115 GB archive at
visit_17, 229 GB at visit_18, and a 490 GB root-level
merge_implementation/stage.tgz — most of which is recursively
duplicated content from earlier visits. The same pattern is visible
under postmortem/, verify_fidelity/, and other revisited nodes.
We're not certain this is unintended — possibly stage.tgz is meant to be
a complete cumulative snapshot for forensic resume. But the size cost
made our run unrecoverable (1.6 TB of disk consumed by what is
effectively ~50 MB of unique per-visit content), so we wanted to flag it
and ask.
Environment
- kilroy 0.1.0 (binary built 2026-04-27)
- Ubuntu 24.04 LTS in an Incus container, ZFS backing store
- Pipeline: ~30-node attractor with
impl_fanout, merge_implementation,
verify_fidelity, verify_test, postmortem, etc.
- 17 implement/merge cycles completed before disk-out
Evidence
Inspecting one archive
tar -tzf merge_implementation/visit_17/stage.tgz shows 16 nested
*.tgz files — one per prior visit:
visit_1/stage.tgz
visit_2/stage.tgz
visit_3/stage.tgz
...
visit_16/stage.tgz
Top consumers inside the 115 GB visit_17 archive (tar -tzvf | sort -rn):
61 GB visit_16/stage.tgz
30 GB visit_15/stage.tgz
15 GB visit_14/stage.tgz
7.6 GB visit_13/stage.tgz
3.8 GB visit_12/stage.tgz
1.9 GB visit_11/stage.tgz
954 MB visit_10/stage.tgz
... (perfect halving down to MB-scale)
5 MB visit_13/events.json
4.9 MB visit_13/response.md
4.9 MB visit_13/stdout.log
3.6 MB events.json
So 99.9% of visit_17/stage.tgz is older visits' stage.tgz files; the
genuinely new per-visit content (logs, response, events) is ~50 MB.
Size progression on disk (merge_implementation/)
visit_1: 741 K
visit_2: 1.9 M (×2.6)
visit_3: 4.9 M (×2.6)
visit_4: 12 M (×2.4)
visit_5: 27 M (×2.25)
visit_6: 56 M (×2.07)
visit_7: 112 M (×2.0)
visit_8: 227 M (×2.0)
visit_9: 454 M (×2.0)
visit_10: 911 M (×2.0)
visit_11: 1.8 G (×2.0)
visit_12: 3.6 G (×2.0)
visit_13: 7.2 G (×2.0)
visit_14: 15 G (×2.0)
visit_15: 29 G (×2.0)
visit_16: 58 G (×2.0)
visit_17: 115 G (×2.0)
visit_18: 229 G (×2.0)
root-level: 490 G (×2.1)
Once per-visit metadata becomes negligible vs. the recursive payload,
the doubling becomes exact.
Impact
- This single run consumed 1.6 TB. After deleting the recursive archives
(find ... -name 'stage.tgz' -delete of older visit_N archives),
the actual unique on-disk state was on the order of 100-200 GB.
- The kilroy run died after 17 cycles with
stall watchdog timeout —
the proximate cause appears to be that the next stage couldn't write
its working files because the disk filled.
- Resuming the run (
kilroy attractor resume --logs-root ...) currently
fails with no space left on device even after substantial cleanup.
Possible fix / question
If the per-visit stage.tgz is intended to be a self-contained snapshot
for kilroy attractor resume, then including prior visits is by design —
but a tar of a tar (no recompression benefit, since the inner is already
gzipped) seems like an expensive way to do it. A few options that would
defuse the growth:
- Exclude
visit_*/stage.tgz from the archive set when packaging a
stage. The next stage.tgz would be the per-visit logs + response
only (~50 MB).
- Symlink prior visits' stage.tgz from inside the new archive rather
than re-tarring them.
- Make stage.tgz creation opt-in via a flag (e.g.
--no-stage-archive), or keep it but cap the visit history retained
(e.g. last 3 visits).
Happy to test a patch if useful. Mostly we wanted to confirm whether the
recursive inclusion is intentional. If yes, would appreciate guidance on
disk-budgeting for long attractor runs.
Reproduction
Any pipeline with a node that gets revisited >10 times (e.g. an
implement → verify → postmortem → implement loop in an attractor run
that doesn't converge quickly). The growth becomes catastrophic around
visit 12-15.
Summary
In a long attractor run with many revisits to the same node, each
<node>/visit_N/stage.tgzarchive appears to include every prior<node>/visit_M/stage.tgz(M < N) verbatim, with only ~50 MB of newper-visit metadata added. Sizes therefore double per visit, leading to
exponential disk growth. A 17-cycle run produced a 115 GB archive at
visit_17, 229 GB at visit_18, and a 490 GB root-level
merge_implementation/stage.tgz— most of which is recursivelyduplicated content from earlier visits. The same pattern is visible
under
postmortem/,verify_fidelity/, and other revisited nodes.We're not certain this is unintended — possibly stage.tgz is meant to be
a complete cumulative snapshot for forensic resume. But the size cost
made our run unrecoverable (1.6 TB of disk consumed by what is
effectively ~50 MB of unique per-visit content), so we wanted to flag it
and ask.
Environment
impl_fanout,merge_implementation,verify_fidelity,verify_test,postmortem, etc.Evidence
Inspecting one archive
tar -tzf merge_implementation/visit_17/stage.tgzshows 16 nested*.tgzfiles — one per prior visit:Top consumers inside the 115 GB visit_17 archive (
tar -tzvf | sort -rn):So 99.9% of visit_17/stage.tgz is older visits' stage.tgz files; the
genuinely new per-visit content (logs, response, events) is ~50 MB.
Size progression on disk (
merge_implementation/)Once per-visit metadata becomes negligible vs. the recursive payload,
the doubling becomes exact.
Impact
(
find ... -name 'stage.tgz' -deleteof older visit_N archives),the actual unique on-disk state was on the order of 100-200 GB.
stall watchdog timeout—the proximate cause appears to be that the next stage couldn't write
its working files because the disk filled.
kilroy attractor resume --logs-root ...) currentlyfails with
no space left on deviceeven after substantial cleanup.Possible fix / question
If the per-visit
stage.tgzis intended to be a self-contained snapshotfor
kilroy attractor resume, then including prior visits is by design —but a tar of a tar (no recompression benefit, since the inner is already
gzipped) seems like an expensive way to do it. A few options that would
defuse the growth:
visit_*/stage.tgzfrom the archive set when packaging astage. The next
stage.tgzwould be the per-visit logs + responseonly (~50 MB).
than re-tarring them.
--no-stage-archive), or keep it but cap the visit history retained(e.g. last 3 visits).
Happy to test a patch if useful. Mostly we wanted to confirm whether the
recursive inclusion is intentional. If yes, would appreciate guidance on
disk-budgeting for long attractor runs.
Reproduction
Any pipeline with a node that gets revisited >10 times (e.g. an
implement → verify → postmortem → implement loop in an attractor run
that doesn't converge quickly). The growth becomes catastrophic around
visit 12-15.