Memory allocation must be structural, not usage-driven.
Papers • Honest Scope • Quick Start • Key Results • Mechanism • What Broke • Roadmap • Citation
Every deployed agent-memory system allocates its bounded budget by recency or current utility — LRU context paging, usage-weighted compression, learned relevance selectors. NicheMem identifies a structural failure of this entire family: in long-horizon deployments where task families recur after dormancy (quarterly reports, annual audits, rare incident runbooks), the memory most worth protecting is precisely the memory emitting no usage signal. Usage-driven policies erase dormant-family competence and pay a re-acquisition cost at every reactivation.
Core thesis: retention of dormant-family skills tracks a single property — whether what survives the budget is decided by the usage stream or by a usage-independent ownership structure. NicheMem reaches the ownership structure emergently: memory modules acquire task families through quality-weighted winner-take-all competition; converged ownership is pinned; eviction pressure is confined within niches; and each module's skill footprint is a structural floor that budget rebalancing may not squeeze.
This is the sibling instantiation of the GAUSE reward-independence principle, transported from learner populations (weight space) into agent memory (context space) — with three mapped substrate differences that are the most novel findings here.
All evidence in this repo is mechanism-level. CycleBench-Sim replaces the frozen LLM with a calibrated agent abstraction (per-rule success is a function of whether the needed skill entries survive in retrieved context, and at what fidelity). This isolates who owns budget and what survives churn — the allocation mechanism — exactly as tabular experiments isolate capacity allocation in GAUSE.
- ✅ What is tested: memory allocation dynamics, retention laws, the mechanism, the theory.
- ❌ What is not tested: end-to-end LLM behavior. No LLM experiment has been run.
- 🔭 Every LLM-level statement in the papers is a stated, falsifiable prediction for the future LLM tier; the papers are written to serve as that tier's pre-registration. See
paper/Next Steps and Review.texfor the cost-and-priority plan (~$300 buys the decisive gate experiment).
git clone https://github.com/HowardLiYH/NicheMem.git
cd NicheMem
pip install numpy scipy matplotlib
python3 experiments/run_all.py # full suite E1–E10 → results/*.json (~15 min, CPU)
python3 experiments/verify_theory.py # numerical checks of every proposition
python3 experiments/make_figures.py # publication figures → paper/figures/
cd paper && latexmk -pdf main.tex # + "NicheMem Explainer.tex", "Deep Dive.tex"Every number in the papers is generated from results/*.json — and audited against them.
The dissociation (E1, 24 seeds, matched budget, Holm-adjusted p ≤ 1.2×10⁻³⁰):
| Policy class | Post-reactivation | Active-family | Verdict |
|---|---|---|---|
| Usage-driven (LRU / usage-compress / EMO / learned utility) | 0.35 – 0.41 | 0.91 – 0.93 | forgets dormant families |
| Unbounded RAG ("store everything") | 0.84 → 0.37 at long horizons | 0.85 | retrieval decays into de-facto forgetting |
| Reward-independent (quotas / NicheMem / oracle) | 0.92 – 0.94 | 0.93 – 0.94 | retains exactly |
- Coverage is a commodity; retention is not. Active-family success is substantively equivalent across every bounded arm — the dissociation is dormancy-specific.
- Retention tracks the class, not the policy. The four usage-driven heuristics are statistically indistinguishable from one another. What the score measures (recency, frequency, utility, relevance) is irrelevant; that it is a function of the usage stream is decisive.
- NicheMem recovers 97% of a privileged hand-built oracle (0.915 vs 0.944) with no labels, no quotas, no trained router, at 1.8× evaluation calls concentrated in an organization phase touching ~7% of the stream.
- Three retention-curve shapes, each matching its derived law: step (eviction; collapse within ~2 epochs of dormancy — there is no grace period), geometric (score decay), graded-then-cliff with near-miss failures (lossy summarization: the agent half-remembers and fails plausibly).
- Verified theory: LRU survival bound, fidelity-decay law, exactness of retention under ownership + skill floors, cold-start acquisition chain (analytic 2.620 vs Monte-Carlo 2.615), and a break-even condition retention satisfies 14:1 at the operating point — and fails below a measurable budget (NicheMem honestly loses to LRU at B=2,400).
- Compete. Per task, memory modules tournament: each retrieves from its own store; the module whose context yields the best verifiable outcome wins, ingests the distilled skill, and applies a quality-weighted exponentiated-gradient affinity update (a zero-quality win moves nothing — the cold-start floor).
- Pin. At affinity > 0.9 (~8 quality wins), ownership locks: routing becomes a table lookup; the owner stops contesting other families (competitive exclusion).
- Idle ≠ Forget. A dormant family's owner receives no tasks → no ingests, no reclaims. Its skill footprint is a structural floor invisible to budget rebalancing; only episodic working memory flexes with need. Retention is exact — an identity, not a bound.
Ten predictions were pre-registered before the first run; four were refuted and are reported as findings (full outcome table in main.pdf, Appendix B):
| # | We predicted | The data said |
|---|---|---|
| 1 | Smarter usage heuristics retain more | No — the class decides; the heuristic doesn't |
| 2 | NicheMem is robust to cluster noise | No — label noise defeats every label-keyed store, pinned NicheMem included; robustness lives in per-task tournament selection at 12× cost |
| 3 | A staleness trigger rescues within-family drift (as in GAUSE) | No-op — memory stores self-repair through distillation; the weight-space remedy doesn't transfer |
| 4 | Surplus modules idle harmlessly (as GAUSE proves for learners) | No — competence is acquired through winning, so surplus competitors duplicate acquisition and waste budget under tight B |
Plus two documented self-inflicted failures kept in the record because each is a design theorem:
- The budget side door: our first "budget tracks need" allocator silently starved dormant niches — a reward-chaser smuggled in through budget rebalancing (cost 15 retention points before diagnosis). Repair: skill floors are structural.
- Retrieval–refresh coupling: when a store cell exceeds the retrieval top-k, un-retrieved skills age invisibly and LRU evicts them during active periods — it dents even the oracle.
And the drift inversion: at full rule replacement, the most retentive arm becomes the worst (protected wrongness crowds working memory) — retention's value is bounded by within-family stability, crossover ≈ ⅔ replacement.
| Document | Pages | Role |
|---|---|---|
paper/main.tex |
17 | The research paper (NeurIPS style): propositions + proofs, all experiments, pre-registration outcomes |
paper/NicheMem Explainer.tex |
19 | Architecture & mechanism companion: intuition, worked demos, applications map, deployment recipe, when not to use it |
paper/Deep Dive.tex |
20 | Mathematical foundations: derivations, worked numeric examples verified from code, verbatim load-bearing code, failure post-mortems, auto-generated reference tables |
paper/Next Steps and Review.tex |
4 | What's deferred (LLM tiers), cost estimates, and an honest self-review |
Repository map: src/cyclebench/ (simulator) • experiments/ (suite, theory checks, figures) • results/ (canonical JSONs) • PLAN.md, PREREGISTRATION.md, THEORY.md (process record).
The decisive missing experiment is cheap: a Tier-1-LLM mini gate (~$60–300 self-hosted; 4 policies × 5 seeds on a small frozen model with real procedurally-generated tool quirks) answers whether the class dissociation survives a real model. Then: learned family inference in the loop (<$30), Tier-2 (ALFWorld/GAIA-style pools under cyclic schedules, $500–2,500), and the Tier-3 longitudinal "runbook survives 800 tasks of dormancy" hero run. Full plan with gates and costs: paper/Next Steps and Review.tex.
Pre-registered prediction for that tier: usage-driven stores will lose dormant-family competence at rates governed by their reclaim channel; structurally-owned stores will not; and the gap will shrink as zero-shot competence rises.
Both projects instantiate one principle in different substrates — capacity protection must not depend on the signal whose absence defines the thing being protected:
| GAUSE (learners) | NicheMem (agent memory) | |
|---|---|---|
| Capacity | learner calibration (weights) | token budget (context) |
| Regimes | market/environment states | task families |
| Forgetting cost | post-reactivation error | post-reactivation task failure |
| What did not transfer | — | drift trigger (stores self-repair) • surplus harmlessness (acquisition coupling) • single-allocator assumption (budget side door) |
@techreport{li2026nichemem,
title = {Idle but Not Forgotten: Reward-Independent Memory Ownership
for Long-Horizon Agents --- A Mechanism-Level Study with CycleBench-Sim},
author = {Li, Yuhao},
institution = {University of Pennsylvania},
year = {2026},
note = {Mechanism-level evidence; LLM-tier predictions pre-registered}
}
