Suggestion: measure multi-session coherence, not just single-session endurance

## Context

The Time Horizon benchmark is excellent — the clearest visualization of capability growth over model generations. But it implicitly measures a single-session ceiling: how long can the model sustain coherent work in one continuous run.

Real-world long-horizon tasks don't run in one session. They span days or weeks across dozens of sessions — resuming, recalling prior decisions, building on past reasoning. The effective time horizon for these tasks depends not just on model capability but on **memory architecture**: can the agent recall what it decided, what it rejected, and why?

## The gap

Current models hit context rot well before their advertised limits. [Chroma's study](https://research.trychroma.com/evaluating-chunking) of 18 leading LLMs shows performance degrading — not gradually but catastrophically — as context grows. Models advertising 200K tokens become unreliable around 130K. So even within a single session, the effective horizon is shorter than the benchmark suggests.

Across sessions, the gap widens further. `/compact` (context summarization) flattens disputes into neutral narration and erases causal chains. `/clear` wipes everything. The next session re-derives decisions the previous session had already settled. Without persistent structured memory, multi-session time horizon is effectively **1× single-session** — you restart each time.

## What we're seeing with structured memory

We've been experimenting with persistent belief graphs as an external memory layer for Claude Code ([Bella](https://github.com/immartian/bellamem), based on [Recursive Emergence](https://github.com/Recursive-Emergence/RE/blob/main/thesis.md)). Beliefs are stored as nodes in a hypergraph with causal (⇒) and dispute (⊥) edges, Bayesian mass scores that reinforce on re-observation and decay without use.

From several weeks of daily dogfooding on real engineering projects: structured belief persistence extends the effective sustained horizon by roughly **8–20×** compared to stateless sessions. Sessions resume with disputes remembered, causal chains intact, rejected approaches blocked from resurfacing. The per-session ceiling (e.g., Opus 4.6's 12 hours on your chart) becomes a per-session *contribution* to an indefinite sustained horizon, rather than the horizon itself.

## Suggestion

Consider adding a **multi-session coherence** axis to the benchmark:

1. **Task setup:** Tasks that require resuming after a context reset (simulating compaction or session boundary). The agent must recall prior decisions and maintain consistency.
2. **Metric:** Success rate on tasks where critical information was established in session N and must be applied in session N+1, with only the agent's chosen persistence mechanism bridging the gap.
3. **Baseline comparison:** Stateless (no memory), flat-file memory (CLAUDE.md/MEMORY.md), and structured memory (knowledge graphs, belief graphs).

This would separate two capabilities your current benchmark conflates: **sustained reasoning within context** (what you measure now) and **coherent reasoning across context boundaries** (what determines real-world time horizon for multi-day projects).

The single-session chart is already the reference benchmark for model capability. A multi-session axis would become the reference benchmark for agent architecture — and that's where the binding constraint is moving as per-session capability scales.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: measure multi-session coherence, not just single-session endurance #40

Context

The gap

What we're seeing with structured memory

Suggestion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Suggestion: measure multi-session coherence, not just single-session endurance #40

Description

Context

The gap

What we're seeing with structured memory

Suggestion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions