Skip to content

Suggestion: measure multi-session coherence, not just single-session endurance #40

@immartian

Description

@immartian

Context

The Time Horizon benchmark is excellent — the clearest visualization of capability growth over model generations. But it implicitly measures a single-session ceiling: how long can the model sustain coherent work in one continuous run.

Real-world long-horizon tasks don't run in one session. They span days or weeks across dozens of sessions — resuming, recalling prior decisions, building on past reasoning. The effective time horizon for these tasks depends not just on model capability but on memory architecture: can the agent recall what it decided, what it rejected, and why?

The gap

Current models hit context rot well before their advertised limits. Chroma's study of 18 leading LLMs shows performance degrading — not gradually but catastrophically — as context grows. Models advertising 200K tokens become unreliable around 130K. So even within a single session, the effective horizon is shorter than the benchmark suggests.

Across sessions, the gap widens further. /compact (context summarization) flattens disputes into neutral narration and erases causal chains. /clear wipes everything. The next session re-derives decisions the previous session had already settled. Without persistent structured memory, multi-session time horizon is effectively 1× single-session — you restart each time.

What we're seeing with structured memory

We've been experimenting with persistent belief graphs as an external memory layer for Claude Code (Bella, based on Recursive Emergence). Beliefs are stored as nodes in a hypergraph with causal (⇒) and dispute (⊥) edges, Bayesian mass scores that reinforce on re-observation and decay without use.

From several weeks of daily dogfooding on real engineering projects: structured belief persistence extends the effective sustained horizon by roughly 8–20× compared to stateless sessions. Sessions resume with disputes remembered, causal chains intact, rejected approaches blocked from resurfacing. The per-session ceiling (e.g., Opus 4.6's 12 hours on your chart) becomes a per-session contribution to an indefinite sustained horizon, rather than the horizon itself.

Suggestion

Consider adding a multi-session coherence axis to the benchmark:

  1. Task setup: Tasks that require resuming after a context reset (simulating compaction or session boundary). The agent must recall prior decisions and maintain consistency.
  2. Metric: Success rate on tasks where critical information was established in session N and must be applied in session N+1, with only the agent's chosen persistence mechanism bridging the gap.
  3. Baseline comparison: Stateless (no memory), flat-file memory (CLAUDE.md/MEMORY.md), and structured memory (knowledge graphs, belief graphs).

This would separate two capabilities your current benchmark conflates: sustained reasoning within context (what you measure now) and coherent reasoning across context boundaries (what determines real-world time horizon for multi-day projects).

The single-session chart is already the reference benchmark for model capability. A multi-session axis would become the reference benchmark for agent architecture — and that's where the binding constraint is moving as per-session capability scales.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions