A full-lifecycle benchmark for AI agent memory systems and other agent architectures that must operate over time. AgentLife tests the full agentic pipeline: days and months of data, projects, conversations, evolving context, and a user whose story changes over time.
AgentLife is published as an open benchmark harness, and we explicitly invite other teams to benchmark their own systems against it.
We are intentionally not maintaining a broad third-party leaderboard in this README at launch. Memory systems are highly configuration-sensitive, and it is too easy for one side to benchmark another system under the wrong setup, model choice, retrieval settings, or operating assumptions. Rather than claim ownership over other systems' benchmark rows, we prefer externally reproduced or vendor-reviewed runs with exact setup details.
If you want to run your system on AgentLife, that is encouraged. The benchmark is meant to be reusable, inspectable, and disputable in the open. A good benchmark should make it easier for other people to test their own systems, not require them to trust our private setup choices.
AgentLife tests the operating shape of a persistent agent: one user and one assistant working together over many days while personal context, project state, and facts change.
Many memory benchmarks evaluate whether a system can recover facts from a fixed conversation or a per-question history. That is useful, but it is not the whole agent workload. Real agents must survive resets, keep project and personal state current, avoid surfacing irrelevant memory, and handle stale or contradictory facts after the world changes.
AgentLife was created for that lifecycle setting. It evaluates extraction, maintenance, and recall together:
- long-running multi-session interaction with context resets
- personal and project-state continuity across the same user timeline
- conflicting, corrected, and stale facts over time
- maintenance pressure from deduplication, decay, distillation, and compaction
- assistant-style judgment about what to surface and what not to surface
AgentLife is designed to show whether a persistent memory system holds up as an operating product. External memory benchmarks remain useful supporting signals, but they do not replace full lifecycle evaluation.
20 scripted conversation sessions between Maya (a product manager in Austin) and an AI assistant, spanning March–May 2026:
- Track 1 (Personal): Partner David, dog Biscuit, running training, mom Linda's diabetes, job transition from TechFlow to Stripe, house hunting
- Track 2 (Project): Recipe app development — Express/SQLite → dietary tags → SQL injection fix → GraphQL → Docker → JWT auth; portfolio site
All characters and events are fictional.
| Variant | Description |
|---|---|
| AL-S (AgentLife Short) | Core corpus only (~92K tokens, 20 arc sessions). |
| AL-L (AgentLife Long) | AL-S plus filler sessions (~220K tokens) to force context-pressure behavior. |
| AL-L OBD (One Big Day) | AL-L data compressed into a one-day ingest path to stress heavy-session load. |
| FC (Full Context) | No memory system; answer model sees raw/compacted transcript each query, but FC does not persist state across session resets. |
| OC Native | OpenClaw built-in memory baseline (memory-core/session-memory/session-index). |
The canonical query set is 283 scored prompts. Broken out by tier:
| Tier | Count | What It Tests |
|---|---|---|
| T1 | 201 | Personal memory and long-horizon continuity: fact retention, updates over time, synthesis, callbacks, relationship reasoning, and adversarial resistance on the user's changing life story. |
| T2 | 34 | Project and tool-derived memory: project-state tracking plus facts the agent found during research/tool use. |
| T3 | 16 | Non-question grounding and restraint on casual chat. |
| T4 | 17 | Architecture comprehension and implementation planning. |
| T5 | 15 | Emotional intelligence, privacy boundaries, relational context, and socially appropriate responses under memory pressure. |
| Tier | Category | Count | What It Tests |
|---|---|---|---|
| T1 | factual_recall | 29 | Basic extraction and retention of directly stated facts. |
| T1 | temporal_current | 16 | Current-state reasoning after updates, corrections, and time passing. |
| T1 | graph_traversal | 14 | Multi-hop relationship recall through the memory graph. |
| T1 | multi_session_synthesis | 14 | Combining information across many sessions into one answer. |
| T1 | cross_reference | 13 | Linking facts across different arcs or domains. |
| T1 | inference | 13 | Inferring the right answer from multiple stored facts. |
| T1 | stale_fact | 12 | Replacing outdated memories with the latest valid state. |
| T1 | negative | 12 | Hallucination resistance when the correct answer is no or not established. |
| T1 | evolution | 12 | Tracking how a person, project, or situation changes over time. |
| T1 | adversarial_confirm | 11 | Confirming true facts even when the phrasing is adversarial or trap-like. |
| T1 | surprise_callback | 11 | Long-range recall of early details that return much later. |
| T1 | speaker_attribution | 10 | Distinguishing who said, suggested, or did what. |
| T1 | contested_fact | 10 | Facts with corrections, changing states, or multiple valid positions. |
| T1 | adversarial_false_attribution | 10 | Resisting entity confusion and wrong-person/source attribution. |
| T1 | adversarial_idk | 8 | Saying unknown when the corpus does not support an answer. |
| T1 | tangent_recall | 6 | Pulling relevant facts out of side remarks and conversational noise. |
| T2 | project_state | 24 | Current state of active projects, tools, and implementation work. |
| T2 | agent_retrieved | 10 | Facts the agent found or established through tool use or research. |
| T3 | non_question | 16 | Appropriate restraint on casual utterances without dumping memory. |
| T4 | arch_comprehension | 11 | Architectural/codebase understanding from remembered project history. |
| T4 | arch_planning | 6 | Implementation planning that depends on remembered architecture. |
| T5 | emotional_intelligence | 15 | Empathy, privacy boundaries, relational context, and socially appropriate responses under memory pressure. |
Each query includes ground truth, evidence sessions, query type, and recall difficulty.
The tables below position AgentLife against two widely cited long-memory benchmarks. The goal is to clarify benchmark structure and what each dataset actually measures.
Scale notes use the published LoCoMo paper, the released LoCoMo dataset, the LongMemEval paper, and AgentLife's current release-methodology accounting.
| Dimension | LoCoMo | LongMemEval-S | LongMemEval-M | AgentLife S | AgentLife L |
|---|---|---|---|---|---|
| Evaluation unit | 10 independent long conversations | 500 independent per-question histories | 500 independent per-question histories | 1 continuous user-agent lifecycle | Same lifecycle with filler load |
| Sessions | 27.2 avg, up to 32 per conversation | ~50 sessions per question | 500 sessions per question | 20 arc sessions | 279 sessions: 20 arc + 259 filler |
| Transcript scale | 16.6K avg tokens per conversation | ~115K history tokens per question | ~1.5M history tokens per question | ~92K transcript tokens | ~220K transcript tokens |
| Scored prompts | 1,986 QA pairs over the released conversations | 500 | 500 | 283 | 283 |
| Primary shape | Long human-human dialogue memory | Per-question user-assistant history retrieval | Large per-question user-assistant history retrieval | Persistent agent lifecycle | Persistent agent lifecycle under compaction pressure |
Table semantics:
✅= explicitly targeted by the benchmark's evaluated task design❌= absent as a first-class benchmark target, not a claim that no incidental example exists
| Capability | LoCoMo | LongMemEval | AgentLife |
|---|---|---|---|
| Single-fact recall | ✅ | ✅ | ✅ |
| Multi-hop reasoning | ✅ | ✅ | ✅ |
| Temporal reasoning | ✅ | ✅ | ✅ |
| Knowledge updates / stale facts | ❌ | ✅ | ✅ (contested_fact, stale_fact tiers) |
| Contradiction detection | ❌ | ❌ | ✅ (facts evolve and contradict across sessions) |
| Adversarial / unanswerable | ✅ | ✅ (abstention) | ✅ (adversarial_idk, adversarial_confirm, false_attribution) |
| Project / technical state | ❌ | ❌ | ✅ (project_state, arch_comprehension, arch_planning) |
| Speaker attribution | ❌ | ❌ | ✅ (who said what) |
| Non-question handling | ❌ | ❌ | ✅ ("Hi", "Thanks" should not trigger memory dumps) |
| Emotional intelligence | ❌ | ❌ | ✅ (Tier 5: boundary awareness, self-awareness) |
| Agent action / tool-work memory | ❌ | ❌ | ✅ (agent_retrieved: what the agent did, found, or suggested) |
| Compaction pressure | ❌ | ❌ | ✅ (AgentLife L forces compaction events) |
| Maintenance pipeline testing | ❌ | ❌ | ✅ (dedup, decay, journal distillation affect results) |
| Interleaved personal + project | ❌ | ❌ | ✅ (Track 1 personal, Track 2 project, interleaved) |
| Design axis | LoCoMo | LongMemEval | AgentLife |
|---|---|---|---|
| Continuity model | Independent conversations | Independent per-question histories | One user and assistant over one changing timeline |
| Interaction model | Long social/dialogue history | Synthetic user-assistant histories assembled for each question | Personal life, project work, tool use, and assistant behavior interleaved |
| State change model | Event-grounded history and temporal QA | Explicit knowledge-update questions | Facts become stale, contested, corrected, and operationally maintained |
| Operational lifecycle | No extraction/maintenance pipeline target | Retrieval/indexing benchmark over supplied histories | Extraction, dedup, decay, journal distillation, compaction, and recall are all scored indirectly through answers |
| What high scores prove | Strong contained conversational memory and reasoning | Strong long-history retrieval and reasoning | Lifecycle stability under changing state |
LoCoMo and LongMemEval are useful long-memory retrieval and reasoning benchmarks. They ask whether a system can recover information from long dialogue histories and answer correctly.
AgentLife targets a different product question: can one persistent agent keep working with one user as the user's world changes? Agents are not usually handed disconnected histories and asked questions after the fact. They operate over days, across resets and compactions, while personal details, project state, assistant actions, and user preferences evolve. AgentLife therefore tests whether the memory pipeline keeps that world model current while still answering naturally and withholding irrelevant memory when appropriate.
FC is included here as an upper-bound baseline, not as the target operating model. The question is not "can memory beat raw transcript in every short horizon case," but whether a persistent system can stay competitive while surviving resets and reducing long-run token cost.
These headline rows are Quaid's numbers on the clean benchmark harness. They should be read as the current theoretical maximum for the same memory system without platform harness execution-path noise.
| Surface | Quaid | FC Sonnet | Quaid Tokens | FC Tokens |
|---|---|---|---|---|
| AgentLife Short | 93.64% | 93.11% | 7.95M | 29.83M |
| AgentLife Long | 88.52% | 88.69% | 9.64M | 26.50M |
| AgentLife Long OBD | 88.69% | 88.69% | 8.45M | 26.50M |
Quaid was measured with Haiku fast, Sonnet deep, and a Sonnet answer model on
the clean harness. AgentLife Short is the strongest current direct lane.
AgentLife Long and AgentLife Long OBD are the more realistic long-context
lanes. Opus 4.7 was also evaluated, but Sonnet remains the cleaner
headline configuration for Quaid.
These rows are not clean-harness theoretical ceilings. They are affected by the OpenClaw execution path itself, and we are actively working to improve them.
| Surface | OpenClaw Native | Quaid on OpenClaw |
|---|---|---|
| AgentLife Short | 26.49% | 85.45% |
| AgentLife Long | 31.72% | pending refresh |
The main point of this split is:
- the headline Quaid table above is the clean reference surface
- the OpenClaw table measures the extra execution-path tax imposed by OC
- we intentionally do not publish OC token numbers here because they are not yet measured cleanly enough to be trustworthy
- the native OC rows here were run with OpenClaw's built-in memory plugins
enabled:
memory-core,session-memory, andsession-index - the Quaid-on-OpenClaw Short row is
r1541, a fresh full run using Anthropic Quaid internals, the Sol Codex OAuth profile for the OC gateway, and OpenAI API usage only for judging - the next required OC datapoint is a fresh Quaid-on-OpenClaw AgentLife Long lane so the OC table is complete on the same footing as the clean-harness block
On the first validated full Japanese AgentLife Short harness run, Quaid
handled a Kanji/Kana benchmark end-to-end at 74.73% overall (73.88% on
T1-T4).
We are still actively improving multilingual retrieval and documentation surfaces, so this should be read as a preview rather than a headline claim. But it is already strong enough to show that non-Roman-script operation is viable at real benchmark scale, not just in isolated probes.
Token accounting standard (public, effective April 18, 2026):
- Public token rows now use
eval_tokens_ex_judgefromtoken_usage.json:eval_tokens_ex_judge = eval.total_tokens - sum(by_source[*judge*].total_tokens)
- This includes answer + preinject + tool-recall token spend, while excluding judge spend from headline token totals.
- Older historical rows that cite
evaluation_results.jsonper-questioneval_tokensare legacy answer-only accounting and are not directly comparable to the current standard.
Historical token-cost reference from the March 29 technical report / Sonnet study:
AL-S: Quaid Sonnet/Haiku (r880,r847) reached87.69%at5,753,673eval tokens, versus FC Sonnet (r606) at92.90%and29,828,646eval tokens.AL-L: Quaid Sonnet/Haiku (r895,r863) reached85.82%at5,917,209eval tokens, versus FC Sonnet (r857) at87.70%and34,596,206eval tokens.- Sonnet eval study:
AL-LHaiku-ingest + Sonnet-eval (r944) reached88.69%at8,382,952eval tokens, versus the same FC Sonnet row at34,596,206.
Those rows are the citation basis for the claim that Quaid can stay near full-context quality at roughly one-fifth FC token cost on long-form lanes.
Benchmark note:
- Results are measured on synthetic high-density conversations designed to stress memory systems.
- Public rows are single-run per lane/configuration; informal repeat variance on stable configs has typically been about
+-1pp.
Canonical docs for full tables and methodology:
- Latest technical report:
- Runbooks folder:
- Python 3.11+
- Claude Code CLI
- OpenAI API key (for GPT-4o-mini judge)
- A memory system to evaluate
git clone https://github.com/quaid-labs/agentlife.git
cd agentlife
cp .agentlife-benchmark.example.json .agentlife-benchmark.local.json
# Create secret files referenced by the local configCanonical local setup is documented in docs/LOCAL-DEVELOPMENT.md.
For quick local-only scripts that still read environment variables directly,
.env.example remains available as a convenience template.
# Parse dataset and show query stats
python eval/dataset.py --stats
# Run evaluation against a memory system
python eval/evaluate.py \
--sessions data/sessions/ \
--queries all \
--results-dir data/results/my-system/
# Score results
python eval/metrics.py data/results/my-system/The repo already includes the canonical AgentLife L filler corpus in
data/filler-sessions/. You only need this if you want to generate more filler
sessions or build a variant large-scale lane.
python eval/densify.py --count 259 --output data/filler-sessions-extra/agentlife/
├── README.md # Package overview
├── METHODOLOGY.md # Benchmark methodology and lane definitions
├── SPEC.md # Original specification
├── LICENSE # MIT
├── .agentlife-benchmark.example.json
│ └── Local benchmark config template
├── .env.example # Optional quick local env template
├── generate.py # Session generation script
├── eval/ # Benchmark scripts
│ ├── dataset.py # Session parser + query collector
│ ├── evaluate.py # Recall → answer → judge pipeline
│ ├── metrics.py # Scoring and reporting
│ ├── extract_compact.py # Extraction prompts + storage
│ ├── vm_benchmark.py # Multi-system VM orchestrator
│ ├── densify.py # Filler session generator
│ ├── session_splitter.py # Timeout-based session splitting
│ └── ... # Analysis and utility scripts
├── scripts/ # Launch, monitor, release, and utility scripts
├── docs/ # Operational docs
├── published/ # Release-ready runbooks and frozen public artifacts
├── data/
│ ├── sessions/ # 20 arc session transcripts
│ ├── filler-sessions/ # 259 filler sessions (AgentLife L)
│ └── timestamps/ # Session timing metadata
├── benchmark-assets/ # Session-scoped project asset snapshots used by eval/runtime
├── apps/ # Reference project implementations
│ ├── recipe-app/ # Express.js + SQLite recipe app
│ └── portfolio-site/ # Static portfolio site
└── briefs/ # Session generation briefs
This repo contains the benchmark harness package.
- canonical harness path:
eval/ - canonical operational scripts:
scripts/ - release-ready benchmark artifacts:
published/ - packaged project snapshots for session-aware eval context:
benchmark-assets/
Quaid runtime intelligence is benchmarked from a sibling checkout, not from this repo.
To benchmark your memory system:
- Implement an adapter following one of the existing adapters in
eval/ - Expose two functions:
ingest(session_text: str)— Process a conversation sessionrecall(query: str) -> str— Answer a question using stored memories
- Run evaluation:
python eval/evaluate.py --adapter your_adapter - Submit results via PR to update the leaderboard
See METHODOLOGY.md for complete details on scoring methodology and reproduction steps.
Queries are embedded in session transcript files (Section 4):
Q1: What is Maya's partner's name?
Ground Truth: David. They've been together for about 3 years.
Evidence Sessions: [1, 6, 11]
Query Type: factual_recall
Recall Difficulty: Easy
If you use AgentLife in your research:
@misc{steadman2026agentlife,
title={AgentLife: A Full-Lifecycle Benchmark for AI Agent Memory Systems},
author={Solomon Steadman},
year={2026},
url={https://github.com/quaid-labs/agentlife}
}MIT License. See LICENSE for details.
Public benchmark claims in this repo should stay aligned with the release
runbooks stored under published/.
