AgentLife

A full-lifecycle benchmark for AI agent memory systems and other agent architectures that must operate over time. AgentLife tests the full agentic pipeline: days and months of data, projects, conversations, evolving context, and a user whose story changes over time.

Open Benchmark

AgentLife is published as an open benchmark harness, and we explicitly invite other teams to benchmark their own systems against it.

We are intentionally not maintaining a broad third-party leaderboard in this README at launch. Memory systems are highly configuration-sensitive, and it is too easy for one side to benchmark another system under the wrong setup, model choice, retrieval settings, or operating assumptions. Rather than claim ownership over other systems' benchmark rows, we prefer externally reproduced or vendor-reviewed runs with exact setup details.

If you want to run your system on AgentLife, that is encouraged. The benchmark is meant to be reusable, inspectable, and disputable in the open. A good benchmark should make it easier for other people to test their own systems, not require them to trust our private setup choices.

Why AgentLife?

AgentLife tests the operating shape of a persistent agent: one user and one assistant working together over many days while personal context, project state, and facts change.

Many memory benchmarks evaluate whether a system can recover facts from a fixed conversation or a per-question history. That is useful, but it is not the whole agent workload. Real agents must survive resets, keep project and personal state current, avoid surfacing irrelevant memory, and handle stale or contradictory facts after the world changes.

AgentLife was created for that lifecycle setting. It evaluates extraction, maintenance, and recall together:

long-running multi-session interaction with context resets
personal and project-state continuity across the same user timeline
conflicting, corrected, and stale facts over time
maintenance pressure from deduplication, decay, distillation, and compaction
assistant-style judgment about what to surface and what not to surface

AgentLife is designed to show whether a persistent memory system holds up as an operating product. External memory benchmarks remain useful supporting signals, but they do not replace full lifecycle evaluation.

Dataset

20 scripted conversation sessions between Maya (a product manager in Austin) and an AI assistant, spanning March–May 2026:

Track 1 (Personal): Partner David, dog Biscuit, running training, mom Linda's diabetes, job transition from TechFlow to Stripe, house hunting
Track 2 (Project): Recipe app development — Express/SQLite → dietary tags → SQL injection fix → GraphQL → Docker → JWT auth; portfolio site

All characters and events are fictional.

Benchmark Variants

Variant	Description
AL-S (AgentLife Short)	Core corpus only (~92K tokens, 20 arc sessions).
AL-L (AgentLife Long)	AL-S plus filler sessions (~220K tokens) to force context-pressure behavior.
AL-L OBD (One Big Day)	AL-L data compressed into a one-day ingest path to stress heavy-session load.
FC (Full Context)	No memory system; answer model sees raw/compacted transcript each query, but FC does not persist state across session resets.
OC Native	OpenClaw built-in memory baseline (memory-core/session-memory/session-index).

Eval Queries

The canonical query set is 283 scored prompts. Broken out by tier:

Tier	Count	What It Tests
T1	201	Personal memory and long-horizon continuity: fact retention, updates over time, synthesis, callbacks, relationship reasoning, and adversarial resistance on the user's changing life story.
T2	34	Project and tool-derived memory: project-state tracking plus facts the agent found during research/tool use.
T3	16	Non-question grounding and restraint on casual chat.
T4	17	Architecture comprehension and implementation planning.
T5	15	Emotional intelligence, privacy boundaries, relational context, and socially appropriate responses under memory pressure.

Category Breakdown

Tier	Category	Count	What It Tests
T1	factual_recall	29	Basic extraction and retention of directly stated facts.
T1	temporal_current	16	Current-state reasoning after updates, corrections, and time passing.
T1	graph_traversal	14	Multi-hop relationship recall through the memory graph.
T1	multi_session_synthesis	14	Combining information across many sessions into one answer.
T1	cross_reference	13	Linking facts across different arcs or domains.
T1	inference	13	Inferring the right answer from multiple stored facts.
T1	stale_fact	12	Replacing outdated memories with the latest valid state.
T1	negative	12	Hallucination resistance when the correct answer is no or not established.
T1	evolution	12	Tracking how a person, project, or situation changes over time.
T1	adversarial_confirm	11	Confirming true facts even when the phrasing is adversarial or trap-like.
T1	surprise_callback	11	Long-range recall of early details that return much later.
T1	speaker_attribution	10	Distinguishing who said, suggested, or did what.
T1	contested_fact	10	Facts with corrections, changing states, or multiple valid positions.
T1	adversarial_false_attribution	10	Resisting entity confusion and wrong-person/source attribution.
T1	adversarial_idk	8	Saying unknown when the corpus does not support an answer.
T1	tangent_recall	6	Pulling relevant facts out of side remarks and conversational noise.
T2	project_state	24	Current state of active projects, tools, and implementation work.
T2	agent_retrieved	10	Facts the agent found or established through tool use or research.
T3	non_question	16	Appropriate restraint on casual utterances without dumping memory.
T4	arch_comprehension	11	Architectural/codebase understanding from remembered project history.
T4	arch_planning	6	Implementation planning that depends on remembered architecture.
T5	emotional_intelligence	15	Empathy, privacy boundaries, relational context, and socially appropriate responses under memory pressure.

Each query includes ground truth, evidence sessions, query type, and recall difficulty.

Benchmark Comparison

The tables below position AgentLife against two widely cited long-memory benchmarks. The goal is to clarify benchmark structure and what each dataset actually measures.

Scale notes use the published LoCoMo paper, the released LoCoMo dataset, the LongMemEval paper, and AgentLife's current release-methodology accounting.

Dataset Shape

Dimension	LoCoMo	LongMemEval-S	LongMemEval-M	AgentLife S	AgentLife L
Evaluation unit	10 independent long conversations	500 independent per-question histories	500 independent per-question histories	1 continuous user-agent lifecycle	Same lifecycle with filler load
Sessions	27.2 avg, up to 32 per conversation	~50 sessions per question	500 sessions per question	20 arc sessions	279 sessions: 20 arc + 259 filler
Transcript scale	16.6K avg tokens per conversation	~115K history tokens per question	~1.5M history tokens per question	~92K transcript tokens	~220K transcript tokens
Scored prompts	1,986 QA pairs over the released conversations	500	500	283	283
Primary shape	Long human-human dialogue memory	Per-question user-assistant history retrieval	Large per-question user-assistant history retrieval	Persistent agent lifecycle	Persistent agent lifecycle under compaction pressure

What Each Benchmark Tests

Table semantics:

✅ = explicitly targeted by the benchmark's evaluated task design
❌ = absent as a first-class benchmark target, not a claim that no incidental example exists

Capability	LoCoMo	LongMemEval	AgentLife
Single-fact recall	✅	✅	✅
Multi-hop reasoning	✅	✅	✅
Temporal reasoning	✅	✅	✅
Knowledge updates / stale facts	❌	✅	✅ (contested_fact, stale_fact tiers)
Contradiction detection	❌	❌	✅ (facts evolve and contradict across sessions)
Adversarial / unanswerable	✅	✅ (abstention)	✅ (adversarial_idk, adversarial_confirm, false_attribution)
Project / technical state	❌	❌	✅ (project_state, arch_comprehension, arch_planning)
Speaker attribution	❌	❌	✅ (who said what)
Non-question handling	❌	❌	✅ ("Hi", "Thanks" should not trigger memory dumps)
Emotional intelligence	❌	❌	✅ (Tier 5: boundary awareness, self-awareness)
Agent action / tool-work memory	❌	❌	✅ (agent_retrieved: what the agent did, found, or suggested)
Compaction pressure	❌	❌	✅ (AgentLife L forces compaction events)
Maintenance pipeline testing	❌	❌	✅ (dedup, decay, journal distillation affect results)
Interleaved personal + project	❌	❌	✅ (Track 1 personal, Track 2 project, interleaved)

Structural Difference

Design axis	LoCoMo	LongMemEval	AgentLife
Continuity model	Independent conversations	Independent per-question histories	One user and assistant over one changing timeline
Interaction model	Long social/dialogue history	Synthetic user-assistant histories assembled for each question	Personal life, project work, tool use, and assistant behavior interleaved
State change model	Event-grounded history and temporal QA	Explicit knowledge-update questions	Facts become stale, contested, corrected, and operationally maintained
Operational lifecycle	No extraction/maintenance pipeline target	Retrieval/indexing benchmark over supplied histories	Extraction, dedup, decay, journal distillation, compaction, and recall are all scored indirectly through answers
What high scores prove	Strong contained conversational memory and reasoning	Strong long-history retrieval and reasoning	Lifecycle stability under changing state

Core Distinction

LoCoMo and LongMemEval are useful long-memory retrieval and reasoning benchmarks. They ask whether a system can recover information from long dialogue histories and answer correctly.

AgentLife targets a different product question: can one persistent agent keep working with one user as the user's world changes? Agents are not usually handed disconnected histories and asked questions after the fact. They operate over days, across resets and compactions, while personal details, project state, assistant actions, and user preferences evolve. AgentLife therefore tests whether the memory pipeline keeps that world model current while still answering naturally and withholding irrelevant memory when appropriate.

Current Headline Results

FC is included here as an upper-bound baseline, not as the target operating model. The question is not "can memory beat raw transcript in every short horizon case," but whether a persistent system can stay competitive while surviving resets and reducing long-run token cost.

These headline rows are Quaid's numbers on the clean benchmark harness. They should be read as the current theoretical maximum for the same memory system without platform harness execution-path noise.

Surface	Quaid	FC Sonnet	Quaid Tokens	FC Tokens
AgentLife Short	93.64%	93.11%	7.95M	29.83M
AgentLife Long	88.52%	88.69%	9.64M	26.50M
AgentLife Long OBD	88.69%	88.69%	8.45M	26.50M

Quaid was measured with Haiku fast, Sonnet deep, and a Sonnet answer model on the clean harness. AgentLife Short is the strongest current direct lane. AgentLife Long and AgentLife Long OBD are the more realistic long-context lanes. Opus 4.7 was also evaluated, but Sonnet remains the cleaner headline configuration for Quaid.

OpenClaw Execution Results

These rows are not clean-harness theoretical ceilings. They are affected by the OpenClaw execution path itself, and we are actively working to improve them.

Surface	OpenClaw Native	Quaid on OpenClaw
AgentLife Short	26.49%	85.45%
AgentLife Long	31.72%	pending refresh

The main point of this split is:

the headline Quaid table above is the clean reference surface
the OpenClaw table measures the extra execution-path tax imposed by OC
we intentionally do not publish OC token numbers here because they are not yet measured cleanly enough to be trustworthy
the native OC rows here were run with OpenClaw's built-in memory plugins enabled: memory-core, session-memory, and session-index
the Quaid-on-OpenClaw Short row is r1541, a fresh full run using Anthropic Quaid internals, the Sol Codex OAuth profile for the OC gateway, and OpenAI API usage only for judging
the next required OC datapoint is a fresh Quaid-on-OpenClaw AgentLife Long lane so the OC table is complete on the same footing as the clean-harness block

Multilingual Preview

On the first validated full Japanese AgentLife Short harness run, Quaid handled a Kanji/Kana benchmark end-to-end at 74.73% overall (73.88% on T1-T4).

We are still actively improving multilingual retrieval and documentation surfaces, so this should be read as a preview rather than a headline claim. But it is already strong enough to show that non-Roman-script operation is viable at real benchmark scale, not just in isolated probes.

Token accounting standard (public, effective April 18, 2026):

Public token rows now use eval_tokens_ex_judge from token_usage.json:
- eval_tokens_ex_judge = eval.total_tokens - sum(by_source[*judge*].total_tokens)
This includes answer + preinject + tool-recall token spend, while excluding judge spend from headline token totals.
Older historical rows that cite evaluation_results.json per-question eval_tokens are legacy answer-only accounting and are not directly comparable to the current standard.

Historical token-cost reference from the March 29 technical report / Sonnet study:

AL-S: Quaid Sonnet/Haiku (r880, r847) reached 87.69% at 5,753,673 eval tokens, versus FC Sonnet (r606) at 92.90% and 29,828,646 eval tokens.
AL-L: Quaid Sonnet/Haiku (r895, r863) reached 85.82% at 5,917,209 eval tokens, versus FC Sonnet (r857) at 87.70% and 34,596,206 eval tokens.
Sonnet eval study: AL-L Haiku-ingest + Sonnet-eval (r944) reached 88.69% at 8,382,952 eval tokens, versus the same FC Sonnet row at 34,596,206.

Those rows are the citation basis for the claim that Quaid can stay near full-context quality at roughly one-fifth FC token cost on long-form lanes.

Benchmark note:

Results are measured on synthetic high-density conversations designed to stress memory systems.
Public rows are single-run per lane/configuration; informal repeat variance on stable configs has typically been about +-1pp.

Canonical docs for full tables and methodology:

Latest technical report:
- published/runbooks/AGENTLIFE_TECHNICAL_REPORT_20260430.md
Runbooks folder:
- published/runbooks/

Quick Start

Prerequisites

Python 3.11+
Claude Code CLI
OpenAI API key (for GPT-4o-mini judge)
A memory system to evaluate

Setup

git clone https://github.com/quaid-labs/agentlife.git
cd agentlife
cp .agentlife-benchmark.example.json .agentlife-benchmark.local.json
# Create secret files referenced by the local config

Canonical local setup is documented in docs/LOCAL-DEVELOPMENT.md.

For quick local-only scripts that still read environment variables directly, .env.example remains available as a convenience template.

Run Evaluation

# Parse dataset and show query stats
python eval/dataset.py --stats

# Run evaluation against a memory system
python eval/evaluate.py \
    --sessions data/sessions/ \
    --queries all \
    --results-dir data/results/my-system/

# Score results
python eval/metrics.py data/results/my-system/

Generate Additional Filler Sessions (Optional)

The repo already includes the canonical AgentLife L filler corpus in data/filler-sessions/. You only need this if you want to generate more filler sessions or build a variant large-scale lane.

python eval/densify.py --count 259 --output data/filler-sessions-extra/

Repository Structure

agentlife/
├── README.md                  # Package overview
├── METHODOLOGY.md             # Benchmark methodology and lane definitions
├── SPEC.md                    # Original specification
├── LICENSE                    # MIT
├── .agentlife-benchmark.example.json
│   └── Local benchmark config template
├── .env.example               # Optional quick local env template
├── generate.py                # Session generation script
├── eval/                      # Benchmark scripts
│   ├── dataset.py             # Session parser + query collector
│   ├── evaluate.py            # Recall → answer → judge pipeline
│   ├── metrics.py             # Scoring and reporting
│   ├── extract_compact.py     # Extraction prompts + storage
│   ├── vm_benchmark.py        # Multi-system VM orchestrator
│   ├── densify.py             # Filler session generator
│   ├── session_splitter.py    # Timeout-based session splitting
│   └── ...                    # Analysis and utility scripts
├── scripts/                   # Launch, monitor, release, and utility scripts
├── docs/                      # Operational docs
├── published/                 # Release-ready runbooks and frozen public artifacts
├── data/
│   ├── sessions/              # 20 arc session transcripts
│   ├── filler-sessions/       # 259 filler sessions (AgentLife L)
│   └── timestamps/            # Session timing metadata
├── benchmark-assets/          # Session-scoped project asset snapshots used by eval/runtime
├── apps/                      # Reference project implementations
│   ├── recipe-app/            # Express.js + SQLite recipe app
│   └── portfolio-site/        # Static portfolio site
└── briefs/                    # Session generation briefs

Benchmark Package Boundaries

This repo contains the benchmark harness package.

canonical harness path: eval/
canonical operational scripts: scripts/
release-ready benchmark artifacts: published/
packaged project snapshots for session-aware eval context: benchmark-assets/

Quaid runtime intelligence is benchmarked from a sibling checkout, not from this repo.

Adding a New System

To benchmark your memory system:

Implement an adapter following one of the existing adapters in eval/
Expose two functions:
- ingest(session_text: str) — Process a conversation session
- recall(query: str) -> str — Answer a question using stored memories
Run evaluation: python eval/evaluate.py --adapter your_adapter
Submit results via PR to update the leaderboard

See METHODOLOGY.md for complete details on scoring methodology and reproduction steps.

Eval Query Format

Queries are embedded in session transcript files (Section 4):

Q1: What is Maya's partner's name?
Ground Truth: David. They've been together for about 3 years.
Evidence Sessions: [1, 6, 11]
Query Type: factual_recall
Recall Difficulty: Easy

Citation

If you use AgentLife in your research:

@misc{steadman2026agentlife,
  title={AgentLife: A Full-Lifecycle Benchmark for AI Agent Memory Systems},
  author={Solomon Steadman},
  year={2026},
  url={https://github.com/quaid-labs/agentlife}
}

License

MIT License. See LICENSE for details.

Notes

Public benchmark claims in this repo should stay aligned with the release runbooks stored under published/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentLife

Open Benchmark

Why AgentLife?

Dataset

Benchmark Variants

Eval Queries

Category Breakdown

Benchmark Comparison

Dataset Shape

What Each Benchmark Tests

Structural Difference

Core Distinction

Current Headline Results

OpenClaw Execution Results

Multilingual Preview

Quick Start

Prerequisites

Setup

Run Evaluation

Generate Additional Filler Sessions (Optional)

Repository Structure

Benchmark Package Boundaries

Adding a New System

Eval Query Format

Citation

License

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
agentlife/eval		agentlife/eval
apps		apps
assets		assets
benchmark-assets/projects/portfolio-site/sessions		benchmark-assets/projects/portfolio-site/sessions
briefs/pilot		briefs/pilot
data		data
docs		docs
eval		eval
published		published
scripts		scripts
.agentlife-benchmark.example.json		.agentlife-benchmark.example.json
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md
SPEC-FULL.md		SPEC-FULL.md
SPEC.md		SPEC.md
generate.py		generate.py

Folders and files

Latest commit

History

Repository files navigation

AgentLife

Open Benchmark

Why AgentLife?

Dataset

Benchmark Variants

Eval Queries

Category Breakdown

Benchmark Comparison

Dataset Shape

What Each Benchmark Tests

Structural Difference

Core Distinction

Current Headline Results

OpenClaw Execution Results

Multilingual Preview

Quick Start

Prerequisites

Setup

Run Evaluation

Generate Additional Filler Sessions (Optional)

Repository Structure

Benchmark Package Boundaries

Adding a New System

Eval Query Format

Citation

License

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages