feat(retrieval): LongMemEval evaluation harness by KIvanow · Pull Request #252 · BetterDB-inc/monitor

KIvanow · 2026-06-16T19:05:14Z

Add a mock-by-default LongMemEval retrieval+QA harness for @betterdb/retrieval with four env-selected seams (embedder, store, reader, judge): OPENAI_API_KEY enables real embedder/reader/judge, a reachable VALKEY_URL enables the real valkey-search store, else deterministic offline mocks. Reader and judge models are set independently (gpt-5.4 reader, gpt-5.5 judge); temperature is omitted for GPT-5-tier models and kept at 0 otherwise. The embedding cache flush is non-fatal and runs in a finally, so a serialization failure at large haystack scale cannot discard a completed eval or lose embeddings already computed. Includes a Tier-0 offline vitest smoke test; datasets and the embedding cache are gitignored.

Summary

Adds a runnable LongMemEval harness to the retrieval eval suite and makes the reader and judge models independently configurable. Establishes our first reportable agent-memory benchmark: 75.6% QA-accuracy with 98.4% recall@10 on the full longmemeval_s split (all 500 records). Retrieval is near-perfect; the remaining QA gap is reader-side reasoning, concentrated in single-session-preference and temporal-reasoning.

Results (longmemeval_s, full 500, k=10, gpt-5.4 reader / gpt-5.5 judge)

question_type	n	recall@k	QA-acc
knowledge-update	78	100.0%	87.2%
multi-session	133	100.0%	66.2%
single-session-assistant	56	100.0%	98.2%
single-session-preference	30	93.3%	50.0%
single-session-user	70	95.7%	94.3%
temporal-reasoning	133	97.7%	64.7%
OVERALL	500	98.4%	75.6%

23,889 chunks indexed.

These numbers are after fixes for the BugBot-flagged issues (date-context handling for reader and judge, long-session splitting, index-readiness gating, and the non-fatal flush). Two things to call out:

Recall is unchanged at 98.4% — exactly as expected; the date-context and judge fixes touch only the QA/grading path, not retrieval.
Temporal-reasoning QA more than doubled (25.0% → 64.7%), driving the overall QA jump from 61.0% to 75.6%. That's the date-context fix paying off: the reader and judge now both see each excerpt's session date plus the question's asked-on date, which temporal questions depend on.
This run was the full 500 records (not a filtered 495) — the 5 previously-oversized records run without crashing, validating the long-session split.

Zep-matched comparison run (gpt-4o reader + judge)

Zep's published LongMemEval_S figure (71.2% overall) uses gpt-4o as both the agent reader and the judge. Our headline result above uses gpt-5.4/gpt-5.5, a stronger reader and not a fair architecture-to-architecture comparison. To control the reader/judge variable, these runs re-run the full _s split with gpt-4o on both seams (everything else identical: same embedder, same chunk-retrieval memory, same k). Reader/judge models are env-overridable (LONGMEMEVAL_READER_MODEL / LONGMEMEVAL_JUDGE_MODEL), defaulting to gpt-5.4/gpt-5.5 so the headline result stays reproducible.

Reader prompt fix. The first gpt-4o run scored 0/30 on single-session-preference. A read-only diagnostic established this was not grader strictness: the gpt-4o reader literally answered "I don't know" on all 30 preference questions, and a lenient preference-aware judge produced 0 flips — it agreed with the generic judge on every record. Root cause: the extractive-only reader prompt ("answer using only the excerpts; else say I don't know") combined with preference golds being recommendation rubrics that have no answer string in the excerpts. The reader prompt was revised to keep factual abstention while inferring recommendations grounded strictly in preferences/signals present in the retrieved excerpts (no invented preferences, no general knowledge). The two gpt-4o runs below isolate that single prompt change.

question_type	n	recall@k	extractive (v1)	generative-grounded (v2)	Δ
knowledge-update	78	100.0%	80.8%	82.1%	+1.3
multi-session	133	100.0%	49.6%	62.4%	+12.8
single-session-assistant	56	100.0%	91.1%	100.0%	+8.9
single-session-preference	30	93.3%	0.0%	13.3%	+13.3
single-session-user	70	95.7%	74.3%	92.9%	+18.6
temporal-reasoning	133	97.7%	34.6%	50.4%	+15.8
OVERALL	500	98.4%	55.6%	67.8%	+12.2

Footer: reader gpt-4o · judge gpt-4o (generic, unchanged) · embedder text-embedding-3-small (1536d) · split longmemeval_s · k=10 · 500 records · reader prompt generative-grounded (v2).

No category regressed — every category rose. Recall is unchanged at 98.4% in both runs (the prompt never touches retrieval). The gains are broad, not preference-only: the old extractive framing was also making gpt-4o over-abstain on factual questions, so the v2 prompt converts wrong abstentions into correct answers across the board (accuracy cannot rise by answering incorrectly). A grounding spot-check confirmed the reader anchors recommendations to the user's actual history in the excerpts and does not fabricate preferences; where retrieval is thin it falls back to generic advice that the judge correctly marks incorrect — which is why single-session-preference improves only to 13.3% despite 93.3% recall.

Three configs side by side (gpt-4o-matched = generative-grounded v2):

config	reader / judge	embedder	memory	recall@10	QA overall
Ours (frontier)	gpt-5.4 / gpt-5.5	text-embedding-3-small	retrieval-over-chunks	98.4%	75.6%
Ours (gpt-4o-matched)	gpt-4o / gpt-4o	text-embedding-3-small	retrieval-over-chunks	98.4%	67.8%
Zep (published)	gpt-4o / gpt-4o	BGE-m3	temporal knowledge graph	—	71.2%

With the reader/judge LLM controlled to gpt-4o, our retrieval-over-chunks memory scores 67.8% vs Zep's published 71.2% (−3.4 pt). This is not a clean win or loss: two variables remain uncontrolled — (1) embedder (ours text-embedding-3-small vs Zep's BGE-m3) and (2) memory architecture (ours retrieval-over-chunks vs Zep's temporal knowledge graph) — plus reader-prompt/agent and judge-rubric differences. Recall is unchanged at 98.4%, so the entire remaining QA gap is reader-reasoning + grading, not retrieval.

Changes

Threaded a model parameter through chat() and gave the judge its own JUDGE_MODEL constant, so reader and judge are independently settable (previously the judge inherited the reader's model and the judge name label was cosmetic/misleading).
Set reader gpt-5.4, judge gpt-5.5; judge label now derives from the actual judge model. Reader/judge models are env-overridable via LONGMEMEVAL_READER_MODEL / LONGMEMEVAL_JUDGE_MODEL (defaults unchanged), so a like-for-like comparison config (e.g. gpt-4o) runs without editing the defaults or making the two runs non-reproducible.
Conditional temperature handling: omit temperature for GPT-5-tier models (which reject non-default values), keep temperature: 0 otherwise. Applied per-model so reader and judge each resolve correctly.
Split sessions exceeding the embedder's input budget into multiple chunks sharing the same session_id (and hard-slice an oversized single turn in turn mode), so every record embeds and indexes without dropping evidence; recall stays session-granular. All 500 _s records run (23,889 chunks).
Reader prompt distinguishes fact-extraction from recommendation: factual questions keep "answer from the excerpts or abstain", while recommendation/advice/tips questions infer an answer grounded strictly in preferences present in the retrieved excerpts (no invented preferences, no general knowledge). This removes spurious abstention; on the gpt-4o-matched config it lifts overall QA 55.6%→ 67.8% with no per-type regression and unchanged recall.
QA prompts are date-aware: each retrieved excerpt is prefixed with its session date and the question's asked-on date is appended to the question given to BOTH the reader and the judge, so temporal items are graded on the same anchored question the reader saw.
Index readiness on real Valkey gates on full index coverage (numDocs + percentIndexed) before querying, so recall is not measured while HNSW is still backfilling.
Embedding cache flush is non-fatal and runs in a finally, so a mid-run failure still persists embeddings already computed and a serialization failure at _s scale cannot crash a completed eval.
Per-type results table with config footer (reader/judge/embedder/split/k/record count) so no number is ever quoted without its split.
Datasets (longmemeval_oracle.json, longmemeval_s) remain gitignored; embedding cache reused across runs where it fits (the _s map exceeds V8's max string length, so _s runs are cold by necessity — results are unaffected since embeddings are deterministic).

Checklist

Unit / integration tests added
Docs added / updated
Roborev review passed — run roborev review --branch or /roborev-review-branch in Claude Code (internal)
Competitive analysis done / discussed (internal)
Blog post about it discussed (internal)

Note

Low Risk
Eval-only tooling and tests under packages/retrieval; no production retrieval API changes. Optional OpenAI calls use env API keys in local/CI eval runs.

Overview
Adds a LongMemEval benchmark harness under packages/retrieval/eval/longmemeval that exercises @betterdb/retrieval end-to-end: chunk haystack sessions into UpsertEntry records, index per question, measure recall@k (evidence session_id in top hits), and optionally QA accuracy via reader + judge.

Runnable tiers are selected by env: no API key → deterministic mock embedder + in-memory Valkey-search mock; OPENAI_API_KEY → OpenAI embeddings (cached on disk, non-fatal flush in finally); reachable VALKEY_URL → real valkey-search with HNSW index-settling before query. LONGMEMEVAL_QA=1 enables Tier 2 with independently configurable reader/judge models (LONGMEMEVAL_READER_MODEL / LONGMEMEVAL_JUDGE_MODEL), date-aware QA prompts, and GPT-5-tier temperature handling.

Chunking supports session vs turn mode, splits oversized sessions/turns for embedder limits while preserving session_id for recall. .gitignore excludes eval cache and downloaded datasets; eval:longmemeval npm script and Tier 0 vitest smoke cover offline recall/QA/chunking without network.

^{Reviewed by Cursor Bugbot for commit eab9387. Bugbot is set up for automated code reviews on this repo. Configure here.}

- Register/unregister a type:retrieval marker on __betterdb:caches - Add health() returning an IndexHealthSnapshot (percent_indexed, dims, numDocs, indexing state, optional recall-estimate hook) - Add a pluggable telemetry seam (RetrievalMetrics, RetrievalTracer) wired into query/upsert: operation duration, query result counts, embedding calls - Add createPrometheusMetrics factory (prom-client) implementing RetrievalMetrics - Export discovery, health, and telemetry types

- Add Retriever.integration.test.ts (skip-guarded, VALKEY_URL default 6384) - Cover create index, upsert, vector query returning the upserted doc, TAG and NUMERIC filters, delete, health, and drop end-to-end - Add iovalkey devDependency for the integration client

Add a mock-by-default LongMemEval retrieval+QA harness for @betterdb/retrieval with four env-selected seams (embedder, store, reader, judge): OPENAI_API_KEY enables real embedder/reader/judge, a reachable VALKEY_URL enables the real valkey-search store, else deterministic offline mocks. Reader and judge models are set independently (gpt-5.4 reader, gpt-5.5 judge); temperature is omitted for GPT-5-tier models and kept at 0 otherwise. The embedding cache flush is non-fatal so a serialization failure at large haystack scale cannot discard a completed eval. Includes a Tier-0 offline vitest smoke test; datasets and the embedding cache are gitignored.

- Mock judge no longer grades an empty prediction correct (g.includes('') is always true), which inflated QA accuracy after a retrieval miss. - OpenAI judge matches only a leading "correct" word so "incorrect", "not correct", and "partially correct" are not scored as correct. - Real-store run warns when HNSW indexing does not settle within the poll window instead of silently undercounting recall.

Replace the hit-count settle check with retriever.health(): wait until numDocs covers every upserted chunk and percentIndexed reaches 1. A question query could return min(k, chunkCount) hits while HNSW was still backfilling, so recall@k could be measured on an incomplete graph.

…ession splitting - runner: require percentIndexed >= 100 (0-100 scale) so recall is not measured while HNSW is still backfilling - judge: parse the verdict by whole word and reject negated/partial forms ("incorrect", "not correct", "partially correct") instead of matching the first word only - adapter: split sessions that exceed the embedder input budget into multiple chunks sharing the same session_id, so long sessions embed instead of failing

- adapter: hard-slice a single turn that exceeds the embedder budget in turn mode (extracted sliceToBudget, shared with session packing) so a long turn embeds instead of failing or dropping the record - runner: prefix each retrieved excerpt with its session date and pass the question's asked-on date to the reader, so temporal-reasoning QA is not depressed by missing date context

…reader The judge was called with the raw record.question while the reader saw the question augmented with question_date, leaving the grader without the temporal anchor and skewing temporal-reasoning QA. Pass the augmented question to both.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 644eb6d. Configure here.}

embedder.flush ran only after every record finished, so a mid-run failure (embedding API error, dimension mismatch, Valkey or reader/judge error) discarded the embeddings already computed in memory — billable work lost on a long OpenAI-backed run. Move the flush into a finally so partial progress is always persisted.

Reader and judge models were hardcoded constants, so a like-for-like comparison config (e.g. gpt-4o to match Zep's published LongMemEval_S setup) could only be run by editing the defaults in place, which mutates the existing config and makes the two runs non-reproducible side by side. Read the models from LONGMEMEVAL_READER_MODEL / LONGMEMEVAL_JUDGE_MODEL, defaulting to gpt-5.4 / gpt-5.5 so existing results stay reproducible when the env is unset.

The extractive-only reader prompt ("answer using only the excerpts; else say I don't know") made the reader abstain on every recommendation-style question, since preference golds are rubrics with no answer string in the excerpts. Revise the single shared prompt to keep factual abstention while inferring a recommendation grounded strictly in preferences/signals present in the retrieved excerpts (no invented preferences, no general knowledge).

jamby77

Nice harness — clean four-seam design (embedder/store/reader/judge) and the defensive comments make it easy to review. Build/config and API usage check out: eval/ is correctly excluded from the package tsc build, the Tier-0 test is genuinely hermetic (fixture + mocks, no env/network), and the recall@k / QA-accuracy aggregation guards empty categories. The judge (\bcorrect\b word-boundary so "incorrect" doesn't match, empty-string guard) and the non-fatal flush in finally are well done.

One non-blocking thing: the in-memory mock store's numeric @field:[min max] filter does exact-equality on the min and ignores the max (eval/longmemeval/store.ts:68-70, 249 — value: parseFloat(m[2]) then Number(stored) !== f.value), and a malformed bound yields NaN (so !== NaN matches nothing). It's effectively dead today since the schema only declares TAG fields (session_id, date), so the reported numbers are unaffected — but it'll silently mis-filter if anyone adds a numeric field to a future test. Worth a min <= x <= max + Number.isFinite guard while it's fresh.

Minor nits (optional): dataset.ts JSON.parse is unguarded (raw SyntaxError on a malformed dataset — runs before any billable work, so nothing's lost), and isGpt5Tier's /^gpt-5/i won't catch a future temperature-rejecting family, though that fails loud (API 400) rather than silently.

Approving — the one real finding is latent and the rest are nits.

jamby77 and others added 3 commits June 16, 2026 13:01

KIvanow requested a review from jamby77 June 16, 2026 19:05