feat(retrieval): LongMemEval evaluation harness#252
Conversation
- Register/unregister a type:retrieval marker on __betterdb:caches - Add health() returning an IndexHealthSnapshot (percent_indexed, dims, numDocs, indexing state, optional recall-estimate hook) - Add a pluggable telemetry seam (RetrievalMetrics, RetrievalTracer) wired into query/upsert: operation duration, query result counts, embedding calls - Add createPrometheusMetrics factory (prom-client) implementing RetrievalMetrics - Export discovery, health, and telemetry types
- Add Retriever.integration.test.ts (skip-guarded, VALKEY_URL default 6384) - Cover create index, upsert, vector query returning the upserted doc, TAG and NUMERIC filters, delete, health, and drop end-to-end - Add iovalkey devDependency for the integration client
Add a mock-by-default LongMemEval retrieval+QA harness for @betterdb/retrieval with four env-selected seams (embedder, store, reader, judge): OPENAI_API_KEY enables real embedder/reader/judge, a reachable VALKEY_URL enables the real valkey-search store, else deterministic offline mocks. Reader and judge models are set independently (gpt-5.4 reader, gpt-5.5 judge); temperature is omitted for GPT-5-tier models and kept at 0 otherwise. The embedding cache flush is non-fatal so a serialization failure at large haystack scale cannot discard a completed eval. Includes a Tier-0 offline vitest smoke test; datasets and the embedding cache are gitignored.
- Mock judge no longer grades an empty prediction correct (g.includes('')
is always true), which inflated QA accuracy after a retrieval miss.
- OpenAI judge matches only a leading "correct" word so "incorrect",
"not correct", and "partially correct" are not scored as correct.
- Real-store run warns when HNSW indexing does not settle within the poll
window instead of silently undercounting recall.
Replace the hit-count settle check with retriever.health(): wait until numDocs covers every upserted chunk and percentIndexed reaches 1. A question query could return min(k, chunkCount) hits while HNSW was still backfilling, so recall@k could be measured on an incomplete graph.
…ession splitting
- runner: require percentIndexed >= 100 (0-100 scale) so recall is not
measured while HNSW is still backfilling
- judge: parse the verdict by whole word and reject negated/partial forms
("incorrect", "not correct", "partially correct") instead of matching the
first word only
- adapter: split sessions that exceed the embedder input budget into multiple
chunks sharing the same session_id, so long sessions embed instead of failing
- adapter: hard-slice a single turn that exceeds the embedder budget in turn mode (extracted sliceToBudget, shared with session packing) so a long turn embeds instead of failing or dropping the record - runner: prefix each retrieved excerpt with its session date and pass the question's asked-on date to the reader, so temporal-reasoning QA is not depressed by missing date context
…reader The judge was called with the raw record.question while the reader saw the question augmented with question_date, leaving the grader without the temporal anchor and skewing temporal-reasoning QA. Pass the augmented question to both.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 644eb6d. Configure here.
embedder.flush ran only after every record finished, so a mid-run failure (embedding API error, dimension mismatch, Valkey or reader/judge error) discarded the embeddings already computed in memory — billable work lost on a long OpenAI-backed run. Move the flush into a finally so partial progress is always persisted.
Reader and judge models were hardcoded constants, so a like-for-like comparison config (e.g. gpt-4o to match Zep's published LongMemEval_S setup) could only be run by editing the defaults in place, which mutates the existing config and makes the two runs non-reproducible side by side. Read the models from LONGMEMEVAL_READER_MODEL / LONGMEMEVAL_JUDGE_MODEL, defaulting to gpt-5.4 / gpt-5.5 so existing results stay reproducible when the env is unset.
The extractive-only reader prompt ("answer using only the excerpts; else
say I don't know") made the reader abstain on every recommendation-style
question, since preference golds are rubrics with no answer string in the
excerpts. Revise the single shared prompt to keep factual abstention while
inferring a recommendation grounded strictly in preferences/signals present
in the retrieved excerpts (no invented preferences, no general knowledge).
jamby77
left a comment
There was a problem hiding this comment.
Nice harness — clean four-seam design (embedder/store/reader/judge) and the defensive comments make it easy to review. Build/config and API usage check out: eval/ is correctly excluded from the package tsc build, the Tier-0 test is genuinely hermetic (fixture + mocks, no env/network), and the recall@k / QA-accuracy aggregation guards empty categories. The judge (\bcorrect\b word-boundary so "incorrect" doesn't match, empty-string guard) and the non-fatal flush in finally are well done.
One non-blocking thing: the in-memory mock store's numeric @field:[min max] filter does exact-equality on the min and ignores the max (eval/longmemeval/store.ts:68-70, 249 — value: parseFloat(m[2]) then Number(stored) !== f.value), and a malformed bound yields NaN (so !== NaN matches nothing). It's effectively dead today since the schema only declares TAG fields (session_id, date), so the reported numbers are unaffected — but it'll silently mis-filter if anyone adds a numeric field to a future test. Worth a min <= x <= max + Number.isFinite guard while it's fresh.
Minor nits (optional): dataset.ts JSON.parse is unguarded (raw SyntaxError on a malformed dataset — runs before any billable work, so nothing's lost), and isGpt5Tier's /^gpt-5/i won't catch a future temperature-rejecting family, though that fails loud (API 400) rather than silently.
Approving — the one real finding is latent and the rest are nits.
94a8bf3 to
bd4a678
Compare

Add a mock-by-default LongMemEval retrieval+QA harness for @betterdb/retrieval with four env-selected seams (embedder, store, reader, judge): OPENAI_API_KEY enables real embedder/reader/judge, a reachable VALKEY_URL enables the real valkey-search store, else deterministic offline mocks. Reader and judge models are set independently (gpt-5.4 reader, gpt-5.5 judge); temperature is omitted for GPT-5-tier models and kept at 0 otherwise. The embedding cache flush is non-fatal and runs in a
finally, so a serialization failure at large haystack scale cannot discard a completed eval or lose embeddings already computed. Includes a Tier-0 offline vitest smoke test; datasets and the embedding cache are gitignored.Summary
Adds a runnable LongMemEval harness to the retrieval eval suite and makes the reader and judge models independently configurable. Establishes our first reportable agent-memory benchmark: 75.6% QA-accuracy with 98.4% recall@10 on the full longmemeval_s split (all 500 records). Retrieval is near-perfect; the remaining QA gap is reader-side reasoning, concentrated in single-session-preference and temporal-reasoning.
Results (longmemeval_s, full 500, k=10, gpt-5.4 reader / gpt-5.5 judge)
23,889 chunks indexed.
These numbers are after fixes for the BugBot-flagged issues (date-context handling for reader and judge, long-session splitting, index-readiness gating, and the non-fatal flush). Two things to call out:
Zep-matched comparison run (gpt-4o reader + judge)
Zep's published LongMemEval_S figure (71.2% overall) uses gpt-4o as both the agent reader and the judge. Our headline result above uses gpt-5.4/gpt-5.5, a stronger reader and not a fair architecture-to-architecture comparison. To control the reader/judge variable, these runs re-run the full _s split with gpt-4o on both seams (everything else identical: same embedder, same chunk-retrieval memory, same k). Reader/judge models are env-overridable (
LONGMEMEVAL_READER_MODEL/LONGMEMEVAL_JUDGE_MODEL), defaulting to gpt-5.4/gpt-5.5 so the headline result stays reproducible.Reader prompt fix. The first gpt-4o run scored 0/30 on single-session-preference. A read-only diagnostic established this was not grader strictness: the gpt-4o reader literally answered "I don't know" on all 30 preference questions, and a lenient preference-aware judge produced 0 flips — it agreed with the generic judge on every record. Root cause: the extractive-only reader prompt ("answer using only the excerpts; else say I don't know") combined with preference golds being recommendation rubrics that have no answer string in the excerpts. The reader prompt was revised to keep factual abstention while inferring recommendations grounded strictly in preferences/signals present in the retrieved excerpts (no invented preferences, no general knowledge). The two gpt-4o runs below isolate that single prompt change.
Footer: reader gpt-4o · judge gpt-4o (generic, unchanged) · embedder text-embedding-3-small (1536d) · split longmemeval_s · k=10 · 500 records · reader prompt generative-grounded (v2).
No category regressed — every category rose. Recall is unchanged at 98.4% in both runs (the prompt never touches retrieval). The gains are broad, not preference-only: the old extractive framing was also making gpt-4o over-abstain on factual questions, so the v2 prompt converts wrong abstentions into correct answers across the board (accuracy cannot rise by answering incorrectly). A grounding spot-check confirmed the reader anchors recommendations to the user's actual history in the excerpts and does not fabricate preferences; where retrieval is thin it falls back to generic advice that the judge correctly marks incorrect — which is why single-session-preference improves only to 13.3% despite 93.3% recall.
Three configs side by side (gpt-4o-matched = generative-grounded v2):
With the reader/judge LLM controlled to gpt-4o, our retrieval-over-chunks memory scores 67.8% vs Zep's published 71.2% (−3.4 pt). This is not a clean win or loss: two variables remain uncontrolled — (1) embedder (ours text-embedding-3-small vs Zep's BGE-m3) and (2) memory architecture (ours retrieval-over-chunks vs Zep's temporal knowledge graph) — plus reader-prompt/agent and judge-rubric differences. Recall is unchanged at 98.4%, so the entire remaining QA gap is reader-reasoning + grading, not retrieval.
Changes
finally, so a mid-run failure still persists embeddings already computed and a serialization failure at _s scale cannot crash a completed eval.Checklist
roborev review --branchor/roborev-review-branchin Claude Code (internal)Note
Low Risk
Eval-only tooling and tests under packages/retrieval; no production retrieval API changes. Optional OpenAI calls use env API keys in local/CI eval runs.
Overview
Adds a LongMemEval benchmark harness under
packages/retrieval/eval/longmemevalthat exercises@betterdb/retrievalend-to-end: chunk haystack sessions intoUpsertEntryrecords, index per question, measure recall@k (evidencesession_idin top hits), and optionally QA accuracy via reader + judge.Runnable tiers are selected by env: no API key → deterministic mock embedder + in-memory Valkey-search mock;
OPENAI_API_KEY→ OpenAI embeddings (cached on disk, non-fatal flush infinally); reachableVALKEY_URL→ real valkey-search with HNSW index-settling before query.LONGMEMEVAL_QA=1enables Tier 2 with independently configurable reader/judge models (LONGMEMEVAL_READER_MODEL/LONGMEMEVAL_JUDGE_MODEL), date-aware QA prompts, and GPT-5-tier temperature handling.Chunking supports session vs turn mode, splits oversized sessions/turns for embedder limits while preserving
session_idfor recall..gitignoreexcludes eval cache and downloaded datasets;eval:longmemevalnpm script and Tier 0 vitest smoke cover offline recall/QA/chunking without network.Reviewed by Cursor Bugbot for commit eab9387. Bugbot is set up for automated code reviews on this repo. Configure here.