Skip to content

feat(retrieval): LongMemEval evaluation harness#252

Open
KIvanow wants to merge 11 commits into
feature/retrieval-sdk-phase6-integrationfrom
feature/retrieval-longmemeval-harness
Open

feat(retrieval): LongMemEval evaluation harness#252
KIvanow wants to merge 11 commits into
feature/retrieval-sdk-phase6-integrationfrom
feature/retrieval-longmemeval-harness

Conversation

@KIvanow

@KIvanow KIvanow commented Jun 16, 2026

Copy link
Copy Markdown
Member

Add a mock-by-default LongMemEval retrieval+QA harness for @betterdb/retrieval with four env-selected seams (embedder, store, reader, judge): OPENAI_API_KEY enables real embedder/reader/judge, a reachable VALKEY_URL enables the real valkey-search store, else deterministic offline mocks. Reader and judge models are set independently (gpt-5.4 reader, gpt-5.5 judge); temperature is omitted for GPT-5-tier models and kept at 0 otherwise. The embedding cache flush is non-fatal and runs in a finally, so a serialization failure at large haystack scale cannot discard a completed eval or lose embeddings already computed. Includes a Tier-0 offline vitest smoke test; datasets and the embedding cache are gitignored.

Summary

Adds a runnable LongMemEval harness to the retrieval eval suite and makes the reader and judge models independently configurable. Establishes our first reportable agent-memory benchmark: 75.6% QA-accuracy with 98.4% recall@10 on the full longmemeval_s split (all 500 records). Retrieval is near-perfect; the remaining QA gap is reader-side reasoning, concentrated in single-session-preference and temporal-reasoning.

Results (longmemeval_s, full 500, k=10, gpt-5.4 reader / gpt-5.5 judge)

question_type n recall@k QA-acc
knowledge-update 78 100.0% 87.2%
multi-session 133 100.0% 66.2%
single-session-assistant 56 100.0% 98.2%
single-session-preference 30 93.3% 50.0%
single-session-user 70 95.7% 94.3%
temporal-reasoning 133 97.7% 64.7%
OVERALL 500 98.4% 75.6%

23,889 chunks indexed.

These numbers are after fixes for the BugBot-flagged issues (date-context handling for reader and judge, long-session splitting, index-readiness gating, and the non-fatal flush). Two things to call out:

  • Recall is unchanged at 98.4% — exactly as expected; the date-context and judge fixes touch only the QA/grading path, not retrieval.
  • Temporal-reasoning QA more than doubled (25.0% → 64.7%), driving the overall QA jump from 61.0% to 75.6%. That's the date-context fix paying off: the reader and judge now both see each excerpt's session date plus the question's asked-on date, which temporal questions depend on.
  • This run was the full 500 records (not a filtered 495) — the 5 previously-oversized records run without crashing, validating the long-session split.

Zep-matched comparison run (gpt-4o reader + judge)

Zep's published LongMemEval_S figure (71.2% overall) uses gpt-4o as both the agent reader and the judge. Our headline result above uses gpt-5.4/gpt-5.5, a stronger reader and not a fair architecture-to-architecture comparison. To control the reader/judge variable, these runs re-run the full _s split with gpt-4o on both seams (everything else identical: same embedder, same chunk-retrieval memory, same k). Reader/judge models are env-overridable (LONGMEMEVAL_READER_MODEL / LONGMEMEVAL_JUDGE_MODEL), defaulting to gpt-5.4/gpt-5.5 so the headline result stays reproducible.

Reader prompt fix. The first gpt-4o run scored 0/30 on single-session-preference. A read-only diagnostic established this was not grader strictness: the gpt-4o reader literally answered "I don't know" on all 30 preference questions, and a lenient preference-aware judge produced 0 flips — it agreed with the generic judge on every record. Root cause: the extractive-only reader prompt ("answer using only the excerpts; else say I don't know") combined with preference golds being recommendation rubrics that have no answer string in the excerpts. The reader prompt was revised to keep factual abstention while inferring recommendations grounded strictly in preferences/signals present in the retrieved excerpts (no invented preferences, no general knowledge). The two gpt-4o runs below isolate that single prompt change.

question_type n recall@k extractive (v1) generative-grounded (v2) Δ
knowledge-update 78 100.0% 80.8% 82.1% +1.3
multi-session 133 100.0% 49.6% 62.4% +12.8
single-session-assistant 56 100.0% 91.1% 100.0% +8.9
single-session-preference 30 93.3% 0.0% 13.3% +13.3
single-session-user 70 95.7% 74.3% 92.9% +18.6
temporal-reasoning 133 97.7% 34.6% 50.4% +15.8
OVERALL 500 98.4% 55.6% 67.8% +12.2

Footer: reader gpt-4o · judge gpt-4o (generic, unchanged) · embedder text-embedding-3-small (1536d) · split longmemeval_s · k=10 · 500 records · reader prompt generative-grounded (v2).

No category regressed — every category rose. Recall is unchanged at 98.4% in both runs (the prompt never touches retrieval). The gains are broad, not preference-only: the old extractive framing was also making gpt-4o over-abstain on factual questions, so the v2 prompt converts wrong abstentions into correct answers across the board (accuracy cannot rise by answering incorrectly). A grounding spot-check confirmed the reader anchors recommendations to the user's actual history in the excerpts and does not fabricate preferences; where retrieval is thin it falls back to generic advice that the judge correctly marks incorrect — which is why single-session-preference improves only to 13.3% despite 93.3% recall.

Three configs side by side (gpt-4o-matched = generative-grounded v2):

config reader / judge embedder memory recall@10 QA overall
Ours (frontier) gpt-5.4 / gpt-5.5 text-embedding-3-small retrieval-over-chunks 98.4% 75.6%
Ours (gpt-4o-matched) gpt-4o / gpt-4o text-embedding-3-small retrieval-over-chunks 98.4% 67.8%
Zep (published) gpt-4o / gpt-4o BGE-m3 temporal knowledge graph 71.2%

With the reader/judge LLM controlled to gpt-4o, our retrieval-over-chunks memory scores 67.8% vs Zep's published 71.2% (−3.4 pt). This is not a clean win or loss: two variables remain uncontrolled — (1) embedder (ours text-embedding-3-small vs Zep's BGE-m3) and (2) memory architecture (ours retrieval-over-chunks vs Zep's temporal knowledge graph) — plus reader-prompt/agent and judge-rubric differences. Recall is unchanged at 98.4%, so the entire remaining QA gap is reader-reasoning + grading, not retrieval.

Changes

  • Threaded a model parameter through chat() and gave the judge its own JUDGE_MODEL constant, so reader and judge are independently settable (previously the judge inherited the reader's model and the judge name label was cosmetic/misleading).
  • Set reader gpt-5.4, judge gpt-5.5; judge label now derives from the actual judge model. Reader/judge models are env-overridable via LONGMEMEVAL_READER_MODEL / LONGMEMEVAL_JUDGE_MODEL (defaults unchanged), so a like-for-like comparison config (e.g. gpt-4o) runs without editing the defaults or making the two runs non-reproducible.
  • Conditional temperature handling: omit temperature for GPT-5-tier models (which reject non-default values), keep temperature: 0 otherwise. Applied per-model so reader and judge each resolve correctly.
  • Split sessions exceeding the embedder's input budget into multiple chunks sharing the same session_id (and hard-slice an oversized single turn in turn mode), so every record embeds and indexes without dropping evidence; recall stays session-granular. All 500 _s records run (23,889 chunks).
  • Reader prompt distinguishes fact-extraction from recommendation: factual questions keep "answer from the excerpts or abstain", while recommendation/advice/tips questions infer an answer grounded strictly in preferences present in the retrieved excerpts (no invented preferences, no general knowledge). This removes spurious abstention; on the gpt-4o-matched config it lifts overall QA 55.6%→ 67.8% with no per-type regression and unchanged recall.
  • QA prompts are date-aware: each retrieved excerpt is prefixed with its session date and the question's asked-on date is appended to the question given to BOTH the reader and the judge, so temporal items are graded on the same anchored question the reader saw.
  • Index readiness on real Valkey gates on full index coverage (numDocs + percentIndexed) before querying, so recall is not measured while HNSW is still backfilling.
  • Embedding cache flush is non-fatal and runs in a finally, so a mid-run failure still persists embeddings already computed and a serialization failure at _s scale cannot crash a completed eval.
  • Per-type results table with config footer (reader/judge/embedder/split/k/record count) so no number is ever quoted without its split.
  • Datasets (longmemeval_oracle.json, longmemeval_s) remain gitignored; embedding cache reused across runs where it fits (the _s map exceeds V8's max string length, so _s runs are cold by necessity — results are unaffected since embeddings are deterministic).

Checklist

  • Unit / integration tests added
  • Docs added / updated
  • Roborev review passed — run roborev review --branch or /roborev-review-branch in Claude Code (internal)
  • Competitive analysis done / discussed (internal)
  • Blog post about it discussed (internal)

Note

Low Risk
Eval-only tooling and tests under packages/retrieval; no production retrieval API changes. Optional OpenAI calls use env API keys in local/CI eval runs.

Overview
Adds a LongMemEval benchmark harness under packages/retrieval/eval/longmemeval that exercises @betterdb/retrieval end-to-end: chunk haystack sessions into UpsertEntry records, index per question, measure recall@k (evidence session_id in top hits), and optionally QA accuracy via reader + judge.

Runnable tiers are selected by env: no API key → deterministic mock embedder + in-memory Valkey-search mock; OPENAI_API_KEY → OpenAI embeddings (cached on disk, non-fatal flush in finally); reachable VALKEY_URL → real valkey-search with HNSW index-settling before query. LONGMEMEVAL_QA=1 enables Tier 2 with independently configurable reader/judge models (LONGMEMEVAL_READER_MODEL / LONGMEMEVAL_JUDGE_MODEL), date-aware QA prompts, and GPT-5-tier temperature handling.

Chunking supports session vs turn mode, splits oversized sessions/turns for embedder limits while preserving session_id for recall. .gitignore excludes eval cache and downloaded datasets; eval:longmemeval npm script and Tier 0 vitest smoke cover offline recall/QA/chunking without network.

Reviewed by Cursor Bugbot for commit eab9387. Bugbot is set up for automated code reviews on this repo. Configure here.

jamby77 and others added 3 commits June 16, 2026 13:01
- Register/unregister a type:retrieval marker on __betterdb:caches
- Add health() returning an IndexHealthSnapshot (percent_indexed, dims,
  numDocs, indexing state, optional recall-estimate hook)
- Add a pluggable telemetry seam (RetrievalMetrics, RetrievalTracer) wired
  into query/upsert: operation duration, query result counts, embedding calls
- Add createPrometheusMetrics factory (prom-client) implementing RetrievalMetrics
- Export discovery, health, and telemetry types
- Add Retriever.integration.test.ts (skip-guarded, VALKEY_URL default 6384)
- Cover create index, upsert, vector query returning the upserted doc,
  TAG and NUMERIC filters, delete, health, and drop end-to-end
- Add iovalkey devDependency for the integration client
Add a mock-by-default LongMemEval retrieval+QA harness for
@betterdb/retrieval with four env-selected seams (embedder, store,
reader, judge): OPENAI_API_KEY enables real embedder/reader/judge,
a reachable VALKEY_URL enables the real valkey-search store, else
deterministic offline mocks. Reader and judge models are set
independently (gpt-5.4 reader, gpt-5.5 judge); temperature is omitted
for GPT-5-tier models and kept at 0 otherwise. The embedding cache
flush is non-fatal so a serialization failure at large haystack scale
cannot discard a completed eval. Includes a Tier-0 offline vitest
smoke test; datasets and the embedding cache are gitignored.
@KIvanow KIvanow requested a review from jamby77 June 16, 2026 19:05
Comment thread packages/retrieval/eval/longmemeval/judge.ts
Comment thread packages/retrieval/eval/longmemeval/judge.ts Outdated
Comment thread packages/retrieval/eval/longmemeval/runner.ts
- Mock judge no longer grades an empty prediction correct (g.includes('')
  is always true), which inflated QA accuracy after a retrieval miss.
- OpenAI judge matches only a leading "correct" word so "incorrect",
  "not correct", and "partially correct" are not scored as correct.
- Real-store run warns when HNSW indexing does not settle within the poll
  window instead of silently undercounting recall.
Comment thread packages/retrieval/eval/longmemeval/runner.ts
Replace the hit-count settle check with retriever.health(): wait until
numDocs covers every upserted chunk and percentIndexed reaches 1. A
question query could return min(k, chunkCount) hits while HNSW was still
backfilling, so recall@k could be measured on an incomplete graph.
Comment thread packages/retrieval/eval/longmemeval/runner.ts Outdated
Comment thread packages/retrieval/eval/longmemeval/judge.ts Outdated
Comment thread packages/retrieval/eval/longmemeval/adapter.ts
…ession splitting

- runner: require percentIndexed >= 100 (0-100 scale) so recall is not
  measured while HNSW is still backfilling
- judge: parse the verdict by whole word and reject negated/partial forms
  ("incorrect", "not correct", "partially correct") instead of matching the
  first word only
- adapter: split sessions that exceed the embedder input budget into multiple
  chunks sharing the same session_id, so long sessions embed instead of failing
Comment thread packages/retrieval/eval/longmemeval/adapter.ts
Comment thread packages/retrieval/eval/longmemeval/runner.ts Outdated
- adapter: hard-slice a single turn that exceeds the embedder budget in turn
  mode (extracted sliceToBudget, shared with session packing) so a long turn
  embeds instead of failing or dropping the record
- runner: prefix each retrieved excerpt with its session date and pass the
  question's asked-on date to the reader, so temporal-reasoning QA is not
  depressed by missing date context
Comment thread packages/retrieval/eval/longmemeval/runner.ts Outdated
…reader

The judge was called with the raw record.question while the reader saw the
question augmented with question_date, leaving the grader without the temporal
anchor and skewing temporal-reasoning QA. Pass the augmented question to both.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 644eb6d. Configure here.

Comment thread packages/retrieval/eval/longmemeval/runner.ts Outdated
embedder.flush ran only after every record finished, so a mid-run failure
(embedding API error, dimension mismatch, Valkey or reader/judge error)
discarded the embeddings already computed in memory — billable work lost on a
long OpenAI-backed run. Move the flush into a finally so partial progress is
always persisted.
@KIvanow KIvanow marked this pull request as ready for review June 16, 2026 21:55
KIvanow added 2 commits June 17, 2026 08:42
Reader and judge models were hardcoded constants, so a like-for-like
comparison config (e.g. gpt-4o to match Zep's published LongMemEval_S setup)
could only be run by editing the defaults in place, which mutates the existing
config and makes the two runs non-reproducible side by side. Read the models
from LONGMEMEVAL_READER_MODEL / LONGMEMEVAL_JUDGE_MODEL, defaulting to
gpt-5.4 / gpt-5.5 so existing results stay reproducible when the env is unset.
The extractive-only reader prompt ("answer using only the excerpts; else
say I don't know") made the reader abstain on every recommendation-style
question, since preference golds are rubrics with no answer string in the
excerpts. Revise the single shared prompt to keep factual abstention while
inferring a recommendation grounded strictly in preferences/signals present
in the retrieved excerpts (no invented preferences, no general knowledge).

@jamby77 jamby77 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice harness — clean four-seam design (embedder/store/reader/judge) and the defensive comments make it easy to review. Build/config and API usage check out: eval/ is correctly excluded from the package tsc build, the Tier-0 test is genuinely hermetic (fixture + mocks, no env/network), and the recall@k / QA-accuracy aggregation guards empty categories. The judge (\bcorrect\b word-boundary so "incorrect" doesn't match, empty-string guard) and the non-fatal flush in finally are well done.

One non-blocking thing: the in-memory mock store's numeric @field:[min max] filter does exact-equality on the min and ignores the max (eval/longmemeval/store.ts:68-70, 249value: parseFloat(m[2]) then Number(stored) !== f.value), and a malformed bound yields NaN (so !== NaN matches nothing). It's effectively dead today since the schema only declares TAG fields (session_id, date), so the reported numbers are unaffected — but it'll silently mis-filter if anyone adds a numeric field to a future test. Worth a min <= x <= max + Number.isFinite guard while it's fresh.

Minor nits (optional): dataset.ts JSON.parse is unguarded (raw SyntaxError on a malformed dataset — runs before any billable work, so nothing's lost), and isGpt5Tier's /^gpt-5/i won't catch a future temperature-rejecting family, though that fails loud (API 400) rather than silently.

Approving — the one real finding is latent and the rest are nits.

@jamby77 jamby77 force-pushed the feature/retrieval-sdk-phase6-integration branch from 94a8bf3 to bd4a678 Compare June 18, 2026 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants