fix(graph): top-degree snapshot cache for /graph/query + /graph/stats (#814) by rohitg00 · Pull Request #816 · rohitg00/agentmemory

rohitg00 · 2026-06-03T18:04:51Z

Summary

Fixes #814 (@allandelmare). Two-phase fix. Phase 1 (initial commits) added snapshot cache + wall-clock budget. Phase 2 (this revision, after Allan's test against his 75K-node corpus exposed that the budget can't fire on a blocked event loop) eliminates kv.list from the hot path entirely.

Allan's diagnosis

Confirmed end-to-end. At 75K nodes the kv.list<GraphNode> + kv.list<GraphEdge> pair returns a ~37MB single WS frame. JSON.parse of that blocks the Node event loop for hundreds of ms. iii's heartbeat starves, worker is declared dead at ~0.5s. The Promise.race wall-clock timer never gets to fire because nothing on the loop runs. Caller sees raw Invocation stopped, never the graceful envelope.

mem::graph-snapshot-rebuild itself was running the same enumeration, so the very call meant to populate the cache died at ~2.6s on his corpus. Catch-22.

What changed in Phase 2

Incremental indexes maintained by `mem::graph-extract`

Three new KV scopes — all kv.get / kv.set only, no enumeration ever:

KV.graphNameIndex — ${type}|${name} → nodeId. Replaces the O(n) existingNodes.find(...) scan inside extract.
KV.graphEdgeKey — ${src}|${tgt}|${type} → edgeId. Same for edges.
KV.graphNodeDegree — nodeId → incident-edge count. Read / incremented on edge writes to maintain snapshot top-N ranking without scanning edges.

mem::graph-extract is now O(extract_size), not O(total_nodes). The bottleneck Allan was hitting on every observation capture is gone.

Snapshot updated inline on every extract

Read snapshot once, mutate stats / degree / topNodes / topEdges inline, write back. No dirty flag bouncing, no scheduled rebuild needed. Corpora built on a post-#814v2 build always have a current snapshot without any explicit rebuild call.

Hot path reads snapshot exclusively

mem::graph-query empty-body / nodeType-only branch: snapshot only. No kv.list fallback (that was the broken path).
mem::graph-stats: snapshot only. Empty envelope + warning on miss, never a 500.
mem::graph-query startNodeId / query branches: still need broader access; keep behind the wall-clock budget with snapshot fallback. Documented as degrading on >25K-node corpora until a per-node edge index lands.

Legacy corpus rebuild + reset

mem::graph-snapshot-rebuild now refuses to run when the recorded node count would block the worker (REBUILD_SAFE_NODE_CEILING = 25000). Returns { success: false, tooLarge: true, totalNodes, ceiling, error } instead of dying. Also backfills the new name-index / edge-key / degree scopes so post-rebuild extracts hit the O(1) path.
New mem::graph-reset + POST /agentmemory/graph/reset — wipes all graph KV scopes, writes an empty snapshot. Observations / recall / history are NOT touched. Operators above the safe ceiling reset and let future extracts rebuild incrementally.

Endpoint count + docs

127 → 128 (added /graph/reset). Bumped README.md, AGENTS.md, src/index.ts boot log.

What this fixes for Allan

Even before backfilling, Allan's observation capture has been doing the broken kv.list on every extract since v0.9.21. That ALSO blocks the worker for hundreds of ms each time, and is the most likely cause of the worker-reconnect churn he observed. Phase 2 fixes that immediately — every new extract is O(1).

For his 75K accumulated graph: POST /agentmemory/graph/reset is the path. Recall + sessions + observations stay intact; the graph rebuilds from new observations as they come in. The 75K accumulated nodes were largely redundant anyway (heavy dedup pressure from re-extracting the same source files / functions across sessions).

Test plan

npx vitest run — 1406/1406 pass (12 new tests for incremental index + snapshot inline update + reset + name-index dedup)
npm run build clean
npx tsc --noEmit — no new errors
Manual against @allandelmare's corpus: POST /agentmemory/graph/reset → trigger a few new observations → POST /agentmemory/graph/query {} and GET /agentmemory/graph/stats both return < 100ms

Out of scope

Per-node edge index for fast BFS on huge corpora — startNodeId / query paths still rely on kv.list and degrade above ~25K. Defer to a follow-up.
SQLite-backed graph store (Architecture: Migrate BM25 & Graph Search from In-Memory to SQLite #309 epic) — separate effort. Snapshot interface stays stable so the migration won't disrupt callers.

cc @allandelmare — would appreciate one more run against your corpus once this lands. Reset is the fast path; alternatively mem::graph-snapshot-rebuild will refuse above 25K with a clear tooLarge error envelope.

Summary by CodeRabbit

New Features
- Cached graph snapshots for faster queries and stats
- POST /agentmemory/graph/snapshot-rebuild to rebuild/persist snapshots
- POST /agentmemory/graph/reset to clear graph state and write an empty snapshot
Improvements
- Extraction avoids full scans using targeted lookups and updates snapshots incrementally, preserving top-degree tracking
- Queries and stats race live enumeration against a time budget and fall back to snapshots with warnings/metadata
Tests
- Added snapshot-focused and budget/oversize tests for persistence, fallbacks, and reset behavior
Documentation
- Updated endpoint count and startup message (126 → 128)

vercel · 2026-06-03T18:04:57Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agentmemory	Ready	Preview, Comment	Jun 4, 2026 3:19pm

coderabbitai · 2026-06-03T18:05:09Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 162b848f-8a14-4078-a604-5158dce2fc2f

📥 Commits

Reviewing files that changed from the base of the PR and between 795537b and b494f23.

📒 Files selected for processing (1)

test/graph.test.ts

🚧 Files skipped from review as they are similar to previous changes (1)

test/graph.test.ts

📝 Walkthrough

Walkthrough

Adds a persisted top-degree GraphSnapshot and helpers; extract updates the snapshot incrementally; query and stats use snapshot-first or timeout-fallback behavior; adds mem::graph-snapshot-rebuild and mem::graph-reset handlers, API endpoints, tests, and docs/boot-log endpoint count updates.

Changes

Graph Snapshot Caching

Layer / File(s)	Summary
Type contracts and storage schema `src/types.ts`, `src/state/schema.ts`	Adds `GraphSnapshot` interface (versioned top-degree subgraph, per-type stats, timestamp, dirty flag) and extends `GraphQueryResult` with `fromSnapshot` and `warning`; adds KV keys `graphSnapshot`, `graphNameIndex`, `graphEdgeKey`, `graphNodeDegree`.
Snapshot infrastructure and helper functions `src/functions/graph.ts`	Implements timeout-bounded async helper and snapshot lifecycle utilities: `emptySnapshot`, read/build from live arrays, paginate from snapshot, and removes old full-ranking path.
Incremental degree and top-N maintenance `src/functions/graph.ts`	Implements `applyDegreeDelta`, topNodes/topEdges promote/evict logic, and merge helpers for node/edge updates used during extraction.
Graph extract using indexes and inline snapshot updates `src/functions/graph.ts`	`mem::graph-extract` uses targeted name/edge indexes instead of full KV scans, merges existing records, updates snapshot stats/top-degree incrementally, and persists snapshot only when new nodes/edges are added; audit metadata extended with added counts.
Graph query: snapshot fast-path and timeout fallback `src/functions/graph.ts`	Adds an empty-body/nodeType-only fast path that serves paginated results from the cached snapshot; for other queries, races live KV enumeration against a wall-clock budget and falls back to snapshot-based pagination or returns an empty response with a warning on timeout.
Graph stats snapshot-first behavior `src/functions/graph.ts`	`mem::graph-stats` returns snapshot stats when present (annotated with `fromSnapshot` and a dirty warning if set); when no snapshot exists returns zeroed stats with a warning.
Snapshot rebuild and reset handlers `src/functions/graph.ts`	Adds `mem::graph-snapshot-rebuild` to enumerate nodes/edges within budget, backfill indexes, rebuild and persist snapshot, and return timing/stats; adds `mem::graph-reset` to clear graph KV scopes and write an empty snapshot.
HTTP endpoints for rebuild/reset `src/triggers/api.ts`	Registers authenticated POST `/agentmemory/graph/snapshot-rebuild` and `/agentmemory/graph/reset` that call the corresponding mem::graph-* handlers and return results or `graphDisabledResponse()` on error.
Snapshot cache test suite `test/graph.test.ts`	New Vitest `"snapshot cache (`#814`)"` suite plus updated pagination tests covering rebuild persistence, query/stat fast-paths and timeout fallback, nodeType filtering, dirty/invalidation behaviour, inline extract updates and name-index dedup, reset clearing index scopes, and tooLarge/budget guards.
Docs and boot log endpoint count updates `AGENTS.md`, `README.md`, `src/index.ts`	Increment documented REST endpoint count and boot log from 126 to 128.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant HTTP_API as "api::graph-snapshot-rebuild"
  participant MemFn as "mem::graph-snapshot-rebuild"
  participant KV
  Client->>HTTP_API: POST /agentmemory/graph/snapshot-rebuild
  HTTP_API->>MemFn: invoke mem::graph-snapshot-rebuild()
  MemFn->>KV: enumerate nodes/edges (budgeted)
  KV-->>MemFn: node/edge batches
  MemFn->>KV: write graphNameIndex/graphEdgeKey/node-degree
  MemFn->>KV: persist KV.graphSnapshot
  MemFn-->>HTTP_API: {stats, tookMs}
  HTTP_API-->>Client: 200 {result}

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

rohitg00/agentmemory#789: Overlapping changes to mem::graph-query and pagination behavior.
rohitg00/agentmemory#698: Related mem::graph-extract changes and extraction pipeline integrations.

Poem

🐰 I cached the graph in the KV burrow deep,
Top nodes counted while the watchers sleep.
When live scans stall and time slips away,
I hand you a snapshot to brighten your day. ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	Code changes implement snapshot caching, indexed KV lookups, and wall-clock budgets for /graph/query and /graph/stats; however, tester reports platform-level invocation stops persist despite graceful envelope implementation.	Address the fundamental blocker: synchronous kv.list blocks Node event loop, preventing Promise.race timers and heartbeats from firing. Implement chunked pagination with event-loop yields (await/setImmediate between batches) or move rebuild off the hot request path.
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: introducing a top-degree snapshot cache to fix /graph/query and /graph/stats timeouts.
Out of Scope Changes check	✅ Passed	All changes directly support the PR objective: KV indexes, snapshot lifecycle, budget-bounded queries, rebuild/reset handlers, and test coverage for snapshot functionality.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/814-graph-degree-cache

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

src/functions/graph.ts (1)

616-651: 💤 Low value

Consider adding an audit trail for explicit snapshot rebuilds.

The mem::graph-snapshot-rebuild function persists state but doesn't call recordAudit(). While snapshot maintenance is arguably bookkeeping (similar to read-path counters), this endpoint is explicitly triggered by users rather than being a transparent cache operation. An audit entry would provide visibility into when and how often rebuilds occur.
♻️ Suggested audit call
       await kv.set(KV.graphSnapshot, SNAPSHOT_KEY, snap);
       const tookMs = Date.now() - started;
+      await recordAudit(kv, "observe", "mem::graph-snapshot-rebuild", [], {
+        totalNodes: snap.stats.totalNodes,
+        totalEdges: snap.stats.totalEdges,
+        topNodes: snap.topNodes.length,
+        tookMs,
+      });
       logger.info("Graph snapshot rebuilt", {
Based on learnings: "Function registration pattern: use sdk.registerFunction with validation of inputs, work via kv.get/kv.set/kv.list, and record audit via recordAudit() for state-changing operations."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/functions/graph.ts` around lines 616 - 651, Add an audit entry for the
explicit rebuild inside the sdk.registerFunction handler for
"mem::graph-snapshot-rebuild": after successfully building and persisting snap
(built by buildSnapshotFromArrays and saved to KV.graphSnapshot / SNAPSHOT_KEY)
call recordAudit(...) with a descriptive action like "graph.snapshot.rebuild"
and include metadata such as success:true, snap.stats (totalNodes/totalEdges),
topNodes/topEdges counts, updatedAt and tookMs; also record a failure audit when
the catch block runs (action "graph.snapshot.rebuild" with success:false and the
error message) so every explicit rebuild attempt is auditable.

test/graph.test.ts (1)

418-517: ⚡ Quick win

Add coverage for the #814 live-enumeration failure/timeout fallback envelope (warning + snapshot/empty).

withTimeout rejects when the wrapped kv.list promise rejects, so you can force the handlers’ catch immediately by making kv.list throw—without waiting for the 6s budget.

mem::graph-query (snapshot present): expect fromSnapshot: true and warning when live enumeration fails.
mem::graph-query (no snapshot): expect an empty graph + warning; fromSnapshot is intentionally omitted in this branch.
mem::graph-stats: mirror the same two cases (fromSnapshot: true with snapshot; fromSnapshot: false without) and assert warning.

💚 Example tests for graph-query fallback

it("graph-query falls back to snapshot with a warning when live enumeration fails", async () => {
  await seed(10, 10);
  await sdk.trigger("mem::graph-snapshot-rebuild", {});

  const originalList = kv.list;
  try {
    kv.list = (async () => {
      throw new Error("Invocation stopped");
    }) as typeof kv.list;

    const result = (await sdk.trigger("mem::graph-query", {
      query: "anything",
    })) as GraphQueryResult;

    expect(result.fromSnapshot).toBe(true);
    expect(result.warning).toBeDefined();
  } finally {
    kv.list = originalList;
  }
});

it("graph-query returns empty + warning when no snapshot exists and live enumeration fails", async () => {
  await seed(10, 10);

  const originalList = kv.list;
  try {
    kv.list = (async () => {
      throw new Error("Invocation stopped");
    }) as typeof kv.list;

    const result = (await sdk.trigger("mem::graph-query", {
      query: "anything",
    })) as GraphQueryResult;

    expect(result.nodes).toHaveLength(0);
    expect(result.edges).toHaveLength(0);
    expect(result.totalNodes).toBe(0);
    expect(result.totalEdges).toBe(0);
    expect(result.fromSnapshot).toBeUndefined();
    expect(result.warning).toBeDefined();
  } finally {
    kv.list = originalList;
  }
});

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/graph.test.ts` around lines 418 - 517, Add tests exercising the
live-enumeration failure/timeout fallback by forcing kv.list to throw and
restoring it in a finally block; specifically add three assertions: (1) for
mem::graph-query when a snapshot exists, trigger "mem::graph-snapshot-rebuild"
then override kv.list to throw and assert the handler returns fromSnapshot: true
and a defined warning, (2) for mem::graph-query when no snapshot exists, seed
but do not build a snapshot, override kv.list to throw and assert nodes/edges
arrays are empty, totalNodes/totalEdges are 0, fromSnapshot is undefined, and
warning is defined, and (3) mirror these two cases for mem::graph-stats
asserting fromSnapshot true/false respectively and that warning is defined; use
the existing test helpers seed, sdk.trigger, kv.list override/restore pattern
and the GraphQueryResult shape to locate the tests to add/update.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/functions/graph.ts`:
- Around line 616-651: Add an audit entry for the explicit rebuild inside the
sdk.registerFunction handler for "mem::graph-snapshot-rebuild": after
successfully building and persisting snap (built by buildSnapshotFromArrays and
saved to KV.graphSnapshot / SNAPSHOT_KEY) call recordAudit(...) with a
descriptive action like "graph.snapshot.rebuild" and include metadata such as
success:true, snap.stats (totalNodes/totalEdges), topNodes/topEdges counts,
updatedAt and tookMs; also record a failure audit when the catch block runs
(action "graph.snapshot.rebuild" with success:false and the error message) so
every explicit rebuild attempt is auditable.

In `@test/graph.test.ts`:
- Around line 418-517: Add tests exercising the live-enumeration failure/timeout
fallback by forcing kv.list to throw and restoring it in a finally block;
specifically add three assertions: (1) for mem::graph-query when a snapshot
exists, trigger "mem::graph-snapshot-rebuild" then override kv.list to throw and
assert the handler returns fromSnapshot: true and a defined warning, (2) for
mem::graph-query when no snapshot exists, seed but do not build a snapshot,
override kv.list to throw and assert nodes/edges arrays are empty,
totalNodes/totalEdges are 0, fromSnapshot is undefined, and warning is defined,
and (3) mirror these two cases for mem::graph-stats asserting fromSnapshot
true/false respectively and that warning is defined; use the existing test
helpers seed, sdk.trigger, kv.list override/restore pattern and the
GraphQueryResult shape to locate the tests to add/update.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 471b5858-55f7-4bd9-8aba-a1320c110cec

📥 Commits

Reviewing files that changed from the base of the PR and between 3e90110 and 75d77db.

📒 Files selected for processing (5)

src/functions/graph.ts
src/state/schema.ts
src/triggers/api.ts
src/types.ts
test/graph.test.ts

allandelmare · 2026-06-03T18:45:34Z

The following was generated with the help of Claude Code based on a hands-on test against my corpus. Apologies for any AI-slop — please push back on anything that reads wrong.

Tested PR #816 (commit 75d77db) against the same 680-session / 75K-node / 344 MB ~/data/ corpus from #814. Backup taken first; data fully intact afterwards (131/133 memories, 684 sessions still readable, recall + observation capture all working).

Headline

The new code is bundled, it boots, and the 6s graceful budget path does fire (visible in logs as [agentmemory] warn Graph query enumeration timed out, using snapshot {"error":"graph-query enumeration: exceeded 6000ms budget"}). But callers still receive {"error":"Invocation stopped"} rather than the {warning, fromSnapshot} envelope, and the snapshot-rebuild endpoint can't escape the very enumeration-timeout problem it was designed to solve. Recall / memories / observation capture are unaffected.

Concrete repro

# Fresh boot of the PR build against existing 75K-node corpus
$ cd ~ && node /tmp/agentmemory-pr816/dist/cli.mjs &
# Ready in 1s; circuit breaker closed; 684 sessions, 131 memories readable.

$ time curl -s -X POST http://localhost:3111/agentmemory/graph/snapshot-rebuild
{"error":"Invocation stopped","error_id":"a5a8b329a1b0"}
real  0m2.66s

$ time curl -s -X POST http://localhost:3111/agentmemory/graph/snapshot-rebuild
{"error":"Invocation stopped","error_id":"a5a983a601d0"}
real  0m1.45s

$ time curl -s -X POST http://localhost:3111/agentmemory/graph/query -d '{}'
{"error":"Invocation stopped","error_id":"..."}
real  0m5.24s

$ time curl -s -X POST http://localhost:3111/agentmemory/graph/query -d '{"limit": 10}'
{"error":"Invocation stopped","error_id":"..."}
real  0m1.14s

$ time curl -s http://localhost:3111/agentmemory/graph/stats
{"error":"Invocation stopped"}    # or "Function not found" during reconnect window — see below

Server log during the same window confirms the new code paths execute:

[agentmemory] warn Graph query enumeration timed out, using snapshot {"error":"graph-query enumeration: exceeded 6000ms budget"}
[agentmemory] warn Graph stats enumeration timed out {"error":"graph-stats enumeration: exceeded 6000ms budget"}

What I think is happening

Bug 1 — iii invocation timeout < the new 6s budget. Snapshot-rebuild errors out in ~2.6s and /graph/query {"limit":10} errors in ~1.1s, both well under the 6s budget. iii is killing the invocation before the SDK's Promise.race reaches the timeout branch, so the graceful {warning, fromSnapshot} envelope never reaches the HTTP client. Either the budget needs to be < iii's per-function deadline, or the rebuild needs to run as a streamed/chunked operation that doesn't sit inside a single iii invocation.

Bug 2 — snapshot-rebuild itself can't escape the timeout it was built to solve. The endpoint is the first thing an operator runs to populate the cache, but it still needs to enumerate the full 75K-node corpus inside one iii call. At our scale that enumeration is exactly what was breaking /graph/query in the first place. The snapshot makes subsequent queries fast (architecturally sound) but only if you can pay the rebuild cost once — and on our corpus you can't. Suggestions:

Stream the rebuild in batches (paginated kv.list with cursor, accumulate, write at end) so each batch fits inside iii's window
Or: register snapshot-rebuild with a relaxed timeout (8–10x the default) since it's known-expensive
Or: rebuild incrementally during mem::graph-extract writes (incurs the per-write cost but amortizes — could be gated to "rebuild every N extracts when dirty")

Bug 3 — worker reconnect churn under sustained graph load. When I had the daemon running for a while, the log showed continuous [iii] Reconnecting in NNNms interleaved with the timeout warnings, and a fresh worker UUID would register every few seconds. During the ~1s gap between worker death and the next registerApiTriggers run, HTTP calls returned {"error":"Function api::graph-stats not found"} (the EXISTING endpoint, not just the new one — api::graph-stats was registered in main; this isn't a missing-registration bug, it's a registration-not-yet-rerun bug). Once the new worker finished registering, api::graph-stats came back. Suggests the graph timeouts are either crashing the worker or being treated as fatal by iii.

I don't know who's calling graph-stats internally — only the viewer in src/viewer/index.html does apiGet('graph/stats'), but I didn't have a viewer open. Could be MCP client probes (12 stale npm exec @agentmemory/mcp processes from prior Claude Code sessions still hanging around) or a background poller I missed. The setInterval calls I found in src/index.ts are for autoForget / lessonDecay / insightDecay / consolidation — none mention graph-stats. Worth checking.

What still works

/agentmemory/health — full metrics, 1292/1293 compress successes, 730/731 summarize successes, 94+ avg quality score
/agentmemory/memories?count=true — 131/133, instant
/agentmemory/sessions?limit=100000 — 684 sessions, < 1s
Observation capture from active Claude Code session — confirmed in log: [agentmemory] info Observation captured … hook=post_tool_use, compress=llm followed by Observation compressed
Boot is clean, no crash loops

Summary

The architecture in this PR is correct (snapshot cache + graceful timeout envelope), but the implementation has a coordination problem between the SDK's Promise.race and iii's per-invocation timeout, plus snapshot-rebuild needs to escape the enumeration-timeout itself before it can be useful at large scale.

Happy to test any follow-up patch against this corpus — fastest feedback loop you'll get for the > 50K-node case.

allandelmare · 2026-06-03T19:04:10Z

Follow-up to my test report above, generated with Claude Code. Two things: (1) correcting an error in my first comment, and (2) a controlled experiment that I think pins the root cause. Please push back on anything that reads wrong.

Correction to my previous comment

I wrote "I didn't have a viewer open" and speculated about a mystery internal caller hammering graph/stats. That was wrong — I did have the viewer open. The viewer dashboard auto-refresh (setInterval in src/viewer/index.html, calling apiGet('graph/stats')) polls graph/stats every ~2s. So the "constant internal graph-stats calls" and the continuous worker-reconnect churn I described in Bug 3 were self-induced by my open viewer tab, not a background poller in the daemon. Apologies for the misdirection.

But re-testing with the viewer closed surfaced the actual mechanism, which is more useful:

Controlled experiment: graph calls crash the worker, cheap calls don't

Fresh boot, no viewer, worker confirmed healthy (memories?count=true returns instantly before each batch):

# 3x graph/query {} back-to-back:
  call 1: 0.578s   -> {"error":"Invocation stopped"}
  call 2: 0.0007s  -> {"error":"Function api::graph-query not found"}
  call 3: 0.0005s  -> {"error":"Function api::graph-query not found"}
  => Worker "registered" count delta: +1   (a reconnect happened)

# 3x memories?count=true back-to-back (control):
  call 1: 0.009s
  call 2: 0.004s
  call 3: 0.004s
  => Worker "registered" count delta: 0     (no reconnect)

Graph calls deterministically trigger a worker reconnect; cheap calls never do. Call 1 fires the enumeration and dies at ~0.5s; calls 2 & 3 land in the ~1s reconnect gap and get instant Function not found. At idle with no viewer, the worker still re-registered 4–5 times over ~90s purely from [OTel] WebSocket error -> Disconnected from engine, will reconnect cycles, but graph calls reliably force an extra one.

Root cause hypothesis

mem::graph-snapshot-rebuild, mem::graph-query (empty-body path), and mem::graph-stats all die at ~0.5–0.7s with zero enumeration log output, even with the worker confirmed up immediately prior. That's far below the new 6s budget, and the function body never logs progress. I think:

The kv.list<GraphNode> + kv.list<GraphEdge> enumeration runs synchronously without yielding to the Node event loop.
A blocked event loop can't fire the Promise.race 6s timer — so the graceful { warning, fromSnapshot } envelope this PR adds can never be returned; the caller always gets the raw iii Invocation stopped instead.
The same blocked loop misses the iii websocket heartbeat, so iii declares the worker dead at ~0.5s and reconnects it — which is the reconnect we see correlated with every graph call.

In short: a Promise.race against a wall-clock timer can't interrupt a synchronously-blocked event loop. The timer and the heartbeat are both starved by the very enumeration they're meant to bound.

Implication for the fix

The snapshot-cache architecture is sound — precompute top-degree, serve the fast path, never enumerate on hot calls. But the rebuild itself (and any live-enumeration fallback) has to yield to the event loop, or it takes the worker down before it can finish or time out gracefully. Options:

Chunk the enumeration: paginated kv.list with a cursor, await/setImmediate between batches, accumulate, write the snapshot at the end. Each batch fits inside the heartbeat window and lets the budget timer fire.
Or move the snapshot rebuild off the request worker entirely (background job / separate worker) so a long enumeration never blocks the HTTP/heartbeat loop.
Either way, the live-enumeration fallback in graph-query/graph-stats should be removed or also chunked — as written it's the thing crashing the worker, so falling back to it defeats the cache's purpose on exactly the large corpora that need it.

Caveat

I'm running the PR build via node dist/cli.mjs directly (not the packaged launcher). The baseline [OTel] WebSocket error reconnect cycle at idle might be partly environmental to that launch method — worth confirming. But the graph-call-triggers-reconnect delta and the ~0.5s death with no enumeration log are robust and reproduce on every call regardless of launch method, since they're about the enumeration blocking the loop, not about how the process was started.

Data fully intact throughout (131/133 memories, 684 sessions, recall + observation capture all working). Standing by to test a chunked-enumeration patch against this corpus.

…build

…oint

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

test/graph.test.ts (1)
29-32: ⚖️ Poor tradeoff

Coverage gap: the timeout/fallback path that fails in production is never exercised.

mockKV.list resolves synchronously and instantly, so withTimeout/LIVE_ENUMERATION_BUDGET_MS in the query and rebuild handlers never fire. None of these tests cover the catch branch that returns the { warning, fromSnapshot } envelope, nor the tooLarge rebuild abort above REBUILD_SAFE_NODE_CEILING. Per the tester's findings, the live enumeration is exactly what blocks the event loop and yields {"error":"Invocation stopped"} on large corpora — so a green suite here gives false confidence on the regression this PR targets.

Consider adding a test that injects a slow/large list (e.g. a deferred promise or fake timers) to assert the warning-envelope fallback, plus one that drives mem::graph-snapshot-rebuild past the ceiling to assert { success: false, tooLarge: true }.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/graph.test.ts` around lines 29 - 32, Add tests that simulate slow and
oversized KV enumeration: update the mockKV.list used by tests to return a
deferred promise (or use fake timers) so that
withTimeout/LIVE_ENUMERATION_BUDGET_MS in the query and rebuild handlers
triggers the timeout path and the code returns the { warning, fromSnapshot }
envelope; additionally add a test that drives the mem::graph-snapshot-rebuild
logic past REBUILD_SAFE_NODE_CEILING to assert it returns { success: false,
tooLarge: true }. Locate the mocks and handlers by the symbols mockKV.list,
withTimeout, LIVE_ENUMERATION_BUDGET_MS, and REBUILD_SAFE_NODE_CEILING and
ensure the new tests assert the warning envelope and tooLarge response
respectively.
src/functions/graph.ts (2)
266-266: ⚡ Quick win

Timestamp not reused as per coding guidelines.

mergeNode creates a new Date() on each call instead of reusing a captured timestamp. This could cause inconsistent updatedAt values across nodes merged in the same extract batch.

As per coding guidelines: "Timestamps: capture once with new Date().toISOString() and reuse."
♻️ Proposed fix

Pass the timestamp as a parameter:
 function mergeNode(
   existing: GraphNode,
   incoming: GraphNode,
   obsIds: string[],
+  updatedAt: string,
 ): GraphNode {
   return {
     ...existing,
     sourceObservationIds: [
       ...new Set([
         ...existing.sourceObservationIds,
         ...incoming.sourceObservationIds,
         ...obsIds,
       ]),
     ],
     properties: { ...existing.properties, ...incoming.properties },
-    updatedAt: new Date().toISOString(),
+    updatedAt,
   };
 }
Then in the extract handler, capture once and pass through.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/functions/graph.ts` at line 266, The mergeNode call currently creates its
own timestamp (updatedAt: new Date().toISOString()) causing inconsistent
timestamps; change mergeNode to accept a timestamp parameter (e.g.,
capturedTimestamp) and replace any internal new Date() usage with that
parameter, then capture a single timestamp once in the extract handler (call it
capturedTimestamp = new Date().toISOString()) and pass it into mergeNode for all
merges in that batch so all nodes share the same updatedAt value.
825-835: ⚖️ Poor tradeoff

Sequential index backfill is O(n) awaits on large corpora.

For a 25K-node corpus (the ceiling), this is 25K sequential kv.set calls plus another pass for edges. Consider batching with Promise.all in chunks to reduce wall-clock time during rebuild.
♻️ Example chunked approach
const BATCH_SIZE = 100;
for (let i = 0; i < liveNodes.length; i += BATCH_SIZE) {
  const batch = liveNodes.slice(i, i + BATCH_SIZE);
  await Promise.all(batch.map((n) =>
    Promise.all([
      kv.set(KV.graphNameIndex, nameIndexKey(n.type, n.name), n.id),
      kv.set(KV.graphNodeDegree, n.id, degree.get(n.id) ?? 0),
    ])
  ));
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/functions/graph.ts` around lines 825 - 835, The current sequential
backfill over liveNodes and liveEdges issues many await kv.set calls causing
slow rebuilds; change the loops that call kv.set for liveNodes (using
KV.graphNameIndex, nameIndexKey, KV.graphNodeDegree, degree) and liveEdges
(using KV.graphEdgeKey and edgeIndexKey) to process in fixed-size batches (e.g.,
BATCH_SIZE = 100) and use Promise.all for each batch so the sets within a batch
run in parallel, awaiting Promise.all per batch before moving to the next; keep
the same keys/values and error handling but replace the per-item await with
batch Promise.all mapping to kv.set calls.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/functions/graph.ts`:
- Around line 888-901: The reset loop currently deletes entries only when row.id
exists, which misses entries in the composite-key scopes (graphNameIndex,
graphEdgeKey, graphNodeDegree) because those store primitives; update the
deletion strategy in the block that iterates rows to remove all keys regardless
of value by either: use the KV API variant that returns keys (e.g., change the
kv.list call to request keys/entries and then call kv.delete(scope, key) for
each), or if supported call a scope-level/prefix delete to wipe the entire
scope; alternatively, change how index values are written (store { key, value })
so row.key can be used for deletion—adjust the loop to reference the actual key
field (or scope delete) instead of checking row.id. Ensure you update the code
paths that write those scopes (graphNameIndex/graphEdgeKey/graphNodeDegree) if
you choose the stored-key approach.
- Around line 200-208: The comparator for snap.topNodes incorrectly always
prioritizes the changed node (uses dA/dB set to next/null) which breaks the
descending-degree ordering; update the comparator in the sort call so it
compares actual degree values (use a.degree and b.degree or a cached degrees map
you build during the extract loop), substituting the changed node's degree with
the variable next when a.id === nodeId or b.id === nodeId, and return (degreeB -
degreeA) to preserve descending order; if degree lookups are asynchronous, make
the surrounding function async or precompute/cache degrees synchronously before
calling snap.topNodes.sort to avoid async operations inside the comparator.

---

Nitpick comments:
In `@src/functions/graph.ts`:
- Line 266: The mergeNode call currently creates its own timestamp (updatedAt:
new Date().toISOString()) causing inconsistent timestamps; change mergeNode to
accept a timestamp parameter (e.g., capturedTimestamp) and replace any internal
new Date() usage with that parameter, then capture a single timestamp once in
the extract handler (call it capturedTimestamp = new Date().toISOString()) and
pass it into mergeNode for all merges in that batch so all nodes share the same
updatedAt value.
- Around line 825-835: The current sequential backfill over liveNodes and
liveEdges issues many await kv.set calls causing slow rebuilds; change the loops
that call kv.set for liveNodes (using KV.graphNameIndex, nameIndexKey,
KV.graphNodeDegree, degree) and liveEdges (using KV.graphEdgeKey and
edgeIndexKey) to process in fixed-size batches (e.g., BATCH_SIZE = 100) and use
Promise.all for each batch so the sets within a batch run in parallel, awaiting
Promise.all per batch before moving to the next; keep the same keys/values and
error handling but replace the per-item await with batch Promise.all mapping to
kv.set calls.

In `@test/graph.test.ts`:
- Around line 29-32: Add tests that simulate slow and oversized KV enumeration:
update the mockKV.list used by tests to return a deferred promise (or use fake
timers) so that withTimeout/LIVE_ENUMERATION_BUDGET_MS in the query and rebuild
handlers triggers the timeout path and the code returns the { warning,
fromSnapshot } envelope; additionally add a test that drives the
mem::graph-snapshot-rebuild logic past REBUILD_SAFE_NODE_CEILING to assert it
returns { success: false, tooLarge: true }. Locate the mocks and handlers by the
symbols mockKV.list, withTimeout, LIVE_ENUMERATION_BUDGET_MS, and
REBUILD_SAFE_NODE_CEILING and ensure the new tests assert the warning envelope
and tooLarge response respectively.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9ad8adb5-d84d-45c0-b181-172701f15888

📥 Commits

Reviewing files that changed from the base of the PR and between c6eb9ab and 6cfb180.

📒 Files selected for processing (7)

AGENTS.md
README.md
src/functions/graph.ts
src/index.ts
src/state/schema.ts
src/triggers/api.ts
test/graph.test.ts

✅ Files skipped from review due to trivial changes (3)

src/index.ts
AGENTS.md
README.md

…fill

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/graph.test.ts`:
- Around line 562-582: The test "graph-reset also wipes composite-key index
scopes" is missing assertions for the edge composite-key scope; before calling
sdk.trigger("mem::graph-reset", {}) verify that the edge index contains an entry
by reading kv.get("mem:graph:edge-key", "<composite>") or
kv.list("mem:graph:edge-key") after sdk.trigger("mem::graph-extract", {
observations: [testObs] }), then after calling sdk.trigger("mem::graph-reset",
{}) assert the edge index is cleared (kv.get returns null or kv.list length is
0) alongside the existing checks for "mem:graph:name-index" and
"mem:graph:node-degree".
- Around line 590-612: The test's slowKV implementation uses setTimeout which
yields the event loop and doesn't simulate the synchronous kv.list blocking
failure; replace the async delay in slowKV.list with a synchronous busy-wait
(CPU-blocking) delay to simulate event-loop starvation, then add a dedicated
test (using slowKV with the synchronous block and an explicit test timeout) that
calls registerGraphFunction/localSdk.trigger("mem::graph-query", ...) to assert
the warning envelope when enumeration exceeds the budget; locate and update the
slowKV function and the existing test case in graph.test.ts to use the new
blocking delay implementation or add a separate test that references slowKV,
mockKV, registerGraphFunction, and localSdk to cover the worker-stall path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c0a7eca3-4b15-4011-84e4-3f35fcc6866e

📥 Commits

Reviewing files that changed from the base of the PR and between 6cfb180 and 795537b.

📒 Files selected for processing (3)

src/functions/graph.ts
src/types.ts
test/graph.test.ts

🚧 Files skipped from review as they are similar to previous changes (2)

src/types.ts
src/functions/graph.ts

coderabbitai · 2026-06-04T09:43:17Z

+    function slowKV(delayMs: number) {
+      const base = mockKV();
+      return {
+        ...base,
+        list: async <T>(scope: string): Promise<T[]> => {
+          await new Promise((r) => setTimeout(r, delayMs));
+          return base.list<T>(scope);
+        },
+      };
+    }
+
+    it("graph-query startNodeId returns warning envelope when enumeration exceeds budget", async () => {
+      const slow = slowKV(7000); // > LIVE_ENUMERATION_BUDGET_MS (6000ms)
+      const localSdk = mockSdk();
+      registerGraphFunction(localSdk as never, slow as never, mockProvider as never);
+
+      const result = (await localSdk.trigger("mem::graph-query", {
+        startNodeId: "n_missing",
+      })) as GraphQueryResult;
+
+      expect(result.warning).toBeTruthy();
+      expect(result.warning).toMatch(/budget|enumeration/i);
+    }, 10000);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Budget fallback test does not exercise the real blocking failure mode.

slowKV.list uses setTimeout, which yields the event loop and lets the race timer fire. The reported production issue is synchronous kv.list blocking (timers/heartbeats starved), so this test can pass while the worker-stall path is still untested.

Suggested direction

+function blockingKV(blockMs: number) { + const base = mockKV(); + return { + ...base, + list: async <T>(scope: string): Promise<T[]> => { + const start = Date.now(); + while (Date.now() - start < blockMs) { + // Intentionally block event loop to simulate synchronous kv.list stall + } + return base.list<T>(scope); + }, + }; +}

Use this in a dedicated test (with explicit timeout) to cover the starvation scenario called out in #814 testing notes.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/graph.test.ts` around lines 590 - 612, The test's slowKV implementation uses setTimeout which yields the event loop and doesn't simulate the synchronous kv.list blocking failure; replace the async delay in slowKV.list with a synchronous busy-wait (CPU-blocking) delay to simulate event-loop starvation, then add a dedicated test (using slowKV with the synchronous block and an explicit test timeout) that calls registerGraphFunction/localSdk.trigger("mem::graph-query", ...) to assert the warning envelope when enumeration exceeds the budget; locate and update the slowKV function and the existing test case in graph.test.ts to use the new blocking delay implementation or add a separate test that references slowKV, mockKV, registerGraphFunction, and localSdk to cover the worker-stall path.

allandelmare · 2026-06-04T15:26:26Z

Re-test of the latest push (b494f23) against the 75K-node corpus, via Claude Code. Good news + one precise remaining gap. Push back on anything that reads wrong.

Pulled b494f23 (incremental indexes + snapshot-only hot path + reset endpoint + the worker-death test), rebuilt, ran against the same 680-session / ~75K-node ~/data/. Backup held; no data loss — 131/133 memories, 685 sessions intact throughout (one transient sessions: 0 mid-test was just a reconnect-gap read while the worker was churning, not loss).

The core regression is FIXED ✅

The snapshot-only hot path works exactly as intended. Controlled experiment (no viewer, worker confirmed healthy first):

# 5x (graph/query {} + graph/stats) = 10 hot-path calls:
graph/query {}  -> 200 in 0.002s  {"nodes":[],"totalNodes":0,"warning":"No graph snapshot available..."}
graph/stats     -> 200 in 0.002s  {"fromSnapshot":false,"warning":"No graph snapshot available..."}
=> Worker "registered" delta: 0   (was +1 PER CALL before)

No more Invocation stopped, no worker death, no reconnect churn. An open viewer polling graph/stats every 2s would now be harmless. This is the fix — thank you.

Remaining gap: rebuild AND reset both still crash on a >25K legacy corpus

Both write-side paths still die the same way the hot path used to:

POST /graph/snapshot-rebuild -> 500 {"error":"Invocation stopped"} in 0.357s  ; worker delta +1
POST /graph/reset            -> 500 {"error":"Invocation stopped"} in 0.337s  ; worker delta +1

snapshot-rebuild: the REBUILD_SAFE_NODE_CEILING = 25000 guard reads snapshot.totalNodes — but a pre-#814 legacy corpus has no snapshot yet, so totalNodes is unknown/0, the guard doesn't fire, and it proceeds to the full kv.list<GraphNode> + kv.list<GraphEdge> pair → blocks the loop → worker dies. The guard can't know the corpus is 75K until it does the very enumeration that crashes it.

graph/reset: this is the more important one. It's the documented escape hatch in the warning envelope you added ("Run POST /agentmemory/graph/reset to wipe and let future extracts repopulate"), but the wipe itself enumerates the keys to delete them, so it hits the same kv.list block and crashes. So on a >25K legacy corpus there is currently no working path to either rebuild or clear the graph — only the graceful empty-graph warning state.

Why the 6s budget doesn't save them

Interesting detail from the log: the budget timer does eventually fire —

[agentmemory] error Graph snapshot rebuild failed {"error":"graph-snapshot-rebuild enumeration: exceeded 6000ms budget"}

— but it fires at ~6s, after the caller already got Invocation stopped at ~0.35s and the worker already reconnected. iii's heartbeat deadline (~0.35s observed) is far shorter than the 6s budget, and a synchronously-blocking kv.list starves the heartbeat before the Promise.race timer can resolve. So the budget bounds the function's wall-clock but can't prevent the worker death — it's defending against the wrong deadline.

Suggested direction

Both snapshot-rebuild and graph-reset need to stop loading the whole keyspace into one synchronous enumeration:

Cursor-paginated kv.list with await/setImmediate between batches, so the loop breathes and the heartbeat survives. Then the 6s budget (or any budget < heartbeat interval) can actually fire mid-walk.
For reset specifically, a streaming delete-by-key-prefix that never materializes all values in memory would be ideal — you don't need to read node bodies to wipe them.
Either fix makes the >25K escape hatch real; without it, large legacy corpora are stuck at empty-graph-forever.

Bottom line

Normal operation is now solid on a large corpus — that was the crash that made me stop using it, and it's gone. The remaining gap only bites operators with a big pre-#814 graph who want to reclaim it. Happy to test a yielding-enumeration patch against this exact corpus — it's the fastest >50K repro you'll get.

allandelmare · 2026-06-04T15:57:41Z

Two structural observations from digging into the root cause, via Claude Code — in case they're useful for where the fix goes next. Plus a standing offer. Push back on anything wrong.

After the b494f23 re-test, I traced why rebuild + reset still crash and it bottoms out at one shared primitive, which I think reframes the fix slightly.

`state::list` is the single shared root cause

StateKV only wraps five engine primitives: get / set / update / delete / list. There's no clear-scope, no cursor/pagination on list, and no key-only keys variant. So every operation that must touch all existing keys is forced through state::list, which returns the entire scope and lands a multi-MB payload that JSON.parse chews synchronously — starving the worker heartbeat (the ~0.35s death we keep seeing, well under the 6s budget).

That's why all three of these are really the same bug:

rebuild — needs every node to compute top-degree → state::list
reset — needs every key to delete it (no clear-scope) → state::list
bootstrapping an incremental key-index for a legacy corpus — can't be done without one state::list to seed it

So for a pre-#814 corpus there's a chicken-and-egg: the incremental snapshot/index design (which is the right architecture) can't be seeded without the one call that crashes. The hot-path fix is solid because it never seeds — it just serves whatever snapshot exists. The write side can't avoid the seed.

The clean fix seems to live in the engine, not here — a paginated/streaming state::list (or a clear-scope / key-only keys) would let rebuild and reset walk in yielding batches so the heartbeat survives. I realize that's upstream of this repo and tangled with the v0.11.2 pin, so I'm flagging it as context, not asking for it here.

Same latent bug in `migrate.ts`

While tracing state::list callers: src/functions/migrate.ts does kv.list<Memory>(KV.memories). On a large-memory corpus that would heartbeat-crash a migration the same way. We only have 131 memories so it's fine for us, but an operator with tens of thousands of memories running a migration would hit it. Might be worth the same yielding/guard treatment whenever the enumeration primitive gets fixed.

Standing offer

We've got the >50K repro you can't easily synthesize (680 sessions / ~75K graph nodes / 344 MB on disk), a backup, and a one-command rebuild-and-test loop pointed at it. Push any branch and I'll report rebuild/reset/hot-path behavior + worker-reconnect deltas against the real corpus, usually within the hour. The hot-path fix already verified clean here — happy to be the scale check on whatever lands for the write side.

fix(graph): top-degree snapshot cache for query + stats

75d77db

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

fix(consistency): bump REST endpoint count 126 -> 127 for snapshot-re…

c6eb9ab

…build

vercel Bot deployed to Preview June 3, 2026 19:54 View deployment

fix(graph): incremental indexes + snapshot-only hot path + reset endp…

6cfb180

…oint

vercel Bot deployed to Preview June 3, 2026 23:26 View deployment

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/functions/graph.ts Outdated

Comment thread src/functions/graph.ts Outdated

fix(graph): sync degree comparator + composite-key reset + batch back…

795537b

…fill

vercel Bot deployed to Preview June 4, 2026 09:28 View deployment

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

test(graph): edge-key wipe assertion + worker-death rejection path

b494f23

vercel Bot deployed to Preview June 4, 2026 15:19 View deployment

rohitg00 merged commit 2a58140 into main Jun 4, 2026
7 checks passed

rohitg00 deleted the fix/814-graph-degree-cache branch June 4, 2026 15:29

allandelmare mentioned this pull request Jun 4, 2026

Bug: no path to rebuild or reset a >25K legacy graph — snapshot-rebuild + graph-reset both heartbeat-crash on state::list (follow-up to #816) #825

Open

rohitg00 mentioned this pull request Jun 4, 2026

chore(release): v0.9.27 #827

Merged

3 tasks

allandelmare mentioned this pull request Jun 4, 2026

Bug: entity-graph query/traversal (graph-query query + startNodeId) heartbeat-crashes past ~25K nodes on FRESH corpora — needs the per-node edge index #816 references #828

Open

coderabbitai Bot mentioned this pull request Jun 5, 2026

fix(knowledge-graph): deduplication, validation, GC, and locking (6 fixes) #813

Closed

Conversation

rohitg00 commented Jun 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Allan's diagnosis

What changed in Phase 2

Incremental indexes maintained by mem::graph-extract

Snapshot updated inline on every extract

Hot path reads snapshot exclusively

Legacy corpus rebuild + reset

Endpoint count + docs

What this fixes for Allan

Test plan

Out of scope

Summary by CodeRabbit

Uh oh!

vercel Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

allandelmare commented Jun 3, 2026

Headline

Concrete repro

What I think is happening

What still works

Summary

Uh oh!

allandelmare commented Jun 3, 2026

Correction to my previous comment

Controlled experiment: graph calls crash the worker, cheap calls don't

Root cause hypothesis

Implication for the fix

Caveat

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

allandelmare commented Jun 4, 2026

The core regression is FIXED ✅

Remaining gap: rebuild AND reset both still crash on a >25K legacy corpus

Why the 6s budget doesn't save them

Suggested direction

Bottom line

Uh oh!

Uh oh!

allandelmare commented Jun 4, 2026

state::list is the single shared root cause

Same latent bug in migrate.ts

Standing offer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rohitg00 commented Jun 3, 2026 •

edited by coderabbitai Bot

Loading

Incremental indexes maintained by `mem::graph-extract`

vercel Bot commented Jun 3, 2026 •

edited

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

`state::list` is the single shared root cause

Same latent bug in `migrate.ts`