fix(graph): top-degree snapshot cache for /graph/query + /graph/stats (#814)#816
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a persisted top-degree GraphSnapshot and helpers; extract updates the snapshot incrementally; query and stats use snapshot-first or timeout-fallback behavior; adds mem::graph-snapshot-rebuild and mem::graph-reset handlers, API endpoints, tests, and docs/boot-log endpoint count updates. ChangesGraph Snapshot Caching
Sequence Diagram(s)sequenceDiagram
participant Client
participant HTTP_API as "api::graph-snapshot-rebuild"
participant MemFn as "mem::graph-snapshot-rebuild"
participant KV
Client->>HTTP_API: POST /agentmemory/graph/snapshot-rebuild
HTTP_API->>MemFn: invoke mem::graph-snapshot-rebuild()
MemFn->>KV: enumerate nodes/edges (budgeted)
KV-->>MemFn: node/edge batches
MemFn->>KV: write graphNameIndex/graphEdgeKey/node-degree
MemFn->>KV: persist KV.graphSnapshot
MemFn-->>HTTP_API: {stats, tookMs}
HTTP_API-->>Client: 200 {result}
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/functions/graph.ts (1)
616-651: 💤 Low valueConsider adding an audit trail for explicit snapshot rebuilds.
The
mem::graph-snapshot-rebuildfunction persists state but doesn't callrecordAudit(). While snapshot maintenance is arguably bookkeeping (similar to read-path counters), this endpoint is explicitly triggered by users rather than being a transparent cache operation. An audit entry would provide visibility into when and how often rebuilds occur.♻️ Suggested audit call
await kv.set(KV.graphSnapshot, SNAPSHOT_KEY, snap); const tookMs = Date.now() - started; + await recordAudit(kv, "observe", "mem::graph-snapshot-rebuild", [], { + totalNodes: snap.stats.totalNodes, + totalEdges: snap.stats.totalEdges, + topNodes: snap.topNodes.length, + tookMs, + }); logger.info("Graph snapshot rebuilt", {Based on learnings: "Function registration pattern: use sdk.registerFunction with validation of inputs, work via kv.get/kv.set/kv.list, and record audit via recordAudit() for state-changing operations."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/graph.ts` around lines 616 - 651, Add an audit entry for the explicit rebuild inside the sdk.registerFunction handler for "mem::graph-snapshot-rebuild": after successfully building and persisting snap (built by buildSnapshotFromArrays and saved to KV.graphSnapshot / SNAPSHOT_KEY) call recordAudit(...) with a descriptive action like "graph.snapshot.rebuild" and include metadata such as success:true, snap.stats (totalNodes/totalEdges), topNodes/topEdges counts, updatedAt and tookMs; also record a failure audit when the catch block runs (action "graph.snapshot.rebuild" with success:false and the error message) so every explicit rebuild attempt is auditable.test/graph.test.ts (1)
418-517: ⚡ Quick winAdd coverage for the
#814live-enumeration failure/timeout fallback envelope (warning + snapshot/empty).
withTimeoutrejects when the wrappedkv.listpromise rejects, so you can force the handlers’catchimmediately by makingkv.listthrow—without waiting for the 6s budget.
- mem::graph-query (snapshot present): expect
fromSnapshot: trueandwarningwhen live enumeration fails.- mem::graph-query (no snapshot): expect an empty graph +
warning;fromSnapshotis intentionally omitted in this branch.- mem::graph-stats: mirror the same two cases (
fromSnapshot: truewith snapshot;fromSnapshot: falsewithout) and assertwarning.💚 Example tests for graph-query fallback
it("graph-query falls back to snapshot with a warning when live enumeration fails", async () => { await seed(10, 10); await sdk.trigger("mem::graph-snapshot-rebuild", {}); const originalList = kv.list; try { kv.list = (async () => { throw new Error("Invocation stopped"); }) as typeof kv.list; const result = (await sdk.trigger("mem::graph-query", { query: "anything", })) as GraphQueryResult; expect(result.fromSnapshot).toBe(true); expect(result.warning).toBeDefined(); } finally { kv.list = originalList; } }); it("graph-query returns empty + warning when no snapshot exists and live enumeration fails", async () => { await seed(10, 10); const originalList = kv.list; try { kv.list = (async () => { throw new Error("Invocation stopped"); }) as typeof kv.list; const result = (await sdk.trigger("mem::graph-query", { query: "anything", })) as GraphQueryResult; expect(result.nodes).toHaveLength(0); expect(result.edges).toHaveLength(0); expect(result.totalNodes).toBe(0); expect(result.totalEdges).toBe(0); expect(result.fromSnapshot).toBeUndefined(); expect(result.warning).toBeDefined(); } finally { kv.list = originalList; } });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/graph.test.ts` around lines 418 - 517, Add tests exercising the live-enumeration failure/timeout fallback by forcing kv.list to throw and restoring it in a finally block; specifically add three assertions: (1) for mem::graph-query when a snapshot exists, trigger "mem::graph-snapshot-rebuild" then override kv.list to throw and assert the handler returns fromSnapshot: true and a defined warning, (2) for mem::graph-query when no snapshot exists, seed but do not build a snapshot, override kv.list to throw and assert nodes/edges arrays are empty, totalNodes/totalEdges are 0, fromSnapshot is undefined, and warning is defined, and (3) mirror these two cases for mem::graph-stats asserting fromSnapshot true/false respectively and that warning is defined; use the existing test helpers seed, sdk.trigger, kv.list override/restore pattern and the GraphQueryResult shape to locate the tests to add/update.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/functions/graph.ts`:
- Around line 616-651: Add an audit entry for the explicit rebuild inside the
sdk.registerFunction handler for "mem::graph-snapshot-rebuild": after
successfully building and persisting snap (built by buildSnapshotFromArrays and
saved to KV.graphSnapshot / SNAPSHOT_KEY) call recordAudit(...) with a
descriptive action like "graph.snapshot.rebuild" and include metadata such as
success:true, snap.stats (totalNodes/totalEdges), topNodes/topEdges counts,
updatedAt and tookMs; also record a failure audit when the catch block runs
(action "graph.snapshot.rebuild" with success:false and the error message) so
every explicit rebuild attempt is auditable.
In `@test/graph.test.ts`:
- Around line 418-517: Add tests exercising the live-enumeration failure/timeout
fallback by forcing kv.list to throw and restoring it in a finally block;
specifically add three assertions: (1) for mem::graph-query when a snapshot
exists, trigger "mem::graph-snapshot-rebuild" then override kv.list to throw and
assert the handler returns fromSnapshot: true and a defined warning, (2) for
mem::graph-query when no snapshot exists, seed but do not build a snapshot,
override kv.list to throw and assert nodes/edges arrays are empty,
totalNodes/totalEdges are 0, fromSnapshot is undefined, and warning is defined,
and (3) mirror these two cases for mem::graph-stats asserting fromSnapshot
true/false respectively and that warning is defined; use the existing test
helpers seed, sdk.trigger, kv.list override/restore pattern and the
GraphQueryResult shape to locate the tests to add/update.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 471b5858-55f7-4bd9-8aba-a1320c110cec
📒 Files selected for processing (5)
src/functions/graph.tssrc/state/schema.tssrc/triggers/api.tssrc/types.tstest/graph.test.ts
Tested PR #816 (commit HeadlineThe new code is bundled, it boots, and the 6s graceful budget path does fire (visible in logs as Concrete repro# Fresh boot of the PR build against existing 75K-node corpus
$ cd ~ && node /tmp/agentmemory-pr816/dist/cli.mjs &
# Ready in 1s; circuit breaker closed; 684 sessions, 131 memories readable.
$ time curl -s -X POST http://localhost:3111/agentmemory/graph/snapshot-rebuild
{"error":"Invocation stopped","error_id":"a5a8b329a1b0"}
real 0m2.66s
$ time curl -s -X POST http://localhost:3111/agentmemory/graph/snapshot-rebuild
{"error":"Invocation stopped","error_id":"a5a983a601d0"}
real 0m1.45s
$ time curl -s -X POST http://localhost:3111/agentmemory/graph/query -d '{}'
{"error":"Invocation stopped","error_id":"..."}
real 0m5.24s
$ time curl -s -X POST http://localhost:3111/agentmemory/graph/query -d '{"limit": 10}'
{"error":"Invocation stopped","error_id":"..."}
real 0m1.14s
$ time curl -s http://localhost:3111/agentmemory/graph/stats
{"error":"Invocation stopped"} # or "Function not found" during reconnect window — see belowServer log during the same window confirms the new code paths execute: What I think is happeningBug 1 — iii invocation timeout < the new 6s budget. Snapshot-rebuild errors out in ~2.6s and Bug 2 — snapshot-rebuild itself can't escape the timeout it was built to solve. The endpoint is the first thing an operator runs to populate the cache, but it still needs to enumerate the full 75K-node corpus inside one iii call. At our scale that enumeration is exactly what was breaking
Bug 3 — worker reconnect churn under sustained graph load. When I had the daemon running for a while, the log showed continuous I don't know who's calling graph-stats internally — only the viewer in What still works
SummaryThe architecture in this PR is correct (snapshot cache + graceful timeout envelope), but the implementation has a coordination problem between the SDK's Happy to test any follow-up patch against this corpus — fastest feedback loop you'll get for the > 50K-node case. |
Correction to my previous commentI wrote "I didn't have a viewer open" and speculated about a mystery internal caller hammering But re-testing with the viewer closed surfaced the actual mechanism, which is more useful: Controlled experiment: graph calls crash the worker, cheap calls don'tFresh boot, no viewer, worker confirmed healthy ( Graph calls deterministically trigger a worker reconnect; cheap calls never do. Call 1 fires the enumeration and dies at ~0.5s; calls 2 & 3 land in the ~1s reconnect gap and get instant Root cause hypothesis
In short: a Implication for the fixThe snapshot-cache architecture is sound — precompute top-degree, serve the fast path, never enumerate on hot calls. But the rebuild itself (and any live-enumeration fallback) has to yield to the event loop, or it takes the worker down before it can finish or time out gracefully. Options:
CaveatI'm running the PR build via Data fully intact throughout (131/133 memories, 684 sessions, recall + observation capture all working). Standing by to test a chunked-enumeration patch against this corpus. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
test/graph.test.ts (1)
29-32: ⚖️ Poor tradeoffCoverage gap: the timeout/fallback path that fails in production is never exercised.
mockKV.listresolves synchronously and instantly, sowithTimeout/LIVE_ENUMERATION_BUDGET_MSin the query and rebuild handlers never fire. None of these tests cover the catch branch that returns the{ warning, fromSnapshot }envelope, nor thetooLargerebuild abort aboveREBUILD_SAFE_NODE_CEILING. Per the tester's findings, the live enumeration is exactly what blocks the event loop and yields{"error":"Invocation stopped"}on large corpora — so a green suite here gives false confidence on the regression this PR targets.Consider adding a test that injects a slow/large
list(e.g. a deferred promise or fake timers) to assert the warning-envelope fallback, plus one that drivesmem::graph-snapshot-rebuildpast the ceiling to assert{ success: false, tooLarge: true }.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/graph.test.ts` around lines 29 - 32, Add tests that simulate slow and oversized KV enumeration: update the mockKV.list used by tests to return a deferred promise (or use fake timers) so that withTimeout/LIVE_ENUMERATION_BUDGET_MS in the query and rebuild handlers triggers the timeout path and the code returns the { warning, fromSnapshot } envelope; additionally add a test that drives the mem::graph-snapshot-rebuild logic past REBUILD_SAFE_NODE_CEILING to assert it returns { success: false, tooLarge: true }. Locate the mocks and handlers by the symbols mockKV.list, withTimeout, LIVE_ENUMERATION_BUDGET_MS, and REBUILD_SAFE_NODE_CEILING and ensure the new tests assert the warning envelope and tooLarge response respectively.src/functions/graph.ts (2)
266-266: ⚡ Quick winTimestamp not reused as per coding guidelines.
mergeNodecreates a newDate()on each call instead of reusing a captured timestamp. This could cause inconsistentupdatedAtvalues across nodes merged in the same extract batch.As per coding guidelines: "Timestamps: capture once with
new Date().toISOString()and reuse."♻️ Proposed fix
Pass the timestamp as a parameter:
function mergeNode( existing: GraphNode, incoming: GraphNode, obsIds: string[], + updatedAt: string, ): GraphNode { return { ...existing, sourceObservationIds: [ ...new Set([ ...existing.sourceObservationIds, ...incoming.sourceObservationIds, ...obsIds, ]), ], properties: { ...existing.properties, ...incoming.properties }, - updatedAt: new Date().toISOString(), + updatedAt, }; }Then in the extract handler, capture once and pass through.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/graph.ts` at line 266, The mergeNode call currently creates its own timestamp (updatedAt: new Date().toISOString()) causing inconsistent timestamps; change mergeNode to accept a timestamp parameter (e.g., capturedTimestamp) and replace any internal new Date() usage with that parameter, then capture a single timestamp once in the extract handler (call it capturedTimestamp = new Date().toISOString()) and pass it into mergeNode for all merges in that batch so all nodes share the same updatedAt value.
825-835: ⚖️ Poor tradeoffSequential index backfill is O(n) awaits on large corpora.
For a 25K-node corpus (the ceiling), this is 25K sequential
kv.setcalls plus another pass for edges. Consider batching withPromise.allin chunks to reduce wall-clock time during rebuild.♻️ Example chunked approach
const BATCH_SIZE = 100; for (let i = 0; i < liveNodes.length; i += BATCH_SIZE) { const batch = liveNodes.slice(i, i + BATCH_SIZE); await Promise.all(batch.map((n) => Promise.all([ kv.set(KV.graphNameIndex, nameIndexKey(n.type, n.name), n.id), kv.set(KV.graphNodeDegree, n.id, degree.get(n.id) ?? 0), ]) )); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/graph.ts` around lines 825 - 835, The current sequential backfill over liveNodes and liveEdges issues many await kv.set calls causing slow rebuilds; change the loops that call kv.set for liveNodes (using KV.graphNameIndex, nameIndexKey, KV.graphNodeDegree, degree) and liveEdges (using KV.graphEdgeKey and edgeIndexKey) to process in fixed-size batches (e.g., BATCH_SIZE = 100) and use Promise.all for each batch so the sets within a batch run in parallel, awaiting Promise.all per batch before moving to the next; keep the same keys/values and error handling but replace the per-item await with batch Promise.all mapping to kv.set calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/functions/graph.ts`:
- Around line 888-901: The reset loop currently deletes entries only when row.id
exists, which misses entries in the composite-key scopes (graphNameIndex,
graphEdgeKey, graphNodeDegree) because those store primitives; update the
deletion strategy in the block that iterates rows to remove all keys regardless
of value by either: use the KV API variant that returns keys (e.g., change the
kv.list call to request keys/entries and then call kv.delete(scope, key) for
each), or if supported call a scope-level/prefix delete to wipe the entire
scope; alternatively, change how index values are written (store { key, value })
so row.key can be used for deletion—adjust the loop to reference the actual key
field (or scope delete) instead of checking row.id. Ensure you update the code
paths that write those scopes (graphNameIndex/graphEdgeKey/graphNodeDegree) if
you choose the stored-key approach.
- Around line 200-208: The comparator for snap.topNodes incorrectly always
prioritizes the changed node (uses dA/dB set to next/null) which breaks the
descending-degree ordering; update the comparator in the sort call so it
compares actual degree values (use a.degree and b.degree or a cached degrees map
you build during the extract loop), substituting the changed node's degree with
the variable next when a.id === nodeId or b.id === nodeId, and return (degreeB -
degreeA) to preserve descending order; if degree lookups are asynchronous, make
the surrounding function async or precompute/cache degrees synchronously before
calling snap.topNodes.sort to avoid async operations inside the comparator.
---
Nitpick comments:
In `@src/functions/graph.ts`:
- Line 266: The mergeNode call currently creates its own timestamp (updatedAt:
new Date().toISOString()) causing inconsistent timestamps; change mergeNode to
accept a timestamp parameter (e.g., capturedTimestamp) and replace any internal
new Date() usage with that parameter, then capture a single timestamp once in
the extract handler (call it capturedTimestamp = new Date().toISOString()) and
pass it into mergeNode for all merges in that batch so all nodes share the same
updatedAt value.
- Around line 825-835: The current sequential backfill over liveNodes and
liveEdges issues many await kv.set calls causing slow rebuilds; change the loops
that call kv.set for liveNodes (using KV.graphNameIndex, nameIndexKey,
KV.graphNodeDegree, degree) and liveEdges (using KV.graphEdgeKey and
edgeIndexKey) to process in fixed-size batches (e.g., BATCH_SIZE = 100) and use
Promise.all for each batch so the sets within a batch run in parallel, awaiting
Promise.all per batch before moving to the next; keep the same keys/values and
error handling but replace the per-item await with batch Promise.all mapping to
kv.set calls.
In `@test/graph.test.ts`:
- Around line 29-32: Add tests that simulate slow and oversized KV enumeration:
update the mockKV.list used by tests to return a deferred promise (or use fake
timers) so that withTimeout/LIVE_ENUMERATION_BUDGET_MS in the query and rebuild
handlers triggers the timeout path and the code returns the { warning,
fromSnapshot } envelope; additionally add a test that drives the
mem::graph-snapshot-rebuild logic past REBUILD_SAFE_NODE_CEILING to assert it
returns { success: false, tooLarge: true }. Locate the mocks and handlers by the
symbols mockKV.list, withTimeout, LIVE_ENUMERATION_BUDGET_MS, and
REBUILD_SAFE_NODE_CEILING and ensure the new tests assert the warning envelope
and tooLarge response respectively.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9ad8adb5-d84d-45c0-b181-172701f15888
📒 Files selected for processing (7)
AGENTS.mdREADME.mdsrc/functions/graph.tssrc/index.tssrc/state/schema.tssrc/triggers/api.tstest/graph.test.ts
✅ Files skipped from review due to trivial changes (3)
- src/index.ts
- AGENTS.md
- README.md
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/graph.test.ts`:
- Around line 562-582: The test "graph-reset also wipes composite-key index
scopes" is missing assertions for the edge composite-key scope; before calling
sdk.trigger("mem::graph-reset", {}) verify that the edge index contains an entry
by reading kv.get("mem:graph:edge-key", "<composite>") or
kv.list("mem:graph:edge-key") after sdk.trigger("mem::graph-extract", {
observations: [testObs] }), then after calling sdk.trigger("mem::graph-reset",
{}) assert the edge index is cleared (kv.get returns null or kv.list length is
0) alongside the existing checks for "mem:graph:name-index" and
"mem:graph:node-degree".
- Around line 590-612: The test's slowKV implementation uses setTimeout which
yields the event loop and doesn't simulate the synchronous kv.list blocking
failure; replace the async delay in slowKV.list with a synchronous busy-wait
(CPU-blocking) delay to simulate event-loop starvation, then add a dedicated
test (using slowKV with the synchronous block and an explicit test timeout) that
calls registerGraphFunction/localSdk.trigger("mem::graph-query", ...) to assert
the warning envelope when enumeration exceeds the budget; locate and update the
slowKV function and the existing test case in graph.test.ts to use the new
blocking delay implementation or add a separate test that references slowKV,
mockKV, registerGraphFunction, and localSdk to cover the worker-stall path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c0a7eca3-4b15-4011-84e4-3f35fcc6866e
📒 Files selected for processing (3)
src/functions/graph.tssrc/types.tstest/graph.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- src/types.ts
- src/functions/graph.ts
| function slowKV(delayMs: number) { | ||
| const base = mockKV(); | ||
| return { | ||
| ...base, | ||
| list: async <T>(scope: string): Promise<T[]> => { | ||
| await new Promise((r) => setTimeout(r, delayMs)); | ||
| return base.list<T>(scope); | ||
| }, | ||
| }; | ||
| } | ||
|
|
||
| it("graph-query startNodeId returns warning envelope when enumeration exceeds budget", async () => { | ||
| const slow = slowKV(7000); // > LIVE_ENUMERATION_BUDGET_MS (6000ms) | ||
| const localSdk = mockSdk(); | ||
| registerGraphFunction(localSdk as never, slow as never, mockProvider as never); | ||
|
|
||
| const result = (await localSdk.trigger("mem::graph-query", { | ||
| startNodeId: "n_missing", | ||
| })) as GraphQueryResult; | ||
|
|
||
| expect(result.warning).toBeTruthy(); | ||
| expect(result.warning).toMatch(/budget|enumeration/i); | ||
| }, 10000); |
There was a problem hiding this comment.
Budget fallback test does not exercise the real blocking failure mode.
slowKV.list uses setTimeout, which yields the event loop and lets the race timer fire. The reported production issue is synchronous kv.list blocking (timers/heartbeats starved), so this test can pass while the worker-stall path is still untested.
Suggested direction
+function blockingKV(blockMs: number) {
+ const base = mockKV();
+ return {
+ ...base,
+ list: async <T>(scope: string): Promise<T[]> => {
+ const start = Date.now();
+ while (Date.now() - start < blockMs) {
+ // Intentionally block event loop to simulate synchronous kv.list stall
+ }
+ return base.list<T>(scope);
+ },
+ };
+}Use this in a dedicated test (with explicit timeout) to cover the starvation scenario called out in #814 testing notes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/graph.test.ts` around lines 590 - 612, The test's slowKV implementation
uses setTimeout which yields the event loop and doesn't simulate the synchronous
kv.list blocking failure; replace the async delay in slowKV.list with a
synchronous busy-wait (CPU-blocking) delay to simulate event-loop starvation,
then add a dedicated test (using slowKV with the synchronous block and an
explicit test timeout) that calls
registerGraphFunction/localSdk.trigger("mem::graph-query", ...) to assert the
warning envelope when enumeration exceeds the budget; locate and update the
slowKV function and the existing test case in graph.test.ts to use the new
blocking delay implementation or add a separate test that references slowKV,
mockKV, registerGraphFunction, and localSdk to cover the worker-stall path.
Pulled The core regression is FIXED ✅The snapshot-only hot path works exactly as intended. Controlled experiment (no viewer, worker confirmed healthy first): No more Remaining gap: rebuild AND reset both still crash on a >25K legacy corpusBoth write-side paths still die the same way the hot path used to: snapshot-rebuild: the graph/reset: this is the more important one. It's the documented escape hatch in the warning envelope you added ("Run POST /agentmemory/graph/reset to wipe and let future extracts repopulate"), but the wipe itself enumerates the keys to delete them, so it hits the same Why the 6s budget doesn't save themInteresting detail from the log: the budget timer does eventually fire — — but it fires at ~6s, after the caller already got Suggested directionBoth
Bottom lineNormal operation is now solid on a large corpus — that was the crash that made me stop using it, and it's gone. The remaining gap only bites operators with a big pre-#814 graph who want to reclaim it. Happy to test a yielding-enumeration patch against this exact corpus — it's the fastest >50K repro you'll get. |
After the
|
Summary
Fixes #814 (@allandelmare). Two-phase fix. Phase 1 (initial commits) added snapshot cache + wall-clock budget. Phase 2 (this revision, after Allan's test against his 75K-node corpus exposed that the budget can't fire on a blocked event loop) eliminates
kv.listfrom the hot path entirely.Allan's diagnosis
Confirmed end-to-end. At 75K nodes the
kv.list<GraphNode>+kv.list<GraphEdge>pair returns a ~37MB single WS frame. JSON.parse of that blocks the Node event loop for hundreds of ms. iii's heartbeat starves, worker is declared dead at ~0.5s. The Promise.race wall-clock timer never gets to fire because nothing on the loop runs. Caller sees rawInvocation stopped, never the graceful envelope.mem::graph-snapshot-rebuilditself was running the same enumeration, so the very call meant to populate the cache died at ~2.6s on his corpus. Catch-22.What changed in Phase 2
Incremental indexes maintained by
mem::graph-extractThree new KV scopes — all
kv.get/kv.setonly, no enumeration ever:KV.graphNameIndex—${type}|${name}→nodeId. Replaces the O(n)existingNodes.find(...)scan inside extract.KV.graphEdgeKey—${src}|${tgt}|${type}→edgeId. Same for edges.KV.graphNodeDegree—nodeId→ incident-edge count. Read / incremented on edge writes to maintain snapshot top-N ranking without scanning edges.mem::graph-extractis now O(extract_size), not O(total_nodes). The bottleneck Allan was hitting on every observation capture is gone.Snapshot updated inline on every extract
Read snapshot once, mutate stats / degree / topNodes / topEdges inline, write back. No dirty flag bouncing, no scheduled rebuild needed. Corpora built on a post-#814v2 build always have a current snapshot without any explicit rebuild call.
Hot path reads snapshot exclusively
mem::graph-queryempty-body / nodeType-only branch: snapshot only. Nokv.listfallback (that was the broken path).mem::graph-stats: snapshot only. Empty envelope + warning on miss, never a 500.mem::graph-querystartNodeId/querybranches: still need broader access; keep behind the wall-clock budget with snapshot fallback. Documented as degrading on >25K-node corpora until a per-node edge index lands.Legacy corpus rebuild + reset
mem::graph-snapshot-rebuildnow refuses to run when the recorded node count would block the worker (REBUILD_SAFE_NODE_CEILING = 25000). Returns{ success: false, tooLarge: true, totalNodes, ceiling, error }instead of dying. Also backfills the new name-index / edge-key / degree scopes so post-rebuild extracts hit the O(1) path.mem::graph-reset+POST /agentmemory/graph/reset— wipes all graph KV scopes, writes an empty snapshot. Observations / recall / history are NOT touched. Operators above the safe ceiling reset and let future extracts rebuild incrementally.Endpoint count + docs
/graph/reset). BumpedREADME.md,AGENTS.md,src/index.tsboot log.What this fixes for Allan
Even before backfilling, Allan's observation capture has been doing the broken
kv.liston every extract since v0.9.21. That ALSO blocks the worker for hundreds of ms each time, and is the most likely cause of the worker-reconnect churn he observed. Phase 2 fixes that immediately — every new extract is O(1).For his 75K accumulated graph:
POST /agentmemory/graph/resetis the path. Recall + sessions + observations stay intact; the graph rebuilds from new observations as they come in. The 75K accumulated nodes were largely redundant anyway (heavy dedup pressure from re-extracting the same source files / functions across sessions).Test plan
npx vitest run— 1406/1406 pass (12 new tests for incremental index + snapshot inline update + reset + name-index dedup)npm run buildcleannpx tsc --noEmit— no new errorsPOST /agentmemory/graph/reset→ trigger a few new observations →POST /agentmemory/graph/query {}andGET /agentmemory/graph/statsboth return < 100msOut of scope
startNodeId/querypaths still rely onkv.listand degrade above ~25K. Defer to a follow-up.cc @allandelmare — would appreciate one more run against your corpus once this lands. Reset is the fast path; alternatively
mem::graph-snapshot-rebuildwill refuse above 25K with a cleartooLargeerror envelope.Summary by CodeRabbit
New Features
Improvements
Tests
Documentation