fix(knowledge-graph): deduplication, validation, GC, and locking (6 fixes)#813
fix(knowledge-graph): deduplication, validation, GC, and locking (6 fixes)#813Acharnite wants to merge 6 commits into
Conversation
|
@Acharnite is attempting to deploy a commit to the rohitg00's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
Need an answer fast? Review this PR in Change Stack to ask focused questions about the PR or a changed range. 📝 WalkthroughWalkthroughPrevents duplicate session extraction, adds VALID_TYPES and Jaccard-based fuzzy deduplication during graph extraction (with cached node mapping and edge remapping), and extends eviction with a graph GC pass removing stale and isolated nodes plus new counters. ChangesGraph Extraction and Eviction Improvements
Sequence Diagram(s)sequenceDiagram
participant Event as event::session::stopped
participant Triggers as registerEventTriggers
participant KV as KV.observations
participant Mem as mem::graph-extract
Event->>Triggers: invoke handler(sessionId)
Triggers->>Triggers: check extractionLocks (graph-extract-${sessionId})
alt lock present
Triggers->>Triggers: log "Skipping duplicate"
else lock absent
Triggers->>KV: list(sessionId)
KV->>Triggers: compressed observations[]
opt observations present
Triggers->>Mem: trigger mem::graph-extract with observations
end
Triggers->>Triggers: release lock (finally)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
src/functions/evict.ts (2)
350-351: ⚡ Quick winParallelize the graph KV reads and add failure fallbacks.
These two reads are independent and every other
kv.listin this function uses.catch(() => []). As written, if either graph scan throws, the wholemem::evictrun aborts after sessions/observations/memories were already deleted, leaving stats unreported. Mirror the existing resilience and run them concurrently.♻️ Proposed change
- const allNodes = await kv.list<GraphNode>(KV.graphNodes); - const allEdges = await kv.list<GraphEdge>(KV.graphEdges); + const [allNodes, allEdges] = await Promise.all([ + kv.list<GraphNode>(KV.graphNodes).catch(() => []), + kv.list<GraphEdge>(KV.graphEdges).catch(() => []), + ]);As per coding guidelines: "Use parallel operations where possible (Promise.all for independent kv writes/reads)".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/evict.ts` around lines 350 - 351, In evict.ts (mem::evict) the two independent graph scans (currently: const allNodes = await kv.list<GraphNode>(KV.graphNodes); const allEdges = await kv.list<GraphEdge>(KV.graphEdges);) should be executed in parallel and each should fall back to an empty array on error; replace with a Promise.all that runs both kv.list calls concurrently and applies .catch(() => []) to each (assign results back into allNodes and allEdges), so a failure in one read won’t abort the whole evict run and mirrors the existing resilient pattern used elsewhere.
394-402: ⚡ Quick winAvoid the O(nodes × edges) scan and reuse the existing
now.For each non-stale node this calls
allEdges.some(...), making isolation detection O(N·E) over the full graph (~1,614 nodes per the PR notes). Precompute the set of edge-connected node IDs once, then look up in O(1). Also,nowTimere-derives the timestamp already captured asnowat Line 111 — reuse it.⚡ Proposed change
- const nowTime = Date.now(); const SEVEN_DAYS_MS = 7 * 24 * 60 * 60 * 1000; + const connectedNodeIds = new Set<string>(); + for (const e of allEdges) { + connectedNodeIds.add(e.sourceNodeId); + connectedNodeIds.add(e.targetNodeId); + } for (const node of allNodes) { if (node.stale) continue; // handled above - const hasEdges = allEdges.some( - (e) => e.sourceNodeId === node.id || e.targetNodeId === node.id, - ); - if (hasEdges) continue; + if (connectedNodeIds.has(node.id)) continue;As per coding guidelines: "Timestamps: capture once with new Date().toISOString() and reuse".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/evict.ts` around lines 394 - 402, The loop over allNodes does an O(nodes×edges) check using allEdges.some(...) and recomputes nowTime instead of reusing the earlier now; fix by precomputing a Set of connected node IDs from allEdges (iterate allEdges once and add e.sourceNodeId and e.targetNodeId into connectedNodeIds) and then replace the hasEdges check with connectedNodeIds.has(node.id); also remove nowTime and reuse the previously captured now timestamp (use the existing now variable instead of creating nowTime). Ensure this change is applied around the code using allNodes, allEdges, node.stale, hasEdges, and now.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/functions/evict.ts`:
- Around line 405-406: Validate node.createdAt before using it to compute age in
the eviction logic: when computing const age = nowTime - new
Date(node.createdAt).getTime() (where age and SEVEN_DAYS_MS are used to decide
deletion), first check that node.createdAt exists and that the parsed timestamp
is a valid finite number (e.g., via Number.isFinite(createdAtTime) or
!Number.isNaN(createdAtTime)); if createdAt is missing or invalid, treat the
node as not eligible for eviction (skip/continue) and optionally log a warning
so corrupt nodes are preserved rather than deleted.
- Around line 355-390: The staleNodes/staleEdges loops currently skip counting
when dryRun because `if (dryRun) continue;` sits before incrementing stats and
recording the audit; move or duplicate the stats increment
(stats.staleGraphNodes / stats.staleGraphEdges) and the call to recordAudit so
they run when dryRun is true (and do not perform kv.delete in that case), or
alternatively perform the stats increment and recordAudit before the dryRun
`continue`; ensure you still omit `await kv.delete(KV.graphNodes, node.id)` /
`await kv.delete(KV.graphEdges, edge.id)` when dryRun is true while keeping
logger/audit and counters updated.
In `@src/functions/graph.ts`:
- Around line 221-226: The fuzzy-merge currently treats names with
empty/very-short token sets as identical because jaccardSimilarity returns 1 for
two empty token sets; update the merge logic in the existingNodes.find block
(the predicate that compares n.type and n.name against node.name) to guard
against empty/short names before using jaccardSimilarity: require both names
have meaningful tokens (e.g., check name length >= 3 or call a helper to compute
token sets and ensure both are non-empty) and only then apply
jaccardSimilarity(...) > 0.8; alternatively, change jaccardSimilarity to return
0 when either side yields zero tokens and use that safer function here—either
approach prevents accidental merges of empty/short names.
In `@src/triggers/events.ts`:
- Around line 72-92: The current lock (extractionLocks, lockKey) is released
immediately after calling sdk.trigger(..., TriggerAction.Void()), so it does not
cover the actual async extraction and allows duplicate edge inserts; fix by
either making the trigger span the full extraction (await or use a trigger
action that waits for completion instead of TriggerAction.Void) or, better, make
mem::graph-extract idempotent by changing parseGraphXml/generateId to produce
deterministic node IDs and computing stable edge fingerprints (e.g., hash of
source content, target content, and relationship type) so dedupe uses those
stable keys rather than non-deterministic generateId; update code references:
extractionLocks, lockKey, sdk.trigger, mem::graph-extract, parseGraphXml,
generateId to implement one of these approaches so the lock or idempotency
covers the entire extraction work.
---
Nitpick comments:
In `@src/functions/evict.ts`:
- Around line 350-351: In evict.ts (mem::evict) the two independent graph scans
(currently: const allNodes = await kv.list<GraphNode>(KV.graphNodes); const
allEdges = await kv.list<GraphEdge>(KV.graphEdges);) should be executed in
parallel and each should fall back to an empty array on error; replace with a
Promise.all that runs both kv.list calls concurrently and applies .catch(() =>
[]) to each (assign results back into allNodes and allEdges), so a failure in
one read won’t abort the whole evict run and mirrors the existing resilient
pattern used elsewhere.
- Around line 394-402: The loop over allNodes does an O(nodes×edges) check using
allEdges.some(...) and recomputes nowTime instead of reusing the earlier now;
fix by precomputing a Set of connected node IDs from allEdges (iterate allEdges
once and add e.sourceNodeId and e.targetNodeId into connectedNodeIds) and then
replace the hasEdges check with connectedNodeIds.has(node.id); also remove
nowTime and reuse the previously captured now timestamp (use the existing now
variable instead of creating nowTime). Ensure this change is applied around the
code using allNodes, allEdges, node.stale, hasEdges, and now.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ee7a5733-9281-410b-aa06-1c2486caef35
📒 Files selected for processing (3)
src/functions/evict.tssrc/functions/graph.tssrc/triggers/events.ts
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/functions/evict.ts (1)
403-413:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winIgnore stale edges when testing isolation.
Line 405 still checks
allEdges, so nodes attached only to stale edges survive this run even though those edges are deleted just above. That delays isolated-node cleanup and undercountsisolatedNodesindryRun.♻️ Proposed fix
- const hasEdges = allEdges.some( - (e) => e.sourceNodeId === node.id || e.targetNodeId === node.id, - ); + const hasEdges = allEdges.some( + (e) => + !e.stale && + (e.sourceNodeId === node.id || e.targetNodeId === node.id), + );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/evict.ts` around lines 403 - 413, The isolation check currently uses allEdges including edges marked stale, so nodes connected only to stale edges are considered non-isolated; update the edge check in the loop to ignore stale edges (e.g., when computing hasEdges use only edges with !e.stale) so nodes attached solely to stale edges become eligible for eviction and are correctly counted for isolatedNodes/dryRun; adjust the hasEdges predicate that references allEdges to filter out stale edges while keeping other logic (node.stale, sourceObservationIds, createdAt) intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/functions/evict.ts`:
- Around line 403-413: The isolation check currently uses allEdges including
edges marked stale, so nodes connected only to stale edges are considered
non-isolated; update the edge check in the loop to ignore stale edges (e.g.,
when computing hasEdges use only edges with !e.stale) so nodes attached solely
to stale edges become eligible for eviction and are correctly counted for
isolatedNodes/dryRun; adjust the hasEdges predicate that references allEdges to
filter out stale edges while keeping other logic (node.stale,
sourceObservationIds, createdAt) intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 83aa4d90-1b36-4f7d-9974-cadb34c436af
📒 Files selected for processing (3)
src/functions/evict.tssrc/functions/graph.tssrc/triggers/events.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- src/functions/graph.ts
- src/triggers/events.ts
…guard, and tests - Jaccard similarity secondary scan when name index exact match misses - Only applies for names >= 3 chars to avoid false positives on short names - VALID_TYPES guard in parseGraphXml rejects corrupt entity types from LLM - graph-dedup.test.ts (8 tests): exact match, fuzzy merge, type scoping, short name guard - graph-valid-types.test.ts (11 tests): all valid types, invalid types, mixed entities - extraction-locking.test.ts (5 tests): concurrent extraction blocking, release, errors All 41 patch-specific tests pass.
- Conflict 1: Name index (upstream) + Jaccard secondary scan (ours) - Conflict 2: Upstream edge index approach (graph-gc handles related_to dedup post-hoc) - Both approaches preserved, no regressions
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
src/functions/graph.ts (1)
529-535: ⚡ Quick winParallelize independent KV writes in the hot extraction path.
These writes are independent and currently serialized, adding avoidable latency per node/edge.
Suggested refactor
- await kv.set(KV.graphNodes, node.id, node); - await kv.set(KV.graphNameIndex, indexKey, node.id); - await kv.set(KV.graphNodeDegree, node.id, 0); + await Promise.all([ + kv.set(KV.graphNodes, node.id, node), + kv.set(KV.graphNameIndex, indexKey, node.id), + kv.set(KV.graphNodeDegree, node.id, 0), + ]); ... - await kv.set(KV.graphEdges, edge.id, edge); - await kv.set(KV.graphEdgeKey, eKey, edge.id); + await Promise.all([ + kv.set(KV.graphEdges, edge.id, edge), + kv.set(KV.graphEdgeKey, eKey, edge.id), + ]);As per coding guidelines, "Use parallel operations with Promise.all for independent KV writes/reads."
Also applies to: 567-569
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/graph.ts` around lines 529 - 535, The three independent KV writes (kv.set calls to KV.graphNodes, KV.graphNameIndex, and KV.graphNodeDegree) are currently serialized; change them to run in parallel by invoking Promise.all on the three kv.set promises (while keeping in-memory updates to snap.stats and newNodeCount as-is). Do the same refactor for the other group of writes mentioned (the kv.set calls around the region referenced as 567-569). Ensure you reference the exact functions/keys: kv.set for KV.graphNodes, KV.graphNameIndex, and KV.graphNodeDegree and replace the serialized awaits with a single await Promise.all([...]) to reduce hot-path latency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/functions/graph.ts`:
- Around line 504-517: The dedupe logic currently can match against stale nodes;
ensure you exclude stale/old records when checking both the exact match
(existing = await kv.get<GraphNode>(KV.graphNodes, existingId)) and the fuzzy
search over allNodes (allNodes = await kv.list<GraphNode>(KV.graphNodes)). Add a
freshness filter (e.g., skip nodes where node.stale === true or where
node.lastSeen/updatedAt is older than your freshness threshold) before assigning
to existing or considering them in jaccardSimilarity comparisons (use
GraphNode.stale or GraphNode.lastSeen/updatedAt accordingly) so merges only
target fresh observations; apply the same filter to the later duplicate block
(lines ~519-527) as well.
- Around line 510-517: The fuzzy fallback currently calls
kv.list<GraphNode>(KV.graphNodes) inside the per-node unmatched branch, causing
repeated full scans; modify the logic so that allNodes is fetched once outside
the per-node loop (or cached on first use) and reused for subsequent fuzzy
checks, then use that cached allNodes when computing jaccardSimilarity against
node.name to set existing; update references to the existing/null assignment
block (the unmatched node branch that calls kv.list, jaccardSimilarity,
node.name, and KV.graphNodes) to use the single cached list instead of calling
kv.list repeatedly.
- Around line 497-576: Nodes that merged into existing persisted nodes leave
edges using transient parsed node IDs, corrupting edge keys and degrees; fix by
creating a mapping from original parsed node IDs to the final persisted node IDs
during the node loop (use node.id and existing?.id or merged.id from mergeNode),
then before processing each edge in the edges loop remap edge.sourceNodeId and
edge.targetNodeId to the persisted IDs (and recompute eKey via edgeIndexKey) so
writes to KV.graphEdges / KV.graphEdgeKey and calls to applyDegreeDelta use the
canonical IDs; if a mapped persisted ID is missing, skip the edge or log and
avoid updating degrees to prevent corruption.
In `@test/extraction-locking.test.ts`:
- Around line 179-183: The test uses a fixed delay (await new Promise(r =>
setTimeout(r, 5))) to let the handler reach kv.list which is flaky; replace that
with a deterministic barrier by creating a promise (e.g., let listReached =
createDeferred()) and resolving it inside the overridden kv.list implementation
(where listCalls is incremented), then await that promise in the test before
asserting expect(listCalls).toBe(1); repeat the same change for the other
occurrence around lines 262-264 so both waits rely on an explicit promise
resolved from within kv.list rather than setTimeout.
In `@test/graph-dedup.test.ts`:
- Around line 104-247: The test suite lacks a regression test to verify
relationships are updated when a node is deduped; add a new case in
test/graph-dedup.test.ts that uses sdk.trigger("mem::graph-extract") with XML
including a <relationship source="..." target="..."> where one endpoint's
<entity name="..."> will merge via the existing dedup logic (e.g., "AgentMemory"
-> "agentmemory"), then fetch stored edges (via kv.list or the same GraphNode
store if edges are stored alongside nodes) and assert that the relationship's
endpoint ID(s) have been rewritten to the merged node.id (and that no edges
reference the old/orphaned id); reference existing helpers used in other tests
(mockProvider.compress, testObs1/testObs2, jaccard flow) so the new test mirrors
the merge path exercised by the other cases.
---
Nitpick comments:
In `@src/functions/graph.ts`:
- Around line 529-535: The three independent KV writes (kv.set calls to
KV.graphNodes, KV.graphNameIndex, and KV.graphNodeDegree) are currently
serialized; change them to run in parallel by invoking Promise.all on the three
kv.set promises (while keeping in-memory updates to snap.stats and newNodeCount
as-is). Do the same refactor for the other group of writes mentioned (the kv.set
calls around the region referenced as 567-569). Ensure you reference the exact
functions/keys: kv.set for KV.graphNodes, KV.graphNameIndex, and
KV.graphNodeDegree and replace the serialized awaits with a single await
Promise.all([...]) to reduce hot-path latency.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 80a864b6-7876-46cf-8af0-c621b835bb96
📒 Files selected for processing (4)
src/functions/graph.tstest/extraction-locking.test.tstest/graph-dedup.test.tstest/graph-valid-types.test.ts
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 5
🧹 Nitpick comments (1)
src/functions/graph.ts (1)
529-535: ⚡ Quick winParallelize independent KV writes in the hot extraction path.
These writes are independent and currently serialized, adding avoidable latency per node/edge.
Suggested refactor
- await kv.set(KV.graphNodes, node.id, node); - await kv.set(KV.graphNameIndex, indexKey, node.id); - await kv.set(KV.graphNodeDegree, node.id, 0); + await Promise.all([ + kv.set(KV.graphNodes, node.id, node), + kv.set(KV.graphNameIndex, indexKey, node.id), + kv.set(KV.graphNodeDegree, node.id, 0), + ]); ... - await kv.set(KV.graphEdges, edge.id, edge); - await kv.set(KV.graphEdgeKey, eKey, edge.id); + await Promise.all([ + kv.set(KV.graphEdges, edge.id, edge), + kv.set(KV.graphEdgeKey, eKey, edge.id), + ]);As per coding guidelines, "Use parallel operations with Promise.all for independent KV writes/reads."
Also applies to: 567-569
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/graph.ts` around lines 529 - 535, The three independent KV writes (kv.set calls to KV.graphNodes, KV.graphNameIndex, and KV.graphNodeDegree) are currently serialized; change them to run in parallel by invoking Promise.all on the three kv.set promises (while keeping in-memory updates to snap.stats and newNodeCount as-is). Do the same refactor for the other group of writes mentioned (the kv.set calls around the region referenced as 567-569). Ensure you reference the exact functions/keys: kv.set for KV.graphNodes, KV.graphNameIndex, and KV.graphNodeDegree and replace the serialized awaits with a single await Promise.all([...]) to reduce hot-path latency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/functions/graph.ts`:
- Around line 504-517: The dedupe logic currently can match against stale nodes;
ensure you exclude stale/old records when checking both the exact match
(existing = await kv.get<GraphNode>(KV.graphNodes, existingId)) and the fuzzy
search over allNodes (allNodes = await kv.list<GraphNode>(KV.graphNodes)). Add a
freshness filter (e.g., skip nodes where node.stale === true or where
node.lastSeen/updatedAt is older than your freshness threshold) before assigning
to existing or considering them in jaccardSimilarity comparisons (use
GraphNode.stale or GraphNode.lastSeen/updatedAt accordingly) so merges only
target fresh observations; apply the same filter to the later duplicate block
(lines ~519-527) as well.
- Around line 510-517: The fuzzy fallback currently calls
kv.list<GraphNode>(KV.graphNodes) inside the per-node unmatched branch, causing
repeated full scans; modify the logic so that allNodes is fetched once outside
the per-node loop (or cached on first use) and reused for subsequent fuzzy
checks, then use that cached allNodes when computing jaccardSimilarity against
node.name to set existing; update references to the existing/null assignment
block (the unmatched node branch that calls kv.list, jaccardSimilarity,
node.name, and KV.graphNodes) to use the single cached list instead of calling
kv.list repeatedly.
- Around line 497-576: Nodes that merged into existing persisted nodes leave
edges using transient parsed node IDs, corrupting edge keys and degrees; fix by
creating a mapping from original parsed node IDs to the final persisted node IDs
during the node loop (use node.id and existing?.id or merged.id from mergeNode),
then before processing each edge in the edges loop remap edge.sourceNodeId and
edge.targetNodeId to the persisted IDs (and recompute eKey via edgeIndexKey) so
writes to KV.graphEdges / KV.graphEdgeKey and calls to applyDegreeDelta use the
canonical IDs; if a mapped persisted ID is missing, skip the edge or log and
avoid updating degrees to prevent corruption.
In `@test/extraction-locking.test.ts`:
- Around line 179-183: The test uses a fixed delay (await new Promise(r =>
setTimeout(r, 5))) to let the handler reach kv.list which is flaky; replace that
with a deterministic barrier by creating a promise (e.g., let listReached =
createDeferred()) and resolving it inside the overridden kv.list implementation
(where listCalls is incremented), then await that promise in the test before
asserting expect(listCalls).toBe(1); repeat the same change for the other
occurrence around lines 262-264 so both waits rely on an explicit promise
resolved from within kv.list rather than setTimeout.
In `@test/graph-dedup.test.ts`:
- Around line 104-247: The test suite lacks a regression test to verify
relationships are updated when a node is deduped; add a new case in
test/graph-dedup.test.ts that uses sdk.trigger("mem::graph-extract") with XML
including a <relationship source="..." target="..."> where one endpoint's
<entity name="..."> will merge via the existing dedup logic (e.g., "AgentMemory"
-> "agentmemory"), then fetch stored edges (via kv.list or the same GraphNode
store if edges are stored alongside nodes) and assert that the relationship's
endpoint ID(s) have been rewritten to the merged node.id (and that no edges
reference the old/orphaned id); reference existing helpers used in other tests
(mockProvider.compress, testObs1/testObs2, jaccard flow) so the new test mirrors
the merge path exercised by the other cases.
---
Nitpick comments:
In `@src/functions/graph.ts`:
- Around line 529-535: The three independent KV writes (kv.set calls to
KV.graphNodes, KV.graphNameIndex, and KV.graphNodeDegree) are currently
serialized; change them to run in parallel by invoking Promise.all on the three
kv.set promises (while keeping in-memory updates to snap.stats and newNodeCount
as-is). Do the same refactor for the other group of writes mentioned (the kv.set
calls around the region referenced as 567-569). Ensure you reference the exact
functions/keys: kv.set for KV.graphNodes, KV.graphNameIndex, and
KV.graphNodeDegree and replace the serialized awaits with a single await
Promise.all([...]) to reduce hot-path latency.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 80a864b6-7876-46cf-8af0-c621b835bb96
📒 Files selected for processing (4)
src/functions/graph.tstest/extraction-locking.test.tstest/graph-dedup.test.tstest/graph-valid-types.test.ts
🛑 Comments failed to post (5)
src/functions/graph.ts (3)
497-576:
⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy liftRemap relationship endpoints to persisted node IDs before writing edges.
When an extracted node merges into an existing node, edge endpoints still use transient parsed IDs. That can persist edges pointing to non-existent nodes, corrupting edge-key dedup and degree counters.
Suggested fix
+ const resolvedNodeIds = new Map<string, string>(); for (const node of nodes) { const indexKey = nameIndexKey(node.type, node.name); ... if (existing) { const merged = mergeNode(existing, node, obsIds, capturedAt); await kv.set(KV.graphNodes, existing.id, merged); + resolvedNodeIds.set(node.id, existing.id); ... } else { await kv.set(KV.graphNodes, node.id, node); ... + resolvedNodeIds.set(node.id, node.id); } } for (const edge of edges) { + const sourceNodeId = resolvedNodeIds.get(edge.sourceNodeId); + const targetNodeId = resolvedNodeIds.get(edge.targetNodeId); + if (!sourceNodeId || !targetNodeId) continue; + const resolvedEdge = { ...edge, sourceNodeId, targetNodeId }; - const eKey = edgeIndexKey(edge.sourceNodeId, edge.targetNodeId, edge.type); + const eKey = edgeIndexKey(sourceNodeId, targetNodeId, resolvedEdge.type); ... - await kv.set(KV.graphEdges, edge.id, edge); - await kv.set(KV.graphEdgeKey, eKey, edge.id); + await kv.set(KV.graphEdges, resolvedEdge.id, resolvedEdge); + await kv.set(KV.graphEdgeKey, eKey, resolvedEdge.id); ... - newEdgesForTopCheck.push(edge); + newEdgesForTopCheck.push(resolvedEdge); } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/graph.ts` around lines 497 - 576, Nodes that merged into existing persisted nodes leave edges using transient parsed node IDs, corrupting edge keys and degrees; fix by creating a mapping from original parsed node IDs to the final persisted node IDs during the node loop (use node.id and existing?.id or merged.id from mergeNode), then before processing each edge in the edges loop remap edge.sourceNodeId and edge.targetNodeId to the persisted IDs (and recompute eKey via edgeIndexKey) so writes to KV.graphEdges / KV.graphEdgeKey and calls to applyDegreeDelta use the canonical IDs; if a mapped persisted ID is missing, skip the edge or log and avoid updating degrees to prevent corruption.
504-517:
⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not deduplicate against stale nodes.
Exact and fuzzy matching can select stale nodes, and merged output keeps stale state from the existing record. Fresh observations can become hidden/evictable immediately after merge.
Suggested fix
let existing: GraphNode | null = null; if (existingId) { existing = await kv.get<GraphNode>(KV.graphNodes, existingId); + if (existing?.stale) existing = null; } if (!existing && node.name.length >= 3) { const allNodes = await kv.list<GraphNode>(KV.graphNodes); existing = allNodes.find( (n) => + !n.stale && n.type === node.type && n.name.length >= 3 && jaccardSimilarity(n.name.toLowerCase(), node.name.toLowerCase()) > 0.8, ) ?? null; }Also applies to: 519-527
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/graph.ts` around lines 504 - 517, The dedupe logic currently can match against stale nodes; ensure you exclude stale/old records when checking both the exact match (existing = await kv.get<GraphNode>(KV.graphNodes, existingId)) and the fuzzy search over allNodes (allNodes = await kv.list<GraphNode>(KV.graphNodes)). Add a freshness filter (e.g., skip nodes where node.stale === true or where node.lastSeen/updatedAt is older than your freshness threshold) before assigning to existing or considering them in jaccardSimilarity comparisons (use GraphNode.stale or GraphNode.lastSeen/updatedAt accordingly) so merges only target fresh observations; apply the same filter to the later duplicate block (lines ~519-527) as well.
510-517:
⚠️ Potential issue | 🟠 Major | ⚡ Quick winFuzzy fallback reintroduces repeated full-scope scans.
kv.list(KV.graphNodes)runs once per unmatched extracted node. On larger corpora this can timeout extraction again and negate the targeted-index improvement.Suggested fix
+ let fuzzyCandidates: GraphNode[] | null = null; for (const node of nodes) { ... if (!existing && node.name.length >= 3) { - const allNodes = await kv.list<GraphNode>(KV.graphNodes); - existing = allNodes.find( + fuzzyCandidates ??= await kv.list<GraphNode>(KV.graphNodes); + existing = fuzzyCandidates.find( (n) => n.type === node.type && n.name.length >= 3 && jaccardSimilarity(n.name.toLowerCase(), node.name.toLowerCase()) > 0.8, ) ?? null; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/graph.ts` around lines 510 - 517, The fuzzy fallback currently calls kv.list<GraphNode>(KV.graphNodes) inside the per-node unmatched branch, causing repeated full scans; modify the logic so that allNodes is fetched once outside the per-node loop (or cached on first use) and reused for subsequent fuzzy checks, then use that cached allNodes when computing jaccardSimilarity against node.name to set existing; update references to the existing/null assignment block (the unmatched node branch that calls kv.list, jaccardSimilarity, node.name, and KV.graphNodes) to use the single cached list instead of calling kv.list repeatedly.test/extraction-locking.test.ts (1)
179-183:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick winReplace fixed-delay yields with deterministic barriers in concurrency tests.
Using
setTimeout(..., 5)to wait for handler progress can cause intermittent failures under CI load. Prefer awaiting explicit promises resolved inside the overriddenkv.listpath.Also applies to: 262-264
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extraction-locking.test.ts` around lines 179 - 183, The test uses a fixed delay (await new Promise(r => setTimeout(r, 5))) to let the handler reach kv.list which is flaky; replace that with a deterministic barrier by creating a promise (e.g., let listReached = createDeferred()) and resolving it inside the overridden kv.list implementation (where listCalls is incremented), then await that promise in the test before asserting expect(listCalls).toBe(1); repeat the same change for the other occurrence around lines 262-264 so both waits rely on an explicit promise resolved from within kv.list rather than setTimeout.test/graph-dedup.test.ts (1)
104-247: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Add a regression test for relationships after node dedup merges.
Current assertions focus on node counts. A case with
<relationship ...>where one endpoint dedups to an existing node is needed to catch orphan-edge endpoint regressions.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/graph-dedup.test.ts` around lines 104 - 247, The test suite lacks a regression test to verify relationships are updated when a node is deduped; add a new case in test/graph-dedup.test.ts that uses sdk.trigger("mem::graph-extract") with XML including a <relationship source="..." target="..."> where one endpoint's <entity name="..."> will merge via the existing dedup logic (e.g., "AgentMemory" -> "agentmemory"), then fetch stored edges (via kv.list or the same GraphNode store if edges are stored alongside nodes) and assert that the relationship's endpoint ID(s) have been rewritten to the merged node.id (and that no edges reference the old/orphaned id); reference existing helpers used in other tests (mockProvider.compress, testObs1/testObs2, jaccard flow) so the new test mirrors the merge path exercised by the other cases.
… nodes, remap edge IDs - Cache all graph nodes once per batch instead of per-node kv.list calls - Skip stale nodes during fuzzy dedup - Remap transient parsed node IDs to persisted node IDs for edges - Parallelize kv.set calls for new nodes
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/functions/graph.ts (1)
514-521:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard on meaningful tokens, not raw string length, before using Jaccard.
Line 515/520 only check character length. That still lets values like
"a b"and"x y"through, andsrc/state/schema.tsreturns1when both token sets are empty after thet.length > 2filter. Those names will still collapse into one node even though they share no meaningful tokens. Please require at least one retained token on both sides before applyingjaccardSimilarity(...) > 0.8.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/functions/graph.ts` around lines 514 - 521, The fuzzy dedup branch uses raw string length to decide when to call jaccardSimilarity, which can still allow names with no meaningful tokens to be compared; update the allNodes.find predicate (around the existing variable and jaccardSimilarity call) to first compute the token sets for both n.name and node.name using the same tokenization/filtering logic used elsewhere (e.g., split and keep tokens with length > 2) and only proceed to compute jaccardSimilarity if both token sets have at least one retained token; otherwise skip that candidate. Ensure you reference the same tokenization rule used elsewhere so the check (both token sets non-empty) prevents spurious merges.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/functions/graph.ts`:
- Around line 496-498: The in-memory fuzzy-dedup cache (allNodes and nodeIdMap)
is snapshotted once and never updated when you insert new nodes, so subsequent
names in the same extraction run can be persisted as duplicates; after you
create/persist a new GraphNode (the code path that inserts to KV.graphNodes),
append that new node into the allNodes array and update nodeIdMap accordingly
(or insert into whatever in-memory index you use for fuzzy matching) so future
fuzzy-match checks in the same run will see it; locate usages of
allNodes/nodeIdMap and the block that persists a new node and add a single-step
cache update there.
- Around line 496-497: The change reintroduces a full-scope
kv.list<GraphNode>(KV.graphNodes) inside mem::graph-extract (allNodes) which
will re-trigger full graph scans at scale; replace this with a targeted
candidate lookup for fuzzy dedup instead of enumerating KV.graphNodes: e.g.,
consult an existing narrower index or embedding nearest-neighbour store (or a
prefix/time-windowed index, recentIds list, or KV.graphIndex) to produce a small
candidate id set, then fetch only those nodes via kv.get/kv.mget and run fuzzy
dedup against that set; ensure the new flow keeps the variable name (allNodes or
candidates) but limits size and is bounded/timeboxed to avoid heartbeat/timeout
issues.
---
Duplicate comments:
In `@src/functions/graph.ts`:
- Around line 514-521: The fuzzy dedup branch uses raw string length to decide
when to call jaccardSimilarity, which can still allow names with no meaningful
tokens to be compared; update the allNodes.find predicate (around the existing
variable and jaccardSimilarity call) to first compute the token sets for both
n.name and node.name using the same tokenization/filtering logic used elsewhere
(e.g., split and keep tokens with length > 2) and only proceed to compute
jaccardSimilarity if both token sets have at least one retained token; otherwise
skip that candidate. Ensure you reference the same tokenization rule used
elsewhere so the check (both token sets non-empty) prevents spurious merges.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| // #813: cache all nodes once for fuzzy dedup, avoiding per-node kv.list calls | ||
| const allNodes = await kv.list<GraphNode>(KV.graphNodes); |
There was a problem hiding this comment.
Avoid reintroducing a full graph scan in mem::graph-extract.
Line 497 brings back the exact kv.list<GraphNode>(KV.graphNodes) enumeration that Lines 486-490 say was removed because large corpora can blow the iii heartbeat budget. Since this now runs on every extract and has no timeout/fallback, fuzzy dedup can make extraction fail again at scale. Please move the fuzzy lookup onto a narrower index/candidate set instead of a whole-scope list.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/functions/graph.ts` around lines 496 - 497, The change reintroduces a
full-scope kv.list<GraphNode>(KV.graphNodes) inside mem::graph-extract
(allNodes) which will re-trigger full graph scans at scale; replace this with a
targeted candidate lookup for fuzzy dedup instead of enumerating KV.graphNodes:
e.g., consult an existing narrower index or embedding nearest-neighbour store
(or a prefix/time-windowed index, recentIds list, or KV.graphIndex) to produce a
small candidate id set, then fetch only those nodes via kv.get/kv.mget and run
fuzzy dedup against that set; ensure the new flow keeps the variable name
(allNodes or candidates) but limits size and is bounded/timeboxed to avoid
heartbeat/timeout issues.
| // #813: cache all nodes once for fuzzy dedup, avoiding per-node kv.list calls | ||
| const allNodes = await kv.list<GraphNode>(KV.graphNodes); | ||
| const nodeIdMap = new Map<string, string>(); |
There was a problem hiding this comment.
Keep the fuzzy-dedup candidate cache in sync within the same extraction run.
allNodes is snapshotted once before the loop, then never updated after Line 537 inserts a new node. If one LLM response emits two similar names in the same batch, the second node will miss the fuzzy match and be persisted as a duplicate because it only searches the pre-existing corpus.
Suggested fix
} else {
await Promise.all([
kv.set(KV.graphNodes, node.id, node),
kv.set(KV.graphNameIndex, indexKey, node.id),
kv.set(KV.graphNodeDegree, node.id, 0),
]);
+ allNodes.push(node);
snap.stats.totalNodes += 1;
snap.stats.nodesByType[node.type] =
(snap.stats.nodesByType[node.type] ?? 0) + 1;
newNodeCount += 1;
nodeIdMap.set(node.id, node.id);Also applies to: 536-545
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/functions/graph.ts` around lines 496 - 498, The in-memory fuzzy-dedup
cache (allNodes and nodeIdMap) is snapshotted once and never updated when you
insert new nodes, so subsequent names in the same extraction run can be
persisted as duplicates; after you create/persist a new GraphNode (the code path
that inserts to KV.graphNodes), append that new node into the allNodes array and
update nodeIdMap accordingly (or insert into whatever in-memory index you use
for fuzzy matching) so future fuzzy-match checks in the same run will see it;
locate usages of allNodes/nodeIdMap and the block that persists a new node and
add a single-step cache update there.
Summary
6 knowledge graph fixes for agentmemory's graph extraction, eviction, and event systems:
Fix 1: Stale Node/Edge Garbage Collection
When consolidation cascade marks nodes/edges as
stale: true, they were filtered from queries but never deleted. This fix adds stale cleanup tomem::evict.Fix 2: Fuzzy Deduplication via Jaccard Similarity
Node dedup was exact match only (
n.name === node.name). Now usesjaccardSimilarity()(>0.8) to catch duplicates with minor name variations (trailing spaces, etc.).Fix 3: Isolated Node Cleanup
43% of nodes had 0 edges and 0 observations.
mem::evictnow deletes nodes with no edges, no observations, and age >7 days.Fix 4: LLM Output Validation
The graph extraction prompt specifies valid entity types (
file|function|concept|error|decision|pattern|library|person), but the LLM sometimes regurgitates the type list itself. AddedVALID_TYPESguard inparseGraphXml.Fix 5:
related_toEdge Dedup49% of edges were the generic
related_totype. When a more specific edge exists between the same node pair,related_tois now skipped.Fix 6: Extraction Locking (Race Conditions)
session::stoppedtriggersgraph-extractasynchronously. Added an in-memory lock set to prevent concurrent extraction for the same session.Testing
Existing test suites pass. Key areas affected:
src/functions/evict.ts— stale GC + isolated node cleanupsrc/functions/graph.ts— Jaccard dedup, LLM validation, related_to dedupsrc/triggers/events.ts— extraction lockingRelated
jaccardSimilarity()was already exported fromsrc/state/schema.tsbut only used in lesson recallpatches/in the KodeHold orchestrator repoSummary by CodeRabbit
New Features
Bug Fixes
Tests