Conversation
Close the feedback loop between performance tracking and memory retrieval. Entries that correlated with better task outcomes now rank higher in search, and synthesis reflects on which patterns associate with better performance. Ship 1 — insightsActive tracking: - SessionInsightsTracker records which entries the curate agent consulted via tools.searchKnowledge() during each curation session - Session-scoped (parentSessionId), curate-only (query sessions excluded) - canonicalPath field added to search results in all return branches using getSummaryAccessPath resolution for consistent keys - CurateExecutor drains tracker on success, clears in finally - NormalizedPerformanceLogEntry type ensures insightsActive: string[] at the domain boundary; readPerformanceLog normalizes missing values Ship 2 — Performance-weighted retrieval boost: - performance-correlation.ts: pure functions computing per-entry and per-domain factors from insightsActive correlation data - Factors bounded to ±15% via tanh saturation (BM25 still dominates) - Applied in runTextSearch and propagateScoresToParents - Cached with stat-only mtime check (no JSONL parse on cache hit) Ship 3 — Performance-aware synthesis: - Synthesis prompt updated with 5th instruction about performance - buildPerformanceContext injects domain stats, high/low-performing entries into the synthesis user message - Gracefully degrades when < 3 relevant performance entries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Claude finished @ngduyanhece's task in 3m 22s —— View job PR Review
SummaryThis PR implements a performance-memory correlation system (Ship 1–3): tracking which knowledge entries were surfaced during curation, correlating them with task performance scores, and feeding those signals back into search ranking and synthesis prompts. The architecture is clean and the fail-open design throughout is appropriate. 9 inline comments posted covering correctness, robustness, style, and test gaps. Here's a prioritized overview: Issues worth addressing
Nitpicks / non-blocking
What's well done
|
src/agent/infra/tools/implementations/search-knowledge-service.ts
Outdated
Show resolved
Hide resolved
src/agent/infra/tools/implementations/search-knowledge-service.ts
Outdated
Show resolved
Hide resolved
test/unit/server/core/domain/knowledge/performance-correlation.test.ts
Outdated
Show resolved
Hide resolved
|
Claude finished @ngduyanhece's task in 4m 12s —— View job PR Review
SummaryAll 9 issues from the previous review round were addressed — the domain-baseline bias fix, substring match replacement, mtime 5 new inline comments posted on this pass:
What's well done in this revision
|
| * Set the session insights tracker for performance-memory correlation. | ||
| * Forwarded to sandboxService so ToolsSDK.searchKnowledge() records surfaced paths. | ||
| */ | ||
| public setInsightsTracker(tracker: import('../../../server/infra/context-tree/session-insights-tracker.js').SessionInsightsTracker): void { |
There was a problem hiding this comment.
nitpick (import style): Same inline dynamic import(...) type pattern that was flagged and fixed in tools-sdk.ts in the previous round — now it's reintroduced here. Consider lifting it to a top-level import to keep the file consistent with the rest of the imports:
| public setInsightsTracker(tracker: import('../../../server/infra/context-tree/session-insights-tracker.js').SessionInsightsTracker): void { | |
| public setInsightsTracker(tracker: SessionInsightsTracker): void { |
And add at the top of the file alongside other import type declarations:
import type { SessionInsightsTracker } from '../../../server/infra/context-tree/session-insights-tracker.js'| const lowEntries = new Map<string, number>() | ||
|
|
||
| for (const entry of relevant) { | ||
| const bucket = entry.score > domainAvg ? highEntries : lowEntries |
There was a problem hiding this comment.
nitpick (off-by-one boundary): Entries with score === domainAvg are classified as low-performing (lowEntries) due to the strict > comparison. In practice, a score exactly at the mean is neither high nor low. This won't cause incorrect output but could surface neutrally-correlated entries as "frequently active in lower-scoring tasks," which could mislead the LLM synthesis.
A simple fix is to use a small threshold instead of the mean as the boundary:
| const bucket = entry.score > domainAvg ? highEntries : lowEntries | |
| const bucket = entry.score >= domainAvg ? highEntries : lowEntries |
Or alternatively, skip entries whose score is within a small epsilon of the mean (Math.abs(entry.score - domainAvg) < 0.05) — though the current approach is acceptable for a prompting hint.
| const mid = Math.floor(scores.length / 2) | ||
| const firstHalf = scores.slice(0, mid) | ||
| const secondHalf = scores.slice(mid) | ||
| const avgFirst = firstHalf.reduce((a, b) => a + b, 0) / firstHalf.length |
There was a problem hiding this comment.
nitpick (unequal halves for odd counts): When relevant.length is odd, mid = Math.floor(N/2) gives firstHalf one fewer element than secondHalf. For example, with exactly 3 entries: firstHalf = [s0], secondHalf = [s1, s2]. The most recent score gets 2× weight in the "second half" average, which can produce a spurious trend signal for small samples.
This only affects synthesis prompts (not hard decisions), but a note in the code would prevent future confusion, e.g.:
// Note: for odd-length arrays firstHalf is 1 element shorter than secondHalf —
// acceptable bias for a heuristic prompt hint.Low priority given the fail-open nature of the context injection.
| if (textPart) { | ||
| // Scoped search: search text within the matched subtree | ||
| return this.runTextSearch(textPart, documentMap, index, limit, topMatch.path, symbolTree, referenceIndex, summaryMap, options) | ||
| const perfFactors = await this.getPerformanceFactors() |
There was a problem hiding this comment.
nitpick (design inconsistency): trySymbolicSearch became async solely to call getPerformanceFactors() in this one branch. The outer search() already computes perfFactors for the text-search fallback path — but those two calls cannot be trivially unified because trySymbolicSearch short-circuits before perfFactors is computed at the outer level.
The cached stat call here is fast, so this is not a performance issue. However, adding a brief comment explains why the method is async and why perfFactors isn't passed as a parameter:
// Fetch performance factors here rather than receiving them as a parameter:
// the outer search() path computes perfFactors *after* this method returns null,
// so passing them in would require eagerly computing them even for direct-match exits.
const perfFactors = await this.getPerformanceFactors()| await service.synthesize(store, 5) | ||
|
|
||
| const userMessage = (llm.generate as sinon.SinonStub).firstCall.args[1] as string | ||
| expect(userMessage).to.not.include('**Performance Context**') |
There was a problem hiding this comment.
issue (test gap — negative test only): The new test correctly verifies that dead-end-lessons paths do NOT match the lessons subfolder. But there's no complementary test confirming that the **Performance Context** section IS injected into the user message when sufficient relevant performance data is present.
Without a positive-path test, a future regression that silently drops the performance context section (e.g., a guard condition change) would go undetected. Fix this →
Summary
Type of change
Scope (select all touched areas)
Linked issues
Root cause (bug fixes only, otherwise write
N/A)Test plan
User-visible changes
List user-visible changes (including defaults, config, or CLI output).
If none, write
None.Evidence
Attach at least one:
Checklist
npm test)npm run lint)npm run typecheck)npm run build)mainRisks and mitigations
List real risks for this PR. If none, write
None.