Skip to content

Feat/eng 1753#306

Open
ngduyanhece wants to merge 4 commits intoproj/experience-domainfrom
feat/ENG-1753
Open

Feat/eng 1753#306
ngduyanhece wants to merge 4 commits intoproj/experience-domainfrom
feat/ENG-1753

Conversation

@ngduyanhece
Copy link
Copy Markdown
Contributor

Summary

  • Problem:
  • Why it matters:
  • What changed:
  • What did NOT change (scope boundary):

Type of change

  • Bug fix
  • New feature
  • Refactor (no behavior change)
  • Documentation
  • Test
  • Chore (build, dependencies, CI)

Scope (select all touched areas)

  • TUI / REPL
  • Agent / Tools
  • LLM Providers
  • Server / Daemon
  • Shared (constants, types, transport events)
  • CLI Commands (oclif)
  • Hub / Connectors
  • Cloud Sync
  • CI/CD / Infra

Linked issues

  • Closes #
  • Related #

Root cause (bug fixes only, otherwise write N/A)

  • Root cause:
  • Why this was not caught earlier:

Test plan

  • Coverage added:
    • Unit test
    • Integration test
    • Manual verification only
  • Test file(s):
  • Key scenario(s) covered:

User-visible changes

List user-visible changes (including defaults, config, or CLI output).
If none, write None.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording

Checklist

  • Tests added or updated and passing (npm test)
  • Lint passes (npm run lint)
  • Type check passes (npm run typecheck)
  • Build succeeds (npm run build)
  • Commits follow Conventional Commits format
  • Documentation updated (if applicable)
  • No breaking changes (or clearly documented above)
  • Branch is up to date with main

Risks and mitigations

List real risks for this PR. If none, write None.

  • Risk:
    • Mitigation:

ngduyanhece and others added 3 commits March 31, 2026 16:52
Close the feedback loop between performance tracking and memory retrieval.
Entries that correlated with better task outcomes now rank higher in search,
and synthesis reflects on which patterns associate with better performance.

Ship 1 — insightsActive tracking:
- SessionInsightsTracker records which entries the curate agent consulted
  via tools.searchKnowledge() during each curation session
- Session-scoped (parentSessionId), curate-only (query sessions excluded)
- canonicalPath field added to search results in all return branches
  using getSummaryAccessPath resolution for consistent keys
- CurateExecutor drains tracker on success, clears in finally
- NormalizedPerformanceLogEntry type ensures insightsActive: string[]
  at the domain boundary; readPerformanceLog normalizes missing values

Ship 2 — Performance-weighted retrieval boost:
- performance-correlation.ts: pure functions computing per-entry and
  per-domain factors from insightsActive correlation data
- Factors bounded to ±15% via tanh saturation (BM25 still dominates)
- Applied in runTextSearch and propagateScoresToParents
- Cached with stat-only mtime check (no JSONL parse on cache hit)

Ship 3 — Performance-aware synthesis:
- Synthesis prompt updated with 5th instruction about performance
- buildPerformanceContext injects domain stats, high/low-performing
  entries into the synthesis user message
- Gracefully degrades when < 3 relevant performance entries

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

Claude finished @ngduyanhece's task in 3m 22s —— View job


PR Review

  • Gather context and read changed files
  • Analyze code for bugs, security issues, performance problems
  • Post inline comments and final review summary

Summary

This PR implements a performance-memory correlation system (Ship 1–3): tracking which knowledge entries were surfaced during curation, correlating them with task performance scores, and feeding those signals back into search ranking and synthesis prompts. The architecture is clean and the fail-open design throughout is appropriate.

9 inline comments posted covering correctness, robustness, style, and test gaps. Here's a prioritized overview:

Issues worth addressing

Severity Location Summary
issue performance-correlation.ts:108 domainAvg is built from insight-filtered entries only, biasing the baseline — domain fallback 0.5 used for domains without insights may skew deltas
issue experience-synthesis-service.ts:179 Substring match p.includes(\/${subfolder}/`)can false-positive on paths like"dead-end-lessons"when subfolder is"lessons"`
issue search-knowledge-service.ts:1051 mtimeMs === comparison — mtime floats are fine on most filesystems but a >= guard is safer for append-only files
issue performance-correlation.test.ts:27 Missing boundary test: no coverage for exactly 5 entries having insightsActive (the activation threshold)

Nitpicks / non-blocking

  • performance-correlation.ts:67 — "domain" in computeDomainFactors is a path-derived directory name, while entry.domain is a task-level LLM field; they'll never share keys. A clarifying comment prevents future confusion.
  • experience-synthesis-service.ts:191firstHalf.length || 1 guard is unreachable given the relevant.length < 3 guard above it.
  • tools-sdk.ts:260 — Inline import(...) type for SessionInsightsTracker is inconsistent with top-level import style.
  • search-knowledge-service.ts:1056ExperienceStore re-allocated on every cache miss; could be a class field.
  • sandbox-service.ts:11 — Missing space inside braces on {CurateResultCollector}.

What's well done

  • SessionInsightsTracker is minimal and correct — Set-based dedup, session-scoped isolation, drain-vs-clear separation is clean.
  • The drainSession-before-hook ordering in CurateExecutor is correct: the snapshot is captured before the experience entries are written, preventing circular self-reference in performance correlation.
  • getPerformanceFactors mtime-based caching avoids reading the JSONL on every search — good for performance.
  • tanh saturation bounding factors to ±0.15 prevents runaway reinforcement while keeping BM25 dominant. Well reasoned.
  • Fail-open behavior is consistently maintained throughout the new code paths.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

Claude finished @ngduyanhece's task in 4m 12s —— View job


PR Review

  • Gather context and read changed files
  • Analyze code for bugs, security issues, performance problems
  • Post inline comments and final review summary

Summary

All 9 issues from the previous review round were addressed — the domain-baseline bias fix, substring match replacement, mtime >= guard, boundary test, naming comment, unreachable || 1 guard, import style, ExperienceStore class field, and brace spacing are all confirmed resolved.

5 new inline comments posted on this pass:

Severity Location Summary
issue experience-synthesis-service.test.ts:257 Negative-only coverage for buildPerformanceContext — no test that the **Performance Context** section is injected when data is present
nitpick cipher-agent.ts:695 Inline dynamic import(...) type reintroduced for SessionInsightsTracker — same pattern fixed in tools-sdk.ts last round
nitpick experience-synthesis-service.ts:203 entry.score > domainAvg classifies exact-mean entries as "low performing"; >= is more defensible
nitpick experience-synthesis-service.ts:192 Unequal halves in trend calc for odd-length relevant log — firstHalf is one element shorter than secondHalf; a comment clarifies intent
nitpick search-knowledge-service.ts:1329 trySymbolicSearch is async solely to call getPerformanceFactors() internally; the design trade-off (avoiding eager computation on direct-match paths) is valid but deserves a comment

What's well done in this revision

  • All four prior correctness issues resolved cleanly. The buildDomainAverages(log) full-log baseline fix is the most impactful of these.
  • path.split('/').includes(subfolder) is a clean, correct replacement for the substring false-positive.
  • The boundary test (activates once exactly five entries have non-empty insightsActive) is well structured.
  • The new matchesSubfolderPath negative test correctly exercises the regression that was being guarded.
  • experienceStore as a class field and the top-level SessionInsightsTracker import in tools-sdk.ts are clean wins.

* Set the session insights tracker for performance-memory correlation.
* Forwarded to sandboxService so ToolsSDK.searchKnowledge() records surfaced paths.
*/
public setInsightsTracker(tracker: import('../../../server/infra/context-tree/session-insights-tracker.js').SessionInsightsTracker): void {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (import style): Same inline dynamic import(...) type pattern that was flagged and fixed in tools-sdk.ts in the previous round — now it's reintroduced here. Consider lifting it to a top-level import to keep the file consistent with the rest of the imports:

Suggested change
public setInsightsTracker(tracker: import('../../../server/infra/context-tree/session-insights-tracker.js').SessionInsightsTracker): void {
public setInsightsTracker(tracker: SessionInsightsTracker): void {

And add at the top of the file alongside other import type declarations:

import type { SessionInsightsTracker } from '../../../server/infra/context-tree/session-insights-tracker.js'

const lowEntries = new Map<string, number>()

for (const entry of relevant) {
const bucket = entry.score > domainAvg ? highEntries : lowEntries
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (off-by-one boundary): Entries with score === domainAvg are classified as low-performing (lowEntries) due to the strict > comparison. In practice, a score exactly at the mean is neither high nor low. This won't cause incorrect output but could surface neutrally-correlated entries as "frequently active in lower-scoring tasks," which could mislead the LLM synthesis.

A simple fix is to use a small threshold instead of the mean as the boundary:

Suggested change
const bucket = entry.score > domainAvg ? highEntries : lowEntries
const bucket = entry.score >= domainAvg ? highEntries : lowEntries

Or alternatively, skip entries whose score is within a small epsilon of the mean (Math.abs(entry.score - domainAvg) < 0.05) — though the current approach is acceptable for a prompting hint.

const mid = Math.floor(scores.length / 2)
const firstHalf = scores.slice(0, mid)
const secondHalf = scores.slice(mid)
const avgFirst = firstHalf.reduce((a, b) => a + b, 0) / firstHalf.length
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (unequal halves for odd counts): When relevant.length is odd, mid = Math.floor(N/2) gives firstHalf one fewer element than secondHalf. For example, with exactly 3 entries: firstHalf = [s0], secondHalf = [s1, s2]. The most recent score gets 2× weight in the "second half" average, which can produce a spurious trend signal for small samples.

This only affects synthesis prompts (not hard decisions), but a note in the code would prevent future confusion, e.g.:

// Note: for odd-length arrays firstHalf is 1 element shorter than secondHalf —
// acceptable bias for a heuristic prompt hint.

Low priority given the fail-open nature of the context injection.

if (textPart) {
// Scoped search: search text within the matched subtree
return this.runTextSearch(textPart, documentMap, index, limit, topMatch.path, symbolTree, referenceIndex, summaryMap, options)
const perfFactors = await this.getPerformanceFactors()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick (design inconsistency): trySymbolicSearch became async solely to call getPerformanceFactors() in this one branch. The outer search() already computes perfFactors for the text-search fallback path — but those two calls cannot be trivially unified because trySymbolicSearch short-circuits before perfFactors is computed at the outer level.

The cached stat call here is fast, so this is not a performance issue. However, adding a brief comment explains why the method is async and why perfFactors isn't passed as a parameter:

// Fetch performance factors here rather than receiving them as a parameter:
// the outer search() path computes perfFactors *after* this method returns null,
// so passing them in would require eagerly computing them even for direct-match exits.
const perfFactors = await this.getPerformanceFactors()

await service.synthesize(store, 5)

const userMessage = (llm.generate as sinon.SinonStub).firstCall.args[1] as string
expect(userMessage).to.not.include('**Performance Context**')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (test gap — negative test only): The new test correctly verifies that dead-end-lessons paths do NOT match the lessons subfolder. But there's no complementary test confirming that the **Performance Context** section IS injected into the user message when sufficient relevant performance data is present.

Without a positive-path test, a future regression that silently drops the performance context section (e.g., a guard condition change) would go undetected. Fix this →

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant