Skip to content

feat(salience): dedup near-duplicate memories by source_key / title#152

Merged
cipher813 merged 1 commit into
mainfrom
feat/salience-dedup-by-content-hash
May 22, 2026
Merged

feat(salience): dedup near-duplicate memories by source_key / title#152
cipher813 merged 1 commit into
mainfrom
feat/salience-dedup-by-content-hash

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Caught on first real prod-vault scoring output: 3 of the auto-selected top-10 were the same memory (System-wide deploy changelog at ids 600, 617, 644 — same title, content edited iteratively across sessions, three different content_hashes).

Bare content-hash dedup misses this — those memories are near-duplicates not byte-duplicates. The right dedup primitive is mnemon's source_key (post-rc16 canonical identity from store.save's upsert-by-slug) falling back to title for pre-rc16 saves.

Dedup key priority

  1. source_key — post-rc16 canonical identity (rc16 / PR fix(mirror): upsert by stable slug + bump 0.6.0rc16 (P0) #122)
  2. title — lowercased + stripped, catches pre-rc16 dupes
  3. id — no-title memories or genuinely unique titles (never dedupes with another)

Keep most recent (highest id) per dedup key — has the most current title/confidence/content metadata.

Verified against prod snapshot

BEFORE: Live memories: 2084 (3 of top-10 are the same memory)
AFTER:  Live memories: 1872 (deduped 212 near-duplicates across whole vault)

212 collapsed — not just the 3 we knew about. The dedup uncovered substantial historical noise.

The 3 deploy-changelog entries collapsed to id 644 (most recent). Auto-selected top-10 now picks 9 more distinct candidates instead of triple-counting the same fact.

Tests

801/801 pytest, 13/13 harness — no regressions.

After merge

git checkout main && git pull
scripts/salience_phase0.sh score   # 9 fresh slots opened, picks should be more diverse

🤖 Generated with Claude Code

Caught 2026-05-21 on first real prod-vault scoring output: 3 of the
auto-selected top-10 were the same memory ("System-wide deploy
changelog" at ids 600, 617, 644 — same title, content edited
iteratively across sessions, three different content_hashes).

Bare content-hash dedup misses this — those memories were
near-duplicates not byte-duplicates. The right dedup primitive is
mnemon's source_key (post-rc16 canonical identity via
store.save's upsert-by-slug) falling back to title for pre-rc16
saves.

Dedup key priority:
  1. source_key   — post-rc16 canonical identity
  2. title        — lowercased + stripped, catches pre-rc16 dupes
  3. id           — no-title memories or genuinely unique titles

Keep most recent (highest id) per dedup key — has the most current
title / confidence / content metadata.

Verified against prod snapshot (/tmp/mnemon-prod-snap.sqlite):
  BEFORE: Live memories: 2084 (3 of top-10 are the same memory)
  AFTER:  Live memories: 1872 (deduped 212 across whole vault)

The 3 deploy-changelog entries collapsed to id 644. Auto-selected
top-10 now picks 9 more distinct candidates instead of triple-counting
the same fact.

Full suite: 801 passed. Harness: 13/13.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit da5fbc4 into main May 22, 2026
9 checks passed
@cipher813 cipher813 deleted the feat/salience-dedup-by-content-hash branch May 22, 2026 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant