feat(salience): dedup near-duplicate memories by source_key / title by cipher813 · Pull Request #152 · cipher813/mnemon

cipher813 · 2026-05-22T00:21:17Z

Summary

Caught on first real prod-vault scoring output: 3 of the auto-selected top-10 were the same memory (System-wide deploy changelog at ids 600, 617, 644 — same title, content edited iteratively across sessions, three different content_hashes).

Bare content-hash dedup misses this — those memories are near-duplicates not byte-duplicates. The right dedup primitive is mnemon's source_key (post-rc16 canonical identity from store.save's upsert-by-slug) falling back to title for pre-rc16 saves.

Dedup key priority

source_key — post-rc16 canonical identity (rc16 / PR fix(mirror): upsert by stable slug + bump 0.6.0rc16 (P0) #122)
title — lowercased + stripped, catches pre-rc16 dupes
id — no-title memories or genuinely unique titles (never dedupes with another)

Keep most recent (highest id) per dedup key — has the most current title/confidence/content metadata.

Verified against prod snapshot

BEFORE: Live memories: 2084 (3 of top-10 are the same memory)
AFTER:  Live memories: 1872 (deduped 212 near-duplicates across whole vault)

212 collapsed — not just the 3 we knew about. The dedup uncovered substantial historical noise.

The 3 deploy-changelog entries collapsed to id 644 (most recent). Auto-selected top-10 now picks 9 more distinct candidates instead of triple-counting the same fact.

Tests

801/801 pytest, 13/13 harness — no regressions.

After merge

git checkout main && git pull
scripts/salience_phase0.sh score   # 9 fresh slots opened, picks should be more diverse

🤖 Generated with Claude Code

Caught 2026-05-21 on first real prod-vault scoring output: 3 of the auto-selected top-10 were the same memory ("System-wide deploy changelog" at ids 600, 617, 644 — same title, content edited iteratively across sessions, three different content_hashes). Bare content-hash dedup misses this — those memories were near-duplicates not byte-duplicates. The right dedup primitive is mnemon's source_key (post-rc16 canonical identity via store.save's upsert-by-slug) falling back to title for pre-rc16 saves. Dedup key priority: 1. source_key — post-rc16 canonical identity 2. title — lowercased + stripped, catches pre-rc16 dupes 3. id — no-title memories or genuinely unique titles Keep most recent (highest id) per dedup key — has the most current title / confidence / content metadata. Verified against prod snapshot (/tmp/mnemon-prod-snap.sqlite): BEFORE: Live memories: 2084 (3 of top-10 are the same memory) AFTER: Live memories: 1872 (deduped 212 across whole vault) The 3 deploy-changelog entries collapsed to id 644. Auto-selected top-10 now picks 9 more distinct candidates instead of triple-counting the same fact. Full suite: 801 passed. Harness: 13/13. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cipher813 merged commit da5fbc4 into main May 22, 2026
9 checks passed

cipher813 deleted the feat/salience-dedup-by-content-hash branch May 22, 2026 00:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(salience): dedup near-duplicate memories by source_key / title#152

feat(salience): dedup near-duplicate memories by source_key / title#152
cipher813 merged 1 commit into
mainfrom
feat/salience-dedup-by-content-hash

cipher813 commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cipher813 commented May 22, 2026

Summary

Dedup key priority

Verified against prod snapshot

Tests

After merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant