A TypeScript implementation of SimHash variants for near-duplicate detection and exact-match workflows.
- Baseline/original implementation.
- Uses character bigram features from raw text.
- Best when you want a simple classic SimHash baseline.
- Distance-oriented profile for better robustness than baseline.
- Adds deterministic canonicalization, mixed token/character features, TF capping, and optional window voting.
- Best when you still care about Hamming distance behavior and nearest-neighbor style similarity.
- Equality-oriented profile designed for exact tag matching.
- Uses aggressive canonicalization + stemming + stopword filtering, then bucketed min-hash style sketching.
- Best when your query system can only do exact hash equality and not distance thresholds.
- Current default profile is
simhash-equality-v2. - Default parameters:
shingleSize=1,bucketCount=2,keptHexCharsPerBucket=3,minTokenLength=4. - Descriptor payload includes
n,b,k, andmso independent implementations can produce the sameXvalue deterministically.
npm install
npm run build
npm test
npm run benchmark
npm run benchmark -- path/to/corpus.json
The benchmark supports:
- Legacy shape: top-level
textsarray - New shape: grouped
familieswith expected equality pairs
Example (new shape):
{
"topNeighbors": 6,
"families": [
{
"id": "my-family",
"description": "Optional family note",
"expectedEqualityPairs": [
["text-a", "text-b"]
],
"texts": [
{ "id": "text-a", "text": "..." },
{ "id": "text-b", "text": "..." },
{ "id": "text-c", "text": "..." }
]
}
]
}expectedEqualityPairs are used for TP/FN/FP reporting under equality-mode scoring.
synthetic-article: regression baselinereal-article: populated with the provided regular-length article and variantstweet-sized: short-text stress testsextra-long-article: populated with your provided extra-long article and variants
- Keep IDs stable over time so benchmark comparisons remain meaningful.
- For each family, include at least:
- original
- light edit
- padded/noisy variant
- unrelated control
- Update
expectedEqualityPairswhenever you add or revise vectors.