High-performance text fingerprinting SDK for Rust with classical sketches (MinHash + LSH, SimHash, TLSH), Unicode-correct canonicalization, and semantic embeddings (ONNX local).
txtfp produces compact, deterministic, byte-stable hashes for text deduplication, near-duplicate detection, and semantic search:
| Method | Use case | Output | Complexity |
|---|---|---|---|
| MinHash | Set-similarity dedup (Jaccard) | [u64; H] |
O(n) sketch |
| LSH | Sub-linear near-duplicate lookup | bucketed index | O(1) avg query |
| SimHash | Bit-LSH near-dup (Hamming) | u64 |
O(n) sketch |
| TLSH | Byte-level locality-sensitive hash | hex string | O(n) sketch |
| Embedding | Semantic similarity (ANN) | Vec<f32> |
model-dependent |
It is the text counterpart to audiofp (audio) and imgfprint (image), and is consumed by the cross-modal ucfp integrator.
Perfect for:
- LLM training-set deduplication
- RAG retrieval ranking
- Content moderation
- Plagiarism detection
- Email / document de-dup at scale
- Byte-stable hash layouts —
MinHashSig<H>andSimHash64arerepr(C)bytemuck::Pod. Schema-versioned, semver-frozen, golden-byte enforced (18 fixtures). - Production canonicalization — NFKC + simple casefold + Bidi/format strip; defends against Trojan Source, ZWJ injection, NFC bombs.
no_std + alloc-clean default features — builds forwasm32-unknown-unknownout of the box.- Streaming + offline fingerprinters — every classical sketcher has both a
Fingerprinter(whole-doc) andStreamingFingerprinter(chunk-fed) variant. - Local embeddings —
LocalProvider(ONNX viaort+ Hugging Face Hub) for semantic similarity. Cloud-hosted providers are out of scope; implementEmbeddingProvideragainst your HTTP client of choice. - Markup helpers — HTML → text, Markdown → text.
- Unicode security — UTS #39 confusable skeleton behind the
securityfeature. - CJK tokenizer —
jieba-rswithOnceLock-lazy dictionary for Simplified Chinese. - Cross-SDK parity —
EmbeddingProvider,Embedding,semantic_similarity,FORMAT_VERSIONaligned withimgfprint/audiofp.
[dependencies]
txtfp = "0.2"Upgrading from 0.1.x? v0.2.0 flipped the default hash family from
MurmurHash3_x64_128toXxh3_64for both MinHash and SimHash — signature bytes change. Pin to0.1or passHashFamily::MurmurHash3_x64_128explicitly for v0.1.x / Pythondatasketchbyte parity. v0.2.1 is API- and bytes-compatible with v0.2.0 (patch release).
| Feature | Default | Pulls |
|---|---|---|
std |
✅ | libstd. Without it, no_std + alloc. |
minhash |
✅ | MinHash sketcher. |
simhash |
✅ | SimHash sketcher. |
lsh |
✅ | Banded LSH index over MinHash signatures. |
markup |
html_to_text, markdown_to_text. |
|
cjk |
CjkTokenizer (jieba, Simplified Chinese). |
|
tlsh |
TlshFingerprinter. |
|
security |
UTS #39 confusable skeleton in the canonicalizer. | |
serde |
Serialize / Deserialize on signatures (incl. const-generic MinHash). |
|
parallel |
Rayon-powered batch helpers. | |
semantic |
LocalProvider via ort + Hugging Face Hub. |
For Japanese / Korean tokenization or PDF text extraction, implement
the Tokenizer / Canonicalizer upstream of this crate against
your preferred dedicated library (lindera, vibrato, pdf-extract,
poppler, …). Cloud-hosted embedding endpoints (OpenAI, Voyage,
Cohere, …) are similarly out of scope; implement
EmbeddingProvider against any HTTP client of choice — see
USAGE.md for a
worked example.
Minimal build (no_std + alloc, MinHash + SimHash only — drops LSH):
[dependencies]
txtfp = { version = "0.3", default-features = false, features = ["minhash", "simhash"] }Without LSH (still on default std):
[dependencies]
txtfp = { version = "0.3", default-features = false, features = ["std", "minhash", "simhash"] }With local ONNX embeddings:
[dependencies]
txtfp = { version = "0.3", features = ["semantic"] }use txtfp::{
Canonicalizer, Fingerprinter, MinHashFingerprinter,
ShingleTokenizer, WordTokenizer, jaccard,
};
fn main() -> Result<(), txtfp::Error> {
let canon = Canonicalizer::default();
let tok = ShingleTokenizer { k: 5, inner: WordTokenizer };
let fp = MinHashFingerprinter::<_, 128>::new(canon, tok);
let a = fp.fingerprint("the quick brown fox jumps over the lazy dog at noon today")?;
let b = fp.fingerprint("the quick brown fox jumps over the lazy dog at dusk today")?;
let similarity = jaccard(&a, &b);
println!("Jaccard estimate: {:.2}", similarity);
if similarity > 0.6 {
println!("near-duplicate");
}
Ok(())
}# #[cfg(feature = "lsh")]
# fn demo() -> Result<(), txtfp::Error> {
use txtfp::{
Canonicalizer, Fingerprinter, LshIndex, LshIndexBuilder,
MinHashFingerprinter, ShingleTokenizer, WordTokenizer,
};
let canon = Canonicalizer::default();
let tok = ShingleTokenizer { k: 5, inner: WordTokenizer };
let fp = MinHashFingerprinter::<_, 128>::new(canon, tok);
// Optimize bands/rows for a Jaccard threshold of 0.7.
let mut idx: LshIndex<128> = LshIndexBuilder::for_threshold(0.7, 128)?.build();
idx.insert(0, fp.fingerprint("the quick brown fox jumps over the lazy dog at noon today")?);
idx.insert(1, fp.fingerprint("astronomers detect cosmic background radiation")?);
let probe = fp.fingerprint("the quick brown fox jumps over the lazy dog at dusk today")?;
let neighbours = idx.query_with_threshold(&probe, 0.5);
println!("near-duplicates: {neighbours:?}");
# Ok(()) }For the complete API reference and worked examples, see USAGE.md.
input bytes
│
▼
canonicalize (NFKC + casefold + Bidi/format strip)
│
▼
tokenize (Word | Grapheme | Shingle | CJK)
│
▼
sketch (MinHash | SimHash | TLSH | Embedding)
│
▼
compare (jaccard | hamming | cosine_estimate | semantic_similarity)
Every layer is independently swappable: pick a canonicalizer config, plug any Tokenizer, choose a HashFamily, and the same input always produces the same byte-stable signature.
MinHashSig<H> SimHash64
├── schema: u16 (= 1) └── 8 bytes (u64, little-endian)
├── _pad: [u8; 6] (zero)
└── hashes: [u64; H], LE
Total size: 8 + 8*H bytes
These layouts are enforced by 18 byte-frozen golden-test fixtures (tests/data/golden/). Failing a golden test is a hard breakage that requires a major-version bump.
- MinHash uses double-hashing (Indyk–Motwani 1998 + Kirsch–Mitzenmacher 2008): one
xxh3_128per shingle, then deriveHslots aslow + (i * high). v0.2.0+ default; passHashFamily::MurmurHash3_x64_128fordatasketchbyte parity. - SimHash is Charikar 2002: token-weighted bag, 64-lane signed accumulator, sign-extract.
- LSH is banded:
bands * rows == H.LshIndexBuilder::for_thresholdnumerically minimizes false-positive + false-negative integral over[0, threshold]and[threshold, 1]to pick the partition. - TLSH wraps
tlsh2128/1. - Local embeddings load HF Hub ONNX models, tokenize with
tokenizers, runort2.0, and pool withPooling::{Cls, Mean, MeanNoNorm, Max}. The pooling default is looked up per-model (BGE → Cls, E5 → Mean, etc.).
Single-thread throughput on a 2024-class x86_64 laptop, fat-LTO release with RUSTFLAGS="-C target-cpu=native" and mimalloc as the benches' global allocator, measured with cargo bench --features lsh over the 5 KB lorem_ipsum (ASCII) corpus:
v0.2.0+ baseline (HashFamily::Xxh3_64 default):
| Operation | Time | Throughput |
|---|---|---|
| MinHash sketch (h=128) | ~110 µs/doc | ~9K docs/sec |
| MinHash sketch (h=64) | ~76 µs/doc | ~13K docs/sec |
| SimHash sketch (b=64) | ~205 µs/doc | ~5K docs/sec¹ |
| Canonicalize NFKC (ASCII) | ~540 ns/doc | ~1.9M docs/sec |
| LSH insert (h=128) | ~1.9 µs/sig | ~530K signatures/sec |
| LSH query (10K-doc index) | ~393 µs² | ~2.5K queries/sec |
Hamming compare (hamming) |
~0.5 ns | ~2B comparisons/sec |
| Jaccard compare (h=128) | ~50 ns | ~20M comparisons/sec |
¹ SimHash 5 KB throughput improved 40% from v0.1.2 (345 µs → 205 µs)
via the streaming ±1-per-occurrence accumulator under Weighting::Tf.
² LSH query is slower than v0.1.x on adversarial bench corpora —
xxh3's collision profile produces 1.62× more bucket candidates than
MurmurHash3 on a 9/10-shared-words corpus. Per-candidate cost is
unchanged. Pin HashFamily::MurmurHash3_x64_128 if your workload
matches the bench shape and you need v0.1.x query latency. See
CHANGELOG.md for the analysis.
Run benchmarks:
RUSTFLAGS="-C target-cpu=native" cargo bench --features lsh- The canonicalizer takes a single-pass ASCII fast path. v0.2.0 extends it to ASCII + droppable bidi/format codepoints (BOM, ZWSP, RLO, variation selectors) — measured 17× faster on a 5 KB corpus with one BOM and a ZWSP every 80 bytes (170 µs → 9.8 µs).
Tokenizer::for_each_tokenis a callback-style API that skips per-tokenStringallocation; classical sketchers route through it.- mimalloc gives ~2× on
LSH insert(alloc-heavy), ~6% on SimHash, marginal elsewhere. - The MinHash slot-update inner loop and the SimHash 64-lane accumulator are already auto-vectorized by LLVM (verified via release-build assembly:
vpcmpltuq+ AVX-512 mask blending onymmregisters). No hand-rolled SIMD planned. LshIndex::extend_par(v0.2.0,parallelfeature) shards bulk insert by band across the rayon thread pool: measured 1.74× speedup on 8 cores for 10K-doc bench.
- Hash byte struct layouts (
MinHashSig<H>,SimHash64,TlshFingerprint): frozen since v0.1.0. Golden tests enforce on every PR. - Hash byte values: changed once at v0.2.0 with the default-hasher flip from MurmurHash3 to xxh3. The struct layout did not change. v0.1.x byte parity is one builder call away (
with_hasher(HashFamily::MurmurHash3_x64_128)); golden fixtures regenerated, no further byte changes planned for v0.2.x. EmbeddingProvider,Embedding,semantic_similarity: parity-compatible withimgfprint0.4.x andaudiofp0.2.x.FORMAT_VERSION = 1: mirrored across the cross-modal sibling crates so the integrator (ucfp) can refuse to open a database whose layout predates the running build.- Cross-config comparisons are gated by
FingerprintMetadata::config_hash. Two fingerprints with different non-zeroconfig_hashvalues must not be compared. - SemVer enforcement: every PR runs
cargo-semver-checks(added in v0.2.1) against the published baseline. Accidental SemVer breaks fail CI.
- OOM protection: streaming sketchers cap their internal buffer at 16 MiB; oversized chunks are rejected at
updatetime. - Trojan Source / homoglyph defense: canonicalizer strips Bidi controls and the Cf category.
securityfeature adds the UTS #39 confusable skeleton so Cyrillic 'а' folds to Latin 'a'. - NFC bombs bounded: NFKC growth capped at 18× (Unicode-spec-mandated worst case).
- Deterministic output: same input always produces the same byte-identical signature; no hidden RNG, no clock dependency.
- Cryptographic-level attacks on the hash families: out of scope. MurmurHash3, xxh3, and SimHash are non-cryptographic by design.
| Feature | txtfp | datasketch (py) | sourmash (py) | rapidfuzz |
|---|---|---|---|---|
| MinHash | ✓ | ✓ | ✓ | — |
| Banded LSH | ✓ | ✓ | ✓ | — |
| SimHash | ✓ | ✓ | — | — |
| TLSH | ✓ | — | — | — |
| Streaming sketches | ✓ | ✓ | ✓ | — |
| Unicode canonicalization | ✓ | — | — | ~ |
| Trojan Source defense | ✓ | — | — | — |
| Local ONNX embeddings | ✓ | — | — | — |
| Byte-stable hash layouts | ✓ | — | — | — |
no_std + alloc |
✓ | — | — | — |
| Pure Rust (no Python GIL) | ✓ | — | — | ✓ |
See the examples/ directory:
dedup.rs— MinHash + LSH end-to-end deduplicationnear_dup.rs— SimHash near-duplicate detectionsemantic.rs— Local ONNX embedding similarity (requiressemantic)regen_goldens.rs— Regenerate the byte-frozen test fixtures (do not run on a patch release; only when intentionally bumping a minor)
cargo run --example dedup --features lsh --release
cargo run --example near_dup --release
cargo run --example semantic --features semantic --releaseContributions welcome. The contract:
- Fork the repository.
- Branch (
git checkout -b feature/x). - Run the matrix locally:
cargo test --no-default-features --features "std,minhash,simhash,lsh,tlsh,markup,security,serde,parallel". - Run clippy:
cargo clippy --all-targets -- -D warnings. - Run benches if the change touches a hot path:
cargo bench. - Never regenerate golden fixtures unless you're explicitly bumping a minor version.
- Open a PR. CI gates on fmt, clippy, doc, deny, audit, semver-checks, and a 60-second fuzz smoke (
canonicalizeandminhash_streamingtargets underfuzz/). - Releases: see
RELEASING.md.
git clone https://github.com/themankindproject/txtfp
cd txtfp
# Default-feature smoke
cargo test
# Full classical surface (no semantic — pulls heavy ONNX deps)
cargo test --features "lsh,markup,security,serde,parallel,tlsh,cjk"
# Build the docs
cargo doc --no-deps --open
# Run the fuzz harness locally (requires nightly + cargo-fuzz)
cd fuzz && cargo +nightly fuzz run canonicalize -- -max_total_time=60Licensed under the [MIT