txtfp

High-performance text fingerprinting SDK for Rust with classical sketches (MinHash + LSH, SimHash, TLSH), Unicode-correct canonicalization, and semantic embeddings (ONNX local).

Overview

txtfp produces compact, deterministic, byte-stable hashes for text deduplication, near-duplicate detection, and semantic search:

Method	Use case	Output	Complexity
MinHash	Set-similarity dedup (Jaccard)	`[u64; H]`	O(n) sketch
LSH	Sub-linear near-duplicate lookup	bucketed index	O(1) avg query
SimHash	Bit-LSH near-dup (Hamming)	`u64`	O(n) sketch
TLSH	Byte-level locality-sensitive hash	hex string	O(n) sketch
Embedding	Semantic similarity (ANN)	`Vec<f32>`	model-dependent

It is the text counterpart to audiofp (audio) and imgfprint (image), and is consumed by the cross-modal ucfp integrator.

Perfect for:

LLM training-set deduplication
RAG retrieval ranking
Content moderation
Plagiarism detection
Email / document de-dup at scale

Features

Byte-stable hash layouts — MinHashSig<H> and SimHash64 are repr(C) bytemuck::Pod. Schema-versioned, semver-frozen, golden-byte enforced (18 fixtures).
Production canonicalization — NFKC + simple casefold + Bidi/format strip; defends against Trojan Source, ZWJ injection, NFC bombs.
no_std + alloc-clean default features — builds for wasm32-unknown-unknown out of the box.
Streaming + offline fingerprinters — every classical sketcher has both a Fingerprinter (whole-doc) and StreamingFingerprinter (chunk-fed) variant.
Local embeddings — LocalProvider (ONNX via ort + Hugging Face Hub) for semantic similarity. Cloud-hosted providers are out of scope; implement EmbeddingProvider against your HTTP client of choice.
Markup helpers — HTML → text, Markdown → text.
Unicode security — UTS #39 confusable skeleton behind the security feature.
CJK tokenizer — jieba-rs with OnceLock-lazy dictionary for Simplified Chinese.
Cross-SDK parity — EmbeddingProvider, Embedding, semantic_similarity, FORMAT_VERSION aligned with imgfprint / audiofp.

Installation

[dependencies]
txtfp = "0.2"

Upgrading from 0.1.x? v0.2.0 flipped the default hash family from MurmurHash3_x64_128 to Xxh3_64 for both MinHash and SimHash — signature bytes change. Pin to 0.1 or pass HashFamily::MurmurHash3_x64_128 explicitly for v0.1.x / Python datasketch byte parity. v0.2.1 is API- and bytes-compatible with v0.2.0 (patch release).

Feature flags

Feature	Default	Pulls
`std`	✅	libstd. Without it, `no_std + alloc`.
`minhash`	✅	MinHash sketcher.
`simhash`	✅	SimHash sketcher.
`lsh`	✅	Banded LSH index over MinHash signatures.
`markup`		`html_to_text`, `markdown_to_text`.
`cjk`		`CjkTokenizer` (jieba, Simplified Chinese).
`tlsh`		`TlshFingerprinter`.
`security`		UTS #39 confusable skeleton in the canonicalizer.
`serde`		`Serialize` / `Deserialize` on signatures (incl. const-generic MinHash).
`parallel`		Rayon-powered batch helpers.
`semantic`		`LocalProvider` via `ort` + Hugging Face Hub.

For Japanese / Korean tokenization or PDF text extraction, implement the Tokenizer / Canonicalizer upstream of this crate against your preferred dedicated library (lindera, vibrato, pdf-extract, poppler, …). Cloud-hosted embedding endpoints (OpenAI, Voyage, Cohere, …) are similarly out of scope; implement EmbeddingProvider against any HTTP client of choice — see USAGE.md for a worked example.

Minimal build (no_std + alloc, MinHash + SimHash only — drops LSH):

[dependencies]
txtfp = { version = "0.3", default-features = false, features = ["minhash", "simhash"] }

Without LSH (still on default std):

[dependencies]
txtfp = { version = "0.3", default-features = false, features = ["std", "minhash", "simhash"] }

With local ONNX embeddings:

[dependencies]
txtfp = { version = "0.3", features = ["semantic"] }

Quick Start

use txtfp::{
    Canonicalizer, Fingerprinter, MinHashFingerprinter,
    ShingleTokenizer, WordTokenizer, jaccard,
};

fn main() -> Result<(), txtfp::Error> {
    let canon = Canonicalizer::default();
    let tok = ShingleTokenizer { k: 5, inner: WordTokenizer };
    let fp = MinHashFingerprinter::<_, 128>::new(canon, tok);

    let a = fp.fingerprint("the quick brown fox jumps over the lazy dog at noon today")?;
    let b = fp.fingerprint("the quick brown fox jumps over the lazy dog at dusk today")?;

    let similarity = jaccard(&a, &b);
    println!("Jaccard estimate: {:.2}", similarity);

    if similarity > 0.6 {
        println!("near-duplicate");
    }
    Ok(())
}

LSH for sub-linear near-duplicate lookup

# #[cfg(feature = "lsh")]
# fn demo() -> Result<(), txtfp::Error> {
use txtfp::{
    Canonicalizer, Fingerprinter, LshIndex, LshIndexBuilder,
    MinHashFingerprinter, ShingleTokenizer, WordTokenizer,
};

let canon = Canonicalizer::default();
let tok = ShingleTokenizer { k: 5, inner: WordTokenizer };
let fp = MinHashFingerprinter::<_, 128>::new(canon, tok);

// Optimize bands/rows for a Jaccard threshold of 0.7.
let mut idx: LshIndex<128> = LshIndexBuilder::for_threshold(0.7, 128)?.build();

idx.insert(0, fp.fingerprint("the quick brown fox jumps over the lazy dog at noon today")?);
idx.insert(1, fp.fingerprint("astronomers detect cosmic background radiation")?);

let probe = fp.fingerprint("the quick brown fox jumps over the lazy dog at dusk today")?;
let neighbours = idx.query_with_threshold(&probe, 0.5);
println!("near-duplicates: {neighbours:?}");
# Ok(()) }

Documentation

For the complete API reference and worked examples, see USAGE.md.

Architecture

Pipeline

input bytes
    │
    ▼
canonicalize  (NFKC + casefold + Bidi/format strip)
    │
    ▼
tokenize      (Word | Grapheme | Shingle | CJK)
    │
    ▼
sketch        (MinHash | SimHash | TLSH | Embedding)
    │
    ▼
compare       (jaccard | hamming | cosine_estimate | semantic_similarity)

Every layer is independently swappable: pick a canonicalizer config, plug any Tokenizer, choose a HashFamily, and the same input always produces the same byte-stable signature.

Signature byte layouts (frozen for v0.1.x)

MinHashSig<H>                       SimHash64
├── schema: u16  (= 1)              └── 8 bytes (u64, little-endian)
├── _pad:   [u8; 6] (zero)
└── hashes: [u64; H], LE

Total size: 8 + 8*H bytes

These layouts are enforced by 18 byte-frozen golden-test fixtures (tests/data/golden/). Failing a golden test is a hard breakage that requires a major-version bump.

Algorithms

MinHash uses double-hashing (Indyk–Motwani 1998 + Kirsch–Mitzenmacher 2008): one xxh3_128 per shingle, then derive H slots as low + (i * high). v0.2.0+ default; pass HashFamily::MurmurHash3_x64_128 for datasketch byte parity.
SimHash is Charikar 2002: token-weighted bag, 64-lane signed accumulator, sign-extract.
LSH is banded: bands * rows == H. LshIndexBuilder::for_threshold numerically minimizes false-positive + false-negative integral over [0, threshold] and [threshold, 1] to pick the partition.
TLSH wraps tlsh2 128/1.
Local embeddings load HF Hub ONNX models, tokenize with tokenizers, run ort 2.0, and pool with Pooling::{Cls, Mean, MeanNoNorm, Max}. The pooling default is looked up per-model (BGE → Cls, E5 → Mean, etc.).

Performance

Single-thread throughput on a 2024-class x86_64 laptop, fat-LTO release with RUSTFLAGS="-C target-cpu=native" and mimalloc as the benches' global allocator, measured with cargo bench --features lsh over the 5 KB lorem_ipsum (ASCII) corpus:

v0.2.0+ baseline (HashFamily::Xxh3_64 default):

Operation	Time	Throughput
MinHash sketch (h=128)	~110 µs/doc	~9K docs/sec
MinHash sketch (h=64)	~76 µs/doc	~13K docs/sec
SimHash sketch (b=64)	~205 µs/doc	~5K docs/sec¹
Canonicalize NFKC (ASCII)	~540 ns/doc	~1.9M docs/sec
LSH insert (h=128)	~1.9 µs/sig	~530K signatures/sec
LSH query (10K-doc index)	~393 µs²	~2.5K queries/sec
Hamming compare (`hamming`)	~0.5 ns	~2B comparisons/sec
Jaccard compare (h=128)	~50 ns	~20M comparisons/sec

¹ SimHash 5 KB throughput improved 40% from v0.1.2 (345 µs → 205 µs) via the streaming ±1-per-occurrence accumulator under Weighting::Tf.

² LSH query is slower than v0.1.x on adversarial bench corpora — xxh3's collision profile produces 1.62× more bucket candidates than MurmurHash3 on a 9/10-shared-words corpus. Per-candidate cost is unchanged. Pin HashFamily::MurmurHash3_x64_128 if your workload matches the bench shape and you need v0.1.x query latency. See CHANGELOG.md for the analysis.

Run benchmarks:

RUSTFLAGS="-C target-cpu=native" cargo bench --features lsh

Optimization knobs

The canonicalizer takes a single-pass ASCII fast path. v0.2.0 extends it to ASCII + droppable bidi/format codepoints (BOM, ZWSP, RLO, variation selectors) — measured 17× faster on a 5 KB corpus with one BOM and a ZWSP every 80 bytes (170 µs → 9.8 µs).
Tokenizer::for_each_token is a callback-style API that skips per-token String allocation; classical sketchers route through it.
mimalloc gives ~2× on LSH insert (alloc-heavy), ~6% on SimHash, marginal elsewhere.
The MinHash slot-update inner loop and the SimHash 64-lane accumulator are already auto-vectorized by LLVM (verified via release-build assembly: vpcmpltuq + AVX-512 mask blending on ymm registers). No hand-rolled SIMD planned.
LshIndex::extend_par (v0.2.0, parallel feature) shards bulk insert by band across the rayon thread pool: measured 1.74× speedup on 8 cores for 10K-doc bench.

Stability

Hash byte struct layouts (MinHashSig<H>, SimHash64, TlshFingerprint): frozen since v0.1.0. Golden tests enforce on every PR.
Hash byte values: changed once at v0.2.0 with the default-hasher flip from MurmurHash3 to xxh3. The struct layout did not change. v0.1.x byte parity is one builder call away (with_hasher(HashFamily::MurmurHash3_x64_128)); golden fixtures regenerated, no further byte changes planned for v0.2.x.
EmbeddingProvider, Embedding, semantic_similarity: parity-compatible with imgfprint 0.4.x and audiofp 0.2.x.
FORMAT_VERSION = 1: mirrored across the cross-modal sibling crates so the integrator (ucfp) can refuse to open a database whose layout predates the running build.
Cross-config comparisons are gated by FingerprintMetadata::config_hash. Two fingerprints with different non-zero config_hash values must not be compared.
SemVer enforcement: every PR runs cargo-semver-checks (added in v0.2.1) against the published baseline. Accidental SemVer breaks fail CI.

Security

OOM protection: streaming sketchers cap their internal buffer at 16 MiB; oversized chunks are rejected at update time.
Trojan Source / homoglyph defense: canonicalizer strips Bidi controls and the Cf category. security feature adds the UTS #39 confusable skeleton so Cyrillic 'а' folds to Latin 'a'.
NFC bombs bounded: NFKC growth capped at 18× (Unicode-spec-mandated worst case).
Deterministic output: same input always produces the same byte-identical signature; no hidden RNG, no clock dependency.
Cryptographic-level attacks on the hash families: out of scope. MurmurHash3, xxh3, and SimHash are non-cryptographic by design.

Comparison with alternatives

Feature	txtfp	datasketch (py)	sourmash (py)	rapidfuzz
MinHash	✓	✓	✓	—
Banded LSH	✓	✓	✓	—
SimHash	✓	✓	—	—
TLSH	✓	—	—	—
Streaming sketches	✓	✓	✓	—
Unicode canonicalization	✓	—	—	~
Trojan Source defense	✓	—	—	—
Local ONNX embeddings	✓	—	—	—
Byte-stable hash layouts	✓	—	—	—
`no_std + alloc`	✓	—	—	—
Pure Rust (no Python GIL)	✓	—	—	✓

Examples

See the examples/ directory:

dedup.rs — MinHash + LSH end-to-end deduplication
near_dup.rs — SimHash near-duplicate detection
semantic.rs — Local ONNX embedding similarity (requires semantic)
regen_goldens.rs — Regenerate the byte-frozen test fixtures (do not run on a patch release; only when intentionally bumping a minor)

cargo run --example dedup --features lsh --release
cargo run --example near_dup --release
cargo run --example semantic --features semantic --release

Contributing

Contributions welcome. The contract:

Fork the repository.
Branch (git checkout -b feature/x).
Run the matrix locally: cargo test --no-default-features --features "std,minhash,simhash,lsh,tlsh,markup,security,serde,parallel".
Run clippy: cargo clippy --all-targets -- -D warnings.
Run benches if the change touches a hot path: cargo bench.
Never regenerate golden fixtures unless you're explicitly bumping a minor version.
Open a PR. CI gates on fmt, clippy, doc, deny, audit, semver-checks, and a 60-second fuzz smoke (canonicalize and minhash_streaming targets under fuzz/).
Releases: see RELEASING.md.

Development

git clone https://github.com/themankindproject/txtfp
cd txtfp

# Default-feature smoke
cargo test

# Full classical surface (no semantic — pulls heavy ONNX deps)
cargo test --features "lsh,markup,security,serde,parallel,tlsh,cjk"

# Build the docs
cargo doc --no-deps --open

# Run the fuzz harness locally (requires nightly + cargo-fuzz)
cd fuzz && cargo +nightly fuzz run canonicalize -- -max_total_time=60

License

Licensed under the [MIT

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
benches		benches
examples		examples
fuzz		fuzz
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE-MIT		LICENSE-MIT
README.md		README.md
RELEASING.md		RELEASING.md
SECURITY.md		SECURITY.md
USAGE.md		USAGE.md
clippy.toml		clippy.toml
deny.toml		deny.toml
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml
txt.md		txt.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

txtfp

Overview

Features

Installation

Feature flags

Quick Start

LSH for sub-linear near-duplicate lookup

Documentation

Architecture

Pipeline

Signature byte layouts (frozen for v0.1.x)

Algorithms

Performance

Optimization knobs

Stability

Security

Comparison with alternatives

Examples

Contributing

Development

License

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

txtfp

Overview

Features

Installation

Feature flags

Quick Start

LSH for sub-linear near-duplicate lookup

Documentation

Architecture

Pipeline

Signature byte layouts (frozen for v0.1.x)

Algorithms

Performance

Optimization knobs

Stability

Security

Comparison with alternatives

Examples

Contributing

Development

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages