feat: SIMD ASCII fast path for Lowercase normalizer (~30-49x) by KimBioInfoStudio · Pull Request #2036 · huggingface/tokenizers

KimBioInfoStudio · 2026-04-26T16:20:24Z

Summary

Adds utils::simd::ascii_lower, a runtime-dispatched in-place ASCII lowercaser with AVX2, SSE2, NEON, and scalar back-ends.
Gates NormalizedString::lowercase on s.is_ascii() so all-ASCII inputs skip the per-char Unicode case-folding loop and the transform() alignments rebuild, mutating bytes in place instead.
Microbench shows ~30-49× speedup over the previous Unicode chars() path on representative ASCII buffers (Apple Silicon / NEON).

Why this is safe

For any ASCII char, char::to_lowercase() returns exactly one ASCII char equal to b | 0x20 for A..=Z (Unicode SpecialCasing.txt has no ASCII entries), so 1 byte → 1 byte and transform() would have rebuilt alignments to the same values it already had. Skipping transform() is therefore observationally equivalent for ASCII inputs.

Two new unit tests pin this down:

lowercase_ascii_fast_path_preserves_alignments — after a non-trivial nfkd() transform, asserts normalized, alignments, original, and original_shift are byte-identical between fast-path output and the expected values.
lowercase_ascii_matches_unicode_path_byte_for_byte — for every printable ASCII byte, checks the fast path output equals chars().flat_map(to_lowercase).

Non-ASCII inputs hit the original code path unchanged.

Implementation notes

Runtime dispatch via is_x86_feature_detected!("avx2"). SSE2 is always available on the x86_64 baseline; aarch64 stable ABI mandates NEON, so no detect there.
Each unsafe fn ascii_lower_* is #[target_feature(enable = "...")]-gated and has a scalar tail. The SSE2/AVX2 paths use signed cmpgt_epi8, which excludes bytes ≥ 0x80 naturally — defensive even though the gate filters them out.
Equivalence-tested across critical lengths (0, 1, 7, 15, 16, 17, 31, 32, 33, ...) and against high-bit bytes mixed in.

Microbench (Apple Silicon, NEON)

cargo bench --bench ascii_lower_benchmark:

size	SIMD	scalar (auto-vec)	unicode chars (old)	SIMD vs old
64 B	30.2 GiB/s	5.9 GiB/s	1.0 GiB/s	~30×
1 KiB	55.4 GiB/s	6.3 GiB/s	1.2 GiB/s	~48×
16 KiB	58.4 GiB/s	6.3 GiB/s	1.2 GiB/s	~49×
256 KiB	57.9 GiB/s	6.3 GiB/s	1.2 GiB/s	~48×

The "unicode chars" stand-in mirrors what NormalizedString::lowercase did before this change (per-char UTF-8 decode + to_lowercase() + Vec<(char, isize)> accumulation, alignment bookkeeping omitted).

Test plan

cargo test --lib (207 tests pass)
cargo build --release --lib
New unit tests in utils::simd::tests (5 tests covering empty input, critical lengths, mixed-case random, idempotence, and high-byte safety)
New equivalence tests in tokenizer::normalizer::tests
cargo bench --bench ascii_lower_benchmark
CI: clippy + tests on x86_64 (will run via repo CI)

Follow-ups (not in this PR)

Same pattern can extend to Strip, BertPreTokenizer byte classification, and ByteLevel ASCII pass-through.

🤖 Generated with Claude Code

Adds `utils::simd::ascii_lower`, a runtime-dispatched in-place ASCII lowercaser with AVX2, SSE2, NEON, and scalar back-ends, and gates `NormalizedString::lowercase` on `is_ascii()` so all-ASCII inputs skip the per-`char` Unicode case-folding loop and the alignments rebuild. For ASCII inputs the slow path produced byte-identical output and byte-identical alignments (each `char::to_lowercase()` of an ASCII char is a single ASCII char, so the `transform` rebuild was a no-op on alignments); the fast path therefore just flips the `0x20` bit on bytes in `A`..=`Z` in place. Two new unit tests in `normalizer.rs` lock that equivalence in: - `lowercase_ascii_fast_path_preserves_alignments` — checks that after a non-trivial NFKD transform the fast path leaves `alignments`, `original`, and `original_shift` unchanged. - `lowercase_ascii_matches_unicode_path_byte_for_byte` — checks every printable ASCII byte against `char::to_lowercase`. Microbench (`benches/ascii_lower_benchmark.rs`, Apple Silicon, NEON): | size | SIMD | scalar (auto-vec) | unicode chars (old) | |--------|------------|-------------------|---------------------| | 64 B | 30.2 GiB/s | 5.9 GiB/s | 1.0 GiB/s | | 1 KiB | 55.4 GiB/s | 6.3 GiB/s | 1.2 GiB/s | | 16 KiB | 58.4 GiB/s | 6.3 GiB/s | 1.2 GiB/s | |256 KiB | 57.9 GiB/s | 6.3 GiB/s | 1.2 GiB/s | i.e. ~30-49x over the previous Unicode path on real-text-like ASCII. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a 64-byte-wide back-end (`_mm512_cmpgt_epi8_mask` + `_mm512_movm_epi8`) to `utils::simd::ascii_lower`. The dispatcher only routes to it when `avx512f`, `avx512bw`, and `avx512fp16` are all detected. `avx512fp16` is used as a proxy for "AVX-512 without meaningful license-mode downclock": - present on Intel Sapphire Rapids / Emerald Rapids / Granite Rapids - present on AMD Zen 4 (Ryzen 7000 / EPYC Genoa) and Zen 5 (Turin) - absent on Skylake-X, Cascade Lake, Cooper Lake, Ice Lake-SP, Rocket Lake — exactly the generations where 512-bit ops cause measurable frequency throttling Older AVX-512-capable hardware therefore stays on the AVX2 path, where the 256-bit work is already memory-bandwidth-bound on long buffers. Verified with `cargo check --target x86_64-unknown-linux-gnu` plus the existing 207-test lib suite on aarch64. The AVX-512 path itself is exercised at runtime only on capable hosts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

HuggingFaceDocBuilderDev · 2026-04-27T06:03:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

KimBioInfoStudio and others added 2 commits April 27, 2026 00:18

This was referenced Apr 26, 2026

feat(NFC): skip Unicode pass for all-ASCII inputs #2037

Open

feat(ByteLevel): skip per-byte transform for printable-ASCII tokens #2038

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SIMD ASCII fast path for Lowercase normalizer (~30-49x)#2036

feat: SIMD ASCII fast path for Lowercase normalizer (~30-49x)#2036
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
KimBioInfoStudio:feat/simd-ascii-lower

KimBioInfoStudio commented Apr 26, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KimBioInfoStudio commented Apr 26, 2026

Summary

Why this is safe

Implementation notes

Microbench (Apple Silicon, NEON)

Test plan

Follow-ups (not in this PR)

Uh oh!

HuggingFaceDocBuilderDev commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants