feat: SIMD ASCII fast path for Lowercase normalizer (~30-49x)#2036
Open
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
Open
feat: SIMD ASCII fast path for Lowercase normalizer (~30-49x)#2036KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
Conversation
Adds `utils::simd::ascii_lower`, a runtime-dispatched in-place ASCII
lowercaser with AVX2, SSE2, NEON, and scalar back-ends, and gates
`NormalizedString::lowercase` on `is_ascii()` so all-ASCII inputs skip the
per-`char` Unicode case-folding loop and the alignments rebuild.
For ASCII inputs the slow path produced byte-identical output and
byte-identical alignments (each `char::to_lowercase()` of an ASCII char is
a single ASCII char, so the `transform` rebuild was a no-op on
alignments); the fast path therefore just flips the `0x20` bit on bytes in
`A`..=`Z` in place. Two new unit tests in `normalizer.rs` lock that
equivalence in:
- `lowercase_ascii_fast_path_preserves_alignments` — checks that after a
non-trivial NFKD transform the fast path leaves `alignments`,
`original`, and `original_shift` unchanged.
- `lowercase_ascii_matches_unicode_path_byte_for_byte` — checks every
printable ASCII byte against `char::to_lowercase`.
Microbench (`benches/ascii_lower_benchmark.rs`, Apple Silicon, NEON):
| size | SIMD | scalar (auto-vec) | unicode chars (old) |
|--------|------------|-------------------|---------------------|
| 64 B | 30.2 GiB/s | 5.9 GiB/s | 1.0 GiB/s |
| 1 KiB | 55.4 GiB/s | 6.3 GiB/s | 1.2 GiB/s |
| 16 KiB | 58.4 GiB/s | 6.3 GiB/s | 1.2 GiB/s |
|256 KiB | 57.9 GiB/s | 6.3 GiB/s | 1.2 GiB/s |
i.e. ~30-49x over the previous Unicode path on real-text-like ASCII.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a 64-byte-wide back-end (`_mm512_cmpgt_epi8_mask` + `_mm512_movm_epi8`)
to `utils::simd::ascii_lower`. The dispatcher only routes to it when
`avx512f`, `avx512bw`, and `avx512fp16` are all detected.
`avx512fp16` is used as a proxy for "AVX-512 without meaningful license-mode
downclock":
- present on Intel Sapphire Rapids / Emerald Rapids / Granite Rapids
- present on AMD Zen 4 (Ryzen 7000 / EPYC Genoa) and Zen 5 (Turin)
- absent on Skylake-X, Cascade Lake, Cooper Lake, Ice Lake-SP, Rocket
Lake — exactly the generations where 512-bit ops cause measurable
frequency throttling
Older AVX-512-capable hardware therefore stays on the AVX2 path, where
the 256-bit work is already memory-bandwidth-bound on long buffers.
Verified with `cargo check --target x86_64-unknown-linux-gnu` plus the
existing 207-test lib suite on aarch64. The AVX-512 path itself is
exercised at runtime only on capable hosts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 26, 2026
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
utils::simd::ascii_lower, a runtime-dispatched in-place ASCII lowercaser with AVX2, SSE2, NEON, and scalar back-ends.NormalizedString::lowercaseons.is_ascii()so all-ASCII inputs skip the per-charUnicode case-folding loop and thetransform()alignments rebuild, mutating bytes in place instead.chars()path on representative ASCII buffers (Apple Silicon / NEON).Why this is safe
For any ASCII char,
char::to_lowercase()returns exactly one ASCII char equal tob | 0x20forA..=Z(UnicodeSpecialCasing.txthas no ASCII entries), so 1 byte → 1 byte andtransform()would have rebuiltalignmentsto the same values it already had. Skippingtransform()is therefore observationally equivalent for ASCII inputs.Two new unit tests pin this down:
lowercase_ascii_fast_path_preserves_alignments— after a non-trivialnfkd()transform, assertsnormalized,alignments,original, andoriginal_shiftare byte-identical between fast-path output and the expected values.lowercase_ascii_matches_unicode_path_byte_for_byte— for every printable ASCII byte, checks the fast path output equalschars().flat_map(to_lowercase).Non-ASCII inputs hit the original code path unchanged.
Implementation notes
is_x86_feature_detected!("avx2"). SSE2 is always available on the x86_64 baseline; aarch64 stable ABI mandates NEON, so no detect there.unsafe fn ascii_lower_*is#[target_feature(enable = "...")]-gated and has a scalar tail. The SSE2/AVX2 paths use signedcmpgt_epi8, which excludes bytes ≥ 0x80 naturally — defensive even though the gate filters them out.Microbench (Apple Silicon, NEON)
cargo bench --bench ascii_lower_benchmark:The "unicode chars" stand-in mirrors what
NormalizedString::lowercasedid before this change (per-charUTF-8 decode +to_lowercase()+Vec<(char, isize)>accumulation, alignment bookkeeping omitted).Test plan
cargo test --lib(207 tests pass)cargo build --release --libutils::simd::tests(5 tests covering empty input, critical lengths, mixed-case random, idempotence, and high-byte safety)tokenizer::normalizer::testscargo bench --bench ascii_lower_benchmarkFollow-ups (not in this PR)
Strip,BertPreTokenizerbyte classification, andByteLevelASCII pass-through.🤖 Generated with Claude Code