Skip to content

feat: SIMD ASCII fast path for Lowercase normalizer (~30-49x)#2036

Open
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
KimBioInfoStudio:feat/simd-ascii-lower
Open

feat: SIMD ASCII fast path for Lowercase normalizer (~30-49x)#2036
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
KimBioInfoStudio:feat/simd-ascii-lower

Conversation

@KimBioInfoStudio
Copy link
Copy Markdown

Summary

  • Adds utils::simd::ascii_lower, a runtime-dispatched in-place ASCII lowercaser with AVX2, SSE2, NEON, and scalar back-ends.
  • Gates NormalizedString::lowercase on s.is_ascii() so all-ASCII inputs skip the per-char Unicode case-folding loop and the transform() alignments rebuild, mutating bytes in place instead.
  • Microbench shows ~30-49× speedup over the previous Unicode chars() path on representative ASCII buffers (Apple Silicon / NEON).

Why this is safe

For any ASCII char, char::to_lowercase() returns exactly one ASCII char equal to b | 0x20 for A..=Z (Unicode SpecialCasing.txt has no ASCII entries), so 1 byte → 1 byte and transform() would have rebuilt alignments to the same values it already had. Skipping transform() is therefore observationally equivalent for ASCII inputs.

Two new unit tests pin this down:

  • lowercase_ascii_fast_path_preserves_alignments — after a non-trivial nfkd() transform, asserts normalized, alignments, original, and original_shift are byte-identical between fast-path output and the expected values.
  • lowercase_ascii_matches_unicode_path_byte_for_byte — for every printable ASCII byte, checks the fast path output equals chars().flat_map(to_lowercase).

Non-ASCII inputs hit the original code path unchanged.

Implementation notes

  • Runtime dispatch via is_x86_feature_detected!("avx2"). SSE2 is always available on the x86_64 baseline; aarch64 stable ABI mandates NEON, so no detect there.
  • Each unsafe fn ascii_lower_* is #[target_feature(enable = "...")]-gated and has a scalar tail. The SSE2/AVX2 paths use signed cmpgt_epi8, which excludes bytes ≥ 0x80 naturally — defensive even though the gate filters them out.
  • Equivalence-tested across critical lengths (0, 1, 7, 15, 16, 17, 31, 32, 33, ...) and against high-bit bytes mixed in.

Microbench (Apple Silicon, NEON)

cargo bench --bench ascii_lower_benchmark:

size SIMD scalar (auto-vec) unicode chars (old) SIMD vs old
64 B 30.2 GiB/s 5.9 GiB/s 1.0 GiB/s ~30×
1 KiB 55.4 GiB/s 6.3 GiB/s 1.2 GiB/s ~48×
16 KiB 58.4 GiB/s 6.3 GiB/s 1.2 GiB/s ~49×
256 KiB 57.9 GiB/s 6.3 GiB/s 1.2 GiB/s ~48×

The "unicode chars" stand-in mirrors what NormalizedString::lowercase did before this change (per-char UTF-8 decode + to_lowercase() + Vec<(char, isize)> accumulation, alignment bookkeeping omitted).

Test plan

  • cargo test --lib (207 tests pass)
  • cargo build --release --lib
  • New unit tests in utils::simd::tests (5 tests covering empty input, critical lengths, mixed-case random, idempotence, and high-byte safety)
  • New equivalence tests in tokenizer::normalizer::tests
  • cargo bench --bench ascii_lower_benchmark
  • CI: clippy + tests on x86_64 (will run via repo CI)

Follow-ups (not in this PR)

  • Same pattern can extend to Strip, BertPreTokenizer byte classification, and ByteLevel ASCII pass-through.

🤖 Generated with Claude Code

KimBioInfoStudio and others added 2 commits April 27, 2026 00:18
Adds `utils::simd::ascii_lower`, a runtime-dispatched in-place ASCII
lowercaser with AVX2, SSE2, NEON, and scalar back-ends, and gates
`NormalizedString::lowercase` on `is_ascii()` so all-ASCII inputs skip the
per-`char` Unicode case-folding loop and the alignments rebuild.

For ASCII inputs the slow path produced byte-identical output and
byte-identical alignments (each `char::to_lowercase()` of an ASCII char is
a single ASCII char, so the `transform` rebuild was a no-op on
alignments); the fast path therefore just flips the `0x20` bit on bytes in
`A`..=`Z` in place. Two new unit tests in `normalizer.rs` lock that
equivalence in:

  - `lowercase_ascii_fast_path_preserves_alignments` — checks that after a
    non-trivial NFKD transform the fast path leaves `alignments`,
    `original`, and `original_shift` unchanged.
  - `lowercase_ascii_matches_unicode_path_byte_for_byte` — checks every
    printable ASCII byte against `char::to_lowercase`.

Microbench (`benches/ascii_lower_benchmark.rs`, Apple Silicon, NEON):

  | size   | SIMD       | scalar (auto-vec) | unicode chars (old) |
  |--------|------------|-------------------|---------------------|
  |   64 B | 30.2 GiB/s |      5.9 GiB/s    |       1.0 GiB/s     |
  |  1 KiB | 55.4 GiB/s |      6.3 GiB/s    |       1.2 GiB/s     |
  | 16 KiB | 58.4 GiB/s |      6.3 GiB/s    |       1.2 GiB/s     |
  |256 KiB | 57.9 GiB/s |      6.3 GiB/s    |       1.2 GiB/s     |

i.e. ~30-49x over the previous Unicode path on real-text-like ASCII.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a 64-byte-wide back-end (`_mm512_cmpgt_epi8_mask` + `_mm512_movm_epi8`)
to `utils::simd::ascii_lower`. The dispatcher only routes to it when
`avx512f`, `avx512bw`, and `avx512fp16` are all detected.

`avx512fp16` is used as a proxy for "AVX-512 without meaningful license-mode
downclock":

  - present on Intel Sapphire Rapids / Emerald Rapids / Granite Rapids
  - present on AMD Zen 4 (Ryzen 7000 / EPYC Genoa) and Zen 5 (Turin)
  - absent on Skylake-X, Cascade Lake, Cooper Lake, Ice Lake-SP, Rocket
    Lake — exactly the generations where 512-bit ops cause measurable
    frequency throttling

Older AVX-512-capable hardware therefore stays on the AVX2 path, where
the 256-bit work is already memory-bandwidth-bound on long buffers.

Verified with `cargo check --target x86_64-unknown-linux-gnu` plus the
existing 207-test lib suite on aarch64. The AVX-512 path itself is
exercised at runtime only on capable hosts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants