feat(ByteLevel): skip per-byte transform for printable-ASCII tokens by KimBioInfoStudio · Pull Request #2038 · huggingface/tokenizers

KimBioInfoStudio · 2026-04-26T16:55:49Z

Summary

Adds an early-return fast path to the ByteLevel pre-tokenizer for tokens whose bytes all live in 0x21..=0x7E (printable ASCII, excluding space). For those tokens the existing per-byte transform produces an output byte-identical to the input and rebuilds alignments to the exact tuples they already had — so we can simply return without doing any of that work.

Why this is safe

The mapping in bytes_char() is the identity on 0x21..=0x7E: each such byte b maps to char::from_u32(b as u32), which UTF-8-encodes back to the same single byte. Walking the slow path on a token consisting only of those bytes therefore:

builds transformations = [(b1 as char, 0), (b2 as char, 0), ...] (one entry per byte, all isize=0)
calls transform(transformations, 0), which rebuilds normalized to the same bytes and alignments to the same per-byte tuples it already had

Returning early is observationally equivalent. Any byte outside 0x21..=0x7E (space, tab, newline, DEL, any Latin-1, any UTF-8 lead/continuation byte) makes the predicate false and the original code path runs unchanged, so:

" word" (post-regex ' ?\p{L}+') still gets ' ' → 'Ġ' (U+0120)
"\n", "\t", runs of whitespace still get their GPT-2 mappings
non-ASCII text (CJK, Arabic, Cyrillic, Vietnamese, …) still goes through the full byte-level transform
all existing offsets, decoder behavior, and post-processor trim_offsets semantics are untouched

Tests

Two new tests in byte_level::tests:

printable_ascii_fast_path_matches_slow_path — for several inputs of varying lengths (including > 32 bytes to cross auto-vectorized chunk boundaries), checks that pre_tokenize produces the input back unchanged with offsets (0, len).
fast_path_does_not_swallow_non_printable_bytes — " hi" still produces "Ġhi", proving the gate falls through correctly when even one non-printable byte is present.

The full existing test suite (202 tests) continues to pass.

Performance

The detection predicate bytes.iter().all(|&b| (0x21..=0x7E).contains(&b)) is auto-vectorized by stable Rust on x86_64 / aarch64, so the gate runs at SIMD-class throughput. After the GPT-2 regex split, a typical English token stream has a substantial fraction of pure-printable tokens (letters, digits, ,, ., ?, !, identifier-like text in code, etc.); each one of those skips a Vec<(char, isize)> allocation, ~len HashMap lookups, and the transform() rebuild.

Test plan

cargo test --lib (202 tests pass)
New equivalence + non-printable regression tests
CI on x86_64 / Linux / macOS

Notes

This is a no-unsafe, no-new-dependency, single-file change. Companion to #2036 (SIMD ASCII lowercase) and #2037 (NFC ASCII early-exit), but fully independent of both.

A follow-up could add an explicit SIMD predicate via the dispatcher introduced in #2036 plus a separate fast path for the very common " word" shape (one leading byte → 'Ġ', rest identity-passthrough).

🤖 Generated with Claude Code

The byte → char map produced by `bytes_char()` is the identity on `0x21..=0x7E` (printable ASCII excluding space): each such byte maps to the char with the same code point, which encodes back to that exact byte in UTF-8. So when an entire post-regex token already lives in that range, the per-byte transform produces an output byte-identical to the input and rebuilds `alignments` to the same tuples it already had. We can return early and skip the `Vec<(char, isize)>` build, the `HashMap` lookups, and the `transform()` rebuild. The gate is conservative: any byte outside `0x21..=0x7E` (space, tab, newline, DEL, any Latin-1, any UTF-8 lead/continuation byte) falls through to the original code path with zero changes. Tokens like `" word"` (regex `' ?\p{L}+'` output), `"\n"`, or any non-ASCII text therefore retain their existing GPT-2 byte mapping (' ' → 'Ġ' etc.). The `iter().all(...)` predicate auto-vectorizes on stable Rust, so the detection is O(n) over bytes with SIMD-class throughput on x86_64 / aarch64. Two new tests in `byte_level::tests`: - `printable_ascii_fast_path_matches_slow_path` — tokens of various sizes (including > 32 bytes to cross auto-vectorized chunks) round-trip byte-identical with the same offsets the slow path produces. - `fast_path_does_not_swallow_non_printable_bytes` — `" hi"` is still mapped to `"Ġhi"` (slow path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ArthurZucker · 2026-04-27T05:58:40Z

/benchmark

ArthurZucker

Sounds good, I want to make sure this has a positive net effect on perfs!

HuggingFaceDocBuilderDev · 2026-04-27T06:01:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker reviewed Apr 27, 2026

View reviewed changes

fmt: cargo fmt on byte_level tests

7f71546

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ByteLevel): skip per-byte transform for printable-ASCII tokens#2038

feat(ByteLevel): skip per-byte transform for printable-ASCII tokens#2038
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
KimBioInfoStudio:feat/byte-level-printable-ascii-fast-path

KimBioInfoStudio commented Apr 26, 2026

Uh oh!

ArthurZucker commented Apr 27, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KimBioInfoStudio commented Apr 26, 2026

Summary

Why this is safe

Tests

Performance

Test plan

Notes

Uh oh!

ArthurZucker commented Apr 27, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants