feat(ByteLevel): skip per-byte transform for printable-ASCII tokens#2038
Open
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
Open
feat(ByteLevel): skip per-byte transform for printable-ASCII tokens#2038KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
Conversation
The byte → char map produced by `bytes_char()` is the identity on
`0x21..=0x7E` (printable ASCII excluding space): each such byte maps to
the char with the same code point, which encodes back to that exact
byte in UTF-8. So when an entire post-regex token already lives in that
range, the per-byte transform produces an output byte-identical to the
input and rebuilds `alignments` to the same tuples it already had. We
can return early and skip the `Vec<(char, isize)>` build, the
`HashMap` lookups, and the `transform()` rebuild.
The gate is conservative: any byte outside `0x21..=0x7E` (space, tab,
newline, DEL, any Latin-1, any UTF-8 lead/continuation byte) falls
through to the original code path with zero changes. Tokens like
`" word"` (regex `' ?\p{L}+'` output), `"\n"`, or any non-ASCII text
therefore retain their existing GPT-2 byte mapping (' ' → 'Ġ' etc.).
The `iter().all(...)` predicate auto-vectorizes on stable Rust, so the
detection is O(n) over bytes with SIMD-class throughput on x86_64 / aarch64.
Two new tests in `byte_level::tests`:
- `printable_ascii_fast_path_matches_slow_path` — tokens of various
sizes (including > 32 bytes to cross auto-vectorized chunks) round-trip
byte-identical with the same offsets the slow path produces.
- `fast_path_does_not_swallow_non_printable_bytes` — `" hi"` is still
mapped to `"Ġhi"` (slow path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
|
/benchmark |
Collaborator
ArthurZucker
left a comment
There was a problem hiding this comment.
Sounds good, I want to make sure this has a positive net effect on perfs!
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an early-return fast path to the
ByteLevelpre-tokenizer for tokens whose bytes all live in0x21..=0x7E(printable ASCII, excluding space). For those tokens the existing per-byte transform produces an output byte-identical to the input and rebuildsalignmentsto the exact tuples they already had — so we can simply return without doing any of that work.Why this is safe
The mapping in
bytes_char()is the identity on0x21..=0x7E: each such bytebmaps tochar::from_u32(b as u32), which UTF-8-encodes back to the same single byte. Walking the slow path on a token consisting only of those bytes therefore:transformations=[(b1 as char, 0), (b2 as char, 0), ...](one entry per byte, allisize=0)transform(transformations, 0), which rebuildsnormalizedto the same bytes andalignmentsto the same per-byte tuples it already hadReturning early is observationally equivalent. Any byte outside
0x21..=0x7E(space, tab, newline, DEL, any Latin-1, any UTF-8 lead/continuation byte) makes the predicate false and the original code path runs unchanged, so:" word"(post-regex' ?\p{L}+') still gets ' ' → 'Ġ' (U+0120)"\n","\t", runs of whitespace still get their GPT-2 mappingstrim_offsetssemantics are untouchedTests
Two new tests in
byte_level::tests:printable_ascii_fast_path_matches_slow_path— for several inputs of varying lengths (including > 32 bytes to cross auto-vectorized chunk boundaries), checks thatpre_tokenizeproduces the input back unchanged with offsets(0, len).fast_path_does_not_swallow_non_printable_bytes—" hi"still produces"Ġhi", proving the gate falls through correctly when even one non-printable byte is present.The full existing test suite (202 tests) continues to pass.
Performance
The detection predicate
bytes.iter().all(|&b| (0x21..=0x7E).contains(&b))is auto-vectorized by stable Rust on x86_64 / aarch64, so the gate runs at SIMD-class throughput. After the GPT-2 regex split, a typical English token stream has a substantial fraction of pure-printable tokens (letters, digits,,,.,?,!, identifier-like text in code, etc.); each one of those skips aVec<(char, isize)>allocation, ~lenHashMaplookups, and thetransform()rebuild.Test plan
cargo test --lib(202 tests pass)Notes
This is a no-
unsafe, no-new-dependency, single-file change. Companion to #2036 (SIMD ASCII lowercase) and #2037 (NFC ASCII early-exit), but fully independent of both.A follow-up could add an explicit SIMD predicate via the dispatcher introduced in #2036 plus a separate fast path for the very common
" word"shape (one leading byte → 'Ġ', rest identity-passthrough).🤖 Generated with Claude Code