Skip to content

feat(ByteLevel): skip per-byte transform for printable-ASCII tokens#2038

Open
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
KimBioInfoStudio:feat/byte-level-printable-ascii-fast-path
Open

feat(ByteLevel): skip per-byte transform for printable-ASCII tokens#2038
KimBioInfoStudio wants to merge 2 commits intohuggingface:mainfrom
KimBioInfoStudio:feat/byte-level-printable-ascii-fast-path

Conversation

@KimBioInfoStudio
Copy link
Copy Markdown

Summary

Adds an early-return fast path to the ByteLevel pre-tokenizer for tokens whose bytes all live in 0x21..=0x7E (printable ASCII, excluding space). For those tokens the existing per-byte transform produces an output byte-identical to the input and rebuilds alignments to the exact tuples they already had — so we can simply return without doing any of that work.

Why this is safe

The mapping in bytes_char() is the identity on 0x21..=0x7E: each such byte b maps to char::from_u32(b as u32), which UTF-8-encodes back to the same single byte. Walking the slow path on a token consisting only of those bytes therefore:

  • builds transformations = [(b1 as char, 0), (b2 as char, 0), ...] (one entry per byte, all isize=0)
  • calls transform(transformations, 0), which rebuilds normalized to the same bytes and alignments to the same per-byte tuples it already had

Returning early is observationally equivalent. Any byte outside 0x21..=0x7E (space, tab, newline, DEL, any Latin-1, any UTF-8 lead/continuation byte) makes the predicate false and the original code path runs unchanged, so:

  • " word" (post-regex ' ?\p{L}+') still gets ' ' → 'Ġ' (U+0120)
  • "\n", "\t", runs of whitespace still get their GPT-2 mappings
  • non-ASCII text (CJK, Arabic, Cyrillic, Vietnamese, …) still goes through the full byte-level transform
  • all existing offsets, decoder behavior, and post-processor trim_offsets semantics are untouched

Tests

Two new tests in byte_level::tests:

  • printable_ascii_fast_path_matches_slow_path — for several inputs of varying lengths (including > 32 bytes to cross auto-vectorized chunk boundaries), checks that pre_tokenize produces the input back unchanged with offsets (0, len).
  • fast_path_does_not_swallow_non_printable_bytes" hi" still produces "Ġhi", proving the gate falls through correctly when even one non-printable byte is present.

The full existing test suite (202 tests) continues to pass.

Performance

The detection predicate bytes.iter().all(|&b| (0x21..=0x7E).contains(&b)) is auto-vectorized by stable Rust on x86_64 / aarch64, so the gate runs at SIMD-class throughput. After the GPT-2 regex split, a typical English token stream has a substantial fraction of pure-printable tokens (letters, digits, ,, ., ?, !, identifier-like text in code, etc.); each one of those skips a Vec<(char, isize)> allocation, ~len HashMap lookups, and the transform() rebuild.

Test plan

  • cargo test --lib (202 tests pass)
  • New equivalence + non-printable regression tests
  • CI on x86_64 / Linux / macOS

Notes

This is a no-unsafe, no-new-dependency, single-file change. Companion to #2036 (SIMD ASCII lowercase) and #2037 (NFC ASCII early-exit), but fully independent of both.

A follow-up could add an explicit SIMD predicate via the dispatcher introduced in #2036 plus a separate fast path for the very common " word" shape (one leading byte → 'Ġ', rest identity-passthrough).

🤖 Generated with Claude Code

The byte → char map produced by `bytes_char()` is the identity on
`0x21..=0x7E` (printable ASCII excluding space): each such byte maps to
the char with the same code point, which encodes back to that exact
byte in UTF-8. So when an entire post-regex token already lives in that
range, the per-byte transform produces an output byte-identical to the
input and rebuilds `alignments` to the same tuples it already had. We
can return early and skip the `Vec<(char, isize)>` build, the
`HashMap` lookups, and the `transform()` rebuild.

The gate is conservative: any byte outside `0x21..=0x7E` (space, tab,
newline, DEL, any Latin-1, any UTF-8 lead/continuation byte) falls
through to the original code path with zero changes. Tokens like
`" word"` (regex `' ?\p{L}+'` output), `"\n"`, or any non-ASCII text
therefore retain their existing GPT-2 byte mapping (' ' → 'Ġ' etc.).

The `iter().all(...)` predicate auto-vectorizes on stable Rust, so the
detection is O(n) over bytes with SIMD-class throughput on x86_64 / aarch64.

Two new tests in `byte_level::tests`:

  - `printable_ascii_fast_path_matches_slow_path` — tokens of various
    sizes (including > 32 bytes to cross auto-vectorized chunks) round-trip
    byte-identical with the same offsets the slow path produces.
  - `fast_path_does_not_swallow_non_printable_bytes` — `" hi"` is still
    mapped to `"Ġhi"` (slow path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ArthurZucker
Copy link
Copy Markdown
Collaborator

/benchmark

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I want to make sure this has a positive net effect on perfs!

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants