Skip to content

feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (10-15% speedup)#1967

Open
michaelfeil wants to merge 15 commits intohuggingface:mainfrom
michaelfeil:mf/performance-improvements
Open

feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (10-15% speedup)#1967
michaelfeil wants to merge 15 commits intohuggingface:mainfrom
michaelfeil:mf/performance-improvements

Conversation

@michaelfeil
Copy link
Copy Markdown
Contributor

@michaelfeil michaelfeil commented Mar 19, 2026

Tips and tricks from a Blog post describing how this lib is 4-10x slower than it needs to be. https://www.crusoe.ai/resources/blog/reducing-ttft-by-cpumaxxing-tokenization

Performance: sharded cache, packed merge keys, FxHash, byte table

Benchmarked on LLaMA 3 tokenizer (data/llama-3-tokenizer.json) with data/big.txt (6.5 MB, 128K lines).

Setup

  • Baseline: main branch (onig regex, single RwLock<AHashMap> cache, AHashMap<(u32,u32)> merge map)
  • Test: mf/performance-improvements branch
  • Hardware: Linux, Intel Xeon
  • Benchmark: cargo bench --bench llama3_benchmark

Results

Benchmark main optimized Change
Offsets (batch + char offsets) 298 ms 226 ms -24.1%
Batch encode (1000 items) 737 ms 594 ms -19.3%
Sequential encode (full file) 1.82 s 1.80 s -1.0% (noise)
BPE Train ~10 s ~11 s no change (noise)

What changed

  1. Sharded cache (cache.rs): Replaced single RwLock<AHashMap> with 64-shard RwLock<FxHashMap>. Eliminates lock contention in parallel encode_batch / rayon workloads.

  2. Packed u64 MergeMap (mod.rs): BPE merge lookup keys packed from (u32, u32) into single u64. FxHash on one u64 is a single multiply vs hashing two fields separately.

  3. FxHash (Cargo.toml, word.rs): rustc-hash for merge map and cache — faster non-cryptographic hashing for small integer and string keys.

  4. Flat byte-to-char table (byte_level.rs): Pre-computed [char; 256] lookup replaces AHashMap<u8, char> in the byte-level encoding hot path.

Why sequential encode barely moved

The sequential path is dominated by the regex step (~40-50% of time) and per-split NormalizedString allocations (~20-30%). These changes target the merge lookup and cache layers, which matter most in parallel/batch workloads where lock contention was the bottleneck.

Files changed

File Change
Cargo.toml Added rustc-hash = "2"
src/utils/cache.rs Sharded cache (64 shards, FxHashMap, per-shard RwLock)
src/models/bpe/mod.rs MergeMap struct with packed u64 keys
src/models/bpe/model.rs Use new MergeMap, remove old type alias
src/models/bpe/word.rs Accept &MergeMap in merge_all
src/models/bpe/serialization.rs Updated merge iteration for new API
src/models/bpe/trainer.rs Updated test imports
src/pre_tokenizers/byte_level.rs Flat [char; 256] byte encoding table

@michaelfeil michaelfeil changed the title Optimize BPE tokenization: sharded cache, packed merge keys, FxHash feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (20-30% speedup) Mar 19, 2026
@michaelfeil michaelfeil changed the title feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (20-30% speedup) feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (10-20% speedup) Mar 19, 2026
@michaelfeil michaelfeil changed the title feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (10-20% speedup) feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (10-15% speedup) Mar 19, 2026
@michaelfeil
Copy link
Copy Markdown
Contributor Author

Change as is on my local mac, above was on

 tokenizers % cargo bench --bench llama3_benchmark -- --baseline main
    Finished `bench` profile [optimized] target(s) in 0.21s
     Running benches/llama3_benchmark.rs (target/release/deps/llama3_benchmark-09b144b2844b4db2)
Gnuplot not found, using plotters backend
llama3-encode/llama3-offsets
                        time:   [230.20 ms 234.48 ms 240.30 ms]
                        thrpt:  [25.751 MiB/s 26.391 MiB/s 26.882 MiB/s]
                 change:
                        time:   [−13.852% −11.623% −8.5892%] (p = 0.00 < 0.05)
                        thrpt:  [+9.3962% +13.152% +16.080%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Benchmarking llama3-encode/llama3-encode: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 11.1s.
llama3-encode/llama3-encode
                        time:   [1.1439 s 1.1629 s 1.1849 s]
                        thrpt:  [5.2225 MiB/s 5.3211 MiB/s 5.4095 MiB/s]
                 change:
                        time:   [+0.3665% +2.6718% +4.8842%] (p = 0.05 < 0.05)
                        thrpt:  [−4.6568% −2.6023% −0.3652%]
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
llama3-encode/llama3-batch
                        time:   [222.81 ms 224.74 ms 226.62 ms]
                        thrpt:  [27.306 MiB/s 27.535 MiB/s 27.773 MiB/s]
                 change:
                        time:   [−15.455% −13.218% −11.156%] (p = 0.00 < 0.05)
                        thrpt:  [+12.557% +15.231% +18.281%]
                        Performance has improved.
Found 3 outliers among 10 measurements (30.00%)
  2 (20.00%) low mild
  1 (10.00%) high mild
llama3-encode/llama3-concurrent-long-1t
                        time:   [11.749 ms 11.845 ms 12.034 ms]
                        thrpt:  [7.2850 MiB/s 7.4015 MiB/s 7.4619 MiB/s]
                 change:
                        time:   [−1.6366% +0.0071% +2.0543%] (p = 0.99 > 0.05)
                        thrpt:  [−2.0130% −0.0071% +1.6639%]
                        No change in performance detected.
llama3-encode/llama3-concurrent-long-2t
                        time:   [16.860 ms 17.014 ms 17.326 ms]
                        thrpt:  [11.824 MiB/s 12.042 MiB/s 12.151 MiB/s]
                 change:
                        time:   [−5.4154% −3.6714% −1.8124%] (p = 0.00 < 0.05)
                        thrpt:  [+1.8459% +3.8113% +5.7254%]
                        Performance has improved.
llama3-encode/llama3-concurrent-long-4t
                        time:   [20.717 ms 20.792 ms 20.935 ms]
                        thrpt:  [19.646 MiB/s 19.780 MiB/s 19.852 MiB/s]
                 change:
                        time:   [−9.0411% −8.4655% −7.8370%] (p = 0.00 < 0.05)
                        thrpt:  [+8.5034% +9.2484% +9.9398%]
                        Performance has improved.
llama3-encode/llama3-concurrent-long-8t
                        time:   [22.864 ms 23.296 ms 23.730 ms]
                        thrpt:  [28.523 MiB/s 29.054 MiB/s 29.604 MiB/s]
                 change:
                        time:   [+0.4146% +2.3962% +4.3271%] (p = 0.03 < 0.05)
                        thrpt:  [−4.1476% −2.3401% −0.4129%]
                        Change within noise threshold.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe
Benchmarking llama3-encode/BPE Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.6s.
llama3-encode/BPE Train vocabulary (big)
                        time:   [847.22 ms 853.69 ms 861.35 ms]
                        thrpt:  [804.66 KiB/s 811.88 KiB/s 818.07 KiB/s]
                 change:
                        time:   [−0.5710% +0.5100% +1.6473%] (p = 0.41 > 0.05)
                        thrpt:  [−1.6206% −0.5075% +0.5743%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker
Copy link
Copy Markdown
Collaborator

Having a look!

@ArthurZucker
Copy link
Copy Markdown
Collaborator

One thing is this seems to affect encode_batch not necessarily encode_batch_fast which does not compute offsets. Tho this would still be a win for cases when you need offsets

Signed-off-by: Michael Feil <63565275+michaelfeil@users.noreply.github.com>
Copy link
Copy Markdown
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice stuff! Thanks!

new_id: *new_id,
});
pub(super) fn merge_all(&mut self, merges: &MergeMap, dropout: Option<f32>) {
TL_MERGE_HEAP.with(|heap_cell| {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was allocation a bottleneck here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just blindly copied the per thread pattern here. Ideally, its combined with a a external rayon pool, which does not fully translate here.

Comment thread tokenizers/src/pre_tokenizers/byte_level.rs
Comment thread tokenizers/src/utils/cache.rs Outdated
Comment thread tokenizers/src/utils/cache.rs Outdated
Comment thread tokenizers/src/utils/cache.rs Outdated
let idx = Self::shard_for(key);
let shard = &self.shards[idx];
if let Ok(guard) = shard.try_read() {
guard.get(key).cloned()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.cloned()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: comment above.

}
}

impl<K, V> PartialEq for Cache<K, V>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
impl<K, V> PartialEq for Cache<K, V>
// We dont really care about Cache comparison, so let's make them always equal
impl<K, V> PartialEq for Cache<K, V>

Comment thread tokenizers/src/utils/cache.rs Outdated
Comment thread tokenizers/src/models/bpe/mod.rs
Comment thread tokenizers/src/models/bpe/mod.rs
Comment thread tokenizers/src/utils/cache.rs Outdated
michaelfeil and others added 3 commits March 25, 2026 15:08
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
@ArthurZucker
Copy link
Copy Markdown
Collaborator

/benchmark

@michaelfeil
Copy link
Copy Markdown
Contributor Author

@ArthurZucker @McPatate thanks for the review. Should be mostly resolved now?

@ArthurZucker
Copy link
Copy Markdown
Collaborator

/benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants