feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (10-15% speedup) by michaelfeil · Pull Request #1967 · huggingface/tokenizers

michaelfeil · 2026-03-19T03:34:16Z

Tips and tricks from a Blog post describing how this lib is 4-10x slower than it needs to be. https://www.crusoe.ai/resources/blog/reducing-ttft-by-cpumaxxing-tokenization

Performance: sharded cache, packed merge keys, FxHash, byte table

Benchmarked on LLaMA 3 tokenizer (data/llama-3-tokenizer.json) with data/big.txt (6.5 MB, 128K lines).

Setup

Baseline: main branch (onig regex, single RwLock<AHashMap> cache, AHashMap<(u32,u32)> merge map)
Test: mf/performance-improvements branch
Hardware: Linux, Intel Xeon
Benchmark: cargo bench --bench llama3_benchmark

Results

Benchmark	main	optimized	Change
Offsets (batch + char offsets)	298 ms	226 ms	-24.1%
Batch encode (1000 items)	737 ms	594 ms	-19.3%
Sequential encode (full file)	1.82 s	1.80 s	-1.0% (noise)
BPE Train	~10 s	~11 s	no change (noise)

What changed

Sharded cache (cache.rs): Replaced single RwLock<AHashMap> with 64-shard RwLock<FxHashMap>. Eliminates lock contention in parallel encode_batch / rayon workloads.
Packed u64 MergeMap (mod.rs): BPE merge lookup keys packed from (u32, u32) into single u64. FxHash on one u64 is a single multiply vs hashing two fields separately.
FxHash (Cargo.toml, word.rs): rustc-hash for merge map and cache — faster non-cryptographic hashing for small integer and string keys.
Flat byte-to-char table (byte_level.rs): Pre-computed [char; 256] lookup replaces AHashMap<u8, char> in the byte-level encoding hot path.

Why sequential encode barely moved

The sequential path is dominated by the regex step (~40-50% of time) and per-split NormalizedString allocations (~20-30%). These changes target the merge lookup and cache layers, which matter most in parallel/batch workloads where lock contention was the bottleneck.

Files changed

File	Change
`Cargo.toml`	Added `rustc-hash = "2"`
`src/utils/cache.rs`	Sharded cache (64 shards, FxHashMap, per-shard RwLock)
`src/models/bpe/mod.rs`	`MergeMap` struct with packed u64 keys
`src/models/bpe/model.rs`	Use new `MergeMap`, remove old type alias
`src/models/bpe/word.rs`	Accept `&MergeMap` in `merge_all`
`src/models/bpe/serialization.rs`	Updated merge iteration for new API
`src/models/bpe/trainer.rs`	Updated test imports
`src/pre_tokenizers/byte_level.rs`	Flat `[char; 256]` byte encoding table

michaelfeil · 2026-03-19T21:24:55Z

Change as is on my local mac, above was on

 tokenizers % cargo bench --bench llama3_benchmark -- --baseline main
    Finished `bench` profile [optimized] target(s) in 0.21s
     Running benches/llama3_benchmark.rs (target/release/deps/llama3_benchmark-09b144b2844b4db2)
Gnuplot not found, using plotters backend
llama3-encode/llama3-offsets
                        time:   [230.20 ms 234.48 ms 240.30 ms]
                        thrpt:  [25.751 MiB/s 26.391 MiB/s 26.882 MiB/s]
                 change:
                        time:   [−13.852% −11.623% −8.5892%] (p = 0.00 < 0.05)
                        thrpt:  [+9.3962% +13.152% +16.080%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Benchmarking llama3-encode/llama3-encode: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 11.1s.
llama3-encode/llama3-encode
                        time:   [1.1439 s 1.1629 s 1.1849 s]
                        thrpt:  [5.2225 MiB/s 5.3211 MiB/s 5.4095 MiB/s]
                 change:
                        time:   [+0.3665% +2.6718% +4.8842%] (p = 0.05 < 0.05)
                        thrpt:  [−4.6568% −2.6023% −0.3652%]
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
llama3-encode/llama3-batch
                        time:   [222.81 ms 224.74 ms 226.62 ms]
                        thrpt:  [27.306 MiB/s 27.535 MiB/s 27.773 MiB/s]
                 change:
                        time:   [−15.455% −13.218% −11.156%] (p = 0.00 < 0.05)
                        thrpt:  [+12.557% +15.231% +18.281%]
                        Performance has improved.
Found 3 outliers among 10 measurements (30.00%)
  2 (20.00%) low mild
  1 (10.00%) high mild
llama3-encode/llama3-concurrent-long-1t
                        time:   [11.749 ms 11.845 ms 12.034 ms]
                        thrpt:  [7.2850 MiB/s 7.4015 MiB/s 7.4619 MiB/s]
                 change:
                        time:   [−1.6366% +0.0071% +2.0543%] (p = 0.99 > 0.05)
                        thrpt:  [−2.0130% −0.0071% +1.6639%]
                        No change in performance detected.
llama3-encode/llama3-concurrent-long-2t
                        time:   [16.860 ms 17.014 ms 17.326 ms]
                        thrpt:  [11.824 MiB/s 12.042 MiB/s 12.151 MiB/s]
                 change:
                        time:   [−5.4154% −3.6714% −1.8124%] (p = 0.00 < 0.05)
                        thrpt:  [+1.8459% +3.8113% +5.7254%]
                        Performance has improved.
llama3-encode/llama3-concurrent-long-4t
                        time:   [20.717 ms 20.792 ms 20.935 ms]
                        thrpt:  [19.646 MiB/s 19.780 MiB/s 19.852 MiB/s]
                 change:
                        time:   [−9.0411% −8.4655% −7.8370%] (p = 0.00 < 0.05)
                        thrpt:  [+8.5034% +9.2484% +9.9398%]
                        Performance has improved.
llama3-encode/llama3-concurrent-long-8t
                        time:   [22.864 ms 23.296 ms 23.730 ms]
                        thrpt:  [28.523 MiB/s 29.054 MiB/s 29.604 MiB/s]
                 change:
                        time:   [+0.4146% +2.3962% +4.3271%] (p = 0.03 < 0.05)
                        thrpt:  [−4.1476% −2.3401% −0.4129%]
                        Change within noise threshold.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe
Benchmarking llama3-encode/BPE Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.6s.
llama3-encode/BPE Train vocabulary (big)
                        time:   [847.22 ms 853.69 ms 861.35 ms]
                        thrpt:  [804.66 KiB/s 811.88 KiB/s 818.07 KiB/s]
                 change:
                        time:   [−0.5710% +0.5100% +1.6473%] (p = 0.41 > 0.05)
                        thrpt:  [−1.6206% −0.5075% +0.5743%]
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)

HuggingFaceDocBuilderDev · 2026-03-23T16:27:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2026-03-24T09:36:19Z

Having a look!

ArthurZucker · 2026-03-24T09:37:54Z

One thing is this seems to affect encode_batch not necessarily encode_batch_fast which does not compute offsets. Tho this would still be a win for cases when you need offsets

Signed-off-by: Michael Feil <63565275+michaelfeil@users.noreply.github.com>

McPatate

Nice stuff! Thanks!

McPatate · 2026-03-25T13:25:29Z

-                            new_id: *new_id,
-                        });
+    pub(super) fn merge_all(&mut self, merges: &MergeMap, dropout: Option<f32>) {
+        TL_MERGE_HEAP.with(|heap_cell| {


Was allocation a bottleneck here?

I just blindly copied the per thread pattern here. Ideally, its combined with a a external rayon pool, which does not fully translate here.

McPatate · 2026-03-25T13:32:24Z

+        let idx = Self::shard_for(key);
+        let shard = &self.shards[idx];
+        if let Ok(guard) = shard.try_read() {
+            guard.get(key).cloned()


.cloned()?

re: comment above.

McPatate · 2026-03-25T13:37:18Z

+    }
+}
+
 impl<K, V> PartialEq for Cache<K, V>


Suggested change

impl<K, V> PartialEq for Cache<K, V>

// We dont really care about Cache comparison, so let's make them always equal

impl<K, V> PartialEq for Cache<K, V>

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

ArthurZucker · 2026-04-10T10:46:26Z

/benchmark

…ache

michaelfeil · 2026-04-12T04:30:42Z

@ArthurZucker @McPatate thanks for the review. Should be mostly resolved now?

ArthurZucker · 2026-04-13T08:11:41Z

/benchmark

michaelfeil added 3 commits March 18, 2026 20:34

optimize performance

01dc86d

add byte level lookup

e779f54

sync llama 3 benchmark

9c53d92

michaelfeil changed the title ~~Optimize BPE tokenization: sharded cache, packed merge keys, FxHash~~ feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (20-30% speedup) Mar 19, 2026

bring back train bench

473588e

michaelfeil changed the title ~~feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (20-30% speedup)~~ feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (10-20% speedup) Mar 19, 2026

michaelfeil changed the title ~~feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (10-20% speedup)~~ feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (10-15% speedup) Mar 19, 2026

michaelfeil mentioned this pull request Mar 20, 2026

feat: performance, adding pcre2 backend + regex-shards (5-15% speedup) #1968

Open

fix const

43678bf

Signed-off-by: Michael Feil <63565275+michaelfeil@users.noreply.github.com>

McPatate reviewed Mar 25, 2026

View reviewed changes

michaelfeil and others added 3 commits March 25, 2026 15:08

Update tokenizers/src/utils/cache.rs

3fc1a8f

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

Update tokenizers/src/utils/cache.rs

b473bcf

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>

Merge branch 'main' into mf/performance-improvements

e232eac

michaelfeil added 7 commits April 11, 2026 18:27

fix typos and comments

586cbaf

add extend from cache, so we don't have to return a clone in shared c…

9ddab19

…ache

add extend from cache, so we don't have to return a clone in shared c…

32d96a8

…ache

add static

d791e82

tl word cache

104b735

cache size

d4386c6

Merge branch 'huggingface:main' into mf/performance-improvements

3445a78

	impl<K, V> PartialEq for Cache<K, V>
	// We dont really care about Cache comparison, so let's make them always equal
	impl<K, V> PartialEq for Cache<K, V>

Conversation

michaelfeil commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance: sharded cache, packed merge keys, FxHash, byte table

Setup

Results

What changed

Why sequential encode barely moved

Files changed

Uh oh!

michaelfeil commented Mar 19, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 23, 2026

Uh oh!

ArthurZucker commented Mar 24, 2026

Uh oh!

ArthurZucker commented Mar 24, 2026

Uh oh!

McPatate left a comment

Choose a reason for hiding this comment

Uh oh!

McPatate Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

michaelfeil Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

McPatate Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

michaelfeil Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

McPatate Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

michaelfeil commented Apr 12, 2026

Uh oh!

ArthurZucker commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michaelfeil commented Mar 19, 2026 •

edited

Loading