feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (10-15% speedup)#1967
feat: Optimize BPE tokenization: sharded cache, packed merge keys, FxHash (10-15% speedup)#1967michaelfeil wants to merge 15 commits intohuggingface:mainfrom
Conversation
|
Change as is on my local mac, above was on |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Having a look! |
|
One thing is this seems to affect |
| new_id: *new_id, | ||
| }); | ||
| pub(super) fn merge_all(&mut self, merges: &MergeMap, dropout: Option<f32>) { | ||
| TL_MERGE_HEAP.with(|heap_cell| { |
There was a problem hiding this comment.
Was allocation a bottleneck here?
There was a problem hiding this comment.
I just blindly copied the per thread pattern here. Ideally, its combined with a a external rayon pool, which does not fully translate here.
| let idx = Self::shard_for(key); | ||
| let shard = &self.shards[idx]; | ||
| if let Ok(guard) = shard.try_read() { | ||
| guard.get(key).cloned() |
There was a problem hiding this comment.
re: comment above.
| } | ||
| } | ||
|
|
||
| impl<K, V> PartialEq for Cache<K, V> |
There was a problem hiding this comment.
| impl<K, V> PartialEq for Cache<K, V> | |
| // We dont really care about Cache comparison, so let's make them always equal | |
| impl<K, V> PartialEq for Cache<K, V> |
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
|
/benchmark |
|
@ArthurZucker @McPatate thanks for the review. Should be mostly resolved now? |
|
/benchmark |
Tips and tricks from a Blog post describing how this lib is 4-10x slower than it needs to be. https://www.crusoe.ai/resources/blog/reducing-ttft-by-cpumaxxing-tokenization
Performance: sharded cache, packed merge keys, FxHash, byte table
Benchmarked on LLaMA 3 tokenizer (
data/llama-3-tokenizer.json) withdata/big.txt(6.5 MB, 128K lines).Setup
mainbranch (onig regex, singleRwLock<AHashMap>cache,AHashMap<(u32,u32)>merge map)mf/performance-improvementsbranchcargo bench --bench llama3_benchmarkResults
What changed
Sharded cache (
cache.rs): Replaced singleRwLock<AHashMap>with 64-shardRwLock<FxHashMap>. Eliminates lock contention in parallelencode_batch/ rayon workloads.Packed u64 MergeMap (
mod.rs): BPE merge lookup keys packed from(u32, u32)into singleu64. FxHash on one u64 is a single multiply vs hashing two fields separately.FxHash (
Cargo.toml,word.rs):rustc-hashfor merge map and cache — faster non-cryptographic hashing for small integer and string keys.Flat byte-to-char table (
byte_level.rs): Pre-computed[char; 256]lookup replacesAHashMap<u8, char>in the byte-level encoding hot path.Why sequential encode barely moved
The sequential path is dominated by the regex step (~40-50% of time) and per-split
NormalizedStringallocations (~20-30%). These changes target the merge lookup and cache layers, which matter most in parallel/batch workloads where lock contention was the bottleneck.Files changed
Cargo.tomlrustc-hash = "2"src/utils/cache.rssrc/models/bpe/mod.rsMergeMapstruct with packed u64 keyssrc/models/bpe/model.rsMergeMap, remove old type aliassrc/models/bpe/word.rs&MergeMapinmerge_allsrc/models/bpe/serialization.rssrc/models/bpe/trainer.rssrc/pre_tokenizers/byte_level.rs[char; 256]byte encoding table