Track B (perf): memoize single-token detokenization on the generation hot path#59
Merged
Conversation
…hot path The batch scheduler called tokenizer.decode([next_id]) once per generated token (three hot-path sites). That decode is deterministic per token id, so common ids — frequent words, punctuation, whitespace — were re-detokenized constantly, across a response and across concurrent requests. Add TokenDecodeCache: a bounded id->text LUT (default 65,536 entries) that decodes each distinct id once. It is strictly behaviour-preserving — each id still maps to exactly the string an isolated decode([id]) produces, the same call the hot path already made — so output is identical; only redundant work is removed. Uses `is not None` so an id decoding to "" is still cached. Wired into BatchScheduler (the continuous-generation hot path); the scattered server.py decode sites can adopt it as a follow-up. - New module squish/serving/token_decode_cache.py (+6 unit tests). - Bumped the three module-census assertions 103 -> 104. Perf gain is realized on the Apple-Silicon generation path; correctness is fully covered here. CI-mode suite: 4020 passed, 127 skipped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01W8bTep4nw7ybFHhx7QjzMv
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First item of Track B — perf quick wins from the PR #57 roadmap. The batch scheduler decoded each generated token individually (
tokenizer.decode([next_id])) at three hot-path sites. That call is deterministic per token id, so frequent ids (common words, punctuation, whitespace) were re-detokenized on every occurrence — within a response and across concurrent requests.This adds
TokenDecodeCache, a boundedid → textLUT that decodes each distinct id once.Why it's safe
Strictly behaviour-preserving. Each id still maps to exactly the string an isolated
tokenizer.decode([id])produces — the same call the hot path already made — so generated output is byte-identical; only the redundant repeat work is removed. Usesis not None(not truthiness) so an id that legitimately decodes to""is still cached rather than re-decoded.Changes
squish/serving/token_decode_cache.py—TokenDecodeCache(tokenizer, max_entries=65_536)withdecode(),clear(),size; bounded so a pathological stream of distinct ids can't grow it without limit.scheduler.py— construct one cache per scheduler; route the three per-token decode sites through it.tests/serving/test_token_decode_cache.py): equivalence with direct decode, memoization (one decode per id), empty-string caching, bounded size,clear(), validation.Validation
CI=1full suite: 4020 passed, 127 skipped,ruff checkclean.Scope note
Wired into
BatchScheduler(the continuous-generation hot path) only. The scatteredserver.pysingle-token decode sites can adopt the same cache as a follow-up; kept out of this PR to keep the change small and reviewable.🤖 Generated with Claude Code
Generated by Claude Code