Skip to content

Track B (perf): memoize single-token detokenization on the generation hot path#59

Merged
wesleyscholl merged 1 commit into
mainfrom
claude/track-b-perf
Jun 18, 2026
Merged

Track B (perf): memoize single-token detokenization on the generation hot path#59
wesleyscholl merged 1 commit into
mainfrom
claude/track-b-perf

Conversation

@konjoinfinity

Copy link
Copy Markdown
Collaborator

Summary

First item of Track B — perf quick wins from the PR #57 roadmap. The batch scheduler decoded each generated token individually (tokenizer.decode([next_id])) at three hot-path sites. That call is deterministic per token id, so frequent ids (common words, punctuation, whitespace) were re-detokenized on every occurrence — within a response and across concurrent requests.

This adds TokenDecodeCache, a bounded id → text LUT that decodes each distinct id once.

Why it's safe

Strictly behaviour-preserving. Each id still maps to exactly the string an isolated tokenizer.decode([id]) produces — the same call the hot path already made — so generated output is byte-identical; only the redundant repeat work is removed. Uses is not None (not truthiness) so an id that legitimately decodes to "" is still cached rather than re-decoded.

Changes

  • New squish/serving/token_decode_cache.pyTokenDecodeCache(tokenizer, max_entries=65_536) with decode(), clear(), size; bounded so a pathological stream of distinct ids can't grow it without limit.
  • scheduler.py — construct one cache per scheduler; route the three per-token decode sites through it.
  • +6 unit tests (tests/serving/test_token_decode_cache.py): equivalence with direct decode, memoization (one decode per id), empty-string caching, bounded size, clear(), validation.
  • Bumped the three module-census assertions 103 → 104 for the new module.

Validation

  • CI=1 full suite: 4020 passed, 127 skipped, ruff check clean.
  • The perf gain itself is realized on the Apple-Silicon generation path and isn't measured by CI — correctness is fully covered here; throughput impact wants a real device benchmark (noted for follow-up).

Scope note

Wired into BatchScheduler (the continuous-generation hot path) only. The scattered server.py single-token decode sites can adopt the same cache as a follow-up; kept out of this PR to keep the change small and reviewable.

🤖 Generated with Claude Code


Generated by Claude Code

…hot path

The batch scheduler called tokenizer.decode([next_id]) once per generated token
(three hot-path sites). That decode is deterministic per token id, so common ids
— frequent words, punctuation, whitespace — were re-detokenized constantly,
across a response and across concurrent requests.

Add TokenDecodeCache: a bounded id->text LUT (default 65,536 entries) that
decodes each distinct id once. It is strictly behaviour-preserving — each id
still maps to exactly the string an isolated decode([id]) produces, the same
call the hot path already made — so output is identical; only redundant work is
removed. Uses `is not None` so an id decoding to "" is still cached.

Wired into BatchScheduler (the continuous-generation hot path); the scattered
server.py decode sites can adopt it as a follow-up.

- New module squish/serving/token_decode_cache.py (+6 unit tests).
- Bumped the three module-census assertions 103 -> 104.

Perf gain is realized on the Apple-Silicon generation path; correctness is
fully covered here. CI-mode suite: 4020 passed, 127 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01W8bTep4nw7ybFHhx7QjzMv
@wesleyscholl wesleyscholl marked this pull request as ready for review June 18, 2026 00:00
@wesleyscholl wesleyscholl merged commit 5fd3f76 into main Jun 18, 2026
16 checks passed
@wesleyscholl wesleyscholl deleted the claude/track-b-perf branch June 18, 2026 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants