Track B (perf): memoize single-token detokenization on the generation hot path by konjoinfinity · Pull Request #59 · konjoai/squish

konjoinfinity · 2026-06-17T23:39:48Z

Summary

First item of Track B — perf quick wins from the PR #57 roadmap. The batch scheduler decoded each generated token individually (tokenizer.decode([next_id])) at three hot-path sites. That call is deterministic per token id, so frequent ids (common words, punctuation, whitespace) were re-detokenized on every occurrence — within a response and across concurrent requests.

This adds TokenDecodeCache, a bounded id → text LUT that decodes each distinct id once.

Why it's safe

Strictly behaviour-preserving. Each id still maps to exactly the string an isolated tokenizer.decode([id]) produces — the same call the hot path already made — so generated output is byte-identical; only the redundant repeat work is removed. Uses is not None (not truthiness) so an id that legitimately decodes to "" is still cached rather than re-decoded.

Changes

New squish/serving/token_decode_cache.py — TokenDecodeCache(tokenizer, max_entries=65_536) with decode(), clear(), size; bounded so a pathological stream of distinct ids can't grow it without limit.
scheduler.py — construct one cache per scheduler; route the three per-token decode sites through it.
+6 unit tests (tests/serving/test_token_decode_cache.py): equivalence with direct decode, memoization (one decode per id), empty-string caching, bounded size, clear(), validation.
Bumped the three module-census assertions 103 → 104 for the new module.

Validation

CI=1 full suite: 4020 passed, 127 skipped, ruff check clean.
The perf gain itself is realized on the Apple-Silicon generation path and isn't measured by CI — correctness is fully covered here; throughput impact wants a real device benchmark (noted for follow-up).

Scope note

Wired into BatchScheduler (the continuous-generation hot path) only. The scattered server.py single-token decode sites can adopt the same cache as a follow-up; kept out of this PR to keep the change small and reviewable.

🤖 Generated with Claude Code

Generated by Claude Code

…hot path The batch scheduler called tokenizer.decode([next_id]) once per generated token (three hot-path sites). That decode is deterministic per token id, so common ids — frequent words, punctuation, whitespace — were re-detokenized constantly, across a response and across concurrent requests. Add TokenDecodeCache: a bounded id->text LUT (default 65,536 entries) that decodes each distinct id once. It is strictly behaviour-preserving — each id still maps to exactly the string an isolated decode([id]) produces, the same call the hot path already made — so output is identical; only redundant work is removed. Uses `is not None` so an id decoding to "" is still cached. Wired into BatchScheduler (the continuous-generation hot path); the scattered server.py decode sites can adopt it as a follow-up. - New module squish/serving/token_decode_cache.py (+6 unit tests). - Bumped the three module-census assertions 103 -> 104. Perf gain is realized on the Apple-Silicon generation path; correctness is fully covered here. CI-mode suite: 4020 passed, 127 skipped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01W8bTep4nw7ybFHhx7QjzMv

wesleyscholl marked this pull request as ready for review June 18, 2026 00:00

wesleyscholl merged commit 5fd3f76 into main Jun 18, 2026
16 checks passed

wesleyscholl deleted the claude/track-b-perf branch June 18, 2026 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track B (perf): memoize single-token detokenization on the generation hot path#59

Track B (perf): memoize single-token detokenization on the generation hot path#59
wesleyscholl merged 1 commit into
mainfrom
claude/track-b-perf

konjoinfinity commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

konjoinfinity commented Jun 17, 2026

Summary

Why it's safe

Changes

Validation

Scope note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants