Skip to content

neluca/tinybpe

TinyBPE

PyPI version License: MIT Python CI codecov Ruff pre-commit

An ultra-fast, lightweight BPE tokenizer and trainer with a pure-C core.

Ever wished you could load a GPT-4 compatible tokenizer in one line without network calls? TinyBPE ships 8 pre-built ByteLevel BPE models directly in the package. The CPython C core runs BPE encoding/decoding at native speed — typically 10-50× faster than pure-Python implementations while depending only on regex.

Why TinyBPE?

Feature TinyBPE tiktoken HuggingFace tokenizers
Core engine Pure C (CPython) Pure Rust (PyO3) Pure Rust (PyO3)
Dependencies regex only tiktoken + Rust toolchain tokenizers + Rust toolchain
Built-in models 8 models ship in package Downloads on first use Downloads on first use
Offline ready ✅ Fully offline ❌ Requires download ❌ Requires download
Model format Human-readable .tbm text Binary blob JSON / binary
One-liner load Tokenizer.from_pretrained("cl100k_base") tiktoken.get_encoding("cl100k_base") AutoTokenizer.from_pretrained(...)
Train new models ✅ Pure-C trainer ✅ (requires Rust build)
Streaming decode ✅ UTF-8 boundary caching
Portable C core ✅ Embeddable
Install size ~3 MB compressed ~2 MB + cached models ~4 MB + cached models

Installation

pip install tinybpe

Optional extras:

pip install tinybpe[dev]       # Development tools (pytest, ruff, mypy)
pip install tinybpe[tiktoken]  # For tiktoken comparison testing
pip install tinybpe[hf]        # For HuggingFace model conversion
pip install tinybpe[all]       # Everything

Quick Start

One-Line Model Loading

from tinybpe import Tokenizer

# Load any built-in model in one line — no network, no download
tok = Tokenizer.from_pretrained("cl100k_base")

ids = tok.encode("hello world")
tok.decode(ids)  # → 'hello world'

List Available Models

import tinybpe

tinybpe.list_models()
# ['cl100k_base', 'deepseek-v4', 'llama4', 'minicpm5', 'o200k_base',
#  'p50k_base', 'qwen35', 'r50k_base']

Built-in Model Catalog

Model LLM Compatibility Vocab Size
cl100k_base GPT-4, GPT-3.5-turbo, text-embedding-ada-002 100,256
o200k_base GPT-4o, GPT-4o-mini, GPT-5 199,998
p50k_base GPT-3 (davinci, curie, babbage, ada) 50,280
r50k_base GPT-2 50,256
qwen35 Qwen3.5 (0.8B-35B) 247,843
deepseek-v4 DeepSeek-V4 Flash 127,997
llama4 Llama 4 Scout (17B) 440,058
minicpm5 MiniCPM5-1B (ByteLevel BPE) 130,050

Training

from tinybpe import Trainer

trainer = Trainer("hello world " * 500)
trainer.train(100)          # learn 100 merges
trainer.save("my_model")    # → my_model.tbm

Streaming Decode

parts = []
decoder = tok.stream_decode(lambda s: parts.append(s))
for token_id in ids:
    decoder(tid)
assert "".join(parts) == "hello world"

With Regex Pre-tokenization

PAT = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

tok = Tokenizer.from_file("my_model.tbm", pat_str=PAT)

With Special Tokens

special_tokens = {"<eot>": 1000, "<fim_prefix>": 1001, "<fim_suffix>": 1002}
tok = Tokenizer(merges, special_tokens=special_tokens)
ids = tok.encode("<fim_prefix> hello world <eot>")

With Byte Remapping (TikToken Compat)

from tinybpe import load_model

merges, bytes_maps = load_model("cl100k_base.tbm")
tok = Tokenizer(merges, bytes_maps=bytes_maps)

API Reference

Tokenizer

class Tokenizer:
    def __init__(self, merges, *, bytes_maps=None, pat_str=None, special_tokens=None)
    def encode(self, text: str) -> list[int]
    def encode_ordinary(self, text: str) -> list[int]
    def count_tokens(self, text: str) -> int
    def decode(self, ids: list[int]) -> str
    def stream_decode(self, callback: Callable[[str], None]) -> Callable[[int], None]
    def stream_decode_reset(self) -> None
    def save(self, path: str) -> None
    def save_vocab(self, path: str) -> None

    @classmethod
    def from_file(cls, path: str, *, pat_str=None, special_tokens=None) -> Tokenizer
    @classmethod
    def from_pretrained(cls, name: str) -> Tokenizer

    @property
    def merges(self) -> list[tuple[int, int]]
    @property
    def vocab(self) -> dict[int, bytes]
    @property
    def n_vocab(self) -> int

Trainer

class Trainer(bpe.Trainer):
    def __init__(self, text, *, preprocess=None, callback=None)
    def step(self) -> tuple | None
    def train(self, n: int) -> int
    def save(self, path: str) -> None

    @property
    def merges(self) -> list[tuple[int, int]]
    @property
    def n_merges(self) -> int

Model Discovery

def list_models() -> list[str]
def get_model_info(name: str) -> dict  # returns vocab_size, family, description, pat_str, special_tokens, has_byte_remap

File I/O

def load_model(path: str) -> tuple[list[tuple[int, int]], list[int] | None]
def save_model(path: str, merges, bytes_maps=None) -> None
def load_vocab(path: str) -> dict[int, bytes]
def save_vocab(path: str, vocab: dict[int, bytes]) -> None

Model Format

.tbm (TinyBPE Model) is a human-readable text file:

TinyBPE Model v1
0               # 0 = no remap, 256 = has remap
104 101         # merge pairs, one per line
256 108
...

See docs/file-formats.md for the full specification.

Conversion Scripts

Convert existing tokenizers to TinyBPE format:

# TikToken
python scripts/convert_tiktoken.py cl100k_base -o models/cl100k_base.tbm

# HuggingFace
python scripts/convert_hf_tokenizer.py tokenizer.json -o output.tbm
python scripts/convert_hf_tokenizer.py Qwen/Qwen3.5-0.8B -o models/qwen35.tbm

See scripts/README.md for details.

Performance

The C core uses an AVL tree for O(log n) pair lookup during training and greedy lowest-rank-first merging during encoding. Typical throughput on a modern CPU:

Operation Tokens/sec
Training (C core) ~5-10M chars/sec
Encoding (C core) ~2-5M tokens/sec
Decoding (C core) ~10-20M tokens/sec

Run benchmarks locally:

python benchmarks/bench_train.py
python benchmarks/bench_encode.py
python benchmarks/bench_decode.py

Development

git clone https://github.com/neluca/tinybpe.git
cd tinybpe
pip install -e ".[dev]"
make test && make lint && make typecheck

See CONTRIBUTING.md for full development setup and PR guidelines.

License

MIT — see LICENSE.

About

🐍This is a fast, lightweight, and clean CPython extension for the Byte Pair Encoding (BPE) algorithm, which is commonly used in LLM tokenization and NLP tasks.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Contributors