Releases · neluca/tinybpe

Highlights

8 built-in ByteLevel BPE models — GPT-4, GPT-4o, GPT-3, GPT-2, Qwen3.5, DeepSeek-V4, Llama 4 Scout, MiniCPM5-1B
One-line model loading — Tokenizer.from_pretrained("cl100k_base"), no network required
JSON-based model registry — add new models by editing models.json, no code changes
96% test coverage — 203 tests, mypy strict, ruff clean
Professional packaging — PyPI-ready with pre-built wheels, CONTRIBUTING/SECURITY/CODE_OF_CONDUCT

New Features

Model Registry & One-Line Loading (`#41e7685`)

from tinybpe import Tokenizer, list_models

list_models()
# ['cl100k_base', 'deepseek-v4', 'llama4', 'minicpm5', 'o200k_base', 'p50k_base', 'qwen35', 'r50k_base']

tok = Tokenizer.from_pretrained("cl100k_base")
ids = tok.encode("hello world")

JSON-Based Model Configuration (`#3c44355`)

All model metadata (name, vocab size, regex pattern, special tokens) lives in tinybpe/models/models.json. Adding a new model:

Drop the .tbm file into tinybpe/models/
Add a JSON entry to models.json

Special Token Support

TikToken models now support special tokens out of the box:

tok = Tokenizer.from_pretrained("cl100k_base")
ids = tok.encode("<|endoftext|> hello <|fim_prefix|>")
# → [100257, 24748, 220, 100258]

Bug Fixes

Special token regex ordering — tokens sorted by length descending so "<ab>" matches before "<a>" (#866502b)
mypy type errors — _find_package_file type inconsistency fixed
No-op test — test_empty_text replaced with actual assertions
Dead code — removed unused variable assignment in convert_minicpm.py
UTF-8 continuation byte check — convert_hf_tokenizer.py now validates bytes 0x80-0xBF
Docstring mismatches — fixed encode_ordinary description and vocab format docs
Duplicate regex patterns — _PAT_GPT2 and _PAT_HF_BYTELEVEL consolidated into _PAT_BYTELEVEL
Author name — updated to Romani Isa

Developer Experience

Makefile — make install, make test, make lint, make format, make typecheck, make clean
pre-commit hooks — ruff + ruff-format + mypy
Optional dependencies — pip install tinybpe[dev|tiktoken|hf|all]
Community files — CONTRIBUTING.md, SECURITY.md, CODE_OF_CONDUCT.md, issue templates, PR template
README_zh.md — Chinese translation of the full README

Technical Details

Ruff: zero lint errors, full formatting compliance
mypy: strict mode, zero errors across 7 source files
Test suite: 203 tests, 96% branch coverage
Wheel size: ~6 MB (includes all 8 models + models.json)
Python: 3.9–3.13 supported, 3.14 classified
Platforms: Linux (manylinux2014, x86_64 + aarch64), macOS, Windows

What's Changed (since v1.0.0)

28 commits
29 files changed (new), 17 files modified
~380K lines of model data added
Author changed to Romani Isa

Installation

pip install tinybpe

Quick Start

from tinybpe import Tokenizer

# Load any built-in model — no network
tok = Tokenizer.from_pretrained("cl100k_base")
ids = tok.encode("hello world")
print(tok.decode(ids))  # → 'hello world'

Full Changelog: https://github.com/neluca/tinybpe/commits/v1.1.0

🌟 Features

The core is meticulously designed and implemented in C , using an AVL-Tree as the index for fast and efficient performance.

Used as a Python module with a simple and elegant API.

Supports training BPE models and continuing training on imported models to expand the vocabulary.

Implements a general byte-level tokenizer, supporting fast encoding and decoding,as well asstreaming decoding.

Supports regular expression pre-tokenization and adding special Tokens.

Supports converting model parameters from tiktoken.

Highly customizable, easy to integrate and extend, and the core is zero dependencies.

Refine the content of the document.

🌟 Features

The core is meticulously designed and implemented in C , using an AVL-Tree as the index for fast and efficient performance.

Used as a Python module with a simple and elegant API.

Supports training BPE models and continuing training on imported models to expand the vocabulary.

Implements a general byte-level tokenizer, supporting fast encoding and decoding,as well asstreaming decoding.

Supports regular expression pre-tokenization and adding special Tokens.

Supports converting model parameters from tiktoken.

Highly customizable, easy to integrate and extend, and the core is zero dependencies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

New Features

Model Registry & One-Line Loading (`#41e7685`)

JSON-Based Model Configuration (`#3c44355`)

Special Token Support

Bug Fixes

Developer Experience

Technical Details

What's Changed (since v1.0.0)

Installation

Quick Start

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🌟 Features

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🌟 Features

Uh oh!

Releases: neluca/tinybpe

TinyBPE v1.1.0 Release

Highlights

New Features

Model Registry & One-Line Loading (#41e7685)

JSON-Based Model Configuration (#3c44355)

Special Token Support

Bug Fixes

Developer Experience

Technical Details

What's Changed (since v1.0.0)

Installation

Quick Start

Uh oh!

v0.1.2

Uh oh!

TinyBPE 0.1.1 Release

🌟 Features

Uh oh!

TinyBPE 0.1.0 Release

🌟 Features

Uh oh!

Model Registry & One-Line Loading (`#41e7685`)

JSON-Based Model Configuration (`#3c44355`)