Scripts for converting various tokenizer model formats to TinyBPE .tbm files.
Convert OpenAI tiktoken encodings to TinyBPE.
# Install tiktoken first
pip install tiktoken
# Convert a built-in encoding
python scripts/convert_tiktoken.py cl100k_base -o models/cl100k_base.tbm
python scripts/convert_tiktoken.py o200k_base -o models/o200k_base.tbm
python scripts/convert_tiktoken.py p50k_base -o models/p50k_base.tbm
python scripts/convert_tiktoken.py r50k_base -o models/r50k_base.tbmConvert HuggingFace tokenizer.json files to TinyBPE.
# Install huggingface_hub first
pip install huggingface_hub
# Convert from a local tokenizer.json
python scripts/convert_hf_tokenizer.py path/to/tokenizer.json -o models/my_model.tbm
# Convert from a HuggingFace model ID
python scripts/convert_hf_tokenizer.py meta-llama/Meta-Llama-3-8B -o models/llama3.tbmWhen adding a new conversion script:
- Follow the existing pattern: CLI with
argparse,-ofor output path - Output TinyBPE
.tbmmodels viatinybpe._model_io.save_model() - Add documentation to this README
- Do NOT add conversion code to the
tinybpepackage itself — keep it inscripts/