A custom Byte Pair Encoding (BPE) tokenizer built from scratch on the WikiText-2 dataset. The tokenizer is trained using HuggingFace's tokenizers library, evaluated on validation and test splits, and saved in a HuggingFace-compatible format ready for downstream language modeling tasks.
- Overview
- Workflow
- Project Structure
- Key Features
- Dataset
- Tokenizer Configuration
- Evaluation Metrics
- Getting Started
- Notebook Walkthrough
- License
This project demonstrates how to build a production-ready BPE tokenizer entirely from scratch β covering data loading, cleaning, deduplication, tokenizer training, evaluation, and serialization. The tokenizer targets English text and is compatible with HuggingFace's PreTrainedTokenizerFast interface, making it a drop-in replacement for downstream NLP pipelines.
flowchart TD
A([π Start]) --> B[1 Β· Setup\nImports Β· Constants\nVOCAB_SIZE=30k Β· SPECIAL_TOKENS\nData/ output dir]
B --> C[2 Β· Load WikiText-2-v1\nSalesforce/wikitext\n~44.8k rows Β· 3 splits]
C --> D[2 Β· Explore Dataset\nCorpus stats Β· Word length dist\nSplit sizes Β· Char frequency\nSave β Data/dataset_exploration.png]
D --> E[3 Β· Data Cleaning\nRemove unk Β· section headers\nNormalize @-@ Β· whitespace]
E --> F[3 Β· Deduplication\nExact-match dedup\nMin 3 words per sentence\nSave β Data/cleaning_comparison.png]
F --> G[4 Β· Initialize BPE Tokenizer\nModel: BPE Β· unk=UNK\nNFD β Lowercase β StripAccents\nPre-tokenizer: Whitespace\nDecoder: BPEDecoder]
G --> H[4 Β· Configure BPE Trainer\nVocab: 30,000 Β· min_freq=2\nSpecial: PAD UNK CLS SEP MASK\nSubword prefix: ##]
H --> I[4 Β· Train Tokenizer\ntrain_from_iterator\nbatch_size=1000\non clean training corpus]
I --> J[4 Β· Post-Processor\nTemplateProcessing\nCLS Β· A Β· SEP template\nEnable padding with PAD]
J --> K[4 Β· Sanity Check\nEncode Β· Decode\n4 sample sentences\nTokens Β· IDs Β· Decoded]
K --> L[4 Β· Vocabulary Inspection\nBreakdown: special Β· single chars\nsubwords ## Β· full words\nSave β Data/vocab_composition.png]
L --> M[5 Β· Evaluate Val & Test\nAvg tokens/sentence\nCompression ratio\nUNK-free coverage\nConsistency check]
M --> N[5 Β· Evaluate Train Sample\n5,000 sentence sample\nAll-splits comparison table\nSave β Data/tokenizer_evaluation.png]
N --> O[6 Β· Wrap as PreTrainedTokenizerFast\nSpecial token mappings\nVocab size verification]
O --> P[6 Β· Save Tokenizer\ncustom_bpe_tokenizer/\ntokenizer.json\ntokenizer_config.json]
P --> Q[6 Β· Reload & Verify\nfrom_pretrained\nAssert identical output\nEncode Β· Decode demo\nBatch padding Β· Pair encoding]
Q --> R[7 Β· Final Summary\nMetrics table\nAll config + results]
R --> S([β
Done])
style A fill:#4CAF50,color:#fff
style S fill:#4CAF50,color:#fff
style B fill:#607D8B,color:#fff
style C fill:#2196F3,color:#fff
style D fill:#2196F3,color:#fff
style E fill:#FF5722,color:#fff
style F fill:#FF5722,color:#fff
style G fill:#9C27B0,color:#fff
style H fill:#9C27B0,color:#fff
style I fill:#FF9800,color:#fff
style J fill:#FF9800,color:#fff
style K fill:#FF9800,color:#fff
style L fill:#FF9800,color:#fff
style M fill:#00BCD4,color:#fff
style N fill:#00BCD4,color:#fff
style O fill:#673AB7,color:#fff
style P fill:#673AB7,color:#fff
style Q fill:#673AB7,color:#fff
style R fill:#607D8B,color:#fff
The full .mmd source is at Flow/workflow.mmd.
Wikitext_2-BPE-Tokenizer/
βββ Wikitext_2-BPE-Tokenizer.ipynb # Main notebook (all steps end-to-end)
βββ Flow/
β βββ workflow.mmd # Mermaid workflow diagram source
βββ Data/ # Generated plots (auto-created at runtime)
β βββ dataset_exploration.png # Word count distribution, split sizes, char freq
β βββ cleaning_comparison.png # Before vs after cleaning row counts
β βββ vocab_composition.png # Vocab pie chart + token length distribution
β βββ tokenizer_evaluation.png # 4-panel evaluation summary plot
βββ custom_bpe_tokenizer/ # Saved tokenizer (auto-created at runtime)
β βββ tokenizer.json # Full tokenizer config + vocab + merges
β βββ tokenizer_config.json # Special token mappings
βββ requirements.txt # Python dependencies
βββ .gitignore
βββ LICENSE
Data/andcustom_bpe_tokenizer/are generated at runtime and excluded from version control via.gitignore.
| Feature | Detail |
|---|---|
| BPE from scratch | Built using HuggingFace tokenizers β no pre-trained vocab |
| Data cleaning | Removes <unk>, section headers, normalizes whitespace |
| Deduplication | Exact-match dedup on the training split before training |
| Normalizer pipeline | NFD β Lowercase β StripAccents β Strip |
| Special tokens | [PAD], [UNK], [CLS], [SEP], [MASK] |
| Post-processor | Auto-wraps sequences with [CLS] ... [SEP] |
| Padding | Enabled with [PAD] token for batch encoding |
| HF-compatible save | Saved as PreTrainedTokenizerFast β reload in one line |
| Evaluation suite | Compression ratio, UNK-free coverage, consistency check |
| Visualizations | 4 evaluation plots + vocab composition + cleaning comparison |
- Source:
Salesforce/wikitexton HuggingFace Hub - Subset:
wikitext-2-v1 - Task: Language modeling (English Wikipedia text)
- Splits:
| Split | Raw Rows |
|---|---|
| Train | ~36,718 |
| Validation | ~3,760 |
| Test | ~4,358 |
# BPE Model
Tokenizer(models.BPE(unk_token="[UNK]"))
# Normalizer
NFD() β Lowercase() β StripAccents() β Replace(r'\s+', ' ') β Strip()
# Pre-tokenizer
Whitespace()
# Trainer
BpeTrainer(
vocab_size=30_000,
special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
min_frequency=2,
continuing_subword_prefix="##",
)
# Post-processor
TemplateProcessing(single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B:1 [SEP]:1")The tokenizer is evaluated on both the validation and test splits across four metrics:
| Metric | Description |
|---|---|
| Vocabulary size | Total tokens in the trained vocabulary |
| Avg tokens / sentence | Mean BPE token count per sentence (excl. special tokens) |
| Compression ratio | Average characters per token β higher = more compression |
| UNK-free coverage | % of sentences containing zero [UNK] tokens |
| Consistency | Same input always produces identical token output |
Prerequisites: Python 3.10+
git clone https://github.com/SANJAI-s0/Wikitext_2-BPE-Tokenizer.git
cd Wikitext_2-BPE-Tokenizer
pip install -r requirements.txtThen open and run the notebook:
jupyter notebook Wikitext_2-BPE-Tokenizer.ipynbOr use Google Colab / Kaggle β no GPU required, the tokenizer trains on CPU in 1β3 minutes.
| Section | Description |
|---|---|
| 1. Setup | Imports, constants (VOCAB_SIZE=30000, special tokens), output dirs |
| 2. Load & Explore | Load WikiText-2-v1, compute corpus stats, visualize length distributions |
| 3. Data Cleaning | Clean text, remove noise, deduplicate training corpus |
| 4. BPE Training | Initialize tokenizer, configure trainer, train on clean corpus |
| 5. Evaluation | Evaluate on val/test splits, print metrics table, generate plots |
| 6. Save & Reload | Save as HF-compatible tokenizer, reload with PreTrainedTokenizerFast |
| 7. Summary | Final metrics summary and conclusions |
This project is licensed under the MIT License.