Skip to content

SANJAI-s0/Wikitext_2-BPE-Tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

WikiText-2 BPE Tokenizer

A custom Byte Pair Encoding (BPE) tokenizer built from scratch on the WikiText-2 dataset. The tokenizer is trained using HuggingFace's tokenizers library, evaluated on validation and test splits, and saved in a HuggingFace-compatible format ready for downstream language modeling tasks.


πŸ“‹ Table of Contents

  1. Overview
  2. Workflow
  3. Project Structure
  4. Key Features
  5. Dataset
  6. Tokenizer Configuration
  7. Evaluation Metrics
  8. Getting Started
  9. Notebook Walkthrough
  10. License

πŸ” Overview

This project demonstrates how to build a production-ready BPE tokenizer entirely from scratch β€” covering data loading, cleaning, deduplication, tokenizer training, evaluation, and serialization. The tokenizer targets English text and is compatible with HuggingFace's PreTrainedTokenizerFast interface, making it a drop-in replacement for downstream NLP pipelines.


πŸ”„ Workflow

flowchart TD
    A([πŸš€ Start]) --> B[1 Β· Setup\nImports Β· Constants\nVOCAB_SIZE=30k Β· SPECIAL_TOKENS\nData/ output dir]

    B --> C[2 Β· Load WikiText-2-v1\nSalesforce/wikitext\n~44.8k rows Β· 3 splits]
    C --> D[2 Β· Explore Dataset\nCorpus stats Β· Word length dist\nSplit sizes Β· Char frequency\nSave β†’ Data/dataset_exploration.png]

    D --> E[3 Β· Data Cleaning\nRemove unk Β· section headers\nNormalize @-@ Β· whitespace]
    E --> F[3 Β· Deduplication\nExact-match dedup\nMin 3 words per sentence\nSave β†’ Data/cleaning_comparison.png]

    F --> G[4 Β· Initialize BPE Tokenizer\nModel: BPE Β· unk=UNK\nNFD β†’ Lowercase β†’ StripAccents\nPre-tokenizer: Whitespace\nDecoder: BPEDecoder]

    G --> H[4 Β· Configure BPE Trainer\nVocab: 30,000 Β· min_freq=2\nSpecial: PAD UNK CLS SEP MASK\nSubword prefix: ##]

    H --> I[4 Β· Train Tokenizer\ntrain_from_iterator\nbatch_size=1000\non clean training corpus]

    I --> J[4 Β· Post-Processor\nTemplateProcessing\nCLS Β· A Β· SEP template\nEnable padding with PAD]

    J --> K[4 Β· Sanity Check\nEncode Β· Decode\n4 sample sentences\nTokens Β· IDs Β· Decoded]

    K --> L[4 Β· Vocabulary Inspection\nBreakdown: special Β· single chars\nsubwords ## Β· full words\nSave β†’ Data/vocab_composition.png]

    L --> M[5 Β· Evaluate Val & Test\nAvg tokens/sentence\nCompression ratio\nUNK-free coverage\nConsistency check]

    M --> N[5 Β· Evaluate Train Sample\n5,000 sentence sample\nAll-splits comparison table\nSave β†’ Data/tokenizer_evaluation.png]

    N --> O[6 Β· Wrap as PreTrainedTokenizerFast\nSpecial token mappings\nVocab size verification]

    O --> P[6 Β· Save Tokenizer\ncustom_bpe_tokenizer/\ntokenizer.json\ntokenizer_config.json]

    P --> Q[6 Β· Reload & Verify\nfrom_pretrained\nAssert identical output\nEncode Β· Decode demo\nBatch padding Β· Pair encoding]

    Q --> R[7 Β· Final Summary\nMetrics table\nAll config + results]

    R --> S([βœ… Done])

    style A fill:#4CAF50,color:#fff
    style S fill:#4CAF50,color:#fff
    style B fill:#607D8B,color:#fff
    style C fill:#2196F3,color:#fff
    style D fill:#2196F3,color:#fff
    style E fill:#FF5722,color:#fff
    style F fill:#FF5722,color:#fff
    style G fill:#9C27B0,color:#fff
    style H fill:#9C27B0,color:#fff
    style I fill:#FF9800,color:#fff
    style J fill:#FF9800,color:#fff
    style K fill:#FF9800,color:#fff
    style L fill:#FF9800,color:#fff
    style M fill:#00BCD4,color:#fff
    style N fill:#00BCD4,color:#fff
    style O fill:#673AB7,color:#fff
    style P fill:#673AB7,color:#fff
    style Q fill:#673AB7,color:#fff
    style R fill:#607D8B,color:#fff
Loading

The full .mmd source is at Flow/workflow.mmd.


πŸ“ Project Structure

Wikitext_2-BPE-Tokenizer/
β”œβ”€β”€ Wikitext_2-BPE-Tokenizer.ipynb   # Main notebook (all steps end-to-end)
β”œβ”€β”€ Flow/
β”‚   └── workflow.mmd                  # Mermaid workflow diagram source
β”œβ”€β”€ Data/                             # Generated plots (auto-created at runtime)
β”‚   β”œβ”€β”€ dataset_exploration.png       # Word count distribution, split sizes, char freq
β”‚   β”œβ”€β”€ cleaning_comparison.png       # Before vs after cleaning row counts
β”‚   β”œβ”€β”€ vocab_composition.png         # Vocab pie chart + token length distribution
β”‚   └── tokenizer_evaluation.png      # 4-panel evaluation summary plot
β”œβ”€β”€ custom_bpe_tokenizer/             # Saved tokenizer (auto-created at runtime)
β”‚   β”œβ”€β”€ tokenizer.json                # Full tokenizer config + vocab + merges
β”‚   └── tokenizer_config.json         # Special token mappings
β”œβ”€β”€ requirements.txt                  # Python dependencies
β”œβ”€β”€ .gitignore
└── LICENSE

Data/ and custom_bpe_tokenizer/ are generated at runtime and excluded from version control via .gitignore.


✨ Key Features

Feature Detail
BPE from scratch Built using HuggingFace tokenizers β€” no pre-trained vocab
Data cleaning Removes <unk>, section headers, normalizes whitespace
Deduplication Exact-match dedup on the training split before training
Normalizer pipeline NFD β†’ Lowercase β†’ StripAccents β†’ Strip
Special tokens [PAD], [UNK], [CLS], [SEP], [MASK]
Post-processor Auto-wraps sequences with [CLS] ... [SEP]
Padding Enabled with [PAD] token for batch encoding
HF-compatible save Saved as PreTrainedTokenizerFast β€” reload in one line
Evaluation suite Compression ratio, UNK-free coverage, consistency check
Visualizations 4 evaluation plots + vocab composition + cleaning comparison

πŸ“Š Dataset

  • Source: Salesforce/wikitext on HuggingFace Hub
  • Subset: wikitext-2-v1
  • Task: Language modeling (English Wikipedia text)
  • Splits:
Split Raw Rows
Train ~36,718
Validation ~3,760
Test ~4,358

βš™οΈ Tokenizer Configuration

# BPE Model
Tokenizer(models.BPE(unk_token="[UNK]"))

# Normalizer
NFD() β†’ Lowercase() β†’ StripAccents() β†’ Replace(r'\s+', ' ') β†’ Strip()

# Pre-tokenizer
Whitespace()

# Trainer
BpeTrainer(
    vocab_size=30_000,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
    min_frequency=2,
    continuing_subword_prefix="##",
)

# Post-processor
TemplateProcessing(single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B:1 [SEP]:1")

πŸ“ˆ Evaluation Metrics

The tokenizer is evaluated on both the validation and test splits across four metrics:

Metric Description
Vocabulary size Total tokens in the trained vocabulary
Avg tokens / sentence Mean BPE token count per sentence (excl. special tokens)
Compression ratio Average characters per token β€” higher = more compression
UNK-free coverage % of sentences containing zero [UNK] tokens
Consistency Same input always produces identical token output

πŸš€ Getting Started

Prerequisites: Python 3.10+

git clone https://github.com/SANJAI-s0/Wikitext_2-BPE-Tokenizer.git
cd Wikitext_2-BPE-Tokenizer
pip install -r requirements.txt

Then open and run the notebook:

jupyter notebook Wikitext_2-BPE-Tokenizer.ipynb

Or use Google Colab / Kaggle β€” no GPU required, the tokenizer trains on CPU in 1–3 minutes.


πŸ““ Notebook Walkthrough

Section Description
1. Setup Imports, constants (VOCAB_SIZE=30000, special tokens), output dirs
2. Load & Explore Load WikiText-2-v1, compute corpus stats, visualize length distributions
3. Data Cleaning Clean text, remove noise, deduplicate training corpus
4. BPE Training Initialize tokenizer, configure trainer, train on clean corpus
5. Evaluation Evaluate on val/test splits, print metrics table, generate plots
6. Save & Reload Save as HF-compatible tokenizer, reload with PreTrainedTokenizerFast
7. Summary Final metrics summary and conclusions

πŸ“„ License

This project is licensed under the MIT License.

About

Custom BPE tokenizer built from scratch on WikiText-2 (30k vocab). Covers data cleaning, deduplication, HuggingFace tokenizers training, evaluation (compression ratio, UNK-free coverage, consistency), and save/reload as PreTrainedTokenizerFast.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors