annotated-mpnet provides a lightweight, heavily annotated, and PyTorch/einops implementation for pretraining MPNet models. This project aims to demystify the MPNet pretraining process, originally part of the larger fairseq codebase, making it more accessible for research and custom pretraining.
- About the Project
- Key Features
- Installation
- Quick Start
- Documentation
- Project Structure
- Changelog
- Contributing
- License
- Acknowledgements
Note
This repo is a fork/update of the original by yext.
MPNet (Masked and Permuted Pre-training for Language Understanding) is a powerful pretraining method. However, its original pretraining code is embedded within the fairseq library, which can be complex to navigate and adapt. annotated-mpnet addresses this by:
- Providing a clean, raw PyTorch implementation of MPNet pretraining.
- Offering extensive annotations and comments throughout the codebase to improve understanding.
- Enabling pretraining without the full
fairseqdependency, facilitating use on various hardware setups.
- Standalone PyTorch Implementation: No
fairseqdependency required for pretraining. - Heavily Annotated Code: Detailed comments explain the model architecture and training process.
- Flexible Data Handling: Supports pretraining with HuggingFace streaming datasets or local text files.
- HuggingFace Compatibility: Includes a tool to convert pretrained checkpoints to the HuggingFace
MPNetForMaskedLMformat for easy fine-tuning. - Integrated Logging: Supports TensorBoard and Weights & Biases for experiment tracking.
pip install directly from the GitHub repository:
pip install "git+https://github.com/pszemraj/annotated-mpnet.git"Or, clone the repository and install in editable mode:
git clone https://github.com/pszemraj/annotated-mpnet.git
cd annotated-mpnet
pip install -e .Note
Pretraining MPNet is computationally intensive and requires a CUDA-enabled GPU. The training script will exit if CUDA is not available.
- Python 3.x
- PyTorch (version >= 2.6.0, CUDA is required for training)
einops(version >= 0.7.0) for explicit tensor shape transforms- CUDA GPU with BF16 support (Ampere+). As of 2026, legacy GPUs without BF16 are not supported.
- HuggingFace
transformers,datasets wandb(for Weights & Biases logging, optional)rich(for enhanced console logging)numpy,cython,tensorboard(optional)
See pyproject.toml for a full list of dependencies.
Stream data from HuggingFace and start pretraining:
pretrain-mpnet \
--dataset-name "HuggingFaceFW/fineweb-edu" \
--tokenizer-name "microsoft/mpnet-base" \
--batch-size 16 \
--update-freq 8 \
--total-updates 100000 \
--checkpoint-dir "./checkpoints/my_run"Run pretrain-mpnet -h for all available options.
- Training Guide - Full usage: streaming vs. local data, resuming, exporting to HuggingFace
- Configuration Reference - Complete reference for all CLI arguments
- Architecture - Model internals: two-stream attention, encoder structure
- Development Guide - Running tests, contributing
annotated-mpnet/
├── annotated_mpnet/ # Core library code
│ ├── data/ # Data loading, collation, streaming dataset
│ ├── modeling/ # MPNetForPretraining model definition
│ ├── scheduler/ # Learning rate scheduler
│ ├── tracking/ # Metrics tracking (AverageMeter)
│ ├── transformer_modules/ # Transformer building blocks
│ └── utils/ # Utilities, Cython-accelerated permutation
├── cli_tools/ # CLI scripts (pretrain-mpnet, convert-to-hf)
├── docs/ # Documentation
├── tests/ # Unit tests
└── pyproject.toml # Build configuration
All notable changes to this project are documented in CHANGELOG.md. The latest version is v0.1.6.
Contributions are welcome! Please consider the following:
- Reporting Issues: Use GitHub Issues to report bugs or suggest new features.
- Pull Requests: For code contributions, please open a pull request with a clear description of your changes.
- Running Tests: Ensure tests pass with
python -m unittest discover tests.
The licenses for third-party libraries used in this project are detailed in LICENSE-3RD-PARTY.txt. The original MPNet code by Microsoft is licensed under the MIT License.
Note
The detailed line-by-line license info is from the original repo and has not been updated in this fork.
- This work is heavily based on the original MPNet paper and implementation by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu from Microsoft.
- The core Transformer module structures are adapted from the
fairseqlibrary.