MiniMoE is a Mixture of Experts language model built from scratch in PyTorch. The project implements sparse expert routing and load balancing to demonstrate MoE mechanics at small scale.
| Spec | Value |
|---|---|
| Total Parameters | 20.8M |
| Active Parameters | 17.7M (~85%) |
| Experts per Layer | 4 |
| Top-k Routing | 2 |
| Layers | 6 |
| Embedding Dim | 256 |
| Training Time | ~1 hour (M4 Max) |
The model processes input through these stages:
- Token and Position Embeddings. The model converts input tokens to vectors and adds positional information.
- Transformer Blocks (x6). Each block contains:
- A multi-head attention layer that lets tokens attend to previous tokens
- An MoE layer where a router selects 2 of 4 experts to process each token
- Language Model Head. A final linear layer projects hidden states to vocabulary logits. This layer shares weights with the token embeddings.
Sparse Routing. Each token is processed by only 2 of 4 experts. This reduces compute while maintaining model capacity.
Load Balancing Loss. An auxiliary loss term prevents expert collapse, where all tokens would go to only 1 or 2 experts:
aux_loss = num_experts * sum(tokens_per_expert * mean_probs)Expert Specialization. Different experts learn different patterns from the mixed training data - in theory! this implementation didn't quite do that, but it's a demo.
The model was trained on three datasets:
- OpenAssistant for conversational examples
- TinyStories for creative writing (capped at 50k examples)
- Code Alpaca for code generation
The table below shows expert activation patterns for different prompt types, averaged across all layers:
Prompt Type | Expert 0 | Expert 1 | Expert 2 | Expert 3
---------------|----------|----------|----------|----------
Code | 16.3% | 25.3% | 25.6% | 32.7%
Story | 22.5% | 24.8% | 26.2% | 26.6%
Chat | 23.5% | 23.5% | 25.4% | 27.7%
Observations:
- All four experts remain active throughout training. The model does not have dead experts.
- Code prompts route 16% of tokens to Expert 0, while Story and Chat prompts route 22-23% to Expert 0.
- Story prompts show the most even distribution across experts.
- Expert 3 receives slightly more tokens across all categories.
You can run the analysis yourself:
uv run python -m src.analyze_experts --checkpoint checkpoints/final.ptTraining:
uv sync
uv run python -m src.trainChat:
uv run python -m src.chat --checkpoint checkpoints/final.ptStory prompt. The model produces TinyStories-style output with simple sentences and child-like narrative:
> Write a short story about a cat.
One day, a happy cat came to the park. The bird was very nice and he
could play with the friends. The cat wanted to play with the toys.
The cat was not happy. He played with the toy.
Code prompt. The model recognizes code patterns but produces invalid syntax:
> Write a Python function to reverse a string.
def find_string(string):
return string.sub(string[-1] = string(string)
Chat prompt. The model defaults to code-like output even for general questions:
> What is the capital of France?
The following code to replace the data in Python? A program is a
function that makes the input used of the input input...
minimoe/
├── src/
│ ├── tokenizer.py # BPE tokenizer (tiktoken)
│ ├── data.py # Mixed dataset loader
│ ├── train.py # Training loop
│ ├── chat.py # Interactive chat interface
│ ├── analyze_experts.py # Expert routing analysis
│ └── model/
│ ├── attention.py # Multi-head attention
│ ├── experts.py # Expert, Router, MoELayer
│ ├── transformer.py # MoE transformer block
│ └── moe_gpt.py # Full model
└── checkpoints/
- Top-k routing: each token is processed by 2 of 4 experts
- All 4 experts remain active throughout training (no dead experts)
- Sparse computation: 17.7M of 20.8M parameters active per forward pass
This is an educational implementation. The model has 20M parameters, which limits its capabilities:
- The model performs pattern matching rather than reasoning
- Outputs tend toward repetition and topic drift on longer generations
- The model shows bias toward code or story patterns depending on the prompt