MiniMoE

MiniMoE is a Mixture of Experts language model built from scratch in PyTorch. The project implements sparse expert routing and load balancing to demonstrate MoE mechanics at small scale.

Overview

Spec	Value
Total Parameters	20.8M
Active Parameters	17.7M (~85%)
Experts per Layer	4
Top-k Routing	2
Layers	6
Embedding Dim	256
Training Time	~1 hour (M4 Max)

Architecture

The model processes input through these stages:

Token and Position Embeddings. The model converts input tokens to vectors and adds positional information.
Transformer Blocks (x6). Each block contains:
- A multi-head attention layer that lets tokens attend to previous tokens
- An MoE layer where a router selects 2 of 4 experts to process each token
Language Model Head. A final linear layer projects hidden states to vocabulary logits. This layer shares weights with the token embeddings.

Key MoE Concepts

Sparse Routing. Each token is processed by only 2 of 4 experts. This reduces compute while maintaining model capacity.

Load Balancing Loss. An auxiliary loss term prevents expert collapse, where all tokens would go to only 1 or 2 experts:

aux_loss = num_experts * sum(tokens_per_expert * mean_probs)

Expert Specialization. Different experts learn different patterns from the mixed training data - in theory! this implementation didn't quite do that, but it's a demo.

Training

The model was trained on three datasets:

OpenAssistant for conversational examples
TinyStories for creative writing (capped at 50k examples)
Code Alpaca for code generation

Expert Routing Analysis

The table below shows expert activation patterns for different prompt types, averaged across all layers:

Prompt Type    | Expert 0 | Expert 1 | Expert 2 | Expert 3
---------------|----------|----------|----------|----------
Code           |   16.3%  |   25.3%  |   25.6%  |   32.7%
Story          |   22.5%  |   24.8%  |   26.2%  |   26.6%
Chat           |   23.5%  |   23.5%  |   25.4%  |   27.7%

Observations:

All four experts remain active throughout training. The model does not have dead experts.
Code prompts route 16% of tokens to Expert 0, while Story and Chat prompts route 22-23% to Expert 0.
Story prompts show the most even distribution across experts.
Expert 3 receives slightly more tokens across all categories.

You can run the analysis yourself:

uv run python -m src.analyze_experts --checkpoint checkpoints/final.pt

Usage

Training:

uv sync
uv run python -m src.train

Chat:

uv run python -m src.chat --checkpoint checkpoints/final.pt

Sample Outputs

Story prompt. The model produces TinyStories-style output with simple sentences and child-like narrative:

> Write a short story about a cat.
One day, a happy cat came to the park. The bird was very nice and he
could play with the friends. The cat wanted to play with the toys.
The cat was not happy. He played with the toy.

Code prompt. The model recognizes code patterns but produces invalid syntax:

> Write a Python function to reverse a string.
def find_string(string):
     return string.sub(string[-1] = string(string)

Chat prompt. The model defaults to code-like output even for general questions:

> What is the capital of France?
The following code to replace the data in Python? A program is a
function that makes the input used of the input input...

Project Structure

minimoe/
├── src/
│   ├── tokenizer.py        # BPE tokenizer (tiktoken)
│   ├── data.py             # Mixed dataset loader
│   ├── train.py            # Training loop
│   ├── chat.py             # Interactive chat interface
│   ├── analyze_experts.py  # Expert routing analysis
│   └── model/
│       ├── attention.py    # Multi-head attention
│       ├── experts.py      # Expert, Router, MoELayer
│       ├── transformer.py  # MoE transformer block
│       └── moe_gpt.py      # Full model
└── checkpoints/

What This Demonstrates

Top-k routing: each token is processed by 2 of 4 experts
All 4 experts remain active throughout training (no dead experts)
Sparse computation: 17.7M of 20.8M parameters active per forward pass

Limitations

This is an educational implementation. The model has 20M parameters, which limits its capabilities:

The model performs pattern matching rather than reasoning
Outputs tend toward repetition and topic drift on longer generations
The model shows bias toward code or story patterns depending on the prompt

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MiniMoE

Overview

Architecture

Key MoE Concepts

Training

Expert Routing Analysis

Usage

Sample Outputs

Project Structure

What This Demonstrates

Limitations

About

Uh oh!

Releases

Packages

Languages

License

inkybubble/minimoe

Folders and files

Latest commit

History

Repository files navigation

MiniMoE

Overview

Architecture

Key MoE Concepts

Training

Expert Routing Analysis

Usage

Sample Outputs

Project Structure

What This Demonstrates

Limitations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages