Skip to content

A minimal Mixture of Experts language model built from scratch in PyTorch. Demonstrates sparse routing, load balancing, and MoE mechanics at 20M parameters

License

Notifications You must be signed in to change notification settings

inkybubble/minimoe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MiniMoE

MiniMoE is a Mixture of Experts language model built from scratch in PyTorch. The project implements sparse expert routing and load balancing to demonstrate MoE mechanics at small scale.

Overview

Spec Value
Total Parameters 20.8M
Active Parameters 17.7M (~85%)
Experts per Layer 4
Top-k Routing 2
Layers 6
Embedding Dim 256
Training Time ~1 hour (M4 Max)

Architecture

The model processes input through these stages:

  1. Token and Position Embeddings. The model converts input tokens to vectors and adds positional information.
  2. Transformer Blocks (x6). Each block contains:
    • A multi-head attention layer that lets tokens attend to previous tokens
    • An MoE layer where a router selects 2 of 4 experts to process each token
  3. Language Model Head. A final linear layer projects hidden states to vocabulary logits. This layer shares weights with the token embeddings.

Key MoE Concepts

Sparse Routing. Each token is processed by only 2 of 4 experts. This reduces compute while maintaining model capacity.

Load Balancing Loss. An auxiliary loss term prevents expert collapse, where all tokens would go to only 1 or 2 experts:

aux_loss = num_experts * sum(tokens_per_expert * mean_probs)

Expert Specialization. Different experts learn different patterns from the mixed training data - in theory! this implementation didn't quite do that, but it's a demo.

Training

The model was trained on three datasets:

  • OpenAssistant for conversational examples
  • TinyStories for creative writing (capped at 50k examples)
  • Code Alpaca for code generation

Expert Routing Analysis

The table below shows expert activation patterns for different prompt types, averaged across all layers:

Prompt Type    | Expert 0 | Expert 1 | Expert 2 | Expert 3
---------------|----------|----------|----------|----------
Code           |   16.3%  |   25.3%  |   25.6%  |   32.7%
Story          |   22.5%  |   24.8%  |   26.2%  |   26.6%
Chat           |   23.5%  |   23.5%  |   25.4%  |   27.7%

Observations:

  • All four experts remain active throughout training. The model does not have dead experts.
  • Code prompts route 16% of tokens to Expert 0, while Story and Chat prompts route 22-23% to Expert 0.
  • Story prompts show the most even distribution across experts.
  • Expert 3 receives slightly more tokens across all categories.

You can run the analysis yourself:

uv run python -m src.analyze_experts --checkpoint checkpoints/final.pt

Usage

Training:

uv sync
uv run python -m src.train

Chat:

uv run python -m src.chat --checkpoint checkpoints/final.pt

Sample Outputs

Story prompt. The model produces TinyStories-style output with simple sentences and child-like narrative:

> Write a short story about a cat.
One day, a happy cat came to the park. The bird was very nice and he
could play with the friends. The cat wanted to play with the toys.
The cat was not happy. He played with the toy.

Code prompt. The model recognizes code patterns but produces invalid syntax:

> Write a Python function to reverse a string.
def find_string(string):
     return string.sub(string[-1] = string(string)

Chat prompt. The model defaults to code-like output even for general questions:

> What is the capital of France?
The following code to replace the data in Python? A program is a
function that makes the input used of the input input...

Project Structure

minimoe/
├── src/
│   ├── tokenizer.py        # BPE tokenizer (tiktoken)
│   ├── data.py             # Mixed dataset loader
│   ├── train.py            # Training loop
│   ├── chat.py             # Interactive chat interface
│   ├── analyze_experts.py  # Expert routing analysis
│   └── model/
│       ├── attention.py    # Multi-head attention
│       ├── experts.py      # Expert, Router, MoELayer
│       ├── transformer.py  # MoE transformer block
│       └── moe_gpt.py      # Full model
└── checkpoints/

What This Demonstrates

  • Top-k routing: each token is processed by 2 of 4 experts
  • All 4 experts remain active throughout training (no dead experts)
  • Sparse computation: 17.7M of 20.8M parameters active per forward pass

Limitations

This is an educational implementation. The model has 20M parameters, which limits its capabilities:

  • The model performs pattern matching rather than reasoning
  • Outputs tend toward repetition and topic drift on longer generations
  • The model shows bias toward code or story patterns depending on the prompt

About

A minimal Mixture of Experts language model built from scratch in PyTorch. Demonstrates sparse routing, load balancing, and MoE mechanics at 20M parameters

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages