MorphLLM is a proof-of-concept language model that grows organically during training. Instead of training a fixed-size model from scratch, MorphLLM starts small (seed model) and expands its capacity (width, depth, sparsity) based on real-time resource monitoring and loss plateau detection.
Key Features:
- Dynamic Growth:
- Net2WiderNet: Expands embedding dimension (
d_model) and FFN width. - Net2DeeperNet: Inserts identity-initialized layers (depth growth).
- Sparse Scaling: Converts dense FFNs to Mixture-of-Experts (MoE) when compute allows.
- Net2WiderNet: Expands embedding dimension (
- Resource Awareness: Monitors VRAM, RAM, and Compute Utilization to decide when and how to grow.
- Anti-Forgetting:
- Rehearsal (SSR): Mixes past model generations into training batches.
- Self-Distillation (SDFT): Uses an EMA teacher to anchor the student model.
- Freeze-Expand-Tune: Protects learned representations during structural adaptation.
pip install -e ".[dev]"Start training a small model. It will grow automatically!
morphllm train --data data/input.txt --steps 1000 --output-dir checkpointsInspect a saved checkpoint to see how much the model has grown.
morphllm status --checkpoint checkpoints/checkpoint_500.ptOutput example:
--- Checkpoint Status ---
Global Step: 500
Architecture:
d_model: 128 (started at 64)
n_layers: 4 (started at 2)
n_head: 4
vocab_size: 1000
You can mostly rely on the automatic controller, but you can also inspect logic via tests or modify GrowthConfig.
Control growth behavior via GrowthConfig in morphllm/controller.py:
vram_ceiling_pct: Max VRAM usage (default 85%).plateau_window: Steps to detect loss plateau (default 50).enable_moe: Whether to allow spawning experts.
MorphLLM: Main container (Embeddings + Stack of Blocks + Head).DynamicTransformerBlock: Growable Attention + FFN/MoE.GrowthController: The "brain" that monitors resources and loss.Trainer: Orchestrates training, growth, and migration.
Run the full test suite (including E2E integration):
pytest tests/ -vDistributed Training: Support multi-GPU via FSDP (currently single-GPU/CPU).Advanced MoE: Implement expert parallelism.Quantization: Implement the PRUNE action for low-VRAM scenarios.