When Second-Order Dynamics Help and When They Don't
This is a negative-result paper. MoGRU works in a narrow niche (moderate-density discrete interference) but fails to generalize to real-world signals, long sequences, or cross-domain transfer. We publish the complete characterization so others don't have to rediscover these limits.
MoGRU augments the GRU with a velocity state and a learned per-dimension momentum gate (
Key findings:
- MoGRU achieves perfect accuracy on interference resistance at N <= 50 distractors, where GRU scores 0.77-0.82
- The advantage is structurally bounded: at N >= 100, the momentum buffer itself becomes corrupted
- On real-world vibration data (CWRU Bearing), momentum smooths away high-frequency fault impulses -- GRU wins at all noise levels
- Four attempted fixes (velocity clipping, velocity LayerNorm, velocity write gate, damping) all fail to close the long-range gap
- MoGRU runs 2-3x slower than GRU due to inability to use cuDNN fused kernels
Two states per cell:
-
$h_t$ (position): hidden representation, same role as GRU hidden state -
$v_t$ (velocity): exponential moving average of state deltas
When
MomGRU = MomentumRNN (Nguyen et al., 2020) applied to a GRU backbone: fixed
| Task | MoGRU | GRU | LSTM | MomGRU |
|---|---|---|---|---|
| Copy (acc) | 0.350 +/- 0.010* | 0.221 +/- 0.090 | 0.063 +/- 0.001 | 0.243 +/- 0.012 |
| Adding (MSE) | 0.003 +/- 0.001 | 0.000 +/- 0.000* | 0.003 +/- 0.002 | 0.003 +/- 0.001 |
| Trend (MSE) | 0.775 +/- 0.041 | 0.792 +/- 0.031 | 0.833 +/- 0.031 | 0.777 +/- 0.017 |
| Sel. Copy (acc) | 1.000 +/- 0.000 | 1.000 +/- 0.000 | 1.000 +/- 0.000 | 0.415 +/- 0.062* |
* p < 0.05 (two-sample t-test)
Interference Resistance (K=5 items, hidden=128)
| N distractors | MoGRU | GRU | LSTM | Winner |
|---|---|---|---|---|
| 10 | 1.000 | 0.765 | 0.065 | MoGRU |
| 25 | 1.000 | 0.710 | 0.064 | MoGRU |
| 50 | 0.938 | 0.816 | 0.068 | MoGRU |
| 100 | 0.553 | 0.729 | 0.067 | GRU |
| 200 | 0.063 | 0.293 | 0.067 | GRU |
MoGRU wins at moderate distractor loads. At N >= 100, accumulated distractor influence overwhelms the momentum buffer.
| Noise sigma | GRU | LSTM | MoGRU | Winner |
|---|---|---|---|---|
| 0.0 | 0.474 | 0.411 | 0.355 | GRU |
| 0.1 | 0.484 | 0.468 | 0.447 | GRU |
| 0.2 | 0.507 | 0.461 | 0.275 | GRU |
| 0.5 | 0.493 | 0.465 | 0.313 | GRU |
| 1.0 | 0.495 | 0.486 | 0.299 | GRU |
| 2.0 | 0.468 | 0.449 | 0.298 | GRU |
GRU wins at all noise levels. Momentum smooths away high-frequency fault impulses that carry diagnostic information.
| Config | GRU | MoGRU | LSTM | GRU / MoGRU |
|---|---|---|---|---|
| 203,893 | 72,588 | 129,212 | 2.8x | |
| 125,023 | 53,673 | 93,870 | 2.3x | |
| 117,604 | 54,759 | 83,009 | 2.1x | |
| 55,265 | 31,681 | 39,232 | 1.7x |
| Variant | Val Acc | |
|---|---|---|
| full_mogru | 1.000 | -- |
| no_momentum ( |
1.000 | 0.000 |
| fixed_beta ( |
0.730 | -0.270 |
| no_layernorm | 1.000 | 0.000 |
| no_reset ( |
1.000 | 0.000 |
The learned per-dimension no_momentum (
# Install dependencies
pip install -r requirements.txt
# Run the 4-task benchmark (5 seeds each)
python -m mogru.benchmark
# Run interference resistance sweep
python -m mogru.interference_deep_dive
# Run ablation study
python -m mogru.ablation
# Run CWRU bearing benchmark (requires data, see data/cwru/README.md)
python -m mogru.bearing_benchmarkmogru/
├── README.md
├── LICENSE
├── requirements.txt
├── .gitignore
├── assets/
│ └── mogru-architecture.svg
├── paper/
│ └── mogru.tex # arXiv-ready paper
├── mogru/
│ ├── __init__.py
│ ├── mogru.py # MoGRUCell, MoGRU, MomentumGRUCell, MomentumGRU
│ ├── benchmark.py # 4-task benchmark suite
│ ├── ablation.py # 5-way ablation study
│ ├── bearing_benchmark.py # CWRU real-world fault detection
│ ├── crossover_sweep.py # Sequence length crossover analysis
│ ├── head_to_head.py # 3-way model comparison
│ ├── interference_deep_dive.py # Distractor count + items sweep
│ ├── velocity_fix_test.py # Long-range collapse fix attempts
│ ├── strategic_transfer_test.py # VICReg + LOO cross-domain transfer
│ └── experiments/
│ ├── __init__.py
│ ├── compile_results.py
│ ├── profiling.py
│ ├── real_world.py
│ └── scaling.py
├── results/
│ ├── benchmark_summary.json # Aggregated benchmark results
│ ├── profiling_results.json # Throughput profiling data
│ └── [20 per-seed JSON files] # Raw results per task per seed
└── data/
└── cwru/
└── README.md # CWRU dataset download instructions
If you find this work useful (even as a negative result), please cite:
@article{matthiasson2026mogru,
title = {Momentum-Gated Recurrent Unit: When Second-Order Dynamics Help and When They Don't},
author = {Matthiasson, Thor},
year = {2026},
note = {Available at \url{https://github.com/Thormatt/mogru}}
}