Skip to content

Thormatt/mogru

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MoGRU Architecture

MoGRU: Momentum-Gated Recurrent Unit

When Second-Order Dynamics Help and When They Don't

License: MIT Python 3.10+ PyTorch


This is a negative-result paper. MoGRU works in a narrow niche (moderate-density discrete interference) but fails to generalize to real-world signals, long sequences, or cross-domain transfer. We publish the complete characterization so others don't have to rediscover these limits.


Overview

MoGRU augments the GRU with a velocity state and a learned per-dimension momentum gate ($\beta_t$), giving hidden states second-order dynamics analogous to momentum in SGD. The model is a strict generalization of the GRU: when $\beta_t = 0$, MoGRU recovers exact GRU behavior (modulo the stabilizing LayerNorm, which our ablations show does not impact baseline performance).

Key findings:

  • MoGRU achieves perfect accuracy on interference resistance at N <= 50 distractors, where GRU scores 0.77-0.82
  • The advantage is structurally bounded: at N >= 100, the momentum buffer itself becomes corrupted
  • On real-world vibration data (CWRU Bearing), momentum smooths away high-frequency fault impulses -- GRU wins at all noise levels
  • Four attempted fixes (velocity clipping, velocity LayerNorm, velocity write gate, damping) all fail to close the long-range gap
  • MoGRU runs 2-3x slower than GRU due to inability to use cuDNN fused kernels

Architecture

Two states per cell:

  • $h_t$ (position): hidden representation, same role as GRU hidden state
  • $v_t$ (velocity): exponential moving average of state deltas

$$ \begin{align} [r_t, u_t] &amp;= \sigma(W_{ru} [x_t, h_{t-1}]) &amp; &amp;\text{Standard GRU gates} \\ \beta_t &amp;= \sigma(W_\beta [x_t, h_{t-1}]) &amp; &amp;\text{Momentum retention (novel)} \\ \tilde{h}_t &amp;= \tanh(W_h x_t + U_h (r_t \odot h_{t-1})) &amp; &amp;\text{Candidate} \\ d_t &amp;= \tilde{h}_t - h_{t-1} &amp; &amp;\text{State delta} \\ v_t &amp;= \beta_t \odot v_{t-1} + (1 - \beta_t) \odot d_t &amp; &amp;\text{Velocity EMA} \\ h_t &amp;= \text{LayerNorm}(h_{t-1} + u_t \odot v_t) &amp; &amp;\text{Additive position step} \end{align} $$

When $\beta \to 0$, velocity equals the raw delta, and the update reduces to the standard GRU convex combination (inside a LayerNorm). The optimizer can "turn off" momentum per-dimension where it's unhelpful.

Results

Benchmark (mean $\pm$ std over 5 seeds, $h = 128$)

MomGRU = MomentumRNN (Nguyen et al., 2020) applied to a GRU backbone: fixed $\mu = 0.9$ momentum on the input transformation, not on hidden-state dynamics.

Task MoGRU GRU LSTM MomGRU
Copy (acc) 0.350 +/- 0.010* 0.221 +/- 0.090 0.063 +/- 0.001 0.243 +/- 0.012
Adding (MSE) 0.003 +/- 0.001 0.000 +/- 0.000* 0.003 +/- 0.002 0.003 +/- 0.001
Trend (MSE) 0.775 +/- 0.041 0.792 +/- 0.031 0.833 +/- 0.031 0.777 +/- 0.017
Sel. Copy (acc) 1.000 +/- 0.000 1.000 +/- 0.000 1.000 +/- 0.000 0.415 +/- 0.062*

* p < 0.05 (two-sample t-test)

Interference Resistance (K=5 items, hidden=128)

N distractors MoGRU GRU LSTM Winner
10 1.000 0.765 0.065 MoGRU
25 1.000 0.710 0.064 MoGRU
50 0.938 0.816 0.068 MoGRU
100 0.553 0.729 0.067 GRU
200 0.063 0.293 0.067 GRU

MoGRU wins at moderate distractor loads. At N >= 100, accumulated distractor influence overwhelms the momentum buffer.

CWRU Bearing Fault Detection (4-class accuracy, real-world)

Noise sigma GRU LSTM MoGRU Winner
0.0 0.474 0.411 0.355 GRU
0.1 0.484 0.468 0.447 GRU
0.2 0.507 0.461 0.275 GRU
0.5 0.493 0.465 0.313 GRU
1.0 0.495 0.486 0.299 GRU
2.0 0.468 0.449 0.298 GRU

GRU wins at all noise levels. Momentum smooths away high-frequency fault impulses that carry diagnostic information.

Throughput (tokens/sec, CPU, batch=64)

$h$ = hidden dimension, $T$ = sequence length.

Config GRU MoGRU LSTM GRU / MoGRU
$h!=!64,; T!=!50$ 203,893 72,588 129,212 2.8x
$h!=!128,; T!=!50$ 125,023 53,673 93,870 2.3x
$h!=!128,; T!=!200$ 117,604 54,759 83,009 2.1x
$h!=!256,; T!=!200$ 55,265 31,681 39,232 1.7x

Ablation (selective copy, $T = 50$)

Variant Val Acc $\Delta$
full_mogru 1.000 --
no_momentum ($\beta = 0$) 1.000 0.000
fixed_beta ($\beta = 0.9$) 0.730 -0.270
no_layernorm 1.000 0.000
no_reset ($r = 1$) 1.000 0.000

The learned per-dimension $\beta_t$ gate is the critical innovation. Fixed momentum is worse than no momentum. Note the irony: no_momentum ($\beta = 0$, i.e. pure GRU) solves the task perfectly, meaning the learned gate's "success" on this task is that it learns to drive $\beta \to 0$ -- effectively turning momentum off to recover GRU behavior where momentum hurts.

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run the 4-task benchmark (5 seeds each)
python -m mogru.benchmark

# Run interference resistance sweep
python -m mogru.interference_deep_dive

# Run ablation study
python -m mogru.ablation

# Run CWRU bearing benchmark (requires data, see data/cwru/README.md)
python -m mogru.bearing_benchmark

Repository Structure

mogru/
├── README.md
├── LICENSE
├── requirements.txt
├── .gitignore
├── assets/
│   └── mogru-architecture.svg
├── paper/
│   └── mogru.tex                    # arXiv-ready paper
├── mogru/
│   ├── __init__.py
│   ├── mogru.py                     # MoGRUCell, MoGRU, MomentumGRUCell, MomentumGRU
│   ├── benchmark.py                 # 4-task benchmark suite
│   ├── ablation.py                  # 5-way ablation study
│   ├── bearing_benchmark.py         # CWRU real-world fault detection
│   ├── crossover_sweep.py           # Sequence length crossover analysis
│   ├── head_to_head.py              # 3-way model comparison
│   ├── interference_deep_dive.py    # Distractor count + items sweep
│   ├── velocity_fix_test.py         # Long-range collapse fix attempts
│   ├── strategic_transfer_test.py   # VICReg + LOO cross-domain transfer
│   └── experiments/
│       ├── __init__.py
│       ├── compile_results.py
│       ├── profiling.py
│       ├── real_world.py
│       └── scaling.py
├── results/
│   ├── benchmark_summary.json       # Aggregated benchmark results
│   ├── profiling_results.json       # Throughput profiling data
│   └── [20 per-seed JSON files]     # Raw results per task per seed
└── data/
    └── cwru/
        └── README.md                # CWRU dataset download instructions

Citation

If you find this work useful (even as a negative result), please cite:

@article{matthiasson2026mogru,
  title   = {Momentum-Gated Recurrent Unit: When Second-Order Dynamics Help and When They Don't},
  author  = {Matthiasson, Thor},
  year    = {2026},
  note    = {Available at \url{https://github.com/Thormatt/mogru}}
}

License

MIT

About

MoGRU: Momentum-Gated Recurrent Unit — When Second-Order Dynamics Help and When They Don't (negative-result research)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors