Speech Enhancement with Sparse Adaptive Mixture of Experts
SESAME is a TF-domain monaural speech enhancement model that simultaneously denoises magnitude and phase spectra. Built on the MP-SENet architecture, it replaces standard FFN layers with a Sparse Mixture-of-Experts (MoE) design for improved capacity without proportional compute increase.
Audio -> STFT -> DenseEncoder -> TSTransformerBlocks (with MoE FFN) -> MaskDecoder + PhaseDecoder -> ISTFT
The generator (MPNet) processes noisy magnitude and phase in parallel:
- DenseEncoder: dilated Conv2d blocks compress the TF representation
- TSTransformerBlocks: alternating time and frequency self-attention with BiGRU-based FFN
- MaskDecoder: predicts a multiplicative magnitude mask via learnable sigmoid
- PhaseDecoder: directly estimates the clean phase via atan2
A MetricDiscriminator provides adversarial training signal by predicting a PESQ-proxy quality score.
The FFN in selected Transformer blocks is replaced by MoEFFN:
- Token Choice Top-2: each token selects its best 2 out of N experts via softmax gating
- Shared BiGRU backbone: all experts share a bidirectional GRU; only the projection heads are per-expert
- Switch Transformer balance loss: gradient-based load balancing (f_i * P_i penalty)
- DeepSeek-V3 adaptive bias: non-gradient bias term on routing logits, updated by load imbalance
- Router z-loss: stabilizes gating logit magnitudes (ST-MoE)
- Noise-conditioned routing: spectral magnitude is projected to a noise embedding that conditions the gate
- Python 3.13+
- uv package manager
- CUDA-capable GPU
git clone https://github.com/<your-org>/sesame.git
cd sesame
uv syncThis project uses the VoiceBank+DEMAND dataset.
- Download and extract the dataset
- Resample all wav files to 16 kHz
- Organize into
data/clean/anddata/noisy/directories - Create pipe-delimited file lists
data/train.txtanddata/test.txt(only the second field — filename — is used)
Update paths in config.yaml accordingly.
Single GPU:
uv run python train.py --config config.yamlMulti-GPU (DDP via torchrun):
uv run torchrun --nproc-per-node=gpu train.py --config config.yamlCheckpoints and training logs are saved to the checkpoint_path directory (default: cp_model/). A copy of the config is saved alongside checkpoints.
uv run python inference.py --checkpoint_file cp_model/g_bestOptions:
--input_noisy_wavs_dir: override noisy input directory--input_clean_wavs_dir: provide clean references to compute metrics (PESQ, CSIG, CBAK, COVL, SSNR, STOI)--output_dir: output directory for enhanced wav files (default:../generated_files)
Key sections in config.yaml:
| Section | Key parameters |
|---|---|
model |
dense_channel, num_tsblocks, n_heads, compress_factor, beta |
model.moe |
apply_to (layer indices), num_experts, top_k, expert_ffn_dim, noise_ctx_dim |
training |
learning_rate, batch_size, epochs, warmup_steps, loss_weights |
data |
sampling_rate, segment_size, n_fft, hop_size, win_size |
paths |
checkpoint_path, input_clean_wavs_dir, input_noisy_wavs_dir |
Set moe.apply_to: [] or remove the moe section entirely to train a baseline model without MoE.
@article{sesame2026,
title={{SESAME}: Speech Enhancement with Sparse Adaptive Mixture of Experts},
author={},
year={2026}
}
@inproceedings{lu2023mp,
title={{MP-SENet}: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra},
author={Lu, Ye-Xin and Ai, Yang and Ling, Zhen-Hua},
booktitle={Proc. Interspeech},
pages={3834--3838},
year={2023}
}- MP-SENet — base architecture
- HiFi-GAN — training utilities
- NSPP — phase estimation
- CMGAN — composite metrics implementation
- Switch Transformers — balance loss
- DeepSeek-V3 — adaptive bias balancing