Skip to content

nullpath-ml/SESAME

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SESAME

Speech Enhancement with Sparse Adaptive Mixture of Experts

SESAME is a TF-domain monaural speech enhancement model that simultaneously denoises magnitude and phase spectra. Built on the MP-SENet architecture, it replaces standard FFN layers with a Sparse Mixture-of-Experts (MoE) design for improved capacity without proportional compute increase.

Architecture

Audio -> STFT -> DenseEncoder -> TSTransformerBlocks (with MoE FFN) -> MaskDecoder + PhaseDecoder -> ISTFT

The generator (MPNet) processes noisy magnitude and phase in parallel:

  • DenseEncoder: dilated Conv2d blocks compress the TF representation
  • TSTransformerBlocks: alternating time and frequency self-attention with BiGRU-based FFN
  • MaskDecoder: predicts a multiplicative magnitude mask via learnable sigmoid
  • PhaseDecoder: directly estimates the clean phase via atan2

A MetricDiscriminator provides adversarial training signal by predicting a PESQ-proxy quality score.

MoE Design

The FFN in selected Transformer blocks is replaced by MoEFFN:

  • Token Choice Top-2: each token selects its best 2 out of N experts via softmax gating
  • Shared BiGRU backbone: all experts share a bidirectional GRU; only the projection heads are per-expert
  • Switch Transformer balance loss: gradient-based load balancing (f_i * P_i penalty)
  • DeepSeek-V3 adaptive bias: non-gradient bias term on routing logits, updated by load imbalance
  • Router z-loss: stabilizes gating logit magnitudes (ST-MoE)
  • Noise-conditioned routing: spectral magnitude is projected to a noise embedding that conditions the gate

Requirements

  • Python 3.13+
  • uv package manager
  • CUDA-capable GPU

Installation

git clone https://github.com/<your-org>/sesame.git
cd sesame
uv sync

Dataset

This project uses the VoiceBank+DEMAND dataset.

  1. Download and extract the dataset
  2. Resample all wav files to 16 kHz
  3. Organize into data/clean/ and data/noisy/ directories
  4. Create pipe-delimited file lists data/train.txt and data/test.txt (only the second field — filename — is used)

Update paths in config.yaml accordingly.

Training

Single GPU:

uv run python train.py --config config.yaml

Multi-GPU (DDP via torchrun):

uv run torchrun --nproc-per-node=gpu train.py --config config.yaml

Checkpoints and training logs are saved to the checkpoint_path directory (default: cp_model/). A copy of the config is saved alongside checkpoints.

Inference

uv run python inference.py --checkpoint_file cp_model/g_best

Options:

  • --input_noisy_wavs_dir: override noisy input directory
  • --input_clean_wavs_dir: provide clean references to compute metrics (PESQ, CSIG, CBAK, COVL, SSNR, STOI)
  • --output_dir: output directory for enhanced wav files (default: ../generated_files)

Configuration

Key sections in config.yaml:

Section Key parameters
model dense_channel, num_tsblocks, n_heads, compress_factor, beta
model.moe apply_to (layer indices), num_experts, top_k, expert_ffn_dim, noise_ctx_dim
training learning_rate, batch_size, epochs, warmup_steps, loss_weights
data sampling_rate, segment_size, n_fft, hop_size, win_size
paths checkpoint_path, input_clean_wavs_dir, input_noisy_wavs_dir

Set moe.apply_to: [] or remove the moe section entirely to train a baseline model without MoE.

Citation

@article{sesame2026,
  title={{SESAME}: Speech Enhancement with Sparse Adaptive Mixture of Experts},
  author={},
  year={2026}
}

@inproceedings{lu2023mp,
  title={{MP-SENet}: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra},
  author={Lu, Ye-Xin and Ai, Yang and Ling, Zhen-Hua},
  booktitle={Proc. Interspeech},
  pages={3834--3838},
  year={2023}
}

Acknowledgements

License

MIT

About

SESAME is a novel phase-aware architecture that integrates sparse Mixture- of-Experts (MoE) into the MP-SENet backbone. This work was submitted to Interspeech 2026.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages