[Blog] [Original SincNet Paper (M. Ravenelli, Y. Bengio)]
iSincNet is as Fast and Lightweight Sincnet Spectrogram Vocoder neural network trained to reconstruct audio waveforms from their SincNet spectogram (real and signed 2d representation). We used the GTZAN dataset which is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). The files were collected in 2000-2001 from a variety of sources including personal CDs, radio, microphone recordings, in order to represent a variety of recording conditions (http://marsyas.info/downloads/datasets.html).
Datasets used during development:
The First 5s second of the Audio audio/invertibility/15033000.mp3
| Non-causal Encoder | Causal Encoder | |
|---|---|---|
| signed values | ![]() |
![]() |
| abs values | ![]() |
![]() |
The following table summarizes the key characteristics and access points for the available pretrained models.
All models are open-source and stored in the pretrained/ folder.
| Sample Rate | FPS | #Bins | Weights | Corpus | Causal Encoder | Scale | Open-Source |
|---|---|---|---|---|---|---|---|
| 16000 | 128 | 128 | 📦 | GTZAN | ✗ | Linear | √ |
| 16000 | 128 | 128 | 📦 | GTZAN | √ | Linear | √ |
| 16000 | 128 | 128 | 📦 | GTZAN | ✗ | Mel | √ |
| 44100 | 350 | 128 | 📦 | GTZAN | ✗ | Linear | √ |
| 44100 | 350 | 128 | 📦 | GTZAN | ✗ | Mel | √ |
pip install -r requirements.txtPlease refer to the demo notebook which shows how to load and use the model
import numpy as np
import librosa
import torch
from sincnet.model import SincNet, Quantizer
from datasets.utils.waveform import WaveformLoader
SAMPLE_RATE = 16_000
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
audio_loader = WaveformLoader(sample_rate=SAMPLE_RATE)
# load the model
params = {
"fs": SAMPLE_RATE,
"fps": 128,
"scale": "lin",
"component": "complex"
}
model : SincNet = (
SincNet(**params)
.load_pretrained_weights(weights_folder="pretrained", verbose=False)
.eval()
.to(device)
)
# encode and decode an audio waveform
duration = 5
offset = 0
audio_path = ...
waveform = audio_loader.load_segment(audio_path, offset=0, duration=5, nchannels=1)
loudness = audio_loader.measure_loudness(waveform)
waveform = audio_loader.normalise_loudness(waveform, loudness, target_lufs=-23)
with torch.no_grad():
audio_tensor = torch.from_numpy(waveform).to(device).float()
spectrogram = model.encode(audio_tensor.unsqueeze(0), scale="mel")
reconstructed_audio_tensor = model.decode(spectrogram, scale="mel")
#(optional) elementwise quantization into a discrete vocabulary of size 2^{q_bits}
quantizer = Quantizer(q_bits=10).to(device)
indices = quantizer(spectrogram)
dequantized_spectrogram = tokenizer.inverse(indices)
dequantized_audio = model.decode(dequantized_spectrogram)-
[1] Mirco Ravanelli, Yoshua Bengio, “Speaker Recognition from raw waveform with SincNet” Arxiv
-
[2] MS-SincResNet: Joint Learning of 1D and 2D Kernels Using Multi-scale SincNet and ResNet for Music Genre Classification Arxiv
-
[3] Curricular SincNet: Towards Robust Deep Speaker Recognition by Emphasizing Hard Samples in Latent Space Arxiv
-
[4] Interpretable SincNet-based Deep Learning for Emotion Recognition from EEG brain activity Arxiv
-
[5] Toward end-to-end interpretable convolutional neural networks for waveform signals Arxiv
-
[6] Filterband design for end-to-end speech separation Arxiv. This paper decomposes sinNet into a product sin * cos as implemented in this repo and bridgin the gap with Gabor filterbank
-
[7] PF-Net: Personalized Filter for Speaker Recognition from Raw Waveform Arxiv. This paper proposes to extend SincNet for more flexiblity by allowing alternative shapes to rectangle function in the spectral domain

-
[8] MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis Arxiv
-
[9] iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform Arxiv
-
[10] iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN Arxiv
-
[11] Deep Griffin-Lim Iteration Arxiv
-
[12] Mel-Spectrogram Inversion via Alternating Direction Method of Multipliers Arxiv
-
[13] HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Arxiv
Related discussion about SincNet vs STFT mravanelli/SincNet#74
- https://github.com/mravanelli/SincNet
- https://github.com/mravanelli/pytorch-kaldi
- https://github.com/PeiChunChang/MS-SincResNet
- https://github.com/ZaUt-bio/Exploring-Filters-in-SincNet-Access-and-Visualization/blob/main/SincNet_filters_visualization_initials.ipynb
- Host weights in Github and add auto-download
- Benchmark of inversion vs Griffin-Lim, iSTFTNet




