Skip to content

wkzng/iSincNet

Repository files navigation

iSincNet (Lightweight Sincnet Spectrogram Vocoder)

[Blog] [Original SincNet Paper (M. Ravenelli, Y. Bengio)]

iSincNet is as Fast and Lightweight Sincnet Spectrogram Vocoder neural network trained to reconstruct audio waveforms from their SincNet spectogram (real and signed 2d representation). We used the GTZAN dataset which is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). The files were collected in 2000-2001 from a variety of sources including personal CDs, radio, microphone recordings, in order to represent a variety of recording conditions (http://marsyas.info/downloads/datasets.html).

Fast and Lightweight Sincnet Spectrogram Vocoder

Datasets used during development:

Example Spectrogram

The First 5s second of the Audio audio/invertibility/15033000.mp3

Non-causal Encoder Causal Encoder
signed values non-causal 15033000 causal 15033000
abs values non-causal 15033000 causal 15033000

🎧 Pretrained Models

The following table summarizes the key characteristics and access points for the available pretrained models. All models are open-source and stored in the pretrained/ folder.

Sample Rate FPS #Bins Weights Corpus Causal Encoder Scale Open-Source
16000 128 128 📦 GTZAN Linear
16000 128 128 📦 GTZAN Linear
16000 128 128 📦 GTZAN Mel
44100 350 128 📦 GTZAN Linear
44100 350 128 📦 GTZAN Mel

Quick Start

pip install -r requirements.txt

Please refer to the demo notebook which shows how to load and use the model

import numpy as np
import librosa
import torch
from sincnet.model import SincNet, Quantizer
from datasets.utils.waveform import WaveformLoader 


SAMPLE_RATE = 16_000
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
audio_loader = WaveformLoader(sample_rate=SAMPLE_RATE) 

# load the model
params = {
    "fs": SAMPLE_RATE,
    "fps": 128,
    "scale": "lin",
    "component": "complex"
}

model : SincNet = (
    SincNet(**params)
    .load_pretrained_weights(weights_folder="pretrained", verbose=False)
    .eval()
    .to(device)
)

# encode and decode an audio waveform
duration = 5
offset = 0
audio_path = ... 
waveform = audio_loader.load_segment(audio_path, offset=0, duration=5, nchannels=1)
loudness = audio_loader.measure_loudness(waveform)
waveform = audio_loader.normalise_loudness(waveform, loudness, target_lufs=-23)

with torch.no_grad():
  audio_tensor = torch.from_numpy(waveform).to(device).float()
  spectrogram = model.encode(audio_tensor.unsqueeze(0), scale="mel")
  reconstructed_audio_tensor = model.decode(spectrogram, scale="mel")

#(optional) elementwise quantization into a discrete vocabulary of size 2^{q_bits}
quantizer = Quantizer(q_bits=10).to(device)
indices = quantizer(spectrogram)
dequantized_spectrogram = tokenizer.inverse(indices)
dequantized_audio = model.decode(dequantized_spectrogram)

References Papers and Related Topics

  • [1] Mirco Ravanelli, Yoshua Bengio, “Speaker Recognition from raw waveform with SincNet” Arxiv

  • [2] MS-SincResNet: Joint Learning of 1D and 2D Kernels Using Multi-scale SincNet and ResNet for Music Genre Classification Arxiv

  • [3] Curricular SincNet: Towards Robust Deep Speaker Recognition by Emphasizing Hard Samples in Latent Space Arxiv

  • [4] Interpretable SincNet-based Deep Learning for Emotion Recognition from EEG brain activity Arxiv

  • [5] Toward end-to-end interpretable convolutional neural networks for waveform signals Arxiv

  • [6] Filterband design for end-to-end speech separation Arxiv. This paper decomposes sinNet into a product sin * cos as implemented in this repo and bridgin the gap with Gabor filterbank

  • [7] PF-Net: Personalized Filter for Speaker Recognition from Raw Waveform Arxiv. This paper proposes to extend SincNet for more flexiblity by allowing alternative shapes to rectangle function in the spectral domain

  • [8] MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis Arxiv

  • [9] iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform Arxiv

  • [10] iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN Arxiv

  • [11] Deep Griffin-Lim Iteration Arxiv

  • [12] Mel-Spectrogram Inversion via Alternating Direction Method of Multipliers Arxiv

  • [13] HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Arxiv

Related discussion about SincNet vs STFT mravanelli/SincNet#74

Usages and Implementations around SincNet

Roadmap and projects status

  • Host weights in Github and add auto-download
  • Benchmark of inversion vs Griffin-Lim, iSTFTNet

Contributions and acknowledgment (TODO)