iSincNet (Lightweight Sincnet Spectrogram Vocoder)

[Blog] [Original SincNet Paper (M. Ravenelli, Y. Bengio)]

iSincNet is as Fast and Lightweight Sincnet Spectrogram Vocoder neural network trained to reconstruct audio waveforms from their SincNet spectogram (real and signed 2d representation). We used the GTZAN dataset which is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). The files were collected in 2000-2001 from a variety of sources including personal CDs, radio, microphone recordings, in order to represent a variety of recording conditions (http://marsyas.info/downloads/datasets.html).

Datasets used during development:

Example Spectrogram

The First 5s second of the Audio audio/invertibility/15033000.mp3

	Non-causal Encoder	Causal Encoder
signed values
abs values

🎧 Pretrained Models

The following table summarizes the key characteristics and access points for the available pretrained models. All models are open-source and stored in the pretrained/ folder.

Sample Rate	FPS	#Bins	Weights	Corpus	Causal Encoder	Scale	Open-Source
16000	128	128	📦	GTZAN	✗	Linear	√
16000	128	128	📦	GTZAN	√	Linear	√
16000	128	128	📦	GTZAN	✗	Mel	√
44100	350	128	📦	GTZAN	✗	Linear	√
44100	350	128	📦	GTZAN	✗	Mel	√

Quick Start

pip install -r requirements.txt

Please refer to the demo notebook which shows how to load and use the model

import numpy as np
import librosa
import torch
from sincnet.model import SincNet, Quantizer
from datasets.utils.waveform import WaveformLoader 


SAMPLE_RATE = 16_000
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
audio_loader = WaveformLoader(sample_rate=SAMPLE_RATE) 

# load the model
params = {
    "fs": SAMPLE_RATE,
    "fps": 128,
    "scale": "lin",
    "component": "complex"
}

model : SincNet = (
    SincNet(**params)
    .load_pretrained_weights(weights_folder="pretrained", verbose=False)
    .eval()
    .to(device)
)

# encode and decode an audio waveform
duration = 5
offset = 0
audio_path = ... 
waveform = audio_loader.load_segment(audio_path, offset=0, duration=5, nchannels=1)
loudness = audio_loader.measure_loudness(waveform)
waveform = audio_loader.normalise_loudness(waveform, loudness, target_lufs=-23)

with torch.no_grad():
  audio_tensor = torch.from_numpy(waveform).to(device).float()
  spectrogram = model.encode(audio_tensor.unsqueeze(0), scale="mel")
  reconstructed_audio_tensor = model.decode(spectrogram, scale="mel")

#(optional) elementwise quantization into a discrete vocabulary of size 2^{q_bits}
quantizer = Quantizer(q_bits=10).to(device)
indices = quantizer(spectrogram)
dequantized_spectrogram = tokenizer.inverse(indices)
dequantized_audio = model.decode(dequantized_spectrogram)

References Papers and Related Topics

[1] Mirco Ravanelli, Yoshua Bengio, “Speaker Recognition from raw waveform with SincNet” Arxiv
[2] MS-SincResNet: Joint Learning of 1D and 2D Kernels Using Multi-scale SincNet and ResNet for Music Genre Classification Arxiv
[3] Curricular SincNet: Towards Robust Deep Speaker Recognition by Emphasizing Hard Samples in Latent Space Arxiv
[4] Interpretable SincNet-based Deep Learning for Emotion Recognition from EEG brain activity Arxiv
[5] Toward end-to-end interpretable convolutional neural networks for waveform signals Arxiv
[6] Filterband design for end-to-end speech separation Arxiv. This paper decomposes sinNet into a product sin * cos as implemented in this repo and bridgin the gap with Gabor filterbank
[7] PF-Net: Personalized Filter for Speaker Recognition from Raw Waveform Arxiv. This paper proposes to extend SincNet for more flexiblity by allowing alternative shapes to rectangle function in the spectral domain
[8] MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis Arxiv
[9] iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform Arxiv
[10] iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN Arxiv
[11] Deep Griffin-Lim Iteration Arxiv
[12] Mel-Spectrogram Inversion via Alternating Direction Method of Multipliers Arxiv
[13] HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Arxiv

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
audio		audio
datasets		datasets
illustrations		illustrations
notebooks		notebooks
pretrained		pretrained
sincnet		sincnet
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
package_training.sh		package_training.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

iSincNet (Lightweight Sincnet Spectrogram Vocoder)

Example Spectrogram

🎧 Pretrained Models

Quick Start

References Papers and Related Topics

Usages and Implementations around SincNet

Roadmap and projects status

Contributions and acknowledgment (TODO)

About

Uh oh!

Releases 1

Packages

Languages

License

wkzng/iSincNet

Folders and files

Latest commit

History

Repository files navigation

iSincNet (Lightweight Sincnet Spectrogram Vocoder)

Example Spectrogram

🎧 Pretrained Models

Quick Start

References Papers and Related Topics

Usages and Implementations around SincNet

Roadmap and projects status

Contributions and acknowledgment (TODO)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages