Official repository for the audio model ranking evaluation presented in
"Neural Proxies for Sound Synthesizers: Learning Perceptually Informed Preset Representations"
published in the Journal of the Audio Engineering Society (JAES).
This repository contains the code for evaluating pretrained audio models. The evaluation is based on a ranking experiment designed to assess the perceptual relevance of audio embedding spaces, i.e., do embedding distances reflect monotonic changes in perceptual sound attributes?
The project's main repository can be found here.
The evaluation relies on a custom datasets based on the TAL-NoiseMaker synthesizer, and can be downloaded here and should be placed in the data/ directory.
Dataset description:
- The dataset consists of 13 groups, each corresponding to a synthesizer parameter (e.g., amplitude envelope, filter cutoff, pitch).
- Each group contains 10 presets.
- For each preset, the associated parameter was monotonically increased in 20 steps.
- Bipolar parameters (centered around zero) were restricted so that all 20 values remained either above or below the midpoint.
For each sound attribute, the evaluation can be described as follows:
- Extract representations from the audio model under evaluation.
- Apply a temporal reduction function across time frames.
- Compute pairwise L1 distances between presets.
- Rank the sounds relative to the minimum and maximum parameter values.
- Compute Spearman rank correlation coefficients for both rankings and average them.
Seven popular pretrained audio model families, plus hand-crafted baselines:
- AudioMAE
- CLAP
- DAC
- EfficientAT (uses
mnprefix indicates MobileNetV3 backbone) - M2L
- OpenL3
- PaSST
- Baselines: (i) 128-bin Mel spectrogram (
mel128); (ii) MFCCs of 40 bands (mfcc40); (iii) a multiresolution log-spectrogram (mstft).
→ EfficientAT and PaSST use a combination of hand-crafted features (time-averaged Mel-spectrograms) and learned features, while the other models rely only on learned features.
nop: Concatenate all frame-level representations.avg time: Average across timeframes → length-independent representation.- CLAP produces already time-averaged embeddings.
git clone https://github.com/pcmbs/synth-proxy_audio-model-selection.git
cd synth-proxy_audio-model-selection
pip install -r requirements.txtDownload the custom TAL-NoiseMaker dataset here. After downloading, place the dataset in the data/ directory.
Example command (EfficientAT models only):
python src/eval.py -m model="glob(mn*,exclude=*_as)" distance_fn="glob(*)" reduce_fn="glob(*,exclude=identity)"All models:
python src/eval.py -m model="glob(*,exclude=[clap_*,*0_as,*4_as])" distance_fn="glob(*)" reduce_fn="glob(*,exclude=identity)" ; python src/eval.py -m model="glob([clap_*,*0_as,*4_as])" distance_fn="glob(*)" reduce_fn="identity"Results will be generated in the logs/ directory, and can also be accessed via WandB.
@article{combes2025neural,
author={combes paolo and weinzierl stefan and obermayer klaus},
journal={journal of the audio engineering society},
title={neural proxies for sound synthesizers: learning perceptually informed preset representations},
year={2025},
volume={73},
issue={9},
pages={561-577},
month={september},
} 