Neural Proxies for Sound Synthesizers:
Perceptually Informed Preset Representations

Official repository for the audio model ranking evaluation presented in
"Neural Proxies for Sound Synthesizers: Learning Perceptually Informed Preset Representations"
published in the Journal of the Audio Engineering Society (JAES).

Audio Model Ranking Evaluation

This repository contains the code for evaluating pretrained audio models. The evaluation is based on a ranking experiment designed to assess the perceptual relevance of audio embedding spaces, i.e., do embedding distances reflect monotonic changes in perceptual sound attributes?

The project's main repository can be found here.

Experimental Design

Dataset

The evaluation relies on a custom datasets based on the TAL-NoiseMaker synthesizer, and can be downloaded here and should be placed in the data/ directory.

Dataset description:

The dataset consists of 13 groups, each corresponding to a synthesizer parameter (e.g., amplitude envelope, filter cutoff, pitch).
Each group contains 10 presets.
For each preset, the associated parameter was monotonically increased in 20 steps.
Bipolar parameters (centered around zero) were restricted so that all 20 values remained either above or below the midpoint.

Evaluation procedure

For each sound attribute, the evaluation can be described as follows:

Extract representations from the audio model under evaluation.
Apply a temporal reduction function across time frames.
Compute pairwise L1 distances between presets.
Rank the sounds relative to the minimum and maximum parameter values.
Compute Spearman rank correlation coefficients for both rankings and average them.

Models evaluated

Seven popular pretrained audio model families, plus hand-crafted baselines:

AudioMAE
CLAP
DAC
EfficientAT (uses mn prefix indicates MobileNetV3 backbone)
M2L
OpenL3
PaSST
Baselines: (i) 128-bin Mel spectrogram (mel128); (ii) MFCCs of 40 bands (mfcc40); (iii) a multiresolution log-spectrogram (mstft).

→ EfficientAT and PaSST use a combination of hand-crafted features (time-averaged Mel-spectrograms) and learned features, while the other models rely only on learned features.

Temporal reduction functions

nop: Concatenate all frame-level representations.
avg time: Average across timeframes → length-independent representation.
CLAP produces already time-averaged embeddings.

Results

Usage

Installation

git clone https://github.com/pcmbs/synth-proxy_audio-model-selection.git
cd synth-proxy_audio-model-selection
pip install -r requirements.txt

Dataset

Download the custom TAL-NoiseMaker dataset here. After downloading, place the dataset in the data/ directory.

Running evaluation

Example command (EfficientAT models only):

python src/eval.py -m model="glob(mn*,exclude=*_as)" distance_fn="glob(*)" reduce_fn="glob(*,exclude=identity)"

All models:

python src/eval.py -m model="glob(*,exclude=[clap_*,*0_as,*4_as])" distance_fn="glob(*)" reduce_fn="glob(*,exclude=identity)" ; python src/eval.py -m model="glob([clap_*,*0_as,*4_as])" distance_fn="glob(*)" reduce_fn="identity"

Results

Results will be generated in the logs/ directory, and can also be accessed via WandB.

Citation

@article{combes2025neural, 
  author={combes  paolo and weinzierl  stefan and obermayer  klaus}, 
  journal={journal of the audio engineering society}, 
  title={neural proxies for sound synthesizers: learning perceptually informed preset representations}, 
  year={2025}, 
  volume={73}, 
  issue={9}, 
  pages={561-577}, 
  month={september},
}

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
assets		assets
configs		configs
data		data
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Proxies for Sound Synthesizers:
Perceptually Informed Preset Representations

Audio Model Ranking Evaluation

Experimental Design

Dataset

Evaluation procedure

Models evaluated

Temporal reduction functions

Results

Usage

Installation

Dataset

Running evaluation

Results

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Neural Proxies for Sound Synthesizers: Perceptually Informed Preset Representations

Audio Model Ranking Evaluation

Experimental Design

Dataset

Evaluation procedure

Models evaluated

Temporal reduction functions

Results

Usage

Installation

Dataset

Running evaluation

Results

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Neural Proxies for Sound Synthesizers:
Perceptually Informed Preset Representations

Packages