Skip to content

feat: Add CLAP (Contrastive Language-Audio Pretraining) support#1137

Open
JeniaJitsev wants to merge 1 commit intomlfoundations:mainfrom
SLAMPAI:clap_v2_exp
Open

feat: Add CLAP (Contrastive Language-Audio Pretraining) support#1137
JeniaJitsev wants to merge 1 commit intomlfoundations:mainfrom
SLAMPAI:clap_v2_exp

Conversation

@JeniaJitsev
Copy link
Contributor

Summary

Add audio-text contrastive learning (CLAP) to open_clip, extending it beyond vision-text to support audio-text modality pairs.

Audio encoders

  • HTSAT (tiny/base/large) with attentional feature fusion (DAF/AFF/iAFF)
  • Whisper (tiny/base/small/medium/large) with optional pretrained OpenAI weight loading

Text encoder

  • RoBERTa-base via HuggingFace transformers (mean_pooler / cls_pooler)

Training pipeline

  • WebDataset audio pipeline (--dataset-type webdataset-audio)
  • Synthetic audio dataset for CI/smoke tests (--dataset-type synthetic-audio)
  • Configurable preprocessing: --data-truncating {rand_trunc,trunc,fusion}, --data-filling {pad,repeat,repeatpad}
  • Fusion mode for longer audio clips (--enable-fusion --fusion-type {daf_2d,aff_2d,iaff_2d,...})
  • Pretrained audio encoder loading (--pretrained-audio)
  • Audio-aware profiling

Model configs

  • 13 HTSAT/CLAP/Whisper JSON model configurations
  • CLAP model class (mirrors CustomTextCLIP, with audio tower instead of vision)

Key design decisions

  • CLAP class returns image_features key for ClipLoss compatibility — zero changes to loss/training pipeline
  • AudioTower wraps encoders (HTSAT or Whisper) with a 2-layer MLP projection
  • Audio models are automatically excluded from vision test suites
  • SyntheticAudioDataset generates random waveforms for CI without audio dependencies

Test plan

  • test_training_clap passes (HTSAT-tiny, synthetic-audio, 1 epoch, CPU)
  • Whisper-tiny smoke test passes (synthetic-audio, 1 epoch, CPU)
  • Pretrained Whisper weight loading verified (67/70 keys for tiny, 187/190 for small)
  • Whisper layer counts verified against official OpenAI spec (4/6/12/24/32)
  • Existing vision tests unaffected (audio models filtered out)
  • Multi-GPU distributed training (tested on NVIDIA GH200 cluster, up to 1024 GPUs)
  • Evaluation on UrbanSound8K, AudioCaps, Clotho benchmarks (via CLIP Benchmark custom fork)

Add audio-text contrastive learning (CLAP) to open_clip, extending it
beyond vision-text to support audio-text modality pairs.

Audio encoders:
- HTSAT (tiny/base/large) with attentional feature fusion (DAF/AFF/iAFF)
- Whisper (tiny/base/small) audio encoder
- Configurable mel spectrogram preprocessing with truncation modes
  (rand_trunc, trunc, fusion)

Text encoder:
- RoBERTa-base via HuggingFace transformers (mean_pooler/cls_pooler)

Training pipeline:
- WebDataset audio pipeline (--dataset-type webdataset-audio)
- Synthetic audio dataset for CI/smoke tests (--dataset-type synthetic-audio)
- Audio batch handling in train/eval loops
- Fusion mode for longer audio clips (--enable-fusion)
- Audio-aware profiling

Model configs:
- 13 HTSAT/CLAP/Whisper model configuration JSONs
- Support for pretrained audio checkpoints (--pretrained-audio)

Tests:
- Audio model filtering in inference regression tests
- CLAP smoke test (test_training_clap) with synthetic-audio
- Audio truncation/preprocessing tests

With assistance from Claude Opus 4.6
@JeniaJitsev
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant