feat: Add CLAP (Contrastive Language-Audio Pretraining) support#1137
Open
JeniaJitsev wants to merge 1 commit intomlfoundations:mainfrom
Open
feat: Add CLAP (Contrastive Language-Audio Pretraining) support#1137JeniaJitsev wants to merge 1 commit intomlfoundations:mainfrom
JeniaJitsev wants to merge 1 commit intomlfoundations:mainfrom
Conversation
Add audio-text contrastive learning (CLAP) to open_clip, extending it beyond vision-text to support audio-text modality pairs. Audio encoders: - HTSAT (tiny/base/large) with attentional feature fusion (DAF/AFF/iAFF) - Whisper (tiny/base/small) audio encoder - Configurable mel spectrogram preprocessing with truncation modes (rand_trunc, trunc, fusion) Text encoder: - RoBERTa-base via HuggingFace transformers (mean_pooler/cls_pooler) Training pipeline: - WebDataset audio pipeline (--dataset-type webdataset-audio) - Synthetic audio dataset for CI/smoke tests (--dataset-type synthetic-audio) - Audio batch handling in train/eval loops - Fusion mode for longer audio clips (--enable-fusion) - Audio-aware profiling Model configs: - 13 HTSAT/CLAP/Whisper model configuration JSONs - Support for pretrained audio checkpoints (--pretrained-audio) Tests: - Audio model filtering in inference regression tests - CLAP smoke test (test_training_clap) with synthetic-audio - Audio truncation/preprocessing tests With assistance from Claude Opus 4.6
Contributor
Author
|
Using implementations from https://github.com/marianna13/open_clip/tree/clap . @rwightman @rom1504 @marianna13 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add audio-text contrastive learning (CLAP) to open_clip, extending it beyond vision-text to support audio-text modality pairs.
Audio encoders
Text encoder
Training pipeline
--dataset-type webdataset-audio)--dataset-type synthetic-audio)--data-truncating {rand_trunc,trunc,fusion},--data-filling {pad,repeat,repeatpad}--enable-fusion --fusion-type {daf_2d,aff_2d,iaff_2d,...})--pretrained-audio)Model configs
CLAPmodel class (mirrorsCustomTextCLIP, with audio tower instead of vision)Key design decisions
CLAPclass returnsimage_featureskey forClipLosscompatibility — zero changes to loss/training pipelineAudioTowerwraps encoders (HTSAT or Whisper) with a 2-layer MLP projectionSyntheticAudioDatasetgenerates random waveforms for CI without audio dependenciesTest plan
test_training_clappasses (HTSAT-tiny, synthetic-audio, 1 epoch, CPU)