Cross-modal alignment framework mapping RoBERTa text embeddings to DINOv2 vision latent space using diverse adapter architectures, Orthogonal Procrustes initialization, and contrastive learning.
pytorch multimodal-learning roberta cross-modal-retrieval contrastive-learning procrustes-analysis dinov2 latent-space-alignment embedding-mapping infonce-loss
-
Updated
Apr 3, 2026 - Jupyter Notebook