🤖 AI Summary
Existing multimodal methods predominantly focus on bimodal (e.g., image–text) alignment, failing to fully exploit the synergistic representational capacity of trimodal data, and are hindered by the absence of balanced, large-scale trimodal benchmarks. Method: We propose the first CLIP-based framework for *equal* visual–textual–auditory alignment, featuring a modality-symmetric joint encoder and latent-space alignment mechanism; introduce VGG-Sound+, the first balanced, large-scale trimodal dataset; and incorporate a self-supervised missing-modality reconstruction task to explicitly model cross-modal complementarity. Contribution/Results: Our approach achieves significant improvements over CLIP and state-of-the-art multimodal baselines on zero-shot classification and other downstream tasks. Crucially, it demonstrates superior robustness and generalization under partial modality absence—e.g., when one or two modalities are missing at inference time—validating its effective modeling of trimodal synergy and redundancy.
📝 Abstract
Multi-modal representation learning has become a pivotal area in artificial intelligence, enabling the integration of diverse modalities such as vision, text, and audio to solve complex problems. However, existing approaches predominantly focus on bimodal interactions, such as image-text pairs, which limits their ability to fully exploit the richness of multi-modal data. Furthermore, the integration of modalities in equal-scale environments remains underexplored due to the challenges of constructing large-scale, balanced datasets. In this study, we propose Synergy-CLIP, a novel framework that extends the contrastive language-image pre-training (CLIP) architecture to enhance multi-modal representation learning by integrating visual, textual, and audio modalities. Unlike existing methods that focus on adapting individual modalities to vanilla-CLIP, Synergy-CLIP aligns and captures latent information across three modalities equally. To address the high cost of constructing large-scale multi-modal datasets, we introduce VGG-sound+, a triple-modal dataset designed to provide equal-scale representation of visual, textual, and audio data. Synergy-CLIP is validated on various downstream tasks, including zero-shot classification, where it outperforms existing baselines. Additionally, we introduce a missing modality reconstruction task, demonstrating Synergy-CLIP’s ability to extract synergy among modalities in realistic application scenarios. These contributions provide a robust foundation for advancing multi-modal representation learning and exploring new research directions. The code is available at https://github.com/JoSangYeon/Synergy-CLIP