🤖 AI Summary
To address overfitting and high computational/storage costs associated with full-parameter fine-tuning of CLIP for text-video cross-modal retrieval, this paper proposes a lightweight Cross-modal Adapter. While keeping the CLIP dual-encoder frozen, our method introduces, for the first time, an early-stage feature interaction mechanism between the text and video encoders. The adapter requires tuning only 0.4% of CLIP’s parameters, significantly improving parameter efficiency and generalization. Evaluated on five standard benchmarks—including MSR-VTT and MSVD—our approach matches or surpasses full-parameter fine-tuning in retrieval performance, reduces trainable parameters by 99.6%, accelerates training by 30%, and enables cross-dataset model reuse. The core innovation lies in the design of a multimodal-aware, early-interaction adapter architecture that facilitates synergistic cross-modal representation learning without altering the pretrained encoder weights.
📝 Abstract
Text-video retrieval is an important multi-modal learning task, where the goal is to retrieve the most relevant video for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on this task. However, as pre-trained models are scaling up, fully fine-tuning them on text-video retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel $ extbf{Cross-Modal Adapter}$ for parameter-efficient fine-tuning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows early cross-modal interactions between CLIP's two encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces $ extbf{99.6}%$ of fine-tuned parameters, and alleviates the problem of overfitting, (2) saves approximately 30% of training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, it achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets. The code will be available at url{https://github.com/LeapLabTHU/Cross-Modal-Adapter}.