Cross-Modal Adapter for Text-Video Retrieval

📅 2022-11-17
🏛️ arXiv.org
📈 Citations: 43
Influential: 5
📄 PDF
🤖 AI Summary
To address overfitting and high computational/storage costs associated with full-parameter fine-tuning of CLIP for text-video cross-modal retrieval, this paper proposes a lightweight Cross-modal Adapter. While keeping the CLIP dual-encoder frozen, our method introduces, for the first time, an early-stage feature interaction mechanism between the text and video encoders. The adapter requires tuning only 0.4% of CLIP’s parameters, significantly improving parameter efficiency and generalization. Evaluated on five standard benchmarks—including MSR-VTT and MSVD—our approach matches or surpasses full-parameter fine-tuning in retrieval performance, reduces trainable parameters by 99.6%, accelerates training by 30%, and enables cross-dataset model reuse. The core innovation lies in the design of a multimodal-aware, early-interaction adapter architecture that facilitates synergistic cross-modal representation learning without altering the pretrained encoder weights.
📝 Abstract
Text-video retrieval is an important multi-modal learning task, where the goal is to retrieve the most relevant video for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on this task. However, as pre-trained models are scaling up, fully fine-tuning them on text-video retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel $ extbf{Cross-Modal Adapter}$ for parameter-efficient fine-tuning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows early cross-modal interactions between CLIP's two encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces $ extbf{99.6}%$ of fine-tuned parameters, and alleviates the problem of overfitting, (2) saves approximately 30% of training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, it achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets. The code will be available at url{https://github.com/LeapLabTHU/Cross-Modal-Adapter}.
Problem

Research questions and friction points this paper is trying to address.

Efficiently fine-tuning large pre-trained models for retrieval
Reducing overfitting and computational costs in multi-modal learning
Enabling cross-modal interactions with minimal parameter adjustments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Adapter for parameter-efficient transfer learning
Adapter-based layers adjust pre-trained multi-modal models
Encoder-level implicit cross-modal interactions between vision-language encoders
🔎 Similar Papers
No similar papers found.
H
Haojun Jiang
Department of Automation, BNRist, Tsinghua University, 100084, Beijing, China
Jianke Zhang
Jianke Zhang
Tsinghua University, IIIS
Embodied AI. VLM. Multimodal Learning
R
Rui Huang
Department of Automation, BNRist, Tsinghua University, 100084, Beijing, China
C
Chunjiang Ge
Department of Automation, BNRist, Tsinghua University, 100084, Beijing, China
Zanlin Ni
Zanlin Ni
Tsinghua University
Computer VisionDeep Learning
Shiji Song
Shiji Song
Tsinghua University
Modeling and optimizationcomplex systemand stochastic systems
G
Gao Huang
Department of Automation, BNRist, Tsinghua University, 100084, Beijing, China