M$^2$IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension

📅 2024-07-01

📈 Citations: 1

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address inefficient multimodal interaction and excessive GPU memory consumption in referring expression comprehension (REC), this paper proposes M²IST, a multimodal interactive side-tuning framework. M²IST freezes pretrained vision and language encoders and optimizes only lightweight Multimodal Interactive Side Adapters (M³ISAs)—a novel parameter-efficient transfer learning (PETL) module integrating cross-attention with adapter mechanisms—to enable end-to-end differentiable, resource-efficient vision-language alignment. On benchmarks including RefCOCO, M²IST achieves performance on par with full fine-tuning while using only 2.11% trainable parameters, 39.61% GPU memory, and 63.46% training time. It establishes a new state-of-the-art trade-off between efficiency and accuracy, and—crucially—resolves, for the first time, the dual challenge of insufficient multimodal interaction and GPU memory bottlenecks in PETL-based REC.

Technology Category

Application Category

📝 Abstract

Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained vision-language foundation models for REC yields impressive performance but becomes increasingly costly. Parameter-efficient transfer learning (PETL) methods have shown strong performance with fewer tunable parameters. However, directly applying PETL to REC faces two challenges: (1) insufficient multi-modal interaction between pre-trained vision-language foundation models, and (2) high GPU memory usage due to gradients passing through the heavy vision-language foundation models. To this end, we present M$^2$IST: Multi-Modal Interactive Side-Tuning with M$^3$ISAs: Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we fix the pre-trained uni-modal encoders and update M$^3$ISAs to enable efficient vision-language alignment for REC. Empirical results reveal that M$^2$IST achieves better performance-efficiency trade-off than full fine-tuning and other PETL methods, requiring only 2.11% tunable parameters, 39.61% GPU memory, and 63.46% training time while maintaining competitive performance. Our code is released at https://github.com/xuyang-liu16/M2IST.

Problem

Research questions and friction points this paper is trying to address.

Efficient vision-language alignment

Reduced GPU memory usage

Improved multi-modal interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Modal Interactive Side-Tuning

Mixture of Multi-Modal Adapters

Efficient Vision-Language Alignment

🔎 Similar Papers

No similar papers found.