MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension

📅 2024-09-20
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 0
Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
Existing referring expression comprehension (REC) methods rely on full fine-tuning of pre-trained models, which degrades pretrained knowledge, incurs high computational overhead, and struggles to capture local visual details and cross-modal alignment. This paper proposes a lightweight multimodal tuning framework that, for the first time, explicitly integrates domain-specific priors into parameter-efficient tuning. Specifically, we design a dynamic Prior Adapter to model global semantic priors, introduce a local convolutional Adapter to enhance fine-grained visual perception, and couple a prior-guided text encoding module to achieve precise language–vision alignment. Built upon the Parameter-Efficient Transfer Learning (PETL) paradigm, our method achieves state-of-the-art localization accuracy on RefCOCO, RefCOCO+, and RefCOCOg—outperforming both full fine-tuning and other PETL approaches—while tuning only 1.41% of the parameters.

Technology Category

Application Category

📝 Abstract
Referring Expression Comprehension (REC), which aims to ground a local visual region via natural language, is a task that heavily relies on multimodal alignment. Most existing methods utilize powerful pre-trained models to transfer visual/linguistic knowledge by full fine-tuning. However, full fine-tuning the entire backbone not only breaks the rich prior knowledge embedded in the pre-training, but also incurs significant computational costs. Motivated by the recent emergence of Parameter-Efficient Transfer Learning (PETL) methods, we aim to solve the REC task in an effective and efficient manner. Directly applying these PETL methods to the REC task is inappropriate, as they lack the specific-domain abilities for precise local visual perception and visual-language alignment. Therefore, we propose a novel framework of Multimodal Prior-guided Parameter Efficient Tuning, namely MaPPER. Specifically, MaPPER comprises Dynamic Prior Adapters guided by a aligned prior, and Local Convolution Adapters to extract precise local semantics for better visual perception. Moreover, the Prior-Guided Text module is proposed to further utilize the prior for facilitating the cross-modal alignment. Experimental results on three widely-used benchmarks demonstrate that MaPPER achieves the best accuracy compared to the full fine-tuning and other PETL methods with only 1.41% tunable backbone parameters.
Problem

Research questions and friction points this paper is trying to address.

Pre-trained Model Optimization
Computational Efficiency
Visual-Linguistic Alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Adjustment
Local Convolution
Prior-Guided Text Module